SEQUEST installation instructions
To install and test SEQUEST on a cluster of Linux computers, follow these instructions. The installation is quite complex, and it may depend on a number of local system configurations. So make sure that you read the entire document, and do things one step at a time, in the order in which they are described in here. This way, if you get into trouble at a certain step, you can contact me and report how far along you've got. This will make troubleshooting much easier. As a side comment, the installation should be relatively easy for a person with good Linux system administration skills, but could be quite difficult for someone not familiar with concepts like path configuration, ssh keys, and the like. So, if you're in the latter category, it would be a good idea to ask for help from the local sys admin or computer guru. In this document, I try to anticipate and troubleshoot the various ways that someone may get in trouble, but I'm fairly certain that I only covered a small part of the installation pitfalls that could arise. Frankly, the program and its installation are more complex that they need to be (due to legacy code reasons). Our new database searching software ProLuCID is designed to be much easier to install and execute.
Obtain the installation package (either 32 or 64 bit)
This contains the following files:
Files necessary to install SEQUEST
unify_sequest: executable file for the core SEQUEST program (searches a single ms2 spectrum)
run_ms: executable file for the master program (reads the input ms2 files, and distributes the load to one or more computers)
Example files to test SEQUEST
sequest.params: database search configuration file
cluster.txt: cluster configuration file
17PM-OrbiLTQ.ms2: input data file for a single phase run of a 17 protein mix (test sample). The data was acquired on a hybrid Orbitrap-LTQ mass spectrometer, with the full mass scans in the Orbitrap analyzer and the ms2 scans in the LTQ analyzer.
Pombe030305_17PM_reverse.fasta: database file, which contains a total of 10,006 protein sequences. The database file was generated by concatenating the 17 protein sequences in the mix to a pombe decoy database of 4,986 proteins. In addition, (17 + 4,986) = 5,003 reversed sequences were added to the database, for a grand total of 10,006 sequences. The reverse sequences are optional, and only needed for further processing of the results using DTASelect2.0.
Result files (you should obtain the same files by running the test sample)
17PM-OrbiLTQ.sqt: output file of the SEQUEST search, in .sqt format
17PM-OrbiLTQ.log: log file where the status of the search is monitored (optional)
(Optional) DTASelect2.0 result files for the test sample
DTASelect2_0.txt: file containing all the results of the SEQUEST search in text format
DTASelect-filter2_0.txt: file containing the filtered results in text format
DTASelect2_0.html: file containing the filtered results in html format
(Optional) DTASelect1.9 result files for the test sample
DTASelect1_9.txt: file containing all the results of the SEQUEST search in text format
DTASelect-filter1_9.txt: file containing the filtered results in text format
DTASelect1_9.html: file containing the filtered results in html format
Note: the files above come packaged as .tar.gz archives. You can unpack them, for example, using the gunzip and tar commands, as shown here for the file SEQUEST32.tar.gz:
[cociorva@grunt ~]$ gunzip SEQUEST32.tar.gz [cociorva@grunt ~]$ tar -xvf SEQUEST32.tar run_ms unify_sequest
Check the executables and place them in the user path
Make sure that the executables are appropriate for your platform (32 or 64 bit)
In the folder in which you downloaded the distribution package, type unify_sequest. You should obtain the following output:
SEQUEST v.27 (rev. 9), (c) 1993 Molecular Biotechnology, Univ. of Washington, J.Eng/J.Yates Licensed to John Yates III @ Univ. of Washington
SEQUEST usage: unify_sequest [options] [dtafiles] options = -Dstring where string specifies the database to be searched -Pstring where string specifies an alternate parameter file name (sequest.params is the default parameters file) -S sets SEQUEST to not re-search .dta files if .out files exists
For example: sequest *.dta
And to check whether the wrapper program works, type run_ms. You should obtain the following output:
run_ms2 usage: run_ms2 options ms2file option: -S will skip the search if ms2 file has already been searched option: -f followed by a cluster file run_ms2 *.ms2 will do search for all ms2 files in the directory
There are two ways in which the above tests could fail:
(i) If you get a message like this:
-bash: unify_sequest: cannot execute binary file
then STOP! You have the wrong executable for your platform (you're either trying to run a 32 bit executable on a 64 bit machine, or the other way around). You need to go back and download the correct version for your platform.
(ii) If you get a message like this:
-bash: unify_sequest: command not found
then it means that your path configuration is a bit unusual, or you are in the wrong folder. If this happens, make sure that you have the files unify_sequest and run_ms in the current folder, and then type ./unify_sequest and ./run_ms (add a dot and a slash in front of each command, forcing it to execute the file in the current folder).
Place the executables in the user path
Copy unify_sequest and run_ms in a folder that is in the user path. If you have administrative access, such a place could be /usr/bin or /usr/local/bin, for example. Otherwise, /home/your_username/bin is also a good place. To test whether this step works, first change to a different folder (any folder will do, as long as the SEQUEST files are not in it), then type which unify_sequest and which run_ms. This should confirm the location of the files. If the files are not found, get help from your local system administrator regarding path configuration. An example is provided below for my setup, where the executables are located in the folder /usr/bin:
[cociorva@grunt ~]$ which run_ms /usr/bin/run_ms
[cociorva@grunt ~]$ which unify_sequest /usr/bin/unify_sequest
Note: it is crucial that unify_sequest is in the user path. If run_ms is not, it will only create minor inconveniences later on, but if unify_sequest is not, the program will absolutely not work.
Create and configure ssh keys
Generate ssh keys
In order to execute the run_ms program, the user needs to be able to send ssh commands between the machines in the cluster without being prompted for a password. Even if run_ms is executed on a single machine, this step is still necessary (the ssh commands are sent from the executing computer to itself). In here, I will use your_username and your_machine to illustrate how to do that. In the example provided below, my user name is cociorva, and my machine name is grunt. First, you need to generate the ssh keys for your machine. To do that, type the following command: ssh-keygen -t rsa. Then, press Enter each time when prompted (3 times). Your session should look like this:
[cociorva@grunt ~]$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/cociorva/.ssh/id_rsa): Created directory '/home/cociorva/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/cociorva/.ssh/id_rsa. Your public key has been saved in /home/cociorva/.ssh/id_rsa.pub. The key fingerprint is: 86:a2:35:b6:98:d9:90:1c:01:6d:85:d1:4d:a0:30:c2 firstname.lastname@example.org
Configure ssh keys
Then, you need to grant password-less access to yourself: cp /home/your_username/.ssh/id_rsa.pub /home/your_username/.ssh/authorized_keys2
[cociorva@grunt ~]$ cp /home/cociorva/.ssh/id_rsa.pub /home/cociorva/.ssh/authorized_keys2
As a security precaution, you should make the newly created file readable by you only (if that's not the case already): chmod go-rx /home/your_username/.ssh/authorized_keys2
[cociorva@grunt ~]$ chmod go-rx /home/cociorva/.ssh/authorized_keys2
Finally, you can list the ssh configuration files you have just created: ls -l /home/your_username/.ssh
[cociorva@grunt ~]$ ls -l /home/cociorva/.ssh total 24 -rw------- 1 cociorva yates 239 Feb 25 14:08 authorized_keys2 -rw------- 1 cociorva yates 887 Feb 25 14:06 id_rsa -rw-r--r-- 1 cociorva yates 239 Feb 25 14:06 id_rsa.pub
Test ssh keys
You test the setup by attempting to connect from your_machine to your_machine via ssh, in order to execute a simple command (whoami, which just prints out your user name to the screen): ssh your_machine whoami. The first time you do that, you will get a warning and you will need to type yes, you are sure you want to continue:
[cociorva@grunt ~]$ ssh grunt whoami The authenticity of host 'grunt (22.214.171.124)' can't be established. RSA key fingerprint is 0d:54:71:9f:5f:c8:d6:56:15:55:e4:74:2e:3f:99:52. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'grunt,126.96.36.199' (RSA) to the list of known hosts. cociorva
The second time, the command should run smoothly: ssh your_machine whoami
[cociorva@grunt ~]$ ssh grunt whoami cociorva
If your test isn't successful or if you think you have made a mistake somewhere, you can repeat the entire process, from the beginning, by generating and then configuring the ssh keys. The only difference is that you'll be asked whether you want to overwrite the old keys, and you should type yes when prompted.
SEQUEST test run on a single machine
Test SEQUEST executables path and ssh keys configuration
If you did everything correctly so far, this step should go smoothly. This test is a combination of two tests that you have already passed: ssh your_machine which unify_sequest. Here is how it looks like on my computer:
[cociorva@grunt ~]$ ssh grunt which unify_sequest /usr/bin/unify_sequest
If you are prompted for a password, it means that your ssh configuration step wasn't done correctly and should be repeated. Go back and make sure you get that working first.
If you get a which: no unify_sequest in ... type of error, it means that you didn't put the files in your proper user path. If you have no admin access, user path configuration could indeed be tricky, and may involve editing files such as .bashrc or .cshrc. Make sure you ask your system administrator for help with that.
Prepare fasta database folder
Make a folder in which you store fasta databases, and copy the file Pombe030305_17PM_reverse.fasta there. For example, I made a folder /home/cociorva/dbase to store my databases:
[cociorva@grunt ~]$ ls -l /home/cociorva/dbase total 5596 -rw-r--r-- 1 cociorva yates 5712604 Feb 26 18:44 Pombe030305_17PM_reverse.fasta
Prepare folder for test run
Make a folder in which you will run this test data. Copy the files 17PM-OrbiLTQ.ms2, cluster.txt, and sequest.params files to this folder. For example, I made a folder called /home/cociorva/SEQUEST_test:
[cociorva@grunt ~]$ ls -l /home/cociorva/SEQUEST_test total 79052 -rw-r--r-- 1 cociorva yates 80835254 Feb 26 18:07 17PM-Orbi.ms2 -rw-r--r-- 1 cociorva yates 6 Feb 26 18:59 cluster.txt -rw-r--r-- 1 cociorva yates 5525 Feb 26 18:54 sequest.params
Edit sequest.params file
Using your favorite text editor, open the sequest.params file and change the database_name line (the 4th line from the top in the example you have) to reflect the actual location of your fasta database. This is how the top of the sequest.params file looks like in my case:
# comment lines begin with a '#' in the first position [SEQUEST] database_name = /home/cociorva/dbase/Pombe030305_17PM_reverse.fasta ppm_peptide_mass_tolerance = 50.000 isotopes = 1 ; 0=search only one peak, 1=search isotopes
Edit cluster.txt file
This file has a very simple format: it lists all the machines that are to be used by SEQUEST, each on a single line. If a machine has multiple processors or multiple cores, it can be listed more than once (we will get to that later). At this stage, your cluster.txt file need only have one line, that lists your_machine. My file looks like this:
Note: the name of this file (cluster.txt in our example) is irrelevant. You can choose to name it whatever you want, as long as you specify it when you run SEQUEST (see below). Also, it may be a good idea to create and save several cluster files, containing different lists of machines. You may have, for instance, a small_cluster.txt file, that lists only a few machines, and a big_cluster.txt file, that lists many machines. This way, you can launch SEQUEST on various subsets of machines without editing the same file all the time.
Execute SEQUEST program
Finally, you're ready to go! Type run_ms -f cluster.txt 17PM-Orbi.ms2 and press Enter. Here is what I get:
[cociorva@grunt SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2 Starting Search for 17PM-Orbi.ms2 Number of Spectra 13356 Number of Spectra 13356 Reading in sequest.params file machines = grunt A total of 1 computers in the cluster Starting the Search 17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search, 1 to grunt stat = 0 17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search, 2 to grunt stat = 0 17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search, 3 to grunt stat = 0 17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search, 4 to grunt stat = 0 17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search, 5 to grunt stat = 0 17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search, 6 to grunt ...
What you see on the screen is simply a log of the program, and you should pay special attention to the stat = 0 lines. If that number differs from zero, it means that there are errors in the program. All the tedious tests and configuration steps up to this point were designed to prevent that, and they hopefully worked well.
The output of the program goes into the 17PM-Orbi.sqt file. This file grows as the program runs, and will contain the full SEQUEST search results once the program finishes. You can let the program complete, but it will probably take very long to finish on a single machine. To stop it, press Ctrl-C at any time. If you stop the program, the 17PM-Orbi.sqt file will only contain the results for the spectra for which the search was completed.
Execute multi-threaded SEQUEST program
You can use the one-computer setup to execute a multi-threaded SEQUEST run. Open the cluster.txt file (or, as discussed before, create a new file named to your liking), and write the name your_machine on more than one line. In my example, I choose to start 4 threads on grunt, because grunt has 4 CPUs:
grunt grunt grunt grunt
Note: the number of SEQUEST threads you start is equal to the number of lines in your cluster file. You can make this as small or as large as you wish. It does not have to be the same as the number of CPUs or cores that your machine has. However, it is generally a good practice to not start more SEQUEST threads that there are CPUs or cores available. Otherwise, the machine will overload and not run any faster.
Then, as before, type run_ms -f cluster.txt 17PM-Orbi.ms2. Here is what I get:
[cociorva@grunt SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2 Starting Search for 17PM-Orbi.ms2 Number of Spectra 13356 Number of Spectra 13356 Reading in sequest.params file machines = grunt machines = grunt machines = grunt machines = grunt A total of 4 computers in the cluster Starting the Search 17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search, 1 to grunt 17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search, 1 to grunt 17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search, 1 to grunt 17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search, 1 to grunt stat = 0 17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search, 2 to grunt stat = 0 17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search, 2 to grunt stat = 0 17PM-Orbi.ms2: 6 (%0.0) dtas Sent for Search, 2 to grunt stat = 0 17PM-Orbi.ms2: 7 (%0.1) dtas Sent for Search, 2 to grunt stat = 0 17PM-Orbi.ms2: 8 (%0.1) dtas Sent for Search, 3 to grunt stat = 0 17PM-Orbi.ms2: 9 (%0.1) dtas Sent for Search, 3 to grunt ...
The program should run significantly faster (in my case, it's 4 times as fast as before). If you have a very good machine with 8 or more cores, you may let it run until it ends, it may not take very long. The number and percentage of spectra finished processing is continuously displayed on the screen, so you should get a pretty good idea of how long it would take to go through all 13,356 of them. If it still seems too long or if you want to go to the next step, you can always type Ctrl-C to stop the program.
SEQUEST test run on multiple computers (a cluster)
You can try this test on only two computers or a large cluster, as the set up is fundamentally the same. I will illustrate it on a cluster of 4 computers, called shamu001 through shamu004, with two CPUs each.
In order to run SEQUEST on a cluster of computers, different steps may be necessary, depending on the particular configuration of your cluster. I will make some assumptions regarding that configuration, which tend to hold true for the majority of clusters out there. In case your system is different, you will need to adapt these instructions, or to ask the help of your system administrator. It is beyond the scope of this manual to provide support for all types of possible networking scenarios.
First, I assume that you have a common user name accross all the machines (nodes) in the cluster, and that your home folder is shared by all machines (via NFS mounting, for example). Therefore, your fasta database and test run folders are visible to all machines (both read and write permission).
Location of SEQUEST executable files
If you choose to put the two SEQUEST executable files in your home folder (for example, in /home/your_username/bin/), then no further steps are necessary. However, if you put the executables in a machine-dependent location like /usr/bin, then you have to make sure that all nodes in the cluster have copies of these files in the same location. In my example, the machines shamu001 through shamu004 all have a copy of unify_sequest in /usr/bin/:
[cociorva@shamu001 cociorva]$ ls -l /usr/bin/unify_sequest -rwxr-xr-x 1 root root 69593 Jun 14 2007 /usr/bin/unify_sequest
[cociorva@shamu002 cociorva]$ ls -l /usr/bin/unify_sequest -rwxr-xr-x 1 root root 69593 Jun 14 2007 /usr/bin/unify_sequest
[cociorva@shamu003 cociorva]$ ls -l /usr/bin/unify_sequest -rwxr-xr-x 1 root root 69593 Jun 14 2007 /usr/bin/unify_sequest
[cociorva@shamu004 cociorva]$ ls -l /usr/bin/unify_sequest -rwxr-xr-x 1 root root 69593 Jun 14 2007 /usr/bin/unify_sequest
Test SEQUEST executables path and ssh keys configuration
You will need to pick one of the machines in the cluster as your "master node", or the machine that will execute run_ms, controlling the distribution of SEQUEST threads. In my case, I choose shamu001, but I could have equally chosen any of the other machines. Login to that machine, and test that you can send ssh commands to itself and the other machines, and that unify_sequest is in the user path:
[cociorva@shamu001 cociorva]$ ssh shamu001 which unify_sequest /usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu002 which unify_sequest /usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu003 which unify_sequest /usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu004 which unify_sequest /usr/bin/unify_sequest
Note: the first time you do these tests, you may get a warning message and be asked whether you indeed want to proceed with the remote connection. You should just press y if prompted. Repeating the test should then produce no more warnings.
Like before, your test may fail in one of two ways: either you are prompted for a password, in which case the ssh keys have not been properly configured, or the unify_sequest file is not found, in which case your path configuration may be wrong. Either way, your system administrator should be able to help you.
Edit cluster.txt file
Here is my example file:
shamu001 shamu001 shamu002 shamu002 shamu003 shamu003 shamu004 shamu004
This file tells run_ms to start 2 SEQUEST threads on each one of the 4 machines in my mini-cluster (including the master node shamu001, which has the additional task of executing run_ms itself). Like before, you can specify to start as many threads as you wish on as many machines as you wish, but logic dictates that you should limit the number of threads to the number of physical CPUs or cores available. Also, you may prefer to leave out the master node altogether, or give it a lesser load, to compensate for the fact that it has to execute run_ms.
Like before, type run_ms -f cluster.txt 17PM-Orbi.ms2 and press Enter. Here is what I get:
[cociorva@shamu001 SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2 Starting Search for 17PM-Orbi.ms2 Number of Spectra 13356 Number of Spectra 13356 Reading in sequest.params file machines = shamu001 machines = shamu001 machines = shamu002 machines = shamu002 machines = shamu003 machines = shamu003 machines = shamu004 machines = shamu004 A total of 8 computers in the cluster Starting the Search 17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search, 1 to shamu001 17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search, 1 to shamu001 17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search, 1 to shamu002 17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search, 1 to shamu002 17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search, 1 to shamu003 17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search, 1 to shamu003 17PM-Orbi.ms2: 6 (%0.0) dtas Sent for Search, 1 to shamu004 17PM-Orbi.ms2: 7 (%0.1) dtas Sent for Search, 1 to shamu004 stat = 0 17PM-Orbi.ms2: 8 (%0.1) dtas Sent for Search, 2 to shamu001 stat = 0 17PM-Orbi.ms2: 9 (%0.1) dtas Sent for Search, 2 to shamu001 stat = 0 17PM-Orbi.ms2: 10 (%0.1) dtas Sent for Search, 2 to shamu003 stat = 0 17PM-Orbi.ms2: 11 (%0.1) dtas Sent for Search, 2 to shamu004 stat = 0 17PM-Orbi.ms2: 12 (%0.1) dtas Sent for Search, 2 to shamu002 stat = 0 17PM-Orbi.ms2: 13 (%0.1) dtas Sent for Search, 2 to shamu003 stat = 0 17PM-Orbi.ms2: 14 (%0.1) dtas Sent for Search, 3 to shamu001 stat = 0 17PM-Orbi.ms2: 15 (%0.1) dtas Sent for Search, 3 to shamu001 stat = 0 17PM-Orbi.ms2: 16 (%0.1) dtas Sent for Search, 2 to shamu004 stat = 0 17PM-Orbi.ms2: 17 (%0.1) dtas Sent for Search, 2 to shamu002 stat = 0 17PM-Orbi.ms2: 18 (%0.1) dtas Sent for Search, 4 to shamu001 stat = 0 17PM-Orbi.ms2: 19 (%0.1) dtas Sent for Search, 4 to shamu001 stat = 0 17PM-Orbi.ms2: 20 (%0.1) dtas Sent for Search, 3 to shamu003 stat = 0 17PM-Orbi.ms2: 21 (%0.2) dtas Sent for Search, 3 to shamu004 ...
Now the program runs much faster than before, as the load is spread over multiple machines in the cluster.
Running SEQUEST under a batch system (e.g. PBS)
In all the examples so far, you have run SEQUEST in interactive mode, with the program starting executing as soon as you type the command, and log output going to your screen. In real world high throughput proteomics, however, you need a system where you can submit batch jobs that are queued and executed by a scheduler, with the output going to a log file. There are many batch systems available, and it is not the purpose of this manual to review them. Here I only give an example of how a job file could look like on a PBS-based system:
#!/bin/sh #PBS -l nodes=2:ppn=4 #PBS -l walltime=1:00:00 #PBS -l cput=8:00:00 #PBS -j oe cd $PBS_O_WORKDIR run_ms -f $PBS_NODEFILE 17PM-Orbi.ms2 > 17PM-Orbi.log exit
In this example, the script would execute a job on 2 nodes, with 4 CPUs per node, asking for 1 hour of wall time and 8 hours of CPU time. The main line in this script is the line run_ms -f $PBS_NODEFILE 17PM-Orbi.ms2 > 17PM-Orbi.log. Comparing it with the interactive SEQUEST command, you will notice two differences: first, the output of the program is redirected to a file (17PM-Orbi.log), and second, the cluster file used by the program is no longer a file you edit yourself, but it is a file that is specified by the batch system (under the environment variable name $PBS_NODEFILE). This ensures that your program will start SEQUEST threads on the cluster nodes assigned by the batch system to your job. This detail is crucial on a batch system, because you do not know beforehand which nodes will be assigned to you, hence you cannot possibly edit the cluster.txt file before the job starts. Using $PBS_NODEFILE is the only way to make sure that your SEQUEST threads are launched on the appropriate nodes.