SEQUEST installation instructions

From Proteomics Wiki
Jump to: navigation, search

To install and test SEQUEST on a cluster of Linux computers, follow these instructions. The installation is quite complex, and it may depend on a number of local system configurations. So make sure that you read the entire document, and do things one step at a time, in the order in which they are described in here. This way, if you get into trouble at a certain step, you can contact me and report how far along you've got. This will make troubleshooting much easier. As a side comment, the installation should be relatively easy for a person with good Linux system administration skills, but could be quite difficult for someone not familiar with concepts like path configuration, ssh keys, and the like. So, if you're in the latter category, it would be a good idea to ask for help from the local sys admin or computer guru. In this document, I try to anticipate and troubleshoot the various ways that someone may get in trouble, but I'm fairly certain that I only covered a small part of the installation pitfalls that could arise. Frankly, the program and its installation are more complex that they need to be (due to legacy code reasons). Our new database searching software ProLuCID is designed to be much easier to install and execute.

Contents

Obtain the installation package (either 32 or 64 bit)

This contains the following files:

Files necessary to install SEQUEST

unify_sequest: executable file for the core SEQUEST program (searches a single ms2 spectrum)
run_ms: executable file for the master program (reads the input ms2 files, and distributes the load to one or more computers)

Example files to test SEQUEST

sequest.params: database search configuration file
cluster.txt: cluster configuration file
17PM-OrbiLTQ.ms2: input data file for a single phase run of a 17 protein mix (test sample). The data was acquired on a hybrid Orbitrap-LTQ mass spectrometer, with the full mass scans in the Orbitrap analyzer and the ms2 scans in the LTQ analyzer.
Pombe030305_17PM_reverse.fasta: database file, which contains a total of 10,006 protein sequences. The database file was generated by concatenating the 17 protein sequences in the mix to a pombe decoy database of 4,986 proteins. In addition, (17 + 4,986) = 5,003 reversed sequences were added to the database, for a grand total of 10,006 sequences. The reverse sequences are optional, and only needed for further processing of the results using DTASelect2.0.

Result files (you should obtain the same files by running the test sample)

17PM-OrbiLTQ.sqt: output file of the SEQUEST search, in .sqt format
17PM-OrbiLTQ.log: log file where the status of the search is monitored (optional)

(Optional) DTASelect2.0 result files for the test sample

DTASelect2_0.txt: file containing all the results of the SEQUEST search in text format
DTASelect-filter2_0.txt: file containing the filtered results in text format
DTASelect2_0.html: file containing the filtered results in html format

(Optional) DTASelect1.9 result files for the test sample

DTASelect1_9.txt: file containing all the results of the SEQUEST search in text format
DTASelect-filter1_9.txt: file containing the filtered results in text format
DTASelect1_9.html: file containing the filtered results in html format

Note: the files above come packaged as .tar.gz archives. You can unpack them, for example, using the gunzip and tar commands, as shown here for the file SEQUEST32.tar.gz:

[cociorva@grunt ~]$ gunzip SEQUEST32.tar.gz
[cociorva@grunt ~]$ tar -xvf SEQUEST32.tar
run_ms
unify_sequest

Check the executables and place them in the user path

Make sure that the executables are appropriate for your platform (32 or 64 bit)

In the folder in which you downloaded the distribution package, type unify_sequest. You should obtain the following output:

SEQUEST v.27 (rev. 9), (c) 1993
Molecular Biotechnology, Univ. of Washington, J.Eng/J.Yates
Licensed to John Yates III @ Univ. of Washington
SEQUEST usage: unify_sequest [options] [dtafiles] options = -Dstring where string specifies the database to be searched -Pstring where string specifies an alternate parameter file name (sequest.params is the default parameters file) -S sets SEQUEST to not re-search .dta files if .out files exists
For example: sequest *.dta

And to check whether the wrapper program works, type run_ms. You should obtain the following output:

run_ms2 usage: run_ms2 options ms2file
option: -S will skip the search if ms2 file has already been searched
option: -f followed by a cluster file
run_ms2 *.ms2   will do search for all ms2 files in the directory
Troubleshooting

There are two ways in which the above tests could fail:

(i) If you get a message like this:

-bash: unify_sequest: cannot execute binary file

then STOP! You have the wrong executable for your platform (you're either trying to run a 32 bit executable on a 64 bit machine, or the other way around). You need to go back and download the correct version for your platform.

(ii) If you get a message like this:

-bash: unify_sequest: command not found

then it means that your path configuration is a bit unusual, or you are in the wrong folder. If this happens, make sure that you have the files unify_sequest and run_ms in the current folder, and then type ./unify_sequest and ./run_ms (add a dot and a slash in front of each command, forcing it to execute the file in the current folder).

Place the executables in the user path

Copy unify_sequest and run_ms in a folder that is in the user path. If you have administrative access, such a place could be /usr/bin or /usr/local/bin, for example. Otherwise, /home/your_username/bin is also a good place. To test whether this step works, first change to a different folder (any folder will do, as long as the SEQUEST files are not in it), then type which unify_sequest and which run_ms. This should confirm the location of the files. If the files are not found, get help from your local system administrator regarding path configuration. An example is provided below for my setup, where the executables are located in the folder /usr/bin:

[cociorva@grunt ~]$ which run_ms
/usr/bin/run_ms
[cociorva@grunt ~]$ which unify_sequest
/usr/bin/unify_sequest

Note: it is crucial that unify_sequest is in the user path. If run_ms is not, it will only create minor inconveniences later on, but if unify_sequest is not, the program will absolutely not work.

Create and configure ssh keys

Generate ssh keys

In order to execute the run_ms program, the user needs to be able to send ssh commands between the machines in the cluster without being prompted for a password. Even if run_ms is executed on a single machine, this step is still necessary (the ssh commands are sent from the executing computer to itself). In here, I will use your_username and your_machine to illustrate how to do that. In the example provided below, my user name is cociorva, and my machine name is grunt. First, you need to generate the ssh keys for your machine. To do that, type the following command: ssh-keygen -t rsa. Then, press Enter each time when prompted (3 times). Your session should look like this:

[cociorva@grunt ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cociorva/.ssh/id_rsa):
Created directory '/home/cociorva/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cociorva/.ssh/id_rsa.
Your public key has been saved in /home/cociorva/.ssh/id_rsa.pub.
The key fingerprint is:
86:a2:35:b6:98:d9:90:1c:01:6d:85:d1:4d:a0:30:c2 cociorva@grunt.scripps.edu

Configure ssh keys

Then, you need to grant password-less access to yourself: cp /home/your_username/.ssh/id_rsa.pub /home/your_username/.ssh/authorized_keys2

[cociorva@grunt ~]$ cp /home/cociorva/.ssh/id_rsa.pub /home/cociorva/.ssh/authorized_keys2

As a security precaution, you should make the newly created file readable by you only (if that's not the case already): chmod go-rx /home/your_username/.ssh/authorized_keys2

[cociorva@grunt ~]$ chmod go-rx /home/cociorva/.ssh/authorized_keys2

Finally, you can list the ssh configuration files you have just created: ls -l /home/your_username/.ssh

[cociorva@grunt ~]$ ls -l /home/cociorva/.ssh
total 24
-rw-------  1 cociorva yates 239 Feb 25 14:08 authorized_keys2
-rw-------  1 cociorva yates 887 Feb 25 14:06 id_rsa
-rw-r--r--  1 cociorva yates 239 Feb 25 14:06 id_rsa.pub

Test ssh keys

You test the setup by attempting to connect from your_machine to your_machine via ssh, in order to execute a simple command (whoami, which just prints out your user name to the screen): ssh your_machine whoami. The first time you do that, you will get a warning and you will need to type yes, you are sure you want to continue:

[cociorva@grunt ~]$ ssh grunt whoami
The authenticity of host 'grunt (137.131.140.165)' can't be established.
RSA key fingerprint is 0d:54:71:9f:5f:c8:d6:56:15:55:e4:74:2e:3f:99:52.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'grunt,137.131.140.165' (RSA) to the list of known hosts.
cociorva

The second time, the command should run smoothly: ssh your_machine whoami

[cociorva@grunt ~]$ ssh grunt whoami
cociorva
Troubleshooting

If your test isn't successful or if you think you have made a mistake somewhere, you can repeat the entire process, from the beginning, by generating and then configuring the ssh keys. The only difference is that you'll be asked whether you want to overwrite the old keys, and you should type yes when prompted.

SEQUEST test run on a single machine

Test SEQUEST executables path and ssh keys configuration

If you did everything correctly so far, this step should go smoothly. This test is a combination of two tests that you have already passed: ssh your_machine which unify_sequest. Here is how it looks like on my computer:

[cociorva@grunt ~]$ ssh grunt which unify_sequest
/usr/bin/unify_sequest
Troubleshooting

If you are prompted for a password, it means that your ssh configuration step wasn't done correctly and should be repeated. Go back and make sure you get that working first.

If you get a which: no unify_sequest in ... type of error, it means that you didn't put the files in your proper user path. If you have no admin access, user path configuration could indeed be tricky, and may involve editing files such as .bashrc or .cshrc. Make sure you ask your system administrator for help with that.

Prepare fasta database folder

Make a folder in which you store fasta databases, and copy the file Pombe030305_17PM_reverse.fasta there. For example, I made a folder /home/cociorva/dbase to store my databases:

[cociorva@grunt ~]$ ls -l /home/cociorva/dbase
total 5596
-rw-r--r--  1 cociorva yates 5712604 Feb 26 18:44 Pombe030305_17PM_reverse.fasta 

Prepare folder for test run

Make a folder in which you will run this test data. Copy the files 17PM-OrbiLTQ.ms2, cluster.txt, and sequest.params files to this folder. For example, I made a folder called /home/cociorva/SEQUEST_test:

[cociorva@grunt ~]$ ls -l /home/cociorva/SEQUEST_test
total 79052
-rw-r--r--  1 cociorva yates 80835254 Feb 26 18:07 17PM-Orbi.ms2
-rw-r--r--  1 cociorva yates        6 Feb 26 18:59 cluster.txt
-rw-r--r--  1 cociorva yates     5525 Feb 26 18:54 sequest.params

Edit sequest.params file

Using your favorite text editor, open the sequest.params file and change the database_name line (the 4th line from the top in the example you have) to reflect the actual location of your fasta database. This is how the top of the sequest.params file looks like in my case:

# comment lines begin with a '#' in the first position

[SEQUEST]
database_name = /home/cociorva/dbase/Pombe030305_17PM_reverse.fasta
ppm_peptide_mass_tolerance = 50.000
isotopes = 1                           ; 0=search only one peak, 1=search isotopes

Edit cluster.txt file

This file has a very simple format: it lists all the machines that are to be used by SEQUEST, each on a single line. If a machine has multiple processors or multiple cores, it can be listed more than once (we will get to that later). At this stage, your cluster.txt file need only have one line, that lists your_machine. My file looks like this:

grunt

Note: the name of this file (cluster.txt in our example) is irrelevant. You can choose to name it whatever you want, as long as you specify it when you run SEQUEST (see below). Also, it may be a good idea to create and save several cluster files, containing different lists of machines. You may have, for instance, a small_cluster.txt file, that lists only a few machines, and a big_cluster.txt file, that lists many machines. This way, you can launch SEQUEST on various subsets of machines without editing the same file all the time.

Execute SEQUEST program

Finally, you're ready to go! Type run_ms -f cluster.txt 17PM-Orbi.ms2 and press Enter. Here is what I get:

[cociorva@grunt SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2
Starting Search for 17PM-Orbi.ms2
Number of Spectra 13356
Number of Spectra 13356
Reading in sequest.params file
machines = grunt
A total of 1 computers in the cluster
Starting the Search
17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search,   1 to grunt
stat = 0
17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search,   2 to grunt
stat = 0
17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search,   3 to grunt
stat = 0
17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search,   4 to grunt
stat = 0
17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search,   5 to grunt
stat = 0
17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search,   6 to grunt
...

What you see on the screen is simply a log of the program, and you should pay special attention to the stat = 0 lines. If that number differs from zero, it means that there are errors in the program. All the tedious tests and configuration steps up to this point were designed to prevent that, and they hopefully worked well.

The output of the program goes into the 17PM-Orbi.sqt file. This file grows as the program runs, and will contain the full SEQUEST search results once the program finishes. You can let the program complete, but it will probably take very long to finish on a single machine. To stop it, press Ctrl-C at any time. If you stop the program, the 17PM-Orbi.sqt file will only contain the results for the spectra for which the search was completed.

Execute multi-threaded SEQUEST program

You can use the one-computer setup to execute a multi-threaded SEQUEST run. Open the cluster.txt file (or, as discussed before, create a new file named to your liking), and write the name your_machine on more than one line. In my example, I choose to start 4 threads on grunt, because grunt has 4 CPUs:

grunt
grunt
grunt
grunt 

Note: the number of SEQUEST threads you start is equal to the number of lines in your cluster file. You can make this as small or as large as you wish. It does not have to be the same as the number of CPUs or cores that your machine has. However, it is generally a good practice to not start more SEQUEST threads that there are CPUs or cores available. Otherwise, the machine will overload and not run any faster.

Then, as before, type run_ms -f cluster.txt 17PM-Orbi.ms2. Here is what I get:

[cociorva@grunt SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2
Starting Search for 17PM-Orbi.ms2
Number of Spectra 13356
Number of Spectra 13356
Reading in sequest.params file
machines = grunt
machines = grunt
machines = grunt
machines = grunt
A total of 4 computers in the cluster
Starting the Search
17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search,   1 to grunt
17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search,   1 to grunt
17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search,   1 to grunt
17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search,   1 to grunt
stat = 0
17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search,   2 to grunt
stat = 0
17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search,   2 to grunt
stat = 0
17PM-Orbi.ms2: 6 (%0.0) dtas Sent for Search,   2 to grunt
stat = 0
17PM-Orbi.ms2: 7 (%0.1) dtas Sent for Search,   2 to grunt
stat = 0
17PM-Orbi.ms2: 8 (%0.1) dtas Sent for Search,   3 to grunt
stat = 0
17PM-Orbi.ms2: 9 (%0.1) dtas Sent for Search,   3 to grunt
...

The program should run significantly faster (in my case, it's 4 times as fast as before). If you have a very good machine with 8 or more cores, you may let it run until it ends, it may not take very long. The number and percentage of spectra finished processing is continuously displayed on the screen, so you should get a pretty good idea of how long it would take to go through all 13,356 of them. If it still seems too long or if you want to go to the next step, you can always type Ctrl-C to stop the program.

SEQUEST test run on multiple computers (a cluster)

You can try this test on only two computers or a large cluster, as the set up is fundamentally the same. I will illustrate it on a cluster of 4 computers, called shamu001 through shamu004, with two CPUs each.

Network drives and shared folders

In order to run SEQUEST on a cluster of computers, different steps may be necessary, depending on the particular configuration of your cluster. I will make some assumptions regarding that configuration, which tend to hold true for the majority of clusters out there. In case your system is different, you will need to adapt these instructions, or to ask the help of your system administrator. It is beyond the scope of this manual to provide support for all types of possible networking scenarios.

First, I assume that you have a common user name accross all the machines (nodes) in the cluster, and that your home folder is shared by all machines (via NFS mounting, for example). Therefore, your fasta database and test run folders are visible to all machines (both read and write permission).

Location of SEQUEST executable files

If you choose to put the two SEQUEST executable files in your home folder (for example, in /home/your_username/bin/), then no further steps are necessary. However, if you put the executables in a machine-dependent location like /usr/bin, then you have to make sure that all nodes in the cluster have copies of these files in the same location. In my example, the machines shamu001 through shamu004 all have a copy of unify_sequest in /usr/bin/:

[cociorva@shamu001 cociorva]$ ls -l /usr/bin/unify_sequest
-rwxr-xr-x    1 root     root        69593 Jun 14  2007 /usr/bin/unify_sequest
[cociorva@shamu002 cociorva]$ ls -l /usr/bin/unify_sequest
-rwxr-xr-x    1 root     root        69593 Jun 14  2007 /usr/bin/unify_sequest
[cociorva@shamu003 cociorva]$ ls -l /usr/bin/unify_sequest
-rwxr-xr-x    1 root     root        69593 Jun 14  2007 /usr/bin/unify_sequest
[cociorva@shamu004 cociorva]$ ls -l /usr/bin/unify_sequest
-rwxr-xr-x    1 root     root        69593 Jun 14  2007 /usr/bin/unify_sequest

Test SEQUEST executables path and ssh keys configuration

You will need to pick one of the machines in the cluster as your "master node", or the machine that will execute run_ms, controlling the distribution of SEQUEST threads. In my case, I choose shamu001, but I could have equally chosen any of the other machines. Login to that machine, and test that you can send ssh commands to itself and the other machines, and that unify_sequest is in the user path:

[cociorva@shamu001 cociorva]$ ssh shamu001 which unify_sequest
/usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu002 which unify_sequest
/usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu003 which unify_sequest
/usr/bin/unify_sequest
[cociorva@shamu001 cociorva]$ ssh shamu004 which unify_sequest
/usr/bin/unify_sequest

Note: the first time you do these tests, you may get a warning message and be asked whether you indeed want to proceed with the remote connection. You should just press y if prompted. Repeating the test should then produce no more warnings.

Like before, your test may fail in one of two ways: either you are prompted for a password, in which case the ssh keys have not been properly configured, or the unify_sequest file is not found, in which case your path configuration may be wrong. Either way, your system administrator should be able to help you.

Edit cluster.txt file

Here is my example file:

shamu001
shamu001
shamu002
shamu002
shamu003
shamu003
shamu004
shamu004

This file tells run_ms to start 2 SEQUEST threads on each one of the 4 machines in my mini-cluster (including the master node shamu001, which has the additional task of executing run_ms itself). Like before, you can specify to start as many threads as you wish on as many machines as you wish, but logic dictates that you should limit the number of threads to the number of physical CPUs or cores available. Also, you may prefer to leave out the master node altogether, or give it a lesser load, to compensate for the fact that it has to execute run_ms.

Run SEQUEST

Like before, type run_ms -f cluster.txt 17PM-Orbi.ms2 and press Enter. Here is what I get:

[cociorva@shamu001 SEQUEST_test]$ run_ms -f cluster.txt 17PM-Orbi.ms2
Starting Search for 17PM-Orbi.ms2
Number of Spectra 13356
Number of Spectra 13356
Reading in sequest.params file
machines = shamu001
machines = shamu001
machines = shamu002
machines = shamu002
machines = shamu003
machines = shamu003
machines = shamu004
machines = shamu004
A total of 8 computers in the cluster
Starting the Search
17PM-Orbi.ms2: 0 (%0.0) dtas Sent for Search,   1 to shamu001
17PM-Orbi.ms2: 1 (%0.0) dtas Sent for Search,   1 to shamu001
17PM-Orbi.ms2: 2 (%0.0) dtas Sent for Search,   1 to shamu002
17PM-Orbi.ms2: 3 (%0.0) dtas Sent for Search,   1 to shamu002
17PM-Orbi.ms2: 4 (%0.0) dtas Sent for Search,   1 to shamu003
17PM-Orbi.ms2: 5 (%0.0) dtas Sent for Search,   1 to shamu003
17PM-Orbi.ms2: 6 (%0.0) dtas Sent for Search,   1 to shamu004
17PM-Orbi.ms2: 7 (%0.1) dtas Sent for Search,   1 to shamu004
stat = 0
17PM-Orbi.ms2: 8 (%0.1) dtas Sent for Search,   2 to shamu001
stat = 0
17PM-Orbi.ms2: 9 (%0.1) dtas Sent for Search,   2 to shamu001
stat = 0
17PM-Orbi.ms2: 10 (%0.1) dtas Sent for Search,   2 to shamu003
stat = 0
17PM-Orbi.ms2: 11 (%0.1) dtas Sent for Search,   2 to shamu004
stat = 0
17PM-Orbi.ms2: 12 (%0.1) dtas Sent for Search,   2 to shamu002
stat = 0
17PM-Orbi.ms2: 13 (%0.1) dtas Sent for Search,   2 to shamu003
stat = 0
17PM-Orbi.ms2: 14 (%0.1) dtas Sent for Search,   3 to shamu001
stat = 0
17PM-Orbi.ms2: 15 (%0.1) dtas Sent for Search,   3 to shamu001
stat = 0
17PM-Orbi.ms2: 16 (%0.1) dtas Sent for Search,   2 to shamu004
stat = 0
17PM-Orbi.ms2: 17 (%0.1) dtas Sent for Search,   2 to shamu002
stat = 0
17PM-Orbi.ms2: 18 (%0.1) dtas Sent for Search,   4 to shamu001
stat = 0
17PM-Orbi.ms2: 19 (%0.1) dtas Sent for Search,   4 to shamu001
stat = 0
17PM-Orbi.ms2: 20 (%0.1) dtas Sent for Search,   3 to shamu003
stat = 0
17PM-Orbi.ms2: 21 (%0.2) dtas Sent for Search,   3 to shamu004
...

Now the program runs much faster than before, as the load is spread over multiple machines in the cluster.

Running SEQUEST under a batch system (e.g. PBS)

In all the examples so far, you have run SEQUEST in interactive mode, with the program starting executing as soon as you type the command, and log output going to your screen. In real world high throughput proteomics, however, you need a system where you can submit batch jobs that are queued and executed by a scheduler, with the output going to a log file. There are many batch systems available, and it is not the purpose of this manual to review them. Here I only give an example of how a job file could look like on a PBS-based system:

#!/bin/sh
#PBS -l nodes=2:ppn=4
#PBS -l walltime=1:00:00
#PBS -l cput=8:00:00
#PBS -j oe

cd $PBS_O_WORKDIR
run_ms -f $PBS_NODEFILE 17PM-Orbi.ms2 > 17PM-Orbi.log
exit

In this example, the script would execute a job on 2 nodes, with 4 CPUs per node, asking for 1 hour of wall time and 8 hours of CPU time. The main line in this script is the line run_ms -f $PBS_NODEFILE 17PM-Orbi.ms2 > 17PM-Orbi.log. Comparing it with the interactive SEQUEST command, you will notice two differences: first, the output of the program is redirected to a file (17PM-Orbi.log), and second, the cluster file used by the program is no longer a file you edit yourself, but it is a file that is specified by the batch system (under the environment variable name $PBS_NODEFILE). This ensures that your program will start SEQUEST threads on the cluster nodes assigned by the batch system to your job. This detail is crucial on a batch system, because you do not know beforehand which nodes will be assigned to you, hence you cannot possibly edit the cluster.txt file before the job starts. Using $PBS_NODEFILE is the only way to make sure that your SEQUEST threads are launched on the appropriate nodes.