There is a semi-auto pipeline to run NCBI blast at RCC rcluster.
Split big query file with multiple query sequences into multiple small input files and run blastall(NCBI).
rccbatchblast - given sequences in FASTA format, find similar sequences in a BLAST database at rcluster. It splits the inoput files in to chunks and submits all chunks to the queue. It takes all standard options from ncbi blastall. There are two more options: -s number of sequences in each unit. The input sequence file will be splitted in to many smaller size files. This option defines how many sequences in each splitted file. -q The name of the queue. The jobs will be submitted to the queue. For more detail about queue, please refer to rcc queue
Search Result utilities
rccbatchblast-check - check the results of rccbatchblast
* After submit your job, check if your jobs are done.
* if all jobs succeed, the blast result will merge in output file; number of input sequences, number of result queries, and total CPU time will be summarized in check.report.
* if jobs failed, or there are duplicated results in units, suspicious folders will be backup with prefix e + original folder name; commands of clean up and resubmission are given at the report.
* Please check and analyst errors and resubmit. All results are written to check.report
* In: original fasta file name-of-output-blast-result
* Out: check.report,output-blast-result
default size-of-split-unit=1000;
(for tblastx, we suggest size to 200)
default queue-name=r1-96h; Refer queue at rcluster for more options of queue-name.
The output file will be named at the following rccbatchblast-check
bjobs -u your-user-name
your-user-name: the user who run the above RCCBatchBlast.
To kill all the jobs you submit, use
bkill -u your-user-name 0
rccbatchblast-check infile outfile
infile: Original input fasta file to blast.
outfile:give a name to the result of the balst.
If the check result is all successful, all balst results merge to the output file named at rccbatchblast-check. There is no need to keep the *h foloders. Use teh following to clean up
Note: DO NOT use "submit job to the queue".
rccbatchblast is a script which already takes care of the submitting to queue.
Except command is"rccbatchblast", the options are same as NCBI blast, plus the options of queue name and chunk size. please refer to Blast
e.x.
mkdir my-new-folder
default size-of-split-unit=1000; (for tblastx, we suggest size to 200)cp input.fasta my-new-folder
cd my-new-folder
rccbatchblast -i input.fasta -d targetdatabase -p program-name -b bValue -v vValue -size numbe-of-sequence-in-split-unit -queue queue-name -m mValue
default queue-name=r1-96h; Refer queue at rcluster for more options of queue-name.
The output file will be named at the following rccbatchblast-check
bjobs -u your-user-name
your-user-name: the user who run the above RCCBatchBlast.To kill all the jobs you submit, use
bkill -u your-user-name 0
rccbatchblast-check infile outfile
infile: Original input fasta file to blast.
If the check result is all successful, all balst results merge to the output file named at rccbatchblast-check. There is no need to keep the *h foloders. Use teh following to clean upoutfile:give a name to the result of the balst.
rm -rf *h