As most of you are probably aware, Sean released HMMER3.0b3 last month. The beta 3 version of HMMER3.0 contains a few bug fixes and the four HMMER3 search programs now allow multi-core parallelisation. We’ve just updated all of the Pfam sequence search tools to use the new HMMER3.0 beta 3 release, so we thought we’d update you on what these changes mean for Pfam.
HMMER3.0b2 had a bug with the –cut_ga/–cut_tc/–cut_nc options. The bug meant that when these options were chosen, some matches were reported that did not meet the specified threshold. The –cut_ga option is one of the options that we make available to users when performing their sequence searches on the website, or, for those users who have downloaded “pfam_scan.pl”, on the command line. Fortunately, the HMMER3.0b2 cut_ga bug didn’t affect Pfam searches as the “pfam_scan.pl” script post-processes the results in such a way that only those matches that meet the cut-off threshold are reported.
We’ve been running the batch searches with HMMER3.0b2 for a couple of months now, and the throughput has been very impressive. Due to the substantial speed improvements of HMMER3 and the possibility of parallelising searches across multiple machines, we have just increased the batch search limit from 1000 sequences, to 5000 sequences per file. This should allow some smaller genomes to be searched in just one batch job.
HMMER3.0b3 compatible “pfam_scan.pl”
We have released a new version of pfam_scan.pl that is compatible with HMMER3.0b3 (available on the FTP site). There weren’t many changes between beta2 and beta3 that affected hmmscan, so there are very few changes in the “pfam_scan.pl” script. The biggest change is the option to use multiple CPUs (or cores). We have implemented this option on the website, so you should see a slight increase in speed when performing sequence searches online. The option is also available on the command line for those who download the new “pfam_scan.pl”.
We have also fixed a couple of minor bugs in “pfam_scan.pl” that affected the website. The first of these meant that we reported insignificant matches that exceeded the E-value threshold specified. We have fixed this bug such that only insignificant matches which fall below the E-value threshold are reported.
The second bug affected sequences which had more than one match to the same Pfam family. In these cases, if the sequence score was below the E-value threshold, the website would report all domains regardless of whether the individual domain E-value scores met the E-value threshold. This has been fixed such that only domains which score below the E-value threshold are reported. Both of these bugs affected only the website and not the downloadable version of the script. The effect of these bugs should be fairly minor, as no matches were lost, merely some matches reported when they shouldn’t have been.
Posted by Jaina & John.