We’re currently working on a new version of one of our core scripts, ‘pfam_scan.pl’. This script searches a set of protein sequences (in FASTA format) against Pfam’s library of HMMs. The original code was written nearly a decade ago but, since then, features have been added, bugs have been fixed and the code has evolved into something that is far from elegant. The re-write is something that we’ve been planning to do for a while and, as the code needs updating to use the new HMMER3 software, now seems like the perfect time to do it.
The purpose of ‘pfam_scan.pl’ is to search one or more sequences for matching Pfam domains. Depending on the user options, the script can also process the results such that overlaps between families belonging to the same clan are resolved and can predict active sites. When we generate the Pfam database, we use ‘hmmsearch’ to search a database of protein sequences using the HMM for each Pfam domain in turn. When we run ‘pfam_scan.pl’, however, we use ‘hmmscan’ (previously known as ‘hmmpfam’ in HMMER2) to search a library of HMMs using a set of query sequences. As an aside, it’s worth noting that there can be a subtle difference between the results that you’ll get when searching a sequence using ‘pfam_scan.pl’, and the matches that would be stored in the Pfam database for exactly the same sequence, as we are using two different HMMER programs. This is a small effect, but one that’s worth knowing about.
We’re seeing a roughly 100-fold increase in search speeds using HMMER3, so we want to pay particular attention to the efficiency of our Perl code, since we don’t want this to be the rate limiting step when performing a search. We’re pleased to note that our benchmarks show that, for a typical sequence search of 300 amino acids against a library of around 11,000 HMMs, the new ‘pfam_scan.pl’ code adds only about 100-200 msecs to the search time, over and above the 1 second ‘hmmscan’ run time (benchmarks were performed on a single 2.4GHz AMD Opteron processor).
We want to use exactly the same code when running searches on our website that our users would use when running searches on their own machines. We’ve taken this requirement into account from the start, so the new ‘pfam_scan.pl’ is written in a far more modular fashion than the old one. This has necessitated some changes in the dependencies of the script, however.
In the past, ‘pfam_scan.pl’ was a standalone script, with no external dependancies other than standard Perl library modules and the HMMER programs. Rather than repeatedly re-invent the wheel, we’ve decided to forgo the standalone nature of the script and use a few modules that can be installed from CPAN. We appreciate that this might cause difficulties for some of our users, so we’re looking at whether to bundle the software along with all of its dependencies, or simply to list the dependencies and let people install them for themselves.
Those of you familiar with Pfam models will know that, previously, we created two HMMER2 models for each Pfam family: one for global matches to the model and one for local matches to the model. Substantial improvements to the local-local search method in HMMER3 now allow us to model each Pfam family with a single HMM. This means that we no longer need to choose one hit over another, in those cases where a sequence has overlapping global and local matches to a single model. Most importantly for searches, it also means that the script will only need to search against half as many models as compared to the HMMER2 version.
Although HMMER3 makes life easier in some ways, it does introduce some more complexity. The HMMER3 version of ‘pfam_scan.pl’ will report two sets of coordinates for each match, namely the alignment coordinates, and the envelope coordinates. We’ll explain more about these in a later blog post…
New ouput formats
One of the problems we’ve always had with ‘pfam_scan.pl’ is its rather terse tabular output. The old version presented hits as a simple table of results, which, if you wanted to make further use of them, had to be parsed and turned back into a data structure. In fact, the Pfam website does exactly that when running a sequence search. For the new version, we want to make sure that, as well as providing the familiar tabular output by default, we can also get the results of a search in more useful formats as well.
The main component of ‘pfam_scan.pl’ is now written as a Perl module, which is responsible for running a search and returning results as a Perl data structure. The actual script, ‘pfam_scan.pl’, is now just a thin wrapper around that module, and it’s really only responsible for interpreting command line arguments. By returning results as a Perl data structure, we’re making it much easier (and quicker) to interpret results, or to pass them onto other analysis tools.
The new architecture also allows us to think about adding other output formats. For our internal purposes, a raw Perl data structure is most useful, but if you’re a ‘pfam_scan.pl’ user and feel strongly that we should be considering some other output format (XML, CSV, maybe even JSON), now is the time to let us know !
Finally, if there are any brave souls who would be willing and able to help us test the new ‘pfam_scan.pl’, we’d really like to hear from you. Testing the script will require you to install quite a few things, such as the HMMER3 executables, the new HMMER3-based HMM library and the various Perl modules that the script requires. If that prospect doesn’t put you off, please do get in touch, either by leaving a comment here or by mail.
Posted by Jaina & John