Back in May we wrote a blog post about the new version of pfam_scan.pl. We asked if there was anyone out there who was willing to help us test our new script, and we were pleasantly surprised at the number of people who got in contact with us – so a big thank you to all those who have helped. Since releasing the alpha version of pfam_scan.pl to our testers we have made some internal changes to the script that are worth mentioning:
Searching Pfam-B HMMs
We are frequently asked how to search sequences against Pfam-B families, and now we are providing this facility as part of pfam_scan.pl. Pfam-A families are the main entry point for scientists in their day-to-day use of Pfam. The alignments for each Pfam-A family have been carefully checked by one of our curators, and each one is accompanied by annotation. In the past we have made a library of all the curated entries in the database (the Pfam-A HMMs) available for users to search against.
Within Pfam, we have a second set of families called Pfam-B families. These families are automatically generated alignments that have no accompanying HMMs or annotation, and their alignments have not passed any of the normal quality control that we would perform for a Pfam-A family. This means Pfam-B families are of much lower quality than our Pfam-A families, however they do help us to fill the sequence space that Pfam-A families do not cover.
In the last release there were 223,403 Pfam-B entries, with the number of sequences matching each Pfam-B entry following a long-tailed distribution. With HMMER3 being significantly faster than HMMER2, we have chosen to generate HMMs for the top 20,000 (largest automatic families) Pfam-B families, as these are the most relevant. Most entries below this cut-off contain less than 10 sequences. As multi-threaded versions of HMMER3 become available we may include more entires, but for the moment the search times against the Pfam-A and Pfam-B libraries are equivalent.
BUT….please remember that Pfam-B accessions are not stable between Pfam releases, so do not rely on them!
We wanted the new version of pfam_scan.pl to be fast, but also much more maintainable than the previous version. When we started work on the new pfam_scan.pl, we made the decision to use a number of third party Perl modules from CPAN. The most important of these is definitely the Moose framework For those of you who are unfamiliar with Moose, it’s a complete object system that improves on the Perl 5 object system and makes object-oriented code simpler and more powerful.
A word or two about Moose
Some our testers noticed that using Moose contributed fairly significantly to the runtime of the script (excluding running HMMER3). The performance overhead of a system like Moose is definitely something that needs to be considered. After careful consideration, we decided that it was a price worth paying, but we also thought that it was worth explaining our reasoning a little.
One of the benefits of Moose is that Moose-based objects can be configured to perform extensive data validation, in a way that is both easy to implement and maintain. The Pfam production pipeline has been entirely re-written for the upcoming Pfam release, and it now relies heavily on Moose for data validation. Since the pipeline modules perform many of the operations that we need in pfam_scan.pl, such as parsing the output of HMMER3 programs, we’ve been able to use these new modules in the script, rather than having to rewrite the same functionality from scratch.
The validation checks that can be built into Moose objects avoid the need to write complex data validation procedures for ourselves. The code that implements the checks is part of the core Perl modules and is therefore maintained by the maintainers of the CPAN modules, not us. Our new production code was easier to write because it relies on Moose and the built-in validation checks are much tighter than they would otherwise have been, despite the fact that we’ve written less code overall. By using the same modules in pfam_scan.pl, we’ve further reduced the amount of code that we need to maintain overall and have significantly speeded up the development of pfam_scan.pl itself.
In short, we’re happy that we’ve struck an appropriate balance between all out performance and long term maintainability. We’ll keep up to date with changes to the Moose framework that might help improve performance, and we’ve already improved the performance of the production pipelines by leveraging some of the more esoteric Moose features. The final release version of the modules and scripts will include these under-the-hood tweaks.
Once again, thank you if you tested out the alpha release of the HMMER3-enabled version of pfam_scan.pl. Your feedback should reduce our pain, and the pain of the community, when it’s bundled with Pfam release 24.0. For those of you who can not wait, you can find the beta release of the script on our (well Rob’s) ftp site. Of course if you find any bugs, or have any feedback, please contact us.
We released the first official release of pfam_scan.pl with Pfam 24.0, and it’s available for download (the tarball is called PfamScan.tar.gz) at ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/.