pfam_scan.pl – part II

September 11, 2009

Back in May we wrote a blog post about the new version of pfam_scan.pl. We asked if there was anyone out there who was willing to help us test our new script, and we were pleasantly surprised at the number of people who got in contact with us – so a big thank you to all those who have helped. Since releasing the alpha version of pfam_scan.pl to our testers we have made some internal changes to the script that are worth mentioning:

Searching Pfam-B HMMs

We are frequently asked how to search sequences against Pfam-B families, and now we are providing this facility as part of pfam_scan.pl. Pfam-A families are the main entry point for scientists in their day-to-day use of Pfam. The alignments for each Pfam-A family have been carefully checked by one of our curators, and each one is accompanied by annotation. In the past we have made a library of all the curated entries in the database (the Pfam-A HMMs) available for users to search against.

Within Pfam, we have a second set of families  called Pfam-B families. These families are automatically generated alignments that have no accompanying HMMs or annotation, and their alignments have not passed any of the normal quality control that we would perform for a Pfam-A family. This means Pfam-B families are of much lower quality than our Pfam-A families, however they do help us to fill the sequence space that Pfam-A families do not cover.

In the last release there were 223,403 Pfam-B entries, with the number of sequences matching each Pfam-B entry following a long-tailed distribution. With HMMER3 being significantly faster than HMMER2, we have chosen to generate HMMs for the top 20,000 (largest automatic families) Pfam-B families, as these are the most relevant. Most entries below this cut-off contain less than 10 sequences. As multi-threaded versions of HMMER3 become available we may include more entires, but for the moment the search times against the Pfam-A and Pfam-B libraries are equivalent.

BUT….please remember that Pfam-B accessions are not stable between Pfam releases, so do not rely on them!

Additional formats

A further change adds the ability to write search results in JavaScript Object Notation (JSON). JSON is a compact, text-based data format, which is most commonly used in the context of the web and javascript applications. However, precisely because it’s so compact and portable, JSON can also be useful as a sort of lightweight XML replacement. Hopefully the JSON output options of pfam_scan.pl will be useful for those users who would otherwise have to parse the raw text output in order to do more processing.

We wanted the new version of pfam_scan.pl to be fast, but also much more maintainable than the previous version. When we started work on the new pfam_scan.pl, we made the decision to use a number of third party Perl modules from CPAN. The most important of these is definitely the Moose framework For those of you who are unfamiliar with Moose, it’s a complete object system that improves on the Perl 5 object system and makes object-oriented code simpler and more powerful.

A word or two about Moose

Some our testers noticed that using Moose contributed fairly significantly to the runtime of the script (excluding running HMMER3). The performance overhead of a system like Moose is definitely something that needs to be considered. After careful consideration, we decided that it was a price worth paying, but we also thought that it was worth explaining our reasoning a little.

One of the benefits of Moose is that Moose-based objects can be configured to perform extensive data validation, in a way that is both easy to implement and maintain. The Pfam production pipeline has been entirely re-written for the upcoming Pfam release, and it now relies heavily on Moose for data validation. Since the pipeline modules perform many of the operations that we need in pfam_scan.pl, such as parsing the output of HMMER3 programs, we’ve been able to use these new modules in the script, rather than having to rewrite the same functionality from scratch.

The validation checks that can be built into Moose objects avoid the need to write complex data validation procedures for ourselves. The code that implements the checks is part of the core Perl modules and is therefore maintained by the maintainers of the CPAN modules, not us. Our new production code was easier to write because it relies on Moose and the built-in validation checks are much tighter than they would otherwise have been, despite the fact that we’ve written less code overall. By using the same modules in pfam_scan.pl, we’ve further reduced the amount of code that we need to maintain overall and have significantly speeded up the development of pfam_scan.pl itself.

In short, we’re happy that we’ve struck an appropriate balance between all out performance and long term maintainability. We’ll keep up to date with changes to the Moose framework that might help improve performance, and we’ve already improved the performance of the production pipelines by leveraging some of the more esoteric Moose features. The final release version of the modules and scripts will include these under-the-hood tweaks.

And Finally…..

Once again, thank you if you tested out the alpha release of the HMMER3-enabled version of pfam_scan.pl.  Your feedback should reduce our pain, and the pain of the community, when it’s bundled with Pfam release 24.0. For those of you who can not wait, you can find the beta release of the script on our (well Rob’s) ftp site. Of course if you find any bugs, or have any feedback, please contact us.

Update 16/10/09

We released the first official release of pfam_scan.pl with Pfam 24.0, and it’s available for download (the tarball is called PfamScan.tar.gz) at ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/.

Advertisements

5 Responses to “pfam_scan.pl – part II”

  1. HCPro Says:

    sdagssfwkgfdrqfiatrdrpknaheckatinveecgemaaiinqllfpmwkitctqcgellemlsqeeelesfrrkrnqlasklsslhnkfpyvdhflnryensldqmntnfdahkqiaqiiggrkeipfsnlgrlnelliksdklvsedfyemsqclleltrwhknrsdsfkkgevhhfrnkisgkaqfnfslmcdnqldkngnfvwgergyhakrffsnffekvdstdgykkhimrvnpngtrqtaigklilstdpstlrqqmkgnpitrvpvgkhctskrddcyvypaccvtmedgtplfsdikmptknhlvignsgdpkyvdvpssssdmivakegycylniflamllnvnesesksftkkvrdiivprlgqwpslidvatecyflsafhpetknaelprilvdhtskcmhvidsygsldtqfhvlkantvsqlikfadddldsemkhylvg

  2. Juan Carlos Says:

    Hi,
    I’ve been trying to use pfam_scan.pl from your site and havent been able to get it to work…always getting this error

    Failed to parse hmmsearch results |[ok]
    | in header section

    Think i did everything right, hmmer is in the path, also pfam_scan and checked all the perl modules…
    but maybe forgot something, but could you help me with this?

    Also mention that the ftp site for beta version does no longer exists or is unavailable..

    Thanks in advance!

    jc

    • jainamistry Says:

      Hi JC,

      HMMER3.1b has an additional line in its hmmscan output compared to HMMER3.0, and this breaks the parser that pfam_scan.pl uses. We would recommend that you use HMMER3.0 with pfam_scan.pl, as HMMER3.1b is, as the name suggests, still in beta testing. When HMMER3.1 is officially released, we will release an updated version of pfam_scan.pl which will be able to cope with any format changes that have taken place.

      Hope that helps,

      Jaina


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s