pfam_scan.pl

May 21, 2009

We’re currently working on a new version of one of our core scripts, ‘pfam_scan.pl’. This script searches a set of protein sequences (in FASTA format) against Pfam’s library of HMMs. The original code was written nearly a decade ago but, since then, features have been added, bugs have been fixed and the code has evolved into something that is far from elegant. The re-write is something that we’ve been planning to do for a while and, as the code needs updating to use the new HMMER3 software, now seems like the perfect time to do it.

The purpose of ‘pfam_scan.pl’ is to search one or more sequences for matching Pfam domains. Depending on the user options, the script can also process the results such that overlaps between families belonging to the same clan are resolved and can predict active sites.  When we generate the Pfam database, we use ‘hmmsearch’ to search a database of protein sequences using the HMM for each Pfam domain in turn. When we run ‘pfam_scan.pl’, however, we use ‘hmmscan’ (previously known as ‘hmmpfam’ in HMMER2) to search a library of HMMs using a set of query sequences. As an aside, it’s worth noting that there can be a subtle difference between the results that you’ll get when searching a sequence using ‘pfam_scan.pl’, and the matches that would be stored in the Pfam database for exactly the same sequence, as we are using two different HMMER programs. This is a small effect, but one that’s worth knowing about.

Speed

We’re seeing a roughly 100-fold increase in search speeds using HMMER3, so we want to pay particular attention to the efficiency of our Perl code, since we don’t want this to be the rate limiting step when performing a search. We’re pleased to note that our benchmarks show that, for a typical sequence search of 300 amino acids against a library of around 11,000 HMMs, the new ‘pfam_scan.pl’ code adds only about 100-200 msecs to the search time,  over and above the 1 second ‘hmmscan’ run time (benchmarks were performed on a single 2.4GHz AMD Opteron processor).

Design

We want to use exactly the same code when running searches on our website that our users would use when running searches on their own machines. We’ve taken this requirement into account from the start, so the new ‘pfam_scan.pl’ is written in a far more modular fashion than the old one. This has necessitated some changes in the dependencies of the script, however.

In the past, ‘pfam_scan.pl’ was a standalone script, with no external dependancies other than standard Perl library modules and the HMMER programs. Rather than repeatedly re-invent the wheel, we’ve decided to forgo the standalone nature of the script and use a few modules that can be installed from CPAN. We appreciate that this might cause difficulties for some of our users, so we’re looking at whether to bundle the software along with all of its dependencies, or simply to list the dependencies and let people install them for themselves.

Simplifications, complications

Those of you familiar with Pfam models will know that, previously, we created two HMMER2 models for each Pfam family: one for global matches to the model and one for local matches to the model. Substantial improvements to the local-local search method in HMMER3 now allow us to model each Pfam family with a single HMM. This means that we no longer need to choose one hit over another, in those cases where a sequence has overlapping global and local matches to a single model. Most importantly for searches, it also means that the script will only need to search against half as many models as compared to the HMMER2 version.

Although HMMER3 makes life easier in some ways, it does introduce some more complexity. The HMMER3 version of ‘pfam_scan.pl’ will report two sets of coordinates for each match, namely the alignment coordinates, and the envelope coordinates.  We’ll explain more about these in a later blog post…

New ouput formats

One of the problems we’ve always had with ‘pfam_scan.pl’ is its rather terse tabular output. The old version presented hits as a simple table of results, which, if you wanted to make further use of them, had to be parsed and turned back into a data structure. In fact, the Pfam website does exactly that when running a sequence search. For the new version, we want to make sure that, as well as providing the familiar tabular output by default, we can also get the results of a search in more useful formats as well.

The main component of ‘pfam_scan.pl’ is now written as a Perl module, which is responsible for running a search and returning results as a Perl data structure. The actual script, ‘pfam_scan.pl’, is now just a thin wrapper around that module, and it’s really only responsible for interpreting command line arguments. By returning results as a Perl data structure, we’re making it much easier (and quicker) to interpret results, or to pass them onto other analysis tools.

Feedback

The new architecture also allows us to think about adding other output formats. For our internal purposes, a raw Perl data structure is most useful, but if you’re a ‘pfam_scan.pl’ user and feel strongly that we should be considering some other output format (XML, CSV, maybe even JSON), now is the time to let us know !

Finally, if there are any brave souls who would be willing and able to help us test the new ‘pfam_scan.pl’, we’d really like to hear from you. Testing the script will require you to install quite a few things, such as the HMMER3 executables, the new HMMER3-based HMM library and the various Perl modules that the script requires. If that prospect doesn’t put you off, please do get in touch, either by leaving a comment here or by mail.

Posted by Jaina & John

Advertisements

26 Responses to “pfam_scan.pl”


  1. It is easy to install Perl modules as OS packages (when they are available).

  2. Andrew Moore Says:

    Hi there;

    very interesting.

    If you still require assistance with testing pfam_scan or the annotation module or whatnot we would be glad to help.

    We are domain people, Perl literates, have an appropriate infrastructure and have already messed a bit with HMMER3 (and are actually quite excited about it).

    Get in touch if we can help.
    Thanks many and keep the good stuff coming.

    andrew


  3. Great news!

    I would love to test the new pfam_scan.pl and associated modules. Please send it.

    Even more, I am really looking forward to have a look at the new HMMER3 models. I am preparing the next release of the Plant Transcription Factor Database (plntfdb) and would be interesting the compare the results using HMMER2 and HMEER3 models.

    Keep up with the good work.

    Best,

    Diego

  4. jainamistry Says:

    It’s lovely to hear from users that are so willing to help us with the testing. We’ve had a few people contact us directly by email too, and we’re impressed with the general level of enthusiasm you all have for Pfam.

    If you’ve offered to help us with testing and haven’t heard back from us yet, you will do shortly. A big thank you to you all.

    Jaina

  5. Luke Ulrich Says:

    100x speedup is quite impressive! We study microbial signal transduction for which domain analysis is crucial. We are looking forward to seeing the new pfam process and would love to help out with testing the new pfam_scan.pl. We have a comprehensive data processing infrastructure and have already done some testing with HMMER3.

    Regarding the data format, my vote is for JSON. It’s lightweight, machine and language independent, as well as human and machine readable. This would be a revolutionary step in making such data more accessible.

    Thanks to you all for your hard work.

    Luke

    • johntate Says:

      Hi Luke,

      We’ll add your vote for JSON. There’s a good CPAN module for converting a perl data structure to JSON, so it should be easy to incorporate.

      Thanks for the feedback,

      John.

  6. yunsheng wang Says:

    That’s great!

  7. Luca Says:

    The speedup increase and the new output data format are impressive and most desired improvements. I am willing to test this new software if possibile.
    I am currently working on a Linux (Ubuntu) OS, so if you had Linux executables that would be great.

    Thanks.

    Luca


  8. I also would really like to test/use your new pfam_scan.pl program. I was just going to start a large job (hundreds of cpu days) using the old script, but I think it makes much more sense to try out HMMER 3 and the new Pfam HMMs. Also, thanks for keeping people in the loop by writing about this in your blog!

  9. Khader Shameer Says:

    Hi,
    I am extensively using Pfam data and HMMER in my different analysis and web server projects. I am also excited to try the new, faster HMMER3 with pfam_scan. Please let me know if I can help to test the new version of pfam_scan.pl. I already have installed the required tools for the running.

    Keep doing the good work and also expecting initiatives like this in the future.

    Thanks,
    Khader Shameer

  10. cjfields Says:

    Add another vote for JSON for me!

  11. Wan Kyu Kim Says:

    Hi,

    I was about to begin a large-scale HMMER search for the full Pfam library.

    It is a very nice new and I am looking forward to using it.

    In fact, I tried installing HMMER3 and would be also very interested in testing HMMER3 with the new pfam_scan.pl.

    Please, point me to try some examplery Pfam models and pfam_scan.pl.

  12. yunling Says:

    MALGSENHSVFNDDEESSSAFNGPSVIRRNARERNRVKQVNNGFSQLRQHIPAAVIADLSNGRRGIGPGA
    NKKLSKVSTLKMAVEYIRRLQKVLHENDQQKQKQLHLQQQHLHFQQQQQHQHLYAWHQELQLQSPTGSTS
    SCNSISSYCKPATSTIPGATPPNNFHTKLEASFEDYRNNSCSSGTEDEDILDYISLWQDDL

  13. timsong Says:

    Hey,
    I am a new to the whole pfam_scan scene and I was just wondering if anyone could help me out getting the script to work? I just recently began to use linux (ubuntu) and it is my first time working with perl.

    • johntate Says:

      The best way to get started is to download the “pfam_scan.pl” script and check the documentation there. You can see it by running “perldoc pfam_scan.pl”.

      If you have problems setting up or running the script, you can email our helpdesk and we’ll try to get you going. You can find the address at the bottom of every page in the Pfam site.

  14. Peter Says:

    Hi,

    I’m also curious about trying out HMMER3 with the new PFAM models. Are you still looking for more testers?

    Peter

    • jainamistry Says:

      Hi Peter,

      Thanks for the offer. We had a fantastic response to our plea for testers, so we’re not looking for any more at the moment.

      We’ve actually just added some new features to the script, and we’ll be releasing it to our volunteers for the second phase of testing shortly. We will write another blog post about the improved version, so you will be able to keep up to date with our progress.

      Best wishes,

      Jaina

  15. Bryan Lunt Says:

    Greetings,
    I am very happy with this script, and finding it quite useful.

    Unfortunately, I also found one point to be somewhat odd.

    In the JSON output, why is the “align” item an array/list and not a dictionary?

    I would have expected it to be a dictionary with the keys HMM, MATCH, PP, and SEQ, rather than a list of strings.

    Is there something I am missing?

    • johntate Says:

      Hi Brian,

      No, you’re not missing anything. We’ve had a look at the JSON output that “pfam_scan.pl” generates and you’re right, there is no good reason why the “align” item is just a list. It’s simply that that’s how the associated code was written. The ability to generate JSON is quite a new feature and it’s not one that we use ourselves, so we haven’t actually worked with results in that format.

      It requires a fairly trivial change to make those data items into a dictionary rather than a list, and we’ll look at doing that in the new year. The only problem is that we’re not sure how many people are using that part of the JSON output, and how many people will object if we change it.

      If anyone else is using the JSON option of “pfam_scan.pl” and feels strongly that we shouldn’t change it, let us know.

      Cheers,

      John.

  16. Clare Bonk Says:

    Am I able to use pfam_scan.pl with a custom hmm database? I have an hmm file and used hmmpress to get the binarie, but I do not know how to get the .hmm.dat file which I need for pfam_scan.pl
    Thanks.

  17. jainamistry Says:

    Hi Clare,

    You’ll need to create the Pfam-A.hmm.dat file yourself. I’ll send you and email with instructions on what this file should contain.

    Thanks,

    Jaina

  18. Anand Rao Says:

    Could someone please help me with my two questions:

    1. How can I add and remove HMMs from a pre-existing file such as Pfam-A? Which is a question, I think, on the same lines as Clare Bonk’s query above…

    2. Where can I get a list of CLAN membership info for all of Pfam-A without having to manually go through the Pfam website? (which would be too time-consuming and error-prone)

    Thanks much


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s