Pfam, HMMER3 and the next release

March 23, 2010

The Xfam blog has been fairly quiet since the release of Pfam 24.0, so I thought I would give you a quick update on what we have been up to in the Pfam team.

Internal and public  updates

In a previous post John talked about how we have been migrating our web servers to virtual machines.   We have also been consolidating our production software pipeline after we moved everything over to subversion.  Finally, we have also been trying to update our documentation as much as possible.  This includes documentation of our internal pipelines, website usage, the MySQL database and remote site installation of the website.  There are two reasons for this push on documentation –  it was in need of a serious update after the migration to HMMER3 and there will changes to staff this year so we want to encapsulate as much knowledge as possible before then.   We  expect the public documentation, particularly the installation notes for running the website searches and website documentation (that can be found under the help section) will be updated shortly (in the next few weeks, once Rfam 10.0 has been released).

How we are using HMMER3

If you follow the world of Pfam closely you may have also seen that Sean Eddy and his team at Janelia Farm are getting close to officially releasing HMMER3.  I know that Sean feels a little aggrieved that we have pushed HMMER3 into production prior to the completion of beta testing and paper writing, but the 200x speed up was too much to resist!  Our change to using HMMER3 forced InterPro to migrate as well, not to mention the many users of Pfam around the world. However, I feel that the transition from HMMER3 beta to production quality code is very much a ‘chicken and egg’ scenario.  Only when major protein databases, such as Pfam, have started to use HMMER3 in earnest have we seen some of the odd use-case situations or caveats.  Overall, from our point of view, the transition of Pfam to using HMMER3 has been pretty smooth.  We are also keeping pace with the latest changes and are currently using HMMER3 release candidate 2 in production, both for our internal pipelines and for the website based searches.

HMMER3 is also opening up new avenues for us to explore the protein universe.  One of these things is that we have now been able to do with HMMER3, is to build a profile-HMM for all 142303 Pfam-B families and search this HMM library against pfamseq . This took less three days to run on the Sanger Institute compute farm, or 1.4 CPU years (on a Intel(R) Xeon(R) CPU 3.0 GHz).  Alex Bateman has also been running a large number of jackhmmer searches (an iterative tool similar to PSI-BLAST that uses HMMs instead of PSSMs) with sequences from the C. difficile proteome.  His aim is to try and annotate some of the proteins with little or unknown function in this clinically important bacteria.  I have also been running jackhmmer with sequences that contain no Pfam annotations to try and develop an understand of how many sequences we really have an unknown function (or similarity to other sequences). Alex and I our both very excited with the results we have seen from jackhmmer – if you have not tried it, we would recommend giving it it a go.  Both these Pfam-B and jackhmmer searches have indicated areas where Pfam coverage is deficient, and we have started to address some of the deficiencies.  However, the volume of data that we are generating (and can generate) has highlighted, even more so, that the manual curation is the rate limiting is the major bottleneck in Pfam production!

The next release of Pfam

Since the last release (version 24.0) we have added nearly 300 families to the database.  Many of the these new families are based on the results from the Pfam-B and jackhmmer searches. We have also been working hard with InterPro to fix our annotation for families where the annotation has become out dated.  In particular, there are many cases where a family was confined to a specific species with HMMER2, but has now become expanded and now includes matches to other related species.

We are now starting to plan for the next release of Pfam.  This typically takes four to six weeks by the time we update the sequence database, perform quality control checks, post process the families and update our mirror sites, so do not expect it before May! However, we are hoping to make Pfam 25.0 on the first official release of HMMER3.

Posted by: Rob

8 Responses to “Pfam, HMMER3 and the next release”

  1. Alex Ochoa Says:

    Any updates on the missing zinc-finger problem? I remember you discussed several options (redoing models by partitioning families, making H3 do glocal preds too, etc.), and I wanted to know it they’ve been narrowed down since.

    • alexbateman Says:

      Hi Alex,

      I have improved the classical zinc finger model significantly over the migrated Pfam 24.0 version. By simply changing the domain threshold (leaving the sequence threshold alone) I was able to gain an extra 24,000 domain hits for the model. We have been improving many of the other models that H3 appeared to work poorly on and generally have been able to find workarounds to improve coverage. For some small repeat families this meant making the alignment consists of say two or three copies of the repeat.

  2. Todd Gibson Says:

    Hello. Are there any plans to update iPfam as well?

  3. rdfinn Says:

    Yes, we have been working on a new version of the iPfam website and associated data. However, there is no specific funding for it, so we have to fit development time in when we can. With all the changes to Pfam, iPfam has been somewhat neglected – sorry. When the next release of iPfam will be made public is hard to say, but we hope to have something by the summer.

  4. Anna Says:

    Hi,

    i was just wondering if there is any update on when next release of pfam might be?
    thank you!

    • alexbateman Says:

      Hi Anna,

      We are busy working on Pfam release 25.0 right now. We hope to get it released around the end of the month.

      Alex

      • Andy Says:

        Hi,
        is the release of Pfam 25.0 still imminent or can we expect a delay because of the staff changes ?
        cheers,
        Andy

      • jainamistry Says:

        Hi Andy,

        Pfam 25.0 has been delayed due to the staff changes. It’s going to be a bit later than we planned – when it is ready we’ll make an announcement on the blog.

        Jaina


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 139 other followers