The Xfam blog has been fairly quiet since the release of Pfam 24.0, so I thought I would give you a quick update on what we have been up to in the Pfam team.
Internal and public updates
In a previous post John talked about how we have been migrating our web servers to virtual machines. We have also been consolidating our production software pipeline after we moved everything over to subversion. Finally, we have also been trying to update our documentation as much as possible. This includes documentation of our internal pipelines, website usage, the MySQL database and remote site installation of the website. There are two reasons for this push on documentation – it was in need of a serious update after the migration to HMMER3 and there will changes to staff this year so we want to encapsulate as much knowledge as possible before then. We expect the public documentation, particularly the installation notes for running the website searches and website documentation (that can be found under the help section) will be updated shortly (in the next few weeks, once Rfam 10.0 has been released).
How we are using HMMER3
If you follow the world of Pfam closely you may have also seen that Sean Eddy and his team at Janelia Farm are getting close to officially releasing HMMER3. I know that Sean feels a little aggrieved that we have pushed HMMER3 into production prior to the completion of beta testing and paper writing, but the 200x speed up was too much to resist! Our change to using HMMER3 forced InterPro to migrate as well, not to mention the many users of Pfam around the world. However, I feel that the transition from HMMER3 beta to production quality code is very much a ‘chicken and egg’ scenario. Only when major protein databases, such as Pfam, have started to use HMMER3 in earnest have we seen some of the odd use-case situations or caveats. Overall, from our point of view, the transition of Pfam to using HMMER3 has been pretty smooth. We are also keeping pace with the latest changes and are currently using HMMER3 release candidate 2 in production, both for our internal pipelines and for the website based searches.
HMMER3 is also opening up new avenues for us to explore the protein universe. One of these things is that we have now been able to do with HMMER3, is to build a profile-HMM for all 142303 Pfam-B families and search this HMM library against pfamseq . This took less three days to run on the Sanger Institute compute farm, or 1.4 CPU years (on a Intel(R) Xeon(R) CPU 3.0 GHz). Alex Bateman has also been running a large number of jackhmmer searches (an iterative tool similar to PSI-BLAST that uses HMMs instead of PSSMs) with sequences from the C. difficile proteome. His aim is to try and annotate some of the proteins with little or unknown function in this clinically important bacteria. I have also been running jackhmmer with sequences that contain no Pfam annotations to try and develop an understand of how many sequences we really have an unknown function (or similarity to other sequences). Alex and I our both very excited with the results we have seen from jackhmmer – if you have not tried it, we would recommend giving it it a go. Both these Pfam-B and jackhmmer searches have indicated areas where Pfam coverage is deficient, and we have started to address some of the deficiencies. However, the volume of data that we are generating (and can generate) has highlighted, even more so, that the manual curation is the rate limiting is the major bottleneck in Pfam production!
The next release of Pfam
Since the last release (version 24.0) we have added nearly 300 families to the database. Many of the these new families are based on the results from the Pfam-B and jackhmmer searches. We have also been working hard with InterPro to fix our annotation for families where the annotation has become out dated. In particular, there are many cases where a family was confined to a specific species with HMMER2, but has now become expanded and now includes matches to other related species.
We are now starting to plan for the next release of Pfam. This typically takes four to six weeks by the time we update the sequence database, perform quality control checks, post process the families and update our mirror sites, so do not expect it before May! However, we are hoping to make Pfam 25.0 on the first official release of HMMER3.
Posted by: Rob