What Pfam did in 2008

January 27, 2009

I thought it would be useful to give a quick overview of some of the major things that have been going on behind the scenes at Pfam during 2008. Overall it may have seemed like a quiet year for our users as we only made one public release of data in July, release 23.0. However, like a paddling duck, the calmness viewed from above belies some furious paddling below.

In total 1326 new families were added to our curation database. These were derived from a number of different sources such as Pfam-B, our automatically generated clustering of sequence regions not already in Pfam. As well as adding new families, we also made a valiant attempt to go through our existing families and improve their scope through a process we call iteration.

Iteration takes the full sequence alignment for a family and attempts to make a new non-redundant seed alignment from it. When a new HMM is made from this seed alignment, we hope to find new even more distant homologues for the family. During 2008 we iterated every single family that was not in a Pfam clan. For about 50% of Pfam families we were able to find more homologues. While in most cases there were only modest improvements, in some cases we found many hundreds of new family members.  In some cases this process allowed us to realise that two families were actually related and merge them into a single entry. For example we found that the uncharacterised family DUF30 (accession PF01727) was actually related to Peptidase_S7 (PF00949).

We changed the underlying source of protein clusters used in Pfam-B, from PRODOM to the PairsDB, from Andreas Heger and Liisa Holm. Unfortunately, the PRODOM database was not able to keep up a frequent release schedule and has become somewhat out of date. The greatly improved coverage of Pfam-B has increased the overall comprehensiveness of Pfam.

The rapid growth in the number of protein sequences has meant that we have had to revisit many aspects of the software used to run Pfam. Large portions of the codebase were rewritten, in order to help us scale with the sequence deluge. In 2008 we started providing Pfam match data on NCBI GenPept and Metagenomics, in addition to UniProt sequences.  Thus, from just over 3 million UniProt sequences in 2007, this expanded to 17 million sequences. This sequence growth, combined with the increased number of families, has resulted in a dramatic increase in the amount of computer power required to produce a Pfam release. In order to produce a release within such a timescale as not to make it obsolete, our release pipeline was restructured from a linear procedure to one that is largely parallel. Pfam 23,0 release took about two months to produce, but consumed approximately 60 CPU years on the Sanger compute cluster. As part of this restructuring, we have also endeavoured to increase the level of quality control. The primary focus here was to ensure data consistency between the data stored on disk (that is used to represent a family during the curation process) and the information populated in the MySQL database (that is used to provide the website).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s