In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database. These changes came into effect a few days ago, when we released Pfam 27.0. This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain.
So what has changed? To the user, hopefully, not a great deal has changed! Nevertheless, there has been a considerable amount of reorganization of the database production pipeline. The most notable loss of information is that we are no longer providing neighbour-joining trees for the Pfam full alignments. If you want or care about this type data, it will now be up to you to calculate it – however, if this data was important to you it should be recalculated using a more precise method anyway. On the flip side, there are many new features that have been integrated into Pfam 27.0. Below is a brief list of the new developments that are now available:
- Real time searches of DNA sequences for matches to Pfam models
- Use of Representative proteomes (1) sequence sets used to provide redundant views of the Pfam-A full alignments
- Addition of disorder predictions to the repertoire of sequence feature annotations
- AntiFam (2) has been applied to the underlying sequence database to remove sequences believed to be spurious translations
- Selectable sunbursts in the Pfam-A ‘species’ distribution tab, allowing the generation of alignments or visualisation of sequences from a user defined taxonomic range.
- New, faster keyword search using Apache Lucy
Blog posts describing these news developments in more detail will follow in the coming weeks. In addition to the changes listed above, there have been many improvements to existing families, be it improving domain boundaries, expanding members or the generation of Wikipedia entries. Many of the new entries in Pfam have been built with the purpose of improving Human coverage.
Enjoy the release!
Posted by Rob
1: Chen C, Natale DA, Finn RD, Huang H, Zhang J, Wu CH, Mazumder R. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLoS One. 2011 6(4):e18910
2: Eberhardt RY, Haft DH, Punta M, Martin M, O’Donovan C, Bateman A. AntiFam: a tool to help identify spurious ORFs in protein annotation. Database 2012:bas003.