Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans. We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan.
The new “stuff”
This release sees the first batch of families produced by Sara, our new curator. She, alongside some of the InterPro curators, has also been updating a fair amount of annotation associated with our families. Having InterPro curators help Pfam is an exciting development, meaning that the annotation gets fixed at the source, ensuring a greater consistency between the two resources.
In addition to family building, we have put over 500 families into a clan since the last release. Much of this has been guided by the excellent ECOD resource  produced by Nick Grishin and his lab. The ECOD structural classification aligns really well with Pfam’s concept of domains, and has highlighted some important cases where a single Pfam entry required splitting into two functional domains.
A parallel activity undertaken by Alex Bateman, has seen a new and improved version of SCOOP , SCOOP2. This refactored work has been performed during those many hours that are often lost while travelling. SCOOP2 implements a more principled scoring system, and has significantly improved the detection of additional relationships in Pfam 31.0. SCOOP2 will be available for download shortly.
The coverage statistics of the release
We love these coverage metrics, but take them on face value as they are there simply to show that we are constantly improving over time. The Uniprot reference proteomes set that we based Pfam 31.0 on contains 26.7 million sequences, which is an increase in size of 51% compared to when we made Pfam 30.0. Of the proteins in the Uniprot reference proteomes, 73% have a match to at least one Pfam entry, and 48% of all residues fall within a Pfam family. These are slightly higher than the last release, but given the increase in the database, this represents a healthy increase in coverage.
Active sites residue prediction in pfam_scan.pl
pfam_scan.pl is a script that allows users to perform local Pfam searches. It contains an option to predict active site residues on the sequences searched, and it does this by transferring Uniprot active site annotation to other sequences in the same family. Previously this was done by comparing the residues at the active site column positions in the alignment; if all residues in the active site were matched by another sequence, the script would predict these residues to be active site residues . We now use a more efficient approach, mapping the active site position to the hidden Markov model (HMM) position, and then looking to see if the other sequences in the family have the same residues at those positions in the HMM. Implementing this change has given a significant speed improvement in active site prediction. The format of the active site file that accompanies pfam_scan.pl (active_site.dat) has changed, so if you wish to use the active site option with the HMMs from Pfam 31.0, you will need to download the latest version on the script from here.
Enjoy release 31.0!
Jaina and Rob
1. Schaeffer RD, Liao Y, Cheng H, Grishin NV . ECOD: new developments in the evolutionary classification of domains. Nucleic Acid Research.
2. Bateman A., Finn R.D. SCOOP: simple method for identification of novel protein superfamily relationships. Bioinformatics. 2007;