Pfam 29.0, our second release of 2015, contains 16295 entries and 559 clans. We have made some major changes to our underlying sequence database and the data that are displayed on the website, which we’ve outlined below. Full details can be found in our Nucleic Acids Research paper, which is available here.
The growing size of UniProtKB and the computational and biocuration effort of ensuring each Pfam release conforms to internal quality control measures have meant that Pfam releases have not been as frequent as we would like them. Infrequent releases mean that new entries (many of which are submitted by our users), along with updates we make to existing entries remain internal for many months, and sometimes for over a year, until the next release is made.
In March of this year, UniProt made their first release of reference proteomes. The reference proteome set contains a representative cross-section of the taxonomic diversity of complete proteomes found in UniProtKB. It includes the proteomes of model organisms and proteomes of particular biomedical and biotechnological interest. The release of reference proteomes has influenced an internal redesign of Pfam and, as of Pfam 29.0, we have moved over to using reference proteomes as our underlying sequence database. Prior to Pfam 29.0, Pfam was based on the whole of UniProtKB. We think the move to reference proteomes will give us many benefits:
- Due to the way reference proteomes are constructed and maintained, they have a high level of experimental validation and should provide a more stable set of sequences on which to base Pfam
- The rate of growth of reference proteomes should be significantly less than that of the whole of UniProtKB, so it should allow us to scale more easily
- The reduction in size of our sequence database will mean a more manageable number of members in our Pfam entries
- Reference proteomes contain the most important organisms in which we need to increase Pfam coverage
Where we saw little or no loss in sensitivity, we have migrated our seed alignments over to contain only reference proteome sequences. We still have some Pfam entries (~20%) which contain seed sequences from the rest of UniProtKB. We also have 260 (1.6%) Pfam entries that do match any sequences in the reference proteome set – but they are still important. These Pfam entries do have matches to sequences in UniProtKB, and we are working with the UniProt team to see if sequences that match these models can be added to reference proteomes.
The information displayed on the Pfam website is now largely based on the reference proteome sequence set. We have however searched all our models against UniProtKB, and you can still search the website with any UniProtKB accession and get all of the Pfam matches. Similarly, if you want to have the full UniProtKB dataset in a Pfam style, search the Pfam entry against the HMMER website. We also make the UniProtKB full alignments available to download as a flatfile, and they are also stored in our MySQL database.
Posted by Rob and Jaina