Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’. You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these questions. Read the rest of this entry »
Posts Tagged ‘pfam’
As some of you will already be aware, the Xfam family has recently gained a new member: the TreeFam database.
TreeFam aims to provide phylogenetic trees and orthology predictions for all animal genes.
AntiFam  is the newest addition to the Xfam brand. It is a database of hidden Markov models (HMMs) designed to identify spurious open reading frames (ORFs). It is available now on our ftp site:
The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’. The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases. Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used. Below is a list of changes that are going to happen in the next release, release 27.0.
Since releasing the new Pfam website four years ago, we’ve had a steady trickle of mails from users who would like to install and run the site within their own local environment. It used to be possible to do just that, given a following wind, if you were ready to install the site from its source code. Unfortunately, after some internal changes and as the list of Perl module dependencies grew and grew, the process got harder and more complex and eventually we stopped supporting it entirely. We’ve been actively discouraging people from trying this for far too long, all the while promising to make the process easier. Finally we’ve managed to get around to building a virtual machine (VM) that should make the whole thing possible again. Read the rest of this entry »
Some users have been contacting us about the new families that are appeared in Pfam release 26.0.
As pointed out by one of our users:
Pfam v26 includes, in addition to DDE_Tnp_1, the following new families:
These extra new families with the name_2, name_3, name_4 etc, have been constructed to increase the coverage of Pfam. Many of our existing large diverse families are not well modelled by a single HMM and there are many true members that are not matched. So by building multiple models we can match more things. Each of these models will be in the same Pfam clan, the RNaseH clan in this case. For the most part these models do not represent any particular subfamily or classification group. Essentially you should think of a match to any of the above seven DDE_TnP_1 families as being the same thing. Because of the way Pfam is built any particular region of a protein may only belong to one of these families. We have a step in building clans called competition which means that if a region of a protein matches to both DDE_Tnp_1 and DDE_Tnp_1_2 for example then the region will be assigned to the family with the highest score. This means that a match to DDE_Tnp_1 in release 25.0 may now end up in a different family such as DDE_Tnp_1_2. You shouldn’t read too much into these changes.
The reason that many of these new families are appearing in Pfam release 26.0 is due to a change in strategy in how we are building many new Pfam families. The new strategy consists of taking complete genomes and taking each protein that does not match Pfam and using it as a starting point for a Jackhmmer search. Jackhmmer is an iterative search tool like PSI-blast. If we find that the Jackhmmer search finds lots of homologues but has some overlaps with an existing family then we may build one of these new additional families to increase coverage of known sequences. Rather than give these families completely new names we simply call them the same as the existing family and append a number to them to show that they are closely related to each other.
Posted by Alex
Many of you will be aware of the proposed web blackout in response to the Stop Online Piracy Act which is currently going through the U.S. House of Representatives (you can read the BBC’s explanation of the Act here). If this Act is enforced, it has far-reaching consequences for the overall freedom of the internet. Editors of the English Wikipedia have taken the decision to close the English Wikipedia for 24 hours, starting at 0500 hrs on Wednesday 18th January. To respect this protest, we will also be making our Wikipedia content unavailable during this time.
You’ll still be able to access all the non-Wikipedia content – that is, all the covariance models and HMMs describing families, domain graphics, full and seed alignments, as well as our species trees.
Posted by Sarah and Alex
As you surely have noted the highly anticipated new Pfam paper is out as part of the 2012 NAR database issue! We were delighted to be listed as a featured article. The paper covers the new release 26.0 (more on this from Rob soon) and presents some novel analysis that may be of interest to Pfam addicts like you. We quite extensively discuss our use of family-specific bit score gathering thresholds (GAs), hoping to bring clarity to an issue that seems to have been a source of confusion in the past (a.k.a. stop sending us tickets asking what GAs are and how to use them! :-)). Also, we extend and update the analysis of DUF families that was presented in a previous publication hoping to push more people into the de-DUF a DUF game. So, enjoy reading the paper and send us comments and suggestions, your support and advice is as always invaluable to us!!
Posted by Marco
Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.
Lately I have been wondering about that missing 20%. I see three possibilities:
- Sequences should be part of new families
- Sequences are missing members of existing families
- Sequences are incorrect gene predictions and never expressed
To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.
On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.
Posted by Alex
Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.
Well, it should have been out about 6 months ago, but finally the long awaited Pfam release 25.0 is here! Release 25.0 contains a total of 12273 families, with 384 new families and 21 families killed since the latest release. Pfam 25.0 is based on UniProt release 2010_05. Those of you who follow Pfam closely will be familiar with the fact the sequence coverage (the number of sequences in Pfamseq containing at least one Pfam match) has hovered at or just below 75%. Despite the addition of only a modest number of new families in this release, the sequence coverage is now 76.69% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.86% of all residues in the sequence database fall within Pfam domains.