Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.
Lately I have been wondering about that missing 20%. I see three possibilities:
- Sequences should be part of new families
- Sequences are missing members of existing families
- Sequences are incorrect gene predictions and never expressed
To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.

Graph showing the number of Jackhmmer hits (y axis) against the unmatched proteins in a genome. Proteins are ordered along X-axis with those whose searches hit the most proteins at the right.
On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.
Posted by Alex
References
Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.
November 22, 2011 at 3:59 pm
Viruses are probably the ultimate reservoir of novel protein families… factories of innovation?
November 22, 2011 at 4:05 pm
Yes I think that is correct. Many of the singletons or ORFans are found clustered together in the bacterial genomes in prophage elements. If people really start to sequence and deposit phage sequences into the databases then I think we’re doomed to never finish.
November 22, 2011 at 6:17 pm
What about proteins that are mostly disordered and exert their function by a having a few linear motif binding sites that might, for example, bring other proteins together (scaffolds)? I guess for a bacterial genome there must be very few disordered proteins like this but for eukaryotic proteomes there might be a significant number of these proteins.
November 23, 2011 at 11:10 am
Thanks for the question Pedro. To be honest, I’ve had my head stuck in the sand over this for many years (well it never hurt your average ostrich). I recently taught on the EMBO regulatory proteins course and the disordered protein brainwashing programme began. I bought Peter Tompa’s excellent book on the subject and am now fully converted 🙂
Pfam actually does already have some families which are disordeded proteins. They often are quite well conserved and our profile-HMMs are fine to identify homologues. We really struggle with finding short linear motifs though. We plan to look at what fraction of the proteins we don’t cover are actually disordered in the coming week or two, particularly for vertebrate genomes. That will hopefully give us some insight into whether disordered regions are holding us back from more complete coverage.
November 26, 2011 at 7:27 pm
Considering that there are still entire unsequenced or barely sequenced phyla, I think it’s premature to ask whether we’ve found all the protein families yet.
Mind you, this does not necessarily contradict statements along the lines of “the majority of proteins in Nature came from no more than X families”. It’s just that there is a very long tail of small protein families.
Another interesting question: what fraction of the total protein length is covered by Pfam domains? It’s rather presumptuous to call a protein as belonging to a family just because it contains a Pfam domain that accounts for 10% of the protein…
November 28, 2011 at 11:21 am
I agree with all these sentiments. But I just can’t wait ’til we’ve sequenced everything 😉
We call the number of sequences with at least 1 Pfam match the sequence coverage and it is 79.466% today. The fraction of residues covered is whatwe call the residue coverage and today that stands at 57.231%. So there are still plenty of unrecognised domains within sequences that have a Pfam match already. We plan to include disorder predictions in Pfam and this may help us decide if the uncovered regions are likely to contain any extra domains.
December 15, 2011 at 2:06 pm
Fission yeast now has 88.57% Pfam coverage at the sequence level (579/5069 proteins have no Pfam domain Version 26.0). Of the 579 “Pfam negative” proteins, 56 (10%) are essential, so essentiality is low (compared to 26.4% for all protein coding genes). However a quick survey of these 579 “Pfam negative” reveals that at ~221 (including 41 essential) are clearly fairly broadly conserved (86 to Metazoa) and could be merged into existing families or provide seeds for new families…
Maybe a proportion of the remainder are non-protein coding? We have good evidence that most are transcribed, and the majority have been GFP tagged and localised, but a small number could be non-coding RNAs which have spurious open reading frames over 100 amino acids (our cut-off for annotation with no other evidence).
There are 1071 presumed protein coding genes in the fission yeast genome between 50 and 200 amino acids (21%)
but in the 579 “Pfam negative” there are 244 proteins between 50 and 200 amino acids (42%).
So are newly evolved proteins shorter? or are homologies more difficult to detect in shorter proteins?
Perhaps absence of homologs and incorrect gene predictions make a minor contribution, but my gut feeling is that we are still compromised in our ability to identify homology for very divergent proteins. This is particularly compounded for smaller proteins, where the probability of generating a high scoring alignment to seed your family is proportionally lower….
v.