Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.
Lately I have been wondering about that missing 20%. I see three possibilities:
- Sequences should be part of new families
- Sequences are missing members of existing families
- Sequences are incorrect gene predictions and never expressed
To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.
On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.
Posted by Alex
Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.