Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot  (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? :-)), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.
Within the human proteome, we identified almost 15,000 clusters of sequence regions that are not in Pfam-A or Pfam-B families, not predicted to be signal peptide regions, and that are at least 50 residues long. This accounts for close to 40% of all human residues (Figure 1, click on the figures to see a higher resolution version of them).We ran phmmer  searches using the members of each of the 15,000 clusters as queries against the whole of UniProtKB, and obtained the following distribution of significant hits:
This histogram has two important features:
- most HRs have few hits in UniProtKB (we will comment more specifically on these regions in a future blog post), and
- there is a long tail of regions with many hits in UniProtKB.
Want to guess which regions Pfam will target first? Of course we want to go for regions that might give us maximum impact, so we will start by trying to incorporate into Pfam the HRs that have the highest number of hits in UniProtKB. In fact, add to this the plot in Figure 3 below and you have our current strategy for adding conserved HRs to Pfam.
Figure 3 was drawn by calculating, for each cluster member, the percentage of its phmmer hits that matched an existing Pfam-A family, and averaging these values over all members in the cluster. We show only clusters that have >500 phmmer hits in UniProtKB. We see that a large number of regions have a high percentage of such ‘overlaps’ with existing Pfam-A families. These HRs are likely to be regions that are on the periphery of existing families (outliers) and that have been narrowly missed by our profile-HMMs. Alternatively, they could be problem regions that generate overlaps with many different clans, such as is sometimes observed for coiled-coil regions (see our recent paper  on the subject). You will also notice, however, that a large proportion of HRs feature few or no overlaps to Pfam-A. These are the ones that we are currently trying to add to Pfam. Ruth Eberhardt, one of our curators, will give some examples of HRs that she has recently added to Pfam-A in another post shortly.
It’s still very early stages in this analysis and we have looked at only around 30 cases. About two thirds resulted in extensions to existing families, and 6 led to new families being built. In a number of instances, such as the ones that will be discussed in our next post, structural information was key to improving family annotation. In many other cases, the only information available was sequence conservation, and based on this our curators had to make a decision as to whether or not to extend an existing family or to create a new one.
Overall, it looks as if the analysis of conserved human regions has given us great potential not only for increasing Pfam coverage of human, but also for improving the boundaries of many of our families. We can imagine that in the future, a similar strategy could be applied to other sets of sequences such as those, for example, in the proteomes of model organisms or pathogens.
Posted by Jaina and Marco
 Mistry, J., et al. (2013) The challenge of increasing Pfam coverage of the human proteome, Database (Oxford), 2013, bat023.
 (2013) Update on activities at the Universal Protein Resource (UniProt) in 2013, Nucleic Acids Res, 41, D43-47.
 Kall, L. et al. (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol, 338, 1027-1036.
 Mistry, J., et al. (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions, Nucleic Acids Res.