Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.
The C-terminal region (residues 302 – 445) of Q9HAC7 (CaiB/baiF CoA-transferase family protein C7orf10) was a Human region identified as not covered by Pfam, that had >3,000 phmmer  hits in UniProtKB. Just over 40% of this protein was covered by PF02515 (CoA-transferase family III) in Pfam 27.0 (Figure 1A). There are several structures available for bacterial members of this family, which confirm that the domain boundaries on this family could be widened. I successfully extended the family at both the C-terminus and the N-terminus, and Pfam now covers 80% of Q9HAC7 (Figure 1B). The extended family will be released in Pfam 28.0.
- Figure 1: Schematic representation of Q9HAC7 (CaiB/baiF CoA-transferase family protein C7orf10) before (A) and after (B) extension of CoA-transferase family III (PF02515). Click on the figure to see a higher resolution image.
Some cases are more complex and require improvements to be made to several different families. A region of P46063 (ATP-dependent DNA helicase Q1, RECQ1) covering residues 410 to 487 was identified as having no Pfam-A domain and had >550 phmmer hits in UniProtKB. This protein is covered by three Pfam-A domains and a Pfam-B family in Pfam 27.0 (Figure 2A). Three-dimensional structures are available for this protein and for the Escherichia coli RecQ protein [2-3], and show RecQ to contain four distinct domains. The N-terminus of the protein is a DEAD/DEAH box helicase domain (PF00270), followed by a helicase conserved C-terminal domain (PF00271), then a zinc-binding domain, which was absent from Pfam 27.0, and, at the C-terminus, an RQC domain (PF09382). The Pfam alignment for the helicase conserved C-terminal domain was too short, so I extended this at the N-terminus to cover the region matched by the Pfam-B family. Extensions to large families such as this one (over 89,000 members in Pfam 27.0) help to increase the Pfam residue coverage of UniProtKB, something we are always striving to do. A new family, PF16124, was built for the zinc-binding domain, filling the gap between the helicase conserved C-terminal domain and the RQC domain, and the RQC domain was extended by a small amount at both ends. Figure 2B shows the new Pfam architecture of this protein, which is a far better reflection of it’s structure. The changes to PF00271 and PF09382, and the new family PF16124 will be available in Pfam 28.0.
In some cases there are no appropriate changes to be made to existing families or new families to build, an example of this is P01138 (Beta-nerve growth factor). The N-terminal region of this protein is not covered by Pfam and has over 600 hits to UniProtKB. However, closer examination revealed that the N-terminus of this protein consists of a signal peptide and a propeptide and it would therefore not be appropriate to include these regions in Pfam.
Our study on the human proteome  has so far proved to be useful for selecting interesting regions to target for family building. We will continue to work on the long list of conserved Human regions over the next few months.
Posted by Ruth
 Pike, A.C., et al. (2009) Structure of the human RECQ1 helicase reveals a putative strand-separation pin, Proc Natl Acad Sci U S A, 106, 1039-1044.
 Mistry, J., et al. (2013) The challenge of increasing Pfam coverage of the human proteome, Database (Oxford), 2013, bat023