Posts Tagged ‘curation’

Case studies from the list of human regions not in Pfam 27.0.

May 14, 2013

Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.

Read the rest of this entry »

Pfam targets conserved human regions

May 7, 2013

Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here [1]  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot [2] (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? :-)), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.

Read the rest of this entry »

What are these new families with _2, _3, _4 endings?

January 19, 2012

Some users have been contacting us about the new families that are appeared in Pfam release 26.0.

As pointed out by one of our users:

Pfam v26 includes, in addition to DDE_Tnp_1, the following new families:

DDE_Tnp_1_2
DDE_Tnp_1_3
DDE_Tnp_1_4
DDE_Tnp_1_5
DDE_Tnp_1_6
DDE_Tnp_1_7

These extra new families with the name_2, name_3, name_4 etc, have been constructed to increase the coverage of Pfam.  Many of our existing large diverse families are not well modelled by a single HMM and there are many true members that are not matched. So by building multiple models we can match more things.  Each of these models will be in the same Pfam clan, the RNaseH clan in this case.  For the most part these models do not represent any particular subfamily or classification group.  Essentially you should think of a match to any of the above seven DDE_TnP_1 families as being the same thing.  Because of the way  Pfam is built any particular region of a protein may only belong to one of these families.  We have a step in building clans called competition which means that if a region of a protein matches to both DDE_Tnp_1 and DDE_Tnp_1_2 for example then the region will be assigned to the family with the highest score.  This means that a match to DDE_Tnp_1 in release 25.0 may now end up in a different family such as DDE_Tnp_1_2.  You shouldn’t read too much into these changes.

The reason that many of these new families are appearing in Pfam release 26.0 is due to a change in strategy in how we are building many new Pfam families.  The new strategy consists of taking complete genomes and taking each protein that does not match Pfam and using it as a starting point for a Jackhmmer search.  Jackhmmer is an iterative search tool like PSI-blast.  If we find that the Jackhmmer search finds lots of homologues but has some overlaps with an existing family then we may build one of these new additional families to increase coverage of known sequences. Rather than give these families completely new names we simply call them the same as the existing family and append a number to them to show that they are closely related to each other.

 

Posted by Alex

Have we found all the protein families yet?

November 22, 2011

Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.

Lately I have been wondering about that missing 20%. I see three possibilities:

  1. Sequences should be part of new families
  2. Sequences are missing members of existing families
  3. Sequences are incorrect gene predictions and never expressed

To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.

Graph showing the number of Jackhmmer hits (y axis) against the unmatched proteins in a genome. Proteins are ordered along X-axis with those whose searches hit the most proteins at the right.

On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.

Posted by Alex

References

Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.

Naming by numbers

July 21, 2010

A user recently asked us why two highly similar sequences that contain a PAS domain are in different Pfam families within the PAS clan.  The PAS domain clan (CL0183) currently contains seven different families: PAS, PAS_2, PAS_3, etc up to PAS_6, as well as the MEKHLA family.  We thought we would take the opportunity to explain some of the rationale behind the way in which we construct and name our families and clans. Read the rest of this entry »

Renaming of Transposases in Pfam

June 21, 2010

A few weeks ago the Pfam team was visited by the curators of the ISfinder resource, a specialist database that classifies eubacterial and archaeal transposases and insert sequences.  The ISfinder database has a very different role to Pfam: it focuses on this specific set of biological sequences and is the naming authority in the field for insertion sequences (ISs).  Pfam’s role is not to name transposases, but to identify the domains contained within these sequences.  Below we describe the outcomes of this meeting. Read the rest of this entry »