Archive for November, 2011

Have we found all the protein families yet?

November 22, 2011

Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.

Lately I have been wondering about that missing 20%. I see three possibilities:

  1. Sequences should be part of new families
  2. Sequences are missing members of existing families
  3. Sequences are incorrect gene predictions and never expressed

To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.

Graph showing the number of Jackhmmer hits (y axis) against the unmatched proteins in a genome. Proteins are ordered along X-axis with those whose searches hit the most proteins at the right.

On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.

Posted by Alex

References

Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.

Rfam now available in UCSC Genome Browser, and other genome news.

November 2, 2011

We are pleased to announce the arrival of the Rfam Track Hub for the popular UCSC Genome browser. Rfam data has been available in the Ensembl browser for some time and provides links back to the Rfam annotation, and now this same functionality is available for the UCSC Genome Browser.

The hub file is available on our ftp site, and by following the instructions at the UCSC Genome Browser Custom Hub page, you can visualise Rfam annotations for the majority of species for which genomes are provided by the UCSC Genome Browser. Clicking on a match will give you exact start and stop positions, as well as links to the Rfam annotation page here at the Sanger. At the moment, bit scores or E-values for a given match aren’t yet available directly through the UCSC Genome Browser, though we’re working on it. Happy browsing!

Rfam types for Genome annotation

Xfam (in the forms of Sarah and Rob) attended the NIH Genome Annotation Workshop last week, and it was a great insight into the trials and tribulations of coming up with common standards that everyone’s happy with. It was also nice to hear that Rfam is being used exensively to annotate ncRNA features. However, there’s been some confusion amongst annotators when converting between Rfam types (such as CD-Box) and the ncRNA_classes required by INSDC under the ncRNA feature key. The ncRNA feature key is intended to describe non-coding RNAs that aren’t ribosomal or transfer RNAs; these use the rRNA and tRNA feature keys respectively.

To use the ncRNA feature key, annotators are required to supply an appropriate ncRNA_class, and this is where confusion arises, as there’s no perfect overlap between the Rfam entry types and the ncRNA classes. To reduce this, here at Rfam we’ve put together a handy translation guide to make it easy to know what ncRNA class you should apply if you are using an Rfam family to annotate a genome. There are also some cases where an INSDC type is more specific than the Rfam type; for example, we don’t have a specific telomerase RNA type, whereas there is a ncRNA_class called telomerase_RNA. Therefore any annotation to RF00025 can use the telomerase_RNA ncRNA_class category. You can find our table of Rfam types and their INSDC equivalents here.

You can also find out all you ever wanted to know about the feature tables used for genome annotation here, and here.

Follow

Get every new post delivered to your Inbox.

Join 127 other followers