The Pfam website in a virtual machine

January 26, 2012

Since releasing the new Pfam website four years ago, we’ve had a steady trickle of mails from users who would like to install and run the site within their own local environment. It used to be possible to do just that, given a following wind, if you were ready to install the site from its source code. Unfortunately, after some internal changes and as the list of Perl module dependencies grew and grew, the process got harder and more complex and eventually we stopped supporting it entirely. We’ve been actively discouraging people from trying this for far too long, all the while promising to make the process easier. Finally we’ve managed to get around to building a virtual machine (VM) that should make the whole thing possible again. Read the rest of this entry »


What are these new families with _2, _3, _4 endings?

January 19, 2012

Some users have been contacting us about the new families that are appeared in Pfam release 26.0.

As pointed out by one of our users:

Pfam v26 includes, in addition to DDE_Tnp_1, the following new families:

DDE_Tnp_1_2
DDE_Tnp_1_3
DDE_Tnp_1_4
DDE_Tnp_1_5
DDE_Tnp_1_6
DDE_Tnp_1_7

These extra new families with the name_2, name_3, name_4 etc, have been constructed to increase the coverage of Pfam.  Many of our existing large diverse families are not well modelled by a single HMM and there are many true members that are not matched. So by building multiple models we can match more things.  Each of these models will be in the same Pfam clan, the RNaseH clan in this case.  For the most part these models do not represent any particular subfamily or classification group.  Essentially you should think of a match to any of the above seven DDE_TnP_1 families as being the same thing.  Because of the way  Pfam is built any particular region of a protein may only belong to one of these families.  We have a step in building clans called competition which means that if a region of a protein matches to both DDE_Tnp_1 and DDE_Tnp_1_2 for example then the region will be assigned to the family with the highest score.  This means that a match to DDE_Tnp_1 in release 25.0 may now end up in a different family such as DDE_Tnp_1_2.  You shouldn’t read too much into these changes.

The reason that many of these new families are appearing in Pfam release 26.0 is due to a change in strategy in how we are building many new Pfam families.  The new strategy consists of taking complete genomes and taking each protein that does not match Pfam and using it as a starting point for a Jackhmmer search.  Jackhmmer is an iterative search tool like PSI-blast.  If we find that the Jackhmmer search finds lots of homologues but has some overlaps with an existing family then we may build one of these new additional families to increase coverage of known sequences. Rather than give these families completely new names we simply call them the same as the existing family and append a number to them to show that they are closely related to each other.

 

Posted by Alex


In Support of Wikipedia

January 17, 2012

Many of you will be aware of the proposed web blackout in response to the Stop Online Piracy Act which is currently going through the U.S. House of Representatives (you can read the BBC’s explanation of the Act here). If this Act is enforced, it has far-reaching consequences for the overall freedom of the internet. Editors of the English Wikipedia have taken the decision to close the English Wikipedia for 24 hours, starting at 0500 hrs on Wednesday 18th January. To respect this protest, we will also be making our Wikipedia content unavailable during this time.

You’ll still be able to access all the non-Wikipedia content – that is, all the covariance models and HMMs describing families, domain graphics, full and seed alignments, as well as our species trees.

Posted by Sarah and Alex


The new NAR paper is out!

January 15, 2012

Dear Pfam-mers,

As you surely have noted the highly anticipated new Pfam paper is out as part of the 2012 NAR database issue! We were delighted to be listed as a featured article. The paper covers the new release 26.0 (more on this from Rob soon) and presents some novel analysis that may be of interest to Pfam addicts like you. We quite extensively discuss our use of family-specific bit score gathering thresholds (GAs), hoping to bring clarity to an issue that seems to have been a source of confusion in the past (a.k.a. stop sending us tickets asking what GAs are and how to use them! :-) ). Also, we extend and update the analysis of DUF families that was presented in a previous publication hoping to push more people into the de-DUF a DUF game. So, enjoy reading the paper and send us comments and suggestions, your support and advice is as always invaluable to us!!

Posted by Marco


Have we found all the protein families yet?

November 22, 2011

Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.

Lately I have been wondering about that missing 20%. I see three possibilities:

  1. Sequences should be part of new families
  2. Sequences are missing members of existing families
  3. Sequences are incorrect gene predictions and never expressed

To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.

Graph showing the number of Jackhmmer hits (y axis) against the unmatched proteins in a genome. Proteins are ordered along X-axis with those whose searches hit the most proteins at the right.

On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.

Posted by Alex

References

Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.


Rfam now available in UCSC Genome Browser, and other genome news.

November 2, 2011

We are pleased to announce the arrival of the Rfam Track Hub for the popular UCSC Genome browser. Rfam data has been available in the Ensembl browser for some time and provides links back to the Rfam annotation, and now this same functionality is available for the UCSC Genome Browser.

The hub file is available on our ftp site, and by following the instructions at the UCSC Genome Browser Custom Hub page, you can visualise Rfam annotations for the majority of species for which genomes are provided by the UCSC Genome Browser. Clicking on a match will give you exact start and stop positions, as well as links to the Rfam annotation page here at the Sanger. At the moment, bit scores or E-values for a given match aren’t yet available directly through the UCSC Genome Browser, though we’re working on it. Happy browsing!

Rfam types for Genome annotation

Xfam (in the forms of Sarah and Rob) attended the NIH Genome Annotation Workshop last week, and it was a great insight into the trials and tribulations of coming up with common standards that everyone’s happy with. It was also nice to hear that Rfam is being used exensively to annotate ncRNA features. However, there’s been some confusion amongst annotators when converting between Rfam types (such as CD-Box) and the ncRNA_classes required by INSDC under the ncRNA feature key. The ncRNA feature key is intended to describe non-coding RNAs that aren’t ribosomal or transfer RNAs; these use the rRNA and tRNA feature keys respectively.

To use the ncRNA feature key, annotators are required to supply an appropriate ncRNA_class, and this is where confusion arises, as there’s no perfect overlap between the Rfam entry types and the ncRNA classes. To reduce this, here at Rfam we’ve put together a handy translation guide to make it easy to know what ncRNA class you should apply if you are using an Rfam family to annotate a genome. There are also some cases where an INSDC type is more specific than the Rfam type; for example, we don’t have a specific telomerase RNA type, whereas there is a ncRNA_class called telomerase_RNA. Therefore any annotation to RF00025 can use the telomerase_RNA ncRNA_class category. You can find our table of Rfam types and their INSDC equivalents here.

You can also find out all you ever wanted to know about the feature tables used for genome annotation here, and here.


The Rfam 10.1 release is out!

June 16, 2011

The crowd of people behind Rfam are proud to announce a new release of the Rfam database. This is version 10.1 and is mostly an increase in the number, size and quality of families.

Rfam now has 1973 families, 528 more families than the 10.0 release. We are just one prokaryotic RNA-seq project away from hitting 2000 families! In fact, we have passed 2000 in terms of Rfam accession (RF02031 is the Escherichia coli sRNA, tpke11). My selfish attempt to claim the coveted RF02000 accession was snatched from my grasp by Chris Boursnell who added RF02000 which now corresponds to the rice microRNA MIR1846 from the miRBase database. I did claim RF01999 and RF02001 with two
sub-types of Group II catalytic intron domains 1-4 that Zasha Weinberg kindly provided to Rfam.

The new families included nearly 100 novel elements inferred by Zasha Weinberg and colleagues in his recent Genome Research article [1]. Zasha kindly provides Rfam with the alignments and writes Wikipedia articles for each notable element, greatly easing the burden on Rfam for incorporating these into the database.

Our new recruit, Ruth Eberhardt, originally from the UniProt group at EBI, has also made a significant mark on the new release. Ruth has been busily incorporating “domains” derived from long messenger-like non-coding RNAs (a.k.a. lncRNAs). These are regions within each transcript that are unusually well conserved and there is some evidence that secondary structure within the regions is evolutionarily constrained. The new families include: MEG3, MALAT1, MIAT, PRINS, XistTUG1, HSR-omega, Evf1, HOTAIR, KCNQ1OT1, SOX2OT, NEAT1, EGOT, H19 and HOTAIRM1.

This summer we had the pleasure of hosting another talented summer student, Ben Moore. Ben is a prolific Wikipedian and rapidly made his mark on the RNA Wikipedia entries and continues to do so while working on a MRes in Computational Biology at the University of York. One stunning feat Ben accomplished was passing the article for “Toxin-antitoxin system” through Wikipedia’s peer-review process for “good articles”. This process appears to be at least as rigorous as scientific peer-review and is quite an achievement. He also built a number of families for RNA anti-toxins including Sok, RNAII, IstR, RdlD, FlmB, Sib, RatA, SymR and PtaRNA1. PtaRNA1 is a newly discovered RNA antitoxin that was first published by Sven Findeiss and colleagues in the RNA families track at RNA Biology. This track has provided very useful updates and expansions for Rfam directly from the RNA community.

A guest Rfam rogue, Chris Boursnell, who has been visiting from Andrew Firth’s group has also been busy building new families. With permission from the good people at the Recode-2 database [3] Chris has added a number of new frame-shift elements and has also updated a number of microRNA families based on the latest release of miRBase [4].

An enormous achievement for the database is the inclusion of full-length small subunit ribosomal RNA families. Previously Rfam had just one truncated model that covered all three kingdoms of life. Thanks to the hard work of Eric Nawrocki and colleagues in Sean Eddy’s lab on the Infernal software and related package ssu-align which can now deal with much larger datasets than were previously possible. The three new alignments now cover bacteria, archaea and eukaryotes. These are all derived from the highly accurate and excellent alignments from the work of Robin Gutell and colleagues who run the Comparative RNA Website.

Thanks to the exciting work by Stefan Washietl and colleagues on the RNAcode software package we now have good evidence that the “RNA” family, C0343 (RF00120), is in fact protein-coding and most-likely is not functioning as a RNA other than in a mRNA-sense [5]. Therefore C0343 has been removed for the 10.1 release.

Our SRP families have all been rebuilt and supplemented with additional families thanks to the work of Magnus Rosenblad and colleagues [6]. This is another excellent contribution to the RNA families track at RNA Biology. Based on this work the existing two SRP families were replaced and supplemented by 7 new families: Metazoa_SRP (RF00017), Bacteria_small_SRP (RF00169), Fungi_SRP (RF01502), Bacteria_large_SRP (RF01854), Plant_SRP (RF01855), Protozoa_SRP (RF01856) and Archaea_SRP (RF01857). These new models should improve the specificity of Rfam annotations and reduce the number of pseudogenes incorporated.

We have continued to work on the Rfam clans and have added 3 new clans. These are U3 (CL00100), Cobalamin (CL00101) and group-II-D1D4 (CL00102). Also, the membership of the clans tRNA (CL00001), RNaseP (CL00002) and SNORA62 (CL00040) have been updated.

Finally, several problematic microRNA families mir-544 (RF01045), mir-1302 (RF00951), mir-1255 (RF00994), mir-548 (RF01061), mir-649 (RF01029), mir-562 (RF00998) and spliceosomal U13 (RF01210) were rethresholded to remove the excessive number of pseudogene annotations in the full alignments. This rethresholding along with the rebuilding of our SSU models have removed approximately 600,000 annotations from Rfam.

There are countless other changes that have made, if I’ve forgotten to include any that are significant to you or to mention your name then I apologise profusely.

This release could not have happened without the invaluable help of Jen Daub and John Tate who have worked tirelessly and enthusiastically on this release. This was made particularly challenging by the fact that I have recently relocated to my homeland in New Zealand to take up a position as a Rutherford discovery fellow and senior lecturer at the University of Canterbury in Christchurch. I hope to continue to contribute to Rfam and the wider RNA community from here. This is also a good moment to welcome the new Czars of Rfam, Sarah Burge and Eric Nawrocki who will now face the exciting and challenging task of managing the day-to-day work of maintaining Rfam. I wish them the best of luck in their new roles. I hope they enjoy it as much as I have.

Paul Gardner.

References

[1] Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR. (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology. 11(3):R31.

[2] Findeiss S, Schmidtke C, Stadler PF, Bonas U (2010). A novel family of plasmid-transferred anti-sense ncRNAs. RNA Biology. 7 (2): 120–4.

[3] Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. (2010) Recode-2: new design, new search tools, and many more genes. Nucleic Acids Res. 38(Database issue):D69-74.

[4] Kozomara A, Griffiths-Jones S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Research. Jan;39(Database issue):D152-7.

[5] Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. (2011) RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA. 17(4):578-94.

[6] Rosenblad MA, Larsen N, Samuelsson T, Zwieb C. (2009) Kinship in the SRP RNA family. RNA Biology. 2009 Nov-Dec;6(5):508-16.


No, seriously, we’ve made a release

April 1, 2011

Well, it should have been out about 6 months ago, but finally the long awaited Pfam release 25.0 is here! Release 25.0 contains a total of 12273 families, with 384 new families and 21 families killed since the latest release.  Pfam 25.0 is based on UniProt release 2010_05. Those of you who follow Pfam closely will be familiar with the fact the sequence coverage (the number of sequences in Pfamseq containing at least one Pfam match) has hovered at or just below 75%.  Despite the addition of only a modest number of new families in this release, the sequence coverage is now 76.69% of all proteins in Pfamseq contain a match to at least one Pfam domain.  53.86% of all residues in the sequence database fall within Pfam domains.

Read the rest of this entry »


Who’s who ?

March 22, 2011

It has been some time since we posted a blog, so, to keep you all on your toes, we are going behind the scenes to reveal something about the minds that run Pfam… From the longest-serving member to the newest recruit we have elicited a few key facts in the form of answers to some ‘trivial’ questions. Here are two profiles as they were given.  Can you work out who is who?

Read the rest of this entry »


The Rfam NAR paper is out!

December 9, 2010

This is a belated announcement of the release of the latest Rfam database article in the NAR database issue:
[1] Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. (2010) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res.

In this publication we discuss the success of the relationship between Wikipedia and Rfam. This includes a fun analysis of the degree of vandalism the RNA pages have received with respect to the number of useful edits. We also discuss the new clans that explicitly link families that share an evolutionary relationship yet are too divergent to be sensibly aligned, the latest “decimal” release and our future plans.

While you are there, check out the latest miRBase paper and the new miRBase blog:
[2] Kozomara A, Griffiths-Jones S. (2010) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res.

Our friends over at the EBI have 2 papers describing the all-important nucleotide sequence archives:
[3] Cochrane G, Karsch-Mizrachi I, Nakamura Y; On behalf of the International Nucleotide Sequence Database Collaboration. (2010) The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res.

[4] Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, Hoad G, Jang M, Pakseresht N, Plaister S, Radhakrishnan R, Reddy K, Sobhany S, Ten Hoopen P, Vaughan R, Zalunin V, Cochrane G. (2010) The European Nucleotide Archive. Nucleic Acids Res.

The ENSEMBL series of genome databases has had an update:
[5] Flicek P et al. (2010) Ensembl 2011. Nucleic Acids Res.

Also, see the very useful nomenclature efforts of the “(Human Genome Organisation) Gene Nomenclature Committee”:
[6] Seal RL, Gordon SM, Lush MJ, Wright MW, Bruford EA. (2010) genenames.org: the HGNC resources in 2011. Nucleic Acids Res.


Follow

Get every new post delivered to your Inbox.

Join 38 other followers