We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.
Posts Tagged ‘rfam’
The team behind Rfam is pleased to announce the release of Rfam 11.0. This release represents a major update from 10.1, primarily due to the upgrade of our underlying sequence database, Rfamseq.
As some of you will already be aware, the Xfam family has recently gained a new member: the TreeFam database.
TreeFam aims to provide phylogenetic trees and orthology predictions for all animal genes.
AntiFam  is the newest addition to the Xfam brand. It is a database of hidden Markov models (HMMs) designed to identify spurious open reading frames (ORFs). It is available now on our ftp site:
Many of you will be aware of the proposed web blackout in response to the Stop Online Piracy Act which is currently going through the U.S. House of Representatives (you can read the BBC’s explanation of the Act here). If this Act is enforced, it has far-reaching consequences for the overall freedom of the internet. Editors of the English Wikipedia have taken the decision to close the English Wikipedia for 24 hours, starting at 0500 hrs on Wednesday 18th January. To respect this protest, we will also be making our Wikipedia content unavailable during this time.
You’ll still be able to access all the non-Wikipedia content – that is, all the covariance models and HMMs describing families, domain graphics, full and seed alignments, as well as our species trees.
Posted by Sarah and Alex
We are pleased to announce the arrival of the Rfam Track Hub for the popular UCSC Genome browser. Rfam data has been available in the Ensembl browser for some time and provides links back to the Rfam annotation, and now this same functionality is available for the UCSC Genome Browser.
The hub file is available on our ftp site, and by following the instructions at the UCSC Genome Browser Custom Hub page, you can visualise Rfam annotations for the majority of species for which genomes are provided by the UCSC Genome Browser. Clicking on a match will give you exact start and stop positions, as well as links to the Rfam annotation page here at the Sanger. At the moment, bit scores or E-values for a given match aren’t yet available directly through the UCSC Genome Browser, though we’re working on it. Happy browsing!
Rfam types for Genome annotation
Xfam (in the forms of Sarah and Rob) attended the NIH Genome Annotation Workshop last week, and it was a great insight into the trials and tribulations of coming up with common standards that everyone’s happy with. It was also nice to hear that Rfam is being used exensively to annotate ncRNA features. However, there’s been some confusion amongst annotators when converting between Rfam types (such as CD-Box) and the ncRNA_classes required by INSDC under the ncRNA feature key. The ncRNA feature key is intended to describe non-coding RNAs that aren’t ribosomal or transfer RNAs; these use the rRNA and tRNA feature keys respectively.
To use the ncRNA feature key, annotators are required to supply an appropriate ncRNA_class, and this is where confusion arises, as there’s no perfect overlap between the Rfam entry types and the ncRNA classes. To reduce this, here at Rfam we’ve put together a handy translation guide to make it easy to know what ncRNA class you should apply if you are using an Rfam family to annotate a genome. There are also some cases where an INSDC type is more specific than the Rfam type; for example, we don’t have a specific telomerase RNA type, whereas there is a ncRNA_class called telomerase_RNA. Therefore any annotation to RF00025 can use the telomerase_RNA ncRNA_class category. You can find our table of Rfam types and their INSDC equivalents here.
The crowd of people behind Rfam are proud to announce a new release of the Rfam database. This is version 10.1 and is mostly an increase in the number, size and quality of families.
Rfam now has 1973 families, 528 more families than the 10.0 release. We are just one prokaryotic RNA-seq project away from hitting 2000 families! In fact, we have passed 2000 in terms of Rfam accession (RF02031 is the Escherichia coli sRNA, tpke11). My selfish attempt to claim the coveted RF02000 accession was snatched from my grasp by Chris Boursnell who added RF02000 which now corresponds to the rice microRNA MIR1846 from the miRBase database. I did claim RF01999 and RF02001 with two
sub-types of Group II catalytic intron domains 1-4 that Zasha Weinberg kindly provided to Rfam.
The new families included nearly 100 novel elements inferred by Zasha Weinberg and colleagues in his recent Genome Research article . Zasha kindly provides Rfam with the alignments and writes Wikipedia articles for each notable element, greatly easing the burden on Rfam for incorporating these into the database.
Our new recruit, Ruth Eberhardt, originally from the UniProt group at EBI, has also made a significant mark on the new release. Ruth has been busily incorporating “domains” derived from long messenger-like non-coding RNAs (a.k.a. lncRNAs). These are regions within each transcript that are unusually well conserved and there is some evidence that secondary structure within the regions is evolutionarily constrained. The new families include: MEG3, MALAT1, MIAT, PRINS, Xist, TUG1, HSR-omega, Evf1, HOTAIR, KCNQ1OT1, SOX2OT, NEAT1, EGOT, H19 and HOTAIRM1.
This summer we had the pleasure of hosting another talented summer student, Ben Moore. Ben is a prolific Wikipedian and rapidly made his mark on the RNA Wikipedia entries and continues to do so while working on a MRes in Computational Biology at the University of York. One stunning feat Ben accomplished was passing the article for “Toxin-antitoxin system” through Wikipedia’s peer-review process for “good articles”. This process appears to be at least as rigorous as scientific peer-review and is quite an achievement. He also built a number of families for RNA anti-toxins including Sok, RNAII, IstR, RdlD, FlmB, Sib, RatA, SymR and PtaRNA1. PtaRNA1 is a newly discovered RNA antitoxin that was first published by Sven Findeiss and colleagues in the RNA families track at RNA Biology. This track has provided very useful updates and expansions for Rfam directly from the RNA community.
A guest Rfam rogue, Chris Boursnell, who has been visiting from Andrew Firth’s group has also been busy building new families. With permission from the good people at the Recode-2 database  Chris has added a number of new frame-shift elements and has also updated a number of microRNA families based on the latest release of miRBase .
An enormous achievement for the database is the inclusion of full-length small subunit ribosomal RNA families. Previously Rfam had just one truncated model that covered all three kingdoms of life. Thanks to the hard work of Eric Nawrocki and colleagues in Sean Eddy’s lab on the Infernal software and related package ssu-align which can now deal with much larger datasets than were previously possible. The three new alignments now cover bacteria, archaea and eukaryotes. These are all derived from the highly accurate and excellent alignments from the work of Robin Gutell and colleagues who run the Comparative RNA Website.
Thanks to the exciting work by Stefan Washietl and colleagues on the RNAcode software package we now have good evidence that the “RNA” family, C0343 (RF00120), is in fact protein-coding and most-likely is not functioning as a RNA other than in a mRNA-sense . Therefore C0343 has been removed for the 10.1 release.
Our SRP families have all been rebuilt and supplemented with additional families thanks to the work of Magnus Rosenblad and colleagues . This is another excellent contribution to the RNA families track at RNA Biology. Based on this work the existing two SRP families were replaced and supplemented by 7 new families: Metazoa_SRP (RF00017), Bacteria_small_SRP (RF00169), Fungi_SRP (RF01502), Bacteria_large_SRP (RF01854), Plant_SRP (RF01855), Protozoa_SRP (RF01856) and Archaea_SRP (RF01857). These new models should improve the specificity of Rfam annotations and reduce the number of pseudogenes incorporated.
We have continued to work on the Rfam clans and have added 3 new clans. These are U3 (CL00100), Cobalamin (CL00101) and group-II-D1D4 (CL00102). Also, the membership of the clans tRNA (CL00001), RNaseP (CL00002) and SNORA62 (CL00040) have been updated.
Finally, several problematic microRNA families mir-544 (RF01045), mir-1302 (RF00951), mir-1255 (RF00994), mir-548 (RF01061), mir-649 (RF01029), mir-562 (RF00998) and spliceosomal U13 (RF01210) were rethresholded to remove the excessive number of pseudogene annotations in the full alignments. This rethresholding along with the rebuilding of our SSU models have removed approximately 600,000 annotations from Rfam.
There are countless other changes that have made, if I’ve forgotten to include any that are significant to you or to mention your name then I apologise profusely.
This release could not have happened without the invaluable help of Jen Daub and John Tate who have worked tirelessly and enthusiastically on this release. This was made particularly challenging by the fact that I have recently relocated to my homeland in New Zealand to take up a position as a Rutherford discovery fellow and senior lecturer at the University of Canterbury in Christchurch. I hope to continue to contribute to Rfam and the wider RNA community from here. This is also a good moment to welcome the new Czars of Rfam, Sarah Burge and Eric Nawrocki who will now face the exciting and challenging task of managing the day-to-day work of maintaining Rfam. I wish them the best of luck in their new roles. I hope they enjoy it as much as I have.
 Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR. (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology. 11(3):R31.
 Findeiss S, Schmidtke C, Stadler PF, Bonas U (2010). A novel family of plasmid-transferred anti-sense ncRNAs. RNA Biology. 7 (2): 120–4.
 Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. (2010) Recode-2: new design, new search tools, and many more genes. Nucleic Acids Res. 38(Database issue):D69-74.
 Kozomara A, Griffiths-Jones S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Research. Jan;39(Database issue):D152-7.
 Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. (2011) RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA. 17(4):578-94.
 Rosenblad MA, Larsen N, Samuelsson T, Zwieb C. (2009) Kinship in the SRP RNA family. RNA Biology. 2009 Nov-Dec;6(5):508-16.
It has been some time since we posted a blog, so, to keep you all on your toes, we are going behind the scenes to reveal something about the minds that run Pfam… From the longest-serving member to the newest recruit we have elicited a few key facts in the form of answers to some ‘trivial’ questions. Here are two profiles as they were given. Can you work out who is who?
This is a belated announcement of the release of the latest Rfam database article in the NAR database issue:
 Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. (2010) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res.
In this publication we discuss the success of the relationship between Wikipedia and Rfam. This includes a fun analysis of the degree of vandalism the RNA pages have received with respect to the number of useful edits. We also discuss the new clans that explicitly link families that share an evolutionary relationship yet are too divergent to be sensibly aligned, the latest “decimal” release and our future plans.
While you are there, check out the latest miRBase paper and the new miRBase blog:
 Kozomara A, Griffiths-Jones S. (2010) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res.
Our friends over at the EBI have 2 papers describing the all-important nucleotide sequence archives:
 Cochrane G, Karsch-Mizrachi I, Nakamura Y; On behalf of the International Nucleotide Sequence Database Collaboration. (2010) The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res.
 Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, Hoad G, Jang M, Pakseresht N, Plaister S, Radhakrishnan R, Reddy K, Sobhany S, Ten Hoopen P, Vaughan R, Zalunin V, Cochrane G. (2010) The European Nucleotide Archive. Nucleic Acids Res.
The ENSEMBL series of genome databases has had an update:
 Flicek P et al. (2010) Ensembl 2011. Nucleic Acids Res.
Also, see the very useful nomenclature efforts of the “(Human Genome Organisation) Gene Nomenclature Committee”:
 Seal RL, Gordon SM, Lush MJ, Wright MW, Bruford EA. (2010) genenames.org: the HGNC resources in 2011. Nucleic Acids Res.
We have been very sad to see a few people leave the group recently. Rob Finn has been the dedicated and hard working project leader of Pfam for many years. In fact as a summer student he is credited with preparing most of the families for Pfam 2.0 ! We’re expecting to see great things from him at his new post at HHMI’s Janelia Farm. We’ve also seen Jaina Mistry get married and move to another city, fortunately for us she’s still working part-time for Pfam remotely. Jen Daub after her whirlwind trip around the world will also be working part-time on the Rfam project from her luxurious new abode in France.
This means we have a number of opportunities for bright and enthusiastic people. We are looking to recruit a new Project Leader to lead the Pfam group. This is an exciting opportunity for a motivated, enthusiastic and experienced computational biologist, and is an influential position working with a high profile bioinformatic resource. We anticipate the candidate will lead the next phase of database development that will include community annotation and the incorporation of new developments based on the HMMER3 software. We would expect the successful candidate to have their own research ideas
and be able to deliver research outputs with the group.
We are also looking for two Computational Biologists to join the group. The successful candidate will ideally a MSc in bioinformatics or equivalent experience and a strong background in molecular biology, biochemistry, genetics or similar.
We would also like to take the opportunity to welcome Professor John Burke from the University of Vermont. John is taking a one year sabbatical with Rfam to learn about all things bioinformatic. He is already an expert on all things to do with ribozymes and RNA structure, so we expect some major improvements in Rfam in these areas.
Last but not least, we have Chris Boursnell, a refugee from the banking world, who is working us and the fine Recode database
to improve our coverage of frame-shift elements.