The Rfam decimal release is out!

April 16, 2010

It has been some time since Rfam has had a public update. However, the dwarves under Mount Rfam have busy for some time working on Rfam 10.0. The good elves at the Eddy-lab some time ago forged Infernal 1.0. Since then we have been busily adding Infernal 1.0 support and individually re-thresholding each Rfam family. This has been an epic adventure largely championed by our number one dwarf Jennifer ‘Gimlet’ Daub. She has since left and is now enjoying sunny NZ (hopefully these two events aren’t correlated). We also mapped all the families and searched a new version of rfamseq based on EMBL 100.

The net result is a 178% increase in the number of regions that Rfam covers, which contrasts with the rather modest increase in rfamseq size of 40% (thankfully new sequencing technologies are primarily being deployed for re-sequencing efforts, phew) is pretty impressive IMHO. This has also meant that some of our alignments are enormous! The tRNA full alignment, for example, now has more than 1 million sequences in it (1,101,833 to be exact). I’m waiting to hear from the Guinness Book of Records people since I have a suspicion that this may be the deepest alignment of all time. Unfortunately this alignment is too large to reasonably fetch from the website, you can find it on the ftp site in the Rfam.full (warning this is a 1GB file).

If anyone wants to repeat this effort then it’ll take them roughly 5 CPU months to calibrate the models, 1 CPU year to run blast, 3 CPU years to run cmsearch and 15 days to run cmalign, all on a modern high-spec computer. Many thanks to Guy Coates, James Beal and Peter Clapham for assistance getting the Sanger infrastructure talking nicely to MPI and the granularity correct for our searches.

New alignment formats!

Another major change our power users may notice is that there are a new set of seed and full alignments available on the website. These use descriptive “species labels” for sequence names rather than the traditional EMBL accessions and coordinates that we normally provide. The provenance of the sequence data is maintained by using “#=GS” tags from Stockholm format to give a mapping back to EMBL accessions. This has been our most requested feature, including from some of the greats in the field like Eric Westhof, Mike Zuker and Sean Eddy. Hopefully our users approve of the new alignments.

100 Rfam clans!

I’ve added 100 clans (actually 99 but close enough, eh). These are explicit relationships between families that share a common ancestor but are too diverged to be reasonably aligned. For example, the RNase P clan contains the related families RNase_MRP, RNaseP_arch, RNaseP_bact_a, RNaseP_bact_b and RNaseP_nuc. This means that internally for these families we can ignore our “no-overlap rule” which has meant that in the past some of these families have had artificially high thresholds to avoid overlapping a related but divergent family.

New views on the website

John ‘ Dwalin’ Tate has spent a bit of time making our website even more prettierer than it was. On the secondary structure tab users now have an option to use VARNA [1], we have also updated the RNAplot derived secondary structures so that the 5′ and 3′ ends are labelled and we have added a download option for coloured HTML alignments generated by colorstock [2]. John has also upgraded the species tab so users can select different levels of the NCBI hierarchy and generate corresponding alignments.

One of the more exciting new features is something we have wanted to add to the site for a while. This is the ability to view the features surrounding an Rfam annotation. We investigated a number of approaches for doing this, including mutating the Pfam domain graphics. However, this brought us uncomfortably close to creating yet another genome/sequence browser, something we are somewhat reluctant to do. Fortunately for us, the European nucleotide archive (ENA) team have added a sequence feature viewer to their site which we can use for our purposes. This is feature is particularly useful for looking at cis-regulatory elements such as riboswitches eg. Magnesium Sensor. Simply go to the Sequences tab from a family page and click on the little “features” icon next to an interesting EMBL sequence accession and you’ll be able to few features within 500 nucs either side of the region. You can switch to the next Rfam region in the list by clicking on next in the top righthand corner of the image.

Ontologies, structures and DAS

For the majority of the families we have added cross-links to both the sequence ontology and the gene ontology. Many of these were provided by our good friends at fRNAdb. One day when we get more time we may start feeding more ncRNA terms back into the ontologies. Until then the mapping is rather coarse-grained and closely related to our existing types. We have also improved the mappings between Rfam and PDB. These can be easily found now on the browse page under Structures. There is now a grand total of 43 Rfam families with 633 corresponding PDB entries.

Prasad ‘Nori’ Gunasekaran has updated the Rfam DAS sources. These are particularly useful for quickly grabbing all the Rfam annotations for your favourite bacteria or archaea eg. Salmonella typhi. This could be nice way for the many genome browsers out there to get the most up-to-date RNA annotations. If your favourite genome is split across multiple EMBL accessions then unfortunately the job of mapping between EMBL coordinates and genome coordinates is still an incredibly challenging task (I’ve moaned about this before here and here and I’ll probably keep moaning until this becomes easy or I stop trying to do it). Perhaps one day if DAS lives up to it’s full potential this will become much easier. We have done our best to give genome coordinates for the latest genome versions in GFF3 format from our genome pages. However some notable assemblies are missing from the public archives (eg. Saccharomyces cerevisiae, Mus musculus and probably many others) turning a difficult task into an absolute nightmare.

rfam_scan.pl has been rejuvenated

Sam ‘Balin’ Griffiths-Jones has reworked rfam_scan.pl. It now has an illustrious version number of 1.0. A major new feature is that a lot of the Bioperl dependancies have been removed so the memory footprint is much lower. The file usage is also much improved, users can now just need to download three files (not including the bioperl and blast dependencies) and can run it with a relatively simple command:

rfam_scan.pl -blastdb Rfam.fasta Rfam.cm yourseqfile.fa

Where you’ve indexed Rfam.fasta for running with NCBI-BLAST. If you spot any problems, please do report them to Sam ASAP. We are also using rfam_scan.pl behind the scenes for the searches submitted via the Rfam website. If there are any problems here we’d like to also like to hear about them.

Final words

There have been innumerable other tweaks to the data and the website, you can read more detail in the README. Please do shout loudly if you see any problems.

Many thanks to all the people who have contributed to this release of Rfam! In particular, I had some invaluable feedback at the wonderful Benasque RNA Workshop last year.

Posted by: Paul ‘Bombur’ Gardner.

References

[1] Darty K, Denise A, Ponty Y VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009 Aug; 25:(15)1974-5

[2] Bendaña YR, Holmes IH Colorstock, SScolor, Ratón: RNA alignment visualization tools. Bioinformatics. 2008 Feb; 24:(4)579-80

This entry was posted on April 16, 2010 at 10:29 am and is filed under Releases.

Tags: rfam

8 Responses to “The Rfam decimal release is out!”

Peter Says:

April 16, 2010 at 12:17 pm
Hej Paul, congrats to 10.0!!
Peter

Reply
ppgardne Says:

April 16, 2010 at 12:28 pm
Tak Peter!

Reply
Eric Nawrocki Says:

April 18, 2010 at 12:23 am
Paul, Congrats on 10.0! But don’t wait too long for Guinness to call, RDP has a 1.4 million deep SSU rRNA alignment (which as you know is about 20X longer than tRNA)! A goal for 11.0: SSU and LSU…

Reply
ppgardne Says:

April 18, 2010 at 6:43 am
Thanks Eric. I’ll get onto that application. 😉
Our partial SSU model is now the only model that seriously breaks our pipeline. The cmalign memory usage goes through the roof. I had to split the sequences up, cmalign them, and then use rounds of “cmalign –merge”. It still required a pretty hefty big-memory machine. Nice work on cmalign, BTW, I was amazed we managed to get it done. I shudder to think what a full-length SSU and LSU model would do — I suspect a specialist solution would be better. I’m not in too much of a hurry since there are soo many specialised rRNA resources out there eg. RDB, SILVA, Greengenes, the European ribosomal RNA database, Comparative RNA website and probably lots of others. My main concern is all the other families we don’t cover.

Reply
- Eric Nawrocki Says:
  
  April 18, 2010 at 12:09 pm
  Yikes! I see your point — I was completely ignorant of how many “other” families there are! BTW, the cmalign memory problem is solved in the development code, the size of output alignments will now be limited only by hard-disk space.
  
  Reply
  - ppgardne Says:
    
    April 18, 2010 at 12:46 pm
    Whoopee! So for the next releases nothing should break the pipeline. Nice!
- ppgardne Says:
  
  April 20, 2010 at 11:45 am
  There’s a relevant discussion regarding the rRNA DBs on Jonathan Eisen’s blog:
  http://phylogenomics.blogspot.com/2010/04/yuck-rrna-databases-restrictions-on.html
  
  Reply
Plans for Rfam 2010-2011 « Xfam Blog Says:

June 30, 2010 at 3:58 pm
[…] one will be building the published RNA Biology articles that we ran out of time to do for the Rfam 10.0 […]

Reply

Xfam Blog

Pages

Twitter

Related blogs

Recent Posts

Archives

Categories

Meta