The Rfam decimal release is out!

April 16, 2010

It has been some time since Rfam has had a public update. However, the dwarves under Mount Rfam have busy for some time working on Rfam 10.0. The good elves at the Eddy-lab some time ago forged Infernal 1.0. Since then we have been busily adding Infernal 1.0 support and individually re-thresholding each Rfam family. This has been an epic adventure largely championed by our number one dwarf Jennifer ‘Gimlet’ Daub. She has since left and is now enjoying sunny NZ (hopefully these two events aren’t correlated). We also mapped all the families and searched a new version of rfamseq based on EMBL 100.

The net result is a 178% increase in the number of regions that Rfam covers, which contrasts with the rather modest increase in rfamseq size of 40% (thankfully new sequencing technologies are primarily being deployed for re-sequencing efforts, phew) is pretty impressive IMHO. This has also meant that some of our alignments are enormous! The tRNA full alignment, for example, now has more than 1 million sequences in it (1,101,833 to be exact). I’m waiting to hear from the Guinness Book of Records people since I have a suspicion that this may be the deepest alignment of all time. Unfortunately this alignment is too large to reasonably fetch from the website, you can find it on the ftp site in the Rfam.full (warning this is a 1GB file).

If anyone wants to repeat this effort then it’ll take them roughly 5 CPU months to calibrate the models, 1 CPU year to run blast, 3 CPU years to run cmsearch and 15 days to run cmalign, all on a modern high-spec computer. Many thanks to Guy Coates, James Beal and Peter Clapham for assistance getting the Sanger infrastructure talking nicely to MPI and the granularity correct for our searches.

New alignment formats!

Another major change our power users may notice is that there are a new set of seed and full alignments available on the website. These use descriptive “species labels” for sequence names rather than the traditional EMBL accessions and coordinates that we normally provide. The provenance of the sequence data is maintained by using “#=GS” tags from Stockholm format to give a mapping back to EMBL accessions. This has been our most requested feature, including from some of the greats in the field like Eric Westhof, Mike Zuker and Sean Eddy. Hopefully our users approve of the new alignments.

100 Rfam clans!

I’ve added 100 clans (actually 99 but close enough, eh). These are explicit relationships between families that share a common ancestor but are too diverged to be reasonably aligned. For example, the RNase P clan contains the related families RNase_MRP, RNaseP_arch, RNaseP_bact_a, RNaseP_bact_b and RNaseP_nuc. This means that internally for these families we can ignore our “no-overlap rule” which has meant that in the past some of these families have had artificially high thresholds to avoid overlapping a related but divergent family.

New views on the website

John ‘ Dwalin’ Tate has spent a bit of time making our website even more prettierer than it was. On the secondary structure tab users now have an option to use VARNA [1], we have also updated the RNAplot derived secondary structures so that the 5′ and 3′ ends are labelled and we have added a download option for coloured HTML alignments generated by colorstock [2]. John has also upgraded the species tab so users can select different levels of the NCBI hierarchy and generate corresponding alignments.

One of the more exciting new features is something we have wanted to add to the site for a while. This is the ability to view the features surrounding an Rfam annotation. We investigated a number of approaches for doing this, including mutating the Pfam domain graphics. However, this brought us uncomfortably close to creating yet another genome/sequence browser, something we are somewhat reluctant to do. Fortunately for us, the European nucleotide archive (ENA) team have added a sequence feature viewer to their site which we can use for our purposes. This is feature is particularly useful for looking at cis-regulatory elements such as riboswitches eg. Magnesium Sensor. Simply go to the Sequences tab from a family page and click on the little “features” icon next to an interesting EMBL sequence accession and you’ll be able to few features within 500 nucs either side of the region. You can switch to the next Rfam region in the list by clicking on next in the top righthand corner of the image.

Ontologies, structures and DAS

For the majority of the families we have added cross-links to both the sequence ontology and the gene ontology. Many of these were provided by our good friends at fRNAdb. One day when we get more time we may start feeding more ncRNA terms back into the ontologies. Until then the mapping is rather coarse-grained and closely related to our existing types. We have also improved the mappings between Rfam and PDB. These can be easily found now on the browse page under Structures. There is now a grand total of 43 Rfam families with 633 corresponding PDB entries.

Prasad ‘Nori’ Gunasekaran has updated the Rfam DAS sources. These are particularly useful for quickly grabbing all the Rfam annotations for your favourite bacteria or archaea eg. Salmonella typhi. This could be nice way for the many genome browsers out there to get the most up-to-date RNA annotations. If your favourite genome is split across multiple EMBL accessions then unfortunately the job of mapping between EMBL coordinates and genome coordinates is still an incredibly challenging task (I’ve moaned about this before here and here and I’ll probably keep moaning until this becomes easy or I stop trying to do it). Perhaps one day if DAS lives up to it’s full potential this will become much easier. We have done our best to give genome coordinates for the latest genome versions in GFF3 format from our genome pages. However some notable assemblies are missing from the public archives (eg. Saccharomyces cerevisiae, Mus musculus and probably many others) turning a difficult task into an absolute nightmare.

rfam_scan.pl has been rejuvenated

Sam ‘Balin’ Griffiths-Jones has reworked rfam_scan.pl. It now has an illustrious version number of 1.0. A major new feature is that a lot of the Bioperl dependancies have been removed so the memory footprint is much lower. The file usage is also much improved, users can now just need to download three files (not including the bioperl and blast dependencies) and can run it with a relatively simple command:

rfam_scan.pl -blastdb Rfam.fasta Rfam.cm yourseqfile.fa

Where you’ve indexed Rfam.fasta for running with NCBI-BLAST. If you spot any problems, please do report them to Sam ASAP. We are also using rfam_scan.pl behind the scenes for the searches submitted via the Rfam website. If there are any problems here we’d like to also like to hear about them.

Final words

There have been innumerable other tweaks to the data and the website, you can read more detail in the README. Please do shout loudly if you see any problems.

Many thanks to all the people who have contributed to this release of Rfam! In particular, I had some invaluable feedback at the wonderful Benasque RNA Workshop last year.

Posted by: Paul ‘Bombur’ Gardner.

References

[1] Darty K, Denise A, Ponty Y VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009 Aug; 25:(15)1974-5

[2] Bendaña YR, Holmes IH Colorstock, SScolor, Ratón: RNA alignment visualization tools. Bioinformatics. 2008 Feb; 24:(4)579-80

About these ads

8 Responses to “The Rfam decimal release is out!”

  1. Peter Says:

    Hej Paul, congrats to 10.0!!
    Peter

  2. Eric Nawrocki Says:

    Paul, Congrats on 10.0! But don’t wait too long for Guinness to call, RDP has a 1.4 million deep SSU rRNA alignment (which as you know is about 20X longer than tRNA)! A goal for 11.0: SSU and LSU…

  3. ppgardne Says:

    Thanks Eric. I’ll get onto that application. ;-)
    Our partial SSU model is now the only model that seriously breaks our pipeline. The cmalign memory usage goes through the roof. I had to split the sequences up, cmalign them, and then use rounds of “cmalign –merge”. It still required a pretty hefty big-memory machine. Nice work on cmalign, BTW, I was amazed we managed to get it done. I shudder to think what a full-length SSU and LSU model would do — I suspect a specialist solution would be better. I’m not in too much of a hurry since there are soo many specialised rRNA resources out there eg. RDB, SILVA, Greengenes, the European ribosomal RNA database, Comparative RNA website and probably lots of others. My main concern is all the other families we don’t cover.


  4. [...] one will be building the published RNA Biology articles that we ran out of time to do for the Rfam 10.0 [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 153 other followers