Updates to Rfam in 2008

February 11, 2009

2008 was a big year for Rfam. I’ll outline a few of the major achievements below. Many of these developments are also discussed in the latest Rfam papers [1,2].

Two new releases

Rfam 9.0 was primarily an update to the underlying sequence database. The previous version of RFAMSEQ (8.1) was based upon EMBL 84, which was three years out of date. The updated sequences also included, for the first time, the whole genome shotgun (WGS) and environmental sequence (ENV) divisions of EMBL. This resulted in an approximately 10-fold increase in the number of sequences that Rfam now searches.

There were also many improvements to existing Rfam models. This included cleaning up the most glaring problems in the consensus secondary structure annotations and alignments. However, the biggest changes came from Jennifer Daub iterating more than 370 of the smaller Rfam families. Iterations are where good sequences from the full alignments are pushed into the seed alignments and the new model used to re-search RFAMSEQ.

Rfam 9.1 was primarily an update to the number of models in Rfam. Our summer student, Adam Wilkinson, built a phenomenal number of new families.  The new families included 408 miRNAs, 144 CD box snoRNAs, 50 H/ACA box snoRNAs, 65 CRISPRs, 57 Cis-regulatory elements, 30 sRNAs, 5 riboswitches, 1 partial LSU rRNA and 9 miscellaneous RNA genes.

A new website

John Tate spent a great deal of time creating the snazzy new Rfam website. The old site was not going to scale well with the masses of new sequences and families that we planned to add. Also, by updating the Rfam website we can now use a common core of website code for both Rfam and its sister site Pfam.

Rfam and Wikpedia

There have been a lot of interesting developments with Rfam and Wikipedia. I’ll write a separate blog entry containing more details later. Rfam is one of the few bioinformatic databases to draw textual annotations of our entries directly from Wikipedia. In the last release we moved away from writing individual entries for each family, which had resulted in hundreds of very repetitive and short entries. Instead we chose to use generic entries covering several families. For example, most of the new miRNA and snoRNA families point to only the miRNA and snoRNA Wikipedia entries respectively. Those individual families that become notable will eventually get their own entries. This decision meant that many more new Rfam families could be built, as writing these short articles does take a significant amount of our time.

Mappings to PDB

In Rfam 9.1 we now provide mappings between Rfam and PDB. These mappings are still experimental and will take a little while to mature. Rob Finn as part of his iPfam work kindly pulled out all the RNA sequences from PDB for us. Since many of these sequences were truncated relative to the Rfam models I had to use a combination of BLAT and Infernal to map from these sequences to the Rfam families.  This only affects 20 Rfam families which are listed below:

Type Family
Cis-reg IRE
Cis-reg Gammaretro_CES
Cis-reg HIV-1_DIS
Cis-reg;frameshift_element HIV_FE
Cis-reg;IRES IRES_Cripavirus
Cis-reg;riboswitch TPP
Cis-reg;riboswitch glmS
Cis-reg;riboswitch Purine
Cis-reg;riboswitch SAM
Gene SRP_bact
Gene SRP_euk_arch
Gene;ribozyme RNaseP_bact_a
Gene;ribozyme RNaseP_bact_b
Gene;ribozyme HDV_ribozyme
Gene;rRNA 5S_rRNA
Gene;rRNA PK-G12rRNA
Gene;rRNA 5_8S_rRNA
Gene;rRNA SSU_rRNA_5
Gene;tRNA tRNA
Intron Intron_gpI

Genome mappings

We have more than 1,140 genome annotations now in GFF3 format. To produce these we used the mappings between EMBL and genomes provided on the EMBL website. However, these mapping have not kept up with ENSEMBL genomes (probably due to genome assemblers not submitting their assemblies to EMBL). This caused a significant headache for us. Jen Daub, with a lot of help from the ENSEMBL people, managed to get mappings for four ENSEMBL species (human, mouse, cow & C. elegans). Otherwise, I’m afraid your favourite large eukaryotic genome may be missing from your list. Until these problems get sorted out, I’m afraid Rfam simply doesn’t have the human resources to hunt down assemblies and sequences of genomes not in EMBL.

Rfam does DAS

Prasad Gunasekaran has made a DAS source from Rfam annotations of EMBL sequences. You can see an example annotation here.

References

[1] P. P. Gardner, J. Daub, J. G. Tate, E. P. Nawrocki, D. L. Kolbe, S.
Lindgreen, A. C. Wilkinson, R. D. Finn, S. Griffiths-Jones, S. R. Eddy & A. Bateman (2008): Rfam: updates to the RNA families database.
Nucl. Acids Res.

[2] J. Daub, P. P. Gardner, J. Tate, D. Ramskold, M. Manske, W. G. Scott, Z. Weinberg, S. Griffiths-Jones & A. Bateman
(2008): The RNA WikiProject: Community annotation of RNA
families. RNA
. RNA.

Advertisements

2 Responses to “Updates to Rfam in 2008”


  1. […] by sharing 11 02 2009 Reading the Xfam (Pfam & Rfam) update about Rfam and I have to say, it is a great to have the Xfam, Ensembl, and other databases blogging about what […]


  2. […] Rfam decimal release is out! April 16, 2010 It has been some time since Rfam has had a public update. However, the dwarves under Mount Rfam have busy for some time […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s