We are happy to announce a new release of Rfam. Version 12.1, based on the same sequence dataset as Rfam 12.0, features over 20 new families, a new clan competing algorithm, a publicly accessible MySQL database, and many website fixes.
Improved algorithm for clan competing
One of the main improvements of this release is the removal of redundancy between families belonging to the same clan. Rfam clans (introduced in Rfam 10.0) group together related Rfam families, for example, large ribosomal subunits from Bacteria, Archaea, and Eukaryota are all found in the LSU clan. Families from the same clan often match the same sequence region, so in Rfam 12.0 a ‘clan competing’ procedure was introduced to keep only the best match. However, there were cases where clan competing did not work well (or simply was not applied) and some sequences were annotated with multiple Rfam families belonging to the same clan. For example, some human sequences appeared in both Protozoa and Metazoa SRP families. Now the algorithm for clan competition has been revised and the redundant matches are eliminated from the database.
As a result, in release 12.1 there is actually a significant drop in the number of annotated ncRNA regions – down from ~19 million in release 12.0 to ~9 million in 12.1. This drop is primarily down to the removal of the redundant annotations between the largest families found in clans, such as rRNA subunits and SRPs.
Pseudoknots are back
Previously pseudoknots were removed from Rfam consensus secondary structures for technical reasons, but thanks to Eric Nawrocki pseudoknots have been restored to families. You can get a file with seed alignments, including those with pseudoknots, from our FTP archive.
We added 23 new families bringing the total number of families to 2473 (Rfam identifiers RF02545 to RF02567). We always welcome suggestions for new Rfam families, so feel free to get in touch.
Public MySQL Database
In order to make it easier to query the data in ways that are not supported by the website, we have created a public MySQL database with the latest Rfam data. This replaces the retired BioMart interface. Now you can explore the data using SQL queries from your favourite MySQL client or programmatically using custom scripts. The MySQL database will be updated with each Rfam release. For more information about how to access the database and examples please have a look at the database documentation.
Thanks to those of you who reported problems with R-chie diagrams on the Rfam website. These visualisations are useful for exploring seed alignments together with consensus secondary structures, but in many Rfam diagrams alignment columns did not match the corresponding secondary structure arches, as can be seen in the interactive before/after comparison. For examples of updated R-chie diagrams, have a look at tRNAs or a pseudoknot from Yellow Fever virus.
Release 12.1 marks the first release produced by the new EMBL-EBI RNA Resources team which is led by Anton Petrov and includes Ioanna Kalvari (Software Developer) and Joanna Argasinska (Biocurator). The team, housed within Rob Finn’s group and jointly coordinated by Alex Bateman, is responsible for Rfam and RNAcentral databases, which will result in tighter coordination between the two resources in the future.
Plans for Rfam 13.0
We plan to release the next version of Rfam towards the end of 2016. The key feature of release 13.0 will be the new genome-centric organisation of the underlying sequence database. Instead of searching the WGS and STD datasets from ENA we will search a representative set of genomes based on reference proteomes computed by UniProt. This will ensure comprehensive annotation of most important genomes, enable a faster release cycle and decrease redundancy in sequence data. Work on release 13.0 is already underway so stay tuned!