We are happy to announce that the new release of Rfam, version 14.0, is now available! Rfam 14.0 is built using a set of over 14,000 non-redundant, representative, and complete genomes (~60% more than in Rfam 13.0). It includes 105 new families, new genome browser hub, and ORCiD integration. Read on to find out more.
What’s new
Data updates
Rfam 14.0 has 60% more genomes than Rfam 13.0
The latest Rfam version comes on the heels of Rfam 13.0, a release that marked the transition to the genome-centric sequence database. In Rfam 13.0, the Rfam sequence database – Rfamseq – was composed of 8,364 non-redundant, representative and complete genomes derived from a genome collection maintained by UniProt. Now with the addition of 6,519 new species, the number of annotated genomes in Rfam 14.0 increased by ~60% to 14,434 genomes.
The majority of the genomes from Rfam 13.0 are also present in Rfam 14.0, although a small number (385, ~4.6%) was removed or replaced. The majority of the new genomes come from Bacteria and Viruses.
Since Rfamseq was updated, this is a major Rfam release (14.0). Expect a minor release (14.1) in the Fall 2018 with new RNA families but no changes in Rfamseq.
More genomes, less redundancy
The switch to annotating complete genomes enabled us to resolve data redundancy at the levels of sequence and species. For instance, in Rfam 12.3 the cumulative length of all human sequences was eight times longer than the total length of the human genome assembly hg38 in 13.0 (note how the width of the green line of Rfam 12 narrows in Rfam 13).
Redundancy reduction at species level relies on Uniprot’s reference proteome collection, which is a result of manual curation and computational refinement. It includes species of high interest to the scientific community and well-studied model organisms, carefully selected in such a way that they represent the taxonomic diversity. Rfam uses the same collection of genomes for annotation with existing RNA families and building new ones.
105 new families
The number of RNA families reached 2,791 with the addition of 105 new families from 8 RNA types. The new families per ncRNA type in release 14.0 is shown below:
- 65 Gene; sRNA;
- 17 Gene; antisense;
- 11 Gene; snRNA; snoRNA; HACA-box;
- 5 Gene; snRNA; snoRNA; CD-box;
- 4 Cis-reg; thermoregulator;
- 1 Cis-reg;
- 1 Cis-reg; leader;
- 1 Cis-reg; riboswitch;
New 3D structures matching Rfam families
2 more Rfam families now have experimentally determined 3D structures that did not match any 3D structures in the past:
Rfam family | PDB structure |
RF00382 DnaX ribosomal frameshifting element | 5UQ7, 5UQ8 – 70S ribosome complex with dnaX mRNA stemloop and E-site tRNA (“in” and “out” conformation) |
RF00375 HIV primer binding site | 6B19 – Architecture of HIV-1 reverse transcriptase initiation complex core |
Search for Rfam entries in PDBe
Rfam regularly updates the mapping between Rfam families and the experimentally determined 3D structures available in PDB. With PDBe’s Advanced Search release in May 2018, PDBe users can take advantage of these mapping by searching with Rfam family names or accessions. For instance, a search using tRNA accession RF00005 currently retrieves 502 entries.
Another powerful new feature is the interactive 3D visualization of the Rfam domains on PDBe entry pages using LiteMol. This is achieved by highlighting the RNA sequence on the corresponding structure, for example tRNA (RF00005) in structure 4UJD. Additional information can be found in the PDBe blog post.
Increased GO term coverage
Non-coding RNA functional annotation was improved with the addition of 133 GO terms to 81 families since last release. The GO annotations are propagated to RNAcentral sequences and submitted to the GOA system, as described in GOREF:0000115.
Genome browser hub
The genome-centric sequence database enabled us to generate the genome browser track hub directly out of the genome annotations without an additional mapping step. At this time we limited the species listed in track hub to those supported by UCSC, with the potential of that number to grow by incorporating all genomes with assemblies at chromosome level. Currently there are 14 species including human (hg38), chicken (galGal5), pig (susScr11) and mouse (mm10). Upon user request, we will also be happy to provide .bed and .bigBed files for various other genomes in our collection, depending on the level of the assembly.
Explore Rfam annotations in UCSC Genome Browser by clicking on these links:
or configure the track manually by editing the URL:
The track hub can also be attached to Ensembl using these instructions and the following URL:
ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.0/genome_browser_hub/hub.txt
Get credit for Rfam families using ORCiD
It is now possible for Rfam authors to get credit for their contributions by claiming family accessions directly to their ORCiD profiles. This new feature was enabled by the Claim to Orcid functionality provided by EBI Search. The process includes three simple steps. Users are first required to login to their ORCiD accounts and use their ORCiD id to search for associated entries. Following search, one can manually select all or a subset of listed entries and click on Claim to ORCID button located at the top of the page. The example provided is of a snoRNA family (RF02725) claimed by the Rfam curator Joanna Argasinska directly to her ORCiD profile.
New Rfam paper
We recently published a new paper in Current Protocols in Bioinformatics with examples covering a broad spectrum of Rfam use cases including examples using our website as well as Infernal to annotate nucleotide sequences. There is also a section dedicated to MySQL with tips and tricks on restoring previous versions of the database, along with useful examples on forming complex queries.
Get in touch
Follow our new Twitter account RfamDB to be the first to find out about new Rfam families and don’t hesitate to raise a GitHub issue or email us if you have any questions.
You can also meet the Rfam team in person at a hands-on tutorial at the upcoming ECCB 2018 conference in Athens.