Posts Tagged ‘rfam’

Rfam release 14.8

May 30, 2022

We are happy to announce the release of Rfam version 14.8. This release includes 48 updated and 25 new microRNA families; 10 families updated based on 3D structure annotations; 4 new families and updates to 5 existing families for Hepatitis C virus; a new xRNAs family from the Potato virus; and the integration of LitScan, a literature scanner powered by RNAcentral and Europe PubMed Central. Read on for details on these changes.

Updated microRNA families

Rfam, miRBase and RNAcentral have been working to synchronize miRNA families between all three resources. We are now happy to report that we have completed 77% of the current miRNAs families covering >1300 miRNAs of 1700 alignments provided to us by Profesor Sam Griffiths-Jones at miRBase. The 400 remaining families need an extended review process, and we will be working on their integration in future releases.

Summary of the miRBase and Rfam synchronisation project, we estimate it is 78% completed.

Families updated with 3D structure information

Rfam has been updating families using 3D structure information. This project aims to improve Rfam families through the addition of pseudoknotes, base pairs, and annotations of other structural elements by inspecting 3D structures. In this release we have updated 10 families:

  • Virus families:
    • RF00507-Coronavirus frameshifting stimulation element
    • RF01047-HBV RNA encapsidation signal epsilon
  • Riboswitches families:
    • RF01763-Guanidine-III riboswitch, also know as ykkC-III riboswitch
    • RF01734-Fluoride riboswitch
    • RF01704-Glutamine-II riboswitch, previously known as Downstream peptide
    • RF01750-ZMP/ZTP riboswitch
    • RF01739-Glutamine riboswitch
    • RF02683-NiCo riboswitch
    • RF01826-SAM-V riboswitch
  • Ribozyme family:

We have added pseudoknots to 7 of the 10 updated families and the updated secondary structure diagrams from 5 of these families are shown below.

Examples of families reviewed and updated with 3D information. Pseudoknot structures (pk) were added to each of these five families based on a review of the corresponding 3D structures.

New and updated families of Hepatitis C virus

In release 14.8 we have created 4 new families, updated 5 existing families, and deleted 4 virus families. These changes are the result of our ongoing collaboration between Profesor Manja Marz of the European Virus Bioinformatics Center and Rfam. The Marz group provided Rfam a curated alignment of representative sequences for the entire genome of Hepatitis C virus genome. We used this alignment to update, create or remove existing Rfam families. The new families we have created are summarized in a table below. We have deleted RF00469, which was merged into RF00260 during review. We have also deleted families from RF02585 to RF02588 which have no support in the genomic alignment. 

Rfam IDNameDescription
RF00061IRES_HCVHepatitis C virus internal ribosome entry site
RF00260HepC_CREHepatitis C virus (HCV) cis-acting replication element (CRE)
RF00620HCV_ARF_SLHepatitis C alternative reading frame stem-loop
RF00468HCV_SLVIIHepatitis C virus stem-loop VII
RF00481HCV_X3Hepatitis C virus 3’X element
RF04218HCV_5BSL1Hepatitis C virus stem-loop I
RF04219HCV_J750J750 non-coding RNA (containing SL761 and SL783)
RF04220HCV_SL588SL588 non-coding RNA
RF04221HCV_SL669SL669 non-coding RNA

As part of this project we have reviewed and updated Coronavirus, Flavivirus and HCV viruses families, and we are working on adding RNA families from other viruses, such as Filoviridae (e.g. Ebolavirus) and Rhabdoviridae (e.g. Rabies viruses).

xRNAs in Potato virus

We want to thank Professor Quentin Vicens for sharing the alignment of Potato leafroll virus exoribonuclease-resistant RNA (PLRV-xrRNA). PLRV-xrRNA is a non-coding RNA that blocks the progression of 5′ to 3′ exoribonuclease using only a folded RNA element, and this family is described in RF04222.

LitScan

RNAcentral has recently developed LitScan, a tool to automatically connect non-coding RNA sequences, genes and families to the literature that discusses them. In this release we have integrated the LitScan widget into Rfam. The widget is now shown in the new ‘Publications’ tab on all Rfam families.

Example of LitScan for mir-17 microRNA precursor family, publications can be sorted by citation, journal, year of publication and others.

Please reach out to us with feedback on the widget, or if you would like to use the LitScan widget on your site!

Rfam release 14.7

December 21, 2021

We are happy to announce the latest Rfam release, version 14.7. The release includes 121 updated microRNA families, 4 new families, and a redesigned Rfam-PDB mapping pipeline that provides weekly updates as new RNA 3D structures become available. Read on to find out more or explore the data in Rfam.

Updated microRNA families

As part of the Rfam-miRBase synchronisation project discussed in the Rfam 14 paper, we continue revising microRNA families in Rfam using the data provided by miRBase. This release includes 121 updated families, such as mir-6 (RF00143) and mir-22 (RF00653). The following five microRNA families have been deleted from Rfam as the corresponding entries were removed from miRBase due to lack of evidence: mir-1937 (RF01942), mir-1280 (RF02013), mir-353 (RF00800), mir-720 (RF02002), and mir-2973 (RF02096). We would like to thank Lisanne Knol (University of Edinburgh) for bringing the first two of these families to our attention.

We estimate that the Rfam is now approximately 60% in sync with miRBase, with additional families to be released in future versions of Rfam. You can view the full list of updated families here or browse all microRNAs in Rfam

New families

The release includes two new hairpin ribozyme families Hairpin-meta1 (RF04190) and Hairpin-meta2 (RF04191) recently reported by the Weinberg lab. The hairpin ribozymes were discovered in metatranscriptome data and are proposed to occur in circular RNA genomes of as yet uncharacterised organisms. The new family joins the original Hairpin ribozyme family (RF00173).

Based on a recent paper we also created two additional bacterial families, the icd-II ncRNA motif (RF04189) and the carA ncRNA motif (RF04192). We would like to thank Ken Brewer (Yale University) for providing the data.

Weekly updates of PDB structures matching Rfam families

For many years Rfam maintained a mapping between Rfam families and experimentally determined RNA 3D structures available in PDB. However, this mapping lagged behind the weekly PDB updates as it was only updated with Rfam releases. 

The newly implemented pipeline analyses the data every week and makes the data available on the Rfam website and in a new section of the FTP archive that contains a preview of the upcoming release. The new, up-to-date mapping also improves the ability to search PDBe using Rfam and is a key part of an ongoing project to review all Rfam families with known 3D structures. 

Currently there are 127 Rfam families with experimentally determined RNA 3D structures in the PDB. For example, a recent paper describing how a viral RNA hijacks host machinery produced several structural models (7SAM, 7SC6, and 7SCQ) showing the pseudoknot of tRNA-like structure that are now mapped to Rfam family RF01084. Follow Rfam on Twitter to be the first to hear when new RNA families are linked to 3D structures.

Other improvements

  • We continue improving the Gene Ontology (GO) terms associated with Rfam families, and in this release 412 families have been updated to use the latest GO terms. Maintaining the GO terms up-to-date is important as Rfam is used for automatic assignment of GO terms in RNAcentral and other resources. The GO terms are shown in the Curation tab of each family and are also available in the rfam2go file. 
  • The Rfam.seed_tree.tar.gz file hosted on the FTP archive has been fixed. We would like to thank Christian Anthon (University of Copenhagen) for reporting the problem.

Get in touch

As always, we look forward to hearing from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Happy holidays from the Rfam team!

This is the first release produced by Emma Cooke and Blake Sweeney who have joined Rfam in the second half of 2021. The Rfam team wishes you a happy holiday season! We look forward to creating lots more families in 2022 and working towards the next major Rfam release, Rfam 15.0!

Welcome Blake as the new RNA Resources Project Leader

September 13, 2021

RNAcentral and Rfam recently completed the search for a project leader to succeed Anton Petrov and Blake Sweeney has been appointed and recently started in his new role. Some of you may know Blake as the current RNAcentral bioinformatician where he has been running the bioinformatic pipeline and speaking at conferences. With a PhD in RNA bioinformatics and a decade of experience developing RNA databases, including 4.5 years at RNAcentral, Blake is perfectly positioned to take the projects forward. The official handover date is in May, but until then Anton and Blake will be working together to ensure a smooth transition. 


Additionally, we are now hiring a new RNAcentral bioinformatician. If you are interested in applying please see: https://bit.ly/38L86Xe. If you have any questions or comments about the RNA resources please contact Blake Sweeney at bsweeney@ebi.ac.uk.

Rfam 14.6 is out

July 27, 2021

We are happy to announce a new release of Rfam (14.6) that includes 121 new microRNA families, a new ribozyme family, 8 new small RNA families found in Bacteroides, as well as 10 additional families with updated secondary structures using 3D structural information. Read on for more information or explore the data in Rfam.

New microRNA families

The new release includes 121 new microRNA families bringing the total number of microRNA families in Rfam to 1,506. This work is part of the ongoing collaboration with miRBase that aims to synchronise microRNAs across miRBase, Rfam, and RNAcentral. Browse Rfam microRNAs or find out more about the microRNA project.

We also resolved an issue with 6 microRNA families that were missing a covariance model on the website and in the FTP archive. Many thanks to Dr Christian Anthon (University of Copenhagen) for pointing out this problem!

Updating families using information from 3D structures

Following on from Rfam 14.5, we updated the secondary structure of 10 additional families with 3D information, including 6 riboswitches, 1 ribozyme, 1 telomerase, 1 localization element and 1 microRNA precursor.

In some families, the updated structure is substantially changed. For example, the central part of the flavin mononucleotide (FMN) riboswitch is now organised by several additional base pairs and two pseudoknots (pK). As a result, the updated structure is more compact and more accurately reflects the experimentally determined 3D structures.

Seven of the updated families include newly annotated pseudoknots, which is an important improvement that helps better model long-distance non-nested interactions. We will continue reviewing and updating the families with 3D structure in future releases. The full list of the updated families can be found in the table below.

FamilyPDB structuresNew  pK
RF00008 – Hammerhead ribozyme (type III)2QUS_A, 2QUS_B, 2QUW_A, 2QUW_B, 2QUW_C, 2QUW_D, 5DI2_A, 5DI4_A, 5DQK_A, 5EAO_A, 5EAQ_A1
RF00025 – Ciliate telomerase RNA6D6V2
RF00050 – FMN riboswitch (RFN element)3F2Q, 3F2T, 3F2W, 3F2X, 3F2Y, 3F3O 2
RF00059 – TPP riboswitch (THI element)2CKY_A, 2CKY_B, 2GDI_X, 2GDI_Y, 2HOJ_A, 2HOK_A, 2HOL_A, 2HOM_A, 2HOO_A, 2HOP_A, 3D2G_A, 3D2G_B, 3D2V_A, 3D2V_B, 3D2X_A, 3D2X_B, 3K0J_E, 3K0J_F, 4NYA_A, 4NYA_B, 4NYB_A, 4NYC_A, 4NYD_A, 4NYG_A
RF00207 – K10 transport/localisation element (TLS)2KE6, 2KUR, 2KUU, 2KU, 2KUW
RF00174 – Cobalamin riboswitch4GMA, 4GXY1
RF00380 – ykoK leader / M-box riboswitch2QBZ_X, 3PDR_X, 3PDR_A1
RF01689 – AdoCbl variant RNA4FRN_A, 4FRN_B, 4FRG_X, 4FRG_B1
RF01831 – THF riboswitch3SD3, 3SUH, 3SUX, 3SUY, 4LVV, 4LVW, 4LVX, 4LVY, 4LVZ, 4LW01
RF02095 – mir-2985-2 microRNA precursor2L3J

Hovlinc ribozyme

A recent paper by Chen Y et al. 2021 describes Hovlinc, a new type of self-cleaving ribozymes found in human and other hominids. Hovlinc was detected in a very long intergenic noncoding RNA in hominids (hominin vlincRNA-located) using a genome-wide approach designed to discover self-cleaving ribozymes. The functions of vlincRNA and the hovlinc ribozyme remain unclear. Hovlinc joins 3 known classes of small, self cleaving ribozymes found in human: (1) Mammalian CPEB3 ribozyme, (2) Hammerhead ribozyme and (3) B2 and ALU retrotransposons. We would like to thank Dr Fei Qi (Huaqiao University) for providing the Hovlinc alignment. View hovlinc family in Rfam.

New Bacteroides families

In a recent article by Ryan et al. 2020, the authors report a high-resolution transcriptome map of the model organism Bacteroides thetaiotaomicron, a common bacteria of the human gut. They recognize 269 non-coding RNAs (ncRNAs) candidates from which nine were validated. Eight of these ncRNAs were integrated as new families:

  1. RF04177 – Bacteroides sRNA BTnc201
  2. RF04178 – Bacteroides sRNA BTnc005
  3. RF04179 – Bacteroides sRNA BTnc049
  4. RF04180 – Bacteroides sRNA BTnc231
  5. RF04181 – rteR sRNA
  6. RF04182 – GibS sRNA
  7. RF04183 – Bacteroidales small SRP
  8. RF04184 – Bacteroides sRNA BTnc060

In addition, the RF01693 – Bacteroidales-1 family was renamed to 6S-Bacteroidales RNA. Bacteroidales-1 was first reported in a comparative genomics-based approach of genome and metagenome sequences from Weinberg et al. 2010. It was identified downstream of L20 ribosomal subunit genes in the order Bacteroidales. Ryan et al. 2020 report that this sRNA is a 6S RNA homolog in Bacteroides thetaiotaomicron. Rfam also has other two families of 6S RNA RF00013-6S/SsrS RNA and RF01685-6S-Flavo. We would like to thank Dr Lars Barquist (University of Würzburg) for providing the data.

Welcome to Emma!

A few weeks ago Emma Cooke joined the Rfam team as Software Developer and is already busy working on new features. Emma has studied Genetics and Software Engineering, and her previous roles have focused on release verification pipelines, software testing, and developing for cloud environments. Please join us in welcoming Emma to the Rfam community and stay tuned for new announcements based on her work.

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Join Rfam team!

March 19, 2021

We are looking for a Software Developer to join the Rfam team and contribute to the world’s largest database of RNA families. The post holder will be responsible for keeping Rfam up-to-date, developing Rfam Cloud, and improving the website. More information about the position can be found at https://bit.ly/rfam-software-developer

Apply now or help spread the word. Closing date: April 20th, 2021.

Rfam 14.5 is live

March 18, 2021

We are happy to announce a new Rfam release, version 14.5, featuring 112 updated microRNA families and 10 families improved using the 3D structure information. Read on for details or explore 3,940 RNA families at rfam.org.

Updated microRNA families

As described in our most recent paper, we are in the process of synchronising microRNA families between Rfam and miRBase. In this release 112 of the existing microRNA families have been updated with new manually curated seed alignments from miRBase, new gathering thresholds, and new family members found in the Rfamseq sequence database. 

In total, 852 new microRNA families have been created (356 in release 14.3 and 496 in release 14.4) and 152 existing families have been updated (40 in release 14.3 and 112 in release 14.5). As the miRBase-Rfam synchronisation is about 50% complete, additional microRNA families will be made available in the upcoming releases. You can view a list of the 112 updated families or browse all 1,385 microRNA families on the Rfam website. 

Updating families using information from 3D structures

We are also in the process of reviewing the families with the experimentally determined 3D structures in order to compare the Rfam annotations with the 3D models. Our goal is to incorporate the 3D information into Rfam seed alignments as many families have been created before the corresponding 3D structures became available. We manually review each PDB structure, verify basepair annotations from matching PDBs, and obtain a more consistent consensus secondary structure model. 

In multiple cases we were able to add missing base pairs and pseudoknots. For example, in the SAM riboswitch (RF00162), we added two base pairs in the base of helix P2, corrected a basepair in P3 and added four basepairs in P4 (one in the base of the helix and three near the terminal loop). The updated consensus secondary structure presents a more accurate central core annotation with more structure in the four-way junction.

SAM riboswitch secondary structure before and after the updates

In another example, one base pair was added in P1 and another one in P3 of the SAM-I/IV variant riboswitch (RF01725). We also corrected a base pair in P3 and included a P4 stem loop that was not integrated before.

SAM-I/IV riboswitch secondary structure before and after the updates

The SAM-I/IV riboswitch is characterised by a similar SAM binding core conformation to that of the SAM riboswitch but it differs in the k-turn motif in P2 which is found in SAM riboswitches but not in SAM-I/IV. These two families also have different pseudoknots interactions, where SAM riboswitch forms a pseudoknot between a P2 loop and the stem of P3, while the SAM-I/IV riboswitch contains a pseudoknot between a P3 loop and the 5′ region.

The first 10 families updated with 3D information include:

  1. RF00162 – SAM riboswitch
  2. RF01725 – SAM-I/IV variant riboswitch
  3. RF00164 – Coronavirus 3’ stem-loop II-like motif (s2m)
  4. RF00013 – 6S / SsrS RNAP
  5. RF00003 – U1 spliceosomal RNA
  6. RF00015 – U4 spliceosomal RNA
  7. RF00442 – Guanidine-I riboswitch
  8. RF00027 – let-7 microRNA precursor
  9. RF01054 – preQ1-II (pre queuosine) riboswitch and
  10. RF02680 – preQ1-III riboswitch

We will continue reviewing the families with known 3D structure in future releases.

Other family updates

Initially reported by Aspegren et al. 2004, Class I (RF01414) and Class II (RF01571) RNAs were found in social amoeba Dictyostelium discoideum and later on investigated in more detail by Avesson et al. 2011. Now a new report from Kjellin et al. 2021 presents a comprehensive analysis of the Class I RNA genes in dictyostelid social amoebas. Based on this study, we updated the Dicty Class I RNA family RF01414 with a new seed alignment and removed the family RF01571, thus merging both families into one. We thank Dr Jonas Kjellin (Uppsala University) for suggesting this update.

Goodbye Ioanna!

Rfam 14.5 is the last release prepared by Dr Ioanna Kalvari who will be leaving the team at the end of March 2021. We would like to take the opportunity to thank Ioanna for her contributions over the last 5.5 years and wish her best of luck in the future!

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter

Rfam 14.4 is live

December 18, 2020

The last Rfam release of 2020 is now live! Rfam 14.4 contains 496 new microRNA families developed in collaboration with miRBase. Find out about the microRNA project in our new NAR paper and let us know if you have any feedback.

Rfam 14.3

September 15, 2020

Rfam 14.3 includes 356 new and 40 updated microRNA families, as well as 12 new and 2 updated Flavivirus RNAs. Find out the details in our new NAR paper and get in touch if you have any feedback.

Rfam Coronavirus Special Release

April 27, 2020

In response to the SARS-CoV-2 outbreak, the Rfam team prepared a special release dedicated to the Coronavirus RNA families. The release 14.2 includes 10 new and 4 revised families that can be used to annotate the SARS-CoV-2 and other Coronavirus genomes with RNA families.

View the data at rfam.org/covid-19 ➡️

New Coronavirus Rfam families

In collaboration with the Marz group and the EVBC, we created 10 families representing the entire 5’- and 3’- untranslated regions (UTRs) for Alpha-, Beta-, Gamma-, and Delta- coronaviruses. A specialised set of alignments for the subgenus Sarbecovirus is also provided, including the SARS-CoV-1 and SARS-CoV-2 UTRs. 

The families are based on a set of high-quality whole genome alignments produced with LocARNA and reviewed by expert virologists. Note that the Alpha-, Beta-, and Deltacoronavirus alignments and structures were refined based on the literature, while the Gammacoronavirus families are based on prediction alone due to the lack of experimental data.


Virus
5’ UTR3’ UTR
AlphacoronavirusaCoV-5UTR
RF03116
aCoV-3UTR
RF03121
BetacoronavirusbCoV-5UTR
RF03117
bCoV-3UTR
RF03122
Sarbecovirus and SARS-CoV-2Sarbecovirus-5UTR
RF03120
Sarbecovirus-3UTR
RF03125
GammacoronavirusgCoV-5UTR
RF03118
gCoV-3UTR
RF03123
DeltacoronavirusdCoV-5UTR
RF03119
dCoV-3UTR
RF03124

Previously, only fragments of the UTRs were found in Rfam. In particular, two families were superseded by the new whole-UTR alignments and removed from Rfam:

  • RF00496 (Coronavirus SL-III cis-acting replication element): This family represented a single stem that is now found in aCoV-5UTR and bCoV-5UTR families.
  • RF02910 (Coronavirus_5p_sl_1_2): This family represented two stems from aCoV-5UTR.

The new families are grouped into 2 clans: CL00116 and CL00117 for the 5’ and 3’ UTRs, respectively. The clans can be used with the Infernalcmscan program to automatically select the highest scoring match from a set of related families (see the Rfam chapter in CPB to learn more).

Revised Coronavirus families

We also reviewed and updated the existing Coronavirus Rfam families.

FamilyWhat was updated?Is it found in SARS-CoV-2?
RF00182 Coronavirus packaging signal The seed alignment and consensus secondary structure were updated to include the 4 conserved repeat units. This RNA element isfound only in Embecovirus, so it is not present in SARS-CoV-2 and other Sarbecoviruses.
RF00507 Coronavirus frameshifting stimulation elementThe seed alignment was expanded. This RNA is present in SARS-CoV-2.
RF00164 Coronavirus s2m RNAThe seed alignment was expanded.

There is a 3D structure for SARS-CoV-1 which can be used for understanding the s2m in SARS-CoV-2.
This RNA is present in the 3′ UTR of SARS-CoV-2.
RF00165 Coronavirus 3’-UTR pseudoknot  The seed alignment was expanded. 
The pseudoknot is annotated in the 3’ UTR families but since it is mutually exclusive with the 3’-UTR consensus structure, it is also provided as a separate family. 
This RNA is present in the 3′ UTR of SARS-CoV-2.

Where to get the data 

You can download the covariance models, as well as seed alignments for the coronavirus families from the corresponding family pages or from a dedicated folder on the FTP archive.

How to use the data

You can download the covariance models and annotate viral sequences with these RNA models using Infernal. See Rfam help for examples.

Inviting all Wikipedians to contribute

We revised the Wikipedia pages associated with each family, and we invite everyone to contribute to the following articles:

Acknowledgements

We would like to thank Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena) for providing the curated alignments for the new families as well as Eric Nawrocki (NCBI) for revising the existing Rfam entries. We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments.


This work is part of the BBSRC-funded project to expand the coverage of viral RNAs in Rfam. More data on SARS-CoV-2 can be found on the European COVID-19 Data Portal.

Rfam 14.1 is out

January 28, 2019

We are happy to announce that a new Rfam release is now available! Rfam 14.1 includes 226 new families bringing the total number of Rfam families to 3,016. In addition, the R-scape visualisations have been updated to display pseudoknots, both manually annotated in seed alignments and predicted by R-scape (see below for details).

New families

The majority of the new families were contributed by Dr Zasha Weinberg (University of Leipzig) and were discovered by a systematic computational analysis of intergenic regions in Bacteria and metagenomic samples (see the NAR paper for more details). Many of the families come from environmental samples, so importing them into Rfam required a new procedure (described below).

This release features many families with statistically significant covariation (highlighted in green in the images below), for example Skipping-rope, Drum, and LOOT:

as well as a new unusually large, highly-structured RNA called ROOL that is found in Firmicutes, Fusobacteria and Tenericutes phylae as well as in phages and cow rumen metagenomic samples:

Browse new families in Rfam

Analysing pseudoknots using R-scape

Developed by Dr Elena Rivas (Harvard University), R-scape is a program that detects covariation support for structural pairs in RNA alignments (see the 2017 paper by Rivas et al in  Nature Methods for more details). Starting with version 1.2.0, R-scape systematically identifies pseudoknots supported by covariation (Rivas & Eddy, in preparation). For example, here is a pseudoknot from the SAM riboswitch that is not yet annotated in the Rfam seed alignment (left) but is correctly predicted by R-scape (right):

The nucleotides forming the pseudoknot are labelled pk_1, pk_2, pk_3 and so on in the structural annotation. Each pseudoknot is shown as a separate stem in an inset, and the basepairs with significant covariation are colored green similar to the other R-scape diagrams.

We are working on adding more pseudoknot annotations to the existing families based on the evidence from R-scape, 3D structures, and scientific literature. Please let us know if your favourite RNA is missing a pseudoknot.

Using RNAcentral identifiers in Rfam seed alignments

In previous releases, every sequence in every Rfam seed alignment was required to have an INSDC identifier assigned by a sequence archive like ENA or GenBank. However, when Rfam users submit their alignments to Rfam, they often include sequences that are not yet found in ENA or GenBank, especially if the sequences come from environmental samples. For example, sequence LV_Brine_h2_0102_1073789 from the MDR-NUDIX RNA does not exist in ENA so it does not have a stable identifier and is not associated with metadata such as NCBI taxid, description, or scientific literature.

In the past Rfam replaced such sequences with closely related ones or removed them altogether which required modifying the user-submitted alignments and could result in smaller, less informative seeds missing some covariation compared to the originals. In this release we implemented a new procedure that accepts RNAcentral identifiers in Rfam seed alignments in order to preserve the manually curated alignments as much as possible.

We began by importing the sequences and metadata from a recently established ZWD database (Zasha Weinberg Database) into RNAcentral where each distinct sequence is assigned a stable identifier (URS id) and linked to a NCBI taxid, its parent ZWD alignment, and scientific literature. For example, sequence LV_Brine_h2_0102_1073789 is assigned RNAcentral id URS0000D661D6_12908 so that it can be easily tracked using RNAcentral search, API, public database, or bulk download files.

Next we replaced the ZWD identifiers with RNAcentral accessions and used the ZWD-RNAcentral alignments as seeds for new Rfam families:

Following the standard Rfam protocol, we manually selected bit-score thresholds for each family that allow reliable identification of sequences from the seed alignments and other homologs from the Rfam sequence database.

A small number of sequences still had to be removed from ZWD alignments in the following cases:

  1. If a covariance model built using the alignment could not find some of its own sequences, these unmatched sequences were removed from the alignment
  2. If a sequence scored worse than a set of random sequences that serve as control when setting bit-score thresholds, such low-scoring sequences were also removed from the alignments.


In future releases we plan to expand the usage of RNAcentral identifiers in Rfam seed alignments.

Please note that any software that parses Rfam seed alignments and uses ENA or GenBank for metadata lookup will now need to include RNAcentral identifiers using the RNAcentral API. For more information or if you have any questions, please contact the RNAcentral team or Rfam help.

11 more families with 3D structure

There are 11 additional Rfam families that match 3D structures bringing the total number of families with experimentally determined structures to 98 (compared with 87 in Rfam 14.0).

Rfam familyPDB structures
RF00009 (RNaseP_nuc)6agb and 6ah3 (yeast), 6ahr and 6ahu (human) [chains A]
RF00025 (Telomerase-cil)6d6v (chain B)
RF00027 (let-7)5zal (chain C), 5zam (chain C)
RF00080 (yybP-ykoY)6cc1 (chains A and B), 6cc3 (chains A and B)
RF00233 (Tymo_tRNA-like)6mj0 (chains A and B)
RF00250 (mir-TAR)6gml (chain P)
RF00390 (UPSK)6mj0 (chains A and B)
RF01727 (SAM-SAH)6hag (chain A)
RF01826 (SAM_V)6fz0 (chain A)
RF02348 (tracrRNA)6mcb (chain B), 6mcc (chain B)
RF02553 (YrlA)6cu1 (chain A)

Other updates

Two existing families were updated with new seed alignments from ZWD, including RF02440 (ldcC RNA) and RF02840 (Lacto-3 RNA). There is also a new clan DUF805 (CL00115) that includes DUF805 and DUF805b families.

Acknowledgements

The Rfam team would like to thank Dr Elena Rivas and Dr Zasha Weinberg for the new data, software, and feedback, as well as the organisers and participants of the 2018 Benasque RNA meeting. We would also like to thank BBSRC for funding Rfam between 2015 and 2018.

Get in touch

Follow Rfam on Twitter to find out about new Rfam families and don’t hesitate to raise a GitHub issue or email us if you have any questions.