Archive for the 'Releases' Category

Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

We have also added 8 new clans since the last Pfam release. One of the new clans is the TSP1 superfamily (CL0692). Previously a single family (PF00090) attempted to identify all known TSP1 domains.  Based on structural work by Marko Hyvönen (University of Cambridge) and colleagues we have added an additional three families (PF19028PF19030 and PF19035) to Pfam. These new families have both improved the coverage of the TSP1 domain, and better modelled the variations in disulphide binding across the structure space.           

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Rfam Coronavirus Special Release

April 27, 2020

In response to the SARS-CoV-2 outbreak, the Rfam team prepared a special release dedicated to the Coronavirus RNA families. The release 14.2 includes 10 new and 4 revised families that can be used to annotate the SARS-CoV-2 and other Coronavirus genomes with RNA families.

View the data at ➡️

New Coronavirus Rfam families

In collaboration with the Marz group and the EVBC, we created 10 families representing the entire 5’- and 3’- untranslated regions (UTRs) for Alpha-, Beta-, Gamma-, and Delta- coronaviruses. A specialised set of alignments for the subgenus Sarbecovirus is also provided, including the SARS-CoV-1 and SARS-CoV-2 UTRs. 

The families are based on a set of high-quality whole genome alignments produced with LocARNA and reviewed by expert virologists. Note that the Alpha-, Beta-, and Deltacoronavirus alignments and structures were refined based on the literature, while the Gammacoronavirus families are based on prediction alone due to the lack of experimental data.

5’ UTR3’ UTR
Sarbecovirus and SARS-CoV-2Sarbecovirus-5UTR

Previously, only fragments of the UTRs were found in Rfam. In particular, two families were superseded by the new whole-UTR alignments and removed from Rfam:

  • RF00496 (Coronavirus SL-III cis-acting replication element): This family represented a single stem that is now found in aCoV-5UTR and bCoV-5UTR families.
  • RF02910 (Coronavirus_5p_sl_1_2): This family represented two stems from aCoV-5UTR.

The new families are grouped into 2 clans: CL00116 and CL00117 for the 5’ and 3’ UTRs, respectively. The clans can be used with the Infernalcmscan program to automatically select the highest scoring match from a set of related families (see the Rfam chapter in CPB to learn more).

Revised Coronavirus families

We also reviewed and updated the existing Coronavirus Rfam families.

FamilyWhat was updated?Is it found in SARS-CoV-2?
RF00182 Coronavirus packaging signal The seed alignment and consensus secondary structure were updated to include the 4 conserved repeat units. This RNA element isfound only in Embecovirus, so it is not present in SARS-CoV-2 and other Sarbecoviruses.
RF00507 Coronavirus frameshifting stimulation elementThe seed alignment was expanded. This RNA is present in SARS-CoV-2.
RF00164 Coronavirus s2m RNAThe seed alignment was expanded.

There is a 3D structure for SARS-CoV-1 which can be used for understanding the s2m in SARS-CoV-2.
This RNA is present in the 3′ UTR of SARS-CoV-2.
RF00165 Coronavirus 3’-UTR pseudoknot  The seed alignment was expanded. 
The pseudoknot is annotated in the 3’ UTR families but since it is mutually exclusive with the 3’-UTR consensus structure, it is also provided as a separate family. 
This RNA is present in the 3′ UTR of SARS-CoV-2.

Where to get the data 

You can download the covariance models, as well as seed alignments for the coronavirus families from the corresponding family pages or from a dedicated folder on the FTP archive.

How to use the data

You can download the covariance models and annotate viral sequences with these RNA models using Infernal. See Rfam help for examples.

Inviting all Wikipedians to contribute

We revised the Wikipedia pages associated with each family, and we invite everyone to contribute to the following articles:


We would like to thank Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena) for providing the curated alignments for the new families as well as Eric Nawrocki (NCBI) for revising the existing Rfam entries. We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments.

This work is part of the BBSRC-funded project to expand the coverage of viral RNAs in Rfam. More data on SARS-CoV-2 can be found on the European COVID-19 Data Portal.

Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (, identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues ( In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team

Dfam 3.0 is out

March 6, 2019


The Dfam consortium is excited to announce the release of Dfam 3.0.  This release represents a major transition for Dfam from a proof-of-concept database into a funded open community resource. Central to this transition is a major infrastructure and technology update, enabling Dfam to handle the increasing pace of genome sequencing and TE library generation. Equally important, we merged Dfam_consensus with Dfam to produce a single resource for transposable element family modeling and annotation. In doing so, Dfam serves the needs of a broader research community while maintaining a high standard for family characterization (seed alignments), and TE annotation sensitivity. Finally, and most importantly, we are working on making Dfam a community driven resource through the development of online curation tools and direct user engagement.

Infrastructure updates

Dfam has undergone a major infrastructure upgrade since the last release including faster servers and storage systems, a new software stack and improved website features. Together these updates will allow Dfam to greatly expand the number of families and the species represented. The new software stack includes a publicly accessible REST API, which provides the core functionality used by the redesigned website and is available for use in community developed applications and workflows. The new website is based on the Angular framework, supporting both a traditional web portal to the Dfam database as well as the use of interactive tools for data management and curation.

Dfam_consensus merger

The merger of Dfam_consensus with Dfam created a combined database of 6,235 TE families in 9 organisms, each characterized by a seed alignment of representative family members. Seed alignments constitute a rich dataset for generating sequence models such as consensus sequences, or profile Hidden Markov Models (HMMs).

Consensus sequence databases have traditionally not preserved the sequence alignment from which the consensus was generated. This omission has made it difficult to evaluate the strength of the consensus, to make incremental improvements by adding/removing members, or to regenerate models using improved methodologies. By adding support for consensus sequences to Dfam, the provenance is preserved in the seed alignment. In addition, the positions within the consensus can be directly related to the corresponding match states within the profile HMM.

Improved interfaces and metadata

The new Dfam website contains several features borrowed from Dfam_consensus including: the seed alignment visualization, the TE classification system and visualization, and per-family and full-database EMBL exports for consensus sequences.

TE classification tree visualization with search facility:


In addition, we have improved the family browsing interface, and added the ability to store/visualize family features such as coding sequences, target site preferences, binding sites, as well as ad-hoc sequence annotation.

Coding regions and target site duplication details for Kolobok-1_DR:


Dfam has adopted the recently developed (for Dfam_consensus) classification system for repetitive sequences and applied it to all of the Dfam-2.x families. This system combines concepts from established systems (Wicker et. al., Piegu et. al., Curcio et. al., Smit et. al., and Jurka et. al.) with phylogenies based on reverse transcriptase and transposases. Classification names were chosen to be as descriptive as possible while still honoring the most widely used acronyms for well-defined classes.

Dfam families may be queried using the new browse form:



Community engagement

We are embarking on an effort to greatly expand the database using de-novo repeat identification pipelines, data sharing with other open-databases, and most importantly from direct community submissions. If you have existing TE libraries or plan to develop one for a newly sequenced organism, consider making it a part of the Dfam database. We can offer assistance with importing legacy datasets and are working on tools to facilitate direct community curation of the database. Please contact us at

Rfam 14.1 is out

January 28, 2019

We are happy to announce that a new Rfam release is now available! Rfam 14.1 includes 226 new families bringing the total number of Rfam families to 3,016. In addition, the R-scape visualisations have been updated to display pseudoknots, both manually annotated in seed alignments and predicted by R-scape (see below for details).

New families

The majority of the new families were contributed by Dr Zasha Weinberg (University of Leipzig) and were discovered by a systematic computational analysis of intergenic regions in Bacteria and metagenomic samples (see the NAR paper for more details). Many of the families come from environmental samples, so importing them into Rfam required a new procedure (described below).

This release features many families with statistically significant covariation (highlighted in green in the images below), for example Skipping-rope, Drum, and LOOT:

as well as a new unusually large, highly-structured RNA called ROOL that is found in Firmicutes, Fusobacteria and Tenericutes phylae as well as in phages and cow rumen metagenomic samples:

Browse new families in Rfam

Analysing pseudoknots using R-scape

Developed by Dr Elena Rivas (Harvard University), R-scape is a program that detects covariation support for structural pairs in RNA alignments (see the 2017 paper by Rivas et al in  Nature Methods for more details). Starting with version 1.2.0, R-scape systematically identifies pseudoknots supported by covariation (Rivas & Eddy, in preparation). For example, here is a pseudoknot from the SAM riboswitch that is not yet annotated in the Rfam seed alignment (left) but is correctly predicted by R-scape (right):

The nucleotides forming the pseudoknot are labelled pk_1, pk_2, pk_3 and so on in the structural annotation. Each pseudoknot is shown as a separate stem in an inset, and the basepairs with significant covariation are colored green similar to the other R-scape diagrams.

We are working on adding more pseudoknot annotations to the existing families based on the evidence from R-scape, 3D structures, and scientific literature. Please let us know if your favourite RNA is missing a pseudoknot.

Using RNAcentral identifiers in Rfam seed alignments

In previous releases, every sequence in every Rfam seed alignment was required to have an INSDC identifier assigned by a sequence archive like ENA or GenBank. However, when Rfam users submit their alignments to Rfam, they often include sequences that are not yet found in ENA or GenBank, especially if the sequences come from environmental samples. For example, sequence LV_Brine_h2_0102_1073789 from the MDR-NUDIX RNA does not exist in ENA so it does not have a stable identifier and is not associated with metadata such as NCBI taxid, description, or scientific literature.

In the past Rfam replaced such sequences with closely related ones or removed them altogether which required modifying the user-submitted alignments and could result in smaller, less informative seeds missing some covariation compared to the originals. In this release we implemented a new procedure that accepts RNAcentral identifiers in Rfam seed alignments in order to preserve the manually curated alignments as much as possible.

We began by importing the sequences and metadata from a recently established ZWD database (Zasha Weinberg Database) into RNAcentral where each distinct sequence is assigned a stable identifier (URS id) and linked to a NCBI taxid, its parent ZWD alignment, and scientific literature. For example, sequence LV_Brine_h2_0102_1073789 is assigned RNAcentral id URS0000D661D6_12908 so that it can be easily tracked using RNAcentral search, API, public database, or bulk download files.

Next we replaced the ZWD identifiers with RNAcentral accessions and used the ZWD-RNAcentral alignments as seeds for new Rfam families:

Following the standard Rfam protocol, we manually selected bit-score thresholds for each family that allow reliable identification of sequences from the seed alignments and other homologs from the Rfam sequence database.

A small number of sequences still had to be removed from ZWD alignments in the following cases:

  1. If a covariance model built using the alignment could not find some of its own sequences, these unmatched sequences were removed from the alignment
  2. If a sequence scored worse than a set of random sequences that serve as control when setting bit-score thresholds, such low-scoring sequences were also removed from the alignments.

In future releases we plan to expand the usage of RNAcentral identifiers in Rfam seed alignments.

Please note that any software that parses Rfam seed alignments and uses ENA or GenBank for metadata lookup will now need to include RNAcentral identifiers using the RNAcentral API. For more information or if you have any questions, please contact the RNAcentral team or Rfam help.

11 more families with 3D structure

There are 11 additional Rfam families that match 3D structures bringing the total number of families with experimentally determined structures to 98 (compared with 87 in Rfam 14.0).

Rfam familyPDB structures
RF00009 (RNaseP_nuc)6agb and 6ah3 (yeast), 6ahr and 6ahu (human) [chains A]
RF00025 (Telomerase-cil)6d6v (chain B)
RF00027 (let-7)5zal (chain C), 5zam (chain C)
RF00080 (yybP-ykoY)6cc1 (chains A and B), 6cc3 (chains A and B)
RF00233 (Tymo_tRNA-like)6mj0 (chains A and B)
RF00250 (mir-TAR)6gml (chain P)
RF00390 (UPSK)6mj0 (chains A and B)
RF01727 (SAM-SAH)6hag (chain A)
RF01826 (SAM_V)6fz0 (chain A)
RF02348 (tracrRNA)6mcb (chain B), 6mcc (chain B)
RF02553 (YrlA)6cu1 (chain A)

Other updates

Two existing families were updated with new seed alignments from ZWD, including RF02440 (ldcC RNA) and RF02840 (Lacto-3 RNA). There is also a new clan DUF805 (CL00115) that includes DUF805 and DUF805b families.


The Rfam team would like to thank Dr Elena Rivas and Dr Zasha Weinberg for the new data, software, and feedback, as well as the organisers and participants of the 2018 Benasque RNA meeting. We would also like to thank BBSRC for funding Rfam between 2015 and 2018.

Get in touch

Follow Rfam on Twitter to find out about new Rfam families and don’t hesitate to raise a GitHub issue or email us if you have any questions.

Genome-centric Rfam is finally here!

September 15, 2017

rfam-13.0We are pleased to announce the release of Rfam 13.0, the first major update since Rfam 12.0 went live in 2014. In this version we introduce a new genome-centric sequence database composed of non-redundant, representative, and complete genomes, as well as new website features, such as an updated text search.

Find out more about Rfam 13.0 in the NAR paper by Kalvari et al.: Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.

Rfam 12.3 is out

June 29, 2017


The new Rfam release (version 12.3) features 101 new families, unified search, and updated documentation.

New families

Rfam 12.3 featured families

In this release 101 new families were added to the database, including over a dozen Yersinia pseudotuberculosis RNA thermometers from a recent PNAS paper by Righetti et al. We would like to thank Zasha Weinberg for contributing NiCo riboswitch, Type-P5 Twister, and several RAGATH RNAs (for example, RAGATH-5). You can browse the new families here.

Unified text search

Rfam text search

Over the years Rfam developed many specialised ways of searching and exploring the data, such as Keyword search, Taxonomy search, browsing entries by type, and “Jump To” navigation. While these options work well, they may be confusing for new users, so we set out to unify all search functionality in a single text search.

The new search is available on the Rfam homepage or at the top of any Rfam page and is powered by EBI search. It allows to browse RNA families, clans, motifs, or explore Rfam by category using facets. For example, one can view families with 3D structures or view all snoRNA families that match human sequences, and the URLs can be bookmarked or shared.

The new search is a full replacement for the old search functionality except for taxonomy, because the new search can find species but not higher-level taxa. For example, one can search for Homo sapiens but not for Mammals. Stay tuned for future updates and use the old Taxonomy search in the meantime. We plan to retire all old search functionality once the new search is fully developed but until then the old and the new searches will coexist.

For more information about the new search, see Rfam documentation. If you have any feedback, please let us know in the comments below, on GitHub, by email, or on Twitter.

New home for Rfam documentation

Rfam help has been migrated to a dedicated documentation hosting platform ReadTheDocs and is now available at

Rfam ReadTheDocs help

The new system offers several advantages:

The source code of the documentation is available on GitHub so if you notice a problem you can let us know by creating an issue or help us fix it by editing the text on GitHub and sending a pull request.

Other updates

  • Clan competition for PDB entries: Now the 3D structure tab, the public MySQL database, and the FTP archive show only the lowest E-value match when several RNA families from the same clan match a PDB chain. For example, chain 0 of PDB structure 1S72 (LSU rRNA from an Archaeon Haloarcula marismortui) now matches only the Archaeal LSU family instead of all families from rRNA LSU clan.
  • New 5S rRNA clan CL00113 that includes 5S rRNA and mtPerm-5S families.

What’s next

This release will be the last “point release” for Rfam 12. In the next few months we will release Rfam 13.0 which will be based on a new sequence database. Previously, Rfam annotated WGS and STD subsets of ENA, which grow very quickly and include many redundant sequences. We will take advantage of reference genomes from UniProt reference proteome collection which is a regularly updated, reduced-redundancy set of reference genomes. This allows us to perform meaningful taxonomic comparisons and explore RNA families by taxonomy without sifting through thousands of versions of the same genome.

Get in touch

As always, we welcome comments and feedback about Rfam, so feel free to get in touch by email or by submitting a new GitHub issue.

Introducing Dfam_consensus – Dfam’s consensus sequence twin

May 18, 2017

Since its inception in 2012, Dfam has demonstrated the promise of using profile hidden Markov Models (HMMs) to improve the detection sensitivity and annotation quality of Transposable Element (TEs) families in human[1] and subsequently for four additional reference organisms[2].  Despite these advances, the tools used to discover new families ( de-novo repeat finders ), improve families ( extend, defragment, subfamily clustering ), and classify TE families continue to depend on consensus sequence models.  This discordance between methodologies is a direct impediment to Dfam’s expansion.

Read the rest of this entry »

Pfam 31.0 is released

March 8, 2017

Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans.  We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan. Read the rest of this entry »