Archive for the 'Releases' Category

Rfam 14.6 is out

July 27, 2021

We are happy to announce a new release of Rfam (14.6) that includes 121 new microRNA families, a new ribozyme family, 8 new small RNA families found in Bacteroides, as well as 10 additional families with updated secondary structures using 3D structural information. Read on for more information or explore the data in Rfam.

New microRNA families

The new release includes 121 new microRNA families bringing the total number of microRNA families in Rfam to 1,506. This work is part of the ongoing collaboration with miRBase that aims to synchronise microRNAs across miRBase, Rfam, and RNAcentral. Browse Rfam microRNAs or find out more about the microRNA project.

We also resolved an issue with 6 microRNA families that were missing a covariance model on the website and in the FTP archive. Many thanks to Dr Christian Anthon (University of Copenhagen) for pointing out this problem!

Updating families using information from 3D structures

Following on from Rfam 14.5, we updated the secondary structure of 10 additional families with 3D information, including 6 riboswitches, 1 ribozyme, 1 telomerase, 1 localization element and 1 microRNA precursor.

In some families, the updated structure is substantially changed. For example, the central part of the flavin mononucleotide (FMN) riboswitch is now organised by several additional base pairs and two pseudoknots (pK). As a result, the updated structure is more compact and more accurately reflects the experimentally determined 3D structures.

Seven of the updated families include newly annotated pseudoknots, which is an important improvement that helps better model long-distance non-nested interactions. We will continue reviewing and updating the families with 3D structure in future releases. The full list of the updated families can be found in the table below.

FamilyPDB structuresNew  pK
RF00008 – Hammerhead ribozyme (type III)2QUS_A, 2QUS_B, 2QUW_A, 2QUW_B, 2QUW_C, 2QUW_D, 5DI2_A, 5DI4_A, 5DQK_A, 5EAO_A, 5EAQ_A1
RF00025 – Ciliate telomerase RNA6D6V2
RF00050 – FMN riboswitch (RFN element)3F2Q, 3F2T, 3F2W, 3F2X, 3F2Y, 3F3O 2
RF00059 – TPP riboswitch (THI element)2CKY_A, 2CKY_B, 2GDI_X, 2GDI_Y, 2HOJ_A, 2HOK_A, 2HOL_A, 2HOM_A, 2HOO_A, 2HOP_A, 3D2G_A, 3D2G_B, 3D2V_A, 3D2V_B, 3D2X_A, 3D2X_B, 3K0J_E, 3K0J_F, 4NYA_A, 4NYA_B, 4NYB_A, 4NYC_A, 4NYD_A, 4NYG_A
RF00207 – K10 transport/localisation element (TLS)2KE6, 2KUR, 2KUU, 2KU, 2KUW
RF00174 – Cobalamin riboswitch4GMA, 4GXY1
RF00380 – ykoK leader / M-box riboswitch2QBZ_X, 3PDR_X, 3PDR_A1
RF01689 – AdoCbl variant RNA4FRN_A, 4FRN_B, 4FRG_X, 4FRG_B1
RF01831 – THF riboswitch3SD3, 3SUH, 3SUX, 3SUY, 4LVV, 4LVW, 4LVX, 4LVY, 4LVZ, 4LW01
RF02095 – mir-2985-2 microRNA precursor2L3J

Hovlinc ribozyme

A recent paper by Chen Y et al. 2021 describes Hovlinc, a new type of self-cleaving ribozymes found in human and other hominids. Hovlinc was detected in a very long intergenic noncoding RNA in hominids (hominin vlincRNA-located) using a genome-wide approach designed to discover self-cleaving ribozymes. The functions of vlincRNA and the hovlinc ribozyme remain unclear. Hovlinc joins 3 known classes of small, self cleaving ribozymes found in human: (1) Mammalian CPEB3 ribozyme, (2) Hammerhead ribozyme and (3) B2 and ALU retrotransposons. We would like to thank Dr Fei Qi (Huaqiao University) for providing the Hovlinc alignment. View hovlinc family in Rfam.

New Bacteroides families

In a recent article by Ryan et al. 2020, the authors report a high-resolution transcriptome map of the model organism Bacteroides thetaiotaomicron, a common bacteria of the human gut. They recognize 269 non-coding RNAs (ncRNAs) candidates from which nine were validated. Eight of these ncRNAs were integrated as new families:

  1. RF04177 – Bacteroides sRNA BTnc201
  2. RF04178 – Bacteroides sRNA BTnc005
  3. RF04179 – Bacteroides sRNA BTnc049
  4. RF04180 – Bacteroides sRNA BTnc231
  5. RF04181 – rteR sRNA
  6. RF04182 – GibS sRNA
  7. RF04183 – Bacteroidales small SRP
  8. RF04184 – Bacteroides sRNA BTnc060

In addition, the RF01693 – Bacteroidales-1 family was renamed to 6S-Bacteroidales RNA. Bacteroidales-1 was first reported in a comparative genomics-based approach of genome and metagenome sequences from Weinberg et al. 2010. It was identified downstream of L20 ribosomal subunit genes in the order Bacteroidales. Ryan et al. 2020 report that this sRNA is a 6S RNA homolog in Bacteroides thetaiotaomicron. Rfam also has other two families of 6S RNA RF00013-6S/SsrS RNA and RF01685-6S-Flavo. We would like to thank Dr Lars Barquist (University of Würzburg) for providing the data.

Welcome to Emma!

A few weeks ago Emma Cooke joined the Rfam team as Software Developer and is already busy working on new features. Emma has studied Genetics and Software Engineering, and her previous roles have focused on release verification pipelines, software testing, and developing for cloud environments. Please join us in welcoming Emma to the Rfam community and stay tuned for new announcements based on her work.

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Dfam 3.4 Release

July 24, 2021

The Dfam Consortium is proud to announce the release of Dfam 3.4. This update includes over 8,200 curated transposable element (TE) families found in 240 mammalian genomes. The models therein have been carefully developed by David Ray’s lab at Texas Tech University (TTU) and further refined by Arian Smit. This is part of an ongoing effort to generate a comprehensive mammalian TE library using multi-species alignments and ancestral sequence reconstructions generated by the Zoonomia project (https://zoonomiaproject.org).

In addition to releasing the curated TE families, full genome annotations are provided for 21 Old World monkeys (Figure 1; Figure 2).

Figure 1: A portion of the available genomes aligned as part of the Zoonomia project, focused on the Primate Order.

Discovery of young, species-specific TEs

As a large portion of a mammalian genome, TEs serve as a source for genomic variation and innovation, including (but certainly not limited to) genomic rearrangement via movement and non-homologous recombination and providing novel transcription factor binding sites. David Ray’s lab has taken the first large-scale effort into examining the TE content of the extant genomes as part of the Zoonomia project in order to determine the TE type and location and subsequently the impact they might have on the evolution of each lineage of mammals. 

Methods

A total of 248 final genome assemblies of placental mammals were initially presented for analysis, most coming from the Zoonomia dataset. Low quality assemblies and previously analyzed genomes were excluded from analyses. To avoid wasted effort on re-curation of previously described TEs, manual curation efforts were focused towards identifying newer putative TEs that underwent relatively recent accumulation, with the main assumption being that many older TEs will be widely shared among large groups of placental mammals and that previous annotation efforts have thoroughly described these older elements in detail.

To classify younger TEs, the filtered dataset was narrowed to elements that have undergone transposition in the recent past, i.e. TEs that have insertion sequences with Kimura 2 parameter (K2P) distances less than 4.4% (approximately ~20my or less since insertion, based on a general mammalian neutral mutation rate of 2.2×10-9). This approach yielded mostly lineage specific TEs, many of which were yet to be previously described.

For each iteration of manual TE curation, new consensus sequences were generated from the 10-50 top BLAST hits, and aligning these sequences via MUSCLE and estimating a consensus sequence with EMBOSS.

To reduce library redundancy, the potential TE consensus sequences were combined with those of known TEs from previous work as well as all known vertebrate TEs from Repbase. The program CD-HIT-EST was used to identify duplicate TEs among our combined TE library according to the 80-80-80 rule of Wicker et al.

To confirm the TE type, each sequence in the library was subjected to a custom pipeline which used: blastx to confirm the presence of known ORFs in autonomous elements, RepBase to identify known elements, and TEclass to predict the TE type. In addition, structural criteria was also utilized for categorizing TEs: DNA transposons, elements with visible terminal inverted repeats; rolling circle transposons were required to have identifiable ACTAG at one end; putative SINEs were inspected for a repetitive tail as well as A and B boxes; LTR retrotransposons were required to have recognizable hallmarks, such as: TG, TGT, or TGTT at their 5’ and the inverse at the 3’ ends.   

Zoonomia Project

Figure 2: Summary of the Zoonomia project

The Zoonomia project is an effort to understand the mammalian tree of life at a deeper level. This massive undertaking is the collaboration of 27 laboratories. Although far from a complete list, some current projects derived from the Zoonomia datasets include: studying mammalian speech development, regulatory element analyses, chromosome evolution and the evolution of microRNA genes.

Future Work

Future efforts will continue to analyze and catalog lineage-specific TEs in deeper branches of the 240-way genome alignment via the reconstructed genomes at each node of the phylogenetic tree as part of the alignment and expand the full genome annotations available on Dfam.

Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman

Rfam 14.5 is live

March 18, 2021

We are happy to announce a new Rfam release, version 14.5, featuring 112 updated microRNA families and 10 families improved using the 3D structure information. Read on for details or explore 3,940 RNA families at rfam.org.

Updated microRNA families

As described in our most recent paper, we are in the process of synchronising microRNA families between Rfam and miRBase. In this release 112 of the existing microRNA families have been updated with new manually curated seed alignments from miRBase, new gathering thresholds, and new family members found in the Rfamseq sequence database. 

In total, 852 new microRNA families have been created (356 in release 14.3 and 496 in release 14.4) and 152 existing families have been updated (40 in release 14.3 and 112 in release 14.5). As the miRBase-Rfam synchronisation is about 50% complete, additional microRNA families will be made available in the upcoming releases. You can view a list of the 112 updated families or browse all 1,385 microRNA families on the Rfam website. 

Updating families using information from 3D structures

We are also in the process of reviewing the families with the experimentally determined 3D structures in order to compare the Rfam annotations with the 3D models. Our goal is to incorporate the 3D information into Rfam seed alignments as many families have been created before the corresponding 3D structures became available. We manually review each PDB structure, verify basepair annotations from matching PDBs, and obtain a more consistent consensus secondary structure model. 

In multiple cases we were able to add missing base pairs and pseudoknots. For example, in the SAM riboswitch (RF00162), we added two base pairs in the base of helix P2, corrected a basepair in P3 and added four basepairs in P4 (one in the base of the helix and three near the terminal loop). The updated consensus secondary structure presents a more accurate central core annotation with more structure in the four-way junction.

SAM riboswitch secondary structure before and after the updates

In another example, one base pair was added in P1 and another one in P3 of the SAM-I/IV variant riboswitch (RF01725). We also corrected a base pair in P3 and included a P4 stem loop that was not integrated before.

SAM-I/IV riboswitch secondary structure before and after the updates

The SAM-I/IV riboswitch is characterised by a similar SAM binding core conformation to that of the SAM riboswitch but it differs in the k-turn motif in P2 which is found in SAM riboswitches but not in SAM-I/IV. These two families also have different pseudoknots interactions, where SAM riboswitch forms a pseudoknot between a P2 loop and the stem of P3, while the SAM-I/IV riboswitch contains a pseudoknot between a P3 loop and the 5′ region.

The first 10 families updated with 3D information include:

  1. RF00162 – SAM riboswitch
  2. RF01725 – SAM-I/IV variant riboswitch
  3. RF00164 – Coronavirus 3’ stem-loop II-like motif (s2m)
  4. RF00013 – 6S / SsrS RNAP
  5. RF00003 – U1 spliceosomal RNA
  6. RF00015 – U4 spliceosomal RNA
  7. RF00442 – Guanidine-I riboswitch
  8. RF00027 – let-7 microRNA precursor
  9. RF01054 – preQ1-II (pre queuosine) riboswitch and
  10. RF02680 – preQ1-III riboswitch

We will continue reviewing the families with known 3D structure in future releases.

Other family updates

Initially reported by Aspegren et al. 2004, Class I (RF01414) and Class II (RF01571) RNAs were found in social amoeba Dictyostelium discoideum and later on investigated in more detail by Avesson et al. 2011. Now a new report from Kjellin et al. 2021 presents a comprehensive analysis of the Class I RNA genes in dictyostelid social amoebas. Based on this study, we updated the Dicty Class I RNA family RF01414 with a new seed alignment and removed the family RF01571, thus merging both families into one. We thank Dr Jonas Kjellin (Uppsala University) for suggesting this update.

Goodbye Ioanna!

Rfam 14.5 is the last release prepared by Dr Ioanna Kalvari who will be leaving the team at the end of March 2021. We would like to take the opportunity to thank Ioanna for her contributions over the last 5.5 years and wish her best of luck in the future!

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter

Rfam 14.4 is live

December 18, 2020

The last Rfam release of 2020 is now live! Rfam 14.4 contains 496 new microRNA families developed in collaboration with miRBase. Find out about the microRNA project in our new NAR paper and let us know if you have any feedback.

Rfam 14.3

September 15, 2020

Rfam 14.3 includes 356 new and 40 updated microRNA families, as well as 12 new and 2 updated Flavivirus RNAs. Find out the details in our new NAR paper and get in touch if you have any feedback.

Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Rfam Coronavirus Special Release

April 27, 2020

In response to the SARS-CoV-2 outbreak, the Rfam team prepared a special release dedicated to the Coronavirus RNA families. The release 14.2 includes 10 new and 4 revised families that can be used to annotate the SARS-CoV-2 and other Coronavirus genomes with RNA families.

View the data at rfam.org/covid-19 ➡️

New Coronavirus Rfam families

In collaboration with the Marz group and the EVBC, we created 10 families representing the entire 5’- and 3’- untranslated regions (UTRs) for Alpha-, Beta-, Gamma-, and Delta- coronaviruses. A specialised set of alignments for the subgenus Sarbecovirus is also provided, including the SARS-CoV-1 and SARS-CoV-2 UTRs. 

The families are based on a set of high-quality whole genome alignments produced with LocARNA and reviewed by expert virologists. Note that the Alpha-, Beta-, and Deltacoronavirus alignments and structures were refined based on the literature, while the Gammacoronavirus families are based on prediction alone due to the lack of experimental data.


Virus
5’ UTR3’ UTR
AlphacoronavirusaCoV-5UTR
RF03116
aCoV-3UTR
RF03121
BetacoronavirusbCoV-5UTR
RF03117
bCoV-3UTR
RF03122
Sarbecovirus and SARS-CoV-2Sarbecovirus-5UTR
RF03120
Sarbecovirus-3UTR
RF03125
GammacoronavirusgCoV-5UTR
RF03118
gCoV-3UTR
RF03123
DeltacoronavirusdCoV-5UTR
RF03119
dCoV-3UTR
RF03124

Previously, only fragments of the UTRs were found in Rfam. In particular, two families were superseded by the new whole-UTR alignments and removed from Rfam:

  • RF00496 (Coronavirus SL-III cis-acting replication element): This family represented a single stem that is now found in aCoV-5UTR and bCoV-5UTR families.
  • RF02910 (Coronavirus_5p_sl_1_2): This family represented two stems from aCoV-5UTR.

The new families are grouped into 2 clans: CL00116 and CL00117 for the 5’ and 3’ UTRs, respectively. The clans can be used with the Infernalcmscan program to automatically select the highest scoring match from a set of related families (see the Rfam chapter in CPB to learn more).

Revised Coronavirus families

We also reviewed and updated the existing Coronavirus Rfam families.

FamilyWhat was updated?Is it found in SARS-CoV-2?
RF00182 Coronavirus packaging signal The seed alignment and consensus secondary structure were updated to include the 4 conserved repeat units. This RNA element isfound only in Embecovirus, so it is not present in SARS-CoV-2 and other Sarbecoviruses.
RF00507 Coronavirus frameshifting stimulation elementThe seed alignment was expanded. This RNA is present in SARS-CoV-2.
RF00164 Coronavirus s2m RNAThe seed alignment was expanded.

There is a 3D structure for SARS-CoV-1 which can be used for understanding the s2m in SARS-CoV-2.
This RNA is present in the 3′ UTR of SARS-CoV-2.
RF00165 Coronavirus 3’-UTR pseudoknot  The seed alignment was expanded. 
The pseudoknot is annotated in the 3’ UTR families but since it is mutually exclusive with the 3’-UTR consensus structure, it is also provided as a separate family. 
This RNA is present in the 3′ UTR of SARS-CoV-2.

Where to get the data 

You can download the covariance models, as well as seed alignments for the coronavirus families from the corresponding family pages or from a dedicated folder on the FTP archive.

How to use the data

You can download the covariance models and annotate viral sequences with these RNA models using Infernal. See Rfam help for examples.

Inviting all Wikipedians to contribute

We revised the Wikipedia pages associated with each family, and we invite everyone to contribute to the following articles:

Acknowledgements

We would like to thank Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena) for providing the curated alignments for the new families as well as Eric Nawrocki (NCBI) for revising the existing Rfam entries. We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments.


This work is part of the BBSRC-funded project to expand the coverage of viral RNAs in Rfam. More data on SARS-CoV-2 can be found on the European COVID-19 Data Portal.

Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team