Posts Tagged ‘release’

Dfam 3.7 : ~3.4 million TE models across 2346 taxa

January 12, 2023

We at Dfam are pleased to announce the latest data release! The Dfam 3.7 release includes additional raw and curated datasets, resulting in a ~4.5x increase in the number of families compared to the previous Dfam 3.6 data release over a wide range of taxa. Please note the large size of the newest release and plan accordingly. It may be beneficial to filter and download the relevant data to your project by utilizing the API. 

EBI dataset contributes to the quadrupling of the Dfam database 

Our continued collaboration with Fergal Martin and Denye Ogeh from the European Bioinformatics Institute (EBI) has provided an additional 771 assemblies and their associated TE models that are now a part of the DR records in Dfam. This brings the total contribution of genomic data from EBI to 1551 species. The new data expands taxa such as Viridiplantae (green plants) and Actinopterygii (bony fishes), and broadens Dfam coverage with the addition of Echinodermata (starfishes, sea urchins/cucumbers) and Petromyzontiformes (lampreys). 

Community submissions – adding diversity to Dfam

Taro (Colocasia esculenta) – a threatened food staple

One of the most ancient cultivated crops, taro is a food staple in the Pacific Islands and the Caribbean, which is currently threatened by taro leaf blight (TLB). Some populations of taro are resistant to TLB, but the genetic basis for this resistance is unknown. As part of an effort to understand the genetic basis of TLB resistance, a taro de novo assembly was generated and the repetitive content was analyzed [1]. The high repetitive content (~82%) of this genome was positively correlated with genome size, with the potential to be linked to TLB resistance. Contributed by M. Renee Bellinger.

Gesneriaceae – understanding angiosperm morphological variation

A member of the plant family Gesneriaceae, the Cape Primrose Streptocarpus rexii has long been studied by evolutionary biologists due to its unique morphological aspects. Genetic resources are critical in order to study the unique meristem evolution of this plant family. As such, a genome annotation pipeline was generated in order to handle the shortcomings of current technical challenges of genome annotation. Part of this effort included generating repeat libraries for not only the Cape Primrose, but also for Dorcoceras hygrometricum and Primulina huaijiensis [2]. Providing these libraries to Dfam will enhance the resources available for future genomic characterization of this plant family.  Contributed by Kanae Nishii.

Mosquito (Anopheles coluzzii) – a human malaria vector

The adaptive flexibility of Anopheles coluzzii, a primary vector of human malaria, allows it escape efforts to control the mosquito population with insecticides. As TEs are integral to adaptive processes in other species, it was hypothesized that TEs could be what is allowing the rapid resistance of A. coluzzii to classic methods of intervention. Analyzing six individuals from two African localities allowed the authors to provide a comprehensive TE library [3]. This effort enhances the resources available to study the genomic architecture and gene regulation underpinning the success of this malaria vector. Contributed by Carlos Vargas and Josefa Gonzalez.

Water flea (Daphnia pulicaria) – a model organism to study climate change

Due to their short lifespans and reproductive capabilities, water fleas are used as a bioindicator to study the effects of toxins on an ecosystem, and are thus useful in studying climate change. A study of two ecological sister taxa – Daphnia pulicaria and Daphnia pulex – analyzed the evolutionary forces of recombination and gene density in driving the differentiation and divergence of the two aforementioned species [4]. TE content was analyzed as part of generating the new Daphnia pulicaria genome assembly.  Contributed by Mathew Wersebe.

601 insects – transposable element influence on species diversity 

TEs are drivers of evolution eukaryotes. However, in some underrepresented taxa, TE dynamics are less well understood. To this end, 601 insect genomes over 20 Orders were analyzed for TE content to analyze the variation between and among insect Orders. This work highlights the need for community-submitted high-quality libraries.  Contributed by John Sproul and Jacqueline Heckenhauer.

Analysis of six bat genomes – evolution of bat adaptations

Bats are an excellent example of complex adaptations, such as flight, echolocation, longevity and immunity. In order to enhance the genomic resources to study the development of complex traits, six high-quality genomes assemblies using long- and short-read technologies were generated (Rhinolophus ferrumequinumRousettus aegyptiacusPhyllostomus discolorMyotis myotisPipistrellus kuhlii and Molossus molossus) [6]. As part of the effort to annotate these new genome assemblies, the TE content was analyzed. These six genomes displayed a wide range of diversity in TE content, perhaps contributing to their complex traits.  Contributed by Kevin Sullivan and David Ray.

LTR7/ERVH – transcriptional regulation in the human embryo

The mechanism by which human endogenous retrovirus type-H (HERVH) exerts regulatory activities fostering self-renewal and pluripotency in the pre-implantation embryo is unknown. In order to elucidate the aforementioned mechanism, the transcription dynamics and sequence signature evolution of HERVH were analyzed [7]. This study not only revealed previously undefined LTR7 subfamilies, but also provided a comprehensive phytoregulatory analysis of all the identified subfamilies against locus-specific regulatory data available in genome-wide assays of embryonic stem cells (ESCs), providing evidence for subfamily-specific promoter activity. The complex evolutionary history of LTR7 is mirrored in the transcriptional partitioning that takes place during early embryonic development.  Contributed by Thomas Carter, Cédric Feschotte, and Arian Smit.

References

1. Bellinger, M. R., Paudel, R., Starnes, S., Kambic, L., Kantar, M. B., Wolfgruber, T., Lamour, K., Geib, S., Sim, S., Miyasaka, S. C., Helmkampf, M., & Shintaku, M. (2020). Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight. G3 (Bethesda, Md.)10(8), 2763–2775. https://doi.org/10.1534/g3.120.401367

    2. Nishii, K., Hart, M., Kelso, N., Barber, S., Chen, Y. Y., Thomson, M., Trivedi, U., Twyford, A. D., & Möller, M. (2022). The first genome for the Cape Primrose Streptocarpus rexii (Gesneriaceae), a model plant for studying meristem-driven shoot diversity. Plant direct6(4), e388. https://doi.org/10.1002/pld3.388

    3. Vargas-Chavez, C., Longo Pendy, N. M., Nsango, S. E., Aguilera, L., Ayala, D., & González, J. (2022). Transposable element variants and their potential adaptive impact in urban populations of the malaria vector Anopheles coluzziiGenome research32(1), 189–202. https://doi.org/10.1101/gr.275761.121

    4. Wersebe, M. J., Sherman, R. E., Jeyasingh, P. D., & Weider, L. J. (2022). The roles of recombination and selection in shaping genomic divergence in an incipient ecological species complex. Molecular ecology, 10.1111/mec.16383. Advance online publication. https://doi.org/10.1111/mec.16383

    5. Sproul, J.S., Hotaling, S., Heckenhauer, J., Powell, A., Larracuente, A.M., Kelley, J.L., Pauls, S.U., Frandsen, P.B. (2022). Repetitive elements in the era of biodiversity genomics: insights from 600+ insect genomes. bioRxiv 2022.06.02.494618; doi: https://doi.org/10.1101/2022.06.02.494618

    6. Jebb, D., Huang, Z., Pippel, M., Hughes, G. M., Lavrichenko, K., Devanna, P., Winkler, S., Jermiin, L. S., Skirmuntt, E. C., Katzourakis, A., Burkitt-Gray, L., Ray, D. A., Sullivan, K. A. M., Roscito, J. G., Kirilenko, B. M., Dávalos, L. M., Corthals, A. P., Power, M. L., Jones, G., Ransome, R. D., … Teeling, E. C. (2020). Six reference-quality genomes reveal evolution of bat adaptations. Nature583(7817), 578–584. https://doi.org/10.1038/s41586-020-2486-3

    7. Carter, T. A., Singh, M., Dumbović, G., Chobirko, J. D., Rinn, J. L., & Feschotte, C. (2022). Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo. eLife11, e76257. https://doi.org/10.7554/eLife.76257

    Rfam Release 14.9

    November 15, 2022

    We are happy to announce the release of Rfam 14.9. This release features 14 new miRNA families, 23 updated miRNA families, 10 families improved with their first 3D structure, 10 families updated with additional 3D structures, and comprehensive improvements using R-scape. Read on for more details.

    Updated and new miRNA families

    In this release, we have updated 23 miRNA families in Rfam and created 14 new families based on miRBase miRNA alignments. We estimate that this project is 80 percent finished. The remaining families are undergoing extensive curation.

    New miRNA Families:

    Rfam IDFamily
    RF04223MIR2619
    RF04224mir-9229
    RF04225MIR7502
    RF04226mir-9186
    RF04227mir-9215
    RF04228MIR6140
    RF04229mir-9279
    RF04230mir-9261
    RF04231mir-9191
    RF04232mir-9318
    RF04233mir-680
    RF04234mir-1421
    RF04235mir-242_2
    RF04236mir-1285

    Updated miRNA Families:

    Rfam IDFamily
    RF00241mir-8/mir-141/mir-200
    RF00456mir-34
    RF00661mir-31
    RF00672mir-190
    RF00700mir-375
    RF00702mir-182
    RF00706mir-263
    RF00713mir-239
    RF00716mir-3
    RF00717mir-315
    RF00726mir-87
    RF00727microRNA bantam
    RF00728mir-81
    RF00762mir-412
    RF00837mir-251
    RF00844mir-67
    RF00848mir-61
    RF00948mir-996
    RF01045mir-544
    RF01413miR-430
    RF01924mir-2774
    RF04088MIR812
    RF04195MIR6217

    3D families

    We continue to review and update Rfam families with available 3D information, and in release 14.9 we include 10 new families updated with 3D information and we added additional 3D structures to 10 families. We have added 9 pseudoknots to the 10 families with new 3D structures.

    Families with new 3D structures added:

    Rfam IDFamilyNew
    RF00037Iron response element I3SNP_C, 3SNP_D
    RF00522PreQ1 riboswitch2L1V_A
    RF01073Gag/pol translational readthrough site2LC8_A
    RF01727SAM/SAH riboswitch6HAG_A
    RF02253Iron response element II3SN2_B
    RF02519ToxI antitoxin4ATO_G
    RF02553Y RNA-like6CU1_A
    RF02796Pab160 RNA3HJW_D, 3LWO_D, 3LWP_D, 3LWQ_D, 3LWR_D, 3LWV_D
    RF03054Xanthine riboswitch/NMT1 RNA7ELP_A, 7ELP_B, 7ELQ_A, 7ELQ_A, 7ELR_A, 7ELR_B, 7ELS_A, 7ELS_B
    RF04222Potato leafroll virus exoribonuclease-resistant RNA7JJU_A, 7JJU_B

    Families with additional 3D structures:

    Rfam IDFamilyUpdate
    RF00015U4 spliceosomal RNA5GAP_V
    RF00025Ciliate telomerase RNA7LMA_B, 7LMB_B
    RF00050FMN riboswitch (RFN element)6WJR_X, 6WJS_X
    RF00059TPP riboswitch (THI element)7TD7_A, 7TDA_A, 7TDB_A, 7TDC_A, 7TZR_X, 7TZR_Y, 7TZS_X, 7TZS_Y, 7TZT_A, 7TZU_A
    RF00162SAM riboswitch (S box leader)7EAF_A
    RF00174Cobalamin riboswitch6VMY_A
    RF00442Guanidine-I riboswitch5U3G_B, 7MLW_F
    RF01763ykkC-III Guanidine-III riboswitch5NWQ_B, 5NY8_A, 5NY8_B, 5NZ3_A, 5NZ3_B, 5NZD_A, 5NZD_B, 5O62_A, 5O62_B, 5O69_A, 5O69_B
    RF01831THF riboswitch6Q57_A, 7KD1_A
    RF02680PreQ1-III riboswitch6XKN_A, 6XKO_A

    We have also created rfam.org/3d which contains a table of all families with 3D structures and a link to download the seed alignments for those families. Additionally, there is a new file on our FTP site Rfam.3d.seed.gz which contains all seed alignments for these families. The page and file will be updated each release. Please reach out if you have any suggestions for improvements!

    Families updated with R-scape model

    We worked with Elena Rivas to analyse all Rfam families with R-scape. We then updated models where R-scape was able to suggest a better alignment. This led to 26 families with improvements, listed below by number of additional covarying base pairs.

    Additional Covaring basepairsRfam IDFamilyAdditional Covaring basepairsRfam IDFamily
    24RF02033HEARO2RF01731TwoAYGGAY
    14RF03065IS605-orfB-I2RF01794sok
    8RF03068RT-32RF02221sRNA-Xcc1
    5RF03072raiA2RF02947cow-rumen-2
    4RF02969DUF3800-I2RF03000LOOT
    3RF01688Actino-pnp2RF03158L31-Actinobacteria
    3RF02004group-II-D1D4-51RF01864plasmodium_snoR21
    3RF02005group-II-D1D4-61RF01867CC2171
    3RF02913pemK1RF02944c4-2
    3RF03077RT-21RF02968DUF3800-IX
    3RF03135L4-Archaeoglobi1RF02987GA-cis
    3RF03144eL15-Euryarchaeota1RF03019RT-16
    2RF00062HgcC1RF03046Pseudomonadales-1

    Other updates

    We have also updated 3 other families. We updated  Sarbecovirus 5’ UTR (RF03120) secondary structure to reflect the pairing from Correlated sequence signatures are present within the genomic 5′UTR RNA and NSP1 protein in coronaviruses. We modified the consensus secondary structure  of the stem loop 1 of the 5’ untranslated region of the family to reflect the secondary structure in that paper. Additionally we renamed RF03054 and RF03071 families, which were first reported by Zasha Weinberg in a comparative analysis of intergenic regions in bacteria.

    The Xanthine riboswitch (RF03054) was first reported in Proteobacteria as NMT1 non-coding RNA (ncRNA). Later it was reported as a ncRNA that recognised Xanthine and the structure was reported in 7ELP, 7ELQ, 7ELR and 7ELS PDBs which have been added as part of the seed alignment.

    The Na+ riboswitch (RF03071) was first reported by Zasha Weinberg and called DUF1646 RNA. More recently, Neil White from the Breaker group identified it as a riboswitch that selectively senses Na+ and  regulates the expression of genes related to the sodium biology.

    Migrating Rfam’s public SVN repository

    Rfam and Pfam used to provide a public copy of their SVN repositories on xfamsvn.ac.uk. With the recent depreciation of Pfam’s website and inclusion as part of InterPro, we have decided to move the Rfam svn repository to http://svn.rfam.org. Users interested in a nightly updated version of Rfam can browse the repository at this new location.

    Rfam release 14.8

    May 30, 2022

    We are happy to announce the release of Rfam version 14.8. This release includes 48 updated and 25 new microRNA families; 10 families updated based on 3D structure annotations; 4 new families and updates to 5 existing families for Hepatitis C virus; a new xRNAs family from the Potato virus; and the integration of LitScan, a literature scanner powered by RNAcentral and Europe PubMed Central. Read on for details on these changes.

    Updated microRNA families

    Rfam, miRBase and RNAcentral have been working to synchronize miRNA families between all three resources. We are now happy to report that we have completed 77% of the current miRNAs families covering >1300 miRNAs of 1700 alignments provided to us by Profesor Sam Griffiths-Jones at miRBase. The 400 remaining families need an extended review process, and we will be working on their integration in future releases.

    Summary of the miRBase and Rfam synchronisation project, we estimate it is 78% completed.

    Families updated with 3D structure information

    Rfam has been updating families using 3D structure information. This project aims to improve Rfam families through the addition of pseudoknotes, base pairs, and annotations of other structural elements by inspecting 3D structures. In this release we have updated 10 families:

    • Virus families:
      • RF00507-Coronavirus frameshifting stimulation element
      • RF01047-HBV RNA encapsidation signal epsilon
    • Riboswitches families:
      • RF01763-Guanidine-III riboswitch, also know as ykkC-III riboswitch
      • RF01734-Fluoride riboswitch
      • RF01704-Glutamine-II riboswitch, previously known as Downstream peptide
      • RF01750-ZMP/ZTP riboswitch
      • RF01739-Glutamine riboswitch
      • RF02683-NiCo riboswitch
      • RF01826-SAM-V riboswitch
    • Ribozyme family:

    We have added pseudoknots to 7 of the 10 updated families and the updated secondary structure diagrams from 5 of these families are shown below.

    Examples of families reviewed and updated with 3D information. Pseudoknot structures (pk) were added to each of these five families based on a review of the corresponding 3D structures.

    New and updated families of Hepatitis C virus

    In release 14.8 we have created 4 new families, updated 5 existing families, and deleted 4 virus families. These changes are the result of our ongoing collaboration between Profesor Manja Marz of the European Virus Bioinformatics Center and Rfam. The Marz group provided Rfam a curated alignment of representative sequences for the entire genome of Hepatitis C virus genome. We used this alignment to update, create or remove existing Rfam families. The new families we have created are summarized in a table below. We have deleted RF00469, which was merged into RF00260 during review. We have also deleted families from RF02585 to RF02588 which have no support in the genomic alignment. 

    Rfam IDNameDescription
    RF00061IRES_HCVHepatitis C virus internal ribosome entry site
    RF00260HepC_CREHepatitis C virus (HCV) cis-acting replication element (CRE)
    RF00620HCV_ARF_SLHepatitis C alternative reading frame stem-loop
    RF00468HCV_SLVIIHepatitis C virus stem-loop VII
    RF00481HCV_X3Hepatitis C virus 3’X element
    RF04218HCV_5BSL1Hepatitis C virus stem-loop I
    RF04219HCV_J750J750 non-coding RNA (containing SL761 and SL783)
    RF04220HCV_SL588SL588 non-coding RNA
    RF04221HCV_SL669SL669 non-coding RNA

    As part of this project we have reviewed and updated Coronavirus, Flavivirus and HCV viruses families, and we are working on adding RNA families from other viruses, such as Filoviridae (e.g. Ebolavirus) and Rhabdoviridae (e.g. Rabies viruses).

    xRNAs in Potato virus

    We want to thank Professor Quentin Vicens for sharing the alignment of Potato leafroll virus exoribonuclease-resistant RNA (PLRV-xrRNA). PLRV-xrRNA is a non-coding RNA that blocks the progression of 5′ to 3′ exoribonuclease using only a folded RNA element, and this family is described in RF04222.

    LitScan

    RNAcentral has recently developed LitScan, a tool to automatically connect non-coding RNA sequences, genes and families to the literature that discusses them. In this release we have integrated the LitScan widget into Rfam. The widget is now shown in the new ‘Publications’ tab on all Rfam families.

    Example of LitScan for mir-17 microRNA precursor family, publications can be sorted by citation, journal, year of publication and others.

    Please reach out to us with feedback on the widget, or if you would like to use the LitScan widget on your site!

    Dfam 3.6 release

    April 21, 2022

    We are pleased to announce the latest data release of the Dfam database! This latest release approximately doubles the number of species from the Dfam 3.5 release (595 to 1,109), and increases the number of transposable element (TE) families by ~2.5x (285,542 to 732,993. A more detailed summary of the species included can be seen in Table 1, and in the Dfam 3.6 release notes.

    Community-submitted libraries

    A huge thank you to the TE community for submitting your data to us! In this release, we have: 1) 3,360 curated rice weevil TE models, submitted by Clément Goubert and Rita Rebollo1; 2) 22 SINE families obtained from 15 moth species (Lepidoptera insects) submitted by Guangjie Han et al.2; 3) 120 Penelope-classified families – something about how they span several kingdoms/orders? submitted by Rory Craig et al.3; and 4) 41 repeat families generated as part of the T2T human assembly project4 – not including the 22 “composite” repetitive families, which will be available as part of a later Dfam release. To read more about the studies associated with these submissions, please see the references below.

    Rice weevil: an agricultural pest

    (Background copied from paper): The rice weevil Sitophilus oryzae is one of the most important agricultural pests, causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular interactions. In the paper (see below), the authors show that many TE families are transcriptionally active, and changes in their expression are associated with insect endosymbiotic state.

    Moth SINEs: high diversity

    (Conclusions copied from paper): Lepidopteran insect genomes harbor a diversity of SINEs. The retrotransposition activity and copy number of these SINEs varies considerably between host lineages and SINE lineages. Host-parasite interactions facilitate the horizontal transfer of SINE between baculovirus and its lepidopteran hosts.

    Penelope elements: far-reaching impacts

    The authors investigate the Penelope (PLE) content of a wide variety of eukaryotes. (copied from paper): This paper uncovers the hitherto unknown PLE diversity, which spans all eukaryotic kingdoms, testifying to their ancient origins. 

    T2T entries: previously hidden genomic content

    A new human genome assembly has been released! The new assembly (T2T or chm13) has sequenced and assembled the remaining 10% of the human genome that was previously unattainable. The entries described in the manuscript are part of this newly-analyzed sequence.

    EBI libraries

    In collaboration the European Bioinformatic Institute (EBI), we processed and imported RepeatModeler runs on 444 additional species, resulting in the addition of 440,543 families. Additional extension and re-classification sites were run on each models and fate final consensus and HMMs were produced. Please note that the relationship data is not available on these uncreated imports at this time.

    References associated with community submissions

    1 Parisot, N., et al (2021). The transposable element-rich genome of the cereal pest Sitophilus oryzae. BMC biology19(1), 241. https://doi.org/10.1186/s12915-021-01158-2
    2 Han, G., et al (2021). Diversity of short interspersed nuclear elements (SINEs) in lepidopteran insects and evidence of horizontal SINE transfer between baculovirus and lepidopteran hosts. BMC genomics22(1), 226. https://doi.org/10.1186/s12864-021-07543-z
    3 Craig, R. J., et al (2021). An Ancient Clade of Penelope-Like Retroelements with Permuted Domains Is Present in the Green Lineage and Protists, and Dominates Many Invertebrate Genomes. Molecular biology and evolution38(11), 5005–5020. https://doi.org/10.1093/molbev/msab225
    4 Hoyt, S. J., et al (2022). From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science (New York, N.Y.)376(6588), eabk3112. https://doi.org/10.1126/science.abk3112

    Rfam release 14.7

    December 21, 2021

    We are happy to announce the latest Rfam release, version 14.7. The release includes 121 updated microRNA families, 4 new families, and a redesigned Rfam-PDB mapping pipeline that provides weekly updates as new RNA 3D structures become available. Read on to find out more or explore the data in Rfam.

    Updated microRNA families

    As part of the Rfam-miRBase synchronisation project discussed in the Rfam 14 paper, we continue revising microRNA families in Rfam using the data provided by miRBase. This release includes 121 updated families, such as mir-6 (RF00143) and mir-22 (RF00653). The following five microRNA families have been deleted from Rfam as the corresponding entries were removed from miRBase due to lack of evidence: mir-1937 (RF01942), mir-1280 (RF02013), mir-353 (RF00800), mir-720 (RF02002), and mir-2973 (RF02096). We would like to thank Lisanne Knol (University of Edinburgh) for bringing the first two of these families to our attention.

    We estimate that the Rfam is now approximately 60% in sync with miRBase, with additional families to be released in future versions of Rfam. You can view the full list of updated families here or browse all microRNAs in Rfam

    New families

    The release includes two new hairpin ribozyme families Hairpin-meta1 (RF04190) and Hairpin-meta2 (RF04191) recently reported by the Weinberg lab. The hairpin ribozymes were discovered in metatranscriptome data and are proposed to occur in circular RNA genomes of as yet uncharacterised organisms. The new family joins the original Hairpin ribozyme family (RF00173).

    Based on a recent paper we also created two additional bacterial families, the icd-II ncRNA motif (RF04189) and the carA ncRNA motif (RF04192). We would like to thank Ken Brewer (Yale University) for providing the data.

    Weekly updates of PDB structures matching Rfam families

    For many years Rfam maintained a mapping between Rfam families and experimentally determined RNA 3D structures available in PDB. However, this mapping lagged behind the weekly PDB updates as it was only updated with Rfam releases. 

    The newly implemented pipeline analyses the data every week and makes the data available on the Rfam website and in a new section of the FTP archive that contains a preview of the upcoming release. The new, up-to-date mapping also improves the ability to search PDBe using Rfam and is a key part of an ongoing project to review all Rfam families with known 3D structures. 

    Currently there are 127 Rfam families with experimentally determined RNA 3D structures in the PDB. For example, a recent paper describing how a viral RNA hijacks host machinery produced several structural models (7SAM, 7SC6, and 7SCQ) showing the pseudoknot of tRNA-like structure that are now mapped to Rfam family RF01084. Follow Rfam on Twitter to be the first to hear when new RNA families are linked to 3D structures.

    Other improvements

    • We continue improving the Gene Ontology (GO) terms associated with Rfam families, and in this release 412 families have been updated to use the latest GO terms. Maintaining the GO terms up-to-date is important as Rfam is used for automatic assignment of GO terms in RNAcentral and other resources. The GO terms are shown in the Curation tab of each family and are also available in the rfam2go file. 
    • The Rfam.seed_tree.tar.gz file hosted on the FTP archive has been fixed. We would like to thank Christian Anthon (University of Copenhagen) for reporting the problem.

    Get in touch

    As always, we look forward to hearing from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

    Happy holidays from the Rfam team!

    This is the first release produced by Emma Cooke and Blake Sweeney who have joined Rfam in the second half of 2021. The Rfam team wishes you a happy holiday season! We look forward to creating lots more families in 2022 and working towards the next major Rfam release, Rfam 15.0!

    Dfam 3.5 release

    October 11, 2021

    We are pleased to announce the the Dfam 3.5 release, which includes both new annotation data (available for download) and additional TE (transposable element) models and species.

    TE annotation data

    As part of this release, we have added annotation data for the curated TE entries for all of the extant species as part of the Zoonomia project (Figure 1). These data were curated by the combined work of David Ray’s lab at Texas Tech University (TTU) as well as the Smit group at the Institute for Systems Biology (ISB), and include young, lineage-specific TE models.

    Figure 1: Example of the genomic annotation for a Dfam TE model.

    TE models

    271 Lineage-specific, curated LTR TE models for the reconstructed ancestor of New World monkeys as part of the Zoonomia project have also been added.Additional uncurated entries (DR records) have also been added for the duckweed (607 models) and Atlantic cod (2751 models) as part of TE families submitted to Dfam via the website interface. The next Dfam release will include additional submitted datasets. With the addition of these new families, Dfam now houses 285,542 TE models across 595 species (Figure 2; Figure 3). We look forward to the continued growth of Dfam!

    Figure 2: Dfam model growth. Numbers above each bar indicate the number of total models in Dfam at the time of the indicated release.
    Figure 3: Dfam species growth. Numbers above each bar indicates the number of total species in Dfam at the time of the indicated release.

    Rfam 14.6 is out

    July 27, 2021

    We are happy to announce a new release of Rfam (14.6) that includes 121 new microRNA families, a new ribozyme family, 8 new small RNA families found in Bacteroides, as well as 10 additional families with updated secondary structures using 3D structural information. Read on for more information or explore the data in Rfam.

    New microRNA families

    The new release includes 121 new microRNA families bringing the total number of microRNA families in Rfam to 1,506. This work is part of the ongoing collaboration with miRBase that aims to synchronise microRNAs across miRBase, Rfam, and RNAcentral. Browse Rfam microRNAs or find out more about the microRNA project.

    We also resolved an issue with 6 microRNA families that were missing a covariance model on the website and in the FTP archive. Many thanks to Dr Christian Anthon (University of Copenhagen) for pointing out this problem!

    Updating families using information from 3D structures

    Following on from Rfam 14.5, we updated the secondary structure of 10 additional families with 3D information, including 6 riboswitches, 1 ribozyme, 1 telomerase, 1 localization element and 1 microRNA precursor.

    In some families, the updated structure is substantially changed. For example, the central part of the flavin mononucleotide (FMN) riboswitch is now organised by several additional base pairs and two pseudoknots (pK). As a result, the updated structure is more compact and more accurately reflects the experimentally determined 3D structures.

    Seven of the updated families include newly annotated pseudoknots, which is an important improvement that helps better model long-distance non-nested interactions. We will continue reviewing and updating the families with 3D structure in future releases. The full list of the updated families can be found in the table below.

    FamilyPDB structuresNew  pK
    RF00008 – Hammerhead ribozyme (type III)2QUS_A, 2QUS_B, 2QUW_A, 2QUW_B, 2QUW_C, 2QUW_D, 5DI2_A, 5DI4_A, 5DQK_A, 5EAO_A, 5EAQ_A1
    RF00025 – Ciliate telomerase RNA6D6V2
    RF00050 – FMN riboswitch (RFN element)3F2Q, 3F2T, 3F2W, 3F2X, 3F2Y, 3F3O 2
    RF00059 – TPP riboswitch (THI element)2CKY_A, 2CKY_B, 2GDI_X, 2GDI_Y, 2HOJ_A, 2HOK_A, 2HOL_A, 2HOM_A, 2HOO_A, 2HOP_A, 3D2G_A, 3D2G_B, 3D2V_A, 3D2V_B, 3D2X_A, 3D2X_B, 3K0J_E, 3K0J_F, 4NYA_A, 4NYA_B, 4NYB_A, 4NYC_A, 4NYD_A, 4NYG_A
    RF00207 – K10 transport/localisation element (TLS)2KE6, 2KUR, 2KUU, 2KU, 2KUW
    RF00174 – Cobalamin riboswitch4GMA, 4GXY1
    RF00380 – ykoK leader / M-box riboswitch2QBZ_X, 3PDR_X, 3PDR_A1
    RF01689 – AdoCbl variant RNA4FRN_A, 4FRN_B, 4FRG_X, 4FRG_B1
    RF01831 – THF riboswitch3SD3, 3SUH, 3SUX, 3SUY, 4LVV, 4LVW, 4LVX, 4LVY, 4LVZ, 4LW01
    RF02095 – mir-2985-2 microRNA precursor2L3J

    Hovlinc ribozyme

    A recent paper by Chen Y et al. 2021 describes Hovlinc, a new type of self-cleaving ribozymes found in human and other hominids. Hovlinc was detected in a very long intergenic noncoding RNA in hominids (hominin vlincRNA-located) using a genome-wide approach designed to discover self-cleaving ribozymes. The functions of vlincRNA and the hovlinc ribozyme remain unclear. Hovlinc joins 3 known classes of small, self cleaving ribozymes found in human: (1) Mammalian CPEB3 ribozyme, (2) Hammerhead ribozyme and (3) B2 and ALU retrotransposons. We would like to thank Dr Fei Qi (Huaqiao University) for providing the Hovlinc alignment. View hovlinc family in Rfam.

    New Bacteroides families

    In a recent article by Ryan et al. 2020, the authors report a high-resolution transcriptome map of the model organism Bacteroides thetaiotaomicron, a common bacteria of the human gut. They recognize 269 non-coding RNAs (ncRNAs) candidates from which nine were validated. Eight of these ncRNAs were integrated as new families:

    1. RF04177 – Bacteroides sRNA BTnc201
    2. RF04178 – Bacteroides sRNA BTnc005
    3. RF04179 – Bacteroides sRNA BTnc049
    4. RF04180 – Bacteroides sRNA BTnc231
    5. RF04181 – rteR sRNA
    6. RF04182 – GibS sRNA
    7. RF04183 – Bacteroidales small SRP
    8. RF04184 – Bacteroides sRNA BTnc060

    In addition, the RF01693 – Bacteroidales-1 family was renamed to 6S-Bacteroidales RNA. Bacteroidales-1 was first reported in a comparative genomics-based approach of genome and metagenome sequences from Weinberg et al. 2010. It was identified downstream of L20 ribosomal subunit genes in the order Bacteroidales. Ryan et al. 2020 report that this sRNA is a 6S RNA homolog in Bacteroides thetaiotaomicron. Rfam also has other two families of 6S RNA RF00013-6S/SsrS RNA and RF01685-6S-Flavo. We would like to thank Dr Lars Barquist (University of Würzburg) for providing the data.

    Welcome to Emma!

    A few weeks ago Emma Cooke joined the Rfam team as Software Developer and is already busy working on new features. Emma has studied Genetics and Software Engineering, and her previous roles have focused on release verification pipelines, software testing, and developing for cloud environments. Please join us in welcoming Emma to the Rfam community and stay tuned for new announcements based on her work.

    Get in touch

    As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

    Dfam 3.2 Release

    July 9, 2020

    Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

    Uncurated Family Support

    In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

    By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

    In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

    Table 1. De novo-identified TE families from additional species

    SpeciesNumber (species)RetrotransposonsDNA transposonsOther
    Mammalia471830137812567
    Sauropsida164293261168827192
    Amphibia6178120316107
    Actinopterygii (bony fishes)116275205136177006
    Chondrichthyes (cartilaginous fishes)516711982273
    Viridiplantae (green plants)28964121687

    Aligned Protein Features

    In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

    Figure 1. Feature track and details for BLASTX alignments to TE protein database.

    Website improvements

    Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

    Pfam 33.1 is released

    June 11, 2020

    We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

    Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

    One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

    Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

    Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

    Posted by Jaina and Alex

    Pfam SARS-CoV-2 special update (part 2)

    April 6, 2020

    This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release.  These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

    pf09408-spike

    Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

    Finally, we have made some very minor changes to the family descriptions and one name change from the last release.  You can now access all the updated files here:

    ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

    In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes.  We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

    Posted by The Pfam team