Posts Tagged ‘dfam’

Dfam 3.7 : ~3.4 million TE models across 2346 taxa

January 12, 2023

We at Dfam are pleased to announce the latest data release! The Dfam 3.7 release includes additional raw and curated datasets, resulting in a ~4.5x increase in the number of families compared to the previous Dfam 3.6 data release over a wide range of taxa. Please note the large size of the newest release and plan accordingly. It may be beneficial to filter and download the relevant data to your project by utilizing the API. 

EBI dataset contributes to the quadrupling of the Dfam database 

Our continued collaboration with Fergal Martin and Denye Ogeh from the European Bioinformatics Institute (EBI) has provided an additional 771 assemblies and their associated TE models that are now a part of the DR records in Dfam. This brings the total contribution of genomic data from EBI to 1551 species. The new data expands taxa such as Viridiplantae (green plants) and Actinopterygii (bony fishes), and broadens Dfam coverage with the addition of Echinodermata (starfishes, sea urchins/cucumbers) and Petromyzontiformes (lampreys). 

Community submissions – adding diversity to Dfam

Taro (Colocasia esculenta) – a threatened food staple

One of the most ancient cultivated crops, taro is a food staple in the Pacific Islands and the Caribbean, which is currently threatened by taro leaf blight (TLB). Some populations of taro are resistant to TLB, but the genetic basis for this resistance is unknown. As part of an effort to understand the genetic basis of TLB resistance, a taro de novo assembly was generated and the repetitive content was analyzed [1]. The high repetitive content (~82%) of this genome was positively correlated with genome size, with the potential to be linked to TLB resistance. Contributed by M. Renee Bellinger.

Gesneriaceae – understanding angiosperm morphological variation

A member of the plant family Gesneriaceae, the Cape Primrose Streptocarpus rexii has long been studied by evolutionary biologists due to its unique morphological aspects. Genetic resources are critical in order to study the unique meristem evolution of this plant family. As such, a genome annotation pipeline was generated in order to handle the shortcomings of current technical challenges of genome annotation. Part of this effort included generating repeat libraries for not only the Cape Primrose, but also for Dorcoceras hygrometricum and Primulina huaijiensis [2]. Providing these libraries to Dfam will enhance the resources available for future genomic characterization of this plant family.  Contributed by Kanae Nishii.

Mosquito (Anopheles coluzzii) – a human malaria vector

The adaptive flexibility of Anopheles coluzzii, a primary vector of human malaria, allows it escape efforts to control the mosquito population with insecticides. As TEs are integral to adaptive processes in other species, it was hypothesized that TEs could be what is allowing the rapid resistance of A. coluzzii to classic methods of intervention. Analyzing six individuals from two African localities allowed the authors to provide a comprehensive TE library [3]. This effort enhances the resources available to study the genomic architecture and gene regulation underpinning the success of this malaria vector. Contributed by Carlos Vargas and Josefa Gonzalez.

Water flea (Daphnia pulicaria) – a model organism to study climate change

Due to their short lifespans and reproductive capabilities, water fleas are used as a bioindicator to study the effects of toxins on an ecosystem, and are thus useful in studying climate change. A study of two ecological sister taxa – Daphnia pulicaria and Daphnia pulex – analyzed the evolutionary forces of recombination and gene density in driving the differentiation and divergence of the two aforementioned species [4]. TE content was analyzed as part of generating the new Daphnia pulicaria genome assembly.  Contributed by Mathew Wersebe.

601 insects – transposable element influence on species diversity 

TEs are drivers of evolution eukaryotes. However, in some underrepresented taxa, TE dynamics are less well understood. To this end, 601 insect genomes over 20 Orders were analyzed for TE content to analyze the variation between and among insect Orders. This work highlights the need for community-submitted high-quality libraries.  Contributed by John Sproul and Jacqueline Heckenhauer.

Analysis of six bat genomes – evolution of bat adaptations

Bats are an excellent example of complex adaptations, such as flight, echolocation, longevity and immunity. In order to enhance the genomic resources to study the development of complex traits, six high-quality genomes assemblies using long- and short-read technologies were generated (Rhinolophus ferrumequinumRousettus aegyptiacusPhyllostomus discolorMyotis myotisPipistrellus kuhlii and Molossus molossus) [6]. As part of the effort to annotate these new genome assemblies, the TE content was analyzed. These six genomes displayed a wide range of diversity in TE content, perhaps contributing to their complex traits.  Contributed by Kevin Sullivan and David Ray.

LTR7/ERVH – transcriptional regulation in the human embryo

The mechanism by which human endogenous retrovirus type-H (HERVH) exerts regulatory activities fostering self-renewal and pluripotency in the pre-implantation embryo is unknown. In order to elucidate the aforementioned mechanism, the transcription dynamics and sequence signature evolution of HERVH were analyzed [7]. This study not only revealed previously undefined LTR7 subfamilies, but also provided a comprehensive phytoregulatory analysis of all the identified subfamilies against locus-specific regulatory data available in genome-wide assays of embryonic stem cells (ESCs), providing evidence for subfamily-specific promoter activity. The complex evolutionary history of LTR7 is mirrored in the transcriptional partitioning that takes place during early embryonic development.  Contributed by Thomas Carter, Cédric Feschotte, and Arian Smit.

References

1. Bellinger, M. R., Paudel, R., Starnes, S., Kambic, L., Kantar, M. B., Wolfgruber, T., Lamour, K., Geib, S., Sim, S., Miyasaka, S. C., Helmkampf, M., & Shintaku, M. (2020). Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight. G3 (Bethesda, Md.)10(8), 2763–2775. https://doi.org/10.1534/g3.120.401367

    2. Nishii, K., Hart, M., Kelso, N., Barber, S., Chen, Y. Y., Thomson, M., Trivedi, U., Twyford, A. D., & Möller, M. (2022). The first genome for the Cape Primrose Streptocarpus rexii (Gesneriaceae), a model plant for studying meristem-driven shoot diversity. Plant direct6(4), e388. https://doi.org/10.1002/pld3.388

    3. Vargas-Chavez, C., Longo Pendy, N. M., Nsango, S. E., Aguilera, L., Ayala, D., & González, J. (2022). Transposable element variants and their potential adaptive impact in urban populations of the malaria vector Anopheles coluzziiGenome research32(1), 189–202. https://doi.org/10.1101/gr.275761.121

    4. Wersebe, M. J., Sherman, R. E., Jeyasingh, P. D., & Weider, L. J. (2022). The roles of recombination and selection in shaping genomic divergence in an incipient ecological species complex. Molecular ecology, 10.1111/mec.16383. Advance online publication. https://doi.org/10.1111/mec.16383

    5. Sproul, J.S., Hotaling, S., Heckenhauer, J., Powell, A., Larracuente, A.M., Kelley, J.L., Pauls, S.U., Frandsen, P.B. (2022). Repetitive elements in the era of biodiversity genomics: insights from 600+ insect genomes. bioRxiv 2022.06.02.494618; doi: https://doi.org/10.1101/2022.06.02.494618

    6. Jebb, D., Huang, Z., Pippel, M., Hughes, G. M., Lavrichenko, K., Devanna, P., Winkler, S., Jermiin, L. S., Skirmuntt, E. C., Katzourakis, A., Burkitt-Gray, L., Ray, D. A., Sullivan, K. A. M., Roscito, J. G., Kirilenko, B. M., Dávalos, L. M., Corthals, A. P., Power, M. L., Jones, G., Ransome, R. D., … Teeling, E. C. (2020). Six reference-quality genomes reveal evolution of bat adaptations. Nature583(7817), 578–584. https://doi.org/10.1038/s41586-020-2486-3

    7. Carter, T. A., Singh, M., Dumbović, G., Chobirko, J. D., Rinn, J. L., & Feschotte, C. (2022). Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo. eLife11, e76257. https://doi.org/10.7554/eLife.76257

    Dfam 3.6 release

    April 21, 2022

    We are pleased to announce the latest data release of the Dfam database! This latest release approximately doubles the number of species from the Dfam 3.5 release (595 to 1,109), and increases the number of transposable element (TE) families by ~2.5x (285,542 to 732,993. A more detailed summary of the species included can be seen in Table 1, and in the Dfam 3.6 release notes.

    Community-submitted libraries

    A huge thank you to the TE community for submitting your data to us! In this release, we have: 1) 3,360 curated rice weevil TE models, submitted by Clément Goubert and Rita Rebollo1; 2) 22 SINE families obtained from 15 moth species (Lepidoptera insects) submitted by Guangjie Han et al.2; 3) 120 Penelope-classified families – something about how they span several kingdoms/orders? submitted by Rory Craig et al.3; and 4) 41 repeat families generated as part of the T2T human assembly project4 – not including the 22 “composite” repetitive families, which will be available as part of a later Dfam release. To read more about the studies associated with these submissions, please see the references below.

    Rice weevil: an agricultural pest

    (Background copied from paper): The rice weevil Sitophilus oryzae is one of the most important agricultural pests, causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular interactions. In the paper (see below), the authors show that many TE families are transcriptionally active, and changes in their expression are associated with insect endosymbiotic state.

    Moth SINEs: high diversity

    (Conclusions copied from paper): Lepidopteran insect genomes harbor a diversity of SINEs. The retrotransposition activity and copy number of these SINEs varies considerably between host lineages and SINE lineages. Host-parasite interactions facilitate the horizontal transfer of SINE between baculovirus and its lepidopteran hosts.

    Penelope elements: far-reaching impacts

    The authors investigate the Penelope (PLE) content of a wide variety of eukaryotes. (copied from paper): This paper uncovers the hitherto unknown PLE diversity, which spans all eukaryotic kingdoms, testifying to their ancient origins. 

    T2T entries: previously hidden genomic content

    A new human genome assembly has been released! The new assembly (T2T or chm13) has sequenced and assembled the remaining 10% of the human genome that was previously unattainable. The entries described in the manuscript are part of this newly-analyzed sequence.

    EBI libraries

    In collaboration the European Bioinformatic Institute (EBI), we processed and imported RepeatModeler runs on 444 additional species, resulting in the addition of 440,543 families. Additional extension and re-classification sites were run on each models and fate final consensus and HMMs were produced. Please note that the relationship data is not available on these uncreated imports at this time.

    References associated with community submissions

    1 Parisot, N., et al (2021). The transposable element-rich genome of the cereal pest Sitophilus oryzae. BMC biology19(1), 241. https://doi.org/10.1186/s12915-021-01158-2
    2 Han, G., et al (2021). Diversity of short interspersed nuclear elements (SINEs) in lepidopteran insects and evidence of horizontal SINE transfer between baculovirus and lepidopteran hosts. BMC genomics22(1), 226. https://doi.org/10.1186/s12864-021-07543-z
    3 Craig, R. J., et al (2021). An Ancient Clade of Penelope-Like Retroelements with Permuted Domains Is Present in the Green Lineage and Protists, and Dominates Many Invertebrate Genomes. Molecular biology and evolution38(11), 5005–5020. https://doi.org/10.1093/molbev/msab225
    4 Hoyt, S. J., et al (2022). From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science (New York, N.Y.)376(6588), eabk3112. https://doi.org/10.1126/science.abk3112

    Dfam 3.5 release

    October 11, 2021

    We are pleased to announce the the Dfam 3.5 release, which includes both new annotation data (available for download) and additional TE (transposable element) models and species.

    TE annotation data

    As part of this release, we have added annotation data for the curated TE entries for all of the extant species as part of the Zoonomia project (Figure 1). These data were curated by the combined work of David Ray’s lab at Texas Tech University (TTU) as well as the Smit group at the Institute for Systems Biology (ISB), and include young, lineage-specific TE models.

    Figure 1: Example of the genomic annotation for a Dfam TE model.

    TE models

    271 Lineage-specific, curated LTR TE models for the reconstructed ancestor of New World monkeys as part of the Zoonomia project have also been added.Additional uncurated entries (DR records) have also been added for the duckweed (607 models) and Atlantic cod (2751 models) as part of TE families submitted to Dfam via the website interface. The next Dfam release will include additional submitted datasets. With the addition of these new families, Dfam now houses 285,542 TE models across 595 species (Figure 2; Figure 3). We look forward to the continued growth of Dfam!

    Figure 2: Dfam model growth. Numbers above each bar indicate the number of total models in Dfam at the time of the indicated release.
    Figure 3: Dfam species growth. Numbers above each bar indicates the number of total species in Dfam at the time of the indicated release.

    Dfam 3.4 Release

    July 24, 2021

    The Dfam Consortium is proud to announce the release of Dfam 3.4. This update includes over 8,200 curated transposable element (TE) families found in 240 mammalian genomes. The models therein have been carefully developed by David Ray’s lab at Texas Tech University (TTU) and further refined by Arian Smit. This is part of an ongoing effort to generate a comprehensive mammalian TE library using multi-species alignments and ancestral sequence reconstructions generated by the Zoonomia project (https://zoonomiaproject.org).

    In addition to releasing the curated TE families, full genome annotations are provided for 21 Old World monkeys (Figure 1; Figure 2).

    Figure 1: A portion of the available genomes aligned as part of the Zoonomia project, focused on the Primate Order.

    Discovery of young, species-specific TEs

    As a large portion of a mammalian genome, TEs serve as a source for genomic variation and innovation, including (but certainly not limited to) genomic rearrangement via movement and non-homologous recombination and providing novel transcription factor binding sites. David Ray’s lab has taken the first large-scale effort into examining the TE content of the extant genomes as part of the Zoonomia project in order to determine the TE type and location and subsequently the impact they might have on the evolution of each lineage of mammals. 

    Methods

    A total of 248 final genome assemblies of placental mammals were initially presented for analysis, most coming from the Zoonomia dataset. Low quality assemblies and previously analyzed genomes were excluded from analyses. To avoid wasted effort on re-curation of previously described TEs, manual curation efforts were focused towards identifying newer putative TEs that underwent relatively recent accumulation, with the main assumption being that many older TEs will be widely shared among large groups of placental mammals and that previous annotation efforts have thoroughly described these older elements in detail.

    To classify younger TEs, the filtered dataset was narrowed to elements that have undergone transposition in the recent past, i.e. TEs that have insertion sequences with Kimura 2 parameter (K2P) distances less than 4.4% (approximately ~20my or less since insertion, based on a general mammalian neutral mutation rate of 2.2×10-9). This approach yielded mostly lineage specific TEs, many of which were yet to be previously described.

    For each iteration of manual TE curation, new consensus sequences were generated from the 10-50 top BLAST hits, and aligning these sequences via MUSCLE and estimating a consensus sequence with EMBOSS.

    To reduce library redundancy, the potential TE consensus sequences were combined with those of known TEs from previous work as well as all known vertebrate TEs from Repbase. The program CD-HIT-EST was used to identify duplicate TEs among our combined TE library according to the 80-80-80 rule of Wicker et al.

    To confirm the TE type, each sequence in the library was subjected to a custom pipeline which used: blastx to confirm the presence of known ORFs in autonomous elements, RepBase to identify known elements, and TEclass to predict the TE type. In addition, structural criteria was also utilized for categorizing TEs: DNA transposons, elements with visible terminal inverted repeats; rolling circle transposons were required to have identifiable ACTAG at one end; putative SINEs were inspected for a repetitive tail as well as A and B boxes; LTR retrotransposons were required to have recognizable hallmarks, such as: TG, TGT, or TGTT at their 5’ and the inverse at the 3’ ends.   

    Zoonomia Project

    Figure 2: Summary of the Zoonomia project

    The Zoonomia project is an effort to understand the mammalian tree of life at a deeper level. This massive undertaking is the collaboration of 27 laboratories. Although far from a complete list, some current projects derived from the Zoonomia datasets include: studying mammalian speech development, regulatory element analyses, chromosome evolution and the evolution of microRNA genes.

    Future Work

    Future efforts will continue to analyze and catalog lineage-specific TEs in deeper branches of the 240-way genome alignment via the reconstructed genomes at each node of the phylogenetic tree as part of the alignment and expand the full genome annotations available on Dfam.

    Curation with Dfam: new data and platform updates

    March 17, 2020

    DNA transposon termini signatures

    The Dfam consortium is excited to announce the generation and release of terminal repeat sequence signatures for class II DNA transposable elements. The termini of class II elements are crucial for movement, and as such, can be used to classify de novo DNA transposable element families in new genomic sequences (Figure 1).

    Figure 1. Major subgroups of class II DNA transposons.

    The LOGOs of the termini can be viewed on the “Classifications” tab on the Dfam website and are organized by class II subclasses (e.g., Crypton, Helitron, TIR, etc.) (Figure 2). This allows for easy visualization of the base conservation at each position in the terminal sequences and comparisons between the 5’ and 3’ termini (Figure 2). In addition, the termini profiles are available for download as a .HMM file.

    Figure 2. Termini signature visualization on the Dfam website (www.dfam.org) sample. Base conservation can be seen via the LOGOs of the 5’, 3’ and combined edge (termini) HMMs. The movement type can be seen preceding DNA transposons that move via a common mechanism (e.g. “Circular dsDNA intermediate). The number of families used to generate the LOGOs are indicated, as well as the subclass named (e.g. “Crypton_A”). Additional notes on the termini, when relevant, are also available.

    Community data submissions

    We have taken the first small step towards a community-driven data curation platform by developing a new data submission system.  At the start this will facilitate the process of uploading data to the site for processing by the curators. As we move forward, further aspects of the curation process will be made available to the community.  Upon creating an account and logging in, users can submit files to Dfam using our web-based upload page. Here you will also find information about submission requirements and how different levels of library quality are handled in Dfam.

    Moving to xfam.org

    May 1, 2014

    Back in November 2012 we announced that the Xfam team in the UK was moving from the Wellcome Trust Sanger Institute to the European Bioinformatics Institute (EMBL-EBI), just next door on the Wellcome Trust Genome Campus. On Tuesday we completed that move by switching off the Pfam and Rfam websites inside Sanger and redirecting all traffic to our shiny new home at xfam.org. You can now find the Pfam and Rfam websites at pfam.xfam.org and rfam.xfam.org respectively. Read the rest of this entry »

    TreeFam 9 is now available!

    May 3, 2013

    We are happy to announce that TreeFam 9 is online and you can find it under http://www.treefam.org.

    TreeFam 9 now has 109 species (vs. 79 in TreeFam 8) and is based on data from Ensembl v69, Ensembl Genomes v16, Wormbase and JGI.

    This release marks an important step for TreeFam as it is the first release build since TreeFam has been resurrected.
    Here is a list of the most important changes in TreeFam 9:

    • New website layout (adopting the Pfam/Rfam/Dfam layout)
    • Infrastructure move of web servers and databases to the EBI
    • Sequence search against the library of TreeFam family profiles
    • new tree visualisations in pure javascript using D3, e.g. see the BRCA2 gene tree here.
    • Pairwise homology download

    We hope you find all the information you are looking for. If you don’t, please let us know so that we can include the information you want. The old website will remain online here.

    If you have questions, suggestions or find bugs, don’t hesitate to contact us through our new forum here.

    Happy treefamming,

    the TreeFam team
    (Fabian, Mateus)

    Dfam paper is an NAR “featured article”; RepeatMasker4 is out

    January 12, 2013

    We are pleased to announce that the Dfam paper (“Dfam: a database of repetitive DNA based on profile hidden Markov models“) is now available in the 2013 NAR Database issue, and has been selected as a “featured article” (meaning the NAR editorial board thinks it is among “the top 5% of papers in terms of originality, significance and scientific excellence”).

    In other exciting news, two members of the Dfam consortium, Arian Smit and Robert Hubley (Institute for Systems Biology, Seattle), just released RepeatMasker 4.0. This is a major update that, among other important improvements, adds support for searching with Dfam and nhmmer. Go get yourself a copy at http://www.repeatmasker.org/

    Posted by Travis

    Dfam 1.1 released

    November 15, 2012

    We are pleased to announce that we’ve released Dfam 1.1. This version represents a few important changes from 1.0, including updated hit results, a new tab for each entry page showing relationships to other entries, and improved handling of redundant profile hits.

    Read the rest of this entry »

    Dfam: A database of repetitive DNA elements

    September 6, 2012

    We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.

    Read the rest of this entry »