Archive for the 'Dfam' Category

Dfam 3.5 release

October 11, 2021

We are pleased to announce the the Dfam 3.5 release, which includes both new annotation data (available for download) and additional TE (transposable element) models and species.

TE annotation data

As part of this release, we have added annotation data for the curated TE entries for all of the extant species as part of the Zoonomia project (Figure 1). These data were curated by the combined work of David Ray’s lab at Texas Tech University (TTU) as well as the Smit group at the Institute for Systems Biology (ISB), and include young, lineage-specific TE models.

Figure 1: Example of the genomic annotation for a Dfam TE model.

TE models

271 Lineage-specific, curated LTR TE models for the reconstructed ancestor of New World monkeys as part of the Zoonomia project have also been added.Additional uncurated entries (DR records) have also been added for the duckweed (607 models) and Atlantic cod (2751 models) as part of TE families submitted to Dfam via the website interface. The next Dfam release will include additional submitted datasets. With the addition of these new families, Dfam now houses 285,542 TE models across 595 species (Figure 2; Figure 3). We look forward to the continued growth of Dfam!

Figure 2: Dfam model growth. Numbers above each bar indicate the number of total models in Dfam at the time of the indicated release.
Figure 3: Dfam species growth. Numbers above each bar indicates the number of total species in Dfam at the time of the indicated release.

Dfam 3.4 Release

July 24, 2021

The Dfam Consortium is proud to announce the release of Dfam 3.4. This update includes over 8,200 curated transposable element (TE) families found in 240 mammalian genomes. The models therein have been carefully developed by David Ray’s lab at Texas Tech University (TTU) and further refined by Arian Smit. This is part of an ongoing effort to generate a comprehensive mammalian TE library using multi-species alignments and ancestral sequence reconstructions generated by the Zoonomia project (https://zoonomiaproject.org).

In addition to releasing the curated TE families, full genome annotations are provided for 21 Old World monkeys (Figure 1; Figure 2).

Figure 1: A portion of the available genomes aligned as part of the Zoonomia project, focused on the Primate Order.

Discovery of young, species-specific TEs

As a large portion of a mammalian genome, TEs serve as a source for genomic variation and innovation, including (but certainly not limited to) genomic rearrangement via movement and non-homologous recombination and providing novel transcription factor binding sites. David Ray’s lab has taken the first large-scale effort into examining the TE content of the extant genomes as part of the Zoonomia project in order to determine the TE type and location and subsequently the impact they might have on the evolution of each lineage of mammals. 

Methods

A total of 248 final genome assemblies of placental mammals were initially presented for analysis, most coming from the Zoonomia dataset. Low quality assemblies and previously analyzed genomes were excluded from analyses. To avoid wasted effort on re-curation of previously described TEs, manual curation efforts were focused towards identifying newer putative TEs that underwent relatively recent accumulation, with the main assumption being that many older TEs will be widely shared among large groups of placental mammals and that previous annotation efforts have thoroughly described these older elements in detail.

To classify younger TEs, the filtered dataset was narrowed to elements that have undergone transposition in the recent past, i.e. TEs that have insertion sequences with Kimura 2 parameter (K2P) distances less than 4.4% (approximately ~20my or less since insertion, based on a general mammalian neutral mutation rate of 2.2×10-9). This approach yielded mostly lineage specific TEs, many of which were yet to be previously described.

For each iteration of manual TE curation, new consensus sequences were generated from the 10-50 top BLAST hits, and aligning these sequences via MUSCLE and estimating a consensus sequence with EMBOSS.

To reduce library redundancy, the potential TE consensus sequences were combined with those of known TEs from previous work as well as all known vertebrate TEs from Repbase. The program CD-HIT-EST was used to identify duplicate TEs among our combined TE library according to the 80-80-80 rule of Wicker et al.

To confirm the TE type, each sequence in the library was subjected to a custom pipeline which used: blastx to confirm the presence of known ORFs in autonomous elements, RepBase to identify known elements, and TEclass to predict the TE type. In addition, structural criteria was also utilized for categorizing TEs: DNA transposons, elements with visible terminal inverted repeats; rolling circle transposons were required to have identifiable ACTAG at one end; putative SINEs were inspected for a repetitive tail as well as A and B boxes; LTR retrotransposons were required to have recognizable hallmarks, such as: TG, TGT, or TGTT at their 5’ and the inverse at the 3’ ends.   

Zoonomia Project

Figure 2: Summary of the Zoonomia project

The Zoonomia project is an effort to understand the mammalian tree of life at a deeper level. This massive undertaking is the collaboration of 27 laboratories. Although far from a complete list, some current projects derived from the Zoonomia datasets include: studying mammalian speech development, regulatory element analyses, chromosome evolution and the evolution of microRNA genes.

Future Work

Future efforts will continue to analyze and catalog lineage-specific TEs in deeper branches of the 240-way genome alignment via the reconstructed genomes at each node of the phylogenetic tree as part of the alignment and expand the full genome annotations available on Dfam.

Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

Curation with Dfam: new data and platform updates

March 17, 2020

DNA transposon termini signatures

The Dfam consortium is excited to announce the generation and release of terminal repeat sequence signatures for class II DNA transposable elements. The termini of class II elements are crucial for movement, and as such, can be used to classify de novo DNA transposable element families in new genomic sequences (Figure 1).

Figure 1. Major subgroups of class II DNA transposons.

The LOGOs of the termini can be viewed on the “Classifications” tab on the Dfam website and are organized by class II subclasses (e.g., Crypton, Helitron, TIR, etc.) (Figure 2). This allows for easy visualization of the base conservation at each position in the terminal sequences and comparisons between the 5’ and 3’ termini (Figure 2). In addition, the termini profiles are available for download as a .HMM file.

Figure 2. Termini signature visualization on the Dfam website (www.dfam.org) sample. Base conservation can be seen via the LOGOs of the 5’, 3’ and combined edge (termini) HMMs. The movement type can be seen preceding DNA transposons that move via a common mechanism (e.g. “Circular dsDNA intermediate). The number of families used to generate the LOGOs are indicated, as well as the subclass named (e.g. “Crypton_A”). Additional notes on the termini, when relevant, are also available.

Community data submissions

We have taken the first small step towards a community-driven data curation platform by developing a new data submission system.  At the start this will facilitate the process of uploading data to the site for processing by the curators. As we move forward, further aspects of the curation process will be made available to the community.  Upon creating an account and logging in, users can submit files to Dfam using our web-based upload page. Here you will also find information about submission requirements and how different levels of library quality are handled in Dfam.

Dfam 3.0 is out

March 6, 2019

 

The Dfam consortium is excited to announce the release of Dfam 3.0.  This release represents a major transition for Dfam from a proof-of-concept database into a funded open community resource. Central to this transition is a major infrastructure and technology update, enabling Dfam to handle the increasing pace of genome sequencing and TE library generation. Equally important, we merged Dfam_consensus with Dfam to produce a single resource for transposable element family modeling and annotation. In doing so, Dfam serves the needs of a broader research community while maintaining a high standard for family characterization (seed alignments), and TE annotation sensitivity. Finally, and most importantly, we are working on making Dfam a community driven resource through the development of online curation tools and direct user engagement.

Infrastructure updates

Dfam has undergone a major infrastructure upgrade since the last release including faster servers and storage systems, a new software stack and improved website features. Together these updates will allow Dfam to greatly expand the number of families and the species represented. The new software stack includes a publicly accessible REST API, which provides the core functionality used by the redesigned dfam.org website and is available for use in community developed applications and workflows. The new website is based on the Angular framework, supporting both a traditional web portal to the Dfam database as well as the use of interactive tools for data management and curation.

Dfam_consensus merger

The merger of Dfam_consensus with Dfam created a combined database of 6,235 TE families in 9 organisms, each characterized by a seed alignment of representative family members. Seed alignments constitute a rich dataset for generating sequence models such as consensus sequences, or profile Hidden Markov Models (HMMs).

Consensus sequence databases have traditionally not preserved the sequence alignment from which the consensus was generated. This omission has made it difficult to evaluate the strength of the consensus, to make incremental improvements by adding/removing members, or to regenerate models using improved methodologies. By adding support for consensus sequences to Dfam, the provenance is preserved in the seed alignment. In addition, the positions within the consensus can be directly related to the corresponding match states within the profile HMM.

Improved interfaces and metadata

The new Dfam website contains several features borrowed from Dfam_consensus including: the seed alignment visualization, the TE classification system and visualization, and per-family and full-database EMBL exports for consensus sequences.

TE classification tree visualization with search facility:

Figure1

In addition, we have improved the family browsing interface, and added the ability to store/visualize family features such as coding sequences, target site preferences, binding sites, as well as ad-hoc sequence annotation.

Coding regions and target site duplication details for Kolobok-1_DR:

Figure4

Dfam has adopted the recently developed (for Dfam_consensus) classification system for repetitive sequences and applied it to all of the Dfam-2.x families. This system combines concepts from established systems (Wicker et. al., Piegu et. al., Curcio et. al., Smit et. al., and Jurka et. al.) with phylogenies based on reverse transcriptase and transposases. Classification names were chosen to be as descriptive as possible while still honoring the most widely used acronyms for well-defined classes.

Dfam families may be queried using the new browse form:

Figure2

 

Community engagement

We are embarking on an effort to greatly expand the database using de-novo repeat identification pipelines, data sharing with other open-databases, and most importantly from direct community submissions. If you have existing TE libraries or plan to develop one for a newly sequenced organism, consider making it a part of the Dfam database. We can offer assistance with importing legacy datasets and are working on tools to facilitate direct community curation of the database. Please contact us at help@dfam.org.

Dfam project seeks postdoctoral fellow

August 20, 2018
We are excited to announce the opening of a postdoctoral fellowship within the Dfam project and located at the Institute for Systems Biology (ISB) in Seattle.  At ISB, the Smit lab is focused on the study of Transposable Element (TE) biology, and evolution using the latest developments in sequence modeling, phylogenetic reconstruction, and homology detection.  We have developed some of the most widely used tools and databases for the study of TEs including RepeatMasker, RepeatModeler, the Repeat Protein Database, and Dfam.
The position offers an opportunity to help shape the future of the new Dfam community resource in collaboration with the Wheeler lab at the University of Montana, and through a partnership with the NIH.  The project will involve investigating and advancing de novo methods for the generation of TE libraries, development of improved methods for classifying new TE families, design of quality metrics and standards for TE modeling, providing TE family curation assistance to the research community and building/studying TE families in a unique set of newly sequenced species.
Applicants should hold (or will shortly be awarded) a PhD degree and have experience in TE biology, and genomics.  Prior experience with TE library generation/curation, genome biology, and genome evolution are considered a major advantage.  Candidates should have strong communication and data analysis skills, and an established record of principal authorship in peer-reviewed publication(s).  The successful candidate is passionate about science, motivated, proactive and able to work in a team.
To apply please visit the career website at: https://systemsbiology.org/about/careers/

Introducing Dfam_consensus – Dfam’s consensus sequence twin

May 18, 2017

Since its inception in 2012, Dfam has demonstrated the promise of using profile hidden Markov Models (HMMs) to improve the detection sensitivity and annotation quality of Transposable Element (TEs) families in human[1] and subsequently for four additional reference organisms[2].  Despite these advances, the tools used to discover new families ( de-novo repeat finders ), improve families ( extend, defragment, subfamily clustering ), and classify TE families continue to depend on consensus sequence models.  This discordance between methodologies is a direct impediment to Dfam’s expansion.

Read the rest of this entry »

Meet Dfam2.0

October 30, 2015

Dfam is growing up. This is the first major expansion of the database since it’s inception. We’ve added repeat families from four new organisms: mouse, zebrafish, fruit fly, and nematode. In total, this release includes 2,844 new familes ( 4,150 total ).

Read the rest of this entry »

Say hello to Dfam1.4

May 13, 2015

With Dfam, we are striving to build models of repeat families that yield high sensitivity without undue false annotation.  In this release of Dfam, we have improved our model building strategy to reduce the potential for false annotation, especially in the context of overextending alignments around true interspersed repeat instances.

Read the rest of this entry »

Dfam 1.3 released

January 7, 2015

We are pleased to announce the release of Dfam 1.3. This release includes almost 200 new repeat families and updates the underlying human genome to hg38.

Read the rest of this entry »