Dfam 3.0 is out

March 6, 2019

 

The Dfam consortium is excited to announce the release of Dfam 3.0.  This release represents a major transition for Dfam from a proof-of-concept database into a funded open community resource. Central to this transition is a major infrastructure and technology update, enabling Dfam to handle the increasing pace of genome sequencing and TE library generation. Equally important, we merged Dfam_consensus with Dfam to produce a single resource for transposable element family modeling and annotation. In doing so, Dfam serves the needs of a broader research community while maintaining a high standard for family characterization (seed alignments), and TE annotation sensitivity. Finally, and most importantly, we are working on making Dfam a community driven resource through the development of online curation tools and direct user engagement.

Infrastructure updates

Dfam has undergone a major infrastructure upgrade since the last release including faster servers and storage systems, a new software stack and improved website features. Together these updates will allow Dfam to greatly expand the number of families and the species represented. The new software stack includes a publicly accessible REST API, which provides the core functionality used by the redesigned dfam.org website and is available for use in community developed applications and workflows. The new website is based on the Angular framework, supporting both a traditional web portal to the Dfam database as well as the use of interactive tools for data management and curation.

Dfam_consensus merger

The merger of Dfam_consensus with Dfam created a combined database of 6,235 TE families in 9 organisms, each characterized by a seed alignment of representative family members. Seed alignments constitute a rich dataset for generating sequence models such as consensus sequences, or profile Hidden Markov Models (HMMs).

Consensus sequence databases have traditionally not preserved the sequence alignment from which the consensus was generated. This omission has made it difficult to evaluate the strength of the consensus, to make incremental improvements by adding/removing members, or to regenerate models using improved methodologies. By adding support for consensus sequences to Dfam, the provenance is preserved in the seed alignment. In addition, the positions within the consensus can be directly related to the corresponding match states within the profile HMM.

Improved interfaces and metadata

The new Dfam website contains several features borrowed from Dfam_consensus including: the seed alignment visualization, the TE classification system and visualization, and per-family and full-database EMBL exports for consensus sequences.

TE classification tree visualization with search facility:

Figure1

In addition, we have improved the family browsing interface, and added the ability to store/visualize family features such as coding sequences, target site preferences, binding sites, as well as ad-hoc sequence annotation.

Coding regions and target site duplication details for Kolobok-1_DR:

Figure4

Dfam has adopted the recently developed (for Dfam_consensus) classification system for repetitive sequences and applied it to all of the Dfam-2.x families. This system combines concepts from established systems (Wicker et. al., Piegu et. al., Curcio et. al., Smit et. al., and Jurka et. al.) with phylogenies based on reverse transcriptase and transposases. Classification names were chosen to be as descriptive as possible while still honoring the most widely used acronyms for well-defined classes.

Dfam families may be queried using the new browse form:

Figure2

 

Community engagement

We are embarking on an effort to greatly expand the database using de-novo repeat identification pipelines, data sharing with other open-databases, and most importantly from direct community submissions. If you have existing TE libraries or plan to develop one for a newly sequenced organism, consider making it a part of the Dfam database. We can offer assistance with importing legacy datasets and are working on tools to facilitate direct community curation of the database. Please contact us at help@dfam.org.

Leave a comment