Introducing Dfam_consensus – Dfam’s consensus sequence twin

May 18, 2017

Since its inception in 2012, Dfam has demonstrated the promise of using profile hidden Markov Models (HMMs) to improve the detection sensitivity and annotation quality of Transposable Element (TEs) families in human[1] and subsequently for four additional reference organisms[2].  Despite these advances, the tools used to discover new families ( de-novo repeat finders ), improve families ( extend, defragment, subfamily clustering ), and classify TE families continue to depend on consensus sequence models.  This discordance between methodologies is a direct impediment to Dfam’s expansion.

The use of consensus sequence as a first-order model for TE families has a long history of demonstrated utility.  The largest database of TE families, RepBase was originally a warehouse for individual instances of TE families but very quickly switched to consensus sequences at the preferred way to model this data.  Unfortunately, there are two drawbacks to RepBase: first it is a private endeavor with a restricted use-license, and second, by only storing consensus sequences for each family, more complex models such as HMMs cannot be readily built.

Dfam_consensus provides an open framework for the community to store both seed alignments (multiple alignments of instances for a given family) and the corresponding consensus sequence model.  This provides a dataset in formats that are compatible with a wide variety of bioinformatics tools and facilitates modeling using HMMs and eventual submission to Dfam.  The site has a similar look and feel to Dfam and includes a novel visualization of the seed alignment depth and whisker plots.

Screen Shot 2017-05-18 at 11.57.11 AM.png

Combined coverage depth and seed whisker plot.

In addition to the development of the database we have released updates to both RepeatMasker and RepeatModeler enabling both to use Dfam_consensus.  RepeatMasker can use Dfam_consensus alongside RepBase in repeat annotation tasks and RepeatModeler now comes with a utility to directly submit new families to the database.

The site can be found at www.dfam-consensus.org.

Robert, Travis, and Arian

References

[1] Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AF, Finn RD. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res., 2013 Jan;41(Database issue):D70-82. doi: 10.1093/nar/gks1265. Epub 2012 Nov 30. PMID:26612867 PMCID:PMC4702899

[2] Robert Hubley; Robert D. Finn; Jody Clements; Sean R. Eddy; Thomas A. Jones; Weidong Bao; Arian F.A. Smit; Travis J. Wheeler The Dfam database of repetitive DNA families. Nucleic Acids Research (2016) Database Issue 44:D81-89. doi: 10.1093/nar/gkv1272 PMID:23203985 PMCID:PMC3531169

One Response to “Introducing Dfam_consensus – Dfam’s consensus sequence twin”


  1. […] Read more: Introducing Dfam_consensus – Dfam’s consensus sequence twin […]


Leave a comment