We are pleased to announce that we’ve released Dfam 1.2. This version represents a few important changes from 1.1, including increased sensitivity for many families, a new plot on the model page, and an improved Relationships tab.
Increased sensitivity for many families
More than 100 families have seen reduced gathering (GA) thresholds, resulting in greater sensitivity with minimal expected false positives (based on hits to reversed genomic sequence, as described in the Dfam paper). The largest changes in coverage come from reducing the GA for the old MIR and L2 families, resulting in an increase by almost 10Mb in coverage of the human genome, while incurring an increase in estimated false coverage of ~15Kb (a roughly 0.15% false discovery rate).
New plot on the Model page
We don’t provide multiple sequence alignments of the hits to each family, because those alignments are often absurdly huge (they may contain many thousands of multi-Kb sequences containing lots of insertions relative to the model). In light of this, we’ve aimed to convey high-level information about the way hits align to the model. The new plot, “Non-Redundant Coverage, Conservation, and Inserts”, gives information about how many hits match the model, where those hits align to the model, and characteristics of sequence conservation and insertion.
For a selected threshold, the plot includes a line showing, for each model position, the fraction of all hits that have a match to that position (considering only RPH-filtered hits – hits for which this model is deemed to fit the sequence better than any other Dfam model). Among RPH-filtered hits, another line shows, for each position, the average percent identity for a window of length 7 around the position. A third line shows the number of insertions among those hits.
Improved Relationships tab
The Relationship tab for an entry shows a list of all other entries that show notable similarity. Previously, the page showed orientation and position, along with an alignment path popup. It now includes percent identity between the entry consensus sequences, match e-value, and percent shared coverage (length). The list of related entries can be sorted by any of these fields.
Behind the Scenes
A lot of the work that’s gone in to this release isn’t visible externally – it’s backend and pipeline work that will make future releases easier to produce. We’ve already begun to devise plans for the next major milestone release; if you have suggestions or would like to contribute models to the database, please get in contact.
Posted by Travis and Rob