With Dfam, we are striving to build models of repeat families that yield high sensitivity without undue false annotation. In this release of Dfam, we have improved our model building strategy to reduce the potential for false annotation, especially in the context of overextending alignments around true interspersed repeat instances.
A new benchmark for False Positives, Using GARLIC
In Dfam, we use estimates of false annotation rates to set model-specific score thresholds, in order to ensure high sensitivity with low false positives. In the past, we have used a reversed (not complemented) genome to estimate false annotation. We did this because reversed genomic sequence preserves a great deal of the complexity found in real genomic sequence, and it is this complexity that challenges the specificity of sequence comparison. However, even reversed sequence fails to represent the characteristics of real genomic sequence — as an extreme example, consider the relative overabundance of CGs and lack of GCs in reversed sequence, because of low CpG frequency in genomic sequence. This release is the first make use of a new false positive benchmark sequence. This benchmark consists of sequence simulated using the GARLIC algorithm, which we have found to be more challenging (and realistic) than reversed sequence.
New global ERE and sequence weighting strategies
In sequence alignment, it has long been common practice to utilize scoring matrices, such as BLOSUM62, with low relative entropy. These matrices achieve high sensitivity because they are ideal for finding remote homologs, while being sufficient to find less diverged relatives as well.
The use of matrices with low relative entropy does not typically result in high rates of non-homologous false positives (false hits made up out of whole cloth). But it has recently become clear that they lead to an increased risk of homologous overextension, in which the alignment of a legitimate hit is extended beyond the correct boundaries. We have observed precisely this phenomenon in previous releases of Dfam. Dfam depends on nhmmer, which mixes observed counts with prior probabilities in a way that reaches a target average relative entropy, using an approach called entropy weight (see: this and this. We have raised the default target average relative entropy in order to reduce false extension.
Another subtle issue is that profile hidden Markov models show position-specific levels of relative entropy — some positions are more conserved than others, and especially in the case of Dfam alignments, some positions have more observed counts than others. The default entropy weighting strategy in HMMER uniformly down-weights all positions in order to decrease average relative entropy. This has the effect of pushing the low occupancy positions well into the range of the priors, resulting in model regions exhibiting very low relative entropy, with accompanying increased rates of overextension in these regions. We have developed a new weighting method in HMMER (–eentexp; available in the current beta) to mitigate this effect. Using this method in Dfam, we see a substantial reduction in overextension.
Another change to nhmmer parameters involves indels. Previously, observed indel counts were mixed with very weak prior probabilities, meaning that a single observed count could result in surprisingly high model frequencies. We increased the strength of these priors, to reduce the ease with which small counts could produce large probability changes.
New gathering thresholds, assessment
After instituting these changes (indels, weighting, and benchmark), we updated the gathering threshold (the score at which a hit is considered reliable) for each family model. The result is a reduction in both false positive hits and overextension of legitimate annotation.
Overall, Dfam1.4 removes about 40% of the estimated false coverage found in Dfam1.3 (~1Mb of non-homolgous false hits, and ~9Mb homologous overextension). Dfam1.4 also yields a reduction of ~21Mb in the total coverage of the human genome by Dfam interspersed repeat families. In other words, about 21Mb of the genome was covered by Dfam1.3 and now isn’t … and we estimate that about half of newly-uncovered sequence was falsely covered in the first place. While we regret the loss of coverage, we believe that the increased reliability of the remaining annotation is well worth the cost.
Posted by Travis Wheeler and Robert Hubley