Dfam is growing up. This is the first major expansion of the database since it’s inception. We’ve added repeat families from four new organisms: mouse, zebrafish, fruit fly, and nematode. In total, this release includes 2,844 new familes ( 4,150 total ).
New organisms and coverage
In expanding Dfam to include families from multiple genomes, we chose the four species named above because of their status as model organisms covering a broad range of the animal kingdom. Using Dfam profile HMMs within RepeatMasker, we see a marked increase in the fraction of repeat annotation in the genomes (relative to using RepBase consensus sequences) : +5.1% for human, 5.5% for mice, 4.4% for zebrafish, 0.7% for flies, and 6.5% for nematodes.
New data management challenges
Changes on the frontend of the website are modest, but extensive work was done on the backend to support this and future expansion. The addition of new organisms required that we deal with a number of new model properties.
- Each family is associated with an NCBI taxonomy clade, which indicates where instances are found. Because of horizontal transfer, the family profile HMM may associate with multiple clades, via “model specificity” (MS) lines.
- For a family found in multiple species, a separate score threshold has been calculated for each appropriate reference species.
- We’ve used Dfam models to identify family members in each organism. To account for this large influx of new search data (and prepare forthcoming flood when more organisms are added), the underlying database schema has been broken into a single central schema and multiple per-assembly schemas. Many of the backend scripts were refactored to better handle the large scale of these data.
Shortly before migrating to 2.0, we moved the Dfam website from it’s pilot location at the HHMI Janelia Research Campus to its new home at the University of Montana. We’d like to shout out a big thank you to HHMI for funding the pilot project, to the University of Montana for funding purchase of the new server, and to the IT departments at both HHMI and Montana for making the transition as smooth as possible.
Recent improvements to Dfam, including the 2.0 release, are described in a manuscript that’s just been accepted to the NAR 2016 databases issue. Here’s a link to the preprint.