As the first post suggested, this blog will partly describe the progress and issues faced with the migration of Pfam to HMMER3. We’ve been waiting for the mercurial HMMER3 for well over a year now, watching all the while its ever receding release date. However, it has finally been released, albeit in alpha phase! Given Sean Eddy’s past record on HMMER2, particularly his attention to detail and his hatred of bugs in his software, we (Pfam) are already confident enough to be looking at migrating to HMMER3. This post will set out the rationale for moving Pfam to HMMER3 quickly and look at some of the issues that will inevitably follow such a move.
First and foremost, we’re concerned about compute time: the last Pfam release, Pfam 23.0, took nearly 60 CPU years to produce, most of that time spent running hmmsearch against the various sequence databases (UniProt, NCBI genpept, metagenomic sequences). For us especially, the 100-fold speed increase promised by HMMER3 makes it well worth exploring ! Additionally, when it comes to searching Pfam with HMMER3, we can immediately cut the search time in half, since there will be only one HMM per family. Currently, we produce two HMMs per family, in order to ensure that the family is as comprehensive as possible, which attentive Pfam users will recognise as Pfam_ls and Pfam_fs. The ls models (“glocal”) find sequence matches that corresponded to the whole length of the HMM. The fs models (local) find partial matches with respect to the model; more often than not, partial matches would be shorter than the actual domains in the sequence. However, we understand that, with HMMER3, many of the multi-hit local alignment issues have been resolved, and partial matches can now be more readily extended to match the full length of the model, where appropriate. Consequently, with HMMER3 we will only need one HMM per family (“local”) and this will remove a heck of a lot of confusion as to how a match was found.
The second reason for migrating to HMMER3 is that it is much more sensitive than HMMER2, which is a very good thing! We now have well over 10,000 families in Pfam and many of these families belong to Pfam clans. Clans are collections of related families and are required to cover the situation where one HMM is insufficiently sensitive to find all members of the family. Furthermore, we tend to set rather conservative thresholds when we curate a family, in order to avoid the inclusion of false positives, but we do this even at the expense of missing real matches. Our hope is that the new HMMER3 software will enable our large, divergent families to become even larger and that we will be able to collect all known members of a family, without the inclusion of false negatives. Cases of wine have been bet on the increased sensitivity, and it’s about time Sean actually won one of these bets!
Personal experience with HMMER3 gives us confidence that switching will be ultimately beneficial to Pfam. Having played with both a pre-alpha release and the alpha release version of HMMER3, it certainly appears to be living up to expectation. There are still some issues with HMMER3, such as missing a few true positive matches that HMMER2 found (from what I can tell, these are due to the filtering-heuristics/bias composition) and HMMER3 failing miserably on a handful of very short (less than 10 amino acid) tandem repeats. Despite these problems, HMMER3 is, overall, a great improvement, and we are now seriously looking at migrating Pfam to use this new version of HMMER.
What are the issues for Pfam users? Well, the good news is that if you use Pfam mainly via the website, there should be little change. Hopefully, you’ll find that sequence searches will run even faster and, once we’re fully operational with HMMER3, we hope to be able to provide some additional services/features that take advantage of the new features in HMMER3.
For those using the Pfam data files rather than the website, we don’t expect a huge number of changes in the Pfam-A.full and Pfam-A.seed flat files. However, if you use the Pfam HMMs then most stuff is (initially) going to break. There is no way to produce parallel sets of HMMs using both HMMER2 and HMMER3, so when we finally make the change, it is going to be fairly drastic!
We still have a lot of work to do before we can release a version of Pfam based on HMMER3. Every Pfam family will need to have its curated threshold reset, a process that is likely to be time-consuming, and we need to re-write large sections of our production pipelines to deal with the changes that HMMER3 requires. As we face issues and/or make changes, we intend to post to this blog.
These are exciting times, but the changes we all face (both the Pfam group and our user community) will undoubtedly cause some pain. As we get closer to a final switch over to HMMER3, we’ll use this blog to tell users more about the process, and hopefully as a way to get feedback from Pfam users and to find out their hopes and fears for HMMER3 ! Ultimately, we will produce a signifcantly more useful version of Pfam, where, for instance, a bacterial genome can be comfortably searched on a basic laptop within minutes, but it will be an interesting journey.