We are now on the brink of releasing Pfam 24.0. This release of Pfam, version 24.0, will be a landmark release as it will be the first to be built using the the new version of the HMMER package, HMMER3. We are well aware that we have been claiming this release as imminent for some time, but we are now at the point of flicking the big switch. There are numerous changes that users need to know about and we will briefly summarise them here.
HMMER version change
The new version of HMMER is considerably faster and more sensitive than previous ones. This has resulted in numerous families being merged and a significantly larger number of families being classified into clans (as we are now better able to detect similarities between families). All families have had their significance thresholds (also known as GA or gathering thresholds) re-set, because of the new statistics underlying the HMMER3 implementation. Thus, it is NOT possible to produce both a HMMER2 and HMMER3 set of HMMs with PFam defined cut-offs. In addition, we only need a single HMM per family, so the previous ‘Pfam_ls’ and ‘Pfam_fs’ files are now replaced by a single file called ‘Pfam-A.hmm’. As well as the Pfam-A.hmms, we will also be making a HMM library for the 20,000 largest Pfam-B families (‘Pfam-B.hmm’). Some of you are already using pre-released versions of these HMMER3 HMM libraries as part of the new version of “pfam_scan.pl”.
Although we will not bother you with the details now, the change in HMMER version has also made us re-think many of our policies, file formats and even the philosophy behind the database. We will try to address the most significant of these in our documentation and, of course, in this blog.
In addition to migrating Pfam from HMMER2 to HMMER3, we have also updated the underlying protein sequence database. Pfam 24.0 is based on UniProtKB release 15.6, dating from 28th July 2009, and this update has resulted in a near doubling of our underlying sequence database. This has obviously changed the character of several families. For example, GP120 is no longer our largest Pfam family, having dropped to being the 3rd largest family, behind ABC_tran and RVT_1, with the latter two families each exceeding 100,000 sequence matches!
If you are not concerned about the MySQL database, then skip to the next section. If you are, then you need to know that, along with all the other changes, we have also changed the type of table-engine used, from MyISAM to InnoDB. There are both technical and managerial reasons for having done this, but one of the main benefits to the user is that InnoDB supports foreign key relationships. This improves data consistency and allows the tracing of table relationships. However, this technical change, along with the data changes, has resulted in many changes to the database schema, including subtle column-name changes and more obvious data-type changes.
No Pfam release would be complete without the addition of new families. We have had the obligatory summer vacation student here adding families this year. As a result of his efforts, and our day-to-day work, there have been 1808 new families added to the database since 23.0. Release 24.0 will contain 11,912 families!!! Want to know the coverage?…..well you will have to wait for that!
There is good news if you only use the Pfam website, since not a great deal has changed here. You will notice many small changes around the site, as well as a few larger ones, such as the new domain graphics. There have been a few usability improvements, notably in the behaviour of the tabs and their interaction with your browser’s “back” button. You should now find that pressing “back” will take you to the last tab that you viewed, rather than to the last complete web page. This might take some getting used to, but we hope that it will prove more intuitive than the original mechanism. You should also be able to bookmark any tab in the site directly, which was not previously easy to do.
One of the most exciting changes to the website is the massive increase in the speed of interactive, single-sequence searches. Because searches used to take upwards of thirty seconds to complete, we used to show a page with a progress bar and a status message telling you how your search was doing. Thanks to the dramatic speed increases in HMMER3, we’ve been able to do away with the progress bar and we now load the results page directly. For all but the largest sequences, you should see your search results loading within just a few seconds.
One part of the site that has changed noticeably (and which will be broken in places at first) is the so-called “RESTful interface”. If you’re not familiar with the RESTful interface, it’s essentially a programmatic interface to some parts of the Pfam website, such as family pages or the sequence search system. Right now it’s looking as if the RESTful interface for much of the site will be released after the initial 24.0 data release, because we still have a lot to do there. However, because the single-sequence search mechanism has changed so much, the RESTful interface to that sub-system will change as soon as the site is released. Most importantly, there are some changes to the XML that is returned by this interface, so, if you use it, you will almost certainly need to update your scripts. We’ll do our best to document all of the changes in time for the release.
If you are waiting for Pfam 24.0, it will not be long now… And if you rely on Pfam as a source of data, then start thinking about scheduling time in October to go and revisit any code that uses it.