It has been a little quiet on the Pfam blog recently, but behind the scenes we’ve been working hard on the migration to HMMER3.
We have built HMMER3 models for all of the Pfam alignments, and searched them against the sequence database. This part was super quick, as HMMER3 is ~100 times faster than HMMER2. Due to the increased sensitivity of HMMER3, many of our Pfam families have grown in size, and we have found that ~80,000 sequences in the sequence database now have overlapping matches to more than one Pfam family.
Within Pfam we have a rule that states that our families should not overlap; this means that any one amino acid can belong to only a single Pfam family. The exception to this rule applies to families within a clan – clans are Pfam’s collections of related families – where overlaps between clan members are allowed. Over the last few weeks we’ve been working through and resolving the list of 80,000 overlaps.
There are several methods we use for resolving overlaps. Where families are related, we put them into the same clan, or merge them together if similarity is very high; sometimes families overlap by a few residues and here we trim the domain boundaries such that the two families no longer overlap; we also have cases where we think the sequence(s) that have the overlap are false positives in one or other of the families, and in these cases we raise the threshold in that family such that that sequence is excluded. Thus maintaining the high quality of Pfam data.
The overlaps generated by HMMER3 have allowed us to find new relationships between families, and to confirm relationships that we had an inkling about. We’re using PRC, SCOOP and structural data along with the HMMER3 overlap data to decide whether to put families into the same clan. In addition to adding many families to existing Pfam clans, so far we have created approximately 60 new clans.
Overlap resolution has been quite a lengthy process, and we’ve still got ~10,000 to go. However, we are hoping that because more families are now in clans, we will have added a great deal of value to the Pfam database through indicating which families are related. Resolution of overlaps is undertaken, admittedly on a smaller scale, every time we update the sequence database for a new Pfam release. Future releases should benefit from the improved clan-infrastructure in terms of overlap resolution, and with each release we will hope to improve this even further.
We’ll keep you posted about progress towards the next release of Pfam (version 24.0).