HMMER3 migration: resolving overlaps

March 19, 2009

It has been a little quiet on the Pfam blog recently, but behind the scenes we’ve been working hard on the migration to HMMER3.

We have built HMMER3 models for all of the Pfam alignments, and searched them against the sequence database. This part was super quick, as HMMER3 is ~100 times faster than HMMER2. Due to the increased sensitivity of HMMER3, many of our Pfam families have grown in size, and we have found that ~80,000 sequences in the sequence database now have overlapping matches to more than one Pfam family.

Within Pfam we have a rule that states that our families should not overlap; this means that any one amino acid can belong to only a single Pfam family.  The exception to this rule applies to families within a clan – clans are Pfam’s collections of related families – where overlaps between clan members are allowed. Over the last few weeks we’ve been working through and resolving the list of 80,000 overlaps.

There are several methods we use for resolving overlaps.  Where families are related, we put them into the same clan, or merge them together if similarity is very high; sometimes families overlap by a few residues and here we trim the domain boundaries such that the two families no longer overlap; we also have cases where we think the sequence(s) that have the overlap are false positives in one or other of the families, and in these cases we raise the threshold in that family such that that sequence is excluded. Thus maintaining the high quality of Pfam data.

The overlaps generated by HMMER3 have allowed us to find new relationships between families, and to confirm relationships that we had an inkling about.  We’re using PRC,  SCOOP and structural data along with the HMMER3 overlap data to decide whether to put families into the same clan.  In addition to adding many families to existing Pfam clans, so far we have created approximately 60 new clans.

Overlap resolution has been quite a lengthy process, and we’ve still got ~10,000 to go. However, we are hoping that because more families are now in clans, we will have added a great deal of value to the Pfam database through indicating which families are related. Resolution of overlaps is undertaken, admittedly on a smaller scale, every time we update the sequence database for a new Pfam release. Future releases should benefit from the improved clan-infrastructure in terms of overlap resolution, and with each release we will hope to improve this even further.

We’ll keep you posted about progress towards the next release of Pfam (version 24.0).

Advertisements

4 Responses to “HMMER3 migration: resolving overlaps”

  1. jamesWasmuth Says:

    If genes can overlap then why not protein domains? I appreciate that if much of the two domains overlap then this may indicate a homologous relationship. However, relatively small overlaps, may reflect an evolutionary pressure to streamline – say in parasites or bacteria.

    Clearly, having overlapping domains would mess up a desire to have discrete units, but forbidding them may discard information.

    Just the $0.02 of someone who doesn’t have to build the database 😉


  2. […] will soon migrate to HMMER3 (the PFAM team is now resolving overlaps between families that arose due to increased sensitivity) and the moment it is be available, it will make a huge […]

  3. graham cromar Says:

    Does this mean I can safely expect there will be no overlapping (alignment start) (alignment end) pairs w.r.t. adjacent domains? What about (envelope start) (envelope end) pairs? I read somewhere that the envelope is always populated but the alignment sometimes isn’t. In order to resolve domain occurrence and order in a given protein is it best to rely on the alignment or the envelope? Opinions?

  4. anandksrao Says:

    “We’re using PRC, SCOOP and structural data along with the HMMER3 overlap data to decide whether to put families into the same clan. ”

    In what manner are the PRC, SCOP (and not SCOOP?) and any other structural data used to inform how the user can resolved domain overlaps and conflicts?

    In response to James Wasmuth’s post which is rather old, are there published examples of any amino acid residue functioning as part of one protein domain at one time, and another protein domain at another time? Seems impossible to me, but then strange things can happen in biology!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s