<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Xfam Blog &#187; pfam</title>
	<atom:link href="http://xfam.wordpress.com/tag/pfam/feed/" rel="self" type="application/rss+xml" />
	<link>http://xfam.wordpress.com</link>
	<description>News about the Pfam and Rfam projects</description>
	<lastBuildDate>Thu, 29 Oct 2009 16:33:48 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='xfam.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/b4e6843655942d15fef21468c92d4907?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Xfam Blog &#187; pfam</title>
		<link>http://xfam.wordpress.com</link>
	</image>
			<item>
		<title>Website update</title>
		<link>http://xfam.wordpress.com/2009/10/29/website-update/</link>
		<comments>http://xfam.wordpress.com/2009/10/29/website-update/#comments</comments>
		<pubDate>Thu, 29 Oct 2009 16:33:48 +0000</pubDate>
		<dc:creator>johntate</dc:creator>
				<category><![CDATA[Releases]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=260</guid>
		<description><![CDATA[We&#8217;ve just updated the Pfam website again. This update comes fairly soon after the major, Pfam 24.0 release and it&#8217;s intended to fix some of the more annoying bugs and omissions that we&#8217;ve found in the last week or so.
There are various small changes and fixes all over the site, but there are several more [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=260&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We&#8217;ve just updated the Pfam website again. This update comes fairly soon after the major, Pfam 24.0 release and it&#8217;s intended to fix some of the more annoying bugs and omissions that we&#8217;ve found in the last week or so.<span id="more-260"></span><br />
There are various small changes and fixes all over the site, but there are several more significant changes that you might need to be aware of.</p>
<h2>Documentation</h2>
<p>We&#8217;ve started the painful process of updating our documentation for the HMMER3-derived data. You&#8217;ll find that most of the tabs in the help page are now up to date, though some do still carry a warning about their content. We&#8217;ll be working through these remaining sections and will update them as soon as possible.</p>
<h2>Domain architectures search</h2>
<p>We now have the domain architectures search working again. The submission form has been tidied up a little and we use the new domain graphics library to render the architecture graphics but, beyond that, the search should work as it did before.</p>
<h2>Sequence search restrictions</h2>
<p>The previous version of HMMER, v2, accepted &#8220;-&#8221; as a valid sequence character, but HMMER3 considers that to be an invalid character and returns an error. In the initial 24.0 website release, you could submit a search sequence with &#8220;-&#8221; and the search would fail with an unhelpful message. With this update, the validation procedure on the website now catches &#8220;-&#8221; before submission and tells you where the invalid character was found.</p>
<h2>GI numbers</h2>
<p>Putting a GI number into a jump box now sends you to a page about that NCBI sequence entry, rather than returning errors. The NCBI pages are still &#8220;under construction&#8221;, but they&#8217;re better than the gaping hole that previously existed in the site !</p>
<h2>RESTful services</h2>
<p>We&#8217;ve reinstated the &#8220;RESTful&#8221; services for the major parts of the site. You should now be able to use the API to get data about Pfam-A families and individual proteins, and, probably after some changes to your code, to submit single-sequence searches again.</p>
<p>The switch to HMMER3 has changed quite a few aspects of the data, as well as the way we run our searches, so there are changes to most of the XML schemas, mostly fairly minor. The documentation on the RESTful interface is now up to date, so check there for information on the new XML formats.</p>
<h3>Sequence searches</h3>
<p>It&#8217;s probably worth giving a little more detail on the changes that we&#8217;ve had to make to the RESTful interface to the sequence search system. If you&#8217;ve previously used the search interface, you will probably need to update your scripts accordingly.</p>
<h4>No &#8220;estimated time&#8221;</h4>
<p>In Pfam 23.0,  searches used HMMER2 and could take on the order of a minute for a Pfam-A search, so we used a &#8220;polling page&#8221; to give the user something to look at while they waited. It showed a progress bar and gave the user an idea of the estimated run time for the search. Now that we&#8217;re using HMMER3, most Pfam-A searches are so fast that it took longer to load the polling page than to run the search. Now we don&#8217;t bother calculating an estimated run time and we&#8217;ve ditched that intermediate page altogether. Results are now loaded into the results page as they appear on the search system.</p>
<p>When running a search using the API, you can simply add a short delay (a couple of seconds should be fine for most sequences) between submitting the search and retrieving the results. If your search is still running when you try to retrieve the results, you just won&#8217;t get results; wait a little longer and hit the same URL again.</p>
<p>When running a search against Pfam 23.0, the procedure was to submit the search and check the estimated run time in the XML that came back. After waiting for that period, you would then retrieve results from a URL in the XML.</p>
<h4>Only one job ID</h4>
<p>In Pfam 23.0, if you chose to run both a Pfam-A and a Pfam-B search, you would get two job identifiers and you would have to retrieve results for each search separately. Because we use HMMER3 to search for both Pfam-A and Pfam-B matches now, we run the jobs in the same queue, so there&#8217;s only a single job ID. Furthermore, the Pfam-A and Pfam-B hits are returned in the same result XML document. You can distinguish them using the &#8220;type&#8221; attribute on the match element.</p>
<h2>Summary</h2>
<p>There have been a lots of small changes and a few larger ones in this update, but hopefully nothing too disruptive. If you have any problems using any of the newly added or recently repaired features, do let us know.</p>
<p>Posted by John.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/260/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/260/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/260/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/260/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/260/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/260/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/260/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/260/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/260/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/260/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=260&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/10/29/website-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a0fa0c9eef18ce83c2a7ee75fb3ca95a?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">johntate</media:title>
		</media:content>
	</item>
		<item>
		<title>Pfam release 24.0</title>
		<link>http://xfam.wordpress.com/2009/10/13/pfam-release-24-0/</link>
		<comments>http://xfam.wordpress.com/2009/10/13/pfam-release-24-0/#comments</comments>
		<pubDate>Tue, 13 Oct 2009 16:15:22 +0000</pubDate>
		<dc:creator>johntate</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Releases]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=252</guid>
		<description><![CDATA[We have just released the latest update to Pfam. Release 24.0 contains a total of 11,912 families, with 1,808 new families and 236 families killed since the last release. 75.15% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.18% of all residues in the sequence database fall within Pfam [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=252&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We have just released the latest update to <a href="http://pfam.sanger.ac.uk/">Pfam</a>. Release 24.0 contains a total of 11,912 families, with 1,808 new families and 236 families killed since the last release. 75.15% of all proteins in Pfamseq contain a match to at least one Pfam domain. 53.18% of all residues in the sequence database fall within Pfam domains.<span id="more-252"></span></p>
<p>As we&#8217;ve <a href="http://xfam.wordpress.com/2009/10/02/imminent-release-of-pfam-24-0/">discussed previously</a>, release 24.0 is the first to be generated using HMMER3. This migration has necessitated changes in our file formats and users should be aware that HMMER3 is <strong>not</strong> backward compatible with HMMER2. For a <a href="http://pfam.sanger.ac.uk/help#tabview=tab1">list of changes</a>, please check the documentation on the Pfam website.</p>
<p>The new release is currently available from the <a href="http://pfam.sanger.ac.uk/">Pfam site</a> at the Sanger Institute and will be available from the other Pfam sites shortly. This release has been a massive undertaking for us and, although most features of the website are working well, some details remain to be addressed, notably the programmatic interface to the sequence search system. Please bear with us while we bring these back over the next few weeks.</p>
<p><span style="text-decoration:underline;">Update</span>: we&#8217;ve had one or two users reporting that the tabbed pages in the site (e.g. help, family, etc.) are broken. Although that&#8217;s eminently possible, if you see a broken page, please try doing a shift+reload before anything else. That should force your browser to download the latest versions of all of the javascript files, and hopefully fix the problem there and then.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/252/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/252/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/252/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/252/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/252/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/252/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/252/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/252/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/252/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/252/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=252&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/10/13/pfam-release-24-0/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/a0fa0c9eef18ce83c2a7ee75fb3ca95a?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">johntate</media:title>
		</media:content>
	</item>
		<item>
		<title>Imminent Release of Pfam 24.0</title>
		<link>http://xfam.wordpress.com/2009/10/02/imminent-release-of-pfam-24-0/</link>
		<comments>http://xfam.wordpress.com/2009/10/02/imminent-release-of-pfam-24-0/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 09:13:22 +0000</pubDate>
		<dc:creator>rdfinn</dc:creator>
				<category><![CDATA[HMMER3 migration]]></category>
		<category><![CDATA[Releases]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=246</guid>
		<description><![CDATA[We are now on the brink of releasing Pfam 24.0.  This release of Pfam, version 24.0, will be a landmark release as it will be the first to be built using the the new version of the HMMER package, HMMER3. We are well aware that we have been claiming this release as imminent for some [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=246&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We are now on the brink of releasing Pfam 24.0.  This release of Pfam, version 24.0, will be a landmark release as it will be the first to be built using the the new version of the HMMER package, <a href="http://hmmer.janelia.org/">HMMER3</a>. We are well aware that we have been claiming this release as imminent for some time, but we are now at the point of flicking the big switch.  There are numerous changes that users need to know about and we will briefly summarise them here.<span id="more-246"></span></p>
<h2>HMMER version change</h2>
<p>The new version of HMMER is considerably faster and more sensitive than previous ones.  This has resulted in numerous families being merged and a significantly larger number of families being classified into clans (as we are now better able to detect similarities between families).  All families have had their significance thresholds (also known as <a href="http://pfam.sanger.ac.uk/help?tab=helpScoresBlock">GA or gathering thresholds</a>) re-set, because of the new statistics underlying the HMMER3 implementation.   Thus, it is <strong>NOT</strong> possible to produce both a HMMER2 and HMMER3 set of HMMs with PFam defined cut-offs.  In addition, we only need a single HMM per family, so the previous &#8216;Pfam_ls&#8217; and &#8216;Pfam_fs&#8217; files are now replaced by a single file called &#8216;Pfam-A.hmm&#8217;.  As well as the Pfam-A.hmms, we will also be making a HMM library for the 20,000 largest Pfam-B families (&#8216;Pfam-B.hmm&#8217;).  Some of you are already using pre-released versions of these HMMER3 HMM libraries as part of the new version of &#8220;pfam_scan.pl&#8221;.</p>
<p>Although we will not bother you with the details now, the change in HMMER version has also made us re-think many of our policies, file formats and even the philosophy behind the database.  We will try to address the most significant of these in our documentation and, of course, in this blog.</p>
<h2>Sequence Database</h2>
<p>In addition to migrating Pfam from HMMER2 to HMMER3, we have also updated the underlying protein sequence database.  Pfam 24.0 is based on UniProtKB release 15.6, dating from 28th July 2009, and this update has resulted in a near doubling of our underlying sequence database.  This has obviously changed the character of several families. For example, <a href="http://pfam.sanger.ac.uk/family/GP120">GP120</a> is no longer our largest Pfam family, having dropped to being the 3rd largest family, behind <a href="http://pfam.sanger.ac.uk/family/ABC_tran">ABC_tran</a> and <a href="http://pfam.sanger.ac.uk/family/RVT_1">RVT_1</a>, with the latter two families each exceeding 100,000 sequence matches!</p>
<h2><strong>Relational Database</strong></h2>
<p>If you are not concerned about the MySQL database, then skip to the next section.  If you are, then you need to know that, along with all the other changes, we have also changed the type of table-engine used, from <a href="http://en.wikipedia.org/wiki/MyISAM">MyISAM to InnoDB</a>.  There are both technical and managerial reasons for having done this, but one of the main benefits to the user is that InnoDB supports foreign key relationships. This improves data consistency and allows the tracing of table relationships.  However, this technical change, along with the data changes, has resulted in many changes to the database schema, including subtle column-name changes and more obvious data-type changes.</p>
<h2><strong>New Families</strong></h2>
<p>No Pfam release would be complete without the addition of new families.  We have had the obligatory summer vacation student here adding families this year.  As a result of his efforts, and our day-to-day work, there have been 1808 new families added to the database since 23.0. Release 24.0 will contain 11,912 families!!!  Want to know the coverage?&#8230;..well you will have to wait for that!</p>
<h2><strong>Website Changes</strong></h2>
<p>There is good news if you only use the Pfam website, since not a great deal has changed here. You will notice many small changes around the site, as well as a few larger ones, such as the new domain graphics.  There have been a few usability improvements, notably in the behaviour of the tabs and their interaction with your browser&#8217;s &#8220;back&#8221; button.  You should now find that pressing &#8220;back&#8221; will take you to the last tab that you viewed, rather than to the last complete web page.  This might take some getting used to, but we hope that it will prove more intuitive than the original mechanism.  You should also be able to bookmark any tab in the site directly, which was not previously easy to do.</p>
<h3>Sequence searches</h3>
<p>One of the most exciting changes to the website is the massive increase in the speed of interactive, single-sequence searches.  Because searches used to take upwards of thirty seconds to complete, we used to show a page with a progress bar and a status message telling you how your search was doing. Thanks to the dramatic speed increases in HMMER3, we&#8217;ve been able to do away with the progress bar and we now load the results page directly.  For all but the largest sequences, you should see your search results loading within just a few seconds.</p>
<h3>Domain graphics</h3>
<p>Visually, probably the most significant difference in the site is the new style domain graphics.  Our domain graphics are pictorial representations of sequences and the Pfam matches to them.  In the existing Pfam 23.0 site, every domain graphic is generated on our server as an image file, using the <a href="http://search.cpan.org/dist/GD/GD.pm">Perl GD module</a>.  Each image is then served separately by our servers.  Obviously, when viewing many sequences in one page, your browser might have to make hundreds of requests to our server and, during periods of heavy load, this severely reduces the performance of the website.  Also, in the past we have had periods during which we were generating these temporary domain graphic images faster than we could delete them!  To circumvent these issues, we have completely changed the way that the domain graphics are drawn.  In the new site, all of the domain graphics in a page are described by a single chunk of data and then drawn in your browser using a javascript library.  This should make the page load more quickly and it allows us to add more information to the domain graphics; you should see new tooltips as you move your mouse over the images, giving information about each domain.</p>
<h3>RESTful interface</h3>
<p>One part of the site that has changed noticeably (and which will be broken in places at first) is the so-called &#8220;RESTful interface&#8221;.  If you&#8217;re not familiar with the <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">RESTful interface</a>, it&#8217;s essentially a programmatic interface to some parts of the Pfam website, such as family pages or the sequence search system.  Right now it&#8217;s looking as if the RESTful interface for much of the site will be released after the initial 24.0 data release, because we still have a lot to do there.  However, because the single-sequence search mechanism has changed so much, the RESTful interface to that sub-system will change as soon as the site is released.  Most importantly, there are some changes to the XML that is returned by this interface, so, if you use it, you will almost certainly need to update your scripts.  We&#8217;ll do our best to document all of the changes in time for the release.</p>
<h2><strong>In Summary</strong></h2>
<p>If you are waiting for Pfam 24.0, it will not be long now&#8230; And if you rely on Pfam as a source of data, then start thinking about scheduling time in October to go and revisit any code that uses it.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/246/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/246/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/246/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=246&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/10/02/imminent-release-of-pfam-24-0/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/979e0bdb3b6200e39425a8748897cf4d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rdfinn</media:title>
		</media:content>
	</item>
		<item>
		<title>pfam_scan.pl &#8211; part II</title>
		<link>http://xfam.wordpress.com/2009/09/11/pfam_scan-pl-part-ii/</link>
		<comments>http://xfam.wordpress.com/2009/09/11/pfam_scan-pl-part-ii/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 09:10:05 +0000</pubDate>
		<dc:creator>jainamistry</dc:creator>
				<category><![CDATA[HMMER3 migration]]></category>
		<category><![CDATA[Production]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=226</guid>
		<description><![CDATA[Back in May we wrote a blog post about the new version of pfam_scan.pl.  We asked if there was anyone out there who was willing to help us test our new script, and we were pleasantly surprised at the number of people who got in contact with us &#8211; so a big thank you [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=226&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Back in May we wrote a blog post about the new version of pfam_scan.pl.  We asked if there was anyone out there who was willing to help us test our new script, and we were pleasantly surprised at the number of people who got in contact with us &#8211; so a big thank you to all those who have helped.  Since releasing the alpha version of pfam_scan.pl to our testers we have made some internal changes to the script that are worth mentioning:<span id="more-226"></span></p>
<p><strong>Searching Pfam-B HMMs</strong></p>
<p>We are frequently asked how to search sequences against <a href="http://pfam.sanger.ac.uk/help" target="_blank">Pfam-B</a> families, and now we are providing this facility as part of pfam_scan.pl.   Pfam-A families are the main entry point for scientists in their day-to-day use of Pfam.  The alignments for each Pfam-A family have been carefully checked by one of our curators, and each one is accompanied by annotation.  In the past we have made a library of all the curated entries in the database (the Pfam-A HMMs) available for users to search against.</p>
<p>Within Pfam, we have a second set of families  called Pfam-B families. These families are automatically generated alignments that have no accompanying HMMs or annotation, and their alignments have not passed any of the normal quality control that we would perform for a Pfam-A family.  This means Pfam-B families are of much lower quality than our Pfam-A families, however they do help us to fill the sequence space that Pfam-A families do not cover.</p>
<p>In the last release there were 223,403 Pfam-B entries, with the number of sequences matching each Pfam-B entry following a <a href="http://en.wikipedia.org/wiki/The_Long_Tail" target="_blank">long-tailed distribution</a>.  With HMMER3 being significantly faster than HMMER2, we have chosen to generate HMMs for the top 20,000 (largest automatic families) Pfam-B families, as these are the most relevant. Most entries below this cut-off contain less than 10 sequences.  As multi-threaded versions of HMMER3 become available we may include more entires, but for the moment the search times against the Pfam-A and Pfam-B libraries are equivalent.</p>
<p>BUT&#8230;.please remember that Pfam-B accessions are <em>not</em> stable between Pfam releases, so do <em>not</em> rely on them!</p>
<p><strong>Additional formats</strong></p>
<p>A further change adds the ability to write search results in <a href="http://www.json.org" target="_blank">JavaScript Object Notation (JSON</a>). JSON is a compact, text-based data format, which is most commonly used in the context of the web and javascript applications.  However, precisely because it&#8217;s so compact and portable, JSON can also be useful as a sort of lightweight XML replacement.  Hopefully the JSON output options of pfam_scan.pl will be useful for those users who would otherwise have to parse the raw text output in order to do more processing.</p>
<p>We wanted the new version of pfam_scan.pl to be fast, but also much more maintainable than the previous version. When we started work on the new pfam_scan.pl, we made the decision to use a number of third party Perl modules from <a href="http://search.cpan.org" target="_blank">CPAN</a>. The most important of these is definitely the <a href="http://www.iinteractive.com/moose/">Moose framework</a> For those of you who are unfamiliar with Moose, it&#8217;s a complete object system that improves on the Perl 5 object system and makes object-oriented code simpler and more powerful.</p>
<p><strong>A word or two about Moose</strong></p>
<p>Some our testers noticed that using Moose contributed fairly significantly to the runtime of the script (excluding running HMMER3). The performance overhead of a system like Moose is definitely something that needs to be considered. After careful consideration, we decided that it was a price worth paying, but we also thought that it was worth explaining our reasoning a little.</p>
<p>One of the benefits of Moose is that Moose-based objects can be configured to perform extensive data validation, in a way that is both easy to implement and maintain.  The Pfam production pipeline has been entirely re-written for the upcoming Pfam release, and it now relies heavily on Moose for data validation. Since the pipeline modules perform many of the operations that we need in pfam_scan.pl, such as parsing the output of HMMER3 programs, we&#8217;ve been able to use these new modules in the script, rather than having to rewrite the same functionality from scratch.</p>
<p>The validation checks that can be built into Moose objects avoid the need to write complex data validation procedures for ourselves. The code that implements the checks is part of the core Perl modules and is therefore maintained by the maintainers of the CPAN modules, not us.  Our new production code was easier to write because it relies on Moose and the built-in validation checks are much tighter than they would otherwise have been, despite the fact that we&#8217;ve written less code overall.  By using the same modules in pfam_scan.pl, we&#8217;ve further reduced the amount of code that we need to maintain overall and have significantly speeded up the development of pfam_scan.pl itself.</p>
<p>In short, we&#8217;re happy that we&#8217;ve struck an appropriate balance between all out performance and long term maintainability.  We&#8217;ll keep up to date with changes to the Moose framework that might help improve performance, and we&#8217;ve already improved the performance of the production pipelines by leveraging some of the more esoteric Moose features.  The final release version of the modules and scripts will include these under-the-hood tweaks.</p>
<p><strong>And Finally&#8230;..</strong></p>
<p>Once again, thank you if you tested out the alpha release of the HMMER3-enabled version of pfam_scan.pl.   Your feedback should reduce our pain, and the pain of the community, when it&#8217;s bundled with Pfam release 24.0.  For those of you who can not wait, you can find the beta release of the script on our (well Rob&#8217;s) <a href="ftp://ftp.sanger.ac.uk/pub/rdf/PfamScanBeta/" target="_blank">ftp site</a>. Of course if you find any bugs, or have any feedback, please contact us.</p>
<h3>Update 16/10/09</h3>
<p>We released the first official release of pfam_scan.pl with Pfam 24.0,  and it&#8217;s available for download (the tarball is called PfamScan.tar.gz)  at <a href="ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/">ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/</a>.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/226/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/226/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/226/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/226/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/226/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/226/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/226/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/226/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/226/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/226/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=226&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/09/11/pfam_scan-pl-part-ii/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5df05397fec11ea44613816772d9b48b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jainamistry</media:title>
		</media:content>
	</item>
		<item>
		<title>pfam_scan.pl</title>
		<link>http://xfam.wordpress.com/2009/05/21/pfam_scan-pl/</link>
		<comments>http://xfam.wordpress.com/2009/05/21/pfam_scan-pl/#comments</comments>
		<pubDate>Thu, 21 May 2009 10:30:29 +0000</pubDate>
		<dc:creator>jainamistry</dc:creator>
				<category><![CDATA[HMMER3 migration]]></category>
		<category><![CDATA[Production]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=189</guid>
		<description><![CDATA[
We&#8217;re currently working on a new version of one of our core scripts, &#8216;pfam_scan.pl&#8217;. This script searches a set of protein sequences (in FASTA format) against Pfam&#8217;s library of HMMs. The original code was written nearly a decade ago but, since then, features have been added, bugs have been fixed and the code has evolved [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=189&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><div>
<p>We&#8217;re currently working on a new version of one of our core scripts, &#8216;pfam_scan.pl&#8217;. This script searches a set of protein sequences (in FASTA format) against Pfam&#8217;s library of HMMs. The original code was written nearly a decade ago but, since then, features have been added, bugs have been fixed and the code has evolved into something that is far from elegant. The re-write is something that we&#8217;ve been planning to do for a while and, as the code needs updating to use the new <a title="HMMER3" href="http://hmmer.janelia.org/" target="_blank">HMMER3</a> software, now seems like the perfect time to do it.<span id="more-189"></span></p>
<div>
<p>The purpose of &#8216;pfam_scan.pl&#8217; is to search one or more sequences for matching Pfam domains. Depending on the user options, the script can also process the results such that overlaps between families belonging to the same clan are resolved and can predict active sites.  When we generate the Pfam database, we use &#8216;hmmsearch&#8217; to search a database of protein sequences using the HMM for each Pfam domain in turn. When we run &#8216;pfam_scan.pl&#8217;, however, we use &#8216;hmmscan&#8217; (previously known as &#8216;hmmpfam&#8217; in HMMER2) to search a library of HMMs using a set of query sequences. As an aside, it&#8217;s worth noting that there can be a subtle difference between the results that you&#8217;ll get when searching a sequence using &#8216;pfam_scan.pl&#8217;, and the matches that would be stored in the Pfam database for exactly the same sequence, as we are using two different HMMER programs. This is a small effect, but one that&#8217;s worth knowing about.</p></div>
<div>
<h3>Speed</h3>
<p>We&#8217;re seeing a roughly 100-fold increase in search speeds using HMMER3, so we want to pay particular attention to the efficiency of our Perl code, since we don&#8217;t want this to be the rate limiting step when performing a search. We&#8217;re pleased to note that our benchmarks show that, for a typical sequence search of 300 amino acids against a library of around 11,000 HMMs, the new &#8216;pfam_scan.pl&#8217; code adds only about 100-200 msecs to the search time,  over and above the 1 second &#8216;hmmscan&#8217; run time (benchmarks were performed on a single 2.4GHz AMD Opteron processor).</p>
<div>
<h3>Design</h3>
<p>We want to use exactly the same code when running searches on our website that our users would use when running searches on their own machines. We&#8217;ve taken this requirement into account from the start, so the new &#8216;pfam_scan.pl&#8217; is written in a far more modular fashion than the old one. This has necessitated some changes in the dependencies of the script, however.</p>
<p>In the past, &#8216;pfam_scan.pl&#8217; was a standalone script, with no external dependancies other than standard Perl library modules and the HMMER programs. Rather than repeatedly re-invent the wheel, we&#8217;ve decided to forgo the standalone nature of the script and use a few modules that can be installed from <a title="Comprehensive Perl Archive Network" href="http://www.cpan.org/" target="_blank">CPAN</a>. We appreciate that this might cause difficulties for some of our users, so we&#8217;re looking at whether to bundle the software along with all of its dependencies, or simply to list the dependencies and let people install them for themselves.</div>
</div>
<h3>Simplifications, complications</h3>
<p>Those of you familiar with Pfam models will know that, previously, we created two HMMER2 models for each Pfam family: one for global matches to the model and one for local matches to the model. Substantial improvements to the local-local search method in HMMER3 now allow us to model each Pfam family with a single HMM. This means that we no longer need to choose one hit over another, in those cases where a sequence has overlapping global and local matches to a single model. Most importantly for searches, it also means that the script will only need to search against half as many models as compared to the HMMER2 version.</p>
<p>Although HMMER3 makes life easier in some ways, it does introduce some more complexity. The HMMER3 version of &#8216;pfam_scan.pl&#8217; will report two sets of coordinates for each match, namely the alignment coordinates, and the envelope coordinates.  We&#8217;ll explain more about these in a later blog post&#8230;</p>
<h3>New ouput formats</h3>
<p>One of the problems we&#8217;ve always had with &#8216;pfam_scan.pl&#8217; is its rather terse tabular output. The old version presented hits as a simple table of results, which, if you wanted to make further use of them, had to be parsed and turned back into a data structure. In fact, the Pfam website does exactly that when running a sequence search. For the new version, we want to make sure that, as well as providing the familiar tabular output by default, we can also get the results of a search in more useful formats as well.</p>
<p>The main component of &#8216;pfam_scan.pl&#8217; is now written as a Perl module, which is responsible for running a search and returning results as a Perl data structure. The actual script, &#8216;pfam_scan.pl&#8217;, is now just a thin wrapper around that module, and it&#8217;s really only responsible for interpreting command line arguments. By returning results as a Perl data structure, we&#8217;re making it much easier (and quicker) to interpret results, or to pass them onto other analysis tools.</p>
<h3>Feedback</h3>
<p>The new architecture also allows us to think about adding other output formats. For our internal purposes, a raw Perl data structure is most useful, but if you&#8217;re a &#8216;pfam_scan.pl&#8217; user and feel strongly that we should be considering some other output format (XML, CSV, maybe even JSON), now is the time to let us know !</p>
<p>Finally, if there are any brave souls who would be willing and able to help us test the new &#8216;pfam_scan.pl&#8217;, we&#8217;d really like to hear from you. Testing the script will require you to install quite a few things, such as the HMMER3 executables, the new HMMER3-based HMM library and the various Perl modules that the script requires. If that prospect doesn&#8217;t put you off, please do get in touch, either by leaving a comment here or by <a title="Pfam contact details" href="http://pfam.sanger.ac.uk/help?tab=helpContactUsBlock" target="_blank">mail</a>.</p>
<div>
<p>Posted by Jaina &amp; John</p></div>
</div>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/189/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/189/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/189/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=189&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/05/21/pfam_scan-pl/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5df05397fec11ea44613816772d9b48b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jainamistry</media:title>
		</media:content>
	</item>
		<item>
		<title>DUFs: families in need of function</title>
		<link>http://xfam.wordpress.com/2009/04/20/dufs-families-in-need-of-function/</link>
		<comments>http://xfam.wordpress.com/2009/04/20/dufs-families-in-need-of-function/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 10:54:03 +0000</pubDate>
		<dc:creator>alexbateman</dc:creator>
				<category><![CDATA[Production]]></category>
		<category><![CDATA[DUF]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=156</guid>
		<description><![CDATA[Domains of Unknown Function, or DUFs, is a large set of families found in the Pfam database. Examples would be &#8220;DUF26&#8221; or &#8220;DUF282&#8220;. The DUF naming scheme was introduced by Chris Ponting, through the addition of DUF1 and DUF2 to the SMART database. These two domains were found to be widely distributed in bacterial signalling [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=156&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Domains of Unknown Function, or DUFs, is a large set of families found in the Pfam database. Examples would be &#8220;<a title="DUF26" href="http://pfam.sanger.ac.uk/family/duf26/" target="_blank">DUF26</a>&#8221; or &#8220;<a title="DUF282" href="http://pfam.sanger.ac.uk/family/duf282" target="_blank">DUF282</a>&#8220;. The DUF naming scheme was introduced by Chris Ponting, through the addition of DUF1 and DUF2 to the SMART database. These two domains were found to be widely distributed in bacterial signalling proteins. Subsequently, the functions of these domains were identified and they have since been renamed as the <a title="GGDEF" href="http://pfam.sanger.ac.uk/family/GGDEF" target="_blank">GGDEF</a> and <a title="EAL" href="http://pfam.sanger.ac.uk/family/EAL" target="_blank">EAL</a> domains respectively (structures shown in Figures 1 and 2). These families were added to Pfam in 1997, and little did Chris know that he was starting a trend that would see thousands of uncharacterised families being added to the domain databases.<span id="more-156"></span></p>
<div id="attachment_157" class="wp-caption aligncenter" style="width: 310px"><a href="http://pfam.sanger.ac.uk/family/GGDEF" target="_blank"><img class="size-medium wp-image-157" title="Figure 1" src="http://xfam.files.wordpress.com/2009/04/fig1.jpg?w=300&#038;h=200" alt="GGDEF domain" width="300" height="200" /></a><p class="wp-caption-text">Structure of the GGDEF domain (in green), formerly known as DUF1, now known to function as a diguanylate cyclase enzyme. Structure of the EAL domain (in green), formerly known as DUF2, now known to function as a cyclic diguanylate-specific phosphodiesterase enzyme.</p></div>
<div id="attachment_158" class="wp-caption aligncenter" style="width: 310px"><a href="http://pfam.sanger.ac.uk/family/EAL" target="_blank"><img class="size-medium wp-image-158" title="Figure 2" src="http://xfam.files.wordpress.com/2009/04/fig2.jpg?w=300&#038;h=200" alt="EAL domain" width="300" height="200" /></a><p class="wp-caption-text">Structure of the EAL domain (in green), formerly known as DUF2, now known to function as a cyclic diguanylate-specific phosphodiesterase enzyme.</p></div>
<p>At least in Britain, the word &#8220;duff&#8221; conjures up something substandard, with the dictionary definition stating:</p>
<blockquote><p><strong>duff </strong>adj. Brit. slang 1. worthless 2. useless.</p></blockquote>
<p>However, in reality, DUFs are treated with the same loving care as all other Pfam families.<span> The only difference is that</span> our curators are unable to identify any functional information from the scientific literature.</p>
<p style="text-align:left;">In Pfam release 23, the DUF number scheme has reached DUF2607, and the fraction of DUF families in Pfam has increased to about 22% of all families (shown in Figure 3).</p>
<div id="attachment_162" class="wp-caption aligncenter" style="width: 310px"><a href="http://xfam.files.wordpress.com/2009/04/fig32.jpg" target="_blank"><img class="size-medium wp-image-162" title="Figure 3" src="http://xfam.files.wordpress.com/2009/04/fig32.jpg?w=300&#038;h=187" alt="Growth of DUFs" width="300" height="187" /></a><p class="wp-caption-text">Growth of DUFs in Pfam</p></div>
<p>It looks as though the number of DUFs is on the increase. Because DUFs require little annotation, they are often easy families to add to Pfam. We expect that soon the number of DUFs will outnumber the families of known function being added to Pfam.</p>
<p>Identifying functions for Domains of Unknown Function is extremely important if we are to understand biology at a systems level. Essentially there are two ways to find out the function of an uncharacterised domain: the first involves identifying a similarity to a domain of known function, either by sequence comparison or perhaps from a newly solved structure; the second way is good old fashioned molecular biology. Sir Rich Roberts put forward a proposal to stimulate experimentation on uncharacterized proteins [<a href="#ref1">1</a>].</p>
<p>Slowly, momentum is being gained and more functions of DUFs are being identified. Since we started adding DUFs nearly 10 years ago, over 270 of them have been renamed presumably when a function had been identified. Our curators have not yet had time to recheck all of the existing 2,000 or so DUFs to see if new functional information has been identified. Therefore, over the coming year, we hope to recheck all of them, and rename and re-annotate those where function is now known. If you know of any recently identified functions for these families, please do let us know.</p>
<p style="text-align:left;">In recent years, structural genomics initiatives have solved the structures of literally hundreds of proteins in uncharacterised families. In many cases, this has helped to narrow down the possible function of a family. For example, DUF442 was shown to be a non-classical phosphatase enzyme, see Figure 4 [<a href="#ref2">2</a>].</p>
<div id="attachment_163" class="wp-caption aligncenter" style="width: 310px"><a href="http://xfam.files.wordpress.com/2009/04/fig4.jpg" target="_blank"><img class="size-medium wp-image-163" title="Figure 4" src="http://xfam.files.wordpress.com/2009/04/fig4.jpg?w=300&#038;h=263" alt="Active site of DUF442" width="300" height="263" /></a><p class="wp-caption-text">Active site of the DUF442 phosphatase enzyme in sequence and structure (From Krishna et al)</p></div>
<p>The DUFs remain a treasure trove of novel biology waiting to be plundered. So, why not get that pioneer spirit and join the gold rush!</p>
<p>Posted by Alex.</p>
<h3>References</h3>
<p><a href="&lt;/dd">[</a><a name="ref1">1</a>] R.J. Roberts (2004): <a href="http://biology.plosjournals.org/perlserv/?request=get-document&amp;doi=10.1371/journal.pbio.0020042&amp;ct=1" target="_blank">Identifying protein function – A call for community action.</a><br />
Plos. Biol. 2:e42.</p>
<p>[<a name="ref2">2</a>] S.S. Krishna <em>et al.</em> (2007) <a href="http://www.ncbi.nlm.nih.gov/sites/entrez" target="_blank">Crystal structure of NMA1982 from <em>Neisseria meningitidis</em> at 1.5 Å resolution provides a structural scaffold for nonclassical, eukaryotic-like phosphatases.</a> Proteins. 69:415-421</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/156/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=156&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/04/20/dufs-families-in-need-of-function/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d934f744fadc2a9a95256737ac1302d9?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">alexbateman</media:title>
		</media:content>

		<media:content url="http://xfam.files.wordpress.com/2009/04/fig1.jpg?w=300" medium="image">
			<media:title type="html">Figure 1</media:title>
		</media:content>

		<media:content url="http://xfam.files.wordpress.com/2009/04/fig2.jpg?w=300" medium="image">
			<media:title type="html">Figure 2</media:title>
		</media:content>

		<media:content url="http://xfam.files.wordpress.com/2009/04/fig32.jpg?w=300" medium="image">
			<media:title type="html">Figure 3</media:title>
		</media:content>

		<media:content url="http://xfam.files.wordpress.com/2009/04/fig4.jpg?w=300" medium="image">
			<media:title type="html">Figure 4</media:title>
		</media:content>
	</item>
		<item>
		<title>HMMER3 migration: resolving overlaps</title>
		<link>http://xfam.wordpress.com/2009/03/19/hmmer3-migration-resolving-overlaps/</link>
		<comments>http://xfam.wordpress.com/2009/03/19/hmmer3-migration-resolving-overlaps/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 17:39:58 +0000</pubDate>
		<dc:creator>jainamistry</dc:creator>
				<category><![CDATA[HMMER3 migration]]></category>
		<category><![CDATA[Production]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=132</guid>
		<description><![CDATA[It has been a little quiet on the Pfam blog recently, but behind the scenes we&#8217;ve been working hard on the migration to HMMER3.
We have built HMMER3 models for all of the Pfam alignments, and searched them against the sequence database. This part was super quick, as HMMER3 is ~100 times faster than HMMER2. Due [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=132&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>It has been a little quiet on the Pfam blog recently, but behind the scenes we&#8217;ve been working hard on the migration to HMMER3.</p>
<p>We have built HMMER3 models for all of the Pfam alignments, and searched them against the sequence database. This part was super quick, as HMMER3 is ~100 times faster than HMMER2. Due to the increased sensitivity of HMMER3, many of our Pfam families have grown in size, and we have found that ~80,000 sequences in the sequence database now have overlapping matches to more than one Pfam family.</p>
<p>Within Pfam we have a rule that states that our families should not overlap; this means that any one amino acid can belong to only a single Pfam family.  The exception to this rule applies to families within a clan &#8211; clans are Pfam&#8217;s collections of related families &#8211; where overlaps between clan members are allowed. Over the last few weeks we&#8217;ve been working through and resolving the list of 80,000 overlaps.<span id="more-132"></span></p>
<p>There are several methods we use for resolving overlaps.  Where families are related, we put them into the same clan, or merge them together if similarity is very high; sometimes families overlap by a few residues and here we trim the domain boundaries such that the two families no longer overlap; we also have cases where we think the sequence(s) that have the overlap are false positives in one or other of the families, and in these cases we raise the threshold in that family such that that sequence is excluded. Thus maintaining the high quality of Pfam data.</p>
<p>The overlaps generated by HMMER3 have allowed us to find new relationships between families, and to confirm relationships that we had an inkling about.  We&#8217;re using PRC,  SCOOP and structural data along with the HMMER3 overlap data to decide whether to put families into the same clan.  In addition to adding many families to existing Pfam clans, so far we have created approximately 60 new clans.</p>
<p>Overlap resolution has been quite a lengthy process, and we&#8217;ve still got ~10,000 to go. However, we are hoping that because more families are now in clans, we will have added a great deal of value to the Pfam database through indicating which families are related. Resolution of overlaps is undertaken, admittedly on a smaller scale, every time we update the sequence database for a new Pfam release. Future releases should benefit from the improved clan-infrastructure in terms of overlap resolution, and with each release we will hope to improve this even further.</p>
<p>We&#8217;ll keep you posted about progress towards the next release of Pfam (version 24.0).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/132/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=132&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/03/19/hmmer3-migration-resolving-overlaps/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5df05397fec11ea44613816772d9b48b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">jainamistry</media:title>
		</media:content>
	</item>
		<item>
		<title>What Pfam did in 2008</title>
		<link>http://xfam.wordpress.com/2009/01/27/what-pfam-did-in-2008/</link>
		<comments>http://xfam.wordpress.com/2009/01/27/what-pfam-did-in-2008/#comments</comments>
		<pubDate>Tue, 27 Jan 2009 13:18:54 +0000</pubDate>
		<dc:creator>alexbateman</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=59</guid>
		<description><![CDATA[I thought it would be useful to give a quick overview of some of the major things that have been going on behind the scenes at Pfam during 2008. Overall it may have seemed like a quiet year for our users as we only made one public release of data in July, release 23.0. However, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=59&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I thought it would be useful to give a quick overview of some of the major things that have been going on behind the scenes at Pfam during 2008. Overall it may have seemed like a quiet year for our users as we only made one public release of data in July, release 23.0. However, like a paddling duck, the calmness viewed from above belies some furious paddling below.<span id="more-59"></span></p>
<p>In total <strong>1326 new families</strong> were added to our curation database. These were derived from a number of different sources such as Pfam-B, our automatically generated clustering of sequence regions not already in Pfam. As well as adding new families, we also made a valiant attempt to go through our existing families and improve their scope through a process we call iteration.</p>
<p>Iteration takes the <em>full</em> sequence alignment for a family and attempts to make a new non-redundant <em>seed</em> alignment from it. When a new HMM is made from this seed alignment, we hope to find new even more distant homologues for the family. During 2008 we <strong>iterated every single family</strong> that was not in a Pfam clan. For about 50% of Pfam families we were able to find more homologues. While in most cases there were only modest improvements, in some cases we found many hundreds of new family members.  In some cases this process allowed us to realise that two families were actually related and merge them into a single entry. For example we found that the uncharacterised family DUF30 (accession PF01727) was actually related to <a href="http://pfam.sanger.ac.uk//family/Peptidase_S7">Peptidase_S7 (PF00949)</a>.</p>
<p>We <strong>changed the underlying source of protein clusters used in Pfam-B,</strong> from PRODOM to the PairsDB, from Andreas Heger and Liisa Holm. Unfortunately, the PRODOM database was not able to keep up a frequent release schedule and has become somewhat out of date. The greatly improved coverage of Pfam-B has increased the overall comprehensiveness of Pfam.</p>
<p>The <strong>rapid growth in the number of protein sequences</strong> has meant that we have had to revisit many aspects of the software used to run Pfam. Large portions of the codebase were rewritten, in order to help us scale with the sequence deluge. In 2008 we started providing Pfam match data on NCBI GenPept and Metagenomics, in addition to UniProt sequences.  Thus, from just over 3 million UniProt sequences in 2007, this expanded to 17 million sequences. This sequence growth, combined with the increased number of families, has resulted in a dramatic increase in the amount of computer power required to produce a Pfam release. In order to produce a release within such a timescale as not to make it obsolete, our release pipeline was restructured from a linear procedure to one that is largely parallel. Pfam 23,0 release took about two months to produce, but consumed approximately 60 CPU years on the Sanger compute cluster. As part of this restructuring, we have also endeavoured to <strong>increase the level of quality control</strong>. The primary focus here was to ensure data consistency between the data stored on disk (that is used to represent a family during the curation process) and the information populated in the MySQL database (that is used to provide the website).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/59/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=59&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/01/27/what-pfam-did-in-2008/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d934f744fadc2a9a95256737ac1302d9?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">alexbateman</media:title>
		</media:content>
	</item>
		<item>
		<title>Early adoption of HMMER3</title>
		<link>http://xfam.wordpress.com/2009/01/21/early-adoption-of-hmmer3/</link>
		<comments>http://xfam.wordpress.com/2009/01/21/early-adoption-of-hmmer3/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 10:46:58 +0000</pubDate>
		<dc:creator>rdfinn</dc:creator>
				<category><![CDATA[HMMER3 migration]]></category>
		<category><![CDATA[Production]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=36</guid>
		<description><![CDATA[As the first post suggested, this blog will partly describe the progress and issues faced with the migration of Pfam to HMMER3.  We&#8217;ve been waiting for the mercurial HMMER3 for well over a year now, watching all the while its ever receding release date.  However, it has finally been released, albeit in alpha phase! Given [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=36&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As the first post suggested, this blog will partly describe the progress and issues faced with the migration of Pfam to HMMER3.  We&#8217;ve been waiting for the mercurial HMMER3 for well over a year now, watching all the while its ever receding release date.  However, it <em>has</em> finally been released, albeit in alpha phase! Given Sean Eddy&#8217;s past record on HMMER2, particularly his attention to detail and his hatred of bugs in his software, we (Pfam) are already confident enough to be looking at migrating to HMMER3. This post will set out the rationale for moving Pfam to HMMER3 quickly and look at some of the issues that will inevitably follow such a move.<span id="more-36"></span></p>
<h3>Speed!</h3>
<p>First and foremost, we&#8217;re concerned about compute time: the last Pfam release, Pfam 23.0, took nearly 60 CPU years to produce, most of that time spent running<em> hmmsearch </em>against the various sequence databases (UniProt, NCBI genpept, metagenomic sequences). For us especially, the 100-fold speed increase promised by HMMER3 makes it well worth exploring ! Additionally, when it comes to searching Pfam with HMMER3, we can immediately cut the search time in half, since there will be only one HMM per family. Currently, we produce two HMMs per family, in order to ensure that the family is as comprehensive as possible, which attentive Pfam users will recognise as <em>Pfam_ls</em> and <em>Pfam_fs</em>. The <em>ls</em> models (&#8220;glocal&#8221;) find sequence matches that corresponded to the whole length of the HMM. The <em>fs</em> models (local) find partial matches with respect to the model; more often than not, partial matches would be shorter than the actual domains in the sequence. However, we understand that, with HMMER3, many of the multi-hit local alignment issues have been resolved, and partial matches can now be more readily extended to match the full length of the model, where appropriate. Consequently, with HMMER3 we will only need  one HMM per family (&#8220;local&#8221;) and this will remove a heck of a lot of confusion as to how a match was found.</p>
<h3>Sensitivity</h3>
<p>The second reason for migrating to HMMER3 is that it is much more sensitive than HMMER2, which is a very good thing! We now have well over 10,000 families in Pfam and many of these families belong to Pfam clans. Clans are collections of related families and are required to cover the situation where one HMM is insufficiently sensitive to find all members of the family. Furthermore, we tend to set rather conservative thresholds when we curate a family, in order to avoid the inclusion of false positives, but we do this even at the expense of missing real matches. Our hope is that the new HMMER3 software will enable our large, divergent families to become even larger and that we will be able to collect all known members of a family, without the inclusion of false negatives. Cases of wine have been bet on the increased sensitivity, and it&#8217;s about time Sean actually won one of these bets!</p>
<h3>Issues</h3>
<p>Personal experience with HMMER3 gives us confidence that switching will be ultimately beneficial to Pfam. Having played with both a pre-alpha release and the alpha release version of HMMER3, it certainly appears to be living up to expectation. There are still some issues with HMMER3, such as missing a few true positive matches that HMMER2 found (from what I can tell, these are due to the filtering-heuristics/bias composition) and HMMER3 failing miserably on a handful of very short (less than 10 amino acid) tandem repeats.  Despite these problems, HMMER3 is, overall, a great improvement, and we are now seriously looking at migrating Pfam to use this new version of HMMER.</p>
<p>What are the issues for Pfam users? Well, the good news is that if you use Pfam mainly via the website, there should be little change. Hopefully, you&#8217;ll find that sequence searches will run even faster and, once we&#8217;re fully operational with HMMER3, we hope to be able to provide some additional services/features that take advantage of the new features in HMMER3.</p>
<p>For those using the Pfam data files rather than the website, we don&#8217;t expect a huge number of changes in the <em>Pfam-A.full</em> and <em>Pfam-A.seed</em> flat files. However, if you use the Pfam HMMs then most stuff is (initially) going to break. There is no way to produce parallel sets of HMMs using both HMMER2 and HMMER3, so when we finally make the change, it is going to be fairly drastic!</p>
<h3>Conclusions</h3>
<p>We still have a lot of work to do before we can release a version of Pfam based on HMMER3. Every Pfam family will need to have its curated threshold reset, a process that is likely to be time-consuming, and we need to re-write large sections of our production pipelines to deal with the changes that HMMER3 requires. As we face issues and/or make changes, we intend to post to this blog.</p>
<p>These are exciting times, but the changes we all face (both the Pfam group and our user community) will undoubtedly cause some pain. As we get closer to a final switch over to HMMER3, we&#8217;ll use this blog to tell users more about the process, and hopefully as a way to get feedback from Pfam users and to find out their hopes and fears for HMMER3 ! Ultimately, we <em>will</em> produce a signifcantly more useful version of Pfam, where, for instance, a bacterial genome can be comfortably searched on a basic laptop within minutes, but it will be an interesting journey.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/36/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/36/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/36/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=36&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/01/21/early-adoption-of-hmmer3/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/979e0bdb3b6200e39425a8748897cf4d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">rdfinn</media:title>
		</media:content>
	</item>
		<item>
		<title>Welcome to the Xfam blog</title>
		<link>http://xfam.wordpress.com/2009/01/19/welcome-to-the-xfam-blog/</link>
		<comments>http://xfam.wordpress.com/2009/01/19/welcome-to-the-xfam-blog/#comments</comments>
		<pubDate>Mon, 19 Jan 2009 16:03:50 +0000</pubDate>
		<dc:creator>alexbateman</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[hmmer3]]></category>
		<category><![CDATA[pfam]]></category>
		<category><![CDATA[rfam]]></category>
		<category><![CDATA[xfam]]></category>

		<guid isPermaLink="false">http://xfam.wordpress.com/?p=15</guid>
		<description><![CDATA[Welcome to the new blog for the Xfam databases ! Xfam is our shorthand for the combination of Pfam and Rfam databases, which we note will also future-proof us, in case we add any further databases to the brand.
We hope that this blog will become a useful point of reference, where our users can learn about [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=15&subd=xfam&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Welcome to the new blog for the Xfam databases ! Xfam is our shorthand for the combination of <a href="http://pfam.sanger.ac.uk">Pfam</a> and <a href="http://rfam.sanger.ac.uk">Rfam</a> databases, which we note will also future-proof us, in case we add any further databases to the brand.</p>
<p>We hope that this blog will become a useful point of reference, where our users can learn about what is going on behind the scenes at Xfam central. We will be announcing some important changes that are coming with the eagerly awaited release of <a href="http://selab.janelia.org/people/eddys/blog/?p=56">HMMER 3</a>. As well as announcing new releases of the data and website, we&#8217;ll also try to discuss our philosophy on protein/RNA domains and sequence classification. If there are other topics that you would like to hear more about, why not leave us a comment.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/xfam.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/xfam.wordpress.com/15/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/xfam.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/xfam.wordpress.com/15/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/xfam.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/xfam.wordpress.com/15/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/xfam.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/xfam.wordpress.com/15/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/xfam.wordpress.com/15/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/xfam.wordpress.com/15/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=xfam.wordpress.com&blog=6232182&post=15&subd=xfam&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://xfam.wordpress.com/2009/01/19/welcome-to-the-xfam-blog/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d934f744fadc2a9a95256737ac1302d9?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">alexbateman</media:title>
		</media:content>
	</item>
	</channel>
</rss>