Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.
Archive for the 'Pfam' Category
Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot  (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? ), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.
In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database. These changes came into effect a few days ago, when we released Pfam 27.0. This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »
We’ve had a few helpdesk tickets in the last few months asking how to download all of the Pfam-A domains for a particular species. This information can be quite difficult to obtain: getting it requires either downloading and installing a sub-set of the tables in our MySQL database, or else searching all of the sequences from the species of interest against Pfam, probably using our batch search.
Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’. You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these questions. Read the rest of this entry »
AntiFam  is the newest addition to the Xfam brand. It is a database of hidden Markov models (HMMs) designed to identify spurious open reading frames (ORFs). It is available now on our ftp site:
The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’. The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases. Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used. Below is a list of changes that are going to happen in the next release, release 27.0.
Since releasing the new Pfam website four years ago, we’ve had a steady trickle of mails from users who would like to install and run the site within their own local environment. It used to be possible to do just that, given a following wind, if you were ready to install the site from its source code. Unfortunately, after some internal changes and as the list of Perl module dependencies grew and grew, the process got harder and more complex and eventually we stopped supporting it entirely. We’ve been actively discouraging people from trying this for far too long, all the while promising to make the process easier. Finally we’ve managed to get around to building a virtual machine (VM) that should make the whole thing possible again. Read the rest of this entry »
As you surely have noted the highly anticipated new Pfam paper is out as part of the 2012 NAR database issue! We were delighted to be listed as a featured article. The paper covers the new release 26.0 (more on this from Rob soon) and presents some novel analysis that may be of interest to Pfam addicts like you. We quite extensively discuss our use of family-specific bit score gathering thresholds (GAs), hoping to bring clarity to an issue that seems to have been a source of confusion in the past (a.k.a. stop sending us tickets asking what GAs are and how to use them! ). Also, we extend and update the analysis of DUF families that was presented in a previous publication hoping to push more people into the de-DUF a DUF game. So, enjoy reading the paper and send us comments and suggestions, your support and advice is as always invaluable to us!!
Posted by Marco