We’ve had a lot of questions from users recently, wondering why our pfam_scan.pl script doesn’t work with the latest release of the HMMER package, version 3.1b. This is a quick post to explain why that is, and what we’ve done about it. Read the rest of this entry »
The identification of orthologs in related organism is a routine task and many databases/tools are available to do that. Some of the databases can be installed locally, which is not ideal in cases where the target is to find orthologs for a single/few genes only. To fill this gap, we developed a quick orthology-on-the-fly prediction tool that is built on top of the HMMER search we introduced in release 9 and can be used here: www.treefam.org.
In our HMMER search page, users can paste a protein sequence and tick the box “insert into tree” to not only get hits in TreeFam (additionally we search Pfam for protein domain hits) but additionally align the user-supplied sequence to the best matching TreeFam family using MAFFT (1) and then adding the sequence into the corresponding gene tree using the family gene tree as a reference for the RAxML-EPA algorithm (2).
We benchmarked different evolutionary placement algorithms for inserting a sequence into a reference tree. Note that rebuilding the tree is in most cases computationally too complex to be offered as a web service. We tested RAxML-EPA algorithm (both the maximum likelihood (ML) and parsimony (P) option, (2)) and Pagan (3) on a subset of TreeFam families and used the gene tree build with TreeBest as a reference. We were interested in accuracy and run time, so we looked at the average scaled Robinson-foulds distance (4). Here is a summary of our results:
|Program||Avg. RF distance||Run time (seconds)|
|RAxML-EPA – ML||0.295||66.178|
|RAxML-EPA – P||0.206||0.164|
While trees produced by RAxML-ML were closest to the ones generated by TreeBest, the ones produced by RAxML’s parsimony option were by far the fastest to finish. We therefore decided to offer both RAxML options, providing a good trade-off between speed and accuracy.
Upon completion of the orthology-on-the-fly search, TreeFam provides a summary page of the best matching TreeFam family and a gene tree with the newly added sequence. Given that all our gene trees have protein domain annotations, we search the Pfam database to report Pfam protein domain matches on the sequence, making it easy to spot differences in the domain architecture of the new sequence and members of the gene family. Here is an example screenshot of the search result page
Go to http://www.treefam.org, paste your sequence and and see where in our gene trees it will be placed!
Your TreeFam team,
Fabian and Mateus
(1) Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol, 10.1093/molbev/mst010.
(2) Berger,S.A., Krompass,D. and Stamatakis,A. (2011) Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol, 60, 291–302.
(3) Loytynoja,A., Vilella,A.J. and Goldman,N. (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics, 28, 1684–1691–1691.
(4) Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. MATHEMATICAL BIOSCIENCES, 53, 131–147.
We are pleased to announce that we’ve released Dfam 1.2. This version represents a few important changes from 1.1, including increased sensitivity for many families, a new plot on the model page, and an improved Relationships tab.
Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.
Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot  (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? :-)), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.
We are happy to announce that TreeFam 9 is online and you can find it under http://www.treefam.org.
TreeFam 9 now has 109 species (vs. 79 in TreeFam 8) and is based on data from Ensembl v69, Ensembl Genomes v16, Wormbase and JGI.
This release marks an important step for TreeFam as it is the first release build since TreeFam has been resurrected.
Here is a list of the most important changes in TreeFam 9:
- New website layout (adopting the Pfam/Rfam/Dfam layout)
- Infrastructure move of web servers and databases to the EBI
- Sequence search against the library of TreeFam family profiles
- Pairwise homology download
We hope you find all the information you are looking for. If you don’t, please let us know so that we can include the information you want. The old website will remain online here.
If you have questions, suggestions or find bugs, don’t hesitate to contact us through our new forum here.
the TreeFam team
In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database. These changes came into effect a few days ago, when we released Pfam 27.0. This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »
We are pleased to announce that the Dfam paper (“Dfam: a database of repetitive DNA based on profile hidden Markov models“) is now available in the 2013 NAR Database issue, and has been selected as a “featured article“ (meaning the NAR editorial board thinks it is among “the top 5% of papers in terms of originality, significance and scientific excellence”).
In other exciting news, two members of the Dfam consortium, Arian Smit and Robert Hubley (Institute for Systems Biology, Seattle), just released RepeatMasker 4.0. This is a major update that, among other important improvements, adds support for searching with Dfam and nhmmer. Go get yourself a copy at http://www.repeatmasker.org/
Posted by Travis
Behind the scenes we are working hard on building the next TreeFam release, which will be TreeFam 9.
TreeFam 9 will have 109 species, that is a 37% increase over TreeFam 8. Most of the species come from EnsEMBL (v.69) and EnsEMBL genomes (v.16) with a few ones coming from JGI.
Besides that – and probably most important for the user – will be our new web site. Based on the success of other Xfam-databases like Pfam , Rfam  and -most recently- Dfam , we decided to give the TreeFam website a face lift by adapting it to the Xfam look&feel.
So, there are great things to come and soon we will have our next blog post.
The next TreeFam blog post will then be about TreeFam 9!
For some light weekend reading, have a look at the latest Rfam paper, Rfam 11.0: 10 years of RNA Families. It’s part of the 2013 Nucleic Acids Research Database issue, and you’ll find all the latest developments to Rfam mentioned, including the sunbursts, the Biomart and an update on the Wikipedia annotation effort.