TreeFam: new Orthology-on-the-fly feature

September 17, 2013

The identification of orthologs in related organism is a routine task and many databases/tools are available to do that. Some of the databases can be installed locally, which is not ideal in cases where the target is to find orthologs for a single/few genes only. To fill this gap, we developed a quick orthology-on-the-fly prediction tool that is built on top of the HMMER search we introduced in release 9 and can be used here:

In our HMMER search page, users can paste a protein sequence and tick the box “insert into tree” to not only get hits in TreeFam (additionally we search Pfam for protein domain hits) but additionally align the user-supplied sequence to the best matching TreeFam family using MAFFT (1) and then adding the sequence into the corresponding gene tree using the family gene tree as a reference for the RAxML-EPA algorithm (2).

We benchmarked different evolutionary placement algorithms for inserting a sequence into a reference tree. Note that rebuilding the tree is in most cases computationally too complex to be offered as a web service. We tested RAxML-EPA algorithm (both the maximum likelihood (ML) and parsimony (P) option, (2)) and Pagan (3) on a subset of TreeFam families and used the gene tree build with TreeBest as a reference. We were interested in accuracy and run time, so we looked at the average scaled Robinson-foulds distance (4). Here is a summary of our results:

Program Avg. RF distance Run time (seconds)
Pagan 0.267 84.502
RAxML-EPA – ML 0.295 66.178
RAxML-EPA – P 0.206 0.164

While trees produced by RAxML-ML were closest to the ones generated by TreeBest, the ones produced by RAxML’s parsimony option were by far the fastest to finish. We therefore decided to offer both RAxML options, providing a good trade-off between speed and accuracy.

Upon completion of the orthology-on-the-fly search, TreeFam provides a summary page of the best matching TreeFam family and a gene tree with the newly added sequence. Given that all our gene trees have protein domain annotations, we search the Pfam database to report Pfam protein domain matches on the sequence, making it easy to spot differences in the domain architecture of the new sequence and members of the gene family. Here is an example screenshot of the search result page

Screenshot of search result

Example of a search result page showing the run times for the different searches on top, the hits in Pfam and TreeFam in the middle and the sequence inserted into the gene tree in the bottom

Go to, paste your sequence and and see where in our gene trees it will be placed!

Your TreeFam team,

Fabian and Mateus

(1) Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol, 10.1093/molbev/mst010.

(2) Berger,S.A., Krompass,D. and Stamatakis,A. (2011) Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol, 60, 291–302.

(3) Loytynoja,A., Vilella,A.J. and Goldman,N. (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics, 28, 1684–1691–1691.

(4) Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. MATHEMATICAL BIOSCIENCES, 53, 131–147.


One Response to “TreeFam: new Orthology-on-the-fly feature”

  1. Cool feature! I’m currently getting “alignment: analysis failed, try again or contact authors” for glyceraldehyde-3-phosphate dehydrogenase from yeast (YGR192C

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: