The identification of orthologs in related organism is a routine task and many databases/tools are available to do that. Some of the databases can be installed locally, which is not ideal in cases where the target is to find orthologs for a single/few genes only. To fill this gap, we developed a quick orthology-on-the-fly prediction tool that is built on top of the HMMER search we introduced in release 9 and can be used here: www.treefam.org.
In our HMMER search page, users can paste a protein sequence and tick the box “insert into tree” to not only get hits in TreeFam (additionally we search Pfam for protein domain hits) but additionally align the user-supplied sequence to the best matching TreeFam family using MAFFT (1) and then adding the sequence into the corresponding gene tree using the family gene tree as a reference for the RAxML-EPA algorithm (2).
We benchmarked different evolutionary placement algorithms for inserting a sequence into a reference tree. Note that rebuilding the tree is in most cases computationally too complex to be offered as a web service. We tested RAxML-EPA algorithm (both the maximum likelihood (ML) and parsimony (P) option, (2)) and Pagan (3) on a subset of TreeFam families and used the gene tree build with TreeBest as a reference. We were interested in accuracy and run time, so we looked at the average scaled Robinson-foulds distance (4). Here is a summary of our results:
|Program||Avg. RF distance||Run time (seconds)|
|RAxML-EPA – ML||0.295||66.178|
|RAxML-EPA – P||0.206||0.164|
While trees produced by RAxML-ML were closest to the ones generated by TreeBest, the ones produced by RAxML’s parsimony option were by far the fastest to finish. We therefore decided to offer both RAxML options, providing a good trade-off between speed and accuracy.
Upon completion of the orthology-on-the-fly search, TreeFam provides a summary page of the best matching TreeFam family and a gene tree with the newly added sequence. Given that all our gene trees have protein domain annotations, we search the Pfam database to report Pfam protein domain matches on the sequence, making it easy to spot differences in the domain architecture of the new sequence and members of the gene family. Here is an example screenshot of the search result page
Go to http://www.treefam.org, paste your sequence and and see where in our gene trees it will be placed!
Your TreeFam team,
Fabian and Mateus
(1) Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol, 10.1093/molbev/mst010.
(2) Berger,S.A., Krompass,D. and Stamatakis,A. (2011) Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol, 60, 291–302.
(3) Loytynoja,A., Vilella,A.J. and Goldman,N. (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics, 28, 1684–1691–1691.
(4) Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. MATHEMATICAL BIOSCIENCES, 53, 131–147.