Visualising & exploring TreeFam gene families

February 19, 2014

The latest TreeFam release 9 has 15,736 gene families. These families vary significantly in size (number of family members), conservation (alignment conservation) and taxonomic diversity (younger families that are only found in e.g. Vertebrates vs. older ones that were present in the last common ancestor of Metazoa).

Visualising & exploring gene families

We have always wanted to find a way to visualise our families according to the above mentioned criteria.
Wouldn’t it be nice if you could easily see all highly conserved families or all families with >= 400 genes?

How to do that technically?

Recently, there were some interesting JavaScript libraries developed, namely  D3, dc.js and crossfilter.
D3 is the library we use to provide interactive trees (check here for the source code). Basically, D3 allows you to bind your data to svg elements. This could be a bar chart – for example, the following bar chart shows the distribution of alignment conservation of all TreeFam families.
d3_alignment_conservation
Coming back to our goal to visualise our gene families, let’s say for each of the above mentioned categories (family size, alignment conservation, taxonomic origin, etc) you want a bar chart. Well, using D3 you can do that and it would probably look nice (check here for a tutorial on how to build bar charts or click  here for other tutorials).  This is nice, but the visualisation is rather static.

What about interactivity?

Ok, ideally you want to link the different charts in a way that allows you to look at a subset of families by simply using the mouse to select a subset from one the chart, and using that as a filter for the data presenting in all of the other charts on the page.  Fortunately, the people behind dc.js have implemented this. And the best is, that is really easy to use, you don’t even have to know how to plot bar charts yourself, dc does it for you (see  the dc wiki if you are interested to learn more about dc).

D3 + dc + TreeFam gene families

So, we have used this d3 + dc.js library to visualise our families and a prototype can be seen on our dev site (see the following picture for an example).

dc_treefam_families

Overview of TreeFam families. The different charts show alignment conservation, number of gene members, taxonomic root, as well as presence of genes from model organisms (click on the image to get to the TreeFam website)

What you can do: The visualisation should be self-explanatory and will allow you to answer simple queries, e.g.:

  1. How many Vertebrate families are there?
  2. Show me all families with ~1 gene/species
  3. Which are the highly-conserved families (alignment conservation >= 85%)?

But also more complicated ones, e.g.

  1. How many eukaryotic families are highly conserved, have at least one human gene and more than one annotated Pfam family?

We see this visualisation workbench as a proof-of-concept and plan to expand it in the future. The code is available on Github, so feel free to get a copy and use it with your own data.  Let us know what you think and if you would like to see additional information charted. 

Posted by Fabian


Short-term Pfam position available.

February 7, 2014

We have just advertised a 9-month maternity cover position in Pfam. We are looking for a skilled Bioinformatician to help us take Pfam into its next phase of development as we become more integrated into the European Bioinformatics Institute (EMBL-EBI).

Essential knowledge, skills and experience:

  • Degree in Science with relevant experience
  • Computer literacy (unix experience)
  • Programming skills in Perl, including OO Perl
  • Familiarity with writing production software
  • MySQL, or similar, expertise
  • Experience working with biological sequence data
  • Good communications skills

See all the details on the EBI jobs page.


Join Rfam, see the world

January 31, 2014

Rfam is recruiting! We are currently recruiting an RNA informatician to join our team. We’re looking for someone really enthusiastic about RNA and who’s interested in working with Rfam as we move to genome-based alignments and explore new technologies for the database and website.

If this is you, why not apply to join us as a Senior Bioinformatician?


We’ve moved, now the websites

January 30, 2014

In November 2012, we announced that the Xfam groups were moving the few tens of metres from the Wellcome Trust Sanger Institute to the European Bioinformatics Institute. We warned you then, that the websites would also eventually move. Read the rest of this entry »


A version of pfam_scan.pl for HMMER 3.1b

October 15, 2013

We’ve had a lot of questions from users recently, wondering why our pfam_scan.pl script doesn’t work with the latest release of the HMMER package, version 3.1b. This is a quick post to explain why that is, and what we’ve done about it. Read the rest of this entry »


TreeFam: new Orthology-on-the-fly feature

September 17, 2013

The identification of orthologs in related organism is a routine task and many databases/tools are available to do that. Some of the databases can be installed locally, which is not ideal in cases where the target is to find orthologs for a single/few genes only. To fill this gap, we developed a quick orthology-on-the-fly prediction tool that is built on top of the HMMER search we introduced in release 9 and can be used here: www.treefam.org.

In our HMMER search page, users can paste a protein sequence and tick the box “insert into tree” to not only get hits in TreeFam (additionally we search Pfam for protein domain hits) but additionally align the user-supplied sequence to the best matching TreeFam family using MAFFT (1) and then adding the sequence into the corresponding gene tree using the family gene tree as a reference for the RAxML-EPA algorithm (2).

We benchmarked different evolutionary placement algorithms for inserting a sequence into a reference tree. Note that rebuilding the tree is in most cases computationally too complex to be offered as a web service. We tested RAxML-EPA algorithm (both the maximum likelihood (ML) and parsimony (P) option, (2)) and Pagan (3) on a subset of TreeFam families and used the gene tree build with TreeBest as a reference. We were interested in accuracy and run time, so we looked at the average scaled Robinson-foulds distance (4). Here is a summary of our results:

Program Avg. RF distance Run time (seconds)
Pagan 0.267 84.502
RAxML-EPA – ML 0.295 66.178
RAxML-EPA – P 0.206 0.164

While trees produced by RAxML-ML were closest to the ones generated by TreeBest, the ones produced by RAxML’s parsimony option were by far the fastest to finish. We therefore decided to offer both RAxML options, providing a good trade-off between speed and accuracy.

Upon completion of the orthology-on-the-fly search, TreeFam provides a summary page of the best matching TreeFam family and a gene tree with the newly added sequence. Given that all our gene trees have protein domain annotations, we search the Pfam database to report Pfam protein domain matches on the sequence, making it easy to spot differences in the domain architecture of the new sequence and members of the gene family. Here is an example screenshot of the search result page

Screenshot of search result

Example of a search result page showing the run times for the different searches on top, the hits in Pfam and TreeFam in the middle and the sequence inserted into the gene tree in the bottom

Go to http://www.treefam.org, paste your sequence and and see where in our gene trees it will be placed!

Your TreeFam team,

Fabian and Mateus

(1) Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol, 10.1093/molbev/mst010.

(2) Berger,S.A., Krompass,D. and Stamatakis,A. (2011) Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol, 60, 291–302.

(3) Loytynoja,A., Vilella,A.J. and Goldman,N. (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics, 28, 1684–1691–1691.

(4) Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. MATHEMATICAL BIOSCIENCES, 53, 131–147.


Dfam 1.2 released

May 31, 2013

We are pleased to announce that we’ve released Dfam 1.2. This version represents a few important changes from 1.1, including increased sensitivity for many families, a new plot on the model page, and an improved Relationships tab.

Read the rest of this entry »


Case studies from the list of human regions not in Pfam 27.0.

May 14, 2013

Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.

Read the rest of this entry »


Pfam targets conserved human regions

May 7, 2013

Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here [1]  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot [2] (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? :-)), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.

Read the rest of this entry »


TreeFam 9 is now available!

May 3, 2013

We are happy to announce that TreeFam 9 is online and you can find it under http://www.treefam.org.

TreeFam 9 now has 109 species (vs. 79 in TreeFam 8) and is based on data from Ensembl v69, Ensembl Genomes v16, Wormbase and JGI.

This release marks an important step for TreeFam as it is the first release build since TreeFam has been resurrected.
Here is a list of the most important changes in TreeFam 9:

  • New website layout (adopting the Pfam/Rfam/Dfam layout)
  • Infrastructure move of web servers and databases to the EBI
  • Sequence search against the library of TreeFam family profiles
  • new tree visualisations in pure javascript using D3, e.g. see the BRCA2 gene tree here.
  • Pairwise homology download

We hope you find all the information you are looking for. If you don’t, please let us know so that we can include the information you want. The old website will remain online here.

If you have questions, suggestions or find bugs, don’t hesitate to contact us through our new forum here.

Happy treefamming,

the TreeFam team
(Fabian, Mateus)


Follow

Get every new post delivered to your Inbox.

Join 139 other followers