The latest TreeFam release 9 has 15,736 gene families. These families vary significantly in size (number of family members), conservation (alignment conservation) and taxonomic diversity (younger families that are only found in e.g. Vertebrates vs. older ones that were present in the last common ancestor of Metazoa).
Visualising & exploring gene families
We have always wanted to find a way to visualise our families according to the above mentioned criteria.
Wouldn’t it be nice if you could easily see all highly conserved families or all families with >= 400 genes?
How to do that technically?
D3 is the library we use to provide interactive trees (check here for the source code). Basically, D3 allows you to bind your data to svg elements. This could be a bar chart – for example, the following bar chart shows the distribution of alignment conservation of all TreeFam families.
Coming back to our goal to visualise our gene families, let’s say for each of the above mentioned categories (family size, alignment conservation, taxonomic origin, etc) you want a bar chart. Well, using D3 you can do that and it would probably look nice (check here for a tutorial on how to build bar charts or click here for other tutorials). This is nice, but the visualisation is rather static.
What about interactivity?
Ok, ideally you want to link the different charts in a way that allows you to look at a subset of families by simply using the mouse to select a subset from one the chart, and using that as a filter for the data presenting in all of the other charts on the page. Fortunately, the people behind dc.js have implemented this. And the best is, that is really easy to use, you don’t even have to know how to plot bar charts yourself, dc does it for you (see the dc wiki if you are interested to learn more about dc).
D3 + dc + TreeFam gene families
So, we have used this d3 + dc.js library to visualise our families and a prototype can be seen on our dev site (see the following picture for an example).
What you can do: The visualisation should be self-explanatory and will allow you to answer simple queries, e.g.:
- How many Vertebrate families are there?
- Show me all families with ~1 gene/species
- Which are the highly-conserved families (alignment conservation >= 85%)?
But also more complicated ones, e.g.
- How many eukaryotic families are highly conserved, have at least one human gene and more than one annotated Pfam family?
We see this visualisation workbench as a proof-of-concept and plan to expand it in the future. The code is available on Github, so feel free to get a copy and use it with your own data. Let us know what you think and if you would like to see additional information charted.
Posted by Fabian