The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’. The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases. Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used. Below is a list of changes that are going to happen in the next release, release 27.0.
Before I dive in and describe the changes, let me first point out some of the principles behind the decision making. First and foremost, our core aim is to curate high quality protein families, that are represented by alignments (termed seed and full) and a profile HMM. This is not going to change! The ultimate goal of this review is aimed at getting the new and improved families into the public domain in a more timely fashion. Pfam receives contributions from around the world, with some power users contributing directly into the databases. This combined with the fact that Pfam-UK has more of an emphasis on curation, the rate of new family generation and improvements to existing families is on the increase. With more families and more sequences the current release pipeline is once again creaking. The philosophy of Pfam it to treat every family in the same way, so what we calculate for some small 10 member DUF (domain of unknown function), we calculate for the largest families. For example, the ABC_tran family contains over 230,000 members and this number is likely to exceed 300,000 at the next release when the underlying sequence database is updated. So what is going to change?
Probably the most significant step is that the neighbour joining (NJ)-trees are no longer going to be calculated on the full alignments. These trees are no longer going to be produced because they simply take too long and too much memory to produce even using FastTree (note it is not just FastTree consuming time and memory, but other components in our tree building/post-processing pipeline) for the big families. In addition to producing the trees, we use the tree to reorder the sequences in the alignment and update all of the rows in the database with the tree-order position. Further adding to the burden is that we also produced trees for the full alignments generated from searches against the NR and metagenomic sequence databases. Thus, the biggest 20 or so families Pfam families were taking approximately a day each to produce the trees and post processing for each dataset, with only a handful of machines accessible that could cope with the memory constraints turning this into effectively an linear process. We thought about reducing the number of replicates to save time, but came to the conclusion that if you really care about family phylogeny then you probably want to use a more accurate tree construction method than FastTree – probably using a carefully defined redundant subset of sequences. Also, simply reducing the number of replicates would not reduce the memory overheads. We will continue to produce NJ-trees for the seed alignments, which contain a representative set of sequences of the family.
Matches to the NR sequence database
For the past 4 years we have provided match data against the NCBI NR protein databases. As well as producing this data in flatfile and web formats, this data has also being included in the Pfam MySQL database. Analysis of our webstats shows that, although these data are accessed sufficiently not to consider dropping them, it is primarily the alignments that are being accessed rather than individual sequences. In the next release, the same flatfile and web access will be provided for Pfam NR matches, but this data will no longer be loaded into our MySQL database. This will offer significant gains, approximately halving the amount of database loading that we will have to perform for each release. If my thoughts transpire as expected, only users who download the database will notice any differences. To offer some halfway ground to our MySQL users, a TSV flat file containing matches to NR will also be made available, which could be easily incorporated into the Pfam MySQL schema.
Matches to Metagenomic sequences
The metagenomics sequence collection in Pfam is a bit of an ad hoc collection of sequences for various projects. Unfortunately, there is little provenience on these sequences and this sequence collection was only planned as an interim solution until the major sequence database established dedicated metagenomic collections. However, as it turns out most metagenomics projects rarely get to the point of predicting protein coding regions. Thus, our initial expectations of having a metagenomics section in Pfam have not really materialized. The number of web accesses to our current data is very, very low. So, in the next release of Pfam matches against metagenomics sequences will no longer be calculated. If you really want access to this sort of data, the I would suggest that you search a Pfam model against the metagenomic sequence database using the HMMER web server. This produces more information on metagenomic matches than is currently provided by the Pfam website – but I am biased….
And finally, the last notable change will probably be more of an internal change, but we are no longer going to produce the HTML view for sequence alignments that exceed 5000 sequences. At the moment, the Pfam website warns users when trying to view alignments (AFAIR) over 2000 sequences. That is because these alignments contain many HTML “<div>” elements to colour residues in each aligned sequence. Infact for the larger alignments there are 10-100s of thousands and at these heady heights web browsers start grinding to a halt – the rendering engines are simply not designed to cope with so many elements in a single page. As it is a struggle to visualize these alignments and even harder to interpret them, we are not going to produce the larger versions. This threshold means that this rule will only apply to the full alignments. In this case, this only a view of the data and the full alignments will still be available for download or visualization using Jalview.
I hope this posting has made it clearer as to why these changes are going to be made in the next release. Please feel free to comment or to send comments privately to email@example.com. We will also send this same message out to our mailing lists, so apologies of you receive it multiple times.
Posted by Rob