Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman

4 Responses to “Google Research Team bring Deep Learning to Pfam”

  1. Gabe Says:

    Exciting! This begs a few questions, actually:
    1) Why not re-train the hmm suite on a larger corpus than the small set of seed alignments?
    2) Why not just take (a representative subset of) the new Pfam-N alignments and build more HMMs out of them (naming each the same thing that ProtENN mapped them to)?
    3) For sequences that still don’t have great matches to existing clusters with ProtENN or HMMs, couldn’t you cluster their 1100-feature embedding and make new families out of those clusters, then just train HMMs on those to get a bunch of new families?
    4) After all this is done, and we have deeper training sets for existing HMMs and a bunch of brand-new families with their own HMMs too, can the team then re-run ProtENN on the new set of annotations/families and iterate on its performance improvements?
    5) I like how the paper waxed on about how much better ProtENN does over HMMs, but didn’t understand what they meant by “retraining all HMMs” to make things more “even” between their method and HMMs. Did they actually use more than the original seed alignments (e.g. entire Pfam)? If so, is that actually better? And if so, why don’t we do the same thing with the release version of the HMMs? (I see we’re back to question 1!)

    I see the potential for deep learning to improve existing models, but right now the most exciting piece of this to me is how it simply allows selection of more sequences to train HMMs. Neural networks are still largely black boxes (although it is impressive they managed to rip off the right layers to get a BLOSUM comparison going!).

    • alexbateman Says:

      Thanks for all the comments. I’ll respond point by point below:
      1. Training Pfam-HMMs on the full alignments would undoubtedly find more homologues. However, there are a few reasons why we do not do this. Firstly, we select per family thresholds and these would not be applicable if the HMM was built upon the full alignment. Secondly, although we strive for Pfam to have no false positives, there are undoubtedly some that will occur. There is a danger than many ore false positives will be identified when using HMMs built from the full alignment. I could imagine that many problems would occur with the coiled-coil and other low/medium complexity families in Pfam. Thirdly, the stability of the matches found by Pfam is guaranteed by building from the seed alignments. If we build HMMs from the full alignment we may find that some families radically change in size between releases.
      2. We do have plans to try to improve existing Pfam families based on the Pfam-N alignments. Our first idea is to try and merge some of the sequences into the seed alignments. However, we will also try to build new families as you suggest if our first strategy doesn’t work out well.
      3. Yes, the embedding space provides some really interesting opportunities for creating new families. I don’t think that we will be working on this any time soon, simply due to a lack of capacity. But hopefully other groups will take up the idea and run with it.
      4. Yes we hope to keep Pfam-N up to date with new Pfam releases.
      5. For Pfam-N the ensemble of models is trained on the Pfam full alignments. So it is perhaps not surprising that ProtENN finds more homologues than the Pfam profile-HMMs. I don’t think it is reasonable to say at this point that ProtENN is better than profile-HMMs. What is interesting to me is that the ProtENN approach doesn’t require multiple sequence alignments for training. We are also excited by some of the new opportunities that Deep Learning offers to bring in other kinds of information and make predictions of additional labels.

      • Gabe Says:

        Thanks Alex, so if my understanding is correct,
        1. Retraining on the full pfam for each family will likely extend the sensitivity of the algorithm (especially for something like a first-match selection), but may hamper the specificity and renders the pre-existing per-family thresholds invalid.
        2. This is great to hear.
        3. Happy to try my hand at a first pass of this if there is a clear explanation for how to generate the numerical 1100-feature loadings for arbitrary amino acid inputs. (i.e. taking a sequence through one-hot encoding and 0-padding, feeding that into the autoencoder layers, and outputting the [normalized?] 1100-dimension numerical representation).
        4. Excellent, good to hear as well. I was thinking more along the lines of an iterative re-training of the ENN within a given pfam release by using the ENN’s predictions to improve the HMMs (e.g. by integrating the necessary new sequences into the seed alignment like you say), then reclassifying the whole protein space with the updated HMMs, then retraining the ENN on the new labels in the full set, and repeating the whole process (again selecting from that newly-updated ENN the new high-confidence labels missed by the HMMs, rebuilding the HMMs, relabeling all proteins again, then yet again retraining the ENN on those labels… until convergence — no new labels identified by the ENN that the HMMs didn’t catch).
        5. This makes sense.

        Some more comments:
        – I have tried some of this myself. I have retrained the pfam-v34 HMMs using the entire pfam-A.full and indeed found many similar assignments as identified by the new Pfam-N which the existing seed-trained Pfam-A did not identify or incorrectly identified. This validates your hypothesis that having access to the full set for training does lead to a convergence of the models’ predictions. I separately trained new HMMs from the Pfam-N database as well.
        – Further, I created a joint HMM database from Pfam-A full’s alignments together with Pfam-N’s alignments for the same family which, as expected, has substantially higher recall, and substantially lower e-values for most already-low-e-value hits. A few “borderline” cases either dropped out entirely (the new “entire-pfam-A-plus-Pfam-N” database no longer considered these sequences applicable for a pfam assignment, which shows at least in some cases it’s not a universal “catch more sequences” but also drops low-confidence sequences as well). I also tried a “stacked” database — one with the normal Pfam-A official HMM db concatenated with a Pfam-N trained DB, and indeed observed a complementary behavior similar to the Pfam-A-full + Pfam-N merged DB (just slightly lower recall and e-values on my test corpus).
        – I share the concerns that continually retraining profiles based on what existing profiles identify will lead to overfitting of profiles as well as over-expanding cluster definitions. But I also see the promise of having more expansive and generalizable families, and have observed this in this small test.

        As my concern is with drastically remote homology using proteins from genomic clades never before seen in public reference databases (we’re talking MAGs newer than the latest MAG dumps that form new microbial families to even phyla), this “full Pfam-A + full Pfam-N retrained HMM set” goes from a “modest bump in recall” to a “dramatic increase” in genes annotated in some of these MAGs (from ~50% of proteins annotated to over 64%), despite also dropping a number of “borderline” calls the seed-trained Pfam-A HMM model makes. Whether the additional new annotations are indeed trustworthy is an entirely different (and equally if not more important) matter, of course. But for many computational applications, having something is better than nothing (even if for many others it is far worse).

  2. Maxwell Bileschi Says:

    Hi Gabe – I’d love to get in on this conversation, but to be honest I’m having a bit of trouble reading the conversation on wordpress because the replies are so nested that the text is very narrow. Can we email?


Leave a comment