Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman


Pfam 34.0 is released

March 24, 2021

Pfam 34.0 contains a total of 19,179 families and 645 clans. Since the last release, we have built 935 new families, killed 15 families and created 11 new clans. UniProt Reference Proteomes has increased by 21% since Pfam 33.1, and now contains 47 million sequences. Of the sequences that are in reference proteomes, 74.5% have at least one Pfam match, and 48.8% of all residues fall within a Pfam family.

Structural models

In our previous blog post, we announced the release of ~6,000 structural models in Pfam and InterPro. Many of the new families that we have created since the last release are large enough to be suitable for structure prediction. We have sent the alignments for new and modified Pfam families to the Baker group, who are currently generating structural models for them using their pipeline. We will release the next set of structural models when Pfam 34.0 is integrated into InterPro.

Collaboration with Google Research

We have been working with Dr Lucy Colwell’s research team at Google Research to expand Pfam coverage using deep learning methods. The deep learning approach, trained on Pfam HMMER matches, has found many additional matches which can be found in a new file called Pfam-N. There is another Pfam blog post which describes the work in more detail here.


Join Rfam team!

March 19, 2021

We are looking for a Software Developer to join the Rfam team and contribute to the world’s largest database of RNA families. The post holder will be responsible for keeping Rfam up-to-date, developing Rfam Cloud, and improving the website. More information about the position can be found at https://bit.ly/rfam-software-developer

Apply now or help spread the word. Closing date: April 20th, 2021.


Rfam 14.5 is live

March 18, 2021

We are happy to announce a new Rfam release, version 14.5, featuring 112 updated microRNA families and 10 families improved using the 3D structure information. Read on for details or explore 3,940 RNA families at rfam.org.

Updated microRNA families

As described in our most recent paper, we are in the process of synchronising microRNA families between Rfam and miRBase. In this release 112 of the existing microRNA families have been updated with new manually curated seed alignments from miRBase, new gathering thresholds, and new family members found in the Rfamseq sequence database. 

In total, 852 new microRNA families have been created (356 in release 14.3 and 496 in release 14.4) and 152 existing families have been updated (40 in release 14.3 and 112 in release 14.5). As the miRBase-Rfam synchronisation is about 50% complete, additional microRNA families will be made available in the upcoming releases. You can view a list of the 112 updated families or browse all 1,385 microRNA families on the Rfam website. 

Updating families using information from 3D structures

We are also in the process of reviewing the families with the experimentally determined 3D structures in order to compare the Rfam annotations with the 3D models. Our goal is to incorporate the 3D information into Rfam seed alignments as many families have been created before the corresponding 3D structures became available. We manually review each PDB structure, verify basepair annotations from matching PDBs, and obtain a more consistent consensus secondary structure model. 

In multiple cases we were able to add missing base pairs and pseudoknots. For example, in the SAM riboswitch (RF00162), we added two base pairs in the base of helix P2, corrected a basepair in P3 and added four basepairs in P4 (one in the base of the helix and three near the terminal loop). The updated consensus secondary structure presents a more accurate central core annotation with more structure in the four-way junction.

SAM riboswitch secondary structure before and after the updates

In another example, one base pair was added in P1 and another one in P3 of the SAM-I/IV variant riboswitch (RF01725). We also corrected a base pair in P3 and included a P4 stem loop that was not integrated before.

SAM-I/IV riboswitch secondary structure before and after the updates

The SAM-I/IV riboswitch is characterised by a similar SAM binding core conformation to that of the SAM riboswitch but it differs in the k-turn motif in P2 which is found in SAM riboswitches but not in SAM-I/IV. These two families also have different pseudoknots interactions, where SAM riboswitch forms a pseudoknot between a P2 loop and the stem of P3, while the SAM-I/IV riboswitch contains a pseudoknot between a P3 loop and the 5′ region.

The first 10 families updated with 3D information include:

  1. RF00162 – SAM riboswitch
  2. RF01725 – SAM-I/IV variant riboswitch
  3. RF00164 – Coronavirus 3’ stem-loop II-like motif (s2m)
  4. RF00013 – 6S / SsrS RNAP
  5. RF00003 – U1 spliceosomal RNA
  6. RF00015 – U4 spliceosomal RNA
  7. RF00442 – Guanidine-I riboswitch
  8. RF00027 – let-7 microRNA precursor
  9. RF01054 – preQ1-II (pre queuosine) riboswitch and
  10. RF02680 – preQ1-III riboswitch

We will continue reviewing the families with known 3D structure in future releases.

Other family updates

Initially reported by Aspegren et al. 2004, Class I (RF01414) and Class II (RF01571) RNAs were found in social amoeba Dictyostelium discoideum and later on investigated in more detail by Avesson et al. 2011. Now a new report from Kjellin et al. 2021 presents a comprehensive analysis of the Class I RNA genes in dictyostelid social amoebas. Based on this study, we updated the Dicty Class I RNA family RF01414 with a new seed alignment and removed the family RF01571, thus merging both families into one. We thank Dr Jonas Kjellin (Uppsala University) for suggesting this update.

Goodbye Ioanna!

Rfam 14.5 is the last release prepared by Dr Ioanna Kalvari who will be leaving the team at the end of March 2021. We would like to take the opportunity to thank Ioanna for her contributions over the last 5.5 years and wish her best of luck in the future!

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter


Folding the Protein Universe

March 3, 2021

Today signifies the realization of a long-held dream to have the structure of every (well nearly every) family in Pfam. The Pfam and InterPro databases have made available structural models of 6,370 protein families created by Ivan Anishchanka from David Baker’s group at the University of Washington in Seattle. The models are made using their latest prediction method called trRosetta which can predict protein structures, based on large multiple sequence alignments, with incredible accuracy.

The Baker group have had remarkable success over the years in the field of structure prediction, and in the recent CASP14 event the group’s predictions were the most accurate from an academic group. Although not quite as accurate as Deep Mind’s AlphaFold 2.0 predictions, they are certainly of a high enough quality for many applications. For example, I am interested to understand when a Pfam family is part of a larger superfamily, or clan as we call them in Pfam. I have been able to take the structural models and identify distant homologues in the PDB using tools such as DALI and PDBeFold that compare protein structures. For longer Pfam families we can look at the structure model and identify likely domain boundaries to split up the existing Pfam family into the domain sized chunks (for example Calmodulin_bind could be split into 3 domains).

Within the InterPro website we have developed a completely novel view that allows you to see which residues in the Pfam seed alignment are predicted to be close in space. By clicking on columns in the alignment, one can see where they are in the structural model and which residues are predicted to be nearby (see the documentation for further details). We would be very interested in getting your feedback on this feature. We could provide a similar view based on contacts found in known structures. The PDB file for individual models can be downloaded from the structural model tab on the family pages within the InterPro and Pfam websites. You can also download all of the structural model and contact map data from the Pfam ftp site and InterPro ftp site.

The figure above shows the contact map and structural model as seen through the InterPro website. Links to some example Pfam families that have a structure model are shown below (for the Pfam links, click on the Structural Model tab):

This is not the first time we have made a large set of structure predictions available.  Back in 2002, again in collaboration with David Baker and Rich Bonneau we made many models available.  The overall accuracy of these models was much lower and we did not know which models were good since we lacked an accurate quality metric. The new models come with a quality score called lDDT and broadly, we can consider a model with lDDT > 0.6 to be good, and one with lDDT > 0.8 to be excellent.

Today marks an amazing milestone, with 88% of Pfam families now having a PDB structure or a structural model. The story is not quite finished though, as there remain 2,202 families that do not have structural data. We plan to investigate different sequence sets to make an even larger set of models available in the coming months. We felt however that it was useful to release this data set to the community as fast as we could. The work described here has been possible only due to funding from the BBSRC BBR, which has been a critical part of the funding landscape for UK data resources for many years.

There will be many exciting stories to be told using this structural treasure trove, and we hope it is a beneficial resource to the community. Please let us know what you think of the data, and whether you find the contact maps and models useful.

Posted by Alex Bateman


Rfam 14.4 is live

December 18, 2020

The last Rfam release of 2020 is now live! Rfam 14.4 contains 496 new microRNA families developed in collaboration with miRBase. Find out about the microRNA project in our new NAR paper and let us know if you have any feedback.


Rfam 14.3

September 15, 2020

Rfam 14.3 includes 356 new and 40 updated microRNA families, as well as 12 new and 2 updated Flavivirus RNAs. Find out the details in our new NAR paper and get in touch if you have any feedback.


Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.


A new Pfam-B is released

June 30, 2020

In addition to our HMM-based Pfam entries (Pfam-A), we used to make a set of automatically generated, non-HMM based entries called Pfam-B. The Pfam-B entries were derived from clusters generated by applying the ADDA algorithm to an all-against-all BLAST search of UniRef-40, and removing any regions covered by Pfam-A. The overhead of producing Pfam-B in this way became too great, and as of Pfam 28.0, we stopped making Pfam-B entries (see [1] for a longer discussion on why we stopped producing Pfam-B). Erik Sonnhammer has devised an alternative method of making Pfam-B using the MMSeqs2 software [2], and an overview of the process is given below (more details will follow in the next Pfam paper).

We have already begun to use the new version of Pfam-B to generate new families, and 11 of these are in Pfam 33.1. For example, the TUTase family (PF19088) was built using Pfam-B as the source. We expect that Pfam-B will be a very useful source of additional families in the coming years.

How the new Pfam-B was created

UniProtKB sequences not covered by Pfam-A were clustered using MMSeqs2 and multiple sequence alignments of each cluster were generated with FAMSA [3]. This resulted in 136730 Pfam-B families that on average contain 99 sequences (max 40912) and are 310 positions wide (max 29216).

How to access the new Pfam-B

The Pfam-B alignments are released as a tar archive on the Pfam FTP site [Pfam-B.tgz]. We do not plan to integrate them into the Pfam website, but we will generate them for each future Pfam release.

Posted by Erik Sonnhammer and the Pfam team

References

1. Finn et al. (2015) The Pfam protein families database: towards a more sustainable future.

2. Hauser et al. (2016) MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

3. Deorowicz et al. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families


Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex