Archive for the 'Production' Category

Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman

Pfam 34.0 is released

March 24, 2021

Pfam 34.0 contains a total of 19,179 families and 645 clans. Since the last release, we have built 935 new families, killed 15 families and created 11 new clans. UniProt Reference Proteomes has increased by 21% since Pfam 33.1, and now contains 47 million sequences. Of the sequences that are in reference proteomes, 74.5% have at least one Pfam match, and 48.8% of all residues fall within a Pfam family.

Structural models

In our previous blog post, we announced the release of ~6,000 structural models in Pfam and InterPro. Many of the new families that we have created since the last release are large enough to be suitable for structure prediction. We have sent the alignments for new and modified Pfam families to the Baker group, who are currently generating structural models for them using their pipeline. We will release the next set of structural models when Pfam 34.0 is integrated into InterPro.

Collaboration with Google Research

We have been working with Dr Lucy Colwell’s research team at Google Research to expand Pfam coverage using deep learning methods. The deep learning approach, trained on Pfam HMMER matches, has found many additional matches which can be found in a new file called Pfam-N. There is another Pfam blog post which describes the work in more detail here.

Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Pfam 31.0 is released

March 8, 2017

Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans.  We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan. Read the rest of this entry »

Pfam 30.0 is available

July 1, 2016

Pfam 30.0, our second release based on UniProt reference proteomes, is now available. The new release contains a total of 16,306 families, with 22 new families and 11 families killed since the last release. The UniProt reference proteome set has expanded and now includes 17.7 million sequences, compared with 11.9 million when we made Pfam 29.0. In this release, we have updated the annotations on hundreds of Pfam entries, and renamed some of our Domains of Unknown Function (DUF) families.

DUFs are protein domains whose function is uncharacterised. Over time, as scientific knowledge increases and new data about proteins comes to light, more information about the function of a domain may become available. As a result, DUFs can be renamed and re-annotated with more meaningful descriptions. As part of Pfam 30.0, we have re-annotated 116 DUFs based on updated information in the UniProtKB database, the scientific literature, and feedback from Pfam and InterPro users. Examples of some our DUF updates in Pfam 30.0 are given below:

 

  • PF10265, created in release 23.0 and originally named DUF2217, has been renamed to Miga, a family of proteins that promote mitochondrial fusion.
  • PF10229, created in release 23.0 and originally named DUF2246, has been renamed as MMADHC, as it represents methylmalonic aciduria and homocystinuria type D proteins and their homologues.  The structure of this domain is shown below.

 

5cv0

Structure of MMADHC dimer, PDB:5CV0

 

  • PF12822, created in release 25.0 and originally named DUF3816, has been renamed to ECF_trnsprt, since it contains proteins identified as the substrate-specific component of energy-coupling factor (ECF) transporters.

Please note that we may change the identifier for a family (e.g. DUF2217), but we never change the accession for a family (e.g. PF10265).

If you find any more DUFs that can be assigned a name based on function, or any other annotation updates, please get in touch with us (pfam-help@ebi.ac.uk).

 

Pfam 27.0 is now available!

March 22, 2013

In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database.  These changes came into effect a few days ago, when we released Pfam 27.0.  This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »

Dfam: A database of repetitive DNA elements

September 6, 2012

We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.

Read the rest of this entry »

Rfam 11.0 is out!

August 14, 2012

The team behind Rfam is pleased to announce the release of Rfam 11.0. This release represents a major update from 10.1, primarily due to the upgrade of our underlying sequence database, Rfamseq.

Read the rest of this entry »

Does my family of interest have a determined 3D protein structure?

May 9, 2012

Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’.  You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these  questions. Read the rest of this entry »

Proposed Pfam release changes

February 27, 2012

The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’.  The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases.  Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used.  Below is a list of changes that are going to happen in the next release, release 27.0.

Read the rest of this entry »