Archive for the 'Production' Category

Pfam 35.0 is released

November 19, 2021

Pfam 35.0 contains a total of 19,632 families and clans. Since the last release, we have built 460 new families, killed 7 families and created 12 new clans. UniProt Reference Proteomes has increased by 7% since Pfam 34.0, and now contains 61 million sequences. Of the sequences that are in UniProt Reference Proteomes, 75.2% have at least one Pfam match, and 48.7% of all residues fall within a Pfam family.

Sources of new families

In an effort to increase the Pfam coverage of metagenomic sequence space, we have created 250 metagenomic protein families. These families were built by clustering protein sequences from the MGnify and UniProt databases, aligning the sequences in each cluster, and using the resulting alignments to create new SEED alignments. We then used our usual building process to create new families from the SEED alignments.

We have also created 52 new families based on clusters from a new resource called DPCfam based on Density Peak Clustering, created by Allesandro Laio, Marco Punta and Elena Tea Russo. An interesting example of these families is the N-terminal domain of the Crinkler effector protein (PF20147). Crinkling- and necrosis-inducing proteins (CRNs) or Crinkler, are ubiquitously present and first described in plant pathogenic oomycetes, and have been shown to participate in processes controlling plant cell death and immunity. However, Crinkler is also found outside oomycetes, such as in the Rhizophagus irregularis crinkler effector protein 1 (RiCRN1) which, like other CRNs, functions in the plant nucleus, but plays an essential role in symbiosis progression and the proper initiation of arbuscule development. This suggests that Crinkler proteins are more ubiquitously distributed than first predicted, and that their function is not limited to plant death (PMID:30233541). The Pfam domain contains the conserved motif FLAK, and, from structure predictions, adopts the ubiquitin-like fold, as seen in the image below. 

Figure 1: N-terminal domain of RiCRN. The image was generated using an AlphaFold colab notebook and is displayed using Molstar.

We continue to be provided with new families from the group of L. Aravind from NCBI, and have added 42 of them to this release of Pfam. Many of these families represent novel domains and proteins found in phage defence systems of bacteria.

Pfam-N

We are really excited about the Pfam-N matches for Pfam 35.0, but there is still a bit of work to do before we can release them. In particular, we’re working on neural networks that can predict the location of the domains themselves, instead of relying on HMMER to do so, as with the previous Pfam-N release. It will take a little more time to ensure the quality of these annotations, and we will make another announcement when they are ready.

Enjoy Pfam 35.0!

Posted by the Pfam team

Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman

Pfam 34.0 is released

March 24, 2021

Pfam 34.0 contains a total of 19,179 families and 645 clans. Since the last release, we have built 935 new families, killed 15 families and created 11 new clans. UniProt Reference Proteomes has increased by 21% since Pfam 33.1, and now contains 47 million sequences. Of the sequences that are in reference proteomes, 74.5% have at least one Pfam match, and 48.8% of all residues fall within a Pfam family.

Structural models

In our previous blog post, we announced the release of ~6,000 structural models in Pfam and InterPro. Many of the new families that we have created since the last release are large enough to be suitable for structure prediction. We have sent the alignments for new and modified Pfam families to the Baker group, who are currently generating structural models for them using their pipeline. We will release the next set of structural models when Pfam 34.0 is integrated into InterPro.

Collaboration with Google Research

We have been working with Dr Lucy Colwell’s research team at Google Research to expand Pfam coverage using deep learning methods. The deep learning approach, trained on Pfam HMMER matches, has found many additional matches which can be found in a new file called Pfam-N. There is another Pfam blog post which describes the work in more detail here.

Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Pfam 31.0 is released

March 8, 2017

Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans.  We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan. Read the rest of this entry »

Pfam 30.0 is available

July 1, 2016

Pfam 30.0, our second release based on UniProt reference proteomes, is now available. The new release contains a total of 16,306 families, with 22 new families and 11 families killed since the last release. The UniProt reference proteome set has expanded and now includes 17.7 million sequences, compared with 11.9 million when we made Pfam 29.0. In this release, we have updated the annotations on hundreds of Pfam entries, and renamed some of our Domains of Unknown Function (DUF) families.

DUFs are protein domains whose function is uncharacterised. Over time, as scientific knowledge increases and new data about proteins comes to light, more information about the function of a domain may become available. As a result, DUFs can be renamed and re-annotated with more meaningful descriptions. As part of Pfam 30.0, we have re-annotated 116 DUFs based on updated information in the UniProtKB database, the scientific literature, and feedback from Pfam and InterPro users. Examples of some our DUF updates in Pfam 30.0 are given below:

 

  • PF10265, created in release 23.0 and originally named DUF2217, has been renamed to Miga, a family of proteins that promote mitochondrial fusion.
  • PF10229, created in release 23.0 and originally named DUF2246, has been renamed as MMADHC, as it represents methylmalonic aciduria and homocystinuria type D proteins and their homologues.  The structure of this domain is shown below.

 

5cv0

Structure of MMADHC dimer, PDB:5CV0

 

  • PF12822, created in release 25.0 and originally named DUF3816, has been renamed to ECF_trnsprt, since it contains proteins identified as the substrate-specific component of energy-coupling factor (ECF) transporters.

Please note that we may change the identifier for a family (e.g. DUF2217), but we never change the accession for a family (e.g. PF10265).

If you find any more DUFs that can be assigned a name based on function, or any other annotation updates, please get in touch with us (pfam-help@ebi.ac.uk).

 

Pfam 27.0 is now available!

March 22, 2013

In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database.  These changes came into effect a few days ago, when we released Pfam 27.0.  This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »

Dfam: A database of repetitive DNA elements

September 6, 2012

We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.

Read the rest of this entry »

Rfam 11.0 is out!

August 14, 2012

The team behind Rfam is pleased to announce the release of Rfam 11.0. This release represents a major update from 10.1, primarily due to the upgrade of our underlying sequence database, Rfamseq.

Read the rest of this entry »

Does my family of interest have a determined 3D protein structure?

May 9, 2012

Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’.  You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these  questions. Read the rest of this entry »