pfam – Xfam Blog

Pfam 36.0 release

Typhaine Paysan-Lafosse — Mon, 18 Sep 2023 09:40:24 +0000

Pfam 36.0 release is now out! This is a very special release as for the first time Pfam is accessible exclusively via the modern and comprehensive interface of InterPro. Offering new features and easy-to-use functionality, the InterPro website will be the Pfam home for releases to come. All Pfam release files remain accessible through the Pfam ftp site.

If you are new to the IntePro interface and would like to learn how to navigate through the website and explore the Pfam annotations, you can have a look at this video, the updated online training, Pfam documentation or get in touch.

Release content

Pfam 36.0 contains a total of 20,795 families and 660 clans. Since the last release, we have built 1,191 new families, killed 28 families and created 5 new clans.

Additionally, we have updated around 1.5% of existing Pfam entries. 2,818 families have seen a change in their boundaries, 281 of them have changed by more than 50 residues, most of them got trimmed or split into domains often due to improved information from accurate structural models.

UniProt Reference Proteomes has increased by 23% since Pfam 35.0, and now contains 75 million sequences. Of the sequences that are in UniProt Reference Proteomes, 76.2% have at least one Pfam match, and 48.6% of all residues fall within a Pfam family.

The accession numbers for this release range from PF20625 to PF21822.

Sources of Pfam families

ECOD

In an effort to harmonise the Pfam and ECOD classifications, we have created 638 new entries, and 50 new clans. However, at the same time we have also removed 45 existing clans, usually because we merged two or more existing Pfam clans into a single entry.

While creating new Pfam families from ECOD and checking their classification, we are able to identify relationships between existing families and the new ones, grouping them together, which means we can either include them in an existing clan or create a new one. This is the case of the new THUMP clan (THioUridine synthases, RNA Methylases and Pseudouridine synthases CL0747), which in Pfam 36.0 includes three families: the existing THUMP domain (PF02926), the THUMP domain of eukaryotic Pus10 (such as human Pus10 Q3MIT2 included in PF21237) and the Ribosomal RNA large subunit methyltransferase M, THUMP-like domain (PF21239). The THUMP domain is involved in RNA metabolism and is present in enzymes involved in at least three unrelated types of RNA-modification.

The Pus10 example is particularly interesting, as it is a pseudouridine synthase with no significant sequence similarity to the other five families of pseudouridine synthases that have been characterised based on sequence homology.

Figure 1. Structure of human Pus10 (2v9k) as seen in InterPro. Highlighted in blue is the THUMP domain represented in Pfam, based on the ECOD classification.

Improving Pfam using AlphaFold

AlphaFold has revolutionised the world of protein structure and protein classification. Since its first release, we have been able to use AlphaFold predicted structures to find a function for domains of unknown function, to create missing domains and to refine Pfam domain boundaries.

An example is the case of the Zinc finger SWIM-type domain-containing protein 3 (ZSWIM3) (Q96MP5). Previously, this protein had two domains defined (Figure 2A). However, using the AlphaFold predicted aligned error graph, we can clearly see that it is actually made of 5 domains: an N-terminal domain, a RNaseH-like domain, a helical domain, a zinc-finger domain and a C-terminal domain (Figure 2B,C). The missing domains were created (Figure 2D): PF21599 (N-terminal), PF21056 (RNaseH-like), PF21609 (helical).

Additionally, one of the existing domains (PF19286) boundaries were too large (Figure 2A), it has been truncated and renamed as ZSWIM1/3, C-terminal domain (Figure 2D).

Figure 2. Example of creation of new Pfam entries and update of an existing Pfam using AlphaFold structure prediction for Q96MP5. A) In Pfam 35.0, two entries existed: SWIM zinc-finger domain (PF04434) and DUF5909 (PF19286). B) AlphaFold Predicted Aligned Error plot, showing 5 distinct domains. C) AlphaFold predicted structure with the 5 domains highlighted in different colours. D) In Pfam 36.0, three new entries were created (PF21599 (N-terminal), PF21056 (RNaseH-like), PF21609 (helical), and DUF5909 (PF19286) was updated.

Enjoy Pfam 36.0!

Posted by the Pfam team

Pfam website decommission

Typhaine Paysan-Lafosse — Thu, 04 Aug 2022 11:06:57 +0000

After more than 20 years of good and faithful service, we have decided to retire the Pfam website. Do not worry though, we are still planning to do Pfam releases and the data will still be available.

As you can imagine this wasn’t an easy decision, and be sure it wasn’t taken lightly. The Pfam website codebase was first released over 20 years ago, and although it has been updated from time to time, some of its core functionality still dates back to its origins. There is a lot of technical debt in its current state and it is only becoming harder to maintain.

Currently, on every release, we are taking more time generating data exclusively related to the website than the core data of Pfam: its alignments, and models. Additionally, our team size doesn’t have the capacity to execute all the release procedures for Pfam on a consistent basis.

Retiring the website will allow us to focus our efforts on producing the core of Pfam. The plan then is to leave the deployment and visualisation tasks to the InterPro website. InterPro was redesigned in recent years, using up to date technologies, including a modern framework (React).

The Pfam data and different viewing features are already available on the InterPro website. For example, searching for a Pfam accession (e.g. PF05093) using the InterPro search by text will allow you to reach the corresponding Pfam entry page, where the menu on the left hand-side gives access to different datasets related to the entry, as shown in the figure below.

Example of Pfam entry page in the InterPro website (PF05093)

The correspondence between the Pfam menu and InterPro menu is given in the table below.

Pfam website tab	InterPro website tab
Summary	Overview
Clan	Available in Overview, Set
Domain organisation	Domain Architectures
Alignments	Alignment
HMM logo	Signature
Trees	Taxonomy (tree icon)
Curation & model	Curation
Species	Taxonomy
Structures	Structures
AlphaFold structures	AlphaFold
trRosetta	RoseTTAFold

You can also browse through the different Pfam families and clans (called Set in InterPro) using the InterPro Browse feature.

The Overview tab of the Set pages in InterPro, the different members of the set (nodes) and the relationship between each other (lines) are displayed in a graph (it corresponds to the Relationship tab in the Pfam website). The size of the nodes is proportional to the number of proteins in the Pfam entry. The graph can be customised to display the Pfam Accession, short name and/or name. Other tabs include Entries (equivalent to the Members section in the Summary tab in Pfam), Proteins, Structures, Taxonomy (equivalent to the Species tab in Pfam), Proteomes and alignments. Additionally, the Proteins tab in InterPro lists all the proteins matched by the different Pfam entries included in a set.

We are aware that not all of the Pfam users are familiar with the InterPro website interface, hence the decommission will be progressive through multiple months, starting from October 5th 2022. On October 5th, we will start redirecting the traffic from Pfam (pfam.xfam.org) to InterPro (www.ebi.ac.uk/interpro). The Pfam website will be available at pfam-legacy.xfam.org until January 2023, when it will be decommissioned. We are also going to organise a webinar to show you where to find the Pfam annotations in InterPro, so stay tuned and check our twitter accounts (@PfamDB/@InterProDB) to register.

If you have any requests, feedback or suggestions on ways to improve Pfam data visualisation in InterPro please contact us through the InterPro helpdesk.

Written by Typhaine Paysan-Lafosse.

Pfam 35.0 is released

jainamistry — Fri, 19 Nov 2021 09:25:30 +0000

Pfam 35.0 contains a total of 19,632 families and clans. Since the last release, we have built 460 new families, killed 7 families and created 12 new clans. UniProt Reference Proteomes has increased by 7% since Pfam 34.0, and now contains 61 million sequences. Of the sequences that are in UniProt Reference Proteomes, 75.2% have at least one Pfam match, and 48.7% of all residues fall within a Pfam family.

Sources of new families

In an effort to increase the Pfam coverage of metagenomic sequence space, we have created 250 metagenomic protein families. These families were built by clustering protein sequences from the MGnify and UniProt databases, aligning the sequences in each cluster, and using the resulting alignments to create new SEED alignments. We then used our usual building process to create new families from the SEED alignments.

We have also created 52 new families based on clusters from a new resource called DPCfam based on Density Peak Clustering, created by Allesandro Laio, Marco Punta and Elena Tea Russo. An interesting example of these families is the N-terminal domain of the Crinkler effector protein (PF20147). Crinkling- and necrosis-inducing proteins (CRNs) or Crinkler, are ubiquitously present and first described in plant pathogenic oomycetes, and have been shown to participate in processes controlling plant cell death and immunity. However, Crinkler is also found outside oomycetes, such as in the Rhizophagus irregularis crinkler effector protein 1 (RiCRN1) which, like other CRNs, functions in the plant nucleus, but plays an essential role in symbiosis progression and the proper initiation of arbuscule development. This suggests that Crinkler proteins are more ubiquitously distributed than first predicted, and that their function is not limited to plant death (PMID:30233541). The Pfam domain contains the conserved motif FLAK, and, from structure predictions, adopts the ubiquitin-like fold, as seen in the image below.

Figure 1: N-terminal domain of RiCRN. The image was generated using an AlphaFold colab notebook and is displayed using Molstar.

We continue to be provided with new families from the group of L. Aravind from NCBI, and have added 42 of them to this release of Pfam. Many of these families represent novel domains and proteins found in phage defence systems of bacteria.

Pfam-N

We are really excited about the Pfam-N matches for Pfam 35.0, but there is still a bit of work to do before we can release them. In particular, we’re working on neural networks that can predict the location of the domains themselves, instead of relying on HMMER to do so, as with the previous Pfam-N release. It will take a little more time to ensure the quality of these annotations, and we will make another announcement when they are ready.

Enjoy Pfam 35.0!

Posted by the Pfam team

AlphaFolding the Protein Universe

alexbateman — Thu, 22 Jul 2021 17:43:43 +0000

Hot on the tail of our inclusion of the Baker group’s trRosetta structural models we are excited to announce the inclusion of models from AlphaFold 2.0 generated by DeepMind and stored in the AlphaFold Database (AlphaFold DB). AlphaFold 2.0’s performance in the CASP14 competition was spectacular, producing near experimental quality structure models.

The new AlphaFold models have been constructed for over 375,000 proteins from 22 model organisms and the very large majority of the models are full length proteins. This is in contrast to the trRosetta models, which were built from the domain region predicted by Pfam. Having full length protein models is very exciting for us because it will allow us to more easily check whether we need to extend or change the Pfam domain boundaries. We will also be able to look for missing domains in the protein structures. AlphaFold models also help to fill in gaps when only a part of a longer family has been structurally characterised.

When looking at the AlphaFold models it is important to look at the quality scores of the model overall. Sometimes a good quality structural model cannot be created, but in these cases it is usually obvious from the quality scores shown as orange regions of the model. Disordered regions of proteins are usually of low confidence.

We think that there are many thousands of Pfam families that could be improved using the AlphaFold and trRosetta models. Feel free to tell us where we could improve them. We are really enjoying mining this treasure trove of data and we hope you find some (not so) hidden gems.

The Pfam team

Google Research Team bring Deep Learning to Pfam

jainamistry — Wed, 24 Mar 2021 10:48:55 +0000

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example.

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN.

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER.

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif.

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe.

Posted by Alex Bateman

Pfam 34.0 is released

jainamistry — Wed, 24 Mar 2021 10:48:43 +0000

Pfam 34.0 contains a total of 19,179 families and 645 clans. Since the last release, we have built 935 new families, killed 15 families and created 11 new clans. UniProt Reference Proteomes has increased by 21% since Pfam 33.1, and now contains 47 million sequences. Of the sequences that are in reference proteomes, 74.5% have at least one Pfam match, and 48.8% of all residues fall within a Pfam family.

Structural models

In our previous blog post, we announced the release of ~6,000 structural models in Pfam and InterPro. Many of the new families that we have created since the last release are large enough to be suitable for structure prediction. We have sent the alignments for new and modified Pfam families to the Baker group, who are currently generating structural models for them using their pipeline. We will release the next set of structural models when Pfam 34.0 is integrated into InterPro.

Collaboration with Google Research

We have been working with Dr Lucy Colwell’s research team at Google Research to expand Pfam coverage using deep learning methods. The deep learning approach, trained on Pfam HMMER matches, has found many additional matches which can be found in a new file called Pfam-N. There is another Pfam blog post which describes the work in more detail here.

Folding the Protein Universe

alexbateman — Wed, 03 Mar 2021 11:47:32 +0000

Today signifies the realization of a long-held dream to have the structure of every (well nearly every) family in Pfam. The Pfam and InterPro databases have made available structural models of 6,370 protein families created by Ivan Anishchanka from David Baker’s group at the University of Washington in Seattle. The models are made using their latest prediction method called trRosetta which can predict protein structures, based on large multiple sequence alignments, with incredible accuracy.

The Baker group have had remarkable success over the years in the field of structure prediction, and in the recent CASP14 event the group’s predictions were the most accurate from an academic group. Although not quite as accurate as Deep Mind’s AlphaFold 2.0 predictions, they are certainly of a high enough quality for many applications. For example, I am interested to understand when a Pfam family is part of a larger superfamily, or clan as we call them in Pfam. I have been able to take the structural models and identify distant homologues in the PDB using tools such as DALI and PDBeFold that compare protein structures. For longer Pfam families we can look at the structure model and identify likely domain boundaries to split up the existing Pfam family into the domain sized chunks (for example Calmodulin_bind could be split into 3 domains).

Within the InterPro website we have developed a completely novel view that allows you to see which residues in the Pfam seed alignment are predicted to be close in space. By clicking on columns in the alignment, one can see where they are in the structural model and which residues are predicted to be nearby (see the documentation for further details). We would be very interested in getting your feedback on this feature. We could provide a similar view based on contacts found in known structures. The PDB file for individual models can be downloaded from the structural model tab on the family pages within the InterPro and Pfam websites. You can also download all of the structural model and contact map data from the Pfam ftp site and InterPro ftp site.

The figure above shows the contact map and structural model as seen through the InterPro website. Links to some example Pfam families that have a structure model are shown below (for the Pfam links, click on the Structural Model tab):

DUF1302 – Pfam, InterPro
Big_6 – Pfam, InterPro
MBA1 – Pfam, InterPro
Calmodulin_bind – Pfam, InterPro
Diacid_rec – Pfam, InterPro

This is not the first time we have made a large set of structure predictions available. Back in 2002, again in collaboration with David Baker and Rich Bonneau we made many models available. The overall accuracy of these models was much lower and we did not know which models were good since we lacked an accurate quality metric. The new models come with a quality score called lDDT and broadly, we can consider a model with lDDT > 0.6 to be good, and one with lDDT > 0.8 to be excellent.

Today marks an amazing milestone, with 88% of Pfam families now having a PDB structure or a structural model. The story is not quite finished though, as there remain 2,202 families that do not have structural data. We plan to investigate different sequence sets to make an even larger set of models available in the coming months. We felt however that it was useful to release this data set to the community as fast as we could. The work described here has been possible only due to funding from the BBSRC BBR, which has been a critical part of the funding landscape for UK data resources for many years.

There will be many exciting stories to be told using this structural treasure trove, and we hope it is a beneficial resource to the community. Please let us know what you think of the data, and whether you find the contact maps and models useful.

Posted by Alex Bateman

A new Pfam-B is released

erison9 — Tue, 30 Jun 2020 12:19:24 +0000

In addition to our HMM-based Pfam entries (Pfam-A), we used to make a set of automatically generated, non-HMM based entries called Pfam-B. The Pfam-B entries were derived from clusters generated by applying the ADDA algorithm to an all-against-all BLAST search of UniRef-40, and removing any regions covered by Pfam-A. The overhead of producing Pfam-B in this way became too great, and as of Pfam 28.0, we stopped making Pfam-B entries (see [1] for a longer discussion on why we stopped producing Pfam-B). Erik Sonnhammer has devised an alternative method of making Pfam-B using the MMSeqs2 software [2], and an overview of the process is given below (more details will follow in the next Pfam paper).

We have already begun to use the new version of Pfam-B to generate new families, and 11 of these are in Pfam 33.1. For example, the TUTase family (PF19088) was built using Pfam-B as the source. We expect that Pfam-B will be a very useful source of additional families in the coming years.

How the new Pfam-B was created

UniProtKB sequences not covered by Pfam-A were clustered using MMSeqs2 and multiple sequence alignments of each cluster were generated with FAMSA [3]. This resulted in 136730 Pfam-B families that on average contain 99 sequences (max 40912) and are 310 positions wide (max 29216).

How to access the new Pfam-B

The Pfam-B alignments are released as a tar archive on the Pfam FTP site [Pfam-B.tgz]. We do not plan to integrate them into the Pfam website, but we will generate them for each future Pfam release.

Posted by Erik Sonnhammer and the Pfam team

References

1. Finn et al. (2015) The Pfam protein families database: towards a more sustainable future.

2. Hauser et al. (2016) MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

3. Deorowicz et al. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families

Pfam 33.1 is released

jainamistry — Thu, 11 Jun 2020 08:58:21 +0000

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam.

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Pfam SARS-CoV-2 special update (part 2)

alexbateman — Mon, 06 Apr 2020 16:10:58 +0000

This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release. These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

Finally, we have made some very minor changes to the family descriptions and one name change from the last release. You can now access all the updated files here:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes. We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

Posted by The Pfam team