Archive for the 'News' Category

AlphaFolding the Protein Universe

July 22, 2021

Hot on the tail of our inclusion of the Baker group’s trRosetta structural models we are excited to announce the inclusion of models from AlphaFold 2.0 generated by DeepMind and stored in the AlphaFold Database (AlphaFold DB). AlphaFold 2.0’s performance in the CASP14 competition was spectacular, producing near experimental quality structure models.

The new AlphaFold models have been constructed for over 375,000 proteins from 22 model organisms and the very large majority of the models are full length proteins. This is in contrast to the trRosetta models, which were built from the domain region predicted by Pfam. Having full length protein models is very exciting for us because it will allow us to more easily check whether we need to extend or change the Pfam domain boundaries.  We will also be able to look for missing domains in the protein structures. AlphaFold models also help to fill in gaps when only a part of a longer family has been structurally characterised.

When looking at the AlphaFold models it is important to look at the quality scores of the model overall. Sometimes a good quality structural model cannot be created, but in these cases it is usually obvious from the quality scores shown as orange regions of the model.  Disordered regions of proteins are usually of low confidence.

We think that there are many thousands of Pfam families that could be improved using the AlphaFold and trRosetta models. Feel free to tell us where we could improve them. We are really enjoying mining this treasure trove of data and we hope you find some (not so) hidden gems. 

The Pfam team

Google Research Team bring Deep Learning to Pfam

March 24, 2021

We are delighted to announce the first fruits of a collaboration between the Pfam team and a Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. In 2019, Colwell’s team published a preprint describing a new deep learning method that was trained on Pfam data, and which improves upon the performance of the HMMER software (HMMER is the underlying software used by Pfam). Colwell’s team embraced our initial sceptical feedback and shared data that helped us to understand the new method’s performance. Over time our scepticism turned into interest as we explored novel findings from the method, and now we are very excited by the potential of these methods to improve our ability to classify sequences into domains and families.

Introducing Pfam-N

We are pleased to share a new file called Pfam-N (N for network), which provides additional Pfam 34.0 matches identified by the Google team. Pfam-N annotates 6.8 million protein regions into 11,438 Pfam families. These regions include nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there are 360 human sequences.

The figure above shows the number of matches to UniProtKB Reference Proteomes 2020_06 for each Pfam release over the last decade (orange). Pfam-N (blue) adds nearly 10% more regions to Pfam v34.0, which based on the current trend, would have taken several years for us to achieve.

How was Pfam-N made?

Deep learning approaches use training examples, much like HMMER, to learn the statistics of what it means for a protein to have a particular function. We use a subset of all the Pfam HMMER matches for training, and provide our deep learning model with both the sequence and Pfam family for each training example. 

We trained a number of replicates (“ensemble elements”) of a convolutional neural network to predict the Pfam matches. We call this ensemble model ProtENN (ENN for Ensemble of Neural Networks). The method relies on HMMER to initially parse proteins into their constituent domains before giving these regions to ProtENN. 

The Pfam-N file is in the standard Pfam Stockholm alignment format, and the ProtENN matches are aligned using the existing Pfam profile-HMM model. We only include a match in Pfam-N if it is not already included in Pfam.

It should be noted that the deep learning model has access to the full set of matches for a Pfam family, whereas the Pfam profile-HMM models are trained on the much smaller Pfam seed alignments. Thus this is not a direct comparison between ProtENN and HMMER. 

Improving Pfam using Pfam-N

We plan to add Pfam-N matches to Pfam seed alignments to help improve the performance of the Pfam profile-HMMs in future releases. Some Pfam families gain huge numbers of additional matches in Pfam-N. For example, the TAT_signal family (PF10518) matches about 4,000 sequences in Pfam 34.0. Pfam-N identifies a further 37,000 protein sequences that were missed by the current Pfam model. The ACT domain (PF01842), which confers regulation to a variety of enzymes by binding to amino acids, is doubled in size by the 27,000 additional matches identified by the deep learning model. Overall, the deep learning models seem to perform particularly well for short families, where the profile-HMMs struggle to distinguish between signal and noise. Large gains are also made for short protein repeats such as TPRs, Leucine Rich Repeats and zinc fingers found in DNA-binding transcription factors.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Future work

Deep learning approaches have a number of potential upsides we’re excited to explore, including explicit modeling of interactions between amino acids that are quite far from each other in sequence, as well as the fact that these approaches build a shared model across all protein classes: they attempt to leverage shared information, about, say, a helix-turn-helix region for all of the large variety of biological processes that incorporate this motif. 

If deep learning use in speech recognition and computer vision are any indications to go by, our current usage to functionally annotate proteins is in its infancy. We look forward to the development of these models to help us classify the protein universe. 

Posted by Alex Bateman

Pfam 34.0 is released

March 24, 2021

Pfam 34.0 contains a total of 19,179 families and 645 clans. Since the last release, we have built 935 new families, killed 15 families and created 11 new clans. UniProt Reference Proteomes has increased by 21% since Pfam 33.1, and now contains 47 million sequences. Of the sequences that are in reference proteomes, 74.5% have at least one Pfam match, and 48.8% of all residues fall within a Pfam family.

Structural models

In our previous blog post, we announced the release of ~6,000 structural models in Pfam and InterPro. Many of the new families that we have created since the last release are large enough to be suitable for structure prediction. We have sent the alignments for new and modified Pfam families to the Baker group, who are currently generating structural models for them using their pipeline. We will release the next set of structural models when Pfam 34.0 is integrated into InterPro.

Collaboration with Google Research

We have been working with Dr Lucy Colwell’s research team at Google Research to expand Pfam coverage using deep learning methods. The deep learning approach, trained on Pfam HMMER matches, has found many additional matches which can be found in a new file called Pfam-N. There is another Pfam blog post which describes the work in more detail here.

Folding the Protein Universe

March 3, 2021

Today signifies the realization of a long-held dream to have the structure of every (well nearly every) family in Pfam. The Pfam and InterPro databases have made available structural models of 6,370 protein families created by Ivan Anishchanka from David Baker’s group at the University of Washington in Seattle. The models are made using their latest prediction method called trRosetta which can predict protein structures, based on large multiple sequence alignments, with incredible accuracy.

The Baker group have had remarkable success over the years in the field of structure prediction, and in the recent CASP14 event the group’s predictions were the most accurate from an academic group. Although not quite as accurate as Deep Mind’s AlphaFold 2.0 predictions, they are certainly of a high enough quality for many applications. For example, I am interested to understand when a Pfam family is part of a larger superfamily, or clan as we call them in Pfam. I have been able to take the structural models and identify distant homologues in the PDB using tools such as DALI and PDBeFold that compare protein structures. For longer Pfam families we can look at the structure model and identify likely domain boundaries to split up the existing Pfam family into the domain sized chunks (for example Calmodulin_bind could be split into 3 domains).

Within the InterPro website we have developed a completely novel view that allows you to see which residues in the Pfam seed alignment are predicted to be close in space. By clicking on columns in the alignment, one can see where they are in the structural model and which residues are predicted to be nearby (see the documentation for further details). We would be very interested in getting your feedback on this feature. We could provide a similar view based on contacts found in known structures. The PDB file for individual models can be downloaded from the structural model tab on the family pages within the InterPro and Pfam websites. You can also download all of the structural model and contact map data from the Pfam ftp site and InterPro ftp site.

The figure above shows the contact map and structural model as seen through the InterPro website. Links to some example Pfam families that have a structure model are shown below (for the Pfam links, click on the Structural Model tab):

This is not the first time we have made a large set of structure predictions available.  Back in 2002, again in collaboration with David Baker and Rich Bonneau we made many models available.  The overall accuracy of these models was much lower and we did not know which models were good since we lacked an accurate quality metric. The new models come with a quality score called lDDT and broadly, we can consider a model with lDDT > 0.6 to be good, and one with lDDT > 0.8 to be excellent.

Today marks an amazing milestone, with 88% of Pfam families now having a PDB structure or a structural model. The story is not quite finished though, as there remain 2,202 families that do not have structural data. We plan to investigate different sequence sets to make an even larger set of models available in the coming months. We felt however that it was useful to release this data set to the community as fast as we could. The work described here has been possible only due to funding from the BBSRC BBR, which has been a critical part of the funding landscape for UK data resources for many years.

There will be many exciting stories to be told using this structural treasure trove, and we hope it is a beneficial resource to the community. Please let us know what you think of the data, and whether you find the contact maps and models useful.

Posted by Alex Bateman

Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

A new Pfam-B is released

June 30, 2020

In addition to our HMM-based Pfam entries (Pfam-A), we used to make a set of automatically generated, non-HMM based entries called Pfam-B. The Pfam-B entries were derived from clusters generated by applying the ADDA algorithm to an all-against-all BLAST search of UniRef-40, and removing any regions covered by Pfam-A. The overhead of producing Pfam-B in this way became too great, and as of Pfam 28.0, we stopped making Pfam-B entries (see [1] for a longer discussion on why we stopped producing Pfam-B). Erik Sonnhammer has devised an alternative method of making Pfam-B using the MMSeqs2 software [2], and an overview of the process is given below (more details will follow in the next Pfam paper).

We have already begun to use the new version of Pfam-B to generate new families, and 11 of these are in Pfam 33.1. For example, the TUTase family (PF19088) was built using Pfam-B as the source. We expect that Pfam-B will be a very useful source of additional families in the coming years.

How the new Pfam-B was created

UniProtKB sequences not covered by Pfam-A were clustered using MMSeqs2 and multiple sequence alignments of each cluster were generated with FAMSA [3]. This resulted in 136730 Pfam-B families that on average contain 99 sequences (max 40912) and are 310 positions wide (max 29216).

How to access the new Pfam-B

The Pfam-B alignments are released as a tar archive on the Pfam FTP site [Pfam-B.tgz]. We do not plan to integrate them into the Pfam website, but we will generate them for each future Pfam release.

Posted by Erik Sonnhammer and the Pfam team

References

1. Finn et al. (2015) The Pfam protein families database: towards a more sustainable future.

2. Hauser et al. (2016) MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

3. Deorowicz et al. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families

Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team

Rfam 12.1 has been released

April 27, 2016

Rfam 12.1 announcement

We are happy to announce a new release of Rfam. Version 12.1, based on the same sequence dataset as Rfam 12.0, features over 20 new families, a new clan competing algorithm, a publicly accessible MySQL database, and many website fixes.

Read the rest of this entry »

Rfam 12.0 is out

September 24, 2014

We are pleased to announce the release of Rfam 12.0! Read the rest of this entry »

Moving to xfam.org

May 1, 2014

Back in November 2012 we announced that the Xfam team in the UK was moving from the Wellcome Trust Sanger Institute to the European Bioinformatics Institute (EMBL-EBI), just next door on the Wellcome Trust Genome Campus. On Tuesday we completed that move by switching off the Pfam and Rfam websites inside Sanger and redirecting all traffic to our shiny new home at xfam.org. You can now find the Pfam and Rfam websites at pfam.xfam.org and rfam.xfam.org respectively. Read the rest of this entry »