Folding the Protein Universe

March 3, 2021

Today signifies the realization of a long-held dream to have the structure of every (well nearly every) family in Pfam. The Pfam and InterPro databases have made available structural models of 6,370 protein families created by Ivan Anishchanka from David Baker’s group at the University of Washington in Seattle. The models are made using their latest prediction method called trRosetta which can predict protein structures, based on large multiple sequence alignments, with incredible accuracy.

The Baker group have had remarkable success over the years in the field of structure prediction, and in the recent CASP14 event the group’s predictions were the most accurate from an academic group. Although not quite as accurate as Deep Mind’s AlphaFold 2.0 predictions, they are certainly of a high enough quality for many applications. For example, I am interested to understand when a Pfam family is part of a larger superfamily, or clan as we call them in Pfam. I have been able to take the structural models and identify distant homologues in the PDB using tools such as DALI and PDBeFold that compare protein structures. For longer Pfam families we can look at the structure model and identify likely domain boundaries to split up the existing Pfam family into the domain sized chunks (for example Calmodulin_bind could be split into 3 domains).

Within the InterPro website we have developed a completely novel view that allows you to see which residues in the Pfam seed alignment are predicted to be close in space. By clicking on columns in the alignment, one can see where they are in the structural model and which residues are predicted to be nearby (see the documentation for further details). We would be very interested in getting your feedback on this feature. We could provide a similar view based on contacts found in known structures. The PDB file for individual models can be downloaded from the structural model tab on the family pages within the InterPro and Pfam websites. You can also download all of the structural model and contact map data from the Pfam ftp site and InterPro ftp site.

The figure above shows the contact map and structural model as seen through the InterPro website. Links to some example Pfam families that have a structure model are shown below (for the Pfam links, click on the Structural Model tab):

This is not the first time we have made a large set of structure predictions available.  Back in 2002, again in collaboration with David Baker and Rich Bonneau we made many models available.  The overall accuracy of these models was much lower and we did not know which models were good since we lacked an accurate quality metric. The new models come with a quality score called lDDT and broadly, we can consider a model with lDDT > 0.6 to be good, and one with lDDT > 0.8 to be excellent.

Today marks an amazing milestone, with 88% of Pfam families now having a PDB structure or a structural model. The story is not quite finished though, as there remain 2,202 families that do not have structural data. We plan to investigate different sequence sets to make an even larger set of models available in the coming months. We felt however that it was useful to release this data set to the community as fast as we could. The work described here has been possible only due to funding from the BBSRC BBR, which has been a critical part of the funding landscape for UK data resources for many years.

There will be many exciting stories to be told using this structural treasure trove, and we hope it is a beneficial resource to the community. Please let us know what you think of the data, and whether you find the contact maps and models useful.

Posted by Alex Bateman


Rfam 14.4 is live

December 18, 2020

The last Rfam release of 2020 is now live! Rfam 14.4 contains 496 new microRNA families developed in collaboration with miRBase. Find out about the microRNA project in our new NAR paper and let us know if you have any feedback.


Rfam 14.3

September 15, 2020

Rfam 14.3 includes 356 new and 40 updated microRNA families, as well as 12 new and 2 updated Flavivirus RNAs. Find out the details in our new NAR paper and get in touch if you have any feedback.


Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.


A new Pfam-B is released

June 30, 2020

In addition to our HMM-based Pfam entries (Pfam-A), we used to make a set of automatically generated, non-HMM based entries called Pfam-B. The Pfam-B entries were derived from clusters generated by applying the ADDA algorithm to an all-against-all BLAST search of UniRef-40, and removing any regions covered by Pfam-A. The overhead of producing Pfam-B in this way became too great, and as of Pfam 28.0, we stopped making Pfam-B entries (see [1] for a longer discussion on why we stopped producing Pfam-B). Erik Sonnhammer has devised an alternative method of making Pfam-B using the MMSeqs2 software [2], and an overview of the process is given below (more details will follow in the next Pfam paper).

We have already begun to use the new version of Pfam-B to generate new families, and 11 of these are in Pfam 33.1. For example, the TUTase family (PF19088) was built using Pfam-B as the source. We expect that Pfam-B will be a very useful source of additional families in the coming years.

How the new Pfam-B was created

UniProtKB sequences not covered by Pfam-A were clustered using MMSeqs2 and multiple sequence alignments of each cluster were generated with FAMSA [3]. This resulted in 136730 Pfam-B families that on average contain 99 sequences (max 40912) and are 310 positions wide (max 29216).

How to access the new Pfam-B

The Pfam-B alignments are released as a tar archive on the Pfam FTP site [Pfam-B.tgz]. We do not plan to integrate them into the Pfam website, but we will generate them for each future Pfam release.

Posted by Erik Sonnhammer and the Pfam team

References

1. Finn et al. (2015) The Pfam protein families database: towards a more sustainable future.

2. Hauser et al. (2016) MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

3. Deorowicz et al. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families


Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

We have also added 8 new clans since the last Pfam release. One of the new clans is the TSP1 superfamily (CL0692). Previously a single family (PF00090) attempted to identify all known TSP1 domains.  Based on structural work by Marko Hyvönen (University of Cambridge) and colleagues we have added an additional three families (PF19028PF19030 and PF19035) to Pfam. These new families have both improved the coverage of the TSP1 domain, and better modelled the variations in disulphide binding across the structure space.           

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex


Rfam Coronavirus Special Release

April 27, 2020

In response to the SARS-CoV-2 outbreak, the Rfam team prepared a special release dedicated to the Coronavirus RNA families. The release 14.2 includes 10 new and 4 revised families that can be used to annotate the SARS-CoV-2 and other Coronavirus genomes with RNA families.

View the data at rfam.org/covid-19 ➡️

New Coronavirus Rfam families

In collaboration with the Marz group and the EVBC, we created 10 families representing the entire 5’- and 3’- untranslated regions (UTRs) for Alpha-, Beta-, Gamma-, and Delta- coronaviruses. A specialised set of alignments for the subgenus Sarbecovirus is also provided, including the SARS-CoV-1 and SARS-CoV-2 UTRs. 

The families are based on a set of high-quality whole genome alignments produced with LocARNA and reviewed by expert virologists. Note that the Alpha-, Beta-, and Deltacoronavirus alignments and structures were refined based on the literature, while the Gammacoronavirus families are based on prediction alone due to the lack of experimental data.


Virus
5’ UTR3’ UTR
AlphacoronavirusaCoV-5UTR
RF03116
aCoV-3UTR
RF03121
BetacoronavirusbCoV-5UTR
RF03117
bCoV-3UTR
RF03122
Sarbecovirus and SARS-CoV-2Sarbecovirus-5UTR
RF03120
Sarbecovirus-3UTR
RF03125
GammacoronavirusgCoV-5UTR
RF03118
gCoV-3UTR
RF03123
DeltacoronavirusdCoV-5UTR
RF03119
dCoV-3UTR
RF03124

Previously, only fragments of the UTRs were found in Rfam. In particular, two families were superseded by the new whole-UTR alignments and removed from Rfam:

  • RF00496 (Coronavirus SL-III cis-acting replication element): This family represented a single stem that is now found in aCoV-5UTR and bCoV-5UTR families.
  • RF02910 (Coronavirus_5p_sl_1_2): This family represented two stems from aCoV-5UTR.

The new families are grouped into 2 clans: CL00116 and CL00117 for the 5’ and 3’ UTRs, respectively. The clans can be used with the Infernalcmscan program to automatically select the highest scoring match from a set of related families (see the Rfam chapter in CPB to learn more).

Revised Coronavirus families

We also reviewed and updated the existing Coronavirus Rfam families.

FamilyWhat was updated?Is it found in SARS-CoV-2?
RF00182 Coronavirus packaging signal The seed alignment and consensus secondary structure were updated to include the 4 conserved repeat units. This RNA element isfound only in Embecovirus, so it is not present in SARS-CoV-2 and other Sarbecoviruses.
RF00507 Coronavirus frameshifting stimulation elementThe seed alignment was expanded. This RNA is present in SARS-CoV-2.
RF00164 Coronavirus s2m RNAThe seed alignment was expanded.

There is a 3D structure for SARS-CoV-1 which can be used for understanding the s2m in SARS-CoV-2.
This RNA is present in the 3′ UTR of SARS-CoV-2.
RF00165 Coronavirus 3’-UTR pseudoknot  The seed alignment was expanded. 
The pseudoknot is annotated in the 3’ UTR families but since it is mutually exclusive with the 3’-UTR consensus structure, it is also provided as a separate family. 
This RNA is present in the 3′ UTR of SARS-CoV-2.

Where to get the data 

You can download the covariance models, as well as seed alignments for the coronavirus families from the corresponding family pages or from a dedicated folder on the FTP archive.

How to use the data

You can download the covariance models and annotate viral sequences with these RNA models using Infernal. See Rfam help for examples.

Inviting all Wikipedians to contribute

We revised the Wikipedia pages associated with each family, and we invite everyone to contribute to the following articles:

Acknowledgements

We would like to thank Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena) for providing the curated alignments for the new families as well as Eric Nawrocki (NCBI) for revising the existing Rfam entries. We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments.


This work is part of the BBSRC-funded project to expand the coverage of viral RNAs in Rfam. More data on SARS-CoV-2 can be found on the European COVID-19 Data Portal.


ABrowse alignment viewer for Pfam coronavirus families

April 24, 2020

The SARS-CoV-2 pandemic continues to stimulate a worldwide scientific response, with results and preprints accumulating at a high rate. In turn, these papers have generated extensive online discussion.

In order to help link these discussions to data, the protein domains in the Pfam SARS-CoV-2 special release are now available for browsing in ABrowse, a phylogenetic alignment and structure browser newly-developed by the team behind the JBrowse genome browser.

Using ABrowse, you can now link to dynamic views of individual domains from the special release. As an example of how this might be used, this is a link to the ORF7a protein, of which a 27 amino acid deletion was recently observed in Arizona. It can be seen quite clearly from the alignment that a 27aa deletion is a not-insignificant dent in this protein, suggesting (as hypothesized by Nick Loman) that this protein might not be necesary for efficient human-to-human transmission of SARS-CoV-2.

You can scroll around the alignment by dragging (ABrowse owes an inspiration debt to the BioJS MSA browser for this feature), and you can use the phylogenetic tree to the left of the alignment to collapse individual clades. ABrowse displays the collapsed clades as a sequence logo, using a probabilistic profile of the ancestral sequence (performed in a separate thread for performance reasons).

For sequences where Pfam includes a link to PDB structures, you can click on a hyperlinked sequence name to bring up a structure visualization using pv, the WebGL PDB viewer. Structures can be rotated, zoomed, or recolored. Mousing over alignment columns will highlight the corresponding residue in all open structures, and vice versa, allowing comparison of homologous amino acids in the structural context.

ABrowse

The image shows the Corona_S2 domain, i.e. the Spike glycoprotein, along with several structures and some ancestral sequence logos.

This version of ABrowse is an alpha release, and is likely to be updated; however, URL bookmarks use the Pfam accession numbers and the JBrowse/ABrowse team plans to keep the alignments up to date. If you have questions, bug reports, or feature requests about ABrowse, please feel free to post them to the ABrowse issues page and/or contact Ian Holmes.

Guest post by Ian Holmes


Pfam SARS-CoV-2 special update (part 2)

April 6, 2020

This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release.  These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

pf09408-spike

Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

Finally, we have made some very minor changes to the family descriptions and one name change from the last release.  You can now access all the updated files here:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes.  We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

Posted by The Pfam team


Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team