Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.


A new Pfam-B is released

June 30, 2020

In addition to our HMM-based Pfam entries (Pfam-A), we used to make a set of automatically generated, non-HMM based entries called Pfam-B. The Pfam-B entries were derived from clusters generated by applying the ADDA algorithm to an all-against-all BLAST search of UniRef-40, and removing any regions covered by Pfam-A. The overhead of producing Pfam-B in this way became too great, and as of Pfam 28.0, we stopped making Pfam-B entries (see [1] for a longer discussion on why we stopped producing Pfam-B). Erik Sonnhammer has devised an alternative method of making Pfam-B using the MMSeqs2 software [2], and an overview of the process is given below (more details will follow in the next Pfam paper).

We have already begun to use the new version of Pfam-B to generate new families, and 11 of these are in Pfam 33.1. For example, the TUTase family (PF19088) was built using Pfam-B as the source. We expect that Pfam-B will be a very useful source of additional families in the coming years.

How the new Pfam-B was created

UniProtKB sequences not covered by Pfam-A were clustered using MMSeqs2 and multiple sequence alignments of each cluster were generated with FAMSA [3]. This resulted in 136730 Pfam-B families that on average contain 99 sequences (max 40912) and are 310 positions wide (max 29216).

How to access the new Pfam-B

The Pfam-B alignments are released as a tar archive on the Pfam FTP site [Pfam-B.tgz]. We do not plan to integrate them into the Pfam website, but we will generate them for each future Pfam release.

Posted by Erik Sonnhammer and the Pfam team

References

1. Finn et al. (2015) The Pfam protein families database: towards a more sustainable future.

2. Hauser et al. (2016) MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

3. Deorowicz et al. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families


Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

We have also added 8 new clans since the last Pfam release. One of the new clans is the TSP1 superfamily (CL0692). Previously a single family (PF00090) attempted to identify all known TSP1 domains.  Based on structural work by Marko Hyvönen (University of Cambridge) and colleagues we have added an additional three families (PF19028PF19030 and PF19035) to Pfam. These new families have both improved the coverage of the TSP1 domain, and better modelled the variations in disulphide binding across the structure space.           

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex


Rfam Coronavirus Special Release

April 27, 2020

In response to the SARS-CoV-2 outbreak, the Rfam team prepared a special release dedicated to the Coronavirus RNA families. The release 14.2 includes 10 new and 4 revised families that can be used to annotate the SARS-CoV-2 and other Coronavirus genomes with RNA families.

View the data at rfam.org/covid-19 ➡️

New Coronavirus Rfam families

In collaboration with the Marz group and the EVBC, we created 10 families representing the entire 5’- and 3’- untranslated regions (UTRs) for Alpha-, Beta-, Gamma-, and Delta- coronaviruses. A specialised set of alignments for the subgenus Sarbecovirus is also provided, including the SARS-CoV-1 and SARS-CoV-2 UTRs. 

The families are based on a set of high-quality whole genome alignments produced with LocARNA and reviewed by expert virologists. Note that the Alpha-, Beta-, and Deltacoronavirus alignments and structures were refined based on the literature, while the Gammacoronavirus families are based on prediction alone due to the lack of experimental data.


Virus
5’ UTR3’ UTR
AlphacoronavirusaCoV-5UTR
RF03116
aCoV-3UTR
RF03121
BetacoronavirusbCoV-5UTR
RF03117
bCoV-3UTR
RF03122
Sarbecovirus and SARS-CoV-2Sarbecovirus-5UTR
RF03120
Sarbecovirus-3UTR
RF03125
GammacoronavirusgCoV-5UTR
RF03118
gCoV-3UTR
RF03123
DeltacoronavirusdCoV-5UTR
RF03119
dCoV-3UTR
RF03124

Previously, only fragments of the UTRs were found in Rfam. In particular, two families were superseded by the new whole-UTR alignments and removed from Rfam:

  • RF00496 (Coronavirus SL-III cis-acting replication element): This family represented a single stem that is now found in aCoV-5UTR and bCoV-5UTR families.
  • RF02910 (Coronavirus_5p_sl_1_2): This family represented two stems from aCoV-5UTR.

The new families are grouped into 2 clans: CL00116 and CL00117 for the 5’ and 3’ UTRs, respectively. The clans can be used with the Infernalcmscan program to automatically select the highest scoring match from a set of related families (see the Rfam chapter in CPB to learn more).

Revised Coronavirus families

We also reviewed and updated the existing Coronavirus Rfam families.

FamilyWhat was updated?Is it found in SARS-CoV-2?
RF00182 Coronavirus packaging signal The seed alignment and consensus secondary structure were updated to include the 4 conserved repeat units. This RNA element isfound only in Embecovirus, so it is not present in SARS-CoV-2 and other Sarbecoviruses.
RF00507 Coronavirus frameshifting stimulation elementThe seed alignment was expanded. This RNA is present in SARS-CoV-2.
RF00164 Coronavirus s2m RNAThe seed alignment was expanded.

There is a 3D structure for SARS-CoV-1 which can be used for understanding the s2m in SARS-CoV-2.
This RNA is present in the 3′ UTR of SARS-CoV-2.
RF00165 Coronavirus 3’-UTR pseudoknot  The seed alignment was expanded. 
The pseudoknot is annotated in the 3’ UTR families but since it is mutually exclusive with the 3’-UTR consensus structure, it is also provided as a separate family. 
This RNA is present in the 3′ UTR of SARS-CoV-2.

Where to get the data 

You can download the covariance models, as well as seed alignments for the coronavirus families from the corresponding family pages or from a dedicated folder on the FTP archive.

How to use the data

You can download the covariance models and annotate viral sequences with these RNA models using Infernal. See Rfam help for examples.

Inviting all Wikipedians to contribute

We revised the Wikipedia pages associated with each family, and we invite everyone to contribute to the following articles:

Acknowledgements

We would like to thank Kevin Lamkiewicz and Manja Marz (Friedrich Schiller University Jena) for providing the curated alignments for the new families as well as Eric Nawrocki (NCBI) for revising the existing Rfam entries. We also thank Ramakanth Madhugiri (Justus Liebig University Giessen) for reviewing the Coronavirus UTR alignments.


This work is part of the BBSRC-funded project to expand the coverage of viral RNAs in Rfam. More data on SARS-CoV-2 can be found on the European COVID-19 Data Portal.


ABrowse alignment viewer for Pfam coronavirus families

April 24, 2020

The SARS-CoV-2 pandemic continues to stimulate a worldwide scientific response, with results and preprints accumulating at a high rate. In turn, these papers have generated extensive online discussion.

In order to help link these discussions to data, the protein domains in the Pfam SARS-CoV-2 special release are now available for browsing in ABrowse, a phylogenetic alignment and structure browser newly-developed by the team behind the JBrowse genome browser.

Using ABrowse, you can now link to dynamic views of individual domains from the special release. As an example of how this might be used, this is a link to the ORF7a protein, of which a 27 amino acid deletion was recently observed in Arizona. It can be seen quite clearly from the alignment that a 27aa deletion is a not-insignificant dent in this protein, suggesting (as hypothesized by Nick Loman) that this protein might not be necesary for efficient human-to-human transmission of SARS-CoV-2.

You can scroll around the alignment by dragging (ABrowse owes an inspiration debt to the BioJS MSA browser for this feature), and you can use the phylogenetic tree to the left of the alignment to collapse individual clades. ABrowse displays the collapsed clades as a sequence logo, using a probabilistic profile of the ancestral sequence (performed in a separate thread for performance reasons).

For sequences where Pfam includes a link to PDB structures, you can click on a hyperlinked sequence name to bring up a structure visualization using pv, the WebGL PDB viewer. Structures can be rotated, zoomed, or recolored. Mousing over alignment columns will highlight the corresponding residue in all open structures, and vice versa, allowing comparison of homologous amino acids in the structural context.

ABrowse

The image shows the Corona_S2 domain, i.e. the Spike glycoprotein, along with several structures and some ancestral sequence logos.

This version of ABrowse is an alpha release, and is likely to be updated; however, URL bookmarks use the Pfam accession numbers and the JBrowse/ABrowse team plans to keep the alignments up to date. If you have questions, bug reports, or feature requests about ABrowse, please feel free to post them to the ABrowse issues page and/or contact Ian Holmes.

Guest post by Ian Holmes


Pfam SARS-CoV-2 special update (part 2)

April 6, 2020

This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release.  These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

pf09408-spike

Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

Finally, we have made some very minor changes to the family descriptions and one name change from the last release.  You can now access all the updated files here:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes.  We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

Posted by The Pfam team


Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team


Curation with Dfam: new data and platform updates

March 17, 2020

DNA transposon termini signatures

The Dfam consortium is excited to announce the generation and release of terminal repeat sequence signatures for class II DNA transposable elements. The termini of class II elements are crucial for movement, and as such, can be used to classify de novo DNA transposable element families in new genomic sequences (Figure 1).

Figure 1. Major subgroups of class II DNA transposons.

The LOGOs of the termini can be viewed on the “Classifications” tab on the Dfam website and are organized by class II subclasses (e.g., Crypton, Helitron, TIR, etc.) (Figure 2). This allows for easy visualization of the base conservation at each position in the terminal sequences and comparisons between the 5’ and 3’ termini (Figure 2). In addition, the termini profiles are available for download as a .HMM file.

Figure 2. Termini signature visualization on the Dfam website (www.dfam.org) sample. Base conservation can be seen via the LOGOs of the 5’, 3’ and combined edge (termini) HMMs. The movement type can be seen preceding DNA transposons that move via a common mechanism (e.g. “Circular dsDNA intermediate). The number of families used to generate the LOGOs are indicated, as well as the subclass named (e.g. “Crypton_A”). Additional notes on the termini, when relevant, are also available.

Community data submissions

We have taken the first small step towards a community-driven data curation platform by developing a new data submission system.  At the start this will facilitate the process of uploading data to the site for processing by the curators. As we move forward, further aspects of the curation process will be made available to the community.  Upon creating an account and logging in, users can submit files to Dfam using our web-based upload page. Here you will also find information about submission requirements and how different levels of library quality are handled in Dfam.


We are recruiting!

August 7, 2019

We have two biocurator positions available to work on the Pfam and InterPro databases. Come and join our team!

The main role of the jobs will be to:

  • Create and maintain InterPro and Pfam entries through the assessment of protein signature models. This will involve using our curation interfaces and tools (using basic command line)
  • Write descriptive abstracts of protein families and domains, summarizing functional information found within the scientific literature.
  • Augment entries with annotation terms for use in automatic annotation pipelines, for example the use of GO annotations and other data standards.
  • Respond to user and collaborator queries and requests.
  • Help develop and deliver training materials, either in person or via the Train Online platform

 

Full details can be found on the EBI jobs pages:

https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01490

https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01418

 

If you have any questions, please get in touch.

 

Posted by Jaina and Lorna


Dfam 3.0 is out

March 6, 2019

 

The Dfam consortium is excited to announce the release of Dfam 3.0.  This release represents a major transition for Dfam from a proof-of-concept database into a funded open community resource. Central to this transition is a major infrastructure and technology update, enabling Dfam to handle the increasing pace of genome sequencing and TE library generation. Equally important, we merged Dfam_consensus with Dfam to produce a single resource for transposable element family modeling and annotation. In doing so, Dfam serves the needs of a broader research community while maintaining a high standard for family characterization (seed alignments), and TE annotation sensitivity. Finally, and most importantly, we are working on making Dfam a community driven resource through the development of online curation tools and direct user engagement.

Infrastructure updates

Dfam has undergone a major infrastructure upgrade since the last release including faster servers and storage systems, a new software stack and improved website features. Together these updates will allow Dfam to greatly expand the number of families and the species represented. The new software stack includes a publicly accessible REST API, which provides the core functionality used by the redesigned dfam.org website and is available for use in community developed applications and workflows. The new website is based on the Angular framework, supporting both a traditional web portal to the Dfam database as well as the use of interactive tools for data management and curation.

Dfam_consensus merger

The merger of Dfam_consensus with Dfam created a combined database of 6,235 TE families in 9 organisms, each characterized by a seed alignment of representative family members. Seed alignments constitute a rich dataset for generating sequence models such as consensus sequences, or profile Hidden Markov Models (HMMs).

Consensus sequence databases have traditionally not preserved the sequence alignment from which the consensus was generated. This omission has made it difficult to evaluate the strength of the consensus, to make incremental improvements by adding/removing members, or to regenerate models using improved methodologies. By adding support for consensus sequences to Dfam, the provenance is preserved in the seed alignment. In addition, the positions within the consensus can be directly related to the corresponding match states within the profile HMM.

Improved interfaces and metadata

The new Dfam website contains several features borrowed from Dfam_consensus including: the seed alignment visualization, the TE classification system and visualization, and per-family and full-database EMBL exports for consensus sequences.

TE classification tree visualization with search facility:

Figure1

In addition, we have improved the family browsing interface, and added the ability to store/visualize family features such as coding sequences, target site preferences, binding sites, as well as ad-hoc sequence annotation.

Coding regions and target site duplication details for Kolobok-1_DR:

Figure4

Dfam has adopted the recently developed (for Dfam_consensus) classification system for repetitive sequences and applied it to all of the Dfam-2.x families. This system combines concepts from established systems (Wicker et. al., Piegu et. al., Curcio et. al., Smit et. al., and Jurka et. al.) with phylogenies based on reverse transcriptase and transposases. Classification names were chosen to be as descriptive as possible while still honoring the most widely used acronyms for well-defined classes.

Dfam families may be queried using the new browse form:

Figure2

 

Community engagement

We are embarking on an effort to greatly expand the database using de-novo repeat identification pipelines, data sharing with other open-databases, and most importantly from direct community submissions. If you have existing TE libraries or plan to develop one for a newly sequenced organism, consider making it a part of the Dfam database. We can offer assistance with importing legacy datasets and are working on tools to facilitate direct community curation of the database. Please contact us at help@dfam.org.