Posts Tagged ‘release’

Dfam 3.2 Release

July 9, 2020

Dfam is proud to announce the release of Dfam 3.2.  This release represents a significant step in the expansion of Dfam by providing early access to uncurated, de novo generated families.  As a demonstration of this new capability, we imported a set of 336 RepeatModeler generated libraries produced by Fergal Martin and Denye Ogeh at the European Bioinformatics Institute (EBI).  Also in this release, Dfam now provides family alignments to the RepeatMasker TE protein database aiding in the discovery of related families and in the classification of uncurated TEs.

Uncurated Family Support

In addition to the fully curated libraries for the model organisms human, mouse, zebrafish, worm and fly, Dfam also includes curated libraries for seven other species.  While a fully curated library is the ultimate goal, support for uncurated families has become an essential aspect of a TE resource due to the increasing rate at which new species are being sequenced and the need to have at least a simple TE masking library available.

By standardizing the storage and tracking of uncurated families, it becomes possible to use these datasets to crudely mask an assembly, provide a first approximation of the TE content, and create a starting point for community curation efforts.  Due to the redundancy and fragmentation inherent in these datasets, we do not compute genome-specific thresholds or generate genome coverage plots for these families.  The latest update to the web portal includes new interfaces for uncurated families and some existing interfaces now include an option to include/omit uncurated families.

In this release, Dfam now contains RepeatModeler de novo-produced libraries for an additional 336 species as the result of the collaboration with EBI researchers (denoted with the new uncurated accession prefix “DR”).  Notable taxa expansions include sauropsida (lizards and birds) and fishes (bony and cartilaginous) (Table1). Also included are Amphibia, Viridiplantae and additional species in Mammalia. 

Table 1. De novo-identified TE families from additional species

SpeciesNumber (species)RetrotransposonsDNA transposonsOther
Mammalia471830137812567
Sauropsida164293261168827192
Amphibia6178120316107
Actinopterygii (bony fishes)116275205136177006
Chondrichthyes (cartilaginous fishes)516711982273
Viridiplantae (green plants)28964121687

Aligned Protein Features

In previous versions of Dfam, hand-curated coding regions were provided for a select set of families.  The protein products of these curated sequences were placed in the RepeatMasker TE protein database for use with the RepeatProteinMask tool.  In this release we have used this database with BLASTX to produce alignments to all Dfam families including the uncurated entries.  The resulting alignments are displayed alongside the curated coding regions as the new “aligned” feature track (Figure 1).

Figure 1. Feature track and details for BLASTX alignments to TE protein database.

Website improvements

Several minor improvements have been made to the interface since the previous release.  The browse page now provides links to download the families selected by the query/filter options as HMM, EMBL or FASTA records.  The Seed tab of the Families page now displays the average Kimura divergence of the seed alignment instances to the consensus.

Pfam 33.1 is released

June 11, 2020

We are pleased to announce the release of Pfam 33.1! Some of you may have noticed that we never released Pfam 33.0 – we had initially planned to do so in March 2020, but due to the global pandemic, we redirected our efforts to updating the Pfam SARS-CoV-2 models instead (see previous blog posts Pfam SARS-CoV-2 special update and Pfam SARS-CoV-2 special update (part 2)). We have added these updated models to the Pfam 33.0 release, along with a few other families that we had built since the data for Pfam 33.0 were frozen, to create Pfam 33.1.

Pfam 33.1 contains a total of 18259 families and 635 clans. Since the last release, we have built 355 new families and killed 25 families. We regularly receive feedback from users about families or domains that are missing in Pfam, and typically add many user submitted families at each release. We include the submitters name and ORCID identifier as an author of such Pfam entries. This helps people to get credit for community activities that improve molecular biology databases such as Pfam.

One such user submission was from Heli Mönttinen (University of Helsinki) who submitted a large scale clustering of virus families. Based on this clustering we added 88 new families to Pfam. 

We have also added 8 new clans since the last Pfam release. One of the new clans is the TSP1 superfamily (CL0692). Previously a single family (PF00090) attempted to identify all known TSP1 domains.  Based on structural work by Marko Hyvönen (University of Cambridge) and colleagues we have added an additional three families (PF19028PF19030 and PF19035) to Pfam. These new families have both improved the coverage of the TSP1 domain, and better modelled the variations in disulphide binding across the structure space.           

Figure 1. Organisation of the TSP1 clan in Pfam shown as a sequence similarity network. Image taken from Xu et al.

Finally, we are very happy to welcome Sara and Lowri who are working as curators for both the Pfam and InterPro resources and are already making great contributions to the resources.

Posted by Jaina and Alex

Pfam SARS-CoV-2 special update (part 2)

April 6, 2020

This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release.  These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

pf09408-spike

Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

Finally, we have made some very minor changes to the family descriptions and one name change from the last release.  You can now access all the updated files here:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes.  We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

Posted by The Pfam team

Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team

Rfam 14.0 is out with over 100 new families and an expanded genome collection

August 8, 2018

Rfam14_coverHQ

We are happy to announce that the new release of Rfam, version 14.0, is now available! Rfam 14.0 is built using a set of over 14,000 non-redundant, representative, and complete genomes (~60% more than in Rfam 13.0). It includes 105 new families, new genome browser hub, and ORCiD integration. Read on to find out more.

What’s new

Data updates

Rfam 14.0 has 60% more genomes than Rfam 13.0

The latest Rfam version comes on the heels of Rfam 13.0, a release that marked the transition to the genome-centric sequence database. In Rfam 13.0, the Rfam sequence database – Rfamseq – was composed of 8,364 non-redundant, representative and complete genomes derived from a genome collection maintained by UniProt. Now with the addition of 6,519 new species, the number of annotated genomes in Rfam 14.0 increased by ~60% to 14,434 genomes.

Screen Shot 2018-08-08 at 3.17.20 PM

The majority of the genomes from Rfam 13.0 are also present in Rfam 14.0, although a small number (385, ~4.6%) was removed or replaced. The majority of the new genomes come from Bacteria and Viruses.

Since Rfamseq was updated, this is a major Rfam release (14.0). Expect a minor release (14.1) in the Fall 2018 with new RNA families but no changes in Rfamseq.

More genomes, less redundancy

The switch to annotating complete genomes enabled us to resolve data redundancy at the levels of sequence and species. For instance, in Rfam 12.3 the cumulative length of all human sequences was eight times longer than the total length of the human genome assembly hg38 in 13.0 (note how the width of the green line of Rfam 12 narrows in Rfam 13).

Screen Shot 2018-08-08 at 3.17.45 PM

Redundancy reduction at species level relies on Uniprot’s reference proteome collection, which is a result of manual curation and computational refinement. It includes species of high interest to the scientific community and well-studied model organisms, carefully selected in such a way that they represent the taxonomic diversity. Rfam uses the same collection of genomes for annotation with existing RNA families and building new ones.

105 new families

The number of RNA families reached 2,791 with the addition of 105 new families from 8 RNA types. The new families per ncRNA type in release 14.0 is shown below:

  • 65   Gene; sRNA;
  • 17   Gene; antisense;
  • 11   Gene; snRNA; snoRNA; HACA-box;
  • 5     Gene; snRNA; snoRNA; CD-box;
  • 4     Cis-reg; thermoregulator;
  • 1     Cis-reg;
  • 1     Cis-reg; leader;
  • 1     Cis-reg; riboswitch;

Browse 105 new families

New 3D structures matching Rfam families

2 more Rfam families now have experimentally determined 3D structures that did not match any 3D structures in the past:

Rfam family PDB structure
RF00382 DnaX ribosomal frameshifting element 5UQ7, 5UQ8 – 70S ribosome complex with dnaX mRNA stemloop and E-site tRNA (“in” and “out” conformation)
RF00375 HIV primer binding site 6B19 – Architecture of HIV-1 reverse transcriptase initiation complex core

Search for Rfam entries in PDBe

Rfam regularly updates the mapping between Rfam families and the experimentally determined 3D structures available in PDB. With PDBe’s Advanced Search release in May 2018, PDBe users can take advantage of these mapping by searching with Rfam family names or accessions. For instance, a search using tRNA accession RF00005 currently retrieves 502 entries.

Another powerful new feature is the interactive 3D visualization of the Rfam domains on PDBe entry pages using LiteMol. This is achieved by highlighting the RNA sequence on the corresponding structure, for example tRNA (RF00005) in structure 4UJD. Additional information can be found in the PDBe blog post.

Increased GO term coverage

Non-coding RNA functional annotation was improved with the addition of 133 GO terms to 81 families since last release. The GO annotations are propagated to RNAcentral sequences and submitted to the GOA system, as described in GOREF:0000115.

Genome browser hub

The genome-centric sequence database enabled us to generate the genome browser track hub directly out of the genome annotations without an additional mapping step. At this time we limited the species listed in track hub to those supported by UCSC, with the potential of that number to grow by incorporating all genomes with assemblies at chromosome level. Currently there are 14 species including human (hg38), chicken (galGal5), pig (susScr11) and mouse (mm10). Upon user request, we will also be happy to provide .bed and .bigBed files for various other genomes in our collection, depending on the level of the assembly.

Explore Rfam annotations in UCSC Genome Browser by clicking on these links:

or configure the track manually by editing the URL:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1&hgct_customText=track%20type=bigBed%20name=Rfam%20description=%22Rfam%2014.0%20ncRNA%20annotations%22%20visibility=full%20bigDataUrl=ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.0/genome_browser_hub/homo_sapiens/bigBed

The track hub can also be attached to Ensembl using these instructions and the following URL:

ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.0/genome_browser_hub/hub.txt

Get credit for Rfam families using ORCiD

It is now possible for Rfam authors to get credit for their contributions by claiming family accessions directly to their ORCiD profiles. This new feature was enabled by the Claim to Orcid functionality provided by EBI Search. The process includes three simple steps. Users are first required to login to their ORCiD accounts and use their ORCiD id to search for associated entries. Following search, one can manually select all or a subset of listed entries and click on Claim to ORCID button located at the top of the page. The example provided is of a snoRNA family (RF02725) claimed by the Rfam curator Joanna Argasinska directly to her ORCiD profile.   

New Rfam paper

rfam-cpb-paperWe recently published a new paper in Current Protocols in Bioinformatics with examples covering a broad spectrum of Rfam use cases including examples using our website as well as Infernal to annotate nucleotide sequences. There is also a section dedicated to MySQL with tips and tricks on restoring previous versions of the database, along with useful examples on forming complex queries.

Get in touch

Follow our new Twitter account RfamDB to be the first to find out about new Rfam families and don’t hesitate to raise a GitHub issue or email us if you have any questions.

You can also meet the Rfam team in person at a hands-on tutorial at the upcoming ECCB 2018 conference in Athens.

Genome-centric Rfam is finally here!

September 15, 2017

rfam-13.0We are pleased to announce the release of Rfam 13.0, the first major update since Rfam 12.0 went live in 2014. In this version we introduce a new genome-centric sequence database composed of non-redundant, representative, and complete genomes, as well as new website features, such as an updated text search.

Find out more about Rfam 13.0 in the NAR paper by Kalvari et al.: Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.

Pfam 31.0 is released

March 8, 2017

Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans.  We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan. Read the rest of this entry »

Rfam 12.2 is live

January 25, 2017

We are happy to announce a new release of Rfam (version 12.2) which includes 115 new families, introduces R-scape secondary structure visualisations, and restores missing families to multiple Rfam clans.

New families

This release adds 115 new Rfam families bringing the total number of families to 2,588. Notable additions include Pistol, Hatchet, Twister-sister and several other riboswitches contributed by Zasha Weinberg. We are always looking for new RNA families, so please feel free to get in touch with your suggestions.

Testing covariation with R-scape

R-scape is a new method for testing whether covariation analysis supports the presence of a conserved RNA secondary structure. In order to check the quality of Rfam structures, we ran R-scape on all Rfam seed alignments and added R-scape visualisations to the secondary structure galleries. For example, here is R-scape analysis of the SAM riboswitch:

r-scape-sam-riboswitch

According to R-scape, the secondary structure from the Rfam seed alignment, shown on the left, has 19 statistically significant basepairs (highlighted in green). R-scape can also use statistically significant basepairs as constraints to predict a new secondary structure that is consistent with the seed alignment. Using this approach, R-scape increased the number of statistically significant basepairs from 19 to 27 while also adding 9 new basepairs that are consistent with the seed alignment (structure on the right). This visualisation gives an idea about the quality of the Rfam structure and indicates that in this case it may need to be updated. To find out more about R-scape have a look at a recent paper by Rivas et al.

Tip: R-scape visualisations are interactive, so you can pan and zoom the structures and get additional information by hovering over nucleotides and basepairs.

R-scape analysis suggests that many existing Rfam secondary structures can be improved (for example, FMN riboswitch or 5S rRNA). In other families secondary structures are not supported by the R-scape covariation analysis (for example, oxyS RNA) which indicates that either their seed alignments need to be expanded or that these RNA families do not have a conserved secondary structure. Lastly, there are also cases where the R-scape structures do not show significant improvement compared to the current secondary structure (for instance, Metazoa SRP).

In future releases we will begin to improve existing Rfam seed alignments by using R-scape in the family building pipeline. In the meantime, Rfam users can get an indication of the quality of the structure using R-scape visualisations.

Recovering lost clan members

Since Rfam 10.0, related Rfam families have been organised into clans. The clans are manually curated and clan membership is checked using automated quality control steps (for example, to make sure that a family cannot belong to more than one clan). However, under certain circumstances these quality control procedures silently removed families from the clans. This bug was introduced in Rfam 11.0, and over time, more than 30 families were dropped from 20 clans, so that some clans did not have any families at all. The problem has now been fixed and proper clan membership has been restored using Rfam releases from the FTP archive. You can explore Rfam clans and let us know if you have any feedback.

Other updates

How to access the data

In addition to the Rfam website, you can access the data in the FTP archive and via the API. There is also a public MySQL database introduced in the last release.

What’s next

As well as revisiting Rfam seed alignments, work is underway on the next major Rfam release (13.0) which will be based on a new sequence database built from complete genomes. We plan to make the new data available in late 2017.

Get in touch

We always welcome comments and feedback about Rfam, so feel free to get in touch by email or by submitting a new GitHub issue.

Pfam 30.0 is available

July 1, 2016

Pfam 30.0, our second release based on UniProt reference proteomes, is now available. The new release contains a total of 16,306 families, with 22 new families and 11 families killed since the last release. The UniProt reference proteome set has expanded and now includes 17.7 million sequences, compared with 11.9 million when we made Pfam 29.0. In this release, we have updated the annotations on hundreds of Pfam entries, and renamed some of our Domains of Unknown Function (DUF) families.

DUFs are protein domains whose function is uncharacterised. Over time, as scientific knowledge increases and new data about proteins comes to light, more information about the function of a domain may become available. As a result, DUFs can be renamed and re-annotated with more meaningful descriptions. As part of Pfam 30.0, we have re-annotated 116 DUFs based on updated information in the UniProtKB database, the scientific literature, and feedback from Pfam and InterPro users. Examples of some our DUF updates in Pfam 30.0 are given below:

 

  • PF10265, created in release 23.0 and originally named DUF2217, has been renamed to Miga, a family of proteins that promote mitochondrial fusion.
  • PF10229, created in release 23.0 and originally named DUF2246, has been renamed as MMADHC, as it represents methylmalonic aciduria and homocystinuria type D proteins and their homologues.  The structure of this domain is shown below.

 

5cv0

Structure of MMADHC dimer, PDB:5CV0

 

  • PF12822, created in release 25.0 and originally named DUF3816, has been renamed to ECF_trnsprt, since it contains proteins identified as the substrate-specific component of energy-coupling factor (ECF) transporters.

Please note that we may change the identifier for a family (e.g. DUF2217), but we never change the accession for a family (e.g. PF10265).

If you find any more DUFs that can be assigned a name based on function, or any other annotation updates, please get in touch with us (pfam-help@ebi.ac.uk).

 

Rfam 12.1 has been released

April 27, 2016

Rfam 12.1 announcement

We are happy to announce a new release of Rfam. Version 12.1, based on the same sequence dataset as Rfam 12.0, features over 20 new families, a new clan competing algorithm, a publicly accessible MySQL database, and many website fixes.

Read the rest of this entry »