Posts Tagged ‘curation’

Pfam SARS-CoV-2 special update (part 2)

April 6, 2020

This post presents an update to last week’s post. Since the initial release of the 40 Pfam profile HMMs that match SARS-CoV-2, we have now produced a set of flatfiles that are more typical of a Pfam release.  These files make our updated annotations that describe the entries available for download, prior to being released via the Pfam website. Moreover, you can now use the multiple sequence alignments to investigate the conserved positions across different coronavirus proteins. Figure 1 shows the alignment of the SARS-CoV-2 receptor binding domain (PF09408 N.B. the Pfam website still shows the old alignment).

pf09408-spike

Figure 1 – Excerpt of the Betacoronavirus spike glycoprotein S1, receptor binding domain alignment (Pfam accession PF09408), rendered using Jalview. The SARS-CoV-2 sequence is the last sequence in the alignment.

Finally, we have made some very minor changes to the family descriptions and one name change from the last release.  You can now access all the updated files here:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/

In this directory you can find the updated seed (Pfam-A.SARS-CoV-2.seed) and full alignments (Pfam-A.SARS-CoV-2.full) in Stockholm format based on the Pfamseq database, which contains sequences of the UniProt Reference Proteomes.  We provide a file with matches to UniProtKB 2019_08 (Pfam-A.SARS-CoV-2.full.uniprot). We also provide a set of alignments for each of the families which include matches to the SARS-CoV-2 sequences which are not as yet present in the Pfamseq database. These alignments can be found in aligned fasta format here or as a tar gzipped library here.

Posted by The Pfam team

Pfam SARS-CoV-2 special update

April 2, 2020

The SARS-CoV-2 pandemic has mobilised a worldwide research effort to understand the pathogen itself and the mechanism of COVID-19 disease, as well as to identify treatment options. Although Pfam already provided useful annotation for SARS-CoV-2, we decided to update our models and annotations for this virus in an effort to help the research community. This post explains what was done and how we are making the data available as quickly as possible.

What have we done?

We assessed all the protein sequences provided by UniProt via its new COVID-19 portal (https://covid-19.uniprot.org/), identified those which lacked an existing Pfam model, and set about building models as required. In some cases we built families based on recently solved structures of SARS-CoV-2 proteins. For example, we built three new families representing the three structural domains of the NSP15 protein (Figure 1) based on the structure by Youngchang Kim and colleagues (http://europepmc.org/article/PPR/PPR115432). In other cases, such as Pfam’s RNA dependent RNA polymerase family (PF00680), we took our existing family and extended its taxonomic range to ensure it included the new SARS-CoV-2 sequences.

Figure 1. The structure of NSP15 (PDB:6VWW) from Kim et al. shows the three new Pfam domains. (1) CoV_NSP15_N (PF19219) Coronavirus replicase NSP15, N-terminal oligomerisation domain in red, (2) CoV_NSP15_M (PF19216) Coronavirus replicase NSP15, middle domain in blue and (3) CoV_NSP15_C (PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

We have also stratified our ID nomenclature and descriptions of the families to ensure they are both correct and consistent. The majority of the family identifiers now begin with either CoV, for coronavirus specific families, or bCoV for the families which are specific to the betacoronavirus clade, which SARS-CoV-2 belongs to. We have also fixed inconsistencies in the naming and descriptions of the various non-structural proteins, using NSPx for those proteins encoded by the replicase polyprotein, and NSx for those encoded by other ORFs. We are grateful to Philippe Le Mercier from the Swiss Institute of Bioinformatics who gave us valuable guidance for our nomenclature.

Where are the data?

You can access a small HMM library (Pfam-A.SARS-CoV-2.hmm) for all the Pfam families that match the SARS-CoV-2 protein sequences on the Pfam FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_1.0/

You can also find a file (matches.scan) showing the matches of the models against the SARS-CoV-2 sequences in the same FTP location. These updates are not yet available on the Pfam website. We anticipate making them available in 6-8 weeks.  We hope you find our SARS-CoV-2 models useful for your research, and as always we welcome your feedback via email at pfam-help@ebi.ac.uk.

How to use this library?

This library is not compatible with the pfam_scan software that we normally recommend to reproduce Pfam matches, as this library only contains a small subset of models.  If you wish to compare these models to your own sequences, please use the following HMMER commands:

$ hmmpress  Pfam-A.SARS-CoV-2.hmm

This only needs to be performed once. Then to compare your sequences (in a file called my.fasta) to this special Pfam profile HMM library, then:

$ hmmscan --cut_ga --domtblout matches.scan Pfam-A.SARS-CoV-2.hmm my.fasta

The –domtblout option enables you to save the matches in a more convenient tabular form, if you do not want to parse the HMMER output.

And finally

We will be making Pfam alignments available during the next week and will produce another blog post describing them.

Posted by The Pfam team

Curation with Dfam: new data and platform updates

March 17, 2020

DNA transposon termini signatures

The Dfam consortium is excited to announce the generation and release of terminal repeat sequence signatures for class II DNA transposable elements. The termini of class II elements are crucial for movement, and as such, can be used to classify de novo DNA transposable element families in new genomic sequences (Figure 1).

Figure 1. Major subgroups of class II DNA transposons.

The LOGOs of the termini can be viewed on the “Classifications” tab on the Dfam website and are organized by class II subclasses (e.g., Crypton, Helitron, TIR, etc.) (Figure 2). This allows for easy visualization of the base conservation at each position in the terminal sequences and comparisons between the 5’ and 3’ termini (Figure 2). In addition, the termini profiles are available for download as a .HMM file.

Figure 2. Termini signature visualization on the Dfam website (www.dfam.org) sample. Base conservation can be seen via the LOGOs of the 5’, 3’ and combined edge (termini) HMMs. The movement type can be seen preceding DNA transposons that move via a common mechanism (e.g. “Circular dsDNA intermediate). The number of families used to generate the LOGOs are indicated, as well as the subclass named (e.g. “Crypton_A”). Additional notes on the termini, when relevant, are also available.

Community data submissions

We have taken the first small step towards a community-driven data curation platform by developing a new data submission system.  At the start this will facilitate the process of uploading data to the site for processing by the curators. As we move forward, further aspects of the curation process will be made available to the community.  Upon creating an account and logging in, users can submit files to Dfam using our web-based upload page. Here you will also find information about submission requirements and how different levels of library quality are handled in Dfam.

We are recruiting!

August 7, 2019

We have two biocurator positions available to work on the Pfam and InterPro databases. Come and join our team!

The main role of the jobs will be to:

  • Create and maintain InterPro and Pfam entries through the assessment of protein signature models. This will involve using our curation interfaces and tools (using basic command line)
  • Write descriptive abstracts of protein families and domains, summarizing functional information found within the scientific literature.
  • Augment entries with annotation terms for use in automatic annotation pipelines, for example the use of GO annotations and other data standards.
  • Respond to user and collaborator queries and requests.
  • Help develop and deliver training materials, either in person or via the Train Online platform

 

Full details can be found on the EBI jobs pages:

https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01490

https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01418

 

If you have any questions, please get in touch.

 

Posted by Jaina and Lorna

Case studies from the list of human regions not in Pfam 27.0.

May 14, 2013

Following on from Jaina and Marco’s blog post last week about conserved Human regions not in Pfam, I would like to give you some examples of how we have used the regions identified to improve existing Pfam families, and to create new ones. When available, we use three-dimensional structures to guide the boundary definitions of our families. In cases where there is no available structure, either for the protein in question or for other proteins in the same Pfam family, we base boundary decisions on sequence conservation. The following paragraphs give three examples of cases I have looked at recently.

Read the rest of this entry »

Pfam targets conserved human regions

May 7, 2013

Recently, we have been looking at how much of the human proteome is covered by Pfam (release 27.0), and ways in which we can improve this coverage. We have even written an open access paper about it that you can read here [1]  that is part of the proceedings of the 2013 Biocuration conference. We used the human proteins in UniProtKB/Swiss-Prot [2] (~20,000 sequences) as our human proteome set, and found that while most of the sequences in this set have some Pfam annotation (90% have at least one Pfam domain), there is still much ground to cover before we have a complete map of all (conserved) human regions (HRs). Here, rather than repeating what we presented in the paper (did we mention it is open access? :-)), we would like to tell you more about the impact this study is having on our strategies for selecting target regions to be added to Pfam.

Read the rest of this entry »

What are these new families with _2, _3, _4 endings?

January 19, 2012

Some users have been contacting us about the new families that are appeared in Pfam release 26.0.

As pointed out by one of our users:

Pfam v26 includes, in addition to DDE_Tnp_1, the following new families:

DDE_Tnp_1_2
DDE_Tnp_1_3
DDE_Tnp_1_4
DDE_Tnp_1_5
DDE_Tnp_1_6
DDE_Tnp_1_7

These extra new families with the name_2, name_3, name_4 etc, have been constructed to increase the coverage of Pfam.  Many of our existing large diverse families are not well modelled by a single HMM and there are many true members that are not matched. So by building multiple models we can match more things.  Each of these models will be in the same Pfam clan, the RNaseH clan in this case.  For the most part these models do not represent any particular subfamily or classification group.  Essentially you should think of a match to any of the above seven DDE_TnP_1 families as being the same thing.  Because of the way  Pfam is built any particular region of a protein may only belong to one of these families.  We have a step in building clans called competition which means that if a region of a protein matches to both DDE_Tnp_1 and DDE_Tnp_1_2 for example then the region will be assigned to the family with the highest score.  This means that a match to DDE_Tnp_1 in release 25.0 may now end up in a different family such as DDE_Tnp_1_2.  You shouldn’t read too much into these changes.

The reason that many of these new families are appearing in Pfam release 26.0 is due to a change in strategy in how we are building many new Pfam families.  The new strategy consists of taking complete genomes and taking each protein that does not match Pfam and using it as a starting point for a Jackhmmer search.  Jackhmmer is an iterative search tool like PSI-blast.  If we find that the Jackhmmer search finds lots of homologues but has some overlaps with an existing family then we may build one of these new additional families to increase coverage of known sequences. Rather than give these families completely new names we simply call them the same as the existing family and append a number to them to show that they are closely related to each other.

 

Posted by Alex

Have we found all the protein families yet?

November 22, 2011

Since starting to work on Pfam all those years ago I have been obsessing about when the job might be finished. I have plans for retiring and learning to play golf or something similar. Cyrus Chothia published a paper saying that the majority of proteins in Nature came from no more than 1,000 families (Chothia 1992). However, we now have over 14,000 families in Pfam. To work out how close we are to finishing Pfam we measure our coverage of known proteins. That is what fraction of proteins in the database have a hit to a Pfam family. Currently we are close to 80% coverage of known sequences in UniProt.

Lately I have been wondering about that missing 20%. I see three possibilities:

  1. Sequences should be part of new families
  2. Sequences are missing members of existing families
  3. Sequences are incorrect gene predictions and never expressed

To investigate this one can take all the proteins in a genome that do not already hit Pfam. We can then use each of these sequences as a seed for an iterative Jackhmmer search. Some proteins find thousands of hits many of which belong to existing families (case 2), and many of them have tens to hundreds of matches and do not match to any known family (case 1). Some of the searches find only the sequence itself and perhaps these are candidates for spurious proteins (case 3). The graph below shows the results of such an analysis for one bacterial genome.

Graph showing the number of Jackhmmer hits (y axis) against the unmatched proteins in a genome. Proteins are ordered along X-axis with those whose searches hit the most proteins at the right.

On the right hand side of the graph we can see the 300 or so proteins that should be parts of existing families. But the large majority of proteins do not appear to be part of existing families and only consist of a red bar. That suggests that we are still far from getting a complete list of all known protein families (if such a thing is possible). So the search continues … and the golf course must wait.

Posted by Alex

References

Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543-544.

Naming by numbers

July 21, 2010

A user recently asked us why two highly similar sequences that contain a PAS domain are in different Pfam families within the PAS clan.  The PAS domain clan (CL0183) currently contains seven different families: PAS, PAS_2, PAS_3, etc up to PAS_6, as well as the MEKHLA family.  We thought we would take the opportunity to explain some of the rationale behind the way in which we construct and name our families and clans. Read the rest of this entry »

Renaming of Transposases in Pfam

June 21, 2010

A few weeks ago the Pfam team was visited by the curators of the ISfinder resource, a specialist database that classifies eubacterial and archaeal transposases and insert sequences.  The ISfinder database has a very different role to Pfam: it focuses on this specific set of biological sequences and is the naming authority in the field for insertion sequences (ISs).  Pfam’s role is not to name transposases, but to identify the domains contained within these sequences.  Below we describe the outcomes of this meeting. Read the rest of this entry »