Does my family of interest have a determined 3D protein structure?

May 9, 2012

Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’.  You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these  questions.

Constructing Pfam-PDB mappings

So how is a PDB entry linked to in Pfam? As most readers know, Pfam is primarily based on the UniProt sequence database, this is our unit of currency and everything in the database is arranged around ‘families’ and UniProt sequences.  In order to form a link to a PDB structure, a polypeptide chain in a PDB file has to be linked to a UniProt sequence.  A decade ago there was little, if any, relationship between the sequences contained in UniProt and PDB.  However, as part of an MRC funded initiative and the will and determination of a few individuals, the linking between UniProt and PDB is now very good, with most sequences with a known protein structure being linked to a UniProt entry.

The linking of entries between PDB and UniProt has been performed by an internal collaboration within the EBI, between PDBe and UniProt, primarily involving Sameer Velankar and Julius Jacobsen.  The SIFTS project does not just provide the overall linking between PDB and UniProt entires, but actually provides a more detailed residue to residue mapping between two linked entries (more information on how this is performed is available here).  A version of the PDBe Oracle database ( formally known as the MSD) used to be replicated at the Wellcome Trust Sanger Institute and contained the SIFTs residue by residue mappings.  These residue by residue mappings are then used to mark up the UniProt sequences in a Pfam multiple sequence alignments with secondary structure and to determine a link between a PDB structure and a Pfam family. Pfam has recently switched to using the SIFTS-XML files directly and the RCSB PDB, as part of the wwPDB is using the same mapping as well.  As you can see, the way Pfam links to PDB is via an indirect mapping to UniProt.

Missing Structures

So why are there missing PDB structures in Pfam? One of the main reasons is that all three databases are in constant flux. PDB is updated every week, UniProt is released every month and Pfam is updated approximately every 6 months.     When making the last Pfam release, version 26.0 , the version of UniProt used was from June, but the release was not publicly available until December – lots of reasons of the delays, but that is for another posting.  Even between those two points in time, thousands of new structures had been determined and now the release has been out for 5 months, even more have been determined. So, the different periodicity of releases accounts for why some structures are not present in Pfam. Other missed structures come from the fact that they simply do not have a corresponding UniProt entry or that the mapping is not present at that point in time of making the release.   Furthermore, the mapped UniProt accession must also be in the Pfam version of UniProt – the slight asynchronous acquisition of UniProt and PDBe data leads to further mismatches.

How can we provide a more up-to-date Pfam to PDB mapping?

Well, the absolute answer is to take the current version of PDB and Pfam and run the profile HMMs against the PDB sequences.  However, we know that may be a little irksome for users – so we have done this for you!  When a new structure is released in the Protein Data Bank, the RCSB PDB now performs a search against  Pfam using the HMMER web services API.   The PDB SEQRES records are used for this scan. Once the Pfam domain annotations have been calculated, they are mapped onto the PDB-ATOM coordinates (the PDB residue numbers) thereby ensuring the atomic coordinates are available.  You can access these Pfam-PDB annotations via the RCSB PDB RESTful API in the following way:

Fetch Pfam domains by PDB id:

http://www.rcsb.org/pdb/rest/hmmer?structureId=1cdg

Fetch all Pfam domains in one xml file

http://www.rcsb.org/pdb/rest/hmmer

Fetch all Pfam domains in tab delimited file

http://www.rcsb.org/pdb/rest/hmmer?file=hmmer_pdb_all.txt

Open source via BioJava

In order to obtain Pfam annotations for newly released protein sequences via the HMMER web site, we contributed a new module to BioJava. It is part of the BioJava web-services module and allows the submission of any protein sequence to the HMMER RESTful API and retrieve the results as a simple list of annotations. It is available as open source that can be used to annotate your favourite pet proteins.  There are additional examples of how to access the HMMER API using Perl or Python on the HMMER website.  We quite happily annotated 100,000 sequences in a few hours using the HMMER servers.

We have also set up a cron job to pull this data on to the Pfam ftp site. We hope this helps find structures for Pfam entries (and vice versa) as soon as they become available.

Posted by Rob and Andreas

3 Responses to “Does my family of interest have a determined 3D protein structure?”

  1. Shyam Says:

    Hi,

    I downloaded all Pfam domains in tab delimited file via

    http://www.rcsb.org/pdb/rest/hmmer?file=hmmer_pdb_all.txt

    A quick look doesn’t show PF0002 with any structures.

    However looking online shows 2 structrures —
    http://pfam.sanger.ac.uk/family/7tm_2#tabview=tab9

    Could you help me understand what’s going on?

    Thanks,
    Shyam

    • rdfinn Says:

      Dear Shyam,

      The difference is due to the fact that the mappings are obtained in two very different ways. The matches you obtained from RCSB come from running each sequence in the PDB file against the Pfam HMM library. The mappings to PDB on the Pfam website are considerably more detail as they are actually a residue by residue mapping between the PDB entry sequence and a UniProt sequence – we then cross reference or project our Pfam matches to UniProt on to the PDB structure/sequence. In the case of PF00002 (assuming you mean this accession), the structure, 3L2J only represents a subregion on the UniProt sequence. In fact, it only maps to four residues of the UniProt sequence (PTH1R_HUMAN/184-187). In this case the projection approach works, but there is simply insufficient sequence to obtain a significant match using the PDB sequence to HMM approach (i.e. RCSB).

      I hope this helps,

      Rob


Leave a comment