Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’. You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these questions.
Constructing Pfam-PDB mappings
So how is a PDB entry linked to in Pfam? As most readers know, Pfam is primarily based on the UniProt sequence database, this is our unit of currency and everything in the database is arranged around ‘families’ and UniProt sequences. In order to form a link to a PDB structure, a polypeptide chain in a PDB file has to be linked to a UniProt sequence. A decade ago there was little, if any, relationship between the sequences contained in UniProt and PDB. However, as part of an MRC funded initiative and the will and determination of a few individuals, the linking between UniProt and PDB is now very good, with most sequences with a known protein structure being linked to a UniProt entry.
The linking of entries between PDB and UniProt has been performed by an internal collaboration within the EBI, between PDBe and UniProt, primarily involving Sameer Velankar and Julius Jacobsen. The SIFTS project does not just provide the overall linking between PDB and UniProt entires, but actually provides a more detailed residue to residue mapping between two linked entries (more information on how this is performed is available here). A version of the PDBe Oracle database ( formally known as the MSD) used to be replicated at the Wellcome Trust Sanger Institute and contained the SIFTs residue by residue mappings. These residue by residue mappings are then used to mark up the UniProt sequences in a Pfam multiple sequence alignments with secondary structure and to determine a link between a PDB structure and a Pfam family. Pfam has recently switched to using the SIFTS-XML files directly and the RCSB PDB, as part of the wwPDB is using the same mapping as well. As you can see, the way Pfam links to PDB is via an indirect mapping to UniProt.
So why are there missing PDB structures in Pfam? One of the main reasons is that all three databases are in constant flux. PDB is updated every week, UniProt is released every month and Pfam is updated approximately every 6 months. When making the last Pfam release, version 26.0 , the version of UniProt used was from June, but the release was not publicly available until December – lots of reasons of the delays, but that is for another posting. Even between those two points in time, thousands of new structures had been determined and now the release has been out for 5 months, even more have been determined. So, the different periodicity of releases accounts for why some structures are not present in Pfam. Other missed structures come from the fact that they simply do not have a corresponding UniProt entry or that the mapping is not present at that point in time of making the release. Furthermore, the mapped UniProt accession must also be in the Pfam version of UniProt – the slight asynchronous acquisition of UniProt and PDBe data leads to further mismatches.
How can we provide a more up-to-date Pfam to PDB mapping?
Well, the absolute answer is to take the current version of PDB and Pfam and run the profile HMMs against the PDB sequences. However, we know that may be a little irksome for users – so we have done this for you! When a new structure is released in the Protein Data Bank, the RCSB PDB now performs a search against Pfam using the HMMER web services API. The PDB SEQRES records are used for this scan. Once the Pfam domain annotations have been calculated, they are mapped onto the PDB-ATOM coordinates (the PDB residue numbers) thereby ensuring the atomic coordinates are available. You can access these Pfam-PDB annotations via the RCSB PDB RESTful API in the following way:
Fetch Pfam domains by PDB id:
Fetch all Pfam domains in one xml file
Fetch all Pfam domains in tab delimited file
Open source via BioJava
In order to obtain Pfam annotations for newly released protein sequences via the HMMER web site, we contributed a new module to BioJava. It is part of the BioJava web-services module and allows the submission of any protein sequence to the HMMER RESTful API and retrieve the results as a simple list of annotations. It is available as open source that can be used to annotate your favourite pet proteins. There are additional examples of how to access the HMMER API using Perl or Python on the HMMER website. We quite happily annotated 100,000 sequences in a few hours using the HMMER servers.
We have also set up a cron job to pull this data on to the Pfam ftp site. We hope this helps find structures for Pfam entries (and vice versa) as soon as they become available.
Posted by Rob and Andreas