A new version of Pfam-N is available

October 20, 2022

In March 2021, we announced the release of Pfam-N, a file increasing the Pfam coverage of the UniProtKB Reference Proteomes using deep learning, the fruit of the collaboration between Pfam and the Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. Since then, the Google Research team has worked hard on improving and refining the deep learning methodology previously developed to further increase the Pfam coverage of protein sequences. The new version of Pfam-N represents the most significant gain in Pfam coverage ever reported.

Updated deep learning methodology

Building on the methods developed by Max Bileschi and colleagues [1], convolutional neural networks are used to annotate each residue (for all sequences in the Pfam database) with a Pfam family or clan label, which are then converted into domain calls. We look for any conflicting calls between nonhomologous families or clans and resolve them. A paper on this method is forthcoming; please direct inquiries to mlbileschi@google.com.

Pfam coverage

The first version of Pfam-N used Pfam 34.0 as a reference and annotated 6.8 million protein regions into 11,438 Pfam families. These regions included nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there were 360 human sequences.

Pfam 35.0 annotates 46.0 million protein sequences, covering 75% of the UniProt Reference Proteomes. The latest version of Pfam-N includes 5.2 million full-length protein sequences from the UniProt Reference Proteomes that previously had no Pfam match, an improvement of 8.5% over the currently-annotated 46.0 million, as illustrated in Figure 1.

Figure 1. Percentage of the UniProt reference proteome covered by Pfam and Pfam-N over time.

Overall, together Pfam and Pfam-N cover 83.7% of the UniProt Reference proteome, as shown in Figure 2.

Figure 2. Percentage of the UniProt reference proteome covered by Pfam and Pfam-N.

Improving Pfam annotations

In our previous post announcing the release of Pfam-N we highlighted the benefit of using Pfam-N to gain huge numbers of additional matches to expand existing Pfam families.

Additionally, Pfam-N can be used to functionally characterise DUFs. Neural network annotations can be used to guide us when there are very distant previously undiscovered relationships between DUFs and more well-annotated families. For example, DUF5309 (PF17236) has been shown to be evolutionarily related to the phage capsid family using RoseTTAFold structural models and has been added to the Pfam clan CL0373: phage-coat.

Furthermore, neural networks can be used to create new families within a clan. They can predict the attachment of protein sequences to a clan, which sometimes are not yet covered by a family. Based on one of these ProtENN predictions, the Alpha/beta hydrolase domain family (PF20408) has been created and added to the Pfam clan CL0028: AB_hydrolase.

Visualising Pfam-N in InterPro

As part of InterPro 91.0, the Google Research team has expanded the Pfam-N annotations to cover all the proteins in UniProt 2020_04. Covering 83.6% of UniProtKB.

To increase the visibility of these annotations, we have added them to the Other features section in the protein sequence viewer displayed on protein pages in the InterPro website, as shown in Figure 3.

Figure 3. Pfam-N annotation for UniProt P18207.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

[1] Bileschi, Maxwell L., et al. “Using deep learning to annotate the protein universe.” Nature Biotechnology (2022): 1-6.

Written by Typhaine Paysan-Lafosse

2 Responses to “A new version of Pfam-N is available”

  1. Jude Wells Says:

    Are the Pfam-N annotations available for bulk-download anywhere? Eg. via FTP


Leave a comment