Pfam website decommission

August 4, 2022

After more than 20 years of good and faithful service, we have decided to retire the Pfam website. Do not worry though, we are still planning to do Pfam releases and the data will still be available. 

As you can imagine this wasn’t an easy decision, and be sure it wasn’t taken lightly. The Pfam website codebase was first released over 20 years ago, and although it has been updated from time to time, some of its core functionality still dates back to its origins. There is a lot of technical debt in its current state and it is only becoming harder to maintain. 

Currently, on every release, we are taking more time generating data exclusively related to the website than the core data of Pfam: its alignments, and models. Additionally, our team size doesn’t have the capacity to execute all the release procedures for Pfam on a consistent basis.

Retiring the website will allow us to focus our efforts on producing the core of Pfam. The plan then is to leave the deployment and visualisation tasks to the InterPro website. InterPro was redesigned in recent years, using up to date technologies, including a modern framework (React).

The Pfam data and different viewing features are already available on the InterPro website. For example, searching for a Pfam accession (e.g. PF05093) using the InterPro search by text will allow you to reach the corresponding Pfam entry page, where the menu on the left hand-side gives access to different datasets related to the entry, as shown in the figure below.

Example of Pfam entry page in the InterPro website (PF05093)

The correspondence between the Pfam menu and InterPro menu is given in the table below.

Pfam website tabInterPro website tab
ClanAvailable in Overview, Set
Domain organisationDomain Architectures
HMM logoSignature
TreesTaxonomy (tree icon)
Curation & modelCuration
AlphaFold structuresAlphaFold

You can also browse through the different Pfam families and clans (called Set in InterPro) using the InterPro Browse feature.

The Overview tab of the Set pages in InterPro, the different members of the set (nodes) and the relationship between each other (lines) are displayed in a graph (it corresponds to the Relationship tab in the Pfam website). The size of the nodes is proportional to the number of proteins in the Pfam entry. The graph can be customised to display the Pfam Accession, short name and/or name. Other tabs include Entries (equivalent to the Members section in the Summary tab in Pfam), Proteins, Structures, Taxonomy (equivalent to the Species tab in Pfam), Proteomes and alignments. Additionally, the Proteins tab in InterPro lists all the proteins matched by the different Pfam entries included in a set.

We are aware that not all of the Pfam users are familiar with the InterPro website interface, hence the decommission will be progressive through multiple months, starting from October 5th 2022. On October 5th, we will start redirecting the traffic from Pfam ( to InterPro ( The Pfam website will be available at until January 2023, when it will be decommissioned. We are also going to organise a webinar to show you where to find the Pfam annotations in InterPro, so stay tuned and check our twitter accounts (@PfamDB/@InterProDB) to register.

If you have any requests, feedback or suggestions on ways to improve Pfam data visualisation in InterPro please contact us through the InterPro helpdesk.

Written by Typhaine Paysan-Lafosse.

Rfam release 14.8

May 30, 2022

We are happy to announce the release of Rfam version 14.8. This release includes 48 updated and 25 new microRNA families; 10 families updated based on 3D structure annotations; 4 new families and updates to 5 existing families for Hepatitis C virus; a new xRNAs family from the Potato virus; and the integration of LitScan, a literature scanner powered by RNAcentral and Europe PubMed Central. Read on for details on these changes.

Updated microRNA families

Rfam, miRBase and RNAcentral have been working to synchronize miRNA families between all three resources. We are now happy to report that we have completed 77% of the current miRNAs families covering >1300 miRNAs of 1700 alignments provided to us by Profesor Sam Griffiths-Jones at miRBase. The 400 remaining families need an extended review process, and we will be working on their integration in future releases.

Summary of the miRBase and Rfam synchronisation project, we estimate it is 78% completed.

Families updated with 3D structure information

Rfam has been updating families using 3D structure information. This project aims to improve Rfam families through the addition of pseudoknotes, base pairs, and annotations of other structural elements by inspecting 3D structures. In this release we have updated 10 families:

  • Virus families:
    • RF00507-Coronavirus frameshifting stimulation element
    • RF01047-HBV RNA encapsidation signal epsilon
  • Riboswitches families:
    • RF01763-Guanidine-III riboswitch, also know as ykkC-III riboswitch
    • RF01734-Fluoride riboswitch
    • RF01704-Glutamine-II riboswitch, previously known as Downstream peptide
    • RF01750-ZMP/ZTP riboswitch
    • RF01739-Glutamine riboswitch
    • RF02683-NiCo riboswitch
    • RF01826-SAM-V riboswitch
  • Ribozyme family:

We have added pseudoknots to 7 of the 10 updated families and the updated secondary structure diagrams from 5 of these families are shown below.

Examples of families reviewed and updated with 3D information. Pseudoknot structures (pk) were added to each of these five families based on a review of the corresponding 3D structures.

New and updated families of Hepatitis C virus

In release 14.8 we have created 4 new families, updated 5 existing families, and deleted 4 virus families. These changes are the result of our ongoing collaboration between Profesor Manja Marz of the European Virus Bioinformatics Center and Rfam. The Marz group provided Rfam a curated alignment of representative sequences for the entire genome of Hepatitis C virus genome. We used this alignment to update, create or remove existing Rfam families. The new families we have created are summarized in a table below. We have deleted RF00469, which was merged into RF00260 during review. We have also deleted families from RF02585 to RF02588 which have no support in the genomic alignment. 

Rfam IDNameDescription
RF00061IRES_HCVHepatitis C virus internal ribosome entry site
RF00260HepC_CREHepatitis C virus (HCV) cis-acting replication element (CRE)
RF00620HCV_ARF_SLHepatitis C alternative reading frame stem-loop
RF00468HCV_SLVIIHepatitis C virus stem-loop VII
RF00481HCV_X3Hepatitis C virus 3’X element
RF04218HCV_5BSL1Hepatitis C virus stem-loop I
RF04219HCV_J750J750 non-coding RNA (containing SL761 and SL783)
RF04220HCV_SL588SL588 non-coding RNA
RF04221HCV_SL669SL669 non-coding RNA

As part of this project we have reviewed and updated Coronavirus, Flavivirus and HCV viruses families, and we are working on adding RNA families from other viruses, such as Filoviridae (e.g. Ebolavirus) and Rhabdoviridae (e.g. Rabies viruses).

xRNAs in Potato virus

We want to thank Professor Quentin Vicens for sharing the alignment of Potato leafroll virus exoribonuclease-resistant RNA (PLRV-xrRNA). PLRV-xrRNA is a non-coding RNA that blocks the progression of 5′ to 3′ exoribonuclease using only a folded RNA element, and this family is described in RF04222.


RNAcentral has recently developed LitScan, a tool to automatically connect non-coding RNA sequences, genes and families to the literature that discusses them. In this release we have integrated the LitScan widget into Rfam. The widget is now shown in the new ‘Publications’ tab on all Rfam families.

Example of LitScan for mir-17 microRNA precursor family, publications can be sorted by citation, journal, year of publication and others.

Please reach out to us with feedback on the widget, or if you would like to use the LitScan widget on your site!

Dfam 3.6 release

April 21, 2022

We are pleased to announce the latest data release of the Dfam database! This latest release approximately doubles the number of species from the Dfam 3.5 release (595 to 1,109), and increases the number of transposable element (TE) families by ~2.5x (285,542 to 732,993. A more detailed summary of the species included can be seen in Table 1, and in the Dfam 3.6 release notes.

Community-submitted libraries

A huge thank you to the TE community for submitting your data to us! In this release, we have: 1) 3,360 curated rice weevil TE models, submitted by Clément Goubert and Rita Rebollo1; 2) 22 SINE families obtained from 15 moth species (Lepidoptera insects) submitted by Guangjie Han et al.2; 3) 120 Penelope-classified families – something about how they span several kingdoms/orders? submitted by Rory Craig et al.3; and 4) 41 repeat families generated as part of the T2T human assembly project4 – not including the 22 “composite” repetitive families, which will be available as part of a later Dfam release. To read more about the studies associated with these submissions, please see the references below.

Rice weevil: an agricultural pest

(Background copied from paper): The rice weevil Sitophilus oryzae is one of the most important agricultural pests, causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular interactions. In the paper (see below), the authors show that many TE families are transcriptionally active, and changes in their expression are associated with insect endosymbiotic state.

Moth SINEs: high diversity

(Conclusions copied from paper): Lepidopteran insect genomes harbor a diversity of SINEs. The retrotransposition activity and copy number of these SINEs varies considerably between host lineages and SINE lineages. Host-parasite interactions facilitate the horizontal transfer of SINE between baculovirus and its lepidopteran hosts.

Penelope elements: far-reaching impacts

The authors investigate the Penelope (PLE) content of a wide variety of eukaryotes. (copied from paper): This paper uncovers the hitherto unknown PLE diversity, which spans all eukaryotic kingdoms, testifying to their ancient origins. 

T2T entries: previously hidden genomic content

A new human genome assembly has been released! The new assembly (T2T or chm13) has sequenced and assembled the remaining 10% of the human genome that was previously unattainable. The entries described in the manuscript are part of this newly-analyzed sequence.

EBI libraries

In collaboration the European Bioinformatic Institute (EBI), we processed and imported RepeatModeler runs on 444 additional species, resulting in the addition of 440,543 families. Additional extension and re-classification sites were run on each models and fate final consensus and HMMs were produced. Please note that the relationship data is not available on these uncreated imports at this time.

References associated with community submissions

1 Parisot, N., et al (2021). The transposable element-rich genome of the cereal pest Sitophilus oryzae. BMC biology19(1), 241.
2 Han, G., et al (2021). Diversity of short interspersed nuclear elements (SINEs) in lepidopteran insects and evidence of horizontal SINE transfer between baculovirus and lepidopteran hosts. BMC genomics22(1), 226.
3 Craig, R. J., et al (2021). An Ancient Clade of Penelope-Like Retroelements with Permuted Domains Is Present in the Green Lineage and Protists, and Dominates Many Invertebrate Genomes. Molecular biology and evolution38(11), 5005–5020.
4 Hoyt, S. J., et al (2022). From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science (New York, N.Y.)376(6588), eabk3112.

Rfam release 14.7

December 21, 2021

We are happy to announce the latest Rfam release, version 14.7. The release includes 121 updated microRNA families, 4 new families, and a redesigned Rfam-PDB mapping pipeline that provides weekly updates as new RNA 3D structures become available. Read on to find out more or explore the data in Rfam.

Updated microRNA families

As part of the Rfam-miRBase synchronisation project discussed in the Rfam 14 paper, we continue revising microRNA families in Rfam using the data provided by miRBase. This release includes 121 updated families, such as mir-6 (RF00143) and mir-22 (RF00653). The following five microRNA families have been deleted from Rfam as the corresponding entries were removed from miRBase due to lack of evidence: mir-1937 (RF01942), mir-1280 (RF02013), mir-353 (RF00800), mir-720 (RF02002), and mir-2973 (RF02096). We would like to thank Lisanne Knol (University of Edinburgh) for bringing the first two of these families to our attention.

We estimate that the Rfam is now approximately 60% in sync with miRBase, with additional families to be released in future versions of Rfam. You can view the full list of updated families here or browse all microRNAs in Rfam

New families

The release includes two new hairpin ribozyme families Hairpin-meta1 (RF04190) and Hairpin-meta2 (RF04191) recently reported by the Weinberg lab. The hairpin ribozymes were discovered in metatranscriptome data and are proposed to occur in circular RNA genomes of as yet uncharacterised organisms. The new family joins the original Hairpin ribozyme family (RF00173).

Based on a recent paper we also created two additional bacterial families, the icd-II ncRNA motif (RF04189) and the carA ncRNA motif (RF04192). We would like to thank Ken Brewer (Yale University) for providing the data.

Weekly updates of PDB structures matching Rfam families

For many years Rfam maintained a mapping between Rfam families and experimentally determined RNA 3D structures available in PDB. However, this mapping lagged behind the weekly PDB updates as it was only updated with Rfam releases. 

The newly implemented pipeline analyses the data every week and makes the data available on the Rfam website and in a new section of the FTP archive that contains a preview of the upcoming release. The new, up-to-date mapping also improves the ability to search PDBe using Rfam and is a key part of an ongoing project to review all Rfam families with known 3D structures. 

Currently there are 127 Rfam families with experimentally determined RNA 3D structures in the PDB. For example, a recent paper describing how a viral RNA hijacks host machinery produced several structural models (7SAM, 7SC6, and 7SCQ) showing the pseudoknot of tRNA-like structure that are now mapped to Rfam family RF01084. Follow Rfam on Twitter to be the first to hear when new RNA families are linked to 3D structures.

Other improvements

  • We continue improving the Gene Ontology (GO) terms associated with Rfam families, and in this release 412 families have been updated to use the latest GO terms. Maintaining the GO terms up-to-date is important as Rfam is used for automatic assignment of GO terms in RNAcentral and other resources. The GO terms are shown in the Curation tab of each family and are also available in the rfam2go file. 
  • The Rfam.seed_tree.tar.gz file hosted on the FTP archive has been fixed. We would like to thank Christian Anthon (University of Copenhagen) for reporting the problem.

Get in touch

As always, we look forward to hearing from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Happy holidays from the Rfam team!

This is the first release produced by Emma Cooke and Blake Sweeney who have joined Rfam in the second half of 2021. The Rfam team wishes you a happy holiday season! We look forward to creating lots more families in 2022 and working towards the next major Rfam release, Rfam 15.0!

Pfam 35.0 is released

November 19, 2021

Pfam 35.0 contains a total of 19,632 families and clans. Since the last release, we have built 460 new families, killed 7 families and created 12 new clans. UniProt Reference Proteomes has increased by 7% since Pfam 34.0, and now contains 61 million sequences. Of the sequences that are in UniProt Reference Proteomes, 75.2% have at least one Pfam match, and 48.7% of all residues fall within a Pfam family.

Sources of new families

In an effort to increase the Pfam coverage of metagenomic sequence space, we have created 250 metagenomic protein families. These families were built by clustering protein sequences from the MGnify and UniProt databases, aligning the sequences in each cluster, and using the resulting alignments to create new SEED alignments. We then used our usual building process to create new families from the SEED alignments.

We have also created 52 new families based on clusters from a new resource called DPCfam based on Density Peak Clustering, created by Allesandro Laio, Marco Punta and Elena Tea Russo. An interesting example of these families is the N-terminal domain of the Crinkler effector protein (PF20147). Crinkling- and necrosis-inducing proteins (CRNs) or Crinkler, are ubiquitously present and first described in plant pathogenic oomycetes, and have been shown to participate in processes controlling plant cell death and immunity. However, Crinkler is also found outside oomycetes, such as in the Rhizophagus irregularis crinkler effector protein 1 (RiCRN1) which, like other CRNs, functions in the plant nucleus, but plays an essential role in symbiosis progression and the proper initiation of arbuscule development. This suggests that Crinkler proteins are more ubiquitously distributed than first predicted, and that their function is not limited to plant death (PMID:30233541). The Pfam domain contains the conserved motif FLAK, and, from structure predictions, adopts the ubiquitin-like fold, as seen in the image below. 

Figure 1: N-terminal domain of RiCRN. The image was generated using an AlphaFold colab notebook and is displayed using Molstar.

We continue to be provided with new families from the group of L. Aravind from NCBI, and have added 42 of them to this release of Pfam. Many of these families represent novel domains and proteins found in phage defence systems of bacteria.


We are really excited about the Pfam-N matches for Pfam 35.0, but there is still a bit of work to do before we can release them. In particular, we’re working on neural networks that can predict the location of the domains themselves, instead of relying on HMMER to do so, as with the previous Pfam-N release. It will take a little more time to ensure the quality of these annotations, and we will make another announcement when they are ready.

Enjoy Pfam 35.0!

Posted by the Pfam team

Dfam 3.5 release

October 11, 2021

We are pleased to announce the the Dfam 3.5 release, which includes both new annotation data (available for download) and additional TE (transposable element) models and species.

TE annotation data

As part of this release, we have added annotation data for the curated TE entries for all of the extant species as part of the Zoonomia project (Figure 1). These data were curated by the combined work of David Ray’s lab at Texas Tech University (TTU) as well as the Smit group at the Institute for Systems Biology (ISB), and include young, lineage-specific TE models.

Figure 1: Example of the genomic annotation for a Dfam TE model.

TE models

271 Lineage-specific, curated LTR TE models for the reconstructed ancestor of New World monkeys as part of the Zoonomia project have also been added.Additional uncurated entries (DR records) have also been added for the duckweed (607 models) and Atlantic cod (2751 models) as part of TE families submitted to Dfam via the website interface. The next Dfam release will include additional submitted datasets. With the addition of these new families, Dfam now houses 285,542 TE models across 595 species (Figure 2; Figure 3). We look forward to the continued growth of Dfam!

Figure 2: Dfam model growth. Numbers above each bar indicate the number of total models in Dfam at the time of the indicated release.
Figure 3: Dfam species growth. Numbers above each bar indicates the number of total species in Dfam at the time of the indicated release.

Welcome Blake as the new RNA Resources Project Leader

September 13, 2021

RNAcentral and Rfam recently completed the search for a project leader to succeed Anton Petrov and Blake Sweeney has been appointed and recently started in his new role. Some of you may know Blake as the current RNAcentral bioinformatician where he has been running the bioinformatic pipeline and speaking at conferences. With a PhD in RNA bioinformatics and a decade of experience developing RNA databases, including 4.5 years at RNAcentral, Blake is perfectly positioned to take the projects forward. The official handover date is in May, but until then Anton and Blake will be working together to ensure a smooth transition. 

Additionally, we are now hiring a new RNAcentral bioinformatician. If you are interested in applying please see: If you have any questions or comments about the RNA resources please contact Blake Sweeney at

Rfam 14.6 is out

July 27, 2021

We are happy to announce a new release of Rfam (14.6) that includes 121 new microRNA families, a new ribozyme family, 8 new small RNA families found in Bacteroides, as well as 10 additional families with updated secondary structures using 3D structural information. Read on for more information or explore the data in Rfam.

New microRNA families

The new release includes 121 new microRNA families bringing the total number of microRNA families in Rfam to 1,506. This work is part of the ongoing collaboration with miRBase that aims to synchronise microRNAs across miRBase, Rfam, and RNAcentral. Browse Rfam microRNAs or find out more about the microRNA project.

We also resolved an issue with 6 microRNA families that were missing a covariance model on the website and in the FTP archive. Many thanks to Dr Christian Anthon (University of Copenhagen) for pointing out this problem!

Updating families using information from 3D structures

Following on from Rfam 14.5, we updated the secondary structure of 10 additional families with 3D information, including 6 riboswitches, 1 ribozyme, 1 telomerase, 1 localization element and 1 microRNA precursor.

In some families, the updated structure is substantially changed. For example, the central part of the flavin mononucleotide (FMN) riboswitch is now organised by several additional base pairs and two pseudoknots (pK). As a result, the updated structure is more compact and more accurately reflects the experimentally determined 3D structures.

Seven of the updated families include newly annotated pseudoknots, which is an important improvement that helps better model long-distance non-nested interactions. We will continue reviewing and updating the families with 3D structure in future releases. The full list of the updated families can be found in the table below.

FamilyPDB structuresNew  pK
RF00008 – Hammerhead ribozyme (type III)2QUS_A, 2QUS_B, 2QUW_A, 2QUW_B, 2QUW_C, 2QUW_D, 5DI2_A, 5DI4_A, 5DQK_A, 5EAO_A, 5EAQ_A1
RF00025 – Ciliate telomerase RNA6D6V2
RF00050 – FMN riboswitch (RFN element)3F2Q, 3F2T, 3F2W, 3F2X, 3F2Y, 3F3O 2
RF00059 – TPP riboswitch (THI element)2CKY_A, 2CKY_B, 2GDI_X, 2GDI_Y, 2HOJ_A, 2HOK_A, 2HOL_A, 2HOM_A, 2HOO_A, 2HOP_A, 3D2G_A, 3D2G_B, 3D2V_A, 3D2V_B, 3D2X_A, 3D2X_B, 3K0J_E, 3K0J_F, 4NYA_A, 4NYA_B, 4NYB_A, 4NYC_A, 4NYD_A, 4NYG_A
RF00207 – K10 transport/localisation element (TLS)2KE6, 2KUR, 2KUU, 2KU, 2KUW
RF00174 – Cobalamin riboswitch4GMA, 4GXY1
RF00380 – ykoK leader / M-box riboswitch2QBZ_X, 3PDR_X, 3PDR_A1
RF01689 – AdoCbl variant RNA4FRN_A, 4FRN_B, 4FRG_X, 4FRG_B1
RF01831 – THF riboswitch3SD3, 3SUH, 3SUX, 3SUY, 4LVV, 4LVW, 4LVX, 4LVY, 4LVZ, 4LW01
RF02095 – mir-2985-2 microRNA precursor2L3J

Hovlinc ribozyme

A recent paper by Chen Y et al. 2021 describes Hovlinc, a new type of self-cleaving ribozymes found in human and other hominids. Hovlinc was detected in a very long intergenic noncoding RNA in hominids (hominin vlincRNA-located) using a genome-wide approach designed to discover self-cleaving ribozymes. The functions of vlincRNA and the hovlinc ribozyme remain unclear. Hovlinc joins 3 known classes of small, self cleaving ribozymes found in human: (1) Mammalian CPEB3 ribozyme, (2) Hammerhead ribozyme and (3) B2 and ALU retrotransposons. We would like to thank Dr Fei Qi (Huaqiao University) for providing the Hovlinc alignment. View hovlinc family in Rfam.

New Bacteroides families

In a recent article by Ryan et al. 2020, the authors report a high-resolution transcriptome map of the model organism Bacteroides thetaiotaomicron, a common bacteria of the human gut. They recognize 269 non-coding RNAs (ncRNAs) candidates from which nine were validated. Eight of these ncRNAs were integrated as new families:

  1. RF04177 – Bacteroides sRNA BTnc201
  2. RF04178 – Bacteroides sRNA BTnc005
  3. RF04179 – Bacteroides sRNA BTnc049
  4. RF04180 – Bacteroides sRNA BTnc231
  5. RF04181 – rteR sRNA
  6. RF04182 – GibS sRNA
  7. RF04183 – Bacteroidales small SRP
  8. RF04184 – Bacteroides sRNA BTnc060

In addition, the RF01693 – Bacteroidales-1 family was renamed to 6S-Bacteroidales RNA. Bacteroidales-1 was first reported in a comparative genomics-based approach of genome and metagenome sequences from Weinberg et al. 2010. It was identified downstream of L20 ribosomal subunit genes in the order Bacteroidales. Ryan et al. 2020 report that this sRNA is a 6S RNA homolog in Bacteroides thetaiotaomicron. Rfam also has other two families of 6S RNA RF00013-6S/SsrS RNA and RF01685-6S-Flavo. We would like to thank Dr Lars Barquist (University of Würzburg) for providing the data.

Welcome to Emma!

A few weeks ago Emma Cooke joined the Rfam team as Software Developer and is already busy working on new features. Emma has studied Genetics and Software Engineering, and her previous roles have focused on release verification pipelines, software testing, and developing for cloud environments. Please join us in welcoming Emma to the Rfam community and stay tuned for new announcements based on her work.

Get in touch

As always, we would be very happy to hear from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Dfam 3.4 Release

July 24, 2021

The Dfam Consortium is proud to announce the release of Dfam 3.4. This update includes over 8,200 curated transposable element (TE) families found in 240 mammalian genomes. The models therein have been carefully developed by David Ray’s lab at Texas Tech University (TTU) and further refined by Arian Smit. This is part of an ongoing effort to generate a comprehensive mammalian TE library using multi-species alignments and ancestral sequence reconstructions generated by the Zoonomia project (

In addition to releasing the curated TE families, full genome annotations are provided for 21 Old World monkeys (Figure 1; Figure 2).

Figure 1: A portion of the available genomes aligned as part of the Zoonomia project, focused on the Primate Order.

Discovery of young, species-specific TEs

As a large portion of a mammalian genome, TEs serve as a source for genomic variation and innovation, including (but certainly not limited to) genomic rearrangement via movement and non-homologous recombination and providing novel transcription factor binding sites. David Ray’s lab has taken the first large-scale effort into examining the TE content of the extant genomes as part of the Zoonomia project in order to determine the TE type and location and subsequently the impact they might have on the evolution of each lineage of mammals. 


A total of 248 final genome assemblies of placental mammals were initially presented for analysis, most coming from the Zoonomia dataset. Low quality assemblies and previously analyzed genomes were excluded from analyses. To avoid wasted effort on re-curation of previously described TEs, manual curation efforts were focused towards identifying newer putative TEs that underwent relatively recent accumulation, with the main assumption being that many older TEs will be widely shared among large groups of placental mammals and that previous annotation efforts have thoroughly described these older elements in detail.

To classify younger TEs, the filtered dataset was narrowed to elements that have undergone transposition in the recent past, i.e. TEs that have insertion sequences with Kimura 2 parameter (K2P) distances less than 4.4% (approximately ~20my or less since insertion, based on a general mammalian neutral mutation rate of 2.2×10-9). This approach yielded mostly lineage specific TEs, many of which were yet to be previously described.

For each iteration of manual TE curation, new consensus sequences were generated from the 10-50 top BLAST hits, and aligning these sequences via MUSCLE and estimating a consensus sequence with EMBOSS.

To reduce library redundancy, the potential TE consensus sequences were combined with those of known TEs from previous work as well as all known vertebrate TEs from Repbase. The program CD-HIT-EST was used to identify duplicate TEs among our combined TE library according to the 80-80-80 rule of Wicker et al.

To confirm the TE type, each sequence in the library was subjected to a custom pipeline which used: blastx to confirm the presence of known ORFs in autonomous elements, RepBase to identify known elements, and TEclass to predict the TE type. In addition, structural criteria was also utilized for categorizing TEs: DNA transposons, elements with visible terminal inverted repeats; rolling circle transposons were required to have identifiable ACTAG at one end; putative SINEs were inspected for a repetitive tail as well as A and B boxes; LTR retrotransposons were required to have recognizable hallmarks, such as: TG, TGT, or TGTT at their 5’ and the inverse at the 3’ ends.   

Zoonomia Project

Figure 2: Summary of the Zoonomia project

The Zoonomia project is an effort to understand the mammalian tree of life at a deeper level. This massive undertaking is the collaboration of 27 laboratories. Although far from a complete list, some current projects derived from the Zoonomia datasets include: studying mammalian speech development, regulatory element analyses, chromosome evolution and the evolution of microRNA genes.

Future Work

Future efforts will continue to analyze and catalog lineage-specific TEs in deeper branches of the 240-way genome alignment via the reconstructed genomes at each node of the phylogenetic tree as part of the alignment and expand the full genome annotations available on Dfam.

AlphaFolding the Protein Universe

July 22, 2021

Hot on the tail of our inclusion of the Baker group’s trRosetta structural models we are excited to announce the inclusion of models from AlphaFold 2.0 generated by DeepMind and stored in the AlphaFold Database (AlphaFold DB). AlphaFold 2.0’s performance in the CASP14 competition was spectacular, producing near experimental quality structure models.

The new AlphaFold models have been constructed for over 375,000 proteins from 22 model organisms and the very large majority of the models are full length proteins. This is in contrast to the trRosetta models, which were built from the domain region predicted by Pfam. Having full length protein models is very exciting for us because it will allow us to more easily check whether we need to extend or change the Pfam domain boundaries.  We will also be able to look for missing domains in the protein structures. AlphaFold models also help to fill in gaps when only a part of a longer family has been structurally characterised.

When looking at the AlphaFold models it is important to look at the quality scores of the model overall. Sometimes a good quality structural model cannot be created, but in these cases it is usually obvious from the quality scores shown as orange regions of the model.  Disordered regions of proteins are usually of low confidence.

We think that there are many thousands of Pfam families that could be improved using the AlphaFold and trRosetta models. Feel free to tell us where we could improve them. We are really enjoying mining this treasure trove of data and we hope you find some (not so) hidden gems. 

The Pfam team