Rfam Release 14.10

November 15, 2023

We are happy to announce the release of Rfam version 14.10. This release includes 62 new microRNA families and 138 updated miRNA families, 9 updated Hepatitis C Virus (HCV) families and a new intron family, the bZIP non canonical Hac1/Xbp1 intron. Read on more details.

Rfam has completed the first phase of synchronisation with miRBase

Rfam, miRBase, and RNAcentral have been working to synchronise miRNA families between all three resources. In this release we created 62 new miRBase families and updated 138. Over the course of the project we have created and updated 1536 families in total. We have updated all miRNA families in Rfam that do not require novel data from miRBase. This release completes the first phase of the synchronisation! Many thanks to everyone who has worked on this project including Joanna Argasinska, Emma Cooke, Ioanna Kalvari, Anton I. Petrov, Nancy Ontiveros-Palacios, Blake A. Sweeney and Sam Griffiths-Jones.

Further phases will focus on updating or improving miRBase families and then reflecting these changes in miRBase and Rfam. Users interested in the detailed changes to Rfam families can browse this sheet. A summary of the changes over the course of this project are:

Release	Updated	New	Total
14.3	37	353	390
14.4	0	496	496
14.5	112	0	112
14.6	0	126	126
14.7	121	0	121
14.8	30	25	55
14.9	22	14	36
14.10	138	62	200
Total	460	1076	1536

Hepatitis C Virus families

In our ongoing collaboration between Profesor Manja Marz of the European Virus Bioinformatics Center and Rfam, we have updated 9 HCV families to better reflect their secondary structures. The Marz group is planning on publishing this work shortly, keep an eye out for their paper. Thank you to Sandra Triebel and Manja Marz for their work on this. The families are listed below.

Rfam	Family	Description
RF04220	HCV_SL588	Hepatitis C virus SL588 non-coding RNA
RF04221	HCV_SL669	Hepatitis C virus SL669 non-coding RNA
RF04219	HCV_J750	Hepatitis C virus J750 non-coding RNA (containing SLSL761 and SL783)
RF04218	HCV_5BSL1	Hepatitis C virus stem-loop I
RF00468	HCV_SLVII	Hepatitis C virus stem-loop VII
RF00620	HCV_ARF_SL	Hepatitis C alternative reading frame stem-loop
RF00260	HepC_CRE	Hepatitis C virus cis-acting replication element (CRE)
RF00481	HCV_X3	Hepatitis C virus 3′ X element
RF00061	IRES_HCV	Hepatitis C virus internal ribosome entry site

bZIP non-canonical Hac1/Xbp1 intron

bZIP intron RNA is a non-canonical intron that was first reported in the transcriptional factor Hac1 gene in yeast and the Xbp1 gene in Metazoa. bZIP introns splice independent of the spliceosome, instead the splicing is done by ribonuclease Ire1 and guided by the secondary structure of bZIP intron. The splicing has been described in Xbp1 gene of human, mouse, Caenorhabditis elegans and Drosophila melanogaster and in Hac1 gene of fungi. The conserved structure was reported in Hooks et al, and we have created bZIP (RF04247) for this RNA.

Secondary Structure of bZIP intron. On the left, consensus structure of Hac1/Xbp1 mRNA taken from Griffiths-Jones S. & Hooks K. B. Conserved RNA structures in the non-canonical Hac1/Xbp1 intron. RNA Biology 2011; 8:4, 552-556. https://doi.org/10.4161/rna.8.4.15396. On the right RF04247 Rfam family secondary structure of bZIP intron.

Posted in Rfam | Leave a Comment »
Tags: release, rfam

Pfam 36.0 release

September 18, 2023

Pfam 36.0 release is now out! This is a very special release as for the first time Pfam is accessible exclusively via the modern and comprehensive interface of InterPro. Offering new features and easy-to-use functionality, the InterPro website will be the Pfam home for releases to come. All Pfam release files remain accessible through the Pfam ftp site.

If you are new to the IntePro interface and would like to learn how to navigate through the website and explore the Pfam annotations, you can have a look at this video, the updated online training, Pfam documentation or get in touch.

Release content

Pfam 36.0 contains a total of 20,795 families and 660 clans. Since the last release, we have built 1,191 new families, killed 28 families and created 5 new clans.

Additionally, we have updated around 1.5% of existing Pfam entries. 2,818 families have seen a change in their boundaries, 281 of them have changed by more than 50 residues, most of them got trimmed or split into domains often due to improved information from accurate structural models.

UniProt Reference Proteomes has increased by 23% since Pfam 35.0, and now contains 75 million sequences. Of the sequences that are in UniProt Reference Proteomes, 76.2% have at least one Pfam match, and 48.6% of all residues fall within a Pfam family.

The accession numbers for this release range from PF20625 to PF21822.

Sources of Pfam families

ECOD

In an effort to harmonise the Pfam and ECOD classifications, we have created 638 new entries, and 50 new clans. However, at the same time we have also removed 45 existing clans, usually because we merged two or more existing Pfam clans into a single entry.

While creating new Pfam families from ECOD and checking their classification, we are able to identify relationships between existing families and the new ones, grouping them together, which means we can either include them in an existing clan or create a new one. This is the case of the new THUMP clan (THioUridine synthases, RNA Methylases and Pseudouridine synthases CL0747), which in Pfam 36.0 includes three families: the existing THUMP domain (PF02926), the THUMP domain of eukaryotic Pus10 (such as human Pus10 Q3MIT2 included in PF21237) and the Ribosomal RNA large subunit methyltransferase M, THUMP-like domain (PF21239). The THUMP domain is involved in RNA metabolism and is present in enzymes involved in at least three unrelated types of RNA-modification.

The Pus10 example is particularly interesting, as it is a pseudouridine synthase with no significant sequence similarity to the other five families of pseudouridine synthases that have been characterised based on sequence homology.

Figure 1. Structure of human Pus10 (2v9k) as seen in InterPro. Highlighted in blue is the THUMP domain represented in Pfam, based on the ECOD classification.

Improving Pfam using AlphaFold

AlphaFold has revolutionised the world of protein structure and protein classification. Since its first release, we have been able to use AlphaFold predicted structures to find a function for domains of unknown function, to create missing domains and to refine Pfam domain boundaries.

An example is the case of the Zinc finger SWIM-type domain-containing protein 3 (ZSWIM3) (Q96MP5). Previously, this protein had two domains defined (Figure 2A). However, using the AlphaFold predicted aligned error graph, we can clearly see that it is actually made of 5 domains: an N-terminal domain, a RNaseH-like domain, a helical domain, a zinc-finger domain and a C-terminal domain (Figure 2B,C). The missing domains were created (Figure 2D): PF21599 (N-terminal), PF21056 (RNaseH-like), PF21609 (helical).

Additionally, one of the existing domains (PF19286) boundaries were too large (Figure 2A), it has been truncated and renamed as ZSWIM1/3, C-terminal domain (Figure 2D).

Figure 2. Example of creation of new Pfam entries and update of an existing Pfam using AlphaFold structure prediction for Q96MP5. A) In Pfam 35.0, two entries existed: SWIM zinc-finger domain (PF04434) and DUF5909 (PF19286). B) AlphaFold Predicted Aligned Error plot, showing 5 distinct domains. C) AlphaFold predicted structure with the 5 domains highlighted in different colours. D) In Pfam 36.0, three new entries were created (PF21599 (N-terminal), PF21056 (RNaseH-like), PF21609 (helical), and DUF5909 (PF19286) was updated.

Enjoy Pfam 36.0!

Posted by the Pfam team

Posted in Pfam | Leave a Comment »
Tags: pfam, release

Dfam 3.7 : ~3.4 million TE models across 2346 taxa

January 12, 2023

We at Dfam are pleased to announce the latest data release! The Dfam 3.7 release includes additional raw and curated datasets, resulting in a ~4.5x increase in the number of families compared to the previous Dfam 3.6 data release over a wide range of taxa. Please note the large size of the newest release and plan accordingly. It may be beneficial to filter and download the relevant data to your project by utilizing the API.

EBI dataset contributes to the quadrupling of the Dfam database

Our continued collaboration with Fergal Martin and Denye Ogeh from the European Bioinformatics Institute (EBI) has provided an additional 771 assemblies and their associated TE models that are now a part of the DR records in Dfam. This brings the total contribution of genomic data from EBI to 1551 species. The new data expands taxa such as Viridiplantae (green plants) and Actinopterygii (bony fishes), and broadens Dfam coverage with the addition of Echinodermata (starfishes, sea urchins/cucumbers) and Petromyzontiformes (lampreys).

Community submissions – adding diversity to Dfam

Taro (Colocasia esculenta) – a threatened food staple

One of the most ancient cultivated crops, taro is a food staple in the Pacific Islands and the Caribbean, which is currently threatened by taro leaf blight (TLB). Some populations of taro are resistant to TLB, but the genetic basis for this resistance is unknown. As part of an effort to understand the genetic basis of TLB resistance, a taro de novo assembly was generated and the repetitive content was analyzed [1]. The high repetitive content (~82%) of this genome was positively correlated with genome size, with the potential to be linked to TLB resistance. Contributed by M. Renee Bellinger.

Gesneriaceae – understanding angiosperm morphological variation

A member of the plant family Gesneriaceae, the Cape Primrose Streptocarpus rexii has long been studied by evolutionary biologists due to its unique morphological aspects. Genetic resources are critical in order to study the unique meristem evolution of this plant family. As such, a genome annotation pipeline was generated in order to handle the shortcomings of current technical challenges of genome annotation. Part of this effort included generating repeat libraries for not only the Cape Primrose, but also for Dorcoceras hygrometricum and Primulina huaijiensis [2]. Providing these libraries to Dfam will enhance the resources available for future genomic characterization of this plant family. Contributed by Kanae Nishii.

Mosquito (Anopheles coluzzii) – a human malaria vector

The adaptive flexibility of Anopheles coluzzii, a primary vector of human malaria, allows it escape efforts to control the mosquito population with insecticides. As TEs are integral to adaptive processes in other species, it was hypothesized that TEs could be what is allowing the rapid resistance of A. coluzzii to classic methods of intervention. Analyzing six individuals from two African localities allowed the authors to provide a comprehensive TE library [3]. This effort enhances the resources available to study the genomic architecture and gene regulation underpinning the success of this malaria vector. Contributed by Carlos Vargas and Josefa Gonzalez.

Water flea (Daphnia pulicaria) – a model organism to study climate change

Due to their short lifespans and reproductive capabilities, water fleas are used as a bioindicator to study the effects of toxins on an ecosystem, and are thus useful in studying climate change. A study of two ecological sister taxa – Daphnia pulicaria and Daphnia pulex – analyzed the evolutionary forces of recombination and gene density in driving the differentiation and divergence of the two aforementioned species [4]. TE content was analyzed as part of generating the new Daphnia pulicaria genome assembly. Contributed by Mathew Wersebe.

601 insects – transposable element influence on species diversity

TEs are drivers of evolution eukaryotes. However, in some underrepresented taxa, TE dynamics are less well understood. To this end, 601 insect genomes over 20 Orders were analyzed for TE content to analyze the variation between and among insect Orders. This work highlights the need for community-submitted high-quality libraries. Contributed by John Sproul and Jacqueline Heckenhauer.

Analysis of six bat genomes – evolution of bat adaptations

Bats are an excellent example of complex adaptations, such as flight, echolocation, longevity and immunity. In order to enhance the genomic resources to study the development of complex traits, six high-quality genomes assemblies using long- and short-read technologies were generated (Rhinolophus ferrumequinum, Rousettus aegyptiacus, Phyllostomus discolor, Myotis myotis, Pipistrellus kuhlii and Molossus molossus) [6]. As part of the effort to annotate these new genome assemblies, the TE content was analyzed. These six genomes displayed a wide range of diversity in TE content, perhaps contributing to their complex traits. Contributed by Kevin Sullivan and David Ray.

LTR7/ERVH – transcriptional regulation in the human embryo

The mechanism by which human endogenous retrovirus type-H (HERVH) exerts regulatory activities fostering self-renewal and pluripotency in the pre-implantation embryo is unknown. In order to elucidate the aforementioned mechanism, the transcription dynamics and sequence signature evolution of HERVH were analyzed [7]. This study not only revealed previously undefined LTR7 subfamilies, but also provided a comprehensive phytoregulatory analysis of all the identified subfamilies against locus-specific regulatory data available in genome-wide assays of embryonic stem cells (ESCs), providing evidence for subfamily-specific promoter activity. The complex evolutionary history of LTR7 is mirrored in the transcriptional partitioning that takes place during early embryonic development. Contributed by Thomas Carter, Cédric Feschotte, and Arian Smit.

References

1. Bellinger, M. R., Paudel, R., Starnes, S., Kambic, L., Kantar, M. B., Wolfgruber, T., Lamour, K., Geib, S., Sim, S., Miyasaka, S. C., Helmkampf, M., & Shintaku, M. (2020). Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight. G3 (Bethesda, Md.), 10(8), 2763–2775. https://doi.org/10.1534/g3.120.401367

2. Nishii, K., Hart, M., Kelso, N., Barber, S., Chen, Y. Y., Thomson, M., Trivedi, U., Twyford, A. D., & Möller, M. (2022). The first genome for the Cape Primrose Streptocarpus rexii (Gesneriaceae), a model plant for studying meristem-driven shoot diversity. Plant direct, 6(4), e388. https://doi.org/10.1002/pld3.388

3. Vargas-Chavez, C., Longo Pendy, N. M., Nsango, S. E., Aguilera, L., Ayala, D., & González, J. (2022). Transposable element variants and their potential adaptive impact in urban populations of the malaria vector Anopheles coluzzii. Genome research, 32(1), 189–202. https://doi.org/10.1101/gr.275761.121

4. Wersebe, M. J., Sherman, R. E., Jeyasingh, P. D., & Weider, L. J. (2022). The roles of recombination and selection in shaping genomic divergence in an incipient ecological species complex. Molecular ecology, 10.1111/mec.16383. Advance online publication. https://doi.org/10.1111/mec.16383

5. Sproul, J.S., Hotaling, S., Heckenhauer, J., Powell, A., Larracuente, A.M., Kelley, J.L., Pauls, S.U., Frandsen, P.B. (2022). Repetitive elements in the era of biodiversity genomics: insights from 600+ insect genomes. bioRxiv 2022.06.02.494618; doi: https://doi.org/10.1101/2022.06.02.494618

6. Jebb, D., Huang, Z., Pippel, M., Hughes, G. M., Lavrichenko, K., Devanna, P., Winkler, S., Jermiin, L. S., Skirmuntt, E. C., Katzourakis, A., Burkitt-Gray, L., Ray, D. A., Sullivan, K. A. M., Roscito, J. G., Kirilenko, B. M., Dávalos, L. M., Corthals, A. P., Power, M. L., Jones, G., Ransome, R. D., … Teeling, E. C. (2020). Six reference-quality genomes reveal evolution of bat adaptations. Nature, 583(7817), 578–584. https://doi.org/10.1038/s41586-020-2486-3

7. Carter, T. A., Singh, M., Dumbović, G., Chobirko, J. D., Rinn, J. L., & Feschotte, C. (2022). Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo. eLife, 11, e76257. https://doi.org/10.7554/eLife.76257

Posted in Dfam, Releases | Leave a Comment »
Tags: dfam, release, submissions, transposable elements, xfam

Rfam Release 14.9

November 15, 2022

We are happy to announce the release of Rfam 14.9. This release features 14 new miRNA families, 23 updated miRNA families, 10 families improved with their first 3D structure, 10 families updated with additional 3D structures, and comprehensive improvements using R-scape. Read on for more details.

Updated and new miRNA families

In this release, we have updated 23 miRNA families in Rfam and created 14 new families based on miRBase miRNA alignments. We estimate that this project is 80 percent finished. The remaining families are undergoing extensive curation.

New miRNA Families:

Rfam ID	Family
RF04223	MIR2619
RF04224	mir-9229
RF04225	MIR7502
RF04226	mir-9186
RF04227	mir-9215
RF04228	MIR6140
RF04229	mir-9279
RF04230	mir-9261
RF04231	mir-9191
RF04232	mir-9318
RF04233	mir-680
RF04234	mir-1421
RF04235	mir-242_2
RF04236	mir-1285

Updated miRNA Families:

Rfam ID	Family
RF00241	mir-8/mir-141/mir-200
RF00456	mir-34
RF00661	mir-31
RF00672	mir-190
RF00700	mir-375
RF00702	mir-182
RF00706	mir-263
RF00713	mir-239
RF00716	mir-3
RF00717	mir-315
RF00726	mir-87
RF00727	microRNA bantam
RF00728	mir-81
RF00762	mir-412
RF00837	mir-251
RF00844	mir-67
RF00848	mir-61
RF00948	mir-996
RF01045	mir-544
RF01413	miR-430
RF01924	mir-2774
RF04088	MIR812
RF04195	MIR6217

3D families

We continue to review and update Rfam families with available 3D information, and in release 14.9 we include 10 new families updated with 3D information and we added additional 3D structures to 10 families. We have added 9 pseudoknots to the 10 families with new 3D structures.

Families with new 3D structures added:

Rfam ID	Family	New
RF00037	Iron response element I	3SNP_C, 3SNP_D
RF00522	PreQ1 riboswitch	2L1V_A
RF01073	Gag/pol translational readthrough site	2LC8_A
RF01727	SAM/SAH riboswitch	6HAG_A
RF02253	Iron response element II	3SN2_B
RF02519	ToxI antitoxin	4ATO_G
RF02553	Y RNA-like	6CU1_A
RF02796	Pab160 RNA	3HJW_D, 3LWO_D, 3LWP_D, 3LWQ_D, 3LWR_D, 3LWV_D
RF03054	Xanthine riboswitch/NMT1 RNA	7ELP_A, 7ELP_B, 7ELQ_A, 7ELQ_A, 7ELR_A, 7ELR_B, 7ELS_A, 7ELS_B
RF04222	Potato leafroll virus exoribonuclease-resistant RNA	7JJU_A, 7JJU_B

Families with additional 3D structures:

Rfam ID	Family	Update
RF00015	U4 spliceosomal RNA	5GAP_V
RF00025	Ciliate telomerase RNA	7LMA_B, 7LMB_B
RF00050	FMN riboswitch (RFN element)	6WJR_X, 6WJS_X
RF00059	TPP riboswitch (THI element)	7TD7_A, 7TDA_A, 7TDB_A, 7TDC_A, 7TZR_X, 7TZR_Y, 7TZS_X, 7TZS_Y, 7TZT_A, 7TZU_A
RF00162	SAM riboswitch (S box leader)	7EAF_A
RF00174	Cobalamin riboswitch	6VMY_A
RF00442	Guanidine-I riboswitch	5U3G_B, 7MLW_F
RF01763	ykkC-III Guanidine-III riboswitch	5NWQ_B, 5NY8_A, 5NY8_B, 5NZ3_A, 5NZ3_B, 5NZD_A, 5NZD_B, 5O62_A, 5O62_B, 5O69_A, 5O69_B
RF01831	THF riboswitch	6Q57_A, 7KD1_A
RF02680	PreQ1-III riboswitch	6XKN_A, 6XKO_A

We have also created rfam.org/3d which contains a table of all families with 3D structures and a link to download the seed alignments for those families. Additionally, there is a new file on our FTP site Rfam.3d.seed.gz which contains all seed alignments for these families. The page and file will be updated each release. Please reach out if you have any suggestions for improvements!

Families updated with R-scape model

We worked with Elena Rivas to analyse all Rfam families with R-scape. We then updated models where R-scape was able to suggest a better alignment. This led to 26 families with improvements, listed below by number of additional covarying base pairs.

Additional Covaring basepairs	Rfam ID	Family	Additional Covaring basepairs	Rfam ID	Family
24	RF02033	HEARO	2	RF01731	TwoAYGGAY
14	RF03065	IS605-orfB-I	2	RF01794	sok
8	RF03068	RT-3	2	RF02221	sRNA-Xcc1
5	RF03072	raiA	2	RF02947	cow-rumen-2
4	RF02969	DUF3800-I	2	RF03000	LOOT
3	RF01688	Actino-pnp	2	RF03158	L31-Actinobacteria
3	RF02004	group-II-D1D4-5	1	RF01864	plasmodium_snoR21
3	RF02005	group-II-D1D4-6	1	RF01867	CC2171
3	RF02913	pemK	1	RF02944	c4-2
3	RF03077	RT-2	1	RF02968	DUF3800-IX
3	RF03135	L4-Archaeoglobi	1	RF02987	GA-cis
3	RF03144	eL15-Euryarchaeota	1	RF03019	RT-16
2	RF00062	HgcC	1	RF03046	Pseudomonadales-1

Other updates

We have also updated 3 other families. We updated Sarbecovirus 5’ UTR (RF03120) secondary structure to reflect the pairing from Correlated sequence signatures are present within the genomic 5′UTR RNA and NSP1 protein in coronaviruses. We modified the consensus secondary structure of the stem loop 1 of the 5’ untranslated region of the family to reflect the secondary structure in that paper. Additionally we renamed RF03054 and RF03071 families, which were first reported by Zasha Weinberg in a comparative analysis of intergenic regions in bacteria.

The Xanthine riboswitch (RF03054) was first reported in Proteobacteria as NMT1 non-coding RNA (ncRNA). Later it was reported as a ncRNA that recognised Xanthine and the structure was reported in 7ELP, 7ELQ, 7ELR and 7ELS PDBs which have been added as part of the seed alignment.

The Na+ riboswitch (RF03071) was first reported by Zasha Weinberg and called DUF1646 RNA. More recently, Neil White from the Breaker group identified it as a riboswitch that selectively senses Na+ and regulates the expression of genes related to the sodium biology.

Migrating Rfam’s public SVN repository

Rfam and Pfam used to provide a public copy of their SVN repositories on xfamsvn.ac.uk. With the recent depreciation of Pfam’s website and inclusion as part of InterPro, we have decided to move the Rfam svn repository to http://svn.rfam.org. Users interested in a nightly updated version of Rfam can browse the repository at this new location.

Posted in Uncategorized | Leave a Comment »
Tags: release, rfam

A new version of Pfam-N is available

October 20, 2022

In March 2021, we announced the release of Pfam-N, a file increasing the Pfam coverage of the UniProtKB Reference Proteomes using deep learning, the fruit of the collaboration between Pfam and the Google Research team led by Dr Lucy Colwell, with Maxwell Bileschi and David Belanger. Since then, the Google Research team has worked hard on improving and refining the deep learning methodology previously developed to further increase the Pfam coverage of protein sequences. The new version of Pfam-N represents the most significant gain in Pfam coverage ever reported.

Updated deep learning methodology

Building on the methods developed by Max Bileschi and colleagues [1], convolutional neural networks are used to annotate each residue (for all sequences in the Pfam database) with a Pfam family or clan label, which are then converted into domain calls. We look for any conflicting calls between nonhomologous families or clans and resolve them. A paper on this method is forthcoming; please direct inquiries to mlbileschi@google.com.

Pfam coverage

The first version of Pfam-N used Pfam 34.0 as a reference and annotated 6.8 million protein regions into 11,438 Pfam families. These regions included nearly 1.8 million full-length protein sequences from UniProtKB Reference Proteomes that previously had no Pfam match, an improvement of 4.2% over the currently-annotated 42.5 million. We also note that among the sequences that get their first Pfam annotation, there were 360 human sequences.

Pfam 35.0 annotates 46.0 million protein sequences, covering 75% of the UniProt Reference Proteomes. The latest version of Pfam-N includes 5.2 million full-length protein sequences from the UniProt Reference Proteomes that previously had no Pfam match, an improvement of 8.5% over the currently-annotated 46.0 million, as illustrated in Figure 1.

Figure 1. Percentage of the UniProt reference proteome covered by Pfam and Pfam-N over time.

Overall, together Pfam and Pfam-N cover 83.7% of the UniProt Reference proteome, as shown in Figure 2.

Figure 2. Percentage of the UniProt reference proteome covered by Pfam and Pfam-N.

Improving Pfam annotations

In our previous post announcing the release of Pfam-N we highlighted the benefit of using Pfam-N to gain huge numbers of additional matches to expand existing Pfam families.

Additionally, Pfam-N can be used to functionally characterise DUFs. Neural network annotations can be used to guide us when there are very distant previously undiscovered relationships between DUFs and more well-annotated families. For example, DUF5309 (PF17236) has been shown to be evolutionarily related to the phage capsid family using RoseTTAFold structural models and has been added to the Pfam clan CL0373: phage-coat.

Furthermore, neural networks can be used to create new families within a clan. They can predict the attachment of protein sequences to a clan, which sometimes are not yet covered by a family. Based on one of these ProtENN predictions, the Alpha/beta hydrolase domain family (PF20408) has been created and added to the Pfam clan CL0028: AB_hydrolase.

Visualising Pfam-N in InterPro

As part of InterPro 91.0, the Google Research team has expanded the Pfam-N annotations to cover all the proteins in UniProt 2020_04. Covering 83.6% of UniProtKB.

To increase the visibility of these annotations, we have added them to the Other features section in the protein sequence viewer displayed on protein pages in the InterPro website, as shown in Figure 3.

Figure 3. Pfam-N annotation for UniProt P18207.

Funding

The work to expand Pfam families with Pfam-N hits is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

[1] Bileschi, Maxwell L., et al. “Using deep learning to annotate the protein universe.” Nature Biotechnology (2022): 1-6.

Written by Typhaine Paysan-Lafosse

Posted in News, Pfam | 2 Comments »

Pfam website decommission

August 4, 2022

After more than 20 years of good and faithful service, we have decided to retire the Pfam website. Do not worry though, we are still planning to do Pfam releases and the data will still be available.

As you can imagine this wasn’t an easy decision, and be sure it wasn’t taken lightly. The Pfam website codebase was first released over 20 years ago, and although it has been updated from time to time, some of its core functionality still dates back to its origins. There is a lot of technical debt in its current state and it is only becoming harder to maintain.

Currently, on every release, we are taking more time generating data exclusively related to the website than the core data of Pfam: its alignments, and models. Additionally, our team size doesn’t have the capacity to execute all the release procedures for Pfam on a consistent basis.

Retiring the website will allow us to focus our efforts on producing the core of Pfam. The plan then is to leave the deployment and visualisation tasks to the InterPro website. InterPro was redesigned in recent years, using up to date technologies, including a modern framework (React).

The Pfam data and different viewing features are already available on the InterPro website. For example, searching for a Pfam accession (e.g. PF05093) using the InterPro search by text will allow you to reach the corresponding Pfam entry page, where the menu on the left hand-side gives access to different datasets related to the entry, as shown in the figure below.

Example of Pfam entry page in the InterPro website (PF05093)

The correspondence between the Pfam menu and InterPro menu is given in the table below.

Pfam website tab	InterPro website tab
Summary	Overview
Clan	Available in Overview, Set
Domain organisation	Domain Architectures
Alignments	Alignment
HMM logo	Signature
Trees	Taxonomy (tree icon)
Curation & model	Curation
Species	Taxonomy
Structures	Structures
AlphaFold structures	AlphaFold
trRosetta	RoseTTAFold

You can also browse through the different Pfam families and clans (called Set in InterPro) using the InterPro Browse feature.

The Overview tab of the Set pages in InterPro, the different members of the set (nodes) and the relationship between each other (lines) are displayed in a graph (it corresponds to the Relationship tab in the Pfam website). The size of the nodes is proportional to the number of proteins in the Pfam entry. The graph can be customised to display the Pfam Accession, short name and/or name. Other tabs include Entries (equivalent to the Members section in the Summary tab in Pfam), Proteins, Structures, Taxonomy (equivalent to the Species tab in Pfam), Proteomes and alignments. Additionally, the Proteins tab in InterPro lists all the proteins matched by the different Pfam entries included in a set.

We are aware that not all of the Pfam users are familiar with the InterPro website interface, hence the decommission will be progressive through multiple months, starting from October 5th 2022. On October 5th, we will start redirecting the traffic from Pfam (pfam.xfam.org) to InterPro (www.ebi.ac.uk/interpro). The Pfam website will be available at pfam-legacy.xfam.org until January 2023, when it will be decommissioned. We are also going to organise a webinar to show you where to find the Pfam annotations in InterPro, so stay tuned and check our twitter accounts (@PfamDB/@InterProDB) to register.

If you have any requests, feedback or suggestions on ways to improve Pfam data visualisation in InterPro please contact us through the InterPro helpdesk.

Written by Typhaine Paysan-Lafosse.

Posted in Pfam | 3 Comments »
Tags: pfam

Rfam release 14.8

May 30, 2022

We are happy to announce the release of Rfam version 14.8. This release includes 48 updated and 25 new microRNA families; 10 families updated based on 3D structure annotations; 4 new families and updates to 5 existing families for Hepatitis C virus; a new xRNAs family from the Potato virus; and the integration of LitScan, a literature scanner powered by RNAcentral and Europe PubMed Central. Read on for details on these changes.

Updated microRNA families

Rfam, miRBase and RNAcentral have been working to synchronize miRNA families between all three resources. We are now happy to report that we have completed 77% of the current miRNAs families covering >1300 miRNAs of 1700 alignments provided to us by Profesor Sam Griffiths-Jones at miRBase. The 400 remaining families need an extended review process, and we will be working on their integration in future releases.

Summary of the miRBase and Rfam synchronisation project, we estimate it is 78% completed.

Families updated with 3D structure information

Rfam has been updating families using 3D structure information. This project aims to improve Rfam families through the addition of pseudoknotes, base pairs, and annotations of other structural elements by inspecting 3D structures. In this release we have updated 10 families:

Virus families:
- RF00507-Coronavirus frameshifting stimulation element
- RF01047-HBV RNA encapsidation signal epsilon

Riboswitches families:
- RF01763-Guanidine-III riboswitch, also know as ykkC-III riboswitch
- RF01734-Fluoride riboswitch
- RF01704-Glutamine-II riboswitch, previously known as Downstream peptide
- RF01750-ZMP/ZTP riboswitch
- RF01739-Glutamine riboswitch
- RF02683-NiCo riboswitch
- RF01826-SAM-V riboswitch

Ribozyme family:
- RF02678-Hatchet ribozyme

We have added pseudoknots to 7 of the 10 updated families and the updated secondary structure diagrams from 5 of these families are shown below.

Examples of families reviewed and updated with 3D information. Pseudoknot structures (pk) were added to each of these five families based on a review of the corresponding 3D structures.

New and updated families of Hepatitis C virus

In release 14.8 we have created 4 new families, updated 5 existing families, and deleted 4 virus families. These changes are the result of our ongoing collaboration between Profesor Manja Marz of the European Virus Bioinformatics Center and Rfam. The Marz group provided Rfam a curated alignment of representative sequences for the entire genome of Hepatitis C virus genome. We used this alignment to update, create or remove existing Rfam families. The new families we have created are summarized in a table below. We have deleted RF00469, which was merged into RF00260 during review. We have also deleted families from RF02585 to RF02588 which have no support in the genomic alignment.

Rfam ID	Name	Description
RF00061	IRES_HCV	Hepatitis C virus internal ribosome entry site
RF00260	HepC_CRE	Hepatitis C virus (HCV) cis-acting replication element (CRE)
RF00620	HCV_ARF_SL	Hepatitis C alternative reading frame stem-loop
RF00468	HCV_SLVII	Hepatitis C virus stem-loop VII
RF00481	HCV_X3	Hepatitis C virus 3’X element
RF04218	HCV_5BSL1	Hepatitis C virus stem-loop I
RF04219	HCV_J750	J750 non-coding RNA (containing SL761 and SL783)
RF04220	HCV_SL588	SL588 non-coding RNA
RF04221	HCV_SL669	SL669 non-coding RNA

As part of this project we have reviewed and updated Coronavirus, Flavivirus and HCV viruses families, and we are working on adding RNA families from other viruses, such as Filoviridae (e.g. Ebolavirus) and Rhabdoviridae (e.g. Rabies viruses).

xRNAs in Potato virus

We want to thank Professor Quentin Vicens for sharing the alignment of Potato leafroll virus exoribonuclease-resistant RNA (PLRV-xrRNA). PLRV-xrRNA is a non-coding RNA that blocks the progression of 5′ to 3′ exoribonuclease using only a folded RNA element, and this family is described in RF04222.

LitScan

RNAcentral has recently developed LitScan, a tool to automatically connect non-coding RNA sequences, genes and families to the literature that discusses them. In this release we have integrated the LitScan widget into Rfam. The widget is now shown in the new ‘Publications’ tab on all Rfam families.

Example of LitScan for mir-17 microRNA precursor family, publications can be sorted by citation, journal, year of publication and others.

Please reach out to us with feedback on the widget, or if you would like to use the LitScan widget on your site!

Posted in Rfam, Uncategorized | Leave a Comment »
Tags: release, rfam

Dfam 3.6 release

April 21, 2022

We are pleased to announce the latest data release of the Dfam database! This latest release approximately doubles the number of species from the Dfam 3.5 release (595 to 1,109), and increases the number of transposable element (TE) families by ~2.5x (285,542 to 732,993. A more detailed summary of the species included can be seen in Table 1, and in the Dfam 3.6 release notes.

Community-submitted libraries

A huge thank you to the TE community for submitting your data to us! In this release, we have: 1) 3,360 curated rice weevil TE models, submitted by Clément Goubert and Rita Rebollo¹; 2) 22 SINE families obtained from 15 moth species (Lepidoptera insects) submitted by Guangjie Han et al.²; 3) 120 Penelope-classified families – something about how they span several kingdoms/orders? submitted by Rory Craig et al.³; and 4) 41 repeat families generated as part of the T2T human assembly project⁴ – not including the 22 “composite” repetitive families, which will be available as part of a later Dfam release. To read more about the studies associated with these submissions, please see the references below.

Rice weevil: an agricultural pest

(Background copied from paper): The rice weevil Sitophilus oryzae is one of the most important agricultural pests, causing extensive damage to cereal in fields and to stored grains. S. oryzae has an intracellular symbiotic relationship (endosymbiosis) with the Gram-negative bacterium Sodalis pierantonius and is a valuable model to decipher host-symbiont molecular interactions. In the paper (see below), the authors show that many TE families are transcriptionally active, and changes in their expression are associated with insect endosymbiotic state.

Moth SINEs: high diversity

(Conclusions copied from paper): Lepidopteran insect genomes harbor a diversity of SINEs. The retrotransposition activity and copy number of these SINEs varies considerably between host lineages and SINE lineages. Host-parasite interactions facilitate the horizontal transfer of SINE between baculovirus and its lepidopteran hosts.

Penelope elements: far-reaching impacts

The authors investigate the Penelope (PLE) content of a wide variety of eukaryotes. (copied from paper): This paper uncovers the hitherto unknown PLE diversity, which spans all eukaryotic kingdoms, testifying to their ancient origins.

T2T entries: previously hidden genomic content

A new human genome assembly has been released! The new assembly (T2T or chm13) has sequenced and assembled the remaining 10% of the human genome that was previously unattainable. The entries described in the manuscript are part of this newly-analyzed sequence.

EBI libraries

In collaboration the European Bioinformatic Institute (EBI), we processed and imported RepeatModeler runs on 444 additional species, resulting in the addition of 440,543 families. Additional extension and re-classification sites were run on each models and fate final consensus and HMMs were produced. Please note that the relationship data is not available on these uncreated imports at this time.

References associated with community submissions

¹ Parisot, N., et al (2021). The transposable element-rich genome of the cereal pest Sitophilus oryzae. BMC biology, 19(1), 241. https://doi.org/10.1186/s12915-021-01158-2
² Han, G., et al (2021). Diversity of short interspersed nuclear elements (SINEs) in lepidopteran insects and evidence of horizontal SINE transfer between baculovirus and lepidopteran hosts. BMC genomics, 22(1), 226. https://doi.org/10.1186/s12864-021-07543-z
³ Craig, R. J., et al (2021). An Ancient Clade of Penelope-Like Retroelements with Permuted Domains Is Present in the Green Lineage and Protists, and Dominates Many Invertebrate Genomes. Molecular biology and evolution, 38(11), 5005–5020. https://doi.org/10.1093/molbev/msab225
⁴ Hoyt, S. J., et al (2022). From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science (New York, N.Y.), 376(6588), eabk3112. https://doi.org/10.1126/science.abk3112

Posted in Dfam, Releases | Leave a Comment »
Tags: dfam, EBI, release

Rfam release 14.7

December 21, 2021

We are happy to announce the latest Rfam release, version 14.7. The release includes 121 updated microRNA families, 4 new families, and a redesigned Rfam-PDB mapping pipeline that provides weekly updates as new RNA 3D structures become available. Read on to find out more or explore the data in Rfam.

Updated microRNA families

As part of the Rfam-miRBase synchronisation project discussed in the Rfam 14 paper, we continue revising microRNA families in Rfam using the data provided by miRBase. This release includes 121 updated families, such as mir-6 (RF00143) and mir-22 (RF00653). The following five microRNA families have been deleted from Rfam as the corresponding entries were removed from miRBase due to lack of evidence: mir-1937 (RF01942), mir-1280 (RF02013), mir-353 (RF00800), mir-720 (RF02002), and mir-2973 (RF02096). We would like to thank Lisanne Knol (University of Edinburgh) for bringing the first two of these families to our attention.

We estimate that the Rfam is now approximately 60% in sync with miRBase, with additional families to be released in future versions of Rfam. You can view the full list of updated families here or browse all microRNAs in Rfam.

New families

The release includes two new hairpin ribozyme families Hairpin-meta1 (RF04190) and Hairpin-meta2 (RF04191) recently reported by the Weinberg lab. The hairpin ribozymes were discovered in metatranscriptome data and are proposed to occur in circular RNA genomes of as yet uncharacterised organisms. The new family joins the original Hairpin ribozyme family (RF00173).

Based on a recent paper we also created two additional bacterial families, the icd-II ncRNA motif (RF04189) and the carA ncRNA motif (RF04192). We would like to thank Ken Brewer (Yale University) for providing the data.

Weekly updates of PDB structures matching Rfam families

For many years Rfam maintained a mapping between Rfam families and experimentally determined RNA 3D structures available in PDB. However, this mapping lagged behind the weekly PDB updates as it was only updated with Rfam releases.

The newly implemented pipeline analyses the data every week and makes the data available on the Rfam website and in a new section of the FTP archive that contains a preview of the upcoming release. The new, up-to-date mapping also improves the ability to search PDBe using Rfam and is a key part of an ongoing project to review all Rfam families with known 3D structures.

Currently there are 127 Rfam families with experimentally determined RNA 3D structures in the PDB. For example, a recent paper describing how a viral RNA hijacks host machinery produced several structural models (7SAM, 7SC6, and 7SCQ) showing the pseudoknot of tRNA-like structure that are now mapped to Rfam family RF01084. Follow Rfam on Twitter to be the first to hear when new RNA families are linked to 3D structures.

Other improvements

We continue improving the Gene Ontology (GO) terms associated with Rfam families, and in this release 412 families have been updated to use the latest GO terms. Maintaining the GO terms up-to-date is important as Rfam is used for automatic assignment of GO terms in RNAcentral and other resources. The GO terms are shown in the Curation tab of each family and are also available in the rfam2go file.
The Rfam.seed_tree.tar.gz file hosted on the FTP archive has been fixed. We would like to thank Christian Anthon (University of Copenhagen) for reporting the problem.

Get in touch

As always, we look forward to hearing from you if you have any feedback or suggestions for Rfam. Please feel free to email us or get in touch on Twitter.

Happy holidays from the Rfam team!

This is the first release produced by Emma Cooke and Blake Sweeney who have joined Rfam in the second half of 2021. The Rfam team wishes you a happy holiday season! We look forward to creating lots more families in 2022 and working towards the next major Rfam release, Rfam 15.0!

Posted in Rfam | Leave a Comment »
Tags: release, rfam

Pfam 35.0 is released

November 19, 2021

Pfam 35.0 contains a total of 19,632 families and clans. Since the last release, we have built 460 new families, killed 7 families and created 12 new clans. UniProt Reference Proteomes has increased by 7% since Pfam 34.0, and now contains 61 million sequences. Of the sequences that are in UniProt Reference Proteomes, 75.2% have at least one Pfam match, and 48.7% of all residues fall within a Pfam family.

Sources of new families

In an effort to increase the Pfam coverage of metagenomic sequence space, we have created 250 metagenomic protein families. These families were built by clustering protein sequences from the MGnify and UniProt databases, aligning the sequences in each cluster, and using the resulting alignments to create new SEED alignments. We then used our usual building process to create new families from the SEED alignments.

We have also created 52 new families based on clusters from a new resource called DPCfam based on Density Peak Clustering, created by Allesandro Laio, Marco Punta and Elena Tea Russo. An interesting example of these families is the N-terminal domain of the Crinkler effector protein (PF20147). Crinkling- and necrosis-inducing proteins (CRNs) or Crinkler, are ubiquitously present and first described in plant pathogenic oomycetes, and have been shown to participate in processes controlling plant cell death and immunity. However, Crinkler is also found outside oomycetes, such as in the Rhizophagus irregularis crinkler effector protein 1 (RiCRN1) which, like other CRNs, functions in the plant nucleus, but plays an essential role in symbiosis progression and the proper initiation of arbuscule development. This suggests that Crinkler proteins are more ubiquitously distributed than first predicted, and that their function is not limited to plant death (PMID:30233541). The Pfam domain contains the conserved motif FLAK, and, from structure predictions, adopts the ubiquitin-like fold, as seen in the image below.

Figure 1: N-terminal domain of RiCRN. The image was generated using an AlphaFold colab notebook and is displayed using Molstar.

We continue to be provided with new families from the group of L. Aravind from NCBI, and have added 42 of them to this release of Pfam. Many of these families represent novel domains and proteins found in phage defence systems of bacteria.

Pfam-N

We are really excited about the Pfam-N matches for Pfam 35.0, but there is still a bit of work to do before we can release them. In particular, we’re working on neural networks that can predict the location of the domains themselves, instead of relying on HMMER to do so, as with the previous Pfam-N release. It will take a little more time to ensure the quality of these annotations, and we will make another announcement when they are ready.

Enjoy Pfam 35.0!

Posted by the Pfam team

Posted in News, Pfam, Production, Releases | Leave a Comment »
Tags: pfam

Rfam has completed the first phase of synchronisation with miRBase

Hepatitis C Virus families

bZIP non-canonical Hac1/Xbp1 intron

Release content

Sources of Pfam families

ECOD

Improving Pfam using AlphaFold

EBI dataset contributes to the quadrupling of the Dfam database

Community submissions – adding diversity to Dfam

References

Updated and new miRNA families

3D families

Families updated with R-scape model

Other updates

Migrating Rfam’s public SVN repository

Updated deep learning methodology

Pfam coverage

Improving Pfam annotations

Visualising Pfam-N in InterPro

Funding

Updated microRNA families

Families updated with 3D structure information

New and updated families of Hepatitis C virus

xRNAs in Potato virus

Community-submitted libraries

Rice weevil: an agricultural pest

Moth SINEs: high diversity

Penelope elements: far-reaching impacts

T2T entries: previously hidden genomic content

EBI libraries

References associated with community submissions

Updated microRNA families

New families

Weekly updates of PDB structures matching Rfam families

Other improvements

Get in touch

Happy holidays from the Rfam team!

Pages

Twitter

Related blogs

Recent Posts

Archives

Categories

Meta