Archive for the 'Production' Category

Pfam 31.0 is released

March 8, 2017

Pfam 31.0 contains a total of 16712 families and 604 clans. Since the last release, we have built 415 new families, killed 9 families and created 11 new clans.  We have also been working on expanding our clan classification; in Pfam 31.0, over 36% of Pfam entries are placed within a clan. Read the rest of this entry »

Pfam 30.0 is available

July 1, 2016

Pfam 30.0, our second release based on UniProt reference proteomes, is now available. The new release contains a total of 16,306 families, with 22 new families and 11 families killed since the last release. The UniProt reference proteome set has expanded and now includes 17.7 million sequences, compared with 11.9 million when we made Pfam 29.0. In this release, we have updated the annotations on hundreds of Pfam entries, and renamed some of our Domains of Unknown Function (DUF) families.

DUFs are protein domains whose function is uncharacterised. Over time, as scientific knowledge increases and new data about proteins comes to light, more information about the function of a domain may become available. As a result, DUFs can be renamed and re-annotated with more meaningful descriptions. As part of Pfam 30.0, we have re-annotated 116 DUFs based on updated information in the UniProtKB database, the scientific literature, and feedback from Pfam and InterPro users. Examples of some our DUF updates in Pfam 30.0 are given below:

 

  • PF10265, created in release 23.0 and originally named DUF2217, has been renamed to Miga, a family of proteins that promote mitochondrial fusion.
  • PF10229, created in release 23.0 and originally named DUF2246, has been renamed as MMADHC, as it represents methylmalonic aciduria and homocystinuria type D proteins and their homologues.  The structure of this domain is shown below.

 

5cv0

Structure of MMADHC dimer, PDB:5CV0

 

  • PF12822, created in release 25.0 and originally named DUF3816, has been renamed to ECF_trnsprt, since it contains proteins identified as the substrate-specific component of energy-coupling factor (ECF) transporters.

Please note that we may change the identifier for a family (e.g. DUF2217), but we never change the accession for a family (e.g. PF10265).

If you find any more DUFs that can be assigned a name based on function, or any other annotation updates, please get in touch with us (pfam-help@ebi.ac.uk).

 

Pfam 27.0 is now available!

March 22, 2013

In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database.  These changes came into effect a few days ago, when we released Pfam 27.0.  This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »

Dfam: A database of repetitive DNA elements

September 6, 2012

We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.

Read the rest of this entry »

Rfam 11.0 is out!

August 14, 2012

The team behind Rfam is pleased to announce the release of Rfam 11.0. This release represents a major update from 10.1, primarily due to the upgrade of our underlying sequence database, Rfamseq.

Read the rest of this entry »

Does my family of interest have a determined 3D protein structure?

May 9, 2012

Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’.  You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these  questions. Read the rest of this entry »

Proposed Pfam release changes

February 27, 2012

The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’.  The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases.  Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used.  Below is a list of changes that are going to happen in the next release, release 27.0.

Read the rest of this entry »

The Rfam 10.1 release is out!

June 16, 2011

The crowd of people behind Rfam are proud to announce a new release of the Rfam database. This is version 10.1 and is mostly an increase in the number, size and quality of families.

Rfam now has 1973 families, 528 more families than the 10.0 release. We are just one prokaryotic RNA-seq project away from hitting 2000 families! In fact, we have passed 2000 in terms of Rfam accession (RF02031 is the Escherichia coli sRNA, tpke11). My selfish attempt to claim the coveted RF02000 accession was snatched from my grasp by Chris Boursnell who added RF02000 which now corresponds to the rice microRNA MIR1846 from the miRBase database. I did claim RF01999 and RF02001 with two
sub-types of Group II catalytic intron domains 1-4 that Zasha Weinberg kindly provided to Rfam.

The new families included nearly 100 novel elements inferred by Zasha Weinberg and colleagues in his recent Genome Research article [1]. Zasha kindly provides Rfam with the alignments and writes Wikipedia articles for each notable element, greatly easing the burden on Rfam for incorporating these into the database.

Our new recruit, Ruth Eberhardt, originally from the UniProt group at EBI, has also made a significant mark on the new release. Ruth has been busily incorporating “domains” derived from long messenger-like non-coding RNAs (a.k.a. lncRNAs). These are regions within each transcript that are unusually well conserved and there is some evidence that secondary structure within the regions is evolutionarily constrained. The new families include: MEG3, MALAT1, MIAT, PRINS, XistTUG1, HSR-omega, Evf1, HOTAIR, KCNQ1OT1, SOX2OT, NEAT1, EGOT, H19 and HOTAIRM1.

This summer we had the pleasure of hosting another talented summer student, Ben Moore. Ben is a prolific Wikipedian and rapidly made his mark on the RNA Wikipedia entries and continues to do so while working on a MRes in Computational Biology at the University of York. One stunning feat Ben accomplished was passing the article for “Toxin-antitoxin system” through Wikipedia’s peer-review process for “good articles”. This process appears to be at least as rigorous as scientific peer-review and is quite an achievement. He also built a number of families for RNA anti-toxins including Sok, RNAII, IstR, RdlD, FlmB, Sib, RatA, SymR and PtaRNA1. PtaRNA1 is a newly discovered RNA antitoxin that was first published by Sven Findeiss and colleagues in the RNA families track at RNA Biology. This track has provided very useful updates and expansions for Rfam directly from the RNA community.

A guest Rfam rogue, Chris Boursnell, who has been visiting from Andrew Firth’s group has also been busy building new families. With permission from the good people at the Recode-2 database [3] Chris has added a number of new frame-shift elements and has also updated a number of microRNA families based on the latest release of miRBase [4].

An enormous achievement for the database is the inclusion of full-length small subunit ribosomal RNA families. Previously Rfam had just one truncated model that covered all three kingdoms of life. Thanks to the hard work of Eric Nawrocki and colleagues in Sean Eddy’s lab on the Infernal software and related package ssu-align which can now deal with much larger datasets than were previously possible. The three new alignments now cover bacteria, archaea and eukaryotes. These are all derived from the highly accurate and excellent alignments from the work of Robin Gutell and colleagues who run the Comparative RNA Website.

Thanks to the exciting work by Stefan Washietl and colleagues on the RNAcode software package we now have good evidence that the “RNA” family, C0343 (RF00120), is in fact protein-coding and most-likely is not functioning as a RNA other than in a mRNA-sense [5]. Therefore C0343 has been removed for the 10.1 release.

Our SRP families have all been rebuilt and supplemented with additional families thanks to the work of Magnus Rosenblad and colleagues [6]. This is another excellent contribution to the RNA families track at RNA Biology. Based on this work the existing two SRP families were replaced and supplemented by 7 new families: Metazoa_SRP (RF00017), Bacteria_small_SRP (RF00169), Fungi_SRP (RF01502), Bacteria_large_SRP (RF01854), Plant_SRP (RF01855), Protozoa_SRP (RF01856) and Archaea_SRP (RF01857). These new models should improve the specificity of Rfam annotations and reduce the number of pseudogenes incorporated.

We have continued to work on the Rfam clans and have added 3 new clans. These are U3 (CL00100), Cobalamin (CL00101) and group-II-D1D4 (CL00102). Also, the membership of the clans tRNA (CL00001), RNaseP (CL00002) and SNORA62 (CL00040) have been updated.

Finally, several problematic microRNA families mir-544 (RF01045), mir-1302 (RF00951), mir-1255 (RF00994), mir-548 (RF01061), mir-649 (RF01029), mir-562 (RF00998) and spliceosomal U13 (RF01210) were rethresholded to remove the excessive number of pseudogene annotations in the full alignments. This rethresholding along with the rebuilding of our SSU models have removed approximately 600,000 annotations from Rfam.

There are countless other changes that have made, if I’ve forgotten to include any that are significant to you or to mention your name then I apologise profusely.

This release could not have happened without the invaluable help of Jen Daub and John Tate who have worked tirelessly and enthusiastically on this release. This was made particularly challenging by the fact that I have recently relocated to my homeland in New Zealand to take up a position as a Rutherford discovery fellow and senior lecturer at the University of Canterbury in Christchurch. I hope to continue to contribute to Rfam and the wider RNA community from here. This is also a good moment to welcome the new Czars of Rfam, Sarah Burge and Eric Nawrocki who will now face the exciting and challenging task of managing the day-to-day work of maintaining Rfam. I wish them the best of luck in their new roles. I hope they enjoy it as much as I have.

Paul Gardner.

References

[1] Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR. (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology. 11(3):R31.

[2] Findeiss S, Schmidtke C, Stadler PF, Bonas U (2010). A novel family of plasmid-transferred anti-sense ncRNAs. RNA Biology. 7 (2): 120–4.

[3] Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. (2010) Recode-2: new design, new search tools, and many more genes. Nucleic Acids Res. 38(Database issue):D69-74.

[4] Kozomara A, Griffiths-Jones S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Research. Jan;39(Database issue):D152-7.

[5] Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. (2011) RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA. 17(4):578-94.

[6] Rosenblad MA, Larsen N, Samuelsson T, Zwieb C. (2009) Kinship in the SRP RNA family. RNA Biology. 2009 Nov-Dec;6(5):508-16.

Plans for Rfam 2010-2011

June 30, 2010

Besides running RNA informatics courses Rfam peoples will be working on the usual summer family-building exercise together with a bright summer student. Priority number one will be building the published RNA Biology articles that we ran out of time to do for the Rfam 10.0 release:

SmY MRP Yfr2 tmRNA
Trypanosomal H/ACA ncRNAs GIR1 U3 SRP
influenza pseudoknot ptaRNA1 RsaOG rsmX

An exciting trend we’re starting to see is groups appending machine parsable alignments to their papers and writing Wikipedia articles for their families outside of the RNA families track. Of particular note are the 81 families from Zasha Weinberg’s latest papers [1-3] (see the table below). Also, Daniel Gautheret’s and Wade Winkler’s groups are supporting this effort with their respective CsfG RNA [4] and EAR motif [5] articles. Rightly or wrongly we are crediting some of this to the increased exposure of Rfam’s requirements from the RNA families track at the journal RNA Biology. This, by the way, for the first time has an impact factor, which is a fantastic 5.559. RNA Biology is punching well above its weight for now, long may this continue. Once our super summer student has finished with these (much easier families) he’ll be moving on to our terrifyingly long list of potential Rfam families that are waiting to be built. If you see anything on this list you might be interested in writing an RNA biology article for then please let us know as soon as possible.

In other news, we’re still hunting for good people to join a revamped Xfam group. In a few days we’ll be advertising Curator positions and we’re still looking for a Senior Computational Biologist. Check the Sanger Jobs site over the next few days.

[1] Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology

[2] Weinberg Z, Perreault J, Meyer MM, Breaker RR (2009) Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis. Nature.

[3] Meyer MM, Ames TD, Smith DP, Weinberg Z, Schwalbach MS, Giovannoni SJ, Breaker RR. (2009) Identification of candidate structured RNAs in the marine organism ‘Candidatus Pelagibacter ubique’. BMC Genomics.

[4] Marchais A, Duperrier S, Gautheret D, Stragier P. (2010). A sporulation-specific, small noncoding RNA highly conserved in endospore formers. In preparation.

[5]  Irnov I & Winkler WC (2010) A regulatory RNA required for antitermination of biofilm and capsular polysaccharide operons in Bacillales. Mol Microbiol.

6S-Flavo Acido-1 Acido-Lenti-1 Actino-pnp
AdoCbl-variant Bacillaceae-1 Bacillus-plasmid Bacteroid-trp
Bacteroidales-1 Bacteroides-1 C4-a1b1 C4
Chlorobi-1 Chlorobi-RRM Chloroflexi-1 Clostridiales-1
Collinsella-1 Cyano-1 Cyano-2 Dictyoglomi-1
Downstream-peptide Flavo-1 Gut-1 JUMPstart
L17DE Lacto-rpoB Lacto-usp Lnt
Methylobacterium-1 Moco-II Ocean-V Pedo-repair
PhotoRC-I PhotoRC-II Polynucleobacter-1 Pseudomon-1
Pseudomon-Rho Pseudomon-groES Pyrobac-1 Rhizobiales-2
SAM-Chlorobi SAM-I-IV-variant SAM-II_long_loops SAM-SAH
STAXI Termite-flg Termite-leu TwoGGAY
asd atoC crcB EAR
flg-Rhizobiales flpD gabT glnA
gyrA hopC lactis-plasmid leu-phe_leader
livK manA mraW msiK
nuoG pan pfl potC
psaA psbNH radC rmf
rne-II sbcD sucA-II sucC
traJ-II wcaG whalefall-1 yjdF
ykkC-III

Pfam, HMMER3 and the next release

March 23, 2010

The Xfam blog has been fairly quiet since the release of Pfam 24.0, so I thought I would give you a quick update on what we have been up to in the Pfam team. Read the rest of this entry »