In a blog post published just over a year ago, I proposed a number of changes to the content of Pfam to improve scalability and usability of the database. These changes came into effect a few days ago, when we released Pfam 27.0. This release of Pfam contains a total of 14831 families, with 1182 new families and 22 families killed since release 26.0. 80% of all proteins in UniProt contain a match to at least one Pfam domain, and 58% of all residues in the sequence database fall within a Pfam domain. Read the rest of this entry »
Archive for the 'Production' Category
We are pleased to introduce Dfam 1.0, a database of profile HMMs for repetitive DNA elements. Repetitive DNA, especially the remnants of transposable elements, makes up a large fraction of many genomes, especially eukaryotic. Accurate annotation of these TEs both simplifies downstream genomic analysis and enables research into their fascinating biology and impact on the genome.
The team behind Rfam is pleased to announce the release of Rfam 11.0. This release represents a major update from 10.1, primarily due to the upgrade of our underlying sequence database, Rfamseq.
Two related questions that we are often asked via the Pfam helpdesk is ‘Which families have a known three-dimensional structure?’ and ‘Why is a particular a PDB structure not found in Pfam’. You may think that there are obvious answers to these questions – but as with many things in life the answer is not necessarily as straight forward as you would have thought. In this joint posting between Andreas Prlic (senior scientist at RCSB Protein Data Bank) and myself (Rob Finn, Pfam Production Lead), we will elaborate on the way the PDB and Pfam cross referencing occurs, why discrepancies occurred in the past and describe the pipeline that the RCSB PDB has implemented using the HMMER web services API, which should provide the most current answer to these questions. Read the rest of this entry »
The current Pfam release, version 26.0, took approximately 4 months to nurse through the various stages of updating the sequence database, resolving overlaps between families, rebuilding the MySQL database and performing all of the post-processing that constitutes the ‘release’. The production team strives to make two releases a year, but I really do not fancy spend two thirds of a year on Pfam releases. Thus, with my colleagues, I have been reviewing what we do and why we do it and, probably more importantly, assessing how much different sections of the Web site are used. Below is a list of changes that are going to happen in the next release, release 27.0.
The crowd of people behind Rfam are proud to announce a new release of the Rfam database. This is version 10.1 and is mostly an increase in the number, size and quality of families.
Rfam now has 1973 families, 528 more families than the 10.0 release. We are just one prokaryotic RNA-seq project away from hitting 2000 families! In fact, we have passed 2000 in terms of Rfam accession (RF02031 is the Escherichia coli sRNA, tpke11). My selfish attempt to claim the coveted RF02000 accession was snatched from my grasp by Chris Boursnell who added RF02000 which now corresponds to the rice microRNA MIR1846 from the miRBase database. I did claim RF01999 and RF02001 with two
sub-types of Group II catalytic intron domains 1-4 that Zasha Weinberg kindly provided to Rfam.
The new families included nearly 100 novel elements inferred by Zasha Weinberg and colleagues in his recent Genome Research article . Zasha kindly provides Rfam with the alignments and writes Wikipedia articles for each notable element, greatly easing the burden on Rfam for incorporating these into the database.
Our new recruit, Ruth Eberhardt, originally from the UniProt group at EBI, has also made a significant mark on the new release. Ruth has been busily incorporating “domains” derived from long messenger-like non-coding RNAs (a.k.a. lncRNAs). These are regions within each transcript that are unusually well conserved and there is some evidence that secondary structure within the regions is evolutionarily constrained. The new families include: MEG3, MALAT1, MIAT, PRINS, Xist, TUG1, HSR-omega, Evf1, HOTAIR, KCNQ1OT1, SOX2OT, NEAT1, EGOT, H19 and HOTAIRM1.
This summer we had the pleasure of hosting another talented summer student, Ben Moore. Ben is a prolific Wikipedian and rapidly made his mark on the RNA Wikipedia entries and continues to do so while working on a MRes in Computational Biology at the University of York. One stunning feat Ben accomplished was passing the article for “Toxin-antitoxin system” through Wikipedia’s peer-review process for “good articles”. This process appears to be at least as rigorous as scientific peer-review and is quite an achievement. He also built a number of families for RNA anti-toxins including Sok, RNAII, IstR, RdlD, FlmB, Sib, RatA, SymR and PtaRNA1. PtaRNA1 is a newly discovered RNA antitoxin that was first published by Sven Findeiss and colleagues in the RNA families track at RNA Biology. This track has provided very useful updates and expansions for Rfam directly from the RNA community.
A guest Rfam rogue, Chris Boursnell, who has been visiting from Andrew Firth’s group has also been busy building new families. With permission from the good people at the Recode-2 database  Chris has added a number of new frame-shift elements and has also updated a number of microRNA families based on the latest release of miRBase .
An enormous achievement for the database is the inclusion of full-length small subunit ribosomal RNA families. Previously Rfam had just one truncated model that covered all three kingdoms of life. Thanks to the hard work of Eric Nawrocki and colleagues in Sean Eddy’s lab on the Infernal software and related package ssu-align which can now deal with much larger datasets than were previously possible. The three new alignments now cover bacteria, archaea and eukaryotes. These are all derived from the highly accurate and excellent alignments from the work of Robin Gutell and colleagues who run the Comparative RNA Website.
Thanks to the exciting work by Stefan Washietl and colleagues on the RNAcode software package we now have good evidence that the “RNA” family, C0343 (RF00120), is in fact protein-coding and most-likely is not functioning as a RNA other than in a mRNA-sense . Therefore C0343 has been removed for the 10.1 release.
Our SRP families have all been rebuilt and supplemented with additional families thanks to the work of Magnus Rosenblad and colleagues . This is another excellent contribution to the RNA families track at RNA Biology. Based on this work the existing two SRP families were replaced and supplemented by 7 new families: Metazoa_SRP (RF00017), Bacteria_small_SRP (RF00169), Fungi_SRP (RF01502), Bacteria_large_SRP (RF01854), Plant_SRP (RF01855), Protozoa_SRP (RF01856) and Archaea_SRP (RF01857). These new models should improve the specificity of Rfam annotations and reduce the number of pseudogenes incorporated.
We have continued to work on the Rfam clans and have added 3 new clans. These are U3 (CL00100), Cobalamin (CL00101) and group-II-D1D4 (CL00102). Also, the membership of the clans tRNA (CL00001), RNaseP (CL00002) and SNORA62 (CL00040) have been updated.
Finally, several problematic microRNA families mir-544 (RF01045), mir-1302 (RF00951), mir-1255 (RF00994), mir-548 (RF01061), mir-649 (RF01029), mir-562 (RF00998) and spliceosomal U13 (RF01210) were rethresholded to remove the excessive number of pseudogene annotations in the full alignments. This rethresholding along with the rebuilding of our SSU models have removed approximately 600,000 annotations from Rfam.
There are countless other changes that have made, if I’ve forgotten to include any that are significant to you or to mention your name then I apologise profusely.
This release could not have happened without the invaluable help of Jen Daub and John Tate who have worked tirelessly and enthusiastically on this release. This was made particularly challenging by the fact that I have recently relocated to my homeland in New Zealand to take up a position as a Rutherford discovery fellow and senior lecturer at the University of Canterbury in Christchurch. I hope to continue to contribute to Rfam and the wider RNA community from here. This is also a good moment to welcome the new Czars of Rfam, Sarah Burge and Eric Nawrocki who will now face the exciting and challenging task of managing the day-to-day work of maintaining Rfam. I wish them the best of luck in their new roles. I hope they enjoy it as much as I have.
 Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR. (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology. 11(3):R31.
 Findeiss S, Schmidtke C, Stadler PF, Bonas U (2010). A novel family of plasmid-transferred anti-sense ncRNAs. RNA Biology. 7 (2): 120–4.
 Bekaert M, Firth AE, Zhang Y, Gladyshev VN, Atkins JF, Baranov PV. (2010) Recode-2: new design, new search tools, and many more genes. Nucleic Acids Res. 38(Database issue):D69-74.
 Kozomara A, Griffiths-Jones S. (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Research. Jan;39(Database issue):D152-7.
 Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. (2011) RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA. 17(4):578-94.
 Rosenblad MA, Larsen N, Samuelsson T, Zwieb C. (2009) Kinship in the SRP RNA family. RNA Biology. 2009 Nov-Dec;6(5):508-16.
Besides running RNA informatics courses Rfam peoples will be working on the usual summer family-building exercise together with a bright summer student. Priority number one will be building the published RNA Biology articles that we ran out of time to do for the Rfam 10.0 release:
|Trypanosomal H/ACA ncRNAs||GIR1||U3||SRP|
An exciting trend we’re starting to see is groups appending machine parsable alignments to their papers and writing Wikipedia articles for their families outside of the RNA families track. Of particular note are the 81 families from Zasha Weinberg’s latest papers [1-3] (see the table below). Also, Daniel Gautheret’s and Wade Winkler’s groups are supporting this effort with their respective CsfG RNA  and EAR motif  articles. Rightly or wrongly we are crediting some of this to the increased exposure of Rfam’s requirements from the RNA families track at the journal RNA Biology. This, by the way, for the first time has an impact factor, which is a fantastic 5.559. RNA Biology is punching well above its weight for now, long may this continue. Once our super summer student has finished with these (much easier families) he’ll be moving on to our terrifyingly long list of potential Rfam families that are waiting to be built. If you see anything on this list you might be interested in writing an RNA biology article for then please let us know as soon as possible.
In other news, we’re still hunting for good people to join a revamped Xfam group. In a few days we’ll be advertising Curator positions and we’re still looking for a Senior Computational Biologist. Check the Sanger Jobs site over the next few days.
 Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biology
 Weinberg Z, Perreault J, Meyer MM, Breaker RR (2009) Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis. Nature.
 Meyer MM, Ames TD, Smith DP, Weinberg Z, Schwalbach MS, Giovannoni SJ, Breaker RR. (2009) Identification of candidate structured RNAs in the marine organism ‘Candidatus Pelagibacter ubique’. BMC Genomics.
 Marchais A, Duperrier S, Gautheret D, Stragier P. (2010). A sporulation-specific, small noncoding RNA highly conserved in endospore formers. In preparation.
 Irnov I & Winkler WC (2010) A regulatory RNA required for antitermination of biofilm and capsular polysaccharide operons in Bacillales. Mol Microbiol.
The Xfam blog has been fairly quiet since the release of Pfam 24.0, so I thought I would give you a quick update on what we have been up to in the Pfam team. Read the rest of this entry »
We are now on the brink of releasing Pfam 24.0. This release of Pfam, version 24.0, will be a landmark release as it will be the first to be built using the the new version of the HMMER package, HMMER3. We are well aware that we have been claiming this release as imminent for some time, but we are now at the point of flicking the big switch. There are numerous changes that users need to know about and we will briefly summarise them here. Read the rest of this entry »
Back in May we wrote a blog post about the new version of pfam_scan.pl. We asked if there was anyone out there who was willing to help us test our new script, and we were pleasantly surprised at the number of people who got in contact with us – so a big thank you to all those who have helped. Since releasing the alpha version of pfam_scan.pl to our testers we have made some internal changes to the script that are worth mentioning: Read the rest of this entry »