Rosetta@home

Rosetta@home (website) is a distributed computing project, run by the Baker Laboratory at the University of Washington, aiming to solve the protein structure prediction problem.

Rosetta @ Home's project goals
Rosetta's goal is to develop computational methods that accurately predict and design protein structure and protein complexes. This computational endeavor may ultimately help researchers develop cures for human diseases such as HIV/AIDS, cancer, Alzheimer's disease, malaria and many other diseases.
 * Long term molecular biology research has concluded that most diseases are directly manifested at the level of protein activity.
 * Predicting (and designing) protein structures and protein complexes is one of the "holy grails" of computational biology.
 * Computational chemistry requires an enormous amount of computing resources -- more than most off the shelf supercomputers like Cray.

Rosetta researchers rely on a technique known as distributed computing, which pools the resources of volunteered idle computers. The project already has ~70,000 active PCs (Jun-2006) for ~38 TeraFLOPS sustained cumulative processing power, but is still actively seeking new participants to reach the 150 TeraFLOPS computing speed mark. 

Baker Lab
Baker Laboratory (website) is based at the University of Washington.

The principal investigator is David Baker, Professor of Biochemistry at the University of Washington and Howard Hughes Medical Institute investigator, who has been elected to the United States National Academy of Sciences in April 2006.

The BakerLab scientific team includes post-docs Phil Bradley, Jim Havranek, Bill Schief, Vanita Sood, Bin Qian, Eric Althoff, Daniela Roethlisberger, John Karanicolas, as well as numerous graduate students and visiting scientists.

Computing platform
The Rosetta science application is available for the Microsoft Windows, Linux and Macintosh platforms. Participation in the project requires at least a 500 MHz or higher CPU, 200 MB of free disk space, 256 MB of RAM, and Internet connectivity (preferably fast Internet, ADSL or broadband).

The project uses the Berkeley Open Infrastructure for Network Computing (BOINC) distributed computing platform. BOINC is a free, open-source program which works under Windows (Vista/XP/2K/2003/NT/98/ME), Linux/FreeBSD/Unix and MacOS X. It is developed at the University of California, Berkeley and is used by most distributed computing science projects.

The current BOINC program version is 5.10, and the current Rosetta application (as of August 6, 2007) is Rosetta version 5.68.

Joining the project
The person wishing to join the projects is advised to follow those steps:
 * 1) Read the project's rules and policies
 * 2) Download, install and run the BOINC software v5.4.9 or later used by Rosetta@Home.
 * 3) On your PC, go to BOINC Manager, then to Projects → Attach to Project and when prompted, enter
 * 4) *Project URL: http://boinc.bakerlab.org/rosetta/
 * 5) *Email address: your valid e-mail address
 * 6) *Password: choose a password for this project.

If you need help to install and run BOINC software, Step #2, you can read the BOINC install instructions in BOINC-Wiki.

Once you complete Step #3, BOINC will automatically download Rosetta software and your first work-unit and start working. You can monitor the progress and graphics from BOINC Manager → Work Tab → select a running workunit → click Show Graphics

Debian/Ubuntu Linux users are advised to install the BOINC package from http://wiki.debian.org/BOINC which includes automatic startup/shutdown scripts

The Rosetta@home Frequently Asked Questions (FAQ) covers a broad range of both technical and scientific issues.

Project significance
Decoding the human genome may be the greatest scientific achievement of this century. But before that knowledge can be used, scientists need to take the research a step further — they need to understand the proteins that are built from our DNA. Proteins are the parts that make up the machinery of living cells.

With the completion of the Human Genome Project, scientists have only a 'flat' view of the molecular structures of proteins (Primary structure: the amino acid sequence) that make up the working parts of living cells and the human organism. In order to really understand what a protein does, scientists need to know the 3-dimensional structure of the protein (tertiary structure). Knowing the 3D structures of proteins, scientists can infer their role in cell processes and create new effective therapeutics to treat myriad diseases.

Protein 3D structures are currently solved "experimentally" in the laboratory through X-ray crystallography or Nuclear Magnetic Resonance (NMR) experiments. The process is slow (it can take weeks or even months to figure out how to crystallize a protein for the first time) and comes at high cost ($20,000-$100,000 USD per protein). Once the 3D structure of a protein has been finalised, it is often deposited in a public protein database such as the Protein Databank or the Cambridge Protein Structure Database. The human proteome has ~400,000 proteins and there are more proteins in other organisms. So far, ~46,000 protein 3D structures have been solved and deposited in the Protein Data Bank. Many structures obtained in private commercial ventures to crystallise medicinally relevant proteins, are not deposited in public protein databases.

One of Rosetta@home's goals is to "computationally" (in the computer, instead of "experimentally" in the laboratory, described in previous paragraph) determine the 3D structure and function of as many proteins as possible, and to make this information available to researchers worldwide at no cost.

The knowledge that is generated and shared through this project has the potential to fast-forward efforts to develop cures for such diseases as malaria, cancer, AIDS/HIV, Alzheimer’s.

Head scientist Prof. David Baker wrote in Rosetta@home progress journal on Mar 3, 2006:


 * [[Image:Rosetta CASP6 Target T281.png|thumb|The first close to atomic-level resolution, blind ab initio structure prediction—CASP6 T281. The high-resolution refinement methodology described in the text produced a model 1.5 Å RMSD from the crystal structure (left panel), with aspects of the native side-chain packing (right panel).]]
 * "The protein structure prediction problem is perhaps the longest standing problem in molecular biology. It has been known for forty years that the structures of proteins are determined by their amino acid sequences, but as recently as five or six years ago it was generally thought that the prediction problem was completely intractable as very little progress had been made. Starting about this time we showed in the CASP blind tests that with the Rosetta low resolution structure prediction method rough models could be built for small proteins that in some cases were reasonably similar in topology to the true structure, but the predicted structures were never accurate at the atomic level. We have worked for the past five years on developing high resolution refinement methods that could take these rough models and refine them to much higher accuracy. This goal remained elusive for the first few years, but about a year and a half ago we made a breakthrough and found that we could make very accurate predictions for some proteins using a trick that involves folding not only the sequence of the protein of interest but also the sequences of a large number of evolutionarily related homologs. Using this method we made the first high accuracy ab initio structure prediction in CASP (the last target in CASP6) and did further tests which showed accurate predictions for 6 of 16 proteins which were published in Science last year.


 * However, this work did not achieve the goal of predicting structure accurately from the amino acid sequence of a protein alone as we had to resort to evolutionary information. Achieving this goal has been the central aim of Rosetta@home thus far, and as I said above it is almost a "holy grail" of computational biology. So now, for quite a few proteins we are coming close to predicting structure from their amino acid sequences without any other information is pretty breathtaking.


 * It is clear for the still large number of proteins for which we are failing that the problem is not enough sampling, even with 100,000 independent folding runs we are not coming close enough to the native structure to land in its energy minimum. So we need more CPU power! it is kind of amazing that solving such a long standing scientific problem depends so crucially on the efforts of volunteers like yourselves!"

By participating in the Rosetta@home Project, volunteers help verify and develop these revolutionary new algorithms.

Project science
In order to understand the project a bit better, it is necessary to clarify its need for distributed computing platforms. Head scientist, Dr David Baker explains:
 * "With the completion of the Human genome project, we are now capable of predicting what the amino acid sequences of different proteins in the human body will be. However, while we know the order of the amino acids, proteins do not remain in a two dimensional shape inside the human body. They "fold" into different three dimensional shapes for various proteins in order to serve their functions inside the human body. A protein with a distinctive amino acid chain does not simply fold in a random manner, but usually folds in a specific configuration. The force behind this phenomenon is believed to be the Hydrophobic Effect. This effect is due to the fact that certain groupings of amino acids are either attracted or repelled by water, therefore they will attempt to either maximize or minimize their exposure to water respectively. Since the fluids that living bodies are composed of primarily consist of water, these amino acid chains fold in on themselves with the chains repelled by water tending to be part of the center of the protein to minimize the amount of their surface area exposed to the body's fluids. Those amino acid chains that are attracted to water tend to be part of the surface of the protein in order to increase the proportion of the chain expose to the body's fluids. Since the forces that act on a protein depend on what the particular amino acid sequence of a protein is, a specific protein will usually 'fold' in the same manner. Due to the complexity of the forces acting upon amino acid sequences large enough to form a protein, it is extremely difficult to predict how the protein will fold, but this knowledge is vital for determining how the protein functions. Devising a way to quickly and accurately predict how a protein will fold could potentially allow researchers to devise treatments for diseases such as Alzheimer's and AIDS.


 * By using molecular dynamics, it is possible to attempt to use the basic laws of physics to simulate the folding process for a particular protein. (For example the Folding@home Project.) However, such an approach is limited in its utility since it requires an enormous amount of processing power to simulate even a simple protein's folding behavior. The Rosetta@home project takes a different approach by utilizing a newly developed software algorithm to try to predict the shape that a protein would be most likely to fold into. An additional piece of software analyzes the projected results and determines which of the various projected results is most likely to be correct. By utilizing a distributed computing project, it is possible to quickly create a database of billions of possible structures for a protein, and thereby obtain an accurate picture of what that folded protein would look like."

Medical Relevance
Rosetta@home focuses on basic research, but some of the disease related work includes AIDS/HIV, Alzheimer's, cancer, prostate cancer, Malaria, Anthrax and various viruses. Not all of the above projects are running on BOINC yet, because the project is working on an efficient queuing system which lets researchers submit jobs easily. (Source http://boinc.bakerlab.org/rosetta/rah_medical_relevance.php)

The structure prediction calculations currently running on BOINC still have direct bearing on treating disease. There is a threefold explanation for this direct relationship between structure prediction and disease treatment:

1. Structure prediction and protein design are closely related
 * Improvements in structure prediction lead to improvements in protein design, which in turn can be directly translated into making new enzymes, vaccines, etc. For more information on protein design you might be interested in looking at the review we recently wrote in Science which is available at our home page (http://depts.washington.edu/bakerpg).


 * Schueler-Furman, O., Wang, C., Bradley, P., Misura, K., Baker, D. (2005). Progress in modeling of protein structures and interactions Science 310, 638-642.

2. Structure prediction identifies targets for new drugs
 * When we predict structures for proteins in the human genome on a large scale, we learn about the functions of many proteins, which will help in understanding how cells work and how disease occurs. More directly, we will be able to identify many new potential drug targets for which small molecule inhibitors (drugs) can be designed. To put this in context, one major road-block to developing new treatments for human disease is identifying new "drugable" protein targets. Most new drugs these days interact with the same targets as the old drugs, so these drugs lead to only small improvements in disease treatment. Structure prediction helps us identify new drug targets, and so will help us find innovative, perhaps even breakthrough, treatments for disease.

3. Structure prediction allows us to use "rational design" to create new drugs
 * If we know the structure of a protein, we can determine its functional sites, and specifically target those sites to be inactivated by a new drug. Calculation of whether a small molecule (drug) will bind to and inactivate a protein target is similar in many ways to the structure prediction calculations we are doing here--it is basically a problem of finding the lowest energy structure of the protein plus drug system--and we have recently developed a new module in ROSETTA to do this docking problem. Results are very promising, and in the near future your machines will likely be running drug docking calculations along with the vaccine and therapeutic protein design projects described above, in addition to the protein folding calculations you are doing now.

There will also be tests of calculations for the other projects described in the introduction section of the web site (HIV, malaria, cancer, prostate cancer, Alzheimer's). The vaccine design calculations will run on BOINC in the near future.


 * "With regard to the message board posts, we aren't yet doing any work on diabetes or Multiple Sclerosis specifically, but if we can generate accurate structures of proteins involved in these diseases using the methods you are helping us to develop, it will contribute to efforts to develop therapies."

source

Description

 * Rosetta is a combination of two software elements. Rosetta ab initio predicts the three-dimensional structure of a folded protein from its linear sequence of amino acids. Numerous software tools around the ab initio concept have also been created that facilitate protein structure prediction. Rosetta combines the Rosetta ab initio structure prediction method with Nuclear Magnetic Resonance (NMR) experimental data for rapid backbone structure determination. Rosetta Design is a useful tool in creating better proteins by determining amino acid sequences that are good for a particular protein structure. It can also be used to enhance protein stability and create alternative sequences for naturally occurring proteins.


 * The Rosetta codes are available to academics free of charge under a non-exclusive license while industry may obtain Rosetta through a non-exclusive license.

Development Background

 * Under the aegis of David Baker, who is a Howard Hughes Medical Institute investigator as well as a professor in the Department of Biochemistry, numerous faculty, postdocs, graduate students, and undergraduates have worked on the Rosetta project over the past nine years. The result has been an highly collaborative research program hosted at the Baker Laboratory and enriched by its many contributors.


 * Funding for the development of Rosetta was provided by the Packard Foundation, the National Science Foundation, and the National Institutes of Health.

source

Collaborative projects
The Human Proteome Folding Project (HPF) is a collaborative effort between New York University (Bonneau Lab), the Institute for Systems Biology (ISB) and the University of Washington (Baker Lab). HPF is running on IBM's World Community Grid (WCG) and on United Devices' grid.org.

HPF Phase-1 applied Rosetta v4.2x software on the human genome and 89 other genomes, starting in November 2004. It is expected to end in April 2006. HPF Phase-2 (HPF2) will apply the latest Rosetta v5.x software in higher resolution, "full atom refinement" mode, concentrating on cancer biomarkers (proteins found at dramatically increased levels in cancer tissues), human secreted proteins and malaria.


 * "The Baker laboratory at the University of Washington has developed a protein folding program named Rosetta. It has 3 major sections. The first section tries to fold a protein, going from a long string of amino acids to a crumpled up 3D structure. The second section tries to reverse this process. Given the surface of a crumpled up protein molecule, it attempts to design a chain of amino acids that will fold up to form that molecule. The third section tries to dock 2 different protein molecules to see how they will interact with each other.


 * A number of universities (such as the University of Warsaw) and research institutes use this Rosetta program for different purposes (see Rosetta Commons at http://www.rosettacommons.org/ ). David Baker maintains a server on the Internet called the Robetta server which allows other scientists to use Rosetta for their projects without maintaining local servers with Rosetta.


 * Recently (3Q05) the Baker Lab has started a BOINC project named Rosetta@home ( http://boinc.bakerlab.org/rosetta/ ). The Baker Lab only has a 500-node Linux cluster, so it is very time-consuming to test variations while trying to improve Rosetta. The first section of Rosetta which folds proteins (called the ab initio prediction section) uses 2 methods. The first method is a speedy low resolution method. The second method is a computationally intensive high resolution method which takes a fold prediction from the low resolution method and attempts to refine it to produce a more accurate prediction. The cluster of computers created for Rosetta@home is used to test various improvements in the high resolution method. Eventually, Dr. Baker also intends to use this cluster to run queries from other scientists that are currently queued up to run on the Rosetta server.


 * Rosetta has been producing the most accurate computer predictions of protein folds, as you can see at CASP6. The most accurate predictions are still made by human scientists, assisted by computer programs, but like human chess players, the computers are putting some pressure on them. Also see the 'Gene Machine' in the July 2001 issue of Wired


 * Now, getting down to particulars. Where do we come in? The Institute for Systems Biology (ISB) in Seattle, WA, USA, has started a project called the Human Proteome Folding Project (HPF) to fold all the unknown proteins found in the human genome plus a number of proteins from 80 other genomes. See HPF


 * This project uses the low resolution method of protein folding that is in the ab initio section of Rosetta. Each unknown protein is folded to produce about 10,000 predictions. Variable conditions are established by a random seed. Both grid.org and the World Community Grid are running this project for ISB. Each Work Unit makes 100-500 fold predictions for a previously unknown protein. The ISB creates a batch of proteins and puts it on the ISB server. Then either grid.org or WCG downloads the batch, sends out the work units, reassembles the results returned and finally uploads the corresponding batch of results back to the ISB server.


 * Both grids (WCG and grid.org) are using the same version of Rosetta to fold the proteins. There were some bug fixes made in Rosetta back in December 2004. The WCG took the lead and then sent the patched version to grid.org which beta tested the new version, then deployed it in January 2005. This was the only cross-grid transfer of Rosetta code that I know of since the HPF project went live.


 * There is a future project being readied to run on the World Community Grid that will use the new high resolution folding method being developed at Rosetta@home to refine the folding predictions made by HPF for some selected proteins. It is currently (and unimaginatively) being referred to as HPF2." source

Differences between protein research projects
Protein structure prediction projects such as Rosetta@home aim to specify what the final tertiary structure will be, from their amino acid sequences. Only then can biomed scientists deduce each protein's role / functionality in cell processes.

Rosetta@Home and Predictor@home are similar in that they both seek to predict the 3D structure. Rosetta uses energy functions to find the lowest or most stable state and Predictor uses Monte Carlo simulations using a knowledge-based force field, based upon a simplified lattice model.

Currently Rosetta is among the most accurate protein prediction methods, as evidenced in recent biennial CASP experiments (see chart comparing Rosetta and Predictor accuracy).

Folding@home is an advanced Computational Chemistry project using molecular dynamics (the laws of physics) to study the dynamics of protein folding and understand misfolding (aggregation) diseases such as Alzheimer's. Quote from Dr. Vijay S. Pande in the Folding@home forums:


 * I know Baker and Ranganathan and their work very well and (like the rest of the protein community) find their work very important and impressive. However, Rosetta@home and Folding@Home are addressing very different problems.


 * Rosetta only predicts the final folded state, not how do proteins fold (and Rosetta has nothing to do with protein misfolding). Thus, those methods are not useful for the questions we're interested in and the diseases we're tackling (Alzheimer's Disease and other aggregation related diseases).


 * Also, one should note that accurate computational protein structure prediction is still very challenging compared to what one can do experimentally, whereas the information obtained from Folding@home on the nature of folding and misfolding pathways matches experiment (e.g. with quantitative validation in rates, free energy, etc) and then goes beyond what experiment can tell us in that arena. While Rosetta has gone a long way and is a very impressive project, given the choice between a Rosetta predicted structure and a crystal structure, one would always chose the crystal structure. I bet that will be changing due to their great efforts, but that may still be a ways off for that dream to be realized.


 * So, both are valuable projects IMHO, but addressing very different questions. I think there are some misunderstandings out there, though. Some people think FAH is all about structure prediction (which it is not -- that's Rosetta's strength) and some think Rosetta is about misfolding related disease (which it's not, that's Folding@Home's strength). Hopefully this post helps straighten some of that out.

Features and Issues
Features related to the current version
 * Since March 2006 the project uses variable run time work units, using the same raw protein data, with each unit being approximately 3MB. Each work unit now runs for a defined period of CPU time (with the default CPU run time being 3 hours) calculating as many predicted protein structures - termed "models" - as the computer can create during this time period. A slow PC might compute only one model, whereas a fast PC over 100 models.
 * Each work unit can run for up to 24 hours, with the exact runtime being user-configurable. The maximum CPU run time will increase up to four days when the 1% bug has been solved.
 * This CPU time option was added to allow participants on dialup Internet or operating large networks of PCs or "crunching farms" to drastically reduce Internet traffic from 1 GByte per month per Pentium 4 (running 24/7) to 1/10th of that and even less.
 * Users also have the option to change the frame rate and CPU use for graphics. The default frame rate is 10 fps.
 * A new graphics version is available for Mac OSX users.
 * Rosetta will consume between 40 MB and 140 MB of memory. You need 256 MB of memory to be safe! (Official project requirements are for a 512 MB RAM PC, but many contributors have less). The biggest units, which are very rare, need up to 250 MB of memory.

Issues
 * The latest version update has solved the "Leave in memory when preempted" issue and the project is working to solve the "stuck at 1%" issue.