Distributed Computing IT Journal: May 2008

Tuesday, May 20, 2008

google summer code project: Integration of GridFTP with Freeloader storage system

Abstract
Title	Integration of GridFTP with Freeloader storage system
Student	Hesam Ghasemi
Mentor	Rajkumar Kettimuthu
Scientific experiments produce large volumes of data which require a cost-conscious data stores, as well as reliable data transfer mechanisms. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP, however, assumes the support of a high-performance parallel file system, a relatively expensive resource. FreeLoader is a storage system that aggregates idle storage space from workstations connected within a local area network to build a low-cost, yet high-performance data store. FreeLoader breaks files into chunks and distributes them across storage nodes thus enabling parallel access to files. A central manager keeps track of file meta-data as well as of the location of the chunks associated with each file. This project will integrate Globus project’s GridFTP implementation and FreeLoader to reduce the cost and increase the performance of GridFTP deployments. The integration of these two opens-source systems system will address the following main problems. First, FreeLoader storage nodes will be exposed as GridFTP data transfer processes (DTPs). Second, the assumption the current GridFTP implementation makes, namely that all data is available at all DTPs will be relaxed by integrating the GridFTP server and FreeLoader managed data location mechanisms. Finally, load balancing mechanisms will be added to the GridFTP server implementation to match FreeLoader’s ability to stripe and replicate data across multiple nodes. The fact that at a Freeloader-supported GridFTP site files are spread over multiple FreeLoader nodes (exposed as GridFTP DTPs) implies that the integration will need to orchestrate multiple connections to execute a file transfer. The current implementation of GridFTP is not able to handle cases where the number of DTPs on the receiver and sender side does not match. For instance, for a single client accessing a GridFTP server with N DTPs, the server will need manage the connections such that a single server DTP will connect to the client at a time. Once the first DTP has finished its transfer, the second server DTP will connect to the client and execute its transfer. The same mechanism can be applied to the cases where the client server node relationship is N-to-M. This mechanism has the following advantages: the load is balanced between the DTP nodes, and can support FreeLoader data layout where file chunks are stored on each DTP. The design and implementation requirements include: code changes to either system should be minimal to enable future integration with the mainstream GridFTP/FreeLoader code, GridFTP client code changes should be avoided, and the integration components should be implemented in C since both FreeLoader and GridFTP are implemented in C.

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC

Pavan Balaji, Argonne National Laboratory

Postdoctoral Researcher

CCT talk:

http://www-unix.mcs.anl.gov/~balaji/#publications

Abstract:

Many large-scale computational biology applications simultaneously rely on multiple resources for efficient execution. For example, such applications may require both large compute and storage resources; however, very few supercomputing centers can provide large quantities of both. Thus, data generated at the compute site oftentimes has to be moved to a remote storage site for either storage or visualization and analysis. Clearly, this is not an efficient model, especially when the two sites are distributed over a Grid. In this talk, I'll present a framework called "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing'' which uses application-specific semantic information to convert the generated data to orders-of-magnitude smaller metadata at the compute site, transfer the metadata to the storage site, and re-process the metadata at the storage site to regenerate the output. Specifically, ParaMEDIC trades a small amount of additional computation (in the form of data post-processing) for a potentially significant reduction in data that needs to be transferred in distributed environments. The ParaMEDIC framework allowed us to use nine different supercomputers distributed within the U.S. to sequence-search the entire microbial genome database against itself and store the one petabyte of generated data at Tokyo, Japan.

GENI - The Global Environment For Networking Innovations Project

CCT talks: Chip Elliott, BBN Technologies And GENI PI/PD/Chief Engineer

http://www.geni.net/

Abstract:

GENI is an experimental facility called the Global Environment for Network Innovation. GENI is designed to allow experiments on a wide variety of problems in communications, networking, distributed systems, cyber-security, and networked services and applications. The emphasis is on enabling researchers to experiment with radical network designs in a way that is far more realistic than they can today. Researchers will be able to build their own new versions of the “net” or to study the “net” in ways that are not possible today. Compatibility, with the Internet is NOT required. The purpose of GENI is to give researchers the opportunity to experiment unfettered by assumptions or requirements and to support those experiments at a large scale with real user populations.

GENI is being proposed to NSF as a Major Research and Equipment Facility Construction (MREFC) project. The MREFC program is NSF’s mechanism for funding large infrastructure projects. NSF has funded MREFC projects in a variety of fields, such as the Laser Interferometer Gravitational Wave Observatory (LIGO), but GENI would be the first MREFC project initiated and designed by the computer science research community.

Friday, May 16, 2008

MOPS (Managed Object Placement Service)

http://www.globus.org/toolkit/data/gridftp/mops.html

MOPS (Managed Object Placement Service) is an enhancement to the Globus GridFTP server that allows you to manage some of the resources needed for the data transfer in a more efficient way.

MOPS 0.1 release includes the following:

GFork - This is a service like inetd that listens on a TCP port and runs a configurable executable in a child process whenever a connection is made. GFork also creates bi-directional pipes between the child processes and the master service. These pipes are used for interprocess communication between the child process executables and a master process plugin. More information on GFork can be found here.
GFork master plugin for GridFTP - This master plugin provides enhanced functionality such as dynamic backend registration for striped servers, managed system memory pools and internal data monitoring for both striped and non striped servers. More information on the GridFTP master plugin and information on how to run the Globus GridFTP server with GFork can be found here.
Storage usage enforcement using Lotman - All data sent to a Lotman-enabled GridFTP server and written to the Lotman root directory will be managed by Lotman. Information on how to configure Lotman and run it with the Globus GridFTP server can be found here.
Pipelining data transfer commands - GridFTP is a command response protocol. A client sends one command and then waits for a "Finished response" before sending another. Adding this overhead on a per-file basis for a large data set partitioned into many small files makes the performance suffer. Pipelining allows the client to have many outstanding, unacknowledged transfer commands at once. Instead of being forced to wait for the "Finished response" message, the client is free to send transfer commands at any time. Pipelining is enabled by using the -pp option with globus-url-copy.

Wednesday, May 14, 2008

GRIDS Lab Topics Related Thesis/Dissertations World-Wide

http://www.gridbus.org/grids_thesis.html

Daniel Colin Vanderster, Resource Allocation and Scheduling Strategies on Computational Grids, Ph.D. Thesis, University of Victoria, February 2008.
SungJin Choi, Group-based Adaptive Scheduling Mechanism in Desktop Grid, Ph.D. Thesis, Korea University, June 2007.
Bjorn Schnizler, Resource Allocation in the Grid : A Market Engineering Approach, Ph.D. Thesis, Karlsruhe University, 2007.
Dang Minh Quan, A Framework For SLA-Aware Execution of Grid-Based Workflows,Ph.D. Thesis, Informatik und Mathematik der Universitaet, November 2006.
Tummalapalli Sudhamsh Reddy, Bridging Two Grids: The SAM-Grid / LCG integration Project, Thesis of Master in Computing Science, The University of Texas, Arlington, May 2006.
Flavia Donno, Storage Management and Access in WLHC Computing Grid, Ph.D. Thesis, University of Pisa, 2006.
Patricia Kayser Vargas Mangan, GRAND: A Model for Hierarchical Application Management in Grid Computing Environment, Ph.D. Thesis, COPPE/Federal University of Rio de Janeiro, Brazil, March 2006.
Anoop Rajendra, Integration of the SAM-Grid Infrastructure to the DZero Data Reprocessing Effort, Thesis of Master in Computing Science, The University of Texas, Arlington, December 2005.
Bimal Balan, Enhancements to the SAM-Grid Infrastructure, Thesis of Master in Computing Science, The University of Texas, Arlington, December 2005.
Tevfik Kosar, Data Placement in Widely Distributed Systems, Ph.D. Thesis, University of Wisconsin-Madison, August 2005.
Aditya Nishandar, Grid-Fabric Interface For Job Management In Sam-Grid, A Distributed Data Handling And Job Management System For High Energy Physics Experiments, Thesis of Master in Computing Science, The University of Texas, Arlington, December 2004.
Sankalp Jain, Abstracting the hetereogeneities of computational resources in the SAM-Grid to enable execution of high energy physics applications, Thesis of Master in Computing Science, The University of Texas, Arlington, December 2004.
Gurmeet Singh Manku, Dipsea: A Modular Distributed Hash Table, Ph.D. Thesis, Stanford University, August 2004.
Adriana Ioana Iamnitchi, Resource Discovery in Large Resource-Sharing Environments, Ph.D. Dissertation, The University of Chicago, December 2003.
Michal Karczmarek, Constrained and Phased Scheduling of Synchronous Data Flow Graphs for StreamIt Language, Thesis of Master of Science in CS, Massachusetts Institute of Technology(MIT), USA, December 2002.
Abhishek S. Rana, A globally-distributed grid monitoring system to facilitate HPC at D0/SAM-Grid (Design, development, implementation and deployment of a prototype), Thesis of Master in Computing Science, The University of Texas, Arlington, Nov. 2002.
Akiko Nakaniwa, Optimal Design of System Resource Management for Distributed Networks. Ph. D Dissertation, Department of Electronics Engineering, Faculty of Engineering, Kansai University, March 2002.
Rajkumar Buyya, Economic-based Distributed Resource Management and Scheduling for Grid Computing, Ph.D. Thesis, Monash University, Melbourne, Australia, April 12, 2002.
Carlos A. Varela, Worldwide Computing with Universal Actors: Linguistic Abstractions for Naming, Migration, and Coordination, Ph.D. Thesis, University of Illinois at Urbana-Champaign, 2001.
Heinz Stockinger, Database Replication in World-Wide Distributed Data Grids, Institute of Computer Science and Business Informatics, University of Vienna, Austria, November 2001.
Heinz Stockinger, Multi-Dimensional Bitmap Indices for Optimising Data Access within Object Oriented Databases at CERN, University of Vienna, Austria, November 2001.
Luis F.G. Sarmenta, Volunteer Computing, Ph.D. Thesis, Massachusetts Institute of Technology(MIT), USA, June 2001.
Jonathan Bredin, Market-based Control of Mobile Agents, Ph.D. Thesis, Dartmouth College, Hanover, NH, USA, June 2001.
Andrea Carol Arpaci-Dusseau, Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems, PhD thesis, University of California at Berkeley, December 1998.
Daniel M. Zimmerman, A Preliminary Investigation into Dynamic Distributed Workflow,M.S. Thesis, California Institute of Technology, May 1998.

Sunday, May 11, 2008

GROMACS Flow Chart

Main Table of Contents

VERSION 3.3
Thu 11 May 2006

This is a flow chart of a typical GROMACS MD run of a protein in a box of water. A more detailed example is available in the Getting Started section. Several steps of energy minimization may be necessary, these consist of cycles: grompp -> mdrun.

eiwit.pdb
Generate a GROMACS topology	pdb2gmx

	conf.gro	topol.top

Enlarge the box	editconf

	conf.gro

Solvate protein	genbox

	conf.gro	topol.top
grompp.mdp
Generate mdrun input file	grompp
			Continuation
	topol.tpr			tpbconv	traj.trr

Run the simulation (EM or MD)	mdrun

	traj.xtc	ener.edr

Analysis	g_... ngmx	g_energy

Computational Biomolecular Dynamics Group

http://www.mpibpc.mpg.de/groups/de_groot/

We carry out computer simulations of biological macromolecules to study the relationship between dynamics and function.

Using molecular dynamics simulations and other computational tools we predict the dynamics and flexibility of proteins, membranes, carbohydrates and polynucleotides to study biological function and dysfunction at the atomic level.

Randomly picked image from current research. Reload this page to updat

GROMACS

http://wiki.gromacs.org/index.php/Main_Page

he 5 latest News

GROMACS 3.3.3 released

Friday, 29 February 2008
It is a pleasure to announce the immediate availability of GROMACS 3.3.3, the latest stable version. Please check the download section in order to find the source code and a limited set of binaries. More binaries will probably be released in the coming days. Please check revision information here .

Stanford Workshop

Thursday, 21 February 2008
The workshop in Stanford was fully booked a few days after registration opened. Stay tuned for a workshop in Europe this fall.

RSS Feed activated

Wednesday, 13 February 2008
You can now subscribe to the latest news from the GROMACS website using an RSS feed. The address to use is http://www.gromacs.org/index2.php?option=com_rss&no_html=1 .

120,000 Downloads

Sunday, 03 February 2008
Since May 9, 2004, GROMACS packages have been downloaded 120,000 times from our server. This means that there were more than 2500 downlods each month, including, obviously, people who download newer versions, of which there have been 3 since that date.

GROMACS 4 Paper

Saturday, 02 February 2008
It is with great pleasure that I announce that the paper about GROMACS 4 is now available on-line at the Journal of Chemical Theory and Computation . Processing by the journal was so fast that we did not manage to have the software ready for release, but it will be announced this spring.

System Biology

http://www.systemsbiology.org/technology/Data_Visualization_and_Analysis/Human_Proteome_Folding_Project

HUMAN PROTEOME FOLDING PROJECT
Overview

The Human Proteome Folding Project will use the computer power of millions of computers to predict the shape of Human proteins for which researchers currently know little. From this shape scientists hope to learn about the function of these proteins, as the shape of proteins is inherently related to how they function in our bodies. This database of protein structures and putative functions will let scientists take the next steps understanding how diseases that involve these proteins work and ultimately how to cure them.

Proteins could be said to be the most important molecules in living beings. Just about everything in your body involves or is made out of proteins. Proteins are actually long chains made up of smaller molecules called amino acids.

There are 20 different amino acids that make up all proteins. One can think of the amino acids as being beads of 20 different colors. Sometimes, hundreds of them make up one protein. Proteins typically don't stay as long chains however. As soon as the chain of amino acids is built, the chain folds and tangles up into a more compact mass, ending up in a particular shape. This process is called protein folding.

Protein folding occurs because the various amino acids like to stick to each other following certain rules. You can think of the amino-acid (beads on a string) as being sticky, but sticky in such a way that only certain colors can stick to certain other colors.

The amino acid chains built in the body must fold up in a particular way to make useful proteins. The cell has mechanisms to help the proteins fold properly and mechanism to get rid of improperly folded proteins. Each gene tells the order of the amino acids for one protein. The gene itself is a section of long chain called DNA.

In recent years scientists have sequenced the human genome; finding over 30,000 genes within the human genome. The collection of all human genes is known as "the human genome". Depending on how genes are counted, there are over 30,000 genes in the human genome. Each of these genes tells how to build the chain of amino acids for the each of the 30,000 proteins. The collection of all of the human proteins is known as "the human proteome."

What the genes don't tell is exactly how the proteins will fold into their compact final form. The final shape of a protein is very important because that determines what it can do and what other proteins it can connect to or interact with. You can think of the protein shapes like puzzle pieces. For example muscle proteins connect to each other to form a muscle fiber. They stick together that way because of their shape, and certain other factors relating to the shape.

Everything that happens in cells, and in the body, is very specifically controlled by protein shapes. For example, the proteins of a virus or a bacteria may have a particular shapes that interact with human proteins or human cell membrane, and let it infect the cell. This is obviously an oversimplified description, but it is important to understand how important the shapes of proteins are. Knowing these shapes lets us understand how the proteins perform their desired function and also how diseases prevent proteins from doing the correct things to maintain a healthy cell and body.

When your grid agent is running it is trying to fold a single protein from the set of human proteins with no known shape. The client will try millions of shapes and return to the central server the best 500 shapes it can find. The goodness of each shape the grid agent tries is determined by something referred to as the Rosetta score. The Rosetta score examines the packing of amino acids in the protein and produces a number, the lower the number the better. The program that the grid agent is running is called Rosetta. As the computers try to fold the protein chains in different ways, they attempt to find the particular folding/shape that is closest to how the proteins really fold in our bodies. You can see the pictures of the partially folded proteins in the right half of the grid agent screen. The left side shows various scores which tells how properly folded the protein is so far, per all of the rules. If a trial fold gets a worse score, then the computer tries to refold it a different way which may be produce a better score. This is done millions of times for each protein; scientists will look at only the lowest scoring structures.

Graphical Overview of Project

The project starts with human proteins from the human genome. These protein sequences are the result of a large amount of research in and of themselves (the human genome project was a huge research project carried out at numerous institutions, including the ISB). A great deal of interesting research is still ongoing to find all of the proteins in the Human Genome with mixtures of computational and experimental efforts. We will fold the proteins in the Human proteome that have no known structure (and often no known function).
Rosetta structure prediction is the program that we’ll use to predict the structures (fold) of these mystery proteins. Rosetta uses a scoring function to rip through huge numbers of possible structures for a given sequence and choose the best structures (which it reports to us as predicted structures). Because there are a large number of possible conformations per sequence and a large number of sequences we need astronomically large amounts of processor power to fold the Human proteome.
We’ll use the spare computing power from huge numbers of volunteers to run Rosetta on more than fifty thousand protein sequences.
We’ll get one or more fold predictions for each protein. Not all predictions work, so we’ll also have several numbers attached to each prediction that tell us how much we can trust each prediction.
We will cross-match the predicted structures with the large data-base of known (by X-ray crystallography and NMR-spectroscopy) structures to see if our predicted fold has been seen before.
If we find a match, that’s just the beginning. In biology context is very important, biologists use a large diversity of experiments and analysis techniques to carry out their work. There are several methods to try to get at the function of unknown proteins, and structure is just one of them … biologists can best use these fold predictions (from 4) and fold-matches (from 5) when they are integrated with results from other relevant methods (when is a gene turned on or off, what tissue is a protein expressed in, can I find this gene in other organisms?, where is this protein in a metabolic pathway?, etc.).

Click to view larger image as a pdf

Central Dogma

Genomic sequence is the final result produced by genome sequencing projects. For the human genome there would be one word (as shown in 1) for each chromosome. The total length of the 23 chromosomes in human is ~3billion letters or bases. This represents a relatively stable place for a cell to store information.
Genomic DNA is copied to complementary messenger RNA by RNA-polymerase. RNA is less stable than DNA and thus the cell can turn on and off genes by regulating how quickly genes are transcribed into RNA message.
RNA is translated into protein sequence by the Ribosome. Each three-letter chunk of the RNA sequence (codon) is translated into one of 20 amino acids. Thus each mRNA codes for a single unique protein. The protein is made as a long, unfolded, polypeptide that is not functional until folded.
Protein folding consists primarily of rotations around the chemical bonds in the backbone and side-chains of the polypeptide to make a conformation that allows the side-chains to pack in a compact core. Here the nascent/unfolded polypeptide/protein is schematized as a red zig-zag. The protein then folds spontaneously to a folded protein as shown at the very bottom. See the description of the scoring functions used by Rosetta for more information about why a protein would do any of this, but the short answer is that the sidechains sticking off the backbone at the bottom make favorable contacts (“+” touching “-“ and oily touching oily for instance).

Click to view larger image as a pdf

What is a protein?

Proteins are far from being just things we eat. They are the molecular machines that carry out metabolism, they carry messages that direct development and enable the immune system to tell friend from foe, they repair damage to our DNA after we’ve spent too much time in the sun. In short proteins are at the center of Human biology, all biology.

But what IS a protein?

Most genes code for proteins. Proteins are polymers that are built from smaller monomers called amino acids (lets say 150 at a time, but the length of proteins vary from gene to gene). These strings of amino acids (with different amino acids having different shapes and chemical properties) then fold up to make more compact shapes that have specific function. So nature can use the same 20 amino acids, that have a common backbone but different variable groups, to make an astronomically large variety of shapes and functions using the same 20 amino acids, the same ribosome (the machine that strings the amino acids together). By changing the order and type of amino acids in proteins, living things can come up with new functions and shapes. This process is often called mutation. Mutations to proteins can be changes of one amino acid in a protein, say the hemoglobin in your blood) for another or the deletion of several amino acids from a protein http://web.mit.edu/esgbio/www/dogma/mutants.html. Many research efforts are currently underway to allow us to rationally engineer protein sequences to make new functions and therapies.

Most drugs carry out their functions by binding to the specific shapes that folded proteins make in cells. Understanding protein three-dimensional structure is one of many things we need to understand if we are to decode the Human genome or the genome of a given pathogen. For more info on the central dogma of modern biology see:
http://en.wikipedia.org/wiki/Central_dogma
http://web.mit.edu/esgbio/www/dogma/dogmadir.html
http://www.emc.maricopa.edu

To see the 20 amino acids see:
http://web.mit.edu/esgbio/www/lm/proteins/aa/aminoacids.html
http://web.mit.edu/esgbio/www/lm/lmdir.html

Which proteins are important?

When faced with the question, which proteins should we fold, the following criterion were used: choose proteins that are important to the people that will be donating the computing cycles that will be folded.

Overall predicting the structure of every protein in an organism with Rosetta will contribute to our overall understanding of several proteins in that genome and how those predicted proteins interact with the organism as a system. Can you imagine trying to fix a car or a machine knowing the function of only 60% of the components. That is the situation that biomedical and biological researchers, to their credit, operate in. Thus, anything that can shed light on these mystery proteins is of use to the field of biology and medicine. These predictions will not be a magic bullet but provide a resource for biologists that are working on the genomes we fold.

The first category of proteins to fold are the proteins in the Human Genome with no known structural homologs. Human proteins are the targets of drugs and the key to improving human health. Improving our understanding of these proteins has innumerable positive effects. Some Human proteins in the blood are therapeutics in and of themselves.

The second category consists of proteins found in the genomes of pathogens. Understanding the biology of these bacteria and viruses that have cause disease will alow us to better fight them. Many of these proteins are the targets of drugs or have roles in virulence that have yet to be fully understood.

The last category consists of proteins that are found in the genomes of environmental microbes. These microbes represent the majority of molecular biodiversity on the planet and understanding these microbes and their role in our environment will be aided by a deeper understanding of their proteomes (the structure and function of the proteins in their genomes). These microbes are responsible for global carbon and nitrogen cycles, they degrade human waste products, and can perform countless undiscovered enzymatic biosynthesis.

Drawing Protiens

Proteins are large complicated molecules, so simplifying how we represent them visually is key to protein structure research.

The chemical structure of a single amino acid. These are strung together by the ribosome in the order encoded on the mRNA that codes for the protein. There are 20 amino acids to choose from, R (see depiction below) can thus be any of 20 different chemical structures depending on what amino acid is specified at that position by the mRNA.
A simpler way to write an amino acid.
Three different amino acids forming the beginning of a protein.
The backbone stays the same (thus, the Ribosome can use the same machinery to add each new amino acid) the sidechains are variable (a huge diversity of chemical functions and structures results from varying the order and composition of amino acids in proteins, nature can solve most of its problems with proteins).

Click to view larger image as a pdf

Rosetta

Rosetta is a computer program for de novo protein structure prediction, where de novo implies modeling in the absence of detectable sequence similarity to a previously determined three-dimensional protein structure. Rosetta uses small sequence similarities from the Protein Data Bank [http://www.rcsb.org/pdb/] to estimate possible conformations for local sequence segments (three and nine residue segments). These segments are called fragments of local structure. It then assembles these pre-computed structure fragments by minimizing a global scoring function that favors hydrophobic burial and packing, strand pairing, compactness and energetically favorable residue pairings. Results from the fourth and fifth critical assessment of structure prediction (CASP4, CASP5) [http://predictioncenter.llnl.gov/] have shown that Rosetta is currently one of the best methods for de novo protein structure prediction and distant fold recognition.

Using Rosetta generated structure predictions we were previously able to recapitulate or predict many functional insights not detectable from primary sequence. Rosetta was also recently used to generate both fold and function predictions for Pfam protein families that had no link to a known structure, resulting in many high confidence fold predictions. In spite of these successes, Rosetta has a significant error rate, as do all methods for distant fold recognition and de novo structure prediction. We thus calculate not just the structure but also the probability that the predicted structure is correct using the Rosetta confidence function. The Rosetta confidence function partially mitigates this error rate by assessing the accuracy of predicted folds.

Another unavoidable source of uncertainty, with respect to function prediction from structure, is the error associated with distilling function from fold matches. Sometime fold carry out more than one function. The predictions generated by de novo structure prediction are thus best used in combination with other sources of putative or general functional information such as proximity in protein association or gene regulatory networks. Thus, making the predictions resulting from this project available to the public in a easily accessible way is a critical final step in this project.

For a quicktime movie showing a protein (Ubiquitin) being folded by Rosetta click here [1d3z.mov OR 1d3z.mpg ]

Rosetta Score

Rosetta uses a scoring function to judge different conformations (shapes/packings of amino acids within the protein). The simulation consists of making moves (changing the bond angles of a bunch of amino acids) and then scoring the new conformation. The rosetta score is a weighted sum of component scores, where each component score is judging a different thing. The environment score is judging how well the hydrophobic (oily) residues are packing together to form a core, while the pair-score is judging how compatible touching residues are with each other one pair at a time.

Environment score: The formation of a hydrophobic core, or the hydrophobic effect, is for most proteins the central driving force for protein folding. The Rosetta environment score rewards burial of hydrophobic residues in a compact hydrophobic core and penalizes solvation of these oily groups. I’ve represented hydrophobic residues as orange stars. The left conformation is good (all the hydrophobics together) while the rightmost conformation is bad (with the hydrophobic amino acids not touching).

Pair-score: Two conformations of a polypeptide are shown, one (top) where the chain is folded back on itself bringing two cysteins together (yellow + yellow = possible disulphide bond) and forming a salt-bridge (blue+red = opposites attract). The conformation at bottom does not make these pairings and the pair-score would, thus, favor the top conformation.

Click to view larger image as a pdf

The Amino Acids

When we display proteins we often use different coloring schemes to help us see the interactions taking place between the different amino acids. We have used the following color scheme for the Human Proteome folding project:

Hydrophobic (oily): orange
Acidic (negatively charged): red
Basic (positive charge): blue
Histidine (positive or negative): purple
Sulphur containing residues: yellow
Everything else (even though every amino acid is special): green

Click to view larger image as a pdf

More information for Scientists

Read more in our recent Journal Articles:
Application to halobaterium NRC-1: [http://genomebiology.com/]
Application to Initial annotation of Haloarcula marismortui: [http://www.genome.org/]
Application to the annotation of Pfam domains of unknown function: [http://www.sciencedirect.com/]

The earliest papers on Roseta De Novo structure prediction (including works by Kim Simons, Rich Bonneau, Charlie EM Strauss, Chris Bystroff, Ingo Ruczinski, Carol Rohl, Phil Bradley, Lars Malmstrom, Dylan Chivian, David Kim, Jens Meiler, Jens Meiler, Jack Schonbrun, David Baker, and others) can be found at: http://bakerlab.org

Review of De Novo structure prediction methods: annual-rev-bonneau.pdf
[http://arjournals.annualreviews.org]

One-at-a-time Rosetta server (the Robetta server); Hosted at ISB and Los Alamos National Labs (Charlie EM Strauss) [http://robetta.bakerlab.org/] Papers describing Robetta: [http://www3.interscience.wiley.com]
[http://www.ncbi.nlm.nih.gov]

PEOPLE

Read more about the scientists at the ISB and the University of Washington leading the Human Proteome Folding Project. For more information on this project direct scientific inquiries to either Richard Bonneau or proteomefolding@systemsbiology.org.

ISB:

Dr. Richard Bonneau: rbonneau@systemsbiology.org
Dr. Bonneau is the technical lead on the Human Proteome Folding project. Dr. Bonneau has expertise primarily in ab initio protein structure prediction, protein folding, and regulatory network inference. He is currently focused on applying structure prediction and structural information to functional annotation and the modeling/prediction of regulatory and physical networks. Dr. Bonneau working to develop general methods to solve protein structures and protein complexes with small sets of distance constraints derived from chemical cross-linking. At the ISB Dr. Bonneau also works on a number of systems biology data-integration and analysis algorithms, including algorithms designed to infer global regulatory networks from systems-biology data.

Dr. Leroy Hood
Dr. Leroy Hood is recognized as one of the world's leading scientists in molecular biotechnology and genomics. A passionate and dedicated researcher, he holds numerous patents and awards for his scientific breakthroughs and prides himself on his life-long commitment to making science accessible and understandable to the general public, especially children. One of this foremost goals is bringing hands-on, inquiry-based science to K-12 classrooms.
[more: http://www.systemsbiology.org ]

University of Washington:

Lars Malmstroem: larsm@u.washington.edu
Lars Malmstroem has worked to engineer the infrastructure (at the ISB/UW end) needed to handle the vast highly interconnected data-sets that this project will generate; he will also be heavily involved in developing the correct data-integration schemes to best deliver the resultant predictions to biologists.

Dr. David Baker:
Rosetta was developed initially in the laboratory of David Baker by a team that included a large number of scientists at several institutions. The goal of current research in his laboratory is to develop improved models of intra and intermolecular interactions and to apply improved models to the prediction and design of macromolecular structures and interactions. Prediction and design applications can be of great biological interest in their own right, and also provide very stringent and objective tests which drive the improvement of the model and increases in fundamental understanding.
[more: http://depts.washington.edu/ ]

Tuesday, May 20, 2008

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC

GENI - The Global Environment For Networking Innovations Project

Friday, May 16, 2008

Wednesday, May 14, 2008

Sunday, May 11, 2008

Blog Archive