Wednesday, July 16, 2008

RADIANT Lab - Research on TCP tuning

  • RADIANT Lab @LANL


http://public.lanl.gov/radiant/pubs.html


  • Optimizing GridFTP Through Dynamic Right-Sizing.
    S. Thulasidasan, W. Feng, and M. Gardner
    IEEE Symposium on High-Performance Distributed Computing. (HPDC-12/2003), Seattle WA, June 2003.
    [LA-UR 03-2486] (8 pages - postscript, 339K; PDF, 267K)

  • Routing and Scheduling Large File Transfers over Lambda Grids.
    A. Banerjee, W. Feng, D. Ghosal, and B. Mukherjee
    The 3rd International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet'05), Lyon, France, February 2005.
    [LA-UR 05-7911] (5 pages - PDF, 576K)
  • Scheduling and Transport for File Transfers on High-Speed Optical Circuits.
    M. Veeraraghavan, X. Zheng, W. Feng, H. Lee, E. Chong, and H. Li
    Journal of Grid Computing, Vol. 1, No. 4, June 2004.
    [LA-UR 04-2008] (18 pages - PDF,195K)



Dynamic Right-Sizing (DRS) Software Distribution

This is the official distribution site for DRS.

Dynamic Right-Sizing provides automatic tuning of TCP flow control windows to support high bandwidth over high-latency (WAN) links. It improves TCP throughput by orders of magnitude over high delay-bandwidth links. It also keeps windows small for low-bandwidth and low-latency connections so they don't consume unnecessary amounts of memory.

There are two versions of DRS. The first is implemented in the Linux kernel and the second is implemented in user-space applications.

DRS Kernel-Space Implementation

The modifications to the Linux kernel are being distributed as a patch file under the GNU General Public License. If the GPL is too restrictive, please contact us.

Once you have the source for correct version of the Linux kernel, download and apply the appropriate DRS patch file for your kernel. (The patch may also be applied against a different version of the kernel, but you may have to patch some of the kernel files by hand.) For a synopsis of how to install the DRS kernel patch and rebuild you kernel, see the DRS Installation Instructions for help. We regret that we are unable to assist with DRS installation questions. However, we would like to know if you find (and fix) a bug or make enhancements so they can be incorporated into future releases.

DRS User-Space Applications

For some, installing DRS in the kernel can be a little daunting. As an interim solution until vendors install DRS by default, we are also providing DRS in user-space applications.

Unlike the kernel version of DRS, the information necessary to implement DRS is not directly available to user-space applications and hence must be synthesized. As a result, DRS in kernel-space performs better than DRS in user-space. Even so, DRS in user-space provides dramatic performance improvement over traditional applications.

The next disadvantage of DRS in user-space is that each application (or set of related applications) require modification to implement the DRS algorithm. The following applications have been modified to support DRS.

  • FTP client and server (Coming soon.)

Besides the applications themselves, the maximum buffer space Linux allows will need to be increased so DRS has room to work.


DRS Kernel-Space Installation Instructions

  • Download an official kernel release from www.kernel.org or your favorite mirror. (Source packages from some Linux distributions, such as Debian, also work.)
  • Uncompress and untar the kernel source in an appropriate location (usually /usr/src).
  • Download the appropriate DRS patch.
  • Patch the kernel by entering the kernel source directory and typing the command
    patch -p1 < [path to DRS
    patch]
    .
  • Configure, build and install the kernel in the usual way. (For an example, see the MAGNET distribution page.)
  • Increase the maximum buffer limits so DRS has room to work.

Tuning Linux Buffer Limits

Besides installing DRS, you'll want to tune the maximum buffer sizes. (Do not change the minimum and default sizes.) This is done by writing to the /proc file system as root:

echo 6553500                 > /proc/sys/net/core/wmem_max
echo 6553500 > /proc/sys/net/core/rmem_max
echo 4096 16384 6553500 > /proc/sys/net/ipv4/tcp_wmem
echo 8192 87380 6553500 > /proc/sys/net/ipv4/tcp_rmem
echo 6553500 6653500 6753500 > /proc/sys/net/ipv4/tcp_mem

Data Compression

Data Compression

D.A. Lelewer and D.S. Hirschberg, Data compression, Computing Surveys 19,3 (1987) 261-297. Reprinted in Japanese BIT Special issue in Computer Science (1989) 165-195. (in HTML)

Debra A. Lelewer and Daniel S. Hirschberg

Table of Contents

Abstract
INTRODUCTION
1. FUNDAMENTAL CONCEPTS
1.1 Definitions
1.2 Classification of Methods
1.3 A Data Compression Model
1.4 Motivation
2. SEMANTIC DEPENDENT METHODS
3. STATIC DEFINED-WORD SCHEMES
3.1 Shannon-Fano Coding
3.2 Static Huffman Coding
3.3 Universal Codes and Representations of the Integers
3.4 Arithmetic Coding
4. ADAPTIVE HUFFMAN CODING
4.1 Algorithm FGK
4.2 Algorithm V
5. OTHER ADAPTIVE METHODS
5.1 Lempel-Ziv Codes
5.2 Algorithm BSTW
6. EMPIRICAL RESULTS
7. SUSCEPTIBILITY TO ERROR
7.1 Static Codes
7.2 Adaptive Codes
8. NEW DIRECTIONS
9. SUMMARY
REFERENCES

Abstract

This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems.

Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested.

INTRODUCTION

Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of efficient coding and its consequences, in the form of speed of transmission and probability of error [Ingels 1971]. Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of data to be transmitted. The purpose of this paper is to present and analyze a variety of data compression algorithms.

A simple characterization of data compression is that it involves transforming a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Data compression has important application in the areas of data transmission and data storage. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.

Many of the methods to be discussed in this paper are implemented in production systems. The UNIX utilities compact and compress are based on methods to be discussed in Sections 4 and 5 respectively [UNIX 1984]. Popular file archival systems such as ARC and PKARC employ techniques presented in Sections 3 and 5 [ARC 1986; PKARC 1987]. The savings achieved by data compression can be dramatic; reduction as high as 80% is not uncommon [Reghbati 1981]. Typical values of compression provided by compact are: text (38%), Pascal source (43%), C source (36%) and binary (19%). Compress generally achieves better compression (50-60% for text such as source code and English), and takes less time to compute [UNIX 1984]. Arithmetic coding (Section 3.4) has been reported to reduce a file to anywhere from 12.1 to 73.5% of its original size [Witten et al. 1987]. Cormack reports that data compression programs based on Huffman coding (Section 3.2) reduced the size of a large student-record database by 42.1% when only some of the information was compressed. As a consequence of this size reduction, the number of disk operations required to load the database was reduced by 32.7% [Cormack 1985]. Data compression routines developed with specific applications in mind have achieved compression factors as high as 98% [Severance 1983].

While coding for purposes of data security (cryptography) and codes which guarantee a certain level of data integrity (error detection/correction) are topics worthy of attention, these do not fall under the umbrella of data compression. With the exception of a brief discussion of the susceptibility to error of the methods surveyed (Section 7), a discrete noiseless channel is assumed. That is, we assume a system in which a sequence of symbols chosen from a finite alphabet can be transmitted from one point to another without the possibility of error. Of course, the coding schemes described here may be combined with data security or error correcting codes.

Much of the available literature on data compression approaches the topic from the point of view of data transmission. As noted earlier, data compression is of value in data storage as well. Although this discussion will be framed in the terminology of data transmission, compression and decompression of data files for storage is essentially the same task as sending and receiving compressed data over a communication channel. The focus of this paper is on algorithms for data compression; it does not deal with hardware aspects of data transmission. The reader is referred to Cappellini for a discussion of techniques with natural hardware implementation [Cappellini 1985].

Background concepts in the form of terminology and a model for the study of data compression are provided in Section 1. Applications of data compression are also discussed in Section 1, to provide motivation for the material which follows.

While the primary focus of this survey is data compression methods of general utility, Section 2 includes examples from the literature in which ingenuity applied to domain-specific problems has yielded interesting coding techniques. These techniques are referred to as semantic dependent since they are designed to exploit the context and semantics of the data to achieve redundancy reduction. Semantic dependent techniques include the use of quadtrees, run length encoding, or difference mapping for storage and transmission of image data [Gonzalez and Wintz 1977; Samet 1984].

General-purpose techniques, which assume no knowledge of the information content of the data, are described in Sections 3-5. These descriptions are sufficiently detailed to provide an understanding of the techniques. The reader will need to consult the references for implementation details. In most cases, only worst-case analyses of the methods are feasible. To provide a more realistic picture of their effectiveness, empirical data is presented in Section 6. The susceptibility to error of the algorithms surveyed is discussed in Section 7 and possible directions for future research are considered in Section 8.

Research on TCP tuning - buffer size - parallel streams

Research on TCP tuning - buffer size - parallel streams


Tom Dunigan's Home Page

http://www.csm.ornl.gov/~dunigan/


TCP auto-tuning (buffer size)
http://www.csm.ornl.gov/~dunigan/netperf/auto.html

  • Dynamic right-size (DRS)
  • Net100
  • Web100 research (using latency bandwidth to calculate buffer size)
  • Automatic tcp buffer tuning
  • Linux 2.4 and auto tuning
  • slow start
  • WAD *

Parallel streams
http://www.csm.ornl.gov/~dunigan/netperf/parallel.html



LANL's Weigle's A Comparison of TCP Automatic Tuning Techniques for Distributed Computing 2002 (Feng's paper)

Friday, July 11, 2008

Stream Computing: A New Paradigm To Gain Insight and Value

www.almaden.ibm.com/institute/resources/2008/presentations/Halim.ppt


HPCwire >> Features

IBM Looks to Tap Massive Data Streams

...
And out of the box it will come. Although this is still a research project and there remains much work to be done before the product is shrink wrapped, Halim and his team are motivated by an expansive view that "stream computing is not just a new computing model, it is a new scientific instrument."

Thursday, July 10, 2008

parallel NetCDF (scoop's format) ve FastBit

http://www.scidacreview.org/0602/html/data.html

Fast Bit : Indexing for Fast Searches

In data mining and analyses, the process of quickly isolating important information from much larger pools of data is critical. FastBit is a software package capable of extremely fast searches of large databases. During a series of head-to-head trials, FastBit considerably outperformed a leading database management system. Searching a dataset composed of 250,000 email messages, FastBit handled queries between ten and one thousand times faster than the popular commercial software.
Bitmaps are sequences of bits, basic yes/no units of information represented by 1 or 0, a computationally practical representation. A bitmap index is a set of bit sequences that represent information about certain indexed attributes. Because FastBit uses bitmap indices, user queries can be addressed by bitwise logical operations, which computer hardware systems generally handle quite efficiently. However, scientific applications often involve indices containing information about a large number of bitmaps, and such bitmap indices demand impractical storage requirements. Schemes that compress index files can reduce space requirements, but compression can also slow down search methods. To maximize FastBit performance, researchers had to optimize this tradeoff between storage space and speed. Using the Word-Aligned Hybrid (WAH) compression method, FastBit achieves this functional balance. Bitmap indices compressed by the WAH scheme are a little larger than indices compressed by other methods, but WAHcompressed indices can be queried much faster because they can be searched without being fully decompressed.
SDM researchers at LBNL developed both the FastBit software package and the WAH compression scheme it employs. A number of SciDAC-supported projects, including the STAR experiment (sidebar, p35) and combustion research (figure 3, p32), have benefited from the impressive power of FastBit.

Thursday, June 19, 2008

plan 9 & 9P

Plan 9 from Bell Labs
http://plan9.bell-labs.com/plan9/
http://plan9.bell-labs.com/wiki/plan9/Papers/index.html

Plan 9 from User Space
http://swtch.com/plan9port/

v9fs
http://swtch.com/v9fs/

9pfuse
http://swtch.com/plan9port/man/man4/9pfuse.html

Plan9 wiki
http://plan9.bell-labs.com/wiki/plan9/plan_9_wiki/index.html

9p
http://en.wikipedia.org/wiki/9P

9p manual
http://www.cs.bell-labs.com/magic/man2html/5/0intro

xcpu
http://www.xcpu.org/


9p implementations

Grifi: GridFTP File System

This program is free software (NO WARRANTY of any kind). I used parts of differents projects, since the source code I took is free and my code is free I do not see any problem but if you think I am breaking any copyright law please tell me and I will do my best to fix the problem.

The current version can be fond at the project home site: http://sourceforge.net/projects/grifi/

-

UberFTP is the first interactive, GridFTP-enabled ftp client. It supports GSI authentication, parallel data channels and third party transfers.

http://dims.ncsa.uiuc.edu/set/uberftp/index.htmlLink

Tuesday, May 20, 2008

google summer code project: Integration of GridFTP with Freeloader storage system

Title Integration of GridFTP with Freeloader storage system
Student Hesam Ghasemi
Mentor Rajkumar Kettimuthu
Abstract
Scientific experiments produce large volumes of data which require a cost-conscious data stores, as well as reliable data transfer mechanisms. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP, however, assumes the support of a high-performance parallel file system, a relatively expensive resource.
FreeLoader is a storage system that aggregates idle storage space from workstations connected within a local area network to build a low-cost, yet high-performance data store. FreeLoader breaks files into chunks and distributes them across storage nodes thus enabling parallel access to files. A central manager keeps track of file meta-data as well as of the location of the chunks associated with each file.
This project will integrate Globus project’s GridFTP implementation and FreeLoader to reduce the cost and increase the performance of GridFTP deployments. The integration of these two opens-source systems system will address the following main problems. First, FreeLoader storage nodes will be exposed as GridFTP data transfer processes (DTPs). Second, the assumption the current GridFTP implementation makes, namely that all data is available at all DTPs will be relaxed by integrating the GridFTP server and FreeLoader managed data location mechanisms. Finally, load balancing mechanisms will be added to the GridFTP server implementation to match FreeLoader’s ability to stripe and replicate data across multiple nodes.
The fact that at a Freeloader-supported GridFTP site files are spread over multiple FreeLoader nodes (exposed as GridFTP DTPs) implies that the integration will need to orchestrate multiple connections to execute a file transfer. The current implementation of GridFTP is not able to handle cases where the number of DTPs on the receiver and sender side does not match. For instance, for a single client accessing a GridFTP server with N DTPs, the server will need manage the connections such that a single server DTP will connect to the client at a time. Once the first DTP has finished its transfer, the second server DTP will connect to the client and execute its transfer. The same mechanism can be applied to the cases where the client server node relationship is N-to-M. This mechanism has the following advantages: the load is balanced between the DTP nodes, and can support FreeLoader data layout where file chunks are stored on each DTP.
The design and implementation requirements include: code changes to either system should be minimal to enable future integration with the mainstream GridFTP/FreeLoader code, GridFTP client code changes should be avoided, and the integration components should be implemented in C since both FreeLoader and GridFTP are implemented in C.

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC


Pavan Balaji, Argonne National Laboratory
Postdoctoral Researcher

CCT talk:

http://www-unix.mcs.anl.gov/~balaji/#publications

Abstract:

Many large-scale computational biology applications simultaneously rely on multiple resources for efficient execution. For example, such applications may require both large compute and storage resources; however, very few supercomputing centers can provide large quantities of both. Thus, data generated at the compute site oftentimes has to be moved to a remote storage site for either storage or visualization and analysis. Clearly, this is not an efficient model, especially when the two sites are distributed over a Grid. In this talk, I'll present a framework called "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing'' which uses application-specific semantic information to convert the generated data to orders-of-magnitude smaller metadata at the compute site, transfer the metadata to the storage site, and re-process the metadata at the storage site to regenerate the output. Specifically, ParaMEDIC trades a small amount of additional computation (in the form of data post-processing) for a potentially significant reduction in data that needs to be transferred in distributed environments. The ParaMEDIC framework allowed us to use nine different supercomputers distributed within the U.S. to sequence-search the entire microbial genome database against itself and store the one petabyte of generated data at Tokyo, Japan.

GENI - The Global Environment For Networking Innovations Project

GENI - The Global Environment For Networking Innovations Project

CCT talks: Chip Elliott, BBN Technologies And GENI PI/PD/Chief Engineer


http://www.geni.net/

Abstract:


GENI is an experimental facility called the Global Environment for Network Innovation. GENI is designed to allow experiments on a wide variety of problems in communications, networking, distributed systems, cyber-security, and networked services and applications. The emphasis is on enabling researchers to experiment with radical network designs in a way that is far more realistic than they can today. Researchers will be able to build their own new versions of the “net” or to study the “net” in ways that are not possible today. Compatibility, with the Internet is NOT required. The purpose of GENI is to give researchers the opportunity to experiment unfettered by assumptions or requirements and to support those experiments at a large scale with real user populations.

GENI is being proposed to NSF as a Major Research and Equipment Facility Construction (MREFC) project. The MREFC program is NSF’s mechanism for funding large infrastructure projects. NSF has funded MREFC projects in a variety of fields, such as the Laser Interferometer Gravitational Wave Observatory (LIGO), but GENI would be the first MREFC project initiated and designed by the computer science research community.

Friday, May 16, 2008

MOPS (Managed Object Placement Service)

http://www.globus.org/toolkit/data/gridftp/mops.html

MOPS (Managed Object Placement Service) is an enhancement to the Globus GridFTP server that allows you to manage some of the resources needed for the data transfer in a more efficient way.

MOPS 0.1 release includes the following:

  1. GFork - This is a service like inetd that listens on a TCP port and runs a configurable executable in a child process whenever a connection is made. GFork also creates bi-directional pipes between the child processes and the master service. These pipes are used for interprocess communication between the child process executables and a master process plugin. More information on GFork can be found here.

  2. GFork master plugin for GridFTP - This master plugin provides enhanced functionality such as dynamic backend registration for striped servers, managed system memory pools and internal data monitoring for both striped and non striped servers. More information on the GridFTP master plugin and information on how to run the Globus GridFTP server with GFork can be found here.

  3. Storage usage enforcement using Lotman - All data sent to a Lotman-enabled GridFTP server and written to the Lotman root directory will be managed by Lotman. Information on how to configure Lotman and run it with the Globus GridFTP server can be found here.

  4. Pipelining data transfer commands - GridFTP is a command response protocol. A client sends one command and then waits for a "Finished response" before sending another. Adding this overhead on a per-file basis for a large data set partitioned into many small files makes the performance suffer. Pipelining allows the client to have many outstanding, unacknowledged transfer commands at once. Instead of being forced to wait for the "Finished response" message, the client is free to send transfer commands at any time. Pipelining is enabled by using the -pp option with globus-url-copy.

Wednesday, May 14, 2008

GRIDS Lab Topics Related Thesis/Dissertations World-Wide

http://www.gridbus.org/grids_thesis.html


Sunday, May 11, 2008

GROMACS Flow Chart

Main Table of Contents

VERSION 3.3
Thu 11 May 2006


This is a flow chart of a typical GROMACS MD run of a protein in a box of water. A more detailed example is available in the Getting Started section. Several steps of energy minimization may be necessary, these consist of cycles: grompp -> mdrun.

eiwit.pdb


Generate a GROMACS topology
pdb2gmx







conf.gro
topol.top








Enlarge the box
editconf







conf.gro




Solvate protein
genbox







conf.gro
topol.top
grompp.mdp

Generate mdrun input file
grompp





Continuation


topol.tpr tpbconv traj.trr








Run the simulation (EM or MD)
mdrun







traj.xtc
ener.edr




Analysis
g_...
ngmx

g_energy


Computational Biomolecular Dynamics Group

http://www.mpibpc.mpg.de/groups/de_groot/


We carry out computer simulations of biological macromolecules to study the relationship between dynamics and function.

Using molecular dynamics simulations and other computational tools we predict the dynamics and flexibility of proteins, membranes, carbohydrates and polynucleotides to study biological function and dysfunction at the atomic level.





Randomly picked image from current research. Reload this page to updat

GROMACS

http://wiki.gromacs.org/index.php/Main_Page

he 5 latest News
GROMACS 3.3.3 released
Friday, 29 February 2008
It is a pleasure to announce the immediate availability of GROMACS 3.3.3, the latest stable version. Please check the download section in order to find the source code and a limited set of binaries. More binaries will probably be released in the coming days. Please check revision information here .
Stanford Workshop
Thursday, 21 February 2008
The workshop in Stanford was fully booked a few days after registration opened. Stay tuned for a workshop in Europe this fall.
RSS Feed activated
Wednesday, 13 February 2008
You can now subscribe to the latest news from the GROMACS website using an RSS feed. The address to use is http://www.gromacs.org/index2.php?option=com_rss&no_html=1 .
120,000 Downloads
Sunday, 03 February 2008
Since May 9, 2004, GROMACS packages have been downloaded 120,000 times from our server. This means that there were more than 2500 downlods each month, including, obviously, people who download newer versions, of which there have been 3 since that date.
GROMACS 4 Paper
Saturday, 02 February 2008
It is with great pleasure that I announce that the paper about GROMACS 4 is now available on-line at the Journal of Chemical Theory and Computation . Processing by the journal was so fast that we did not manage to have the software ready for release, but it will be announced this spring.

System Biology

http://www.systemsbiology.org/technology/Data_Visualization_and_Analysis/Human_Proteome_Folding_Project



HUMAN PROTEOME FOLDING PROJECT
Overview

The Human Proteome Folding Project will use the computer power of millions of computers to predict the shape of Human proteins for which researchers currently know little. From this shape scientists hope to learn about the function of these proteins, as the shape of proteins is inherently related to how they function in our bodies. This database of protein structures and putative functions will let scientists take the next steps understanding how diseases that involve these proteins work and ultimately how to cure them.

Proteins could be said to be the most important molecules in living beings. Just about everything in your body involves or is made out of proteins. Proteins are actually long chains made up of smaller molecules called amino acids.

There are 20 different amino acids that make up all proteins. One can think of the amino acids as being beads of 20 different colors. Sometimes, hundreds of them make up one protein. Proteins typically don't stay as long chains however. As soon as the chain of amino acids is built, the chain folds and tangles up into a more compact mass, ending up in a particular shape. This process is called protein folding.

Protein folding occurs because the various amino acids like to stick to each other following certain rules. You can think of the amino-acid (beads on a string) as being sticky, but sticky in such a way that only certain colors can stick to certain other colors.

The amino acid chains built in the body must fold up in a particular way to make useful proteins. The cell has mechanisms to help the proteins fold properly and mechanism to get rid of improperly folded proteins. Each gene tells the order of the amino acids for one protein. The gene itself is a section of long chain called DNA.

In recent years scientists have sequenced the human genome; finding over 30,000 genes within the human genome. The collection of all human genes is known as "the human genome". Depending on how genes are counted, there are over 30,000 genes in the human genome. Each of these genes tells how to build the chain of amino acids for the each of the 30,000 proteins. The collection of all of the human proteins is known as "the human proteome."

What the genes don't tell is exactly how the proteins will fold into their compact final form. The final shape of a protein is very important because that determines what it can do and what other proteins it can connect to or interact with. You can think of the protein shapes like puzzle pieces. For example muscle proteins connect to each other to form a muscle fiber. They stick together that way because of their shape, and certain other factors relating to the shape.

Everything that happens in cells, and in the body, is very specifically controlled by protein shapes. For example, the proteins of a virus or a bacteria may have a particular shapes that interact with human proteins or human cell membrane, and let it infect the cell. This is obviously an oversimplified description, but it is important to understand how important the shapes of proteins are. Knowing these shapes lets us understand how the proteins perform their desired function and also how diseases prevent proteins from doing the correct things to maintain a healthy cell and body.

When your grid agent is running it is trying to fold a single protein from the set of human proteins with no known shape. The client will try millions of shapes and return to the central server the best 500 shapes it can find. The goodness of each shape the grid agent tries is determined by something referred to as the Rosetta score. The Rosetta score examines the packing of amino acids in the protein and produces a number, the lower the number the better. The program that the grid agent is running is called Rosetta. As the computers try to fold the protein chains in different ways, they attempt to find the particular folding/shape that is closest to how the proteins really fold in our bodies. You can see the pictures of the partially folded proteins in the right half of the grid agent screen. The left side shows various scores which tells how properly folded the protein is so far, per all of the rules. If a trial fold gets a worse score, then the computer tries to refold it a different way which may be produce a better score. This is done millions of times for each protein; scientists will look at only the lowest scoring structures.

Back to top



Graphical Overview of Project
  1. The project starts with human proteins from the human genome. These protein sequences are the result of a large amount of research in and of themselves (the human genome project was a huge research project carried out at numerous institutions, including the ISB). A great deal of interesting research is still ongoing to find all of the proteins in the Human Genome with mixtures of computational and experimental efforts. We will fold the proteins in the Human proteome that have no known structure (and often no known function).
  2. Rosetta structure prediction is the program that we’ll use to predict the structures (fold) of these mystery proteins. Rosetta uses a scoring function to rip through huge numbers of possible structures for a given sequence and choose the best structures (which it reports to us as predicted structures). Because there are a large number of possible conformations per sequence and a large number of sequences we need astronomically large amounts of processor power to fold the Human proteome.
  3. We’ll use the spare computing power from huge numbers of volunteers to run Rosetta on more than fifty thousand protein sequences.
  4. We’ll get one or more fold predictions for each protein. Not all predictions work, so we’ll also have several numbers attached to each prediction that tell us how much we can trust each prediction.
  5. We will cross-match the predicted structures with the large data-base of known (by X-ray crystallography and NMR-spectroscopy) structures to see if our predicted fold has been seen before.
  6. If we find a match, that’s just the beginning. In biology context is very important, biologists use a large diversity of experiments and analysis techniques to carry out their work. There are several methods to try to get at the function of unknown proteins, and structure is just one of them … biologists can best use these fold predictions (from 4) and fold-matches (from 5) when they are integrated with results from other relevant methods (when is a gene turned on or off, what tissue is a protein expressed in, can I find this gene in other organisms?, where is this protein in a metabolic pathway?, etc.).
Click to view larger image as a pdf

Back to top



Central Dogma
  1. Genomic sequence is the final result produced by genome sequencing projects. For the human genome there would be one word (as shown in 1) for each chromosome. The total length of the 23 chromosomes in human is ~3billion letters or bases. This represents a relatively stable place for a cell to store information.
  2. Genomic DNA is copied to complementary messenger RNA by RNA-polymerase. RNA is less stable than DNA and thus the cell can turn on and off genes by regulating how quickly genes are transcribed into RNA message.
  3. RNA is translated into protein sequence by the Ribosome. Each three-letter chunk of the RNA sequence (codon) is translated into one of 20 amino acids. Thus each mRNA codes for a single unique protein. The protein is made as a long, unfolded, polypeptide that is not functional until folded.
  4. Protein folding consists primarily of rotations around the chemical bonds in the backbone and side-chains of the polypeptide to make a conformation that allows the side-chains to pack in a compact core. Here the nascent/unfolded polypeptide/protein is schematized as a red zig-zag. The protein then folds spontaneously to a folded protein as shown at the very bottom. See the description of the scoring functions used by Rosetta for more information about why a protein would do any of this, but the short answer is that the sidechains sticking off the backbone at the bottom make favorable contacts (“+” touching “-“ and oily touching oily for instance).


Click to view larger image as a pdf

Back to top



What is a protein?


Proteins are far from being just things we eat. They are the molecular machines that carry out metabolism, they carry messages that direct development and enable the immune system to tell friend from foe, they repair damage to our DNA after we’ve spent too much time in the sun. In short proteins are at the center of Human biology, all biology.

But what IS a protein?

Most genes code for proteins. Proteins are polymers that are built from smaller monomers called amino acids (lets say 150 at a time, but the length of proteins vary from gene to gene). These strings of amino acids (with different amino acids having different shapes and chemical properties) then fold up to make more compact shapes that have specific function. So nature can use the same 20 amino acids, that have a common backbone but different variable groups, to make an astronomically large variety of shapes and functions using the same 20 amino acids, the same ribosome (the machine that strings the amino acids together). By changing the order and type of amino acids in proteins, living things can come up with new functions and shapes. This process is often called mutation. Mutations to proteins can be changes of one amino acid in a protein, say the hemoglobin in your blood) for another or the deletion of several amino acids from a protein http://web.mit.edu/esgbio/www/dogma/mutants.html. Many research efforts are currently underway to allow us to rationally engineer protein sequences to make new functions and therapies.

Most drugs carry out their functions by binding to the specific shapes that folded proteins make in cells. Understanding protein three-dimensional structure is one of many things we need to understand if we are to decode the Human genome or the genome of a given pathogen. For more info on the central dogma of modern biology see:
http://en.wikipedia.org/wiki/Central_dogma
http://web.mit.edu/esgbio/www/dogma/dogmadir.html
http://www.emc.maricopa.edu

To see the 20 amino acids see:
http://web.mit.edu/esgbio/www/lm/proteins/aa/aminoacids.html
http://web.mit.edu/esgbio/www/lm/lmdir.html


Which proteins are important?

When faced with the question, which proteins should we fold, the following criterion were used: choose proteins that are important to the people that will be donating the computing cycles that will be folded.

Overall predicting the structure of every protein in an organism with Rosetta will contribute to our overall understanding of several proteins in that genome and how those predicted proteins interact with the organism as a system. Can you imagine trying to fix a car or a machine knowing the function of only 60% of the components. That is the situation that biomedical and biological researchers, to their credit, operate in. Thus, anything that can shed light on these mystery proteins is of use to the field of biology and medicine. These predictions will not be a magic bullet but provide a resource for biologists that are working on the genomes we fold.

The first category of proteins to fold are the proteins in the Human Genome with no known structural homologs. Human proteins are the targets of drugs and the key to improving human health. Improving our understanding of these proteins has innumerable positive effects. Some Human proteins in the blood are therapeutics in and of themselves.

The second category consists of proteins found in the genomes of pathogens. Understanding the biology of these bacteria and viruses that have cause disease will alow us to better fight them. Many of these proteins are the targets of drugs or have roles in virulence that have yet to be fully understood.

The last category consists of proteins that are found in the genomes of environmental microbes. These microbes represent the majority of molecular biodiversity on the planet and understanding these microbes and their role in our environment will be aided by a deeper understanding of their proteomes (the structure and function of the proteins in their genomes). These microbes are responsible for global carbon and nitrogen cycles, they degrade human waste products, and can perform countless undiscovered enzymatic biosynthesis.

Back to top



Drawing Protiens

Proteins are large complicated molecules, so simplifying how we represent them visually is key to protein structure research.

  1. The chemical structure of a single amino acid. These are strung together by the ribosome in the order encoded on the mRNA that codes for the protein. There are 20 amino acids to choose from, R (see depiction below) can thus be any of 20 different chemical structures depending on what amino acid is specified at that position by the mRNA.
  2. A simpler way to write an amino acid.
  3. Three different amino acids forming the beginning of a protein.
  4. The backbone stays the same (thus, the Ribosome can use the same machinery to add each new amino acid) the sidechains are variable (a huge diversity of chemical functions and structures results from varying the order and composition of amino acids in proteins, nature can solve most of its problems with proteins).
Click to view larger image as a pdf

Back to top



Rosetta

Rosetta is a computer program for de novo protein structure prediction, where de novo implies modeling in the absence of detectable sequence similarity to a previously determined three-dimensional protein structure. Rosetta uses small sequence similarities from the Protein Data Bank [http://www.rcsb.org/pdb/] to estimate possible conformations for local sequence segments (three and nine residue segments). These segments are called fragments of local structure. It then assembles these pre-computed structure fragments by minimizing a global scoring function that favors hydrophobic burial and packing, strand pairing, compactness and energetically favorable residue pairings. Results from the fourth and fifth critical assessment of structure prediction (CASP4, CASP5) [http://predictioncenter.llnl.gov/] have shown that Rosetta is currently one of the best methods for de novo protein structure prediction and distant fold recognition.

Using Rosetta generated structure predictions we were previously able to recapitulate or predict many functional insights not detectable from primary sequence. Rosetta was also recently used to generate both fold and function predictions for Pfam protein families that had no link to a known structure, resulting in many high confidence fold predictions. In spite of these successes, Rosetta has a significant error rate, as do all methods for distant fold recognition and de novo structure prediction. We thus calculate not just the structure but also the probability that the predicted structure is correct using the Rosetta confidence function. The Rosetta confidence function partially mitigates this error rate by assessing the accuracy of predicted folds.

Another unavoidable source of uncertainty, with respect to function prediction from structure, is the error associated with distilling function from fold matches. Sometime fold carry out more than one function. The predictions generated by de novo structure prediction are thus best used in combination with other sources of putative or general functional information such as proximity in protein association or gene regulatory networks. Thus, making the predictions resulting from this project available to the public in a easily accessible way is a critical final step in this project.

For a quicktime movie showing a protein (Ubiquitin) being folded by Rosetta click here [1d3z.mov OR 1d3z.mpg ]

Back to top



Rosetta Score

Rosetta uses a scoring function to judge different conformations (shapes/packings of amino acids within the protein). The simulation consists of making moves (changing the bond angles of a bunch of amino acids) and then scoring the new conformation. The rosetta score is a weighted sum of component scores, where each component score is judging a different thing. The environment score is judging how well the hydrophobic (oily) residues are packing together to form a core, while the pair-score is judging how compatible touching residues are with each other one pair at a time.

Environment score: The formation of a hydrophobic core, or the hydrophobic effect, is for most proteins the central driving force for protein folding. The Rosetta environment score rewards burial of hydrophobic residues in a compact hydrophobic core and penalizes solvation of these oily groups. I’ve represented hydrophobic residues as orange stars. The left conformation is good (all the hydrophobics together) while the rightmost conformation is bad (with the hydrophobic amino acids not touching).

Pair-score: Two conformations of a polypeptide are shown, one (top) where the chain is folded back on itself bringing two cysteins together (yellow + yellow = possible disulphide bond) and forming a salt-bridge (blue+red = opposites attract). The conformation at bottom does not make these pairings and the pair-score would, thus, favor the top conformation.

Click to view larger image as a pdf

Back to top



The Amino Acids

When we display proteins we often use different coloring schemes to help us see the interactions taking place between the different amino acids. We have used the following color scheme for the Human Proteome folding project:

Hydrophobic (oily): orange
Acidic (negatively charged): red
Basic (positive charge): blue
Histidine (positive or negative): purple
Sulphur containing residues: yellow
Everything else (even though every amino acid is special): green

Click to view larger image as a pdf











Back to top



More information for Scientists

Read more in our recent Journal Articles:
Application to halobaterium NRC-1: [http://genomebiology.com/]
Application to Initial annotation of Haloarcula marismortui: [http://www.genome.org/]
Application to the annotation of Pfam domains of unknown function: [http://www.sciencedirect.com/]

The earliest papers on Roseta De Novo structure prediction (including works by Kim Simons, Rich Bonneau, Charlie EM Strauss, Chris Bystroff, Ingo Ruczinski, Carol Rohl, Phil Bradley, Lars Malmstrom, Dylan Chivian, David Kim, Jens Meiler, Jens Meiler, Jack Schonbrun, David Baker, and others) can be found at: http://bakerlab.org

Review of De Novo structure prediction methods: annual-rev-bonneau.pdf
[http://arjournals.annualreviews.org]

One-at-a-time Rosetta server (the Robetta server); Hosted at ISB and Los Alamos National Labs (Charlie EM Strauss) [http://robetta.bakerlab.org/] Papers describing Robetta: [http://www3.interscience.wiley.com]
[http://www.ncbi.nlm.nih.gov]

Back to top



PEOPLE

Read more about the scientists at the ISB and the University of Washington leading the Human Proteome Folding Project. For more information on this project direct scientific inquiries to either Richard Bonneau or proteomefolding@systemsbiology.org.

ISB:

Dr. Richard Bonneau: rbonneau@systemsbiology.org
Dr. Bonneau is the technical lead on the Human Proteome Folding project. Dr. Bonneau has expertise primarily in ab initio protein structure prediction, protein folding, and regulatory network inference. He is currently focused on applying structure prediction and structural information to functional annotation and the modeling/prediction of regulatory and physical networks. Dr. Bonneau working to develop general methods to solve protein structures and protein complexes with small sets of distance constraints derived from chemical cross-linking. At the ISB Dr. Bonneau also works on a number of systems biology data-integration and analysis algorithms, including algorithms designed to infer global regulatory networks from systems-biology data.

Dr. Leroy Hood
Dr. Leroy Hood is recognized as one of the world's leading scientists in molecular biotechnology and genomics. A passionate and dedicated researcher, he holds numerous patents and awards for his scientific breakthroughs and prides himself on his life-long commitment to making science accessible and understandable to the general public, especially children. One of this foremost goals is bringing hands-on, inquiry-based science to K-12 classrooms.
[more: http://www.systemsbiology.org ]

University of Washington:

Lars Malmstroem: larsm@u.washington.edu
Lars Malmstroem has worked to engineer the infrastructure (at the ISB/UW end) needed to handle the vast highly interconnected data-sets that this project will generate; he will also be heavily involved in developing the correct data-integration schemes to best deliver the resultant predictions to biologists.

Dr. David Baker:
Rosetta was developed initially in the laboratory of David Baker by a team that included a large number of scientists at several institutions. The goal of current research in his laboratory is to develop improved models of intra and intermolecular interactions and to apply improved models to the prediction and design of macromolecular structures and interactions. Prediction and design applications can be of great biological interest in their own right, and also provide very stringent and objective tests which drive the improvement of the model and increases in fundamental understanding.
[more: http://depts.washington.edu/ ]

Back to top