Tuesday, May 20, 2008

google summer code project: Integration of GridFTP with Freeloader storage system

Title Integration of GridFTP with Freeloader storage system
Student Hesam Ghasemi
Mentor Rajkumar Kettimuthu
Abstract
Scientific experiments produce large volumes of data which require a cost-conscious data stores, as well as reliable data transfer mechanisms. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP, however, assumes the support of a high-performance parallel file system, a relatively expensive resource.
FreeLoader is a storage system that aggregates idle storage space from workstations connected within a local area network to build a low-cost, yet high-performance data store. FreeLoader breaks files into chunks and distributes them across storage nodes thus enabling parallel access to files. A central manager keeps track of file meta-data as well as of the location of the chunks associated with each file.
This project will integrate Globus project’s GridFTP implementation and FreeLoader to reduce the cost and increase the performance of GridFTP deployments. The integration of these two opens-source systems system will address the following main problems. First, FreeLoader storage nodes will be exposed as GridFTP data transfer processes (DTPs). Second, the assumption the current GridFTP implementation makes, namely that all data is available at all DTPs will be relaxed by integrating the GridFTP server and FreeLoader managed data location mechanisms. Finally, load balancing mechanisms will be added to the GridFTP server implementation to match FreeLoader’s ability to stripe and replicate data across multiple nodes.
The fact that at a Freeloader-supported GridFTP site files are spread over multiple FreeLoader nodes (exposed as GridFTP DTPs) implies that the integration will need to orchestrate multiple connections to execute a file transfer. The current implementation of GridFTP is not able to handle cases where the number of DTPs on the receiver and sender side does not match. For instance, for a single client accessing a GridFTP server with N DTPs, the server will need manage the connections such that a single server DTP will connect to the client at a time. Once the first DTP has finished its transfer, the second server DTP will connect to the client and execute its transfer. The same mechanism can be applied to the cases where the client server node relationship is N-to-M. This mechanism has the following advantages: the load is balanced between the DTP nodes, and can support FreeLoader data layout where file chunks are stored on each DTP.
The design and implementation requirements include: code changes to either system should be minimal to enable future integration with the mainstream GridFTP/FreeLoader code, GridFTP client code changes should be avoided, and the integration components should be implemented in C since both FreeLoader and GridFTP are implemented in C.

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC

Computational Biology On The Grid: Decoupling Computation And I/O With ParaMEDIC


Pavan Balaji, Argonne National Laboratory
Postdoctoral Researcher

CCT talk:

http://www-unix.mcs.anl.gov/~balaji/#publications

Abstract:

Many large-scale computational biology applications simultaneously rely on multiple resources for efficient execution. For example, such applications may require both large compute and storage resources; however, very few supercomputing centers can provide large quantities of both. Thus, data generated at the compute site oftentimes has to be moved to a remote storage site for either storage or visualization and analysis. Clearly, this is not an efficient model, especially when the two sites are distributed over a Grid. In this talk, I'll present a framework called "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing'' which uses application-specific semantic information to convert the generated data to orders-of-magnitude smaller metadata at the compute site, transfer the metadata to the storage site, and re-process the metadata at the storage site to regenerate the output. Specifically, ParaMEDIC trades a small amount of additional computation (in the form of data post-processing) for a potentially significant reduction in data that needs to be transferred in distributed environments. The ParaMEDIC framework allowed us to use nine different supercomputers distributed within the U.S. to sequence-search the entire microbial genome database against itself and store the one petabyte of generated data at Tokyo, Japan.

GENI - The Global Environment For Networking Innovations Project

GENI - The Global Environment For Networking Innovations Project

CCT talks: Chip Elliott, BBN Technologies And GENI PI/PD/Chief Engineer


http://www.geni.net/

Abstract:


GENI is an experimental facility called the Global Environment for Network Innovation. GENI is designed to allow experiments on a wide variety of problems in communications, networking, distributed systems, cyber-security, and networked services and applications. The emphasis is on enabling researchers to experiment with radical network designs in a way that is far more realistic than they can today. Researchers will be able to build their own new versions of the “net” or to study the “net” in ways that are not possible today. Compatibility, with the Internet is NOT required. The purpose of GENI is to give researchers the opportunity to experiment unfettered by assumptions or requirements and to support those experiments at a large scale with real user populations.

GENI is being proposed to NSF as a Major Research and Equipment Facility Construction (MREFC) project. The MREFC program is NSF’s mechanism for funding large infrastructure projects. NSF has funded MREFC projects in a variety of fields, such as the Laser Interferometer Gravitational Wave Observatory (LIGO), but GENI would be the first MREFC project initiated and designed by the computer science research community.

Friday, May 16, 2008

MOPS (Managed Object Placement Service)

http://www.globus.org/toolkit/data/gridftp/mops.html

MOPS (Managed Object Placement Service) is an enhancement to the Globus GridFTP server that allows you to manage some of the resources needed for the data transfer in a more efficient way.

MOPS 0.1 release includes the following:

  1. GFork - This is a service like inetd that listens on a TCP port and runs a configurable executable in a child process whenever a connection is made. GFork also creates bi-directional pipes between the child processes and the master service. These pipes are used for interprocess communication between the child process executables and a master process plugin. More information on GFork can be found here.

  2. GFork master plugin for GridFTP - This master plugin provides enhanced functionality such as dynamic backend registration for striped servers, managed system memory pools and internal data monitoring for both striped and non striped servers. More information on the GridFTP master plugin and information on how to run the Globus GridFTP server with GFork can be found here.

  3. Storage usage enforcement using Lotman - All data sent to a Lotman-enabled GridFTP server and written to the Lotman root directory will be managed by Lotman. Information on how to configure Lotman and run it with the Globus GridFTP server can be found here.

  4. Pipelining data transfer commands - GridFTP is a command response protocol. A client sends one command and then waits for a "Finished response" before sending another. Adding this overhead on a per-file basis for a large data set partitioned into many small files makes the performance suffer. Pipelining allows the client to have many outstanding, unacknowledged transfer commands at once. Instead of being forced to wait for the "Finished response" message, the client is free to send transfer commands at any time. Pipelining is enabled by using the -pp option with globus-url-copy.

Wednesday, May 14, 2008

GRIDS Lab Topics Related Thesis/Dissertations World-Wide

http://www.gridbus.org/grids_thesis.html


Sunday, May 11, 2008

GROMACS Flow Chart

Main Table of Contents

VERSION 3.3
Thu 11 May 2006


This is a flow chart of a typical GROMACS MD run of a protein in a box of water. A more detailed example is available in the Getting Started section. Several steps of energy minimization may be necessary, these consist of cycles: grompp -> mdrun.

eiwit.pdb


Generate a GROMACS topology
pdb2gmx







conf.gro
topol.top








Enlarge the box
editconf







conf.gro




Solvate protein
genbox







conf.gro
topol.top
grompp.mdp

Generate mdrun input file
grompp





Continuation


topol.tpr tpbconv traj.trr








Run the simulation (EM or MD)
mdrun







traj.xtc
ener.edr




Analysis
g_...
ngmx

g_energy


Computational Biomolecular Dynamics Group

http://www.mpibpc.mpg.de/groups/de_groot/


We carry out computer simulations of biological macromolecules to study the relationship between dynamics and function.

Using molecular dynamics simulations and other computational tools we predict the dynamics and flexibility of proteins, membranes, carbohydrates and polynucleotides to study biological function and dysfunction at the atomic level.





Randomly picked image from current research. Reload this page to updat

GROMACS

http://wiki.gromacs.org/index.php/Main_Page

he 5 latest News

GROMACS 3.3.3 released
Friday, 29 February 2008
It is a pleasure to announce the immediate availability of GROMACS 3.3.3, the latest stable version. Please check the download section in order to find the source code and a limited set of binaries. More binaries will probably be released in the coming days. Please check revision information here .
Stanford Workshop
Thursday, 21 February 2008
The workshop in Stanford was fully booked a few days after registration opened. Stay tuned for a workshop in Europe this fall.
RSS Feed activated
Wednesday, 13 February 2008
You can now subscribe to the latest news from the GROMACS website using an RSS feed. The address to use is http://www.gromacs.org/index2.php?option=com_rss&no_html=1 .
120,000 Downloads
Sunday, 03 February 2008
Since May 9, 2004, GROMACS packages have been downloaded 120,000 times from our server. This means that there were more than 2500 downlods each month, including, obviously, people who download newer versions, of which there have been 3 since that date.
GROMACS 4 Paper
Saturday, 02 February 2008
It is with great pleasure that I announce that the paper about GROMACS 4 is now available on-line at the Journal of Chemical Theory and Computation . Processing by the journal was so fast that we did not manage to have the software ready for release, but it will be announced this spring.

System Biology

http://www.systemsbiology.org/technology/Data_Visualization_and_Analysis/Human_Proteome_Folding_Project



HUMAN PROTEOME FOLDING PROJECT
Overview

The Human Proteome Folding Project will use the computer power of millions of computers to predict the shape of Human proteins for which researchers currently know little. From this shape scientists hope to learn about the function of these proteins, as the shape of proteins is inherently related to how they function in our bodies. This database of protein structures and putative functions will let scientists take the next steps understanding how diseases that involve these proteins work and ultimately how to cure them.

Proteins could be said to be the most important molecules in living beings. Just about everything in your body involves or is made out of proteins. Proteins are actually long chains made up of smaller molecules called amino acids.

There are 20 different amino acids that make up all proteins. One can think of the amino acids as being beads of 20 different colors. Sometimes, hundreds of them make up one protein. Proteins typically don't stay as long chains however. As soon as the chain of amino acids is built, the chain folds and tangles up into a more compact mass, ending up in a particular shape. This process is called protein folding.

Protein folding occurs because the various amino acids like to stick to each other following certain rules. You can think of the amino-acid (beads on a string) as being sticky, but sticky in such a way that only certain colors can stick to certain other colors.

The amino acid chains built in the body must fold up in a particular way to make useful proteins. The cell has mechanisms to help the proteins fold properly and mechanism to get rid of improperly folded proteins. Each gene tells the order of the amino acids for one protein. The gene itself is a section of long chain called DNA.

In recent years scientists have sequenced the human genome; finding over 30,000 genes within the human genome. The collection of all human genes is known as "the human genome". Depending on how genes are counted, there are over 30,000 genes in the human genome. Each of these genes tells how to build the chain of amino acids for the each of the 30,000 proteins. The collection of all of the human proteins is known as "the human proteome."

What the genes don't tell is exactly how the proteins will fold into their compact final form. The final shape of a protein is very important because that determines what it can do and what other proteins it can connect to or interact with. You can think of the protein shapes like puzzle pieces. For example muscle proteins connect to each other to form a muscle fiber. They stick together that way because of their shape, and certain other factors relating to the shape.

Everything that happens in cells, and in the body, is very specifically controlled by protein shapes. For example, the proteins of a virus or a bacteria may have a particular shapes that interact with human proteins or human cell membrane, and let it infect the cell. This is obviously an oversimplified description, but it is important to understand how important the shapes of proteins are. Knowing these shapes lets us understand how the proteins perform their desired function and also how diseases prevent proteins from doing the correct things to maintain a healthy cell and body.

When your grid agent is running it is trying to fold a single protein from the set of human proteins with no known shape. The client will try millions of shapes and return to the central server the best 500 shapes it can find. The goodness of each shape the grid agent tries is determined by something referred to as the Rosetta score. The Rosetta score examines the packing of amino acids in the protein and produces a number, the lower the number the better. The program that the grid agent is running is called Rosetta. As the computers try to fold the protein chains in different ways, they attempt to find the particular folding/shape that is closest to how the proteins really fold in our bodies. You can see the pictures of the partially folded proteins in the right half of the grid agent screen. The left side shows various scores which tells how properly folded the protein is so far, per all of the rules. If a trial fold gets a worse score, then the computer tries to refold it a different way which may be produce a better score. This is done millions of times for each protein; scientists will look at only the lowest scoring structures.

Back to top



Graphical Overview of Project
  1. The project starts with human proteins from the human genome. These protein sequences are the result of a large amount of research in and of themselves (the human genome project was a huge research project carried out at numerous institutions, including the ISB). A great deal of interesting research is still ongoing to find all of the proteins in the Human Genome with mixtures of computational and experimental efforts. We will fold the proteins in the Human proteome that have no known structure (and often no known function).
  2. Rosetta structure prediction is the program that we’ll use to predict the structures (fold) of these mystery proteins. Rosetta uses a scoring function to rip through huge numbers of possible structures for a given sequence and choose the best structures (which it reports to us as predicted structures). Because there are a large number of possible conformations per sequence and a large number of sequences we need astronomically large amounts of processor power to fold the Human proteome.
  3. We’ll use the spare computing power from huge numbers of volunteers to run Rosetta on more than fifty thousand protein sequences.
  4. We’ll get one or more fold predictions for each protein. Not all predictions work, so we’ll also have several numbers attached to each prediction that tell us how much we can trust each prediction.
  5. We will cross-match the predicted structures with the large data-base of known (by X-ray crystallography and NMR-spectroscopy) structures to see if our predicted fold has been seen before.
  6. If we find a match, that’s just the beginning. In biology context is very important, biologists use a large diversity of experiments and analysis techniques to carry out their work. There are several methods to try to get at the function of unknown proteins, and structure is just one of them … biologists can best use these fold predictions (from 4) and fold-matches (from 5) when they are integrated with results from other relevant methods (when is a gene turned on or off, what tissue is a protein expressed in, can I find this gene in other organisms?, where is this protein in a metabolic pathway?, etc.).
Click to view larger image as a pdf

Back to top



Central Dogma
  1. Genomic sequence is the final result produced by genome sequencing projects. For the human genome there would be one word (as shown in 1) for each chromosome. The total length of the 23 chromosomes in human is ~3billion letters or bases. This represents a relatively stable place for a cell to store information.
  2. Genomic DNA is copied to complementary messenger RNA by RNA-polymerase. RNA is less stable than DNA and thus the cell can turn on and off genes by regulating how quickly genes are transcribed into RNA message.
  3. RNA is translated into protein sequence by the Ribosome. Each three-letter chunk of the RNA sequence (codon) is translated into one of 20 amino acids. Thus each mRNA codes for a single unique protein. The protein is made as a long, unfolded, polypeptide that is not functional until folded.
  4. Protein folding consists primarily of rotations around the chemical bonds in the backbone and side-chains of the polypeptide to make a conformation that allows the side-chains to pack in a compact core. Here the nascent/unfolded polypeptide/protein is schematized as a red zig-zag. The protein then folds spontaneously to a folded protein as shown at the very bottom. See the description of the scoring functions used by Rosetta for more information about why a protein would do any of this, but the short answer is that the sidechains sticking off the backbone at the bottom make favorable contacts (“+” touching “-“ and oily touching oily for instance).


Click to view larger image as a pdf

Back to top



What is a protein?


Proteins are far from being just things we eat. They are the molecular machines that carry out metabolism, they carry messages that direct development and enable the immune system to tell friend from foe, they repair damage to our DNA after we’ve spent too much time in the sun. In short proteins are at the center of Human biology, all biology.

But what IS a protein?

Most genes code for proteins. Proteins are polymers that are built from smaller monomers called amino acids (lets say 150 at a time, but the length of proteins vary from gene to gene). These strings of amino acids (with different amino acids having different shapes and chemical properties) then fold up to make more compact shapes that have specific function. So nature can use the same 20 amino acids, that have a common backbone but different variable groups, to make an astronomically large variety of shapes and functions using the same 20 amino acids, the same ribosome (the machine that strings the amino acids together). By changing the order and type of amino acids in proteins, living things can come up with new functions and shapes. This process is often called mutation. Mutations to proteins can be changes of one amino acid in a protein, say the hemoglobin in your blood) for another or the deletion of several amino acids from a protein http://web.mit.edu/esgbio/www/dogma/mutants.html. Many research efforts are currently underway to allow us to rationally engineer protein sequences to make new functions and therapies.

Most drugs carry out their functions by binding to the specific shapes that folded proteins make in cells. Understanding protein three-dimensional structure is one of many things we need to understand if we are to decode the Human genome or the genome of a given pathogen. For more info on the central dogma of modern biology see:
http://en.wikipedia.org/wiki/Central_dogma
http://web.mit.edu/esgbio/www/dogma/dogmadir.html
http://www.emc.maricopa.edu

To see the 20 amino acids see:
http://web.mit.edu/esgbio/www/lm/proteins/aa/aminoacids.html
http://web.mit.edu/esgbio/www/lm/lmdir.html


Which proteins are important?

When faced with the question, which proteins should we fold, the following criterion were used: choose proteins that are important to the people that will be donating the computing cycles that will be folded.

Overall predicting the structure of every protein in an organism with Rosetta will contribute to our overall understanding of several proteins in that genome and how those predicted proteins interact with the organism as a system. Can you imagine trying to fix a car or a machine knowing the function of only 60% of the components. That is the situation that biomedical and biological researchers, to their credit, operate in. Thus, anything that can shed light on these mystery proteins is of use to the field of biology and medicine. These predictions will not be a magic bullet but provide a resource for biologists that are working on the genomes we fold.

The first category of proteins to fold are the proteins in the Human Genome with no known structural homologs. Human proteins are the targets of drugs and the key to improving human health. Improving our understanding of these proteins has innumerable positive effects. Some Human proteins in the blood are therapeutics in and of themselves.

The second category consists of proteins found in the genomes of pathogens. Understanding the biology of these bacteria and viruses that have cause disease will alow us to better fight them. Many of these proteins are the targets of drugs or have roles in virulence that have yet to be fully understood.

The last category consists of proteins that are found in the genomes of environmental microbes. These microbes represent the majority of molecular biodiversity on the planet and understanding these microbes and their role in our environment will be aided by a deeper understanding of their proteomes (the structure and function of the proteins in their genomes). These microbes are responsible for global carbon and nitrogen cycles, they degrade human waste products, and can perform countless undiscovered enzymatic biosynthesis.

Back to top



Drawing Protiens

Proteins are large complicated molecules, so simplifying how we represent them visually is key to protein structure research.

  1. The chemical structure of a single amino acid. These are strung together by the ribosome in the order encoded on the mRNA that codes for the protein. There are 20 amino acids to choose from, R (see depiction below) can thus be any of 20 different chemical structures depending on what amino acid is specified at that position by the mRNA.
  2. A simpler way to write an amino acid.
  3. Three different amino acids forming the beginning of a protein.
  4. The backbone stays the same (thus, the Ribosome can use the same machinery to add each new amino acid) the sidechains are variable (a huge diversity of chemical functions and structures results from varying the order and composition of amino acids in proteins, nature can solve most of its problems with proteins).
Click to view larger image as a pdf

Back to top



Rosetta

Rosetta is a computer program for de novo protein structure prediction, where de novo implies modeling in the absence of detectable sequence similarity to a previously determined three-dimensional protein structure. Rosetta uses small sequence similarities from the Protein Data Bank [http://www.rcsb.org/pdb/] to estimate possible conformations for local sequence segments (three and nine residue segments). These segments are called fragments of local structure. It then assembles these pre-computed structure fragments by minimizing a global scoring function that favors hydrophobic burial and packing, strand pairing, compactness and energetically favorable residue pairings. Results from the fourth and fifth critical assessment of structure prediction (CASP4, CASP5) [http://predictioncenter.llnl.gov/] have shown that Rosetta is currently one of the best methods for de novo protein structure prediction and distant fold recognition.

Using Rosetta generated structure predictions we were previously able to recapitulate or predict many functional insights not detectable from primary sequence. Rosetta was also recently used to generate both fold and function predictions for Pfam protein families that had no link to a known structure, resulting in many high confidence fold predictions. In spite of these successes, Rosetta has a significant error rate, as do all methods for distant fold recognition and de novo structure prediction. We thus calculate not just the structure but also the probability that the predicted structure is correct using the Rosetta confidence function. The Rosetta confidence function partially mitigates this error rate by assessing the accuracy of predicted folds.

Another unavoidable source of uncertainty, with respect to function prediction from structure, is the error associated with distilling function from fold matches. Sometime fold carry out more than one function. The predictions generated by de novo structure prediction are thus best used in combination with other sources of putative or general functional information such as proximity in protein association or gene regulatory networks. Thus, making the predictions resulting from this project available to the public in a easily accessible way is a critical final step in this project.

For a quicktime movie showing a protein (Ubiquitin) being folded by Rosetta click here [1d3z.mov OR 1d3z.mpg ]

Back to top



Rosetta Score

Rosetta uses a scoring function to judge different conformations (shapes/packings of amino acids within the protein). The simulation consists of making moves (changing the bond angles of a bunch of amino acids) and then scoring the new conformation. The rosetta score is a weighted sum of component scores, where each component score is judging a different thing. The environment score is judging how well the hydrophobic (oily) residues are packing together to form a core, while the pair-score is judging how compatible touching residues are with each other one pair at a time.

Environment score: The formation of a hydrophobic core, or the hydrophobic effect, is for most proteins the central driving force for protein folding. The Rosetta environment score rewards burial of hydrophobic residues in a compact hydrophobic core and penalizes solvation of these oily groups. I’ve represented hydrophobic residues as orange stars. The left conformation is good (all the hydrophobics together) while the rightmost conformation is bad (with the hydrophobic amino acids not touching).

Pair-score: Two conformations of a polypeptide are shown, one (top) where the chain is folded back on itself bringing two cysteins together (yellow + yellow = possible disulphide bond) and forming a salt-bridge (blue+red = opposites attract). The conformation at bottom does not make these pairings and the pair-score would, thus, favor the top conformation.

Click to view larger image as a pdf

Back to top



The Amino Acids

When we display proteins we often use different coloring schemes to help us see the interactions taking place between the different amino acids. We have used the following color scheme for the Human Proteome folding project:

Hydrophobic (oily): orange
Acidic (negatively charged): red
Basic (positive charge): blue
Histidine (positive or negative): purple
Sulphur containing residues: yellow
Everything else (even though every amino acid is special): green

Click to view larger image as a pdf











Back to top



More information for Scientists

Read more in our recent Journal Articles:
Application to halobaterium NRC-1: [http://genomebiology.com/]
Application to Initial annotation of Haloarcula marismortui: [http://www.genome.org/]
Application to the annotation of Pfam domains of unknown function: [http://www.sciencedirect.com/]

The earliest papers on Roseta De Novo structure prediction (including works by Kim Simons, Rich Bonneau, Charlie EM Strauss, Chris Bystroff, Ingo Ruczinski, Carol Rohl, Phil Bradley, Lars Malmstrom, Dylan Chivian, David Kim, Jens Meiler, Jens Meiler, Jack Schonbrun, David Baker, and others) can be found at: http://bakerlab.org

Review of De Novo structure prediction methods: annual-rev-bonneau.pdf
[http://arjournals.annualreviews.org]

One-at-a-time Rosetta server (the Robetta server); Hosted at ISB and Los Alamos National Labs (Charlie EM Strauss) [http://robetta.bakerlab.org/] Papers describing Robetta: [http://www3.interscience.wiley.com]
[http://www.ncbi.nlm.nih.gov]

Back to top



PEOPLE

Read more about the scientists at the ISB and the University of Washington leading the Human Proteome Folding Project. For more information on this project direct scientific inquiries to either Richard Bonneau or proteomefolding@systemsbiology.org.

ISB:

Dr. Richard Bonneau: rbonneau@systemsbiology.org
Dr. Bonneau is the technical lead on the Human Proteome Folding project. Dr. Bonneau has expertise primarily in ab initio protein structure prediction, protein folding, and regulatory network inference. He is currently focused on applying structure prediction and structural information to functional annotation and the modeling/prediction of regulatory and physical networks. Dr. Bonneau working to develop general methods to solve protein structures and protein complexes with small sets of distance constraints derived from chemical cross-linking. At the ISB Dr. Bonneau also works on a number of systems biology data-integration and analysis algorithms, including algorithms designed to infer global regulatory networks from systems-biology data.

Dr. Leroy Hood
Dr. Leroy Hood is recognized as one of the world's leading scientists in molecular biotechnology and genomics. A passionate and dedicated researcher, he holds numerous patents and awards for his scientific breakthroughs and prides himself on his life-long commitment to making science accessible and understandable to the general public, especially children. One of this foremost goals is bringing hands-on, inquiry-based science to K-12 classrooms.
[more: http://www.systemsbiology.org ]

University of Washington:

Lars Malmstroem: larsm@u.washington.edu
Lars Malmstroem has worked to engineer the infrastructure (at the ISB/UW end) needed to handle the vast highly interconnected data-sets that this project will generate; he will also be heavily involved in developing the correct data-integration schemes to best deliver the resultant predictions to biologists.

Dr. David Baker:
Rosetta was developed initially in the laboratory of David Baker by a team that included a large number of scientists at several institutions. The goal of current research in his laboratory is to develop improved models of intra and intermolecular interactions and to apply improved models to the prediction and design of macromolecular structures and interactions. Prediction and design applications can be of great biological interest in their own right, and also provide very stringent and objective tests which drive the improvement of the model and increases in fundamental understanding.
[more: http://depts.washington.edu/ ]

Back to top

Saturday, May 10, 2008

Phil Dykstra's nuttcp quick start guide

Phil Dykstra's nuttcp quick start guide

nuttcp is a TCP/UDP network testing tool, much like iperf. I think it's the best such tool available, for its simplicity, ease of use, and feature set.

Getting nuttcp

Official Site: http://www.lcp.nrl.navy.mil/nuttcp/ (html)
Official Site: ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/ (ftp)
Local copies: http://www.wcisd.hpc.mil/nuttcp/

Compiling/Installing nuttcp

Compiling nuttcp for unix/linux should be easy:
  cc -O3 -o nuttcp nuttcp-5.5.5.c
I copy it to /usr/local/bin/nuttcp but you could put it anywhere. It does NOT need any special permissions to run.

To run nuttcp manually

On one system (or both), run nuttcp -S. This starts a server that will wait for connections. On the other system, try commands like:
  nuttcp hostname        (transmits to hostname)
nuttcp -r hostname (receives from hostname)
type "nuttcp" to see lots of options. Most useful:
  -i1       to watch tests run (1 second intervals)
-w8m to set socket buffers ("window") to 8 MBytes
-u for UDP tests
-R10m for a 10 Mbps UDP test (or TCP rate limit)
-l512 to set UDP packet length (or TCP write size)

With servers running on remote machines you can also do third party tests:

  nuttcp host1 host2

To start nuttcp from xinetd

  1. Copy nuttcp4 and nuttcp6 to /etc/xinetd.d/ (these are in the xinetd.d subdirectory on this server)
  2. Make sure "nuttcp" is in your /etc/services file:
       nuttcp          5000/tcp
    nuttcp-data 5001/tcp
    This tells xinetd what port to listen on for the "nuttcp" service.
  3. Enable the services:
       chkconfig nuttcp4 on
    chkconfig nuttcp6 on
    If you don't have or don't want IPv6, only enable nuttcp4.
  4. Reload xinetd killall -HUP xinetd

Note on IPv6 and old xinetd's:

Modern xinetd's can listen for IPv4 and IPv6 services on the same port. Old ones (e.g. Redhat 8.0) can't. I'm not sure when this changed. For old systems just use the nuttcp file for xinetd.d. If you also want IPv6 you will have to start another service on a different port.

Optional nuttcp server access control

You can use /etc/hosts.allow and /etc/hosts.deny if you are starting nuttcp from xinetd. And/or you can use iptables to restrict access to the control port (default 5000).

nuttcp on Windows

Try one of the zip files. Unzip it and run nuttcp from that directory. These were compiled with cygwin. A cygwin dll is included in the zip file. If you get IPv6 errors, try the version that says "noipv6".

Example Run

host1$ nuttcp -i1 -w8m host2
83.1246 MB / 1.00 sec = 699.7341 Mbps
118.0095 MB / 1.00 sec = 990.0559 Mbps
118.0095 MB / 1.00 sec = 990.0886 Mbps
118.0009 MB / 1.00 sec = 989.9823 Mbps
118.0095 MB / 1.00 sec = 990.0718 Mbps
118.0095 MB / 1.00 sec = 990.0757 Mbps
118.0009 MB / 1.00 sec = 989.9744 Mbps
118.0095 MB / 1.00 sec = 990.0807 Mbps
118.0095 MB / 1.00 sec = 990.0896 Mbps
118.0009 MB / 1.00 sec = 989.9893 Mbps

1157.4375 MB / 10.10 sec = 961.4075 Mbps 16 %TX 10 %RX

This shows a 10 second TCP test from host1 to host2 with a window size of 8 MBytes. It ran at 990 Mbps (GigE with 9000 byte jumbo frames) most of the time. The first second or two is often slower due to TCP slow start. The average for the entire test was 961 Mbps. During each second it sent ~118 MBytes. The percentages at the end mean that nuttcp consumed 16% of the CPU on the transmitter (host1) and 10% of the CPU on the receiver (host2). They are useful for telling if you were CPU limited and should not be confused as packet loss or retransmits.


Phil Dykstra
April 2007

ttcp/nttcp/nuttcp/iperf

http://sd.wareonearth.com/~phil/net/ttcp/



ttcp/nttcp/nuttcp/iperf versions

ttcp was one of the first TCP throughput testing tools ever written. It was created by Mike Muuss at BRL to compare the performance of TCP stacks by U.C. Berkeley and BBN to help DARPA decide which version to place in the first BSD Unix release (Berkeley won). The name stands for "Test TCP", but it also supports UDP testing. Many variations have since been created with different defaults, new options, ports to new systems, etc.

What does the reported data rate really mean?

All variations of ttcp/iperf report payload or user data rates, i.e. no overhead bytes from headers (TCP, UDP, IP, etc.) are included in the reported data rates. When comparing to "line" rates or "peak" rates, it is important to consider all of this overhead. It is also important to understand what the tools mean by "K" or "M" bits or bytes per second. Versions of the tools differ on this point.

Computer memory is measured in powers of two, e.g. 1 KB = 2^10 = 1024 bytes; 1 MB = 2^20 = 1024*1024 = 1048576 bytes. Data communication rates however should always be stated in simple bits per second. For example "100 megabit ethernet" can send exactly 100,000,000 bits per second.

    K and M in ttcp/nttcp/iperf
    • ttcp original: K = 1024 (M is not used)
    • nttcp: K = 1024, M = 1000*1024
    • iperf 1.1.1 and earlier: K = 1024, M = 1024*1024
    • iperf 1.2 and above: K=1000, M=1000000
    • nuttcp: K=1000, M=1000000
Thus for some early tools, the reported K bits per second should be multipled by 1.024 to get the actual kilobits per second payload data rate. Reported M bits per second should be multipled by 1.024 for nttcp and 1.048576 for iperf to get the payload data rate in millions of bits per second. Iperf 1.2 and above and nuttcp correctly report bits per second rates in power of ten.

Differnet Versions

ttcp original

(source code)
RCSid[] = "@(#)$Header: ttcp.c,v 1.10 87/09/02 23:26:36 mike Exp $ (BRL)"
Usage: ttcp -t [-options] host  -l## length of bufs written to network (default 1024)
-s source a pattern to network
-n## number of bufs written to network (-s only, default 1024)
-p## port number to send to (default 2000)
-u use UDP instead of TCP
Usage: ttcp -r [-options] >out
-l## length of network read buf (default 1024)
-s sink (discard) all data from network
-p## port number to listen at (default 2000)
-B Only output full blocks, as specified in -l## (for TAR)
-u use UDP instead of TCP
Example run:
% ttcp -r -s
ttcp-r: nbuf=1024, buflen=1024, port=2000
ttcp-r: socket
ttcp-r: accept
ttcp-r: 0.0user 0.5sys 0:10real 5% 0i+0d 0maxrss 0+0pf 0+0csw
ttcp-r: 74891264 bytes processed
ttcp-r: 0.54 CPU sec = 135437 KB/cpu sec, 1.0835e+06 Kbits/cpu sec
ttcp-r: 10.067 real sec = 7264.92 KB/real sec, 58119.3 Kbits/sec
It is also worth noting that the "bytes processed" number can wrap for large data transfers.

nttcp

(source code)

Created by someone at Silicon Graphics (SGI), it added the important option (-w) to set the transmit and receive window size (actually socket buffer sizes, which indirectly set the window size). It also changed several things:

  • changed default nbuf (-n) from 1024 to 2048
  • changed default buflen (-l) from 1024 to 65536
  • changed default port (-p) from 2000 to 5001
  • added window size (-w) option
  • added TCP nodelay (-D) option
  • reversed the meaning of (-s) source/sink option!

The default buffer length increase to 65536 was probably an attempt to maximize performance by reducing the number of write system calls. I have found however that on modern systems larger buffers are not always faster because they may not fit in cache memory as well. Somewhere around 8kB seems like a good value in my experience for todays hardware.

The change in default port and the meaning of -s is unfortunate. At least the author changed the program name. Others however have created "ttcp" programs that also have these changes. iperf and nuttcp continued nttcp's use of port 5001.

Usage: ttcp -t [-options] host [  -l## length of bufs written to network (default 8192)
-s don't source a pattern to network, use stdin
-n## number of source bufs written to network (default 2048)
-p## port number to send to (default 5001)
-u use UDP instead of TCP
-D don't buffer TCP writes (sets TCP_NODELAY socket option)
-L set SO_LONGER socket option
Usage: ttcp -r [-options >out]
-l## length of network read buf (default 8192)
-s don't sink (discard): prints all data from network to stdout
-p## port number to listen at (default 5001)
-B Only output full blocks, as specified in -l## (for TAR)
-u use UDP instead of TCP
Example run:
% ./nttcp -r
ttcp-r: buflen=65536, nbuf=2048, port=5001 tcp
ttcp-r: socket
ttcp-r: accept from 127.0.0.1
send window size = 65535
receive window size = 65535
ttcp-r: 455426048 bytes in 10.05 real seconds = 44243.41 KB/sec = 353.9473 Mb/s
ttcp-r: 55550 I/O calls, msec/call = 0.19, calls/sec = 5526.05
ttcp-r: 0.0user 2.5sys 0:10real 25% 0i+0d 0maxrss 0+14pf 0+0csw

nuttcp

ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/

One of the best! Highly recommended. The author calls it n-u-t-t-c-p, but many of us affectionately call it nut-c-p. nuttcp can run as a server and passes all output back to the client side, so you don't need an account on the server side to see the results.

Usage: nuttcp or nuttcp -h      prints this usage info
Usage: nuttcp -V prints version info
Usage: nuttcp -xt [-m] host forward and reverse traceroute to/from server
Usage (transmitter): nuttcp -t [-options] host [3rd-party] [ out]
-4 Use IPv4
-6 Use IPv6
-l## length of network read buf (default 8192/udp, 65536/tcp)
-s don't sink (discard): prints all data from network to stdout
-n## number of bufs for server to write to network (default 2048)
-w## receiver window size in KB (or (m|M)B or (g|G)B)
-ws## server transmit window size in KB (or (m|M)B or (g|G)B)
-wb braindead Solaris 2.8 (sets both xmit and rcv windows)
-p## port number to listen at (default 5001)
-P## port number for control connection (default 5000)
-B Only output full blocks, as specified in -l## (for TAR)
-u use UDP instead of TCP
-m## use multicast with specified TTL instead of unicast (UDP)
-N## number of streams (starting at port number), implies -B
-R## server transmit rate limit in Kbps (or (m|M)bps or (g|G)bps)
-T## server transmit timeout in seconds (or (m|M)inutes or (h|H)ours)
-i## client interval reporting in seconds (or (m|M)inutes)
-Ixxx identifier for nuttcp output (max of 40 characters)
-F flip option to reverse direction of data connection open
-xP## set nuttcp process priority (must be root)
-d set TCP SO_DEBUG option on data socket
-v[v] verbose [or very verbose] output
-b brief output (default)
Usage (server): nuttcp -S [-options]
note server mode excludes use of -s
-4 Use IPv4 (default)
-6 Use IPv6
-1 oneshot server mode (implied with inetd/xinetd), implies -S
-P## port number for server connection (default 5000)
note don't use with inetd/xinetd (use services file instead)
-xP## set nuttcp process priority (must be root)
--no3rdparty don't allow 3rd party capability
--nofork don't fork server
Format options:
-fxmitstats also give transmitter stats (MB) with -i (UDP only)
-frunningtotal also give cumulative stats on interval reports
-f-drops don't give packet drop info on brief output (UDP)
-f-percentloss don't give %loss info on brief output (UDP)
-fparse generate key=value parsable output

iperf

http://dast.nlanr.net/Projects/Iperf/

% iperf -v
iperf version 1.1.1 (23 Feb 2000) pthreads
Example run:
% iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[ 6] local 127.0.0.1 port 5001 connected with 127.0.0.1 port 3639
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 421 MBytes 336 Mbits/sec

P. Dykstra, phil@sd.wareonearth.com, March 2001 (update March 2004)

Friday, April 25, 2008

Data: How to Initiate Transfers: GridFTP Clients

GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks. TeraGrid has three clients which utilize gridFTP (click on the client name to see more information and examples of that method):

http://www.teragrid.org/userinfo/data/gridftp.php

http://www.teragrid.org/userinfo/data/

Tuesday, April 22, 2008

The evolution of storage systems

by R. J. T. Morris and B. J. Truskowski

Storage systems are built by taking the basic capability of a storage device, such as the hard disk drive, and adding layers of hardware and software to obtain a highly reliable, high-performance, and easily managed system. We explain in this paper how storage systems have evolved over five decades to meet changing customer needs. First, we briefly trace the development of the control unit, RAID (redundant array of independent disks) technologies, copy services, and basic storage management technologies. Then, we describe how the emergence of low-cost local area data networking has allowed the development of network-attached storage (NAS) and storage area network (SAN) technologies, and we explain how block virtualization and SAN file systems are necessary to fully reap the benefits of these technologies. We also discuss how the recent trend in storage systems toward managing complexity, ease-of-use, and lowering the total cost of ownership has led to the development of autonomic storage. We conclude with our assessment of the current state-of-the-art by presenting a set of challenges driving research and development efforts in storage systems.


http://www.research.ibm.com/journal/sj/422/morris.html

Monday, April 14, 2008

Google Summer of Code

Google Summer of Code 2008 is on! Over the past three years, the program has brought together over 1500 students and 2000 mentors from 90 countries worldwide, all for the love of code. We look forward to welcoming more new contributors and projects this year.


** Globus: Google Summer of Code 2008 Ideas

New execution and data transfer providers

Globus project: Swift

Description: The Java CoG kit provides an abstraction for process execution and file transfer (for example, local execution, local filesystem copy, over ssh/scp, GridFTP, GRAM2, GRAM4, direct submission to the PBS scheduling system). Execution and transfer providers can then be used by higher level applications such as Swift in order to move data to execution sites and to perform application execution without needing to be particularly aware of how that execution and transfer is happening. An interesting project might be to implement a provider for some existing execution or transfer mechanisms so that they could be used as part of CoG.

Requires: Decent Java programming skills. A favourite execution or data transfer mechanism


Mentor: Ben Clifford


Integration of GridFTP with Freeloader storage system

Globus project: GridFTP

Description: GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. It is based on the Internet FTP protocol, and it defines extensions for high performance operation and security. Striped data transfer (aka cluster-to-cluster data transfer) is a key feature that utilizes multiple CPUs and NICs to achieve higher performance. In striped mode, however, GridFTP assumes the support of a high-performance parallel file system, a relatively expensive resource. Freeloader is a storage system that aggregates the idle storage space from workstations connected to a local area network to build a high-performance data store. FreeLoader breaks files into chunks and distributes these chunks across the storage nodes. This accelerates read/write operations as they can benefit from the parallel access to multiple disks. This project aims to integrate GridFTP and FreeLoader to reduce the cost and increase the performance of GridFTP deployments.

Links related to this project


** Project Ideas from the Ohio Supercomputer Center



  1. Improved scalability in pbsdcp scatter implementation
    Mentor: Troy Baer
    Programming Language(s): Perl, C with MPI
    License: GPL

    pbsdcp is a distributed copy command for PBS and TORQUE batch environments that is part of OSC's pbstools package. It is used to copy files between shared directories (e.g. NFS home directories) and local storage on a set of compute nodes (e.g. /tmp). It has two modes of operation: scatter, in which files in a shared directory are copied into local file systems on each of the compute nodes; and gather, in which files in local file systems on each of compute nodes are collected into a shared directory

    The scatter mode in pbsdcp is currently implemented in a rather naive fashion: for each node, it forks an rcp on the source files with a destination directory on that node's local storage. This means that the amount of data which must be transferred from the shared storage scales linearly with the number of nodes. We would like to replace that implementation with something more scalable, such as a tree-based or store-and-forward distribution scheme. Moreover, we would like this to use MPI for communication between nodes if possible, so that the high-performance Infiniband and Myrinet networks in our (and similar) clusters will be used for as much of the data transfer as possible.

    1. Improved scalability in all
      Mentor: Rick Mohr
      Programming Langauge(s): C
      License: GPL

      all is a distributed shell command built on top of rsh used by OSC and other sites. It allows commands to be run on either all or an arbitrary subset of the nodes in a cluster, either sequentially or in parallel.

      The parallel mode of all currently has a scalability problem on clusters larger than about 200 nodes. Because all uses rsh and rsh wants to use privileged ports (i.e. port numbers below 1024), parallel executions of all run out of the necessary ports for node counts above 200 or so. One solution to this problem would be "chunking" or "batching"; that is, starting up at most a fixed number (say 128) of rsh connections and then only starting more once the first few rshes have completed. (Similar logic can be seen in OSC's parallel command processor.)

      Alternately, a project to add some of all's features, such as its relatively simple syntax and PBS/TORQUE integration, to another distributed shell command such as pdsh would also be considered.



** dev:sahana_gsoc08_ideas



SystemTap

SystemTap provides free software (GPL) infrastructure to simplify the gathering of information about the running Linux system. This assists diagnosis of a performance or functional problem. SystemTap eliminates the need for the developer to go through the tedious and disruptive instrument, recompile, install, and reboot sequence that may be otherwise required to collect data.

SystemTap provides a simple command line interface and scripting language for writing instrumentation for a live running kernel. We are publishing samples, as well as enlarging the internal "tapset" script library to aid reuse and abstraction. We also plan to support probing userspace applications. We are investigating interfacing Systemtap with similar tools such as Frysk, Oprofile and LTT.

Current project members include Red Hat, IBM, Intel, and Hitachi.

Scripts & Tools
http://sourceware.org/systemtap/wiki/ScriptsTools

Sunday, April 6, 2008

DTrace Network Providers

from : http://opensolaris.org/os/community/dtrace/NetworkProvider/

The following is a design proposal for a collection of DTrace Networking Providers. These providers aim to provide networking observability and troubleshooting information for Solaris users. The first prototype TCP provider was demonstrated at CEC 2006.


#dtrace -n 'tcp:::receive /args[2]->tcp_dport == 80/ {
@pkts[args[1]->ip_daddr] = count();
}'

dtrace: description 'tcp:::receive' matched 1 probe
^C

192.168.1.8 9
fe80::214:4fff:fe3b:76c8 12
192.168.1.51 32
10.1.70.16 83
192.168.7.3 121
192.168.101.101

Friday, April 4, 2008

more DTrace

Brendan Gregg's Homepage

http://brendangregg.com/


Top Ten DTrace (D) Scripts

DTrace is a comprehensive and flexible dynamic tracing facility built into the Solaris Operating System. DTrace allows dynamic instrumentation of a running Solaris system, which can assist with answering questions like "which process is chewing up CPU 38," or "which user is causing the cross-call activity on CPU 6," or "which setuid binaries are being executed?"

DTrace uses a scripting language called "D," which uses a syntax very similar to C and Awk. Several amazing D scripts have been developed and distributed through the Internet, so I thought I would share my favorite D scripts in a Letterman like "Top 10" format:

http://prefetch.net/articles/solaris.dtracetopten.html

Observing I/O Behavior with the DTraceToolkit

http://prefetch.net/articles/observeiodtk.html


DTrace user's guide http://docs.sun.com/app/docs/doc/817-6223/

DTrace Toolkit http://www.opensolaris.org/os/community/dtrace/dtracetoolkit/


DTrace presentation [FIRST look at this]

http://www.nbl.fi/~nbl97/solaris/dtrace/dtt_present.pdf


OpenSolaris Community: DTrace


OpenSolaris Community: DTrace

http://www.opensolaris.org/os/community/dtrace/

Endorsed projects

Chime Visualization Tool for DTrace
DTrace Provider for NFSv4
Mozilla DTrace

An Overview of DTrace

DTrace is a comprehensive dynamic tracing framework for the Solaris™ Operating Environment. DTrace provides a powerful infrastructure to permit administrators, developers, and service personnel to concisely answer arbitrary questions about the behavior of the operating system and user programs.

The Solaris™ Dynamic Tracing Guide describes how to use DTrace to observe, debug and tune system behavior. The Solaris™ Dynamic Tracing (DTrace) Guide (here), also includes a complete reference for bundled DTrace observability tools and the D programming language.

For Users:

  • dynamically enable and manage thousands of probes
  • dynamically associate predicates and actions with probes
  • dynamically manage trace buffers and probe overhead
  • examine trace data from a live system or from a system crash dump

For Solaris™ Developers:

  • implement new trace data providers that plug into DTrace
  • implement trace data consumers that provide data display
  • implement tools that configure DTrace probes




No Bad Dogs - DTrace - Bryan Cantrill

No Bad Dogs How to Make a Dog-Slow System Sit Up and Speak

http://research.sun.com/minds


--from WIKI

Bryan M. Cantrill is an engineer at Sun Microsystems. Cantrill graduated from Brown University, B.Sc. in computer science. He was born in Colorado where he attained the rank of Eagle Scout.

In 2005 Bryan Cantrill was named one of the 35 Top Young Innovators by Technology Review, MIT's magazine. Cantrill was included in the