Tuesday, May 20, 2008

google summer code project: Integration of GridFTP with Freeloader storage system

Title Integration of GridFTP with Freeloader storage system
Student Hesam Ghasemi
Mentor Rajkumar Kettimuthu
Abstract
Scientific experiments produce large volumes of data which require a cost-conscious data stores, as well as reliable data transfer mechanisms. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP, however, assumes the support of a high-performance parallel file system, a relatively expensive resource.
FreeLoader is a storage system that aggregates idle storage space from workstations connected within a local area network to build a low-cost, yet high-performance data store. FreeLoader breaks files into chunks and distributes them across storage nodes thus enabling parallel access to files. A central manager keeps track of file meta-data as well as of the location of the chunks associated with each file.
This project will integrate Globus project’s GridFTP implementation and FreeLoader to reduce the cost and increase the performance of GridFTP deployments. The integration of these two opens-source systems system will address the following main problems. First, FreeLoader storage nodes will be exposed as GridFTP data transfer processes (DTPs). Second, the assumption the current GridFTP implementation makes, namely that all data is available at all DTPs will be relaxed by integrating the GridFTP server and FreeLoader managed data location mechanisms. Finally, load balancing mechanisms will be added to the GridFTP server implementation to match FreeLoader’s ability to stripe and replicate data across multiple nodes.
The fact that at a Freeloader-supported GridFTP site files are spread over multiple FreeLoader nodes (exposed as GridFTP DTPs) implies that the integration will need to orchestrate multiple connections to execute a file transfer. The current implementation of GridFTP is not able to handle cases where the number of DTPs on the receiver and sender side does not match. For instance, for a single client accessing a GridFTP server with N DTPs, the server will need manage the connections such that a single server DTP will connect to the client at a time. Once the first DTP has finished its transfer, the second server DTP will connect to the client and execute its transfer. The same mechanism can be applied to the cases where the client server node relationship is N-to-M. This mechanism has the following advantages: the load is balanced between the DTP nodes, and can support FreeLoader data layout where file chunks are stored on each DTP.
The design and implementation requirements include: code changes to either system should be minimal to enable future integration with the mainstream GridFTP/FreeLoader code, GridFTP client code changes should be avoided, and the integration components should be implemented in C since both FreeLoader and GridFTP are implemented in C.

No comments: