<=Back to the Research Page

SciDAC Review, August 2003

Optimizing Performance and Enhancing Functionality of Distributed Application Using Logistical Networking

Lead PI: Micah Beck, University of Tennessee (UT)
PIs: Jack Dongarra (UT), James S. Plank (UT), Rich Wolski, University of California Santa Barbara

1. Introduction

Communication is the foundation for all collaboration, and accordingly the growth of the kind distributed collaboration envisioned by the SciDAC Program has closely tracked the development of high-performance networking. Sustained collaboration, however, inevitably involves communication with both synchronous and asynchronous elements. While current forms of networking address the former, the latter requires uses of storage that are not part of the current network model. Applications that need significant amounts of non-archival, "working storage" for distributed state management, latency hiding, and exposed buffer management, are not well served by traditional models of file and database storage services. By providing best-effort control over substantial shared, untrusted state management resources (RAM and disk), at network intermediate nodes and at shared edge locations, our project on Logistical Networking (LoN) addresses these needs by implementing communication services that IP networking alone have not been able to provide.

As the name suggests, the goal of Logistical Networking is to bring data transmission and storagewithin one framework, much as military or industrial logistics treat transportation lines and storage depots as coordinate elements of one infrastructure. Achieving this goal requires a form of network storage that is globally scalable, meaning that the storage systems attached to the global data transmission network can expand beyond organizational and geographical boundaries and can grow large enough to be global in scope.

The approach to this problem we developed for SciDAC is based upon an exposed approach to computer service architecture, in which one offers client software a primitive service the semantics of which are closely based on the underlying physical infrastructure. Higher-level tools and protocols with more abstract semantics running on clients compose these primitive services to implement more sophisticated run-time algorithms for applications. In this way the infrastructure of LoN follows the exposed approach embodied in IP datagram service. LoN uses scalable storage resources in the network to model communication in both its synchronous and asynchronous aspects; its fundamental service is the movement of data between buffers on adjacent nodes.

The Internet Backplane Protocol (IBP) supplies LoN with its basic mechanism. It implements scalable management of storage in the network using shared physical resources. IBP is implemented by intermediate nodes, called "depots," that offer standard storage operations, including allocate, load, store and third party copy. From a networking perspective, an IBP depot can be viewed as a kind of router that exposes buffers to clients, thereby enabling the implementation of flexible network services, such as overlay multicast with explicit specification of the delivery tree by the source.
Figure 1. Network Storage Stack

To create a shared resource fabric that exposes network storage for general use in this way, we defined a new storage stack (Figure 1), analogous to the Internet stack, using a bottom-up and layered design approach that adheres to the end-to-end principles. IBP is the lowest layer of the storage stack that is globally accessible from the network. Its design is modeled on the design of IP datagram delivery. Just as IP is a more abstract service based on link-layer datagram delivery, so IBP is a more abstract service based on blocks of data (on disk, memory, tape or other media) that are managed as "byte arrays." By masking the details of the storage at the local level - fixed block size, differing failure modes, local addressing schemes - this byte array abstraction allows a uniform IBP model to be applied to storage resources generally. The use of IP networking to access IBP storage resources creates a globally accessible storage service.

As the case of IP shows, however, in order to scale globally the service guarantees that IBP offers must be weakened, i.e. it must present a "best effort" storage service. First and foremost, this means that IBP storage allocations can be time limited. When the lease on an IBP allocation expires, the storage resource can be reused and all data structures associated with it can be deleted. Additionally an IBP allocation can be refused by a storage resource in response to over-allocation, much as routers can drop packets, and such "admission decisions" can be based on both size and duration. Enforcing time limits puts transience into storage allocation, giving it some of the fluidity of datagram forwarding; more importantly, it makes network storage far more sharable, and creates an infrastructure that can scale.

The semantics of IBP storage allocation also assume that an IBP storage resource can be transiently unavailable. Since the user of remote storage resources depends on so many uncontrolled, remote variables, it may be necessary to assume that storage can be permanently lost. Thus, IBP is a "best effort" storage service. To encourage the sharing of idle resources, IBP even supports "soft" storage allocation semantics, where allocated storage can be revoked at any time. In all cases such weak semantics mean that the level of service must be characterized statistically.

Like the higher levels of the traditional network stack, the higher levels of the network storage stack - exNode, Logistical Runtime System (LoRS), Logistical Backbone (L-bone, and the Logistical Files System - implement abstractions and services necessary to aggregate and utilize the generic best effort storage service provided by IBP. A discussion of each of these layers is integrated into the account of our work on Logistical Networking for the SciDAC research community presented below.

2. Goals and Objectives of the Project

To enhance the effectiveness of the collaboration among SciDAC's distributed teams and facilitate their access to data resources, our work on Logistical Networking has pursued two closely related goals: We have explored the architectural implications of LoN on the implementation of network services; and investigated its impact on SciDAC and other science applications through the development and deployment of pre-production tools. While the architectural implications can be explored through experimentation in the lab and on other testbed facilities, assessing their impact on science applications requires active cooperation with projects that normally utilize production IT facilities.

Our plan to achieve these goals has focused on 3 different sets of objectives:

  • Design and development of the IBP depot - Since IBP provides the foundation for all LoN efforts, the rapid deployment of a stable depot was essential. Therefore, our initial objectives included the implementation of a robust IBP depot with extensible data movement module architecture, experimentation on a prototype depot testbed to determine performance and reliability characteristics, and the extension of data movement capabilities of IBP depot to include highest-performance protocols currently available. After satisfactorily achieving these objectives in the first two years, we have pursued further reliability enhancements, performance testing and optimization.

  • Design and development of LoN middleware - IBP is limited as a storage mechanism in much the same way that IP is limited as a networking mechanism; weak semantics mean that the levels of service required by applications must be implemented by additional layers of middleware. These middleware services fall into two classes: infrastructure services and end-to-end services.

  • Application Impact - In order to evaluate the impact of our work on real applications and to direct our work toward meeting their needs, we have sought to engage with science application projects, and in particular SciDAC projects. The worldwide deployment of a large scale IBP storage resources has been an essential part of this effort.

3. Technical Approach & Accomplishments

As with our objectives, we present our technical approaches and accomplishments in three sections, focusing on depot design and development, necessary middleware, and application and community impact.

3.1 IBP depot

The weakening of storage semantics means that traditional protocols and APIs are not well adapted to implement data logistics. Rather than attempting to express our architectural ideas as variants of conventional tools (e.g. FTP, HTTP), resulting in versions which are far enough from existing standards that existing client code cannot be used, we have chosen to define our own simple storage protocol and our own server, called the depot. In order to have an implementation that satisfies these requirements and yet is ready to deploy quickly for applications and middleware developers, we emphasized full functionality and software stability in our initial release, working on performance optimizations as the project moved forward.

3.1.1 IBP as RPC over TCP

In order to enable rapid, convenient depot development we decided to make each IBP operation through a separate remote procedure call, which accrues TCP connection overhead on each call. Having accomplished this goal, we are now working on optimizations, starting with persistent sockets, which amortize the TCP overhead. Version 1.3 of the depot software offers high reliability, dynamic thread control, and full IPv6 compatibility.

3.1.2 Data Movers Plug-in Modules

The IBP protocol and depot software architecture allows for extensibility through plug-in modules, called Data Movers, which implement new data transmission protocols as external helper processes. This plug-in architecture has allowed us to incorporate new protocols very quickly without compromising depot stability. An important new component of IBP v1.3 is the high performance SABUL (Simple Available Bandwidth Utilization Library) Data Mover, which utilizes a reliable UDP transfer stream along with a TCP flow control channel to provide very high throughput over high performance long-haul networks. SABUL was developed by Robert Grossman's group at the University of Chicago and the National Center for Data Mining; this group collaborated with us in the development of the SABUL Data Mover. We are currently implementing a more integrated and optimized plug-in architecture that allows Data Movers to run as threads within the depot.

3.1.3 Single resource vs. multiresource depots

The typical depot platform consists of multiple storage/memory resources (e.g. RAM and disk, or multiple disk volumes) that, for technical or policy reasons, are managed separately. In order to keep the initial depot implementation simple, a separate depot process was used for each such resource pool. But this approach does not allow for optimizations between resources on a single system. We have released a beta version of an optimized multiresource depot which supports more fine-grained buffer management without paying the cost of interprocess communication, in particular on transfers between RAM and disk.

3.1.4 Encapsulated Data Movers

In order to take advantage of native point-to-multipoint transfer protocols, such as layer 2 broadcast or IP multicast, the IBP protocol and the Data Mover module interface allow Data Movers to implement point-to-multipoint operations. In early Data Movers we took advantage of this generality in order to implement reliable point-to-multipoint operations that are the composition of 1. local disk-to-RAM data movement, and 2. multiple reliable point-to-point network transfers from that RAM buffer to each destination. Encapsulating this complexity inside a single Data Mover module allowed us to avoid the overheads imposed by single resource Data Movers, as described above, but suffers from a lack of adaptability. We are currently experimenting with an exposed approach to dynamic buffering in reliable multicast that takes advantage of our optimized multiresource RAM/disk depot.

3.1.5 Persistent sockets for optimization, security and pipelining

Although the simple RPC-over-TCP binding of the IBP protocol provides stability and robustness, it rules out many optimizations that can be made when multiple requests use a single TCP connection, in the manner of HTTP 1.1. These optimizations include simple amortization of TCP set-up and slow-start over multiple operations; the ability to apply security protocols, such as SSL, which impose heavyweight initial connection costs; and, perhaps most powerfully, the use of operation pipelining to mask the unavoidable latency between client and server in the wide area network. We have defined a variant of the IPB protocol and implemented a corresponding alpha version of the depot that uses persistent. This extension also supports the use of X.509 certificates for authentication and authorization, which allows IBP to conform to Grid security infrastructure standards as required by NERSC and other DOE sites. Extension of the IBP protocol to support pipelining is in process.

3.1.6 Porting depot and client code; IPv6

To meet user requirements, we have ported the client code to a wide variety of versions of Unix and Linux. The IBP client has also been ported to Java and Win32. An early version of the depot was ported to Windows and the process of porting the current depot in under way. The client library and depot have been ported to IPv6 and have been deployed within the 6net project. The client library is currently being ported to operate within the Linux kernel, allowing the implementation of a character device driver that accesses storage on IBP depots. The /dev/ibp device allows applications with no knowledge of IBP to take advantage of Logistical Networking by simply opening the device file and reading or writing to it.

3.2 Middleware Services

The heart of the client middleware that adapts the weak semantics of the IBP protocol to the strong requirements of user applications is the exNode. The exNode stores references to multiple IBP capabilities in a manner analogous to the way that a Unix inode stores references to disk blocks. It also stores the metadata required for the implementation of additional end-to-end services, including encryption, compression, error coding, check sums, etc. In order to enable portability across architectures and network locations, we have defined an XML serialization for the exNode, which is used by the higher-level LoRS tools.

3.2.1 Logistical Runtime System (LoRS)

The Logistical Runtime System (LoRS) is a substantial client-side middleware component that builds on the weak semantics of IBP to implement the strong service guarantees required by users and applications. The core functionality of LoRS accesses the L-Bone (see 3.2.2) for resource discovery, makes allocations on IBP depots, transfers data to and from local files and applications to IBP depots, and manages the transfer of data between IBP depots. As experimental services, such as overlay multicast, are implemented for research purposes, they are also incorporated into new releases of the LoRS tools. Version 0.81 has improved end-to-end data compression, encryption, and checksums. An upcoming release will feature fault-tolerant Reed-Solomon encoding of stored data. The LoRS Tools package presents LoRS functionality through command-line tools and GUIs. v0.81 includes access to new Data Mover modules, and a new version of LoRS View visualization software with an improved graphical user interface. While the extent of LoRS is quite broad, and much of our user support is focused on it, space limitations prevent us from the kind of detailed consideration of its structure and features that it merits.

3.2.2 Logistical Backbone (L-Bone) software development with NWS integration

For rapid development with scalability, we decided to take a centralized, replicated directory approach to resource discovery based on OpenLDAP. We developed an application specific API and query language in order to adapt and generalize a directory mechanism to the specific semantics of depot discovery. Depots are registered and static metadata is entered manually. Regular polling then monitors the state of the depots, including storage availability and network accessibility. The Network Weather Service can be used to incorporate network proximity criteria into L-Bone queries. Depot status and aggregate L-Bone status is publicly available through a Web interface. These tools currently provide access to a global depot infrastructure, which we describe in more detail below. (Sec. 3.3.1)

3.2.3 Latency hiding through aggressive prestaging for remote visualization

Like other SciDAC projects, the TSI group intends to make future use of remote visualization technology in their distributed collaboration. We are developing a LoRS toolset to facilitate that activity. In the browsing of remote image databases, latency is chronic, limiting factor for applications. In cases where continuous maintenance of a local copy of the entire database is not practical, aggressive prestaging based on prediction of browsing locality can provide near local performance. However, aggressive prestaging at the client can overwhelm client resources and does not make use of shared infrastructure, other than the network itself. We have successfully demonstrated the use of the IBP depot as a shared resource in the local area network on which such aggressive prestaging strategies can be implemented. The result is near local latency with no collateral impact on the client [1]

3.2.4 File services on Logistical Networking infrastructure

We have developed two different approaches to providing limited file services on LoN infrastructure. The Read-Only Logistical File System (ROLFS) provides users with a directory services that automatically renews allocations on depots, keeping the capabilities that comprise the exNode from expiring. The Logistical File System (LFS) is more advanced. It enables users to interact with LoN technologies using a familiar, recognizable file system interface. We are also extending LFS to include single-writer consistency, automatic replication generation, and automatic replica scheduling. This new system will use a dynamic scheduler that forecasts future performance and availability levels for IBP depots to determine the degree of replication and placement of replicas necessary to insure a specified availability and performance goal.

3.2.5 Exposed multicast and routing

The inclusion of both point-to-point and point-to-multipoint operations in the IBP protocol enable it to support an exposed form of overlay multicast that can conform to various middleware- and application-specific reliability and performance criteria. We have demonstrated the feasibility and efficacy of this approach in wide area testbed experiments [2]. Ongoing experimentation is testing the hypothesis that the exposed approach can achieve levels of combined performance and reliability not previously demonstrated by encapsulated IP-based methods. One of the most difficult parts of the problem is the discovery of dynamic topology in an overlay networking environment. For this reason we are investigating the deployment of depots conforming to the underlying layer 2 topology and the use of multicast routing protocols similar to those used for native IP multicast.

3.2.6 Integration of Network Weather Service

The typical approach to maximizing both throughput and connection reliability using TCP is to use parallel and redundant communication streams. Unfortunately, when redundant data is moved as a precautionary measure and then discarded, this approach can result in wasted bandwidth. We have developed a novel approach to managing replicated communication streams that relies on the on-line performance forecasts generated by the Network Weather Service (NWS). By dynamically ranking the connectivity between data source and sink, and adaptively discovering appropriate time out values, our approach achieves higher performance than the redundant streams approach, with the same robustness characteristics, using a small fraction of the bandwidth [3].

3.3 Application and User Support

3.3.1 Global deployment and support for early adopters

The application driven development strategy we have followed depends on making depot resources, close to end users in terms of network topology, available. In addition to the 15 TB public testbed (19 countries, 200+ depots) we have already deployed depots (1.65 TB per depot) for each of the five major TSI collaboration sites (ORNL, NERSC, UCSD, NCSU, SUNY Stony Brook). All of the depots on the L-Bone are available for TSI and other SciDAC users. Clusters of public depots, thirty-one in California and eight in North Carolina, serve as backup transfer paths for TSI data transfers. To facilitate the porting of SciDAC middleware and application software to use Logistical Networking, we have also provided dedicated consulting services and help-line support to the TSI and broader SciDAC community.

3.3.2 Integration with NetSolve/GridSolve

Support for distributed computing that facilitates the research efforts of collaboratories and other advanced applications is an important part of the SciDAC vision. Our experiments with the integration of Logistical Networking technology into NetSolve middleware developed by PI Dongarra's group are designed to support that vision. NetSolve allows remote users, working with familiar interfaces such as Matlab and Mathematica, to access distributed hardware and software resources in order to perform complex calculations. Integration with Logistical Networking enables users to store (and replicate) data objects in IBP depots near the locations of NetSolve servers, and then point NetSolve servers to these depots to find data to use in computations. The proximity of the IBP depots to the NetSolve servers, as well as the existence of multiple replicas that can supply the needed data, improves the performance of NetSolve, especially across the wide area. The user can thus run computations on remote data and retrieve only the pertinent portion of the output at the client. A prototype version showing the power of this enhanced functionality was exhibited at SC2002.

3.3.3 LAPACK for Clusters (LFC)

PI Dongarra is exploring the use of IBP RAM depots in middleware that couples cluster system information with the specifics of a user problem to launch cluster based applications on the best set of resources available [4].

3.3.4 GridSAT

The Logistical File System is being tested using GridSAT, a Grid enabled Satisfiability solver used in circuit design and verification. PI Wolski has developed a prototype GridSAT using LFS, IBP and the LBONE and an "always-available" checkpoint storage service which will be demonstrated at SC 2003. GridSAT is compatible with NMI and the NPACKage software infrastructures.

3.3.5 Demo, presentations and publication; tutorial materials

"An End-to-End Approach to Globally Scalable Network Storage," presented at SIGCOMM 2002, explains the unique architectural vision behind Logistical Networking [5]. We gave several major public demonstrations of the performance and functionality of Logistical Networking technology. At the international iGrid2002 conference, multiple standard TCP streams were used to accomplish data transfers of 100 Mbps from the US to Amsterdam. At SC2002, a distributed computing application running on NetSolve servers around the world used Logistical Networking technology to seamlessly access blocks from distributed data replicas for enhanced application performance.
A comprehensive tutorial on the LoRS toolset has also been developed. This tutorial will enable SciDAC and other DOE users to make immediate and productive use of global L-Bone resources.

4. Application and Community Impact

4.1 Terascale Supernova Initiative (TSI)

The TSI project has adopted Logistical Networking as a key component of their data management strategy and our project plans have been driven by TSI requirements. TSI uses the LoRS tools to share both large data sets and computational results between the major collaboration sites, viz. ORNL, NERSC, NCSU, SUNYSB, and UCSD. Previously, using traditional FTP tools, TSI collaborators tolerated transfer rates of 8 Mbps, at best. Using the LoRS tools, they now routinely achieve transfer data rates at speeds above 220 Mbps between key research sites. It is a simple, scalable solution to a crucial TSI problem and need.

We are deploying five new IBP depots to the major TSI research sites. The depots at ORNL, NCSU and SUNYSB are already in use. These dedicated depots will provide an additional 9.5 TB of IBP storage space for use by TSI and the entire SciDAC research community.

Ongoing work with the TSI team will provide further integration with the TSI supercomputer environment, their HPSS storage environment and tools, the commercial visualization tools they, and in general their production work flow. The TSI team are comparing IBP functionality with Arie Shoshani's Hierarchical Resource Manager (HRM) middleware and evaluating the benefits of integrating IBP storage resources with HRM, which is oriented towards archival storage. Of particular interest would be HRM's ability to take advantage of LoN overlay routing and point-to-multipoint transfer capabilities.

4.2 Global research participation in Logistical Networking

There has been substantial participation from Internet2 member institutions, many of which have collaborations with DOE research labs, in the deployment of IBP. There have been regular Logistical Networking sessions and meetings of the Network Storage Working Group, which focuses one Logistical Networking, at Internet2 member meetings and Joint Techs workshops, which are held jointly with the ESnet Coordinating Committee. An upcoming meeting of the Asia Pacific Advanced Network (APAN) consortium will include a session dedicated to Logistical Networking. Against this background, LoN tools and the L-Bone are being used by a variety of researchers and research groups spread around the world, including the following: Singapore: National Technical University (Assoc. Prof. Lee Bu Sung), Korea: Korea Advanced Institute of Science and Technology (Prof. Kilnam Chon), Netherlands: Surfnet (Maarten Koopmans), Czech Republic: Masaryk University (Assoc. Prof. & Dean Ludek Matyska), France: INRIA (Permanent Researcher Laurent Lefevre), Italy: Universita' del Piemonte Orientale (Prof. Cosimo Anglano), and Canada, Yotta Yotta (Dr. Geoff Hayward).

5. Plans and Futures

Although more specific near-term plans are included in our discussion above, the items below represent additional future directions that we are also developing:

  • Reaching other SciDAC applicatoins and DOE projects
  • Extensible depot services, e.g. error encoding, reliable multicast, reduction operations, bit rate adaption [6].
  • Experimentation with layer 2 and optical connectivity between depots
  • Expanded integration with TSI tools and workflow
  • Logistical toolkit for remote and distributed visualization
  • Address needs of Data Grids
  • Adoption of high performance networking best practices on depots
  • Address IETF and GGF for standardization of IBP; commercial adoption
  • Peering between distinct logistical networks
  • 100TB deployed throughout DOE labs and partner institutions
  • Client integration with all common DOE platforms

6. References

[1] J. Ding, J. Huang, M. Beck, S. Liu, T. Moore, and S. Soltesz, "Remote Visualization by Browsing Image Based Databases with Logistical Networking," in Proceedings of SC2003. Pheonix, AZ, 2003 (to appear).

[2] M. Beck, Y. Ding, E. Fuentes, and S. Kancherla, "An Exposed Approach to Reliable Multicast in Heterogeneous Logistical Networks," presented at Workshop on Grids and Advanced Networks (GAN03), Tokyo, Japan, May 12-15, 2003.

[3] R. Wolski and M. Allen, "The Livny and Plank-Beck Problems: Studies in Data Movement on the Computational Grid," in Proceedings of SC2003. Pheonix, AZ, 2003 (to appear).

[4] K. Roche and J. Dongarra, "Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion," Parallel Computing 2002 (submitted).

[5] M. Beck, T. Moore, and J. S. Plank, "An End-to-end Approach to Globally Scalable Network Storage," in Proceedings of ACM Sigcomm 2002. Pittsburgh, PA: Association for Computing Machinery, 2002.

[6] M. Beck, T. Moore, and J. S. Plank, "An End-to-End Approach to Globally Scalable Programmable Networking," presented at Future Directions in Network Architecture (FDNA-03), an ACM SIGCOMM 2003 Workshop, Karlsruhe, Germany, August 25-29, 2003 (to appear).



<=Back to the Research Page