<=Back
to the Research Page
SciDAC Review, August 2003
Optimizing Performance and Enhancing Functionality of Distributed Application
Using Logistical Networking
Lead PI: Micah Beck, University of Tennessee (UT)
PIs: Jack Dongarra (UT), James S. Plank (UT), Rich Wolski, University
of California Santa Barbara
1. Introduction
Communication is the foundation for all collaboration, and accordingly
the growth of the kind distributed collaboration envisioned by the SciDAC
Program has closely tracked the development of high-performance networking.
Sustained collaboration, however, inevitably involves communication with
both synchronous and asynchronous elements. While current forms of networking
address the former, the latter requires uses of storage that are not part
of the current network model. Applications that need significant amounts
of non-archival, "working storage" for distributed state management,
latency hiding, and exposed buffer management, are not well served by
traditional models of file and database storage services. By providing
best-effort control over substantial shared, untrusted state management
resources (RAM and disk), at network intermediate nodes and at shared
edge locations, our project on Logistical Networking (LoN) addresses these
needs by implementing communication services that IP networking alone
have not been able to provide.
As the name suggests, the goal of Logistical Networking is to bring data
transmission and storagewithin one framework, much as military or industrial
logistics treat transportation lines and storage depots as coordinate
elements of one infrastructure. Achieving this goal requires a form of
network storage that is globally scalable, meaning that the storage systems
attached to the global data transmission network can expand beyond organizational
and geographical boundaries and can grow large enough to be global in
scope.
The approach to this problem we developed for SciDAC is based upon an
exposed approach to computer service architecture, in which one offers
client software a primitive service the semantics of which are closely
based on the underlying physical infrastructure. Higher-level tools and
protocols with more abstract semantics running on clients compose these
primitive services to implement more sophisticated run-time algorithms
for applications. In this way the infrastructure of LoN follows the exposed
approach embodied in IP datagram service. LoN uses scalable storage resources
in the network to model communication in both its synchronous and asynchronous
aspects; its fundamental service is the movement of data between buffers
on adjacent nodes.
The Internet Backplane Protocol (IBP) supplies LoN with its basic
mechanism. It implements scalable management of storage in the network
using shared physical resources. IBP is implemented by intermediate nodes,
called "depots," that offer standard storage operations, including
allocate, load, store and third party copy. From a networking perspective,
an IBP depot can be viewed as a kind of router that exposes buffers to
clients, thereby enabling the implementation of flexible network services,
such as overlay multicast with explicit specification of the delivery
tree by the source.
Figure 1. Network Storage Stack
To create a shared resource fabric that exposes network storage for general
use in this way, we defined a new storage stack (Figure 1), analogous
to the Internet stack, using a bottom-up and layered design approach that
adheres to the end-to-end principles. IBP is the lowest layer of the storage
stack that is globally accessible from the network. Its design is modeled
on the design of IP datagram delivery. Just as IP is a more abstract service
based on link-layer datagram delivery, so IBP is a more abstract service
based on blocks of data (on disk, memory, tape or other media) that are
managed as "byte arrays." By masking the details of the storage
at the local level - fixed block size, differing failure modes, local
addressing schemes - this byte array abstraction allows a uniform IBP
model to be applied to storage resources generally. The use of IP networking
to access IBP storage resources creates a globally accessible storage
service.
As the case of IP shows, however, in order to scale globally the service
guarantees that IBP offers must be weakened, i.e. it must present a "best
effort" storage service. First and foremost, this means that IBP
storage allocations can be time limited. When the lease on an IBP allocation
expires, the storage resource can be reused and all data structures associated
with it can be deleted. Additionally an IBP allocation can be refused
by a storage resource in response to over-allocation, much as routers
can drop packets, and such "admission decisions" can be based
on both size and duration. Enforcing time limits puts transience into
storage allocation, giving it some of the fluidity of datagram forwarding;
more importantly, it makes network storage far more sharable, and creates
an infrastructure that can scale.
The semantics of IBP storage allocation also assume that an IBP storage
resource can be transiently unavailable. Since the user of remote storage
resources depends on so many uncontrolled, remote variables, it may be
necessary to assume that storage can be permanently lost. Thus, IBP is
a "best effort" storage service. To encourage the sharing of
idle resources, IBP even supports "soft" storage allocation
semantics, where allocated storage can be revoked at any time. In all
cases such weak semantics mean that the level of service must be characterized
statistically.
Like the higher levels of the traditional network stack, the higher levels
of the network storage stack - exNode, Logistical Runtime System (LoRS),
Logistical Backbone (L-bone, and the Logistical Files System - implement
abstractions and services necessary to aggregate and utilize the generic
best effort storage service provided by IBP. A discussion of each of these
layers is integrated into the account of our work on Logistical Networking
for the SciDAC research community presented below.
2. Goals and Objectives of the Project
To enhance the effectiveness of the collaboration among SciDAC's distributed
teams and facilitate their access to data resources, our work on Logistical
Networking has pursued two closely related goals: We have explored the
architectural implications of LoN on the implementation of network services;
and investigated its impact on SciDAC and other science applications through
the development and deployment of pre-production tools. While the architectural
implications can be explored through experimentation in the lab and on
other testbed facilities, assessing their impact on science applications
requires active cooperation with projects that normally utilize production
IT facilities.
Our plan to achieve these goals has focused on 3 different sets of objectives:
- Design and development of the IBP depot - Since IBP provides
the foundation for all LoN efforts, the rapid deployment of a stable
depot was essential. Therefore, our initial objectives included the
implementation of a robust IBP depot with extensible data movement module
architecture, experimentation on a prototype depot testbed to determine
performance and reliability characteristics, and the extension of data
movement capabilities of IBP depot to include highest-performance protocols
currently available. After satisfactorily achieving these objectives
in the first two years, we have pursued further reliability enhancements,
performance testing and optimization.
- Design and development of LoN middleware - IBP is limited
as a storage mechanism in much the same way that IP is limited as a
networking mechanism; weak semantics mean that the levels of service
required by applications must be implemented by additional layers of
middleware. These middleware services fall into two classes: infrastructure
services and end-to-end services.
- Application Impact - In order to evaluate the impact of our
work on real applications and to direct our work toward meeting their
needs, we have sought to engage with science application projects, and
in particular SciDAC projects. The worldwide deployment of a large scale
IBP storage resources has been an essential part of this effort.
3. Technical Approach & Accomplishments
As with our objectives, we present our technical approaches and accomplishments
in three sections, focusing on depot design and development, necessary
middleware, and application and community impact.
3.1 IBP depot
The weakening of storage semantics means that traditional protocols and
APIs are not well adapted to implement data logistics. Rather than attempting
to express our architectural ideas as variants of conventional tools (e.g.
FTP, HTTP), resulting in versions which are far enough from existing standards
that existing client code cannot be used, we have chosen to define our
own simple storage protocol and our own server, called the depot. In order
to have an implementation that satisfies these requirements and yet is
ready to deploy quickly for applications and middleware developers, we
emphasized full functionality and software stability in our initial release,
working on performance optimizations as the project moved forward.
3.1.1 IBP as RPC over TCP
In order to enable rapid, convenient depot development we decided to
make each IBP operation through a separate remote procedure call, which
accrues TCP connection overhead on each call. Having accomplished this
goal, we are now working on optimizations, starting with persistent sockets,
which amortize the TCP overhead. Version 1.3 of the depot software offers
high reliability, dynamic thread control, and full IPv6 compatibility.
3.1.2 Data Movers Plug-in Modules
The IBP protocol and depot software architecture allows for extensibility
through plug-in modules, called Data Movers, which implement new data
transmission protocols as external helper processes. This plug-in architecture
has allowed us to incorporate new protocols very quickly without compromising
depot stability. An important new component of IBP v1.3 is the high performance
SABUL (Simple Available Bandwidth Utilization Library) Data Mover, which
utilizes a reliable UDP transfer stream along with a TCP flow control
channel to provide very high throughput over high performance long-haul
networks. SABUL was developed by Robert Grossman's group at the University
of Chicago and the National Center for Data Mining; this group collaborated
with us in the development of the SABUL Data Mover. We are currently implementing
a more integrated and optimized plug-in architecture that allows Data
Movers to run as threads within the depot.
3.1.3 Single resource vs. multiresource depots
The typical depot platform consists of multiple storage/memory resources
(e.g. RAM and disk, or multiple disk volumes) that, for technical or policy
reasons, are managed separately. In order to keep the initial depot implementation
simple, a separate depot process was used for each such resource pool.
But this approach does not allow for optimizations between resources on
a single system. We have released a beta version of an optimized multiresource
depot which supports more fine-grained buffer management without paying
the cost of interprocess communication, in particular on transfers between
RAM and disk.
3.1.4 Encapsulated Data Movers
In order to take advantage of native point-to-multipoint transfer protocols,
such as layer 2 broadcast or IP multicast, the IBP protocol and the Data
Mover module interface allow Data Movers to implement point-to-multipoint
operations. In early Data Movers we took advantage of this generality
in order to implement reliable point-to-multipoint operations that are
the composition of 1. local disk-to-RAM data movement, and 2. multiple
reliable point-to-point network transfers from that RAM buffer to each
destination. Encapsulating this complexity inside a single Data Mover
module allowed us to avoid the overheads imposed by single resource Data
Movers, as described above, but suffers from a lack of adaptability. We
are currently experimenting with an exposed approach to dynamic buffering
in reliable multicast that takes advantage of our optimized multiresource
RAM/disk depot.
3.1.5 Persistent sockets for optimization, security and pipelining
Although the simple RPC-over-TCP binding of the IBP protocol provides
stability and robustness, it rules out many optimizations that can be
made when multiple requests use a single TCP connection, in the manner
of HTTP 1.1. These optimizations include simple amortization of TCP set-up
and slow-start over multiple operations; the ability to apply security
protocols, such as SSL, which impose heavyweight initial connection costs;
and, perhaps most powerfully, the use of operation pipelining to mask
the unavoidable latency between client and server in the wide area network.
We have defined a variant of the IPB protocol and implemented a corresponding
alpha version of the depot that uses persistent. This extension also supports
the use of X.509 certificates for authentication and authorization, which
allows IBP to conform to Grid security infrastructure standards as required
by NERSC and other DOE sites. Extension of the IBP protocol to support
pipelining is in process.
3.1.6 Porting depot and client code; IPv6
To meet user requirements, we have ported the client code to a wide variety
of versions of Unix and Linux. The IBP client has also been ported to
Java and Win32. An early version of the depot was ported to Windows and
the process of porting the current depot in under way. The client library
and depot have been ported to IPv6 and have been deployed within the 6net
project. The client library is currently being ported to operate within
the Linux kernel, allowing the implementation of a character device driver
that accesses storage on IBP depots. The /dev/ibp device allows applications
with no knowledge of IBP to take advantage of Logistical Networking by
simply opening the device file and reading or writing to it.
3.2 Middleware Services
The heart of the client middleware that adapts the weak semantics of
the IBP protocol to the strong requirements of user applications is the
exNode. The exNode stores references to multiple IBP capabilities in a
manner analogous to the way that a Unix inode stores references to disk
blocks. It also stores the metadata required for the implementation of
additional end-to-end services, including encryption, compression, error
coding, check sums, etc. In order to enable portability across architectures
and network locations, we have defined an XML serialization for the exNode,
which is used by the higher-level LoRS tools.
3.2.1 Logistical Runtime System (LoRS)
The Logistical Runtime System (LoRS) is a substantial client-side middleware
component that builds on the weak semantics of IBP to implement the strong
service guarantees required by users and applications. The core functionality
of LoRS accesses the L-Bone (see 3.2.2) for resource discovery, makes
allocations on IBP depots, transfers data to and from local files and
applications to IBP depots, and manages the transfer of data between IBP
depots. As experimental services, such as overlay multicast, are implemented
for research purposes, they are also incorporated into new releases of
the LoRS tools. Version 0.81 has improved end-to-end data compression,
encryption, and checksums. An upcoming release will feature fault-tolerant
Reed-Solomon encoding of stored data. The LoRS Tools package presents
LoRS functionality through command-line tools and GUIs. v0.81 includes
access to new Data Mover modules, and a new version of LoRS View visualization
software with an improved graphical user interface. While the extent of
LoRS is quite broad, and much of our user support is focused on it, space
limitations prevent us from the kind of detailed consideration of its
structure and features that it merits.
3.2.2 Logistical Backbone (L-Bone) software development with NWS integration
For rapid development with scalability, we decided to take a centralized,
replicated directory approach to resource discovery based on OpenLDAP.
We developed an application specific API and query language in order to
adapt and generalize a directory mechanism to the specific semantics of
depot discovery. Depots are registered and static metadata is entered
manually. Regular polling then monitors the state of the depots, including
storage availability and network accessibility. The Network Weather Service
can be used to incorporate network proximity criteria into L-Bone queries.
Depot status and aggregate L-Bone status is publicly available through
a Web interface. These tools currently provide access to a global depot
infrastructure, which we describe in more detail below. (Sec. 3.3.1)
3.2.3 Latency hiding through aggressive prestaging for remote visualization
Like other SciDAC projects, the TSI group intends to make future use
of remote visualization technology in their distributed collaboration.
We are developing a LoRS toolset to facilitate that activity. In the browsing
of remote image databases, latency is chronic, limiting factor for applications.
In cases where continuous maintenance of a local copy of the entire database
is not practical, aggressive prestaging based on prediction of browsing
locality can provide near local performance. However, aggressive prestaging
at the client can overwhelm client resources and does not make use of
shared infrastructure, other than the network itself. We have successfully
demonstrated the use of the IBP depot as a shared resource in the local
area network on which such aggressive prestaging strategies can be implemented.
The result is near local latency with no collateral impact on the client
[1]
3.2.4 File services on Logistical Networking infrastructure
We have developed two different approaches to providing limited file
services on LoN infrastructure. The Read-Only Logistical File System (ROLFS)
provides users with a directory services that automatically renews allocations
on depots, keeping the capabilities that comprise the exNode from expiring.
The Logistical File System (LFS) is more advanced. It enables users to
interact with LoN technologies using a familiar, recognizable file system
interface. We are also extending LFS to include single-writer consistency,
automatic replication generation, and automatic replica scheduling. This
new system will use a dynamic scheduler that forecasts future performance
and availability levels for IBP depots to determine the degree of replication
and placement of replicas necessary to insure a specified availability
and performance goal.
3.2.5 Exposed multicast and routing
The inclusion of both point-to-point and point-to-multipoint operations
in the IBP protocol enable it to support an exposed form of overlay multicast
that can conform to various middleware- and application-specific reliability
and performance criteria. We have demonstrated the feasibility and efficacy
of this approach in wide area testbed experiments [2]. Ongoing experimentation
is testing the hypothesis that the exposed approach can achieve levels
of combined performance and reliability not previously demonstrated by
encapsulated IP-based methods. One of the most difficult parts of the
problem is the discovery of dynamic topology in an overlay networking
environment. For this reason we are investigating the deployment of depots
conforming to the underlying layer 2 topology and the use of multicast
routing protocols similar to those used for native IP multicast.
3.2.6 Integration of Network Weather Service
The typical approach to maximizing both throughput and connection reliability
using TCP is to use parallel and redundant communication streams. Unfortunately,
when redundant data is moved as a precautionary measure and then discarded,
this approach can result in wasted bandwidth. We have developed a novel
approach to managing replicated communication streams that relies on the
on-line performance forecasts generated by the Network Weather Service
(NWS). By dynamically ranking the connectivity between data source and
sink, and adaptively discovering appropriate time out values, our approach
achieves higher performance than the redundant streams approach, with
the same robustness characteristics, using a small fraction of the bandwidth
[3].
3.3 Application and User Support
3.3.1 Global deployment and support for early adopters
The application driven development strategy we have followed depends
on making depot resources, close to end users in terms of network topology,
available. In addition to the 15 TB public testbed (19 countries, 200+
depots) we have already deployed depots (1.65 TB per depot) for each of
the five major TSI collaboration sites (ORNL, NERSC, UCSD, NCSU, SUNY
Stony Brook). All of the depots on the L-Bone are available for TSI and
other SciDAC users. Clusters of public depots, thirty-one in California
and eight in North Carolina, serve as backup transfer paths for TSI data
transfers. To facilitate the porting of SciDAC middleware and application
software to use Logistical Networking, we have also provided dedicated
consulting services and help-line support to the TSI and broader SciDAC
community.
3.3.2 Integration with NetSolve/GridSolve
Support for distributed computing that facilitates the research efforts
of collaboratories and other advanced applications is an important part
of the SciDAC vision. Our experiments with the integration of Logistical
Networking technology into NetSolve middleware developed by PI Dongarra's
group are designed to support that vision. NetSolve allows remote users,
working with familiar interfaces such as Matlab and Mathematica, to access
distributed hardware and software resources in order to perform complex
calculations. Integration with Logistical Networking enables users to
store (and replicate) data objects in IBP depots near the locations of
NetSolve servers, and then point NetSolve servers to these depots to find
data to use in computations. The proximity of the IBP depots to the NetSolve
servers, as well as the existence of multiple replicas that can supply
the needed data, improves the performance of NetSolve, especially across
the wide area. The user can thus run computations on remote data and retrieve
only the pertinent portion of the output at the client. A prototype version
showing the power of this enhanced functionality was exhibited at SC2002.
3.3.3 LAPACK for Clusters (LFC)
PI Dongarra is exploring the use of IBP RAM depots in middleware that
couples cluster system information with the specifics of a user problem
to launch cluster based applications on the best set of resources available
[4].
3.3.4 GridSAT
The Logistical File System is being tested using GridSAT, a Grid enabled
Satisfiability solver used in circuit design and verification. PI Wolski
has developed a prototype GridSAT using LFS, IBP and the LBONE and an
"always-available" checkpoint storage service which will be
demonstrated at SC 2003. GridSAT is compatible with NMI and the NPACKage
software infrastructures.
3.3.5 Demo, presentations and publication; tutorial materials
"An End-to-End Approach to Globally Scalable Network Storage,"
presented at SIGCOMM 2002, explains the unique architectural vision behind
Logistical Networking [5]. We gave several major public demonstrations
of the performance and functionality of Logistical Networking technology.
At the international iGrid2002 conference, multiple standard TCP streams
were used to accomplish data transfers of 100 Mbps from the US to Amsterdam.
At SC2002, a distributed computing application running on NetSolve servers
around the world used Logistical Networking technology to seamlessly access
blocks from distributed data replicas for enhanced application performance.
A comprehensive tutorial on the LoRS toolset has also been developed.
This tutorial will enable SciDAC and other DOE users to make immediate
and productive use of global L-Bone resources.
4. Application and Community Impact
4.1 Terascale Supernova Initiative (TSI)
The TSI project has adopted Logistical Networking as a key component
of their data management strategy and our project plans have been driven
by TSI requirements. TSI uses the LoRS tools to share both large data
sets and computational results between the major collaboration sites,
viz. ORNL, NERSC, NCSU, SUNYSB, and UCSD. Previously, using traditional
FTP tools, TSI collaborators tolerated transfer rates of 8 Mbps, at best.
Using the LoRS tools, they now routinely achieve transfer data rates at
speeds above 220 Mbps between key research sites. It is a simple, scalable
solution to a crucial TSI problem and need.
We are deploying five new IBP depots to the major TSI research sites.
The depots at ORNL, NCSU and SUNYSB are already in use. These dedicated
depots will provide an additional 9.5 TB of IBP storage space for use
by TSI and the entire SciDAC research community.
Ongoing work with the TSI team will provide further integration with
the TSI supercomputer environment, their HPSS storage environment and
tools, the commercial visualization tools they, and in general their production
work flow. The TSI team are comparing IBP functionality with Arie Shoshani's
Hierarchical Resource Manager (HRM) middleware and evaluating the benefits
of integrating IBP storage resources with HRM, which is oriented towards
archival storage. Of particular interest would be HRM's ability to take
advantage of LoN overlay routing and point-to-multipoint transfer capabilities.
4.2 Global research participation in Logistical Networking
There has been substantial participation from Internet2 member institutions,
many of which have collaborations with DOE research labs, in the deployment
of IBP. There have been regular Logistical Networking sessions and meetings
of the Network Storage Working Group, which focuses one Logistical Networking,
at Internet2 member meetings and Joint Techs workshops, which are held
jointly with the ESnet Coordinating Committee. An upcoming meeting of
the Asia Pacific Advanced Network (APAN) consortium will include a session
dedicated to Logistical Networking. Against this background, LoN tools
and the L-Bone are being used by a variety of researchers and research
groups spread around the world, including the following: Singapore: National
Technical University (Assoc. Prof. Lee Bu Sung), Korea: Korea Advanced
Institute of Science and Technology (Prof. Kilnam Chon), Netherlands:
Surfnet (Maarten Koopmans), Czech Republic: Masaryk University (Assoc.
Prof. & Dean Ludek Matyska), France: INRIA (Permanent Researcher Laurent
Lefevre), Italy: Universita' del Piemonte Orientale (Prof. Cosimo Anglano),
and Canada, Yotta Yotta (Dr. Geoff Hayward).
5. Plans and Futures
Although more specific near-term plans are included in our discussion
above, the items below represent additional future directions that we
are also developing:
- Reaching other SciDAC applicatoins and DOE projects
- Extensible depot services, e.g. error encoding, reliable multicast,
reduction operations, bit rate adaption [6].
- Experimentation with layer 2 and optical connectivity between depots
- Expanded integration with TSI tools and workflow
- Logistical toolkit for remote and distributed visualization
- Address needs of Data Grids
- Adoption of high performance networking best practices on depots
- Address IETF and GGF for standardization of IBP; commercial adoption
- Peering between distinct logistical networks
- 100TB deployed throughout DOE labs and partner institutions
- Client integration with all common DOE platforms
6. References
[1] J. Ding, J. Huang, M. Beck, S. Liu, T. Moore, and S. Soltesz, "Remote
Visualization by Browsing Image Based Databases with Logistical Networking,"
in Proceedings of SC2003. Pheonix, AZ, 2003 (to appear).
[2] M. Beck, Y. Ding, E. Fuentes, and S. Kancherla, "An Exposed
Approach to Reliable Multicast in Heterogeneous Logistical Networks,"
presented at Workshop on Grids and Advanced Networks (GAN03), Tokyo, Japan,
May 12-15, 2003.
[3] R. Wolski and M. Allen, "The Livny and Plank-Beck Problems:
Studies in Data Movement on the Computational Grid," in Proceedings
of SC2003. Pheonix, AZ, 2003 (to appear).
[4] K. Roche and J. Dongarra, "Deploying Parallel Numerical Library
Routines to Cluster Computing in a Self Adapting Fashion," Parallel
Computing 2002 (submitted).
[5] M. Beck, T. Moore, and J. S. Plank, "An End-to-end Approach
to Globally Scalable Network Storage," in Proceedings of ACM Sigcomm
2002. Pittsburgh, PA: Association for Computing Machinery, 2002.
[6] M. Beck, T. Moore, and J. S. Plank, "An End-to-End Approach
to Globally Scalable Programmable Networking," presented at Future
Directions in Network Architecture (FDNA-03), an ACM SIGCOMM 2003 Workshop,
Karlsruhe, Germany, August 25-29, 2003 (to appear).
<=Back to the
Research Page
|