Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
1736
¿ Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Luigi Rizzo Serge Fdida (Eds.)
Networked Group Communication First International COST264 Workshop, NGC’99 Pisa, Italy, November 17-20, 1999 Proceedings
½¿
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Luigi Rizzo Universit`a di Pisa, Dip. Ing. dell’Informatione Via Diotisalvi 2, I-56126 Pisa, Italy E-mail:
[email protected] Serge Fdida Universit´e Pierre et Marie Curie, Laboratoire LIP6-CNRS 8, Rue du Capitaine Scott, F-75015 Paris, France E-mail:
[email protected]
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Networked group communication : first international COST 264 workshop / NGC ’99, Pisa, Italy, November 17 - 20, 1999. Luigi Rizzo ; Serge Fdida (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1736) ISBN 3-540-66782-2
CR Subject Classification (1998): C.2, D.4.4, H.4.3, H.5.3 ISSN 0302-9743 ISBN 3-540-66782-2 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN: 10704258 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Preface
Enabling group communication is one of the major challenges for the future Internet. Various issues ranging from services and applications to protocols and infrastructure have to be addressed. Moreover, they need to be studied from various angles and therefore involve skills in multiple areas. COST264 was created to contribute to this international effort towards group communication and related technologies. The European COST framework is ideal for establishing a new community of interest, providing an open forum for ideas, and also supporting young researchers in the field. The COST264 action, officially started in late 1998, aims at leveraging the European research in this area and creating intensive interaction at the international level. To this purpose, COST264 decided to organize an annual technical workshop, the “International Workshop on Networked Group Communication”. NGC’99 in Pisa is the first event of the series.
Despite this being the first workshop and despite the very short time between the Call for Papers and the deadline for submissions, and the other conflicting and more established events, the Call for Papers of NGC’99 was highly successful: we received 49 papers, of which 18 were selected to compose the basis of the technical program. We hope you will enjoy our paper selection, which is the
VI
Preface
core of these proceedings, and addresses important issues in the research and development of networked group communication. In addition to refereed contributions, we scheduled two keynote speakers (Christophe Diot and Steve Deering), and four invited talks by Ken Birman (Cornell), Bob Briscoe (BT), Radia Perlman (SUN), Tony Speakman (CISCO). Because one of the goals of COST264 is to disseminate information, we decided to include two poster sessions in the program, to give young researchers an opportunity to present their work and receive useful feedback from the participants. Finally, the workshop was preceded by a day dedicated to tutorials. We had a total of three half-day tutorials: – Mostafa Ammar and Don Towsley on Principles of Multicast Protocols and Services; – Mark Handley on The Near-Term Future of IP Multicast – Radia Perlman on Network and Multicast Security This event would have been not possible without the enthusiastic contribution and hard work of a number of individuals and institutions. On behalf of COST264 and all participants to the workshop we would like to thank for their contribution: – the program committee members, the reviewers, and all authors who submitted their work to the workshop, and made it possible to have a very high-level technical program; – the tutorial lecturers and the invited speakers, who put their knowledge and expertise to the service of the workshop; – the supporting institutions: Scuola Superiore S. Anna, which hosted the meeting; the Istituto di Applicazioni Telematiche (IAT/CNR) which took care of the secretariat; and the Universit` a di Pisa, which gave technical, financial, and human support to the organization of the workshop. – the industrial sponsors: Cisco, Microsoft Research, Motorola Labs (in alphabetical order), who are fully aware of the importance of this topic. A special thanks goes to Jon Crowcroft and Christophe Diot, whose help and expertise was fundamental during the organization of this workshop. We hope that this workshop will be the first of a successful series, and we are looking forward to a fruitful technical interaction in the delightful city of Pisa.
October 1999
Luigi Rizzo, Serge Fdida
Organization
NGC’99 – the First International Workshop on Networked Group Communication – is organized by action COST264, in cooperation with the Istituto per le Applicazioni Telematiche (IAT), Universit` a di Pisa, and Scuola Superiore S. Anna.
Industrial Sponsors CISCO Microsoft Research Motorola Labs
VIII
Organization
Conference Chairs Luigi Rizzo Serge Fdida
Universit` a di Pisa, Italy LIP6, Paris, France
Program Committee Kevin C. Almeroth Mostafa Ammar Ernst Biersack Bob Briscoe David Cheriton Jon Crowcroft Walid Dabbous Andr´e Danthine Christophe Diot Jordi Domingo-Pascual Wolfgang Effelsberg JJ Garcia-Luna Jim Gemmel Jose Guimares Mark Handley Markus Hofmann David Hutchison Roger Kermode Jim Kurose Luciano Lenzini Helmut Leopold Brian Levine Allison Mankin J¨ org Nonnenmacher Huw Oliver Sanjoy Paul Radia Perlman Jean-Jacques Quisquater Tony Speakman Burkhard Stiller Don Towsley Giorgio Ventre Lorenzo Vicisano Brian Whetten
USCB GeorgiaTech EURECOM British Telecom Stanford University College London INRIA University of Li`ege SPRINT Univ. Politecnica de Catalunya University of Mannheim UC Santa Cruz Microsoft ISCTE Lisbon ACIRI Bell Labs Lancaster University Motorola University of Massachusetts Universit` a di Pisa Telekom AT UC Santa Cruz ISI Bell Labs Hewlett Packard Bell Labs SUN UCL (BE) CISCO ETH Zurich University of Massachusetts Universit` a di Napoli CISCO Talarian Corporation
Organization
List of Reviewers Kevin Almeroth Mostafa Ammar Alberto Bartoli Cinzia Bernardeschi Ernst Biersack Ken Birman Bob Briscoe David Cheriton Domenico Cotroneo Jon Crowcroft Walid Dabbous Raffaele D’Albenzio Andr´e Danthine Gianluca Dini Christophe Diot Jordi Domingo-Pascual Wolfgang Effelsberg Serge Fdida Thomas Fuhrmann JJ Garcia-Luna Jim Gemmel Mark Handley Volker Hilt Markus Hofmann
Hugh Holbrook David Hutchison Roger Kermode Jim Kurose Helmut Leopold Brian Levine Laurent Mathy Martin Mauve J¨ org Nonnenmacher Huw Oliver Sanjoy Paul Radia Perlman Suchitra Raman Luigi Rizzo Dan Rubenstein Tony Speakman Burkhard Stiller Ion Stoica Don Towsley Giorgio Ventre Lorenzo Vicisano Brian Whetten Hui Zhang
IX
Table of Contents
A Preference Clustering Protocol for Large-Scale Multicast Applications . . Tina Wong, Randy Katz, Steven McCanne Computer Science Division, University of California, Berkeley, USA
1
Layered Multicast Group Construction for Reliable Multicast Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Miki Yamamoto, Yoshitsugu Sawa, Hiromasa Ikeda Department of Communications Engineering, Osaka University, Japan Building Groups Dynamically: A CORBA Group Self-Design Service . . . . . 36 Eric Malville France T´el´ecom CNET, France Issues in Designing a Communication Architecture for Large-Scale Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Emmanuel L´ety, Thierry Turletti INRIA, Sophia Antipolis, France HyperCast: A Protocol for Maintaining Multicast Group Members in a Logical Hypercube Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 J¨ org Liebeherr Computer Science Department, University of Virginia, USA Tyler K. Beam Microsoft Corporation, Redmond, USA Support for Reliable Sessions with a Large Number of Members . . . . . . . . . . 90 Roger Kermode Motorola Research Centre, Botany, Australia David Thaler Microsoft Corporation, Redmond, USA Distributed Core Multicast (DCM): A Multicast Routing Protocol for Many Groups with Few Receivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Ljubica Blazevi´c, Jean-Yves Le Boudec Institute for computer Communications and Applications (ICA), Swiss Federal Institute of Technology, Lausanne A Distributed Recording System for High Quality MBone Archives . . . . . . . 126 Angela Schuett, Randy Katz, Steven McCanne Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
XII
Table of Contents
Reducing Replication of Data in a Layered Video Transcoder . . . . . . . . . . . . 144 Gianluca Iannaccone Universit` a di Pisa, Pisa, Italy Providing Interactive Functions through Active Client-Buffer Management in Partitioned Video Multicast VoD Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Zongming Fei, Mostafa H. Ammar Networking and Telecommunications Group, College of Computing, Georgia Institute of Technology, Atlanta, USA Ibrahim Kamel, Sarit Mukherjee Panasonic Information and Networking Technology Lab, Panasonic Technologies Inc., USA A Multicast Transport Protocol for Reliable Group Applications . . . . . . . . . 170 Congyue Liu Guangzhou Communications Institute, Guangzhou, P.R.China Paul D. Ezhilchelvan University of Newcastle, Newcastle upon Tyne, UK Marinho Barcellos UNISINOS, Sao Leopoldo, Brazil Efficient Buffering in Reliable Multicast Protocols . . . . . . . . . . . . . . . . . . . . . . 188 Oznur Ozkasap, Robbert van Renesse, Kenneth P. Birman, Zhen Xiao Department of Computer Science, Cornell University, USA Native IP Multicast Support in MPLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Arup Acharya C&C Research Labs, NEC USA, USA Fr´ed´eric Griffoul NPDL-E, NEC Europe Ltd., Germany Cyclic Block Allocation: A New Scheme for Hierarchical Multicast Address Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Marilynn Livingston Computer Science, Southern Illinois University, USA Virginia Lo, Daniel Zappala Computer Science, University of Oregon, USA Kurt Windisch Adv. Network Technology Ctr., University of Oregon, USA Survivable ATM Group Communications Using Disjoint Meshes, Trees, and Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 William Yurcik Department of Applied Computer Science, Illinois State University, USA
Table of Contents
XIII
The Direction of Value Flow in Connectionless Networks . . . . . . . . . . . . . . . . 244 Bob Briscoe BT Research, BT Labs, Ipswich, UK Techniques for Making IP Multicast Simple and Scalable . . . . . . . . . . . . . . . . 270 Radia Perlman Sun Microsystems Laboratories, USA Suchitra Raman Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA Watercasting: Distributed Watermarking of Multicast Media . . . . . . . . . . . . 286 Ian Brown, Colin Perkins, Jon Crowcroft Department of Computer Science, University College London, UK MARKS: Zero Side Effect Multicast Key Management Using Arbitrarily Revealed Key Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Bob Briscoe BT Research, BT Labs, Ipswich, UK Multicast Service Differentiation in Core-Stateless Networks . . . . . . . . . . . . . 321 Tae Eun Kim, Raghupathy Sivakumar, Kang-Won Lee, Vaduvur Bharghavan TIMELY Research Group, University of Illinois at Urbana-Champaign, USA
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
A Preference Clustering Protocol for Large-Scale Multicast Applications Tina Wong, Randy Katz, and Steven McCanne Computer Science Division University of California, Berkeley {twong,randy,mccanne}@cs.berkeley.edu
Abstract. IP Multicast has enabled a variety of large-scale applications on the Internet which would otherwise bombard the network and the content servers if unicast communication was used. However, the efficiency of multicast is often constrained by preference heterogeneity, where receivers range in their preferences for application data. We examine an approach in which approximately similar preferences are clustered together and transmitted on a limited number of multicast addresses, while consuming bounded total session bandwidth. We present a protocol called Matchmaker that coordinates sources and receivers to perform clustering. The protocol is designed to be scalable, fault tolerant and reliable through the use of decentralized design, soft-state operations and sampling techniques. Our simulation results show that clustering can reduce the amount of superfluous data at the receivers for certain preference distributions. By factoring in application-level semantics into the protocol, it can work with different application requirements and data type characteristics. We discuss how three different applications—stock quote dissemination, distributed network games, and session directory services—can specialize the protocol to perform clustering and achieve better resource utilization.
1
Introduction
The deployment of the IP Multicast Backbone (MBone) has enabled a variety of large-scale applications in the Internet, ranging from video conferencing tools to electronic whiteboards to information dissemination applications and distributed network games. These applications would otherwise bombard the network and the content servers if unicast communication was used. However, the efficiency in using multicast communication is often constrained by receiver heterogeneity. One form of heterogeneity is found at the network level, where receiving rates at the receivers vary by their bandwidth capacities. We also observe heterogeneity at the end-host level, where data types that can be handled at the receivers differ in their processing speeds. While the problems of network and end-host heterogeneity have been studied extensively, only recently have researchers started to investigate the concept of preference heterogeneity, where receivers range in their preferences for data within a single application. Preference heterogeneity manifests itself mostly in L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 1–18, 1999. c Springer-Verlag Berlin Heidelberg 1999
2
T. Wong, R. Katz, and S. McCanne complete heterogeneity
complete similarity
UNICAST
MULTICAST CLUSTER approx. similar sources and receivers into like groups
Fig. 1. The clustering concept.
large-scale multicast applications containing rich data types and configurable user interfaces. For example, in news dissemination, different subscribers are interested in different news categories. In network games, players that are fighting together require detailed and frequent state updates from one another, but not from those users farther away. In Internet TV broadcasts, viewers can only watch a few programs among the many available at any time, since humans can only attend to a limited amount of information simultaneously. In the limit of complete preference similarity, multicast is the optimal communication paradigm; in the limit of complete preference heterogeneity, unicast should be used instead. Between these two extreme scenarios is a spectrum where we need to group sources and receivers with matching preferences together. However, in the limit of many small groups, the control overhead associated with multicast forwarding state becomes unacceptable. The tradeoff lies herein: we cluster sources and receivers within an application into approximately similar groups, while maximizing preference overlap and minimizing network resource consumption. Figure 1 illustrates this concept. Clustering works with the current IP model and requires no new mechanisms in the network. We assume simple network primitives: packets sent to a multicast address are delivered to all end-hosts subscribed to that address. An alternate solution to accommodate preference heterogeneity is to have sources transmit all their data, and receivers filter out the undesired data. However, this solution is inappropriate, because it wastes both network resources and CPU processing cycles in handling the unnecessary data. A more effective approach, similar to layered multicast for congestion control [23,37] and multicast filtering in Distributed Interactive Simulation (DIS) [26,20], is to have sources send different versions of their data on separate multicast addresses. Although receiver preferences are well-matched in this case, the main drawback is that the number of multicast addresses used scales linearly with the number of sources and/or the granularity of preferences within an application. While the introduction of IPv6 provides ample distinct multicast addresses, the more severe problem of multicast routing state still remains [5,10]. The detrimental cost arises from the overhead of periodic keep-alive messages from routers to maintain multicast forwarding state. To combat this linearity of growth in the number of multicast addresses used, we can send a single data stream that models average preferences to all the receivers. This is proposed by the SCUBA protocol for Internet video conferencing [1], in which votes are collected from receivers to determine the popularity of
A Preference Clustering Protocol for Large-Scale Multicast Applications
3
video sources, and then to decide which sources to allocate most of the total session bandwidth. This approach works well if receivers exhibit a consensus among their preferences, e.g., in a lecture broadcast, the audience is usually interested in only the people currently holding the floor. However, applications do not always show such consensus, e.g., in news dissemination and network games as explained earlier. Assuming consensus in these applications leads to poor preference matching at the receivers. To deal with applications where receivers exhibit multiple modes of preference, we can create a separate multicast session for each group of sources and receivers with the same preferences. This is analogous to proxy-based schemes to accommodate network and end-host heterogeneity [3], in which a proxy is instantiated to service clients’ requests in a fine-grained manner. Although this approach transmits and processes only the data matching receiver preferences, it is impractical if the number of these groups is large. This is because the total data rate injected into the network is unregulated, and the control overhead from using a large number of multicast addresses not considered. In this paper, we present a protocol called Matchmaker that coordinates sources and receivers within a single application to perform clustering. By grouping only approximately similar preferences together, the protocol allows the application to control the number of multicast addresses it uses and also the number of connections to be maintained by the data sources. The protocol also governs the total data rate injected into the network across all the sources according to the preferences, which helps to avoid and accommodate network congestion. We designed the protocol to be scalable, fault tolerant and reliable through the use of decentralization, soft-state operations and sampling techniques. Our simulation results show that clustering can reduce the amount of superfluous data experienced at the receivers for certain preference distributions. By factoring in application-level semantics in the protocol, it can work with different application requirements and data type characteristics. We discuss how three different applications—stock quote dissemination, distributed network games, and session directory services—can specialize the protocol to achieve better resource utilization. The rest of the paper is organized as follows. Section 2 describes the Matchmaker protocol in detail. Section 3 discusses simulation results that study the feasibility of clustering. Section 4 details three different applications and how they can specialize the protocol. Section 5 compares our work to related research. Section 6 goes over future directions and concludes this paper.
2
The Matchmaker Protocol
Before we dive into the details of the protocol, we explain the following terminology used in the paper: – A source represents a logical stream of data. There are multiple data sources in an application which can originate from a single end-host or different end-hosts. A receiver is interested in certain sources.
4
T. Wong, R. Katz, and S. McCanne
– A cluster represents a group of similar sources and receivers. One or more multicast addresses can be associated with a cluster. – A partition is the set of clusters that encompasses all the sources and receivers in an application. The goal of the Matchmaker protocol is to coordinate members in an application so that clustering is performed: members are grouped according to similarity in their preferences, such that they communicate using a fixed number of multicast addresses with limited total session bandwidth. There are four main tasks in achieving this coordination: Task 1: Execution of the clustering algorithm to form an initial partition of the members, and to re-group them in reaction to changes. Task 2: Collection of member reports, which are incorporated in the clustering algorithm to yield a meaningful partition. Task 3: Notifying members of the current or adapted partition. Task 4: Handoff from the old to the new partition, so sources can rendezvous with interested receivers. One approach in implementing these tasks is to have a network agent responsible for the coordination. The Active Services model [2] is an example that provides fault tolerant, scalable, highly available agent platforms in the network to carry out user-specified computations. This model has been materialized through the AS1 framework, which uses a cluster of computers and the soft-state concept to allow robust and graceful crash recovery when faced with agent failures. Though a valid approach, finding and placing agents at strategic points in the network remain open research questions. Thus, in Matchmaker, members of the application are responsible for the coordination tasks. Matchmaker would be simple if it is possible for members to independently come up with the same partition. However, this requires them to have a globally consistent view of the current configuration, such as preference values, member existences, and so on. Otherwise, members might have different notions of what multicast addresses to transmit to and subscribe to for data. Although distributed programming toolkits such as ISIS [6] and HORUS [36] provide robust group communication primitives to achieve virtual synchrony among members, the control overhead and latency involved in accomplishing it is costly if not unacceptable with large member populations. Thus, we allow inconsistency among members, and elect only one to execute the clustering algorithm. This follows the Lightweight Session model [16,24], which advocates loosely-coupled and lightweight communication for multi-party applications for enhanced scalability to large session sizes. 2.1
The Clustering Algorithm
We briefly describe a clustering algorithm used by Matchmaker. For details and performance results of the algorithm, please refer to [38]. The protocol is not limited to using this particular algorithm.
A Preference Clustering Protocol for Large-Scale Multicast Applications Sources
Receivers
R1
R2
R3
R4
S2
S3
Sources
11111111111111 00000000000000 H H L L 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 H H L L 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 L L H H 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 L L H H 00000000000000 11111111111111 00000000000000 11111111111111 (i)
S1
S4
(a) GR
11111111 00000000000000000000 11111111111111111111 0000 000000000000 111111111111 0000 0000000000000000000 1111111111111111111 000000000000 111111111111 00001111 1111 0000 00000000000000000000 11111111111111111111 0000 1111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 1111 00000000000000000000 11111111111111111111 0000 1111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 1111 000 111 000 111 00000000000000000000 11111111111111111111 0000 1111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 1111 000 111 000 111 C1 111 C2 000 000 111 000 111 000 111 0000 1111 1111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 1111 0000 1111 0000 0000 1111 00001111 1111 0000 00001111 1111 0000 S1
R1
S2
S3
R2
R3
R1
S4
R4
Receivers
S1
5
R2
R3
R4
S2
S3
S4
1111111 0000000 0000000 1111111 H H 0000000 L L 0000000 1111111 0000000 1111111 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 H H 1111111 L L 0000000 1111111 0000000 1111111 0000000 1111111 0000000 0000000 1111111 1111111 0000000 1111111 0000000 1111111 0000000 L L 1111111 H H 0000000 1111111 0000000 0000000 1111111 1111111 0000000 1111111 0000000 1111111 0000000 1111111 L L 1111111 H H 0000000 1111111 0000000 1111111 0000000 1111111 0000000
0000 0000 1111 11111 00000 0000 1111 1111 0000 1111 00001111 1111 0000 000001111 11111 0000 0000 1111 00000 11111 0000 1111 0000 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 00000 11111 0000 1111 1111 0000 1111 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 0000 1111 0000 1111 00000 11111 0000 1111 0000 1111 000 111 000 111 C1 111 C2 000 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 00001111 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000000000000000000 1111111111111111111 000000000000 111111111111 0000 1111 0000 1111 R2 R1 R3 R4 S1
S2
S3
S4
H L
H L
(ii)
(i)
(b) GS
(ii)
Fig. 2. Grouping Schemes.
Grouping Schemes There are different ways sources and receivers can be grouped together to share a multicast address. We can group receivers with similar preferences into clusters—a scheme we call GR. For example, in a network game, we can use GR to group players that are close together in virtual space because they are focused on the same objects. Figure 2(a) is an illustration. A preference matrix is shown in (i), with each element representing a preference vector the corresponding receiver assigns on the source. We use a scalar to simplify the figure: H denotes high-quality and L low-quality. Receivers R1 and R2 are grouped because they show the same preferences towards all four sources; likewise for R3 and R4. The data transmissions and subscriptions are shown in (ii). R1 and R2 get data from cluster C1, to which sources S1 and S2 send high-quality data to, and S3 and S4 low-quality. Each cluster uses one multicast address here. In GR, a source sends data to all clusters containing receivers interested in itself, and a receiver gets data from the one cluster that best matches its preferences. An alternative is to group sources that receivers find similarly “interesting” or “uninteresting” into clusters—a scheme we call GS. For example, in news dissemination, we can use GS to group categories that users found collectively useful. Figure 2(b) is an illustration. S1 and S2 are grouped together because R1 and R2 want high-quality data from both, while R3 and R4 want low-quality. Each source layers its data into one base layer and one enhancement layer when sending to a cluster. R1 and R2 subscribe to two layers from C1 to get highquality data from both S1 and S2 but only one from C2 to get low-quality data from both S3 and S4. In this example, each cluster uses two multicast addresses. In GS, a source sends data to just one cluster, and a receiver gets data from all clusters containing its desired sources.
Objective Function The objective function in clustering is to maximize the preference overlap in each cluster of a partition. The number of possible ways to arrange N objects into K clusters is approximately K N . An exhaustive search algorithm that finds the optimal solution is impractical for applications with real-time constraints. Thus, we formulate clustering in our problem with an approximation algorithm. It is divided into two phases:
6
T. Wong, R. Katz, and S. McCanne
– A bootstrapping phase to handle the joining of new sources and receivers to the application in an on-line manner. – An adaptation phase to deal with changes in preferences and the departures of old sources and receivers dynamically. It consists of a simple control loop that slowly backs off in the absence of opportunity for beneficial re-grouping and quickly converges otherwise. The unsynchronized membership model in IP multicast says that members can join and leave a multicast address at any time. The bootstrapping phase of the algorithm needs to be on-line to group new sources or receivers as they come into existence. We use a greedy approach which adds a new source S to the cluster Gk containing the most “similar” sources to S, if GS is used, and likewise for GR. The algorithm allows an application to define what constitutes similar sources and receivers depending on the meaning of preferences. For example, preferences can be “binary” in that each receiver either wants all data from a given source, or none. Example definitions can be found in [38]. Note that the algorithm only groups sources in GS and receivers in GR; the mapping of the other is implicit, as explained in Section 2.1. We assume the number of clusters, K, available is fixed. We describe its derivation later. While the algorithm needs to be on-line during the bootstrapping phase, it can be off-line in the adaptation phase, working on a snapshot of the current configuration—sources, receivers, preferences, and partition. We use the k-means method here, also known as “switching”, proven to converge to a locally optimal solution [14]. For each source S in some cluster Gi , the algorithm switches S to another cluster Gj if it is more similar to the set of sources belonging to Gj , when GS is used. The process is analogous in GR. One benefit of k-means is that it incrementally refines the current partition to arrive at the adapted partition. This minimizes the disruption to the application, because it limits the number of multicast join and leave operations that result. A potential disadvantage of k-means is its unbounded convergence time. The algorithm stops only when none of the sources can be moved from its current cluster to another. However, our simulation results show that the algorithm quickly converges to a locally optimal partition with a small number of iterations (3-5) for a range of preference distributions and workloads. Constraints We need to satisfy the two constraints of network state and bandwidth consumption. The first constraint arises from the cost of using multicast addresses, which is in the form of forwarding state at the routers and the control overhead to maintain them. The second constraint limits the total session bandwidth available to an application. We leave the problem of assigning constraints to other research [21]. For example, an ISP can be responsible for allocating blocks of multicast addresses and session bandwidth to an application. The number of clusters K used in the algorithm is simply the number of addresses A available to the application, if data is not layered or GR is used. Otherwise, K is defined as A L , where L is the maximum number of layers deemed useful at the application. This is the worst-case estimate of K because
A Preference Clustering Protocol for Large-Scale Multicast Applications
7
receivers might not need all layers from all sources. A heuristic to alleviate this problem is to re-run the algorithm with a larger K if the previous K results in unused addresses. The bandwidth consumption constraint is considered after the partition is formed. In GS, the total session bandwidth available to an application is divided among the sources according to the average weights assigned by the receivers. In GR, each source further allocates the assigned bandwidth among the clusters based on the data sent to each. This controls the total amount of data injected into the network across all sources, which helps to avoid and accommodate network congestion. 2.2
Election for Execution
The clustering algorithm is executed “on-demand”, when a new member needs to be bootstrapped into the current partition, or when the partition needs to be adapted. We use a decentralized approach to elect one member as the “matchmaker” to execute the algorithm, so that global consistency among members is not required. The matchmaker is elected using randomized timers with multicast damping, such as used in retransmissions in SRM [11]. The matchmaker, if it exists, periodically sends out heartbeats on a multicast control channel. Members listen on this channel, and detect the (non-)existence of a matchmaker. The absence of heartbeats triggers members to set randomized timers to compete becoming a matchmaker. When the timer expires at a member, it deems itself the matchmaker and starts sending out heartbeats. A simple tie-breaker such as IP address is used to deal with multiple timers expiring at the same time. Those members unfitted to become matchmakers, such as those behind bottleneck links, should not participate in the election. One problem of this election is a potentially large latency for picking the matchmaker. Given a large number of members, the random timer values are chosen from a relatively large uniform distribution to decrease the probability of multiple timers expiring at the same time. Thus, only a few members participate in the matchmaker election. Additionally, these members keep soft-state of member reports for the matchmaker. When the current matchmaker crashes or quits the application voluntarily, another can be elected fairly quickly and without the delay of rebuilding the soft-state through their periodic announcements. Note that this election process for choosing an initial matchmaker or a replacement after a crash is the same. This simplifies the protocol because no explicit recovery procedure is required. Such approaches are often used in softstate systems such as the Active Services model [2] and distributed recording [31]. We again use a decentralized approach to choose the “potential” matchmakers. Each member periodically and independently conducts the following experiment: it generates a random number D from the uniform distribution (0, 1], and compares D with F = (C − M )/N , where C is the expected number of members to be chosen, M the current number of members that are chosen, and N the total number of members in the application. If D > F , then the mem-
8
T. Wong, R. Katz, and S. McCanne
ber becomes a potential matchmaker—it participates in the election process and keeps soft-state. Otherwise, it conducts the experiment again at a later time. The value of C is not critical here, and can be heuristically determined by the protocol, based on tradeoffs between election latency and state backup level. The values of M and N are determined through the announcements of heartbeats and member reports, respectively. 2.3
Collection of Member Reports
There are two types of member reports: – Preference Reports from a receiver denote its preferences for application data. Preferences can be configured at the user-interface, or inferred by the application. We describe examples in three applications in Section 4. – Heartbeat Messages from a source which includes the application-level names of the data it is sending. Members send reports periodically using the announce/listen mechanism [30, 12,32] to a multicast address serving as a control channel. The address is known to all members and negotiated outside the protocol, such as through SAP [12] and SDP [13]. The reports are aged out when not refreshed, which means members who have left the application or crashed involuntarily, thus no longer sending reports, are automatically not considered in the clustering algorithm. To limit control bandwidth consumption, the interval on which reports are sent is dynamically adjusted according to member population size. To put it mathematically, this interval is defined as, nS t= B where n is the estimated number of members, S the size of report, and bw the control bandwidth. As pointed out by Amir et al. [1], this convergence time to learn about receiver preferences scales linearly with population size. If the clustering algorithm tries to collect all reports before (re-)grouping the members, some of the earlier reports might have already became out-of-date. However, our simulation results show that the algorithm only needs to sample a small percentage (approximately 10% or less) of the preference reports to calculate a partition as effectively as when all the reports are considered, given a range of preference distributions and workloads. 2.4
Notification of Rendezvous Information
The rendezvous messages contain directions for sources about transmission multicast addresses, and for receivers about subscription multicast addresses best suited to their preferences. Instead of mappings from member names (IP addresses) to multicast addresses, we add another level of indirection and use mappings from preferences (application-level data names) to multicast addresses. The benefits are three-fold. First, sources are logical streams of data which can originate
A Preference Clustering Protocol for Large-Scale Multicast Applications
9
from the same IP address. Second, this allows new members to know immediately, without going through the matchmaker, approximately where to send and/or receive data. More importantly, we adhere to the Lightweight Sessions model in which knowledge of full membership is not required in the protocol. In Section 4, we describe examples of indirection with three different applications. There are two ways to disseminate the rendezvous messages. We can either use the announce/listen mechanism, or deliver them with a reliable multicast protocol. The advantage of the former approach is its simplicity, but its periodic nature can waste bandwidth. The latter approach allows members to recover only the mappings related to their preferences, if selective reliability with data naming [28] is used. Part of our future work is to quantify the tradeoffs between these two approaches. 2.5
Handoffs in Transmission and Subscription
Upon receipt of rendezvous information that requires multicast address handoffs, the ideal case is the execution of the following steps in the order presented: Step 1: The receivers subscribe to the new address, without unsubscribing from the old one. Step 2: The sources start sending to the new address. Step 3: The receivers unsubscribe from the old address. Unsynchronized execution of these steps can lead to receivers experiencing missing data. For example, if a receiver unsubscribes from the old address before the corresponding source starts sending to the new one, some amount of data is lost. We can use group communication protocols such as ISIS and HORUS to achieve a perfect handoff process. Certain classes of applications, such as distributed banking database transactions and critical military exercises, require such strong guarantees. However, for most Internet-based applications that are consumer and entertainment services, like news dissemination and video broadcasts, data losses are tolerable to a certain extent and sometimes can be recovered intermittently. Also, if we wait for every member to receive the rendezvous information before completing the handoff process, the application can be unnecessarily affected by a few slow members with bad network connectivity. We use a heuristic to alleviate the problem of missing data, illustrated in Figure 3. Members introduce a short lag time during which they rendezvous at both the current and new multicast addresses: sources send to both addresses, possibly at lower data rates, and receivers subscribe to both addresses, possibly getting duplicated data. Each member independently changes to the new address after the short lag time. Receivers can also switch when they detect their desired sources have already moved to the new address. This approach is analogous to soft handoff or doublecasting schemes in mobile networking [17]. We can also temporarily increase the announcement frequency of the rendezvous information when a new partition is formed. This increases the probability that members receive the information in a timely manner, and thus handoff to
10
T. Wong, R. Katz, and S. McCanne lag timer expires S sends to Cr , Cp
S sends to Cr
M
move
normal
R subscribes to Cr , Cp
N R subscribes to Cr
receipt of rendezvous C C
r
= most recent cluster
p
= previous cluster
Fig. 3. State diagram for the handoff process.
the new addresses at similar times. Raman and McCanne [29] propose a softstate transport protocol that allows different levels of consistency by adjusting the bandwidth allocated for “hot” and “cold” data. Their techniques can be incorporated in the dissemination of the rendezvous information, so as to intelligently adjust its announcement rate in different situations.
3
Simulation Results
To simplify the experiments, we 1) used a binary function to represent preferences such that each receiver either wants all data from a given source, or none, and 2) have sources send at equal data rates. Performance is measured in terms of average receiver “goodput”—the amount of useful data divided by the total received—to indicate the efficiency of the utilization of resources. A cluster here represents one multicast address. We modeled three preference patterns: – Zipf. Preferences collectively follow a perfect Zipf’s distribution. This means the expected number of receivers interested in the ith most popular source is inversely proportional to i. We modeled this because several studies have shown that Web access follows the Zipf’s law [7]. – Multi-modal. Preferences fall into modes. We partitioned sources evenly into five modes, and organized receivers so that each selects from sources in only one mode. This maps to applications with geographical correlations, e.g., in weather reports dissemination, a user is only interested in certain regions. – Uniform. Preferences are random. This is the worst-case scenario as there is no correlation among receivers’ preferences. This serves as a baseline pattern. We also modeled two application classes: – Categorical. Each user is interested in 5% to 10% of 100 categories available. The sources are the categories, and the receivers the users. This resembles “live data feeds” such as stock quote services. Since in this application class there is usually a limited number of categories but a much larger user population, the default number of receivers is 1000. We model changes in a receiver’s preferences with a complete change of its interest to the sources.
A Preference Clustering Protocol for Large-Scale Multicast Applications
11
– Spatial. We use a 32x32 grid to represent a virtual space where the distribution of participants on the grid follows one of the above patterns. Each participant is interested in others located within a radius of 5.65 units from itself, which is about 10% of the positions on the grid. The sources and the receivers are the participants, and at each position there is an avatar which serves only as a source. This is to model collaborative applications like network games. The default number of participants is 100. We model changes in a participant’s locations with a move of a distance of 5.65 units in either up, down, left, or right directions. In the experiments, the order in which receivers’ preferences were presented to the algorithm was random. The bootstrapping phase of the algorithm was used to incorporate new sources and receivers. We changed the preferences of all receivers as specified above, and applied the adaptation phase to re-cluster the partition. We measured the performance of the algorithm with the sampling (10%) and limited iterations (5 times) heuristics to reduce its running time. We compared this performance to the locally optimal k-means algorithm and a simple round-robin scheme. We found that if the preference patterns do not lend themselves to effective clustering, then an inexpensive algorithm like round-robin suffices. Figure 4(a) illustrates this comparison as we varied the number of receivers, given a categorical workload with a Zipf pattern. Round-robin performs nearly as well as our algorithm, and as the number of receivers scales, it is even comparable to the locally optimal algorithm. Note that we used 16 addresses here; with fewer addresses, the differences become even less obvious. This poor performance results because there is a heavy tail of “cold” sources that few receivers are interested in. Unless there are enough addresses to isolate these sources, a receiver needs only one or two sources out of each address. The two algorithms out-perform roundrobin marginally because they find and group the few “hot” sources, leaving slightly more addresses for the “cold” ones. Similar results are observed with uniform a pattern, when the degree of correlation among receivers’ preferences is even weaker. In contrast, in the presence of opportunity for effective clustering, a more sophisticated algorithm like ours is necessary to achieve good performance. Figure 4(b) shows the comparison given a categorical workload with multi-modal a pattern. Not only does our algorithm achieve higher goodput than round-robin, from a factor of 2 to 4 depending on number of receivers, it also performs as well as the locally optimal algorithm. Here, robin-robin can only reduce the number of extra sources in each cluster, whereas the other two algorithms group sources in the same mode together. By the same token, we see similar results when given a spatial workload in Figure 4(c), since receivers that are close together in virtual space can be placed in the same cluster. Results with a Zipf pattern is similar. We also measured the actual execution times on an unloaded Pentium II 133 MHz machine with 128 MB of memory. Each iteration took about 0.3 to 0.5 seconds. These results say that the adaptation phase of the algorithm is practical, but only if its execution is infrequent compared to its running-time.
12
T. Wong, R. Katz, and S. McCanne Zipf interest, 16 groups, 100 data sources, <= 10 % interest Locally optimal Heuristic adapt Round-robin
10
Multi-modal interest, 16 groups, 100 data sources, <= 10 % interest
Average receiver goodput
Average receiver goodput
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100
1000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Locally optimal Heuristic adapt Round-robin
10
100
Number of receivers
1000
Number of receivers
(a) Categorical - Zipf
(b) Categorical - Multi-modal
Average receiver goodput
Uniform interest, 16 groups, 32x32 grid, 5.65 radius of interest 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Locally optimal Heuristic adapt Round-robin
10
20
30
40 50 60 70 Number of receivers
80
90
100
(c) Spatial - Uniform Fig. 4. Performance of clustering algorithm.
Speeding up the algorithm, such as through parallel or distributed calculations, is part of our future work.
4
Applications and Implementation Status
We are in the process of implementing Matchmaker using the network/multimedia toolkit MASH [22]. We have designed the protocol to be generic and customizable by a range of applications and data types. In this section, we show how three applications — stock quote dissemination, distributed network games, and session directory services — can specialize the protocol to perform clustering and thus achieve better resource utilization. 4.1
Stockcaster
To experiment with Matchmaker, we are developing a stock quote dissemination application called Stockcaster. The Stockcaster server periodically polls the CheckFree website1 for stock quotes, and multicasts this information to its clients. There are on the order of 7000 symbols in the U.S. market alone. Assigning 1
http://qs.secapl.com
A Preference Clustering Protocol for Large-Scale Multicast Applications
13
Fig. 5. Screen capture of Stockcaster user interface.
each symbol to a multicast address is not a scalable solution. Grouping symbols into categories like “technology stock” or “pharmaceutical stock” is also ineffective. A user usually has a diversified portfolio, and might subscribe to all categories to get just a few symbols from each. Clustering can be used to group symbols (i.e., sources) to be disseminated on the same multicast addresses, based on correlations in user portfolios. Symbols are used as the level of indirection since they serve as both application-level names and preferences. Figure 5 is a screen capture of Stockcaster. Clients configure their preferences via the user interface. A preference report simply contains a list of symbols such as: {DELL BMED AAPL AOL AMZN ...} The server does not explicitly generate source heartbeats, because they are implicitly covered by the reports. The rendezvous message is a list of symbols to multicast addresses mappings of the form, {230.1.1.1 {DELL BMED AAPL ...}} {230.1.1.2 {AOL AMZN ...}} ... The sever only sends symbols that are mapped in the rendezvous message. The algorithm divides the total session bandwidth to each symbol based on the percentage of users interested in that symbol. The server uses this allocated bandwidth to determine the rate at which the stock quotes of each symbol are disseminated. 4.2
Network Games and DIS
Detailed studies on real-life exercises say that clustering is necessary in DIS environments: [26] found that there were 15,500 entities on 50KM square grid cells within a 2200x1500 KM area in a particular simulation. In network games and DIS, assigning each entity to a multicast address is not scalable, nor is statically dividing the virtual space evenly into regions. Entities can congregate in certain regions, and their locations can be continuously changing. [20] found that
14
T. Wong, R. Katz, and S. McCanne Layers 4
2
3
4
6
3
8
10
7
2
12
1
0
4
20 19
17
9
2
18
15 13
5
1
16
14
11
6
8
10
12
14
16
18
20
Time
Fig. 6. Cumulative data layering of increasing update frequency.
60% of the terrain was outside the detection range of all entities in a simulation. Resources assigned to deserted regions and disabled entities are therefore wasted. Preferences can be represented by locations of entities. For example, an entity can be only interested in other entities close to itself. Alternatively, an entity can be interested in all entities in the virtual environment but at different levels. An entity might want frequent and detailed updates from other entities close to itself, and less so from those farther away. Updates can be organized into layers of cumulatively increasing frequencies, where each layer is sent on a different multicast address. Depending on the distance between itself and another, an entity subscribes to varying number of layers to get different frequency levels necessary for interaction. Figure 6 illustrates this data layering. The clustering algorithm can group entities together based on the distances between their locations. The rendezvous messages can contain mappings of the centroid of the entity locations to multicast addresses, such as, {230.1.1.1 {x1,y1}} {230.1.1.2 {x2,y2}} ... An entity then transmits and/or subscribes to the address with the centroid closest to itself. 4.3
Session Directory Service
The session directory tool sdr announces information on MBone sessions. Currently, bandwidth used by sdr is limited to 200 bits per second [12], which is divided evenly among all announcements. As pointed out by Swan et al. [34], with just 25 sessions this constraint can lead to a wait of 10-20 minutes to see a particular announcement. Clustering is beneficial here because a sdr client is not necessarily interested in all the sessions. More bandwidth should be allocated to the popular sessions, such as NASA shuttle launch, than the less interesting ones, like testing. Likewise, the announcements for sessions that are occurring or will soon start should be allowed to disseminate more frequently. There are several ways to configure user preferences in sdr. One is through users entering their categories of interest via the user interface. Note that categories alone might not be sufficient, and the more fine-grained keyword approach can be useful. Keywords are helpful when users are interested in specific events,
A Preference Clustering Protocol for Large-Scale Multicast Applications
15
such as different sports during Olympic seasons. These keywords can be incorporated in the clustering algorithms using an ontology tool to identify similar terms. Preferences can also be implicitly generated based on the current time. For example, the clustering algorithm can artificially give more preferences for sessions that are occurring now or will soon start, so more bandwidth is assigned to the current sessions.
5
Related Work
Multicast state aggregation [27,35] combines entries in a multicast forwarding table when the outgoing interface sets of the entries match. Variants of this approach include trading data leakage for more reduction in entries. This approach handles the problem of scaling multicast forwarding state across applications, which is not currently dealt with by clustering. However, this approach does not decrease the amount of control messages that maintain the multicast forwarding state. At any time, routers still need to process these periodic messages on all multicast addresses in use concurrently. Recent proposals on router-assisted multicast forwarding services, such as PGM [33], Breadcrumb Forwarding Services [39], and AIM [18], provide mechanisms to send data on a per-packet basis to a subset of receivers on a multicast tree. Clustering is independent of the underlying multicast routing infrastructure, and is orthogonal to and works with these new services. Additionally, the amount of state and processing required at the routers is associated with the granularity of forwarding. We can apply clustering to group approximately similar forwarding requests at the application level, thus decreasing load at the network level. Levine et.al. [19] suggests using addressable routing [18] to consider receiver interest in IP delivery. While this works well if interest correlates to topology, a hybrid scheme that uses clustering is necessary otherwise. The authors conclude that low latency address allocation and group creation should be supported by the Internet architecture to realize large-scale multicast applications. The Destination Set Group (DSG) scheme splits receivers in a multicast session into disjoint groups with which the source carries independent conversations. Research in DSG has focused on congestion control [4,8] and fairness in video distribution [9]. Our research is on dealing with heterogeneity at the application-level. DIS Researchers have been looking at using multicast groups to split entities in the virtual environment. Macedonia et al. [20] proposes dividing entities based on temporal, spatial, and functional classes. Pullen [25,26] presents a similar approach where the virtual space is split into a grid of a certain cell size. In both cases, the constraints of network state and bandwidth consumption are not considered.
16
T. Wong, R. Katz, and S. McCanne
6
Conclusions and Discussions
In this paper, we have examined an approach to accommodate preference heterogeneity, in which approximately similar preferences are clustered together and transmitted on a limited number of multicast addresses while consuming bounded total session bandwidth. We have presented the Matchmaker protocol that coordinates sources and receivers to perform this clustering. The protocol is designed to be scalable, fault tolerant and reliable through the use of decentralized concepts, soft-state operations and sampling techniques. Our simulation results show that clustering can reduce the amount of superfluous data at the receivers for certain preference distributions. By factoring in application-level semantics in the protocol, it can work with different application requirements and data type characteristics. The feasibility of Matchmaker also lies in the scalability of the clustering algorithm. Even with the sampling heuristics, the running time of the algorithm depends on the number of sources and receivers in the application. If receiver preferences change frequently, then the algorithm might not calculate a new partition fast enough before they change again. We are currently modifying the algorithm so that clustering is executed in parallel on subsets of sources and receivers. Although the performance of the resulting partition might be degraded, we can employ application-level knowledge to more intelligently divide the sources and receivers for independent clustering. Some have argued that the “killer” application for multicast is Internet TV [15], which can involve many independent content providers. In this paper, we focuses on clustering within a single application to reduce multicast forwarding state and control message processing at the routers. The protocol we presented does not handle the scenario of many separate applications each using only a small number of multicast addresses. We are currently investigating end-toend clustering across applications. The idea is to allocate multicast addresses to applications based on similarity in their durations, data rates, and subscriber locations. It is also worthwhile to study the tradeoffs between solving their problem in an end-to-end manner and at the router level [27,35].
Acknowledgments We are grateful to Tzi-cker Chiueh, Adam Costello, Christophe Diot, Mark Handley, Emmanuel Lety, Manuel Oliveria, Angela Schuett, and Helen Wang for their thoughtful discussions and comments on this paper. We also thank the anonymous reviewers for their insightful feedback. This work was supported by DARPA contract N66001-96-C-8505, by the State of California under the MICRO program, and by NSF Contract CDA 94-01156.
References 1. Amir, E., McCanne, S., and Katz, R. Receiver-driven bandwidth adaptation for light-weight sessions. In ACM Multimedia ’97 (Seattle, WA, November 1997).
A Preference Clustering Protocol for Large-Scale Multicast Applications
17
2. Amir, E., McCanne, S., and Katz, R. An Active Service Framework and its Application to Real-time Multimedia Transcoding. In Proceedings of Sigcomm (Vancouver, Canada, September 1998). 3. Amir, E., McCanne, S., and Zhang, H. An application-level video gateway. In Proceedings of ACM Multimedia ’95 (Nov. 1995), ACM. 4. Ammar, M. H., and Wu, L. Improving the Throughput of Point-to-Multipoint ARQ Protocols through Destination Set Splitting. In Proceedings of IEEE INFOCOM 92 (Florence, Italy, May 1992). 5. Ballardie, T., Francis, P., and Crowcroft, J. Core Based Trees (CBT): An Architecture for Scalable Inter-Domain Multicast Routing. In Proceedings of SIGCOMM ’93 (San Francisco, CA, Sept. 1993), ACM, pp. 85–95. 6. Birman, K., Chiper, A., and Stephenson, P. Lightweight causal and atomic group multicast. ACM Transactions on Computer Systems 9, 3 (Aug. 1991), 272– 314. 7. Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. Web Caching and Zipf-like Distributions: Evidence and Implications. In Proceedings of INFOCOM (New York, NY, March 1999). 8. Cheung, S. Y., and Ammar, M. H. Using Destination Set Grouping to Improve the Performance of Window-Controlled Multipoint Connections. Computer Communications Journal 19 (1996), 723–736. 9. Cheung, S. Y., Ammar, M. H., and Li, X. On the Use of Destination Set Grouping to Improve Fairness in Multicast Video Distribution. In Proceedings of IEEE INFOCOM 96 (San Francisco, CA, March 1996). 10. Deering, S., Estrin, D., Farinacci, D., and Jacobson, V. An Architecture for Wide-Area Multicast Routing. In Proceedings of SIGCOMM ’94 (University College London, London, U.K., Sept. 1994), ACM. 11. Floyd, S., Jacobson, V., Liu, C.-G., McCanne, S., and Zhang, L. A reliable multicast framework for light-weight sessions and application level framing. IEEE/ACM Transactions on Networking (1995). 12. Handley, M. SAP: Session Announcement Protocol. Internet Draft, Nov 19, 1996. 13. Handley, M., and Jacobson, V. SDP: Session Directory Protocol. Internet Draft, Mar 26, 1997. 14. Hartigan, J. Clustering Algorithms. John Wiley and Sons, 1975. 15. Holbrook, H. W., and Cheriton, D. R. IP Multicast Channels: EXPRESS Support for Large-Scale Single-Source Applications. In Proceedings of ACM SIGCOMM (Harvard, MA, 1999). 16. Jacobson, V. SIGCOMM ’94 Tutorial: Multimedia conferencing on the Internet, Aug. 1994. 17. Lee, J. S. Overview of the Technical Basis of Qualcomm’s CDMA Cellular Telephone System Design. In Proceedings of ICCS (Nov. 1994). 18. Levine, B., , and Garcia-Luna-Aceves, J. Internet Multicast Based on GroupRelative Addressing. Tech. rep., University of California at Santa Cruz, 1999. 19. Levine, B., Crowcroft, J., Diot, C., Garcia-Luna-Aceves, J., and Kurose, J. Consideration of Receiver Interest in Content for IP Delivery. Tech. rep., University of California at Santa Cruz, 1999. Submitted for publication. 20. Macedonia, M. R., Zyda, M. J., Pratt, D. R., Brutzman, D. P., and Barham, P. T. Exploiting Reality with Multicast Groups: A Network Architecture for Large-Scale Virtual Environments. IEEE Computer Graphics and Applications 15, 5 (Sept 1995), 38–45.
18
T. Wong, R. Katz, and S. McCanne
21. MacKie-Mason, J. K., and Varian, H. R. Public Access to the Internet. Prentice-Hall, Englewood Cliffs, NJ, 1994, ch. Pricing the Internet. 22. McCanne, S., et al. Towards a Common Infrastructure for MultimediaNetworking Middleware. In Proceedings of the Seventh International Workshop on Network and OS Support for Digital Audio and Video (St. Louis, CA, May 1997), ACM. 23. McCanne, S., Jacobson, V., and Vetterli, M. Receiver-driven layered multicast. In Proceedings of SIGCOMM ’96 (Stanford, CA, Aug. 1996), ACM. 24. McCanne, S. R. Scalable Multimedia Communication with Internet Multicast, Light-weight Sessions, and the MBone . Tech. Rep. CSD-98-1002, U.C. Berkeley, 98. 25. Pullen, J. M., and White, E. L. Analysis of Dual-Mode Multicast for Large Scale DIS Exercises. In Proceedings of 13th DIS Workship on Standards for Interoperability of Distributed Simulations (Sept 1995). 26. Pullen, J. M., and White, E. L. Simulation of Dual-Mode Multicast Using RealWorld Data. In Proceedings of 14th DIS Workship on Standards for Interoperability of Distributed Simulations (Mar 1996). 27. Radoslavov, P. I., Govindan, R., and Estrin, D. Exploiting the BandwidthMemory Tradeoff in Multicast State Aggregation. Tech. rep., University of Southern California/ISI, 1999. Submitted for publication. 28. Raman, S., and McCanne, S. Scalable Data Naming for Application Level Framing in Reliable Multicast. In Proceedings of ACM Multimedia ’98 (Bristol, England, September 1998), ACM. 29. Raman, S., and McCanne, S. A Model, Analysis, and Protocol Framework for Soft State-based Communication. In SIGCOMM ’99 (Harvard, MA, September 1999). 30. Schooler, E. A Multicast User Directory Service for Synchronious Rendezvous. Tech. rep., California Institute of Technology, Sept 1996. 31. Schuett, A., Raman, S., Chawathe, Y., McCanne, S., and Katz, R. A Softstate Protocol for Accessing Multimedia Archives. In Proceedings of 8th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV) (Cambridge, UK, July 1998). 32. Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, Audio-Video Transport Working Group, Nov. 1991. Internet Draft expires 3/1/96. 33. Speakman, T., Farinacci, D., Lin, S., and Tweedly, A. Pretty Good Multicast (PGM) Transport Protocol Specification, Jan. 1998. Internet Draft (RFC pending). 34. Swan, A., McCanne, S., and Rowe, L. Layered Transmission and Caching for the Multicast Session Directory Service. In Proceedings of ACM Multimedia (September 1998). 35. Thaler, D., and Handley, M. On the Aggregatability of Multicast Forwarding State. Tech. Rep. MSR-TR-99-34, Microsoft Research, 1999. 36. van Renesse, R., Birman, K. P., and Maffeis, S. Horus: A Flexible Group Communication System. Communications of the ACM (1996). 37. Vicisano, L., Rizzo, L., and Crowcroft, J. TCP-like congestion control for layered multicast data transfer. In Proceedings of INFOCOM (San Francisco, CA, March 1998). 38. Wong, T., Katz, R., and McCanne, S. Efficient Multi-Party Applications using Preference Clustering. Tech. rep., July 1999. Submitted for publication. 39. Yano, K., and McCanne, S. The Breadcrumb Forwarding Service and the Digital Fountain Rainbow: Toward a TCP-Friendly Reliable Multicast. Tech. rep., University of California at Berkeley, 1999. Submitted for publication.
Layered Multicast Group Construction for Reliable Multicast Communications Miki Yamamoto, Yoshitsugu Sawa, and Hiromasa Ikeda Department of Communications Engineering, Osaka University 2-1 Yamadaoka, Suita, Osaka 565-0871 Japan {yamamoto, sawa, ikeda}@comm.eng.osaka-u.ac.jp http://www2b.comm.eng.osaka-u.ac.jp
Abstract. In reliable multicast communications, transmission rate of a sender should be restricted to the node capability of the lowest node in order to support reliable transmission of packets to all receivers in a multicast group. Even a node of high capability should receive packets at lower rate when there are lower capability nodes in a corresponding multicast group. In reliable multicast communications, heterogeneity of node capability degrades total performance of a multicast group. In the paper, we present the layered multicast group construction, which is one of technical solutions for performance degradation caused by heterogeneity. The basic concept of the layered multicast group construction is to divide a multicast group into multiple subgroups and order them based on node capability. This reduces diversity of node capability inside each sugbroup, which improves delay performance of a whole multicast group. We investigate optimal construction of the layered multicast group construction, i.e. optimal dividing points of subgroups. Numerical examples show that average delay performance is remarkably improved by the layered multicast group construction compared with a conventional single multicast group construction and performance improvement obtained by two or three subgroups is sufficient for practical use.
1 Introduction Multicast communications are very popular and important scheme to disseminate information from a sender to a group of nodes. For instance wb(white-board), videoconferencing and traffic report dissemination are multicast applications [5]. Most of them are realized by IP multicast. In IP multicast transmission of an IP datagram to a host group is identified by a single IP destination address[6] [7]. Based on communication requirement, they can be grouped into two criteria, real-time application and reliable application. In real-time application, voice and motion picture are transmitted under bounded delay constraint[8]. If packet arrival rate at one of receivers exceeds its capability, excess packets will be discarded at this receiver. In real time-application, retransmission of packets may be less important because of delay constraint. Thus, in almost real-time application, the sender does not control data flow. In reliable application, data should be transmitted under loss constraint [9][10][11][12]. To guarantee the reliability, the sender must retransmit a packet when it is lost at least one receiver. When a lot of packets are lost, retransmission of these packets degrades total performance of a corresponding reliable L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 19–35, 1999. c Springer-Verlag Berlin Heidelberg 1999
20
M. Yamamoto, Y. Sawa, and H. Ikeda
multicast group. So, flow control mechanism which adjusts packet transmission rate of the sender to the capability of the lowest receiver in a group, should be introduced in reliable multicast communications[14]. When adequate flow control is implemented in reliable multicast communications, receivers of high capability must receive packets even at the lowest receiver’s rate. The larger diversity of receiver’s capability is, the larger unused resource of node capability in a group is. Node capability with which it receives packets, is restricted by many factors, e.g. processing capability of computer, network bandwidth between the sender and a node, number of link hops and so on. For example, in the Internet, various kinds of computers i.e. high speed super computers, workstations, personal computers are interconnected. A VLSI chip for computers in different generation also has significantly different capability of processing. In this paper, we propose a layered multicast group construction for reliable multicast communications, which divides a whole multicast group into multiple subgroups based on receiver’s capability and constructs layered subgroups. By dividing a multicast group based on receivers’ capability, diversity of receivers’ abilities inside each subgroup, i.e. heterogeneity of node capability inside a subgroup, is reduced. In the paper, we discuss about the optimal construction of the layered multicast group and analyze how many subgroups are necessary for obtaining substantial improvement. The remainder of this paper is structured as follows. Section 2 shows related works and dicusses about difference amont this paper and others. Section 3 presents the layered multicast group construction. Section 4 investigates optimal construction for the layered multicast group under the condition that a distribution of nodes’ abilities is given. Section 5 provides numerical examples of optimal construction for several distribution of nodes’ abilities. Section 6 concludes this paper.
2 Related Work
There have been proposed many schemes concerning about layering construction of multicast receivers. Here, we describe related works and show difference from our paper. Chueng[1] has proposed destination set grouping method for multicast video distribution. Concept of destination spliting is similar to ours, but its application is video distribution which is different from ours and there is no investigation about how many destination sets are preferable. McCanne[8] have proposed the use of multiple multicast groups for flow control. This sheme is receiver-initiated approach and does not show how many multicast groups are adequate, either. For error recovery which is different tpics from our paper, Ammar[2] and Kasera[3] suggest to use multiple multicast groups for error recovery. For flow control in reliable multicast communications, Bhattacharyya[4] has proposed rate-controlled bulk data transfer method. In this scheme, receivers are not splitted according to their capabilities. Same data are transmitted on multiple multicast channels with different transmission rate. Receivers with higher capability can receive data from
Layered Multicast Group Construction for Reliable Multicast Communications
21
Subgroup 1
Layer 1
Layer 2 Subgroup 2
Subgroup 3
Subgroup L
Fig. 1. Layered Multicast Group Construction
multiple channels, which enables faster completion of file transfer. The basic idea of how to use multiple multicast groups is significantlly different from ours.
3 Layered Multicast Group Construction In this section, we present an efficient network construction for reliable multicast communications in a heterogeneous network, a layered multicast group construction. In reliable multicast communications, the sender must adjust its transmission rate to the node of the lowest capability in a corresponding multicast group. The worst case is that only one node of extremely low capability is included in a multicast group in which the other nodes have high capability. In such a case, many high capability nodes receive packets even at the same rate as the low capability node. This means that in a heterogeneous network, diversity of nodes’ capability degrades total performance of a multicast group. We propose a layered multicast group construction, which divides a multicast group into multiple subgroups according to node capability and can reduce heterogeneity of node capability in each subgroup. Divided subgroups construct two-layer multicast groups (Fig.1). The subgroup of the highest capability is located at the first layer and all of the other subgroups are located at the second layer. The node of the highest capability in the first layer behaves as a source node to the subgroup of the second highest capability. And the node of the n-th highest capability relays packets to the subgroup of the (n + 1)-st highest capability. At a node relaying packets to the lower layer, packets arrives from the source with high rate and arrived packets should be transmitted to lower rate. This leads sojourn of packets at this node, so this node behaves as a log-server[11]. The layered multicast group construction decreases diversity of node capability in each subgroup compared with construction with a single multicast group. Nodes in a higher capability subgroup can receive packets with a higher rate than ones in a lower capability subgroup. Packet transmission flow in the layered multicast group construction is as follows. When a source node is in the highest capability subgroup, it multicasts packets to all nodes in the highest capability subgroup. And the node of the n-th highest capability multicasts(relays) packets to the subgoup of the (n + 1)-st highest capability(Fig.2 (a)). When a source node is included in a lower subgroup, it multicasts packets to all nodes
22
M. Yamamoto, Y. Sawa, and H. Ikeda : Sender
(a) Sender is at the highest subgroup
(b) Sender is at lower subgroup
Fig. 2. Packet flow in the layered multicast group construction
included in the higher subgroups and the corresponding subgroup. In order to make this multicast applicable, a multicast subgroup which includes the corresponding subgoup and also higher subgroups (shadow part in Fig.2 (b)) should be prepared for each lower subgroup. For a lower subgroup than the corresponding subgroup, the n-th highest node relays packets to the (n + 1)-st highest subgroup (Fig.2 (b)).
4 Optimal Construction For an efficient network construction of the layered muticast group construction, the optimal dividing opint which divides a whole multicast group into subgroups is an important parameter. In this section, we investigate the optimal network construction when a distribution of nodes’ capability is given. We assume three types of distribution of nodes’ capability; linear distribution, exponential concave distribution and exponential convex distribution. Discussion in this section can be easily extended to any distribution of nodes’ capability. 4.1 Network Model In this section, we describe our network model. Ntotal : the number of nodes in a multicast group Ri : capability of node i (0 ≤ i ≤ Ntotal − 1) Si : a subgroup of the i-th highest capability Ni : total number of nodes included in the subgroup Si and all the lower subgroups Sj (j ≥ i + 1). From this definition, N1 = Ntotal (fixed). Ti : transmission rate at the subgroup Si . Without losing generality, we assume that Ri ≤ Rj (i < j). From the definition of Si , Rm > Rl when Rm ∈ Sp , Rl ∈ Sq and p < q 1 . 1
R0 denotes the node of the lowest capability, and S1 is the highest capability subgroup. The numbering order from the viewpoint of capability is reverse for nodes and subgroups
Layered Multicast Group Construction for Reliable Multicast Communications
23
In reliable multicast communications, transmission rate is restricted to the lowest capability of nodes in the corresponding subgroup. In lower subgroups than S1 , transmission rate of a subgroup Si (i > 1) is restricted by the node capability of the lowest capability, RNi , Ti = RNi+1 (i ≥ 2) (1) In the subgroup S1 , some nodes owe burden of packet relay. We assume that processing load for sending a packet is same as that for receiving a packet[13]. With this assumption, ˆ j , can be derived transformed capability of node j which relays packets to subgroup Sk, R as follows, ˆ j = Rj − RN . (2) R k+1 Thus, transmission rate of the subgroup S1 is defined as ˆ j , RN2 }, T1 = min{ min R ˆ1 j∈N
(3)
ˆ1 denotes a set of nodes which relay packets in subgroup S1 . where N Average delay necessary for transmiting unit length file to a whole multicast group, D, is defined as follows; L Ni − Ni+1 D= , (4) Ntotal Ti i=1 where NL+1 is assumed to be 0 and L is the total number of subgroups. We define a subgroup construction is optimum when average delay defined above is minimum. Optimal construction also means that it gives the highest average throughput, as easily expected from definition of average delay, (4). 4.2 Linear Distribution In this section, the capability of nodes is assumed to be expressed as a linear function, f (x) = Ax + B,
(5)
where x denotes node number (x = 0, 1, · · · , Ntotal − 1) and A > 0, B > 0. For the lower subgroups than S1 , transmission rate of a subgroup Si (i ≥ 2) is, Ti = RNi+1 = ANi+1 + B
(i ≥ 2).
(6)
In the subgroup S1 , several nodes are candidates for restricting its transmission rate. The node of RNtotal −1 relays packets to the subgroup S2 and RNtotal −2 relays to S3 . RN3 and RN4 indicates the lowest capability in subgroup S2 and S3 , respectively. Difference between RN3 and RN4 is RN3 − RN4 = (AN3 + B) − (AN4 + B) = A(N3 − N4 )
(7)
Here, (N3 − N4 ) is the number of nodes in subgroup S3 and should be greater than or equal to one. So, RN3 − RN4 ≥ A. From (5), RNtotal −1 − RNtotal −2 = A.
(8)
24
M. Yamamoto, Y. Sawa, and H. Ikeda
Thus, RNtotal −1 − RNtotal −2 ≤ RN3 − RN4 .
(9)
RNtotal −1 − RN3 ≤ RNtotal −2 − RN4 .
(10)
This means In subgroup S2 , the node of capability RNtotal −1 relays packets to other node in S2 . ˆN ˆN Transformed capability of this node, R , is R = RNtotal −1 − RN3 betotal −1 total −1 cause this node shold relay pacekts with rate RN3 . Similarly, transformed capability ˆN of a node of RNtotal −2 which relays packets to S3 is R = RNtotal −2 − RN4 . total −2 ˆ ˆN ≤ R . Similarly, the following relation among transformed From (10), R Ntotal −2 total −1 capabilities can be obtained; ˆN ˆN ˆN ˆN R ≤R ≤R ≤ ··· ≤ R . total −1 total −2 total −3 total −L
(11)
ˆN (11) means that transmission rate of subgroup S1 will be restricted by R or RN2 , total −1 ˆN , RN2 . (12) T1 = min R total −1 ˆN ≥ RN2 and We investigate the optimal construction for the two cases; R total −1 ˆ RNtotal −1 < RN2 . ˆN ≥ RN2 ] [R total −1 In this case, A(Ntotal − N3 − 1) ≥ AN2 + B. T1 = RN2 so, D=
L i=1
Ni − Ni+1 . Ntotal (ANi+1 + B)
(13)
In (13), NL+1 is assumed to be 0. Ni is originally a discrete value, but we assume D is a continuous function of Ni , i.e. we remove restriction of Ni for integer. So, (13) can be ∂D differentiated with Ni and the differential coefficient ∂N is obtained as follows, i ∂D 1 = ∂Ni Ntotal 2
ANi−1 + B ANi+1 + B (ANi + B)2 (i = 2, 3, · · · , L). 1
∂ D A > 0 and B > 0, so ∂N 2 > 0. Thus, i satisfy the following equation,
−
∂D ∂Ni
(14)
increases monotonously. Ni should
N1 ≥ N2 ≥ · · · ≥ NL .
(15)
So, the optimal dividing points can be obtained form the following equations, ∂D = 0, ∂Ni
(16)
N 1 ≥ N2 ≥ · · · ≥ NL .
(17)
Layered Multicast Group Construction for Reliable Multicast Communications
25
Considering the case that Ni satisfing (16) does not satisfies (17), the number of layering L is too many. For example, when N1 and N2 satisfing (16) has relation of N1 < N2 , optimal value of N1 and N2 which minimizes D satisfies N1 = N2 . This means no nodes is included in subgroup S1 , so the number of layers L is too many. ˆN [R < RN2 ] total −1 In this case, A(Ntotal − N3 − 1) < AN2 + B and T1 = A(Ntotal − N3 − 1), so D=
N1 − N2 Ntotal A(Ntotal − N3 − 1) +
L i=2
The differential coefficient,
∂D ∂Ni
Ni − Ni+1 . Ntotal (ANi+1 + B)
(18)
is
1 AN3 + B 1 , − A(Ntotal − N3 − 1) 1 ∂D 1 = ∂N3 Ntotal AN4 + B N1 − N2 AN2 + B , + − A(Ntotal − N3 − 1)2 (AN3 + B)2 1 ∂D 1 ANi−1 + B = − ∂Ni Ntotal ANi+1 + B (ANi + B)2 (i = 4, 5, · · · , L). ∂D 1 = ∂N2 Ntotal
(19)
(20)
(21)
In (19), AN3 + B = T2 and A(Ntotal − N3 − 1) = T1 are transmission rate of S2 and S1 , respectively. From basic concept of the layered multicast group construction, transmission rate of S1 should be higher than that of S2 , so A(Ntotal − N3 − 1) ≥ AN3 + B. Thus, ∂D ≥ 0. (22) ∂N2 This means that the optimal value of N2 should satisfy the following equation, AN2 + B = A(Ntotal − N3 − 1).
(23)
∂2 D ∂Ni2
> 0 (i ≥ 3), so the optimal point can be obtained from (23) and the following equations, ∂D = 0 (i = 3, 4, · · · , L), (24) ∂Ni N1 ≥ N2 ≥ · · · ≥ NL .
(25)
26
M. Yamamoto, Y. Sawa, and H. Ikeda
4.3 Exponential Concave Distribution In this section, capability of nodes is assumed to be expressed as an exponential concave function, f (x) = Cexp[Dx], (26) where x denotes node number (x = 0, 1, · · · , Ntotal − 1) and C > 0, D > 0. As we descrived in equation (3), in the highest capability subgroup S1 , several nodes can be a candidate node for restiricting transmission rate of the subgroup. We discuss the optimal construction for two cases; one is the case where the originally the lowest capability node in S1 restricts T1 and the other is the case where the node relaying packets to Sk restricts T1 . [Originally lowest capability node restricts T1 ] In this case, the transmission rate in each subgroup can be expressed as, Ti = Cexp[DNi+1 ] (i = 1, 2, · · · , L),
(27)
where NL+1 is assumed to be 0. Thus, average delay of a whole multicast group is, D=
L i=1
The differential coefficient
∂D ∂Ni
Ni − Ni+1 . Ntotal Cexp[DNi+1 ]
is
∂D 1 1 + D(Ni−1 − Ni ) = − ∂Ni Ntotal Cexp[DNi+1 ] Ntotal Cexp[DNi ] (i = 2, 3, · · · , L). And satisfies
∂2D ∂Ni2
> 0. Thus
∂D ∂Ni
(28)
(29)
inceases monotonously and the optimal construction
∂D = 0 (i = 2, 3, · · · , L), ∂Ni
(30)
N1 ≥ N2 ≥ · · · ≥ NL .
(31)
[Node which relays packets to Sk restricts T1 ] Original node capability of a node relaying packets to Sk is RNtotal −k . Transformed capability of this node is ˆN R = RNtotal −k − RNk+1 total −k = Cexp[D(Ntotal − k)] − Cexp[DNk+1 ].
(32)
When this node restricts T1 , the following equation is satisfied, Cexp[D(Ntotal − k)] − Cexp[DNk+1 ] ≤ Cexp[DN2 ] (2 ≤ k ≤ L − 1).
(33)
Layered Multicast Group Construction for Reliable Multicast Communications
27
Average delay of a whole multicast group in this case is, D= +
N1 − N2 Ntotal Cexp[D(Ntotal − k)] − Cexp[DNk+1 ] 1
L i=2
Ni − Ni+1 . Ntotal Cexp[DNi+1 ]
(34)
∂D ∂D The differential coefficient ∂N satisfied ∂N ≥ 0. This means that average delay 2 2 inceases as N2 increases and minimum value of N2 is optimal. From equation (33), optimal value of N2 should satisfy the following equation,
Cexp[D(Ntotal − k)] − Cexp[DNk+1 ] = Cexp[DN2 ]. 2
(35)
2
∂ D ∂ D Due to lack of space, we omit expression of ∂N but ∂N > 0 and ∂N∂D monotonously 2 2 k+1 k+1 k+1 increases. Thus, the optimal construction satisfiels,
∂D = 0. ∂Nk+1 (i = 3, 4, · · · , L, i = k + 1) has the same expression as equation (29), so increases monotonously. Thus, optimal construction satisfies ∂D ∂Ni
(36) ∂D ∂Ni
∂D = 0 (i = 3, 4, · · · , L, i = k + 1), ∂Ni
(37)
N1 ≥ N2 ≥ · · · ≥ NL .
(38)
Consequently, in the case where a node relaying packets to subgroup Sk restricts T1 , the optimal construction can be obtained from equations (35), (36), (37) and (38). 4.4 Exponential Convex Distribution In this section, the distribution of node capability is expressed as an exponential convex function, f (x) = E − F exp[−Gx], (39) where x denotes node number (x = 0, 1, · · · , Ntotal − 1) and E > 0, F > 0, G > 0. As in section 3.3, we discuss about the optimal construction for two cases; one is the case where the originally the lowest capability node in S1 restricts T1 and the other is the case where the node which relays packets to Sk restricts T1 . [Originally lowest capability node restricts T1 ] In this case, the transmission rate in each subgroup can be expressed as, Ti = E − F exp[−GNi+1 ] (i = 1, 2, · · · , L),
(40)
where NL+1 is assumed to be 0. Thus, average delay of a whole multicast group is, D=
L i=1
Ni − Ni+1 . Ntotal (E − F exp[−GNi+1 ])
(41)
28
M. Yamamoto, Y. Sawa, and H. Ikeda 2
∂ D ∂D As in section 3.3, ∂N inceases monotonously. Thus, the optimal con2 > 0, so ∂N i i struction satisfies following equations,
∂D = 0 (i = 2, 3, · · · , L), ∂Ni N1 ≥ N2 ≥ · · · ≥ NL . [Node which relays packets to Sk restricts T1 ]
(42) (43)
Original node capability of a node relaying packets to Sk is RNtotal −k . Transformed capability of this node is ˆN R −k = RN −k − RN total
total
k+1
= {E − F exp[−G(Ntotal − k)]} −{E − F exp[−GNk+1 ]} = F {exp[−GNk+1 ] −exp[−G(Ntotal − k)]}.
(44)
When this node restricts T1 , the following equation is satisfied, F {exp[−GNk+1 ] − exp[−G(Ntotal − k)]} ≤ E − F exp[−GN2 ].
(45)
Average delay of a whole multicast group in this case is, N1 − N2 D= Ntotal F {exp[−GNk+1 ] − exp[−G(Ntotal − k)]} +
L i=2
Ni − Ni+1 . Ntotal (E − F exp[−GNi+1 ])
(46)
∂D ∂D The differential coefficient ∂N satisfies ∂N ≥ 0. This means that average delay in2 2 ceases as N2 increases and minimum value of N2 is optimal. From equation (45), optimal value of N2 should satisfy the following equation,
F {exp[−GNk+1 ] − exp[−G(Ntotal − k)]} = E − F exp[−GN2 ]. Due to lack of space, we omit expression of F [−GNk+1 ],
∂D ∂Nk+1
∂2 D ∂Ni2
but when 2exp[−GNk+1 ] > E −
monotonously increases and the optimal construction satisfiels, ∂D = 0. ∂Nk+1
And
∂D ∂Ni
(47)
(48)
increases monotonously. Thus, optimal construction satisfies
∂D = 0 (i = 3, 4, · · · , L, i = k + 1), (49) ∂Ni N1 ≥ N2 ≥ · · · ≥ NL . (50) Consequently, in the case where a node relaying packets to subgroup Sk restricts T1 , the optimal construction can be obtained from equations (47), (48), (49) and (50).
Layered Multicast Group Construction for Reliable Multicast Communications
RN
29
-1
total
R0 =1 0
Ntotal-1
Fig. 3. Distribution of node capability
5 Numerical Example In the layered multicast group construction, the more subgroups are prepared, the better throughput (delay) performance seems to be obtained. Increase of subgroups, however, means inefficient usage of network resource. Multicast communications remove redundant packet transmission which cannot be avoided when 1 : n communication is supported by n point-to-point communications. Increase of subgroups needs more redundant packet transmission in the network. So, in order to answer a question of how many subgroups are adequate, we should take care of minimum number of subgroups which can obtain satisfactory improvement of throughput(delay) performance. In this section, we show some numerical examples for three types of node capability distribution examined in the previous section and discuss about the number of subgroups. For distributions of node capability, we assume R0 , i.e. the lowest capability, is 1 (Fig.3). We treat two cases for diversity of node capability; one is RNtotal −1 ≤ 30 and the other is RNtotal −1 ≤ 1000. The former situation is assumed to be that diversity of capability is not so large. For example, in multicast group where computers of several generation of processor are included, difference of node processing capability is not so large. The latter case is assumed to be that diversity depends on network bandwidth. For linear distribution, B is 1 and A is changed with RNtotal −1 . For exponential cancave distribution, C = 1 and for convex distribution E = 30, F = 29 when RNtotal −1 = 30 and E = 1000, F = 999 when RNtotal −1 = 1000. D and G is changed with RNtotal −1 . [Linear Distribution] Figures 4(a),(b) and (c) show the optimal dividing point when RNtotal −1 ≤ 30 in the case of two, three and four subgroups construction, respectively. Horizontal axis, diversity of capability denotes ratio of RNtotal −1 /R0 , i.e. (maximum capability)/(minimum capability) in a multicast group. As we describe in section 3.1, the lower capability node has lower number. So, in these figures, lower capability node is located in lower numbering in vertical axis. For example, in Fig.4(a), when total number of nodes is 101, nodes whose node number is above the dividing line should be included in a subgroup S1 and nodes below the line should be included in S2 . Figure 5 shows normalized delay performance of the optimal layered multicast group construction of 2,3,4 and 5 subgroups when RNtotal −1 = 30. Delay is normalized with
30
M. Yamamoto, Y. Sawa, and H. Ikeda
Number of Nodes (%)
Number of Nodes (%)
100 80 60
S1
40 20 0
S2 0
10
20
100 80
40
S2
20 0
30
S1
60
0
10
S3 20
30
Diversity of Capacity
Diversity of Capacity
(b) three subgroup
(a) two subgroup
Number of Nodes (%)
100
S1
80 60
S2
40
S4
S3
20 0
0
10
20
30
Diversity of Capacity
(c) four subgroup Fig. 4. Optimal dividing point in a linear distribution
that of the single multicast group construction. When diversity of capability is larger than 5, the layered multicast group construction has better performance than the single multicast group construction. For example, when diversity is 20, only with 2 subgroups the layered multicast group construction improves 60%(1 → 0.4) of delay performance. By dividing a whole multicast group into 2, 3 or 4 subgroups according to the optimal dividing point, average delay can be improved as shown in Fig.6(when RNtotal −1 = 1000). In both cases of RNtotal −1 ≤ 30 and 1000, layered multicast group construction with 2 subgroups remarkably improves average delay compared with single construction as shown in Fig.5 and 6. Improvement of delay with increment of the number of subgroups from 2 to 3 is smaller than that of from single to 2 sublayers. From the viewpoint of practical use, the layered multicast group construction with 2 or 3 is reasonable solution. [Exponential Concave and Convex Distributions] Figures 7 and 8 show normalized delay performance of the optimal layered multicast construction under the condition that distribution of node capability is an exponential concave function when RNtotal −1 ≤ 30 and 1000, respectively. Figures 9 and
Layered Multicast Group Construction for Reliable Multicast Communications
31
Normalized Delay
1.0 0.8
2 subgroups
0.6
3 4
0.4 0.2
5
0.0
0
10
20
30
Diversity of Capacity Fig. 5. Normalized delay characteristics –linear distribution– (diversity of node capability is below 30)
Normalized Delay
0.20 0.18 2 subgroups
0.16
3
0.14 0.12 4
0.10 0.08
0
200 400 600 800 1000 Diversity of Capacity
Fig. 6. Normalized delay characteristics –linear distribution– (diversity of node capability is below 30)
10 show normalized delay performance of the optimal layered multicast construction under the condition that distribution of node capability is exponential convex function when RNtotal −1 ≤ 30 and 1000, respectively. Concave distribution represents situation that a lot of lower capability nodes are included in a multicast group. Convex distribution represents situation that a lot of excellent capability node is included. In both cases, as the diversity becomes larger, improvement of delay becomes larger(this is also observed in linear distribution). This is because throughput of higher capability nodes are resticted by the lower nodes inside a multicast group and with the layered multicast
32
M. Yamamoto, Y. Sawa, and H. Ikeda
Normalized Delay
1.0 0.8 2 subgroups
0.6
3
0.4 4
0.2 0.0 0
10
20
30
Diversity of Capacity Fig. 7. Normalized delay characteristics –exponential concave distribution– (diversity of node capability is below 30)
Normalized Delay
0.20 0.15
2 subgroups 3
0.10 0.05
4
0
200 400 600 800 1000 Diversity of Capacity
Fig. 8. Normalized delay characteristics –exponential concave distribution– (diversity of node capability is below 1000)
group construction, these restriction can be removed in each subgroup even with 2 or 3 subgroups. For practical use, layered multicast group construction with 2 or 3 subgroups is a reasonable choice.
6 Conclusions In this paper, we have presented the layered multicast group construction. It is one of solutions for technical problem of performance degradation caused by heterogeneity of
Layered Multicast Group Construction for Reliable Multicast Communications
33
Normalized Delay
1.0 0.8
2 subgroups 3
0.6 0.4
4
0.2 0.0 0
10
20
30
Diversity of Capacity
Normalized Delay
Fig. 9. Normalized delay characteristics –exponential convex distribution– (diversity of node capability is below 30)
1.0 0.8 0.6
2 subgroups 3 4
0.4 0.2 0
200 400 600 800 1000 Diversity of Capacity
Fig. 10. Normalized delay characteristics –exponential convex distribution– (diversity of node capability is below 1000)
network in reliable multicast communications. The basic concept of the layered multicast group construction is to divide a multicast group into multiple subgroups and order them based on node capability. This reduces diversity of node capability inside each sugbroup, which improves delay performance of a whole multicast group. We investigate the optimal construction of the layered multicast group construction, i.e. the optimal dividing points of subgroups. We have clearified the way to derive the optimal construction of the layered multicast group, for three types of distribution of node capability, linear distribution, exponential concave distribution and exponential convex distribution.
34
M. Yamamoto, Y. Sawa, and H. Ikeda
Numerical examples show that average delay performance is notably improved by the layered multicast group construction compared with single multicast group construction, i.e. conventional multicast group. Improvement of delay performance observed from conventional multicast group to the layered multicast group construction with two groups is the most significant. Improvement observed with increment of the number of subgroups decreases with increase of subgroups. Increase of subgroups has another aspect of inefficient network resource usage of redundant packet transmission. Our numerical examples show that two or three subgroups can obtain satisfactory improvement of delay performance for the layered multicast group construction. Acknowledgements This paper is supported in part by the Grant-in-Aid for Scientific Research(B) of the Ministry of Education, Science and Culture, Grant No.1045015.
References 1. S.Y.Chung, M.H.Ammar and X.Li, “On the Use of Destination Set Grouping to Improve Fairness in Multicast Video Distribution,” Tech Report: GIT-CC-95-25, Georgia Institute of Technology, Atlanta, July 1995. 2. M.H.Ammar and L.Wu, “Improving the Throughput of Point-to-Multipoint ARQ Protocols through Destination Set Spliting,” Proc. of IEEE INFOCOM’92, pp.262-271, June 1992. 3. S.Kasera, J.Kurose and D.Towsley, “Scalable Reliable Multicast Using Multiple Multicast Groups,” CMPSCI Tech Report TR 96-73, University of Massachusetts, Amherst, October 1996. 4. S.Bhattacharyya, J.Kurose, D.Towsley and R.Nagarajan, “Efficient Rate-Controlled Bulk Data Transfer using Multiple Multicast Groups,” Proc. of IEEE INFOCOM’98, pp,11721179, June 1998. 5. S.Floyd, V.Jacobson, S.McCanne and L.Zhang, “A Reliable Multicast Framework for Lightweight Sessions and Application Level Framing,” IEEE/ACM Transactions on Networking, Vol.5, No.6, pp.784-803, Dec. 1997. 6. S.Deering,“Host Extension for IP Multicasting,” RFC-1112, Aug. 1989. 7. R.Braudes and S.Zabele,”Requirements for Multicast Protocols,” RFC-1458 1993. 8. S.McCanne, V.Jacobson and M.Vetterli,“Receiver - driven Layered Multicast,” in Proc. of ACM Sigcomm’96, Stanford,pp117-130, Aug. 1996. 9. A.Koifman and S.Zabele,“RAMP: A Reliable Adaptive Multicast Protocol,” in Proc. of IEEE Infocom’96, Boston,pp1442-1451, Apr. 1996. 10. J.C.Lin and S.Paul,“RMTP:A Reliable Multicast Transport Protocol,” in Proc. of IEEE Infocom’96, Boston, pp1414-1424, Apr. 1996. 11. H.W.Holbrook, S.K.Singhal and D.R.Cheriton,“Log-Ba- sed Receiver - Reliable Multicast for Distributed Interactive Simulation,” in Proc. of ACM Sigcomm’95, pp328-341, Aug. 1995. 12. S.Ramakrishnan and B.N.Jain,“A Negative Acknowledgement with Periodic Polling,” in Proc. of IEEE Infocom’87, pp502-511, Apr. 1987. 13. M.Yamamoto, J.Kurose, D.Towsley and H.Ikeda,“A Delay Analysis of Sender - Initiated and Receiver - Initiated Reliable Multicast Protocols,” in Proc. of IEEE Infocom’97, Kobe, pp481 - 489, Apr. 1997.
Layered Multicast Group Construction for Reliable Multicast Communications
35
14. M.Yamamoto, Y.Sawa, S.Fukatsu and H.Ikeda, “NAK-based Flow Control Scheme for Reliable Multicast Communications,” in Proc. of IEEE Globecom’98, Sydney, pp.2732-2736, Nov. 1998.
Building Groups Dynamically: A CORBA Group Self-Design Service Eric Malville France T´el´ecom CNET (Centre National d’Etude des T´el´ecommunications) 42, rue des Coutures F-14066 Caen Cedex 4, France
[email protected]
Abstract. This paper focuses on CORBA object group services. Our aim is to provide a Group Self-Design (GSD) protocol which enables a dynamic and autonomous construction of groups. From a global point of view, the GSD protocol enables the system to be organised into a treestructure whose nodes are groups. From a local point of view, it enables a group to be sub-divided autonomously and independently of the others. This paper presents the GSD protocol and proposes an implementation of this protocol on top of CORBA. The advantages of our GSD approach are illustrated through an application to the task allocation problem in Open Information Systems (OIS).
1
Introduction
Open Information Systems (OIS) [10][12] are large-scale information systems composed of heterogeneous and distributed resources (e.g. people, printers, word processors) that may appear, disappear and change. Distributed Systems provide object-oriented communication infrastructures for building applications in distributed and heterogeneous environments (in terms of system and programming language). One of the most important of these object-oriented communication infrastructures is CORBA (Common Object Request Broker) specified by the OMG (Object Management Group) [25] which allows the objects to communicate independently of the specific platforms and techniques used to implement these objects. CORBA defines objects as the unit of distribution. Each object provides a set of operations through IDL (Interface Definition Language) interfaces. CORBA specifies basic mechanisms for remote invocation through the ORB (Object Request Broker), as well as a set of services for object management (e.g. naming service, transaction service). However, the current version of CORBA does not integrate the notion of object group. The group abstraction has been studied in great depth in the domain of distributed systems [1][2][26][4]. A group is a set of objects that are addressed as a single entity. The key mechanisms underlying the group paradigm are group multicast and fault-tolerance. The properties of groups have led some L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 36–53, 1999. c Springer-Verlag Berlin Heidelberg 1999
Building Groups Dynamically: A CORBA Group Self-Design Service
37
people [16][18][19][22][9][24] to provide the group abstraction in a CORBA environment. However, the group primitives they propose do not allow a dynamic construction of groups. Some applications require more flexible group services to deal with the dynamics of OIS. In such an environment, the groups should be able to organize themselves. In this article, we propose a GSD (Group self-design) protocol which allows the object groups to acheive such a self-organization. In the next section, we present an example of application, a task allocation mechanism, which requires the groups to organize themselves depending on the dynamics of the environment. In the section 3, we present the group self-design mechanism we propose. The section 4 summarizes some implementations aspects.
2
Task Allocation in Open Information Systems
A lot of applications can take advantage of groups [17]. For example, the group paradigm can be used to solve one of the most important problem in OIS (and in a more general way, in distributed systems): the problem of task allocation. In our approach [6][7][21], each resource is associated with an agent (i.e. an “intelligent” and autonomous CORBA Object1 ) which is charged with representing this resource in the system. Each agent can play the roles of both a server, since it provides services to the other agents (i.e. the services the resource it represents provides), and a client, since it is in charge of performing tasks on its resource’s behalf. The services (or competencies) an agent provides are described by a set of characteristics. As in [6], a distinction is made between two types of characteristics: structural characteristics, which are static or weakly dynamic, and conjunctural characteristics which, on the contrary, are highly dynamic. In the same way, a task is described by a list of the competencies that a server must possess in order to carry it out. In this context, the problem of task allocation consists in a search by a client for servers capable of satisfying its needs. Our focus is on the following two allocation mechanisms: the contract net protocol (CNP) of R.G. Smith [5][27] and the agent group model (AGM) of B. Dillenseger and F. Bourdon [6][7]. The CNP is based on a mechanism of calls for tender. As a general rule, a client, looking for a server capable of performing a given task, broadcasts a call for tender to all the servers. Those servers which are capable of carrying out the task return a service offer to the client, which chooses the server most suited to its requirements. In the AGM, each member of a group knows the structural characteristics of the other members. When a server wishes to join a group, it broadcasts its structural characteristics throughout the group and in return receives those of all its members. Similarly, when a member leaves a group or when its structural characteristics are changed, a multicast is required in order to update the knowledge of the other members. In this way, when a client has a task to execute, it sends a call for tender to one of the members of the group, chosen at random. Since the latter knows the structural characteristics of all the other 1
More details on what an agent is, can be found in [6][7]
38
E. Malville
members, it can determine which of them are structurally capable of performing the task, and forward them the call for tender. Depending on their conjunctural characteristics (e.g. their availability) the servers determine whether or not they can in fact perform the task. In which case, they return a service offer to the client which can then select the most appropriate offer. The number of messages generated by these two mechanisms is mainly due to the broadcast of the call for tender in the case of the CNP and the structural characteristics in the case of the AGM. The task allocation protocol that we put forward is based on a group self-design (GSD) protocol which structures the search space (i.e. the system) into a group tree-structure. The group treestructure enables a reduction in the network load generated by the search for servers and by the management of the system dynamics.
3 3.1
The GSD Approach Group Management Protocol
The GSD protocol which we put forward is based on the coherence of the views of the members of a group. The view of a group is local to each member and contains the information that the server has on the group (e.g. an ordered list of the members and their structural characteristics). A group management protocol is necessary in order to maintain the coherence of views of the members in an environment in which the machines, agents and/or communication network may fail. The protocol manages the update of the views upon modifications of the group structure. A number of solutions have been found for this problem, which has been studied in great depth in the context of distributed systems (e.g. [1], [13], [14], [23]). The group management protocol which we put forward has been strongly influenced by the work presented in [1] and [14]. The protocol ensures: – broadcast atomicity: when a message is broadcasted within a group, the protocol ensures that either all the members receive the message, or none of them do; – total ordering of message delivery: when two messages A and B are broadcasted within a group, the protocol ensures that if one member receives A before B, then all the other members also receive A before B. When a server wishes to join a group (see figure 1), it sends the group a message of type Join which contains its structural characteristics. A member chosen at random receives the message and transmits it to the group co-ordinator; the latter is the first agent to have joined the group. The delivery of the messages in total order ensures that the list of members is ordered in the same way for all the members of the group and that the co-ordinator is therefore the same for all. If it fails, the role of co-ordinator is ensured by its successor in the list. The co-ordinator then broadcasts the message to the other members of the group in order for them to update their view and sends the new member a message of type SetView containing its view of the group.
Building Groups Dynamically: A CORBA Group Self-Design Service S1
G
S2
S3
S4
viewG
39
N
viewG
G
Join(SCN ) SetV iew(viewG )
Fig. 1. Group join protocol
Broadcast is performed by a series of point-to-point communications. The broadcast mechanism poses a well-known problem in the field of distributed systems designed to ensure the atomicity of broadcast; the co-ordinator may initiate a broadcast and fail before the process has been completed. In this case, some addressees do not receive the message and the views of the members are not coherent. To solve the problem, each member sends once in turn the Join message to its successors, and sends a SetView message to the new member (see figure 1). When an addressee is not accessible, the transmitter withdraws it from the group by sending a Leave message to the co-ordinator. The use of logical clocks, local to each agent, prevents the members from sending the same message twice. The delivery in total order of broadcasted messages is ensured by the coordinator. Messages to be broadcasted are sent to the co-ordinator. If two messages are sent in concurrent fashion to a group, the co-ordinator broadcasts the first message it receives and checks that all the members of the group have received it; this done, it can broadcast the second message. This ensures the members of the group all receive the broadcasted messages in the same order. Protocols, managing the withdrawal of a member from a group and changes in the structural data of a member, rely on the same mechanisms. When a server withdraws or when its structural characteristics are changed, it sends a Leave or Change message directly to the co-ordinator which is then handled similarly. 3.2
Group Tree-Structure
While the group is a means of limiting the network load during the search for servers, on the other hand its management protocol becomes increasingly unsuitable as the group grows in size. Indeed, both the number of messages required to maintain the coherence of views and the quantity of information shared by the members of the group are proportional to the size of the group. The GSD protocol we put forward in this article enables a limit to be placed on the size of
40
E. Malville
groups and, therefore, both on the network load created by the management of the system dynamics and on the quantity of information shared by the agents. From a global point of view, it enables the search space (i.e. the system) to be structured into a tree-structure in which the nodes are groups and the branches are labelled by the structural characteristics of the son groups. From a local point of view, it enables a group whose size has become too large to be sub-divided into smaller groups. The original group thus sub-divided becomes the father of the new groups. Certain servers remain members of the father group while others leave it to join son groups. Each son group is represented by an agent, called the representative, in its father group. Information about son groups forms the branches of the tree, since this information enables access to son groups from a node. Each server is represented by a list of structural characteristics of the form: | SC = {sc1 , . . . , scn } The structural characteristics of a representative correspond to those of the group it represents. Each member knows the list (organised by order of arrival) of the members (either plain member or representative) of the group as well as their characteristics. The view of a group therefore has the form: M embers = {s1 , . . . , sn } SC = {SCs1 , . . . , SCsn } Representatives ⊂ M embers F ather = G The system initially comprises a single group, i.e. the root of the tree. Upon creation, the agents all know of the existence of this group and address to it in their search for servers or to declare themselves in the system. When it becomes too large, the group sub-divides into a group tree-structure. Each node of the tree can sub-divide autonomously and independently of the others, and thus extend the tree-structure. A protocol also enables management of the disappearance of groups which no longer contain any agent to maintain the connectivity of the tree. 3.3
Group Sub-division
The group sub-division protocol guarantees the coherence of the views of the members in the father group and in its son groups. It is based on three global constraints: 1. All the agents know two thresholds relating to group size: the maximum size Smax that a group must not exceed and the minimum size Smin that a group must have in order to be created; 2. They use the same sub-division function to determine how to sub-divide a group based on their current view of this group. This function returns the list of the new son groups (whose size is greater than Smin ) and their content; 3. The sub-division protocol is based on the coherence of the views of its members.
Building Groups Dynamically: A CORBA Group Self-Design Service
41
These three global constraints enable the members of a group to decide how and when to sub-divide the group in an autonomous, uniform fashion. The sub-division of a group (if any) is performed during the inscription phase of a new member (cf. figure 2). The inscription protocol is enriched in order to preserve the coherence of the views of the members of the father group and its son groups. When an agent (N ) wishes to join a group, it sends it a Join message. Group G
N
Group G
S3 viewF
viewG S 2
S1 viewG
S4 viewG
S3 viewG
S4 viewF S viewF
=⇒
Group G’ S1 viewS
S2
N viewS
viewS
M embers = {s1 , s2 , s3 , s4 } viewG =
SC = {SCs1 , SCs2 , SCs3 , SCs4 } Representatives = ∅ F ather = none
M embers = {s3 , s4 , S} viewF =
SC = {SCs3 , SCs4 , SCS } Representatives = {(S, G )} F ather = none
M embers = {s1 , s2 , N } viewS =
N
SC = {SCs1 , SCs2 , SCN } Representatives = ∅ F ather = G
S1
S2
S3
S4
G
Create S
G
G Join(N, {sc, sc }) Join(N, {sc, sc }, S) SetV iew(viewS ) SetV iew(viewF )
Fig. 2. Group sub-division protocol
When the co-ordinator receives a Join message, it evaluates the new view of the group and determines whether its size is greater than Smax . If this is
42
E. Malville
the case, it applies the sub-division function to the new group view in order to determine which son groups (G ) are to be created, and which agents are to be their members. It creates these new groups and their representatives (S), and sends a Join message to all the members of the original group. This message contains the structural characteristics of the new member and an ordered list of the new representatives. The co-ordinator does not need to send them the structural characteristics of the new representatives since each member can determine these from the current view they have of the group. Only the information they cannot determine autonomously has to be communicated to them. Each member updates its view upon reception of the Join message. As stated before, it can then determine the view of each group issued from sub-division (father group and its sons), update its own view and send a SetView message to the new representatives and the new member so as to initialize their view. During a sub-division, a representative, just like any other agent, can become a member of a new group. The group it represents becomes the son of this group and therefore no longer has the same father. The view of each of its members therefore has to be changed appropriately. To do so, the members of a new group send each son group a message containing the name of the new father. The message is broadcasted once to all the members in order for them to update their view. G
Electronic document
G1
G2
G2 Word processor
G1 Printer G11 G12 Printer Color Color printer Color printer
Color printer
Word processor Word processor Word processor
G12
G11 Printer Black and White Black and White printer Black and White printer
Black and White printer
Fig. 3. An example of functional group-tree structure
3.4
An Example of a Sub-division Function
The sub-division protocol we have just presented can be applied to every subdivision function which (i) is deterministic and (ii) leads to sub-divisions in which the groups are disjointed. In [21], we propose an example of sub-division function, the functional sub-division function, which verifies these two properties2 . 2
It is possible to demonstrate this, but this lies outside the scope of the present article. We do not state it here therefore.
Building Groups Dynamically: A CORBA Group Self-Design Service
43
This sub-division function enables the construction of a tree whose nodes are functionally homogeneous groups, i.e. groups whose members have structural similarities (cf. figure 3). The groups are therefore described by the structural characteristics that their members have in common. 3.5
The Group Deletion
A group disappears when it no longer contains any agents. A node group contains at least the representatives of its son groups. Only the leaf groups of the tree can therefore disappear. When a leaf disappears, its representative must be withdrawn from the father group. To do so, when the last member of a group is withdrawn, it sends a Leave message to the father group which contains the name of the representative (i.e. the name of the group it represents). This message is broadcasted to all the members of the father group, which withdraw the representative from their view. A representative withdraws from a group when it receives a Leave message containing its own name. 3.6
The Tree Browsing
Browsing through group tree-structure enables a new agent to declare itself in the system, and a client to search for servers capable of meeting its needs. Since the branches of the tree are labelled by the structural characteristics of the son groups, it is not necessary to browse through the whole tree-structure. At each node, a decision is taken to determine which sub-trees a client or a new server must browse through. The sub-trees of a node are browsed through in parallel fashion. In order to declare itself in the system, a server must first of all determine in which groups it has to register. To do so, it sends an inscription request containing its structural characteristics to the root group of the tree. A member of the group chosen at random receives the message and determines whether the new agent should register in son groups. If this is the case, it returns the list of these groups to the new agent which reiterates the process by sending them an inscription request. If this is not the case, it sends the request to the co-ordinator in order to register the new member. A client which has a task to delegate sends a call for tender to the root group of the tree. The message contains the list of the characteristics that an agent has to satisfy in order to perform the task. The member (chosen at random) which receives the request, forwards it to the structurally competent servers of its group and returns to the client the son groups which satisfy the characteristics of the request (so as to browse through the relevant sub-trees). Upon reception of the call for tender, a server determines on the basis of its conjunctural characteristics, whether or not it can in fact perform the task. In which case, it sends a service offer to the client. Any message sent to a group is received by one of its members chosen at random. Invoking a group is therefore possible if at least one of its members is accessible. Browsing through the whole tree is possible as long as the nodes
44
E. Malville
contain at least one accessible member. The tolerance of the tree-structure to failures of machines, agents and the network, therefore, depends on its size and the distribution of its members on the different sites. The Smin and Smax thresholds used to sub-divide a group enable an average group size to be determined and therefore influence the sturdiness of the tree-structure.
3.7
Evaluation
The purpose of the GSD protocol is to structure the search space into a group tree-structure in order to limit the size of the groups and therefore to limit the network load created by the management of the activity and the dynamics of the system. However, the group self-design protocol relies on the coherence of the views of the members. This coherence requires a reliable broadcast protocol which is very expensive in terms of communication costs. Indeed, in the AGM a message has to be broadcasted only once, while in the GSD approach, each member has to broadcast the message to all of its successors in the view. Therefore, the broadcast of a message in a group containing n members requires only n − 1 messages in the case of the AGM and (n − 1) + (n − 2) + . . . + 2 + 1 namely n(n − 1)/2 messages in the case of our GSD approach. In the present section we put forward an evaluation of our approach in terms of network load in relation to the CNP and the AGM. The GSD protocol, we have put forward, has been validated in a real distributed environment: CORBA [25]. However, it is difficult, in such a real environment, to study all of their properties. In fact, it is difficult (indeed impossible) to control all the parameters influencing the behavior of the system (e.g. delays in the transmission of the communications, crashes of the network and/or hosts). It is also difficult to study these protocols on a large scale (e.g. high number of hosts, high number of agents). The evaluation of GSD protocols have been done through simulations, carried out on the oRis simulator [11]. The simulation constitutes both a tool for the conception and an analytic device which allows, when the real situation is too complex, to rebuild an artificial environment where all the parameters are precisely controlled. However, it must be noted that the simulations are not sufficient. In particular, no formal proof of our model can be made from these simulations on oRis. These simulations only allow us to study how GSD protocols behave in relation to the others. The results obtained through simulations have to be relativised, since their meaning mainly depends on the model chosen to simulate the real environment. This model has to reproduce as faithfully as possible the real situation for the results to be as close as possible to the results we should obtain in the real environment. Even if the simulation model is a “good” approximation of the reality (according to the chosen evaluation criteria), an evaluation in a real environment (i.e. CORBA) had to be done in order to verify that the obtained results are relevant, even under the influence of parameters we have not yet taken into account (volontary or not) in the artificial environment.
Building Groups Dynamically: A CORBA Group Self-Design Service
45
The network load generated by the three mechanisms is evaluated in terms of the amount of information transmitted between agents, since the network traffic actually depends, not only on the number of messages, but also on their size. The properties of the group tree-structure mainly depend on the global constraints (e.g. Smin , Smax ). The fluctuations of the network load generated by our GSD approach mainly depend on the size of the groups the new agents join: the bigger the groups, the bigger the network load (and conversely). Therefore, the network load created by GSD protocols mainly depends on the global constraints Smax and Smin . An initial series of simulations shows the development of the network load created by the different mechanisms of task allocation in relation to the size of the system (cf. figure 4). This evaluation shows that, despite the costly broadcast protocol it requires, the GSD mechanism allows a limit to be placed on the communication load generated by the management of the dynamics of the system and by the search for servers.
Network load (number of bytes)
2000
1500 CNP Group GSD
1000
500
0 0
50
100 Server number
150
200
Fig. 4. The network load created by the three task allocation mechanisms
4
The Group Service
Several mechanisms that augment the basic functionalities of CORBA, have been designed and adopted by the OMG as part of the CORBA specifications. Nevertheless, no support is provided for the group abstraction. In this section, we present how our GSD protocol has been implemented. 4.1
Integration vs. Service Approach
There are basically two main approaches to implement GSD protocol in a CORBA
environment [8][9]: the integration approach and the service approach.
46
E. Malville
The integration approach (e.g. Orbix+Isis [16], Electra [15][19][18]) consists in integrating group primitives provided by the underlying system (e.g. Isis [3], Horus[26], Amoeba [14]) within the ORB core. The ORB is, therefore, modified in order to distinguish references to singleton objects from references to groups of objects. In this approach, a request to a group is performed by the underlying system. In the service approach (e.g. [9], [8], [22]), the group primitives are provided by a CORBA service on top of the ORB. Therefore, this approach complies with the CORBA phylosophy. It follows the design of the other functionalities that have been added to CORBA through services, such as the naming service. These services were specified in IDL (Interface Definition Language). The group service appears as a new CORBA service, besides other services and also besides any other CORBA object. A service is viewed as a set of IDL interfaces and can be composed of several objects located at different nodes on the network. The service is inherently accessible from anywhere on the bus. The main advantages of the integration approach is its ease of development (there is no need to build a new group system from scratch) and its transparency (an object group is not distinguishable by a client from a singleton object that implements the same interface). In order to acheive a similar transparency in the service approach, the use of group proxies is necessary. Another advantage of the integration approach is that the implementation can be very efficient since group communication is performed directly by the underlying system. Multicasting a request does not involve an intermediary object as in the service approach. One of the drawbacks of the integration approach is the loss of portability (implementation is strongly ORB and system dependent) and interoperability (both clients and servers have to use the same ORB and system). On the contrary, a group service is not ORB-dependent, nor system-dependent (it is defined only in terms of IDL interfaces) and, by using the ORB communication primitives, it benefits from the ORB’s interoperability properties. It requires no modification of the ORB and is easily portable from one ORB to another. Therefore, the service approach is more appropriate for the task allocation mechanism we propose for OIS, since it provides the group abstraction in heterogeneous environments. 4.2
The CORBA Group-Self Design Service
A CORBA service is viewed as a set of IDL (Interface Definition Language) interfaces. It is generally composed of several distinct interfaces. It can be implemented as a collection of distinct objects that cooperate to provide the complete service, and which are located at different nodes on the network. This is typically the case for our GSD service. The general architecture. Our GSD service is composed of three families of objects: the server objects, the group objects and the factory objects. In a group, a server is represented by a server object (SO) knowing stable information about this server (e.g. its structural characteristics, its type, i.e. server
Building Groups Dynamically: A CORBA Group Self-Design Service
47
or representative, its location). The server objects implement the GSD protocols and provide two interfaces: an Internal interface and an External interface. The Internal interfaces allow the server objects to communicate within the group which they belong to. Their External interfaces allow them to receive requests coming from outside the group. A group is represented by one or more group objects (GO) which all know the External interface and the location of all the server objects belonging to this group. These group objects allow the external agents (client or server) to contact the group in order to join it or to search for servers. The role of a group object consists in transmitting the requests it receives to one of the members (chosen at random) of the group it represents. The sturdiness of a group depends on the number and the repartition of its group objects which depend on the repartition of the members of the group; as a general rule, if a server object runs on a machine hosti , a group object also runs on this machine. Each group object has two interfaces: an External interface and an Administration interface. The External interfaces allow the agents to send requests to the group. The Administration interfaces allow the members of a group to inform the group objects that a server has joined or leaved the group. It must be noted that the management of the group objects is integrated in the GSD protocol. As a general rule, the group objects are created by the co-ordinator and destroy themselves when their host no longer contains any member of the group. In the example of the figure 5, a server S3 running on a host host2 joins a group G. No member of this group runs on the host host2 . The co-ordinator, therefore, has to create a new group object (GOG2 ) on the site host2 and send the other members the object reference of this newly created group object.
GOG1
S1
S2
viewG
S3
GOG2
viewG
1111111111 0000000000 11111111 00000000 host host 1
2
joinRequest(S3 , CSS3 , host2 )
setV iew(viewG )
f orwardJoin(S3 , CSS3 , host2 )
insert(Si , host(Si ))
join(S3 , CSS3 , host2 , GOG2 )
Creation of a new group object
Fig. 5. The creation of a new group object
48
E. Malville
The factories allow the agent to create (locally or remotely) CORBA objects like the group objects or the representatives. A factory runs on each machine of the distributed system and provides a GenericFactory interface defined by the OMG. The naming graph. The standard naming service of CORBA allow an object to create name-to-object association (i.e. a name binding). A naming context is an object that contains a set of name bindings in which each name is unique. The interface the naming service provides allows the CORBA objects to resolve and to bind a name. To resolve a name is to determine the object associated with the name in a given context. To bind a name is to create a name binding in a given context. A name is always resolved relative to a context – there are no absolute names. Since a context is like any other CORBA object, it can also be bound to a name in a naming context. Binding contexts in other contexts creates a naming graph – a directed graph with nodes and labeled edges and whose nodes are contexts. We have defined our own context graph. The root of this graph is the GSD context. This graph contains two specific contexts, namely factory and groups. The factory context contains the object references of all the factories of the distributed system. The groups context contains a context for each group. A context of a group is identified by the name of the group and contains the object references of all its group objects. The reference object of a group object is associated with the name of its site. This conventional binding allows the agents to get the object references of the factories and the group objects depending on their location. GSD groups G1 hosti . . . hostj
...
factories Gn
host1 host2 . . . hostm
hostk . . . hostl
Fig. 6. The naming context graph in the CORBA GSD service
To contact a group or to create a new CORBA object, an agent has to access the naming service. Therefore, the reliability of our GSD service depends on the reliability of the naming service. The GSD service has to be combined with a reliable naming service. S. Maffeis [20] proposes a naming service which relies on an integration approach of the group abstraction. The naming service he proposes is provided by a group of duplicated naming servers distributed over an homogeneous environment (e.g. Horus, Isis or Amoeba). This approach increases the reliability of the naming service and therefore of our GSD service. Our GSD service provides the group abstraction over an heterogenous environment.
Building Groups Dynamically: A CORBA Group Self-Design Service
49
The view of a group. Each server object manages its own view of the group which it belongs to. This view contains (stable) information about the group itself, its group objects and the server objects it contains (cf. figure 7). Part of this information corresponds to the information we have introduced in the previous section. The other part is implementation specific. In particular, each server object knows the External and Internal interfaces of all the other members. module GSD { // IDL definition ... struct ServerObject { // The stable information of a server object: string server; // - The name of the server it represents string type; // - The type of the server it represents string host; // - Its location Object internal,external; // - Its object references AVSeq sc; // - The structural characteristics of ... // the server it represents }; struct GroupObject { // The stable information of a group object: string host; // - Its location Object external,admin; // - Its object references }; typedef sequence<ServerObject> SOSeq; typedef sequence
GOSeq; struct View { // The view of a group: string name; // - The name of the group string father; // - The name of the father AVSeq sc; // - The characteristics of the group GOSeq group_objects; // - The group objects of the group SOSeq server_objects; // - The server objects the group contains };
};
Fig. 7. The view of a group
The interfaces. The group objects and the server objects all implement an External interface (cf. figure 8). The External interface of the group objects allows an agent (client or server) to join the group (joinRequest), to withdraw a representative from the group (leaveRequest) or to search for servers (searchRequest) satisfying a set of eligibility constraints (eligibility). The External interfaces of the server objects allow a group object to send them the requests it receives.
50
E. Malville
module GSD { // IDL definition ... interface External { void joinRequest(in RequestId id,in ServerObject n) raises(NoAliveMember,AlreadyJoined,Failure,JoinTo); void leaveRequest(in RequestId id,in StringSeq ms) raises(NoAliveMember,NotJoined,Failure); void searchRequest(in AVSeq eligibility,out BidSeq offers, out Node node) raises(NoAliveMember); }; struct SonGroup { // Information on a newly created son group: Object representative; // - Its representative string site; // - The location of the representative ObjectSeq gos; // - The reference object of its group objects }; typedef sequence<SonGroup> SonGroupSeq; interface Internal { void forwardJoin(in RequestId id,in ServerObject n) raises(AlreadyJoined,Failure,JoinTo); void forwardLeave(in RequestId id,in StringSeq ms) raises(NotJoined,Failure); void forwardChange(in RequestId id,in ServerObject m) raises(NotJoined,Failure); void join(in RequestId id,in ServerObject n, in GroupObject new_go,in SonGroupSeq sons); void change(in RequestId id,in ServerObject m); void leave(in RequestId id,in StringSeq ms); void setView(in View v); void search(in AVSeq eligibility,out Bid offer); };
};
interface Administration { void insert(in Object so,in string site); void remove(in ObjectSeq sos); };
Fig. 8. The interfaces of the GSD service
Building Groups Dynamically: A CORBA Group Self-Design Service
51
The server objects also implement an External interface (cf. figure 8). It must be noted that, from a conceptual point of view, the distinction between the External and the Internal interfaces consists in isolating the internal communications from those coming from outside the group; the Internal interface allows the server objects to communicate within the group, while the External interface describes the methods the agents are allowed to use for contacting the group. Each group object provides an Administration interface (cf. figure 8). This interface allows the members of a group to update the knowledge of the group objects (i.e. the external object reference and the location of the server objects) when a member leaves the group (remove) or joins it (insert).
5
Outlooks
Our implementation of the GSD service comes up against a limitation. The interface of a group is specific to the task allocation problem. The group objects and the server objects have to implement the External interface. We are currently studying how the Dynamic Skeleton Interface (DSI) and the Dynamic Invocation Interface (DII) could help us in implementing generic group and server objects, i.e. independent of the application. The DII lets clients choose at run-time the operation invoked through a set of standard APIs. In contrast to the static stubs, the DII is independent of the target object’s interface. The DSI provides a run-time binding mechanism for servers that do not have static skeletons, allowing them to handle any request dynamically. Therefore, group and server objects should be implemented independently of the application setting, i.e. independently of a specific group interface.
6
Conclusion
In this article we put forward a Group Self-Design (GSD) mechanism which enables the system to be structured into a tree-structure, whose nodes are groups and whose branches are labelled by the structural characteristics of the son groups. It is based on a group sub-division protocol which enables the construction of the tree-structure. By browsing through the tree-structure, a client can search for servers, and new agents can declare themselves within the system. The sub-division protocol is based on a deterministic sub-division function which enables the members of a group to determine in an autonomous but uniform fashion how to sub-divide a group. The sub-division function we put forward enables the construction of a tree-structure of functionally homogeneous groups. The nodes of the tree output by the sub-division function are groups whose members have structural characteristics in common. A comparative study between our approach, the contract net protocol of R. G. Smith, and the agent group model of B. Dillenseger and F. Bourdon has shown that group self-design is the only way to enable a limitation to be placed
52
E. Malville
on the network load created by the management of the system dynamics and by task allocation.
7
Acknowledgments
This work was led in collaboration with Michel Riveill from the University of Savoie and Fran¸cois Bourdon from the University of Caen. Special dedication to Bruno Dillenseger, the originator of the agent group model.
References 1. K. Birman, A. Schiper, and P. Stepheson. Lightweight causal and atomic group multicast. ACM Trans. Comput. Syst. 9(3) 1991 2. K. P. Birman, R. Cooper and B. Gleeson. Programming with process groups: Group and multicast semantics. Technical Report TR91-1185, Cornell Univ., Computer Science Dept. (1991) 3. K. P. Birman. The process group approach to reliable distributed computing. Communications of the ACM. 36(12) (1993) 4. D. Dolev and D. Malki. The Transis Approach to High Availability Cluster Communication. Communications of the ACM. 39(4) (1996) 5. R. Davis and R.G. Smith. Negotiation as a metaphor for distributed problemsolving. Artificial Intelligence. Vol. 20 no. 1 (1983) 63–109 6. B. Dillenseger and F. Bourdon. Towards a multi-agent model for the office information system: a Prolog-based approach. In proceedings of PAP’95 (Practical Applications of Prolog. (1995) 191–200 7. B. Dillenseger and F. Bourdon. Supporting Intelligent Agents in a Distri-buted Environment: a COOL-based approach. In proceedings of TOOLS EUROPE-95 (Technology of Object-Oriented Languages and Systems. (1995) 235–246 8. P. A. Felber, B. Garbinato and R. Guerraoui. The Design of a CORBA Group Communication Service. In proceedings of the 15th Symposium on Reliable Distributed Systems. (1996) pages 150–159 9. P. Felber, R. Guerraoui and A. Schiper. A CORBA Object Group Service. Workshop (CORBA : Implementation, Use, and Evaluation) of the 11th European Conference on Object-Oriented Programming. (1997) 10. L. Gasser: Social conceptions of knowledge and action: DAI foundations and open systems semantics. IEEE Transaction on Systems, Man, and Cybernetics. (1981) 107–138 11. F. Harrou¨et, R. Cozien, P. Reignier, et J. Tisseau. oRis: un langage pour simultions multi-agents. JFIADSMA’97. (1997) 12. C. Hewitt: Open Information Systems Semantics for DAI. Artificial Intelligence 8 (1991) 323–364 13. W. Jia, J. Cao, and X. Jia. Heuristic Token Selection for Total Order Reliable Multicast Communication. In proceedings of ISADS’97, the Third International Symposium on Autonomous Decentralized Systems. (1997) 14. M. F. Kaashoek, A. S. Tanenbaum and K. Verstoep. Group Communication in Amoeba and its Applications. Distributed Systems Engineering Journal. (1993) 1 48–58 15. S. Landis and R. Stento. CORBA with fault tolerance. Object magazine. (1995)
Building Groups Dynamically: A CORBA Group Self-Design Service
53
16. S. Landis and S. Maffeis. Building Reliable Distributed Systems with CORBA. Theory and Practice of Object Systems, John Wiley Publisher. (1997) 17. L. Liang, S. T. Chanson and G. W. Neufeld. Process Groups and Group Communications: Classifications and Requirements. IEEE Computer. 23(2) (1990) 57–66 18. S. Maffeis. A Flexible System Design to Support Object-Groups and ObjectOriented Distributed Programming. In proceedings of the ECOOP’93 Workshop on Object-Based Distributed Programming. (1994) 19. S. Maffeis. Adding Group Communication and Fault-Tolerance to CORBA. In proceedings of the 1995 USENIX Conference on Object-Oriented Technologies. (1995) 20. S. Maffeis. A Fault-Tolerant CORBA Name Server. In proceedings of the IEEE Symposium on Reliable Distributed Systems. (1996) 21. E. Malville and F. Bourdon. Task Allocation: A Group Self-Desing Approach. In the proceedings of the Third International Conference on Multi-Agents Systems. (1998) 22. F. M. Costa and E. R. M. Madeira. An Object group model and its implementation to support cooperative applications on CORBA. In proceeding of the IFIP/IEEE International Conference on Distributed platforms: Client/server and Beyond: DCE, CORBA, ODP and advanced Distributed Application. (1996) 213–229 23. L. E. Moser, P. M. Melliar-Smith, D. A. Agerwal, R. K. Budhia and C. A. LingleyPapadopoulos. Totem: A Fault-Tolerant Multicast Group Communication System. Communications of the ACM. 39(4) (1996) 24. P. Narasimhan, L. E. Moser and P. M. Melliar-Smith. Consistency of Partitionable Object Groups in a CORBA Framework. In proceedings of the 30th Hawaii International Conference on Systems Sciences. (1997) 120–129 25. Object Management Group. The Common Object Request Broker: Architecture and Specification. Document 97.02.25. (1996) 26. R. v. Renesse, K. P. Birman and S. Maffeis. Horus: A Flexible Group Communication System. Communications of the ACM. 39(4) (1996) 27. R.G. Smith. The Contract Net Protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on computers. Vol. C-29 no. 12 (1980) 1104–1113
Issues in Designing a Communication Architecture for Large-Scale Virtual Environments Emmanuel L´ety and Thierry Turletti INRIA, 06902 Sophia Antipolis, FRANCE {Emmanuel.Lety, Thierry.Turletti}@sophia.inria.fr
Abstract. This paper describes the issues in designing a communication architecture for large-scale virtual environments on the Internet. We propose an approach at the transport-layer, using multiple multicast groups and multiple agents. Our approach involves the dynamic partitioning of the virtual environment into spatial areas and the association of these areas with multicast groups. In our work, we describe a method to determine an appropriate cell-size that limits the traffic received per participant with a limited number of multicast groups.
1
Introduction
This paper describes the issues in designing a communication architecture for Large-Scale Virtual Environments (LSVE) on the Internet. Such virtual environments (VE) include massively multi-player games, Distributed Interactive Simulations (DIS) [1], and shared virtual worlds. Today, many of these applications have to handle an increasing number of participants and deal with the difficult problem of scalability. Moreover, the real-time requirements of these applications makes the scalability problem more difficult to solve. In this paper, we consider only many-to-many applications, where each participant is both source and receiver. We also make the assumption that a single data flow is generated per participant. However, we believe that most of the results and mechanisms presented in this paper can be easily adapted to more complex applications that use several media types or layered encodings [2]. Using IP multicast solves part of the scalability problem by allowing each source to send data only once to all the participants without having to deal with as many sequential or concurrent unicast sessions. However, with a large number of heterogeneous users, transmitting all the data to all the participants dramatically increases the probability of congestion within the network and particularly at the receiver side. Indeed, processing and filtering all the packets received at the application level could overload local resources, especially if the rendering module is already processor intensive [3]. [4] shows that in a group communication setting, as the number of flows of data and the number of users increase, L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 54–71, 1999. c Springer-Verlag Berlin Heidelberg 1999
Issues in Designing a Communication Architecture
55
the percentage of content received by each participant that is interesting decreases. This useless information represents a cost in terms of network bandwidth, routers buffer occupation and end-host resources and is mainly responsible for the degradation of performance in LSVE. We argue that the superfluous received traffic has to be filtered out before it reaches the end-host. The main difficulty in this filtering mechanism comes from the heterogeneity and the dynamicity of the receivers, not only in terms of bandwidth and processing power but also in terms of data interest, virtual and physical locations. In AIM[5], a network-layer approach is proposed that enables sources to restrict the delivery of packets to a subset of the multicast group members. However, this proposition requires modifications in the routers and is unfortunately not yet deployed in the Internet. In this paper, we propose a filtering mechanism done at the transport-layer, using multiple multicast groups and agents. Our approach involves dynamic VE partitioning into spatial areas called cells and the association of these cells with multicast groups. In [6], we described a method, based on the theory of planar point processes, to determine an appropriate cell-size so that the incoming traffic at the receiver side remains with a given probability below a sufficiently low threshold. The simulations presented in this paper are complementary to this work and present a new evaluation of the mapping algorithm. We then propose mechanisms to dynamically partition the VE into cells of different sizes, depending on the density of participants per cell, the number of available multicast groups, and the link bandwidth and processing power available per participant. The rest of the paper is organized as follows. Section 2 reviews the limitations of the current IP multicast model, presents the cell-based grouping strategy, and examines the tradeoff in selecting the cell-size parameter. Section 3 discusses simulation results of the impact of the cell-size and the density of participants on the traffic received at the receivers. Section 4 presents a communication architecture framework that allows a dynamic cell-based grouping strategy with a limited number of multicast groups. Finally, Section 5 discusses related works, and Section 6 concludes the paper and presents directions for future work.
2
Motivation
We now examine the different limitations in using multiple multicast groups, the cell-based grouping strategy, and how to select the best size of cell. 2.1
Multiple Multicast Groups Limitations
There are several limitations on the use of multiple multicast groups. First, we have to consider that today, multicast groups are not inexhaustible resources: the number of available multicast groups in IPv4 is limited to 268 million Class
56
E. L´ety and T. Turletti
D addresses1 . There is an increasing number of applications that require several multicast addresses, such as layered coding based videoconferencing, or DIS applications. The widespread use of multicast increases the probability of address collisions. A few solutions have already been proposed in the literature to solve the multicast address allocation problem. For example, a scalable multicast address assignment based on DNS has been proposed in [7]. Another option could be the use of the Multicast Address Set Claim (MASC) protocol which describes an hierarchical block allocation of Class D addresses scheme for the Internet[8]. Some alternatives to the current IP multicast model have also been proposed: [9] describes a multicast address space partitioning scheme based on the port number and the unicast host address. In Simple Multicast, a multicast group is identified by the pair (address of the group, address of the core of the multicast tree), which gives to each core the full set of Class D addresses space [10]. In EXPRESS, a multicast channel is identified by both the sender’s source address and the multicast group [11]. Finally, with IPv6, the multicast address space will be as large as unicast address space, so this will solve the multicast address assignment problem. However, all these propositions are not yet available on the Internet and most of them are still active research areas. Second, multicast addresses are expensive resources. The routing and forwarding tables within the network are limited resources with limited size. For each multicast group, all the routers of the associated multicast tree have to keep state about which ports are in the group. Hosts and routers also need to report periodically their IP multicast group memberships to their neighboring multicast routers using IGMP[12]. Moreover, some routing protocols (such as DVRMP[13]) rely on the periodic flooding of messages throughout the network. All this traffic has a cost, not only in terms of bandwidth but also in terms of join and leave latency, which should be taken into consideration for interactive applications [14]. Indeed, when a participant sends a join request, it can take several hundred of milliseconds before the first multicast packet arrives. Such costs should be obviously considered in Large-Scale Multicast Applications (LSMA) and argue in favor of a bigger cell-size, and therefore, of a limited number of multicast groups. 2.2
The Cell-Based Grouping Strategy
Grouping strategies consist in partitioning senders and receivers into multiple multicast groups, depending of their common interest. Before partitioning the entire set of participants into multiple multicast groups, the data in which users are interested have to be identified. In this paper, we define the user interest as the set of virtual entities that a user can interact with. Note that entities located within the domain vision of a participant should only be considered as a part of its area of interest. However, users interests can change during the session, in particular, new participants can join or leave a session. So, it is important to handle the dynamicity of these centers of interest during the session. 1
IPv4 Class D addresses use 28-bits address space.
Issues in Designing a Communication Architecture
57
Once this identification is done, a grouping strategy has to be defined, according to several parameters, such as the number of available multicast groups, link capacities at the receiver, etc. Different grouping strategies have been proposed for LSVE [15,16]. In this paper, we focus on the cell-based grouping strategy which basically consists in partitioning the VE into cells and assigning to each cell a multicast group. During the session, each participant identifies the cell it is currently ”virtually” located in, and sends its data to the associated multicast group. To receive the data from the other participants included in its area of interest, each participant has to join the multicast groups associated with the cells that intersect its area of interest. Similarly, when a participant moves, it needs to leave the multicast groups associated with the cells which do not intersect anymore its area of interest. The cell-based grouping strategy is particularly suitable on VE that can easily be partitioned into virtual areas (e.g., virtual Euclidean spaces). However, the main difficulty in this partitioning is to find the appropriated cell-size. Indeed, decreasing the cell-size increases the overhead associated with dynamic group membership whereas increasing the cell-size increases the unwanted information received per participant [17]. In the following subsection, we examine the issues involved in selecting the best size of cell. 2.3
The Cell-Size Tradeoff
Two approaches are possible to estimate the best cell-size in a LSVE: the first approach requires the pre-calculation of a static cell-size parameter, which remains the same during the whole session. The second approach consists in dynamically re-estimating the cell-size during the session, taking into account dynamic parameters. To motivate the choice of one of these two approaches, let us first identify the parameters involved in the cell-size calculation and then, examine the impact of the dynamicity of these parameters on the appropriate cell-size. – The number of available multicast groups is an important parameter to take into account for the cell-size calculation because it gives a lower bound of the cell-size. As the number of multicast groups used is inversely proportional to the size of the cell, a small set of available multicast groups will lead to a bigger cell-size. – The receivers capabilities identify the link capacities and the processing power available per receiver. Typically, this parameter limits the amount of traffic that receivers can handle. Assuming each user roughly generates the same amount of traffic, the incoming traffic per receiver grows linearly with the total number of sources contained in the multicast groups to which it has subscribed. Nevertheless, some of these participants may be located outside the area of interest but inside a cell that includes this area of interest. The ratio between the corresponding number of unwanted participants and the total number of sources received represents the percentage of superfluous traffic received. So, the cell-size and more particularly the ratio between the cell-size and the size of the area of interest, have a direct impact on the amount of unwanted traffic.
58
E. L´ety and T. Turletti
– The density of participants represents the ratio between the number of participants and the size of the VE. In the cell-based grouping strategy, the area of interest is approximated by the smallest set of cells covering the area of interest. In the rest of the paper, we refer to the difference between these two areas as the superfluous area, see Figure 3. So, the density of participants in a VE not only has an impact on the average number of participants located in the area of interest, but also on the superfluous area. A smaller cell-size could allow a better approximation of the area of interest and a significant reduction of superfluous area and its corresponding traffic. – The participant velocity can be used in a cell-based grouping VE, to estimate the bandwidth overhead generated when participants cross cells, and the mean time that the participant stays per cell. For example, assuming a straight and horizontal movement, the product of the participant velocity with the cell-size determines the mean time that the participant stays per cell, and therefore, the average frequency of join and leave messages.
3
Impact of the Cell-Size
In this section, we analyze the impact of the cell-size and the density of participants on the traffic received by participants. We denote s the cell-size (i.e., the distance between two adjacent horizontal or vertical cell boundaries), CellArea the cell area s2 , and IArea the area of interest. Furthermore, the following assumptions are made for the simulations: – The cells form a regular square grid on the plane; the left and the right extremities, and the bottom and the top extremities are linked to each other, thus forming a torus. – The participants are static and located on the plane according to a uniform distribution law and each participant generates the same amount of traffic. – IArea is a square of area r2 centered on each participant. In order to be as generic as possible, we focus on the impact of the ratio bet2 ween CellArea and IArea (i.e., rs2 ). Figure 1 plots the average percentage of superfluous traffic out of the total traffic received by a participant. Since the participants are located on the plane according to a uniform distribution law, the percentage of superfluous traffic is equal to the ratio between the superfluous area and the area including all the cells that intersect IArea. We observe that when CellArea is larger than IArea, more than 70% of the traffic is superfluous. This figure also suggests that when CellArea is smaller than IArea, a slight diminution of CellArea decreases significantly the traffic received by receivers. However, it is important to notice that 70% of superfluous traffic is acceptable compared to the situation in which all the users communicate on a single multicast group [3]. Indeed, this 70% represents the percentage of superfluous traffic out of the total traffic received and not out of the total superfluous traffic. With a single multicast group and a large number of participants, almost all the traffic received would be superfluous.
Issues in Designing a Communication Architecture
59
90
% of Superfluous Traffic
80
70
60
50
40
30
20 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
CellArea / IArea
Fig. 1. Percentage of Superfluous Traffic
In order to evaluate the maximum traffic received per participant with a given probability, we performed a set of extensive simulations with different densities of participants and different CellArea values. In each simulation, we kept track of the traffic received by each participant for 10,000 different distributions of participants on the grid. We then sorted an array containing the different traffic received for all the distributions and all the participants. In order to determine the maximum traffic with a probability p, we read the element whose index is equal to p times the number of elements in the sorted array. This allows us to assert that the traffic received per participant is less than the value of this element with the probability p. In [6], we presented a model based on the theory on planar point processes, to determine the probability that the receiving traffic stays below a given threshold. However, the following simulations are complementary to this model and confirm the impact of the cell-size on the traffic received by end-users. The left side of Figure 2 shows the maximum traffic received by a participant 50% of the time, depending on the density of participants in the VE, and on 2 the rs2 ratio. Here, the density of participants represents the average number of participants per IArea. Such a way to express the density of participants in a VE is very useful, as it allows us to modify the CellArea without having an impact on the value of the density. The simulation shows that for a given value of participant density, it is possible to find the largest ratio between CellArea and IArea, so that this traffic remains under a sufficiently low threshold. Note that the maximum traffic received with a probability of 0.5 does not represent the average traffic received but the median value of the traffic received. Finally, the right side of Figure 2 probably shows the most interesting results. In order to satisfy the participants in a VE, it is better to determine an appropriate
60
E. L´ety and T. Turletti
100 40
80
30
60 Max Traffic
Max Traffic 20
40
10
20
0
0 4
6 5
3
4 3 2 CellArea/IArea
2 1 Participants Density
1 0
4
6 5
3
4 3 2 CellArea/IArea
2 1 Participants Density
1 0
Fig. 2. Max Traffic Threshold with p=0.5 and p=0.99
CellArea so that the incoming traffic remains with a high probability below the maximum traffic that they can handle. This value of probability reflects the tradeoff between the satisfaction of the users and the required number of multicast groups. Figure 2 shows that for a given density of participants, it is possible to find the largest CellArea (i.e., the smallest number of multicast groups), so that the incoming traffic remains below a sufficiently low threshold with a probability of 0.99 (i.e., 99% of the time). Moreover, for a given CellArea, we observe that this traffic increases linearly with the density of participants. In [6], we have shown that the mean residence time in a cell decreases exponentially as the mean velocity approaches 1 cell-size per second. This result argues in favor of a limited velocity in LSVE, so that the residence time per cell remains higher by orders of magnitude than the join and leave latency. Indeed, a participant needs to anticipate its join request by subscribing to the multicast groups which map the cells where it can go during the time corresponding to a join latency. Hence, its velocity and the cell-size impact on the number of multicast groups it needs to join by anticipation, and therefore on the IGMP traffic generated.
4
Framework for a Scalable Communication Architecture
In this section, we describe a framework for a scalable communication architecture for LSVE. We believe that today, such many-to-many applications, with potentially thousands of users, require minimal management and administration support. Indeed, the number of participants in such applications is too large to transmit all data to all users and to let them filter out the part of the data they are interested in. This would not only waste network and end-hosts resources, but also result in the fast degradation of the application performance at the enduser. Moreover, the aim of this paper is not to propose a new IP multicast model
Issues in Designing a Communication Architecture
61
neither to come up with a network-layer approach, adding new mechanisms in the routers. Alternatively, we present a transport-layer solution with multiple agents, assuming that all the users are capable of receiving multicast transmissions. The goal of this architecture is to make LSVE scalable with thousands of heterogeneous users on the Internet. Moreover, we claim that this solution works with a limited number of available multicast groups. In order to allow participants to select the information they would like to receive, we propose mechanisms using multiple multicast groups. Each participant joins and leaves multicast groups, depending on its interest in the content of the data transmitted. This section is organized as follows. First, we introduce a user satisfaction metric and present the rˆ ole of the agents in our architecture. Then, we describe the information exchange process between participants and agents. Finally, we present our mapping algorithm with a first evaluation. 4.1
User Satisfaction Metric
An ideal situation from the end-user viewpoint can be defined as a situation where the traffic received contains no superfluous data. However, this situation is far from being realistic, considering the cost of multicasting, and therefore, the limitation in the number of available multicast groups (see Section 2.1). For this purpose, we define the metric of the user satisfaction S as: S=
Ur min(T, C)
(1)
where Ur stands for the interesting data rate received and processed (in the case of a homogeneous Poisson point process of intensity λ, this would be proportional to λr2 ); T represents the global data rate (received or not received), in which the user is interested; and C stands for the receiver capability, which is the maximum data rate that the receiver can handle (limited by its network connectivity and/or processing power). When a participant receives and processes all the data it is interested in, this satisfaction metric is maximal whatever the superfluous traffic rate. Notice that for a particular user, S is also maximal when Ur is equal to C. This is true even though only a part of the interesting and useful data is received by the application. We justify the choice of this metric by the necessary tradeoff between the superfluous data rate received, the network state, and the overhead associated with dynamic group membership. Note that the goal of our architecture is not to maximize the satisfaction of the worst receiver in terms of network connectivity and processing power, but of the receiver that is the least satisfied. This approach often referred as max-min fairness is described in [18]. 4.2
Agents Responsibility
Let us define agents as servers or processes running at different parts of the network (e.g., on a campus LAN, hosted by an ISP or by LSVE developers).
62
E. L´ety and T. Turletti
Administrators of LSVE are responsible for deploying such agents on the Internet and for positioning them as close as possible to their potential users. Our approach requires the dynamic partitioning of the VE into cells of different sizes, and the association of these cells with multicast groups. Agents have to dynamically determine appropriate cell-size values in order to maximize the users satisfaction. Before any participant is connected, the VE has to be partitioned into startzones, according to its intrinsic structure (e.g., walls, rooms, etc.). Each start zone is then associated with a single multicast group. During the session, four successive operations are required: – Partition each start-zone into several zones, according to the distribution of users within the start-zone. – Compute the appropriate cell-size for each zone, according to the parameters listed in Section 2.3 (see Figure 3). – Divide each zone into cells, according to its computed cell-size, and assign a multicast group address to each cell of each zone. – Inform the participants of which multicast groups they need to join in order to interact with participants located around them.
Superfluous Area
Area of Interest
Cells
Fig. 3. Partitioning with different cell-sizes
In the rest of the paper, we refer to the three first operations as the mapping algorithm. We also designate the results of these three operations as the mapping information. Concerning the fourth operation, it is necessary to distinguish between two different situations: the first situation happens when a participant is moving in the VE and is about to enter in an area where it does not have the
Issues in Designing a Communication Architecture
63
mapping information. The second situation occurs when agents decide that the cell-size of a part of the VE is no longer appropriate; for example if the density of participants in this area suddenly increases. In this case, a new cell-size needs to be computed and the participants who are currently located in this area need to update their group memberships. Moreover, participants need to keep interacting between each other without suffering from this remapping. From now, we refer to this critical operation as the handover management (see Section 4.5). 4.3
Mapping Information
In order to communicate mapping information to users, i.e., the association between cells and multicast group addresses, it is necessary to find a way to identify and name these cells within the VE. Moreover, the VE could be a structured environment with walls and rooms of different sizes. Two participants can be very close to each other but as a wall is separating them, there is no possible interaction. This specific information should be taken into account before partitioning a VE into different zones. Note that cells have the same size within each zone and that all cells are squares (at the exception of a few of them located at the border of the zones). To refer to a virtual position in the VE in a permanent way, we divide the VE in area units. So, a cell contains an integer number of area units and a zone an integer number of cells. The area unit is chosen according to the maximal participant velocity, the number of available multicast groups, the average size of the participant area of interest, and the join and leave latency. Once this division is done, each area unit is referenced by its position in the VE. A matrix of “probability of interaction” between area units is built according to the structure of the VE. Agents use this matrix to dynamically define the different zones of the VE and the mapping information. 4.4
Participants-to-Agent Communication
Figure 4 shows the different levels of communication in our architecture : – Each participant subscribes to one or more multicast groups but sends data packets on a single group. – Each participant is connected to a single agent, using a unicast connection. – Agents communicate with each other on a single multicast group: the Agent Multicast Group (AM G). – Agents subscribe to users’ multicast groups during handover operations. – New participants send Hello packets on the agent’s multicast group. Let us now describe the way participants enter the VE and the different messages they send to their agents. Connection to the Virtual Environment We assume that before starting a session, participants have already downloaded the VE description and know the agent’s multicast group address. When a new
64
E. L´ety and T. Turletti Participants Multiple Multicast Groups
Agent Discovery
AMG
Agents During handover only
Unicast Communications
Fig. 4. Communication architecture
participant wants to enter the VE, it first needs to find the “closest” agent before registering and starting a login process. In our architecture, end-users discover agents by sending Hello packets on the agent multicast group address (they do not need to request membership to that group). This agent discovery is done using either an incremental TTL-based mechanism or an RTT-based mechanism, depending on the distance metric we decide to choose. As soon as an agent receives a Hello packet from a new participant, it opens a TCP connection with it and starts the login process. Afterwards, an optional authentication process can start. Only the mapping information concerning the virtual area where the new participant is located, is transmitted during this connection. Indeed, the mapping information of other parts of the VE might change before the participant needs to use them. Map Request Message As participants move in the VE, they enter into new virtual areas and require the associated mapping information in order to keep interacting with other participants. Therefore, when participants are about to enter in a new part of the VE, they send a unicast map request to their agent. These requests contain their current position in the VE, so that agents can send back to them the right mapping information. Note that participants have to anticipate their map request in order to obtain the mapping information before they enter in the new area. The anticipation time depends on the round-trip time to the associated agent, the participant velocity, and the size of its area of interest. Moreover, a mapping information is considered valid by a participant only for a short duration after its
Issues in Designing a Communication Architecture
65
reception. Indeed, if a participant receives the mapping information, but finally decides to stay out of the corresponding area for a while, the mapping information for that area may change during that time (see Section 4.5). So, before entering a virtual area, a participant needs to check if its mapping information is still valid. If the difference between the current time and the reception time of the mapping information exceeds the validity period, a new map request is necessary. In case of a remapping, the agents which have sent a mapping information during the time corresponding to a validity period, re-send the new mapping information to the corresponding users. Remapping Request Message In LSVE, heterogeneity at the receivers implies that some users are able to interact simultaneously with a large number of other participants whereas other users are much more limited. However, both of them can be confronted with the situation in which the data rate received is about to exceed the maximum data rate they can handle. Two different reasons may lead to this situation: – When the number of participants located in its area of interest exceeds the maximum number of participants it can handle. – When the sum of the number of participants located in its area of interest and in its superfluous area (see Figure 3) exceeds the maximum number of participants it can handle. In the first case, there is no way for our architecture to increase the satisfaction of the participant. The only thing that the participant can do is leave the multicast groups which map the “least interesting part” of its area of interest. In the second case, the participant could claim for a better mapping, i.e., a better cell-size. Indeed, with a smaller cell-size, its area of interest will be better approximated, and therefore, its superfluous area will be reduced. In this case, the participant sends to its agent a remapping request containing its virtual position. 4.5
Agents Algorithm Overview
Now, we describe the communication between agents, the mapping and the handover algorithms, then, evaluate performance of the mapping algorithm. Agent-to-Agent Communication We assume that, at the beginning, agents only know the maximal velocity of the participants. Note that this assumption is realistic, considering that most of the time, the maximal velocity is an “application dependent” parameter. In our architecture, two kinds of information are used by the agents to partition the VE into zones and cells: map requests and remapping requests. Since a map request contains the virtual position of its sender, each agent is able to track the location of its connected users in the VE. In order to evaluate the density of
66
E. L´ety and T. Turletti
participants within each zone, agents exchange information on the AM G multicast group. However, agents do not need to send the exact virtual position of their associated users. Only the number of users per zone is necessary to allow agents to compute periodically the density of participants per zone. Remapping requests inform agents of the possible dissatisfaction of some of their connected users. As remapping requests also contain the virtual positions of their respective senders, agents can use these messages to define new zones where a more appropriate cell-size should be computed. In order to process all the remapping requests received per zone, each agent sends all its remapping requests received on the agent multicast group. Using these messages, agents can jointly decide when and where to modify the mapping in the VE. Mapping Algorithm The same mapping algorithm is used by each agent. Basically, throughout the session, agents periodically compute the average density of participants per multicast group, by dividing the number of connected participants with the number of available multicast groups for the application. We refer to this density as the remapping threshold of the mapping algorithm. Since a VE disposes of a limited set of multicast groups, the number of cells in the VE is also limited. So, the density of participants in the VE should be limited in order that agents are able to maximize the users satisfaction. In order to make a VE scalable for a large number of participants, the following solutions are possible: – Build an extensible VE whose size adapts to the number of users so that the average density of participants in the VE always remains under a maximal threshold. – Limit the maximum number of participants connected to the VE, and build a VE large enough such as the average density of participants in the VE never exceeds a maximal threshold. – Use protocols such as the ones defined in MASC, to dynamically allocate more multicast groups to the LSVE, whenever the density of participant exceeds a maximal threshold. However, this solution only solves part of the problem by reducing the average size of superfluous area: as the density of participants increases, more and more participants have to reduce their area of interest in order to avoid packet loss or CPU overload. As participants arrive and move in the VE, agents keep track of the density of participants in each of these zones. Two possible reasons can lead to the division of a zone into smaller cells: – It is possible to find a smaller cell-size where the average density of participants per cell exceeds the remapping threshold. – Remapping requests are sent by participants located in a given zone.
Issues in Designing a Communication Architecture
67
In the first case, agents can use the density of participants in the zone to compute a more appropriate cell-size, taking into account the number of available multicast groups. In the second case, agents first determine the distribution of remapping requests within the zone. If agents detect a concentration of “unsatisfied users” in only a part of the zone, then the zone is divided into smaller zones and a new cell-size is evaluated for each zone. However, before proceeding to a remapping, agents have to check if there are still enough available multicast groups. Conversely, agents can decide to remap a zone using bigger cells. This remapping occurs when the density of participants per cell is smaller than the remapping threshold. Moreover, if neighboring zones contain only one cell (i.e., each zone is associated with a single multicast group), then agents can decide to merge these zones into a single one. This situation can occur if the resulting zone contains less participants than the remapping threshold. Nevertheless, two start zones can never be merged. Handover Management This operation is certainly the most critical operation in LSVE. When agents decide to change the mapping in a zone, the participants located in that zone need to keep interacting with each other while they update their groups memberships. Here are the successive operations required to realize an handover: – Agents elect a designated agent, which takes care of this operation. The designated agent is the one with the highest number of connected users involved in the handover. If several agents are candidates at the same time, a selection can be made based on their IP addresses. – The designated agent joins all the current multicast groups associated with the cells of the remapping zones. Since each group maps a cell, the agent only sends in each group the mapping information relative to the neighboring virtual area of that cell. – When a participant receives the new mapping information, it joins the new groups which map its area of interest. However, the participant waits for the time corresponding to the join latency [14], before sending in the new groups. Thus, when the first packet is sent to the group, the new multicast tree is already established between the participants. – As the designated agent keeps receiving information on the old groups (the new mapping information might have been lost before reaching some enduser), it periodically resends the new mapping information on the old multicast groups. – When the agent detects that no packet has been received for a given period of time on an old multicast group, it sends a Free packet to that group. – When a participant receives a Free packet, it leaves the corresponding multicast group.
68
E. L´ety and T. Turletti
Performance of the Mapping Algorithm In order to evaluate the performance of the mapping algorithm, we compared it with a static partitioning strategy by simulating a square VE with 511 participants and 144 multicast groups. First, the VE was partitioned into 3x3 square start-zones. For the static partitioning strategy, we divided the VE into 12x12 squares cells of the same size (i.e. 4x4 cells per start-zone). With the mapping algorithm, each start-zone was dynamically divided into square cells, depending on its density of participants. Note that as cells and zones are both squares, the number of cells per zone takes its value in {1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121}. However, the total number of cells in the VE was always less than or equal to the number of available multicast groups. For each partitioning strategy, we ran two major sets of simulations. In the first set, the 511 participants were randomly distributed in the VE according to a uniform distribution law. In the second set, we randomly distributed participants in each start-zone according to the same law, but we fixed the number of participants in each zone: the first zone contained 256 participants, the second zone 128 participants, the third zone 64 participants and so on. During these simulations, we considered that participants had different capabilities uniformly distributed between 15 and 35 sources. For example, if a participant was able to handle a maximum of 20 sources, but 40 participants were located in the cells intersecting its area of interest, then only half of its incoming traffic was received and processed. The mapping algorithm first computed an appropriate cell-size for each zone based on the density of participants, and then kept dividing the zones where the participants with the lowest satisfactions were located (see Section 4.1). Figure 5 shows the distribution of participants in the different multicast groups, i.e., in the different cells of the VE. As expected, Figure 5 shows that with a
45
Static cell-size (non-uniform distribution) Dynamic cell-size (non-uniform distribution) Static cell-size (uniform distribution) Dynamic cell-size (uniform distribution)
40 35
% of groups
30 25 20 15 10 5 0 0
5
10
15
20
25
30
Number of participants
Fig. 5. Distribution of participants in multicast groups
Issues in Designing a Communication Architecture
69
uniform distribution, the static partitioning performs as good as our algorithm: in both cases, more than 95% of the groups contain between 1 and 10 users, with an average of 3.5 users per groups. This average number is equal to the ratio between the total number of participants and the total number of available multicast groups. Moreover, Figure 5 shows that the two strategies completely differ during the second set of simulation. Indeed, the static strategy reveals that more than 40% of the multicast groups contain no user and some other groups contain up to 25 users. However, the mapping algorithm allows approximately the same distribution of participants in multicast groups as in the first set of simulation. Therefore, it shows its adaptive capacity to VE, even with a limited number of available multicast groups. In order to evaluate the satisfaction, we use the metric S defined in Section 4.1. The first simulation, done using a uniform distribution of participants, shows that more than 95 % of participants have a maximal satisfaction (S = 1) independently of the mapping strategy (static or dynamic). The second simulation, done with a non-uniform distribution of participants, is shown in Figure 6.
80
Static cell-size (non-uniform distribution) Dynamic cell-size (non-uniform distribution)
70
% of participants
60 50 40 30 20 10 0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Satisfaction
Fig. 6. Percentage of satisfaction
We note that about 80 % of participants have a maximal satisfaction using the mapping algorithm compared to less than 55 % with the static partitioning. Moreover, it is important to notice that in the case of a non-uniform distribution of participants, some participants receive more interesting data-rate that they can handle, especially the participants located within the zone with the highest density of participants. However, as the mapping algorithm computes a smaller cell-size in this zone, participants can gradually reduce their number of subscribed multicast groups by leaving the groups which map the ”least interesting part” of their area of interest.
70
5
E. L´ety and T. Turletti
Related Work
There has been relatively little published work on the issue of evaluating grouping strategies for LSVE. [19] analyzes the performance of a grid-based relevance filtering algorithm that estimates the cell-size value which minimizes both the network traffic and the use of scarce multicast resources. However, the paper shows specific simulations done using different granularity of grids for several types of DIS entities, but the generic case is not studied. [16] compares the cost of cell-based and entity-based grouping strategies using both static and dynamic models but the paper does not propose any solution to calculate the cell-size value. Different architectures using multiple multicast groups have already been designed for LSVE such as NPSNET[15], SPLINE[20], and MASSIVE-2[21]. [22] also suggests an approach for interest management using multicast groups. However, none of them have presented an architecture to dynamically partition the VE into multicast groups taking account the density of participants per cell and the participants capabilities.
6
Conclusion and Future Work
Although the current IP multicast model has a lot of imperfections to handle LSVE applications, we have described a communication architecture that enables us to run such applications on the Internet today. We have proposed and simulated mechanisms to dynamically partition a VE into cells of different sizes, depending on the density of participants per cell, the number of available multicast groups, and the link bandwidth and processing power available per participant. We believe that our framework is flexible enough to be easily adapted to new approaches to provide multicast such as EXPRESS[11] or Simple Multicast[10] and also to benefit from new functionalities like the future support for source filtering in IGMP[23]. Directions for future work include the use of a congestion control scheme for multicast UDP streams described in [24], the extension of the architecture framework to multi-flow sources, and the implementation and experimentation with a real LSVE application on the Internet.
References 1. J.M.Pullen, M.Myjak, and C.Bouwens, “Limitations of internet protocol suite for distributed simulation in the large multicast environment,” Request for Comments 2502, Feb 1999. 2. Steven McCanne, Van Jacobson, and Martin Vetterli, “Receiver-driven layered multicast,” in Proc. of ACM SIGCOMM, August 1996. 3. E. Lety, L. Gautier, and C. Diot, “Mimaze, a 3d multi-player game on the internet,” in Proc. of 4th International Conference on VSMM (Virtual Systems and MultiMedia), Gifu, Japan, November 1998.
Issues in Designing a Communication Architecture
71
4. B.N. Levine, J. Crowcroft, C. Diot, J.J. Garcia-Luna Aceves, and J. Kurose, “Consideration of receiver interest for ip multicast delivery,” Submitted for publication, July 1999. 5. B.N. Levine and J.J. Garcia-Luna-Aceves, “Improving internet multicast with routing labels,” in Proc. of IEEE International Conference on Network Protocols, October 1997, pp. 241–250. 6. E. Lety, T. Turletti, and F. Baccelli, “Cell-based multicast grouping in large-scale virtual environments,” Tech. Rep. RR-3729, INRIA, July 1999. 7. M. Sola, M. Ohta, and T. Maeno, “Scalability of internet multicast protocols,” in Proc. of INET, 1998. 8. S. Kumary, P. Radoslavov, D. Thaler, C. Alaettinoglu, D. Estrin, and M. Handley, “The masc/bgmp architecture for inter-domain multicast routing,” in Proc. of ACM SIGCOMM, sept 1998. 9. S. Pejhan, A. Eleftheriadis, and D.Anastassiou, “Distributed multicast address management in the global internet,” IEEE Journal on Selected Areas in Communications, pp. 1445–1456, October 1995. 10. R Perlman, C-Y Lee, A. Ballardie, J. Crowcroft, Z. Wang, T. Maufer, C. Diot, and M. Green, “Simple multicast: A design for simple, low-overhead multicast,” Internet Draft, February 1999. 11. H.W. Holbrook and D.R. Cheriton, “Ip multicast channels: Express support for large-scale single-source applications,” in Proc. of ACM SIGCOMM, August 1999. 12. S. Deering, “Host extensions for ip multicasting,” RFC-1112, August 1989. 13. D. Waitzman, C. Partridge, and S. Deering, “Distance vector multicast routing protocol,” RFC-1075, November 1988. 14. L. Rizzo, “Fast group management in igmp,” in Proc. of Hipparch’98 workshop, UCL, London, UK, June 1998. 15. Michael R. Macedonia, Michael J. Zyda, David R. Pratt, Donald P. Brutzman, and Paul T. Barham, “Exploiting reality with multicast groups: A network architecture for large-scale virtual environments,” IEEE Computer Graphics and Applications, vol. 15, no. 5, pp. 38–45, September 1995. 16. L. Zou, M.H. Ammar, and C. Diot, “An evaluation of grouping techniques for state dissemination in networked multi-user games,” Submitted for publication, May 1999. 17. D.J. Van Hook, S.J. Rak, and J.O. Calvin, “Approaches to relevance filtering,” in Proc. of 11th DIS Workshop, September 1994. 18. D. Bertsekas and R. Gallager, Data Networks, chapter 6, pp. 524–529, PrenticeHall, 1987. 19. S.J. Rak and D.J. Van Hook, “Evaluation of grid-based relevance filtering for multicast group assignment,” in 14th DIS workshop, 1996. 20. John W. Barrus, Richard C. Waters, and David B. Anderson, “Locales and beacons: Efficient and precise support for large multi-user virtual environments,” IEEE Virtual Reality Annual International Symposium, March 1996. 21. C. Greenhalgh, “Dynamic, embodied multicast groups in massive-2,” Tech. Rep. NOTTCS-TR-96-8, Department of CS, The University of Nottingham, UK, 1996. 22. Howard Abrams, Kent Watsen, and Michael Zida, “Three-tiered interest management for large-scale virtual environments,” in Proc. ACM Symposium on Virtual Reality Software and Technology, Taipei, Taiwan, 1998. 23. B. Cain, S. Deering, and A. Thyagarajan, “Internet group management protocol, version 3,” Internet Draft, February 1999. 24. A. Clerget, “Tuf : A tag-based udp multicast flow control protocol,” Tech. Rep. RR-3728, INRIA, July 1999.
HyperCast: A Protocol for Maintaining Multicast Group Members in a Logical Hypercube Topology* Jörg Liebeherr 1 and Tyler K. Beam 2 1
Computer Science Department, University of Virginia, Charlottesville, VA 22903, USA 2 Microsoft Corporation, Redmond, WA 98052, USA
Abstract. To efficiently support large-scale multicast applications with many thousand simultaneous members, it is essential that protocol mechanisms be available which support efficient exchange of control information between the members of a multicast group. Recently, we proposed the use of a control topology, which organizes multicast group members in a logical n-dimensional hypercube, and transmits all control information along the edges of the hypercube. In this paper, we present the design, verification, and implementation of a protocol, called HyperCast, which maintains members of a multicast group in a logical hypercube. We use measurement experiments of an implementation of the protocol on a networked computer cluster to quantitatively assess the performance of the protocol for multicast group sizes up to 1024 members.
1
Introduction
A major impediment for scalability of multicast applications is the need of multicast group members to exchange control information with each other. Consider, for example, the implementation of a reliable multicast service. A unicast protocol with a single sender and a single receiver requires the receiver to send positive (ACK) or negative (NACK) acknowledgment packets to indicate reception or loss of data. If the same mechanism is applied to large groups, the sender would soon be flooded by the number of incoming ACK or NACK packets; this is referred to as the ACK implosion problem [5]. Even though many techniques and protocol mechanisms have been proposed to improve the scalability of multicast applications by solving the ACK implosion problem (e.g., [6][15][18]), the problem of protocol support for large-scale multicast applications, especially with a large number of senders, is not solved, and scalability to thousands of users is currently not feasible [19]. We are pursuing a novel approach to the problem of scalable multicasting in packet-switched networks. The key to our approach is to organize members of a multicast group in a logical n-dimensional hypercube. By exploiting the symmetry prop *
This work is supported in part by the National Science Foundation under grants ANI-9870336 and NCR-9624106 (CAREER). The work of Tyler Beam was done when he was with the University of Virginia.
L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 72-89, 1999. © Springer-Verlag Berlin Heidelberg 1999
HyperCast
73
erties in a hypercube, operations, which require an exchange of control information between multicast group members, can be efficiently implemented. Note: We do not consider that the hypercube is used for data transmissions, even though this is feasible. The main use of the hypercube is to channel the transmission of control information, such as acknowledgments, to avoid the ACK implosion problem. In a previous paper [13], we have shown that by labeling multicast group members as the nodes of a hypercube, we can almost trivially build so-called acknowledgment trees [22], which can avoid ACK implosion even in very large multicast groups. We also performed an analysis of the load-balancing properties of tree embeddings in a hypercube, and demonstrated that the trees embedded in a hypercube have excellent load-balancing properties. In this paper, we present protocol mechanisms needed to maintain a hypercube in a connectionless wide-area network, such as the Internet. We discuss the design of a protocol, called HyperCast, which organizes the members of an IP multicast group in a logical hypercube. We evaluate the scalability properties of the HyperCast protocol through measurements of a prototype implementation. We will demonstrate that the HyperCast protocol can maintain a hypercube in a multicast group with dynamically changing group membership. The HyperCast protocol achieves scalability through a distributed soft-state implementation. No entity in the network has knowledge of the entire group. The protocol is capable to repair a hypercube which has become inconsistent through failures of group members, network faults and packet losses. The approach presented in this paper is intended for many-to-many multicast applications where each group member can transmit information. There are many multicast applications where only one or few multicast group members actually transmit information, e.g., multicast web servers or electronic software upgrades. We do not claim that, in these situations, our approach presents any significant advantages over currently available solutions.
2
Control Topologies for Multicast Groups
In recent years, many protocol mechanisms have been proposed to solve the ACK implosion problem mostly in the context of providing a reliable multicast service (e.g.,[1][5][7][8][11][12][16][17][22]). In packet-switching networks, we find two main approaches to limit the volume of control information which causes ACK implosion. In one approach, control information is broadcast to all or a subgroup of multicast group members, and a backoff algorithm [2][7] or a predefined bound on the volume of control traffic [21] is used is to avoid ACK implosion. In the second approach, multicast group members are organized in a logical graph, henceforth called control topology, and every group member exchanges control information only with its neighbors in the logical graph. Control topologies that have been considered include rings [4][23] and trees [8][10][11][12][17][22][24].
74
J. Liebeherr and T.K. Beam
Tree topologies have emerged as the most often proposed control topology. In a (proto-)typical tree topology, the members of a multicast group are organized as a rooted spanning tree, and all control information is transmitted along the edges of the tree. Tree topologies achieve scalability by exploiting the hierarchical structure of the tree. For example, by “merging” acknowledgments at the internal nodes of a tree, the number of acknowledgments received by a group member is limited to the number of children, thus, avoiding ACK implosion. In multicast applications with multiple senders, one tree is needed for each sender. Since maintaining a separate tree for each sender introduces substantial overhead, several protocols propose to use a single spanning tree, so-called shared tree [11][17] and “rehang” the tree with different nodes as root node. 5
1 2
2
3 4
4
1
5 3
(a) Node 1 is root
(b) Node 5 is root
Figure 1: Re-hanging a shared tree with different nodes as root. In Figure 1-a we show a binary tree control topology with node 1 as root. Figure 1b depicts the same tree, “rehung” with node 5 as root node. Among the currently considered topologies, tree-based topologies seem to be most suited to support large multicast groups. However, an analysis presented in [13] showed that, in multicast groups with a large number of senders, rehanging shared trees may not balance the load for processing control information among group members. We showed that a hypercube and tree-embedding algorithms in a hypercube (as shown in the next section) improve the load balancing properties.
3
Group Communication with Hypercubes
In this section, we describe the underpinnings of the proposed approach of using logical hypercubes to support group communications from [13]. An n-dimensional hypern cube is a graph with 2 nodes. Each node is labeled by a bit string kn…k1, where ki ³ {0, 1}. Nodes in a hypercube are connected by an edge if and only if their bit strings differ in exactly one position. A hypercube of dimension n = 3 is shown in Figure 2.
HyperCast
75
We organize multicast group members as the nodes of a logical n-dimensional hypercube. By imposing a particular ordering on the nodes, we can efficiently embed spanning trees into the hypercube topology. By enforcing that control information between multicast group members can only be transmitted to the parent node in the spanning tree, the ACK implosion problem can be avoided. Since spanning trees serve the function of filtering acknowledgments transmitted to the root of the tree, the trees are referred to as acknowledgment trees. Since, in an actual multicast group, the number of group members will not be a power of 2, we need to be able to work with hypercubes where certain positions are n not occupied. We refer to a hypercube with N nodes and N < 2 as an incomplete hypercube. For incomplete hypercubes we will try to maintain the following properties: 110
100
111
101
010
000
011
001
Figure 2: 3-dimensional hypercube with node labels. In a dynamic hypercube, compactness can be achieved by labeling newly added nodes in a specific order and by properly relabeling nodes whenever a node leaves the hypercube. Maintaining complete containment, however, is difficult to achieve, if the acknowledgment trees are computed in a distributed fashion and without global state information.
Compactness: The dimension of the hypercube should be kept as small as possible, n = Îlog2 NÞ. Complete Containment of Trees: If we compute an acknowledgment tree for an incomplete hypercube, we want to ensure that the tree is a subgraph of the incomplete hypercube. That is, no node should be part of an acknowledgement tree if the node is not present in the cube. In [13], we presented a simple algorithm which guarantees complete containment of embedded trees. A key idea that leads to the algorithm is to use a Gray code [20] for ordering node labels of a hypercube and to add nodes to the hypercube in the order given by the Gray code. As an example, consider the labels of the 3-dimensional hypercube in Figure 2. If we add nodes to the hypercube, we need to have a rule for the order in which node labels are added. If we used the order of a binary encoding, nodes would be added in the following sequence: 000 001 010 011 … 111. Using the ordering given by a Gray code, we add node labels in the following order:
76
J. Liebeherr and T.K. Beam
000 001 011 010 … 100. In Table 1, we show the ordering of labels according to a binary code and a Gray code. Note that consecutive node labels using a Gray code differ in exactly one bit position. Table 1: Comparison of Binary Code and Gray codes. Index i= Binary code: Bin(i) = Gray code: G(i)=
0 000 000
1 001 001
2 010 011
3 011 010
4 100 110
5 101 111
6 110 101
7 111 100
Using a Gray code, we can devise a simple algorithm, which embeds a spanning tree into an incomplete hypercube. The algorithm, given in Figure 3, is implements a spanning tree in a distributed fashion. A node with label G(i) calculates the label of its parent node in the tree with a root with label G(r), by only using the labels G(i) and G(r) as input. The algorithm consists of flipping a single bit. The trees constructed by our algorithm have the following properties: Property 1. The path length between a node and a root is given by the Hamming distance of their labels. n Property 2. If N=2 , that is, the hypercube is complete, then the embedding results in a binomial tree. Property 3. In an incomplete and compact hypercube, the trees obtained by the algorithm are completely contained. In Figure 8, we show a spanning tree generated by the algorithm for a root with label 111 in an incomplete hypercube with 7 nodes. Input: Label of the i-th node in the Gray encoding: G(i) := I = In…I2I1, and the label of the r-th node ( i) in the Gray encoding: G(r) := R = Rn…R2R1. Output: Label of the parent node of node I in the embedded tree rooted at R. Procedure Parent(I, R) If (G 1(I) < G 1(R)) { // Flip the least significant bit where I and R differ. Parent := InIn-1…Ik+1(1 - Ik)Ik-1…I2I1 with k = mini(Ii Ri) } Else { // (G 1(I) > G 1(R)) // Flip the most significant bit where I and R differ. Parent := InIn-1…Ik+1(1 - Ik)Ik-1…I2I1 with k = maxi(Ii Ri) } End -1
Figure 3: Tree Embedding Algorithm [13]. (G (.) is the inverse function of -1 G() which assigns a number to a bit label, that is, G (G(k)) = k. ) In [13] we performed an analytical comparison of the acknowledgment trees generated by the algorithm in Figure 3 and the acknowledgement trees generated in a shared tree approach (see Section 2). For both the hypercube and the shared tree, we assumed that spanning trees rooted at the sender are used for aggregation of control informa-
HyperCast
77
tion. For the analysis, we made the simplifying assumptions, that (a) group communication is symmetric, that is, on the average each member of the group generates the same amount of control information, (b) the physical network topology is not considn ered, (c) the hypercube is complete, that is, N = 2 , and (d) the number of nodes in the hypercube is constant. Under these assumptions, the hypercube was shown to have better load-balancing properties than a shared tree.1 The analytical results have encouraged us to pursue the design and implementation of a protocol which maintains a hypercube control topology. 110
111
101
010
111
101
110
011
011
010 000
001
001
000
(a) Embedded in hypercube
(b) Resulting tree
Figure 4: Embedded Tree with 111 as Root.
4
The HyperCast Protocol
The goal of the HyperCast protocol is to maintain members of a multicast group as the nodes of a logical hypercube structure so that services, such as reliable multicast, can be implemented on top of the logical structure. We want to emphasize that the HyperCast protocol is not concerned with transmission of data, nor does HyperCast provide any application-level services. The HyperCast protocol provides mechanisms which allow new nodes to enter the hypercube, and it has procedures for repairing the hypercube in case of one or multiple failures. The key to reach scalability of the logical hypercube to very large group sizes is that every node is aware of only a few nodes in the hypercube. No entity in the multicast group has complete state information. 4.1
Overview
The HyperCast protocol presented here takes advantage of the IP multicast service. A multicast group member, henceforth simply called a node, that wishes to participate in the hypercube structure joins a single IP Multicast group address, referred to as the 1
Due to space considerations, the results from the analysis are not included in this manuscript. We refer the interested reader to [13].
78
J. Liebeherr and T.K. Beam
control channel. Every node can both send and receive messages on this control channel. Obviously, scalability requirements demand that the traffic on this channel be kept minimal. We will see that only a few stations transmit to the control channel at a 2 time. Nodes in the hypercube have a physical and a logical address. The physical address consists of the IP address of the host on which a node resides, and the UDP port used by the node for HyperCast unicast messages. Each node has a unique physical address. The logical address of a node is a bit string label, which uniquely indicates the position of the node in the hypercube (as discussed in Section 3). In the HyperCast protocol, logical addresses are represented as 32-bit integers, with one bit reserved to designate an invalid logical address. Therefore, the protocol allows for hypercubes of up to 31 2 (approximately two billion) nodes. The task of the HyperCast protocol is to keep the hypercube in a stable state, which is defined to satisfy the following three criteria: - Consistent: No two nodes share the same logical address. - Compact: In a multicast group with N nodes, the nodes have bit string labels equal to G(0) through G(N - 1). - Connected: Every node knows the physical address of each of its neighbors in the hypercube. Nodes joining the hypercube, nodes leaving the hypercube, and network faults can cause a hypercube to violate one or more of the above conditions, leading to an unstable state. The task of the HyperCast protocol is to continuously return the hypercube to a stable state in an efficient manner. 4.2
Basic Data Structures
The neighbors of a node in a hypercube are those nodes with logical addresses that differ from the logical address of the node in exactly one bit location. In an mdimensional hypercube, every node has a maximum number of m neighbors. In the HyperCast protocol, every node maintains a table with the logical addresses of all its neighbors, the so-called neighborhood table. The fields for an entry in the neighborhood table consist of: - The neighbor’s logical address, - the neighbor’s physical address, if it is known, and - the time elapsed since the node last received a message from the neighbor. Given any node, its successor in the Gray's ordering is defined to be its ancestor. In a stable hypercube, every node except the one with the largest logical address has one ancestor. A node without an ancestor is defined to be a Hypercube Root (HRoot). In the HyperCast protocol, every node keeps track of the currently highest logical address in the hypercube according to the Gray ordering, and assumes that this node is
2
The protocol can be revised so that only a small subset of nodes is listening on the control channel at any given point of time. But, currently, we will assume that every member of the group is listening to the control channel.
HyperCast
79
the HRoot3. The address of the highest known logical address is used by a node to determine which of its neighbors should be present in its neighborhood table. If, based on the highest address, a node determines that a neighbor should be present in the node’s neighborhood table, but is not present, the node is said to have an incomplete neighborhood. Each node keeps the following information on the node with the highest logical address: the logical address, the physical address, the time elapsed since it last received a message from this node, together with the last received sequence number from this node.4 In an unstable hypercube, multiple nodes may consider themselves to be an HRoot. Also, different nodes in the hypercube may have different assumptions of who the HRoot is. However, in a stable hypercube there is exactly one HRoot. 4.3
HyperCast Timers and Periodic Operations
Four time parameters are used in the HyperCast protocol. These parameters and their uses are defined below and listed with their default values: theartbeat (default = 2s): Nodes send messages to each of their neighbors in the neighborhood table periodically every theartbeat seconds. ttimeout (default = 10s): When the time elapsed since a node last received a message from a neighbor exceeds ttimeout seconds, the neighbor’s entry is said to be stale and the neighborhood table is said to be incomplete. A missing neighbor is referred to as a tear in the hypercube. The information on the HRoot also becomes stale after ttimeout. tmissing (default = 20s): After a neighbor entry becomes stale, a node begins multicasting on the control channel to contact the missing neighbor. If the missing neighbor fails to respond for another tmissing seconds, the node removes the entry from the neighborhood table and proceeds under the assumption that the neighbor has failed. tjoining (default = 6s): Nodes that are in the process of joining the hypercube send multicast messages to announce their presence to the entire group. To prevent a large number of joining nodes from saturating the control channel with multicast messages, a joining node that receives a multicast message from another joining node backs-off from its attempt to join the hypercube for a period of time tjoining, before retrying to join. 4.4
Message Types
There are a total of four message types that are used by the HyperCast protocol. All messages are sent as UDP datagrams. A node transmits a message, either by unicasting to one or all of its neighbors, or by multicasting on the control channel. We do not assume transmissions of these messages to be reliable.
3 4
This assumption may be incorrect in certain situations. The node with the highest logical address attaches sequence numbers to the multicast messages it sends, as will be discussed in Subsection 4.4. Nodes store this sequence number so that they can determine if they have received recent or outdated information.
80
J. Liebeherr and T.K. Beam
Beacon Message: The beacon message is multicast on the control channel. A beacon contains the logical/physical address pair of the sender, as well as the logical address of the currently known HRoot. A node transmits a beacon message only (1) if the node considers itself to be the HRoot, or (2) if the node determines that it has an incomplete neighborhood, or (3) if the node is in the process of joining the hypercube. By construction of the hypercube, there is always at least one HRoot, and, therefore, at least one node is sending out beacons on the multicast channel. In a stable hypercube, there is only one HRoot, and, thus, only one node sends out beacons to the multicast channel. Every node uses the beacon messages sent by HRoot(s) to form an estimate of the largest logical address in the hypercube. This information is sufficient so that the node can determine whether it has a complete neighborhood. Each beacon message contains a sequence number, SeqNo, which is used to resolve conflicts if beacons are received from multiple nodes. The HRoot’s sequence number begins at 0. Whenever the HRoot sends a beacon message, SeqNo is incremented by one. Whenever a new HRoot is chosen, the sequence number is also incremented (SeqNo of new HRoot = SeqNo of current HRoot + 1). Since each node keeps track of the current HRoot, the sequence number tracks the timeliness of the information on the HRoot. When information at a node is not consistent, the information tagged with the lower sequence number is ignored. The last group of nodes which send beacon messages are joining nodes which periodically send beacons to advertise their presence to the group. Ping Message: Every node periodically sends a ping to all of its neighbors listed in its neighborhood table. A ping informs the receiver that the node is still present in the hypercube. A ping is a short unicast message, containing the logical and physical addresses of both the sender and the receiver of the message, as well as the logical address and sequence number of the currently known HRoot. If a node has not received a ping from a neighbor for an extended period of time (ttimeout), the node considers its neighborhood incomplete and begins sending beacons as described above. If it still has not received a ping from its neighbor after another period of time (tmissing), it assumes that its neighbor has failed and removes the missing neighbor from its neighborhood list. Ping messages are also used as the only mechanism to assign a new logical address to the receiver of a ping message. Leave Message: When a node wishes to leave the hypercube, it sends a leave message to its neighbors. Nodes receiving this leave message remove the leaving node from their neighborhood tables. Since a leave message is not reliable, a neighbor may not receive a leave message. In this case, a neighbor will notice the absence of a neighbor through missing responses to its ping messages. Even without leave messages, a former neighbor eventually realizes that a node has left the neighborhood since no ping messages arrive for this neighbor. Kill Message: A kill message is used to eliminate a node from the hypercube. More specifically, a kill message is used to eliminate nodes with duplicate logical addresses. A node which receives a kill message immediately sends a leave message to all its neighbors, and tries to rejoin the hypercube as a new node
HyperCast
4.5
81
Protocol Mechanisms
The HyperCast protocol implements two mechanisms for maintaining a stable hypercube. Recall from Subsection 4.1 that a stable hypercube satisfies the criteria for being consistent, compact, and connected. Duplicate Elimination (Duel): The Duplicate Elimination (Duel) mechanism enforces consistency by ensuring that duplicate logical addresses are removed from the hypercube. If a node detects that another node has the same logical address, it compares its own physical address with the physical address of the conflicting node. If the node’s physical address is numerically greater than the conflicting node’s physical address, the node with the greater physical address issues a kill message to the other node. Otherwise, it sends leave messages to all of its neighbors and rejoins the hypercube. Address Minimization (Admin): The Address Minimization (Admin) mechanism is used to maintain compactness of the hypercube. On a conceptual level, the Admin mechanism has nodes attempt to assume lower logical addresses whenever opportunities arise. To see how Admin reconstitutes compactness, recall first that a hypercube which violates compactness must have a tear in the hypercube fabric (that is, some node has an incomplete neighborhood table). The Admin mechanism enforces that a node with a logical address higher than the logical address of a tear lowers its logical addresses to repair the tear. The Admin mechanism at a node consists of an active and a passive part. The active part is executed when a node receives a beacon message from the HRoot and the node realizes that it is missing a neighbor in its own neighborhood table which has a lower logical address than the HRoot. In such a situation, the node sends a ping with the missing lower logical address to the HRoot. The passive part is activated when the HRoot receives such a ping message with a destination logical address lower than its current logical address; the HRoot sets its logical address to the value given in the ping. The Admin mechanism also governs the process of nodes joining the hypercube. Initially, the logical address of a newly joining node is marked as an invalid logical address. The invalid address is, by definition, larger than any valid address in the hypercube. Since a joining node sends beacons to announce its presence to the group, other nodes check to see if they can find a “lower” (valid) logical address for the new node in the hypercube. If there is a node with an incomplete neighborhood, this node sends a ping to the new node with the address of the vacant position. The new node assumes the (lower) address given in the ping message and occupies the vacant address. If there is no tear in the hypercube, the new node is placed as a neighbor of the HRoot. More precisely, the HRoot sends a ping to the new node containing the logical address which corresponds to the successor of the HRoot in the Gray ordering. Therefore, a node which joins a stable hypercube becomes the new HRoot. The Duel and Admin mechanisms, respectively, enforce consistency and compactness of a hypercube. The last criterion for a stable hypercube, connectedness, is maintained by the following process: whenever a node A receives a message from another node B with a logical address that designates it as a neighbor in the hypercube, the
82
J. Liebeherr and T.K. Beam
logical/physical address pair of node B is added into node A’s neighborhood table. If a neighbor does not send pings for an extended period of time, it is assumed that the neighbor has dropped out of the hypercube and its entry in the neighborhood table is removed. Actions taken by the Admin mechanism then repair the tear in the neighborhood table. 4.6
States and State Transitions
In the HyperCast protocol, each node in the hypercube is in one of eleven different states. Based on events that occur and HyperCast control messages that are received, nodes transition between states. In Figure 5 we show the state transition diagram of the HyperCast protocol. The states are indicated as circles. State transitions are indicated as arcs; each arc is labeled with a condition which triggers the transition. The states of the hypercube node are described in Table 2. We refer to [14] for a more detailed description of the state transitions. With the state definitions, we can give a precise definition of a stable hypercube. A hypercube with N nodes is stable if all of its nodes have unique logical addresses, ranging from G(0) to G(N-1) (where G(.) indicates the Gray code discussed in Section 3), and all nodes are in state Stable, with the exception of the node with a logical address G(N-1) which is in state HRoot/Stable.
Timeout while attempting tocontact neighbor
Neighborhood becomes incomplete Stable
Repair
Incomplete
Neighborhood becomes complete
Node becomes HRoot
New HRoot
Timeout for finding an HRoot
Has ancestor
Neighborhood becomes complete New Node HRoot becomes HRoot Neighborhood becomes incomplete
HRoot/ Stable
Node becomes HRoot
Neighborhood becomes complete Timeout for finding any neighbor Start Hypercube
Joining
Any State
Depart
Node wants to join
Outside
Has no ancestor HRoot/ Repair
HRoot/ Incomplete
NIL
Depart
New HRoot
Node leaves
Leaving
Timeout while attempting to contact neighbor Timeout for finding an HRoot
Joining
Figure 5: Node State Transition Diagram.
Timeout for beacons from Joining nodes Beacon from Joining Node received
Joining Wait
HyperCast
83
Table 2: Node state definitions.
Outside:
Not yet participating in the group.
Joining:
Wishes to join the hypercube, but does not yet have any information about the rest of the hypercube. Its logical address is marked as invalid.
JoiningWait:
A Joining node that has received a beacon from another Joining node within the last tjoining. Has determined that it is the only node in the multicast group since it has not received any control messages for a period of time ttimeout, and it starts its own stable hypercube of size one. Knows all of its neighbors’ physical addresses.
StartHypercube:
Stable: Incomplete:
Repair: HRoot/Stable: HRoot/ Incomplete: HRoot/Repair: Leaving: 4.7
Does not know one or more of its neighbors’ physical addresses, or a neighbor is assumed to have left the hypercube after not receiving pings from that neighbor for ttimeout. Has been Incomplete for a period of time tmissing and it begins to take actions to attempt to repair its neighborhood. Stable node which also believes that it has the highest logical address in the hypercube. Incomplete node which believes that it has the highest logical address in the entire hypercube. Repair node which also believes that it has the highest logical address in the hypercube. Node that wishes to leave the hypercube.
Example
We next illustrate the operations of the protocol in a simple example. In the examples, we use a small number of nodes and we assume that there are no packet losses. Figure 6 shows a hypercube with five nodes, represented as circles. We use arrows to represent unicast messages. Circles around a node indicate a multicast message. In Figure 6-a, we show a stable hypercube. Here, the HRoot, node 110, multicasts beacons periodically. The beacon is received by all nodes and keeps all nodes informed of the logical address of the HRoot. Therefore, the nodes know which of their neighbors should be present in their neighborhood tables. Every node periodically sends ping messages to its neighbors in the neighborhood table (Figure 6-b).
84
J. Liebeherr and T.K. Beam
beacon
110
011
010
ping ping
ping ping
001
000
000
001
011 pi n pi g ng
pi n pi g ng
010
ping
ping
110
(a)
(b) Figure 6: Stable hypercube.
110
110
ping
(as 11 1
110
)
New
010
000
011
001
ping
111
010
000
011
001
(a)
010
000
110
010
011
(c) 111
110
111
010
011
000
001
(d)
010
000
ping
ping
111
011
001
(b)
110
111
011
001
(e)
000
001
(f)
Figure 7: Joining node. In Figure 7-a, we show a node in state Joining, labeled “New” that wants to join the hypercube. The node periodically sends beacon messages, thus, making its presence known to the group. The HRoot places the Joining node as its neighbor at the next successive position in the hypercube according to the Gray ordering, and pings the new node with the new logical address (111) (Figure 7-b). The new node takes on the
HyperCast
85
new logical address and replies with a ping back to the original HRoot (Figure 7-c). The new node determines from the ping packet that it is the HRoot, since its own logical address is the highest known logical address. It begins sending beacons as an HRoot (Figure 7-d). If node 011 receives the beacon from the new HRoot, it realizes that 111 should be its neighbor. Thus, node 011 sends a ping message to 111 (Figure 7-e). Once node 111 receives the ping message, it responds with a ping itself (Figure 7-f). At this time, all nodes in the hypercube have complete neighborhood tables and know all their neighbors, so the hypercube is stable.
5
Verification and Implementation
We used the Spin protocol verification tool [9] to aid in the development of the HyperCast Protocol. Spin checks the logical consistency of a protocol specification by searching for deadlocks, non-progress cycles, and any kind of violation of userspecified assertions. To verify the HyperCast design in Spin, the entire HyperCast protocol specification, as well as a system for simulating multiple hypercube nodes was encoded using the Process Meta Language (PROMELA). In addition to checking for deadlocks and non-progress cycles, Spin was used to ensure that every execution path resulted in a stable hypercube. Due to the unavoidable state space explosion when using a tool such as Spin, we were only able to analyze hypercubes with at most 6 nodes. While verification cannot be used to prove results for large hypercube sizes, we assert that, for the purposes of verification, there is little qualitative difference between a hypercube of six nodes and a hypercube of several thousand nodes. It is unlikely that non-progress cycles and deadlocks exist in large hypercubes, which do not have analogous fault modes in a 6node hypercube. We wish to emphasize, however, that our verification with Spin is not equivalent to a complete formal verification of the protocol. The HyperCast protocol was implemented using the Java programming language. The total size of the implementation is about 5,000 lines of code. Java was chosen for its portability to multiple platforms and its easy-to-use threading constructs [3]. The implementation was an exact port of the code written in PROMELA. Two sockets are used for each hypercube node, one for unicast packets, and one for multicast packets on the control channel.
6
HyperCast Experimental Validation
To determine the scalability properties of the HyperCast protocol, we have tested the Java implementation in a testbed environment. The protocol testbed used is the Centurion computer cluster at the University of Virginia, a cluster of workstations used primarily as a platform for distributed computing. The part of the cluster used for this experiment consists of 32 computers, each a 533 MHz DEC Alpha with 256 MB of RAM running Linux 2.0.35. The Centurion cluster machines are connected
86
J. Liebeherr and T.K. Beam
with a 100 Mbit/s switched Ethernet network. Up to 32 logical hypercube nodes are run on a single machine.5 The goal of the experiments is to find answers to the following questions. What is the overhead of the protocol, and how does the overhead scale with increased size of the hypercube? The overhead of the protocol consists of the (unicast and multicast) control messages ping, beacon, kill, and leave. Of particular importance for scalability is that the volume of beacon messages be low. Note that, in the current implementation, beacon messages are sent to all nodes of the hypercube via IP multicast. How much time does the protocol require to return a hypercube to a stable state? To assert scalability, the time needed to return the hypercube to a stable state should not depend on the size of the hypercube. The time to reconstitute stability gives an indication to how quickly the HyperCast protocol can adapt to dynamic changes in the group membership. In this paper, we only present a single experiment. We refer to [14] for additional experiments. We examine a scenario where multiple nodes want to simultaneously join the hypercube. We measure the time until the HyperCast protocol establishes a stable hypercube. In the experiment, we vary the number N of nodes that are already i/2 present in the hypercube (N = 2 , where i ranges from 0 to 18), and the number J of i/2 nodes which want to join the hypercube when the experiment begins (J = 2 , where i ranges from 0 to 16). The performance measures considered here are as follows: The time needed to return the hypercube to a stable state. (Time is measured in multiples of theartbeat.) The number of packets (unicast and multicast) transmitted. At the start of each experiment, there is a stable hypercube with N nodes, and J nodes want to join the multicast node. All J nodes are in state Joining. An experiment is completed when the hypercube contains N + J nodes and is in a stable state. We measure the time until stability is reached, as well as the multicast traffic which is transmitted over the duration of the run. Figure 8 shows for all (N, J) pairs, the time until the hypercube stabilizes. Note that the plotted graph is constant as a function of N. Also, the plot indicates that there is little correlation between the number of nodes present in the hypercube and the time to attain a stable hypercube. The increase in time with respect to the number of joining nodes J indicates a linear relation between the number of joining nodes and the time needed. This behavior is expected, since the process of adding one node to the hypercube should take a constant amount of time. Figure 9 shows the average number of unicast packets sent or received by a station per thearbeat time units. The values are averaged over the entire duration of the experiment. The data indicates that the unicast traffic at a node grows on a logarithmic scale. Since unicast transmissions are primarily ping messages between neighbors, the behavior is as expected. Figure 10 shows the average rate of multicast transmissions sent and received at each node during the time of the join operation. The data indicates 5
IPC processing in the Java Virtual Machine (JVM) is the bottleneck when running multiple hypercube nodes on a single machine. The limit of 32 nodes per machine is a result of the restriction on the maximum number of sockets that can be handled by the JVM.
HyperCast
87
Time, in multiples of theartbeat
that there is no strong correlation between multicast traffic and the number of nodes present in the hypercube. There is, however, a correlation between the multicast traffic and the number of nodes joining the hypercube. This correlation is due to the beacons sent by newly joining nodes. Overall, this experiment shows that the process of adding nodes to the hypercube scales well to larger group sizes. Applications which require low latency in join operations can use a lower value of theartbeat, thereby reducing the time needed to add a node to the hypercube.
1800 1600 1400 1200 1000 800 600 400 200 0 10 8
10 6
8 6
4
Log2 (N)
4
2
Log2 (J)
2 0
0
Average # packets sent/received at each node per theartbeat
Figure 8: Time until the Hypercube reaches a stable state. N and J are the number of nodes already in the hypercube and the number of joining nodes, respectively, at the beginning of the experiment.
16 14 12 10 8 6 4 2 0 10 8
10 6
8 6
4
Log2 (N)
4
2
2
Log2 (J) Figure 9: Average number of unicast packets sent and received per node and per time unit over the duration of the experiment. 0
0
J. Liebeherr and T.K. Beam
Average # packets sent/received at each node per theartbeat
88
7 6 5 4 3 2 1 0 10 8
10 6
8 6
4
Log2 (N)
4
2
2 0
0
Log2 (J)
Figure 10: Average number of multicast packets sent and received per node and per time unit over the duration of the experiment.
7
Conclusions
We have presented a novel approach to the problem of scalable multicast in packetswitched network, where we organize members of a multicast group in a logical ndimensional hypercube. By exploiting the symmetry properties in a hypercube, operations that require an exchange of feedback information between multicast group members can be efficiently implemented. In this paper, we presented the design, specification, verification, and evaluation of the HyperCast protocol, which maintains members of a dynamically changing multicast group in a logical hypercube topology. The implementation has been tested for group sizes of up to 1024 nodes. The data indicates that larger group sizes may be reached. The HyperCast protocol organizes nodes into a hypercube, but, at present, does not support any applications. In future work, we will build protocol mechanisms, which use the symmetrical hypercube topology to support applications.
References [1] M. Ammar and L. Wu. Improving the Performance of Point to Multi-Point ARQ Protocols through Destination Set Splitting. In: Proc. IEEE Infocom ‘92, Pages 262-271, May 1992. [2] J. Bolot. End-to-End Packet Delay and Loss Behavior in the Internet. In: Proc. ACM Sigcomm '93, 23(4):289-298, September 1993. [3] M. Campione, K. Walrath. The Java Tutorial: Object-Oriented Programming for the Internet (Java Series). Addison-Wesley Publishing, March 1998. [4] J.M. Chang and N.F. Maxemchuck. Reliable Broadcast Protocols. ACM Transactions on Computing Systems, 2(3):251-273, August 1984.
HyperCast
89
[5] J. Crowcroft and K. Paliwoda. A Multicast Transport Protocol. In Proc. ACM Sigcomm ’88, Pages 247-256, August 1988. [6] C. Diot, W. Dabbous, and J. Crowcroft. Multipoint Communications: A Survey of Protocols, Functions, and Mechanisms. IEEE Journal on Selected Areas in Communications, 15(3): 277- 290, April 1997. [7] S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang. A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. IEEE/ACM Transactions on Networking, 5(6):784-803, December 1997. [8] H.W. Holbrook, S.K. Singhal, and D.R. Cheriton. Log-based Receiver-Reliable Multicast for Distributed Interactive Simulation. In: Proc. of ACM Sigcomm ’95, Pages 328-341, August, 1995. [9] G. J. Holzmann. The Model Checker SPIN. IEEE Transactions on Software Engineering 23(5):279-295, May 1997. [10] M. Kadansky, D. Chiu, and J. Wesley. Tree-Based Reliable Multicast (TRAM). Internet Draft, Internet Engineering Task Force, November 1998. [11] B.N. Levine, D.B. Lavo and J.J. Garcia-Luna-Aceves. The Case for Reliable Concurrent Multicasting Using Shared Ack Trees. In: Proc. ACM Multimedia ‘96, Pages 18-22, November 1996. [12] B. N. Levine and R. Rom. Supporting Reliable Concast with ATM Networks. Technical Report, Sun Research Labs SDS-96-0517, January 1997. [13] J. Liebeherr and B. S. Sethi. A Scalable Control Topology for Multicast Communications. In: Proc. IEEE Infocom ‘98, Pages 1197-1204, March 1998. [14] J. Liebeherr and T.K. Beam, HyperCast Protocol: Design and Evaluation, Technical Report, CS-99-26, University of Virginia, September 1999. [15] C. K. Miller, Multicast Networking and Applications, Addison-Wesley, 1998 [16] C. Papadopoulus, G. Parulkar, and G. Varghese. An Error Control Scheme for large-Scale Multicast Applications. In: Proc. IEEE Infocom ‘98, Pages 1188- 1197, March 1998. [17] S. Paul, K.K. Sabnani, J.C.-H. Lin, and S.Bhattacharyya. Reliable Multicast Transport Protocol (RMTP). IEEE Journal on Selected Areas in Communications, 15(3):407 - 421, April 1997. [18] S. Paul, Multicasting on the Internet and Its Applications, Kluwer Academic Publishers, 1998. [19] M. Pullen, M. Myjak, and C. Bouwens. Limitations of Internet Protocol Suite for Distributed Simulation in the Large Multicast Environment. IETF Internet-Draft, March 1997. [20] M.J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, New York, 2nd edition, 1994. [21] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. RTP: A Transport Protocol for Real-Time Applications. Request For Comments RFC 1889, Internet Engineering Task Force, January 1996. [22] R. Yavatkar, J. Friffioen, and M. Sudan. A Reliable Dissemination Protocol for Interactive Collaborative Applications. In: Proc. ACM Multimedia ‘95, Pages 333-343. November 1995. [23] B. Whetten, T. Montgomery, and S. Kaplan. "A High Performance Totally Ordered Multicast Protocol". Lecture Notes in Computer Science, Vol. 938, Theory and Practice in Distributed Systems (K. P. Birman, F. Mattern, A. Schiper (Eds.)), Pages 33-57, 1995. [24] B. Whetten, M. Basavaiah, S. Paul, T. Montgomery, N. Rastogi, J. Conlan, and T. Yeh. The RMTP-II Protocol. Internet Draft, Internet Engineering Task Force, September 1998.
Support for Reliable Sessions with a Large Number of Members Roger Kermode1 and David Thaler2 1
Motorola Australian Research Centre, Botany, NSW 2019, Australia [email protected] 2 Microsoft, Redmond, WA 98052, U.S.A. [email protected]
Abstract. The ability to localize traffic when performing distributed searches within a group of nodes that form a session is a key factor in determining how big the group can scale. In this paper we describe an algorithm using the concept of scoping that we believe significantly enhances the ability to localize traffic for the service discovery aspect of many protocols, and hence their ability to scale. The algorithm is based upon the notion of a hierarchy of administrative multicast scopes where smaller scopes nest inside larger ones. To exploit this topological structure, we describe an application-layer protocol, the Scoped Address Discovery Protocol (SADP), which provides session members with the ability to discover, for each session, which addresses to use within each of the various scopes within a hierarchy. We show via simulation that SADP affords this ability in a manner that scales, through merging the well known distribution mechanisms of announce/listen and query/response and exploiting the nested hierarchy of scopes itself.
1
Introduction
Since the Internet Multicast Backbone (MBone) was first unveiled in 1992, numerous attempts have been made to realize multicast’s promise of efficient group communication. While these attempts have usually remained small in scope, it can be argued that, for the most part, the current set of solutions for ad-hoc session management and unreliable data transport for a few sessions with a small to medium number of session members have performed reasonably well within the research environment. The recent explosion of Web services and streaming media along with the needs to support signicantly increased numbers of sessions, receivers, and senders in a reliable, scalable manner may soon render this assessment invalid for the commercial environment. At first glance the problem that must be solved appears to be: “How does one deliver data reliably to a large number of globally distributed destinations in a manner that scales with the number of receivers, the number of sessions, and the number of senders?” L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 90–107, 1999. c Springer-Verlag Berlin Heidelberg 1999
Support for Reliable Sessions with a Large Number of Members
91
Several services must be provided to the application for this problem to be solved: 1) Address Allocation, 2) Session Announcement, 3) Reliable Multicast Transport. Currently the provision of address-allocation and session-announcement services is handled by the combination of the Session Announcement Protocol (SAP) [1] and the Session Description Protocol (SDP) [2]. Reliable multicast transport has been, and continues to be, the subject of many research efforts. All of these efforts have assumed flat (non-hierarchical) multicast routing and strive to reduce the volume of traffic sent to repair packet losses by using techniques including NAKs with suppression [3], tree-based aggregation [4,5], router assist [6,7], and Forward Error Correction [8,9]. Ultimately, all of the above solutions are limited in their ability to scale by the lack of hierarchy within the network, with many attempting to provide hierarchy inside the service protocol definition. For example, SAP allows session announcements to be scoped using the methods described in Section 2 (TTL or administrative), while many of the reliable multicast protocols mentioned earlier use multiple multicast groups or aggregation points to constrain traffic to some homogeneous region of network connectivity. Since all of the above are attempts at traffic localization, the problem stated earlier may be restated as: “How can one provide the necessary mechanisms for traffic localization so that one can deliver data reliably to a large number of globally-distributed destinations in a manner that scales with the number of receivers, the number of sessions, and the number of senders?” In this paper we describe a refinement to the concept of administrative scoping that we believe significantly enhances the ability to localize traffic for the service discovery aspect of many protocols. The refinement is based upon the notion of exposing a hierarchy of multicast scopes where smaller scopes “nest” within larger scopes. We also describe a new application-layer protocol, the Scoped Address Discovery Protocol (SADP) which provides clients with the ability to discover, for a given session, which addresses to use within the various scopes in a hierarchy. During the course of the description, we show how SADP works, how it realizes a nested scoping search service, and how it provides this service by using a merger of the two common distributed search approaches of announce/listen and request/response.
2 2.1
Multicast Scoping TTL Scoping
One method of scoping which has been in use for some time on the Internet Multicast Backbone (MBone) is that of TTL-based scoping. The header of every IP packet contains a “Time-to-Live” (TTL) field which is decremented by each router as the packet is forwarded. Normally, the packet is dropped if the TTL reaches zero. Hence limiting the TTL value of a multicast packet limits its scope to a region with a hop-count radius specified by the initial TTL value.
92
R. Kermode and D. Thaler
It was common practice in the early MBone days to extend this facility by using non-zero thresholds in routers as well [10]. A router with a threshold of, say, 128 would drop all packets of TTL 128 or less. In this way, packets for one session could be sent with a higher TTL, to achieve global distribution, while packets for another session sent with a lower TTL could be limited to a site or regional scope created using TTL thresholds configured in routers. Another common use of TTL-scoping is to perform “expanding-ring” searches in which a node in search of another repeatedly increases the TTL of its request until a node that can reply is contacted. The replying node would then transmit a reply with a TTL greater than or equal to that used by the original node. There are, however, numerous problems with TTL-scoping [11], most notably the fact that routers cannot prune off traffic that is dropped due to TTL scoping, since subsequent packets may arrive with a higher TTL. These problems result in a significant waste of bandwidth, which motivated the need for another method of scoping. 2.2
Administrative Scoping
Administrative Scoping [11] takes a different approach to the task of limiting the propagation of multicast traffic. Instead of using soft, sender-centric boundaries based on hops from the source, administrative scoping uses hard, router-centric boundaries that are pre-set by the network administrators. Such a boundary is defined by an address range, for which a boundary router will not forward data or control traffic for groups within that address range. This shift of responsibility from the application or sender to the network has several effects: – Administrative scopes are topologically explicit: Administrative scopes have a boundary that is the same for all nodes within them, unlike TTL scopes whose boundaries are relative to the host from which a packet originates. – Administrative scopes must be configured: As with non-zero TTL thresholds, explicit decisions must be made as to the set of routers that act as boundaries for a given scope. – Administrative scopes are long-lived: The effort to create and maintain an administrative scope means that its boundaries cannot change rapidly. – Administrative scopes provide greater control than TTL scopes: The fact that administrative scopes are configured, topologically explicit, and longlived means that network administrators can have a high degree of confidence that traffic sent to an address within an administrative scope will stay within the scope. Furthermore, administrative scope boundaries can be defined on a per-router basis for optimal locality and therefore need not be topologically circular about some central point within a network. For the reasons described above, administrative scoping is the best current practice for scoping on the MBone.
Support for Reliable Sessions with a Large Number of Members
93
scope a
scope b scope d
scope e
scope c scope f
scope g
Fig. 1. Example of nested scoping hierarchy
2.3
Nested Administrative Scoping
We now introduce the concept of nested administrative scoping, which was born in large part out of the realizations that TTL scoping fails to provide adequate localization. The other motivating factor was the observation that many current applications attempt to create their own hierarchies and that a network-centric mechanism for creating hierarchy could relieve them of this responsibility. Finally, the introduction of a network-centric hierarchy would allow network administrators to provide this service in a controlled manner for many applications at the same time. Nested administrative scoping extends the basic model of administrative scoping by explicitly exposing the topological relationship between scopes. The nesting property allows scopes of differing sizes to be arranged in hierarchies similar to that shown in Figure 1. Such hierarchies afford localization through the ability to send data to “subgroups” of members in one’s general vicinity. Hierarchies also provide the ability to perform “expanding zone” searches in lieu of an expanding-ring search. The expanding zone search is in many regards similar to the expanding TTL search: A querying node makes requests that progressively incorporate larger and larger regions of the network (and hence more and more session members) until such time that a session member is found that can reply to the request. The difference lies in the fact that the region covered by each request is defined by an administrative scope boundary and not a fixed number of hops from the querying node. This means that by judiciously choosing the appropriate scopes, one can exercise much tighter controls on which regions of the network are queried during a search. Consider the following hypothetical example where a population of nodes is distributed amongst the scopes in Figure 1. Let there be members D, E, F, and G in scopes d, e, f, and g, respectively. Node D would initially send out a
94
R. Kermode and D. Thaler
request to scope d, and upon discovering no response would repeat the request at scope b. At this point node E would hear the request and send a reply. Hence, D is able to receive an answer without having to send a request to the entire session which would be seen by F and G. Algorithmically the basic search process can be defined as follows: 1. Starting with the smallest known scope, a node issues a request within that scope and waits for a reply. 2. If another node within that scope hears a request at a certain scope that it can satisfy it sends a response at that same scope, possibly after some random delay to reduce duplicate responses. 3. Nodes that receive a response to a particular request while waiting to send a response to that request, suppress their own response. (In contrast, TTL scoping in general cannot achieve as much suppression since each response might be seen by different sets of nodes.) 4. If a requestor issues a request to a scope, and does not hear a response after a specified amount of time, it may retransmit its request at the same scope a small number of additional times. Should these retries fail to elicit a response, the requestor increases the scope to the next largest scope and tries again. 5. Requestors increase the scope of the request according to step 4 until either a response is received, or the entire scope of the session itself is reached. Should attempts to elicit a response at this largest scope for the session fail to yield a response, the requestor may conclude that the request cannot be met. In order to realize searches of this kind, several services must be made available to the members of a session: – First, a mechanism must be provided that can determine which scopes are present (including the address ranges associated with them, for reasons discussed later), and the nesting relationships between them. This can be done by manual configuration, or via the Multicast-scope Zone Announcement Protocol (MZAP) [12]. – Second, session members will need the ability to allocate multicast addresses within these scopes. This service could be provided by the Multicast Address Client Allocation Protocol (MADCAP) [13]. The sdr tool in use today also provides this service to humans but cannot easily be invoked by applications to allocate addresses. – Finally, session members need to know when a subgroup address for a particular session has already been allocated within a given scope before deciding to allocate a new subgroup address for the session in that scope. This final service would be provided by our Scoped Address Discovery Protocol (SADP). The remainder of this paper focuses on the SADP protocol, specifically the design principles involved, how it extends the concepts behind existing protocols, and how these extensions improve performance.
Support for Reliable Sessions with a Large Number of Members
3
95
Related Work
SADP has but one purpose: to allow hosts to take a session identifier (namely, the primary address used for the session) and to determine in a scalable fashion what address has been allocated (if any) for use by the session at each of the smaller scopes within which that host resides. The manner by which it affords this functionality is simple: try to find the nearest host that can help, learn as much pertinent scope address information as possible from this node, and to then provide this information to the application. To that end, SADP’s design draws much on that for similar protocols, specifically ARP, DNS, and SAP. It is therefore useful to briefly examine the problem space each of these protocols attempts to address. 3.1
The Address Resolution Protocol
The Address Resolution Protocol (ARP) [14,15] is used to map a (wide-area) network-layer address1 to a (local-area) link-layer address. ARP works as follows. To acquire the local/lower-layer (MAC) address corresponding to a given unicast IP address, it broadcasts a request within the local area containing the IP address for which to find the local address. The machine owning the address of interest sends a reply to the sender giving the mapping, while other machines ignore the request. The reply is then cached for future use so that subsequent packets use the cached mapping. Since an ARP request is confined to a single link, any mapping received in a response is guaranteed to be significant to the requester. That is, it will never contain a MAC address on some other link which cannot be used by the requestor. In addition, ARP’s use of broadcast messages means that it provides fast resolution at the expense of poor scalability when the number of requests increases. Finally, ARP requests will fail if the owner of the address is down (although this is not problematic since knowing the MAC address of an unreachable host is not particularly useful). 3.2
The Domain Name Service
The Domain Name Service (DNS) [16] is used to map an application-layer address (i.e., a name) to a network-layer address. DNS works as follows. The application-layer namespace is organized into a hierarchy. Each node in the namespace hierarchy is assigned to one or more authoritative servers which store the mapping information. To resolve a name to an address, a host sends a request to a server, which either relays or redirects the request to other servers up or down the namespace hierarchy until a server is found which knows the mapping. Answers are cached so that future requests may be answered immediately. 1
ARP is used only for unicast addresses, since the mapping for multicast addresses is static.
96
R. Kermode and D. Thaler
The problem with DNS is that it does not provide any guarantee that mappings obtained are significant to the requestor. For example, network addresses in the range 10.x.x.x are reserved for private reuse [17], and as such can be considered to be “scoped” unicast addresses. If a host has the address 10.0.0.1 stored in DNS, then a remote requester may obtain this mapping, and when attempting to use it, may end up reaching some other (local) host with the same address! Since DNS uses a hierarchy, it exhibits good scalability as the number of requests increases, and provides reasonably good response time. However, it is primarily intended for situations where the name and the address both have global significance. 3.3
The Session Announcement Protocol
The Session Announcement Protocol (SAP) [1,18] is used to announce information related to multicast sessions. Among other attributes, SAP carries an application-layer address (i.e., a name) and one or more network-layer addresses, and hence announces mapping information somewhat analogous to that found in DNS for unicast addresses. SAP works as follows. Periodically, a system “owning” session information multicasts out the mapping information to the same scope within which the addresses are significant. Other systems can then cache the information for later lookup on demand. Since SAP advertisements are multicast to the scope within which the mapping is significant, all receivers are again guaranteed that all mappings in the cache are usable. Scalability of bandwidth is achieved by extending the inter-announcement period so that the overall bandwidth remains constant as the number of addresses advertised increases. In this manner, SAP provides scalability of bandwidth at the expense of fast resolution, and state in the listeners (which must cache all mappings of potential interest in the future).
4
Scoped Address Discovery Protocol (SADP)
SADP aims to meet the dual requirements of timely response and also scalability. To achieve these two goals it uses a hybrid multicast request/response with announce/listen protocol in order for session members to learn of scoped addresses without causing packet storms. The reasons for using multicast exchanges, as opposed to unicast ones, lie in the fact that a multicast supports a fully distributed mode of operation. Were unicast exchanges to be used, clients would require additional configuration or functionality to locate a server. Additional mechanisms would also be required to disseminate mapping information between servers as well as mechanisms to handle failover in the event that one or more of these servers fails. Hence, we believe that a multicast-based mechanism is simpler.
Support for Reliable Sessions with a Large Number of Members
4.1
97
SADP Basic Operation
The SADP protocol merges the request/response mechanisms found in ARP and DNS with the announce/listen mechanism found in SAP. It then extends them by exploiting the nesting of administrative scopes. The reasons for adopting this hybrid approach stem from two fundamental design goals. First, the solution must afford the address discovery in a timely fashion, and second the solution must scale. Thus, we take the hybrid approach since it is well known that request/response protocols offer timely response, but not scalability, while announce/listen protocols afford scalability at the expense of response time. Fortunately, the process of merging the request/response and announce/ listen mechanisms has been solved within the context of peer-based recovery mechanisms for reliable multicast. An example of such a mechanism is Scalable Reliable Multicast (SRM) [3] which empowers session members within a multicast session to repair each other’s losses. When a packet is detected missing in an SRM session, all the members that missed the packet wait a random amount of time before sending off a NACK. Should a member receive a NACK from another member it suppresses its own NACK and eavesdrops in on the subsequent response. This minimizes the number of duplicate NACKs. Responding members similarly delay sending off the repair to minimize the number of duplicate repairs transmitted. If we now substitute SADP requests for NACKs and SADP responses for repairs, the resulting algorithm is a distributed search in which the session members can listen in to the responses to others’ requests in order to learn about the global state of the session. Studies (e.g., [19,20,21]) have shown that search algorithms that rely on this kind of delay-based suppression mechanism work best for small numbers of members concentrated within small areas of a network. The same studies also show that these algorithms fail badly for large numbers of members distributed over a wide area. In these scenarios, isolated losses in the distribution tree cause the suppression mechanism to fail and duplicate responses to be sent. The fact that search algorithms based on a delay-based suppression mechanism may still allow a member to receive responses from more than one other member causes a potential problem. In cases where there are no network partitions, these replies should be identical for a given scope, since the responding members should have an identical view of the world. When a new member joins a session which spans a newly healed partition, this may not be the case. Should a new member join a session which spans a newly healed partition it may receive multiple responses for a given scope. These different responses will correspond to the trees that existed on either side of the partition. Since these trees are separate and both belong to the same hierarchy, the new member may join either one and still safely participate in the session. In effect, the partition introduces another intermediate level into the hierarchy that doesn’t break the hierarchy but simply reduces the region covered by the partitioned scope. Alternatively, conflicting subgroup addresses can be resolved by using the one with the lower address, and freeing the higher one.
98
R. Kermode and D. Thaler
With this basic mechanism in place, one must now examine the needs of session members for scoped address information and how the existence of the administrative scopes within which these addresses are allocated can be used to assist in their discovery. The first thing to note is that nested administrative scopes can be used by SADP to perform an “expanding-zone” search for a suitable node that can send a response. The address within each scope to be used for this search would need to be known in advance by all potential session members. To achieve this, SADP uses a well-known “scope-relative” address in every scope to send SADP messages about all sessions. Scope-relative addresses are computed from the address range in each scope by applying a constant negative offset to the end of the range. Hence any SADP-speaker which learns the scope ranges can compute the SADP scope-relative address in each scope. This address is used to communicate information regarding all sessions encompassing the given scope. To differentiate between searches for different sessions one would use the address at the largest scope to identify both the session as well as the largest scope to which the search should expand. New session members would learn the administratively scoped addresses specific to their sessions by first searching within the local (smallest) scope. Were an existing session member found in this scope, the request/response exchange would take place at this scope with existing members informing new members of which addresses to use. Since the set of scopes, and hence addresses, for an existing session member would be identical for a new member, the new member would then stop the search after receiving the first response. Should the new member’s request at the local scope fail, it would reattempt the request again at the same scope a small number of times before then expanding the search to a larger scope. In this case the existing session member that eventually responds would not be in the same scope as the sender for scopes smaller than that to which the request was sent. Therefore, the existing member would not send information for these scopes, and the new member could infer that it was the only session member for that session at those scopes. In these cases it could, if so desired, allocate additional addresses at these smaller scopes to achieve greater localization for new session members that join at these scopes at some time in the future. 4.2
Summary of Algorithm
The general algorithm that new members to a session use to determine which scopes and addresses are involved in the hierarchy for a particular session can be summarized as follows: 1. Determine the multicast address used for the largest scope, and use it as a Session Identifier (SID). This task is done by the session announcement service outside of SADP, such as via SAP as described earlier. 2. Multicast a SADP Request message, containing the SID, on the well-known SADP group in the local (smallest) scope.
Support for Reliable Sessions with a Large Number of Members
99
3. Potential repliers that receive a multicast SADP Request message start a random timer with an expiration set to a random time T = Tmax ∗ log256 (256 ∗ X + 1), where X is chosen over the uniform random interval [0 : 1), and Tmax is the maximum delay [18]. This mechanism ensures that close to one replier will respond, with the remaining repliers suppressing their responses. 4. The requester waits for a (multicast) response for s (e.g., 2.5) seconds. This time may be a configurable parameter, but should be larger than Tmax plus the expected round-trip time. If no response is heard, then repeat the request at the same scope. 5. If, after a total of k (e.g., two) attempts at a given scope, no response has been received, increase the scope to the next largest scope and repeat, starting from step 2. Also, allocate an address for future use in the scope for which no response was received. 6. Continue until either a response has been heard or the scope of the session itself is reached. No requests are sent at the session’s scope since the address is already known. The basic mechanism described above affords significant localization, but only when new session members join at places in the network where session members already exist. If no session member exists locally, then the new session member’s search will expand until one is found at a larger scope. The fact that sparsely-populated far-flung sessions are not uncommon means that a significant number of searches may expand to the largest scope, a characteristic that would drastically inhibit scalability. To see how this occurs, consider the example shown in Figure 2. Here, a user in Sydney has created a global session, allocated a global address (the SID), and advertised this session via some means to other users. Two users, one each located in sites in Seattle and Berkeley, decide to join the session in which the only current participants are located in Australia. The Seattle user joins first and his host begins the process of attempting to learn which subgroup addresses to use. First it tries locally within his own site in Seattle and has no success. It then successively tries scopes corresponding to a regional ISP scope, a backbone provider scope, and finally a North America continental scope, before giving up and allocating subgroup addresses within these scopes. No request will be sent to the global scope, since the global address (the SID) is already known. Later, when the Berkeley user joins, her application also begins the process of attempting to learn which subgroup addresses to use. The application tries her own site in Berkeley, a regional ISP scope, and a backbone provider scope, until finally a response is received from the Seattle user’s machine at the continental scope, since the Seattle and Berkeley users are on different continental backbone networks. The important thing to note here is that both searches expanded to relatively large scopes. This operational artifact can be a serious impediment to scalability.
100
R. Kermode and D. Thaler North America
regional ISP Seattle site
Backbone Provider
regional ISP Berkeley site
Backbone Provider
Australia regional ISP Backbone Provider
Sydney site
Fig. 2. Australia/USA Example
4.3
Server Operation
SADP counters the problem of searches expanding to a session’s entire scope, which occurs when new members join a sparely populated session, by introducing SADP caching servers. SADP servers subscribe to the SADP address in every scope in which they reside. Their purpose is to short-cut the normal search process by announcing address information learned from larger scopes at lower scopes. These servers then assist in the operation of SADP by listening to the responses at each scope and then serving replies to the requests they receive. To further expedite the ability to act as proxies, the basic mechanism is modified to propagate the association between an address and a scope without a request being made. Session members that have just expanded the hierarchy into a particular scopes by allocating a new address in that scope, announce it to servers by multicasting an unsolicited SADP Response message to the well-known SADP group in that scope. Consequently, when a request is received on a given scope from a session member, a SADP server replies according to the following set of rules:
Support for Reliable Sessions with a Large Number of Members
101
1. A response is sent to the same SADP address and scope as that on which the request was received. 2. Responses should contain address information for the scope of the request, as well as for all larger scopes within which that scope nests. The net result is that the SADP servers act as proxy members for all sessions as far as storing the addresses they use. This allows a new session member to quickly acquire information about higher level addresses by sending requests within a small area, rather than having to send requests at the larger scopes. SADP servers reduce traffic by minimizing the size of the scope in which a SADP request is answered. Consider the Australia/USA session example again, but this time with the addition of SADP servers located in each site. These servers will join all the appropriate SADP scope-relative addresses for the scopes they are in and hence receive SADP messages for all these scopes. Now let us again investigate what happens when the Sydney user creates a session, and advertises it via some means to the other users. When the Sydney user joins, his client will again allocate a subgroup address for each smaller scope level it is in, as before. As each of these subgroup addresses is allocated, it is announced within its scope level over the scope-relative SADP address for that scope using an unsolicited SADP Request message. When the Seattle user’s client allocates an address for use at the North America continental scope, the address is announced to the North America SADP group and is heard by all North American SADP servers, but not the Australian SADP servers. This enables the North American SADP servers to learn the continental address for the session and to then announce it in SADP Responses for any requests they receive at lower levels. Later when the Berkeley user joins, and the Berkeley server receives a site-wide request, it is at this scope level that the search can stop.
5
Simulations
To analyze the operation of SADP, we constructed a simulator to model the number of SADP requests and SADP responses sent to each administrative scope zone. A decision to forego packet-level simulation was made to maximize the number of nodes that could be simulated. This approach assumes that SADP request/response exchanges are independent from one another. This assumption gives conservative performance results, since a separate SADP response must be issued for every request, even when two members send a request simultaneously. 5.1
Simulation Conditions
The simulations were performed using a five-level topology consisting of a hierarchy of scopes: 1 global scope containing 5 top-level scopes (e.g., continental), each of which contains 5 scopes (e.g., country, or backbone provider) for a total of 25 scopes at this second level. Each of these in turn contains another 5 scopes
102
R. Kermode and D. Thaler
(125 scopes at a “regional” level for instance), each of which in turn contains 5 local scopes (625 local scopes). Simulations were then performed as follows for varying numbers of clients and SADP servers. An initial member was first placed randomly within the topology. SADP servers were then placed in random local scope zones, with no more than one server per local scope zone. Finally, each additional randomly-placed member joined the session at a random time, queried what addresses to use, allocated additional smaller scoped addresses if needed, and then left the session at a randomly chosen time before the end of the simulation. This scenario was deliberately chosen as it models the worst case. In practice, members would likely exhibit a degree of clustering which would naturally lead to better performance. For these simulations, we assumed that packet loss was negligible (we will revisit this assumption later in Section 5.3), and furthermore that the conservative duplicate-suppression mechanism employed allowed only one response per scope when multiple responses were possible. The effect of these assumptions greatly simplifies the simulation since a scope-level, and not a link-level, simulation can be used along with a one query per scope level (k = 1) policy. While these assumptions serve to make simulation considerably easier, it is important to note that they are not unreasonable even if small losses are present. The fact that scopes nest means that each successive attempt at a larger scope is heard by all previously tried smaller scopes and hence can be considered to be a retry at these scopes. Thus, the probability that a search would expand to more than two scopes larger than the smallest possible successive scope is small. For example, if the probability of all responses being dropped at a given scope is 5%, it drops to an upper bound of 0.25% for the next largest scope, and 0.0125% for the next largest scope after that. Two measurements were taken for each run: Cumulative Coverage, and number of requests per new member. The Cumulative Coverage measurement indicates the average total amount of topology (fraction of local scopes) that participates in all SADP requests when a new member joins and wants to find out what address to use, and hence is a measure of the total amount of bandwidth used. For example, were a member to perform a single query attempt at the global scope, the coverage would be 100%. Likewise, if a request were made at the smallest scope, which covers 1/625 of the total topology, the coverage would be 0.16%. Searches at successively larger scopes are cumulative. Thus in the five-level hierarchy simulated, where the SADP algorithm can send requests to all levels but the largest scope, the maximum cumulative coverage for an individual member joining the session is 24.96% (0.16% + 0.8% + 4% + 20%). The number of requests measures the number of retries that must be made before a response is received, or the session scope itself is reached. Ideally this value would be as close to 1 as possible, since this corresponds to the “best case” scenario where a single request is sent, and a single response is received. This number can also be used to estimate the time required to resolve the currently allocated addresses, since the total time is given by ks(l − 1) plus the round-trip
Support for Reliable Sessions with a Large Number of Members
103
time of the final request/response, where l is the number of requests sent in the simulation (where k = 1), and k and s are as specified in Section 4.2. 5.2
Results
Figure 3 shows the Cumulative Coverage and the number of requests per member for a number of members ranging from 25 (an average of 0.04 members per local scope) to 625 (an average of 1 member per local scope). The number of SADP servers was similarly varied from 0 to 625, and the results were averaged over four trials. Figure 3 shows that SADP performs as one might expect. In scenarios where the session member density is low, the number of requests invariably expanded to use larger scopes in order to find another session member who could provide a response. Thus, for low session member densities, the coverage was higher, but still less than in a flat topology (which would always be 100%), while the number of requests per new member was measurably higher than 1, the value for a flat topology. In scenarios where the session member density was high, the first few requests expanded to the larger scopes as before. However, subsequent requests by new members had a greater chance of finding an active member in a smaller scope, and therefore did not expand their searches to the larger scopes. Thus, the coverage while initially high for the first few joins, quickly subsided as subsequent searches completed at smaller scopes. The overall effect was much greater localization. This conclusion is supported by the measured request count, which approached 1 as the member density, and hence the probability for a response at the smallest scope, increased. Figure 3 also shows that the addition of SADP servers significantly improves the performance for low-density sessions. In these cases, the caches serve the address information at the smallest possible scope to a new session member. Thus the greater the density of SADP servers in the topology, the greater likelihood that a new session member will find a nearby cache at a lower scope level to answer its request for address information. The results of the simulations shown in Figure 3 can be extended to account for scenarios with extremely large numbers of session members with minimal effort. In such scenarios the coverage will approach 1/(number of local scopes within the session). Thus for the topology used for Figure 3, the coverage for global sessions will asymptotically approach 1/625 or 0.16% as the probability of finding a SADP server or active session member in each scope approaches one. It also follows that in these scenarios the request count will also approach one. 5.3
Loss Analysis
With the insights from the previous section, it is possible to calculate what will happen if one assumes that losses cause an exchange at a given scope to fail with probability p. This will cause the coverage to increase by a factor of r as the next request will be sent at the next level, and so on until a request succeeds
Support for Reliable Sessions with a Large Number of Members
105
0.03 0.025 0.02 0.015 0.01 0.005 0 0
0.2
0.4
0.6
0.8
1
Fig. 4. Coverage Limit As Loss Increases
or the scope level of the session itself is reached. Let N be the number of scope levels, and let each parent scope contain r child scopes. If one considers the case where the number of members m increases, and therefore that there will always be an active member or SADP server in a new member’s local scope, then the limit of the expected coverage of the last request in a search is given by: lim LastCvg = r−(N −1) (p0 (1 − p)r0 + p1 (1 − p)r1 + · · · + p(N −2) (1 − p)r(N −2) )
m→∞
=
N −2 (1 − p) (pr)i rN −1 i=0
=
(1 − p)(1 − pN −1 rN −1 ) rN −1 (1 − pr)
In addition, when k = 1 a upper bound on the cumulative coverage can be derived from the coverage of the last request. CumulCvg <= LastCvg
N −2
(1/r)i
i=0
<= LastCvg
r(1 − r1−N ) r−1
Combining these equations, a graph of CumulCvg for r = N = 5 is given in Figure 4. From this, we see that loss has a relatively negligible effect on coverage for large groups in our simulation, until the probability of loss passes about 10%. Similarly, the response time will be governed by the number of requests until one succeeds or the session scope itself is reached. Hence, as the number of members increases, the number of requests will approach min(1/p, k(N − 1)).
106
6
R. Kermode and D. Thaler
Conclusions and Future Work
In this paper, we outlined the concept of using hierarchical nested administrative scopes to afford localization for multicast sessions with extremely large numbers of members. We examined the application’s needs for additional addresses on a per-scope basis to support such a localization scheme. Furthermore, we introduced a new application-layer protocol, the Scoped Address Discovery Protocol (SADP), designed to efficiently distribute this information to new members. We showed how SADP merges the two major paradigms for information exchange, announce/listen and request/response, and exploits the existence of hierarchically-nested administrative scopes. We confirmed through simulation that SADP scales well, and in fact that its performance improves as the number of members increases. Our results show that SADP’s ability to scale is governed by two factors. The first is the number of local (smallest) scopes and the second is the probability of a new session member finding an active session member (either a client or a SADP server) within its local scope. In scenarios where the number of local scopes is large and the probability of finding an active session member at the smallest scope is non-trivial, say greater than 10SADP scales particularly well. The deployment of SADP servers about the network increases this probability for all sessions, and hence significantly reduces the initial tendency for the SADP expanding zone search to expand to the largest scopes for sparsely populated wide-area sessions. Hierarchically nested administrative scoping provides a new means for multicast traffic localization, especially for distributed searching applications with large numbers of participants. The SADP protocol itself fits into this space and provides the means for scalably discovering the per-scope addresses for use by such applications. Once deployed, we expect a number of applications to take advantage of the service afforded by SADP, particularly those concerned with service discovery, resource resolution, and localized recovery for reliable multicast. An additional potential area of applicability is that of aggregated statistics collection. Instead of every receiver reporting some statistic directly to a source, reports can be multicast to a local group, with aggregated statistics generated by one receiver, and multicast to the next higher. We leave this possibility for future work. One opportunity for future investigation is to perform a full link-level simulation of SADP with various loss rates. Finally, some further optimizations may be possible when multiple members join simultaneously and issue redundant requests.
7
Acknowledgements
The authors would like to gratefully acknowledge the feedback given by Mark Handley and the other members of the IETF MBoneD, and Malloc Working Groups that greatly assisted in the refinement of the ideas presented in this paper.
Support for Reliable Sessions with a Large Number of Members
107
References 1. Mark Handley, Colin Perkins, and Edmund Whelan. SAP: Session announcement protocol, August 1999. Internet Draft, draft-ietf-mmusic-sap-v2-*.txt, Work in Progress. 2. M. Handley and V. Jacobson. SDP: Session description protocol, April 1998. RFC 2327. 3. Sally Floyd, Van Jacobson, Steven McCanne, Ching-Gung Liu, and Lixia Zhang. A reliable multicast framework for light-weight sessions and application level framing. In Proceedings of ACM SIGCOMM, pages 342–356, 1995. 4. B. Whetten, M. Basavaiah, S. Paul, T. Montgomery, N. Rastogi, J. Conlan, and T. Yeh. The RMTP-II protocol, April 1998. Internet Draft, Work in Progress. 5. M. Kadansky, D. Chiu, J. Wesley, and J. Provino. Tree-based reliable multicast (TRAM), September 1999. Internet Draft, draft-kadansky-tram-*.txt, Work in Progress. 6. T. Speakman, N. Bhaskar, D. Farinacci, S. Lin, A. Tweedly, L. Vicisano, and J. Gemmell. PGM reliable transport protocol specification, June 1999. Internet Draft, draft-speakman-pgm-spec-*.txt, Work in Progress. 7. B.N. Levine and J.J. Garcia-Lina-Aceves. Improving internet multicast routing with routing labels. In Proceedings of the IEEE International Conference on Network Protocols, pages 241–250, 1997. 8. L. Rizzo. Effective erasure codes for reliable computer communication protocols. Computer Communication Review, 27(2):24–36, April 1997. 9. J. Nonnenmacher, E. Biersack, and D. Towsley. Parity-based loss recovery for reliable multicast transmission. In Proceedings of ACM SIGCOMM, pages 289– 300, 1997. 10. Steve Casner. Frequently asked questions (FAQ) on the multicast backbone (MBONE), December 1994. ftp://venera.isi.edu/mbone/faq.txt. 11. D. Meyer. Administratively scoped IP multicast, July 1998. BCP 23, RFC 2365. 12. Mark Handley, Dave Thaler, and Roger Kermode. Multicast-scope zone announcement protocol (MZAP), June 1999. Internet Draft, draft-ietf-mboned-mzap-*.txt, Work in Progress. 13. Stephen R. Hanna, Baiju V. Patel, and Munil Shah. Multicast address dynamic client allocation protocol (MADCAP), August 1999. Internet Draft, draft-ietfmalloc-madcap-*.txt, Work in Progress. 14. David C. Plummer. An ethernet address resolution protocol, November 1982. STD 37, RFC 826. 15. Christian Huitema. Routing in the Internet. Prentice Hall, 1995. 16. P.V. Mockapetris. Domain names - implementation and specification, November 1987. STD 13, RFC 1035. 17. Y. Rehkter, B. Moskowitz, D. Karrenberg, G.J. de Groot, and E. Lear. Address allocation for private internets, February 1996. BCP 5, RFC 1918. 18. Mark Handley. Session directories and scalable internet multicast address allocation. In Proceedings of ACM SIGCOMM, pages 105–116, 1998. 19. R. Kermode. Smart network caches: Localized content and application negotiated recovery mechanisms for multicast media distribution, June 1998. Ph.D. Thesis, MIT. 20. C.-G. Liu, D. Estrin, S. Shenker, and L. Zhang. Local error recovery in SRM: Comparison of two approaches. Technical Report 97-648, USC, January 1997. 21. O. Ozkasap, Z. Xiao, and K. Birman. Scalability of two reliable multicast protocols, May 1999. Work in Progress.
Distributed Core Multicast (DCM): A Multicast Routing Protocol for Many Groups with Few Receivers Ljubica Blazevi´c and Jean-Yves Le Boudec Institute for computer Communications and Applications (ICA) Swiss Federal Institute of Technology, Lausanne email: {Ljubica.Blazevic, Leboudec}@epfl.ch
Abstract. We present a multicast routing protocol called Distributed Core Multicast (DCM). It is intended for use within a large single Internet domain network with a very large number of multicast groups with a small number of receivers. Such a case occurs, for example, when multicast addresses are allocated to mobile hosts, as a mechanism to manage Internet host mobility or in large distributed simulations. For such cases, existing dense or sparse mode multicast routing algorithms do not scale well with the number of multicast groups. DCM is based on an extension of the centre-based tree approach. It uses several core routers, called Distributed Core Routers (DCRs) and a special control protocol among them. DCM aims: (1) avoiding multicast group state information in backbone routers, (2) avoiding triangular routing across expensive backbone links, (3) scaling well with the number of multicast groups. We evaluate the performance of DCM and compare it to an existing sparse mode routing protocol when there is a large number of small multicast groups.
1
Introduction
We describe a multicast routing protocol called Distributed Core Multicast (DCM). DCM is designed to provide low overhead delivery of multicast data in a large single domain network for a very large number of small groups. This occurs when the number of multicast groups is very large (for example, greater than a million), the number of receivers per multicast group is very small (for example, less than five) and each host is a potential sender to a multicast group. DCM is a sparse mode routing protocol, designed to scale better than the existing multicast routing protocols when there are many multicast groups, but each group has in total a few members. Recent sparse mode multicast routing protocols, such as the protocol independent multicast (PIM-SM) [4] and the core-based trees (CBT) [2], build a single delivery tree per multicast group that is shared by all senders in the group. This tree is rooted at a single centre router called “core” in CBT, and “rendezvous point” (RP) in PIM-SM. L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 108–125, 1999. c Springer-Verlag Berlin Heidelberg 1999
Distributed Core Multicast (DCM)
109
Both centre-based routing protocols have the following potential shortcomings: – traffic for the multicast group is concentrated on the links along the shared tree, mainly near the core router; – finding an optimal centre for a group is a NP-complete problem and requires the knowledge of the whole network topology [12]. Current approaches typically use either an administrative selection of centers or a simple heuristic [10]. Data distribution through a single centre router could cause non optimal distribution of traffic in the case of a bad positioning of the centre router, with respect to senders and receivers. This problem is known as a triangular routing problem. PIM-SM is not only a centre-based routing protocol, but it also uses sourcebased trees. With PIM-SM, destinations can start building source-specific trees for sources with a high data rate. This partly addresses the shortcomings mentioned above, however, at the expense of having routers on the source-specific tree keep source-specific state. Keeping the state for each sender is undesirable when the number of senders is large. Multicast Source Discovery Protocol (MSDP)[5] allows multiple RPs per multicast group in a single share-tree PIM-SM domain. It can also be used to connect several PIM-SM domains together. Members of a group initiate sending of a join message towards the nearest RP. MSDP enables RPs, which have joined members for a multicast group, to learn about active sources to the group. Such RPs trigger a source specific join towards the source. Multicast data arrives at the RP along the source-tree and then is forwarded along the group shared-tree to the group members. [13] proposes to use the MSDP servers to distribute the knowledge of active multicast sources for a group. DCM is based on an extension of the centre-based tree approach and is designed for the efficient and scalable delivery of multicast data under the assumptions that we mention above (a large number of multicast groups, a few receivers per group and a potentially a large number of senders to a multicast group). As a first simplifying step, we consider a network model where a large single domain network is configured into areas that are organised in a two-level hierarchy. At the top level is a single backbone area. All other areas are connected via the backbone(see Figure 1). This is similar to what exists with OSPF[7]. The issues addressed by DCM are: (1): to avoid multicast group state information in backbone routers, (2): to avoid triangular routing across expensive backbone links and (3) to scale well with the number of multicast groups. The following is a short DCM overview and it is illustrated in Figure 1. We introduce an architecture based on several core routers per multicast group, called Distributed Core Routers (DCRs). – The DCRs in each area are located at the edge of the backbone. The DCRs act as backbone access points for the data sent by senders inside their area to receivers outside this area. A DCR also forwards the multicast data received
110
L. Blazevi´c and J.-Y. Le Boudec IP Multicast Area D
receiver A1
(1)
IP Multicast Area A
(1)
DCR X4
(3)
DCR X3
DCR X1
Backbone Area (2)
sender C1 (2)
IP Multicast Area C
sender A2 DCR X2 (1) IP Multicast Area B sender B1
Fig. 1. This is a model of a large single domain network and an overview of data distribution with DCM. In this example there are four non-backbone areas that communicate via the backbone. We show one multicast group M and DCRs X1, X2, X3 and X4 that serve M. Step (1): Senders A2, B1 and C1 send data to the corresponding DCRs inside their areas. Step (2): DCRs distribute the multicast data across the backbone area to DCR X1 that needs it. Step (3): A local DCR sends data to the local receivers in its area.
from the backbone to receivers in the area it belongs to. When a host wants to join the multicast group M, it sends a join message. This join message is propagated hop-by-hop to the DCR inside its area that serves the multicast group. Conversely, when a sender has data to send to the multicast group, it will send the data encapsulated to the DCR assigned to the multicast group. – The Membership Distribution Protocol (MDP) runs between the DCRs serving the same range of multicast addresses. It is fully distributed. MDP enables the DCRs to learn about other DCRs that have group members. – The distribution of data uses a special mechanism between the DCRs in the backbone area, and the trees rooted at the DCRs towards members of the group in the other areas. We propose a special mechanism for data distribution between the DCRs, which does not require that non-DCR backbone routers perform multicast routing. With the introduction of the DCRs close to any sender and receivers, converging traffic is not sent to a single centre router in the network. Data sent from a sender to a group within the same area is not forwarded to the backbone. Our approach alleviates the triangular routing problem common to all centre-based trees, and unlike PIM-SM, is suitable for groups with many sporadic senders. Similar to PIM-SM and CBT, DCM is independent on underlying unicast routing protocol. In this paper we examine the properties of DCM in a large single domain network. However, DCM is not constrained to a single domain network. Intero-
Distributed Core Multicast (DCM)
111
perability of DCM with other inter-domain routing protocols is the object of ongoing work. The structure of this paper is as follows. In the next section we present the architecture of DCM. That is followed by the DCM protocol specification in Section 3. In Section 4 we give a preliminary evaluation of DCM. Section 5 presents how DCM can be used to route packets to the mobile hosts.
2
Architecture of DCM
In this section we describe the general concepts used by DCM. A detailed description follows in Section 3. We group general concepts into three broad categories: (1) hierarchical network model (2) how membership information is distributed and (3) how user data is forwarded. 2.1
Hierarchical Network Model
We consider a network model where a large single domain network is configured into areas that can be viewed as being organised in a two-level hierarchy. At the top level is a single backbone area to which all other areas connect. This is similar to what exists with OSPF[7] and MOSPF[6]. In DCM we use the area concept of OSPF. DCM, unlike MOSPF, does not require link state routing. DCM is independent of the underlying unicast routing protocol. Our architecture introduces several core routers per multicast group that are called Distributed Core Routers (DCRs). The DCRs are border routers situated at the edge with the backbone. Inside each non-backbone area there can exist several DCRs serving as core routers for the area. 2.2
Distribution of the Membership Information
Regarding the two-level hierarchical network model, we distinguish distribution of the membership information in non-backbone areas and in the backbone area. Inside non-backbone areas, multicast routers keep group membership information for groups that have members inside the corresponding area. But unlike MOSPF, the group membership information is not flooded inside the area. The state information kept in multicast routers is per group ((*,G) state) and not per source per group (no (S,G) state). If for the multicast group G there are no members inside an area, then no (*,G) state is kept in that area. This is similar to MSDP when it is applied on our network model. Inside the backbone, non-DCR routers do not keep the membership information for groups that have members in non-backbone areas. This is different from MSDP where backbone routers can keep (S,G) information when they are on the source specific distribution trees from the senders towards RPs. This is also different from MOSPF where all backbone routers have complete knowledge of all areas’ group membership. In DCM, the backbone routers may keep group membership information for a small number of reserved multicast groups that
112
L. Blazevi´c and J.-Y. Le Boudec
are used for control purposes inside the backbone. We say a DCR is labelled with a multicast group when there are members of the group inside its corresponding area. DCRs in different areas run a special control protocol for distribution of the membership information, e.g information of being labelled with the multicast group. 2.3
Multicast Data Distribution
Multicast packets are distributed natively from the local DCR in the area to members inside the area. Multicast packets from senders inside the area are sent towards the local DCR. This can be done by encapsulation or by source routing. This is similar to what exists in MSDP. DCRs act as packet exploders, and by using the other areas’ membership information attempt to send multicast data across the backbone only to those DCRs that need it (that are labelled with the multicast group). DCRs run a special data distribution protocol that try to optimize the use of backbone bandwidth. The distribution trees in the backbone are source-specific, but unlike MSDP do not keep (S,G) information.
3
The DCM Protocol Specification
In this section we give the specification of DCM by describing the protocol mechanisms for every building block in the DCM architecture. 3.1
Hierarchical Network Model Addressing Issues
In each area there are several routers that are configured to act as candidate DCRs. The identities of the candidate DCRs are known to all routers within an area by means of an intra-area bootstrap protocol [3]. This is similar to PIM-SM with the difference that the bootstrap protocol is constrained within an area. This entails a periodic distribution of the set of reachable candidate DCRs to all routers within an area. Routers use a common hash function to map a multicast group address to one router from the set of candidate DCRs. For a particular group address M, we use the hash function to determine the DCR that serves1 M. The used hash function is h(r(M ), DCRi ). Function r(M ) takes as input a multicast group address and returns the range of the multicast group, while DCRi is the unicast IP address of the DCR. The target DCRi is then chosen as the candidate DCR with the highest value of h(r(M ), DCRj )) among all j from set {1, .., J} where J is the number of candidate DCRs in an area: h(r(M ), DCRi ) = max{h(r(M ), DCRj ), j = 1, .., J} 1
(1)
A DCR is said to serve the multicast group address M when it is dynamically elected among all the candidate DCRs in the area to act as an access point for address M
Distributed Core Multicast (DCM)
113
One possible example of the function that gives the range2 of the multicast group address M is : r(M ) = M &B , where B is a bit mask.
(2)
We do not present here the hash function theory. For more information see [11], [3] and [9]. The benefits of using hashing to map a multicast group to DCR are the following: – We achieve minimal disruption of groups when there is change in the candidate DCR set. This means that we have to do a small number of re-mappings of multicast groups when there is a change in the candidate DCR set. See [11] for more explanations. – We apply the hash function h(.,.) as defined by the Highest Random Weight (HRW) [9] algorithm. This function ensures load balancing between candidate DCRs. This is very important, because no single DCR serves more multicast groups than any other DCR inside the same area. We achieve, by this property, that when the number of candidate DCRs increases, the load on each DCR decreases. All routers in all non-backbone areas should apply the same functions h(., .), r(.).
Each candidate DCR is aware of all the ranges of multicast addresses for which it is elected to be a DCR in its area. There is one reserved multicast address that corresponds to every range of multicast group address. A DCR joins a reserved multicast address that corresponds to a range of multicast addresses that it serves. This multicast address is used by DCRs in different areas that serve the same range of multicast addresses to exchange control information (see Section 3.3). 3.2
Distribution of Membership Information inside Non-backbone Areas
When a host is interested in joining the multicast group M, it issues an IGMP join message. A multicast router on its LAN, known as the designated router (DR), receives the IGMP join message. The DR determines the DCR inside its area that serves M, as described in the Section 3.1. The process of establishing the group shared tree is like in PIM-SM [4]. The DR sends a join message towards the determined DCR. Sending a join message forces any off-tree routers on the path to the DCR to forward a join message and join the tree. Each router on the way to the DCR keeps a forwarding state for M. When a join message reaches the DCR, this DCR becomes labelled with the multicast group M. In this way, the delivery subtree , for the receivers of the multicast group M in an area, is established. The subtree is maintained 2
A range is the partition of the set of multicast addresses into group of addresses. A range to which a multicast group address belongs to is defined by Equation (2). e.g if the bit mask is (hex) 00000009 we get 4 possible ranges of IPv4 class-D addresses.
114
L. Blazevi´c and J.-Y. Le Boudec
by periodically refreshing the state information for M in the routers (like in PIM-SM, this is done by periodically sending join messages). Like in PIM-SM, when the DR discovers that there are no longer any receivers for M, it sends a prune message towards the nearest DCR to disconnect from the shared distribution tree. Figure 2 shows an example of joining the multicast group.
join(M2)
join(M2) M2
IP Multicast
M2 M1
DCR X1
Area A
Backbone Area
join(M1) DCR X4
M1
IP Multicast DCR X3 M1
join(M1)
Area D
DCR X2
IP Multicast
join(M1)
join(M1) IP Multicast Area B
Area C
Fig. 2. The figure shows hosts in four areas that join two multicast groups M1 and M2. Four DCRs (X1,X2,X3 and X4) presented in the figure serve the range of multicast addresses where group addresses M1 and M2 belong to. A circle on the figure represents multicast routers in non-backbone areas that are involved in the construction of the DCR rooted subtree. These subtrees are showed with dashed lines. X2, X3 and X4 are now labelled with M1, while X1 and X4 are labelled with M2.
3.3
Distribution of Membership Information inside the Backbone
The Membership Distribution Protocol (MDP) is used by DCRs in different areas to exchange control information. As said in Section 3.1, within each nonbackbone area, for each range of multicast addresses (as defined by Equation (2)) there is one DCR serving that range. DCRs in different areas that serve the same range of multicast addresses are members of the same MDP control multicast group. This group is defined by a MDP control multicast address used for exchanging control information. A DCR joins as many MDP control multicast groups as the number of ranges of multicast addresses it serves. There are as many MDP control multicast groups as there are possible ranges of multicast addresses. We do not propose a specific protocol for maintaining the multicast tree for the MDP multicast group. This can be done by means of an existing multicast routing protocol (e.g CBT). DCRs that are members of the same MDP control multicast group exchange the following control information:
Distributed Core Multicast (DCM)
115
– periodical keep-alive messages. – unicast distance information. Each DCR sends, to the corresponding MDP control multicast group, information about the unicast distance from itself to other DCRs that it has learned to serve the same range of multicast addresses. This information comes from existing unicast routing tables and it is used for the distribution of multicast data among the DCRs. – multicast group information. A DCR, which is labelled with the multicast group M, informs DCRs in other areas responsible for M that it has receivers for M. In this way, every DCR keeps a record of every other DCR that has at least one member for a multicast address from the range that the DCR serves. A DCR should notify all other DCRs when it becomes labelled with a new multicast group or no longer labelled with a multicast group. 3.4
How Senders Send to a Multicast Group
The sending host originates native multicast data, for the multicast group M, that is received by the designated router (DR) on its LAN. The DR determines the DCR within its area that serves M. We call this DCR the source DCR. The DR encapsulates the multicast data packet (IP-in-IP) and sends it with a destination address equal to the address of the source DCR. The source DCR receives the encapsulated multicast data. This is similar to PIM-SM where the DR sends encapsulated multicast data to the RP corresponding to the multicast group. 3.5
Data Distribution in the Backbone
The multicast data for the group M is distributed from a source DCR to all DCRs that are labelled with M. Since we assume that the number of receivers per multicast group is not large, there are only a few labelled routers per multicast group. Our goal is to perform multicast data distribution in the backbone in such a way that backbone routers keep a minimal state information while at the same time backbone bandwidth is used efficiently. We propose a solution that can be applied in the Internet today. It uses point-to-point tunnels to perform data distribution among DCRs. With this solution, non-DCR backbone routers do not keep any state information related to the distribution of the multicast data in the backbone. Point-to-Point Tunnels The DCR that serve the multicast group M keeps the following information: (1) a set V of DCRs that serve the same range to which M belongs; (2) information about unicast distances between each pair of DCRs from V ; (3) the set L of labelled DCRs for M. The DCR obtains this information by exchanging the MDP control messages with DCRs in other areas. In this way, we present the virtual network of DCRs that serve the same range of multicast group addresses by means of an undirected complete graph G = (V, E). V is defined above, while the set of edges E are tunnels between each pair of DCRs in V. Each edge is associated with a cost value that is equal to an inter-DCR unicast distance.
116
L. Blazevi´c and J.-Y. Le Boudec
The source DCR, called S, calculates the optimal tree that spans the labelled DCRs. In other words, S finds the subtree T = (VT , ET ) of G that spans the set of nodes L such that cost(T ) = e∈ET cost(e) is minimised. We recognise this problem as the Steiner tree problem. Instead of finding the exact solution, that is a NP-complete problem, we introduce a simple heuristic called Shortest Tunnel Heuristic (STH). STH consists of two phases. In the first phase a greedy tree is built by adding one by one the nodes that are closest to the tree under construction, and then removing unnecessary nodes. The second phase is further improving the tree established so far. Phase 1: Build a greedy tree – Step 1: Begin with a subtree T of G consisting of the singe node S. k = 1. – Step 2: if k = n then goto Step 4. n is the number of nodes in set V. – Step 3: Determine a node zk+1 ∈ V , zk+1 ∈ T closest to T (ties are broken arbitrarily). Add the node zk+1 to T. k = k + 1. Goto Step 2. – Step 4: Remove from T non-labelled DCRs of degree1 1 and degree2 2 (one at a time). Phase 2: Improve a greedy tree STH can be further improved by two additional steps: – Step 5: Determine a minimum spanning tree for the subnetwork of G induced by the nodes in T (after the step 4). – Step 6: Remove from the minimum spanning tree non-labelled DCRs of degree 1 and 2 (one at a time). The resulting tree is the (suboptimal) solution. Figure 3, Figure 4 and Figure 5 illustrate three examples of the usage of STH. Nodes X1, X2, X3 and X4 present four DCRs that serve the multicast group M. In all examples the source DCR is X1, and the labelled DCRs for M are X2 and X4. For the first two examples, the tree that is obtained by the first phase cannot be further improved by steps 5 and 6. In the third example, steps 5 and 6 give improvements in terms of cost of the resulting tree. The source DCR applies STH to determine the distribution tunnel tree from itself to the list of labelled DCRs for the multicast group. The source DCR puts inter-DCR distribution information in the form of an explicit distribution list in the end-to-end option field of the packet header. Under the assumption that there is a small number of receivers per multicast group, the number of labelled DCRs for a group is also small. Thus, an explicit distribution list that completely describes the distribution tunnel tree is not expected to be long. When a DCR receives a packet from another DCR, it reads from the distribution list whether it should make a copy of the multicast data and of the identities of the DCRs where it should send multicast data by tunneling. Labelled DCRs deliver data to local receivers in the corresponding area. An example that shows how multicast data is distributed among DCRs is presented in Figure 6. 1 2
Degree of a node in a graph is the number of edges incident with a node A node of degree 2 is removed by its two edges being replaced by a single edge (tunnel) connecting the two nodes adjacent to the node being removed. The source DCR is never removed from a graph
Distributed Core Multicast (DCM)
4
4
2
2
2
X1 X4
4 X2
X1
X1
X1
2
X3
X3 2
2
2
2
2
X3
X2
X2
X3
(a) the four node complete graph
(b) start with X1; add X3 to the tree because it is closer to X1 than X2 and X4
(c) X2 is added to the tree
X4
(d) result of STH up to step 4; since X3 is of degree 3 it is not removed from the tree; this is STH solution since the tree cannot be improved by the steps 5 and 6.
Fig. 3. The first example of the application of STH on the complete graph
X1
X1 2
2
4
2
2
X4
3 X2
X1
X2
4
X2
X3
2
2 3
2
X3
(a) the four node complete graph
X4
(b) result of STH up to step 4
X4
(c) X3 is of degree 2 and it is not the labelled DCR; it is removed from the tree; X2 and X4 are connected with the edge (a tunnel between them); this is STH solution since the tree cannot be improved by the steps 5 and 6.
Fig. 4. The second example of the application of STH on the complete graph
117
118
L. Blazevi´c and J.-Y. Le Boudec
X1 16
8
X1
12 X4
14
X1
12
8
8
X1
X3
X2 7
18
X4
8
7 X2
X3
(a) the four node complete graph
(b) result of STH up to step 4
X4
16
14
X2
X4
X2
(c) X3 is removed from the tree; result of STH up to step 5
(d) Steps 5 and 6: minimum spanning tree of X1, X2 and X4; this is STH solution
Fig. 5. The third example of the application of STH on the complete graph
end-to-end option
At 1:
Area A sa= X1
da=X3
X2; X4
DCR X1
Backbone
At 2: 1
sa= X3
da= X2
At 3: 2
3
DCR X2
DCR X4
Area B
sa=X3
da=X4
Area D DCR X3
Area C
3 encapsulated multicast data packet
Fig. 6. Figure presents an example of inter-DCR multicast data distribution by using point-to-point tunnels. The source DCR is X1 and labelled DCRs are X2 and X4. X1 calculates the distribution tunnel tree to X2 and X4 by applying STH. Assume that the result of STH gives the distribution tunnel tree consisting of edges X1-X3, X3-X2 and X3-X4. This is similar to the example presented in Figure 3. Then X1 sends the encapsulated multicast data packet to X3. In the end-to-end option field of the packet, a distribution list is contained. X3 sends two copies of multicast data: one to X2 and the other to X4. On this figure are also presented packet formats at various points (points 1, 2 and 3) on the way from X1 to X2 and X4. A tunnel between the two DCRs is shown with the dash line.
Distributed Core Multicast (DCM)
3.6
119
Data Distribution inside Non-backbone Area
A DCR receives encapsulated multicast data packets either from a source that is within its area, or from a DCR in another area. A DCR checks if it is labelled with the multicast group that corresponds to the received packet, i.e whether there are members of the multicast group in its area. If this is the case, a DCR forwards the multicast packet along the distribution subtree that is already established for the multicast group (as is described in Section 3.2).
4
Preliminary Evaluation of DCM
In this section we examine DCM performance under following assumptions: large number of multicast groups, a few receivers per group and a potentially large number of senders to a multicast groups. We show that, under these assumptions, DCM performs better than the PIM-SM shared-tree multicast routing protocol.
sender Area D sender Area D
join M
DCR X4
join M
DCR X3 IP Multicast
Area A
DCR X1 Area A
sender
Backbone Area
sender
Backbone Area
RP
sender
sender Area C
DCR X2
Area C
Area B sender
Area B
(a) PIM-SM Shared-tree
sender
(b) DCM
Fig. 7. The figure presents one member of the multicast group M in area A and four senders in areas A, B, C and D. Two different approaches for data distribution are illustrated: the PIM-SM shared-tree case and DCM. In the case of DCM within each area there is one DCR that serves M. In PIM-SM one of the DCRs is chosen to be the centre router (RP) . With PIM-SM, all senders send encapsulated multicast data to the RP. In DCM each sender sends encapsulated multicast data to the DCR inside their area. With PIM-SM, multicast data is distributed from the RP along established distribution tree to the receiver (dashed line). With DCM, data is distributed from source DCRs (X1, X2, X3 and X4) to a receiver by means of point-to-point tunnels (full lines in the backbone) and the established subtree in Area A (dashed line)
We have implemented DCM using the Network Simulator (NS) tool [1]. To examine the performance of DCM in a realistic manner, we performed simula-
120
L. Blazevi´c and J.-Y. Le Boudec
tions on a single-domain network model consisting of four areas connected via the backbone area. Figure 7 illustrates the network model used in simulations where areas A,B, C and D are connected via the backbone. The whole network contains 128 nodes. We examined the performance under realistic conditions: the links on the network were configured to run at 1.5Mb/s with a 10ms delay between hops. The link costs in the backbone area are higher than the costs in other areas. We analyse the following characteristics: size of the routing table, traffic concentration in the network and control traffic overhead. – The amount of multicast router state information DCM requires that each multicast router maintains a table of multicast routing information. In our simulations, we want to check the size of multicast router routing table. The routing table size becomes an especially important issue when the number of senders and groups grows, because router speed and memory requirements are impacted. We performed a number of simulations. In all the simulations, we use the same network model presented in Figure 7, but with different numbers of multicast groups. For each multicast group there is only one receiver and 20 senders. Within each area, there is more than one candidate DCR. The hash function is used by routers within the network to map a multicast group to one DCR in the corresponding area. We randomly distributed membership among a number of active groups. For every multicast group, one receiver in the network is chosen randomly. In the same way, senders are chosen. The same scenarios were simulated with PIM-SM applied as the multicast routing protocol. In PIM-SM, candidate RP routers are placed at the same location as candidate DCRs in the DCM simulation. We verified that among all routers in the network, routers with the largest routing table size are DRCs in the case of DCM. In the case of PIM-SM those are RPs and backbone routers. We define the most loaded router as the router with the largest routing table size. Figure 8 shows the routing table size in the most loaded router for the two different approaches. Figure 8 illustrates that the size of the routing table of the most loaded DCR is increasing linearly with the number of multicast groups. The most loaded router in PIM-SM is in the backbone. As the number of multicast groups increases, the size of the routing table in the most loaded DCR becomes considerably smaller than the size in the most loaded PIM-SM backbone router. As it is expected, routing table size in RPs is larger than in DCRs. This can be explained by the fact that the RP router in the case of PIM-SM is responsible for the receivers and senders in the whole domain, while DCRs are responsible for receivers and senders in the area where the DCR belongs. For non-backbone routers, simulation results show that with the placement of RPs at the edge with the backbone there is not a big difference in their routing table sizes for two the approaches. Otherwise, if the location of RPs is elsewhere inside the area, non-backbone routers have smaller routing table
Distributed Core Multicast (DCM)
121
2000
routing table size of the most loaded router
1800
PIM−SM backbone router DCR
*o
1600 1400 1200 1000 800 600 400 200 0 0
500
1000
1500 2000 2500 number of multicast groups
3000
3500
Fig. 8. Routing table size for the most loaded routers
size in the case when DCM is applied as the multicast routing protocol than in the case of PIM-SM. Figure 9 illustrates the average routing table size in the backbone routers for the two routing protocols. In case of PIM-SM this size is increasing linearly with the number of multicast group. With DCM all join/prune messages from receivers in non-backbone areas are terminated at the corresponding DCRs situated at the edge with the backbone. Thus, in DCM non-DCR backbone routers need not keep multicast group state information for groups with receivers inside non-backbone areas. Backbone routers may keep group membership information only for a small number of MDP control multicast groups. – Traffic concentration In the shared-tree case of PIM-SM, every sender to a multicast group sends encapsulated data to the RP router uniquely assigned to that group within the whole domain. This is illustrated in Figure 7(a) where all four senders to a multicast group send data to a single point in the network. This increases traffic concentration on the links leading to the RP. With DCM, converging traffic is not sent to a single point in the network because each sender sends data to the DCR assigned to a multicast group within the corresponding area (as presented in Figure 7(b)). In DCM, if all senders and all receivers are in the same area, data is not forwarded to the backbone. In that way, backbone routers don’t forward the local traffic generated inside an area. Consequently, triangular routing across expensive backbone links is avoided.
122
L. Blazevi´c and J.-Y. Le Boudec
average routing table size at the backbone router
2000 PIM−SM DCM
1800
*o
1600 1400 1200 1000 800 600 400 200 0 0
500
1000
1500 2000 2500 number of multicast groups
3000
3500
Fig. 9. Average routing table size at the backbone router
– Control traffic overhead Join/prune messages are overhead messages that are used for setting up, maintaining and tearing down the multicast data delivery subtrees. In our simulations we wanted to measure the number of such messages that are exchanged in two cases when DCM and PIM-SM are used as the multicast routing protocols. Simulations have shown that in DCM the number of join/prune messages is 20% smaller than in PIM-SM. This result can be explained by the fact that in DCM all join/prune messages from the receivers in the non-backbone areas are terminated at the corresponding DCRs inside the same area, close to the destinations. In PIM-SM join/prune messages must reach the RP that may be far away from the destinations. In DCM, DCRs exchange the MDP control messages. The evaluation of the overhead of these messages depends on the group joining/leaving dynamicity and updating frequency. This is left for the future work.
5
Application of DCM in the New Mobility Management Scheme
In this section we show how DCM can be used for a new mobility management approach based on multicasting. When a visiting mobile host arrives into the new domain it is assigned a temporary multicast address. This is the care-of address that the mobile keeps as long it stays in the same domain. This is unlike Mobile IP [8] proposal where the mobile host does a location update after each migration and informs about this its possible distant home agent.
Distributed Core Multicast (DCM)
123
We propose to use DCM as the mechanism to route packets to the mobile hosts. As explained in Section 2.1, for the mobile host’s assigned multicast address, within each area, there exists a DCR that serves that multicast address. Those DCRs are responsible for forwarding packets to the mobile host. As said before, the DCRs run the MDP control protocol and are members of a MDP control multicast group for exchanging MDP control information. A multicast router in the mobile host’s cell initiates a joining the multicast group assigned to the mobile host. Typically this router coexists with the base station in the cell. As described in Section 3.2 the join message is propagated to the DCR inside the area that serves the mobile host’s multicast address. Then, the DCR sends to the MDP control multicast group a MDP control message when the mobile host is registered.
area
DCR X3
area
Backbone area (4)
DCR X2
DCR X1
(3)
Sender
join
BS 1
DCR X4
join
(1)
(2) area
BS 2 cell MH
cell
area
Fig. 10. The mobile host (MH) is assigned multicast address M. Four DCRs, X1, X2, X3 and X4 serve M. Step (1): Base station BS1 sends a join message for M towards X1. X1 informs X2, X3 and X4 that it has a member for M. Step (2): Advance registration for M in a neighbouring cell is done by BS2. Step(3): The sender sends a packet to multicast group M. Step (4): The packet gets delivered through the backbone to X1. Step (5): X1 receives encapsulated multicast data packet. From X1 data is forwarded to BS1 and BS2. MH receives data from BS1.
In order to reduce packet latency and losses during a handover, advance registration can be performed. The goal is that when a mobile host moves to a new cell, the base station in the new cell should already started receiving data for the mobile host. The mobile host continues to receive the data without disruption. There are several ways to perform this:
124
L. Blazevi´c and J.-Y. Le Boudec
– A base station that anticipates1 the arrival of a mobile host initiates joining the multicast address assigned to the mobile host. This is illustrated in one example in Figure 10. – In the case where a bandwidth is not expensive on the wired network, all neighbouring base stations can start receiving data destined to a mobile host. This guarantees that there would be no latency and packet losses during a handover. A packet for the mobile host reaches all base stations that joined the multicast group assigned to the mobile host. At the same time the mobile host receives data only from a base station in its current cell. A base station that receives a packet on behalf of the mobile host that is not present in its cell can either discard a packet or buffer it for a certain interval of time (e.g. 10ms). Further research is needed to determine what is the best approach. In this document we do not address the problems of using multicast routing to support end-to-end unicast communication. These problems are related to protocols such as: TCP, ICMP, IGMP, ARP. A simple solution to this problem could be to have a special range of unicast addresses that are routed as multicast addresses. In this way, packets destined to the mobile host are routed by using a multicast mechanism. Conversely, at the end systems, these packets are considered as unicast packets and standard unicast mechanisms are applied.
6
Conclusions
We have considered the problem of multicast routing in a large single domain network with a very large number of multicast groups with a small number of receivers. Our proposal, called Distributed Core Multicast (DCM) is based on an extension of the centre-based tree approach. DCM uses several core routers, called Distributed Core Routers (DCRs) and a special control protocol among them. The objectives achieved with DCM are: (1) avoiding state information in backbone routers, (2) avoiding triangular routing across expensive backbone links, (3) scaling well with the number of multicast groups. Our initial results tend to indicate that DCM performs better than the existing sparse mode routing protocols in terms of multicast forwarding table size. We have presented an example of the application of DCM where it is used to route packets to the mobile hosts.
References 1. Network Simulator. Available from http://www-mash.cs.berkeley.edu/ns. 2. A. Ballardie. Core Based Trees (CBT) Multicast Routing Architecture. RFC 2201, September 1997. 1
The mechanism by which the base station anticipates the arrival of the mobile host is out of the scope of this paper
Distributed Core Multicast (DCM)
125
3. Deborah Estrin, Mark Handley, Ahmed Helmy, Polly Huang, and David Thaler. A Dynamic Mechanism for Rendezvous-based Multicast Routing. In Proc. of IEEE INFOCOM’99, New York, USA, March 1999. 4. D. Estrin et.al. Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification. RFC 2117, June 1997. 5. D. Farinacci et. al. Multicast Source Discovery Protocol (MSDP). Internet Draft(work in Progress), June 1998. 6. J. Moy. Multicast Extensions to OSPF. RFC 1584, 1994. 7. J. Moy. OSPF version 2. RFC 1583, 1994. 8. C. Perkins. IP Mobility Support, Network Working Group. RFC 2002, October 1996. 9. D. G. Thaler and C. V. Ravishankar. Using Name-Based Mappings to Increase Hit Rates. IEEE/ACM Transactions on Networking, 6(1), February 1998. 10. David G. Thaler and Chinya V. Ravishankar. Distributed Center-Location Algorithms. IEEE JSAC, 15(3), April 1997. 11. Vinod Valloppillil and Keith W. Ross. Cache Array Routing Protocol v1.0. Internet Draft(work in Progress), 1998. 12. Liming Wei and Deborah Estrin. The Trade-offs of Multicast Trees and Algorithms. In Proc.of the 1994 International Conference on Computer Communications and Networks, San Francisco, CA, USA, September 1994. 13. Li. Yunzhou. Group Specific MSDP Peering. Internet Draft(work in Progress), June 1999.
A Distributed Recording System for High Quality MBone Archives Angela Schuett, Randy Katz, Steven McCanne Department of Electrical Engineering and Computer Science University of California, Berkeley fschuett,randy,[email protected]
Abstract. Popular multicast applications that allow group communi-
cation using real-time audio and video have enabled a wide variety of online meetings, conferences and panel discussions. The ability to record and later replay these sessions is one of the key functionalities required for a complete collaboration system. One of the unsolved problems in archiving these interactive sessions is the lack of any method for recording sessions at the highest possible quality. Since audio and video transmissions are typically sent unreliably, there may be a wide variance in recorded quality depending on where the recorder is placed relative to the various sources. This is especially problematic if multiple sources are active in a single session. In addition, because of congestion control schemes that send high-quality, high-rate data to local receivers, and low-rate data in the wide area, dierent sets of data may be available in dierent areas of the network for any given session. In response to these challenges, we have developed a system that uses multiple distributed recorders placed at or near the sources of the session. These recorders serve as data caches that transmit data to archives. The archive systems collate the data from various recorders and create a high-quality recorded session, which is then available for playback. In this paper, we present the tradeos involved in architecting a distributed recording system, and present our design for a fault-tolerant, scalable system that also supports a wide range of heterogeneity in endsystem connectivity and processor speed. This is achieved in our system through the use of decentralized, shared control protocols that allow simple and fast fault recovery, and decentralized, multicast data collection protocols that allow multiple systems to share data collection bandwidth. We describe and implementation of the system using the MASH multimedia toolkit, the libsrm reliable multicast protocol framework, and the AS1 active service middleware platform implementation. We also discuss our experience with the system and identify several areas of future work.
1
Introduction
The deployment of the multicast backbone, or MBone, has made possible synchronous multi-party communication that is more ecient and more accessible L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 126−143, 1999. Springer-Verlag Berlin Heidelberg 1999
A Distributed Recording System for High Quality MBone Archives
127
than ever before. The standardization of RTP [SCFJ96] as a light-weight, besteort, real-time data transmission protocol has allowed interactive sessions that scale to a very large number of participants. A number of publicly available tools transmit and receive RTP data streams [MJ95,JM,Sch92,CR], and these tools have been used to transmit sessions such as concerts, classes, group meetings, lectures, and conversations. Archive systems have also been developed that record and playback RTP sessions. These include local recorders and local players, which save RTP packets and replay them from local disk, [Sch,Hol95,CSa], and archive systems [AA98,Kle94,CSa], which allow remote clients to request playback of sessions that are stored at the archive. Some archive systems also allow clients to request that the archive system record an advertised session [Hol97,CSb,LKH98]. As the MBone carries more content, we expect that more archive systems will be run at a variety of sites, independently recording and replaying sessions that are of interest to their local user populations. One of the challenges in designing and deploying a recording system for the MBone is that RTP audio and video sessions may exhibit dierent types of intrasession heterogeneity. That is, participants in the same session may see dierent data. The rst type of heterogeneity is in received reception quality. Since RTP is a lossy protocol, without retransmission requests, some receivers may receive more packets, and thus a better reception quality, than other receivers1. This is particularly apparent on the current MBone, where one study measured that 50% of receivers in a large MBone session had loss rates above 10% [Han97]. Another study found that loss bursts lasting longer than 8 seconds were not uncommon [YKT96]. If a recorder is too \far away" (in network terms) from the session sources, it may produce an almost unusable recording of the session. The second type of heterogeneity is in the subscribed reception quality. In order to satisfy the network and processing requirements of a variety of receivers, a source may send its data stream in a hierarchically encoded, layered format [MJV96]. As illustrated in Figure 1, receivers subscribe to as many layers as can t in the available bandwidth. In this way, the source may generate high-quality, high-bandwidth layers that are viewed by nearby local receivers, while receivers across congested links may only subscribe to the lower-quality, lower-rate layers. A recorder which is not local to some sources may be missing a large percentage of the available session data. Another approach to dealing with both congested links and the diering decoding abilities of receivers is to install media gateways [AMZ95], mixers, and transcoders in the network. These may transform sessions into dierent transmission formats, or may rate-limit sessions. This type of transformation is described in the third type of intra-session heterogeneity, reception and transmission format. A session that contains transcoders may contain subsessions where data is transmitted in a format incompatible with the wider session. The transcoders join these sub-sessions into a single virtual session. The 1
Note that this is not an issue for SRM and other reliable multicast protocols since these protocols are designed so that all packets will eventually reach all receivers.
128
A. Schuett, R. Katz, and S. McCanne R
R
Bottleneck S
S
R
R
Layered multicast session with 2 sources separated by a bottleneck. Each source sends 3 layers locally, but there is no single point in the session where a recorder could receive all 6 possible layers. Fig. 1.
location of recorders with respect to these gateways may have a large in uence on the format and quality of the recorded session. Because of these various types of sender and receiver heterogeneity, any single session recorder will have the unique viewpoint of the place in the network where the recording is taking place. When seeking an archival copy of a session, the question is which viewpoint produces the best recording. In a session with a single source, the most complete session viewpoint would be that which is closest to the source. However, in a session with multiple sources, there may be no single place in the network that has a high quality viewpoint of all the sources. Therefore, instead of choosing a single viewpoint for an archived session, we would prefer to combine the data from multiple viewpoints into a recorded session with no missing pieces. These individual recorded viewpoints may be streamed on a separate session to an archive system that collates the data into a single high-quality representation. High-quality archived sessions allow playback systems more exibility in replaying stored sessions. During the interactive session, certain sacri ces for global congestion control may have been necessary, but during playback a dierent set of tradeos exist. Viewers may trade a longer playout buer and playback latency for a higher-quality representation. Some playback viewers may only be interested in a subset of the original participant group, but at the highest data rate possible for that subset. High-quality representations are also very useful at the archive for post-processing algorithms such as image or voice recognition for automated indexing and annotation services. Distributed recording as a solution to the problem of getting perfect quality recording has also been described by Lambrinos et al [LKH99]. Their paper mo-
A Distributed Recording System for High Quality MBone Archives
129
tivates the need for distributed recording as a way both to achieve higher-quality archive copies of sessions, and as a means of gathering delay information from various receivers. This delay information can be used in replaying the session from a variety of \perception points". Other possibilities for improving the quality of recorded sessions include using forward error correction (FEC) or retransmission schemes such as resilient or reliable multicast [XMZY97,LPA98]. These can improve the received reception quality of recorders by limiting the number of transmission errors, but they cannot break the wide-area session view by transmitting local enhancement layers or data from behind a gateway. In addition, real-time participants may not wish to have as many retransmissions as archives may require, because interactivity in sessions requires limits on the latency between transmission and display. If a packet arrives at the site of a session participant after the audio or video frame has been played, it is useless, except for the case of archiving. A separate session for archive data gathering allows us to weaken the real-time latency constraints and stream data in a slower, TCP-friendly way through bottleneck links, without forcing all of the interactive session participants to receive the data enhancements. In order to achieve high-quality recordings of sessions, we have designed and implemented a system that uses recording caches as data caches, supplying data to archive systems, which collect this enhancement data on a session separate from the original interactive session. In the next section of the paper, we describe in more detail the components and protocols of the system, and the design decisions that inform our nal design. In Section 3 we describe the details of the protocol for collecting recorded data to participating archives. Section 4 describes our implementation and experience with the system. Section 5 describes our ongoing and future work in improving and extending the system, and our conclusions are given in Section 6.
2 System Design When implementing a system that will be large-scale and distributed, there are a number of important design considerations. In common with other large-scale, wide-area systems, the system must be scalable, fault-tolerant, TCP-friendly, and able to support heterogeneity in participants and network conditions. These goals have been achieved in a number of routing and application layer protocols through the design principles of light-weight sessions [McC98] [Jac94]. Lightweight sessions use shared multicast control rather than centralized control and soft state rather than hard state. Soft state [Cla88,RM99a] describes a protocol style which uses unreliable transmission of periodic state messages to achieve eventual consistency. State which is not refreshed by a periodic message eventually expires. Since messages are periodic, state is rebuilt automatically if a participant looses state due to a crash, or enters the session late. In contrast, hard state is sent reliably, and the default is that state is only established once.
130
A. Schuett, R. Katz, and S. McCanne
Functionality Archive Systems Recording Caches Provides user interface for scheduling recording Yes No Records sessions Yes Yes Stores sessions Long term Short term Provides streaming session playback Yes No Responds to requests for cached packets Yes Yes Fig. 2.
Functionality of Archive Systems and Recording Caches
Separate procedures must be used to rebuild state after a crash or for a new participant. We apply the light-weight sessions model of protocol design to many of the components of our recording system, ensuring that it will co-exist smoothly with other MBone protocols. Whenever possible, we use soft state, rather than hard server state. We use distributed algorithms rather than centralized server-side algorithms, so that there is no single point of failure. We use receiver-driven protocols, to allow for heterogeneity of receiver participation interest. In addition to the general goals of scalability, eciency and fault-tolerance, we have several more speci c goals for a distributed recording system. Our rst goal is that the live session quality should never be degraded by recording operations or retransmissions. To meet this goal we must include provisions for congestion control in our data transmissions. Our second goal is that the recorded session content should be accessible as soon as possible, even during the on-going live session. Accordingly, we must consider latency in our design. Our third goal is that the system should be able to work with a variety of recorder, archive and collaboration tools. To this end, the system should be componentized, rather than monolithic, with clear protocols for interaction between components. We must also consider that archive administrators and end-users may have heterogeneous interests, both in the types of sessions to be recorded and the desired quality levels of the resulting archival copies. Consequently, the protocols of the system must allow for application-de ned quality levels. In the context of these goals and general system design principles, we have framed our system and designed the protocols and components that implement it. In the next section, we give a high-level overview of the design. In the following sections, we go into more detail and describe the rationale for the various design decisions.
2.1 Design Summary
As stated earlier, the distributed recording system consists of two cooperating components, archive systems and recording caches. Figure 2 lists the diering responsibilities of these two components. Archive systems are fairly large, statically placed systems that provide playback and recording services to a number of users. Recording caches are smaller infrastructure services which do not provide a user interface or long term storage, but that perform recording upon
A Distributed Recording System for High Quality MBone Archives
S
S
Recording Cache
Recording Cache
131
S
S S Archive
An example interactive session showing multiple sources, an archive system, and several recording caches. Recording caches temporarily store data from sources that are not otherwise adequately recorded by archives, because of bottleneck links, or local-only data transmissions. On a separate session, data is retrieved from the caches and stored permanently at the archive system.
Fig. 3.
request, store packets temporarily, and answer requests from archive systems for speci c packets. Note that both recording caches and archive systems perform recordings, and both may answer requests for packets. Our design allows archive systems to be run independently, with a variety of implementations and specializations, similar to the current diversity of web servers on the Internet. Because archives act independently, there may be several archive systems near each other, and no archive systems in other areas of the session. In order to achieve a high-quality recording, at least one recording agent (either an archive system or a recording cache) must be \close" to each of the session sources. If no archive system is close to a source, then a recording cache may be placed near the source to provide coverage. A single recording cache may take responsibility for multiple sources if the sources are located in the same local area. The system uses these caches to provide enhancement data to archive systems. The archive systems individually record as much of the session as they can receive consistent with congestion control algorithms. On a separate archive collection session archive systems request missing data packets. Responses may come from other archive systems, or from recording caches. Using these responses, archives build a complete, high-quality copy of the session. Figure 3 shows an example session containing multiple sources, a single archive, and several recording caches. 2.2
Recorders
In our system, recording caches and archive systems provide the data needed to construct high-quality recordings. An alternate design would be to require
132
A. Schuett, R. Katz, and S. McCanne
sources to keep a log of packets so that archives could request missing packets or layers from sources. This has the advantage of simplicity, in that each source is responsible only for its own packets, and no packet need ever be lost. However, it does have several disadvantages. Due to heterogeneity, some sources may not have sucient resources to maintain a complete packet log. They may be running on a disk-limited or processor-limited platform, and requiring the source to keep a packet log may degrade the quality of the original session. Another limitation is that it may be inecient to require each source to keep a separate log. There may be several sources on the same local network participating in the same session. Any one source on this network could keep packets for all local sources, simplifying the recovery process. Finally, relying on sources to be loggers and responders is not fault tolerant, since a source could disappear from the session, causing all of its packets to be lost from the archives. To solve these problems, our design allows sources to transmit normally and uses nearby recording agents (either caches or archives) to store data from sources and provide that data to archive systems. In essence, these recorders act as proxy responders on behalf of sources. In this way, we can also take advantage of the mechanisms for ensuring the reliability and scalability of proxy services, including cluster-based platforms, and the automatic restart of services that have crashed. We believe that proxy platforms may be provided by ISPs to run a number of services on behalf of users, such as web content extractors [FGC+ 97], Internet shopping brokers [GAE98], and video stream transcoders [AMK98,AMZ95]. Another advantage of placing recorders on proxy platforms is that it allows recorded session data to be used by other services running at the platform. For example, an instant replay service might provide quick replay in the local area while an indexing service might produce an on-the- y summary of the session. For the implementation of our system, described in Section 4, we use recording caches implemented on top of clusters of computers controlled by middleware that provides fault tolerance and workload balancing. However, although the rest of this paper uses the term recording cache, the recording and responding process could be located at the source instead. But for maximum scalability, fault-tolerance and heterogeneity, caches that are implemented on cluster-based middleware platforms are recommended. 2.3
Control Protocol
In order to take advantage of the scalability and fault tolerance possible when placing recording caches on proxy platforms, the control and data transmission protocols for the distributed recording service must be designed appropriately. In particular, using the automatic fault-recovery provided by middleware proxy platforms can be very dicult if a hard state control protocol is used [FGC+ 97,SRC+ 98]. If only soft state is used, then failure recovery is almost automatic, because it does not require any special case of the protocol. The next soft state refresh message to arrive after the system fault can be used by the middleware platform to restart the necessary failed component. In a hard state
A Distributed Recording System for High Quality MBone Archives
133
system, some central agent must store the state necessary for restarting the failed agent. This makes the middleware system less extensible, since the addition of each new service requires changes to the central agent. It may be more intuitive to consider recording to be a hard state service, where a recording request is initiated for a certain period of time (the announced length of the session), and run without further control input for that length of time. However, the small extra bandwidth cost of the periodic announce/listen control messages used in a soft state protocol is dwarfed by the bandwidth required to transmit data for any sort of multimedia session. The remaining problem with using a soft state protocol is ensuring that the failure semantics are designed correctly. That is, if the agents controlling the recording cache fail, the recording cache should also gracefully close. In the next section, we describe how this is achieved in our system. Another consideration is that soft state control protocols can be more scalable than hard state protocols, since they allow multiple clients to share the control of a single agent without increasing the control bandwidth. Because of this enhanced scalability and the improved fault-tolerance, we use soft state for initiation and control of distributed recorders. 2.4
Control Chain
As we have described, archive systems accept recording requests from users, but recording caches must be initiated and controlled by other agents of the system. In order to initiate the recording cache, an agent must have some knowledge of where in the session recording coverage is required. One possible design is for a controlling archive system to monitor the interactive session, initiating recording caches in areas of poor archive coverage, and removing recording caches when they are no longer necessary. However, there are several problems with this design. First, this centralized control hampers fault-tolerance by introducing a single point of failure into the system. Centralized control also limits the system's ability to respond to heterogeneity, because the central archive may not posses the information necessary to correctly place recording caches. For example, only local participants may be aware of transcoders or local-only enhancement layers. In addition, centralized control of the recording caches stretches control channels over long distances and lossy links. This makes control more dicult and more susceptible to failure in case of a temporary network partition. Source control of recording caches has the correct failure semantics. If a network partition occurs between the source and the recording cache it is controlling, then the cache is not able to perform its recording duty and should be suspended. However, if there is a partition between the source and the recording cache, but not between the recording cache and the collecting archive, then the recording cache may still need to participate in the data recovery session, responding to requests for packets. We solve this problem by splitting the functionality of recording caches into recording and responding. The recording agent stores packets to the shared disk, the responding agent retrieves packets from disk and sends them to collecting archives. If the responding agent is under
134
A. Schuett, R. Katz, and S. McCanne
separate control, by the collecting archives, then the failure semantics for both agents will be correct. To summarize, our solution is to design the system so that sources select and control their own recording caches. While this may cause some temporary duplication of eort, it allows us to manage heterogeneity, system partitions and errors. Responders are selected and controlled by collecting archives.
2.5
Data Collection Responsibilities
Given the correct coverage and control of recording caches in a session, it is next necessary to collect and combine the recorded data in such a way as to produce the high-quality session copy. Again, we are confronted with the choice of centralized versus decentralized algorithms. As an example, in a centralized algorithm, the cache for each source would slowly stream its source's data to a centralized collection point, where the data would be combined and stored. However, if each cache has responsibility only for a speci c source or sources, the loss of a recording cache due to hardware or software faults will cause severe damage to the nal session recording. Another negative point is that the entire data stream for each source may not be necessary at the archive. If the archive joins the original session, it will receive the baseline data along with the other real-time, interactive participants. This has the advantage of allowing the archive to support playback during the session (although it may not be highquality). Since the baseline data has already been received and stored, only the enhancement data will need to be streamed to the archive from the caches. Since the caches will not know what data the archive is missing, it will be necessary for the archive to request the data it requires. As we described in the overview, we also want to support multiple archives recording the same session. One possible solution is for one archive to build the high-quality copy of the session and then to transfer in bulk this copy to all other interested archive sites. This has the advantage of simplicity but the disadvantage of centralization. That is, it is less fault-tolerant, has a longer latency before the other participating archives can make the data available to their users, and has less support for heterogenous session interest on the part of archives. Some archives may not wish to collect a high-quality copy of all of the streams of a session. They may be interested in only portions of the session, or may want high-quality representations of some participants but not others. These archives could lter out unwanted data after it has been transmitted to them, but this is an inecient use of the network. In this section, we have laid the framework of our distributed recording system design, which allows a completely decentralized system with multiple sources, multiple recorders, and multiple collecting archives. In the next section, we will describe the data collection process in more detail.
A Distributed Recording System for High Quality MBone Archives
3
135
Data Collection Protocol
3.1 Single Archive, Single Cache To begin the discussion of the data collection protocol, we consider a simple protocol in which one archive system collects data from one recording cache. The archive uses a reliable request-response protocol to retrieve data from the cache. To begin, the archive needs to know what data it is missing, and what data it wishes to collect. The archive will be able to calculate some of the data it is missing by looking for holes in the sequence number space of RTP packets. However, without input from the recording cache, it will not know if it is also missing data from the beginning of the stream, since RTP streams begin with a random sequence number. The archive will also not be able to detect tail losses. Also, since locally transmitted data may not be advertised outside of the local scope, the archive may not know that it is missing entire layers or streams that the recorder has cached. Once the archive knows all of the data that is available for collection, it still needs some application-level information about the data in order to decide whether it wishes to collect that data. Archives may have dierent policies about how much data should be collected and stored. For example, an archive may want to gather the highest quality audio data possible, but just use best-eort for video data. Or, an archive may have a very speci c mandate to only record certain sources in a session, or a certain time-slot of a long session. For this reason, when notifying the archive about data available at the recording cache, it is important to use application-level naming information, not just sequence number extents, so that the archive can decide which data is necessary. To begin the data collection process, each recording cache produces a namespace of the data available at that cache. This namespace includes information that uniquely identi es each individual stream and layer of data in the session. To identify streams and layers of data, we use the original session transmission addresses, along with source identi cation information from the RTCP protocol. Beginning and ending sequence numbers and timestamps for each stream are also included. The archive builds this same namespace for the data it has already recorded, and compares namespaces to nd missing streams or missing sequence number spaces.
3.2 Multiple Archives, Multiple Caches In the previous section we speci ed that we want to design the system so that multiple archives can collect data from multiple recording caches. It would be possible to have multiple archives individually contact the responders and request their missing data. However, there will be many cases where the archives' requests will overlap. For example, all archives might need a locally-transmitted layer that is only present at a single recording cache. For this reason, we would like to use multicast rather than unicast to perform data collection. We can also use multicast to transmit the namespaces of participants.
136
A. Schuett, R. Katz, and S. McCanne
We need to use a reliable multicast protocol to transmit data namespaces to collection participants, since missing namespaces can impact protocol correctness. We could send data enhancement packets with either a reliable or unreliable protocol. Data enhancement does not necessarily need to be sent reliably to all participants, since RTP can tolerate certain levels of loss. However, some archives may want to devote resources to receiving every packet possible, so we feel that a reliable multicast protocol, with some latitude for receiver input into whether retransmissions are necessary, is the best choice. There are several reliable multicast protocols that we could use for our namespace and data transmission. Since we want receivers to choose whether to receive retransmissions, we feel that a NACK based scheme like SRM [FJM+ 95] is a better t than an ACK based scheme like RMTP [LP96]. In addition, SRM uses the principle of Application Level Framing, which allows application control of protocol features wherever possible. In SRM, applications decide whether to NACK packets or ignore the packet loss. For these reasons, we have chosen SRM as the reliable multicast protocol for our data collection algorithm. However, we do not use SRM simply as a replacement for TCP, to transmit packet requests and responses reliably. Instead, we take advantage of the data recovery features already present in SRM and cast each enhancement packet request as an SRM retransmission request. In SRM, upon receiving a retransmission request, agents which have the requested data set a timer, based on how far they are from the requesting agent. In this way the closest responder should generally provide the missing data. Through this mechanism we have multiple archives and recording caches sharing the responsibility of providing data, and multiple archives bene ting from the retransmission requests of other archives. In order to use the SRM recovery protocol, a globally unique namespace must be established. A requested packet must have the same name at all responders so that it can be requested by a single global message. We could map this namespace onto a at sequence number space for SRM retransmission requests and replies, but this would be very dicult. Participants would need to coordinate their assignment of sequence numbers to data streams so that a packet available from numerous recording caches or archive systems would have the same sequence number at each source. Instead of a at sequence number space, we would like to use a hierarchical space, with separate sequence number spaces for individual streams, so that we can reuse the original RTP sequence numbers. Figure 4 shows our hierarchical naming scheme. We use source naming information from the RTP protocol, which is globally available to all collection session participants. Each source may have transmitted multiple streams of dierent media types and/or multiple layers. These are identi ed by the multicast address they were transmitted on. Finally, individual data containers are identi ed with the starting timestamp from the rst packet in that data container. Individual packets are identi ed by the RTP sequence number of the interactive session. Multiple data containers may be used for a very long session where sequence numbers wrap. Using this namespace, participants can create a globally correct naming scheme that uniquely identi es each packet.
A Distributed Recording System for High Quality MBone Archives
137
Root Source ID
Source ID
Source ID
RTP cname
RTP cname
RTP cname
Stream ID
Stream ID
RTP transmission addr
RTP transmission addr
Data
Data
Data
Starting Time
Starting Time
Starting Time
Global data naming using a hierarchy. Streams are uniquely identi ed by source ID and stream transmission address. Data packets are identi ed by RTP sequence number and a container starting time, to account for sequence number wrap. Fig. 4.
The SNAP protocol (Scalable Naming and Announcement Protocol) [RM98], implemented in the libsrm framework [RC], allows transmitted data in an SRM session to be named hierarchically with application generated names and sequence number spaces. SNAP provides all the functionality required to transmit this namespace reliably in a compressed format. It provides periodic refreshes of portions of the namespace, including tail sequence numbers for all containers.
4 Implementation The distributed recording system we have described is composed of heavyweight archive systems, lightweight recording caches and the collaboration tools used by senders and receivers. Figure 5 shows these components and the protocols which are necessary for communicating between components. The recording agent and responding agent are lightweight agents which should be run on a computing cluster, with middleware to provide load balancing and fault recovery. To take advantage of these features, the control protocols for these agents should be soft-state, announce/listen protocols. Using a soft-state, announce/listen control protocol, clients must continue to send periodic keep-alive messages through the life of the service agent. If a service agent fails, from a hardware or software fault, the next keep-alive message will cause the agent to be restarted. In this style of protocol, message ordering is not important, and each message must contain the complete set of data to allow the agent to be restarted after a fault. As described in Section 2.4, the recording agents in the system are controlled by the nearby source or sources which require additional recording coverage. In order to operate, the recorder needs information about the addresses and media types of the session to be recorded. In addition, the recorder needs naming
138
A. Schuett, R. Katz, and S. McCanne Recording Agent
Archive System
Control
Data Data - RTP
Control
Responding Agent
Source
Data - SRM/RTP Fig. 5.
Components and protocols of the distributed recording system
information about the session, so that it can be properly labeled in storage and available to the responder. In our protocol, the session's SDP announcement is used to provide this information. The SDP announcement contains all of the necessary session addresses that the recorder needs to monitor. Local-only layers are also advertised in the local SDP announcement, so no separate mechanism is required to achieve individualized recorder initiation. Each source will automatically instantiate its recorder to record all the data that the source is aware of. Changes to the SDP announcement, such as an addition of a new layer or media type, are automatically forwarded to the recording agent, since the message is soft-state and periodic. The response message from the recording agent to the controlling sources is more simple. Sources merely need to know that a recording is taking place, so a simple acknowledgment message is sent. Eventually, we may add quality report information, so that sources can decide whether to move their recording agent to a dierent platform. Unlike the recording agents, the responding agents are controlled by the archive servers which are collecting data. We would like archive servers which are requesting data from the same session to be served by the same agent. Therefore, the control message needs to contain a eld which will distinguish among responders for dierent sessions. The obvious choice is again to use the session identi cation information from the SDP announcement. In fact, this identi cation and an address on which the responder will join the SRM data recovery session is all that is needed in the responder initiation and control protocol. Fine-grained packet requests take place on the separate SRM channel. We did not create any new data protocols for the distributed recording system. As shown in Figure 5, we use unmodi ed RTP for the original data session, and SRM carrying RTP packets for the data recovery session. We chose the libsrm implementation of SRM, which also includes the SNAP protocol, for the implementation of the data recovery protocol between archive servers and response agents. Overall, we have found libsrm to be very helpful in providing the correct level of abstraction to the system. However, the libsrm library is still undergoing development and is not completely tuned to provide the best possible data retransmission. We have implemented the recording and response agents using the AS1 Active Service framework as the middleware system that allows the agents to be run scalably and reliably. The AS1 Active Service framework implements a service
A Distributed Recording System for High Quality MBone Archives
S Recorder
Archive Archive
a. Single channel
S Recorder
Archive
139
S S
Archive
Recorder
b. Separate channels
Two session con gurations. In Figure a, the archives would bene t from sharing a single multicast channel for retransmitted data. In Figure b, the archives are across dierent bottlenecks and so are collecting disjoint sets of data. Fig. 6.
platform that is a cluster of computers providing load-balancing and automatic restart for agents. The agents themselves are implemented using the MASH multimedia toolkit [MBKea97], a set of composable multimedia networking and application objects. The archive server we use in our system is the MASH Path nder [CSb]. Through a web interface, Path nder allows users to view current session announcements (using the SDP/SAP protocol [HJ97]), join a live MBone session, request that a session be recorded, and play various sessions. Recorded sessions are automatically immediately made available for playback. For this system we made minimal changes to Path nder, adding an agent to perform data collection on sessions being recorded. The playback agent has been described in a previous paper [SRC+ 98]. We have been using the recording and archive server objects for some time, but are still gathering experience with the responding and collection objects. Using the AS1 middleware has been very helpful in simplifying the object implementation requirements, since the fault recovery code does not have to be re-written for each object. 5
Ongoing Work
Using a protocol like SRM for our data collection algorithm only works well if all of the collecting archives have similar data needs, because SRM uses global retransmission of data. However, if archives have divergent needs, then using a single global channel for retransmission may be very wasteful. Figure 6 shows examples of recovery sessions where shared recovery channels would and would not be bene cial. If collecting archives are across dierent bottleneck links from the source and recorder, then they will need dierent sets of data. If they have correlated losses, or need the same set of local-only data, then they would bene t from sharing a multicast channel. One solution to this problem is to construct a hierarchy of participants that will allow retransmissions to only be sent to the portion of the tree that requires the data. This solution, adding local recovery to SRM, has been proposed by many researchers, but is not yet feasible. The problem with building these retransmission trees is that there is not an agreed way to build them without introducing new functionality into routers. Some schemes acknowledge that currently, administrator help is required to build trees of responders [LP96]. Other
140
A. Schuett, R. Katz, and S. McCanne
schemes use an expanding ring search based on TTL [YGS95], or use short experiments to measure link delay [XMZY97], or use multicast IGMP trace packets to locate responders and receivers relative to each other [LPGLA98]. Although this is a dicult problem, and is currently the subject of much research, we believe that it can be solved for this particular application domain because of the application-level knowledge that is available and because the problem is somewhat more limited than in the fully general reliable multicast domain. The application-level knowledge that is available for building a tree of participants is twofold. First, we have the data from the original, interactive session that indicates which participants were able to subscribe to various localonly layers or pre-transcoded data. Second, we have the data from the interactive session indicating which participants lost packets due to congestion. Since archive applications have less onerous latency requirements than interactive applications, we may be able to use more history, based on the original session, than other reliable multicast applications typically have access to. In essence, we have a longer time available for bootstrapping. The Group Formation Protocol [RM99b,RM99c] uses receiver-generated lossprints that enumerate the packets lost at that receiver. These lossprints are used to group receivers who are behind the same bottleneck. We are working on using this protocol to organize collection session participants into sub-groups so that requests and responses only go to necessary participants.
6 Summary and Conclusions We have described the inherent problems in recording a multi-source MBone session with only one recorder. Session-wide variations in received reception quality, subscribed reception quality and reception and transmission formats indicate that in order to achieve the best possible recorded session, multiple cooperating recorders are necessary. To support this distributed recording system, we have introduced the recording cache, composed of a recording agent and a responding agent. The recording cache provides enhancement data to archive systems. Because the recording cache agents are initiated and controlled using soft state protocols, it is amenable to implementation on cluster based middleware, such as an active service platform, that provides a scalable, fault-tolerant implementation base. Because archives and caches record the entire session, users can begin viewing baseline quality session playback with low latency, and individual components can be lost from the system without catastrophic results. Because the system uses a decentralized data collection protocol, the system supports heterogeneity in archives' desired recording quality, and has no single point of failure. We have presented a protocol where archives collect data from recording caches and other participating archives using SRM retransmission requests. Archives are able to uniquely identify missing packets and streams through a globally consistent hierarchical namespace that uses RTP stream identi cations and sequence numbers. This namespace is reliably and eciently transmitted through
A Distributed Recording System for High Quality MBone Archives
141
the SNAP protocol. We have described our initial implementation of the archives and recording caches, using the MASH multimedia toolkit, the libsrm reliable multicast protocol framework and the AS1 active service platform implementation. We are using these implementations to explore new techniques for using packet loss information to form data collection subtrees, so that the data collection algorithm scales to a larger number of archive session participants.
7 Acknowledgments Many thanks to Yatin Chawathe, Suchitra Raman, Drew Roselli, Helen Wang, Tina Wong and the anonymous reviewers for their feedback and suggestions. This work was supported by DARPA contract N66001-96-C-8505, by the State of California under the MICRO program, and by NSF Contract CDA 94-01156. Angela Schuett is supported by a National Physical Science Consortium Fellowship
References [AA98]
K. Almeroth and M. Ammar. The Interactive Multimedia Jukebox (IMJ): A New Paradigm for the On-Demand Delivery of Audio/Video. In Proceedings of the Seventh International World Wide Web Conference, April 1998. [AMK98] Elan Amir, Steve McCanne, and Randy Katz. An Active Service Framework and its Application to Real-time Multimedia Transcoding. In Proceedings of SIGCOMM '98, September 1998. [AMZ95] Elan Amir, Steve McCanne, and Hui Zhang. An Application Level Video Gateway. In Proceedings of ACM Multimedia '95, November 1995. [Cla88] D.D. Clark. The design philosophy of the darpa internet protocols. In Proceedings of SIGCOMM '88, Stanford, CA, August 1988. ACM. [CR] Yatin Chawathe and Cynthia Romer. Mash collaborator documentation. http://mash.cs.berkeley.edu/ mash/software/usage/collaboratorusage.html. [CSa] Yatin Chawathe and Angela Schuett. MASH archive tools documentation. http://mash.cs.berkeley.edu/mash/software/archive-usage.html. [CSb] Yatin Chawathe and Angela Schuett. MASH Path nder documentation. http://mash.cs.berkeley.edu/mash/software/usage/path nder.html. [FGC+ 97] Armando Fox, Steven Gribble, Yatin Chawathe, Eric Brewer, and Paul Gauthier. Cluster-based Scalable Network Services. In Proceedings of SOSP '97, pages 78{91, St. Malo, France, October 1997. [FJM+ 95] Sally Floyd, Van Jacobson, Steven McCanne, Ching-Gung Liu, and Lixia Zhang. A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. In Proceedings of SIGCOMM '95, Boston, MA, September 1995. Association for Computing Machinery. [GAE98] Ramesh Govindan, Cengiz Alaettinoglu, and Deborah Estrin. A Framework for Active Distributed Services. Technical Report 98-669, International Sciences Institute, University of Southern California, 1998. [Han97] Mark Handley. An Examination of MBone Performance. Technical Report ISI/RR-97-450, USC/ISI, 1997.
142
A. Schuett, R. Katz, and S. McCanne
[HJ97]
Mark Handley and Van Jacobson. SDP: Session Description Protocol. Internet Draft, Internet Engineering Task Force, November 1997. [Hol95] Wieland Holfelder. MBone VCR - Video Conference Recording on the MBone. In Proceedings of ACM Multimedia, 1995. [Hol97] Wieland Holfelder. Interactive Remote Recording and Playback of Multicast Videoconferences. In Proceedings of the Fourth International Workshop on Interactive Distributed Multimedia Systems and Telecomminication Services (IDMS), 1997. [Jac94] Van Jacobson. SIGCOMM '94 Tutorial: Multimedia Conferencing on the Internet, August 1994. [JM] Van Jacobson and Steven McCanne. Visual Audio Tool. Lawrence Berkeley Laboratory. Software available at ftp://ftp.ee.lbl.gov/conferencing/vat. [Kle94] Anders Klemets. The Design and Implementation of a Media on Demand System for WWW. In Proceedings of the First International Conference on WWW, Geneva, May 1994. [LKH98] Lambros Lambrinos, Peter Kirstein, and Vicky Hardman. The Multicast Multimedia Conference Recorder. In Proceedings of the 7th International Conference on Computer Communications and Networks, October 1998. [LKH99] Lambros Lambrinos, Peter Kirstein, and Vicky Hardman. Improving the Quality of Recorded Mbone sessions using a Distributed Model. In Proceedings of the 6th International Workshop on Interactive Distributed Multimedia Services and Telecommunication Services (IDMS), October 1999. [LP96] John C. Lin and Sanjoy Paul. RMTP: A Reliable Multicast Transport Protocol. In Proceedings IEEE Infocom '96, pages 1414{1424, San Francisco, CA, March 1996. [LPA98] Xue Li, Sanjoy Paul, and Mostafa Ammar. Layered Video Multicast with Retransmissions (LVMR): Evaluation of Hierarchical Rate Control. In Proceedings of INFOCOM 98, March 1998. [LPGLA98] B. N. Levine, S. Paul, and J.J. Garcia-Luna-Aceves. Organizing Multicast Receivers Deterministically According to Packet-Loss Correlation. In Proceedings of ACM Multimedia '98, September 1998. [MBKea97] Steve McCanne, Eric Brewer, Randy Katz, and Lawrence Rowe et al. Toward a Common Infrastructure for Multimedia-Networking Middleware. In Proceedings of the Fifth International Workshop on Network and OS Support for Digital Audio and Video (NOSSDAV), May 1997. [McC98] Steven McCanne. Scalable Multimedia Communication with Internet Multicast, Light-weight Sessions, and the MBone. Proceedings of the IEEE, 1998. [MJ95] Steven McCanne and Van Jacobson. vic: A Flexible Framework for Packet Video. In Proceedings of ACM Multimedia '95, pages 511{522, San Francisco, CA, November 1995. [MJV96] Steven McCanne, Van Jacobson, and Martin Vetterli. Receiver-driven Layered Multicast. In ACM SIGCOMM, Stanford, CA, August 1996. [RC] Suchitra Raman and Yatin Chawathe. libsrm: A generic framework for reliable multicast transport. http://wwwmash.cs.berkeley.edu/mash/software/srm2.0/. [RM98] Suchitra Raman and Steven McCanne. Scalable Data Naming for Application Level Framing in Reliable Multicast. In Proceedings of ACM Multimedia '98, 1998.
A Distributed Recording System for High Quality MBone Archives
[RM99a]
143
Suchitra Raman and Steven McCanne. A Model, Analysis, and Protocol Framework for Soft State-based Communication. In Proceedings of SIGCOMM '99, Cambridge, MA, September 1999. [RM99b] Sylvia Ratnasamy and Steven McCanne. Inference of Multicast Routing Trees and Bottleneck Bandwidths using End-to-end Measurements. In Proceedings of IEEE Infocom '99, New York, March 1999. [RM99c] Sylvia Ratnasamy and Steven McCanne. Scaling end-to-end multicast transports with a topologically-sensitive group formation protocol. In Proceedings of the 7th International Conference on Network Protocols, November 1999. [SCFJ96] Henning Schulzrinne, Steve Casner, R. Frederick, and Van Jacobson. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, Audio-Video Transport Working Group, January 1996. RFC1889. [Sch] Henning Schulzrinne. RTP Tools 1.6. http://www2.ncsu.edu/eos/service /ece/project/succeed info/rtptools/rtptools-1.7/rtptools.html. [Sch92] Henning Schulzrinne. Voice Communication Across the Internet: A network voice terminal. Technical Report TR-92-50, University of Massachusetts, Amherst, 1992. [SRC+ 98] Angela Schuett, Suchitra Raman, Yatin Chawathe, Steven McCanne, and Randy Katz. A Soft-state Protocol for Accessing Multimedia Archives. In Proceedings of 8th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 98), Cambridge, UK, July 1998. [XMZY97] X. Rex Xu, Andrew C. Myers, Hui Zhang, and Raj Yavatkar. Resilient Multicast support for Continuous-media applications. In Proceedings of NOSSDAV '97, 1997. [YGS95] R. Yavatkar, J. Grioen, and M. Sudan. A Reliable Dissemination Protocol for Interactive Collaborative Applications. In Proceedings of ACM Multimedia '95, San Francisco, CA, November 1995. Association for Computing Machinery. [YKT96] Maya Yajnik, Jim Kurose, and Don Towsley. Packet Loss Correlation in the MBone Multicast Network. IEEE Global Internet Conference, 1996.
Reducing Replication of Data in a Layered Video Transcoder Gianluca Iannaccone Dipartimento di Ingegneria dell’Informazione: Elettronica, Informatica e Telecomunicazioni. Universit` a degli Studi di Pisa, Via Diotisalvi 2, I-56126, Pisa, Italy
Abstract. In a previous work, we presented a Layered Video Transcoder (lvt) to transform an input H.261 stream in a set of data streams suitable for multi-rate layered multicast transmission. lvt performs a very simple strategy for filtering data, thus generates some amount of redundancy in the output streams. While this redundancy is useful to let receivers tolerate losses and change their subscription level without experiencing inconsistencies in the received video, reducing redundancy would grant a better compression rate. Furthermore, the performance analysis we performed suggested possible improvement to the lvt’s filtering algorithms. In this paper, we explore a set of new strategies to filter data, trading some additional complexity with better performance. In particular we discuss the possibility of doing a second conditional replenishment step in the transcoder and implement macroblock classification policies, using the history of macroblock updates to recognize the “role” of a macroblock in the video signal (e.g. background or moving subject). Performance analysis shows that these new strategies significantly improve the quality of video of the transcoder.
1
Introduction
In a multicast environment, receivers’ heterogeneity makes difficult, almost impossible indeed, to use a single fixed data rate to satisfy all users interested in a given transmission. Recent research works have proposed the use of layered schemes to distribute multimedia streams over the Internet, stimulating research on algorithms for layered video compression [4, 5, 7]. In this model, the source distributes multiple levels of quality of the same signal simultaneously across multiple network channels. If data are organized in a layered way, receivers may adapt their reception rate to bandwidth availability choosing which layer, or how many layers, to receive. In a recent work [1], we presented a Layered Video Transcoder (lvt), namely a stand-alone application that can be connected to existing videoconference tool, such as VIC [3], to provide layered video streams. We focused on a transcoder approach, rather than designing a specialized layered encoder, because this provides an easy way to deploy adaptive and scalable multicast services using existing L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 144−151, 1999. Springer-Verlag Berlin Heidelberg 1999
Reducing Replication of Data in a Layered Video Transcoder
145
video sources or recordings. Further, we focused on low complexity transcoding schemes so that low overhead transcoding services could be easily located either at the source or within the network. On the other hand, our design goals forced us to partially sacrifice performance at high data rates in favor of simplicity and robustness. The actual implementation of lvt performs very simple strategies for filtering data to be transmitted and thus generates a relatively high degree of data redundancy. The introduction of new strategies might reduce data redundancy, allowing a more effective use of bandwidth and granting a better video quality, in particular at low data rates. These strategies have to define much finer priority among macroblocks in the video signal. They could use history in macroblocks’ updates to recognize their role, and thus their importance, inside the video signal, or could perform a second conditional replenishment step based on the video actually transmitted on each layer, rather than rely on the VIC conditional replenishment only. In the next section, we give a brief description of the Layered Video Transcoder (lvt), while in Section 3 we discuss the need for replication of data. Section 4 is dedicated to the analysis of the proposed new strategies for filtering data, while Section 5 shows the results of the performance analysis. Finally, concluding remarks are provided in Section 6.
2
Architecture of lvt
The Layered Video Transcoder acts as a filter between an H.261 video source, such as VIC, and H.261 receivers. On the source side, lvt intercepts data generated by VIC and distributes them over four layers. On the receivers side the transcoder merges received layers into a single data flow, which is then processed by VIC. Compression standard One of the video compression algorithms used by VIC is an adaptation of ITU-T H.261 [2], called Intra-H.261. This algorithm inherits the general structure of H.261 but excludes any function based on inter-frame information, like motion vectors. Intra-H.261 uses then conditional replenishment to increase compression ratio avoiding this way the use of inter-frame information. Same as H.261, Intra-H.261 splits the image in macroblocks — composed of four blocks of 8x8 pixel — and treats macroblocks separately from each other. Then, according to conditional replenishment, each macroblock is compared with the previously transmitted version of the same macroblock. If differences between the two macroblocks are small (i.e. below a given threshold) then the new macroblock is not transmitted at all and discarded. Conditional replenishment guarantees a robust and low complexity compression, but these results are gained at expenses of quality of video signal: there will certainly be inconsistencies in the image received, and receivers can lose small motion artifacts.
146
G. Iannaccone
However, since our focus is on videoconference transmission, we can release a lot of constraints about quality of video signal.
Layered organization As full detailed in [1], lvt follows the scheme shown in Fig. 1 to rearrange data over the four layers. The label L means that the frame may be a reduced quality version of the one coming from the source. The task of the fourth layer is to transmit the high-quality version of macroblocks sent in the lower layers. According to Fig. 1, in principle the first layer processes only one fourth the frames transmitted by VIC, and is designed to use 1/8 of the nominal source bit rate. The other layers use a similar scheme, only working on different frames and target bit rate (respectively 1/8, 1/4 and 1/2 of the nominal source rate).
Fig. 1. Frames normally used on the various layers. L means that the block can be recompressed by lvt using a coarser quantization, while H means the block is used as received from the source.
In practice, due to the H.261 compression and conditional replenishment, the size of each frame is highly variable, so there are no means to assert that this layer can transmit all frames it receives. Therefore, lvt introduces a bandwidth control policy as follows: – if the frame to be transmitted exceed the available bandwidth, the coder compresses low-priority macroblocks by using a coarser quantization; – the priority of a macroblock is calculated as the distance — by luminance factor — between the new macroblock received and the last copy sent; – if after compression with the coarser quantization, the target data rate is not reached the frame is discarded and the credit is accumulated for the next frame.
Reducing Replication of Data in a Layered Video Transcoder
3
147
Replication of Data
In lvt, there are two situations that force the transcoder to introduce replicated data in the layered stream. The first, that gives a very low degree of redundancy, is a direct consequence of the organization of frames across the four layers. Indeed, the fourth layer provides information discarded on lower layers or the high-quality version of macroblocks sent on lower layers. Hence we experience a replication of data, since the frequency components sent in the low-quality macroblocks (e.g. DC coefficient) are replicated in high-quality macroblocks. We refer to this type of redundancy as “refinement redundancy”. The second and most important cause of replication of data is shown in Fig. 2. Frame 2 is scheduled for transmission on layer 3, which transmits the update of the block (x,y). Frame 3, scheduled for transmission on layer 2, does not include block (x,y) since it is not changed from frame 2. If the block (x,y) is not replicated on layer 2, receivers that are not subscribed to layer 3 will miss the update, and will experience an inconsistent image. In the following we refer to this second case as ”consistency redundancy”.
Fig. 2. lvt records block updates and can retransmit a block on multiple layers to help keeping the image consistent for receivers at all subscription levels.
As discussed in [1], replicated macroblocks represent about 25% of the total macroblocks sent across all layers. This redundancy does not permit lvt to fully exploit conditional replenishment algorithm on lower layers, thus granting for each subscription level a video quality worse than that of a VIC source encoded at the same rate.
4
Proposed strategies
In the current implementation, lvt replicates all data without performing any computation on replicated macroblocks. Indeed a different strategy could imply to evaluate if it is really worthwhile to retransmit every macroblock.
148
G. Iannaccone
Any new strategy needs an algorithm capable to estimate the priority (i.e. the importance) of each macroblock in the image. Conditional replenishment algorithm is far from perfect when used in conjunction with the layered video transcoder, since high-priority macroblocks for the 256 Kbps VIC source might be very low-priority ones for the 32 Kbps first layer. Hence, finding low-priority macroblocks in the video stream could be a useful way to avoid replication of data and frame discards, that strongly affect performance of the transcoder [1]. In the implementation of new strategies, we define the following restraints: – complexity. The strategy must not require excessive processing since, as mentioned in Section 1, low complexity is a design goal of lvt. A measure of complexity is the number of operations made on each macroblock, i.e. decoding, encoding or calculation of differences. – robustness. Some redundancy is useful to let receivers tolerate losses and change their subscription level without experiencing inconsistencies in the video received. Therefore, new strategies has to reach a trade off between the need for increasing the compression rate and the need for granting robustness and high tolerance to data lost or missing. – receiver side. Currently, lvt implements a very light-weight receiver process that merges the four-layer stream into a single stream to be forwarded to VIC. In the implementation of new strategies for replication of data, we aim to preserve receiver’s simplicity. 4.1
Conditional replenishment
The first strategy we explore is very simple. It consists in performing a second conditional replenishment step for each subscription level, based on the video actually transmitted on that level. This method is useful for finding those macroblocks, among those to be transmitted or replicated, which deliver low quantity of information (i.e. differences are small). Further, using different reference frames for each layer, can prevent lvt from replicating macroblocks that revert to the initial status after been modified in an higher layer. To develop this strategy, the transcoder has to maintain a history of recent frames sent on each layer. The overhead required is limited to memory usage, while computational complexity is almost unaffected since lvt already computed differences of each macroblock for reducing quality. Robustness is also unaffected, since we simply perform an additional step of the conditional replenishment algorithm inside the transcoder. Moreover, the receiver needs no modification at all. 4.2
Background recognition
This strategy tries to find low priority macroblocks between those received from the source, considering the history in macroblocks’ updates. History information
Reducing Replication of Data in a Layered Video Transcoder
149
for each macroblock consist in the frequency of updates (i.e. the number of frames in which the macroblock exceeded the threshold) and the most recent frame that contained the macroblock. Therefore, if a macroblock has been rarely updated and the last update is old, we consider an update as a low-priority one and the transcoder will forward the macroblock only if higher-priority macroblocks do not need all the available bandwidth. This technique tries to recognize background macroblocks from those concerning the moving subject, and eliminates small motion artifact that can occur in the background (e.g. moving shadow or small variations in light exposure). On the other hand, it is aware of entire scene changes — computing the mean number of macroblocks per frame sent by VIC — and thus can avoid possible inconsistencies in scene changes. This strategy has very low complexity since it needs a small amount of additional state per macroblock and the computations are very simple. Both robustness and receiver process are substantially unaffected. On the other hand, video quality could suffer from a static background that could give a sense of slow motion to users.
5
Performance Analysis
In order to evaluate the quality of video signal achieved using the new strategies we compared a set of four video streams derived from the same video source: 1) a full VIC video at 256 kbps, CIF resolution (352x288 pixels) and a peak frame rate of 30 fps; 2) the lvt video as described in [1]; 3) the lvt video with the conditional replenishment algorithm (CR), and 4) the lvt video with conditional replenishment and background recognition (CR&BR). Each video stream has been transmitted along links with different bottleneck bandwidths and then compared with others using parameters such as the percentage of replicated macroblocks in the transmission, the mean Peak Signalto-Noise Ratio (PSNR) and the mean frame rate. The bottlenecks were simulated using dummynet [6] with a bandwidth of 32, 64, 128 and 256 kbps and a queue size of 40 packets. Figure 3 shows the percentage of replicated macroblocks for the first three layers. As we see from the graph, the new strategies significantly reduce the replication of data, and improve bandwidth utilization on the lower layers. In Fig. 4, we show the PSNR of VIC video and lvt video with different strategies. Due to the computational complexity involved in dealing with a video source at 256 Kbps, we computed the PSNR offline with a set of frames of the video signal sampled at constant time intervals. Even if this method is far from perfect — since the sampling rate has a great impact over PSNR calculation — we believe that it is a good measure for qualitatively comparing proposed strategies. As shown in Fig. 4, the use of more complex strategies for discovering useless replication of data only grants a small improvement in quality of image –
150
G. Iannaccone
Fig. 3. Average ratio between replicated macroblocks and macroblocks transmitted.
Fig. 4. Mean Peak Signal-to-Noise Ratio.
Fig. 5. Mean Frame Rate.
Reducing Replication of Data in a Layered Video Transcoder
151
approximately 3 dB –, both at lower and higher layers, compared with the basic lvt video1 . On the other hand, as we see from Fig. 5, they largely affect the frame rate of video. Indeed, the mean frame rate achieved with the proposed strategies is approximately equal to the maximum frame rate according to the layered organization scheme — see Section 2. At a first glance, the background recognition algorithm seems to be ineffective and superfluous. Indeed we believe that further investigation is needed to discover if it useless or, as we expect, its effectiveness depends upon the kind of video transmitted.
6
Conclusions and future work
In this paper, we have proposed two strategies useful for reducing replication of data in the layered video transcoder. The performance analysis shows that lvt with the new strategies yields a quality of video which scales well with available bandwidth and achieves almost the same quality of source at the highest layer. Indeed, there is the need for a better parameter tuning, both for the second conditional replenishment threshold and for the background recognition algorithm. Future work will focus on finding optimal parameter but also on introducing new techniques for increasing performance of lvt on lower layers. A technique that is currently a work in progress involves the use of a set of future frames (i.e. successive frames transmitted by VIC) for better recognizing the “role” of a macroblock in the frame.
References 1. G. Iannaccone, L. Rizzo, A Layered Video Transcoder for Videoconference Applications, Mosaico Research Report PI-DII/4/1999, Univ. di Pisa, July 1999. 2. ITU-T, Video codec for audiovisual services at p*64kb/s, International Telecommunication Union Recommendation H.261, 1993. 3. S. McCanne and V. Jacobson, vic: a flexible framework for packet video, Proceedings of ACM Multimedia ’95, November 1995. 4. S. McCanne, M. Vetterli and V. Jacobson, Receiver-driven Layered Multicast, ACM SIGCOMM ’96, August 1996. 5. S. McCanne, M. Vetterli and V. Jacobson, Low-complexity Video Coding for Receiver-driven Layered Multicast, IEEE Journal on Selected Areas in Communications, vol. 16, no. 6, pp. 983-1001, August 1997. 6. L. Rizzo, Dummynet: a simple approach to the evaluation of network protocols, ACM Computer Communication Review, Vol.27, n.1, pp.31-41, January 1997. 7. L. Vicisano, L. Rizzo and J. Crowcroft, TCP-like congestion control for layered multicast data transfer, IEEE INFOCOM ’98, March 1998.
1
It has to be noted that, even if computed values may depend on the sampling rate, the proposed strategies have a minor impact on image quality
Providing Interactive Functions through Active Client-Buffer Management in Partitioned Video Multicast VoD Systems Zongming Fei', Mostafa H. Ammar', Ibrahim Kame12, and Sarit Mukherjee2 Networking and Telecommunications Group, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA {fei,ammar)@cc.gatech.edu, Panasonic Information and Networking Technology Lab, Panasonic Technologies,Inc., Two Research Way, Princeton, NJ 08540, USA {ibrahim,sarit)@research.panasonic.com
Abstract. Multicast delivery is an attractive approach to the provision of a video-on-demand service because it scales well to a very large number of clients. The problem is how to provide interactive functions to individual clients within the multicast framework without compromising the scalability of the multicast paradigm. In this paper, we propose an active buffer management scheme to provide interactive functions in partitioned video broadcast. Our scheme lets the client selectively prefetch segments from broadcast channels based on the observation of the play point in its local buffer. We introduce the concept of feasible points which can guarantee the continuity of playback after resuming normal play following VCR actions. Our simulations show that the active buffer management scheme can implement interactive actions through buffering with a very high probability in a wide range of user interaction levels.
1
Introduction
A Video-on-Demand (VoD) service provides subscribers with a set of videos and sends a specific video to customers upon request. Using multicast to send popular videos has been demonstrated by many researchers to be an efficient way for their delivery [l-61. There are two basic approaches to provide multicast videoon-demand services. One is on-demand batching, in which the server allocates a channel to a group of pending clients requesting the same video and sends it over that channel. These systems are on-demand in the sense that the server allocates a channel to send a video only if there are requests sent over upstream channels by the clients requesting the video. Another approach is continuous broadcast VoD systems in which the server allocates one or several channels to one video and each channel sends the whole video or a part of it in cycles. These systems are not on-demand because they keep broadcasting videos even if there are no requests for them. They do not need upstream channels and are most suitable for broadcasting popular videos. A simple form of this approach is the pay-per-view model (or staggered broadcast) in which several channels broadcast a video periodically with staggered start times. In such a system, the maximum startup latency is equal to the L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 152−169, 1999. Springer-Verlag Berlin Heidelberg 1999
Providing Interactive Functions
153
length of the video divided by the number of channels allocated to this video. Pyramid broadcasting [7] divides a video into segments of exponentially increasing sizes and lets one channel broadcast each segment repeatedly. Segments are broadcast in the channels at a faster speed than the playback rate. The scheme greatly decreases the startup latency at the expense of requiring client buffering. Skyscraper broadcasting [8] modifies the distribution of segment sizes in the pyramid broadcasting scheme and keeps the broadcast speed the same as the playback rate. It also places an upper bound on the weight of maximum segment size to reduce the storage required at the clients. The greedy diskconserving broadcasting scheme [9] generalizes the skyscraper scheme to the case where each client can download from more than 2 channels at the same time. We call these schemes partitioned video broadcast and will describe them in more detail in Section 2.1. The multicast approach to video-on-demand satisfies requests of several users at one time and thus, to some extent, sacrifices the special requirements of each individual user. The startup latency is a tradeoff of some degradation of service quality for scalability. The desirable interactive functions (sometimes called VCR functions) for video-on-demand service are also difficult to provide in multicast systems. Some recent VoD work begins to deal with this problem. The work by Almeroth and Ammar [3-51 incorporates the interactive functions into a multicast delivery VoD system and introduces the concept of discontinuous VCR actions. The system uses client buffering to provide interactive functions and proposes emergency interactive channels be used when the client buffer contents are not sufficient for the desired interaction. The SAM protocol proposed by Liao and Li [ 101 uses synch-buffer and special interactive channels (called I-streams) to provide VCR functions. The work by Abram-Profeta and Shin [ll]improves the SAM protocol by changing the shared synch buffers to separate buffers at each client and making it more scalable. Most of the work has addressed VCR functions in the environment of an on-demand batching video delivery system and depends on both client buffering and “interactive” channels to provide VCR functions. The client side buffering technique can be extended to apply to the continuous broadcast model of video delivery (including partitioned broadcast), but interactive channels are not available in this model. Even if we have interactive channels, it is not preferable because using interactive channels to provide VCR functions compromises the scalability of the multicast paradigm. As an alternative to interactive channels in a continuous broadcast system, interactive functions that cannot be handled through client buffering can be accomplished by switching broadcast streams and incurring a possible discontinuity. In this paper, we propose an active bufler management scheme for providing VCR functions in continuous partitioned video broadcast systems. It is a generalization of our previous work on staggered multicast near VoD systems [12]. The active buffer management scheme uses client side buffering in a novel fashion that relies on the simultaneous availability of “past”, “present” and “future” parts of a video. It lets the client selectively prefetch segments from broadcast channels based on the observation of the play point in its local buffer. We let clients make the decision of what to prefetch and when to prefetch from broad-
154
Z. Fei et al.
cast channels according to the current status of their buffers, because in such systems no backward channel is available from the client to the server The challenging part is how to deal with the unequal sizes in the partitioned broadcast scheme. After a VCR action moves the play point to a new place in the video stream and we resume normal playback, we may face a situation of discontinuity of playback because we cannot get the frames before their scheduled playback times due to the unequal segment sizes. To deal with this problem, we introduce the concept of feasible points that restricts a client to move only to those points within the broadcast stream that can preserve the continuity of playback. While the existing broadcasting schemes [8,9] focus on reducing the startup latency and client storage, they do not deal with the problem of providing interactive functions for clients. We design a new video partitioning scheme that is more suitable for the interactive behavior of clients. We perform extensive simulations to explore the effects of various parameters of our scheme at different user interaction levels. Our simulations show that our scheme can implement VCR actions through buffering with a high probability in a wide range of user interaction levels. The rest of this paper is structured as follows. In Section 2 we describe the schemes of the partitioned video broadcast and summarize the existing work of using client buffering scheme to provide VCR functions in video multicast. In Section 3 we propose an active client buffer management technique to provide VCR functions in partitioned video broadcast and in Section 4 we give the rules for deciding feasible points. In Section 5 we perform an extensive evaluation of the performance of the proposed scheme. The paper is concluded in Section 6.
’.
Background
2 2.1
Partitioned Video Broadcast
Partitioned video broadcast divides a video into segments and sends each segment over one channel in cycles. Assume the bandwidth allocated to a video is B Mbits/sec2. The bandwidth is divided into equal-bandwidth channels and the bandwidth of each channel is equal to the playback rate, b Mbits/sec. The number of channels is I< = 1; J . A video of length L is divided into I< segments and channel n periodically broadcasts segment n , where 1 5 n 5 I<. The relative weights of segments distinguish one broadcasting scheme from another. Usually a function f(n)(1 5 n 5 I<) is given and the size of segment n is f ( n ) * L . The maximum startup
c,”=, f (j)
. The function f(n) is designed to minimize the startup c,”=, latency. One way to achieve this is to make f(n) increase as fast as possible, latency is
f(’)*L’b
f (j)
subject to the continuity condition that guarantees the clients can download each segment of the video in a timely fashion with a given number of loaders If such a backward channel were available, then on-demand batching is a preferable option to continuous delivery of a static set of videos. How the server bandwidth is allocated among videos depends on the strategy of the server and is not considered here.
Providing Interactive Functions
155
in order to playback without jitter. The function f(n) can generate a series specifying the relative weights of each segment. The skyscraper broadcasting scheme [8] assumes that each client has two loaders (can download from two channels at the same time as the client plays back from the buffer) and the series generated by the function f(n) is [l,2 , 2 , 5 , 5 , 12,12,25,25,.. .]. There is an upper bound parameter W , which sets f(n) to W if f(n) > W . The greedy disk-conserving broadcasting scheme [9] generalizes the skyscraper scheme to the case in which each client can have more than 2 loaders. A series generated by function f(n) with 4b disk bandwidth requirement for each client is [l,2,4,8,14,24,40,70,. . .] ’. 2.2
VCR Actions
We consider a video as a continuous sequence of frames. The play point of a client is a point in the video which the client is currently playing out. The destination point of a VCR action is the point in the video when the client finishes the VCR action. A VCR action can be described as a pair (At,Al), where At is the time it takes for the VCR action to complete and Al is the relative position of the destination point with respect to the current play point (assume it is p ) . Obviously At 2 0 and the destination is p + A l . The destination is in the forward direction if Al > 0 and in the backward direction if Al < 0. We assume that the part of the video (of length Al) is evenly distributed in the time interval At. Assuming the VCR action is issued at the play point p and the normal playback rate is b , we consider following VCR actions. 1. Jump Forward/Backward (JF/JB): At = 0, Al # 0. 2. Fast Forward/Backward (FF/FB): At > 0, > b. 3. Slow Forward/Backward (SF/SB): At > 0,O < < b. 4. Pause: At > 0, Al = 0. 5. Play: At > 0, = b , and Play Backward: At > 0, = 4. Note that Play is also described as a pair (At,Al) because we can denote the time from the moment we resume normal play until the moment we issue the next VCR action as At and the video length played as Al. It is interesting to note that VCR actions can be determined by one parameter 2 defined as
2:
2 =
2.3
AllAt b
2:
.
Conventional Client Buffering Scheme to Provide VCR Functions for Video Multicast
Using client buffering is an intuitive solution to the problem of providing VCR functions [3,10,11]. We illustrate the conventional client buffering scheme in Figure 1 and focus on how the relative position of play point in the buffer is changed by each VCR action. In Figure 1, we show a client buffer of size 100. The buffer can be looked as a sliding window over a usable portion of a video. It is composed of video After downloading the first segment and sending it directly to the display, the client only has to download from 3 channels (3b) at the same time. 4b disk bandwidth is required at this time because one b is used for sending data from disk to the display.
156
Z. Fei et al. Relative Play point
Position
I
N Before
N+100 60
N+60 100+10
N+10 Play At=lO,AI=lO
60
I
N+10 Fast Forward At=lO,AI=30 N+10 Fast Backward
N+100+10 80
N+60+10*3
I
N+100+10 20
N+60-10.3
At=10, AI=-30
I
N+10 Slow Foward At=10, AI=10/3
N+60+10/3
I
N+10 Pause At=10, AI=O
N+100+10 53.3
N+100+10 50
N Jump Foward
70 N+60+10
At=O, A I =10 N Jump backward
At=O, A I
=-lo
I
N+100 50
N+60-10
Fig. 1. The effects of VCR functions on play point position in the buffer
between the most recently downloaded frame and the oldest frame still held in the buffer. When a client begins to download a video, the buffer is empty. As the frames are downloaded from the multicast group, they are played back and the buffer is filled up at the same time. If there is no VCR function, the play point is the same as the most recent frame. When the buffer becomes full, the oldest frames are discarded to allow more recent frames to come in. Let us focus on the configuration labeled as Before in Figure 1. We assume the frame immediately before the oldest frame is N and the buffer is full. So the most recent frame is N 100. This is also the frame that is currently being downloaded from the multicast channel. If some VCR functions have been performed, then the play point can be frames other than the most recent frame. In the example we assume the play point is N+60. We show how each VCR function will change the relative position of the play point in the buffer. When a play action is issued, the play point moves forward at the same rate as the buffer boundaries. So the relative position of play point in the buffer does not change. When a fast forward (specified as (At = 10, Al = 30)4) is issued, the boundaries move 10 units while the play point moves 30 units. So the relative position of the play point in the buffer moves forward 20 units. For the same reason, the fast backward moves the play point in the buffer backward 40 units. Though pause and slow forward do not play in the backward direction, the relative positions of the play point in the buffer do move backward because the boundaries of the buffer move forward. Jump forward and jump backward are implemented instantaneously, so the boundaries of the buffer do not change. Jump forward moves the play point forward and jump backward moves the play point backward. We can see that play does not change the relative position of the play
+
The unit of time is a frame. At = 10 represents a time interval for 10 frames of video to be played back at the normal playback rate (=channel transmission rate). We use 3 as the ratio of the FF speed over play speed in this example.
Providing Interactive Functions
157
point, fast forward and jump forward move the play point forward (called forward VCR actions), and all other actions move the play point backward (called backward VCR actions).
Problems with the scheme: At the beginning, no forward VCR actions can be performed because the relative position of play point is the same as the most recent frame (right boundary). After some backward VCR actions are performed moving the play point to the middle of the buffer, we can begin to perform forward VCR actions. The problem with the above buffer scheme is that the relative position of the play point is determined by the history of the VCR actions. The effects of VCR actions in the same direction are cumulative. When consecutive VCR actions in the same direction are performed, the play point will ultimately move to a boundary of the buffer. At this time, VCR functions in this direction cannot be implemented any more. It is at the mercy of VCR actions in the opposite direction to reverse the effects. Whether a VCR action can be implemented or not depends on what previous VCR actions are. The lack of self-adjustment is an inherent problem of the above buffer scheme. In the next section, we propose a new buffer scheme that takes advantages of the characteristics of partitioned video broadcast and provides VCR functions with a greater flexibility.
Providing VCR Functions in Partitioned Video Broadcast
3
The Idea of Active Buffer Management
3.1
As we know, the probability that a VCR function can be implemented with local buffers depends on many factors, including buffer size, play point within the buffer, buffer contents and management. The key is how to keep this probability high given a fixed buffer size. Our intuition is to keep the play point in the middle of the buffer so that there is a lower probability that VCR actions will move the play point out of the buffer. However, when a VCR action is performed, the relative position of the play point in the buffer will change. Our scheme is based on the use of a bufler manager that adjusts the contents of the buffer after VCR actions so that the relative position of the play point in the buffer stays in the middle ’. We employ a policy of selectively downloading the segments based on the current position of play point in the buffer. This is possible in partitioned video broadcast because all the segments are repeatedly broadcast in the channels ready for downloading and therefore, a video’s “past”, “present” and “future” are being broadcast simultaneously at any point in time. Let us explain the basic idea by a simplified example. Assume that the client buffer can hold three segments. At some point, the buffer holds segments x,x 1,x 2 and the play point is in the middle segment x 1. Now we observe the situation after one segment time. 1) Case 1: If there is no VCR function, the
+
+
+
We may prefer a different position if the probabilities of forward and backward VCR actions are skewed. For example, if we have 90% of VCR actions in the forward direction we will keep the relative position of the play point in the left part of the buffer.
158
Z. Fei et al.
+
play point will be in the segment x 2. During this period, the loader finishes downloading the segment x 3 and the segment x is discarded. The buffer now holds segments x 1,x 2, x 3. The play point is still in the middle segment (now x 2). 2) Case 2: If a fast forward action is issued during this period and moves the play point to a position within segment x 3. We observe the play point is no longer in the middle segment. The idea is to let the buffer manager select the segments to be downloaded according to the position of play point. After observing that the play point is in the last segment x 3 of the three segments x 1,x 2, x 3 in the buffer, the buffer manager will download segments x 4, x 5 at the same time. After one segment time, the buffer will hold segments x 3, x 4, x 5 and the play point will move to segment x 4. Therefore, the play point is moved back to the middle segment. We see from this example that by letting the buffer manager selectively prefetch segments based on the position of play point in the buffer, it can compensate the effects of VCR actions. Therefore, our scheme can tolerate consecutive VCR actions in the same direction and solve the problem of conventional client buffering scheme.
+
+
+
+
+
+
+ + + + + + + +
3.2
+
+
Destination Adjustment for VCR Actions
An important property of the destination of any VCR action is that once the client resumes normal play from this destination, it should be able to play the rest of the video to the end without interruptions, if there are no further VCR actions. We call all those destinations satisfying the above property feasible points. Since we cannot predict user behaviors, there always exists a case in which the destination point of a user action is outside of the client buffer. Although we let the user specify an arbitrary destination for VCR actions, the algorithm adjusts it to the nearest feasible point automatically, if necessary. This is what we called discontinuous interactive function implementation, for the reason of the discontinuity between the specified destination and the adjusted destination 6 . By doing this, we can guarantee the continuity after we resume normal playback. When the destination is not in the buffer, only those frames that are being broadcast at each channel are available immediately. Naively, we can select the nearest frame among them as the adjusted destination. This, however, will not always work because, in the unequally segmented schemes, they do not always satisfy the above property. If all the segments are of equal size, the points being currently broadcast in each channel are feasible. It is complicated in partitioned video broadcast because of unequal segmentation. Not all those frames being broadcast are good candidates. If we consider buffering at the clients, the frames in the buffer can also be feasible points even if they are not being broadcast in the channels. In Section 4, we will give a detailed description of the rules for deciding whether a point is feasible and how to find the nearest feasible point with regard to a given destination. Another way to implement is that we may delay the VCR action until the buffer is filled with the contents we want. We do not need to adjust destination in this mode. However, the user has to wait before a VCR action can be implemented. It is also possible that a hybrid mode may give a good tradeoff between the destination adjustment and the response time.
Providing Interactive Functions
159
In summary, our scheme uses an active buffer management technique to maintain the play point in the middle of the buffer. If the destination point of a VCR action is a point already in the buffer and it is also a feasible point at the finishing time, then the action can be provided by the frames from the buffer. If not, the destination is adjusted to the nearest feasible point and the continuity of the playback of the rest of video is guaranteed if there are no further VCR actions. We use the percentage of VCR actions implemented by the contents in the client buffer (abbreviated as percentage of bufler VCR actions) as a performance measure of our scheme. The higher this percentage, the better the interactive performance of the system will be. We also measure the difference between the specified destination and the adjusted destination for those VCR actions with destination changed for the feasibility reason. We use the percentage of destination shift (defined as the percentage of the difference over the length of VCR actions) as another performance measure. The smaller this percentage, the better the performance. 3.3
A VCR-oriented Broadcasting Series
I
- 5 12 I
I I
I I
1
12
I
!
I
25
1
'
25
I
Fig. 2. A scenario in the skyscraper scheme
The design of a new broadcasting series is motivated by the problems we encountered during the process we explored the possibility of providing VCR functions in the skyscraper scheme [8]. Consider a scenario in the skyscraper scheme as in Figure 2. Assume the channels are broadcasting the frames at the positions identified by the dotted line. The length of the segment i is Zi. At this time, the client wants to jump to somewhere in segment 4. Assume segment 4 is not in the client buffer. The available frames are those being broadcast at each channel. The nearest one among them is what is being broadcast in channel 4. If we let the client jump to this destination and resume normal playback, after time t Z5, we finish playing the rest part of segment 4 and the whole segment 5. Now we need the beginning part of segment 6. However, the channel 6 is now beginning to broadcast the last two units of segment 6. During the above t Z5 period, we cannot get the beginning part of segment 6, either. This is because only the shaded part of segment 6 is broadcast in the channel during this period. We face
+
+
160
Z. Fei et al.
the interruption of playback. If we let the client jump to channel 5, the interruption situation occurs even sooner (after t ) . We may consider adjusting the destination point to other channels. The problem is that jumping to any even number channel will lead to the same situation as to channel 4 and jumping to any odd number channel will lead to the same situation as to channel 5. The only feasible points are in those equal segments whose length is restricted by W . A similar situation of discontinuity can occur in the greedy scheme [9]. We design a VCR-oriented broadcast series based on two observations from the above analysis. If a destination segment is not in the buffer and we still require that we can always select the frame being broadcast at this channel as the destination, its size must be equal to the size of the next segment to guarantee that the client can playback the next segment when it finishes playing the current segment. The sizes of all segments beginning from the latter of the two above equal segments must satisfy the continuity condition to guarantee smooth playback of the rest of video, assuming the client can download from m channels simultaneously. To make the requirement small at the clients, we set m to 3. We give a function with segments satisfying these two conditions as follows. ifn=1 ifn=2 ifn=3 if f ( n - 1) # f ( n f ( n - 1) f ( n - 2) * 2 otherwise
-
2) and f ( n
- 2)
# f ( n - 3)
It generates the following segment sizes: 1,2,4,4,8,16,16,32,64,64,. . .. We use a parameter u to limit the maximum value of the series. We set f ( n ) = f ( n - 1) if n > u. So all f ( n ) with n 2 u are equal to f ( u ) . This restriction also limits the buffer requirements for clients. The loading strategy for the clients is simple if there are no VCR functions. At the beginning three loaders are allocated to segment 1,2 and 3. Each will load the assigned segment as soon as it begins. The downloaded segments are put into the buffer. The display gets the frames from the buffer or from the channel at the same time when they are sent to the buffer. The playback of the video begins as soon as the loader begins downloading the first segment. When a loader finishes loading segment j , it will be assigned to segment j 3. The loader downloads the segment when it is broadcast in the channel at the latest cycle before its playback time. The reasoning about the continuity of playback, and the generalization of the series can be found in [13].
+
3.4 VCR Function Implementations with the Active Buffer Management Scheme
When clients download or playback video segments, they go through two phases. A segment k is in the pyramid phase (unequal segments) if 1 5 k 5 u - 1 and in the equal segment phase if u 5 k 5 I<. Clients are required to have at least
Providing Interactive Functions
161
three loaders and three buffers. The size of each buffer is the same as that of the maximum segment. It can hold either one equal phase segment or several pyramid phase segments. During pyramid phase, we may have more than 3 segments in the buffers at the same time. We distinguish two components that work together to provide VCR functions. One is the player, which accepts user interaction commands and plays back the video as requested. The other is the loader/bufler manager (or simply manager), which manages the loaders/buffers and decides which channels the client should listen to and download segments. Player The player keeps playing back the contents from the buffers until it accepts a VCR command. It then makes the decision based on the availability of the contents in the buffer and the broadcasting position of relevant channels. When the player accepts a Fast Forward/Fast Backward/Slow Backward/Play Backward command, it checks whether the contents to be played are in the buffer and whether the destination point is a feasible point at the finishing time. When a command passes these two tests, it can be implemented smoothly. Otherwise we simply jump to the nearest feasible point with respect to the specified destination. When the player accepts a Jump Forward/Backward command, it checks whether the destination is a feasible point. If it is, it moves to that point and resumes normal playback. Otherwise, the player selects a nearest feasible point with respect to the destination and resumes normal playback. If the player accepts a Pause or Slow Forward command, it will implement them for the duration specified. This is because Pause or Slow Forward can be implemented smoothly from any feasible point. So for Pause, the player stops moving forward for a specified period of time. For Slow Forward, it plays back the video from the buffer at a specified lower speed. After a VCR action is completed that has a destination outside of the buffer, the player checks the allocation of loaders/buffers and reallocates them, if necessary. Assume the segment containing the current play point is k. During the pyramid phase, if three segments k ,k 1,k 2 are in the buffer or being downloaded, no actions are taken. Otherwise the player will reallocate loaders to these segments. During the equal segment phase, if segments k - 1,k ,k 1 are in the buffer or being downloaded, no actions are taken. Otherwise the player will reallocate loaders to these segments. The downloading begins immediately after the loaders are allocated to them.
+
+
+
Loader/ buffer manager It allocates loaders/buffers at the channel boundary. By channel boundary we mean the point of time when the channel finishes one cycle of broadcasting the segment and begins the next cycle of broadcasting. During the pyramid phase, if any of the segments k+l,k+2,k+3 are not in the buffer and not being downloaded, we allocate loaders for them. Startup allocation is considered as a special case. When a client starts, we assume the current segment is 0. After the startup latency, we move to the boundary of the first channel. The manager allocates loaders to segments k 1,k 2, k 3, i.e., 1,2,3.Downloading of the first segment begins immediately. Downloading of the second and the third segments may begin later, depending on channel positions. During the equal segment phase, if the play point is in the earlier half of the
+
+
+
162
Z. Fei et al.
+
current segment k , the three loaders are assigned to segments k - 1,k,k 1. If the play point is in the later half of the current segment k, the three loaders are assigned to segments k ,k 1,k 2. If the contents of a segment are in the buffer, no loading will occur. The downloaded segments will overwrite the earlier segments already in the buffer. However, if a VCR function is ongoing, all the buffer contents for the action cannot be overwritten. If a loader is downloading a segment when we want to allocate a loader to the segment, no action is taken and we assume one loader is already allocated to the segment. Our scheme has the flexibility for clients with more than three loaders and buffers. In this case, extra loaders/buffers are allocated to unallocated segments based on the priorities of these segments. We define the priority of a segment as its distance to the current segment. Smaller number means higher priority. If two segments have same priority number, the segment in the forward direction is given a higher priority.
+
4
+
Feasible Points
We introduce the concept of feasible points to deal with problems that occur in providing VCR functions in partitioned video broadcast. If we do not check the property of the destination point of a VCR action, we may face discontinuity of playback after we resume normal play following the VCR action, even if no further VCR actions are issued. We first describe some notations before we illustrate the problem by another example and give the rules for deciding whether a destination point is feasible or not. Let us assume the start position and end position of segment i in a video (e.g., time from video starting point) are si and e i , respectively. At any time, each channel is broadcasting a specific frame in the segment. It is called the channel point denoted as ci. We have si 5 ci 5 ei and ei = s i + l . We use [yl, y2] to represent the part of video between two points y1 and y2. Assume the destination point d is in segment j. Note that d can be different from c j . 4.1
Another Infeasible Case
In Section 3.3, we showed that an unrestricted jump can lead to discontinuity of later playback, even if there are no further VCR actions. In this section, we presents another example in which the video contents required by a VCR action are in the buffer, but performing it leads to the discontinuity in the future after we resume normal play. In Figure 3, we assume the current play point is s k and the fast forward command is specified as ( t ,[ s k , e k ] ) . We know that channel broadcast point is ci for channel i. Assume the contents of channel k ( [ s k ,e k ] ) are in the buffer and contents of channel k 1 are not in the buffer. Because all the contents ( [ s k ,e k ] ) required by the VCR action are in the buffer, we can perform the FF action to the full length to the position e k . It takes time t for this fast forward action. During this period, we can only download [ck+l, h] to fill in the buffer, but we need [ s k + l , ck+l] to continue the normal play, which is not available. Therefore, we show a case in which some VCR action cannot be performed to the full length for the continuity reason, even if the contents required by the VCR action are in the buffer.
+
Providing Interactive Functions
163
Channel k: I
,
Channel k+l :
Buffer at Client: I I
k+l
k+l
$+l
I
k
k+l
I
d
I
I
Fig. 3. Another infeasible case
4.2
Rules
We now give the rules for deciding whether a given point is feasible or not, and how to find the nearest feasible point with respect to a given destination. We distinguish three cases based on the relative sizes of segments involved. If the destination point d(= p Al) is in segment j and d is located at or before the channel point c j (i.e., d 5 c j ) , the segment containing d is designated as the current segment. We use k to denote the current segment number and have k = j. We now give the rules for these destination points. Later we will consider the rules for destination point d located after the channel point c j .
+
Channel:
Case 1 :
7,
k:
Case 2.1 :
~2, I
lek
k:
7,
I
I
I
k+l:
I I I I
(a)
‘
I
I
7, dcj Fk
lek
I
I
k+l :
Case 2.2:
~2,
s
Ick+l
I
ek+l
‘,+I
‘,+I/ ek+l
I
k+2:
I I
I
I
I I
d = destination
I I
(b)
shaded in buffer ==> d is feasible
Fig. 4. Rule for feasible points
-
-
Case 1 (Figure 4(a)): (A,A) case. This is the case in which the size of current segment is equal to the size of the next segment, i.e., f ( k ) = f ( k 1). 0 If [ d , c k ] is in the buffer, then d is feasible; otherwise, the later nearest feasible point is the point q such that [q,ck] are in the buffer with the smallest q value. Case 2 (Figure 4(b)): (A,2A,2A) case. This is the case in which the size of current segment is half of the size of the next segment and the next two segments are of equal size, i.e., f ( k ) = 1/2 * f ( k 1) and f ( k 1) = f ( k 2). 0 case 2.1 The segments in channels k and k 1 are left aligned, i.e., Ck - Sk = Ck+l - % + l * If [ d ,ck] and [ s k + l ,ck+l] are in the buffer, then d is feasible; otherwise, if [ s k + l ,ck+l] are in the buffer, the later nearest feasible point is the point
+
+
+
+
+
164
Z. Fei et al. q such that [q,ck] are in the buffer with the smallest q value; otherwise, the later nearest feasible point is the point q such that [q,ck+l] are in the buffer with the smallest q value; 0 case 2.2 The segments in channels k and k 1 are right aligned, i.e., Ck - S k # Ck+l - Sk+l* If [d, ck] are in the buffer, then d is feasible; otherwise, the later nearest feasible point is the point q such that [q,ck] are in the buffer with the smallest q value. Case 3: (A, 2A, 4A) case. This is the case in which the size of current segment is half of the size of the next segment and the size of next segment is in turn half of the size of its next segment, i.e., f ( k ) = 1/2 * f ( k 1) and f ( k 1) = 1/2 * f ( k 2). We distinguish four subcases based on the relative positions of relavant broadcasting channels and give rules for each case. Interested readers are referred to [13].
+
-
+
+
+
Now we consider the case in which the destination point d in segment j is located after the channel point c j (i.e., d > c j ) . If we designate segment j 1 as the current segment, we can use the above rules. If we still use k to denote the current segment number, then we have k = j 1. In this situation, [d, ck] means [d, ek-13 and [ S k , 4 .
+
+
Simulation Results
5 5.1
Experimental Settings
In our experiments, we assume the video length is 120 minutes and we have 30 channels for broadcasting the video. The video is divided into 30 segments and first 8 segments are of unequal size with the weights defined by the series in Section 3.3. Segments from 9 to 30 are of equal size. The size of the first segment is 0.0805 minutes or 4.83 seconds. So maximum startup latency is 4.83 seconds and average is half of it, i.e., 2.42 seconds. The size of largest segment is 5.15 minutes. The size of buffer required at the client side is a multiple (at least 3) of the size of the largest segment. 5.2
User Interaction Model
The user interaction model is shown in Figure 5(a), which gives the probability of issuing a specific VCR action and the mean duration ’. We assume the duration of a VCR action is exponentially distributed with the mean ~i given in the user interaction model. The durations ~i and transition probabilities pi determine the degree of interactivity of user actions. A similar model is also used in [ l l ] . Finding typical values for these parameters is beyond the scope of this paper. However, rather than selecting a specific set of values for these parameters, we experiment with a range of values. The model is further simplified to reduce the number of experimental parameters. We set p7 = 0 to prevent unexpected leave from the middle of a video. We For Jump, the mean duration means average jump steps measured as At = 0.
since
Providing Interactive Functions
I
165
m -
($ w (.> i(,b Pause
Backwar
Fonvar
Fig. 5 . User interaction model
set p8 = pg = 0 because slow backward and play backward are not used so often as other VCR functions. Also we set pi = p j and ~i = T- for 1 5 i < j 5 6. We use p, to represent the probability of issuing VCR actions after a play action and 6 get p, = pi and PO= 1 - p, . We use T, to represent those T ~ ' S (1 5 i 5 6). Therefore, pi = p,/6 and ~i = T, for 1 5 i 5 6. After the simplification, we get a user interaction model as shown in Figure 5(b). In this model, we have only three parameters p,, TO and T,. We call : ;duration ratio. We experiment with two ways to change the user interaction level.
1. Change probability. In this case, we let ; ; = 0.5 and vary p, from 0.1 to 0.9. 2. Change duration ratio ;. In this case, we set p, = 0.6 and change the duration ratio ; ; from 0.2 to 1.0. 5.3
Numerical Results
Recall that we measure the performance of our scheme by the percentage of buffer VCR actions and the percentage of destination shift (Section 3.2). In our first experiment (Figure 6(a)), we examine the effect of frequency of issuing VCR actions on the percentage of buffer VCR actions. We change the probability of issuing VCR actions (p,) from 0.1 to 0.9. We set TO to be about half of maximum segment size (2.58 minutes). We experiment with different number of loaders and buffers at the clients. As expected, the percentage of buffer VCR actions decreases as the probability of issuing VCR actions increases. However, the percentage of buffer VCR actions stays above 87% even if the probability of issuing VCR actions is as high as 0.9. Though this high interactive degree rarely happens, the simulation results in this setting make us believe our scheme can deal with a very high level of user interactions. We change the duration ratio in Figure 6(b). We let it change from 0.2 to 1.0. The percentage of buffer VCR actions decreases when the ratio increases. The degradation of performance is larger than the case in which we change probability. However, even when the ratio is 1.0 we still only need 3 loaders/buffers
166
Z. Fei et al.
> ; The effect of changing VCR probability
loo
loo
Number of LoadersiBuffers = 3 Number of LoadersiBuffers = 4 Number of LoadersiBuffers = 5 801 0.1
The effect of changing duration ratio
"
0.2
"
85
+
Number of LoadersiBuffers = 3 Number of LoadersiBuffers = 4 Number of LoadersiBuffers = 5
+
"
0.3 0.4 0.5 0.6 0.7 Probability of issuing VCR actions (Pv)
(a) Change probability
" 0.8
0.9
801 0.2
"
0.3
0.4
"
+ +
'
"
0.6 0.7 The duration ratio
0.5
0.8
0.9
1
(b) Change the duration ratio
Fig. 6. Percentage of buffer VCR actions
to keep the percentage of buffer VCR actions above 84%. In most practical situations, the ratio is usually low. We can achieve 95% buffer VCR actions when the ratio is 0.2. Number of channels 127 67 47 37 31 27 24 Maximum segment size (min) 1.0 2.0 3.0 4.0 4.9 5.9 6.9 Table 1. Relations of the number of channels and maximum segment size
Next we explore the effects of the maximum segment size. We set pv = 0.6 and TO = 2.58 minutes. We let the duration ratio be 0.25, 0.50 and 1.0. We vary the number of the channels so that the size of maximum segment changes from 1.0 minute to 6.9 minutes. The actual numbers are listed in Table 1. The client has three loaders. When the size of the maximum segment is 6.9 minutes, each client needs three 6.9 minute buffers. So the total buffer size in this case is 20.7 minutes. We let the clients in the cases with a smaller maximum segment size have a buffer of the same size, so they may have more than 3 buffers of the maximum segment size. Figure 7(a) shows the percentage of buffer VCR actions when the maximum segment size changes from 1 minute to 6.9 minutes. As the maximum segment size increases, the percentage of buffer VCR actions decreases. However, even when the segment size is 7 minutes, the percentage of buffer VCR actions is still above 82%. We can see from the plot that we can achieve better performance by increasing the number of channels and thus decreasing the maximum segment size. When the buffer size is 2 minutes, the percentage of buffer VCR actions is 93% for the duration ratio of 1.0. Next we measure the percentage of destination shift for those adjusted VCR actions. Figure 7(b) shows the percentage of destination shift. In all cases, the percentages of shift are always less than 11%.With the total buffer size being the same, the client can achieve smaller shift if the segment size is smaller. This
Providing Interactive Functions 100
,
I
20 duration ratio = 0.25 duration ratio = 0.50 duration ratio = 1.00
167
-
-+--0
15 90
10 -
: % 5 85 g
75 -
duration ratio = 0.25 duration ratio = 0.50 duraiton ratio = 1.OO
+
-+--0
I
70 2
3
4
5
6
7
1
2
3 4 5 Maximum segment size (minutes)
6
7
(b) Percentage of destination shift
Fig. 7.The effects of maximum segment size
implies that the increase of the number of channels not only reduces the startup latency, but leads to better performance for VCR functions as well. 5.4
Comparison with Staggered Broadcast
In this section, we compare the performance in three cases: 1) staggered broadcast with conventional buffer scheme; 2) partitioned broadcast with active buffer management scheme; 3) staggered broadcast with active buffer management scheme. We let the number of channels change from 12 to 30 and the number of loaders be 3. We let the case of staggered broadcast with 12 channels have 4 buffers of its maximum segments. For fairness, all other cases have a buffer of the same size as in this case. Thus each of these cases may have a different number (other than 4) buffers of its maximum segment size because its maximum segment may be smaller or bigger than that of the above-mentioned 12 channel case. For the partitioned broadcast cases, we can select an appropriate u value to guarantee that they have at least 3 buffers of their maximum segment size. We keep the user interaction level the same (pv = 0.6 and the duration ratio 0.5). We know the startup latency decreases as the number of channels increases in both staggered broadcast and partitioned broadcast. The startup latency is much smaller in partitioned broadcast than in staggered broadcast with the same number of channels. We are more interested in their interactive behavior. Figure 8(a) shows the percentage of buffer VCR actions in three cases. The percentage is almost constantly around 63% in the staggered broadcast with conventional buffer scheme, while the partitioned broadcast with active buffer management can achieve 86% with 12 channels and 97% with 30 channels. It is interesting to note that the staggered broadcast equipped with our active buffer management scheme has even better performance than the partitioned broadcast. This is because the maximum segment size of the staggered broadcast is smaller than that of partitioned broadcast, and thus it has a larger number of buffers of its maximum segment size and has more flexibility to arrange the contents in the buffer. The same
168
Z. Fei et al.
20
q
I
staggered broadcast with active buffer partitioned broadcast with active buffer staggered broadcast with conventional buffer
+
70
a,
a
55 -
staggered broadcast with active buffer partitioned broadcast with active buffer staggered broadcast with conventional buffer
+ -+---
-
12
14
16
18 20 22 24 The number of channels
26
28
30
(b) Percentage of destination shift
Fig. 8. Comparisons with staggered broadcast
performance improvement can also be seen from the measure of the percentage of the destination shift, as shown in Figure 8(b). The percentage in staggered broadcast with conventional buffer scheme changes from 23% to 9%. The partitioned broadcast has a smaller number, varying from 11%to 1.5%. The staggered broadcast with active buffer management enjoys a further improvement with the percentage decreasing from 6% to less than 1%.After comparing its performance with the conventional buffer management scheme, we can clearly see the advantage of our active buffer management scheme. With the same number of channels, the staggered broadcast has better interactive performance than the partitioned broadcast, if both of them use active buffer management. On the other hand, we know the partitioned broadcast has a much shorter startup latency. An interesting observation here is the tradeoff between the startup latency and the interactive performance between these two schemes. The scheme with a larger startup latency has better performance for providing interactive functions.
6
Concluding Remarks
Multicast delivery is an attractive approach to the provision of a video-ondemand service because it scales well to a very large number of clients. The problem is how to provide interactive functions to individual clients within the multicast framework without compromising the scalability of the multicast paradigm. In this paper, we propose a scheme to support interactive functions in partitioned video broadcast. The scheme lets the client selectively prefetch segments from broadcast channels based on the observation of the play point position in its local buffer. The contents of the buffer are adjusted in such a way that the relative position of the play point is kept in the middle part of the buffer and a high probability of providing interactive functions with the contents of the local buffer is achieved. After analyzing the problems with providing interactive functions in the existing partitioned video broadcast schemes, we design a new broadcast series
Providing Interactive Functions
169
suitable for interactive behaviors of clients. We illustrate several cases in which VCR actions can lead to discontinuity of later playback in partitioned broadcast. We introduce the concept of feasible points to deal with these problems arising from unequal segmentation in the broadcast scheme. By restricting the destination of a VCR action to feasible points, we can guarantee the video can be played back without interruptions after the VCR action. Using a simple user interaction model we experiment with various levels of user interactivity and explore the performance of our scheme. By using the percentage of buffer VCR actions and the percentage of destination shift as the performance measure, we show that our scheme can implement interactive actions through buffering with a very high probability in a wide range of user interaction levels. Compared with the conventional buffering scheme, our active buffer management scheme is shown to be able to improve the performance of interactive behavior with the same resource requirement.
References 1. A. Dan, D. Sitaram, and P. Shahabuddin, “Scheduling policies for an on-demand video server with batching,” in Proc. of A C M Multimedia’94, pp. 15-23, 1994. 2. A. Dan, D. Sitaram, and P. Shahabuddin, “Dynamic batching policies for an ondemand video server,” Multimedia Systems, vol. 3, pp. 112-121, June 1996. 3. K. C. Almeroth and M. H. Ammar, “A scalable interactive video-on-demand service using multicast communication,” in Proceedings of International Conference o n Computer Communication a n d Networks, pp. 292-301, 1994. San Francisco, CA. 4. K. C. Almeroth and M. H. Ammar, “On the performance of a multicast delivery video-on-demand service with discontinuous VCR actions,” in Proceedings of ICC’95, pp. 292-301, 1995. Seattle, WA. 5. K. C. Almeroth and M. H. Ammar, “On the use of multicast delivery to provide a scalable and interactive video-on-demand service,” IEEE J o u r n a l of Selected Areas in Communications, vol. 14, pp. 1110-1122, August 1996. 6. M. Hofmann, E. Ng, K. Guo, S. Paul, and H. Zhang, “Caching techniques for streaming multimedia over the Internet ,” tech. rep., Bell Laboratories, April 1999. Technical Report 990409-04TM. 7. S. Viswanathan and T . Imielinski, “Metropolitan area video-on-demand service using pyramid broadcasting,” Multimedia Systems, vol. 3, pp. 197-208, May 1996. 8. K. A. Hua and S. Sheu, “Skyscraper broadcasting: A new broadcasting scheme for metropolitan video-on-demand systems,” in Proceedings of A C M Sigcomm ’97, 1997. 9. L. Gao, J . Kurose, and D. Towsley, “Efficient schemes for broadcasting popular videos,” in Proceedings of NOSSDAV’98, 1998. 10. W. Liao and V. 0. Li, “The split and merge protocol for interactive video-ondemand,” IEEE Multimedia, vol. 4, pp. 51-62, October-December 1997. Also in Proceedings of Infocom’97. 11. E. L. Abram-Profeta and K. G. Shin, “Providing unrestricted VCR functions in multicast video-on-demand servers,” in Proceedings of IEEE International Conference on Multimedia Computing a n d Systems (ICMCS’98)) Austin, Texas, 1998. 12. Z. Fei, I. Kamel, S. Mukherjee, and M. H. Ammar, “Providing interactive functions for staggered multicast near video-on-demand systems,” in Proceedings of IEEE International Conference on Multimedia Computing a n d Systems ’99, 1999. 13. Z. Fei, M. Ammar, I. Kamel, and S. Mukherjee, “Providing interactive functions through active client buffer management in partitioned video broadcast ,” tech. rep., College of Computing, Georgia Institute of Technology, 1999. GIT-CC-99-09.
A Multicast Transport Protocol for Reliable Group Applications 1
Congyue Liu,1 Paul D Ezhilchelvan2 and Marinho Barcellos3
Guangzhou Communications Institute, Guangzhou, 510310, P.R.China. [email protected] 2 University of Newcastle, Newcastle upon Tyne, NE1 7RU, UK. [email protected] 3 UNISINOS, Sao Leopoldo, RS, 93022-000, Brazil. [email protected]
Abstract. This paper presents a transport-level multicast protocol that
is useful for building fault-tolerant group-based applications. It provides (i) reliable, end-to-end message delivery, and (ii) a failure suspector service wherein best eorts are made to avoid mistakes. This service can facilitate an ecient, higher level implementation of group membership service which does not capriciously exclude a functioning and connected member from the membership set. The protocol has mechanisms for owand implosion- control, and for recovering from packet losses. Through simulations, its performance is studied for both homogeneous and heterogeneous network congurations. The results are very encouraging.
1 Introduction Building group-based applications capable of tolerating site crashes and network partitions has been under investigation for several years. Useful, programming paradigms such as view synchrony and virtual synchrony (VS) have been specied [1]. An ecient implementation of these abstractions can be obtained if the following low-level services are available: (ser1 ) multicasts from a group member (known as the sender ) are received by other members (called receivers ) in the sent order and with no message duplication and omission; and (ser2 ) if a receiver crashes or detaches due to partition, the sender is issued a failuresuspicion notice over that receiver. A suspicion cannot always be correct as a slow, overloaded site cannot be distinguished from a crashed or detached one. So, false suspicions are admitted and are even acted upon as the only way to ensure liveness in applications. However, raising false suspicions must be avoided as much as possible. This, we believe, can be better achieved in a cost-eective manner if ser2 is built in the same level as where ser1 can be eciently provided. The reasons for this are as follows. The sender's suspicion becomes false when the timeout it employs to suspect a receiver failure becomes smaller than the round trip time (rtt ) between itself and the receiver. That is, the sender's ability to make correct suspicions depends solely on the sender having an accurate estimate of rtt. The rtt of course varies L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 170−187, 1999. Springer-Verlag Berlin Heidelberg 1999
A Multicast Transport Protocol for Reliable Group Applications
171
with the network load and congestion levels. A multicast protocol that provides end-to-end reliability (ser1 above) will be constantly estimating uptodate rtt s for detecting packet losses. Given that the quality of service oered by ser2 depends on the accuracy of the rtt estimates being used, it seems only natural to build both ser1 and ser2 together at the same level. Systems that implement VS abstractions tend to build ser1 using standard, low-level communication services and then implement ser2 at a higher level (typically using ping s). For example, Horus[2] builds on UDP/IP multicast to obtain end-to-end reliability; Newtop[3] and Phoenix[4] use multiple TCP unicasts; and Transis[5] employs Trans protocol[6] designed for broadcast mediums. This is probably because many well-known, end-to-end, reliable multicast protocols in the literature do not provide ser2, as they are also designed with scalability in mind. Consequently, they strive to relieve the sender from the burden of keeping excessive state information regarding receivers, in particular from the task of having to ensure that every functioning receiver receives the transmitted packets. In the receiver-initiated protocols, such as SRM[7], LBRM[8], receivers are responsible for detecting and recovering from packet losses; so, the sender cannot know which receivers are receiving the transmitted packets and which ones have crashed or detached. In the sender-initiated protocols, such as RMTP[9], the sender plays an active role in loss recovery only when receivers inform it of packet losses; it is not made aware of the entire receiver set (for reasons of scalability) and hence it cannot suspect crashed or disconnected receivers. We here present a sender-initiated, transport-level multicast protocol that provides both ser1 and ser2 for a generic network topology. Our protocol is designed exploiting two characteristics of fault-tolerant group applications: the group size is usually small, so the sender can aord to keep a fair amount of state information per receiver; secondly, the sender knows the full membership of the receiver set. With this knowledge, it retransmits a packet until each receiver sends an ack (i.e., acknowledgement) for the packet or gets suspected. A common problem to be tackled in the sender-initiated approach is the ackimplosion [10], [11] which is an acute shortage of resources caused by the volume and synchrony of incoming acks at the sender, resulting in ack losses and hence increased network cost caused by unnecessary retransmissions. Our protocol employs an eective mechanism to control implosion, and also provides ow control. Its design is motivated by the performance study of our earlier protocol PRMP [12], [13]. Compared to PRMP, the protocol presented here is simple and hence easier to implement; incurs far less computational overhead at the sender and less network cost; it however achieves a smaller throughput. Its performance and cost-eectiveness are demonstrated through simulations on both homogeneous and heterogeneous networks. We assume the following system context: the transmission phase, during which the data packets are transferred from the sender to receivers, is preceded by the connection setup phase. The transmission phase is the focus of the paper and the connection setup is assumed to help accomplish an unreliable multicast
172
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
service that eciently propagates the sender's data packets to the connected destinations. This unreliable service may be achieved through a series of unicast transmissions or a single IP multicast. (Both cases are simulated.) A receiver can unicast to the sender, can crash or detach from the sender. The rest of this paper is organized as follows. Section 2 describes the design and mechanisms of the protocol. Simulation results are given in Section 3 which are followed by concluding remarks in section 4.
2 Protocol Description 2.1
Overview
The protocol solves the problem of a non-faulty sender having to reliably transmit a number of data packets to a desination set of NC receivers. Failures during transmission result in packets being dropped or corrupted and discarded. To deal with packet losses, the sender must keep a copy of each transmitted packet in its buer and keep retransmitting a packet until an acknowledgement for that packet is received from every receiver. However, no acknowledgement can be received from a crashed or detached receiver. So, to ensure termination, the sender retransmits a given packet only for some specied number of times. A receiver that has not indicated the receipt or non-receipt of the packet after all these retransmissions is suspected to have failed. Once a packet is acknowledged by all unsuspected receivers, the sender regards it to have been fully acknowledged and removes it from the buer. Receivers indicate the receipt/non-receipt of a packet through their responses unicast to the sender. Simultaneous arrival of many such responses can cause implosion and the sender may miss receiving some of them. To minimise implosion losses, the sender sends Feedback Information (F I for short) to receivers at regular intervals called the Cycle. F I s essentially tell receivers when to unicast their responses. When a receiver receives an F I , it delays for a certain amount of time and then sends a response once every Period. (Both the delay and the Period are indicated in the F I .) This continues till another F I is received which initiates (after an indicated delay ) another episode of periodic responses from the receiver. The sender estimates the delay time (specic to a given receiver) and the Period (common to all receivers) based on its estimation of rtt s for the receivers. If this estimation is, and remains, accurate, the sender must receive one response from each receiver during the Period and the time gap between the arrival times of any two consecutive responses will be constant. In other words, the sender should receive the responses in a non-bursty manner and at a rate that cannot cause implosion, provided that the sender can precisely estimate rtt s and that the rtt s change very little over time. In practice, however, rtt s seldom remain constant even during short periods of time. The protocol copes with the variations in rtt in the following manner: (a) the sender obtains a fresh estimate of rtt every
A Multicast Transport Protocol for Reliable Group Applications
173
time it recieves a response from a receiver and thereby its rtt estimates are kept reasonably uptodate; and, (b) it uses the most recent rtt estimates to compute FI parameters at the start of every Cycle and receivers employ a mechanism to tolerate small uctuations of rtt which might occur during Cycle. The sender observes a retransmission timeout (RTO) before it retransmits a packet that is considered to be lost during previous transmissions. RTO is xed based on the sender's estimate of rtt so that premature retransmissions are avoided. Further, a retransmission can be either a multicast or a series of selective unicasts, depending on which option is deemed economical in terms of message cost. The protocol employs a sliding window scheme for ow control and for detecting packet losses. It involves the use of three types of packets which can be exchanged between the sender and the receivers: (a) DATA packets are sent by the sender and contain a sequence number seq that uniquely identies a packet; (b) FI packets, also sent by the sender, contain a unique sequence number FIseq; and, (c) response packets, RESP , sent by a receiver indicate to the sender which data packets have been received and which ones have not been. (We assume seq=FIseq is large enough to avoid problems that arise when sequence number are wrapped around and reused.) A DATA packet (FI packet respectively) is said to be earlier than another DATA packet (FI packet respectively) if the former has a smaller sequence number. 2.2
Protocol Details
2.2.1
Sliding Window
Packet loss detection and ow control in our protocol are based on a sliding window scheme with xed-size data packets. Both the sender and receivers employ a buer of size S packets, negotiated at connection setup, and kept constant during the whole transmission. The protocol, at the receivers, is required to sequentially deliver the data packets received. At a receiver, a data packet is said to be unconsumable if an earlier packet has not been received. Further, the upper-layer may be slow and consequently some packets ready for consumption may remain unconsumed. Each receiver keeps a receiving window W that is characterised by buer size S , a left edge LE , and the highest received sequence number HR from the sender. LE is the minimum between the sequence number of the earliest unconsumed packet in R and the sequence number of the earliest packet yet to be received by R . Thus LE refers to the smallest sequence number of the packet that is either waiting to be consumed or expected to be received. W is a boolean vector indexed by seq, LE seq LE + S 1: W [seq] is true if R has received the data packet seq, or false otherwise. HR is set to seq of a data packet received from the sender if seq > HR . The sender keeps a set of NC sending windows, one W for each receiver. W is the sender's (latest) knowledge of W of R . Like W , it is characterised by size i
i
i
i
i
i
i
i
i
i
i
i
i
i
p;i
i
i
i
p;i
174
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
S , a left edge, denoted as LE and the highest received sequence number HR . LE and HR are the sender's knowledge of LE and HR , respectively. For the data packet seq, LE seq LE + S 1, W [seq] indicates the sender's knowledge of whether R has received seq; it is initially set to false. Finally, the sender keeps the variable HS to record the largest seq of data packets it has p;i
p;i
p;i
p;i
i
p;i
p;i
i
p;i
i
multicast so far. When it is time to respond, R sends a RESP packet to the sender containing (a) RESP:W; which is the copy of its receiving window; (b) RESP:W:LE which contains the value of LE ; (c) RESP:W:HR; the value of HR ; and, (d) a timestamp RESP:ts which is used by the sender to estimate the Round-Trip time (RTT for short). When the sender receives a RESP packet from R , it updates its variables related to R . LE maxfLE ; RESP:W:LE g, HR maxfHR ; RESP:W:HRg and then, for all seq, RESP:W:LE seq RESP:W:HR, W [seq] W [seq] _ RESP:W [seq]. From W , the sender can infer that R has received all data packets with seq, seq
i
i
i
i
p;i
p;i
p;i
p;i
p;i
p;i
p;i
i
p;i
p;i
p;i
p;i
p;i
i
2.2.2 Flow Control Our protocol employs window-based ow control. Since data packets are multicast to the entire destination set, the receiver with the smallest number of free buer spaces determines the number of new packets which can be multicast. The sender determines the eective window (EW ) for each R , where EW denotes the number of new packets R can take without buer overow: EW (LE + S) (HS + 1): Without causing buer overow at any of the receivers, EW , EW minfEW j8i : 1 i NC g; new packets can be multicast. When EW is zero, the sender is said to be blocked. In addition to the window-based scheme, the protocol allows the user to set a maximum transmission rate by establishing an inter-packet gap (IPG), which is the minimum interval that must elapse bewteen two successive transmissions from the sender. p;i
p;i p;i
i
i
p;i
p
p
p;i
p
2.2.3 Feedback Information (FI ) and Related Parameters The sender divides time into epochs of xed length " which is known to all receivers. Epochs are denoted as E , with n = 0; 1; 2:::.; E is the time interval between n " and (n + 1) ". An FI packet multicast by the sender contains: (a) an array Rdel with Rdel[i], 1 i NC; indicating the delay R should observe before sending its rst response after it receives FI ; (b) Period, which is the time interval that should (ideally) elapse between the arrival of two successive responses from any given receiver. Period is identical to all receivers and is expressed as the number of epochs; (c) ts, which indicates the sender's local time when the FI packet is n
n
i
A Multicast Transport Protocol for Reliable Group Applications
175
sent; and, (d) FIseq, which indicates the sequence number of the FI packet and helps a receiver to detect duplicate FI . The sender periodically sends a new FI to receivers and this period is called Cycle, Cycle = k Period for some k 1. FI from the sender FI1
FI2
RESP to the sender
... ... t1
t2 Rdel1[i]
Period1
Period1
Period1
Cycle1
Period2
Receiver
Period2
Rdel2[i]
Figure 2.1. Receiver response times and FI parameters. Figure 2.1 shows two successive FI packets, FI1 and FI2 , from the sender. These packets are received by R at (local) times t1 and t2 respectively, and contain parameters fRdel1; Period1 ; FIseq = 1; ts1 g and fRdel2; Period2 ; FIseq = 2; ts2 g respectively. R sends its rst response as per FI1 at time t1 + Rdel1[i] and subsequently once every Period1 . This continues until t2 + Rdel2[i] when R sends its rst response as per FI2 . After t2 + Rdel2 [i], R responds once every Period2 until it begins to respond as per the third FI packet it may receive later. The cycle thus repeats. i
i
i
i
2.2.4 Sender's Estimation of FI Parameters If the rate of responses arriving at the sender exceeds a given threshold, losses can be expected due to implosion. This threshold is called the implosion threshold (ITR). It depends on the buer space and the processing capacity currently available within the sender node, and hence may change if, for example, a new multicast session is opened or a concurrent one is closed. To avoid losses by implosion, the protocol controls the arrival rates and arrival timings of response packets using an input value response rate (RR). Thus, RR is the protocol's knowledge of ITR. It should not exceed ITR if implosion is to be avoided; if it is too small compared to ITR then the sender is processing the receiver responses below its capacity and this will decrease the throughput. So, the objective should be to have RR tracking ITR. In our simulation study, we analyse the eect of RR exceeding ITR on implosion losses. The sender computes FI parameters to meet two objectives. First, it aims to receive within every epoch the maximum number of responses permitted by RR. That is, it plans to receive RQ responses, RQ bRR "c. Secondly, within
176
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
P eriod, it N C=RQ
d
plans to receive one response from every receiver. So,
e". It computes Rdel such that these objectives are met.
current time
En
En+1 En+2
P eriod =
......
t
Sender
R3
R2
RTT1 RTT2+Rdel[2] RTT3+Rdel[3]
R1
Figure 2.2. Rationale behind the estimation of F I parameters. For now, assume that the sender knows RT T for every receiver (see section 2.2.6). It orders RT T values in non-increasing manner: RT T1 RT T2 ::: RT TNC . Let n be the smallest integer such that n " RT T1 >current time. Say t = n " RT T1. The sender is to multicast the F I packet at its clock time t. It plans the rst response (as per the F I to be sent at t) from R1 to arrive at n ". So Rdel[1] is set to zero. The rst response from R2 (as per the F I to be sent at t) is expected to arrive at n " + "=RQ: So, upon receiving the F I (to be sent at t), R2 should be instructed to delay sending its rst response by Rdel[2] = n " + "=RQ (t + RT T2). In general, Rdel[i] = n " + (i 1) "=RQ (t + RT Ti). Figure 2.2 depicts the rationale behind the sender's estimation of Rdel for N C = 3, RQ = 1, and RT T1 > RT T2 > RT T3. By estimating and sending new F I for every Cycle, the sender accounts for changes occurred in RT T and RR during the past Cycle, and also for any changes in the receiver set due to exclusion of a receiver that is suspected to have crashed or disconnected from it. Further, receivers are programmed to cope with small uctuations of RT T around the sender's RT T estimate used at the beginning of Cycle (described in section 2.2.7). Thus, maximum eort is being made for responses to arrive at the sender within the planned timing interval. Observe that the receivers need not know the value of Cycle used by the sender. So, the sender can be made more responsive to changes in RT T; RR or in receiver set by sending new F I immediately after it detects these changes (instead of waiting for any remaining part of Cycle to be over) .
A Multicast Transport Protocol for Reliable Group Applications
177
2.2.5 Reducing Redundant Transmissions When the sender sends no new data packets for a long period (say, due to being blocked), successive responses from a receiver could be identical, and some of these responses may well be redundant. To save bandwidth, R sends no response until W changes, if it has already sent x identical responses in succession; we assume that at least one of these x responses will reach the sender. If q is the probability that a given packet can be lost due to network error, then the probability that at least one of x packets would reach the sender will be (1 qx): The sender also multicasts a given FI packet x times, with an interval Period between two successive FI transmissions. Note that FI parameters need not be recomputed before retransmission due to the chosen interval Period between two successive transmissions of a given FI . i
i
2.2.6 RTT (round trip time ) Estimation The sender and receivers use their own clocks which need not be synchronised i.e., the clocks may read dierent values at any given time. To deal with this clock asynchrony, Ri maintains a clock-dierence counter C _Diff . Recall that the eld ts in a DATA or FI packet indicates the time when that DATA or FI packet respectively is sent according to the sender's clock. Whenever Ri receives a DATA or FI packet, it sets C _Diff to the local clock time when the packet is received minus the value of ts in the received packet. When Ri is to send a RESP packet at its clock time TRESP , it computes RESP:ts to be TRESP C _Diff . When the sender receives a RESP packet from Ri at its clock time clock, it computes RTTi clock RESP:ts. Figure 2.3 explains the rationale behind the sender's estimation of RTTi and assumes that the last packet Ri received before it decides to send RESP at TRESP is an FI packet. As per the gure, RTTi =t1 + t3 . Since C _Diff =TFI FI:ts, RESP:ts =TRESP C _Diff = FI:ts + t2 . So, RTTi=clock RESP:ts=t1 + t3 . Note that the sender keeps no state information for RTTi estimation and obtains a fresh estimate of RTTi for every RESP it receives from Ri . t1
t2
t3
FI.ts
sender’s clock clock
Ri’s clock T FI
T RESP
Figure 2.3. Estimation of RTT.
178
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
2.2.7
Handling absent responses and lost packets
Both data packets and response packets can be lost during transmission. To avoid waiting for ever on receiving a response from a given R , the sender also waits on a retransmission timeout (RTO) after having transmitted a given packet. RTO maxfRTT j1 i NC g+x Period. A receiver that neither ACK s nor NACK s the transmitted packet during RTO is regarded to be an absentee for that packet. The packet is retransmitted to every absentee for a maximum number of times, with each transmission followed by waiting for a multiplicatively increased RTO to see whether RESP can be received. (This maximum number and the multiplicative factor can be specied as parameters of the protocol.) A receiver that remains as an absentee despite all these retransmissions is removed from the receiver set. This removal is indicated to the upper layer in a failure suspicion exception. Note that removing a persistently non-responsive receiver from the set of receivers is necessary to prevent the sender from being indenitely blocked (see ow-control, section 2.2.2). Once a receiver is removed, any response received subsequently from it is ignored. The explanation behind RTO estimation and failure suspicion are as follows. Recall that R gets a chance to send its RESP once every Period. So, in the absence of failures, the rst RESP sent by R after receiving a given packet, should reach the sender within RTT +Period. Since at least one of x consecutive responses sent by R is likely to succeed, the sender waits for (RTT +x Period) for every R . The sender retransmits a packet to an absentee receiver for a specied, maximum number of times. We assume that if the absentee is functioning and connected to the sender, at least one of the attempts will succeed; if all attempts fail, the absentee is suspected to be crashed/detached and is removed from the receiver set. Changes in RTT during Cycle can cause a receiver's response to arrive earlier or later than the planned time. To deal with small RTT changes during Cycle, R employs a second clock-dierence counter C _Diff _FI which is updated only when an FI is received: C _Diff _FI = T FI:ts, where T is the local time when R received FI . Note that receiving an FI makes C _Diff and C _Diff _FI have the same value, and subsequent arrival of a DATA packet may make C _Diff dier from C _Diff _FI . After R has updated C _Diff following the arrival of a DATA packet, it computes T = C _Diff C _Diff _FI . T indicates half of the increase in RTT since R received the last FI , and hence approximately half the amount of increase in RTT over the RTT estimate used by the sender in computing FI parameters. Soon after sending a given RESP , R sets T to the time when it has to send the next planned RESP ; it sets T =T T if T T is larger than the current time. The RESP planned to be sent at T is actually sent at T , and new values for T and T are computed after RESP is sent at T . T T is not a future time means that the uctuations of RTT since the sender last sent its FI are so large that compensation is not possible. Next we discuss how the sender deals with the packets known to be lost. After i
i
i
i
i
i
i
i
i
FI
FI
i
i
adj
adj
i
i
i
i
i
send
send
send
;
;
;
send
send ;
send
send
adj
adj
send
adj
send
send
i
A Multicast Transport Protocol for Reliable Group Applications
179
transmitting a packet seq at local time t, the sender performs the rst recovery in one the following two ways. (i) if the number of receivers that indicate NACK for seq reaches or exceeds the multicast threshold MTR NC , the packet seq is multicast; or (ii) if, at local time t+RTO, the number of absentees and the receivers that NACKseq reaches or exceeds MTR NC , seq is multicast; otherwise, seq is unicast to each absentee and to every NACK ing receiver. After the rst recovery, seq is only unicast to any receiver that still remains as an absentee or keeps NACK ing. This is because the number of receivers requiring retransmission after the rst recovery is expected to be small.
3 Simulation Results To evaluate the performance of our protocol, we carried out simulation experiments under various settings. These experiments are undertaken on two dierent network topologies: single-level tree and multi-level tree. In the single level case, each receiver is connected directly to the sender and the sender's multicast is realised through multiple unicasts - one for each receiver. In the second case, a receiver is connected to the sender either directly or via multicast-enabled routers and the sender uses IP multicast. In both cases, a receiver addresses its RESP packets to the sender; the applications at receivers are assumed to be message-hungry: the received packets are consumed as soon as they become ready for consumption. The results of multi-level topology are omitted for space reasons but can be seen in section 3.2 of the full paper [15]. Three parameters are evaluated: throughput T , relative network cost N , and relative implosion losses I . If D bytes of data are to be transmitted, and the packet size is P; the number of data packets to be transmitted, DP , can be dened as: DP = dD=P e. Let 4t be the period of time (in ms) between the transmission of the rst data packet and the moment all packets become fully acknowledged (both events occurring at the sender), the throughput is calculated as T = DP=4t; in packets/ms. N is calculated as the total number of packets exchanged (TP ) per receiver per data packet, i.e., N = TP=(NC DP ); the ideal value for N is (DP + 1)=DP (at least one ack per receiver is required at the end of transmission). I is measured as the ratio of total implosion losses to NC DP . The desired value for I is 0, i.e., no losses due to implosion. In the network model we use in the experiments, losses are assumed to be independent. Each packet has a destination address, which may be a unicast or multicast address. Each receiver is uniquely identied by a ctitious network address. The implosion losses are simulated in the following manner: an incoming response is to be stored in an incoming queue (IQ) before being processed; it is considered to be lost if there is no space in IQ when it arrives. The size of IQ is 64 packets.
180
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
3.1 Simulation on Single-Level Tree Series of experiments were conducted for dierent values of NC . The network was modeled as a set of channels directly connecting the sender and receivers. Three types of channels are considered: short; medium and long. Each channel type is characterised by a set of three attributes: normally distributed propagation latency with mean L, latency standard deviation (to emulate jittering) SD, and the percentage error rate Err. The values associated with each type are listed in Table 3.1. Two network congurations are dened: (a) LAN in which all channels are of type short; (b) HYBRID (HYB in gures) where at least bNC=3c receivers are connected to the sender by channels of a given type. Channel Type
short medium long
L
SD Err
1.5 ms 0.08 1% 5 ms 0.5 1% 75 ms 15 10%
Table 3.1. General properties for three channel types. Input Variable Name Value data unit size unitSize 1000 bytes transmission size DP 1000 packets inter-packet gap IPG 1 ms epoch length " 10 ms response rate RR 1500 RESP / s uni vs multicast threshold MTR 20% number of same responses x 2 RESPs FI Cycle Cycle 10 Periods Table 3.2. Protocol inputs used in single-level tree conguration. To assess the impact of window size (S ) on performance, we employ two values for S : 64 packets, and 1000 packets, the latter being large enough to represent the window of innite size. Table 3.2 shows the value of all input parameters used in the experiments. Unless specied otherwise, RR = ITR. From table 3.2, RQ = bRR "c = 15. Period = dNC=RQe epochs.
3.1.1 Comparison with Full-Feedback Protocol To illustrate the relative performance of our protocol, we compare it with the standard sender-initiated protocol Full-Feedback protocol (or FF for short). FF is a multicast extension of TCP and the details can be seen in [14]. Its main characteristics are: (a) it employs a sliding window scheme with selective
A Multicast Transport Protocol for Reliable Group Applications
181
retransmission (i.e., no go-back-N); (b) receivers instantly acknowledge every packet they receive; (c) loss detection is timeout-based, and loss recovery via global multicasts (i.e., MT R NC is 1).
I-IMPLOSION RATE (number of losses/link/packet)
2.5
2
1.5
1 ’FW-LAN’ ’IW-LAN’ ’FF-FW-LAN’ ’FF-IW-LAN’
0.5
0 3
6
9
12
15
18
21
24
27 30 NC (No. of Receivers)
40
50
60
Figure 3.1. Relative implosion rate, I. 1 ’FW-LAN’ ’IW-LAN’ ’FF-FW-LAN’ ’FF-IW-LAN’
0.9
T-THROUGHPUT (packets/ms)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 3
6
9
12
15
18
21
24
27 30 NC (No. of Receivers)
40
Figure 3.2. Throughput, T.
50
60
182
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
Figures 3.1, 3.2 and 3.3 show the simulation results of I , T , N , respectively. (A marked point in all the graphs is the average taken over 6 experiments.) The graphs named FW-XXX and IW-XXX indicate the performance of our protocol with nite and innite widow size respectively, in the network conguration XXX which can be either LAN or HYB; the ones named FF-FW-XXX and FF-IW-XXX indicate the corresponding performance of FF protocol. These gures indicate that our protocol performs much better than FF in all three parameters concerned. With near-zero implosion losses in our protocol (see g. 3.1), packets get fully acknowledged sooner, resulting in higher throughput as shown in gure 3.2 by the (widening) gap between the graphs of our protocol and FF. For both protocols T decreases as NC increases because the probability of a given multicast not reaching at least one receiver (thus requiring recovery) increases with NC .
N-RELATIVE NETWORK COST (per link per packet)
5
4.5
4
3.5
3
2.5
’FW-LAN’ ’IW-LAN’ ’FF-FW-LAN’ ’FF-IW-LAN’
2
1.5
1 3
6
9
12
15
18
21
24
27 30 NC (No. of Receivers)
40
50
60
Figure 3.3. Relative network cost, N. The relative throughput gains in our protocol are not achieved at the expense of increased N , as illustrated in gure 3.3. With our protocol, N decreases with increasing of NC while FF has the opposite tendency. This is due to two reasons: our implosion control mechanism reduces the amount of responses sent by receivers, while in FF receivers ack immediately after receiving a packet; FF employs only multicasts for lost packet recovery whereas we decide judiciously between multicast and unicast. So, the total number of packets used in FF increases more compared to the increase in NC DP .
A Multicast Transport Protocol for Reliable Group Applications
183
3.1.2 Performance in Hybrid Network The graphs named FW-HYB and IW-HYB of gures 3.4 and 3.5 show respectively T and N of our protocol in the hybrid network conguration. (The graphs MAX_T and IW-HYB-NOERR will be discussed in the next subsection.) Note the gaps between FW and IW in both gures. FW provides less than half the throughput of IW, because in FW transmission of data packets is restricted by window size, while in IW data can be continuously transmitted at a rate of 1=IP G. As for dierence in N between FW and IW, T in IW is much higher than T in FW. That means the time needed to transmit same amount of data packets is much shorter than in IW. So more F I packets and RESP packets are used, hence, increasing N . Simulation results also show that the value of x chosen aects the performance of the protocol. When x is equal to 10, both T and N are worse than those when x is 2. So, the smaller x, the better is the performance. In our experiments, the worst Err is 10%. Using the fomula discussed in section 2.2.5, x = 2 means that there is at least 99% chance that at least 1 out of 2 responses from a connected receiver reaches the sender. 0.9 ’MAX-T’ ’FW-HYB’ ’IW-HYB’ ’IW-HYB-NOERR’
0.8
T-THROUGHPUT (packets/ms)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 34
6
8 10
15
20
25
30 NC (No. of Receivers)
40
50
60
Figure 3.4. Throughput T in the hybrid conguration.
3.1.3 Impact of IQ size and RR on implosion Losses In all the experiments presented so far, the implosion losses were nearly zero when IQ size is 64 packets. To evaluate the impact of IQ size on implosion,
184
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
we xed it to 3 and chose parameters that would maximise the possibility of implosion: we considered the hybrid case, where the variation in is higher which provides larger scope for the parameters computed at the beginning of to become incorrect, thus causing the responses to arrive outside the planned interval; innite window (IW) was assumed so that the sender is never blocked due to ow control and hence it is subject to implosion possibility to the maximum extent. RT T
FI
C ycle
2.2 ’FW-HYB’ ’IW-HYB’ ’IW-HYB-NOERR’
N-RELATIVE NETWORK COST (per link per packet)
2.1 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 34
6
8 10
15
20
25
30 NC (No. of Receivers)
Figure 3.5. Relative network cost
N
40
50
60
in the hybrid conguration.
Until = 20, the implosion loss was zero and thereafter it became non-zero but still negligibly small. The largest we observed was 7 5 for = 30, and the average (over six experiments) for = 30 was 1 667 5. This near-zero can be attributed to (i) the eectiveness of our implosion control mechanism, and (ii) the used in the estimation of parameters is the same as . In practice, however, cannot be accurately estimated and can vary with time. These experiments nevertheless lead us to conclude that if the used for estimating parameters is accurate, our implosion control mechanism transforms the system, at least for small , into a system of innite size which can suer no implosion. However, the mechanism, unlike the system with innite , extracts its price in terms of reduction in and increase in . We estimate this cost by xing = 0%. That is, any loss that occurs can only be due to implosion. The IW-HYB-NOERR graphs in gures 3.4 and 3.5 show respectively and of our protocol for size = 3 packets, = 0%, IW in the hybrid case. NC
I
e
I
NC
NC
:
e
I
RR
IT R
FI
IT R
RR
FI
NC
IQ
IQ
T
N
E rr
T
N
IQ
E rr
A Multicast Transport Protocol for Reliable Group Applications
185
In a system with an innite-length IQ, the maximum throughput achieved when Err = 0% is DP=DeltaT , where DeltaT = time for the sender to multicast all packets + time to get ack for all packets from all unsuspected receivers. The rst term is DP IPG (due to IW), and the second term is 2 RTTmax if we assume that each receiver collectively acks/nacks only after receiving an endof-transmission packet from the sender. Thus the maximum T achievable in a loss- and implosion- free environment is shown as MAX_T in g. 3.4, and N in such an environment is nearly 1. The IW-HYB-NOERR case indicates the cost of our implosion control scheme: the smallest T is about 50% of MAX_T and the maximum N is just below 1.9. Observe the "humps" in Figure 3.5: N increases with NC , during the rst interval 3 NC 15, falls sharply during the second interval 15 < NC 30 and falls again during interval 30 < NC 45. The explanation for this lies in the value chosen for the jitter of long channels (SD = 15) and the value of Period during these intervals which is 10 ms, 20 ms, 30 ms respectively. Note that Period does not change with NC during a given interval, and that RTO = maxfRTTij1 i NC g+x Period. So, the smaller the Period, the smaller is RTO and hence the larger is the probability that a response sent along a long channel reaches the sender after RTO. Responses that do not arrive before RTO trigger retransmission -thus increasing N . For small values of NC in the rst interval, N increases sharply. This is because: when at least MTR NC , MTR NC = NC=5, of NC=3 receivers that are connected by long channels cause RTO timeouts for a given packet, the packet is multicast to all receivers. Our nal set of experiments analysed the eect on I due to the variation between RR (used in estimating FI parameters) and ITR - the value RR should ideally track. Note that when ITR is larger than RR used, it does not lead to implosion as buer spaces are under utilised. So, with buer size = 1, we xed RR at 1:5 packets/ms and estimated I by varying ITR for NC = 30. The values of I observed were 0.00213 and 0.00047 when ITR was 0:5 and 1:00, respectively.
4 Conclusions We have presented a transport-level multicast protocol that provides (i) reliable, end-to-end message delivery, and (ii) a failure suspector service wherein the best eorts are made to minimise mistakes. The simulations indicate that objective (i) is met with good throughput and low network cost; the implosion control mechanism employed eectively minimises implosion losses. As the sender estimates uptodate round trip times (rtt s) for every arrival of a response from receivers, it is aware of rtt variations and hence can make fewer mistakes while suspecting a receiver failure. This is very useful for building a group membership service that does not remove a functioning member capriciously. Building a failure-suspector at the lower-level and, thereby, facilitating ecient provision of fault-tolerance at the application levels are a novelty of this paper.
186
C. Liu, P.D. Ezhilchelvan, and M. Barcellos
The Trans protocol [9], like ours, also provides some basic services at the lower level which are useful for building important higher level services: it includes (at the data-link level) some useful (ack ) information onto the delivered packets which is used for ecient message ordering at higher levels; we dier in the type of low-level service we chose to provide. The Reliable Multicast Transport Service of [16] provides a service similar to ser2 which helps building a membership service at the higher level; but the emphasis there is to transparently extend unicast TCP into multicast TCP. Recall that when a receiver receives no transmissions from the sender, our protocol causes the receiver to stop unicasting its responses to the sender. This feature not only helps save bandwidth but also allows the protocol to be easily extended to deal with the sender crash - an issue that is not addressed here. Other future work planned is three-fold: incorporating congestion control, extending the protocol for n to n context, and implementing the extended version as the underlying service for our group management system [6].
Acknowledgements The authors thank the anonymous reviewers for their many useful comments. Thanks are also due to Richard Achmatowicz and Martin Beet for their patience and technical support.
References [1] O. Babaoglu, A. Bartoli and G. Dini, Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems, IEEE Transactions on Computers, 46 (6), June 1997, pp. 642-658. [2] R. V. Renesse, K. P. Birman and S Maeis, HORUS: A exible Group Communication System, Comm. of the ACM, Vol 39, No. 4, April 96, pp. 76-83. [3] G. Morgan, S. K. Shrivastava, P. D. Ezhilchelvan and M. C. Little, Design and Implementation of a CORBA Fault-Tolerant Group Service, 2nd IFIP WG 6.1 International Working Conference on Distributed Applications and Interoperable Services, Helsinki, June 99. [4] P. Felber, R. Guerraoui and A. Schiper, The Implementation of CORBA Object Service, Theory and Practice of Object Systems, Vol. 4, No. 2, 1998, pp. 93-105. [5] D. Dolev and D. Malki, The Transis Approach to High Availability Cluster Communication, Comm. of the ACM, Vol 39, No. 4, April 96, pp. 64-70.
A Multicast Transport Protocol for Reliable Group Applications
187
[6] P. M. Melliar-Smith, L. E. Moser and V. Agarwala, Broadcast Protocols for Distributed Systems, IEEE Trans. on Parallel and Distributed Systems,1(1), January 1990, pp. 17-25. [7] S. Floyd, V. Jacobson, S. McCanne, C. Liu and L. Zhang, A Reliable Multicast Framework for Light-Weight Sessions and Application Level Framing, IEEE/ACM Transactions on Networking 5 (6) Dec. 1997, pp. 784-803. [8] H. Holbrook, S. Singhal and D. Cheriton, Log-Based Receiver-Reliable Multicast for Distributed Interactive Simulation, ACM SIGSOMM'95, Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communication. Sept. 95, Cambridge, USA. [9] S. Paul, K. Sabanni, J. Lin and S. Bhattacharya, Reliable Multicast Transport Protocol, IEEE Journal on Selected Areas in Communications, 15 (3), April 1997, pp. 407-421. [10] M. W. Jones, S. Sorensen and S. Wilbut, Protocol Design for Large Group Multicasting: the Message Distribution Protocol, Computer Communication, v. 14, n. 5, June 1991. [11] J. Crowcroft and K. Paliwoda, A Multicast Transport Protocol, ACM SIGCOMM'88, Stanford, 16-19 Aug. 1988. [12] A. M. P. Barcellos and P. D. Ezhilchelvan, An End-to-End Reliable Multicast Protocol Using Polling for Scaleability, IEEE INFOCOM'98, San Francisco, California, April 98, pp.1180-1187. [13] A. M. P. Barcellos and P. D. Ezhilchelvan, A Scalable Polling-Based Reliable Multicast Protocol, Proc. of 4th International Workshop on High Performance Protocol Architecture (HIPPARCH98), London, June 98. [14] S. Pingali, D. Towsley and J. Kurose, A Comparison of Sender-Initiated and Receiver_Intiated Reliable Multicast Protocols, Proc. ACM SIGMETRICS Conf. on Measurement and Modelling of Computer Systems, Nachville, May 16-20, 1994. [15] C. Liu, P. D. Ezhilchelvan and A. M. P. Barcellos, A Multicast Transport Protocol for Reliable Group Applications. (http://www.cs.ncl.ac. uk/people/paul.ezhilchelvan/home.formal/papers/ngc99Full.ps)
[16] R. Talpade and M. H. Ammar, "Single Connection Emulation (SEC): An Architecture for Providing a Reliable Multicast Service", Proc. of the 15th IEEE International Conference on Distributed Computing Systems (ICDCS95), Vancouver, Canada, June 95, pp. 144-152.
Efficient Buffering in Reliable Multicast Protocols Oznur Ozkasap, Robbert van Renesse, Kenneth P. Birman, and Zhen Xiao Dept. of Computer Science, Cornell University 4118 Upson Hall, Ithaca, NY 14853 [email protected], {rvr,ken,xiao}@cs.cornell.edu
Abstract. Reliable multicast protocols provide all-or-none delivery to participants. Traditionally, such protocols suffer from large buffering requirements, as receivers have to buffer messages, and buffer sizes grow with the number of participants. In this paper, we describe an optimization that allows such protocols to reduce the amount of buffering drastically at the cost of a very small probability that all-or-none delivery is violated. We analyze this probability, and simulate an optimized version of an epidemic multicast protocol to validate the effectiveness of the optimization. We find that the buffering requirements are sub-constant, that is, the requirements shrink with group size, while the probability of all-or-none violation can be set to very small values.
1
Introduction
The aim of reliable multicast protocols is to provide all-or-none delivery of messages to all participants in a group.1 Informally, if any participant delivers the message, then eventually all participants should deliver the message. Since the sender may fail, processes buffer the messages that they receive in case a retransmission is necessary. Most existing reliable multicast protocols have all receivers buffer messages until it is known that the message has become stable (i.e., has been delivered to every participant). In such systems, it is always the case that the amount of buffering on each participant grows with group size for a combination of the following reasons: 1. the time to accomplish stability increases; 2. the time to detect stability increases; 3. depending on the application, the combined rate of sending may increase. As a result, these multicast protocols do not scale well. 0
1
This work is supported in part by ARPA/ONR grant N00014-92-J-1866, ARPA/RADC grant F30602-96-1-0317, NSF grant EIA 97-03470, and the Turkish Research Foundation. We use here the distributed systems terminology for reliability [HT94], rather than the networking terminology which does not stipulate all-or-none delivery.
L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 188–203, 1999. c Springer-Verlag Berlin Heidelberg 1999
Efficient Buffering in Reliable Multicast Protocols
189
In this paper, we investigate optimizing buffering by only buffering messages on a small subset of participants, while spreading the load of buffering over the entire membership. This way, each participant only requires a fraction of the buffering space used previously. Indeed, the amount of buffering space per participant decreases with group size. On the negative side, we introduce a small, known probability that a message is not delivered to all members. For example, this may happen if the entire subset responsible for buffering the message crashes before the message is delivered everywhere. We believe that in many situations such small probabilities can be condoned. In fact, epidemic multicast protocols such as used in the Clearinghouse domain name service [DGH+ 87], Refdbms [GLW94], Bayou [PST+ 97], and Cornell’s bimodal multicast [BHO+ 99] already introduce such a small known probability. Because of this, we focus our attention on using our suggested optimization in such protocols. Note that, using the terminology of [HT94], the resulting protocols are not reliable. Yet, we consider the robustness of these protocol better than protocols such as RMTP [LP96] or SRM [FJL+ 97], because the probability of message loss is known and therefore a certain known Quality of Service is provided (even if the original sender of a message fails). Our protocols are useful for life and mission-critical applications such as air traffic control, health monitoring, and stock exchanges, where a certain a priori known probability of message loss is acceptable [BHO+ 99]. For such applications as money transfer, fully reliable all-or-none multicast or atomic transactions would be necessary, while for nonlife or mission-critical applications, such as tele-conferencing or distribution of software, protocols such as RMTP and SRM are sufficient. We investigate techniques for choosing suitable subsets of participants for buffering messages, ways for locating where messages are buffered in case a retransmission is required, how these techniques improve memory requirements, and how they impact the reliability of the multicast protocols. We take into account message loss and dynamic group membership. For analysis we use both stochastics and simulation. This paper is organized as follows. In Section 2, we describe the group membership model, as well as how reliable multicast protocols (particularly epidemic protocols) are structured. The buffer optimization technique is presented in detail, and analyzed stochastically, in Section 3. Section 4 describes how this technique may be incorporated into an existing multicast protocol. In Section 5, we weaken our assumptions and describe a technique to improve the reliability of the optimized protocol without sacrificing the scalability. Simulation results are presented in Section 6. Section 7 describes related work, and Section 8 concludes.
2
Model and Epidemic Protocols
We consider a single group of processes or members. Each member is uniquely identified by its address. Each member has available to it an approximation of the entire membership in the form of a set of addresses. We do not require
190
O. Ozkasap et al.
that the members agree on the membership, such that a scalable membership protocol such as [vRMH98] suffices to provide this membership information. We consider a non-Byzantine fail-stop model of processes. As is customary, recovery of a process is modeled as a new process joining the membership. The members can send or multicast messages among each other. There are two kinds of message loss: send omissions and receive omissions. In case of a send omission, no process receives the message. In case of a receive omission, only a corresponding receiver loses the message. Initially, we assume that receive omissions are independent from receiver to receiver and message to message, and occur with a small probability Ploss , and that there are no send omissions. We weaken these assumptions in Section 5. The members run a reliable multicast protocol that aims to provide allor-none delivery of multicast messages, that is, to deliver each message to all processes that are up (not failed) at the time of sending. We do not require FIFO or total order on message delivery. All such protocols run in three phases: 1. an initial (unreliable) multicast phase attempts to reach as many members as possible; 2. a repair phase detects message loss and retransmits messages; 3. a garbage collection phase detects message stability and releases buffer space. Most protocols use a combination of positive or negative acknowledgment messages for the last two phases. Epidemic multicast protocols accomplish the all-or-none guarantee with high probability by a technique called gossiping. Each member p periodically chooses another member q at random to send a gossip message to, which includes a report of what messages p has delivered and/or buffered. (Every message that is buffered by a process has been delivered by that process, but not necessarily vice versa.) q may update p with messages that q has buffered, but p has not delivered. q may also request from p those messages that p has buffered but q has not delivered. Garbage collection in epidemic protocols is accomplished by having members only maintain messages in their buffer for a limited time. In particular, members garbage collect a message after a time at which they can be sure, with a specific high probability, that the gossiping has disseminated all messages that were lost during the initial multicast. This time grows O(log n), where n is the size of the membership as the corresponding member observes it [BHO+ 99].
3
Basic Optimization
In this section, we describe the technique we use to buffer messages on only a subset of the membership. The subset has a desired constant size C. We say desired because, as we shall see, failures and other randomized effects cause messages to be buffered on more or fewer than C members. The subset is not fixed, but randomized from message to message in order to spread the load of buffering evenly over the membership.
Efficient Buffering in Reliable Multicast Protocols
191
We assume that each message is uniquely identified, for example by the tuple (source address, sequence number). Using a hash function H : bitstring → [0 · · · 1], we hash tuples of the form message identifier, member address to numbers between 0 and 1. This hash function has a certain fairness property, in that for a set of different inputs, the outputs should be unrelated. Cryptographic hashes are ideal, but too CPU-intensive. CRCs (cyclic redundancy checks) are cheap, but the output is too predictable for our purpose: when given the 32-bit big-endian numbers 0, 1, 2, ... as input, the output of CRC-16 is 0, 256, 512, etc. We will describe a hash function that is cheap and appears fair, as well as why we require these properties, in Section 4. A member with address A and a view of the membership of size n buffers a message with identifier M if and only if H(M, A) × n < C. We call a member that buffers M the bufferer of M . If H is fair, n is correct, and there is no message loss, the expected number of bufferers for M is C. Also, for a set of messages M1 , M2 , . . ., the messages are buffered evenly over the membership. If members agree on the membership, then any member can calculate for any message which members are the bufferers for this message. If members have slightly different memberships, it is possible that they disagree on the set of bufferers for a message, but not by much. In particular, the sets of bufferers calculated by different members will mostly overlap. Also, if C is chosen large enough, the probability that all bufferers fail to receive a message is small. We will now calculate this probability. For simplicity, we assume that every member agrees on the membership (and is therefore correct), and that this membership is of size n. We consider an initial multicast successful if it is received by all members, or if it is received by at least one bufferer (which can then satisfy retransmission requests). Thus, the probability of success is the sum of the following two independent probabilities: P1 : no members are bufferers, but they all received the initial multicast; P2 : there is at least one member that is a bufferer and that received the initial multicast. P1 is simple to calculate, as, by fairness of H, “being a bufferer” is an independent event (with probability C/n), as is message loss (with probability Ploss ): P1 = ((1 − P2 can be calculated as follows:
C ) · (1 − Ploss ))n n
P2 = P (∃bufferer that receives M ) = 1 − P (all processes are not bufferers or lose M ) = 1 − P (a process is not a bufferer or loses M )n = 1 − (1 − P (a process is bufferer and receives M )n C = 1 − (1 − · (1 − Ploss ))n n
(1)
(2)
192
O. Ozkasap et al.
The probability of failure Pf ail is then calculated as: Pf ail = 1 − P1 − P2 = (1 −
C C · (1 − Ploss ))n − ((1 − ) · (1 − Ploss ))n n n
(3)
Assuming Ploss is constant (independent of n), it is easy to see that as n grows, Pf ail tends to e−C·(1−Ploss ) . Thus, given the probability of receive omission, the probability of failure can be adjusted by setting C, independent of the size of the membership. Pf ail gets exponentially smaller when increasing C. In many cases Ploss is a function of group size, as it depends on the size and topology of the underlying network. For example, in a tree-shaped topology, messages have to travel over O(log n) links. If Pll is the individual link loss, then Ploss = 1 − (1 − Pll )t , where t is the average number of links that the message has to travel (t grows O(log n)). Worse yet, receive omissions are no longer independent from each other. Thus, setting C in this case does depend on n. We discuss a solution to this problem in Section 5, and see how this affects the choice of C in Section 6.
4
Implementation
In this section, we discuss the design of the hash function H that we use, how we integrate our optimization with an epidemic multicast protocol, and how this affects the buffering requirements of the protocol. As mentioned, the hash function H has to be fair and cheap. It has to be fair, so that the expected number of bufferers for a message is C, and so that the messages are buffered evenly over the membership. It has to be cheap, since it is calculated each time a message is received. Cryptographic hashes are typically fair, but they are not cheap. CRC checks are cheap, but not fair. We therefore had to design a new hash function. Our hash function H uses a table of 256 randomly chosen integers, called the shuffle table. The input to H is a string of bytes, and the output is a number between 0 and 1. The algorithm is: unsigned integer hash = 0; for each byte b do hash = hash XOR shuffle[b XOR least signif byte(hash)]); return hash/MAX INTEGER; To integrate optimized buffering into an actual epidemic protocol (see Section 2), we have to modify the protocol as follows. Previously, members satisfied the retransmission of a message out of their own buffers. With the optimization, if a member does not have the message buffered locally, it calculates the set of bufferers for the message and picks one at random. The member then sends a retransmission request directly to the bufferer, specifying the message identifier
Efficient Buffering in Reliable Multicast Protocols
193
and the destination address. A bufferer, on receipt of such a request, determines if it has the message buffered. If so, it satisfies the request. If not, it ignores the request. Note that processes still have to maintain some information about the messages they have, or have not, received. In the original protocol, processes had to buffer all messages until they are believed to be stable (a global property). In the optimized protocol, processes only need to remember the identifiers of messages they have received locally. They can do so in sorted lists of records, one list per sender. Each record describes, using two sequence numbers, a range of consecutively received messages. Since there are typically not many senders, and each list will typically be of size 1, the amount of storage required is negligible. The buffering requirements of the epidemic protocol are improved as follows. In the original protocol, the memory requirement for buffering on each member grew O(ρ log n), where ρ is the total message rate and n is the number of participants (assuming fixed sized messages and fixed message loss rate) [BHO+ 99]. This is because the number of rounds of gossip required to spread information fully with a certain probability grows O(log n). In the modified protocol, the buffering requirement on each member shrinks by O(ρ log n/n), since C is constant.
5
Improvement
Up until now we have assumed that the only message loss was due to rare and independent receive omissions. In this section, we will suggest an improved strategy in order to deal with more catastrophic message loss, without sacrificing the advantageous scaling properties. The improvement consists of two parts. The first part is to maintain two message buffers. The so-called long-term buffer is like before, in which messages are kept for which the corresponding process is a bufferer. The short-term buffer is a buffer in which all messages are kept in FIFO order as they are received for some fixed amount of time. (Since messages are kept for a fixed amount of time, the size of this buffer is linearly dependent on the message rate ρ, but independent of group size.) Both buffers can be used for retransmissions. The second part involves an improvement to the initial multicast phase. The idea is to detect send omissions or large dependent receive omission problems, and retransmit the message by multicasting it again (rather than by point-topoint repairs). Such strategies are already built into multicast protocols such as bimodal multicast [BHO+ 99] and SRM [FJL+ 97]. Thus, the initial multicast phase is subdivided into three subphases: 1a. an unreliable multicast attempts to reach as many members as possible; 1b. detection of catastrophic omission; 1c. multicast retransmission if useful. For example, in a typical epidemic protocol this can be done as follows. Members detect holes in the incoming message stream by inspecting sequence
194
O. Ozkasap et al.
numbers. They include information about holes in gossip messages. When a member receives a gossip with information about a hole that it has detected as well, it sends a multicast retransmission request to the sender. The probability of this happening is low in case of a few receive omissions, but high in the case of a catastrophic omission. The sender should still have the message in its short-term buffer to satisfy the retransmission request. Since these retransmission requests are only triggered by randomized gossip messages, it will not lead to implosion problems such as seen in ack or nak based protocols. These two parts, when combined, lead to two significant improvements. First, they make catastrophic loss unlikely, so that the assumptions of the original basic optimization are approximately satisfied. Secondly, since most message loss is detected quickly, retransmissions will typically be satisfied out of the short-term buffer without the need for retransmission requests to bufferers. The long-term buffer is only necessary for accomplishing all-or-none semantics in rare failure scenarios.
6
Simulation Results
120 118 119 102
113
101
114
100
112
39 98 94
95
97
96 33
37 99 116
32 31
14
117 38
115
15
19 17
12
12
10 18
20
11 22
43 21
45 13
44 3 14
48
109 111
46
36 103
41
49
42
47 50
40
110 11
104
34
32
15
105
13
80 35 41 74
51
54
39
16 40
29
36
82
16
75
37 83
81
52
77
34
35
4 53
106
78 76
25
17 33
108
38
28
107
31
56
104
30 26
18
55
0
57
5 109
23
24
79
111
107
27 8 108
105
112
10 9
113
1
5 0
4
7
6
62
110 106
63 20
1
61 6
43
60
59
19
59
21
45
3
50
58 2
65
56 47
66
44
57
42
64
51
48 101
67
63
58
52
49 53
86
123
100
121
98
46
60
102 122
85
54
7
114
62
97 103
115 99
61
28
2 65
116
68
120
119 55
118 66
72
87 24
74
117
9 8
64 69
75
73
79 22
73
70
80
26
27
23
71
87
83
67 81
70
30
68 71
82
84
29
69 25
72
91
90
93
95
88 89
92 89
93 88
91 92 86
78
(a)
96
85
76 77
84
90
94
(b)
Fig. 1. (a) a tree topology; (b) a transit-stub topology.
To validate our techniques, we have simulated bimodal multicast [BHO+ 99] with and without our buffering optimization. This protocol follows the description of epidemic protocols in Section 2 closely, and contains a multicast retransmission scheme similar to the one described in Section 5. For our simulations, we used the ns2 network simulator [BBE+ 99], and multicast messages from a single sender. In all experiments, we set C so that Pf ail ≈ 0.1%, based on the link loss probability and the number of members (see Equation 3).
Efficient Buffering in Reliable Multicast Protocols
195
We simulated on two different network topologies (see Figures 1(a) and (b)): a pure tree topology, with the sender located at the root of the tree, and a transitstub topology (generated by the ns2 gt-itm topology generator), with the sender located on a central node. The transit-stub topology is more representative of the Internet than is the tree topology. The topology has an influence on the probability of message loss, but as we shall see, the overall effect of these two topologies on the buffering requirements is similar. In Figures 2 and 3, we show the required amount of buffer space (the maximum number of messages that needed to be buffered) per member as a function of group size. In all these experiments, the individual link-loss probability in the network is 0.1%. In these cases, C ≈ 6. The graphs for the original bimodal multicast are labeled “pbcast-ipmc,” while the buffer optimized multicast graphs are labeled “pbcast-hash.” Figure 2 uses a rate of 25 messages/sec. In (a) we used the tree topology, and in (b) the transit-stub topology. Figure 3 shows the same graphs for 100 messages/sec. We find not only that the buffering optimization greatly reduces the memory requirements on the hosts, but also that the buffering behavior is more predictable. To see the effect of message rate and noise rates more clearly, see Figure 4. In both experiments we used a tree topology with 100 members. In (a), the link-loss probability is still fixed at 0.1%, while the message rate is varied from 25 to 100 messages/sec. In (b), the message rate is fixed at 100 messages/sec, but the link loss probability is varied from 0.1% to 1.5%. At Pll = 1.5%, we find that C ≈ 9. Figure 5 shows that the buffer optimization significantly reduces the memory requirements on each individual host, and also that the buffering responsibility is spread evenly over all members. Again, we are using a tree topology with 100 members. We show how much was buffered on each one of the members using both the original protocol and the optimized one. In (a), the message rate is 25 messages/sec, while in (b) it is 100 messages/sec. In Figure 6, we show on how many locations each of the messages numbered 1000-1500 was buffered for two different link-loss probabilities: (a) 0.1% and (b) 1.5%. With larger loss, it is necessary to buffer messages in more locations in order to get the same Pf ail probability. Because of this, the probability that nobody buffers a message (1−P2 ) is actually smaller for situations with larger loss. The graphs demonstrate this clearly. Note that although, in (a), three messages were not buffered anywhere, this does not imply that the messages were not delivered to every member. In fact, all three messages were correctly delivered.
7
Related Work
Work on buffering in group communication can be classified in three categories: 1. Multicast flow control techniques attempt to control the amount of buffering using rate or credit-based mechanisms;
196
O. Ozkasap et al.
25
pbcast−ipmc pbcast−hash
mean buffer requirement
20
15
10
5
0 20
30
40
50
60
70
80
90
100
group size
(a) 25
pbcast−ipmc pbcast−hash
mean buffer requirement
20
15
10
5
0 20
30
40
50
60
70
80
90
100
110
120
group size
(b) Fig. 2. The required amount of buffer space per member as a function of group size. In (a) we use a tree topology, while in (b) we use a transit-stub topology. The message rate is 25 messages/sec.
Efficient Buffering in Reliable Multicast Protocols
110
pbcast−ipmc pbcast−hash
100
mean buffer requirement
90 80 70 60 50 40 30 20 10 0 20
30
40
50
60
70
80
90
100
group size
(a) 110
pbcast−ipmc pbcast−hash
100
mean buffer requirement
90 80 70 60 50 40 30 20 10 0 20
30
40
50
60
70
80
90
100
110
group size
(b) Fig. 3. Same as Figure 2, but for 100 messages/sec.
120
197
198
O. Ozkasap et al.
90
pbcast−ipmc pbcast−hash
80
mean buffer requirement
70
60
50
40
30
20
10
0
30
40
50
60
70
80
90
100
message rate (msgs/sec)
(a) pbcast−ipmc pbcast−hash
30
mean buffer requirement
25
20
15
10
5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
system wide noise ratio (%)
(b) Fig. 4. In (a), we show the average amount of buffer space per member as a function of message rate. In (b), we show the buffer space as a function of link loss probability.
Efficient Buffering in Reliable Multicast Protocols
199
30
pbcast−ipmc pbcast−hash
buffer requirement
25
20
15
10
5
0
0
10
20
30
40
50
60
70
80
90
100
group member no
(a) 100
90
80
buffer requirement
70
pbcast−ipmc pbcast−hash
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
group member no
(b) Fig. 5. This graph shows, for each member, how much buffer space was required. In (a), the message rate is 25 messages/sec, while in (b) the message rate is 100 messages/sec.
200
O. Ozkasap et al.
number of locations message is buffered
25
20
15
10
5
0 1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1350
1400
1450
1500
message number
(a)
number of locations message is buffered
25
20
15
10
5
0 1000
1050
1100
1150
1200
1250
1300
message number
(b) Fig. 6. This graph shows, for 500 messages starting at message 1000, on how many locations each of these messages was buffered. In (a), the link loss probability is 0.1%, which in (b) this probability is 1.5%.
Efficient Buffering in Reliable Multicast Protocols
201
2. Stability optimization techniques attempt to minimize the time to achieve and detect stability of messages, thereby reducing the time that messages are buffered; 3. Memory reduction techniques attempt to minimize the amount of buffer memory necessary. In the first category, a good representative paper is by Mishra and Wu [MW98]. They study the effect of buffering of rate and credit-based flow control in both ACK and NAK-based multicast protocols using simulation. They conclude that rate-based flow control techniques are generally best. We note that flow control is mostly orthogonal to the buffer optimization. Flow control is an adaptive mechanism, intended to deal with varying availability of resources. These resource include CPU and memory resources on end-hosts and routers, and bandwidth availability on network links. Buffer optimization is not adaptive, but as it reduces the use of memory resources on the end-hosts, it will have an impact on flow control. (Note that although related, congestion control deals with buffering requirements on routers rather than end-hosts, and is therefore not discussed here further.) In the second category, all reliable communication protocols attempt to optimize the time to achieve stability. Mishra and Kuntur [MK99] present a general technique, which they call Newsmonger, to improve the time to detect stability. This is important when the application requires uniform or safe delivery of messages. As a beneficial side-effect, it also reduces the amount of time that messages need to be buffered. The Newsmonger is a token that rotates among the members, and can be applied to any reliable multicast protocol that provides membership agreement of some sort. The technique, when combined with our buffering optimization, is still useful to improve the latency of uniform delivery. Our buffer optimization technique belongs in the third category. The best known work in this category is a general protocol model called Application Level Framing (ALF) [CT90]. ALF leaves many reliability decisions to the application, rather than providing an abstraction of a reliable multicast channel. SRM [FJL+ 97] is a well-known implementation of a multicast facility in the ALF model, and is used in various tele-conferencing applications. SRM does not buffer or order messages, but provides call-backs to the application when it detects that a message is lost. It is the application that decides whether and how it wants to retransmit the message. Rather than buffering messages, the application may be able (and, in current SRM applications, usually is able) to regenerate messages based on its state. In contrast to our work, SRM does not provide all-or-none delivery with any known level of reliability.
8
Conclusion
In this paper, we presented a technique that significantly optimizes buffer requirements in reliable multicast protocols. In particular, the buffer requirements on a host are reduced by a factor of n/C, where n is the size of the group, and C is a
202
O. Ozkasap et al.
small constant containing the number of sites where a message should be buffered (typically on the order of about 10). The reliability guarantees of the protocol are slightly adversely affected, but the probability of this can be calculated and limited by choosing a suitable C. Using simulation, we have demonstrated that this technique is highly effective. We have described how buffer optimization can be incorporated into an epidemic multicast protocol such as bimodal multicast [BHO+ 99]. In the future, we would like to study the impact of our optimization on other, non-epidemic, reliable multicast protocols. Since such protocols do not allow occasional violations of the all-or-none delivery guarantee, additional mechanisms may be necessary. For example, in so-called virtually synchronous protocols, the conflict can be solved by simulating a partition in the membership if a message cannot be recovered. Since such events should be rare, this may be an acceptable solution, as these protocols already deal with real network partitions.
References [BBE+ 99] S. Bajaj, L. Breslau, D. Estrin, K. Fall, S. Floyd, P. Haldar, M. Handley, A. Helmy, J. Heidemann, P. Huang, S. Kumar, S. McCanne, R. Rejaie, P. Sharma, K. Varadhan, Y. Xu, H. Yu, and Zappala D. Improving simulation for network research. Technical Report 99-702, Univ. of Southern California, March 1999. [BHO+ 99] K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal multicast. ACM Transactions on Computer Systems, 17(2):41–88, May 1999. [CT90] D.D. Clark and D.L. Tennenhouse. Architectural considerations for a new generation of protocols. In Proc. of the ’90 Symp. on Communications Architectures & Protocols, pages 200–208, Philadelphia, PA, September 1990. ACM SIGCOMM. [DGH+ 87] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In Proc. of the Sixth ACM Symp. on Principles of Distributed Computing, pages 1–12, Vancouver, British Columbia, August 1987. ACM SIGOPS-SIGACT. [FJL+ 97] S. Floyd, V. Jacobson, C.-G. Liu, S. McCanne, and L. Zhang. A reliable multicast framework for light-weight sessions and application level framing. IEEE/ACM Transactions on Networking, 5(6):784–803, December 1997. [GLW94] R.A. Golding, D.D. Long, and J. Wilkes. The refdbms distributed bibliographic database system. In USENIX Winder 1994 Technical Conference Proceedings, January 1994. [HT94] V. Hadzilacos and S. Toueg. A modular approach to the specification and implementation of fault-tolerant broadcasts. Technical Report TR94-1425, Department of Computer Science, Cornell University, 1994. [LP96] J.C. Lin and S. Paul. Rmtp: A reliable multicast transport protocol. In Proc. of IEEE INFOCOM’96, pages 1414–1424, March 1996. [MK99] S. Mishra and S.M. Kuntur. Improving performance of atomic broadcast protocols using the newsmonger technique. In Proc. of the 7th IFIP International Working Conference on Dependable Computing for Critical Applications, pages 157–176, San Jose, CA, January 1999.
Efficient Buffering in Reliable Multicast Protocols
203
[MW98] S. Mishra and L. Wu. An evaluation of flow control in group communication. IEEE/ACM Transactions on Networking, 6(5), October 1998. [PST+ 97] K. Petersen, M.J. Spreitzer, D.B. Terry, M.M. Theimer, and A.J. Demers. Flexible update propagation for weakly consistent replication. In Proc. of the Sixteenth ACM Symp. on Operating Systems Principles, pages 288–301, Saint-Malo, France, October 1997. [vRMH98] R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proc. of Middleware’98, pages 55–70. IFIP, September 1998.
Native IP Multicast Support in MPLS Arup Acharya' and Fr6d6ric Griffou12 C&C Research Labs, NEC USA, Princeton N J 08540, USA NPDL-E, NEC Europe Ltd., 69115 Heidelberg, Germany
Abstract. Multicast support in a Multiprotocol Label Switching (MPLS) network has yet to be defined. An MPLS network consists of label switching devices such as ATM. This document discusses both dense-mode and sparse-mode native IP multicast within the context of MPLS networks. Unlike unicast routing, dense-mode multicast routing trees are established in a data-driven manner and it is not possible to topologically aggregate such trees, which are rooted at different sources. In sparsemode multicast, source-specific trees may coexist with a core or shared tree, and it is not possible to assign a common label to traffic from different sources on a branch of the shared tree. This leads us to suggest a per-source traffic-driven label allocation scheme for supporting all three types of multicast (dense mode, shared tree, source tree) routing trees in a MPLS network. Note that we focus on native multicast support and that our scheme is not to be applied when MPLS-based tunnels are appropriate, for instance to support multicast in MPLS-based VPN.
I
Introduction
IP switching technology allows for an efficient and scalable operation of IP directly on label switching hardware such as ATM. Such an approach is currently being standardized at the IETF as Multiprotocol Label Switching (MPLS). The standardization efforts have been primarily focused on topology-based aggregation schemes for unicast traffic. Efficient support for multicast over label switching hardware is still an open problem, both within the MPLS working group and research community. This paper first describes why label switching for multicast traffic is vastly different from topology-based schemes for unicast, and then presents a solution for both dense mode and sparse mode multicast. The key objective is to use multicast switching hardware, such as in ATM switches, to forward IP multicast packets at layer 2 (L2) with a minimal resort to layer 3 (L3) forwarding. The MPLS terminology generalizes the notion of IP flow by defining the concept of Forwarding Equivalence Class (FEC). The association between a FEC and a label used to forward the FEC datagrams is called label binding. In the ATM case, a label is the VPI/VCI cell field identifying an ATM Virtual Circuit (VC). An ATM switch controlled by an IP MPLS module is called an ATM Label Switched Router (LSR). Refer to [l]for an exhaustive MPLS terminology description. The discussion in this paper applies to any label-switching device, including ATM. We use the term "label" synonymously with ''VCI' in this paper. L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 204−215, 1999. Springer-Verlag Berlin Heidelberg 1999
Native IP Multicast Support in MPLS
205
The MPLS specification drafts ([l],[2]) advocate a topology-driven procedure to map Layer 3 IP unicast routes onto Layer 2 switched paths, for instance ATM VCs. The key points of MPLS unicast forwarding are the following: - Routing table updates trigger the creation or destruction of label bindings. - The label bindings are advertised using either a dedicated Label Distribution
Protocol (LDP) or piggy-backing in existing control protocols such as RSVP and BGP. As a consequence, a label binding exists before any data is received by the LSR, thus all the packets are switched at the layer 2. However, unlike unicast routing, dense-mode multicast routing trees are established in a traffic-driven manner and it is not possible to topologically aggregate such trees. In sparsemode multicast, source specific trees may coexist with a shared tree, and thus it is not possible to always assign a common label to traffic from different sources on a branch of the IP shared tree.
Multicast Dense Mode Support
2
In this section, we describe dense mode label (VC) binding and release events, without referring to any specific label distribution procedure. 2.1
PIM-DM Support
In PIM-DM ([5]), a source-specific shortest path or (S, G) tree is created with the arrival of multicast packet from a source S to a group G. PIM-DM characteristics are the following: 1. There is no (S, G) routing entry prior to arrival of data from S. 2. It is not possible to aggregate several (S, G) entries for the same group when the incoming and outgoing interfaces of the entries are different. 3. A given routing table entry changes dynamically (even without any change in the unicast network topology) due to periodic pruning of branches and/or arrival of new members and/or source inactivity. 4. All packets are forwarded at the IP level till such a time incoming and outgoing labels are assigned to the (S,G) entry.
Points (1) and (3) lead us to conclude that label assignment for dense-mode flows needs to be hop-by-hop traffic-driven.From (2), each (S, G) entry needs to be assigned a separate incoming label. When the first packet from source S to destination G is received by an LSR, multicast IP forwarding carries out the RPF' check and creates an (S, G) entry in the multicast routing table (MRT). Once this (S, G) entry exists, the procedure to bind a label to the (S, G) FEC is activated. RPF: Reverse Path Forwarding. This mechanism checks whether a multicast packet is received on the interface which is on the shortest path to the source.
206
A. Acharya and F. Griffoul
From (3), a label binding destruction is triggered in two cases: either when receiving/emitting a Prune(S,G) or when the activity timer expires. Due to (4) the label bindings need to be done as quickly as possible, to keep IP forwarding to a minimum. Once the label binding procedure successfully completes, all subsequent (S,G) packets are forwarded in ATM hardware. Arrival of a PIM Graft(S, G) message requires adding an outgoing branch to the existing pointto-multipoint VC carrying the (S, G) traffic. The (S, G) forwarding state is associated with an activity timer, which is used to remove inactive (S, G) entries, i.e. flows with no traffic during a specified amount of time. In an IP router, this is achieved by resetting the timer whenever a packet is forwarded using the (S, G) entry. When forwarding traffic in switched mode, no traffic will be observed at the IP level and therefore, the timer has to be reset based on forwarding activity on the LSP. When the timer expires, both the label and the (S, G) MRT entry are removed or reclaimed. 2.2
DVMRP Support
DVMRP ([S]) is supported in the same fashion as PIM-DM: both are floodand-prune techniques which create a (S, G) entry in the MRT on arrival of the first data packet. The difference between the two is mainly at the IP level, e.g. DVMRP uses RIP specific information to disambiguate equal-cost paths, while PIM-DM uses explicit PIM-Assert messages. Our proposed mechanism for PIMDM is equally applicable to setting up the label switched path when the multicast protocol is DVMRP.
Multicast Sparse Mode Support
3
We consider PIM-SM in this section, as specified in [7]. Support for shared-tree only protocols like Core-Based Tree (CBT) is for further study. Unlike dense mode, a multicast routing entry already exists in a sparse mode tree prior to arrival of data packets, since the IP group membership is propagated along the multicast tree by explicit PIM Join/Prune messages. 3.1
Previous Work for PIM-SM
There has already been some work on the support of PIM-SM for MPLS. One approach described in [8] was firstly proposed for the Cisco's Tag Switching technology: it suggests a piggy-backing methodology to assign and distribute labels for sparse-mode trees. The idea is that PIM Join/Prune messages are augmented to carry labels. MRT: Multicast Routing Table
Native IP Multicast Support in MPLS
207
There are pros and cons to PIM piggybacking: the basic advantage is to tighly synchronized the L3 route set-up and the L2 LSP establishment, without introducing additional control messages. However it obviously requires a non-minor change in the PIM protocol. Moreover such an approach cannot work for densemode protocols which have no explicit Join messages (see Sect. 2). A more detailed discussion on piggybacking disadvantages can be found in the MPLS multicast framework draft [12]. As we discuss below, it is not possible to always assign a single label, common to all sources, for PIM sparse-mode shared trees, and thus the piggybacking approach is not as trivial as to provide a correct label allocation.
3.2
The (*, G ) / ( S ,G ) Co-existence Problem
PIM-SM allows receivers to join a shared tree (*, G) tree for the group G with a common Rendex-vous Point (RP) as the root, or a shortest-path (S,G) tree rooted at a specific source S. A receiver may thus receive traffic for a given source S through the (S,G) tree, and for other sources, through the (*,G) tree. In a MPLS context, a problem arises when a node on the (*, G) tree needs to forward data differently depending on the source. Figure 1shows an example. There are two members of the group G, H1 and H2 attached to their respective Designated Router (DR) R2 and R3. Two sources S1 and S2 transmit their traffic to the Rendez-vous Point RP. Let us consider the case when R2 decides to join the source-specific tree for S1 ’. It does so by first sending a Join(S1,G) towards S1. Then when receiving the first packet of S1 on the incoming interface 3 (while the shared tree incoming interface is l),it sends a Prune(S1, G) message to R3 which is forwarded upstream towards the RP. This Prune message results in R3 forwarding data traffic from S1 on interface 2 only, while traffic from S2 is forwarded on both interfaces 2 and 3 (since there is no source-specific join to S2 by any of the receivers H1 or H2). The multicast routing entries on R3 are:
, GI: i i f = ( i ) o i f = ( 2 , 3 ) ( S i , G ) : i i f = ( i ) oif=(2) (*
To accomplish the same forwarding behavior at the L2 layer, a common label cannot be assigned to all the traffic on R3’s incoming link 1; the traffic from S1 on R3’s interface 1 must be assigned a distinct label from that of S2. Otherwise it would result in S1 traffic duplication at H1. Such selective forwarding may be necessary at different points of the shared tree depending on the source of the traffic. For PIM-SM, a naive topology-driven label assignment leads to incorrect data delivery. In the next subsections we describe two schemes to overcome this problem. The recommended policy for a router with directly connected members to switch from the RP-tree to SP-tree is after receicing a significant number of data packets during a specified time interval from a particular source.
208
A. Acharya and F. Griffoul
Fig. 1. ( * , G ) / ( S , G )coexistence in R3 and R2
3.3
PIM-SM Extensions
In a recent draft ([9]), the piggybacking approach has been updated to take into account the (*, G)/( S, G) co-existence problem previously described. Essentially, the Join/Prune PIM message is again extended to percolate a new label binding on the shared tree for a source, if any receiver sends a Prune for that source. In our example (Fig. 1) this PIM modification means the Prune(S1,G) message transmitted by R2 is propagated up to the RP, carrying a label assignment for the packets from S1.
As specified in [7], this propagation to the RP is not needed, (for instance in the case of Fig. 1).Note that since PIM-SM is a soft-state protocol, the Prune message has to be sent periodically up to the RP. This approach leads to a per-source signaling solution.
Native IP Multicast Support in MPLS
3.4
209
Per-source Label Assignment
To solve the (S, G)/(*, G) coexistence problem without resorting to IP forwarding, source specific labels are to be assigned on intermediate nodes of the shared tree. Multiple labels will be associated with one (*,G) entry, corresponding to one label per active source. In order to unambiguously distinguish a persource (*,G) label binding from a (S,G) binding, we propose to introduce a (S, G)’ FEC representing IP packets from source S forwarded on the (*, G) tree. Since PIM manages only one entry timer per route, the MPLS module needs to maintain additional per-source activity timers for each LSP ‘. When a (S, G)’ timer expires, the corresponding label binding is deleted. The (*,G) entry removal releases all the remaining (S, G)’ label bindings. When a new member joins the shared tree, the (*, G) entry of a LSR may get an additional outgoing interface (oif). A new branch must then be added to each (S, G)’ LSP. The switch from a shared tree to a shortest path tree is handled as follows. If the trees fully overlap, a new (S, G) LSP is setup when the first packet from S arrives at the ingress node since the packet matches the (S, G) entry; the (S, G) LSP will timeout due to inactivity or could be released as the (S, G) LSP is setup. If the trees do not overlap, a Prune(S, G) for a (*, G) oif removes the corresponding (S, G) LSP branch. If the (S, G) oifs list is empty, the (S, G) binding is released. PIM-SM allows a sender to transmit packets either as encapsulated messages (PIM-Register) to the RP, or as native multicast (typically when the RP joins the source specific tree). In the former case, end-to-end LSP cannot be created since a unicast path between the source and the RP may have been setup; moreover the data packets need to be decapsulated at the IP level. In the latter case, end-to-end switched path can be established.
4
Multicast Label Distribution
In order to avoid label distribution by piggy-backing in the multicast routing protocols, we propose two solutions for the label distribution: - Upstream implicit distribution. - Downstream on demand explicit assignment.
It is not a PIM-SM specification violation. A node with distinct (S,G) and (*,G) iif sends a Prune(S,G) upstream when receiving the first (S,G) packet on the (S,G) iif.
210
A. Acharya and F. Griffoul
4.1
Upstream Implicit Distribution
In this method, when a multicast-capable LSR receives a packet with a label that has no current binding on the incoming interface, L3 processing is invoked. In the rest of the document, we use the term Unused Label (UL) to denote a free multicast label, i.e. a label within the multicast label range with no current binding. When a multicast-capable LSR detects a new multicast flow (for example at the edge of the MPLS cloud), it invokes L3 routing to determine the outgoing interfaces. For each outgoing interface, it selects a UL and binds the UL to the corresponding multicast tree. It then forwards the packet downstream. A downstream LSR receives the packet with the UL, invokes L3 routing (since the incoming label has no binding) to determine the outgoing interfaces and selects UL for each of those interfaces. An entry is added to the label table consisting of the incoming interface/label and outgoing interfaces/labels. Subsequent traffic on the corresponding multicast tree is label-switched at L2. In Fig. 2, consider a new multicast flow that arrives on interface 1: the UL selected by the upstream LSR is A, and reception of the packet invokes L3 processing. As a result of L3 processing, interfaces 2, 3 and 4 are selected as the outgoing interfaces. ULs X, Y and Z are then picked for the interfaces 2, 3 and 4 respectively, and a copy of the packet is forwarded on each of those interfaces with the corresponding labels. An entry is added to the label table:
Subsequent packets that arrive at interface 1 with label A are switched at L2, without invoking L3 processing. Thus, only the f i s t packet undergoes L3 processing. Note that this scheme works well for both point-to-point and multi-access interfaces. A partitioned label space between multicast and unicast traffic avoids a situation where a label 1 is allocated by a downstream LSRd for unicast traffic from LSRul, and is then subsequently allocated by another LSRu2 for multicast traffic downstream. A disjoint label space amongst multicast LSRs ensures no two LSRs assign the same label on a common multi-access link, e.g LSR u l and u2 (Fig. 3). [lo] describes a solution; however, it augments PIM-Hello messages to achieve disjoint multicast labels across PIM-capable LSRs on a multi-access link. Alternatively it is possible to add some extensions to the LDP ([3]) initialisation protocol to achieve label partitioning. Moreover, since there can only be one forwarder on the link for a given (S, G), a per-source upstream label binding requires no further coordination among multicast LSRs on a common link.
Native IP Multicast Support in MPLS
211
UL=Z Fig. 2. implicit upstream label assignment
11’8I I -
I
Fig. 3. multi-access interfaces
Once a label 1 has been assigned on a LSR’s outgoing interface, there needs to be a mechanism to reclaim that label. To prevent traffic from being switched along the wrong LSP, it is sufficient that the following relation holds: i f 1 is a UL on an outgoing interface of LSRu then 1 must also be an UL on the corresponding incoming interface of any LSRd on the same link as LSRu. Note that traffic is not forwarded incorrectly at L2, if 1 is an UL on LSRd’s incoming interface, but not a UL on LSRu’s outgoing interface. In this case, any traffic that LSRu sends with a label L invokes L3 processing at LSRd. In this multicast solution for MPLS, we need to ensure that a label is f i s t reclaimed as an UL on the downstream before the upstream LSR. Additionally, whenever a branch of the multicast (L3) routing tree is deleted (e.g.explicit PIM Prune messages, deletion of an outgoing interface in a MRT entry due to non-arrival of PIM-Join), that will trigger an immediate reclamation of the L2 label without additional LDP messages. Thus, our solution is aggressive in both
212
A. Acharya and F. Griffoul
assigning and reclaiming labels, without sacrificing correct forwarding behavior at L2. 4.2
LDP-based Downstream On Demand
An alternate scheme to assign labels to multicast flows, is to use the Label Distribution Protocol (LDP), a new protocol defined to distribute MPLS unicast labels ([3]).LDP provides control messages to explicitly request and assign labels associated with unicast FEC. LSP can be extended to support multicast FEC as well: when a LSR detects a multicast flow, it sends a Label Mapping LDP message to the upstream LSR assigning a label to the flow. Till a label is assigned to the multicast flow, packets for that flow are forwarded at L3, using a default label, like VPI=O, VCI=32 in the ATM case ([4]). As currently defined in [3], LDP operates over a point-to-point (TCP) reliable connection between adjacent LSRS: on a multi-access link, like Ethernet, the LDP Label Mapping message has to be sent as a link-local multicast so that only one of the downstream LSRS sends a label binding upstream. Thus a LDP modification is required to support multicast in multi-access networks, while the current TCP-based message exchange can be used as is for point-to-point interfaces like ATM. Traffic or Label Request
Label Mapping Fig. 4. label allocation messages
In addition to label binding triggered by arrival of a new flow, a LSR must activate a label binding when receiving PIM-DM Graft message and PIM-SM Join(*,G) (if the oif list is modified), as mentioned in Sects. 2.2 and 3.4. In order to use LDP for multicast label allocation, 2 new FEC elements need to be defined: - the source-group element, type 0x04. - the group element, type 0x05.
Native IP Multicast Support in MPLS
213
The source-group element corresponds to dense mode and sparse mode source specific multicast routing entry. The TLV encoding follows the FEC TLV specified in [3]: 0 I 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
................................................................. I
0x04
I
Address Family
I
Length
.................................................................
I
I
Source Address
I
I
Group Address
I
................................................................. .................................................................
The ‘Address Family” field encodes the address family of both the source and the group address, as specified in [14]. The “Len” field is the length in bits of the source/group address that follows. The group element represents the sparse mode, shared multicast routing entry. Although never used in a Mapping message (since we use per-source L2 tree), it can be included in Label Release message to reclaim all the labels associated with a (*, G) MRT, for instance when the MRT is deleted. Its encoding is similar to the (S, G) FEC TLV. 4.3
Label Reclamation
When an LSR receives a Label Withdraw from a downstream LSR, the corresponding outgoing LSP branch is removed. Upon reception of a Label Release from upstream, a LSR deletes the indicated LSP. (Optional) Label Withdraw
LSRu
n
Label Release (Prune or inactivity) Fig. 5. Label Release Messages
Sending a Label Withdraw is normally triggered by the removal of a (S, G) or (*, G) MRT entry, for which a label binding has been distributed. In the event when the activity timer associated with a label-binding expires the LSR will send a Label Withdraw to its upstream LSR.
214
5
A. Acharya and F. Griffoul
Conclusion
In this paper, we first make the following observations for existing multicast routing protocols (PIM, DVMRP, MOSPF): - Dense-mode trees are created in a data-driven fashion; no L3 messages are
used to create the tree. - Dense-mode trees are created on a per-source basis, with no known mecha-
nisms to aggregate different (S, G) trees. - Source-specific sparse-mode trees are setup via explicit L3 control messages,
but like dense-mode trees, multiple (S, G) trees cannot be aggregated. - Nodes of a shared sparse-mode tree may forward traffic selectively based on
the traffic source. From these observations, it appears that the (S,G) structure of DM and source-specific SM trees at L3 favors a per-source label-assignment. Sparse-mode trees should also be mapped to a per-source LSP to avoid L3 routing at intermediate nodes of the shared tree. This led us to suggest a per-source LSP setup that is applicable to all three tree types. No changes are needed to any L3 routing protocol. Further, at the level of individual nodes, we observe that: - Data-driven creation of MRT entry at DM tree nodes can be coupled with
label-assignment, thus avoiding L3 processing beyond the f i s t packet. - PIM-Prune messages can be exploited to trigger immediate reclamation of
labels on the upstream and downstream nodes of the pruned branch (DM or SM). - Nodes on a shared SM tree need to perform data-driven per-source label assignment since the sources are not known a-priori. As a result, we presented a basic building block, using the dual notions of unused labels and implicit binding, to achieve a data-driven, per-source LSP that binds labels to flows at the earliest possible time, i.e the first packet. This architecture has been submitted to the IETF MPLS Working Group ([ll]) while a prototype implementation of both the implicit and the LDP schemes has been implemented on a Linux-based MPLS platform, controlling the hardware of an external ATM fabric via the General Switch Management Protocol ([13]). A comparison to other multicast label distribution approaches is also available within the MPLS multicast framework document ([121).
References 1. Rosen E., Viswanathan A., Callon A.: Multiprotocol Label Switching Architecture. draft-ietf-mpls-arch-06.txt , August 1999. 2. Callon R., Doolan P., Feldman N., Fkedette A., Swallow G.,Viswanathan A.: A Framework for Multiprotocol Label Switching. draft-ietf-mpls-framework-05.txt, September 1999.
Native IP Multicast Support in MPLS
215
3. Andersson L., Doolan P., Feldman N., Fredette A., Thomas B.: LDP Specification. draft-ietf-mpls-ldp-05.txt, June 1999. 4. Davie B., Lawrence J., McCloghrie K., Rekhter Y., Rosen E., Swallow G.: MPLS using A T M V C Switching. draft-ietf-mpls-atm-02.txt, April 1999. 5. Deering S., Estrin D., Farinacci D., Jacobson V., Helmy A., Meyer D., Wei L.: Protocol Independent Multicast Version 2 Dense Mode Specification. draft-ietf-pimv2-dm-03.txt, June 1999. 6. Waitzman D., Partridge C.: Distance Vector Multicast Routing Protocol. RFC 1075, November 1988. 7. Estrin D., Farinacci D., Helmy A. Thaler D., Deering S., Handley M., Jacobson V., Liu C., Sharma P., Wei L.: Protocol Independent Multicast ( P I . ) , Sparse Mode Protocol: Specification. RFC 2362, June 1998. 8. Farinacci D., Rekhter Y. Multicast Label Binding and Distribution using PIM. draft-farinacci-multicast-tagsw-01. txt , November 1998. 9. Farinacci D., Rekhter Y., Rosen E.: Using PIM to Distribute MPLS Labels for June 1999. Multicast Routes draft-farinacci-mpls-multicast-OO.txt, 10. Farinacci D., Rekhter Y.: Partitioning Label Space among Multicast Routers on a Common Subnet. draft-farinacci-multicast-label-part-Ol.txt, September 1999. 11. Acharya A., Griffoul F., Ansari F.: I P Multicast Support in MPLS networks, draftacharya-ipsofact o-mpls-mcast-00. txt , February 1999. 12. Ooms D., Livens W., Sales B., Ramahlo M., Acharya A., Griffoul F., Ansari F.: Framework for I P Multicast in MPLS, draft-ietf-mpls-multicast-OO.txt, June 1999. 13. Newman P., Edwards W., Hinden R., Hoffman E., Ching Liaw F., Lyon T., Minshall G.: Ipsilon’s General Switch Management Protocol Specification version 1.1. RFC 1987, August 1996. 14. Reynolds J., Postel J. Assigned Numbers. RFC1700, October 1994.
Cyclic Block Allocation: A New Scheme for Hierarchical Multicast Address Allocation M. Livingston1 , V. Lo2 , K. Windisch3 , and D. Zappala2 1
Computer Science, Southern Illinois University, [email protected] Computer Science, University of Oregon, {lo,zappala}@cs.uoregon.edu Adv. Network Technology Ctr., University of Oregon, [email protected] 2
3
Abstract. This paper presents a new hierarchical multicast address allocation scheme for use in interdomain multicast. Our scheme makes use of masks that are contiguous but not prefix-based to provide significant improvements in performance. Our Cyclic Block Allocation (CBA) scheme shares some similarities with both Reverse Bit Expansion and kampai, but overcomes many shortcomings associated with these earlier techniques by exploiting techniques from the area of subcube allocation for hypercubes. Through static analysis and dynamic simulations, we show that CBA has the following characteristics that make it an excellent candidate for practical use in interdomain multicast protocols: better address utilization under dynamic requests and releases than other schemes; low blocking time; efficient routing tables; addresses reflect domain hierarchy; and compatibility with MASC architecture.
1
Introduction
The next decade will see increasing demands for the use of interdomain multicast as an important form of group communication to support applications ranging from multimedia transmissions to distributed conferencing and game-playing to E-commerce transactions. A critical problem that must be solved in order for multicast to serve these needs is the dynamic allocation of multicast addresses. In this paper, we present a new hierarchical multicast address allocation scheme for use in interdomain multicast. Our scheme makes use of masks that are contiguous but not prefix-based to dramatically improve address utilization. Our Cyclic Block Allocation (CBA) scheme shares some similarities with both Reverse Bit Expansion [5] and kampai [16], but overcomes shortcomings associated with these earlier techniques by exploiting results from the area of subcube allocation for hypercubes. We establish several fundamental results that impact the design of address allocation schemes. We demonstrate the inherent limitations of pure prefix-based address allocation with respect to its ability to recognize aggregatable blocks of
Supported by NSF NCR-9714680
L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 216–234, 1999. c Springer-Verlag Berlin Heidelberg 1999
Cyclic Block Allocation
217
addresses. In addition, we show that the likelihood that a domain can release a block of addresses by halving is very poor. These results motivate the need for aggressive strategies such as migration and swapping to increase address space utilization and led us to the development of the CBA scheme. Through static analysis and dynamic simulation we show that CBA has the following characteristics that make it an excellent candidate for practical use in interdomain multicast protocols: better address utilization under dynamic requests and releases than other schemes; low blocking time; efficient routing tables; addresses reflect domain hierarchy; and compatibility with MASC architecture.
2
Background
In this section, we review those approaches to address allocation that are most relevant to our approach, highlighting their strengths and weaknesses. We then briefly show the correspondence between the address allocation problem and the subcube allocation problem from parallel processing. We show how known results from the latter arena are applicable to the address allocation problem. 2.1
Terminology and Notation
Throughout this paper, we use the following terminology and notation: – block of addresses - a set of addresses that can be expressed using a single address expression or mask. We use the standard notation for describing a block of addresses, e.g. the set of four addresses 0000, 0001, 0010, 0011 can be represented as the address expression 00XX, in which the X’s represent “don’t care” bits. The same set of four addresses can be represented by the mask 1100. where the 1’s in a mask correspond to those bit positions in an address that are significant for use in the routing table lookup. In most of this paper, we use address expressions instead of masks, but we use the terms interchangeably when the context is clear. – prefix-based mask/address expression - one in which all the significant bits are in the leftmost positions; equivalently, in the address expression, all the “don’t care” bits are in the rightmost positions. – contiguous mask/address expression - one in which all the significant bits are contiguous (adjacent) modulo the number of bits in the address. Thus wraparound is allowed. – non-contiguous mask/address expression - one in which all the significants bits are located in arbitrary positions; equivalently the don’t care’s are also located in arbitrary positions. For example, given a block of 25 addresses allocated from a 210 bit address space, 00100XXXXX denotes a prefix-based address expression. 001XXXXX01 and XX00110XXX both denote contiguous address expressions, and X00XX10XX0 denotes a non-contiguous expression.
218
2.2
M. Livingston et al.
Current Approaches to Hierarchical Address Allocation
We presume a model for interdomain multicast as specified in Estrin et al. [9] and the proposals of the IETF’s MALLOC working group [7]. Under this model, domains operate using the Multicast Address-Set Claim (MASC) protocol [4] for the assignment of address blocks between allocation domains. The MASC allocation domains function as nodes in the interdomain hierarchy whose composition follows customer-provider relationships between ISPs. Allocation domains claim sufficient address space from a parent domain to satisfy multicast address requests from both internal applications and child MASC domains. Internal applications are served by separate intradomain MALLOC protocols. Throughout the MASC architecture, all address allocations are granted with limited lifetimes, although renewal is possible. This allows some allocations to timeout naturally, permitting space to be reclaimed for aggregation. This avoids the alternative of forcing all applications to renumber. Several techniques have been considered for use within the MASC/BGMP protocols for hierarchical and dynamic address allocation: prefix-based techniques and the kampai scheme which uses non-contiguous masks. Both of these techniques expand a domain’s address space by doubling the size of the domain’s block of addresses, and contract by halving the size of the address space. This approach is necessary to keep the routing tables small with the goal of one routing table entry per domain. Prefix-based techniques are those that adopt the same address masking techniques as used in unicast addressing, specifying a block of addresses with a prefix-based mask. These include the scheme currently under consideration by the MALLOC working group of the IETF, Reverse Bit Expansion (RBE) [5], which was proposed by Estrin, Govindan, Handley, Thaler, and Radoslavov. Under the prefix-based mask, doubling occurs by changing the rightmost significant bit in the mask to a don’t care bit; and halving occurs by changing the leftmost don’t care bit to either a 0 or a 1. While natural and easy to understand, prefix-based address allocation is known to suffer from poor address utilization and aggregation abilities with utilization levels as low as 25% in a two-level hierarchy [5,8]. The MALLOC proposal addresses this problem by augmenting RBE with the ability to migrate to a new block of addresses, at the cost of two Group Routing Information Base (G-RIB) entries per domain. We will demonstrate later why this extension is an excellent idea. Tsuchiya’s kampai scheme [16] uses non-contiguous masks with the same doubling and halving method discussed above. The use of non-contiguous masks significantly improves address utilization but suffers from several shortcomings: – kampai’s bottom-up, one-address-at-a-time allocation method is too cumbersome for realistic address allocation. Obtaining a large block of addresses for a busy subdomain involves too much overhead in terms of number of requests and constant G-RIB updates. The variation suggested by Tsuchiya, in which addresses are allocated in chunks of 2k -sized blocks at a time, limits address utilization.
Cyclic Block Allocation
219
– kampai allows for ambiguous or overloaded masks. This means an address mask associated with a given domain may only have part of its associated block of addresses actually used by that domain, with the remainder distributed arbitrarily to other siblings. The conflict is disambiguated in the routing tables, but results in a confusing and inelegant address scheme. – kampai does not have a way to deal with address aggregation. Tsuchiya acknowledges that with kampai removed addresses will be scattered about the address space and that it is not possible to give back a block of address space without reassigning some addresses. Tsuchiya’s simulations tested address allocation only but did not investigate a more complete model with both address requests and releases. – kampai’s scheme only supports growth through doubling; it does not provide a mechanism for migration. – kampai’s scheme and its data structures are hard to understand although we believe this is a problem with the presentation, not the underlying scheme. 2.3
Address Allocation and Subcube Allocation
There is a simple and straightforward correspondence between the address allocation problem and the subcube allocation problem in hypercubes. In the former, blocks are allocated from a set of 2n binary addresses with each block specified as a mask or address expression as defined above. The hypercube is a recursive structure that served as the underlying communication network of the Intel iPSC and N-Cube parallel processors. In a hypercube, the 2n processors are each labeled with an n-bit address; processors are connected in a regular pattern as illustrated in Figure 1. A subcube of size 2k is a subset of the hypercube that itself forms a smaller hypercube. Each subcube is specified using a prefix or contiguous or non-contiguous mask in exactly the same ways as blocks of addresses are described using masks. See Figure 2 to see the relationship between subcubes and address blocks. 100 10 0
000
110 010
00 111
101 11 1 0-D hypercube 1-D hypercube
011
001
01 2-D hypercube
3-D hypercube
Fig. 1. Recursive definition of the hypercube
220
M. Livingston et al.
4-D hypercube 0100
1111100 000 000 111 0001111 1000 00001010 0000 111 1111 0000 1111 0000 1111 0000 1111 0000 110111 1111 00 00 11
0110
00 11 11 0010 00 00 11 00 11 0111 00 11 00010111 11 00 000 111 00 11 000 0001 0011 00 11 00 111 11 00 11 00 11 00 11 00 11 0000
11 00 00 11 000 111 00 11 000 111 000 111 00 11 00 11
Subcube
1001
1111110 000 000 111 000 111 1111 11 00 00 11
1011
Address Block 00XX (prefix-based address expression) 1XX0 (contiguous address expression) X1X1 (non-contiguous address expression)
Fig. 2. The correspondence between address allocation and subcube allocation
Under address allocation, child domains request blocks of addresses, use them for a certain length of time, and then release them to the parent domain. In a hypercube machine, applications request subcubes, hold them for the runtime of the application, and then release the subcubes back to the operating system. The algorithm used by the operating system to handle the requests and relinquishments of subcubes is the processor allocation algorithm and has been the target of of intensive research for the past decade [14,12,11,17,3]. The key idea to remember is that a subcube is equivalent to a block of addresses that is expressible using a single address expression or mask. This equivalence means that subcube recognition techniques can be applied to the problem of multicast address allocation. Note, however, that while hypercubes machines typically have fewer than 212 processors, the address allocation problem deals with much larger numbers of addresses: 228 under IPv4 and 2120 under IPv6. This difference and other practical constraints associated with address allocation require that results from hypercube theory be applied to the address allocation problem with great care.
Cyclic Block Allocation
3
221
Principles for Address Allocation
From past work in the area of subcube allocation and fault tolerant hypercubes, we derive several results that have strong implications for multicast address allocation. First, we discuss the ability of prefix-based, contiguous, and non-contiguous allocation schemes to grow by doubling the size of their address space. Next, we compare the ability of these schemes to recognize aggregatable blocks of addresses, and we show that prefix-based schemes perform poorly. Finally, we demonstrate that under any scheme, when a child domain wishes to contract by releasing one or more blocks of addresses, the likelihood of being able to contract is extremely small. Poor ability to reclaim addresses results in intolerably low address space utilization due to fragmentation of the address space. After describing these results, we discuss their implications for the design of a practical method for hierarchical and dynamic address allocation. Most of our discussion will use the terminology from address allocation even though the original results may have been developed for hypercubes. 3.1
Doubling Capability
The capacity of a scheme to expand its address space by doubling is affected by the type of mask allowed by that scheme. In any prefix-based allocation scheme, there is only one choice for a new block to be combined with the current block for doubling. This can be seen by noting that doubling can only occur by converting the rightmost significant bit to a don’t care bit in the address expression corresponding to the current allocation. If the single desired block needed for doubling is not free, the expansion cannot occur. Schemes that use contiguous masks have two choices when expanding through doubling. This can be seen by noting that doubling occurs by converting either the leftmost or rightmost significant bit to a don’t care. If neither of the two desired blocks needed for doubling is free, the expansion cannot occur. Schemes that use non-contiguous masks have n − k choices when expanding through doubling, where n is the total number of bits in the full address space, and k is the number of don’t cares in the current address expression. This can be seen by noting that doubling occurs by converting any one of the significant bits to a don’t care bit. The complexity of an algorithm to double the size of a domain’s block of addresses is O(n) for all three cases, where n is the number of bits in the multicast address. The doubling algorithm for contiguous masks under CBA is described in Section 4. 3.2
Recognition Capability
A given multicast address allocation scheme can be characterized by the number of distinct blocks of a given size that it is capable of recognizing. Again, the mask
222
M. Livingston et al.
pattern limits the number of blocks that are recognizable due to the respective constraints on the location of significant bits in each of these types of masks. We will see later that recognition capacity is critical for reclaiming unused addresses through migration and swapping. Table 1 gives formulas for the recognition ability of a wide range of schemes taken from the processor allocation literature. We classify each scheme based on its use of prefix-based masks, contiguous masks, or non-contiguous masks. The one with least recognition capability is the prefix-based Buddy scheme. We note that Cyclic with contiguous masks supports n times the recognition capability of prefix-based schemes. Also shown is the recognition capability of three schemes not considered in this paper: Gray Code and Double Gray Code [2] and Partners [1,18]. These schemes use non-contiguous masks but do not allow for all possible patterns of non-contiguity. Interestingly, they all do worse than Cyclic. This is why we chose Cyclic for recognition in the CBA address allocation scheme. We originally designed Cyclic as a subcube allocation algorithm almost a decade ago; it is exciting to see its potential for use in multicast address allocation. The highest recognition capability (full recognition) is associated with fully non-contiguous masks. While kampai uses fully non-contiguous masks, it does not support full recognition in the sense discussed in this section; kampai only gives an algorithm for expansion through doubling. Figure 3 presents these recognition formulas for three sample address space sizes, 25 , 210 , and 228 . The graphs show the percentage of aggregatable blocks found by a given scheme relative to the total possible number of aggregatable blocks (under full recognition). From these graphs, we see that Cyclic is the only scheme with reasonable performance. As the address space increases in size, all other schemes will fail to recognize blocks and Cyclic will only recognize large and small blocks. It is important to note that the increasing recognition ability of these schemes comes at the cost of higher overhead. The complexity of prefix-based recognition for a given parent domain is O(n × C), where n is the number of bits in a multicast address and C is the number of child domains served by that parent. The complexity of any algorithm for full recognition with non-contiguous masks is O(n! × n × C); the factorial term renders full recognition with non-contiguous masks computationally infeasible. The complexity of recognition for contiguous masks may be exponential in the worst case, but has not yet been proven to be so. We have found a linear time recognition algorithm for a large subclass of contiguous masks. We believe there exists a polynomial time recognition algorithm for all contiguous masks; this is part of our ongoing research. 3.3
Address Aggregation Capability
Our work in the area of faulty hypercubes [6,13] yields some results that have severe implications for the likelihood of address aggregation even in the best of circumstances. The problem of reducing the set of addresses allocated to a child domain by repeatedly halving its address space is related to the problem of
Cyclic Block Allocation Recognition capacity vs. Block size in an address space of size 2^5 100 cyclic partners doublegray onegray buddy
percentage of aggregatable blocks
80
60
40
20
0 2
4 8 Aggregatable block size (log scale)
16
Recognition capacity vs. Block size in an address space of size 2^10 100 cyclic partners doublegray onegray buddy
percentage of aggregatable blocks
80
60
40
20
0 2
4
8
16 32 64 128 Aggregatable block size (log scale)
256
512
1024
Recognition capacity vs. Block size in an address space of size 2^28 100 cyclic partners doublegray onegray buddy
percentage of aggregatable blocks
80
60
40
20
0 20
400
8000 160000 3.2e+06 6.4e+07 Aggregatable block size (log scale)
1.28e+09 2.56e+10 5.12e+11
Fig. 3. Subcube recognition capacities
223
224
M. Livingston et al.
Table 1. Recognition capability of prefix-based, contiguous, and non-contiguous schemes: n bit address space, k bit subcube/block Subcube Total blocks recognized Allocation Scheme general formula example: n = 8, k = 3 Buddy (prefix-based) 2n−k 32 n−k+1 Gray (constrained non-contig) 2 64 Double Gray (constrained non-contig) not shown 128 Partners (constrained non-contig) (n − k + 1) × 2n−k 192 Cyclic (contiguous) n × 2n−k 256 n−k Full (non-contig) (n 1792 k) × 2
finding the minimum number of faulty processors in an n-dimensional hypercube so that every k-dimensional subcube is faulty, k < n. Since our focus is on contraction through halving, we first state the results for this special case. We then briefly describe the more general situation. In particular, suppose a child domain wishes to relinquish a block of addresses to its parent by freeing half of its addresses. Specifically, if the child holds a 2k size block of addresses it would like to free up a 2k−1 size block, leaving itself with a 2k−1 size block. However, the likelihood that 2k−1 or more unused addresses actually form an aggregatable block that is representable by a single mask turns out to be almost negligible. Figure 4 shows the probabilities of being able contract by halving the size of the address space. The formula used to compute these probabilities is given in [10]. From the graph, we see that only a small number of addresses in use (around 10) from the huge exponential address space causes the probability of aggregation to fall below 5% for the samples given. The prospects become even worse for larger address spaces: an increasingly smaller fraction of the addresses in active use reduces the probability of aggregation to these low levels. In [6], we show that only two addresses in active use for multicast sessions can prevent any aggregatable block of size 2k−1 to exist for purposes of halving. Any pair of addresses that are complements of each other comprise such a set. In general, our work [6] shows that with probability close to 1, a random collection of O(n) faulty processors will leave no large fault free subcubes. In terms of address allocation, this means that only a small number of addresses in active use for multicast sessions can completely destroy the ability of the domain to free up a large aggregatable subset of its addresses. We note that under the MASC architecture, this problem is diminished since rather than explicitly releasing space, MASC simply renews its claim on a smaller sub-block. The MASC protocol guarantees that any sessions allocated to the non-renewed block will have timed out by the time the sub-block times out.
Cyclic Block Allocation
225
Recognition probability vs. Address usage with n bit multicast addresses 1
Probability of releasing a block by halving
Address Address Address Address Address
space space space space space
= = = = =
2^12 2^16 2^20 2^24 2^28
0.8
0.6
0.4
0.2
0 0
5
10 15 Number of addresses in use
20
25
Fig. 4. Probability of address aggregation
3.4
Implications for Practical Address Allocation Schemes
Any viable address allocation scheme must perform well under dynamic expansion and contraction of its address space. Because of the woefully poor probability that a domain can actually free up a block of addresses by halving, we believe it is necessary to adopt techniques such as migration and swapping in order to aggressively reclaim fragmented blocks of free addresses. Migration was proposed by Estrin, Govindan, Handley, Thaler, and Radoslavov in the IETF’s MALLOC working group [5] in order to address the shortcomings of RBE. When a child domain needs to increase its address space and it cannot expand its current block, it migrates. Under migration, it is given a completely new block of addresses of double the size. As new multicast sessions are created, the multicast addresses are allocated from the new block. As the old sessions timeout, their addresses are freed up; when all the old addresses are free, the old block is released to the parent domain. Swapping is a type of migration we propose in which a domain is given a new block that is the same size as its current block. The domain migrates to the new block and eventually frees up its old block. Swapping occurs among sibling domains, for example, when one domain can only double by using a block held by one of its siblings. Both migration and swapping support higher address space utilization by finding alternative blocks when standard doubling and halving fails. By seeking available space within the domain’s current allocation, we avoid asking the parent domain for more space for as long as possible. This is desirable since allocation of more space from the parent triggers global changes to the routing tables [9]. Migration and swapping come at a significant price: they require two routing table entries per domain during the time that the domain holds both the old
226
M. Livingston et al.
and new blocks. This becomes a severe problem for the top level domains whose current BGP routing tables contain tens of thousands of unicast entries. Under BGMP, the tables will be even further expanded to add G-RIB entries to the unicast and M-RIB entries. Thus, any scheme with fewer entries per domain is clearly preferable. The discussion above shows a clear tradeoff among competing goals: ability to expand and contract with reasonable likelihood versus complexity of the algorithms used for expansion and contraction versus need to keep the routing tables stable and small. To summarize the analysis above: – Independent of the type of mask, the probability of aggregation by halving is close to zero for most block sizes and especially for addresses with n > 12. This means migration and swapping techniques must be used to recapture fragmented free blocks. – Migration requires good recognition capability. Prefix-based schemes have poor recognition capabilities, and full recognition (non-contiguous masks) has intolerable overhead. – Cyclic recognition (contiguous masks) is the best known scheme that has reasonable overhead. As a result, we designed our Cyclic Based Allocation scheme, which uses contiguous masks, augmented with both migration and swapping.
4
Cyclic Block Allocation (CBA)
Cyclic Block Allocation (CBA) is a new hierarchical multicast address allocation scheme that performs well under dynamic expansion and contraction of address spaces. The defining feature of the CBA address allocation algorithm is that it assigns blocks of addresses to child domains using contiguous address expressions as defined earlier. Based on our earlier discussion, the use of contiguous address expressions provides greater opportunities for doubling and migration than schemes that use prefix-based address expressions. For ease of explanation, assume the domains are organized in a tree with a single root node and assume that the top level node initially owns the full n-bit address space. We first describe the overall operation of CBA. We then describe the algorithms for doubling and migration. CBA can be implemented as either a request-reply protocol or as a claim-collide protocol. 4.1
CBA Description
Main Routine – Initial request: A leaf child domain requests a base block of size 2k . Each child domain can use a different size for its base block. – Expansion: When a high-threshold is met, expand by doubling. If doubling is not possible, expand by migration. If migration is not possible, block and try again later.
Cyclic Block Allocation
227
– Contraction: When a low-threshold is met, contract by halving. If halving is not possible, contract by migration.1 If migration is not possible, block and try again later. – Swapping: When opportunities exist for productive swapping, a parent initiates swapping among selected siblings. (Swapping is not yet implemented.) Subroutines – Doubling: Doubling occurs as follows: first convert the leftmost significant bit (modulo n) to a don’t care to create a new mask that represents the desired new block. This new mask is compared with the masks of all siblings to see if there is a conflict. A conflict exists between the masks for domain A and domain B if and only if the following condition holds: for every bit position in which the bits in both mask(A) and mask(B) are significant bits, they are identical (both one’s or both zeroes). If a conflict exists, repeat with the rightmost significant bit (modulo n). If both cases fail, doubling has failed. The complexity of doubling is O(n × s), where n is the number of bits in the multicast address and s is the number of sibling domains. (This basic doubling algorithm can easily be modified for doubling with prefix-based masks and with non-contiguous masks.) – Migration and swapping: Migration and swapping use the basic approach of the (Full) Cyclic subcube allocation scheme for hypercubes [13] to find an available block of addresses of the desired size; this parallel algorithm recognizes subcubes in an n-bit address space using 2n processors. For address allocation under CBA, we developed a sequential algorithm for Cyclic to be executed on the parent node when migration or swapping is initiated. CBA’s subcube recognition algorithm is called the k-Cyclic algorithm because it recognizes those subcubes whose don’t care bits start in bit positions 0 through k−1 and which do not wrap around. It turns out that this limitation is needed to ensure that contiguous masks remain contiguous in a fully hierarchical address allocation scheme. This is in contrast to Full Cyclic which allows don’t care bits to start in any position and to wrap. k-Cyclic runs in O(n) time for fixed k. Briefly, k-Cyclic maintains k lists of free subcubes, one each for bit positions 0 through k − 1. There is no duplication in recording of a free subcube from one list to another i.e., a maximal free subcube will appear only in one of the lists. The difficulty in this is the maintenance of this maximal property when a subcube is returned. We can show that for fixed k, to return a q-cube and maintain the list property will require that we investigate how the q-cube can combine with an element in one list, then further combine with an element in another list etc. One can show this process is bounded by 2k searches; when k is fixed this number is constant. Further details about the k-Cyclic subcube recognition algorithm can be found in [10].
1
Recall that under the MASC architecture halving can always be achieved without migration as discussed earlier.
228
4.2
M. Livingston et al.
CBA Features
In this subsection, we describe some of the characteristics of the CBA scheme that make it an excellent candidate for hierarchical multicast address allocation. Results of performance evaluation experiments for address utilization, blocking time, and size of routing tables are described in the next section. – Addresses reflect domain hierarchy: Given two or more address masks, it is easy to determine the relationship among the domains holding the corresponding blocks of addresses. In particular, domain B is a descendent of domain A iff the don’t cares in mask(A) cover those of mask(B) and the significant bits in mask(B) cover those of mask(A). RBE’s prefix-based masks satisfy this condition but kampai does not. Note that this property does not hold for Full Cyclic but does hold for k-Cyclic. – Routing table size and lookup: CBA requires a maximum of two routing table entries per subdomain since it uses migration and swapping. Because the addresses reflect the domain hierarchy and because the masks are contiguous, the routing tables can be searched using standard search techniques. Also, because CBA delays requesting more address space from the parent domain through migration and swapping, fewer global changes to the routing tables are needed. – Compatible with MASC architecture: CBA can function within the structure of the MASC architecture as a claim-collide protocol.
5
Simulations
The address allocation problem is highly dynamic in nature. This is particularly true when evaluating the effects of address relinquishment, which is crucial to any realistic analysis of address allocation algorithm performance. Therefore, many of the properties of the multicast address allocation schemes already presented are comparable only through the techniques of modeling and simulation. In this section, we present a simplified model of multicast address allocation that we use to analyze the properties of Cyclic vs. Prefix-based address allocation. Our simplified model roughly approximates MASC for the purpose of investigating the fundamental principles behind these algorithms. Our ongoing work uses the detailed model and simulator proposed by Radoslavov [15] which we are using to obtain a more realistic analysis of CBA performance. 5.1
Algorithms Simulated
The preliminary simulations investigate the growth capability and needs for migration in the context of address block requests and releases within a single level domain hierarchy. We compare the performance of CBA with Adaptive Reverse Bit Expansion which is prefix-based. For the remainder of this paper, we refer to the latter scheme as Prefix.
Cyclic Block Allocation
229
We experimented with three versions of CBA, comparing their performance with two versions of Prefix. These differ in the method used to select a new block of addresses for migration. – Random Fit: CBA-RF and Prefix-RF randomly choose a block from the set of free blocks of the desired size. – First Fit: CBA-FF and Prefix-FF choose the block containing the smallest address. – Best Fit: CBA-BF examines the blocks in an order corresponding to their distance from earlier allocated blocks in order to minimize potential interference2 . We did not implement Prefix-BF since prior work in processor allocation has shown First Fit and Best Fit to perform comparably. 5.2
Simulation Details and Performance Metrics
We initially distribute a block of addresses to each child domain, taking the block size from a uniform distribution of block sizes up to the maximum initial block size of 20 bits in a 28-bit address space. This method was used to ramp up the simulation to a reasonable initial utilization level in order to more quickly reach steady state. Steady state was typically achieved within 5000 iterations. A single iteration of the simulation visits all the children once in randomized order. At each visit, a child domain can initiate one of three actions: (a) a request to grow by doubling in size, (b) a relinquishment of half of the owned address space back to the parent, or (c) null action. If a request to grow cannot be satisfied by doubling in place, the failure to double is recorded and an attempt is made to migrate to a new block of double the current size. If migration cannot be achieved, the failure to migrate is recorded and the action becomes a null action. A request to halve is always satisfiable because we do not model individual session allocations or lifetimes in this simulation. When halving, the block released is selected in a manner consistent with the block selection criteria (random, first fit, best fit). The probabilities for the three actions is biased towards growth at a ratio of 5 : 3 : 2 for grow, relinquish, null, respectively. We modeled the load on the system by varying the number of child domains from 32 to 200, and recorded the following metrics. Results reported represent means taken over repeated runs with 90% confidence levels. The confidence intervals are not shown in the graphs below but are within ±0.2% for all metrics except migration success rate. Performance metrics captured by the simulation include: average address space utilization; average percentage of failed requests to double; average percentage of successful migrations; average percentage of successful growth - through either doubling or migration; average iteration number for first failure to double this reflects the ability of the scheme to grow through doubling only; and average iteration number for first failure to migrate. 2
This order depends on properties of the Cyclic allocation scheme and are discussed in [10]
230
M. Livingston et al.
5.3
Simulation Results
Below we give representative results for a 28 bit address space. Results were similar for address spaces greater than 16 bits. Doubling (no migration): Figure 5 shows address utilization versus iteration number with migration turned off. Four schemes: Cyclic-RF, k-Cyclic-RF, PrefixRF, and Prefix-FF are included. k-Cyclic has utilization at 80% and it has consistently higher utilization levels than the Prefix-RF scheme at 60%. Also of interest is the rate at which each scheme reaches its maximum achievable utilization level. k-Cyclic-RF has the fastest rate of increase. The ability to reach the plateau more quickly corresponds directly to high success rates for doubling. Note that Prefix-FF has very poor performance (less than 3% utilization) because the tight packing of the blocks severely limits the ability to double. In contrast, random schemes distribute the blocks, leaving empty spaces for future growth through doubling.
24 Bit-Space, 200 Children w/ Max Init. Alloc. 15, No Migration k-Cyclic RF Cyclic RF Prefix RF Prefix FF
90 80
Percent Utilization
70 60 50 40 30 20 10 0 0
2000
4000 6000 Iteration Number
8000
10000
Fig. 5. Address utilization without migration (Note: legend lists algorithms in best to worst order. Prefix-FF is at the very bottom of the graph.)
Doubling plus migration: Figure 6 and Figure 7 illustrate the performance of the address allocation schemes when the migration feature is activated. k-CyclicRF and Cyclic-RF perform best, and four out of the five versions of Cyclic outperform the two versions of Prefix. Only k-Cyclic-FF has utilizations comparable to Prefix. However, the differences among all seven algorithms is only a few percentage points. What is particularly surprising is that with respect to address utilization, each individual scheme performs better without migration than with migration (except for Prefix-FF). We speculate that the addition of migration tends to fragment the address space in a manner that interferes with the ability
Cyclic Block Allocation
231
to double. We believe this fragmentation can be controlled by refinements of the migration selection criteria. Further investigation of this phenomenon is part of our ongoing work.
Address Space Utilization with migration and 28 bit address space 61.5 prefix - RF prefix - FF k-cyclic - FF k-cyclic - RF cyclic - RF cyclic - FF cyclic - BF
61
Address Space Utilization
60.5 60 59.5 59 58.5 58 57.5 57 56.5 10
15
20
25
30 35 Number of children
40
45
50
Fig. 6. Address utilization versus number of children, with migration
Successful Growth Rate with migration and 28 bit address space 61.1 prefix - RF prefix - FF k-cyclic - FF k-cyclic - RF cyclic - RF cyclic - FF cyclic - BF
61
Successful Growth Rate
60.9 60.8 60.7 60.6 60.5 60.4 60.3 60.2 10
15
20
25
30 35 Number of children
40
45
50
Fig. 7. Growth success rate versus number of children, with migration
232
M. Livingston et al.
Migration success rate: Figure 8 shows stark differences in the ability of these algorithms to migrate. It appears that the First Fit algorithms perform better for migration, while the Random Fit algorithms perform better for doubling. Also, while the confidence intervals for all other metrics were very small (less than 0.2%), for migration success rate the confidence intervals ranged from 0.05% to 1.8%, indicating that subcube recognition capacity is sensitive to the dynamic situation as well as the algorithm.
Migration Success Rate with migration and 28 bit address space 60 prefix - RF prefix - FF k-cyclic - FF k-cyclic - RF cyclic - RF cyclic - FF cyclic - BF
Migration Success Rate
50
40
30
20
10
0 10
15
20
25
30
35
40
45
50
Number of children
Fig. 8. Migration success rate versus number of children
5.4
Dynamic Features of CBA
– Address utilization and blocking time: k-Cyclic outperforms all other schemes and quickly reaches high levels of utilization (up to 80%) without migration. Its high success rate for growth implies less blocking time waiting for the parent domain to acquire more addresses. – Impact on G-RIB tables: The ability to grow through doubling versus migration affects G-RIB table sizes. Growth through migration requires retaining two G-RIB entries for a period of time, a high cost situation given the current explosion in routing table sizes. Growth solely through doubling avoids this overhead. Growth success rates directly affect G-RIB flux. Whenever a growth request fails, the parent domain must also make a growth request to its parent domain, necessitating global changes to G-RIB tables at the higher level domains. While we have run many simulations over a wide parameter space, the results are not yet conclusive. The preliminary nature of these simulations call for continued
Cyclic Block Allocation
233
study of CBA to understand the conditions under which its performance can attain its theoretical promise. 5.5
Realistic Simulations
Much more detailed and realistic simulations are being conducted using the MascSim software [15], written by Pavlin Radoslavov at USC-ISI, modified to model the nonwrapping version of CBA. MascSim realistically simulates in detail the MASC protocol including hierarchical domain topologies, arbitrary address demand step functions, link failures, and true MASC claim-and-collide behavior. The most important benefit of this level of simulation will be a better analysis of the comparative G-RIB sizes and flux generated in a hierarchy by CBA versus the RBE prefix algorithm. MascSim will also provide observations of utilization and allocation latency.
6
Future Work and Conclusion
Our ongoing and future plans include refinement of our current migration algorithms with better subcube selection criteria to avoid the interference and fragmentation effect. We also plan to develop a polynomial time algorithm for Full-Cyclic. CBA’s swapping technique will be developed with new algorithms to select targets for swapping. Finally, many additional simulations are called for, most notably for migration when a child relinquishes a block and for swapping. In general, we believe that the multicast address allocation problem is an instance of a broad class of resource allocation problems from which cross-pollination will continue to be fruitful. Much work remains to be done in both theoretical foundations and in performance analysis within the realm of hierarchical multitcast address allocation.
7
Acknowledgments
Special thanks to Prajna Dasgupta, Joannie Humphreys, and Iyer Sivaramakrishna, University of Oregon, and Josh Hoyt and Dave Meyer, Harvey Mudd College for their contributions to this project. These students were supported by NSF NCR-9714680 and an NSF Research Experiences for Undergraduates supplement to this grant. Thanks also to the referees for their helpful comments.
References [1] A. Al-Dhelaan and B. Bose. A new strategy for processor allocation in an nCUBE multiprocessor. In Proceedings of the International Phoenix Conference on Computers and Communication, pages 114–118, March 1989. [2] Ming-Syan Chen and Kang G. Shin. Processor allocation in an n-cube multiprocessor using gray codes. IEEE Transactions on Computers, pages 1396–1407, 1987.
234
M. Livingston et al.
[3] S. Dutt and J. P. Hayes. Subcube allocation in hypercube computers. IEEE Transactions on Computers, 40(3):341–352, March 1991. [4] D. Estrin, R. Govindan, M. Handley, S. Kumar, P. Radoslavov, and D. Thaler. The multicast address-set claim (MASC) protocol. Internet Draft of the IETF MALLOC Working Group draft-ietf-malloc-masc-01.txt, February 1999. [5] D. Estrin, R. Govindan, M. Handley, D. Thaler, and P. Radoslavov. MASC prefix allocation algorithm. Presentation given at MALLOC Working Group Meeting of IETF-41, Los Angeles, 1998. http://netweb.usc.edu/masc/ietf-41-masc-sims.ps. [6] Niall Graham, Frank Harary, Marilynn Livingston, and Quentin F. Stout. Subcube fault-tolerance in hypercubes. Information and Computation 102, pages 280–314, 1993. [7] M. Handley, D. Thaler, and D. Estrin. The internet multicast address allocation architecture. Internet Draft of the IETF MALLOC Working Group draft-ietfmalloc-arch-01, April 1999. [8] Phillip Krueger, Ten-Hwang Lai, and Vibha A. Radiya. Processor allocation vs. job scheduling on hypercube computers. In The 11th International Conference on Distributed Computing Systems, pages 394–401, 1991. [9] S. Kumar, P. Radoslavov, D. Thaler, C. Alaettinoglu, D.Estrin, and M. Handley. “The MASC/BGMP Architecture for Inter-domain Multicast Routing”. In ACM SIGCOMM, October 1998. [10] M. Livingston and V. Lo. A linear-time algorithm for k-cyclic subcube recognition. In preparation. [11] M. Livingston and Q. F. Stout. Parallel allocation algorithms for hypercubes and meshes. In Proceedings of the 4th Conference on Hypercube Concurrent Computers and Applications, 1989. [12] Marilynn Livingston and Quentin Stout. Distributing resources in hypercube computers. In Proceedings of the 3rd Conference on Hypercube Concurrent Computers and Applications, 1988. [13] Marilynn L. Livingston and Quentin F. Stout. Fault tolerance of the cyclic buddy subcube location scheme in hypercubes. The Sixth Distributed Memory Computing Conf. Proceedings (DMCC6), pages 34–41, 1991. [14] Virginia Lo, Kurt J. Windisch, Wanqian Liu, and Bill Nitzberg. Noncontiguous processor allocation algorithms for mesh-connected multicomputers. IEEE Transactions on Parallel and Distributed Systems, pages 712–726, 1997. [15] P. Radoslavov. MascSim simulator, 1999. http://catarina.usc.edu/masc. [16] Paul F. Tsuchiya. Efficient and flexible hierarchical address allocation. INET92, pages 441–450, June 1992. [17] N. F. Tzeng and G. L. Feng. Resource allocation in cube network systems based on the covering radius. IEEE Transactions Parallel and Distributed Systems, 7(4):328–342, April 1996. [18] Kurt Windisch, Virginia Lo, and Bella Bose. Contiguous and non-contiguous processor allocation algorithms for k-ary n-cubes. In Proceedings of the 1995 International Conference on Parallel Processing, pages II–164–II–168, 1995.
Survivable ATM Group Communications Using Disjoint Meshes, Trees, and Rings William Yurcik Department of Applied Computer Science Illinois State University Normal, IL. 61790 USA [email protected]
Abstract. This paper examines the potential benefits of providing survivability to ATM group communications based on the extension of circuit-switched restoration techniques. Specifically I compare the feasibility and cost of using disjoint dedicated backup route sets established by the VC Mesh model, shared multicast trees, and rings. Results show promise for further development of self-healing survivable rings.
1
Introduction
The search for a native ATM group communications solution began with the introduction of multicast capability (point-to-multipoint circuits) in the ATM Forum’s User Network Interface (UNI) specification 3.1. The ATM Forum MPOA (Multi-Protocol Over ATM) subworking group proposes two approaches for ATM intracluster group communications: (1) the VC Mesh Model and (2) the Multicast Server Model (MCS). As a result of the perceived complexity and inefficiency of these two approaches, other techniques have been proposed such as shared trees and rings. System survivability is important due to the increasing number of information systems and society’s increasing dependence of these systems for dependable service. The majority of work on providing survivability for ATM networks focuses on extending circuit-switched facility restoration techniques to ATM using Virtual Paths (VPs). VPs can be viewed as semi-permanent routes through the network with dedicated bandwidth onto which Virtual Circuits (VCs) are grouped. Different algorithms for ATM network survivability have been proposed using either a preplanned backup VP for each working VP and/or dynamically searching for new paths(s) after a single link failure [9,8]. While ATM network survivability has focused on point-to-point VP/VC restoration, there has been very little work on the survivability of ATM group communications. The same types of vulnerabilities present in ATM unicast are also present in ATM group communications but the potential risk of these vulnerabilities is considerably greater. For instance, while an ATM unicast path is a collection of links and nodes between one source and one destination, ATM group communications can have multiple such paths. To exacerbate this vulnerability, it is often more complex to determine and control which links are in use L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 235–243, 1999. c Springer-Verlag Berlin Heidelberg 1999
236
W. Yurcik
for a given group routing scheme as compared to the equivalent point-to-point communication. Additionally, users may require an all-or-nothing restoration where if all group members can not be restored simultaneously then the session is abandoned. In this paper the focus of investigation is the extension of circuit-switched restoration techniques to the survivability of connection-oriented ATM group communications. The basic concept is each group communication session requiring survivability will establish a disjoint backup set of routes with reserved bandwidth for restoration. When a single link or node fault occurs within a “working” set of routes, traffic flow is rerouted to the corresponding disjoint “backup” set of routes. The remainder of the paper is organized as follows: Section 2 briefly overviews some of the techniques proposed for ATM group communications. Section 3 summarizes survivability issues and tradeoffs specific to these proposed schemes. Section 4 reports comparative experimental results and I conclude with a summary in Section 5.
2 2.1
ATM Group Communication Schemes VC Mesh Model
Figure 1A shows an example of group communications between three nodes using the VC Mesh Model. For a single group containing multiple senders, a unidirectional VC for each sender is established on the network with no coordination between VCs. This criss-crossing of VCs across a network gives rise to the name “VC Mesh” [1]. If a member joins or leaves the group, each point-to-multipoint VC for each sender in the group must be modified. In the ATM Forum UNI 3.1 specification, the sender keeps track of all members in the connection in all cases, only the sender can add a new group member and either the sender can delete a member or the member can delete itself from the point-to-multipoint VC [3]. The ATM Forum UNI 4.0 specification allows a new member to join a point-tomultipoint VC without the intervention of the sender by using a Leaf-Initiated Join (LIJ) capability [2].
receiver sender M
receiver
C S
group member
(A)
nongroup member
receiver
(B) (C)
Fig. 1. (A) VC Mesh Model; (B) Multicast Server Model; (C) Shared Multicast Tree
Survivable ATM Group Communications
2.2
237
Multicast Server Model
Figure 1B shows an example of ATM group communications between three nodes using the MCS Model (this figure needs to duplicated for each active sender). In the MCS Model, a server is chosen within each cluster to serve as a proxy for all senders in order to relieve senders from VC connection setup and release operations resulting from group dynamics. Conceptually the MCS serves as a centralized connection manager mediating and disassociating senders and receivers within a group [12]. All senders establish a unidirectional point-to-point VC to the MCS. The MCS establishes a unidirectional point-to-multipoint VC to the rest of the multicast group. During transmission, the MCS reassembles cells into packets arriving on all incoming VCs (avoiding interleaving cells from different packets) and queues cells belonging to particular packets for transmission on an outgoing point-to-multipoint VC. All requests to create a group, delete a group, to add a member to a group, or delete a member from a group are sent to the MCS which maintains all state information. 2.3
Shared ATM Multicast Tree Schemes
Figure 1C shows a generic shared tree approach connecting a group of three nodes. I consider three specific shared ATM multicast tree schemes: SMART, SEAM, and SPAM [5,6,7]. These schemes are similar in that they aim to provide a general-purpose control architecture by modifying in-band control mechanisms of ATM switches [11]. Resources are reserved in both directions on all of the VC links of a shared multicast tree until the connections are released[5]. Each of these schemes individually address two inherent problems of a shared tree architecture: (1) cell interleaving because cells from different sources may arrive interleaved at one destination and (2) resource management because resources allocated to a connection are shared between a number of different sources. SMART (Shared Many-to-many ATM ReservaTions) proposes a shared tree of one to several associated many-to-many VCs [6]. The same service class and traffic parameters are given to all connections of the tree at call setup. By granting access to a shared tree through the use of resource management (RM) cells, SMART ensures that only one source is active on a tree at any given instant. SEAM (Scalable and Efficient ATM Multicast) relies on per-VC queueing and per-packet forwarding to avoid interleaving of cells from different packets by requiring ATM switches to detect the first and last cells in a packet. SEAM introduces two concepts: (1) “cut-through forwarding” which is a mechanism to forward all cells from a packet together while buffering cells from later packets until the last cell is forwarded, and (2) “short-cutting” which is a signaling mechanism allowing cells to follow the shortest path along a shared tree instead of having to go all the way to the “root” and be forwarded back to receivers. SPAM (a Simple Protocol for ATM Multicast) is a third attempt to support multipoint-to-multipoint VCs in a shared tree approach by incrementally improving SPAM. In contrast to the forwarding model of SEAM where there is no multiplexing of cells from different packets, SPAM multiplexes cells from
238
W. Yurcik
different packets through the use of a proposed AAL-SPAM and native “cutthrough forwarding”. As a compromise between AAL5 and AAL3/4, AAL-SPAM is AAL5 plus an added 16 bit MID (Multiplexing IDentifier) field. Unique MIDs are assigned to senders within a multicast group such that receivers can distinguish cells from different senders. “Native cut-through forwarding” refers to cells from different packets being switched immediately and multiplexed over the shared many-to-many VC connecting the group. Results comparing SPAM and SEAM report the following: (1) SPAM buffer requirements are independent of packet size while SEAM buffer requirements increase significantly with increase in packet size and (2) SPAM delay is the expected propagation and transmission delay while certain SEAM switches in a network will experience maximum delay due to increased blocking [7]. The major shortcomings of SPAM are: (1) the practicality of modifying AAL5 to AAL-SPAM and (2) the global allocation of unique MIDs. 2.4
Virtual Ring Multicasting
Ofek and Yener have proposed the Virtual Ring [10,13] for window-based packet group communications. The Virtual Ring is a unidirectional circular overlay of routes that join all desired members for a group communication. Using Virtual Rings is straightforward since a message sent from one member of the group travels around the ring and back to the sender. In the application of Virtual Ring multicasting to ATM networks, each unicast connection between adjacent nodes on the Virtual Ring can be implemented as a separate point-to-point VC. The Virtual Ring provides a global sense of direction for concatenated VCs from switch to switch around an enclosed ring. The main advantage of the Virtual Ring is that it extends multicast to large groups since it incorporates feedback from receiver to sender to confirm cells have been received correctly. Through a combined strategy of feedback mechanisms referred to as “Implicit ACK” and “Explicit NACK”, the Virtual Ring minimizes the feedback needed for reliability while overcoming the “ACK implosion effect”. The Implicit ACK procedure considers each cell that circulates the Virtual Ring and returns to its source as an implicit acknowledgment. An Explicit NACK is a unicast retransmission request initiated by a receiver (at a higher layer) when ordered cells are missing.
3 3.1
Survivability Issues and Tradeoffs Relevant Circuit-Switched Restoration
The most relevant survivability issues that translate from circuit-switched restoration to multipoint restoration are: (1) the location to initiate restoration and (2) the method of identifying a backup route for restoration. Considering tradeoffs briefly outlined, end-to-end preplanned restoration is the only option that simultaneously provides both optimal and guaranteed restoration.
Survivable ATM Group Communications
239
There are two locations for initiating restoration: end-to-end restoration initiated between sender and receivers and local restoration initiated between the upstream and downstream nodes adjacent to the detected fault. While local restoration offers greater speed and ease of implementation, it may also result in blocking and “backhaul” situations where the backup route is not disjoint from the remaining working route. In general, the further restoration is initiated away from a fault results in slower restoration but a globally optimal solution, less blocking, and avoidance of backhaul situations. There are two methods of identifying a backup route for restoration: a preplanned backup route can be computed in advance or a dynamic search for a backup route can be computed in real-time. In the preplanned method, each working route has a prearranged backup route when initially assigned. In the dynamic search method, senders and receivers broadcast signaling messages after being notified of a network failure and backup routes are selected via a specified algorithm. In the preplanned method, survivability is guaranteed because the backup route has reserved bandwidth. In dynamic search, no bandwidth is reserved for backup routes so restoration is not guaranteed (i.e., residual bandwidth may not be available). 3.2
Survivability of Proposed Models: VC Mesh, Shared Multicast Trees, Multicast Server, and Virtual Ring
In the VC Mesh Model, group dynamics drive the establishment of a new VC for each sender in a group in order to reflect updated membership after each join or leave operation. Also in the VC Mesh Model, multiple senders transmitting to the same group members use separate VCs while in the MCS Model multiple senders each establish point-to-point VCs to the MCS and then the MCS integrates point-to-multipoint transmissions to all group members. For both of these reasons, the number of VCs utilized in the VC Mesh Model increases with the number of senders at a faster rate than the MCS Model. As a result, there is potentially a larger number of routes over which a failure may occur for the VC Mesh which means an increased exposure to link and node failures. The MCS constrains the number of potential paths over which a failure may occur by concentrating traffic on a server which is both a performance bottleneck and single point-of-failure. If an MCS is disabled, all group sessions using that MCS terminate. For the ATM Shared Multicast Tree, the dominant survivability issue is handling link or node failures in the “trunk” of the shared tree which will interrupt all group communications. On the other hand, sharing a common tree makes the protection of links and nodes an easier task. A general technique to provide survivability to tree-based approaches is to restore faults on the “working” tree by rerouting to a preplanned disjoint “backup” shared multicast tree as depicted in Figure 2A. I propose extending Virtual Ring multicasting to ATM group communications in the form of Self-Healing Survivable Rings (SHSR). A SHSR consists of a working and backup set of routes both in the form of rings. In the case of a
240
W. Yurcik direction
J
of backup
I Virtual
K
direction of working Virtual Ring
Ring
H
working tree
backup tree
(A)
E working Virtual Ring
backup Virtual Ring
(B)
Fig. 2. Restoration Techniques to Provide Survivability to (A) Tree-based or (B) Ringbased ATM Group Communications
failure in the working ring, traffic is rerouted to the backup ring by switches adjacent to the failure. To use SHSR to provide survivability, two contributions are necessary: (1) a novel formulation of a least cost Virtual Ring connecting all desired group members, that is link and node disjoint, such that a single fault will only disconnect a ring in one place, and (2) a real-time ring reconfiguration mechanism to provide restoration of single faults while maintaining the established ring. In [14], the author develops a novel “Disjoint Steiner Ring” (DSR) formulation which provides a theoretical basis upon which SHSRs can be implemented. A real-time ring reconfiguration mechanism to provide restoration of single link or single node faults can be achieved by bandwidth provisioning of circuits for counter-rotating rings. When a link or node fault occurs, the adjacent upstream node in the working ring will loop-back onto the counter-rotating backup ring [14] as shown in Figure 2B. This real-time ring reconfiguration mechanism requires no new ATM signaling capabilities.
4
Experimental Results
Here I report the results of experimentation to compare the feasibility and cost of preplanned dedicated backup route restoration techniques to provide survivability for ATM group communications implemented via the VC Mesh Model (VCMESH), the Shared Multicast Tree (SMT), and the Self-Healing Survivable Ring (SHSR). The MCS model was not included because of the dominant single-point-of-failure vulnerability which requires a different set of restoration techniques. To enable comparisons of feasibility and cost I make the following state-of-the-art assumption: only single link/node faults are considered. I consider two actual network configurations used previously in [14,15]. For these networks, all groups of varying sizes (3 and 4) are formed. When a link
Survivable ATM Group Communications
241
or node fault occurs within a “working” set of routes, traffic flow is automatically rerouted to the corresponding disjoint “backup” set of routes. Minimum cost working and backup rings are calculated by an implementation of the DSR formulation [14]. The SMT approaches leave the issue of how to build the shared tree to an external routing protocol so I have assumed the best case scenario by identifying the minimum cost tree calculated using a implementation variant of a Steiner Tree procedure due to Lawler known as the spanning tree enumeration algorithm [15]. The VCMESH approach specifies that each sender establishes circuit(s) to connect with all other group members but does not specify the mechanism so I have assumed the best case scenario such that each sender optimally selects the set of circuits that minimizes cost. For the networks studied, the feasibility of restoration using the different techniques are not equal. Figure 3A shows that while the SHSR displays 100% feasibility, the feasibility of the SMT and VCMESH techniques begin at approximately 88% and 58% respectively and rapidly decrease as group size increases.
(B) COST OF SURVIVABILITY TECHNIQUES BY GROUP SIZE 80
100%
70
S M T 88.33%
90%
60
78.10%
80% 70%
50 VCMESH 58.33%
60% 50%
COST IN LINKS
PER CENT OF GROUPS WHERE DISJOINT BACKUP RESTORATION EXISTS
(A) FEASIBILITY OF SURVIVABILITY TECHNIQUES BY GROUP SIZE SHSR 100.00% 100.00%
AVERAGE COST WITH 95% CONF. INT. 40 WORKING/BACKUP VCMESHs −−+−− 30
40%
WORKING/BACKUP SMTs −−o−− 30%
25.24%
20
WORKING/BACKUP SHSRs −−x−−
20% 10 10% 0%
3
4
3 4 GROUP SIZES 3/4
3
4
0
3
4
GROUP SIZE
Fig. 3. Results Comparing Survivability Techniques for ATM Group Communications (for group sizes 3 and 4): (A) feasibility; (B) average cost
The explanation for the poor feasibility of the SMT and VCMESH techniques is the existence of “multipoint traps”. A “trap” is a topology where a corresponding set of backup routes are not available due to the disjointness constraint although disjoint working and backup routes may be available if selected differently [4]. Traps occur because the routing algorithm for the working set of routes optimizes selection according to least cost/minimum hop or similar metric without considering the survivability provided by the selection of a non-optimal
242
W. Yurcik
set of working routes that can be paired with a disjoint set of backup routes. “Multipoint traps” become more prevalent as the number of group members increases because more links must be used for the “working” SMT/VCMESH leaving less links for forming a disjoint “backup” SMT/VCMESH. “Multipoint traps” also explain why the VCMESH technique has a dramatically lower feasibility than the SMT technique since the VCMESH utilizes more working routes. SHSRs have superior feasibility due to the DSR formulation which solves for both the working and backup routes jointly to eliminate “multipoint traps”. Multipoint traps have become the dominant factor for preplanned end-to-end restoration of ATM group communications after being a relatively rare event in point-to-point restoration. Given the use of disjoint dedicated backup SMT/VCMESH techniques to provide survivability to ATM group communications is not feasible for groups on arbitrary networks, there still exist densely connected networks where these techniques can provide guaranteed single fault tolerance. For these dense networks only, the question of restoration cost is relevant. Figure 3B shows the cost of the SHSR technique is lower, with statistical significance, than either the SMT and VCMESH techniques with a cost differential that increases as the group size increases. The size of the cost differential is not small, SHSRs are 20% less expensive than SMTs and 56% less expensive than VCMESHes.
5
Conclusion
This research has investigated the extension of circuit-switched techniques to the restoration of ATM group communications implemented via the VC Mesh model, shared multicast trees, and self-healing survivable rings. Experimental results show that for the networks studied the use of disjoint dedicated backup is 100% feasible for the SHSR technique and less feasible for the SMT and VCMESH techniques due to “multipoint trap” topologies. Results also show that the relative ranking order of each technique by cost is consistent across different networks and group sizes: SHSR (lowest cost), SMT (middle cost), and VCMESH (highest cost). I conclude that providing survivability to ATM group communications via SHSRs is both feasible and the least cost preplanned backup restoration technique. Further research is needed to develop less complex but accurate heuristics that can scale to large networks.
6
Acknowledgment
I thank David Tipper, Luigi Rizzo, and the anonymous referees for comments which have significantly improved the manuscript.
Survivable ATM Group Communications
243
References 1. G. Armitage, “Support for Multicast Over UNI 3.0/3.1 Based ATM Networks,” IETF RFC, rfc2022.txt, November 1996. 2. ATM Forum, ATM User-Network Interface (UNI) Signaling Specification Version 4.0, July 1996. 3. ATM Forum, ATM User-Network Interface (UNI) Specification Version 3.1. 4. D. Dunn et. al., “Comparison of k-Shortest Paths and Maximum Flow Routing for Network Facility Restoration,” IEEE J. on Sel. Areas in Comm., Vol. 12, No. 1, pp. 88-99. 5. E. Gauthier et. al., “SMART: A Many-To-Many Multicast Protocol for ATM,” IEEE J. on Sel. Areas in Comm., Vol. 15, No. 3, pp. 458-472. 6. M. Grossglauser and K.K. Ramakrishnan, “SEAM: Scalable and Efficient ATM Multicast,” IEEE INFOCOM’97, 1997. 7. S. Komandur and D. Moss´e, “SPAM: A Data Forwarding Model for Multipointto-Multipoint Connection Support in ATM Networks,” IEEE IC3N’97. 8. K. Murakami and H.S. Kim, “Virtual Path Routing for Survivable ATM Networks,” IEEE/ACM Trans. on Networking, Vol. 4, No. 1, pp. 22-39. 9. T. H. Noh et. al., “Reconfiguration For Service and Self-Healing in ATM Networks based on Virtual Paths,” Computer Networks and ISDN Systems, Vol. 29, pp. 1857-1867. 10. Y. Ofek and B. Yener, “Reliable Concurrent Multicast From Bursty Sources,” IEEE J. on Sel. Areas in Comm., Vol. 15, No. 3, pp. 434-444. 11. J.E. van der Merwe and I.M. Leslie, “Service-Specific Control Architectures for ATM,” IEEE J. on Sel. Areas in Comm., Vol. 16, No. 3, pp. 424-436. 12. Y. Xie et. al., “Multicasting Over ATM Using Connection Server,” ICC’97. 13. B. Yener, Y. Ofek, and M. Yung, “Combinatorial Design of Congestion-Free Networks,” IEEE/ACM Trans. on Networking, Vol. 5, No. 6, pp. 989-1000. 14. W. Yurcik and D. Tipper, “Providing Network Survivability to ATM Group Communications Via Self-Healing Survivable Rings,” 7th Intl. Conf. on Telecomm. Systems, 1999, pp 501-518. 15. W. Yurcik, “Providing ATM Multipoint Survivability Via Disjoint VC Mesh Backup Groups,” 7th IEEE Intl. Conf. on Computer Comm. and Networks, 1998, pp 129-136.
The Direction of Value Flow in Connectionless Networks Bob Briscoe BT Research, B54/74, BT Labs, Martlesham Heath, Ipswich, IP5 3RE, England [email protected]
Abstract. This paper argues that all network providers in a connectionless multi-service network should offer each class of their service to each neighbour for each direction at a single price. This is called “splitedge pricing”. If sets of customers wish to reapportion their networking charges between themselves, this should be tackled end-to-end. Edge reapportionment should not be muddled with networking charges, as is the case in the telephony market. Avoiding the telephony approach is shown to offer full reapportionment flexibility, but avoids the otherwise inevitable network complexity, particularly for multicast. “Split-edge pricing” is recursive, applying as much to relationships between providers as to edge-customers. Various scenarios are discussed, showing the advantage of the approach. These include phone to Internet gateways and even inter-domain multicast conferences with heterogeneous QoS. The business model analysis suggests a new, purely financial role of end-to-end intermediary in the Internet industry.
1
Introduction
Traditionally, data communications has been sold so cheaply that charging for it on usage basis has not seemed feasible or sensible. While flat-rate subscription or connect-time charging prevails, the question of reapportioning the value of a particular communication between its ends rarely surfaces. With the possibility of variable quality of service (QoS) approaching, the need for some form of usage-charging for high QoS has arisen. This has led to new thinking on cheaper usage-charging systems for packet networks [15,2] which in turn has brought the issue of reapportionment of charges between the end customers back into the limelight [5]. However, to tackle this issue, the traditional model of a unit of communication value is inadequate because it grew in the context of connection-oriented telephone systems. Worse, it has often been misapplied to the connectionless Internet. The connection-oriented mind-set also leads to confusion over blame and liability for each unit of communication. This paper explicitly clarifies these fundamental issues. Our minimalist model for a connectionless network business has boundaries that match the service access points above and below the network layer of the OSI stack [20]. This business can be modelled as buying in lower L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 244–269, 1999. c Springer-Verlag Berlin Heidelberg 1999
The Direction of Value Flow in Connectionless Networks
245
and higher layer services (e.g. links, virtual connections or naming services). It still applies to ISPs that use their own links and services - the cost just becomes internalised. This paper proposes a simple charging model that can be applied between any pair of multi-service connectionless networks for each class of service and for send and receive separately. It works whether the pair are both providers or even if one is an edge customer. The model’s simplicity ensures charging will always be straightforward at every border in the Internet, whether for unicast or multicast flows. No matter how many networks are connected together, any one network is only dependent on prices from its direct neighbours. Therefore, the model is intrinsically scalable. We call the model “split edge pricing”. From the end customers’ points of view, this means that any flow through the Internet is sold on entry and on exit. As a consequence, the model appears to restrict all end customers to each have to pay the price of their local network provider. This appears to restrict any customers who would rather reapportion the costs differently between themselves (termed clearing). Invariably, network providers offer their services at a set price regardless of the value each customer derives from each transmission. This is a natural consequence of a competitive market, often called a “buyer’s market”. If usagebased charging is in operation, no one bothers with any communication of less value than this market price. However, transmissions naturally have at least two ends. (In fact, we consider two-ended flows as just a specific case of multipoint flows.) Often a transmission just never happens because one of the ends derives less value than their local price. Often in such cases, the total value derived from the transmission by all ends would have been greater than the total charges levied by all providers on all end customers. Therefore, it is in any network provider’s interest to matchmake customers who derive surplus value with those who would otherwise be in deficit. That is, clearing plays an important role in encouraging network demand. Matchmaking in the traditional telephony market is well understood. Various ways are available for end customers to share the cost of a call besides the normal “originator pays”. Examples are “calls free to the originator”, “local charge only” etc. Telephony interconnect arrangements ensure that wherever payment enters the system, it ends up being cleared between the providers who bore the cost of each call. However, the interconnect pricing scheme that drives clearing blurs the distinction between clearing of edge payments and the market price of interconnect. This paper argues that these over-complicated clearing arrangements are the result of evolution from a fully connected matrix of single country providers and are flawed for the Internet. Instead we propose “split-edge pricing” as a more flexible replacement. The apparent problem of no flexibility to clear between the ends is solved simply. Clearing can be achieved end-to-end, directly between customers or their edge providers, bypassing the core network businesses. If instead, clearing follows the same path as the data flow, we show that core network complexity becomes inevitable, particularly for multicast, but also for unicast. Incidentally, end-to-end clearing was never possible on the PSTN
246
B. Briscoe
because there was no convenient way to form independently routed end-to-end data connections simultaneously to call progress. Clearly, this is possible on the Internet. Clearing requirements will differ on a per session basis, therefore the model where clearing takes place end-to-end involves per-session accounting without involving the network providers along the data path (other than those at the edge). Thus, it seems natural for clearing to re-use existing e-commerce concepts and mechanisms. This results in a scenario where the traditional telephone bill becomes an anachronism for the Internet. Instead, edge-provider charges can be settled by any other third party across the ends of a communication, leaving the “bill” as just the balance of those charges that are directly retailed between the edge-provider and its edge-customer. E-commerce-based clearing allows part of local customer A’s usage to be “wholesaled” to remote customer B or to third party C. While part of customer B’s usage can be wholesaled to customer A and so on. The cost of the act of clearing is significant; therefore it is important that the default apportionment in the core model matches the most common case. We establish that the common case is where both senders and receivers pay, at all ends of each transmission. This incidentally causes the perceived need for the network to report how many receivers are subscribed to a multicast to evaporate. We then consider the unpleasant fact that, on the Internet, a receiver can never protect itself from being sent to. We suggest a rather novel business model that is still optimised for the common case, but simultaneously has no receiver liability. For completeness, we examine the specific problems with the traditional clearing model used in telephony. The paper draws to a close by working through some example scenarios to suggest how the models would work in practice. Finally, limitations and further work are listed before conclusions are drawn.
2
Related Work
Some authors state that they believe the business model of the current fixed access rate Internet is “sender takes all” [8,21]. This phrase is used to imply the sender’s ISP receives all the revenue. This is completely erroneous. ISP’s rates relate to access bandwidth, regardless of the direction in which it is used. Thus, “sender and receiver each take half” is more appropriate (approximately). This is a similar position to the half-circuit charging common for data links, but applied end-to-end. MacKie-Mason et al asserts that the blame for a transmission is impossible to determine at the network level [12], an argument that can descend into sophistry. However, later, using precise definitions of the terms, we argue that the sender is always to blame for a transmission in a connectionless network. Clark analyses the apportionment of charges between senders and receivers [5], and proposes an engineering solution, which he admits would introduce considerable complexity to the Internet if implemented. Shenker et al describes edge pricing [16], a business model that appears regularly in communication networks and which forms much of the background to this work.
The Direction of Value Flow in Connectionless Networks
3
247
The Value of Place
The value of communication concerns the incremental value of having information in a certain place (or places) by a certain time, instead of or as well as the original places. Usually, the more the information is worth, the more value is placed on having it in the right places in a timely manner. There is no value to the customer at all while information is in transit. It is delivery that is important. Strictly one also has to take account of the mitigating cost of storage in both places (or only in the second place if the sender deletes after sending). In summary the added value of transmission is the marginal change in value caused by associating a new location at the delivery time with the intrinsic value of the information to the customer. However, because the data communications market is fairly competitive, charges for communicating information tend instead to follow the “cost plus margin” rule. This is particularly so because it is very difficult for providers to predict what value their customers put on moving any one piece of information. Any payment to an edge-network provider has the two aspects – “who pays” and “who is paid”. “Who is paid” can only be each local provider collecting its local price. With competitive “cost plus” pricing there is no scope for any provider to break out of that. But, because communications naturally involves at least two parties, in order to cover the total costs of all the providers involved, “who pays” can be on a different apportionment. The edge customers do know the value to them of having the information at a certain place in time. Thus, although apportionment is difficult for network providers, it is very relevant to edge-customers. Clearly, the network providers can stimulate more use of their networks by making arrangements for customers to efficiently apportion costs between themselves.
4
End-to-End Pricing
If a price is higher than the perceived value for any customer, she is free to get the remote party (or anyone else) to make up the difference through some higher level arrangement. On the other hand, if the value to her is higher than her local price, she is also free to offer to cover some of the costs of the remote end(s). However, our minimalist provider doesn’t have to be concerned with matchmaking multiple customers to get round local discrepancies between price and customer value. This is an issue that can be dealt with end-to-end, not locally. We are not saying ISPs shouldn’t offer end-to-end pricing - it is clearly in their interest to matchmake between customers with surplus value and those with deficit. All we are saying is that, if they do, end-to-end pricing should be considered as a separate role (Fig. 1). Such a role could be a separate business – it could gain on some combinations and lose on others, possibly making a profit overall. In this case it would be a retail service that used the networking services as wholesalers. It is also possible that edge customers could effectively take on this role themselves. Fig. 1 shows three end customers using a data path
248
B. Briscoe
Fig. 1. “End-to-end pricing” role
through multiple connected ISPs. The relative value of the service flows and prices for one direction of one class of service is represented by the thickness of the arrows. Note that the size of the proportions of prices represents a choice by the end-system that is willing to pay more than its local price. In fact, the end-to-end pricing role may advertise identical prices to each customer. However, these could be modified by an offer by A to cover a proportion (possibly all of it). Pricing between providers is omitted for clarity (but see later). Telephony firms have traditionally offered end-to-end pricing because they are selling an application. The role of network provider has always been muddled with selling the end-to-end application. This is already putting considerable strains on the International Accounting Rate System (IARS) [9] with potentially s(n−1)2 prices having to be negotiated (where n is the number of edge providers and s is the number of global schemes for sharing the proportions of the price between the ends, e.g. local rate only, free to sender). In practice, end providers are grouped together to reduce the number of prices presented to customers. The PSTN uses addressing conventions (e.g. +800 for free to sender), but this limits commercial flexibility to the few schemes that are widely recognised. Clark proposed an Internet-based solution to allow flexibility [5]. However, catering for various combinations of sender and receiver payments through the core of the network needs packet format changes and router involvement. Further, wholesale prices between providers would have to be negotiated for every possible scheme for sharing charges between ends as well as for every possible grouping of end points beyond that boundary. Worse still, inter-provider accounting would then require traffic flows to be isolated then further sub-classified by how much each end was paying on a per-flow basis. The “n2 problem” would still exist for our end-to-end pricing solution but this is fairly easy to contain by grouping. An example scenario is given at the
The Direction of Value Flow in Connectionless Networks
249
end of the paper. Importantly though, end-to-end pricing gets rid of all the interprovider problems described above. There is no longer a need to identify endto-end flows at inter-provider boundaries. Thus inter-provider charging could be based on bulk measures like average queue lengths, number of routing advertisements etc. Also, most importantly, end-to-end pricing can be introduced without changing the Internet at all, and it allows future flexibility. To summarise so far, we should ensure any discrepancy in the willingness to pay across end customers is normalised end-to-end first, so that edge ISPs always receive payment at their local price.
5
Common Case Value Apportionment
Although we have delegated the problem of apportioning sender and receiver payments to a higher layer, it is still important to cater for the common case at the network charging level so that the higher layer functions are unnecessary in most cases. We propose that all edge providers should charge their local customers for both sending and receiving as the default case. Our primary justification for this stance is that the large majority of communication occurs between consenting parties. In this section we also argue that other possible default scenarios (e.g. “only senders pay”) would be unstable anyway and collapse back to our proposed model, which appears stable. Further, delegating charge reapportionment to a higher layer eliminates all but local pricing, so that we can extend charging for sending and receiving recursively to apply at the boundary between any pair of providers. This greatly simplifies inter-provider metering essential for Internet scalability as QoS and multicast are introduced. Thus our edge pricing model applies to any “edge” - whether at the edge of the Internet or just the edge of a backbone. The stability analysis intrinsically applies equally to this more general case. Incidentally, allowing for prices for each direction to be different hardly needs justification. It allows for asymmetric costs (e.g. access technology like xDSL or satellite) and for asymmetric demand (e.g. some ISPs might host more big senders, while others might host the mass of receivers). If these factors aren’t asymmetric, the two prices can simply be set to be the same. Figure 2 shows a generic scenario with multiple networks, N , all connected to the network of interest, Nb . Each connected network has a status relative to Nb based on whether it provides more or less connectivity to other hosts at that class of service. Although the diagram gives the impression that Nb is a backbone network, any one of the neighbouring networks could be a simple link to an edge customer’s single host. The model is designed to be general enough for Nb to be an edge customer, an edge network, a backbone network or some hybrid. Those networks with the same suffix are of similar status relative to Nb . For instance, those labelled Nc may be edge customers, Nd may be equally large backbones and Ne a peer network. In fact, this is a simplification. To be more specific we propose that a provider should offer each class of service in each direction at a separate price. Thus, Fig. 2 shows the situation for one of possibly many classes of service. Class of service
250
B. Briscoe
Fig. 2. Split-edge pricing
is defined as a unique combination of the service mode (unicast, multicast) and quality (latency, instantaneous bandwidth, reliability, jitter). Quality specifications within one class may leave one parameter to be specified by the customer while others remain fixed, thus generalising both RSVP and diffserv [19,14]. Appendix A justifies treating each class of service independently. Appendix B gives an introduction to the model for each class of service, but allows heterogeneous QoS per leg of the multicast. The full model is given in an earlier version of this paper [1]. However, all this detail would obscure the summary of the analysis we attempt to give here, which we now continue. A packet of a particular class of service is shown being multicast from Na into Nb and onward into the other networks. Because multicast is a general case of unicast this allows us to model both topologies. We will also be able to treat the topology as aggregation1 by reversing the direction of transmission. The term packet is used, but the arrows could represent flows of similar class packets for a certain time. Fig. 2 highlights the pricing between networks Na and Nb . Wbas and Wbar denote the per direction weightings applied to the “nominal charge” that Nb applies to Na (for more detail on exactly what the nominal charge means, see Appendix B). Wabs and Wabr likewise weight the charge Na applies to Nb . Each weighted price is for transmission between the edge in question and the remote edge of the Internet, not just the remote edge of that provider. For full generality, there have to be four price weightings like this for every class of service at every inter-network interface, but the weights would take different 1
Examples of packets that are forwarded until aggregation (reverse multicast) are: – RSVP receiver initiated reservation (RESV) messages – pragmatic general multicast (PGM) [17] negative acknowledge (NACK) messages or the “lay breadcrumbs” messages [6] suggested in their place
The Direction of Value Flow in Connectionless Networks
251
values unless the neighbours were of the same status. The relationship between any two parties across the edge of their networks is split into prices for each class of service and each of these is further split into two prices for each direction, each of which are again split into “half” prices that each party offers the other. Hence, we call this model “split-edge pricing”. Thus the payment for traffic in any one direction across each interface depends on the difference between the two weighted prices offered by the networks either side. In other words, no assumptions are made about who is provider and who is customer; this purely depends on the sign of the difference between the charges at any one time. Clearly, edge customers (Nc , say) have no provider status in the networking market. So, for all j, Wcjs = 0 and Wcjr = 0. In Appendix B we analyse policies like “only senders pay” or “only receivers pay” using the model (by simply setting all receiving weights to zero or all sending weights to zero). Stability of a policy is determined by assessing whether one network would gain from a maverick policy. The results are summarised here. “Only senders pay” or “only receivers pay” are only stable policies if all providers agree to adopt the same policy, and none break ranks. As soon as one goes maverick, customers who are primarily receivers and those who are primarily senders migrate to different providers. Income appears to remain stable, but the source of the income switches from retail customers to interconnect causing the inter-provider link to become a bottleneck. Thus costs increase without any increase in revenue. “Only senders pay” is also unstable where multicast is concerned. (“Only receivers pay” is all but meaningless for multicast.) To support an “only multicast senders pay” policy, all domains have to trust each other to faithfully report receiver community size. In Appendix B we show that it is simple for a domain to lie about its local receiver community size to increase its profits. Proposed mechanisms such as “EXPRESS count management protocol” [7] suffer from this flaw. Solving this problem is unlikely to be successful without breaking the scalability benefits of receiver initiated IP multicast that ensure upstream nodes are unaware of downstream join and leave activity. In contrast, “both senders and receivers pay” is stable in both unicast and multicast cases. It also doesn’t lead to inefficient network utilisation unlike the above cases. It is also possible to cater for different balances of predominant senders and receivers by weighting the sending price differently to the receiving price. For instance if there are a few big predominant senders but many small predominant receivers, the economy of scale in managing a large customer can be reflected in a lower sender weighting. Similarly, the inefficiencies of multicasts to small receiver communities compared to multiple unicasts can be discouraged by slightly weighting multicast sender pricing. The aggregation case is similar, with “both senders and receivers pay” stable while the two other policies go unstable for the same reasons as for multicast, but swapped round. If end customers want a different apportionment of charges, we have made the case for this being arranged end-to-end. The remainder of the paper concentrates on issues surrounding clearing. Also, at the end, various worked examples
252
B. Briscoe
are given to illustrate how it would be achieved in practice. However, first, we introduce one further relevant issue - that of how a receiver can control its costs, if it can’t stop itself being sent to.
6
Blame, Liability, and Control
We have shown that all ends paying is the common case and a stable one so should be the default. We can share the cost differently at higher level if end user value is shared differently from this default (and it is worth bothering given the cost of another financial transfer). However, we must remember that a sender can decide not to send but a receiver can not avoid being sent to (in the current Internet). We must be careful here to define the context of the question of blame. We are only concerned with blame for sending into or receiving from the service access point of the network layer. Clearly, if someone operates a Web service, they don’t normally decide whether to send replies on a request by request basis. But this doesn’t mean they have been forced to send at the network level. They have chosen to put the service on a well-known port with public access. They can stop certain people requesting them to send by securing the Web server or interposing a firewall. But, whenever they send it is because they have arranged it to be so. Ultimate sender blame presents a problem. In cases where the sender derives surplus value from a communication and the receiver derives less value than their provider charges, receivers are vulnerable to being exploited (e.g. adverts). Such cases are much rarer than it first appears, mainly because of confusions that can be cleared by considering the following factors: – The value of the information isn’t relevant when considering the networking service - only the value of moving the information - getting it to a useful place – Often the value of moving information is transitory - getting it to a useful place to discover that moving it wasn’t useful – Often the value of moving lots of information is to get a small part of it to a useful place, but it isn’t possible to know which part before moving it – The cost of transmitting information is often far less than the cost of the effort of targeting which information should be transmitted – Information in one direction often controls the flow of information in the other Nonetheless, genuine cases remain where the receiver is being persistently forced to pay for transmission that is valuable to the sender but not to the receiver. The only solution to this seemingly intractable dilemma is for it to be customary for all ends to pay, but the ultimate liability should remain with the sender. Any receiver could then dispute the customary apportionment (end-toend) with no risk of denial (unless the sender had proof of a receiver request). A similar but opposite situation used to prevail with the UK postal service. It was customary for the sender to pay for the stamp, but if it was missing or
The Direction of Value Flow in Connectionless Networks
253
insufficient the receiver was liable for the payment, because the Royal Mail had an obligation to deliver every letter.
7
End-to-End Clearing
Fig. 3. “End-to-end clearing” model
We have discussed how prices can be apportioned between the ends of a communication. We now discuss how payment will follow the same path. We can assume electronic commerce will make it possible for anyone to pay anyone else’s ISP on the Internet, even if a clearinghouse is needed We shall call this the “end-to-end clearing” model (Fig. 3). These arrangements will typically be made through higher level protocols. The act of making a financial transfer has similar order costs to the cost of transmission of a couple of small e-mails. In addition there is a the cost to the ISP of provision of processing resources for authentication. Therefore, arranging a different apportionment of charges between ends is more likely for long-lived sessions, such as Internet telephony or conferences on the mbone, than short connections, such as are typical on the Web. However, a collection of related short connections may be combined into one longer-lived session for these purposes. In the “end-to-end clearing” model, the clearinghouse role deals with the end-to-end “half-circuit” sharing (including the straightforward price differences between the two ends) leaving inter-provider accounting to be purely about wholesaling. The figure shows one end paying and follows example proportions of this money as they are distributed among the providers. The clearing role may be involved in taking payments for higher level service from each customer (e.g. conference fees or pay-TV charges). In such cases it knows the number of
254
B. Briscoe
participants and can charge the customer on the left (who is paying for everyone) accordingly. In this example each leg is charged at fifty units and no profit is made. Note that inter-provider money flows match the flow of networking service in the opposite direction across the same interface. This is the deliberate aim of the model, to ensure that bulk measurements at these boundaries can drive interconnect charges based solely on local conditions. It also allows each pair of providers to choose their own basis for metering independently of other arrangements at other interfaces. There is nothing to stop providers or customers assuming the clearinghouse role, but the accounting information model needs to be based on a third party clearing system to allow for the most general case. To clarify, the paying customer may make payment: – either to a dedicated clearing house – or direct to the ISP at the remote end (the remote customer need only notify her ISP’s payment interface to the payer) – or even direct to the remote customer so that she can pay their own ISP In all cases, the role of clearing must be separate even if there is no separate enterprise to achieve the function. Note that the last case is special - the clearing role is null, but it still appears in the information model. In other words, the charges for all ends should never be lumped together while accounting. If, instead, end-to-end half-circuit sharing were achieved through the provider chain, end-to-end clearing information would have to be identified separately from that needed for wholesale accounting. If clearing information were not identified separately, the types of model that could be built on the infrastructure would be restricted.
8
Iterative Clearing
We have presented what we believe to be an optimum business model, but other models need to be considered. In particular, we will now consider a model similar to the public ‘phone service, which has one or two implicit features that need to be separated out for full understanding. We will consider payment in the model first, rather than pricing, as it will then be easier to understand the pricing issues. In this model, ISPs don’t expect payment for all sent and received traffic to be made to all edge providers (Fig. 4). Instead a customer might pay their own provider on behalf of both (all) ends as in the normal case for telephony and as proposed by Clark for the Internet [5]. This alternative business model allows customers to decide into which end(s) payment enters the system, on a per flow basis. We shall call this model the “iterative” model for reasons that will become clear as we go. The financial flows between providers in this model depend on at which ends payment is entering the system on a per flow (or per packet) basis. For some flows, there may even be proportional sharing of costs between the ends. For business model flexibility an accounting system would need a “payee
The Direction of Value Flow in Connectionless Networks
255
percentage” field - the percentage of the total cost to be paid by the customer at the end being accounted for. Usually it would be 100% or 0% in the typical cases of “paid completely to local provider” or “completely to remote”. The balance would be the remote end’s payment. Note, though, that the perceived purpose of this model is the transaction efficiency when the local payee gets 100%. If Fig. 4 is compared with the end-to-end clearing model in Fig. 3, both models end up with two of the edge ISPs paid the same amounts on a halfcircuit basis (we will explain where the third leg of the multicast has gone later). The difference is merely in the route the payment takes from payer to payee. With “iterative” clearing the payment follows the data path. Along the way, providers take their cut with two types of money sharing being mixed together: – wholesale cut – half-circuit sharing
Fig. 4. “Iterative clearing” model
However, the amount deducted from the flow at each boundary doesn’t match the level of service crossing that boundary. This can lead to complexity in the network, as there is pressure to design the network itself to reveal the apportionment of costs. This was why Clark was concerned about how much complexity would be added to the Internet to cater for arbitrary combinations of sender and receiver payments. This is also why international and interconnect on the PSTN have limited flexibility to arbitrarily apportion charges between the ends. Even “free to sender” calls are blocked between a lot of countries because they don’t yet have prices set or the interconnect accounting in place. Specifically, there are five points stacked up against the “iterative clearing” model:
256
B. Briscoe
– As already pointed out, a “payee percentage” field would have to drive inter-provider accounting, whether it was in accounting messages or packets. Otherwise the revenue of an edge ISP and its upstream providers would depend on a factor completely outside their control - to which end its customers chose to make payment. The “payee percentage” field would therefore have to be trusted by upstream providers. To help prevent the field being tampered with, it would need to be signed by the remote ISP. How signed fields can be aggregated without losing the signature integrity would be a matter for further research. – Still further complication might be introduced for some future applications if the share of payment between the parties wasn’t fixed but depended on characteristics of the flow or other parameters only understood at a higher level - higher than the provider would normally be interested in. – Worse still, the payment should ideally be split taking into account the current prices of all the edge providers who will eventually be paid. The only alternative (used in the international accounting rate system (IARS) for telephony) is for ISPs to agree compromise prices between themselves that average out price inconsistencies. This is what has been causing all the tensions in IARS as some countries liberalise earlier than others causing huge variation in prices around the world, between which no happy compromise can be found. This is difficult even for a system where every end to end path only passes through two international carriers at maximum, each pair setting compromise prices with each other. With eight ISPs on many end to end Internet paths, five typical [11] and considerable peer interconnection, it is likely that it will take longer to negotiate prices than the time available, thus leading to distortions to providers’ supply and demand signals. – Finally, because of the much longer provider chains typically found on the Internet, unacceptable delays will be introduced before the revenue arrives in the correct place. Any delay in clearing hugely increases the cost of the payment system, as extra trust mechanisms have to be invoked while the payment remains unconfirmed. These trust mechanisms have to be applied to the edge customers, not just the providers, therefore hugely increasing the total cost of the system. – If multicast is to be catered for by iterative clearing (e.g. conferences), each provider needs to know how many ends they are serving locally, both to inform the person paying and check settlement. In contrast, with the endto-end clearing model, if senders and receivers all pay, no-one needs to count ends nor trust others to count ends for them. If clearing is desired by the ends, only the ends need to know how many ends there are to pay for - noone needs to calculate how many ends are attached to each provider. Thus, for instance, if there is a charge to join a conference, this can cover the cost of paying each participant’s communications charges as well as the content (each participant would have to declare their ISP when they join). The more who pay the host to join, the more there is to cover charges. Then the host can send bulk payments to the relevant ISPs, either directly or through a single clearer (see worked example below). Multicast has been omitted from
The Direction of Value Flow in Connectionless Networks
257
Fig. 4 simply because there is no generic way to handle multipoint networking with iterative clearing. The only advantage of the “iterative” model is that it appears to reduce (by one) the number of transactions to achieve the desired apportionment. Also all the inter-provider transactions can be fairly lightweight because they can be batched up. For example, consider the case where both the parties in an Internet ’phone conversation are being paid for by the caller. It appears less complex for the caller to pay everyone’s payments to her own ISP, then let the ISP transfer the correct amount to its upstream provider as part of a bulk transaction. However, on the other side of the bargain is a considerably more complicated network, compromise pricing, increased credit time lags and less flexibility in inventing new ways to apportion charges, particularly for multicast.
9 9.1
Example Scenarios Finding an End-to-End Price
Let us assume some way has been invented for an ISP’s edge customer, Ca , to announce her intention to cover some part of the transmission costs of parties communicating with her, Cb , Cc etc. Some suggestions are given in [5]. Kausar suggests modifications to SDP [13] to achieve this for longer sessions [10]. A price needs to be set and settlement made between each pair of parties. If this is achieved, end-to-end, between the parties involved there are no further engineering implications - the pairs of parties clearly trust each other enough to enter into a financial arrangement and are willing to accept the cost of the transaction. However, there will be many occasions where the parties have no trust relationship. In these cases the problem reduces to, Cb , Cc etc finding suitable intermediaries. First they must know Ca ’s ISP. Ca may have already given this information in the session description protocol. Alternatively a directory of ISPs could be operated by the Internet address allocation registry (IANA) or any private concern, in which one could look up the network address of Ca and be given the payment interface of the associated ISP. This directory might be operated by one organisation monolithically (feasible for the current 74,000 ISPs) or it could be hierarchical like DNS. They may then choose to check whether their own ISP has a direct relationship with Ca ’s ISP. Alternatively, they could go straight to a different directory we postulate would be necessary. This directory would accept lists of ISPs and return a list of organisations that would act as an intermediary between them all. The mechanism would be identical to a Web search engine - in invert index of ISP-intermediary pairs that would accept queries of logically ANDed ISPs. There may be some intermediaries who will deal with any combination of ISPs through their own network of secondary clearing arrangements. Whatever, the resulting intermediary could then be contacted to find the price being offered for the particular combination of ISPs. The same organisation would naturally take the payments and clear them between providers. The
258
B. Briscoe
intermediary would also have to find out the prices being charged by the relevant edge-providers, which would represent its back-end costs. We assume the providers would be using tariff dissemination protocols such as in Rizzo et al [15], Carle et al [4] or Yemini et al [18] that could be listened to by the clearer as easily as by the edge-customers, particularly if transported over multicast. 9.2
Accounting and Clearing with “Sender Liable But Local Payment Customary”
We concluded earlier that only the sender should be ultimately liable for usage charges. However, we suggested that the customary position should be to expect every customer to pay for both reception and sending. We will now describe how this necessary but rather novel business model would work, assuming also end-to-end clearing. At this point we have to make a clear separation between accounting and liability for payment. We propose that each edge customer and her edge network provider should first reconcile their usage records for sent and received traffic, whoever is expected to pay for that usage. That is, we require the edge ISP to be willing to sign the relevant accounts if asked. Whoever ends up paying, the price applied to each account will be that set by the edge-provider supplying the service. We will also describe the more demanding case of postpayment. The following discussion is easiest to understand if we consider the accounts for one flow at a time. We assume that accounting is operating in near-real-time on a per-session basis, thus such an approach makes sense as a session consists of a set of flows known to at least a subset of the participants. Again, for ease of understanding, let us consider one incoming flow to one customer before we consider her outgoing flow. Note that two or more edge-charges are raised per flow depending on the number of ends, but we are focusing on one customer at one end (termed “local”). For brevity, we will use feminine pronouns for the local customer and masculine for the remote customer(s). If the local customer doesn’t wish to pay her provider’s charges for reception, she will present the account of her received flow to the clearer (discovered using the procedure in the previous section). She will identify the remote sender who she expects will pay. The clearer looks up the sender’s ISP and debits its account, referring to the sender’s address. When the sender’s ISP settles its account with the clearer, it will deduct the amount from its account with the sender. The price used is the clearer’s, which may or may not be identical to the local provider’s price. If the sender is a large organisation, it might have an account directly with the clearer. The clearer also credits its account with the local ISP at the local provider’s price, referring to the local customer. When this accounts settles, it will clear the local customer’s debt with her local provider. The sender cannot dispute the charge unless he has evidence that the receiver previously agreed to accept payment liability for traffic of this type from him. Purely for the clearer’s information, the local customer may identify which records she believes the remote party expects to pay and which she is simply disputing. Thus despite the customary case being each end paying its own charges,
The Direction of Value Flow in Connectionless Networks
259
the onus is on the sender to ensure it can prove this. In most cases, there is a relationship between the parties in a communication which will allow the sender to safely assume the customary case so that no receiver is expected to bounce its customary charges back to the sender. Where such trust isn’t present, the sender might wish to ensure receivers have confirmed they are accepting the flow on the customary terms (where they pay their own end’s charges). However, some ISP’s may assume this liability and offer sender debt collection as a service. We now move on to consider outgoing flows sent by the local customer. If she has evidence that the remote receiver has agreed to cover her local provider’s charges for her sent traffic, she will present her account for the sent flow to the clearer. The mechanics of looking up accounts and debiting and crediting are as before. The receiver can only dispute this charge if he can prove the evidence saying he agreed to pay is invalid. The local customer may optionally notify her local provider that she expects settlement for each flow to come from the clearer not herself. If the clearer never pays the local provider, the local customer remains liable to her own provider for the charges. If she has also already paid the clearer, she can ask the clearer for evidence that it has settled for the usage in question with her local provider. If it can provide the evidence signed by the provider, her liability to her provider is cleared. If it can’t provide such evidence, it must return her payment so she will never have to pay twice. This all seems very complicated, given it is per flow. However, in practice, settlement will be done in batch between each pair of parties as providers and clearers all have long term relationships with their customers. It is only the accounting that is per-flow. Indeed, even the accounting may be done in batches, but each batch will contain per-flow granularity of usage records. Also, it is clear that such reapportionment will only be cost-effective for flows where the value of the communications quality is higher than the cost of reapportionment. Examples would be long-lived flows, or collections of flows between the same ends where premium quality transmission service is used. Also, it should be recalled, that the clearing intermediary is merely a role. This role may be taken by one of the edge-ISPs or by one of the edge-customers, which removes the need for half the messaging. Essentially, receipts are being traded much in the same way as employees claim travel expenses from their employer. This results in a scenario where the traditional telephone bill becomes an anachronism. Such a bill represents two commercial processes wrapped into one, which we propose should be separate. The first stage is reconciliation of all local usage records, whoever will pay. The second stage is agreement over who pays for which usage record. 9.3
Inter-domain Multicast with Heterogeneous QoS
Illustrating the power of the principles set down so far, we can take an example like multicast with heterogeneous QoS per receiver and show that charging for it with correct apportionment will “just happen”, even inter-domain. However, it will only be efficient by using the “split-edge pricing” model. For illustration,
260
B. Briscoe
let us have two provider networks with an edge customer of Na sending into a multicast address where the tree crosses into Nb reaching receivers who are customers of Nb . Nb will have given a price for multicast reception and Na one for sending to an address in the multicast range. The tree may spread to other receivers on other networks too. The receivers in each domain will note when they join the multicast and each start being charged for the traffic they receive at their local price. The sender will be charged at her provider’s price. At the domain boundary between Na and Nb , Nb will be charged Na ’s price for sending to a multicast while Nb will charge its price to Na for receiving from a multicast. This usage may either be measured exactly at the inter-provider border, calculated statistically by combining customer usage data with multicast routing tables or simply covered within bulk measurements at the border. The receivers of a particular multicast group may happen to be located so that the multicast tree fans out immediately at its entrance to the network. With heterogeneous QoS per receiver (e.g. RSVP), any message to set up the QoS must emanate from the receiver and can therefore be charged for locally. Again this can be treated identically at the inter-provider boundary. We believe edge pricing allows enough flexibility to charge differentially for broad ranges of route lengths because it allows different charges for different administrative domains. Even if a single domain spanned the globe, if desired it could be divided into internal pricing domains to achieve the same effect. We now go on to discuss how reapportionment of charges between the ends might work in this multicast case. It would be unlikely that anyone would volunteer to pay for all receivers however many there were, unless they had a prior arrangement with each receiver. For instance, if a conference organiser were offering to pay everyone’s communications expenses, part of the charge for each participant to join the conference would most likely cover these costs. Thus, any number of participants could join but the host would still have all payments covered from income. The host might just allow enough in the conference charge to cover most prices of most providers and probably make a little profit to cover the risk. Other models might be possible. The host may only agree to pay each participant’s charges up to a ceiling. Alternatively, the host may ask for receipts authenticated by ISPs and pay the exact charge of each ISP. Thus, the end-toend pricing model allows full flexibility for reapportioning charges in both the multicast and unicast cases. 9.4
Phone to Internet Gateway
Others are taking the approach of allowing the telephony charging model to determine that for Internet telephony. Because of the complexity implications this approach has, we suggest the Internet should take advantage of the opportunity for a fresh start. The Internet should reject the complexity of iterative PSTN charging before it becomes endemic. Instead, phone to Internet gateways (PIGs) should be treated exactly as end-systems are treated above. Any apportionment of sender and receiver payments should be dealt with from one end of the Internet to the PIG. That is end-to-end across the Internet’s patch, rather than
The Direction of Value Flow in Connectionless Networks
261
Fig. 5. Clearing across a PIG
end-to-end across both the Internet and the PSTN. This also allows for multicast models on the Internet side (multipoint on the PSTN side remains as complicated as today). Clearing between end-parties must never become muddled in with network provider pricing. If this were to happen, the Internet would be for ever saddled with interworking with a legacy, even when the legacy had virtually withered and died. It is always better to make the legacy interwork with the new model than the other way round. Note that the end customer will see no difference if they rely on their edge network provider for all Internet telephony charging. This is purely an internal re-arrangement between ISPs. However, the customer could make these arrangements herself, if she desired.
10
Limitations and Further Work
The models described in this paper only become critical for any Internet communications that are usage-charged (e.g. QoS requests). Such cases are rare at present, however the author is engaged in related work, which suggests that the subscription and connect time charging models for Internet communications are only viable as long as capacity utilisation is low. However, even ultra-lightweight usage-charging [2] is yet to be proven cost-effective, therefore the context that this paper relies on is not at all certain without considerable further work. Also this paper argues the case for “sender and receiver both pay”, but it is not proven. The stability of inter-related ISP policies requires “war-game” simulation as a step towards a proof. Further work is also required to exercise scenarios based on these models, through simulation and prototyping in order to fully work through the performance and security issues.
262
11
B. Briscoe
Conclusions
We have defined a generalised pricing model for any number of interconnected multi-service networks. Each network offers each of its neighbours each class of service at a separate price for each direction of transmission. We call this “splitedge pricing”. This model scales naturally to any size inter-network because all prices only depend on direct neighbours. We have shown that the common case for apportioning value between the ends of a connectionless communication network is catered for if all users are charged for both sending and receiving. We have also shown that this is the most stable and efficient case, particularly for multicast and aggregation. It should therefore be used as the default in the “split edge-pricing” model. We have suggested that a new business model would be useful and more efficient to cater for the cases where there is a large discrepancy from this default in terms of value apportionment - large enough for it to be worth the cost of making a balancing transaction. This new model requires a new role in communications markets - an intermediary for end-to-end pricing and clearing. This new role could be conducted by existing ISPs or customers themselves, but there appears to be considerable added value, making this a viable business in its own right. It appears that this role is a threat to existing ISPs business. The role is one suitable for a purely financial processor using common e-commerce mechanisms with relatively low costs and the ability to take a share of the surplus value available on top of charge reapportionments. This role relegates edge ISPs into wholesalers for a potentially large class of Internet applications. The intermediary would become the retail face of the Internet in many cases. Further, we suggest a subtle twist to the recommendation that customers should pay for both sending and receiving. We suggest this should be customary, but that ultimate liability for sending should lie with the sender. Disputes could then quickly be resolved through the end-to-end clearing role. This stems from the unavoidable fact that receivers can’t avoid being sent to.
Acknowledgements Richard Gandon (BT International Carrier Services), Martin Tatham, Mike Rizzo, J´erˆome Tassel, Steve Rudkin, Chris Russell (BT Research), Jon Crowcroft (UCL).
References 1. Bob Briscoe (BT), The Direction of Value Flow in Multi-service Connectionless Networks, Int’l Conf on Telecommunications & E-Commerce ‘99, Oct 1999, http://www.labs.bt.com/projects/mware/ 2. Bob Briscoe (BT), Lightweight, End to End, Usage-based Charging for Packet Networks, submitted to Openarch 2000, Sep 1999, http://www.labs.bt.com/projects/mware/
The Direction of Value Flow in Connectionless Networks
263
3. Frances Cairncross, The Death of Distance : How the Communications Revolution Will Change Our Lives, pub. Harvard Business School Press, ISBN: 0875848060, Oct 1997 4. Georg Carle, Felix Hartanto, Michael Smirnow, Tanja Zseby, (GMD FOKUS), Generic Charging and Accounting for Value-Added IP Services, Technical Report GloNe-12-98, Dec 1998, (also presented at IWQoS’99 pricing workshop, Jun 1999) http://www.fokus.gmd.de/research/cc/glone/employees/georg.carle/ 5. David D Clark (MIT), Combining Sender and Receiver Payments in the Internet, in Interconnection and the Internet. Edited by G. Rosston and D. Waterman, Oct 1996, http://diffserv.lcs.mit.edu 6. Ross Finlayson, Discussion at Third Reliable multicast research group meeting, Orlando, FL, and following on the mailing list, Feb 1998, http://www.east.isi.edu/rm/ 7. Hugh W. Holbrook and David R. Cheriton, (Stanford Uni), IP Multicast Channels: EXPRESS Support for Large-scale Single-source Applications, in proc. ACM SIGCOMM’99, Sep 1999, http://www.acm.org/sigcomm/sigcomm99/papers/session2-3.html 8. ITU, The Direction of Traffic, ITU/Teleography Inc, Geneva, 1996, http://www.itu.ch/ti/publications/traffic/direct.htm, in brief chapter 3, Box 3.1 is extracted on-line: “Accounting rates and how they work”, http://www.itu.ch/intset/whatare/howwork.html 9. ITU, Reforming the International Accounting Rate System, http://www.itu.ch/intset/ 10. Nadia Kausar, Bob Briscoe and Jon Crowcroft (UCL), A charging model for Sessions on the Internet, in proc IEEE ISCC‘99, Egypt, 6-8 Jul 1999, http://www.rennes.enst-bretagne.fr/ afifi/iscc99.html 11. Sean McCreary and kc claffy, How far does the average packet travel on the Internet?, CAIDA, 25 May 1998, http://www.caida.org/Learn/ASPL/ 12. Jeffrey K MacKie-Mason and Hal Varian, (UMich), Some Economics of the Internet, Tenth Michigan Public Utility Conference at Western Michigan University, March 25-27, 1993, http://www.sims.berkeley.edu/˜ hal/people/hal/papers.html 13. Mark Handley, Van Jacobsen, SDP: Session Description Protocol, IETF RFC 2327, Mar 1998, ftp://ftp.nordu.net/rfc/rfc2327.txt 14. S. Blake (Torrent), D. Black (EMC), M. Carlson (Sun), E. Davies (Nortel), Z. Wang (Bell Labs Lucent), W. Weiss (Lucent), An Architecture for Differentiated Services, IETF RFC 2475, Dec 1998, ftp://ftp.nordu.net/rfc/rfc2475.txt 15. Mike Rizzo, Bob Briscoe, J´erˆ ome Tassel, Kostas Damianakis, (BT) A Dynamic Pricing Framework to Support a Scalable, Usage-based Charging Model for Packet-switched Networks, in proc. IWAN’99 Springer-Verlag, Jul 1999, http://www.labs.bt.com/projects/mware/ 16. Scott Shenker (Xerox PARC), David Clark (MIT), Deborah Estrin (USC/ISI) and Shai Herzog (USC/ISI), Pricing in Computer Networks: Reshaping the research agenda, SIGCOMM Computer Communication Review Volume 26, Number 2, Apr 1996, http://www.statslab.cam.ac.uk/frank/PRICE/scott.ps 17. Tony Speakman, Nidhi Bhaskar, Richard Edmonstone, Dino Farinacci, Steven Lin, Alex Tweedly and Lorenzo Vicisano, (cisco), PGM Reliable Transport Protocol Specification, Work in progress: IETF Internet Draft, Jun 1999 (expires Dec 1999), ftp://ftp.nordu.net/internet-drafts/draft-speakman-pgm-spec-03.txt
264
B. Briscoe
18. Y. Yemini, A. Dailianas, and D. Florissi, MarketNet: A Market-based Architecture for Survivable Large-scale Information Systems, In Proceedings of Fourth ISSAT International Conference on Reliability and Quality in Design, Aug. 1998, Seattle, WA, http://www.cs.columbia.edu/dcc/marketnet/ 19. Lixia Zhang (Xerox PARC), Stephen Deering (Xerox PARC), Deborah Estrin(USC/ISI), Scott Shenker (Xerox PARC) and Daniel Zappala (USC/CSD), RSVP: A New Resource ReSerVation Protocol, IEEE Network. Sep 1993, http://www.isi.edu/div7/rsvp/pub.html 20. H. Zimmerman, OSI Reference Model - The ISO Model of Architecture for Open Systems Interconnection, IEEE Transactions on Communications, Vol 28, No. 4, Apr 1980, pp 425-432. 21. Chris Zull (Cutler & Co), Interconnection Issues in the Multimedia Environment, Interconnection Asia ‘97, IIR Conferences, Singapore, Apr 1997, http://www.cutlerco.com.au/core/content/speeches/Interconnection%20Issues/Interconnection Issues.html.
Appendix A: Independence of Logical Classes of Service Here we argue that each class of service can be treated independently of other logical classes of service. Figure 6 attempts to show the split-edge prices for different classes of service between Na and Nb by layering the diagram in the third dimension “out of the page”. Each class of service “layer” is a logical independent inter-network in its own right as it has its own share of resources and its own inter-network prices. Relating this to currently proposed technology, for integrated services [19], class of service is defined as either best effort, controlled load, or guaranteed service with the particular flowspecs reserved being dealt with as heterogeneous QoS within a class (Appendix B). For differentiated service (DS) [14], each DS code-point represents a class of service.
Fig. 6. Split-edge pricing per class of service
The Direction of Value Flow in Connectionless Networks
265
However, because each network is managed autonomously, there may be disjoint mappings between classes of service in neighbouring networks (as allowed in diffserv). Such a case is shown between Nb and the right-most Nd , which uses one class of service where everyone else uses two. At such a boundary, Wdbs and Wdbr for the merged class appear from Nb ’s point of view to each be a pair of prices for each of its classes of service that happen to be identical. On the other hand, Nb might offer two different Wbdr prices and two Wbds prices to Nd for each of Nb ’s two classes. Nd would just see each pair as a single price that varied depending on the relative proportions of traffic coming from or going to each class. Therefore, even with disjoint class mappings, we need not concern ourselves with more than one class of service at a time.
Appendix B: Split-Edge Pricing Model There follows a description of the “split-edge pricing” model. From this, one can analyse the surplus (or deficit) income that any party on a network can expect for any flow topology (unicast or multicast). Also pricing can be different for traffic in each direction so reverse multicasting (aggregation) is catered for in the model). There is no room here for analysis based on the model, but this is presented in [1] and conclusions are described here backed by a natural language rather than mathematical argument. We make the assumption that differences in length of routes or shapes of routing trees through any party’s network will not produce cost differentials that are significant enough to be worth measuring and charging for. I.e, we assume the “death of distance” [3], or a “black box” network routing assumption where a network’s routing is hidden within its interfaces. This means we trust a network or inter-network of federated domains to find the cheapest route without end-system intervention or, put another way, that routing protocols approximate to a competitive market. If the route isn’t truly the cheapest, we assume the discrepancy is so minor that the end-customer isn’t concerned. Border routing policy distorts this considerably, but one of our long term motivations is to make most border-routing policy redundant, simplifying inter-provider interfaces using usage-charging instead. This also implies that internal network design inefficiency should be absorbed by providers in their overall pricing. For instance in current multicast routing, tree stability always takes priority over tree efficiency. Providers are free not to make this decision if they can design better multicast routing algorithms, but there is no need to expose internal cost differentials in external pricing (when they are purely for provider convenience in the first place). This is deliberately unlike any of the scenarios analysed in Herzog et al - the classic work on sharing the cost of multicast. Herzog et al has a scenario where all receivers share the cost equally, but not where receivers are divided into sets based on provider domains and only charged equally per domain (edge pricing). Nor does it consider heterogeneous QoS. If some degree of distance-based pricing is required, we assume edge-address-based pricing mechanisms can be used without having to
266
B. Briscoe
concern the end-systems with the route between addresses. But we believe even this is unlikely. Figure 7 shows a generic topology that will be used to illuminate the analysis. The general model and terminology have already been introduced in Section 5. We explained the scenario of a packet or flow being multicasted between interconnected networks of various statuses relative to the one of interest and we explained the four price weights at any interface. However, we glossed over the details of the model (e.g. not saying what the weights were applied to). We will now correct those omissions.
Fig. 7. Split-edge pricing with heterogeneous QoS
The figure shows the classes of service, Q, set for each branch of the tree. Q may be confined to discrete levels or allowed to take any from a bounded continuous range of values, depending on the QoS mechanism. It is assumed that, wherever a packet is duplicated for multicasting, the multiple copies might each have different classes of service. Note that it has not been assumed that the branches all have different classes of service. This depends on the (independent) requirements of the ultimate ends of each branch. There need be no correlation between neighbouring network status and the value of Q for that branch. RSVP is an example of such a heterogeneous QoS scheme (diffserv has the potential to become heterogeneous, e.g. by setting the class of each branch based on the DS-bytes in the multicast routing packets or in the IGMP join packets from end systems or even using RSVP). The packet or flow being modelled could be data or signalling. V represents some measure of the volume or size of the service consumed. It might be the amount of data in the packet or in a flow of similar packets for a certain time. It might be the time for which a reservation of a certain size is held. We simplify the model by requiring V to be the same for all branches
The Direction of Value Flow in Connectionless Networks
267
of the tree. This is justified because a branch leaving the tree, a packet loss or a network filter can be dealt with as an alteration to the topology rather than allowing V to be heterogeneous. We define the “nominal charge” function that Nb levies for the packet or flow as Cb (V, Q). This is a nominal charge because next we will describe how it is weighted to determine the actual charge for each different type of neighbour. Wbas & Wbar denote the per direction weightings applied to the charge that Nb applies to Na as shown at the highlighted interface between these two in Fig. 7. The third digit of the suffix denotes the direction of traffic that the weighting applies to; s being the weight for traffic sent into the provider, r for traffic received from the provider setting the charge. However Na is also offering service to Nb . So similarly Na weights its charge Ca with weights Wabr and Wabs . The first digit of the suffix of W denotes the network provider setting the charge being weighted. The second digit denotes the type of neighbour network provider to which this weighting applies. Often, we can consider scenarios where many of these price weightings are set to be either equal to each other or to be zero, but formulae derived for this model allow fine granularity of price weighting for any scenario we might dream up in future work. We have initially experiment with extreme policies like “sender pays all’ but the formulae allow consideration of more subtle scenarios where prices might be slightly unequal in the different directions, perhaps because of asymmetric access technology like xDSL. The analysis in [1] takes this model and produces formulae for the surplus of each party. Then various scenarios such as “sender pays all” or “senders and receivers pay equally” are exercised to determine whether some policies are more advantageous to providers than others. It is concluded that for intra-provider packets, the revenue is unaffected by whether senders, receivers or both are charged, but which customer(s) contributes to it, depends on the direction of the unicast. However, for inter-provider packets where peer providers charge each other as much as any edge customer, if both providers adopt a “receiver pays all” or a “sender pays all” policy, the revenue ends up always moving to the edge provider furthest from the customer paying. As long as each provider has a similar customer mix in terms of senders and receivers the net effect is that all providers make similar gains (and incidentally they could mutually agree not to charge each other with little change in their income - peering). Charging for only one direction has the benefit of halving the cost of measuring and accounting for both directions. However, if all provider policies start as “sender pays all”, Na might notice it has more receivers than senders. Let us assume Na switches to charging receivers only, while Nb continues to only charge senders. A packet in the direction from Nb to Na would result in each provider charging their edge customer for Qt , but neither charging each other. A packet in the opposite direction would result in neither provider charging their end customers at all, but both nominally charging each other. This would, as one might expect, cause a migration of all
268
B. Briscoe
predominant receivers to other providers like Nb and all big senders to Na . This leaves all providers with very little edge revenue, but all just nominally charging each other similar amounts. Clearly this will not be tolerated for long as the edge-customers are exploiting the providers, all the traffic is being inefficiently and expensively funnelled across the inter-provider links and there is consequent pressure for all providers to reverse their policy. Changing the wholesale pricing policies will make no difference. If the retail policies remain imbalanced there will be little revenue entering the system. Thus neither all providers charging only for sending nor all charging only for receiving can be stable unless there is some industry-wide agreement as to which one to standardise around. If there is such tacit agreement, charging for sending seems preferable as it avoids the problem of charging for unsolicited receipt. What this teaches us is that any extreme policy where either sending or receiving are offered at low or zero price encourages instability simply because local pricing doesn’t match true local costs. Therefore, when end customers arrange themselves to their best advantage, providers suffer unless they collectively organise themselves, which is unlikely. On the other hand, all providers charging for both sending and receiving is also stable, and without artificial standardisation. If any provider breaks ranks and charges, say, only senders, receivers will quickly migrate towards it and senders away to the rest of the industry. Thus the most stable arrangement is for send and receive pricing to approximately match costs. When the multicast is analysed, this only strengthens the case for all ends being liable for their use of the network, whether sending or receiving. Here we consider single-source trees for brevity, but the argument is even stronger for shared trees. Today, multicast senders are being charged exorbitant rates such that only those with income from (or advertising to) large numbers of receivers become customers. In the longer term, “sender pays all” will only be tenable if it is at a rate that reflects the number of receivers. However, if this problem is considered, the trust required to achieve a solution appears to put up a theoretical barrier to “sender pays all”. The cost of an inter-domain multicast is spread throughout the tree. The cost to each domain is approximately proportional to the number of branches within that domain. The proximity of the tree’s fan-out to its ingress into the domain determines the actual cost per branch. This in turn depends on how meshed the network is, which is usually a characteristic of the design of a whole domain (and if not, the internal design is under the domain’s control). Thus, a “leaf-domain” (one with only edge receivers and no downstream domains on the tree) can work out its cost for a certain number of receivers in its domain and report it to the upstream domain. This report is effectively a bill presented to the upstream domain, requesting a share of the income from the sender as it trickles down through the domains in the same direction as the tree. This upstream domain can collect similar “bills” from other downstream networks and report the sum onwards upstream, eventually reaching the sender. Thus the head-end provider charges the sender the costs of the whole tree and has to pass on much of this income to meet the downstream “bills”. Clearly, any domain can over-report its bill and make a profit. This is because the sender
The Direction of Value Flow in Connectionless Networks
269
cannot determine the topology directly from its receivers without destroying the scalability benefits of multicast, which deliberately hides receiver activity. It seems far simpler to apply the same argument as for unicast above and charge each receiver and the sender. The charges need not be identical for each, but the sender charge doesn’t need to reflect the number of receivers. Where the tree crosses a domain boundary, each domain simply charges the other as one sender or one receiver. This ensures charging is completely distributed. There is not even a need to correlate together the receivers for one tree. All receivers are just usage-charged as they occur, with no need for co-ordination or totalling. As before, if the sender (or one receiver, or even a third party) wants to offer to pay for all receivers, this can be arranged end-to-end, rather than through the networks.
Techniques for Making IP Multicast Simple and Scalable Radia Perlman1 and Suchitra Raman2 1
2
Sun Microsystems Laboratories, USA Department of Electrical Engineering and Computer Science, University of California, Berkeley
Abstract. This paper describes Root Administered Multicast Addressing (RAMA), a protocol for wide-area IP multicast that scales to a large number of simultaneous groups with topologically distant members. RAMA solves the wide-area “rendezvous” problem by making the root of the distribution tree explicit in the multicast group identifier. This is done by extending the 4-byte IPv4 multicast group address format to an 8-byte address format, (R, G), where R is the unicast address of the root of the multicast distribution tree and G is a group identifier, unique with respect to R and administered by it. Data distribution occurs via a single shared bi-directional tree, allowing scalable operation for multiple senders. RAMA generalizes two recent protocols, Simple Multicast [18] and EXPRESS Multicast [10], into a common protocol that has the desirable features of both. RAMA has several advantages over existing designs for wide-area IP multicast: (i) it is a single protocol that works both within a domain and across domains, (ii) it provides efficient support for multiple-sender as well as single-sender applications, (iii) since group identifiers are allocated and administered locally with respect to the root, there is no need for globally coordinated Internet-wide multicast address allocation. Hence, RAMA does not require a companion heavy-weight multicast address allocation infrastructure. The extended address format also provides a larger number of multicast addresses, mitigating the address availability problem. In this paper, we motivate and describe the RAMA protocol architecture and the engineering issues in developing and deploying it. Extended addressing in RAMA requires changes to end hosts, and we outline the design and implementation of RAMA using IP options for a BSD-based end host operating system. Unfortunately, IP options are not handled efficiently by today’s routers which forward such packets in a slow path, typically in software. To address this issue, we also discuss variants of RAMA that do not require the extended address format.
1
Introduction
Network layer multicast provides an efficient distribution mechanism for delivering data to multiple receivers. Besides providing efficient multi-point delivery, network layer multicast also provides a “group” abstraction in which a sender L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 270–285, 1999. c Springer-Verlag Berlin Heidelberg 1999
Techniques for Making IP Multicast Simple and Scalable
271
of data can refer to a group of receivers, without listing them explicitly. Much past work addresses the problem of extending the unicast IP service model to include multicast. While today’s model of IP multicast, originally proposed in [6] provides an elegant and general group abstraction, it defies a scalable wide-area multicast routing protocol. Many of the the current designs for wide-area multicast routing protocols suffer from scaling problems when there are large numbers of simultaneous groups whose members are distributed across the Internet. As a result, even though the design and deployment of IP multicast protocols started almost a decade ago, we do not yet have a ubiquitous multicast infrastructure. We first discuss existing designs for multicast routing protocols to understand their scaling properties. 1.1
Flood and Prune Protocols
One class of routing protocols (DVMRP [6] and dense mode PIM [4]) involves “broadcast and prune,” where traffic is flooded from the source, using reverse path forwarding [2]. Additionally, when a router R receives a multicast message from a source S with destination address G, and R has no neighbors that wish to receive traffic for (S, G), it sends a message in response, indicating that the neighbor should “prune (S, G)”, i.e., should stop sending traffic for (S, G) to this router. This class of protocols does not support large numbers of groups with topologically distant members because of two drawbacks: Too much flooded data. To reach all potential receivers, flood and prune protocols must periodically flood data to reach all parts of the Internet. However, in practice, for a given receiver, only a very small portion of the groups would be of interest. Too much prune state. Each router R must remember all the (S, G) pairs it received from each neighbor (representing all the (S, G) pairs the neighbor is not interested in receiving), in addition to all the (S, G) pairs R has sent prunes for. In other words, the prune state in routers grows proportionally with the number of sources s and number of groups g that a router is not interested in! 1.2
Explicit Tree Protocols
The other class of protocols, (CBT [1], sparse mode PIM [3], and BGMP [12]) explicitly builds a shared tree based on a root C, so that only routers on the distribution path of a multicast group need to keep state about the group. CBT and BGMP create a bi-directional tree, whereas sparse mode PIM creates a unidirectional tree. The shared tree approach is more scalable since the router state does not grow as rapidly as in the dense mode protocols, since routing state is no longer maintained for groups that there is no interest in! In addition, sparse mode PIM allows for switching between a state-efficient shared tree and a latency-optimal source-rooted tree. Each router decides if it is
272
R. Perlman and S. Raman
receiving a “sufficiently high volume” of traffic from a particular source, and if so, joins a tree rooted at that source, pruning itself from the shared tree. This dynamic switching between uni-directional shared tree and per-source trees is complex and has stability problems as per-source trees time out. Also, the root of the uni-directional shared tree becomes a bottleneck. All the shared tree approaches require the use of periodic announcement messages to locate the rendezvous point, or RP, for a group, i.e., learn the mapping from the group address G to its RP. In sparse mode PIM, a bootstrap mechanism within a domain advertises candidate RPs and a hash function maps G to one of the set of candidate RPs. This mechanism does not scale beyond a domain because it is too expensive to do Internet-wide advertisements of the list of candidate RPs. In addition, this mechanism creates highly suboptimal trees if the candidate RP is selected using a hash function from among Internet-wide candidates, rather than being co-located with high bandwidth senders to optimize data paths. There have been several recent proposals that specifically address the widearea IP multicast routing problem. The inter-domain multicast routing protocol BGMP [12] proposes using a shared bi-directional distribution tree among domains such that any intra-domain protocol (i.e., DVMRP or PIM) can be run within each domain. Routing between BGMP domains requires that multicast address allocation reflect the underlying unicast network topology, or at least provide core location information. This alignment with the unicast routing hierarchy also makes BGMP routing entries aggregatable resulting in state savings in inter-domain routers. There is less consensus on how such address allocation is to be done in a scalable and deployable manner. Some approaches that have been suggested are: – Multicast Address Set Claim (MASC) [12], a scheme proposed in conjunction with BGMP for dynamically assigning blocks of multicast addresses to each domain, and using inter-domain unicast routing, e.g., BGP [15] to distribute reachability information. Once R is localized in this manner to a domain, a mechanism such as PIM bootstrap is used to map G to the RP within that domain. While we feel that the shared bi-directional inter-domain tree architecture in BGMP is a scalable distribution mechanism, we are less convinced that the MASC architecture is sufficiently dynamic and free from allocation conflicts, especially in the face of network partitions. If multicast addresses need to be allocated in blocks to domains, either statically or dynamically, multicast addresses will become a scarce resource. – GLOP addressing [16] is a static assignment of multicast addresses based on unicast domains, in which each domain is assigned 256 multicast addresses. 256 addresses per domain are not sufficient to support anything but a very restricted set of applications (perhaps a few streams broadcast by an ISP). Another scheme [17] assigns class D addresses based on 24 bits of the unicast address space. This scheme cannot be used along with “routing realms” connected to the rest of the Internet via network address translators (NATs)
Techniques for Making IP Multicast Simple and Scalable
273
[14] that do not have any globally assigned unique unicast addresses, and therefore would not have any multicast addresses. Another proposal to overcome the wide-area rendezvous problem is MSDP [7], the Multicast Source Discovery Protocol. MSDP is a scheme in which tunnels are configured between candidate RPs in various domains. When a source S transmits on group G, knowledge that (S, G) is an active (source, group) pair is flooded throughout all domains. This scheme suffers from severe scaling problems if many sources and groups are active simultaneously. Recent research has revisited the basic abstraction of a group, since perhaps it was too ambitious and generalized. This has lead to the development of at least two independent proposals [18,10] that argue in favor of a modified abstraction for a group that is less general, but affords a scalable wide-area routing protocol. One such scheme is the “Simple Multicast” routing protocol (SM) which proposes extending the multicast address architecture by making end hosts aware of the core router of the multicast distribution tree. SM overcomes the core location problem in the wide-area by explicitly distributing the core address at the application layer along with the 4-byte group address. In this scheme, the group “address” (G) is extended with the unicast address (C) of the core or RP. Hence, the new extended group identifier is (C, G). The additional address bytes may be carried in an IP option or “next header,” following the IP header. The EXPRESS multicast model explicitly names the source of data in the group address. Hence, the group is identified by the 8-byte quantity (S, G), where G is a group identifier with respect to S. Both SM and EXPRESS mitigate the difficult problem of globally coordinated multicast address allocation by localizing address management to a single node. Since G is unique with respect to the root R of the distribution tree (i.e., the core C in SM, or the source S in EXPRESS), there is no need for a separate address allocation infrastructure. The key difference between SM and EXPRESS is their sender model. At the network layer, EXPRESS supports data delivery from only one source per group, whereas SM preserves this aspect of the existing multicast model by providing support for multiple senders. The designers of the EXPRESS protocol make the strong assumption that the only application requiring wide-area large-scale multicast is IP television. However, existing transport protocols such as RTP [19] and SRM [8] require network layer support for many-to-many communication and can only achieve this through substantial support from a session-level agent to provide this abstraction over the underlying one-to-many model. For example, scalable timers in RTP as well as SRM using slotting and damping require the use of a multicast back channel from every receiver. In addition, EXPRESS uses uni-directional source-rooted trees, while SM uses more optimal bi-directional core-based trees. Bi-directional trees better match the requirements of several transport protocols that use scoped transmissions to perform local recovery. Uni-directional trees are ill-suited for this this type of operation, since the data is first encapsulated towards the source and subsequently traverses down the distribution tree.
274
R. Perlman and S. Raman
While the concept of optimizing the single-sender data path is attractive, we see that the design choices made in the EXPRESS model are too restrictive. Hence, in our design of RAMA, we combine the desirable features of SM and EXPRESS to provide a flexible network service that performs efficiently for one sender, while not excluding the multiple sender case. EXPRESS and SM can therefore be viewed as special cases of the more general RAMA protocol. Although it is not intrinsically more difficult to provide bi-directional trees than uni-directional trees, it is a change from existing protocols such as DVMRP and PIM. So, until routers change, forwarding of packets from senders other than the root would be suboptimal. In this paper, we describe the RAMA protocol and highlight the engineering issues involved in designing the address extensions in routers as well as end hosts. Since current practices in router architectures do not support efficient forwarding of packets with IP options which get forwarded through a slow software path, we also investigate variants of the basic RAMA protocol that are better optimized for today’s router architectures — (i) Link-local address allocation with hop-by-hop address translation, and (ii) UDP port-extended addresses. We also describe “user-level only” extensions that allow us to deploy the protocol without requiring modifications to the end host operating system. The rest of this paper is organized as follows. In Section 2 we describe the details of the root-administered multicast addressing protocol. Section 3 discusses the header extensions and how they impact the end host API as well as the router implementation. We present hop-by-hop translation and UDP port-extended addressing in Section 4 as two variants of the basic RAMA scheme that can be used instead to overcome its forwarding inefficiencies. In Section 5 we discuss other aspects of RAMA, e.g., robustness, access control, and the benefits of a bi-directional distribution tree. Finally, we conclude in Section 6.
2
Extended Address Multicast Using RAMA
How would you create a group with an 8-byte address? How would a potential member learn the address? 2.1
Choosing the Root R
We first discuss how the root of the distribution tree is chosen. In practice, most applications have a small number of high-bandwidth sources, e.g., video sources in a videoconferencing or multi-player application where the core of the tree can be chosen administratively to be located close to these sources. If the assumption behind EXPRESS is valid, i.e., that by far the most important application to support is of a single content provider, e.g., Internet TV, then choosing a root for a group is obvious — it should be chosen to be the source of the data1 1
EXPRESS allows only the source to be the root of the tree, and does not allow other session sources to send to the (S, G) tree unless appropriately tunneled. Since
Techniques for Making IP Multicast Simple and Scalable
275
For applications like conference calls, any member of the group, or any router near a member is a reasonable choice for the root. In the case of a group where none of the members or senders is known in advance, a root could be chosen arbitrarily, and we will do no worse than one chosen algorithmically based on a hash of the address, as is done in PIM. 2.2
Address Allocation
Now that R is selected, it is easy to find G in RAMA. Selection of G is done autonomously by R since G is unique with respect to R. So the node creating the group (perhaps R) just needs to request R to allocate a group identifier G. 2.3
Discovering a Group to Join
Several mechanisms are available to announce a group to potentially interested receivers. For a private group, such as a conference call, a reasonable mechanism is for the group creator to send e-mail to the invited participants, giving the 8-byte group ID. For longer-lived groups in which anyone would be invited to participate, the group might be advertised on a web page or in a directory such as sdr [9]. For instance, a web server that provides a stream of data might advertise the service on its web page. Perhaps after paying for the service, the client would be told the group address, together with an encryption key if it is using end-to-end security. 2.4
Joining the Group
The receiver joins a Simple Multicast (or EXPRESS) group by issuing a special packet saying “Join (R, G)”. Routers that receive a Join message establish forwarding state about that group, and continue to forward it toward R. The Join message travels towards the root until it reaches a router that already has forwarding state for the group (R, G). End hosts also perform membership reporting based on the 8-byte address. To survive router failures, and avoid orphaned membership state due to host failures, forwarding state in the routers is “soft” — i.e., it is expired after a certain period of time. To renew its membership with the router, end hosts and on-tree routers periodically retransmit membership report messages. This is in tune with the model of multicast defined a decade ago which favors receiverinitiated joins. RAMA allows bi-directional trees for this application, we accommodate tree-building protocols as in reliable multicast [13,11], and also provide a low-bandwidth back channel, say for questions in a “lecture” type application.
276
2.5
R. Perlman and S. Raman
Sending to the Group
We now discuss the packet packet formats in the two root-administered protocols: EXPRESS and SM. In EXPRESS only the root R can multicast data to the group, and the case when R is the source of the data is straighforward. The IP source address is R and the destination address is G, the (R, G) information from the IP header is used to perform lookups for the group in the routers. However, if another node N wishes to transmit, N must either tunnel the packet to R, which will decapsulate and forward it to the group, or N must create another group with N as the root, and somehow alert all members to join (N, G ) as well as (R, G). In RAMA, multiple senders are allowed to transmit to a group. The key challenge is to devise a packet format that allows the source as well as the 8byte group identifier to be specified in the data packet in a way that is convenient to both end hosts as well as routers. We propose three alternatives and discuss their relative merits.
3
Header Extensions
The simplest scheme involves extending the IP header to carry the extra 4byte root address. Note that the header extension only needs to be included when the sender is a node other than the root, since it can be assumed when the header extension is absent that the group identifier is (S, G), i.e., that the IP source address is the root. This can either be done with an IP option or a “next header” that includes R. Although functionally identical, some router implementers argue that a “next header” is more efficient to parse in router hardware, since it appears in a fixed position and has a fixed length whereas an IP option can appear at a variable offset within the IP header, depending on how many other IP options are present. Since IP options are encountered infrequently, many router implementations relegate packets containing IP options to a “slow path” processed in software. The extended header is used only when a node other than R transmits the packet. Otherwise, when R is the source of data, the packet format resembles the EXPRESS data packet and carries no additional option. Hence, the data path is optimized for the common single-sender case, while not disallowing non-root senders. One motivation for allowing non-root senders is to preserve the flexibility of IP multicast as originally conceived, and allow for efficient multi-sender applications. Another reason is based on experience with transport protocols. The multiple-sender model eases the task of higher layer protocol designers, e.g., in reliable multicast transport where a multicast back channel is crucial for scalable loss recovery and congestion control. RAMA requires that end hosts be aware of the root of the distribution tree. We briefly discuss end host implementation strategies for RAMA address extensions. The additional functionality required in the end hosts may be described as follows.
Techniques for Making IP Multicast Simple and Scalable
277
1. The multicast application programming interface (API) is extended to allow applications to specify the address of the root node R, in addition to the class D group identifier G. This may be performed in-kernel by changing the socket implementation, or at the user-level, using the raw IP socket interface. 2. Data packets on the output path from a non-root sender must carry the root address R in an IP option or “next header” (as in IPv6 [5]). 3. On the input data path to a receiving process, data delivery and filtering must be done based on the new (R, G) address format. Once again, this additional packet filtering is easily accomplished by modifying the kernel. Alternatively, it may also be done entirely at user-level using the raw IP interface. The user-level solution is less efficient in this case, since the inkernel IP filtering mechanism is imperfect and may deliver spurious packets to the user-level receiving process that must be filtered again based on (R, G). 4. End hosts perform membership reporting based on new extended group address format (R, G). Again, this may be performed easily either within the kernel or at user-level. Extended address handling may be done by kernel modifications and we describe these changes in Section 3.1. We consider the UNIX (R)2 operating system and describe the changes required in the case of a BSD-based implementation of the TCP/IP stack. Since new releases of operating system kernels may take a long period of time to become ubiquitous, we have also developed a userlevel architecture for the end hosts extensions for RAMA which we present in Section 3.2. 3.1
Unix Kernel Implementation
On the sending path, our kernel implementation extends the socket data structure with the root address associated with the RAMA group that the socket is connected to. This causes an IP option to be attached to all packets transmitted using send() or sendto(). The root address in the socket data structure is set using a new socket option via the setsockopt() system call. This is illustrated in Figure 1. On the receiving path, we maintain a list of 8-byte RAMA groups that each socket is subscribed to, and perform membership reporting for them. Incoming multicast datagrams are filtered based on the 8-byte membership list before being passed up to the application. Since this filtering is performed in-kernel, this implementation is more efficient than the user-level architecture we describe in Section 3.2. 3.2
User-Level End Host Architecture
Our user-level solution uses the raw IP socket interface to construct outgoing RAMA data packets with the root address R in an IP option. Since raw IP 2
UNIX is a registered trademark in the United States and other countries, exclusively licensed through X/Open Company, Ltd.
278
R. Perlman and S. Raman
Fig. 1. RAMA end host API extensions.
access under Unix is available only in privileged mode, applications using this user-level solution must run in super-user mode. Our solution, shown in Figure 2 has the following components. – A user-level library (librama) that invokes the raw IP socket API in the sending and receiving data path. librama in turn provides an extended socket API, and explicitly handles the unicast address of the root. Applications must be re-written to use the data path functions of librama through the extended API that it provides. librama performs subsequent 8-byte filtering on all the packets received through the raw IP interface, since the kernel IP filtering does not consider the root address carried in the option field. Hence, user-level filtering of addresses is a source of inefficiency since receivers subscribed to (R, G) also have to process packets that are sent to other processes on the same host, or same LAN that are listening to other (R , G) groups. This scheme merely provides an easily deployable interim user-level implementation to solve the address extension problem until modified end host systems can be widely deployed. – A user-level RAMA protocol daemon (ramapd) which performs membership reporting on behalf of all RAMA receivers on the host. We use a simple host RAMA protocol that runs between a receiving process and the userlevel membership reporting daemon and is implemented via a local socket. A RAMA receiver subscribes to a (R, G) group through ramapd which periodically reports this membership information to the upstream multicast router
Techniques for Making IP Multicast Simple and Scalable
UDPDBVHQG UDPDBUHFY HWF
279
+RVW SURWRFRO
8VHUOHYHO OLEUDU\ 5 *
5DZ ,3 ,QWHUIDFH *
'$7$
/LEUDPD
UDPDSG
3HULRGLF -2,1V
$SSOLFDWLRQ
5RXWHU SURWRFRO
+RVW NHUQHO 1HWZRUN HYHQWV
Fig. 2. User-level only extensions to the end host.
using the RAMA membership protocol, similar to IGMP, but with 8-byte addresses. ramapd also periodically checks for failed receiver processes and sends a Leave message if the last receiver for a group dies.
4
Variants of the Basic Scheme
As we have seen in Section 3, using extended addresses on all data packets in RAMA requires either kernel-level changes to end hosts or a user-level implementation that has suboptimal data path performance. In addition, today’s routers do not efficiently forward data packets with IP options, since such packets are processed in the router’s slow path in software which results in significantly poorer performance than the hardware fast path for forwarding regular option-less IP packets. In this section, we outline alternative solutions in which the extended 8-byte format is used to establish forwarding state, but no modifications are required at the end host for the data path.
280
4.1
R. Perlman and S. Raman
Link-Local Addresses with Hop-by-Hop Translation
The link-local scheme uses 4-byte class D destination address in data packets. The IP destination field acts like a “label” in label switching or a virtual circuit identifier in ATM. This scheme is shown in Figure 3. Each multicast RAMA router on a LAN is statically configured with a range of class D addresses for use on that LAN as link-local addresses. If R is the upstream multicast router for the IP prefix containing G, then R assigns a link local address for (C, G) from its share of class D addresses3 . With 28 bits of unique class D addresses, there are 256 million addresses for each local link. To join (R, G), a node sends “Join (R, G).” The next router R1 replies “Ack (R, G). Use X1 ”, where X1 is picked from among the statically allocated block of addresses of R1 . Then R1 sends “Join (R, G)” to R2 , the next router on the path to R. R2 replies “Ack (R, G). Use X2 ,” and so on. Each router propagates the join message towards the root, creating address mappings at each level. The 8-byte end host join message is sent from the application layer using the raw IP interface. Once the alias X1 is assigned, the application uses this as its group identifier to which datagrams are transmitted. The underlying existing IGMP mechanism is reused for reporting membership to this group. Data packets transmitted to X1 undergo hop-by-hop address translation. Since the destination address is rewritten by each router, the IP checksum must be recalculated at each hop4 . Once a member has joined (R, G) and is told to use X1 , it transmits to X1 and listens to X1 just as in the existing multicast API. When R1 receives a packet for X1 , it must overwrite the destination address with X2 in order to forward it to X2 . Non-member senders simply use the 8-byte address format to send data packets. If a router has no forwarding state for the destination (R, G), routers simply forward these packets towards the core. When the packet reaches an “on-tree” router that has forwarding state for (R, G), it is transmitted on its remaining interfaces that are on the bi-directional tree. The first router that sees a data packet destined to (R, G) sends back a Join-Ack message with a 4-byte address to the source and adds the new 8-byte to 4-byte mapping to its translation tables.
3
4
As added safety, the routers periodically announce to the ALL-ROUTERS group with an IP TTL of 1 what address range they have, to detect misconfiguration where they have overlapping addresses ranges. Routers also perform similar address translation when they see a JOIN-ACK message going through them. This is to handle the case that a non-member sender’s data packet causes a JOIN-ACK from a non-adjacent router. Please see the description of the non-member sender case below.
& *
;
;
&· *·
<
<
U
5
-RLQ & *
& *
;
;
&· *·
<
<
I
-RLQ$FN & * DOLDV ;
J
J
<
J
<
(&3 JURXS
&· *·
U
281
I ;
U
J -RLQ & *
H
-RLQ$FN & * DOLDV ;
-RLQ$FN & * DOLDV ;
5
I ;
I H
-RLQ & *
& *
-RLQ & *
H
H
(&3 JURXS
(&3 JURXS
Techniques for Making IP Multicast Simple and Scalable
5
5
5
5
Fig. 3. Hop-by-hop multicast address translation.
The main drawback of this scheme is that it relies heavily on the “signaling” phase — i.e., Join and Join-Ack messages. Since forwarding and address translation depend on the presence of mapping tables, they are susceptible to router failures and route changes. Therefore, this scheme requires special fault handling for these cases. 4.2
Layer 4 Port Allocation
Since G is unique with respect to the core address, we do not need a large number of addresses for each root. We consider a scheme that provides 128 groups per root node. This is a specific allocation strategy that covers UDP port fields in addition to the IP destination field. We try to leverage router machinery that is used to classify flows based on layer 4 port numbers for QoS purposes. The scheme functions optimally when routers take the fixed length UDP header into account during switching. If not, spurious forwarding based merely on a portion
282
R. Perlman and S. Raman
of the address occurs. As long as perfect filtering is done at the end hosts based on the destination and port combination, this scheme can be used to avoid using extended IP headers. The idea is to use the IP destination address (4 bytes) and the UDP destination port (2 bytes) to form a group address, which in this RAMA variant will be 6 bytes rather than 8 bytes, as in the other RAMA variants. We need 4 bytes to specify R (since it is an ordinary IP address). We cannot put all 4 bytes of R into the IP destination address, because we need to use a class D address. To specify class D we would theoretically only use 4 bits of the destination address. However, in order to allow this variant of RAMA to coexist with other multicast schemes, we can use only 1/16 of the class D addresses by using a fixed constant for the remaining 4 bits of the first byte. This leaves us with 3 bytes we can use in the destination address. We put 3 bytes of R there. Now we have 2 bytes in the UDP destination port. In order to make sure we do not conflict with well-known ports, we set the top bit to 1. That leaves us 15 bits, where we can place the final byte of R and leave 7 bits for R to assign so that R can be the root of 128 different multicast groups. This scheme requires no changes to the end host operating system, since the Unix kernel performs filtering based on IP destination and UDP port field for UDP datagrams. We also note that most current multicast applications are written over UDP and do have the UDP ports available for allocation.
5 5.1
Discussion Robustness
If the root address R is fixed, a natural question to ask is what happens if R fails? In the case of a single content provider, where R is the content provider, the group terminates when R fails as expected. In the case of a conference call, in most cases the chosen root could be assumed to survive for the duration of the call. In cases where extra robustness is required, a second backup group could be created (R , G ). Members may join the backup group only when the primary root fails, or may join both groups, but only start using the backup group if the primary group fails. In order to detect failures, we rely upon periodic keepalive messages from the root to all the subscribed members. 5.2
Access Control
Although the original IP multicast model allowed any sender to a group and any receiver to subscribe to a group, there is a growing concern that this is not the desired behavior. Although end to end cryptography can be used at a higher layer to authenticate authorized transmitters, it is desirable for the routing infrastructure to assist in limiting transmissions from unauthorized senders. Router assist for access control saves network bandwidth and also reduces the excess load on receivers caused by processing unauthorized sender traffic.
Techniques for Making IP Multicast Simple and Scalable
283
,3GHVWLQDWLRQDGGUHVV ,3VRXUFHDGGUHVV [[[[
555 ELW*
8'3VRXUFHSRUW 5 5555 * ELWJURXSLGHQWLILHU
5
8'3GHVWLQDWLRQSRUW 3RUW>@ LVUHVHUYHG
Fig. 4. Assigning UDP ports for RAMA address extensions.
In EXPRESS access control is implicit since R is the only authorized transmitter. If other senders are allowed in EXPRESS, they tunnel to R, and R maintains an access control list from which it decides which sources it will forward to the group. In RAMA, R informs all the routers in the tree of the access control policies for the group by periodically multicasting the allowed and blocked sender list. If there were too many of either, R can add sources to the list as needed. For instance, initially all senders are allowed. Only when an unauthorized sender transmits a packet, the root adds this address to the block list. Alternatively, R starts with the policy that only R be allowed, and if S wanted to transmit, it would have to ask R’s permission, and be included in the access list that R multicasts, before S could transmit. Or until that occurred, S could tunnel to R, as in EXPRESS. 5.3
Bi-Directional Trees
Probably the most important property of the multicast distribution tree is the cost to the network to deliver data. A single bi-directional tree into which all members can inject data is no more costly, in terms of data distribution, than n different trees rooted at each source. In fact, the bi-directional tree could be
284
R. Perlman and S. Raman
thought of as n different per-source trees, that all happen to share the same links, since once the shared tree is formed, it can be used with any node as the root. One concern with a single shared tree is that the route from a source to a particular destination might be highly suboptimal, for example, when the source N1 and receiver N2 are in domain A, but the path from N1 to N2 must go through domain B, since the core or RP is located in B. Not only would such a path be likely to be very suboptimal, assumimg that inter-domain links are more costly than intra-domain links, but this also introduces what is known as a ”third party dependency”, where domain A depends on the nodes in domain B for data flow between members of A. Bi-directional trees reduce the severity of the third party dependence problem. However, it still does not guarantee that the path from a sender does not traverse inter-domain links to reach a receiver within the same domain. In order to ensure that data from a sender does not leave the domain before reaching receivers also located within the same domain, all routers within the domain must agree upon one exit router for each root R. With this rule of having a single exit point from a domain towards each root address R, the path between members of the domain is guaranteed to be intradomain.
6
Summary
Extended addresses greatly simplify wide-area multicast routing and make it far more scalable. In this paper, we described the key challenges in making root-addressed multicast addressing a reality. We discussed the benefits of embedding the root address of the multicast distribution tree in the address. Root Administered Multicast Addressing supports the multiple-sender case and uses bi-directional trees to obtain better data distribution. The extended address format in RAMA data packets incurs an extra processing overhead in routers. We address this using alternatives to the extended header protocol that can be processed efficiently in router hardware. Simple Multicast and EXPRESS multicast are special cases of RAMA, which incorporates the desirable features of the individual protocols. EXPRESS restricts itself to the single-sender case with the sender as the root R of the tree. Although EXPRESS is efficient to implement in today’s router architectures, its single-sender model is inflexible for higher layer protocols, especially transport protocols. We also discuss the hosts changes to handle 8-byte group addresses, a requirement for RAMA, Simple Multicast as well as EXPRESS.
7
Acknowledgements
Cheng-Yin Lee of Nortel has significantly contributed to the design of Simple Multicast, as well as many other members of the IETF.
Techniques for Making IP Multicast Simple and Scalable
285
References 1. Ballardie, T., Francis, P., and Crowcroft, J. Core Based Trees (CBT): An Architecture for Scalable Inter-Domain Multicast Routing. In Proceedings of SIGCOMM ’93 (San Francisco, CA, Sept. 1993), ACM, pp. 85–95. 2. Dalal, Y., and Metcalfe, R. Reverse path forwarding of broadcast packets. Communications of the ACM (Dec. 1978). 3. Deering, S., Estrin, D., Farinacci, D., and Jacobson, V. An Architecture for Wide-Area Multicast Routing. In Proceedings of SIGCOMM ’94 (University College London, London, U.K., Sept. 1994), ACM. 4. Deering, S., Estrin, D., Farinacci, D., Jacobson, V., Helmy, A., Meyer, D., and Wei, L. Protocol Independent Multicast version 2 Dense Mode Specification, Aug. 1997. Internet Draft. 5. Deering, S., and Hinden, R. Internet Protocol, Version 6 (IPv6) Specification, Dec 1998. RFC-2460. 6. Deering, S. E. Multicast Routing in a Datagram Internetwork. PhD thesis, Stanford University, Dec. 1991. 7. Farinacci, D., Rekhter, Y., Lothberg, P., Kilmer, H., and Hall, J. Multicast Source Discovery Protocol (MSDP), June 1998. Internet Draft. 8. Floyd, S., Jacobson, V., McCanne, S., Liu, C.-G., and Zhang, L. A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. In Proceedings of SIGCOMM ’95 (Boston, MA, Sept. 1995), ACM. 9. Handley, M., and Jacobson, V. sdr — A Multicast Session Directory. University College London. 10. Holbrook, H., and Cheriton, D. IP Multicast Channels: EXPRESS Support for Large-scale Single-source Applications. In Proceedings of SIGCOMM ’99 (Cambridge, MA, Sept. 1999), ACM. 11. Kadansky, M., Chiu, D., Wesley, J., and Provino, J. Tree-based Reliable Multicast (TRAM), Sept. 1999. Internet Draft. 12. Kumar, S., Radoslavav, P., Thaler, D., Alaettinoglu, C., Estrin, D., and Handley, M. The MASC/BGMP Architecture for Inter-domain Multicast Routing. In Proceedings of SIGCOMM 1998 (Vancouver, Canada, Sep 1998), ACM. 13. Li, D., and Cheriton, D. OTERS (On-Tree Efficient Recovery using Subcasting): A Reliable Multicast Protocol. In Proceedings of 6th IEEE International Conference on Network Protocols (ICNP’98) (Oct. 1998). 14. Lo, J., and Taniguchi, K. IP Network Address (and Port) Translation, June 1998. Internet Draft expires 6/99. 15. Lougheed, K., and Rekhter, Y. A Border Gateway Protocol (BGP). Cisco Systems and T. J. Watson Research Center, IBM Corp., June 1989. RFC-1105. 16. Meyer, D. Glop Bit Usage. Cisco Systems, 1999. draft-ietf-mboned-glop-bits00.txt. 17. Ohta, M., and Crowcroft, J. Static Multicast, June 1999. Internet Draft. 18. Perlman, R., Crowcroft, J., Ballardie, T., and Lee, C.-Y. A Design for Simple Low Overhead Multicast, Dec. 1998. Internet Draft (work in progress). 19. Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, Audio-Video Transport Working Group, Jan. 1996. RFC-1889.
Watercasting: Distributed Watermarking of Multicast Media Ian Brown, Colin Perkins, and Jon Crowcroft Department of Computer Science, University College London Gower Street, London WC1E 6BT, UK {I.Brown, C.Perkins, J.Crowcroft}@cs.ucl.ac.uk
Abstract. We outline a scheme by which encrypted multicast audiovisual data may be watermarked by lightweight active network components in the multicast tree. Every recipient receives a slightly different version of the marked data, allowing those who illegally re-sell that data to be traced. Groups of cheating users or multicast routers can also be traced. There is a relationship between the requirements for the scheme proposed here, the requirements for reliable multicast protocols, and proposed mechanisms to support layered delivery of streamed media in the Internet.
1
Introduction
When discussing multicast in the Internet, the loosely coupled model often causes information providers some disquiet. There are a number of reasons for this, but the one of interest to us is the relative anonymity of the receivers: traffic forwarding and group membership are local issues, and at no point is the complete group membership known unless an application level protocol is used to provide an approximate list of members (for example RTCP [26]). This anonymity implies that it is a simple matter to eavesdrop on, and record, a media stream in an undetectable manner. There have been various proposals for protecting multicast data using public key authentication and encryption techniques, together with some suggestions for providing scalable key distribution for such services [2,3]. These schemes rely on the cost of re-transmitting media data being large enough to deter paying customers from re-selling on content they have received. However, the Internet provides an environment where miscreants can easily re-transmit data without detection and on a large scale. Even if illegal copies are found, it may be difficult to determine which receiver was the source of those copies. It is therefore necessary to provide a means for tracing illegal copies, to deter would-be thieves. One way to provide an audit trail of the origin of a copy is to use a technique known as watermarking. This entails embedding additional, virtually undetectable, information in the data to distinguish one copy from another. There is, however, a problem with using watermarking with broadcast or multicast media: the addition of distinguishing marks to the data stream is contrary to the L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 286–300, 1999. c Springer-Verlag Berlin Heidelberg 1999
Watercasting: Distributed Watermarking of Multicast Media
287
bandwidth saving of being able to distribute the same data efficiently to multiple recipients. We propose to use lightweight active network components at the branch points in an IP multicast network to modify a media stream such that recipients are delivered unique versions of the sourced data in an efficient manner. Our method builds upon some of the mechanisms proposed to provide support for reliable multicast data delivery and for delivery of scalable, layered, streamed multicast data (such as video and audio), and is part of a family of end-to-end services that can make use of a distributed, heterogeneous and dynamic set of filters in a multicast distribution tree. We consider here the typical content provider case: one source sending data to many recipients. In the next section we describe related work on watermarking and multicast security. Following that, we outline the salient points of our approach and analyse its efficiency and possible threats to its effectiveness. Finally, we note directions for further study.
2 2.1
Background Watermarking
Typically, in broadcast networks, cryptographic techniques are sufficient to give a sufficient level of protection against dishonest reception. The major problem is legitimate users illegally selling their keys to others. Schemes have been designed to make keys effectively as large as the content they protect, and hence uneconomic to sell [9], or to identify the source of stolen keys [21] even when legitimate users collude to try and cover their trail. These schemes are no protection against retransmission of content rather than keys, but this is not typically a problem since retransmission of content is difficult and expensive. In the Internet environment, retransmission of content is simple and cheap so further protection is essential. Watermarking allows content providers to trace any illegally redistributed data they discover back to a subscriber by subtly marking the data sent to each user in a way that is difficult to reverse. The simplest watermarking schemes use the low bits in an audio or image file to embed information such as subscriber ID. These are easily defeated by altering these bits. More complex schemes select a subset of bits to alter using a secret key, or use spread spectrum techniques or complex transformations of the data to make removal of the watermark more difficult. Anderson and Manifavas describe an ingenious scheme that allows a single broadcast ciphertext to be decrypted to slightly different plaintexts by users with slightly different keys. Unfortunately the scheme is extremely vulnerable to collusion between users. Five or more users can together produce plaintext (or keys for installation in pirate decoders) that cannot be traced. Shamir has pointed out that increasing collusion resistance in all of these schemes requires exponential work from the defender to cost the attacker linearly more effort [1].
288
I. Brown, C. Perkins, and J. Crowcroft
Regrettably this is the only scheme proposed so far that allows efficient marking of data being supplied to large numbers of users. While others are not hugely computationally intensive, they would not scale well. A content provider would need huge computing resources to be able to watermark data being sent to typical live event audiences. An enormous amount of bandwidth would also be used up sending a different version of the data to each viewer. This motivates our approach, which distributes the processing needed throughout the multicast tree used to efficiently deliver data. 2.2
Multicast Security
The IP multicast model allows for any receiver to become a sender, subject to their ISP’s policy and pricing mechanisms. Unlike traditional broadcast networks, the effort needed to re-multicast data is small. Further, any host can receive multicast traffic, with the decision to route data to that host being made close to that host with no reference to the original source of the data. It becomes clear that, in addition to marking data to trace those who illegally redistribute it, we also need to protect data in transit to prevent unauthorised access by eavesdroppers. By encrypting packets, we ensure only those possessing the necessary key can access content. This encryption is typically performed at the application level, although in the future it may be integrated into the forwarding mechanisms [15]. Whilst the problem of key management for multicast data is not yet completely solved, proposals to provide this functionality [12] are moving forward and could build on content providers’ systems for authenticating users. With rapid rekeying, content providers could remove pirates from groups even quicker than current pay-per-view systems. It is possible to limit multicast traffic to a specific region of the network, using administratively scoped addressing [19]. This relies on border routers of the administrative region being correctly configured to prevent traffic sent to certain address ranges leaking out of the region. It provides an effective means of limiting the flow of traffic if correctly configured, but does not prevent unauthorized reception of data by hosts within the region. It is also difficult to configure and use, although future protocol developments may ease these problems [11]. To summarise, we note that it is almost impossible to limit access to multicast data. We must rely on encryption and good key management to prevent intercepted traffic being decoded, and watermarking to trace authorised users who illegally redistribute content. 2.3
Multicast Loss Characteristics
A number of studies have been conducted into the performance and loss characteristics of the Mbone [10,30]. These have shown a large amount of heterogeneity in the reception quality for multiple receivers in a single session, posing a challenge to designers of resilient multicast streaming protocols [8,22] and reliable multicast transport protocols [16].
Watercasting: Distributed Watermarking of Multicast Media
289
The loss signature of a receiver is used by a number of protocols to identify subsets of receivers which belong, at least symptomatically, to shared subtrees. This signature has two components: the temporal pattern of packet loss, and the correlation between the position of a receiver in the multicast distribution tree and the observed loss. The temporal correlation of packet loss has been noted by a number of authors. Bolot [5] noted that packet loss is not independent (it is more likely that a packet is dropped if the previous packet was also lost) and derived a simple Bernoulli model for such loss. More recent work [31,20] notes that this model is not sufficient in many cases, and that higher-order Markov models are more accurate. Correlation is also noticable at longer time-scales. For example, Handley [10] and Bolot [5] have noted bursts of loss with a 30 second period (possibly due to router bugs) and the authors have noted similar effects. Packet loss is also correlated between receivers, such that many receivers see the same patterns of loss [30]. This is clearly due to lossy links within the distribution tree which cause loss for all leaf nodes below them. Packet loss correlation is a therefore a good predictor for the shape of the multicast distribution tree [17]. One significant drawback of these techniques is that whilst loss signatures match the network topology to a fairly high accuracy, they do not allow the topology to be discovered directly, although recent work has shown that it is possible to infer the logical network topology based on packet loss measurements [7,24]. 2.4
Reliable Multicast: Router Support
Reliably multicasting a packet to a large group of receivers becomes more efficient if the network acts to ensure reliability. Recently, a number of proposals have been made to add such reliability into the network. One of these, PGM [27], provides a close fit for our requirements for a watermarking scheme. PGM is “a reliable transport protocol for applications that require ordered, duplicate-free, multicast data delivery from multiple sources to multiple receivers”. To achieve reliability, receivers send negative acknowledgements (NAKs) which are reliably propogated up the multicast distribution tree towards the sender, with the aid of the routers. Retransmissions of lost data are provided by the sender or by designated local retransmittors. Two mechanisms are incorporated to prevent NAK implosion: on detecting loss, receivers employ a random backoff delay before sending a NAK with suppression if a NAK is received from another receiver. In addition, routers which receive duplicate NAKs from multiple downstream links eliminate the duplicates, sending only a single NAK up towards the source. The result is a timely and efficient means by which NAKs can be returned to the source of multicast data, allowing either retransmission of that data or addition of FEC to the stream to ensure reliable delivery.
290
I. Brown, C. Perkins, and J. Crowcroft
In addition to providing an efficient NAK delivery and summarisation service, PGM offers a number of end-to-end options to support fragmentation, sequence number ranges, late joins, time-stamps, reception quality reports, sequence number dropout and redirection. Of interest to us is the sequence number dropout option. This allows placement of “intermediate application-layer filters” in routers. Such filters allow the routers to selectively discard data packets and convey the resulting sequence number discontinuity to receivers such that sequencing can be preserved across the dropout, and to suppress NAKs for those packets intentionally discarded. They act as lightweight active network elements, modifying data streams passing through them. The operation of these filters is not defined by PGM. In later sections of this paper, we describe semantics for these filters suitable for watermarking multicast streams. 2.5
Reliable Multicast: Layering and FEC
The use of packet-level forward error correction data to recover from loss is well-known. For every k data packets, n − k FEC packets are generated, for the transmission of n packets over the network. For every transmission group of n packets it is necessary to receive only a subset to reconstruct the original data. There are a number of means by which these FEC packets may be transmitted. The three primary means are by piggy-backing them onto previous packets, sending them as part of the same stream but with a different payload type indicator or sending them as a separate stream. Sending FEC packets within the same stream as the original data has the advantage of reducing overheads (routers only need keep state for a single stream), but forces all receivers to receive the FEC data in addition to the original data. If a receiver is not experiencing loss, this is clearly wasteful. Sending the FEC data on a different stream has greater overhead (because routers need keep state for multiple flows), but allows for greater flexibility. Those receivers which are not experiencing loss do not join the multicast group transporting the FEC stream, and hence do not receive the FEC data; varying amounts of FEC can be supplied, layered over a range of groups, giving different levels of protection; or enhancement layers can be provided – not FEC but additional data to improve the quality of the stream for those on high capacity links who are not experiencing loss. This use of layered transmission to provide either FEC or differing quality has been studied by a number of authors [28,18] and shown to perform well.
3
Protocol Overview
Given that the loss signature of a receiver corresponds to its position in the network, it should be possible to use this as a simple form of digital watermark. The pattern of degradation in a stream will likely be different for each receiver provided there is a non-zero packet loss rate in the network (see section 2.3). There are four problems with this:
Watercasting: Distributed Watermarking of Multicast Media
291
1. A receiver may neglect to send a loss signature back to the sender, escaping notice by the watermarking scheme. 2. Lost packets cause degradation of the delivered stream. A network which drops enough packets to make this watermarking technique successful will likely provide insufficient quality for most uses. 3. A receiver may collude with another receiver to repair the loss, hence defeating the watermarking scheme. 4. A receiver may easily defeat the watermarking scheme by dropping additional packets (possibly transforming the stream to match that received by another receiver). Ensuring that a receiver returns its loss signature to the sender is clearly an impossible task in the traditional Internet environment with smart end-points and dumb routers. However, if an active network is assumed it becomes possible for the last-hop router to return a loss signature to the sender. If the installation of this active element forms part of the multicast tree setup procedure, we may ensure that the loss signature of each receiver is returned to the source. The active network elements can also conspire to ensure that all receivers see unique loss patterns, rather than leaving this to chance. Instead of relying on the loss signature of a particular branch in the multicast forwarding tree being unique, the position of a node in the tree can be used to determine which packets to drop in order to ensure a unique loss pattern for each node. The proposed use of active network components is not unique to our scheme; a number of reliable multicast protocols have been developed which would benefit from support within the network. This support typically takes the form of filtering, summarisation and subcasting abilities: exactly the requirements for our scheme (see section 2.4). The assumption of an active network leaves three barriers to the development of an effective watermarking solution: degradation of the stream by packet loss, collusion attacks by multiple receivers to repair the stream, and the ease of breaking the protection by dropping additional packets. These three problems are related, and have a common solution. A typical counter to the problem of packet loss in a multicast network is to add forward-error correction data to a media stream (section 2.5). This allows a stream to be repaired if some fraction of the packets are lost. We modify this approach by sending FEC data which is subtly different to the original data, such that a stream repaired using this FEC data will differ based on the observed loss pattern, but will not be noticably degraded. This altering of the media stream is typically straightforward, although content specific. It is vital that the set of transformed packets resulting from one packet cannot be used to recreate the original, otherwise a collusion attack could produce a non-watermarked version of the data. Likewise, the watermark must be resistant to a wide range of transforms, such as the introduction of jitter or re-sampling [23]. The active network elements therefore subtract FEC packets. Rather than ensuring a unique loss pattern at each receiver, they ensure a unique pattern of
292
I. Brown, C. Perkins, and J. Crowcroft
packets is received. This may be implemented using the PGM sequence number dropout option and application layer filters as noted in section 2.4. This solves the quality degradation problem: since some version of each packet is received by each participant the reception quality is no worse than that provided by the underlying network, although each receiver sees a slightly different stream. Receivers can no longer collude to repair a stream (the result will simply be a combination of their watermarks, enabling identification of the conspirators). Finally, discarding additional packets simply results in a degraded stream with the watermark still present. The result is a relatively simple means of watermarking multicast data: the source sends multiple subtly different copies of each packet. Routers at the branch points in the network discard packets, such that the stream delivered to each receiver is unique.
4
Implementation Strategy
We have designed an initial protocol to implement watercasting using PGM and slight modifications to multicast tree setup. These could be made using active network code in routers. We hope to refine this protocol after gaining implementation experience. A client wishing to receive content from a server first performs a unicast authentication with that server. After convincing the server it is a valid subscriber, the client is given a receiver identification key. This key is supplied to the last hop router when the receiver joins the session. The last hop router passes this key, its address and the time the receiver joined back up the multicast tree to the source. Each router in the tree adds its address, encrypted with the public key of the server to keep the topology secret. This is a slight variation on the current Internet Group Management Protocol MTRACE packet. When the source receives a valid RIK and topology report, it unicasts the current session key(s) for the requested media to the client. This server is therefore able to validate and store the entire tree topology. This information is necessary to allow it to later determine the correct watermark for each receiver, in the event that it becomes necessary to trace an illegal copy of the content. For a multicast distribution tree with maximum depth d, the source generates a total of n differently watermarked copies of each packet such that n ≥ d. Each group of n alternate packets is termed a transmission group. On receiving the packets which form a transmission group, a router forwards all but one of those packets out of each downstream interface on which there are receivers. The choice of which packet to discard is made at random on a per-interface basis, with the pseudo-random sequence keyed by the position of the router in the distribution tree and the interface address. Each last hop router in the distribution tree will receive n − dr packets from each transmission group, where dr is the depth of the route through the distribution tree to this router. Exactly one of these packets will be forwarded
Watercasting: Distributed Watermarking of Multicast Media
293
onto the subnet with the receiver(s). The choice of which packet is forwarded is determined pseudo-randomly, keyed with the position of the router in the tree, the interface address and the receiver identification key. The filtering process is illustrated in figure 1. In this example, the receiver furthest from the source is R1 and the maximum depth of the distribution tree, d = 5. The source will generate n ≥ 5 distinct versions of each packet to form a transmission group; label these ABCDE. At router 0 these are filtered, passing ABDE to router 00 and ACDE to router 01. At router 01 the packets are filtered again, with ACE being passed to router 010. Since this is the last hop router before receiver R2, router 010 does not just filter out a single packet, it pseudorandomly selects one packet, E, to pass to the receiver. A similar process occurs to filter the packets destined for the other receivers. Source ABCDE
0 ABDE
ACDE
00 ADE
000 AD
ACE
001
AD
0000
01
ABD
ACD
010 E
0001
ACD
011
012
C
R2
R3
A
R1
Fig. 1. Filtering transmission groups to obtain a unique watermark
The media payload is encrypted. The receiver also receives a decryption key from the source by the same means that it receives the receiver identification key. Media packet headers are not encrypted, since routers need to use the sequence numbers in the packets to determine which packets to discard. The encryption of the media payload prevents unauthorised receivers from snooping on the packets. The watermark may be used to detect illegal redistribution of the decrypted payload by legitimate receivers. In effect, the combination of tree topology and receiver identification key is the secret used by the source to do the watermarking. Participating routers should therefore refuse requests to reveal any part of that topology. Even if some routers and clients collude, they would need a conspiracy from a client right up to the source to discover anything useful. The selective discard function aims to provide the multicast routers and their clients the minimum degree of freedom possible in order to facilitate the later tracing of cheating routers or users. Every router in the tree indelibly affects the
294
I. Brown, C. Perkins, and J. Crowcroft
stream by dropping certain packets. A cheating downstream router therefore cannot produce a stream seemingly originating from an upstream router, as it does not have access to all of the packets passing through that router. It would need to collude with all other branches of that router to get copies of all data passing through it. The higher up the tree, the less likely this is, the lower down the tree, the easier to eliminate targets from an investigation. An upstream router could attempt to impersonate a downstream router or client, but would need to know the topology of the tree to that point to do so effectively. This becomes increasingly difficult as the distance of the point from the cheating router increases, which may be sufficient protection in that points upstream of a suspected router could be included in any investigation of that router. Alternatively, each router could be given a shared secret by the source at the time it joins the multicast tree, and include that secret in the initialisation of its pseudorandom number generator. We keep as much of the processing as possible at the source to simplify the router protocol. For each watercast stream a router is processing, it only needs to store the sequence state to allow it to drop the appropriate packet in each transmission group. We place a heavier burden on the source: it needs to appropriately modify the outgoing stream and inject sufficient redundant packets to maintain quality for all clients whilst allowing enough packets to be dropped to watermark each client’s stream uniquely. The source also needs to store enough information to enable it to later reconstruct the path to a watermark found in a recovered media clip. Because the watermarking algorithms, and the pseudo-random sequence generator, operate in a deterministic manner, this comprises the original data, the topology of the distribution tree, and details of when receivers join and leave. Given a recovered clip, the server can determine how many redundant packets it was sending out at that time and hence which transformations were being applied. It can then reduce the clip to a series of packet labels (such as AECDBBABC using the example in section 4). By simulating the operation of various network components from the start of the transmission of the original broadcast through to the end of the clip, it can aggressively rule out nodes that could not have produced the clip because they did not have access to any of the packets present in it. The completion of this process will result in a set of nodes that could have produced the given clip. The length of clip required for this result depends on the multicast tree topology and pseudo-random sequence generator used. This can vary continuously according to the topology of the multicast tree at the time of transmission of the marked data, something known only to the source. This makes it difficult even for conspiracies of users and routers to remove the watermark or alter it to implicate someone else. The probability that the media clip must have originated with a particular receiver increases with the length of the clip.
Watercasting: Distributed Watermarking of Multicast Media
5
295
Analysis
The two important effects of watercasting are reducing the number of unique watermarked copies of data required, and distributing the task of selecting a different sequence of watermarked packets to each recipient. The results of these gains are considered below. The probability that a particular version, v, of a packet is received, given that the transmission group size is n and the receiver is d hops from the source, p, is simply pv =
d−2 n−i 1 1 · = n − d + 2 i=1 n − i + 1 n
(1)
The probability that multiple receivers receive the same version of a packet depends on the position of those receivers in the distribution tree. The closer those receivers are, that is the longer the shared path from the source, the more likely they are to receive the same version. This can be offset by increasing the number of packets in a transmission group. Introducing redundant data into the multicast stream increases the size of the stream by the number of packets per transmission group at the first hop, then one less at the second hop, and so on until the last hop where the traffic size is the same as for a non-watermarked stream. If n is the group size and d the maximum depth of the tree, this increases the amount of traffic by a factor of 2n − d + 3 n − (2) 2 d This is still far less than the traffic that would be generated by unicasting unique versions of each stream to every receiver. If we set n = d, which creates the minimum extra traffic but makes the watermark sequence longer, this factor is d+1 (3) 2 At the cost of greater complexity at routers, this figure could be decreased still further. If each router knows the maximum depth of each of its interface’s subtrees, it need only send the minimum number of redundant packets necessary to each child node. Rather than choosing one packet from each transmission group to drop at random, it selects n − di packets to drop, where di is the depth of the subtree on that interface. This reduces the extra bandwidth consumed on all subtrees that are shallower than the maximum depth of the tree. It also reduces the scope of attacks by cheating downstream routers, which have fewer versions of each packet to use. Using a value of n > d allows a tree to grow more easily. At the initial setup of a tree, this is particularly important: as many new members join, continually altering the depth of the tree, the source would otherwise need to constantly
296
I. Brown, C. Perkins, and J. Crowcroft
increase n. By setting n to the likely depth of the tree after this phase, the source reduces this update complexity, at the cost of greater bandwidth requirements over the (initially small) tree as it is set up. The source may use knowledge of other likely tree changes, or react to a rapidly-changing depth, by discontinuously varying n in the same manner. There is therefore a tradeoff: larger values of n allow greater flexibility in tree setup and reduce the length of the watermark sequence required to trace the originator of data, but require greater processing and bandwidth to create and distribute. An obvious optimisation would seem to be only watermarking 1 in every x packets, reducing bandwidth and computation requirements by a factor close to x. But because every router in the multicast tree would need to know the location of the watermarked packets to run the watercasting algorithm, it would be impossible to keep this information secret. Unfortunately, if an attacker knew the position of the watermarked packets, she could simply remove them and redistribute the resulting degraded data. While this would be fatal to data such as executable code, it may be acceptable for lossy information such as an audiovisual signal. An information provider must consider the quality of data they are effectively prepared to give away before using this technique. When the last-hop network is a multi-access subnet, such as an Ethernet, any host on that network can receive the same packet with no extra effort. Non-subscribers on a network can intercept such packets, but do not have the decryption key needed to read them. But two legitimate subscribers on the same sub-network will receive the same watermarked data. We contend that this is a small problem: multi-access sub-networks are typically under the administrative control of a single agency. If one of the users of such a network illegally resold data, it would be traceable to the agency controlling that network, which is sufficient in most cases. Collaborative conspiracies are always a difficult problem for watermarking schemes. Groups of users can attempt to combine their different watermarked versions of the same piece of data in a way that removes or at least damages the watermark. The simplest way to do this is perform ‘bit voting’: set each bit in the reconstructed piece of data to be that which is most prevelant in the same bit in the set of watermarked files. This is usually fatal to simple schemes, and can damage more sophisticated watermarks. The watermarking techniques we use must therefore be able to defeat collaborative and individual attacks such as introducing jitter and re-sampling [23]. But the main contribution of our paper is that an active network can perform part of the watermarking function. Even if the specific transforms we use can be defeated, we hope that it should be possible to simply plug in others more resistant to attacks, preserving the validity of our approach.
Watercasting: Distributed Watermarking of Multicast Media
6
297
Conclusions and Future Work
We have outlined a general idea for watermarking media data using active network components, and a specific method to perform that task. Our method leverages schemes such as PGM that are being developed to provide other services such as reliable multicast and filtering in active networks. Traditional communications security systems protect data en route to recipients by encrypting it with a key known only to authorised users. This is effective at preventing eavesdropping of the data in transit, but can do nothing to stop authorised recipients redistributing the data. This is a major problem for ‘secrets’ such as live sporting events shared between millions of paying customers. As the cost of redistributing data continues to plummet, watermarking is likely to become as essential to information security as cryptography is today. In conventional watermarking schemes, each recipient gets a different version of the watermarked data to allow any cheating recipients who illegally redistribute the data to be traced. This is computationally expensive for the source, which needs to calculate a large number of unique versions of the data, and requires unicast transmission of the resulting data to clients. Our scheme reduces both loads. The source needs to calculate a far smaller number of versions of each packet of the data, and is relieved of the task of deciding which packet goes to which user by routers. This load is spread thinly throughout the distribution tree. The resulting data can be transmitted efficiently through the network at a cost in bandwidth over pure multicast related to the depth of the multicast tree rather than the number of recipients. This is particularly important for very large multimedia streams. Watercasting requires a considerable amount of network complexity compared to content control schemes such as Nark [6]. It uses a trusted smartcard to provide the keys needed to decrypt data according to an access policy, thus scaling excellently even with a constantly changing membership. Watercasting is more appropriate for protecting high value content where sufficient incentive exists for an attacker to compromise a smartcard. We are now performing simulations to model the performance of our method. Factors such as the size of transmission groups and complexity of filtering algorithms will have large effects on the performance of the system, and we intend to use our simulations to fine tune these parameters. We are also investigating optimisations such as increasing capacity near the multicast root for given sizes of receiver sets and the depth and breadth of the tree. Finally, we intend to build small test networks and distribute audiovisual and other data through them to experimentally verify our scheme and determine its implementation complexity, using these results to further develop the protocol. The central task of watercasting is to provide evidence. Computer security systems often claim to reduce or obviate the need for legal solutions to a problem by removing it through technical means. Our design instead aims to provide an audit trail through which the illegal distributors of a given piece of data can be traced and prosecuted. The strength of the evidence provided by watercasting is crucial to the ability to mount a successful trial, particularly if no other evidence
298
I. Brown, C. Perkins, and J. Crowcroft
is available. We are working with legal researchers to determine how to best fine tune our scheme to meet this aim. We are also investigating the requirements of content providers: ‘fair use’ provisions in copyright laws, for example, may reduce their need for the tracing of very short clips. Watercasting has wide applicability to the protection of any data that is distributed to a large number of people via a network. While we have focussed on audiovisual data, other information such as software could be equally covered with the development of appropriate transforms. Indeed, an appropriate software architecture would allow small pieces of transform code to be plugged-in to our system to extend it with minimum effort. The authenticity of such data may be more important to clients than that of a video broadcast. It would be trivial to put a public-key Authentication Header in each packet [14] and so assure clients of the information’s integrity and origin. Servers authenticating large amounts of data could use a hybrid scheme combining public-key certificates and k-time signature schemes that allows offline pre-computation of expensive parameters so that even bursty, lossy and lowlatency streams can be authenticated [25]. Watermarking technology is still in its infancy. Petitcolas et al. [23] hoped that their attacks on first-generation algorithms would lead to an improved second generation, and so on. We hope our system is reasonably resistant to the attacks they designed, but we will no doubt see further ones developed. Our design criteria are slightly less robust than those of, for example, the International Federation for the Phonographic Industry (IFPI), who required a watermark that could not be removed or altered “without sufficient degradation of the sound quality as to render it unusable” [13]. Our definition of unusable is not unlistenable or even unsellable, but simply perceptually intrusive enough to justify paying for the original rather than a pirated copy. We plan to use tools developed for measuring subjective audio quality [29] to assess the impact of removing the watermarks we develop. We also believe that the best use for our system is the transmission of ‘live’ data. As Barlow observed [4], the value of such transmissions drop rapidly as they age. The live TV rights to a popular sporting event are worth a considerable amount, but this drops dramatically once the game is finished and the result is known. Therefore, even if our watermarking scheme can be defeated, as long as it takes a reasonable amount of time to do so it will have achieved its main objective — to prevent large profits being available from the illegal re-distribution of content.
7
Acknowledgements
Thanks to the IRTF Reliable Multicast Research Group and the network multimedia research group at UCL for discussion that led to some of these ideas. Thanks also to Bob Briscoe and the anonymous referees for their useful comments.
Watercasting: Distributed Watermarking of Multicast Media
299
References 1. R.J. Anderson and C. Manifavas. Chameleon – A New Kind of Stream Cipher. Fourth Workshop on Fast Software Encryption, pp. 107-113, January 1997. 2. A. Ballardie. Scalable Multicast Key Distribution. RFC 1949, May 1996. 3. A. Ballardie and J. Crowcroft. Multicast-Specific Security Threats and CounterMeasures. The Internet Society Symposium on Network and Distributed System Security, San Diego, California, February 1995. 4. J. P. Barlow. The economy of ideas. Wired 2(3) pp. 85, March 1994. 5. J.-C. Bolot and A. Vega-Garc´ia. The Case for FEC-Based Error Control for Packet Audio in the Internet. ACM Multimedia Systems, 1997. 6. Bob Briscoe and Ian Fairman. Nark: Receiver-based Multicast Non-repudiation and Key Management. ACM Conference on Electronic Commerce (EC-99), Denver, Colorado, November 1999. 7. R. C´ aceres, N. G. Duffield, J. Horowitz, D. Towsley and T. Bu. Multicast-Based Inference of Network-Internal Characteristics: Accuracy of Packet Loss Estimation. IEEE Infocom, New York, March 1999. 8. G. Carle and E. Biersack. A Survey of Error Recovery Techniques for IP-Based Audio-Visual Multicast Applications. IEEE Network, November/December 1997. 9. C. Dwork, J. Lotspiech and M. Naor. Digital Signets: Self Enforcing Protection of Digital Information. 28th Annual ACM Symposium on the Theory of Computing, Philadelphia, pp. 489, May 1996. 10. M. Handley. An Examination of Mbone Performance. USC/ISI Research Report ISI/RR-97-450, April 1997. 11. M. Handley and D. Thaler. Multicast-Scope Zone Announcement Protocol. IETF work in progress, October 1998. 12. T. Hardjono, B. Cain and N. Doraswamy. A Framework for Group Key Management for Multicast Security. IETF work in progress, July 1998. 13. International Federation of the Phonographic Industry. Request for proposals — Embedded signalling systems issue 1.0. June 1997. 14. S. Kent and R. Atkinson. IP Authentication Header. RFC 2402, November 1998. 15. S. Kent and R. Atkinson. IP Encapsulating Security Payload. RFC 2406, November 1998. 16. B. Levine and J. J. Garcia-Luna-Aceves. A Comparison of Reliable Multicast Protocols. ACM Multimedia Systems Journal, August 1998. 17. B. Levine, S. Paul and J. J. Garcia-Luna-Aceves. Organizing Multicast Receivers Deterministically by Packet-Loss Correlation. Preprint, University of California, Santa Cruz. 18. S. McCanne, V. Jacobson and M. Vetterli. Receiver-driven Layered Multicast. ACM SIGCOMM’96, Stanford, CA, August 1996. 19. D. Meyer. Administratively Scoped IP Multicast. RFC 2365, July 1998. 20. S. Moon, J. Kurose, P. Skelly and D. Towsley. Correlation of Packet Delay and Loss in the Internet. Technical Report 98-11, Department of Computer Science, University of Massachusetts, Amherst, MA 01003, USA. 21. M. Naor and B. Pinkas. Threshold Traitor Tracing. Advances in Cryptology – CRYPTO ’88, Lecture Notes in Computer Science 403, August 1988. 22. C. Perkins, O. Hodson and V. Hardman. A Survey of Packet Loss Recovery Techniques for Streaming Audio. IEEE Network, pp. 40–49, September/October 1998. 23. F. A. P. Petitcolas, R. J. Anderson and M. G. Kuhn. Attacks on copyright marking systems. Second Workshop on Information Hiding, Portland, Oregon, April 1998.
300
I. Brown, C. Perkins, and J. Crowcroft
24. S. Ratnasamy and S. McCanne. Inference of Multicast Routing Trees and Bottleneck Bandwidths using End-to-end Measurements. IEEE Infocom, New York, March 1999. 25. P. Rohatgi. A Hybrid Signature Scheme for Multicast Source Authentication. IRTF work in progress, June 1999. 26. H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson. RTP: A Transport Protocol for Real-time Applications. RFC 1889, January 1996. 27. T. Speakman, D. Farinacci, S. Lin and A. Tweedly. PGM Reliable Transport Protocol Specification. IETF work in Progress, February 1999. 28. L. Vicisano, L. Rizzo and J. Crowcroft. TCP-like congestion control for layered multicast data transfer. IEEE Infocom, San Francisco, March 1998. 29. A. Watson and M. A. Sasse. Measuring Perceived Quality of Speech and Video in Multimedia Conferencing Applications. ACM Multimedia ’98, Bristol, England, pp. 55-60, September 1998. 30. M. Yajnik, J. Kurose and D. Towsley. Packet Loss Correlation in the Mbone Multicast Network. IEEE Global Internet Conference, November 1996. 31. M. Yajnik, S. Moon, J. Kurose and D. Towsley. Measurement and Modelling of the Temportal Dependence in Packet Loss. IEEE Infocom, New York, March 1999.
MARKS: Zero Side Effect Multicast Key Management Using Arbitrarily Revealed Key Sequences Bob Briscoe BT Research, B54/74, BT Labs, Martlesham Heath, Ipswich, IP5 3RE, England email: [email protected]
Abstract. The goal of this work is to separately control individual secure sessions between unlimited pairs of multicast receivers and senders. At the same time, the solution given preserves the scalability of receiver initiated Internet multicast for the data transfer itself. Unlike other multicast key management solutions, there are absolutely no side effects on other receivers when a single receiver joins or leaves a session and no smartcards are required. The cost per receiver-session is typically just one short set-up message exchange with a key manager. Key managers can be replicated without limit because they are only loosely coupled to the senders who can remain oblivious to members being added or removed. The technique is a general solution for access to an arbitrary sub-range of a sequence of information and for its revocation, as long as the end of each sub-range can be planned at the time each access is requested.
1
Introduction
This paper presents techniques to maintain an individual security relationship between multicast senders and each receiver without compromising the efficiency and scalability of IP multicast’s data distribution. We focus on issues that are foremost if the multicast information is being sold commercially. Of prime concern is how to individually restrict each receiver only to data for which it has paid. We adopt an approach where the key used to encrypt sent data is systematically changed for each new unit of application data. The keys are taken from a sequence seeded with values initially known only to senders. A key sequence construction is presented where arbitrarily different sub-sequences can be revealed to each receiver by only revealing a small number of intermediate seed values rather than having to reveal the every key in each sub-sequence. Specifically a maximum of O(log(N )) seeds need to be revealed once per session to each receiver in order to reconstruct a sub-sequence N keys long. This should be compared with the most efficient multicast key management solutions to date, that require a message of length O(log(n)) to be multicast to all n receivers every time a L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 301–320, 1999. c Springer-Verlag Berlin Heidelberg 1999
302
B. Briscoe
receiver or group of receivers joins or leaves. Further, calculation of each key in the sequence only requires a mean of under two fast hash operations. (Notation is explained in Appendix B.) In contrast, whenever a receiver is added or removed with the present scheme, there is zero side effect on other receivers. A special group key change isn’t required because systematic changes occur sufficiently regularly anyway. No keys are sent over multicast, therefore reliable multicast isn’t required. If key managers are delegated to handle requests to set-up receiver sessions, the senders can be completely oblivious to any receiver addition or removal. Thus, there is absolutely no coupling back to the senders. In many commercial scenarios (e.g. prepayment) key managers can be stateless, allowing performance to grow linearly with unbounded key manager replication. Resilience of the whole system would also be assured in such scenarios, even in the face of partial failures, due to the complete decoupling of all the elements. Our thesis is that there are many applications that only rarely if ever require premature eviction, e.g. pre-paid or subscription pay-TV or pay-per-view. Thus, we don’t present a solution for unplanned eviction, but instead concentrate on the pragmatic scenario of pre-planned eviction, which we believe is a novel approach. Each eviction from the multicast group is planned at each session set-up, but each is still allowed to occur at an arbitrary time. Nonetheless, we briefly describe how the occasional unplanned eviction can be catered for by modular combination with existing solutions at the expense of loss of simplicity and scalability. Four other key sequence constructions are presented in a companion technical report [4], which also presents a mathematical model that encompasses all five schemes and others in the same class (including the one-way function tree (OFT) [14]). Each scheme in the companion report has particular strengths; one is useful for sessions of unknown duration, another multiplies the effective key length against brute force attack (without increasing the operational key-length) and yet another is extremely simple and efficient in terms of message bandwidth but has limited commercial applicability. The scheme chosen for this paper is the simplest and is secure enough for most commercial scenarios. In Section 2, we discuss requirements and describe related work on multicast key management and other multicast security issues. In Section 3 we use an example application to put the paper into a practical context and to highlight the scalability advantages of using systematic key changes. In Section 4 we present the key sequence construction that allows different portions of a key sequence to be reconstructed from various combinations of intermediate seeds. Section 5 discusses the efficiency and security of the construction. Section 6 very briefly describes variations on the approach to add other security requirements such as multi-sender multicast, a watermarked audit trail and unplanned eviction, although more detail and a wider literature review can be found in [4]. Finally limitations of the approach are discussed followed by conclusions.
MARKS
2
303
Background, Definitions, and Requirements
When using Internet multicast, senders send to a multicast group address while receivers ’join’ the multicast group through a message to their local router. For scalability, the designers of IP multicast deliberately ensured that any one router in a multicast tree would hide all downstream join and leave activity from all upstream routers and senders [7]. Thus a multicast sender is oblivious to the identities of its receivers. Clearly any security relationship with individual receivers is impossible if they can’t be uniquely distinguished. Conversely, if receivers have to be distinguished from each other, the scalability benefits start to be eroded. If a multicast sender wishes to restrict its data to a set of receivers, it will typically encrypt the data at the application level. End-to-end access is then controlled by limiting the circulation of the key. A new receiver could have been storing away the encrypted stream before it joined the secure session. Therefore, every time a receiver is allowed in, the key needs to be changed (termed backward security [14]). Similarly, after a receiver is thrown out or requests to leave, it will still be able to decrypt the stream unless the key is changed again (forward security). Most approaches work on the basis that when the key needs to be changed, every receiver will have to be given a new key. Continually changing keys clearly has messaging side effects on all the other receivers than the one joining or leaving. We define a ’secure multicast session’ as the set of data that a receiver could understand, having passed one access control test. If one key is used for many related multicast groups, they all form one secure session. If a particular receiver leaves a multicast group then re-joins but she could have decrypted the information she missed, the whole transmission is still a single secure session. We envisage very large receiver communities, e.g. ten million viewers for a popular Internet pay-TV channel. Even if just 10% of the audience tuned in or out within a fifteen minute period, this would potentially cause thousands of secure joins or leaves per second. We use the term ’application data unit’ (ADU) as a more general term for the minimum useful atom of data from a security or commercial point of view (one second in the above example). The ADU equates to the aggregation interval used in [6] and has also been called a cryptoperiod when measured in units of time. ADU size is application and security scenario dependent. It may be an initialisation frame and its set of associated ’P-frames’ in a video sequence or it may be ten minutes of access to a network game. Note that the ADU from a security point of view can be different from that used at a different layer of the application. ADU size can vary throughout the duration of a stream dependent on the content. ADU size is a primary determinant of system scalability. If a million receivers were to join within fifteen minutes, but the ADU size was also fifteen minutes, this would only require one re-key event. However, reduction in re-keying requirements isn’t the only scalability issue. In the above example, a system that can handle a million requests in fifteen minutes still has to be provided, even if its output is just one re-key request to
304
B. Briscoe
the senders. With just such scalability problems in mind, many multicast key management architectures introduce a key manager role as a separate concern from the senders. This deals with policy concerns over membership and isolates the senders from much of the messaging traffic needed for access requests. 2.1
Related Work
Ballardie suggests exploiting the same scalability technique used for the underlying multicast tree, by delegating key distribution along the chain of routers in a core based multicast routing tree [12]. However, end-to-end security suffers from the complexity of requiring edge customers to entrust their keys to many intermediate network providers requiring a long chain of security associations. The Iolus system [15] sets up a similar distribution hierarchy, but only involving trusted end-systems. However, these gateway nodes also partition re-keying side effects by decrypting and re-encrypting the stream localising sub-group keys. This introduces a latency burden on every packet in the stream and requires strategically placed intermediate systems to volunteer their processing resource. An alternative class of approaches involves a single key for the multicast data, but a hierarchy of keys under which to send out a new key over the same multicast channel as the data. These approaches involve a degree of redundant re-keying traffic arriving at every receiver in order for the occasional message to arrive that is decipherable by that receiver. The state of the art in this class is [6]. The group members are arranged as the leaves of a binary tree with the group session key at the root. Two auxiliary keys are assigned per layer of the tree. If each member is assigned a different user identity number (UID) this effectively assigns a pair of auxiliary keys to each bit in the UID space. The first of each pair is given to users with a 1 at that bit position in their UID and the other when there is a 0. When a single member leaves, a new group session key is randomly generated and multicast encrypted with every auxiliary key in the tree except those held by the leaving member. This guarantees (aside from the reliability of multicast) that every remaining member will get at least one message they can decrypt. A variant recognises the potential for aggregation of member removals if many occur within the timespan of one ADU. The group session key is multicast to the group multiple times, each encrypted with different logical combinations of the auxiliary keys in order to ensure all members but the leaving ones can decrypt at least one message. Finding this minimised set has the same solution as the familiar problem of reducing the number of logic gates and inputs in the field of logic hardware design. Wong et al [17] take an approach that is a generalisation of Chang et al, analysing key graphs as a general case of trees. They find a tree of degree four rather than binary is the most efficient for large groups. The standardised approach to pay-TV key management also falls into this class [13]. A set of secondary keys is created and each receiver holds a sub-set of these in tamper-resistant storage. The group key is also unknown outside the tamper-resistant part of the receiver. In case the group key becomes compromised, a new one is regularly generated and broadcast multiple times under different secondary keys to ensure the appropriate receivers can re-key.
MARKS
305
All work in this class of approaches uses multicast itself as the transport to send new keys. As ’reliable multicast’ is still to some extent a contradiction in terms, all such approaches have to allow for some receivers missing the occasional multicast of a new key due to localised transmission losses. Some approaches include redundancy in the re-keying to allow for losses, but this reduces their efficiency and increases their complexity. Others simply ignore the possibility of losses, delegating the problem to a choice of a sufficiently reliable multicast scheme. The Nark scheme [3] falls into the same class as the present work because the group key is systematically changed for each new ADU in a stream. However, unlike with the present approach, a smartcard happens to be required to give non-repudiation of delivery and latency so its presence can also be exploited to control which keys in the sequence to reveal. Each receiver has a proxy of the sender running within her smartcard, so all smartcards can be sent one primary seed for the whole key sequence. The proxy on the smartcard then determines which keys to give out depending on the policy it was given by the key manager when the receiver set up the session. The present paper shows how to construct a key sequence such that it can be partially reconstructed from intermediate seeds, thus removing the need for a smartcard if non-repudiation is not a requirement. Beyond the requirement we focus on, two taxonomies of multicast security requirements [2,5] include many other possible combinations of security requirements for multicast. It is generally agreed that a modular approach is required to building solutions for combined requirements, rather than searching for a single monolithic ’super-solution’. Later, as examples of this modular approach, we show how a number of variations can be added to our basic key management schemes to achieve a selection of the more commercially important requirements.
3
Sender-Decoupled Architecture
We now describe a large-scale network game scenario to explain why systematic key changes allow sender decoupling, giving the scalability benefits asserted in the introduction. This motivates the need for key sequences that can initially be built from a small number of seeds. Use of a practical example also clarifies why it must be possible to reveal arbitrary portions of the key sequence to different customers. This motivates the need for reconstruction of any sub-range of the key sequence, also from a small number of intermediate seeds. We deliberately choose an example where the financial value of an ADU (defined in Section 2) doesn’t relate to time or data volume, but only to a completely application-specific factor. In this example, participation is charged per ’game-minute’, a duration that is not strictly related to real-time minutes, but is defined and signalled by the game time-keeper. The game consists of many virtual zones, each moderated by a different zone controller. The zone controllers provide the background events and data that bring the zone to life. They send this data encrypted on a multicast address per zone, but the same ADU index and hence key is used at any one time in all zones. Thus the whole game is
306
B. Briscoe
one single ’secure multicast session’ (defined in Section 2) despite being spread across many multicast addresses. Players can tune in to the background data for any zone as long as they have the current key. The foreground events created by the players in the zone are not encrypted, but they are meaningless without reference to this background data. Fig.1 only shows data flows relevant to game security and only those once the game is in progress, not during set-up. Clearly all players are sending data, but the figure only shows encrypting senders, S - the zone controllers. Similarly, only receivers that decrypt, R, are shown - the game players. A game controller sets up the game security (not shown but described below). Key management operations are delegated to a number of replicated key managers, KM, that use secure Web server technology. The key to the secure multicast session is changed every game-minute (every ADU) in a sequence. All encrypted data is headed by an ADU index in the clear, which refers to the key needed to decrypt it. After the set-up phase, the game controller, zone controllers and key managers hold initial seeds that enable them to calculate the sequence of keys to be used for the entire duration of the game. Game Set-Up 1. The game controller (not shown) unicasts a shared ’control session key’ to all KM and S after satisfying itself of the authenticity of their identity. The easiest way to do this is for all S and KM to run secure Web servers so that the session key can be sent to each of them encrypted with each public key using client authenticated secure sockets layer (SSL) communications [8]. The game controller also notifies all KM and S of the multicast address it will use for control messages, which they immediately join. 2. The game controller then generates the initial seeds to construct the entire key sequence and multicasts them to all KM and all S, encrypting the message with the control session key and using a reliable multicast protocol suitable for the probably small number of targets involved. 3. The game is announced in an authenticated session directory announcement [9] regularly repeated over multicast (not shown). The announcement protocol is enhanced to include details of key manager addresses and the price per game-minute. Authenticated announcement prevents an attacker setting up spoof payment servers to collect the game’s revenues. The key managers as well as receivers listen to this announcement in order to get the current price of a game-minute. Receiver Session Set-Up, Duration, and Termination 1. A receiver that wishes to pay to join the game, having heard it advertised in the session directory, contacts a KM Web server requesting a certain number of game-minutes using the appropriate form. This is shown as ’unicast setup’ in Fig.1. R pays the KM the cost of the requested game-minutes, perhaps paying in some form of e-cash or in tokens won in previous games. In return,
MARKS
307
Fig. 1. Key management system design
2. 3.
4.
5.
KM sends a set of intermediate seeds that will allow R to calculate just the sub-range of the key sequence that she has bought. The key sequence construction described in the next section makes this possible efficiently. All this would take place over SSL with only KM needing authentication, not R. R generates the relevant keys using the intermediate seeds she has bought. R joins the relevant multicasts determined by the game application, one of which will always be the encrypted background zone data from one S. R uses a key from the sequence calculated in the previous step to decrypt these messages, thus making the rest of the game data meaningful. Whenever the time-keeper signals a new game-minute (over the control multicast), all the zone controllers increment their ADU index and use the next key in the sequence. They all use the same ADU index. Each R notices that the ADU index in the messages from S has been incremented and uses the appropriate next key in the sequence. When the game-minute index approaches the end of the sequence that R has bought, the application gives the player an ’Insert coins’ warning before she loses access. The game-minutes continue to increment until the point is reached where the key required is outside the range that R can feasibly calculate. If R has not bought more game-minutes, she has to drop out of the game.
This scenario illustrates how senders can be completely decoupled from all receiver join and leave activity as long as key managers know the financial value of each ADU index or the access policy to each ADU through some prearrangement. There is no need for any communication between key managers and senders during the session. Senders certainly never need to hear about any
308
B. Briscoe
receiver activity. If key managers need to avoid selling ADUs that have already been transmitted, they merely need to synchronise with the changing stream of ADU sequence numbers from senders. In the example, key managers synchronise by listening in to the multicast data itself. In other scenarios, it may be possible for synchronisation to be purely time-based, either via explicit synchronisation signals or implicitly by time-of-day synchronisation. In yet other scenarios (e.g. multicast distribution of commercial software), the time of transmission may be irrelevant. For instance, the transmission may be regularly repeated, with receivers being sold keys to a part of the sequence that they can tune in to at any later time. In this example, pre-payment is used to buy seeds. This ensures key managers hold no state about their customers. This means they can be infinitely replicated as no central state repository is required, as would otherwise be the case if seeds were bought on account and the customer’s account status needed to be checked. Thus performance can be linear with key manager replication and system resilience is independent of key manager resilience.
4
Key Sequence Construction
The following notations are used: – b(s) is the notation used for a function that blinds the value of s. That is, a computationally limited adversary cannot find s from b(s). An example of a blinding or one-way function is a hash function such as MD5 [11] or the standard Secure Hash 1 [16]. Good hash functions typically require only lightweight computational resources. Hash functions are designed to reduce an input of any size to a fixed size output. In all cases, we will use an input that is already the same size as the output, merely using the blinding property, not the size reduction property of the hash. – r(s) is any computationally fast one-to-one function that maps from a set of input values to itself. A circular (rotary) bit shift is an example of such a function. 4.1
Binary Hash Tree (BHT)
The binary hash tree requires two blinding functions, b0 (x) and b1 (x), to be wellknown. We will term these the ’left’ and the ’right’ blinding functions. Typically they could be constructed from a single blinding function, b(), by applying one of two simple one-to-one functions, r0 () and r1 () before the blinding function. As illustrated in Fig.2, b0 (s) = b(r0 (s)) and b1 (s) = b(r1 (s)). For instance, the first well-known blinding function could be a one bit left circular shift followed by an MD5 hash, while the second blinding function could be a one bit right circular shift followed by an MD5 hash. Other alternatives might be to precede one blinding function with an XOR with 1 or a concatenation with a well-known word. It seems advantageous to choose two functions
MARKS
309
Fig. 2. Two blinding functions from one
that consume minimal but equal amounts of processor resource as this balances the load in all cases and limits the susceptibility to covert channels that would otherwise appear given the level of processor load would reveal the choice of function being executed. Alternatively, for efficiency, two variants of a hash function could be used, e.g. MD5 with two different initialisation vectors. However, it seems ill advised to tamper with tried-and-tested algorithms. The key sequence is constructed as follows: 1. The sender randomly generates an initial seed value, s(0, 0). As a concrete example, we will take its value as 128 bits wide. 2. The sender decides on the required maximum tree depth, D, which will lead to a maximum key sequence length, N0 = 2D before a new initial seed is required. 3. The sender generates two ’left’ and ’right’ first level intermediate seed values, applying respectively the ’left’ and the ’right’ blinding functions to the initial seed: s(1, 0) = b0 (s(0, 0)); s(1, 1) = b1 (s(0, 0)) The sender generates four second level intermediate seed values: s(2, 0) = b0 (s(1, 0)); s(2, 1) = b1 (s(1, 0)), s(2, 2) = b0 (s(1, 1)); s(2, 3) = b1 (s(1, 1)), and so on, creating a binary tree of intermediate seed values to a depth of D levels. Formally, if s(d, i) is an intermediate seed that is d levels below the initial seed, s(0, 0): (1) s(d, i) = bp (s(d − 1, i/2)) where p = i mod 2 (see Appendix B for notation) 4. The key sequence is then constructed from the seed values across the leaves of the tree. Strictly, the stream cipher in use may not require 128b keys, therefore a shorter key may be derived from the leaf seeds by truncation of the most (or least) significant bits, typically to 64b. The choice of stream cipher is irrelevant as long as it is fast and secure. That is, if D = 5, k0 = s(5, 0); k1 = s(5, 1); ...k31 = s(5, 31) Formally, (2) ki = s(D, i)
310
B. Briscoe
5. The sender starts multicasting the stream, encrypting ADU0 with k0 , ADU1 with k1 etc. but leaving at least the ADU sequence number in the clear. 6. If the sender delegates key management, it must privately communicate the initial seeds to the key managers. A receiver reconstructs a portion of the sequence as follows: 1. When a receiver is granted access from ADUm to ADUn , the sender (or a key manager) unicasts a set of seeds to that receiver (e.g. using SSL). The set consists of the intermediate seeds closest to the tree root that enable calculation of the required range of keys without enabling calculation of any key outside the range. These are identified by testing the indexes, i, of the minimum and maximum seed using the fact that an even index is always a ’left’ child, while an odd index is always a ’right’ child. A test is performed at each layer of the tree, starting from the leaves and working upwards. A ’right’ minimum or a ’left’ maximum always needs revealing before moving up a level. If a seed is revealed, the index is shifted inwards by one seed. To move up a layer, the minimum and maximum indexes are halved with the maximum rounded down. The odd/even tests are repeated on the new indexes, revealing a ’right’ minimum or ’left’ maximum as before. The process continues until the minimum and maximum cross or meet. They can cross after either or both have been shifted inwards. They can meet after they have both been shifted upwards, in which case the seed where they meet needs revealing before terminating the procedure. This procedure is described more formally, in C-like code in Appendix A 2. Clearly, each receiver needs to know where each seed that it is given resides in the tree. The seeds and their indexes can be explicitly paired when they are revealed. Alternatively, to reduce the bandwidth required, the protocol may specify the order in which seeds are sent so that each index can be calculated implicitly from the minimum and maximum index and the order of the seeds. This is possible because there is only one minimal set of seeds that allows re-creation of any one range of keys. Each receiver can then repeat the same pairs of blinding functions on these intermediate seeds as the sender did to re-create the sequence of keys, km to kn (Equations 1 and 2). 3. Any other receiver can be given access to a completely different range of ADUs by being sent a different set of intermediate seeds. The creation of a key sequence with D = 4 is graphically represented in Fig.3. As an example, we circle the relevant intermediate seeds that allow one receiver to re-create the key sequence from k3 to k9 . The seeds and keys that remain blinded from this receiver are shown on a grey background. Of course, a value of D greater than 4 would be typical in practice. Note that each layer can be assigned an arbitrary value of d as long as it uniquely identifies the layer. Nothing relies on the actual value of d or D.
MARKS
311
Fig. 3. Binary hash tree
Therefore it is not necessary for the sender to reveal how far the tree extends upwards, thus improving security. Often a session will have an unknown duration when it starts. Clearly, the choice of D limits the maximum length of key sequence from any one starting point. The simplest work-round is just to generate a new initial seed and start a new binary hash tree alongside the old if it is required. If D is known by all senders and receivers, a range of keys that overflows the maximum key index, 2D , will be immediately apparent to all parties. In such cases it would be sensible to allocate a ’tree id’ for each new tree and specify this along with the seeds for each tree.
5 5.1
Discussion Storage and Processing Costs
The general approach is to use a small number of seeds to generate a larger number of keys, both at the sender before encryption and at the receiver before decryption. In either case, there may be limited memory capacity for the key sequence, which appears to require exponentially more memory than the seeds. We will now show that the tree construction requires minimal memory and minimal processing at either the sender or the receiver as each new key in the sequence is calculated. We assume the keys are used sequentially and once a key has been used it will never be required again. After this we will discuss the trade-offs between storage and processing that key managers may make, given that they have to be able to serve seeds from arbitrary points in the future tree at any time. For senders and receivers using the BHT, it is most efficient to only store the seeds on the branch of the tree from a root to that key following the one currently in use. Note that there may be multiple roots, particularly for receivers, where each revealed seed is a root. In practice this principle translates into being
312
B. Briscoe
able to deallocate memory for a parent seed immediately it has been hashed to produce its right child. If leaf seeds are also deallocated as soon as the next in the sequence is in use, this will ensure the tree only holds log(N ) seeds in memory on top of any revealed seeds being held to generate the rest of the tree to the right of the current key. Re-using the earlier example in Fig.3, we will now follow the key calculation sequence step-by-step. For brevity we will assume keys are synonymous with their corresponding leaf seeds: 1. s(4, 3) is immediately available as one of the revealed seeds. 2. s(4, 4) requires two hash operations from s(2, 1). The value of s(3, 2) calculated on the way should be stored. 3. s(4, 3) may be deallocated once s(4, 4) is in use 4. s(4, 5) requires one hash of the stored s(3, 2) 5. s(4, 4) and s(3, 2) may then be deallocated 6. s(4, 6) requires two hashes from s(2, 1). Again the value of s(3, 3) calculated on the way should be stored. 7. s(2, 1) may be deallocated as soon as it has been hashed 8. s(4, 5) may be deallocated as soon as s(4, 6) is in use 9. The process continues along similar lines until s(4, 9) is finished with, when it is deallocated leaving no further seeds in memory. It will be noted that, if the above seed storage strategy is adopted, one hash operation is required per key on the seeds in the penultimate layer, one hash every two keys on the next layer up, one hash every four keys on the next layer and so on. In other words, no branch of the tree ever requires the hash to be calculated more than once. Therefore: mean no. of hashes per key = no. of branches/no. of leaves = = (2(D+1) − 1)/2D < 2 If memory is extremely scarce (e.g. an embedded device) but some clock cycles are spare, storage can be traded off against processing. Any intermediate seeds down the branch of the tree to the current key need to be calculated, but they don’t all need to be stored. Those closest to the leaves should be stored (cached), as they will be needed soonest to calculate the next few keys. As intermediate seeds nearer to the root are required, they can be recalculated as long as the seeds originally sent by the key manager are never discarded until the sequence has left them behind. Unlike senders or receivers, a key manager cannot guarantee to only access the key-space sequentially. It will have to respond to requests for seeds from anywhere in the tree. However, for most scenarios it is likely that requests will tend not to be randomly distributed. Therefore, a key manager can use an identical approach to the device with scarce memory. It can calculate seeds in any part of the tree from the initial seeds, but cache those being most frequently used. This simply requires a fixed size cache memory allocation and discard of the least recently used values in the store.
MARKS
5.2
313
Efficiency
Table 1 shows various performance parameters of the BHT per secure multicast session, where: – R, S and KM are the receiver, sender and key manager, respectively, as defined in Section 3 – N (= n − m + 1) is the length of the range of keys that the receiver requires, randomly positioned in the key space – ws is the size of a seed (typically 128b) – wh is the size of the key management protocol header overhead – ts is the processor time to blind a seed (plus one relatively negligible circular shifting operation)
Table 1. Efficiency parameters of the BHT per secure multicast session BHT (unicast message size)/ws − wh min 1 per R or max 2(log(N + 2) − 1) (min storage)/ws mean O(log(N) - 1 ) min 0 per R (processing latency)/ts max log(N ) mean O(log(N)/2 ) min 1 per R or S (processing per key)/ts max log(N ) mean 2 per R or KM (min storage)/ws 1 per S (min random bits)/ws 1
The unicast message size for each receiver’s session set-up is shown equated to the minimum amount of storage each receiver requires. This is the storage required before starting the session, not once keys have started to be calculated. The minimum sender storage row has the same meaning. The processing latency is the time required for one receiver to be ready to decrypt incoming data after having received the unicast set-up message for its session. Note that there is no latency cost when other members join or leave, as in schemes that cater for unplanned eviction. The figures for processing per key assume sequential access of keys and the caching strategy described in Section 5.1. The exceptional cases when a session starts or ends are not included in the figures for per key processing. Only the sender (or a group controller if there are multiple senders) is required to generate random bits for the initial seeds. The number of bits required is clearly equal to the minimum sender storage of these initial seeds. It can be seen that the only parameters that depend on the size of the group membership are those that are per receiver. The cost of two of these (storage
314
B. Briscoe
and processing latency) is distributed across the group membership thus being constant per receiver. Only the unicast message size causes a cost at a key manager that rises linearly with group membership size, but the cost is only borne once per receiver session. Certainly, none of the per receiver costs are themselves dependent on the group size as in all schemes that allow unplanned eviction. Thus, the BHT construction is highly scalable. 5.3
Security
Each seed in the tree is potentially twice as valuable as its child. Therefore, there is an incentive to exhaustively search the seed space for the correct value that blinds to the current highest known seed value in the tree. For the MD5 hash, this will involve 2127 MD5 operations on average. It is possible a value will be found that is incorrect but blinds to a value that collides with the known value (typically one will be found every 264 operations with MD5). This will only be apparent by using the seed to produce a range of keys and testing one on some data supposedly encrypted with it. Having succeeded at breaking one level, the next level will be twice as valuable again, but will require the same brute-force effort to crack. Note that one MD5 hash (portable source) of a 128b input takes about 4us on a Sun SPARCserver-1000. Thus, 2128 MD5s would take 4e25 years. MD5 optimised for its host architecture is about twice as fast. Generally, the more random values that are needed to build a tree, the more it can contain sustained attacks to within the bounds of the sub-tree created from each new random seed. However, for long-running sessions, there is a trade-off between security and the convenience of a continuous key-space (as against concatenating BHTs side-by-side described earlier). The randomness of the randomly generated seeds is another potential area of weakness that must be correctly designed. Any key sequence construction like that discussed here is vulnerable to collusion between valid group members. If a sub-group of members agree amongst themselves to each buy a different range of the key space, they can all share the seeds they are sent so that they can all access the union of their otherwise separate key spaces. Arbitrage is a variant of member collusion that has already been discussed. This is where one group member buys the whole key sequence then sells portions of it more cheaply than the selling price, still making a profit if most keys are bought by more than one customer. Protection against collusion with non-group members is discussed in Section 6.2 on watermarking. Finally, the total system security for any particular application clearly depends on the strength of the security used when setting up the session. The example scenario in Section 3 describes the issues that need to be addressed and suggests standard cryptographic techniques to meet them. As always, the overall security of an application is as strong as the weakest part, which is more likely to be some ’human’ element than the key sequence construction discussed here.
MARKS
6
315
Requirement Variations
The key management scheme described in the current work lends itself to modular combination with other mechanisms to meet the additional commercial requirements described below. 6.1
Multi-sender Multicast
A multi-sender multicast session can be secured using the BHT as long as all the senders arrange to use the same key sequences. They need not all simultaneously be using the same key as long as the keys they use are all part of the same sequence. Receivers can know which key to use even if each sender is out of sequence with the others as long as the ADU index is transmitted in the clear as a header for the encrypted ADU. The example scenario in Section 3 described how multiple senders might synchronise the ADU index they were all using if this was important to the commercial model of the application. If each sender in a multi-sender multicast uses different keys or key sequences, each sender is creating a different secure multicast session even if they all use the same multicast address. This follows from the distinction between a multicast session and a secure multicast session defined in Section 2. 6.2
Watermarked Audit Trail
Re-multicast of received data requires very low resources on the part of any receiver. Even if the value of the information received is relatively low there is always a profit to be made by re-multicasting data and undercutting the original price (arbitrage), as proved in Herzog et al [10]. In general, prevention of information copying is considered infeasible; instead most attention focuses on the more tractable problem of copy detection by uniquely ’watermarking’ each copy of a work. If a watermarked copy is later discovered, it can be traced back to its source, thus deterring the holders of original copies from passing on further, illicit copies. Watermarks are typically applied to the least significant bits of a medium to avoid significantly degrading the quality. An approach such as Chameleon [1] can be used to watermark the keys used to decrypt the stream of data and can therefore be combined with keys from the BHT. In Chameleon a stream is ciphered by combining a regular stream cipher with a large block of bits (512kB in Chameleon’s concrete example). Each receiver is given a long-term copy of the block to decipher the stream. The block is watermarked for each receiver in a way specific to the medium. Because the block is only used for the XOR operation, the position of any watermarked bits is preserved in the output, allowing the approach to be generic. Thus, the keys generated by the BHT construction can be treated as a sequence of intermediate keys from which a watermarked sequence of final keys is generated, thus enforcing watermarked decryption.
316
B. Briscoe
However, this approach suffers from an applicability limitation of Chameleon that has not been previously discussed to our knowledge. Chameleon doesn’t detect ’semi-internal’ leakage to users who legitimately hold a valid long term key block. Intermediate keys, rather than final ones, can be leaked to any such receiver. For instance, in the above network game example, a group of players can collude to each buy a different game-hour and share the (unwatermarked) intermediate keys that each buys between themselves. Thus a receiver not entitled to certain of the intermediate keys can create final keys watermarked with her own key block and hence decrypt the cipherstream. Although the keys and data produced are stamped with her own watermark, this only gives an audit trail to the target of the leak, not the source (shutting the stable door after the horse has bolted). Chameleon does nonetheless create an audit trail for any keys or data that are passed to a completely unauthorised receiver - that is a receiver without a long-term key block, e.g. someone who has not played the game recently. In such cases the traitor who revealed the keys or data can be traced if the keys or data are traced. Similarly, there is an audit trail if one of the players passes on their long-term key block instead, as it also contains a watermark traceable to the source of the leak. Thus Chameleon ’raises the bar’ against leakage, and is therefore still a valid candidate for modular combination with BHT. 6.3
Unplanned Eviction
As already pointed out, the BHT allows for eviction from the group at arbitrary times, but only if planned at the time each receiver session is set up. If preplanned eviction is the common case, but occasionally unplanned evictions are needed, keys from the BHT can be combined with another scheme, such as LKH++ [6] to allow the occasional unplanned eviction. To achieve this, as with watermarking above, keys from the sequence generated by the BHT are treated as intermediate keys. These are combined (e.g. XORed) with a group key distributed using for example LKH++ to produce a final key used for decrypting the data stream. Thus both the BHT intermediate key and the LKH++ intermediate key are needed to produce the final key at any one time. Indeed, any number of intermediate keys can be combined (e.g. using XOR) to meet multiple requirements simultaneously. For instance, MARKS, LKH++ and Chameleon intermediate keys can be combined to simultaneously achieve low cost planned eviction, occasional unplanned eviction and a watermarked audit trail against leakage outside the long-term group. Formally, the final key, ki,j,... = c(ki , kj , ...) where intermediate keys k can be generated from sequences using a BHT construction or any other means such as Chameleon or LKH++ and c() is a combining function, such as XOR. In general, combination in this way produces an aggregate scheme with storage costs that are the sum of the individual component schemes. However,
MARKS
317
combining LKH++ with MARKS, where most evictions are planned, cuts out all the re-keying messages of LKH++ unless an unplanned eviction is actually required.
7
Limitations and Further Work
Duplication of information costs so little that selling multiple copies at a unit price much greater than the cost of duplication always results in an economic incentives for potential buyers to collude. We discuss receiver collusion and arbitrage in Sections 5.3 and 6.2 but the best solution we can offer without requiring smartcards only offers the possibility of detecting collusion between a group member and a non-member. Detecting intra-group collusion without requiring specialist hardware is left for further work. We have assumed that knowledge of more than one value blinded in different ways from the same starting value doesn’t lead to an analytical solution to calculate the original value. Until proofs exist showing any blinding function is resistant to analytical (as against brute force) attack, it won’t be possible to prove whether an analytical attack has been made easier by our techniques. Finally, through pressure of time, we have avoided analysis of trees of degree three and above. They potentially offer greater efficiency at the expense of additional complexity. For instance the experiments in Wong et al recommend a tree of degree four, but the pattern of usage that their tree is subjected to is only tenuously related to the present work.
8
Conclusion
We have presented a solution to manage the keys of very large groups. It preserves the scalability of receiver initiated Internet multicast by completely decoupling senders from all receiver join and leave activity. Senders are also completely decoupled from the key managers that absorb this receiver activity. We have shown that many commercial applications have models that only need stateless key managers, in which cases unlimited key manager replication is feasible. These gains have been achieved by the use of systematic group key changes rather than receiver join or leave activity driving re-keying. Decoupling is achieved by senders and key managers pre-arranging the unit of financial value in the multicast data stream (the ’application data unit’ with respect to charging). Using this model, there is zero side effect on other receivers (or on the senders) when one receiver joins or leaves. We also ensure multicast is not used for key management, only for bulk data transfer. Thus, re-keying isn’t vulnerable to random transmission losses, which are complex to repair scalably when using multicast. State of the art techniques that allow unplanned eviction from the group are still costly in messaging terms. In contrast we have focussed on the problem of planned eviction. That is, eviction per receiver after some arbitrary future ADU, but planned at the time the receiver requests a session. We have asserted that many commercial scenarios based on pre-payment or subscription don’t require
318
B. Briscoe
unplanned eviction but do require arbitrary planned eviction. Examples are payTV, pay-per-view TV or network gaming. To achieve planned but arbitrary eviction we have designed a key sequence construction that is used by the senders to systematically change the group key. It is designed such that an arbitrary subrange of the sequence can be reconstructed by revealing a small number of seeds (16B each). We can reveal N keys to each receiver using O(log(N )) seeds. The scheme requires on average just O(log(N )/2) fast hash operations to get started, then on average no more than just two more hashes to calculate each new key in the sequence. This implies under 10us of processing time to generate each ADU key with today’s technology. To put this work in context, for pay TV charged per second with 10% of ten million viewers tuning in or out within a fifteen minute period, the best alternative scheme (Chang et al) might generate a re-key message of the order of tens of kB every second, multicast to every group member. The present work requires a message of a few hundred bytes unicast just once to each receiver at the start of perhaps four hours of viewing. This comparison is not strictly fair as, unlike the present scheme, Chang et al and the other schemes of its class allow for unplanned eviction from the group, thus allowing accurate charging for serendipitous viewing. However, the purpose of this work is to present a far more scalable solution for commercial scenarios where unplanned eviction is not required. Another way of putting this is that the cost of scenarios requiring unplanned eviction might make them economically unviable compared to those that can make do with planned eviction. Nonetheless, if unplanned eviction is occasionally required, we have shown how to combine our scheme with Chang’s to get the best of both worlds. Combining schemes sums the storage requirements of each, but both are very low in this respect. We also show how to further combine with the Chameleon watermarking scheme to give rudimentary detection of information leakage outside the group.
Acknowledgements Jake Hill, Ian Fairman, David Parkinson (BT)
References 1. Ross Anderson, Charalampos Manifavas (Cambridge Uni), ”Chameleon - A New Kind of Stream Cipher” Encryption in Haifa (Jan 1997), http://www.cl.cam.ac.uk/ftp/users/rja14/chameleon.ps.gz 2. Pete Bagnall, Bob Briscoe, Alan Poppitt, (BT), ”Taxonomy of Communication Requirements for Large-scale Multicast Applications”, Internet Draft (work in progress), Internet Engineering Task Force (17 May 1999), draft-ietf-lsma-requirements-03.txt
MARKS
319
3. Bob Briscoe, Ian Fairman (BT), ”Nark: Receiver-based Multicast Non-repudiation and Key Management”, forthcoming in ACM conference on Electronic Commerce (Nov 1999), http://www.labs.bt.com/projects/mware/ 4. Bob Briscoe (BT), ”MARKS: Zero Side Effect Multicast Key Management using Arbitrarily Revealed Key Sequences”, BT Technical Report (Aug 1999), http://www.labs.bt.com/projects/mware/ 5. Ran Canetti (IBM T.J. Watson), Juan Garay (Bell Labs), Gene Itkis (NDS), Daniele Micciancio (MIT), Moni Naor (Weizmann Inst. of Science), Benny Pinkas (Weizmann Inst. of Science), ”Multicast Security: A Taxonomy and Efficient Constructions”, Proceedings IEEE Infocomm’99, Vol2 708-716 (Mar 1999), http://www.wisdom.weizmann.ac.il/˜bennyp/PAPERS/infocom.ps 6. Isabella Chang, Robert Engel, Dilip Kandlur, Dimitrios Pendarakis, Debanjan Saha, (IBM T.J. Watson Research Center) ”Key Management for Secure Internet Multicast using Boolean Function Minimization Techniques”, Proceedings IEEE Infocomm’99, Vol2 689-698 (Mar 1999), http://www.research.ibm.com/people/d/debanjan/papers/infocom99.srm.pdf 7. S. Deering, ”Multicast Routing in a Datagram Network,” PhD thesis, Dept. of Computer Science, Stanford University, (1991). 8. A. Frier, P. Karlton and P. Kocher, (Netscape), ”The SSL 3.0 Protocol”, Nov 18, 1996. 9. Mark Handley (UCL), ”On Scalable Internet Multimedia Conferencing Systems”, PhD thesis (14 Nov 1997) http://www.aciri.org/mjh/thesis.ps.gz 10. Shai Herzog (IBM), Scott Shenker (Xerox PARC), Deborah Estrin (USC/ISI), ”Sharing the cost of Multicast Trees: An Axiomatic Analysis”, in Proceedings of ACM/SIGCOMM ’95, Cambridge, MA, Aug. 1995, http://www.research.ibm.com/people/h/herzog/sigton.html 11. Ronald L. Rivest, ”The MD5 Message-Digest Algorithm”, Request for Comments (RFC) 1321, Internet Engineering Task Force (1992) http://www.ietf.org/rfc/rfc1321.txt 12. Tony Ballardie, ”Scalable multicast key distribution”, Request for Comments (RFC) 1949, Internet Engineering Task Force (May 1996) http://www.ietf.org/rfc/rfc1949.txt 13. ITU-R Rec. 810, ”Conditional-Access Broadcasting Systems”, (1992) verb+http://www.itu.int/itudocs/itu-r/rec/bt/810.pdf+ 14. McGrew, David A., Alan T. Sherman, ”Key establishment in large dynamic groups using one-way function trees,” TIS Report No. 0755, TIS Labs at Network Associates, Inc., Glenwood, MD (May 1998). 15. Suvo Mittra, ”Iolus: A framework for scalable secure multicasting,” Proceedings of the ACM SIGCOMM ’97, 14-18 Sep 1997 Cannes, France. 16. FIPS Publication 180-1, Secure hash standard, NIST, U.S. Department of Commerce, Washington, D.C. (April 1995). 17. Chung Kei Wong, Mohamed Gouda and Simon S Lam, ”Secure Group Communications Using Key Graphs”, Proceedings of ACM SIGCOMM’98 (Sep 98) http://www.acm.org/sigcomm/sigcomm98/tp/abs_06.html
320
B. Briscoe
Appendix A - Algorithm for Identifying Minimum Set of Intermediate Seeds for BHT In the following C-like code fragment the function odd(x) tests whether x is odd, and the function reveal(d,i) reveals seed s(d, i) to the receiver min=m; max=n; for(d=D; ; d--) {
}
if (min == max) { reveal(d,min); break ; } if odd(min) { reveal(d,min); min++; } if !odd(max) { reveal(d,max); max--; } if (min > max) break; min/=2; max/=2;
// // // // //
working from leaves... move up tree 1 level each loop min & max have converged... ...so reveal sub-tree root.. ...and quit
// odd min never left child... // ...so reveal odd min seed // and step min in 1 to right // even max never right child... // ...so reveal even max seed // and step max in 1 to left // // // //
min & max cousins, so quit halve min ... ... & halve max ready for... ... next level round loop
Appendix B - Notation O(x) is notation for ’of order x’. j/P is notation for the value of j/P rounded down to the nearest integer (the floor function). j mod P is notation for the remainder of j/P
Multicast Service Differentiation in Core-Stateless Networks Tae-eun Kim, Raghupathy Sivakumar, Kang-Won Lee, and Vaduvur Bharghavan TIMELY Research Group, University of Illinois at Urbana-Champaign {tkim, sivakumr, kwlee, bharghav}@timely.crhc.uiuc.edu
Abstract. Diffserv has emerged as a popular architectural paradigm for providing service differentiation for future internetworks. The key principle of Diffserv is to move per-flow state out of the core and into the fringes of network clouds. Using core-stateless networks, techniques such as Core Stateless Fair Queueing (CSFQ) have been proposed for achieving unicast service differentiation in core routers. In this paper, we propose a core-stateless network architecture called mCorelite that extends CSFQ in order to provide multi-rate weighted max-min fairness for layered multicast flows, and eliminates the need for multicast receivers to perform rate adaptation using join experiments. Preliminary performance results using the ns-2 simulator demonstrate that it is possible to achieve service differentiation for multicast flows without maintaining any per-flow state in the core of the network.
1
Introduction
In the initial stages of the deployment of the Internet, most routers supported a single best effort service class and provided only simple datagram service. However, as the Internet has moved from a predominantly academic networking environment to a largely commercial networking environment, the issue of providing service differentiation among flows has come to the fore. In the early half of this decade, the Intserv paradigm for providing quality of service gained prominence [1,2]. Intserv provides end-to-end per-flow metrics for unicast and multicast flows, but requires a significant amount of per-flow state, signalling, and computation in the routers of the network. As the Internet has continued to grow explosively in size and heterogeneity in recent years, routers in the core of backbone networks often serve hundreds of thousands of flows concurrently; in this scenario, it has been argued that the “core” routers cannot maintain perflow state or perform per-flow computations. In order to address the scalability issues of Intserv, the Diffserv paradigm for providing quality of service is now becoming popular [3,4]. The basic goal of Diffserv is to achieve service differentiation among flows while moving flow-specific state and computation out of the core and into the fringes of the network. In a similar vein, the goal of this paper is to design a core-stateless network architecture that provides relative service differentiation. To this end, we present a QoS architecture called mCorelite for L. Rizzo and S. Fdida (Eds.): NGC’99, LNCS 1736, pp. 321–338, 1999. c Springer-Verlag Berlin Heidelberg 1999
322
T.-e. Kim et al.
achieving multi-rate weighted max-min fair rate allocation for layered multicast flows co-existing with unicast flows1 . It is well known that weighted max-min fair rate allocation can be achieved in a network where all the routers perform weighted fair queueing (WFQ) [18]. However, implementing WFQ requires maintaining per-flow state, which is not desirable at core routers. To alleviate this problem, recently a technique called Core Stateless Fair Queueing (CSFQ) was proposed for approximating the behavior of WFQ without maintaining per-flow state in core routers [6]. While CSFQ is able to effectively emulate the behavior of WFQ by allocating per-flow rate equal to the “weighted fair share” when serving only unicast flows, it has inherent limitations that make it unsuitable for serving layered multicast flows. In mCorelite, core routers use the same principle as in CSFQ but are able to compute the weighted fair share when serving both unicast and layered multicast flows. As a packet traverses its flow path, it gets updated with the weighted fair share information of the core routers along the path. Using this information, end hosts or access routers can join as many layers of a flow as can be sustained along the path for multicast flows, or perform rate adaptation for unicast flows. mCorelite thus achieves multi-rate weighted max-min fairness and eliminates the need for multicast receivers to perform independent join experiments [9]. The key technical challenge is to design low overhead mechanisms at the core routers to compute the weighted fair share without maintaining any per-flow state when there are both unicast and layered multicast flows traversing the core router. The rest of the paper is organized as follows: Section 2 presents the network and service models of mCorelite. Section 3 provides a brief overview of CSFQ and makes the case for mCorelite. Section 4 presents the core and access router mechanisms in mCorelite. Section 5 discusses initial performance results of mCorelite. Section 6 compares our work to related work and Section 7 concludes the paper.
2
Network and Service Model
2.1
Network Model
mCorelite is designed for a heterogeneous inter-network of autonomously managed network clouds, where a network cloud consists of core routers in the center and edge routers in the fringes. Since core routers may serve in the order of hundreds of thousands of flows concurrently, in mCorelite core routers do not maintain per-flow state or perform per-flow computations. On the other hand, edge routers are allowed to maintain a limited amount of per-flow state and perform some per-flow computations. In particular, since end hosts themselves may not be trusted to maintain their traffic contracts, “access routers”, i.e. edge 1
mCorelite is the multicast component of the Corelite QoS architecture. While Corelite supports minimum rate guarantees and relative rate/delay differentiation in a core-stateless network [7,8], the focus of this paper is on achieving weighted max-min fair rate differentiation for layered multicast flows.
Multicast Service Differentiation in Core-Stateless Networks
323
routers to which end hosts are connected, maintain some flow-specific state and participate in flow-specific traffic shaping. The mCorelite architecture is independent of the underlying unicast and multicast routing mechanisms, so long as all packets in a flow traverse the same route (otherwise, fair share computations along a path become meaningless). For our purposes, it does not matter where the multicast routers are placed in the network, because we make the case that every router needs to be “multicast aware” – even if it is not a multicast router – in order to correctly determine its weighted fair share when multicast flows traverse through it. We assume that multicast flows are layered and prioritized, i.e. a receiver must subscribe to layer i before it attempts to join layer i + 1. We further assume that a receiver must be able to receive all the packets in a layer in order to join the layer; we do not allow for partial reception of packets in a layer. For unicast flows, we assume that the flow can utilize whatever bandwidth is allocated to it. The access router serving the receiver performs joins on behalf of multicast receivers, while the access router serving the sender enforces the rate adaptation on behalf of unicast senders. The interaction between the end hosts and the access routers are beyond the scope of this paper. 2.2
Service Model
In the network environment described above, mCorelite seeks to achieve multirate weighted max-min fairness for unicast and layered multicast flows. We now present the definition of multi-rate weighted max-min fairness used in this paper. Let us start with defining max-min fairness in a network consisting of only unicast flows [5]. Consider a set of flows {1, . . . , n} and a rate allocation vector for the flows R = [r1 , . . . , rn ]. R is a max-min fair rate allocation if it is impossible to increase the rate allocation ri without causing the decrease of some rj ≤ ri . Weighted max-min fairness is a simple extension of max-min fairness, where each flow i has a corresponding weight wi . A rate allocation vector R is weighted max-min fair if it is impossible to increase the rate allocation of ri r without causing the decrease of some rj such that wjj ≤ wrii . In a network with only unicast flows, the definition of weighted max-min fairness tells us how to achieve relative rate differentiation among flows. Now let us extend the above definition to include multicast flows (unicast being a special case of multicast with a single receiver). Consider a set of paths {(1, 1) . . . (1, n1 ) . . . (m, 1) . . . (m, nm )} corresponding to flows {1, . . . , m}, where {(i, 1) . . . (i, ni )} represents the set of paths to the ni receivers of flow i. We define multi-rate weighted max-min fairness as follows: a rate allocation R = [r1,1 , . . . , rm,nm ] is a multi-rate weighted max-min fair rate allocation if, for a receiver j ≤ ni in a multicast group i ≤ m, it is impossible to increase the rate r r ≤ wi,ji , allocation ri,j without causing the decrease of some rk,l such that wk,l k where wi is the weight of flow i. This definition provides multi-rate weighted maxmin fairness, but does not address the additional requirement that a multicast receiver must either receive all packets in a layer or none at all. Let Bi,x represent the cumulative rate of the first x layers of multicast flow i. Given a rate allocation
324
T.-e. Kim et al.
of ri,j , the j th receiver of flow i can receive lj layers, such that Bi,lj ≤ ri,j < Bi,lj +1 . In other words, each receiver of a multicast flow will be subscribed to as many full layers as the weighted fair share along the path from the sender to the receiver can sustain. The remaining bandwidth (ri,j −Bi,lj ) that is left unused by j is then reassigned to each of the links along the path of the flow to j, and the weighted max-min fair rate allocation is recursively recomputed till no multicast receiver can join additional layers. At this point, the additional bandwidth is distributed among unicast flows (which can thus receive rate allocations in excess of their weighted fair share). Figure 1 shows an example of multi-rate max-min fair rate allocation and illustrates a simple centralized algorithm to compute the rates. f1 f2
A
f3
15
5
B
C
3
D
1 f4
f4,e
f4,d E
Fig. 1. Multi-rate Max-min Fairness Example. There are four flows in the network shown above: three unicast flows (f1 , f2 , f3 ), and one multicast flow (f4 ) with two receivers D and E. Let f4 consist of two layers that require 1 unit of bandwidth each. Network link capacities are as specified for each link. Initially, the weighted fair shares on the links are as follows: A-B: 5, B-C: 5/3, C-D: 3/2, D-E: 1/1. In the first iteration, the bottleneck link is B-E, with a weighted fair share of 1. The flow traversing D-E, f4,e , is allocated a rate of 1, and weighted fair shares are recomputed. In the second iteration, the bottleneck link is C-D, with a weighted fair share of 1.5. The flows traversing C-D, f1 and f4,d , are allocated a rate of 1.5 each. Since f4,d can only accept rates of 1 or 2, it is allocated a rate of 1, and the remaining bandwidth is reassigned to C-D. In the third iteration bottleneck links are B-C and C-D, with weighted fair share of 2. The flows traversing B-C and C-D, f1 and f3 , are allocated a rate of 2 each. Finally, f2 is allocated a rate of 12. The multi-rate weighted max-min rate allocation is f1 : 2, f2 : 12, f3 : 2, f4,d : 1, f4,e : 1.
3
Background
The core router mechanisms in mCorelite are similar to CSFQ, but are designed to support layered multicast flows as well as unicast flows. In this section, we first provide a brief overview of CSFQ and then make the case for mCorelite. 3.1
Core Stateless Fair Queueing
CSFQ estimates the weighted fair share of each link without maintaining any per-flow state in the core router. The fair share α at a core router represents
Multicast Service Differentiation in Core-Stateless Networks
325
the share of the output link capacity that is alloted to each flow that traverses the router. In CSFQ, each packet has the rate r and weight w of the flow to which the packet belongs stamped in its header. When the packet arrives at a router, the router drops the packet with a probability of max{0, 1−w∗α/r}. If the packet is not dropped, it is accepted for transmission. If A represents the aggregate arrival rate, F represents the aggregate accepted rate (where the two variables are updated after the arrival of every packet), and C represents the link capacity, the fair share α is updated as follows: if (A > C) αnew ← αold ∗ C/F else αnew ←largest rate of any active flow The combination of fair share estimation and probabilistic dropping of packets for those flows whose rate exceeds the weighted fair share enables CSFQ to enforce fair sharing of a link without maintaining any per-flow state in the router. However, as we explain in the next section, this mechanism works only for unicast flows. 3.2
Case for mCorelite
From the description above, it is clear that CSFQ is based on two key premises: (a) the rate advertised in a packet header reflects the rate of the flow to which the packet belongs, and (b) flows that exceed the weighted fair share lose a fraction of their packets, such that the expected egress rate of no flow out of the router exceeds the weighted fair share of the flow at the router. Unfortunately, both of these premises pose problems in supporting layered multicast flows, as we will see below. f1 (100)
f1 (100) B 100
C
fs = 100 * 100/100 = 100
1st round
A
B
D
D
C f2 (20, 20, 20, 20)
(a)
fs = 100 * 100/180 = 55
2nd round fs = 55 * 100/135
100
...
A
= 40.74
n-th round fs = ... = 20
(b)
Fig. 2. CSFQ and Layered Multicasting: CSFQ routers treat each layer of a layered multicast session as an independent flow and hence over-allocate resources to multicast sessions and penalize unicast flows. In the above example, the weighted fair share unfairly converges to 20 units due to which the unicast flow gets a share of 20 while the multicast flow gets a share of 80 units.
Figure 2 shows a simple example. There are two flows: one is a unicast flow f1 , and the other is a multicast flow. Flow f1 has a rate request of 100 which can fully utilize the shared link. Flow f2 is composed of four layers with decreasing priorities where each layer requires 20 units of bandwidth. Let us assume initially there is only f1 in the network (Figure 2.a). In this case, the initial weighted fair share is 100. Now, f2 starts transmission, and shares the link with
326
T.-e. Kim et al.
f1 . Let us consider that the receiver has subscribed to all four layers initially (because the perceived weighted fair share of 100 on the shared link is sufficient to accommodate all the layers). From the service model in Section 2, we expect to see a rate allocation of 60 units for f 1 and 40 units for f 2. The question is: “In CSFQ, what should be the rate ri contained in the header of the packets belonging to each layer of the multicast flow in order to achieve the desired rate allocation?” We present three possible choices, but show that none of them results in the desired rate allocation. – Option a: following the original CSFQ algorithm, each multicast packet carries the rate of its own layer. In this case, CSFQ will end up treating each layer of the multicast flow as an individual flow, and converge to a weighted fair share of 20 (See Figure 2.b). Thus the unicast flow is allocated a rate of 20, and the multicast flow is allocated a rate of 80. – Option b: each multicast packet carries the aggregate rate of the flow. In this case, all multicast packets will carry the rate of 80. There are two problems with this approach. First, the rate information in the packet header is different from the actual rate of each layer. Second, CSFQ does not know how to preferentially transmit packets of layers 1 and 2, and drop higher layer packets. Instead, it will end up dropping packets from all layers with equal probability. The unicast rate is allocated a rate of 50, and the multicast flow is allocated a rate of 50, with an expected 12.5 packets transmitted in each layer. – Option c: each packet carries the aggregate rate of the flow up to that layer. In this case, the packets of layers 1, 2, 3, and 4 will carry the rates of 20, 40, 60, and 80, respectively. Let us consider some time instant when the weighted fair share is 50. At this time, CSFQ will drop packets of layer 3 with a probability of 1/6, and label the accepted packets with a rate of 50, when it should drop 1/2 of layer 3 for correct operation. Similarly, it will drop packets of layer 4 with a probability of 3/8, when it should drop all the packets of layer 4. There are two problems with this approach: (a) the accepted number of packets is much higher than the weighted fair share (because advertised rates in packet headers are inaccurate) and this leads to a reduction in the weighted fair share, thereby causing unnecessary packet loss for both unicast and multicast flows (as well as increased fluctuation in the weighted fair shares), and (b) for CSFQ routers downstream, the adjusted rates in the higher layers are very different from the actual traffic generated, and this causes CSFQ to break down for the multi-hop case. In summary, CSFQ is unable to handle layered multicast flows because it expects all packets of a flow to carry the same advertised rate in their headers, and because it does not distinguish higher priority layers from lower priority layers. It turns out that even if we use CSFQ in core routers with an RLM-like scheme at access routers, the basic problems of CSFQ described above remain. In fairness to the designers of CSFQ, supporting multicast flows was not a target of that work. While this is the main limitation of CSFQ that we overcome in this work, other issues with CSFQ that we address are that it must drop packets to perform
Multicast Service Differentiation in Core-Stateless Networks
327
rate adaptation, it must perform specialized computations for every packet, and that it requires a change in packet headers.
4
The mCorelite Architecture
In this section, we first present a brief overview of the mCorelite architecture, then describe the algorithms at the access router and the core router that facilitate multicast service differentiation. 4.1
Overview
As we have mentioned before, the mCorelite service model calls for either fully accepting a layer or not accepting any packet of the layer at the core router. In order to achieve this, we need to be able to specify that packets belonging to layer i must be accepted only if the cumulative rate for the layers 1 through i is not greater than the weighted fair share. In fact, this was the motivation for the third option that we considered in the previous section. However, the problems with this option in the context of CSFQ were that (a) the advertised rate in the packets does not match the sending rate, which results in miscalculation of the weighted fair share in CSFQ, and (b) packets of a layer may be partially delivered. Both of these problems are due to the fact that, in CSFQ, the flow-specification contained in packet headers is tightly coupled to the way the packet itself is handled. There are two key features of mCorelite that solve these problems. – The first key feature of mCorelite is that the source access router sends two pieces of information about the layer (as opposed to just the rate information in CSFQ): the current rate of the layer, and the admission rate of the layer. The current rate denotes the sending rate of the layer, while the admission rate denotes the minimum value of the weighted fair share that must be available to the flow for the corresponding layer to be admitted. For unicast flows, the admission rate is 0 (because a unicast flow can be partially delivered), while for multicast flows, the admission rate is the “cumulative rate” of the layers up to, and including, the current layer. – The second key feature of mCorelite is that it decouples weighted fair share computation from data packet forwarding. mCorelite uses periodic control packets called markers (which are distinct from data packets) that are sent along the path of the flow and carry both flow specific information (layer rate and cumulative rate) from the source, and path specific information (weighted fair share) from the core routers. A marker is periodically generated once every epoch2 and is interleaved with the data packets belonging to 2
An epoch is a pre-specified time period defined in both access routers and core routers. In access routers, a marker packet is transmitted once per epoch. In core routers, weighted fair share calculation happens once per epoch. Epochs need not be synchronized.
328
T.-e. Kim et al.
the layer by the source access router, and retrieved by the destination access router. End hosts are oblivious of the insertion and removal of markers in the packet flow. The combination of these two features allows mCorelite to overcome the problems associated with CSFQ mentioned in the previous section. Using markers instead of data packets to carry the sender rate information means that the weighted fair share computations of CSFQ can be performed independent of packet forwarding, simply by counting one marker packet equivalent to the number of data packets (i.e. the sending rate of the layer) that it represents. The admission rate is used to determine if the current layer can be sustained given the current weighted fair share of the link. If the current layer can be sustained, the corresponding marker is allowed to pass through, and the weighted fair share is stamped/updated in the marker header. Otherwise the marker is dropped. Receiver side access routers decide whether they should join or leave layers based on the weighted fair share information contained in the markers. In addition to the multicast data layers, we require a 0th layer, which is a control layer, that periodically carries information about layer structure, i.e. the number of layers in the multicast session (n), the sending rate of each layer (ratei , 1 ≤ i ≤ n), and the multicast address for each layer (addri , 1 ≤ i ≤ n). A receiver always joins the 0th layer and this enables the receiver to know which layers it can join given the current weighted fair share along its path. We believe that using the approach outlined above, mCorelite can support multi-rate weighted max-min fairness for layered multicast and unicast flows. Moreover, using markers eliminates the need for changing packet header formats, and it reduces the packet processing overhead to one marker per epoch rather than for every packet, and reduces packet loss (markers are dropped instead of packets). However, it makes the approach more sensitive to delayed or lost markers. We now present a detailed description of the specific mechanisms in mCorelite. 4.2
Access Router Mechanisms
Once every epoch, the source access router sends a marker for each layer with appropriate values for the current rate (marker.currentRate), admission rate (marker.cumulativeRate), weight (marker.weight) and weighted fair share (marker.α) fields. The first three are set by the source access router, while the weighted fair share is initially empty and is subsequently updated by core routers along the path. When the destination access router sees a marker packet at the end of the epoch, the marker contains the weighted fair share for the path. For a unicast flow, the router sends the weighted fair share back to the source access router for rate adaptation (the source access router may possibly inform the sending host about the updated rate), while for a multicast flow, the router initiates rate adaptation as shown in Figure 3. When the current reception rate (currentReceptionRate) is smaller than the weighted fair share of the session
Multicast Service Differentiation in Core-Stateless Networks
1 2 3 4 5
329
Unicast Sender Rate Adaptation /* at the source access router */ if ( currentSendingRate < marker.α ) /* α is the weighted fair share */ currentSendingRate = currentRate + 1 else currentSendingRate = marker.α
6 Multicast Receiver Rate Adaptation /* at the destination access router */ ratei , {i: receiver is a member of layer i} 7 currentReceptionRate = 8 maxLayer = M AX(i), {i: receiver is a member of layer i} 9 if ( currentReceptionRate < marker.α ) 10 if ( currentReceptionRate + ratemaxLayer+1 ≤ marker.α ) 11 join(addrmaxLayer+1 ) 12 currentReceptionRate = currentReceptionRate + ratemaxLayer+1 13 maxLayer = maxLayer + 1 14 else 15 while ( currentReceptionRate > marker.α ) 16 currentReceptionRate = currentReceptionRate − ratemaxLayer 17 leavelayer(maxLayer) 18 maxLayer = maxLayer − 1
Fig. 3. Rate Adaptation at Access Routers: In case of unicast, sender side access router performs rate adaptation by increasing or decreasing sending rate, while in case of multicast receiver side access router performs rate adaptation by joining or leaving multicast groups.
(marker.α), then in case of multicast the destination access router joins the next layer (lines 9 – 13) and in case of unicast the source access router increases the flow’s rate by one. The reason for the access router joining only the next layer is because when the network is not congested, the weighted fair share information (marker.α) may be too optimistic (in the absence of congestion CSFQ sets the fair share to the rate of the maximum rate flow - See section 4.3) and thus may lead to unstable behavior3 . On the other hand, when the current reception rate for the session is greater than the weighted fair share, in case of multicast the destination access router drops the highest layers until the cumulative rate becomes smaller than or equal to the weighted fair share (lines 14 – 18) and in the case of unicast, the source access router adjusts the outgoing rate of the flow to the reduced weighted fair share. It is clear that this rate adaptation will achieve the fairness service model presented in Section 2 as long as the weighted fair share estimate at the core router is fairly accurate. We now describe how core routers perform the weighted fair share estimation.
3
For the same reason, the unicast sender performs linear increase instead of directly setting the sending rate to the received weighted fair share (lines 2, 3).
330
T.-e. Kim et al.
4.3
Core Router Mechanisms
The core router performs a weighted fair share computation and puts the weighted fair share value in the markers. The computation is similar to CSFQ – the core router performs two functions as shown in Figure 4: (a) whenever it sees a marker, it decides whether the marker should be “admitted” or not, and does a trivial amount of book-keeping (lines 2 – 5); and (b) at the end of each epoch, it performs the weighted fair share computation.
1 2 3 4 5
Marker Arrival /* for every marker */ if ( α ∗ marker.weight ≥ marker.cumulativeRate ) A ← A + marker.currentRate maxRate = M AX(maxRate, marker.cumulativeRate, marker.currentRate) marker.α = M IN (marker.α, α ∗ marker.weight)
6 Fair Share Update /* once every epoch */ 7 if ( A ≥ C ) /* C is the link capacity */ 8 α = α ∗ C/A 9 else 10 α = maxRate + 1
Fig. 4. Fair Share Computation at Core Routers
When a marker arrives at a core router, the router computes the weighted fair share for the flow corresponding to the layer/flow by multiplying the fair share by the flow weight. If the admission rate of the marker is lower than the weighted fair share, the marker is admitted (line 2). In the unicast case, we set marker.cumulativeRate to 0 so that it is always admitted with its current rate. However, for multicast, only those layers whose cumulative rate is smaller than the weighted fair share is admitted. After this step, the core router updates the admitted rate A by the current rate of the layer denoted in the marker (line 3), updates the maxRate variable which keeps the maximum sending rate of all the sessions and flows in the current epoch (line 4), and the weighted fair share field of the marker packet.α (line 5). The maxRate is used for setting the new fair share of the link at the end of the epoch when the aggregate admitted rate is smaller than the capacity of the link (line 10). The reason for considering marker.currentRate in line 4 is because unicast flows have their marker.cumulativeRate set to 0 and it is marker.currentRate that reflects their correct rate information. The fair share computation itself is similar to CSFQ. If the aggregate rate of the admitted markers is greater than the capacity of the link, a new fair share value for this link is calculated following the approximation proposed by CSFQ [6] (lines 7, 8). On the other hand, if the aggregate rate is smaller than the
Multicast Service Differentiation in Core-Stateless Networks
331
capacity, the new fair share is set to maxRate + 1 again following the convention of CSFQ (lines 9, 10).
5
Performance Evaluation
In this section, we present the performance of mCorelite in the ns-2 simulation environment. We present three sets of results: – First, we show that mCorelite supports inter-session fairness when unicast and multicast flows coexist. The test scenarios include both the case when the network is static and the case when there are network dynamics. – Second, we show that mCorelite achieves the max-min fairness service model in a multiple bottleneck topology. – Third, we compare mCorelite to RLM with CSFQ (option c in Section 3). We see that mCorelite achieves better fairness and fewer variations in terms of dynamic joins and leaves. In all our simulations, we used packets of size 1 KB, tail drop routers with a buffer size of 20 packets, default weights of 1 for all flows, and epoch size of 1 second at both core routers (for fair share computation) and access routers (for rate adaptation). The unicast flows perform slow start unless otherwise mentioned. The multicast sessions used in the simulations have one of the following three layer structures: Type 1 (30, 20, 40, 80, 10), Type 2 (20, 40, 20, 80, 10), and Type 3 (4, 8, 16, 32, 64), where elements of the vector denote the rate of each layer. The rates are in packets per second. 5.1
Inter-session Fairness
M1
R1 4Mbps 1ms
4Mbps 1ms
M2 1Mbps, 10ms U1
R2 R3 R4 R5
U2
R6
Fig. 5. Single Bottleneck Link Topology
In this section, we show that mCorelite allocates bandwidth fairly among responsive unicast and multicast flows in a simple network topology shown in Figure 5. There are two multicast sessions (M1 → {R1 , R2 } and M2 → {R3 , R4 }) and two unicast flows (U1 → R5 and U2 → R6 ). Three different scenarios were
332
T.-e. Kim et al.
used to evaluate the following: (a) inter-session fairness between two multicast sessions, (b) inter-session fairness between a multicast flow and an unicast flow, and (c) inter-session fairness between two multicast sessions with different start times. The first scenario consists of two multicast sessions (Type 1 and Type 2) and one unicast flow. The second scenario consists of one multicast session (Type 1) and two unicast flows and the third scenario consists of two multicast sessions (both Type 2) started at different times: 10 seconds and 40 seconds respectively. The first two scenarios show how the rate of competing flows stabilize at their weighted fair shares. The third scenario tries to evaluate how quickly mCorelite adapts to network dynamics and results in fair bandwidth allocation. The results are shown in Figures 6.a, 6.b, and 6.c, respectively.
100
100
100
M1-R1 M1-R2 M2-R1 M2-R2
M2
20
Transmission rate
40
80
transmission rate (pps)
M1 60
M1-R1 M1-R2 M2-R1 M2-R2
M1
80 Transmission rate transmission rate (pps)
Transmission rate transmission rate (pps)
80
M-R1 M-R2 U-R1
U1
60
40
20
0 0
20
40
60
time time (second)
80
100
M1
60
40
M2
20
0 0
20
40
60
time time (second)
80
100
0 10
20
30
40
50
60
70
80
90
100
time time (second)
Fig. 6. Inter-session Fairness: (a) two multicast sessions with different layer structures (Type 1 and 2), (b) one multicast session and two unicast flows, and (c) two multicast sessions (Type 2s) with different start times (10 sec and 40 sec)
The x-axis represents time in seconds, and the y-axis represents transmission rate in packets per second (pps). In Figure 6.a, we observe that the Type 1 multicast session joins up to 2 layers (cumulative rate = 50 pps) and the Type 2 multicast session joins up to 2 layers (cumulative rate = 60 pps) since the weighted fair share for each flow is 62.5 pps (or 500 Kbps). In Figure 6.b, we observe that the Type 1 multicast flow again stabilizes at 50 pps, and the two unicast flows take up the remaining bandwidth (∼ 68 pps4 ). Finally, in Figure 6.c, the first multicast session stabilizes at 80 pps since the initial weighted fair share is 130 pps. However, when another multicast session (Type 2) is introduced at 40 seconds, the original session immediately leaves the third layer and gives up room for the new session, and both sessions stabilize at 60 pps. In addition, to evaluate the scalability of mCorelite, we simulated the performance of mCorelite with different sets of varying parameters: (a) bottleneck link delay (1 to 300 msec), (b) round-trip time (24 to msec), (c) number of multicast sessions (2 to 10 sessions), and (d) number of receivers in a session (20 to 100 receivers). Essentially, we found that the performance of mCorelite is consistent in all cases. Specifically, effective throughput was constant with different link 4
(1.5M bps − 50 · 8Kbps)/2 = 0.55M bps 68 pps
Multicast Service Differentiation in Core-Stateless Networks
333
delays and round-trip times, and the loss rate was always near 0 % with varying number of multicast sessions and receivers. 5.2
Max-Min Fairness
In this section, we show that mCorelite indeed achieves the multi-rate weighted max-min rate allocation service model defined in Section 2. We consider two multicast sessions (M1 → {R1 , . . . , R10 } and M2 → {R11 , . . . , R20 }) and two unicast flows (U1 → U R1 and U2 → U R2 ) in a network configuration shown in Figure 7. M1 and M2 have layer structures of Type 1 and Type 2, respectively. The two unicast flows start transmission at 1 sec. M1 starts at 5 sec and lasts for the lifetime of the simulation. M2 starts at 30 sec and stops at 60 sec. We ran the simulation for a period of 100 sec.
M1
M2
1
2
U2 UR2
4
3 R1
R2
R11 7
U1 11 UR1
R3
R7
6
R6
8
9
10
R8 R16
R17 R18 R19
5 R12 R13
R4
R5
R9
1 Mbps, 10 ms 1.25 Mbps, 10 ms 1.5 Mbps, 10 ms 4 Mbps, 1 ms
R20
R14 R15
R10
Fig. 7. Multiple Bottleneck Link Topology
We summarize the results in Table 1. The table consists of entries for (a) bottleneck links5 , (b) for each link, the flows (or receivers) which have the link as their bottleneck link, (c) computed weighted fair shares of the flows, and (d) measured reception rates at the receivers. In the table, the computed weighted fair share (according to the definition of weighted max-min fairness) and the measured reception rate match in most cases. Due to the start-up behavior (which does not implement slow start in this case), the measured rates of unicast flows during 5 – 30 sec period are smaller than the computed weighted fair share. Also there is one case (R12 – R15 ), where the transient behavior results in higher throughput than the weighted fair share for the multicast session. But overall, we observe that mCorelite effectively achieves its goal of provisioning max-min fairness among multicast and unicast flows in the given multiple bottleneck link network. 5
The links are identified by their end points. For example, 1-3 refers to the link which connects nodes 1 and 3.
334
T.-e. Kim et al.
Table 1. Weighted Fair Shares and Rates (f s: computed weighted fair share, r: measured rate) Bottleneck Link 1-3 1-4 2-5 2-6 7-11
5.3
Receivers/Flows
5 – 30 sec fs / r 180 / 180 90 / 90 90 / 90 90 / 90 97.5 / 54 97.5 / 56.5
R1 , R2 R11 R3 , R7 , R10 R12 − R15 R4 , R5 R6 , R8 , R9 R16 − R20 U R2 U R1
30 – 60 sec fs / r 90 / 90 80 / 80 50 / 50 60 / 65 90 / 90 50 / 50 60 / 60 77.5 / 75.3 137.5 / 100
60 – 100 sec fs / r 180 / 180 90 / 90 90 / 90 90 / 90 97.5 / 92 97.5 / 95
mCorelite vs Receiver-Driven Layered Multicast with CSFQ (RLM/CSFQ)
In this section, we compare the performance of mCorelite with both plain RLM and with RLM/CSFQ. There are several enhancements to RLM in the literature that try to achieve better throughput and inter-session fairness [10,11,12]. Therefore comparing mCorelite with plain RLM or RLM/CSFQ is in some sense not fair. Nevertheless, since RLM is a popularly accepted layered multicasting scheme, we compare mCorelite with RLM in this paper to provide a glimpse of how mCorelite performs with respect to one of the existing schemes. We are also in the process of comparing mCorelite with extensions to RLM that explicitly try to achieve inter-session fairness [12]. However, it is worthwhile to note that the extensions to RLM do not preclude the need for receivers to perform join experiments and hence, unlike in mCorelite, will suffer from the periodic congestion induced by the join experiments.
M1
M2
U1
4 Mbps 5 ms
4 Mbps 5 ms 1
2
2 Mbps 20 ms
450 Kbps 20 ms
30 Kbps 20 ms
120 Kbps 20 ms 5
4
6 4 Mbps 5 ms
4 Mbps 5 ms R1
3
2 Mbps 20 ms
R2
R3
4 Mbps 5 ms R4
R5
Fig. 8. Topology for Comparison with RLM
Multicast Service Differentiation in Core-Stateless Networks
335
The tests presented in this section were performed in the topology shown in Figure 8. There are two multicast sessions (M1 → {R1 , R2 , R4 } and M2 → {R5 }, whose layer structures are Type 3) and a unicast flow U1 → R3 . M1 starts transmission at 5 seconds and lasts for the lifetime of the simulation, whereas M2 starts at 100 seconds and terminates at 210 seconds. The unicast flow starts at time 150 seconds. We plot the reception rate of mCorelite, RLM and RLM/CSFQ in Figure 9. 40
40
35
M1-R2 R2
M1-R2 U-R1 35
R3
25
R3
20
15
U-R1
Fair share after 150 sec
fair share after 150 sec
10
Transmission rate transmission rate (pps)
30
30
Transmission rate
Transmission rate (pps) transmission rate
40
flow2 flow5
M1-R2 U-R1 35
25
20
15
10
5
R2
5
0 50
100
150
200
250 time
time (second)
300
350
400
450
U-R1
25
20
R2
15
M1-R2
fairshare share Fair after 150 sec after 150 sec
10
5
0
0 0
R3
30
0
50
100
150
200
250 time
300
350
400
450
0
50
100
150
200
250
300
350
400
450
time time (second)
Fig. 9. Reception rate: (a) RLM, (b) RLM/CSFQ and (b) mCorelite
From the figure, we observe that, in case of RLM, the unicast flow starves since RLM does not support inter-session fairness. In case of RLM/CSFQ, since RLM does not use the feedback provided by CSFQ, it still performs experimental joins to try joining to higher layers. This causes a fluctuation in the fair share computation at core routers leading to a fluctuation in the rate observed by the receiver even in the absence of network dynamics (when no new flows are introduced). Further, since CSFQ does not treat the different layers of the multicast flow as one composite flow it miscalculates the fair share on occasions leading to more fluctuations. Even if RLM is cognizant of the fair share, it cannot prevent rate fluctuations because CSFQ inherently miscalculates the fair share (as explained in Section 3). Finally, CSFQ assumes that all packet drops in the network are because of the probabilistic dropping algorithm employed. However, packet drops can occur also because of buffer overflow (this is more so in the case of layered flows). However, since data packets in CSFQ also carry control information any packet dropping that occurs due to reasons other than CSFQ’s dropping algorithm severely affects the fair share computation at downstream routers. As can be seen in 9(b), all these factors in tandem contribute to the fluctuation in rate observed by the flows in the network. On the other hand, we observe that mCorelite provides the weighted max-min fairness defined in Section 2 between the unicast and the two multicast sessions (i.e. R2 gets 12 pps and R3 gets the remaining bandwidth6 ). In addition, in the case of mCorelite, the rate adaptation is quicker (see the start-up behavior during 0 – 25 sec) and the rate variation is smaller than that of RLM because 6
When the network has reached equilibrium, the sending rate of mCorelite remains constant. However, there is a small rate fluctuation in Figure 9.b, because the rate was measured at the receiver.
336
T.-e. Kim et al.
mCorelite does not depend on join experiments for determining reception rate. In addition to the throughput performance, we have also found that the packet loss rate of mCorelite is smaller than that of RLM since it does not require join experiments. In case of mCorelite, packet loss rate is always less than 0.1 %. However in the case of RLM, due to congestion induced by join experiments, the packet loss rate is as high as 0.5 %. In summary, we have observed that mCorelite effectively achieves its design goal of providing max-min rate service differentiation in the network configurations considered in this section.
6
Related Work
The Corelite QoS architecture supports delay/rate service differentiation for both unicast and multicast flows [7,8] in core stateless networks. This paper has focussed on the component of Corelite that supports rate service differentiation for multicast flows. In literature, it has been shown that, by implementing a fair packet scheduler (such as fair queueing or round-robin) at each network router, it is possible to achieve max-min fairness [5]. However, the major challenge is to achieve max-min fairness without maintaining any per flow state or performing any per flow processing at the core routers. RED [13] and FRED [14] are some of the early efforts to provide fair resource sharing among contending flows. But they deviate from the ideal behavior in a number of scenarios [6]. CSFQ was shown to effectively approximate the service of fair queueing [6]. However, it was not designed with the goal of supporting multicast flows in the network and hence cannot be employed in multicast environments as is. While RLM is the most popularly accepted layered multicast approach in related work, it does not explicitly address the issue of inter session fairness. However, there have been other approaches in related work that extend RLM to support inter session fairness. Vicisano et al. [10] have proposed a TCP-like congestion control mechanism for layered multicasting which uses “synchronization points” in each layer marked by the sender to coordinate the join experiments of individual receivers. The sender also divides the layers in such a way that layer i + 1 has double the bandwidth of layer i so that dropping the highest layer effectively reduces the aggregated rate of the session by half in order to simulate the behavior of TCP congestion control. Li et al. [12] have proposed an extension to RLM that achieves multi-rate max-min fairness by ensuring that competing multicast sessions sharing the same bottleneck link join approximately the same number of layers. The basic idea is to make higher layers more sensitive to congestion and lower layers less sensitive to congestion. Thus when there is congestion, the receiver subscribed to more number of layers will preferentially drop its highest layer in favor of the other receivers (belonging to the other sessions) subscribed to a less number of layers. ThinStreams [11] also tried to achieve fair link sharing among multicast sessions by making higher layer easy to drop and lower layer more persistent. However, the difference in this case is that congestion is detected by comparison of the measured throughput
Multicast Service Differentiation in Core-Stateless Networks
337
and the expected throughput similar to TCP Vegas. Different “join threshold” and “leave threshold” values are chosen for each layer to ensure that higher layers are more sensitive to congestion, e.g. the “leave threshold” of a receiver exponentially decreases as the function of the number of subscribed groups. When the problem of fair rate allocation is moved to a domain consisting of both unicast and multicast flows, other issues such as “how should the notion of fairness in the multicast environment be extended?” need to be addressed. There are several multicast fairness models in related work: TCP-friendliness [10], bounded fairness [15], and multicast max-min fairness [16,17]. In [17], TCP-friendly multicast congestion control was shown to approximate max-min fairness. Bounded fairness is a general model of TCP-friendliness, that can be achieved by weighted max-min fairness. In this paper, we have thus focused on providing weighted max-min fairness. When defining max-min fairness in the multicast environment, an issue to be considered is whether participating multicast sessions are single-rate (where all the receivers in the same multicast group perceive the same data rate) [16] or multi-rate (where each receiver perceives different data rate according to the available bandwidth on the sender to receiver path) [17]. Rubenstein et al. [17] define a multi-rate max-min fairness for multicast sessions and show that multi-rate multicast sessions achieve “better max-min fairness” than single-rate multicast sessions in terms of the four fairness criteria presented in [17]. The service model of mCorelite is also based on multi-rate max-min fairness. However, there is a major difference between the service model of mCorelite and the one given in [17]. In mCorelite, when a multicast flow is unable to use its weighted fair share fully, the unused portion of the weighted fair share is allocated to other flows. But in [17], receivers that cannot fully sustain the next layer subscribe to the next layer for a fractional amount of time. In addition, the fairness model of [17] raises other issues like “when should the fractional joins be performed? what should be the granularity of the join time?” etc. The tradeoffs between these two service models thus need to be explored in more detail.
7
Summary
mCorelite achieves service differentiation by providing multi-rate weighted maxmin fairness using an approach that is inspired by CSFQ, but is considerably enhanced to support layered multicast flows, lossless rate adaptation, and low overhead weighted fair share computation. Our preliminary performance results using the ns-2 simulator demonstrate that mCorelite effectively achieves its design goal. While our approach seems to be promising, we still have a long way to go. Specifically, (a) we need to compare our work more thoroughly with related work, and understand the limitations of the weighted fair share computation in highly dynamic network environments, (b) we are also updating the service model to include minimum rate guarantees in addition to service differentiation for multicast flows, (c) we need to remove the simplifying assumption that the
338
T.-e. Kim et al.
multicast layers are in strict priority order, (d) we are yet to analyze the interaction between the rate adaptation mechanism of mCorelite with end-to-end congestion control schemes like that of TCP’s, and (e) while we have tested up to hundreds of receivers in our current simulations, ongoing work is testing the scalability of mCorelite with larger number of receivers and flows.
References 1. J. Wroclawski. Specification of the Controlled-Load Network Element Service. RFC 2211, September 1997. 2. S. Shenker, and C. Patridge. Specification of Guaranteed Quality of Service. RFC 2212, September 1997. 3. K. Nichols, and S. Blake. Differentiated Services Operational Model and Definitions. IETF Internet-Draft , February 1998. 4. K. Nichols, V. Jacobson, and L. Zhang. A Two-bit Differentiated Services Architecture for the Internet. IETF Internet-Draft , November, 1997. 5. D. Bertsekas and Robert Gallager. Data Networks (Second Edition). Prentice Hall, pp. 506-507, 1992. 6. I. Stoica, S. Shenker, and H. Zhang. Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. Proceedings of ACM SIGCOMM ’98, September 1998. 7. N. Venkitaraman, R. Sivakumar, and V. Bharghavan. Achieving Per-Flow Rate Fairness with a Stateless Core. TIMELY Group Technical Report, June 1999. 8. T. Nandagopal, N. Venkitaraman, R. Sivakumar, and V. Bharghavan. Relative Delay Differentiation and Delay Class Adaptation in Core-Stateless Networks. TIMELY Group Technical Report, July 1999. 9. S. McCanne and V. Jacobson. Receiver-driven Layered Multicast. Proceedings of ACM SIGCOMM ’96, August 1996. 10. L. Vicisano, J. Crowcroft, and L. Rizzo. TCP-like Congestion Control for Layered Multicast Data Transfer. Proceedings of IEEE INFOCOM ’98, March 1998. 11. L. Wu, R. Sharma, and B. Smith. ThinStreams: An Architecture for Multicasting Layered Video. Proceedings of NOSSDAV ’97, May 1997. 12. X. Li, S. Paul, and M. H. Ammar. Multi-Session Rate Control for Layered Video Multicast. Proceedings of Multimedia Computing and Networking 1999, January 1999. 13. S. Floyd and V. Jacobson. Random Early Detection Gateways for Congestion Avoidance. IEEE/ACM Transactions on Networking, vol. 1, no. 4, August 1993. 14. D. Lin and R. Morris. Dynamics of Random Early Detection. Proceedings of ACM SIGCOMM ’97, September 1997. 15. H. A. Wang and M. Schwartz. Achieving Bounded Fairness for Multicast and TCP Traffics in the Internet. Proceedings of ACM SIGCOMM ’98, September 1998. 16. H. Tzeng and K. Siu. On Max-Min Fair Congestion Control for Multicast ABR Service in ATM. IEEE JSAC, vol. 15, no. 3, April 1997. 17. D. Rubenstein, J. Kurose, and D. Towsley. The Impact of Multicast Layering on Network Fairness. Proceedings of ACM SIGCOMM ’99, September 1999. 18. A. Demers, S. Keshav and S. Shenker. Analysis and simulation of a fair queueing algorithm. Proceedings of ACM SIGCOMM ’89.
Author Index
Acharya, Arup, 204 Ammar, Mostafa H., 152
Livingston, Marilynn, 216 Lo, Virginia, 216
Barcellos, Marinho, 170 Beam, Tyler K., 72 Bharghavan, Vaduvur, 321 Birman, Kenneth P., 188 Blazevi´c, Ljubica, 108 Briscoe, Bob, 244, 301 Brown, Ian, 286
Malville, Eric, 36 McCanne, Steven, 1, 126 Mukherjee, Sarit, 152
Crowcroft, Jon, 286 Ezhilchelvan, Paul D., 170 Fei, Zongming, 152 Griffoul Fr´ed´eric, 204 Iannaccone, Gianluca, 144 Ikeda, Hiromasa, 19 Kamel, Ibrahim, 152 Katz, Randy, 1, 126 Kermode, Roger, 90 Kim, Tae-eun, 321 L´ety, Emmanuel, 54 Le Boudec, Jean-Yves, 108 Lee, Kang-Won, 321 Liebeherr, J¨ org, 72 Liu, Congyue, 170
Ozkasap, Oznur, 188 Perkins, Colin, 286 Perlman, Radia, 270 Raman, Suchitra, 270 Sawa, Yoshitsugu, 19 Schuett, Angela, 126 Sivakumar, Raghupathy, 321 Thaler, David, 90 Turletti, Thierry, 54 van Renesse, Robbert, 188 Windisch, Kurt, 216 Wong, Tina, 1 Xiao, Zhen, 188 Yamamoto, Miki, 19 Yurcik, William, 235 Zappala, Daniel, 216