September/October 2008, Vol. 22, No. 5
®
THE MAGAZINE OF GLOBAL INTERNETWORKING www.comsoc.org
Implications and Control of Middleboxes in the Internet
A Publication of the IEEE Communications Society in cooperation with the IEEE Computer Society and the Internet Society
®
®
LYT-TOC-SEPT
9/5/08
1:09 PM
Page 1
®
THE MAGAZINE OF GLOBAL INTERNETWORKING SEPTEMBER/OCTOBER 2008, VOL. 22, NO. 5
Special Issue
Implications and Control of Middleboxes in the Internet Guest Editors: Xiaoming Fu, Martin Stiemerling, and Henning Schulzrinne
8
A Retrospective View of Network Address Translation
33
Today, network address translators, or NATs, are everywhere. Their ubiquitous adoption was not promoted by design or planning but by the continued growth of the Internet. Lixia Zhang
14
Behavior and Classification of NAT Devices and Implications for NAT Traversal For a long time, traditional client-server communication was the predominant communication paradigm of the Internet. Network address translation devices emerged to help with the limited availability of IP addresses and were designed with the hypothesis of asymmetric connection establishment in mind. But with the growing success of peer-to-peer applications, this assumption is no longer true. Andreas Müller, Georg Carle, and Andreas Klenk
20
26
Modeling Middleboxes The authors present a simple middlebox model that succinctly describes how different middleboxes process packets and illustrate it by representing four common middleboxes. Dilip Joseph and Ion Stoica
Network Address Translation for the Stream Control Transmission Protocol The authors discuss the deficiencies of using existing NAT methods for SCTP and describes a new SCTP-specific NAT concept. This concept is analyzed in detail for several important network scenarios, including peer-topeer, transport layer mobility, and multihoming. Michael Tüxen, Irene Rüngeler, Randall Stewart, and Erwin P. Rathgeb
Distributed Connectivity Service for a SIP Infrastructure The authors present a distributed connectivity service solution that integrates relay functionality directly in user nodes. Luigi Ciminiera, Guido Marchetto, Fulvio Risso, and Livio Torrero
41
Dial “M” for Middlebox Managed Mobility
48
NAT Issues in the Remote Management of Home Network Devices
Users can be served by multiple network-enabled terminal devices, each of which in turn can have multiple network interfaces. This multihoming at both the user and device level presents new opportunities for mobility handling. Stephen Herborn and Aruna Seneviratne
The authors focus on NAT issues in the management of home network devices. Specifically, they discuss efforts relating to standardization. Choongul Park, Kitae Jeong, Sungil Kim, and Youngseok Lee
56
Improving the Performance of Route Control Middleboxes in a Competitive Environment The authors show that by blending randomization with adaptive filtering techniques, it is possible to drastically reduce the interference between competing route controllers, and this can be achieved without penalizing the end-to-end traffic performance. Marcelo Yannuzzi, Xavi Masip-Bruin, Eva Marin-Tordera, Jordi Domingo-Pascual, Alexandre Fonte, and Edmundo Monteiro
Editor’s Note 2 New Books & Multimedia 4 Guest Editorial 6
IEEE NETWORK ISSN 0890-8044 is published bimonthly by the Institute of Electrical and Electronics Engineers, Inc. Headquarters address: IEEE, 3 Park Avenue, 17th Floor, New York, NY 100165997, USA; tel: +1-212-705-8900; e-mail:
[email protected]. Responsibility for the contents rests upon authors of signed articles and not the IEEE or its members. Unless otherwise specified, the IEEE neither endorses nor sanctions any positions or actions espoused in IEEE Network. ANNUAL SUBSCRIPTION: $40 in addition to IEEE Communications Society or any other IEEE Society member dues. Non-member prices: $250. Single copy price $50. EDITORIAL CORRESPONDENCE: Address to: Chatschik Bisdikian, Editor-in-Chief, IEEE Network, IEEE Communications Society, 3 Park Avenue, 17th Floor, New York, NY 10016-5997, USA; email:
[email protected] COPYRIGHT AND REPRINT PERMISSIONS: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S. Copyright law for private use of patrons: those articles that carry a code on the bottom of the first page provided the per copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, USA. For other copying, reprint, or republication permission, write to Director, Publishing Services, at IEEE Headquarters. All rights reserved. Copyright ©2008 by the Institute of Electrical and Electronics Engineers, Inc. POSTMASTER: Send address changes to IEEE Network, IEEE, 445 Hoes Lane, Piscataway, NJ 08855-1331, USA. Printed in USA. Periodical-class postage paid at New York, NY and at additional mailing offices. Bulk rate postage paid at Easton, PA permit #7. Canadian GST Reg# 40030962. Return undeliverable Canadian addresses to: Frontier, P.O. Box 1051, 1031 Helena Street, Fort Eire, ON L2A 6C7. SUBSCRIPTIONS, orders, address changes should be sent to IEEE Service Center, 445 Hoes Lane, Piscataway, NJ 08855-1331, USA. Tel. +1-732-981-0060. ADVERTISING: Advertising is accepted at the discretion of the publisher. Address correspondence to IEEE Network, 3 Park Avenue, 17th Floor, New York, NY 10016-5997, USA.
IEEE Network • September/October 2008
1
LYT-EDIT_NOTE-SEPTEMBER
9/5/08
1:06 PM
Page 2
EDITOR’S NOTE ®
THE MAGAZINE OF GLOBAL INTERNETWORKING
Director of Magazines Thomas F. La Porta, Penn. State Univ., USA
NATs and Frozen Veggies
Editor-in-Chief Ioanis Nikolaidis, U. of Alberta, Canada
Associate Editor-in-Chief
Chatschik Bisdikian, IBM Research, USA
Senior Technical Editors Thomas M. Chen, Swansea U., UK Yi-Bing (Jason) Lin, National Chiao Tung Univ., Taiwan Peter O’Reilly, Northeastern Univ., USA
Technical Editors Kevin Almeroth, UCSB, USA N. Asokan, Nokia Res. Ctr., Finland Olivier Bonaventure, U. Catholique de Louvain, Belgium Adrian Conway, Verizon, USA Jon Crowcroft, U. of Cambridge, UK Christos Douligeris, U. of Piraeus, Greece Paolo Giacomazzi, Politecnico di Milano, Italy David Greaves, U. of Cambridge, UK Nikhil Jain, Qualcomm, USA Admela Jukan, T. U. Braunschweig, Germany Tim King, BTexact Tech., UK Frank Magee, Consultant, USA Ioanis Nikolaidis, U. of Alberta, Canada Georgios I. Papadimitriou, Aristotle Univ., Greece Mohammad Peyravian, IBM Corporation, USA Kazem Sohraby, U. of Arkansas, USA James Sterbenz, Univ. of Kansas, USA Joe Touch, USC/ISI, USA Vittorio Trecordi, CEFRIEL, Italy Guoliang Xue, Arizona State Univ., USA Raj Yavatkar, Intel, USA Bulent Yener, Rensselaer Polytechnic Institute, USA
Feature Editors Olivier Bonaventure, "Software Tools for Networking" U. Catholique de Louvain, Belgium Olivier Bonaventure, "New Books & Multimedia" U. Catholique de Louvain, Belgium
IEEE Production Staff Joseph Milizzo, Assistant Publisher Eric Levine, Associate Publisher Susan Lange, Digital Production Manager Catherine Kemelmacher, Associate Editor Jennifer Porcello, Publications Coordinator Devika Mittra, Publications Assistant
2008 IEEE Communications Society Officers Doug Zuckerman, President Andrzej Jajszczyk, VP–Technical Activities Mark Karol, VP–Conferences Byeong Gi Lee, VP–Member Relations Sergio Benedetto, VP–Publications Nim Cheung, Past President Stan Moyer, Treasurer John M. Howell, Secretary
Board of Governors The officers above plus Members-at-Large: Class of 2008 Thomas M. Chen, Andrea Goldsmith Khaled Ben-Letaief, Peter J. McLane Class of 2009 Thomas LaPorta, Theodore Rappaport Catherine Rosenberg, Gordon Stuber Class of 2010 Fred Bauer, Victor Frost Stefano Galli, Lajos Hanzo
2008 IEEE Officers Lewis M. Terman, President John R. Vig, President-Elect Barry L. Shoop, Secretary David G. Green, Treasurer Leah H. Jamieson, Past President Jeffry W. Raynes, Executive Director Curtis A. Siller, Jr., Director, Division III
®
2
Ioanis Nikolaidis
D
ear readers, welcome to the September 2008 issue of IEEE Network. The sound of trucks, the heavy duty disposal bins on the curb, and the thud and bang of construction are all elements of a “quiet” summer, full of renovations, in my neighborhood. Resisting this Siren’s call is difficult even if I swore off any renovations for the rest of my life, given past experience. I naively thought that this time it wouldn’t be that bad. After all, this time it looks like a much smaller job than last time. Of course, I neglected a key conservation law: if a job is small, the additional delays for various reasons will expand it to be roughly equal the total time of a “big” job (more professionally managed one might argue, and hence with much less slack). What I was not prepared to experience is the shift in attitudes caused by the widespread adoption of many “information” appliances in today’s household. I should have spotted the shift when my contractor warned that he would need to turn off the power to our house, only to qualify it with “If that’s okay with your gear, right?” noticing that there were maybe a tad too many devices, computers, firewalls, servers, and bridges spread around the house. He was concerned that some might develop bad hiccups after the switch was turned off and on again. He had experienced himself some “unhealthy” side effects to his equipment under similar circumstances, so his concern was genuine. I thought for a moment of explaining the benefits of statelessness and how, I would hope, most of my gear could survive power being cut and restored later (no I don’t have a UPS — I believe in luck). I decided not to expand on the topic, just agreeing that it was okay to cut the power to the house. Things indeed went as planned, although it should have struck me as odd that he did not ask about other things that might be influenced by cutting the power. A few days later, while I was at work, the contractor stumbled on a dilemma. He had to run a industrial strength vacuum cleaner to pick up lots of debris. Having pulled down walls and removed several wall outlets left him with no choice but to run an extension cord to the nearest outlet he could find still standing. It happened that this was an outlet already fully populated by two cords, one connecting a refrigerator we keep in the basement, and one connecting a NAT/firewall box, a nearby server, and a cable modem. Without any hesitation, he removed the one least likely to create a hassle: the refrigerator! In comparison to a NAT box, a refrigerator is low tech and almost stateless — if not its volatile contents. His choice was reasonable. He was not expecting to keep it for more than an hour this way. But human nature conspired. The contractor forgot to plug in the refrigerator when he was done. The packets were running smoothly while our frozen veggies were thawing. To
IEEE Network • September/October 2008
LYT-EDIT_NOTE-SEPTEMBER
9/5/08
1:06 PM
Page 3
EDITOR’S NOTE make matters worse, and blame my own human nature here, I did not notice the “failure” until late in the evening (okay, so I do keep some beer there too). Had it been the firewall malfunctioning I would have spotted it in minutes. I spent a good part of the evening deciding what had to be thrown away and what to keep (luckily this was not a warm day) and laughing at our priorities: mine and the contractor’s. The fact that everyday people think of consumer-grade networking and information appliances as possibly the most sensitive objects in a house reflects what they have learned from their own experience in the recent past. After all, a lost file can be a major blow, while a pound of rotten spinach is, well, compost. A handful of remarkable technologies made it into these everyday devices, and one that is still a topic of research, extension, and overall controversy is Network Address Translation (NAT). NAT is no longer just a way to establish a home user’s little kingdom of an Internet-connected private network (while guilt-free of hoarding IP addresses). NAT boxes are increasingly active participants as the ‘’middlepoint’’ of communication paths and this has led to the use of a new term, “middlebox,” to describe the particular class of technologies. This special issue, entitled “Implications and Control of Middleboxes in the Internet,” provides a timely
IEEE Network • September/October 2008
review of where we are in middlebox evolution and how they might further evolve. I would like to thank the guest editors, Xiaoming Fu, Martin Stiemerling, and Henning Schulzrinne, as well as the liaison editor of this issue, Jon Crowcroft, for their excellent work in putting this issue together. I would also to welcome a new member to our editorial board: Dr. Admela Jukan. Dr. Jukan received her Ph.D. degree from Vienna University of Technology in Austria, and is currently a W3 Professor of Electrical and Computer Engineering at the Technical University Carolo-Wilhelmina of Brunswick (Braunschweig), Germany. Dr. Jukan served between 2002 and 2004 as Program Director in Computer and Networks System Research at the National Science Foundation (NSF), responsible for funding and coordinating US-wide university research and education activities in the area of network technologies and systems. As always, your feedback regarding the direction and substance of the magazine is invaluable and always appreciated. Please contact me, by e-mail, at
[email protected], to let me know what you think about the editorial comments, what type of content might be more interesting to you, and in what ways the magazine’s distinct character could be improved or further publicized.
3
LYT-NEWBOOKS-SEPTEMBER
9/5/08
1:06 PM
Page 4
NEW BOOKS AND MULTIMEDIA/EDITED BY OLIVIER BONAVENTURE The New Books and Multimedia column contains brief reviews of new books in the computer communications field. Each review includes a highly abstracted description of the contents, relying on the publisher’s descriptive materials, minus advertising superlatives, and checked for accuracy against a copy of the book. The reviews also comment on the structure and the target audience of each book. Publishers wishing to have their books listed in this manner should contact Olivier Bonaventure by email. Olivier Bonaventure Université Catholique de Louvain, Belgium
[email protected]
LAN Switch Security : What Hackers Know About Your Switches Eric Vyncke and Christopher Pagen, Cisco Press, 2008, ISBN-10: 1-58705256-3, Softbound, 360 pages Ethernet is now the default fixed local area network technology. Ethernet LANs are found in all enterprise environments, and in more and more home networks. Ethernet was designed in the 1970s when security was not a concern. Since then, Ethernet has evolved with the introduction of hubs and switches. Many network administrators are aware that hubs are a security concern since they broadcast Ethernet frames, and some of them assume that switches are more secure. Unfortunately, hackers have learned the limitations of Ethernet switches and have developed several tools that can be used to exploit them. This book describes the current state of the art in securing Ethernet switches. The authors take a practical approach by using different types of Cisco switches and freely available tools to demonstrate the security problems and their solutions. Despite its focus on a single vendor, this book is an interesting reference for system administrators who are willing to better understand how to secure their Ethernet networks. This is particularly important in environments such as schools were uncontrolled laptops are often connected. The first part discusses the basic security problems that affect Ethernet switches: the learning bridge process and the implications of the limited size of the MAC table on Ethernet switches. It also discusses configurations to mitigate these problems. Then the book analyzes several protocols and their security implications: the spanning tree protocol, the 802.1q VLANs, DHCP, IPv4 ARP, and IPv6 Neighbor Discovery, but also surprising electrical security issues with
4
power over Ethernet. The second part focuses on techniques can that be used on switches to sustain denial-of-service attacks, from both forwarding and control plane viewpoints. The last part analyzes recent techniques that can be used to improve the security of Ethernet switches, such as 802.1x or 802.1AE and access control lists.
Principles of Protocol Design Robin Sharp, Springer Verlag, 2008, ISBN: 978-3-540-77540-9, Hardbound, 402 pages This book takes an unusual path to describe computer network protocols. While most standard networking texts mainly focus on a textual description of the different protocols and mechanisms, Robin Sharp starts from formal description techniques. More precisely, he chooses the Communicating Sequential Processes (CSP) notation proposed by Hoare. CSP is a process algebra that allows to model the interactions among communicating processes. The book starts with a detailed description of CSP and then uses the CSP formalism to describe several mechanisms such as flow and error control, fault-tolerant broadcast, and twophase commits. An advantage of using CSP is that the book contains proofs of several of the described mechanisms. However, as CSP does not contain complex data types, it is difficult to completely model complex protocols in detail. Surprisingly, the author did not consider more powerful formal description techniques that evolved from CSP such as LOTOS. The second part of the book is more heterogeneous. Several security protocols are discussed, and the BAN logic is introduced. Then the author briefly discusses real protocols. The discussion considers both open system interconnection (OSI) protocols and Internet protocols. This part
is less interesting than the first part, where the CSP models could be of interest to readers who are more interested in the application of formal description techniques to network protocols.
Patterns in Network Architecture : A Return to Fundamentals John Day, Prentice Hall, 2008, ISBN10: 0132252422, Hardbound, 464 pages The architecture of today’s Internet was mainly designed together with the TCP and IP protocols in the 1970s and early 1980s. During the last years, researchers and funding organizations in America, Europe, and Asia have started to work on different alternative architectures for the Internet. Some consider an evolutionary approach where the Internet architecture would be incrementally modified in a backward compatible manner, while others believe a completely new architecture should be developed to take into account the requirements of today’s and tomorrow’s Internet. John Day’s book is a must read for researchers interested in the evolution of the Internet architecture. The book is composed of two main parts. The first part is mainly a history of the evolution of computer network architectures in the 1970s and 1980s. John Day participated actively in this research on both the Internet side and the OSI side. He explains the reasons for some of the design choices and discusses alternatives that were considered but not selected. The discussion considers several of the key elements of a computer network architecture, including the protocol elements, layering, naming, and addressing. The second part describes John Day’s vision of an alternative network architecture. For this, he starts by reconsidering network-based InterProcess Communication (IPC) and shows that a distributed IPC should be at the core of a computer network architecture. This discussion is interesting, but the author does explain in detail how it could be realized in practice. The second part ends with two chapters on topological addressing influenced by Mike O’Dell’s GSE proposal, and a discussion of the impact of multicast and multihoming on the architecture.
IEEE Network • September/October 2008
LYT-GSTEDIT-SEPTEMBER
9/9/08
12:53 PM
Page 6
GUEST EDITORIAL
Implications and Control of Middleboxes in the Internet
Xiaoming Fu
M
Martin Stiemerling
Henning Schulzrinne
iddleboxes in the Internet have been explored, sometimes quite controversially, in operations, standardization, and the research community for more than 10 years. The main concern in the past has been their contradicting nature to the Internet’s endto-end principle. In the past, many have expressed concerns that middleboxes contradict the Internet's end-to-end principle that is often understood to posit that "intelligence" is placed in end system and network elements just forward packets. Middleboxes introduce functions beyond forwarding in the data path between a source and destination, as described, for example, in RFC 3234. RFC 3234 describes a wide range of middle boxes, from TCP performance enhancing proxies to transcoders. On the other hand, middleboxes were introduced in the Internet for various reasons: NATs intend to decouple the internal IP addressing from the public address space while allowing multiple hosts to share a single public IP address, for the purpose of preserving the IP address space; firewalls are used for administrators to enforce policies on the data traffic at administrative borders with the intention of preventing their networks from being attacked or monitored; application level gateways (ALGs) are typically used to assist applications in their operations. The implications of the emergence and popularity of middleboxes are complicated. With middleboxes it is difficult to even provide basic end-to-end connectivity for many applications. For example, Internet hosts behind NATs can only initiate a TCP connection with another host, but cannot accept a connection request. Unlike in the past, when the vast majority of applications followed the client-server design pattern, and most hosts behind NATs were clients anyway (e.g., your browser accessing a Web server), a variety of new applications today, such as voice-over-IP, gaming, and peer-to-peer file sharing cause an enormous list of issues. Hosts behind NATs are not reachable from any other host anymore, which become particularly troublesome for VoIP and other peer-to-peer applications. Likewise, firewalls are usually statically configured to block certain TCP ports or do not understand non-TCP protocols, making it difficult to deploy new applications and protocols. This results in a number of issues to be considered in the design and development of new protocols and applications. To mitigate the negative impacts of these issues, quite a number of techniques have been developed, which can be cat-
6
egorized as explicit control and implicit control of firewalls and NATs. For explicit control, an entity, either the end host or a proxy in the network, has a relationship with the middlebox and controls its behavior (e.g., the set of policies or filter rules loaded). Examples of explicit control are universal plug and play (UPnP), Internet Engineering Task Force (IETF) Middlebox Communications (MIDCOM), and IETF Next Steps in Signaling (NSIS). On the other hand, implicit control is the traditional way of traversing middleboxes. Implicit control does not have any control relationship with the middlebox, because end hosts, probably with the support of other end hosts, are using hole punching techniques to get a working middlebox traversal. Examples of implicit control are the IETF’s Session Traversal Utilities for NAT (STUN), Traversal Using Relays around NAT (TURN), and Interactive Connectivity Establishment (ICE). In addition, there have been some recent attempts to design or use certain types of middleboxes, such as various application proxies. In this special issue we are pleased to introduce a series of state-of-the-art articles on this specific area. These articles cover the subject from a variety of perspectives, offering the readers an understanding of the issues and implications of various middleboxes in the Internet, including their control mechanisms. A total of eigh articles, selected from 26 submissions based on a strict peer review process, cover a broad range in the field of implications and control of middleboxes in the Internet. While some articles present more general issues with middleboxes, understanding their behaviors and implications, others focus on new approaches to controlling and usiing middleboxes. NATs, an unplanned reality, have posed complications to the Internet architecture and applications. The first article, “A Retrospective View of NAT” by Lixia Zhang, takes readers back to the early days of middleboxes. It gives a historic review of NATs and the lessons learned, including how they impeded standardization and deployment of IPv6, and an expected solution for addressing the Internet address depletion problem. Without a timely standardization of NAT, today there have been a number of different NAT implementations, and it is vital to understand their behaviors due to their nearly ubiquitous presence. The second article, “Behavior and Classification of NAT Devices and Implications for NAT Traversal” by Andreas Müller, Andreas Klenk, and Georg Carle, provides a comprehensive overview of NAT behaviors and currently available
IEEE Network • September/October 2008
LYT-GSTEDIT-SEPTEMBER
9/9/08
12:53 PM
Page 7
GUEST EDITORIAL NAT traversal techniques. The article presents a new categorization approach based on an analytical abstraction of NAT traversal, which classifies NAT traversal services into four distinct types and deduces the corresponding NAT behaviors. This may help developers of new protocols and applications to determine applicable techniques for NAT traversal. While the first two articles describe the history, behavior, and classification of NAT, the next article by Dilip Joseph and Ion Stoica, “Modeling Middleboxes,” proposes a formal and generic model for deducing middlebox functionalities and behaviors. Using this model, the article illustrates how different middleboxes process packets, and how four common middleboxes — firewall, NAT, layer 4, and layer 7 load balancers — may be depicted. As such, the article provides an initial step for relevant designers, users, and researchers to understand and refine the behaviors and implications of various middleboxes. Existing middleboxes mostly consider TCP and UDP in their implementations, and typically do not support other protocols, such as the Stream Control Transmission Protocol (SCTP). In the fourth article, Michael Tüxen et al. describe the extensions required to support NAT for SCTP. The analysis presented in this article may be useful as a general lesson in the near future, as several other protocols after SCTP, including DCCP, XCP, and HIP, use similar techniques such as multihoming, rehoming, and handshake cookies. Applications using the Session Initialization Protocol (SIP) or peer-to-peer way of operation (P2PSIP or just normal P2P applications) are among those that suffer most from the middlebox traversal issue. The fifth article, “Distributed Connectivity Service for a SIP Infrastructure” by Luigi Ciminiera et al., examines this issue and presents an alternative approach to the current STUN/TURN/ICE approach to middlebox traversal. The approach distributes the rendezvous and relay functions among SIP user agents, which discover their peers autonomously and maintain a P2P overlay to ensure connectivity across NATs and firewalls in a SIP infrastructure without relying on a centralized server. The remaining three articles address new applications of middleboxes. The sixth article, “Dial M for Middlebox Managed Mobility” by Stephen Herborn and Aruna Seneviratne, describes a new usage type of middleboxes for mobility support via the concept of virtual private “personal networks.” Such a network is created and maintained by way of HIP combined with IPsec and supported by middlebox state drop "(at least to some extent)" plus middlebox state, which may be interesting (at least to some extent) for the recent research efforts on network virtualization, as they use today’s technologies directly. An increasing number of home users today are using NATs to connect their home IP devices with the Internet. Choongul Park et al. discuss this issue in their article “Issues in the Remote Management of Home Network Devices.” By extending SNMP and using additional management objects (MOs) to gather NAT binding information, the authors attempt to address the NAT traversal problem under a symmetric NAT, based on their observations in Korea. While the success rate of NAT traversal could be a potential issue outside Korea, the article provides an insight of what home networking standards may have to deal with.
IEEE Network • September/October 2008
Yet another type of middlebox function, intelligent route control (IRC) for multihomed sites and subscribers, has been recently identified as a key issue in efficient network operations. The final article, “Improving the Performance of Route Control Middleboxes in a Competitive Environment” by Marcelo Yannuzzi et al., addresses this issue and introduces an IRC approach for competitive environments, by blending randomization with adaptive filtering techniques. We hope that these articles will help to clarify and explain the state-of-the-art advances on middlebox issues in the Internet, providing current visions of how the behaviors, implications, and control of middlboxes may be analyzed, encompassed, and utilized. In preparing this special issue, we wish to thank all the peer reviewers for their efforts in carefully reviewing the manuscripts to meet the tight deadlines. We are grateful to our liaison editor Jon Crowcroft for his constructive feedbacks, and Editor-in-Chief Ioanis Nikolaidis for his timely and critical suggestions.
Biographies X IAOMING F U [M’02] (
[email protected]) received his Ph.D. degree in computer science from Tsinghua University, Beijing, China, in 2000. After almost two years of postdoctoral work at Technical University Berlin, he joined the University of Göttingen as an assistant professor, leading a team working on networking research. Since April 2007 he has been a professor and head of the Computer Networks Group at the University of Göttingen. During 2003–2005 he also served as an expert on the ETSI Specialist Task Forces on Internet Protocol Testing; he was also a visiting scientist at the University of Cambridge and Columbia University. In the research fields of architectures, protocols, and applications for QoS, firewalls, p2p overlay, and mobile networking as well as related security issues, he (co-)authored more than 50 referred papers as well as several RFCs/I-Ds. He has served as TPC member and session chair for several conferences, including IEEE INFOCOM, ICNP, ICDCS, GLOBECOM, and ICC. He was also founding chair of the ACM Workshop on Mobility in the Evolving Internet Architecture (MobiArch) and is TPC Co-Chair of IEEE GLOBECOM 2009 Next Generation Networking and Internet Symposium. He is currently a member of the editorial board of Computer Communications Journal (Elsevier). M ARTIN S TIEMERLING [M’00] (
[email protected]) received his M.Sc. degree (Diploma) in electrical eengineering with a focus on IP networking technologies from the Polytechnic University of Applied Sciences in Cologne in 2000. After that he joined the NEC Laboratories Europe, Heidelberg, Germany, where he is currently a senior researcher. His areas of research interest are Internet architecture, Internet signaling protocols, network management, and overlay/ peer-to-peer systems. He has published several papers in these areas, and served as a TPC member of IEEE IPOM 2007. In the IETF he is active as working document editor in the MIDCOM, MMUSIC, and NSIS working groups, as well as in other IETF working groups and IRTF research groups. He is co-chair of the IETF Next Steps in Signaling (NSIS) working group, and secretary of the IP over DVB (IPDVB) working group, and a co-author of RFC 3816, RFC 3989, and RFC 4540, as well as RTSPng. HENNING SCHULZRINNE [F’06] (
[email protected]) received his Ph.D. from the University of Massachusetts in Amherst, Massachusetts. He was a member of technical staff at AT&T Bell Laboratories, Murray Hill, New Jersey, and an associate department head at GMD-Fokus (Berlin) before joining the Computer Science and Electrical Engineering Departments at Columbia University, New York. He is currently a professor and chair of the Department of Computer Science. He has been a member of the Board of Governors of the IEEE Communications Society and is vice chair of ACM SIGCOMM, former chair of the IEEE Communications Society Technical Committees on Computer Communications and the Internet, has been technical program chair of Global Internet, INFOCOM, NOSSDAV, and IPTCOMM, and was General Chair of ACM Multimedia 2004. He has also been a member of the Internet Architecture Board. Protocols codeveloped by him, such as RTP, RTSP, and SIP, are now Internet standards, used by almost all Internet telephony and multimedia applications. His research interests include Internet multimedia systems, ubiquitous computing, mobile systems, quality of service, and performance evaluation.
7
ZHANG LAYOUT
9/5/08
1:03 PM
Page 8
A Retrospective View of Network Address Translation Lixia Zhang, University of California, Los Angeles
Abstract Today, network address translators, or NATs, are everywhere. Their ubiquitous adoption was not promoted by design or planning but by the continued growth of the Internet, which places an ever-increasing demand not only on IP address space but also on other functional requirements that network address translation is perceived to facilitate. This article presents a personal perspective on the history of NATs, their pros and cons in a retrospective light, and the lessons we can learn from the NAT experience.
A
network address translator (NAT) commonly refers to a box that interconnects a local network to the public Internet, where the local network runs on a block of private IPv4 addresses as specified in RFC 1918 [1]. In the original design of the Internet architecture, each IP address was defined to be globally unique and globally reachable. In contrast, a private IPv4 address is meaningful only within the scope of the local network behind a NAT and, as such, the same private address block can be reused in multiple local networks, as long as those networks do not directly talk to each other. Instead, they communicate with each other and with the rest of Internet through NAT boxes. Like most unexpected successes, the ubiquitous adoption of NATs was not foreseen when the idea first emerged more than 15 years ago [2, 3]. Had anyone foreseen where NAT would be today, it is possible that NAT deployment might have followed a different path, one that was better planned and standardized. The set of Internet protocols that were developed over the past 15 years also might have evolved differently by taking into account the existence of NATs, and we might have seen less overall complexity in the Internet compared to what we have today. Although the clock cannot be turned back, I believe it is a worthwhile exercise to revisit the history of network address translation to learn some useful lessons. It also can be worthwhile to assess, or reassess, the pros and cons of NATs, as well as to take a look at where we are today in our understanding of NATs and how best to proceed in the future. It is worth pointing out that in recent years many efforts were devoted to the development and deployment of NAT traversal solutions, such as simple traversal of UDP through NAT (STUN) [4], traversal using relay NAT (TURN) [5], and Teredo [6], to name a few. These solutions remove obstacles introduced by NATs to enable an increasing number of new application deployments. However, as the title suggested, this article focuses on examining the lessons that we can learn from the NAT deployment experience; a comprehensive survey of NAT traversal solutions must be reserved for a separate article.
8
0890-8044/08/$25.00 © 2008 IEEE
I also emphasize that this writing represents a personal view, and my recall of history is likely to be incomplete and to contain errors. My personal view on this subject has also changed over time, and it may continue to evolve, as we are all in a continuing process of understanding the fascinating and dynamically changing Internet.
How a NAT Works As mentioned previously, IP addresses originally were designed to be globally unique and globally reachable. This property of the IP address is a fundamental building block in supporting the end-to-end architecture of the Internet. Until recently, almost all of the Internet protocol designs, especially those below the application layer, were based on the aforementioned IP address model. However, the explosive growth of the Internet during the 1990s not only signaled the danger of IP address space exhaustion, but also created an instant demand on IP addresses: suddenly, connecting large numbers of user networks and home computers demanded IP addresses instantly and in large quantities. Such demand could not possibly be met by going through the regular IP address allocation process. Network address translation came into play to meet this instant high demand, and NAT products were quickly developed to meet the market demand. However, because NATs were not standardized before their wide deployment, a number of different NAT products exist today, each with somewhat different functionality and different technical details. Because this article is about the history of NAT deployment — and not an examination of how to traverse various different NAT boxes — I briefly describe a popular NAT implementation as an illustrative example. Interested readers can visit Wikipedia to find out more about existing types of NAT products. A NAT box N has a public IP address for its interface connecting to the global Internet and a private address facing the internal network. N serves as the default router for all of the destinations that are outside the local NAT address block. When an internal host H sends an IP packet P to a
IEEE Network • September/October 2008
ZHANG LAYOUT
9/5/08
1:03 PM
Page 9
public IP destination address D located in the global Internet, the packet is routed to N. N translates the private source IP address in P’s header to N’s public IP address and adds an entry to its internal table that keeps track of the mapping between the internal host and the outgoing packet. This entry represents a piece of state, which enables subsequent packet exchanges between H and D. For example, when D sends a packet P’ in response to P, P’ arrives at N, and N can find the corresponding entry from its mapping table and replace the destination IP address — which is its own public IP address — with the real destination address H, so that P’ will be delivered to H. The mapping entry times out after a certain period of idleness that is typically set to a vendor-specific value. In the process of changing the IP address carried in the IP header of each passing packet, a NAT box also must recalculate the IP header checksum, as well as the checksum of the transport protocol if it is calculated based on the IP address, as is the case for Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) checksums. From this brief description, it is easy to see the major benefit of a NAT: one can connect a large number of hosts to the global Internet by using a single public IP address. A number of other benefits of NATs also became clear over time, which I will discuss in more detail later. At the same time, a number of drawbacks to NATs also can be identified immediately. First and foremost, the NAT changed the end-to-end communication model of the Internet architecture in a fundamental way: instead of allowing any host to talk directly to any other host on the Internet, the hosts behind a NAT must go through the NAT to reach others, and all communications through a NAT box must be initiated by an internal host to set up the mapping entries on the NAT. In addition, because ongoing data exchange depends on the mapping entry kept at the NAT box, the box represents a single point of failure: if the NAT box crashes, it could lose all the existing state, and the data exchange between all of the internal and external hosts must be restarted. This is in contrast to the original goal of IP of delivering packets to their destinations, as long as any physical connectivity exists between the source and destination hosts. Furthermore, because a NAT alters the IP addresses carried in a packet, all protocols that are dependent on IP addresses are affected. In certain cases, such as TCP checksum, which includes IP addresses in the calculation, the NAT box can hide the address change by recalculating the TCP checksum when forwarding a packet. For some of the other protocols that make direct use of IP addresses, such as IPSec [7], the protocols can no longer operate on the end-to-end basis as originally designed; for some application protocols, for example, File Transfer Protocol (FTP) [8], that embed IP addresses in the application data, application-level gateways are required to handle the IP address rewrite. As discussed later, NAT also introduced other drawbacks that surfaced only recently.
A Recall of the History of NATs I started my Ph.D. studies in the networking area at the Massachusetts Institute of Technology at the same time as RFC 791 [9], the Internet Protocol Specification, was published in September 1981. Thus I was fortunate to witness the fascinating unfolding of this new system called the Internet. During the next ten years, the Internet grew rapidly. RFC 1287 [2], Towards the Future Internet Architecture, was published in 1991 and was probably the first RFC that raised a concern about IP address space exhaustion in the foreseeable future.
IEEE Network • September/October 2008
RFC 1287 also discussed three possible directions to extend IP address space. The first one pointed to a direction similar to current NATs: Replace the 32-bit field with a field of the same size but with a different meaning. Instead of being globally unique, it would be unique only within some smaller region. Gateways on the boundary would rewrite the address as the packet crossed the boundary. RFC 1335 [3], published shortly after RFC 1287, provided a more elaborate description of the use of internal IP addresses (i.e., private IP addresses) as a solution to IP address exhaustion. The first article describing the NAT idea, “Extending the IP Internet through Address Reuse” [10], appeared in the January 1993 issue of ACM Computer Communication Review and was published a year later as RFC 1631 [11]. Although these RFCs can be considered forerunners in the development of NAT, as explained later, for various reasons the IETF did not take action to standardize NAT. The invention of the Web further accelerated Internet growth in the early 1990s. The explosive growth underlined the urgency to take action toward solving both the routing scalability and the address shortage problems. The IETF took several follow-up steps, which eventually led to the launch of the IPng development effort. I believe that the expectation at the time was to develop a new IP within a few years, followed by a quick deployment. However, the actual deployment during the next ten years took a rather unexpected path.
The Planned Solution As pointed out in RFC 1287, the continued growth of the Internet exposed strains on the original design of the Internet architecture, the two most urgent of which were routing system scalability and the exhaustion of IP address space. Because long-term solutions require a long lead time to develop and deploy, efforts began to develop both a short term and a long-term solution to those problems. Classless inter-domain routing, or CIDR, was proposed as a short term solution. CIDR removed the class boundaries embedded in the IP address structure, thus enabling more efficient address allocation, which helped extend the lifetime of IP address space. CIDR also facilitated routing aggregation, which slowed down the growth of the routing table size. However, as stated in RFC 1481 [12], IAB Recommendation for an Intermediate Strategy to Address the Issue of Scaling: “This strategy (CIDR) presumes that a suitable long-term solution is being addressed within the Internet technical community.” Indeed, a number of new IETF working groups started in late 1992 and aimed at developing a new IP as a long-term solution; the Internet Engineering Steering Group (IESG) set up a new IPng area in 1993 to coordinate the efforts, and the IPng Working Group (later renamed to IPv6) was established in the fall of 1994 to develop a new version of IP [13]. CIDR was rolled out quickly, which effectively slowed the growth of the global Internet routing table. Because it is a quick fix, CIDR did not address emerging issues in routing scalability, in particular the issue of site multihoming. A multihomed site should be reachable through any of its multiple provider networks. In the existing routing architecture, this requirement translates to having the prefix, or prefixes, of the site listed in the global routing table, thereby rendering provider-based prefix aggregation ineffective. Interested readers are referred to [14] for a more detailed description on multihoming and its impact on routing scalability. The new IP development effort, on the other hand, took much longer than anyone expected when the effort first
9
ZHANG LAYOUT
9/5/08
1:03 PM
Page 10
began. The IPv6 working group finally completed all of the protocol development effort in 2007, 13 years after its establishment. The IPv6 deployment also is slow in coming. Until recently, there were relatively few IPv6 trial deployments; there is no known commercial user site that uses IPv6 as the primary protocol for its Internet connectivity. If one day someone writes an Internet protocol development history, it would be very interesting to look back and understand the major reasons for the slow development and adoption of IPv6. But even without doing any research, one could say with confidence that NATs played a major role in meeting the IP address requirement that arose out of the Internet growth and at least deferred the demand for a new IP to provide the much needed address space to enable the continued growth of the Internet.
The Unplanned Reality Although largely unexpected, NATs have played a major role in facilitating the explosive growth of Internet access. Nowadays, it is common to see multiple computers, or even multiple LANs, in a single home. It would be unthinkable for every home to obtain an IP address block, however small it may be, from its network service provider. Instead, a common implementation for home networking is to install a NAT box that connects one home network or multiple home networks to a local provider. Similarly, most enterprise networks deploy NATs as well. It also is well known that countries with large populations, such as India and China, have most of their hosts behind NAT boxes; the same is true for countries that connected to the Internet only recently. Without NATs, the IPv4 address space would have been exhausted a long time ago. For reasons discussed later, the IETF did not standardize NAT implementation or operations. However, despite the lack of standards, NATs were implemented by multiple vendors, and the deployment spread like wildfire. This is because NATs have several attractions, as we describe next.
Why NATs Succeeded NATs started as a short term solution while waiting for a new IP to be developed as the long-term solution. The first recognized NAT advantages were stated in RFC 1918 [1]: With the described scheme many large enterprises will need only a relatively small block of addresses from the globally unique IP address space. The Internet at large benefits through conservation of globally unique address space, which will effectively lengthen the lifetime of the IP address space. The enterprises benefit from the increased flexibility provided by a relatively large private address space. The last point deserves special emphasis. Indeed, anyone can use a large block of private IP addresses — up to 16 million without asking for permission — and then connect to the rest of the Internet by using only a single public IP address. A big block of private IP addresses provides the much needed room for future growth. On the other hand, for most if not all user sites, it is often difficult to obtain an IP address block that is beyond their immediate requirements. Today, NAT is believed to offer advantages well beyond the above. Essentially, the mapping table of a NAT provides one level of indirection between hosts behind the NAT and the global Internet. As the popular saying goes, “Any problem in computer science can be solved with another layer of indirection.” This one level of indirection means that one never need worry about renumbering the internal network when
10
changing providers, other than renumbering the public IP address of the NAT box. Similarly, a NAT box also makes multihoming easy. One NAT box can be connected to multiple providers and use one IP address from each provider. Not only does the NAT box shelter the connectivity to multiple ISPs from all the internal hosts, but it also does not require any of its providers to “punch a hole” in the routing announcement (i.e., make an ISP de-aggregate its address block). Such a hole punch would be required if the multihomed site takes an IP address block from one of its providers and asks the other providers to announce the prefix. Furthermore, this one level of indirection also is perceived as one level of protection because external hosts cannot directly initiate communication with hosts behind a NAT, nor can they easily figure out the internal topology. Besides all of the above, two additional factors also contributed greatly to the quick adoption of NATs. First, NATs can be unilaterally deployed by any end site without any coordination by anybody else. Second, the major gains from deploying a NAT were realized on day one, whereas its potential drawbacks were revealed only slowly and recently.
The Other Side of the NAT A NAT disallows the hosts behind it from being reachable by an external host and hence disables it from being a server. However, in the early days of NAT deployment, many people believed that they would have no need to run servers behind a NAT. Thus, this architectural constraint was viewed as a security feature and believed to have little impact on users or network usage. As an example, the following four justifications for the use of private addresses are quoted directly from RFC 1335 [3]. • In most networks, the majority of the traffic is confined to its local area networks. This is due to the nature of networking applications and the bandwidth constraints on inter-network links. • The number of machines that act as Internet servers, that is, run programs waiting to be called by machines in other networks, is often limited and certainly much smaller than the total number of machines. • There are an increasingly large number of personal machines entering the Internet. The use of these machines is primarily limited to their local environment. They also can be used as clients such as ftp and telnet to access other machines. • For security reasons, many large organizations, such as banks, government departments, military institutions, and some companies, allow only a very limited number of their machines to have access to the global Internet. The majority of their machines are purely for internal use. As time goes on, however, the above reasoning has largely been proven wrong. First, network bandwidth is no longer a fundamental constraint today. On the other hand, voice over IP (VoIP) has become a popular application over the past few years. VoIP changed the communication paradigm from client-server to a peer-to-peer model, meaning that any host may call any other host. Given the large number of Internet hosts that are behind NAT, several NAT traversal solutions have been developed to support VoIP. A number of other recent peerto-peer applications, such as BitTorrent, also have become popular recently, and each must develop its own NAT traversal solutions. In addition to the change of application patterns, a few other problems also arise due to the use of non-unique, pri-
IEEE Network • September/October 2008
ZHANG LAYOUT
9/5/08
1:03 PM
Page 11
vate IP addresses with NATs. For instance, a number of business acquisitions and mergers have run into situations where two networks behind NATs were required to be interconnected, but unfortunately, they were running on the same private address block, resulting in address conflicts. Yet another problem emerged more recently. The largest allocated private address block is 10.0.0.0/8, commonly referred to as net-10. The business growth of some provider and enterprise networks is leading to, or already has resulted in, the net-10 address exhaustion. An open question facing these networks is what to do next. One provider network migrated to IPv6; a number of others simply decided on their own to use another unallocated IP address block [15]. It is also a common misperception that a NAT box makes an effective firewall. This may be due partly to the fact that in places where NAT is deployed, the firewall function often is implemented in the NAT box. A NAT box alone, however, does not make an effective firewall, as evidenced by the fact that numerous home computers behind NAT boxes have been compromised and have been used as launch pads for spam or distributed denial of service (DDoS) attacks. Firewalls establish control policies on both incoming and outgoing packets to minimize the chances of internal computers being compromised or abused. Making a firewall serve as a NAT box does not make it more effective in fencing off malicious attacks; good control polices do.
Why the Opportunity of Standardizing NAT Was Missed During the decade following the deployment of NATs, a big debate arose in the IETF community regarding whether NAT should, or should not, be deployed. Due to its use of private addresses, NAT moved away from the basic IP model of providing end-to-end reachability between hosts, thus representing a fundamental departure from the original Internet architecture. This debate went on for years. As late as 2000, messages posted to the IETF mailing list by individual members still argued that NAT was architecturally unsound and that the IETF should in no way endorse its use or development. Such a position was shared by many people during that time. These days most people would accept the position that the IETF should have standardized NAT early on. How did we miss the opportunity? A simple answer could be that the crystal ball was cloudy. I believe that a little digging would reveal a better understanding of the factors that clouded our eyes at the time. As I see it from my personal viewpoint, the following factors played a major role. First, the feasibility of designing and deploying a brand new IP was misjudged, as were the time and effort required for such an undertaking. Those who were opposed to standardizing NAT had hoped to develop a new IP in time to meet the needs of a growing Internet. Unfortunately, the calculation was way off. While the development of a new IP was taking its time, Internet growth did not wait. Network address translation is simply an inevitable consequence that was not clearly recognized at the time. Second, the community faced a difficult question regarding how strictly one should stick to architectural principles, and what can be acceptable engineering trade-offs. Architectural principles are guidelines for problem solving; they help guide us toward developing better overall solutions. However, when the direct end-to-end reachability model was interpreted as an absolute rule, it ruled out network address translation as a feasible means to meet the instant high demand for IP
IEEE Network • September/October 2008
addresses at the time. Furthermore, sticking to the architectural model in an absolute way also contributed to the onesided view of the drawbacks of NATs, hence the lack of a full appreciation of the advantages of NATs as we discussed earlier, let alone any effort to develop a NAT-traversal solution that can minimize the impact of NATs on end-to-end reachability. Yet another factor was that given that network address translation could be deployed unilaterally by a single party alone, there was not an apparent need for standardization. This seemingly valid reasoning missed an important fact: a NAT box does not stand alone; rather it interacts both directly with surrounding IP devices, as well as indirectly with remote devices through IP packet handling. The need for standardizing network address translation behavior has since been well recognized, and a great effort has been devoted to developing NAT standards in recent years [16]. Unfortunately the early misjudgment on NAT already has cost us dearly. While the big debate went on through the late 1990s and early part of the first decade of this century, NAT deployment was widely rolled out, and the absence of a standard led to a number of different behaviors among various NAT products. A number of new Internet protocols also were developed or finalized during the same time period, such as IPSec, Session Announcement Protocol (SAP), and Session Initiation Protocol (SIP), to name a few. Their designs were based on the original model of IP architecture, wherein IP addresses are assumed to be globally unique and globally reachable. When those protocols became ready for deployment, they faced a world that was mismatched with their design. Not only were they required to solve the NAT traversal problem, but the solutions also were required to deal with a wide variety of NAT box behaviors. Although NAT is accepted as a reality today, the lessons to learn from the past are yet to be clarified. One example is the recent debate over Class-E address block usage [17]. Class-E refers to the IP address block 240.0.0.0/4 that has been on reserve until now. As such, many existing router and host implementations block the use of Class-E addresses. Putting aside the issue of required router and host changes to enable Class-E usage, the fundamental debate has been about whether this Class-E address block should go into the public address allocation pool or into the collection of private address allocations. The latter would give those networks that face net-10 exhaustion a much bigger private address block to use. However, this gain is also one of the main arguments against it, as the size limitation of private addresses is considered a pressure to push those networks facing the limitation to migrate to IPv6, instead of staying with NAT. Such a desire sounds familiar; similar arguments were used against NAT standardization in the past. However if the past is any indication of the future, we know that pressures do not dictate new protocol deployment; rather, economical feasibility does. This statement does not imply that migrating to IPv6 brings no economical feasibility. On the contrary, it does, especially in the long run. New efforts are being organized both in protocol and tools development to smooth and ease the transition from IPv4 to IPv6 and in case studies and documentation to show clearly the short- and long-term gains from deploying IPv6.
Looking Back and Looking Forward The IPv4 address space exhaustion predicted long ago is finally upon us today, yet the IPv6 deployment is barely visible on the horizon. What can and should be done now to enable the Internet to grow along the best path forward? I hope this review of NAT history helps shed some light on the answer.
11
ZHANG LAYOUT
9/5/08
1:03 PM
Page 12
First, we should recognize not only the fact that IPv4 network address translation is widely deployed today, but also recognize its perceived benefits to end users as we discussed in a previous section. We should have a full appraisal of the pros and cons of NAT boxes; the discussion in this article merely serves as a starting point. Second, it is likely that some forms of network address translation boxes will be with us forever. Hopefully, a full appraisal of the pros and cons of network address translation would help correct the view that all network address translation approaches are a “bad thing” and must be avoided at all costs. Several years ago, an IPv4 to IPv6 transition scheme called Network Address Translation-Protocol Translation (NAT-PT; see [18]) was developed but later classified to historical status,1 mainly due to the concerns that: • NAT-PT works in much the same way as an IPv4 NAT box. • NAT-PT does not handle all the transition cases. However, in view of IPv4 NAT history, it seems worthwhile to revisit that decision. IPv4, together with IPv4 NAT, will be with us for years to come. NAT-PT seems to offer a unique value in bridging IPv4-only hosts and applications with IPv6enabled hosts and networks. There also have been discussions of the desire to perform address translations between IPv6 networks as a means to achieve several goals, including insulating one’s internal network from the outside. This question of “Whither IPv6 NAT?” deserves further attention. Instead of repeating the mistakes with IPv4 NAT, the Internet would be better off with well-engineered standards and operational guidelines for traversing IPv4 and IPv6 NATs that aim at maximizing interoperability. Furthermore, accepting the existence of network address translation in today’s architecture does not mean we simply take the existing NAT traversal solutions as given. Instead, we should fully explore the NAT traversal design space to steer the solution development toward restoring the end-to-end reachability model in the original Internet architecture. A new effort in this direction is the NAT traversal through tunneling (NATTT) project [19]. Contrary to most existing NAT traversal solutions that are server-based or protocol-specific, NATTT aims to restore end-to-end reachability among Internet hosts in the presence of NATs, by providing generic, incrementally deployable NAT-traversal support for all applications and protocols. Last, but not least, I believe it is important to understand that successful network architectures can and should change over time. All new systems start small. Once successful, they grow larger, often by multiple orders of magnitude as is the case of the Internet. Such growth brings the system to an entirely new environment that the original designers may not have envisioned, together with a new set of requirements that must be met, hence the necessity for architectural adjustments. To properly adjust a successful architecture, we must have a full understanding of the key building blocks of the architecture, as well as the potential impact of any changes to them. I believe the IP address is this kind of key building block that touches, directly or indirectly, all other major components in the Internet architecture. The impact of IPv4 NAT, which changed IP address semantics, provides ample evidence. During IPv6 development, much of the effort also involved a change in IP address semantics, such as the introduction of new concepts like that of the site-local address. The site-local address was later abolished and partially replaced by unique 1
Historical status means that a protocol is considered obsolete and is thus removed from the Internet standard protocol set.
12
local IPv6 unicast addresses (ULA) [20], another new type of IP address. The debate over the exact meaning of ULA is still going on. The original IP design clearly defined an IP address as being globally unique and globally reachable and as identifying an attachment point to the Internet. As the Internet continues to grow and evolve, recent years have witnessed an almost universal deployment of middleboxes of various types. NATs and firewalls are dominant among deployed middleboxes, though we also are seeing increasing numbers of SIP proxies and other proxies to enable peer-to-peerbased applications. At the same time, proposals to change the original IP address definition, or even redefine it entirely, continue to arise. What should be the definition, or definitions, of an IP address today, especially in the face of various middleboxes? I believe an overall examination of the role of the IP address in today’s changing architecture deserves special attention at this critical time in the growth of the Internet.
Acknowledgments I sincerely thank Mirjam Kuhne and Wendy Rickard for their help with an earlier version of this article that was posted in the online IETF Journal of October 2007. I also thank the coeditors and reviewers of this special issue for their invaluable comments.
References [1] Y. Rekhter et al., “Address Allocation for Private Internets,” RFC 1918, 1996. [2] D. Clark et al., “Towards the Future Internet Architecture,” RFC 1287, 1991. [3] Z. Wang and J. Crowcroft, “A Two-Tier Address Structure for the Internet: A Solution to the Problem of Address Space Exhaustion,” RFC 1335, 1992. [4] J. Rosenberg et al., “STUN: Simple Traversal of User Datagram Protocol (UDP) through Network Address Translators (NATs),” RFC 3489, 2003. [5] J. Rosenberg, R. Mahy, and P. Matthews, “Traversal Using Relays around NAT (TURN),” draft-ietf-behave-turn-08, 2008. [6] C. Huitema, “Teredo: Tunneling IPv6 over UDP through Network Address Translations (NATs),” RFC 4380, 2006. [7] S. Kent and R. Atkinson, “Security Architecture for the Internet Protocol, RFC 2401, 1998. [8] J. Postel and J. Reynolds, File Transfer Protocol (FTP), RFC 959, 1985. [9] J. Postel, Internet Protocol Specification, RFC 791, 1981. [10] P. Tsuchiya and T. Eng, “Extending the IP Internet through Address Reuse,” ACM SIGCOMM Computer Commun. Review, Sept. 1993. [11] K. Egevang and P. Francis, “The IP Network Address Translator (NAT),” RFC 1631, 1994. [12] C. Huitema, “IAB Recommendation for an Intermediate Strategy to Address the Issue of Scaling,” RFC 1481, 1993. [13] R. M. Hinden, “IP Next Generation Overview,” http://playground.sun.com/ ipv6/INET-IPng-Paper.html, 1995. [14] L. Zhang, “An Overview of Multihoming and Open Issues in GSE,” IETF J., Sept. 2006. [15] L. Vegoda, “Used but Unallocated: Potentially Awkward /8 Assignments,” Internet Protocol J., Sept. 2007. [16] http://www.ietf.org/html.charters/behave-charter.html; IETF BEHAVE Working Group develops requirements documents and best current practices to enable NATs to function in a deterministic way, as well as advises on how to develop applications that discover and reliably function in environments with the presence of NATs. [17] http://www.ietf.org/mail-archive/web/int-area/current/msg01299.html; see the message dated 12/5/07 with subject line “240/4” and all the follow-up. [18] G. Tsirtsis and P. Srisuresh, “Network Address Translation-Protocol Translation (NAT-PT),” RFC 2766, 2000. [19] E. Osterweil et al., “NAT Traversal through Tunneling (NATTT),” http:// www.cs.arizona.edu/˜bzhang/nat/ [20] R. M. Hinden and B. Haberman, “Unique Local IPv6 Unicast Addresses,” RFC 4193, 2005.
Biography LIXIA ZHANG (
[email protected]) received her Ph.D. in computer science from the Massachusetts Institute of Technology. She was a member of research staff at the Xerox Palo Alto Research Center before joining the faculty of the UCLA Computer Science Department in 1995. In the past she served as vice chair of ACM SIGCOMM and co-chair of the IEEE ComSoC Internet Technical Committee. She is currently serving on the Internet Architecture Board.
IEEE Network • September/October 2008
MUELLER LAYOUT
9/5/08
1:01 PM
Page 14
Behavior and Classification of NAT Devices and Implications for NAT Traversal Andreas Müller and Georg Carle, Technische Universität München Andreas Klenk, Universität Tübingen
Abstract For a long time, traditional client-server communication was the predominant communication paradigm of the Internet. Network address translation devices emerged to help with the limited availability of IP addresses and were designed with the hypothesis of asymmetric connection establishment in mind. But with the growing success of peer-to-peer applications, this assumption is no longer true. Consequently network address translation traversal became a field of intensive research and standardization for enabling efficient operation of new services. This article provides a comprehensive overview of NAT and introduces established NAT traversal techniques. A new categorization of applications into four NAT traversal service categories helps to determine applicable techniques for NAT traversal. The interactive connectivity establishment framework is categorized, and a new framework is introduced that addresses scenarios that are not supported by ICE. Current results from a field test on NAT behavior and the success ratio of NAT traversal techniques support the feasibility of this classification.
W
hen the Internet Protocol (IP) was designed, the growth of the Internet to its current size was not imaginable. Therefore, it was reasonable to use a fixed 32-bit field to identify a host based on its IP address. This limited address range makes it impossible to assign globally unique IPv4 addresses to the growing number of networked devices. Furthermore, requesting an IP address for every newly added device results in an unacceptable administration overhead. The authors in [1] propose to assign a number of public IP addresses to a designated border router instead of configuring certain hosts with addresses that can be routed globally. The border router is then responsible for translating IP addresses between the private and the public domains, allowing as many simultaneous connections as public IP addresses were assigned. This allows a host within the local network to access the Internet even though it has a private IP address. This technique became known as network address translation (NAT). Because the translation of addresses breaks the end-to-end connectivity model of the IP, newly developed services following the peerto-peer (P2P) paradigm such as file sharing, instant messaging, and voice over IP (VoIP) applications suffer from the existence of NAT. Thus, NAT traversal is an important problem today. And even in the future, after a possible success of IPv6, companies and home users still might deploy NAT devices to hide their topologies from Internet service providers (ISPs). There are two possible approaches to the problem. One direction within the Internet Engineering Task Force (IETF) Behave Working Group [2] is to cope with existing NAT implementations and to establish standards for the
14
0890-8044/08/$25.00 © 2008 IEEE
detection of NAT behavior and for NAT traversal. On the other hand, the IETF also standardizes behavioral properties for NATs to work in conjunction with IETF protocols (e.g., Datagram Congestion Control Protocol [DCCP], Internet Control Message Protocol [ICMP], Stream Control Transmission Protocol [SCTP]). Enterprise class NATs are among the first to incorporate new features introduced through standardization. However, the large scale deployment of residential gateways with NAT functionality prohibits the change of NAT and requires the use of protocols that work with existing NATs. This is also the focus of this article, where we treat NATs as black boxes rather than trying to change them.
NAT Behavior Today, a NAT device usually is used to share a single public IP address among a number of private end systems. The NAT maintains a table, listing all connections between the public and the private domains. For every connection attempt (e.g., a Transmission Control Protocol synchronize [TCP SYN] packet) coming from an internal host, the NAT creates a new entry in the list. In NAT terminology this entry is called a binding [3]. Each entry contains the source IP address and the source port. The NAT replaces the source IP address with its public IP address. The source port is replaced using one of the strategies explained later in this section. Although the concept of NAT was published as early as 1994 [1], no common approach for NAT emerged. Current NAT implementations not only differ from vendor to vendor but also from model to model, which leads to compatibility
IEEE Network • September/October 2008
MUELLER LAYOUT
9/5/08
1:01 PM
Page 15
Classification
NAT property
Port binding
Port preservation No port preservation Port overloading Port multiplexing
NAT binding
Endpoint-independent Address- (port)-dependent Connection-dependent
Endpoint filtering
Independent Address restricted Address and port restricted
n Table 1. NAT behavior categories and possible NAT properties.
issues. If an application works with one particular NAT, this does not imply that it always works in a NATed environment. Therefore, it is very important to understand and classify existing NAT implementations in order to design applications that can work in combination with current NATs. The classification in this article is mainly derived from simple traversal of User Datagram Protocol (UDP) through NAT (STUN) [4], whereas the address binding and mapping behavior follows the terminology used in RFC 4787 [5]. This section covers only topics that are required for the understanding of this article. A detailed discussion and further information (including test results) is given in [6] (for TCP) and [5] (for UDP). Binding covers “context based packet translation” [7], which describes the strategy the NAT uses to assign a public transport address (combination of IP address and port) to a new state in the NAT. Filtering, or packet discard, shows how the NAT handles (or discards) packets trying to use an existing mapping. Table 1 shows the different categories and their possible properties. Port binding describes the strategy a NAT uses for the assignment. With port preservation, the NAT assigns an external port to a new connection; it attempts to preserve the local port number if possible. Port overloading is problematic and rarely occurs. A new connection takes over the binding, and the old connection is dropped. Port multiplexing is a very common strategy where ports are demultiplexed based on the destination transport address. Incoming packets can now carry the same destination port and are distinguished by the source transport address. NAT binding deals with the reuse of existing bindings. That is, if an internal host closes a connection and establishes a new one from the same source port, NAT binding describes the assignment strategy for the new connection. As shown in Table 1, the NAT binding is organized into three categories. With Endpoint Independent, the external port is only dependent on the source transport address of the connection. As long as a host establishes a connection from the same source IP address and port, the mapping does not change. The assignment is dependent on the internal and the external transport address with the Address (Port) Dependent strategy. As long as consecutive connections from the same source to the same destination are established, the mapping does not change. As soon as we use a different destination, the NAT changes the external port. With a Connection Dependent binding, the NAT assigns a new port to every connection. We distinguish between NATs that increase the new port number by a specific (and well predictable) delta and NATs that assign random port numbers to the new mappings. Endpoint filtering describes how existing mappings can be used by external hosts and how a NAT handles incoming connection attempts that are not part of a response. Independent Filtering allows inbound connections independent of the
IEEE Network • September/October 2008
source transport address of the packet. As long as the destination transport address of a packet matches an existing state, the packet is forwarded. With Address Restricted Filtering, the NAT forwards only packets coming from the same host (matching IP address) to which the initial packet was sent. Address and Port Restricted Filtering also compares the source port of the inbound packet in addition to address restricted filtering.
NAT Traversal Problem To work properly, the NAT must have access to the protocol headers at layers 3 and 4 (in case of a network address port translation [NAPT]). Additionally, for every incoming packet, the NAT must already have a state listed in its table. Otherwise, it cannot find the related internal host to which the packet belongs. According to RFC 3027 [8], the NAT traversal problem can be separated into three categories, which are presented in this section. In addition to the three problems, we identified Unsupported Protocols as a new category. The first problem occurs if a protocol uses Realm-Specific IP Addresses in its payload. That is, if an application layer protocol such as the Session Initiation Protocol (SIP) uses a transport address from the private realm within its payload signalizing where it expects a response. Because regular NATs do not operate above layer 4, application layer protocols typically fail in such scenarios. A possible solution is the use of an application layer gateway (ALG) that extends the functionality of a NAT for specific protocols. However, an ALG supports only the application layer protocols that are specifically implemented and may fail when encryption is used. The second category is P2P Applications. The traditional Internet consists of servers located in the public realm and clients that actively establish connections to these servers. This structure is well suited for NATs because for every connection attempt (e.g., a TCP SYN) coming from an internal client, the NAT can add a mapping to its table. But unlike client-server applications, a P2P connection can be initiated by any of the peers regardless of their location. However, if a peer in the private realm tries to act as a traditional server (e.g., listening for a connection on a socket), the NAT is unaware of incoming connections and drops all packets. A solution could be that the peer located in the private domain always establishes the connection. But what if two peers, both behind a NAT, want to establish a connection to each other? Even if the security policy would allow the connection, it cannot be established. The third category is a combination of the first two. Bundled Session Applications, such as File Transfer Protocol (FTP) or SIP/Session Description Protocol (SDP), carry realm-specific IP addresses in their payload to establish an additional session. The first session is usually referred to as the control session, whereas the newly created session is called the data session. The problem here is not only the realm-specific IP addresses, but the fact that the data session often is established from the public Internet toward the private host, a direction the NAT does not permit (e.g., active FTP). Unsupported Protocols are typically newly developed transport protocols such as the SCTP or the DCCP that cause problems with NATs even if an internal host initiates the connection establishment. This is because current NATs do not have built-in support for these protocols. The unsupported protocols also cover protocols that cannot work with NATs because their layer 3 or layer 4 header is not available for translation. This happens when using encryption protocols such as IPSec.
15
MUELLER LAYOUT
9/5/08
1:01 PM
Page 16
(1) (2) (2)
(1)
Requester a)
Service
(1)
(2)
(3)
(1) (1)
(2)
(3)
(3)
Service b)
c)
(3)
d)
n Figure 1. NAT traversal service categories for applications: a) RNT; b) GSP; c) SPPS; d) SSP.
NAT Traversal Service Categories Instead of classifying the NAT behavior (see classification in STUN [4]), we defined four NAT traversal service categories, each making different assumptions about the purpose of the connection establishment and the infrastructure that is available. Our categorization emphasizes that the applicability of many NAT traversal techniques depends on the support of a combination of requester, the responder, globally reachable infrastructure nodes, and the role of the application. On the one hand, server applications set up a socket and wait for connections (which also applies to P2P applications). On the other hand, client applications such as VoIP clients actively initiate a connection and wait for an answer on a different port (bundled session applications). Other applications work only across NATs if both ends participate in the connection establishment (unsupported protocols). Thus, we differentiate between supporting a service and supporting a client. In this article, the client is called the requester because it actively initiates a connection. The behavior of the NAT is important because it allows or prohibits certain NAT traversal techniques within one service category. If only one end implements NAT traversal support (e.g., by running a stand-alone framework or by built-in NAT traversal functionality), NAT traversal techniques that rely on a collaboration of both ends (e.g., ICE) are not applicable. Our first category, requester side NAT traversal (RNT), covers scenarios where only the requester side supports NAT traversal (e.g., the application or the NAT itself). RNT helps applications that actively participate in the connection establishment and still suffer from the existence of NATs. Typical examples are applications that have problems with realm-specific IP addresses in their payload. This applies to protocols using in-band signaling on the application layer, which is related to bundled session applications with asymmetric connection establishment (e.g., VoIP using SIP/SDP). The second category, global service provisioning (GSP), assumes that the host providing the service implements NAT traversal support, helping to make a service globally accessible. This is done by creating and maintaining a NAT mapping that then accepts multiple connections from previously unknown clients (Fig. 1). This is the main difference from RNT, which only creates a NAT mapping for one particular session (e.g., one call in the case of VoIP). The last two categories assume support at both ends, the service and the requester. On the one side, NAT traversal is required to make a service behind a NAT globally accessible, whereas on the other side, the support at the requester allows the use of sophisticated techniques through coordinated action. Thus, service provisioning using pre-signaling (SPPS) extends the GSP category by the assumption that both hosts have interoperable frameworks (e.g., ICE [9]; NAT, URIs, Tunnels, SIP, and STUNT [NUTSS] [10]; NATBlaster [11]; or NatTrav [12]) running. This allows a selection from all available NAT traversal solutions, which leads to a high success rate of NAT traversal. In Fig. 1, the two hosts use a rendezvous point to agree on a NAT traversal technique. After
16
creating the mapping in step 2, the service is accessible by any host, depending on the selected NAT traversal technique and the filtering strategy of the NAT. SPPS supports all types of services where a one-to-one connection is sufficient and presignaling is available. The last category, secure service provisioning (SSP), is an extension of SPPS and addresses scenarios that require authorization of the remote party before initiating the NAT traversal process. The hereby established channel must be accessible only by the authorized remote party. This requires additional functionality that enforces this policy and only allows authorized users to access the service. The policy enforcement can be done at the NAT itself, at a data relay, or at a firewall. Table II depicts all four service categories with popular NAT traversal techniques and shows the implications for automated NAT traversal and required signaling. First we distinguish between the service and the requester. “Support at the service” means, for example, that a framework must be deployed at the same host providing the service. The same applies to the requester. “RP” means that a rendezvous point is required for relaying data back and forth. “Signaling messages” means that some sort of signaling protocol is used for NAT traversal. Again, we differentiate between signaling at the service and signaling at the requester. A rendezvous point for signaling messages is required in case of pre-signaling. Finally, “stream independent” describes the requirement for consecutive connections. For example, a port forwarding entry must be created only once, whereas hole punching [13] requires sending a new hole punching packet for every new stream (with restricted filtering). Table 2 shows the main differences of our service categories. RNT deals with bundled session applications that wait on a port after initiating a session (e.g., via a SIP INVITE). GSP requires only support of the service and aims to make a service globally reachable for multiple clients. SPPS and SSP combine these categories and require support at both ends. The requester initiates pre-signaling to exchange information about a global end point. The service then creates a mapping in the NAT that can be used by the client.
Applicability of NAT Traversal Techniques for NAT Traversal Service Categories There are many different techniques for solving the NAT traversal problem in specific scenarios, but none of them provides a solution that works well with all NATs, applications, and network topologies. Another article explains many of the available protocols for NAT traversal [14] in general. This section describes the applicability of existing techniques from the applications point of view. RNT is required for protocols using in-band signaling (bundled session applications). Therefore, one common approach is to integrate RNT into these applications (e.g., the VoIP client), to establish port bindings on the fly. One possibility is the integration of a universal plug and play (UPnP) client. Another option is to use ALGs that are integrated in the NAT, interpreting in-band signaling and establishing map-
IEEE Network • September/October 2008
MUELLER LAYOUT
Service category
9/5/08
1:01 PM
Page 17
Requires support at Service
Requester
RP
NAT with ALG RNT
GSP
Signaling messages
NAT traversal techniques NAT
Service
Requester
RP
STUN
Streamindependent
X
UPnP (for bundled session applications)
X
UPnP (port forwarding)
X
Hole punching — independent filtering
X
Open data relay (e.g., RSIP)
X
Hole punching — independent binding
X
X
UPnP
X
X
Closed/open data relay (e.g., TURN, Skype)
X
X
Tunneling (e.g., over UDP)
X
Hole punching — restricted filtering
X X
X
X
X
X
X X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
NSIS NATFW NSLP
X
X
X
X
Closed data relay (e.g., TURN)
X
X
X
X
Tunneling (e.g., over secure channel)
X
X
X
X
X
X
SPPS
SSP
X
X X
X X
X
X
X
n Table 2. Service categories and their implications for automated NAT traversal; RP denotes rendezvous point. pings accordingly. ALGs are not a general solution because the NAT must implement the required logic for each protocol, and end-to-end security prohibits the interpretation of the signaling by the NAT. GSP depends on NAT traversal techniques that allow unrestricted access to a public end point. A control protocol can be used to directly establish a port forwarding entry in the mapping tables of the NAT, for instance, with UPnP [15]. Port forwarding entries created by UPnP are easy to maintain and work independently from NAT behavior. However, UPnP only works if the NAT is in the local network on the path to the other end point. Thus, nested NATs are not allowed, and path changes break the connectivity. Hole punching is an alternative if UPnP is not applicable and works for NATs with an independent filtering strategy. The mapping must be refreshed periodically, for instance, by sending keep-alive packets. For NATs other than full-cone, hole-punching for GSP cannot be used because the source port of the request is unknown in advance. SSPS makes no assumption about the accessibility of a created mapping, thus all possible techniques are applicable. Different from GSP, hole-punching for SPPS works as long as port prediction is possible. For NATs implementing restricted filtering, pre-signaling helps to create the appropriate mapping because the five-tuple of the connection is exchanged. Pre-signaling also enables the establishment of an UDP tunnel, allowing the encapsulation of unsupported protocols. SPPS also can use UPnP to establish port forwarding entries for one session.
IEEE Network • September/October 2008
SSP is an extension to SPPS that allows only authorized hosts to allocate and to use a mapping. Protocols that authorize requests and assume control over the middlebox, such as middlebox communication (MIDCOM) [16] or the NAT/Firewall Next Step in Signaling (NSIS) Layer Protocol [17] qualify for SSP. The advantage of NSIS is that it can discover and configure multiple middleboxes along the data path, thus supporting complex scenarios with nested NATs and multipath routing. However, if one NAT on the path does not support the protocol, NSIS fails. Using NSIS and MIDCOM for SSP requires restrictive rules that allow only authorized clients to use the mapping, for instance, by opening pinholes for IP fivetuples. UPnP is not useful for SSP because it forwards inbound packets without considering the source transport address. Hole punching can be used only with SSP if the NAT implements a restricted filtering strategy. All cases discussed previously rely on additional measures to prohibit IP spoofing. The use of secure tunnels impedes IP spoofing and allows secure NAT traversal, even for unsupported protocols (e.g., IPSec, SCTP, DCCP). SSP also can be achieved by using traversal using relay NAT (TURN) with authentication, authorization, and secure communication (e.g., via transport layer security [TLS]). ICE [9] is under standardization by the IETF and strives to combine several techniques into a framework flexible enough to work with all network topologies. Because ICE requires both peers to have an ICE implementation running, it can be seen as a technique for SPPS or SSP, depending on the accessibility and the security policies of the public endpoint.
17
MUELLER LAYOUT
9/5/08
1:01 PM
Page 18
NAT traversal request
Requester initiated
Access to service
Support at service Support at both ends
Secure endpoint
Insecure endpoint
SSP
SPPS
Support at both ends
Support at client
RNT
GSP
Secure endpoint
Insecure endpoint
SSP
SPPS
n Figure 2. Decision tree for ANTS. The same is true for solutions such as TURN [18]. TURN is a promising candidate for SPPS, because it provides a relay with a public transport address allowing the exchange of data packets between a TURN client and a public host.
Why Unilateral Solutions Exist
ing NAT traversal support. With the session manager, ANTS can provide GSP and RNT directly. Whenever an application is added and associated with GSP or RNT, the session manager calls the NAT traversal logic and asks to allocate an appropriate mapping in the NAT. This also requires ANTS to have sufficient knowledge about the applicability of the integrated techniques regarding the service categories. For example, UPnP cannot be used for SSP because it violates the idea of an endpoint that is accessible only by authenticated hosts. Figure 2 shows a decision tree that ANTS uses to establish a mapping in the NAT. First, we distinguish between requester initiated NAT traversal on the one hand and the access to a service on the other hand. Then, we must know which ends actually implement ANTS. If both hosts have the framework running, pre-signaling is possible, which leads to a wide choice of techniques depending on the security considerations of the mapping. If only one end supports ANTS, only techniques belonging to GSP or RNT are applicable. Despite some unsolved issues such as the question of how to connect legacy applications to ANTS (e.g., by using a library or a traversal of UDP through NAT [TUN]-based approach), the idea of a knowledge-based framework seems
Despite the great flexibility of SPPS and SSP, both categories involve a number of assumptions that are not always satisfied. The most important one is the requirement for both ends (and sometimes also the infrastructure), to support compatible versions of the NAT traversal framework. It remains to be seen if the future will bring a sufficiently big deployment of one framework on which to rely for arbitrary applications. The chances are better within homogeneous problem domains, like telecommunication, where such frameworks can be integrated with the applications and be distributed in large numbers. For instance, the adoption of ICE is occurring mainly within the VoIP/SIP community and focusing on VoIP specific use cases. These drawbacks are the reason why RNT and GSP as unilateral solutions for the NAT traversal problems exist. It is easier to enhance an infrastructure under one responsibility than to rely on a solution that requires a global deployment. However, unilateral solutions are limited to the middleboxes in the given domain. They fail to provide solutions S. cat. to scenarios with nested NATs and depend on the network topology.
Prot.
Condition
Suc. rate
RNT
UDP TCP
(UPnP or HP-UDP) (UPnP or HP-TCP)
90.27% 77.84%
GSP
UDP TCP UDP TCP
(Full Cone and HP-UDP) (Full Cone and HP-TCP) (UPnP or (Full Cone and HP-UDP)) (UPnP or (Full Cone and HP-TCP))
27.03% 17.30% 50.27% 44.32%
SPPS
UDP TCP TCP UDP TCP TCP
(HP-UDP) (HP-TCP) (HP-TCP or HP-UDP) (UPnP or HP-UDP) (UPnP or HP-TCP) (UPnP or HP-TCP or HP-UDP)
88.65% 71.35% 94.59% 90.27% 77.84% 95.14%
SSP
UDP TCP
(Restricted NAT and HP-UDP) (Restricted NAT and HP-TCP)
48.65% 38.38%
Coalescing Unilateral and Cooperative Approaches for NAT Traversal When investigating existing NAT traversal techniques, we determined that none of them can be used in all scenarios. For example, UPnP only supports globally accessible end points, whereas ICE requires both hosts to run the framework. In [19], we proposed a new framework that aims toward providing an advanced NAT traversal service (ANTS) supporting all four service categories. The concept of ANTS is based on the idea of reusing previously obtained knowledge about the topology of the network and the capability of the NAT. A small component of ANTS, the NAT tester, is responsible for gathering this information and will be presented (together with some test results) in the next section. If a user decides that a particular application should be reachable from the public Internet, he registers it at a session manager that keeps track of all applications request-
18
n Table 3. Results of the field test: success rates of NAT traversal techniques depending on service categories.
IEEE Network • September/October 2008
MUELLER LAYOUT
9/5/08
1:01 PM
Page 19
to be the right answer. Thus once implemented, ANTS can help many existing services by integrating several techniques and making its choice based on knowledge about the NAT and the requirements of the application.
Field Test on NAT Traversal To prove that existing techniques can be adapted to our service categories, we implemented a NAT tester that acts as a cornerstone for our new framework. This section presents the results of a field test investigating 185 NATs in the wild. For a detailed description including all results, see our Web site: http://nettest.net.in.tum.de. The first test queries a public STUN server to determine the type of the NAT. Afterward, the NAT tester performs the following connection tests and tries to establish a connection to the host behind the NAT: UPnP, hole punching, and connecting to a data relay (each for both protocols, UDP and TCP) (Table 3). We then adapted the test results to our work and evaluated the success rates of the individual techniques regarding our defined service categories. Table III shows the categories and the conditions that must be met according to the considerations made previously. For example, GSP requires the use of UPnP or hole punching support in combination with a fullcone NAT to make a service globally accessible. Therefore, 50.27 percent of our tested NATs supported a direct connection for UDP and category GSP (44.32 percent for TCP). In all other cases (the remaining percentages), an external relay must be used to provide GSP. For SPPS, which makes no security assumptions, we divided our results into two categories. First we determined the success rates without considering UPnP. With 88.65 percent of all NATs, we were able to establish a direct connection to the host behind the NAT (71.35 percent for TCP). This rate increased slightly (for TCP to 77.84 percent) when UPnP was an option. The highest success rate for TCP NAT traversal (95.14 percent) was discovered when we also allowed the tunneling of TCP packets through UDP. SSP allows only authorized hosts to create and to use a mapping. Therefore, a suitable technique for SSP is hole punching in combination with a NAT implementing a restricted filtering strategy. This was supported by 48.65 percent for UDP and 38.38 percent for TCP. The success rate for RNT depends on the effort that is made for the specific protocol. For example, if we assume that we can inspect each signaling packet on the application layer thoroughly, we could adopt the results from SPPS to RNT. If we would only modify the packets in a way that the internal port is reachable by any client, the success rate of GSP would apply to RNT. Finally, we did not measure the effect of NATs with integrated ALGs in this field test.
Conclusion With the increasing popularity of P2P communication, the NAT traversal problem has become more urgent than ever. Existing solutions have the drawback of supporting only certain types of NATs and cannot be viewed as a general solution to the problem. When analyzing the NAT traversal problem more thoroughly, we discovered that the question of who supports the NAT traversal framework determines which NAT traversal techniques are applicable. Therefore, we identified four NAT traversal service categories that differentiate
IEEE Network • September/October 2008
between support by service, client, and infrastructure and listed applicable NAT traversal techniques for each category. Our findings from a field test showed that there are a number of prospective NAT traversal techniques that enable connectivity for each NAT traversal service category. We emphasized how to build upon this categorization to develop a knowledgebased NAT traversal framework. Future frameworks that aspire to support the typical connectivity scenarios of current applications should support all four service categories.
References [1] K. Egevang and P. Francis, “The IP Network Address Translator (NAT),” IETF RFC 1631, May 1994. [2] IETF, “Behavior Engineering for Hindrance Avoidance (behave);” http://www.ietf.org [3] P. Srisuresh and M. Holdrege, “IP Network Address Translator (NAT) Terminology and Considerations,” IETF RFC 2663, Aug. 1999. [4] J. Rosenberg et al., “STUN: Simple Traversal of User Datagram Protocol (UDP) through Network Address Translators (NATs),” IETF RFC 3489, Mar. 2003. [5] E. F. Audet and C. Jennings, “NAT Behavioral Requirements for Unicast UDP,” IETF RFC 4787, Jan. 2007. [6] S. Guha and P. Francis, “Characterization and Measurement of TCP Traversal through NATs and Firewalls,” Proc. ACM Internet Measurement Conf., Berkeley, CA, Oct. 2005. [7] G. Huston, “Anatomy: A Look Inside Network Address Translators,” The Internet Protocol J., vol. 7, 2004, pp. 2–32. [8] M. Holdrege and P. Srisuresh, “Protocol Complications with the IP Network Address Translator,” IETF RFC 3027, Jan. 2001. [9] J. Rosenberg, “Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,” IETF Internet draft, work in progress, Oct. 2007. [10] P. Francis, S. Guha, and Y. Takeda, “NUTSS: A SIP-based Approach to UDP and TCP Network Connectivity,” Cornell Univ., Panasonic Commun., tech. rep., 2004. [11] A. Biggadike et al., “NATBLASTER: Establishing TCP Connections between Hosts behind NATs,” ACM SIGCOMM Asia Wksp., Beijing, China, 2005. [12] J. Eppinger, “TCP Connections for P2P Applications — A Software Approach to Solving the NAT Problem,” Carnegie Mellon Univ., Pittsburgh, PA, tech. rep., 2005. [13] B. Ford, P. Srisuresh, and D. Kegel, “Peer-to-Peer Communication across Network Address Translation,” MIT, tech. rep., 2005. [14] H. Khlifi, J. Gregoire, and J. Phillips, “VoIP and NAT/Firewalls: Issues, Traversal Techniques, and a Real-World Solution,” IEEE Commun. Mag., July 2006. [15] U. Forum, “Internet Gateway Device (IGD) Standardized Device Control Protocol,” Nov. 2001. [16] P. Srisuresh et al., “Middlebox Communication Architecture and Framework,” IETF RFC 3303, Aug. 2002. [17] M. Stiemerling et al., “NAT/Firewall NSIS Signaling Layer Protocol (NSLP),” IETF Internet draft, Feb. 2008. [18] J. Rosenberg, R. Mahy, and P. Matthews, “Traversal Using Relays around NAT (TURN),” IETF Internet draft, work in progress, June 2008. [19] A. Müller, A. Klenk, and G. Carle, “On the Applicability of KnowledgeBased NAT-Traversal for Future Home Networks,” Proc. IFIP Networking 2008, Springer, Singapore, May 2008.
Biographies ANDREAS MÜLLER (
[email protected]) received his diploma degree in computer science from the University of Tübingen, Germany in 2007. Currently, he is a research assistant and Ph.D. candidate at the Network Architecture and Services Department at the Technical University of Munich. His research interests include middleboxes, P2P systems, and autonomic networking. ANDREAS KLENK (
[email protected]) earned his diploma degree in computer science from Ulm University, Germany, in 2003. He is a Ph.D. candidate and research assistant at the University of Tübingen and works with Professor Carle. He contributes to European research projects in the telecommunication field. His research interests include negotiation and security in autonomic systems. GEORG CARLE (
[email protected]) received a M.Sc. degree from Brunel University London in 1989, a diploma degree in electrical engineering from the University of Stuttgart in 1992, and a doctoral degree from the faculty of computer science, University of Karlsruhe in 1996. He is a full professor in computer science at the Technical University of Munich, where he is chair of the Department of Network Architecture and Services. Among the focal interests of his research are Internet technology and mobile communication in combination with security.
19
JOSEPH LAYOUT
9/5/08
1:04 PM
Page 20
Modeling Middleboxes Dilip Joseph and Ion Stoica, University of California at Berkeley Abstract The lack of a concise and standard language to describe diverse middlebox functionality and deployment configurations adversely affects current middlebox deployment, as well as middlebox-related research. To alleviate this problem, we present a simple middlebox model that succinctly describes how different middleboxes process packets and illustrate it by representing four common middleboxes. We set up a pilot online repository of middlebox models and prototyped model inference and validation tools.
M
iddleboxes, like firewalls, NATs, load balancers, and intrusion-prevention boxes have become an integral part of networks today. There is great diversity in how these middleboxes process and transform packets, and in how they are configured and deployed. For example, a firewall is commonly connected inline on the physical network path and transparently forwards packets unmodified or drops them. A load balancer, on the other hand, rewrites packet headers and contents and often requires packets to be explicitly IP addressed and forwarded to it. There is currently no standard way to succinctly describe the complexity and diversity of middlebox packet processing and deployment mechanisms. Middlebox taxonomies like RFC 3234 [1] provide only a high-level classification of middleboxes. Details about middlebox operations and deployment configurations often are buried in different middlebox and vendor specific configuration manuals or simply are not documented clearly. Efforts like the Unified Firewall Model [2] and BEHAVE [3] provide models to describe the operations of specific middleboxes like firewalls and NATs. The lack of a concise and standard language to describe different middleboxes adversely affects current middlebox deployment, as well as hinders middlebox-related research. Correctly deploying and configuring a middlebox is a challenging task by itself. Without a clear understanding of how different middleboxes process packets and interact with the network and with other middleboxes, network planning, verification of operational correctness, and troubleshooting become even more complicated. In our own research experience of designing and implementing the policy-aware switching layer [4] — a new mechanism to overhaul the ad hoc manner in which middleboxes are deployed in data centers today — the non-availability of clear information about how some middleboxes process packets led to initial design decisions that were wrong and that later manifested as hard-to-debug errors while testing. In this article, we present a general model to clearly and succinctly describe the functionality of a middlebox and deployment configurations. Through sets of pre-conditions and processing rules, the model describes the types of packets expected by a middlebox and how it transforms them. Later, we provide more details of our model and illustrate it by representing four common middleboxes. The middlebox model provides a standard language to concisely describe different middleboxes. We are building an
20
0890-8044/08/$25.00 © 2008 IEEE
online repository of middlebox models at http://www.middlebox.org, which we envision as filled with models of various commonly used middleboxes. To ease model construction, we prototyped a tool that infers hints about the operations of a particular middlebox through black box testing. We also prototyped a tool that validates the operations of a middlebox against its model and thus helps detect unexpected behavior. We discuss these and other applications of our model later.
The Model RFC 3234 [1] defines a middlebox as “an intermediary device performing functions other than the normal, standard functions of an IP router on the datagram path between a source host and destination host.” We refine this high-level definition of a middlebox to construct a simple model that describes various aspects of middlebox functionality and operations. A middlebox in our model consists of zones, input pre-conditions, state databases, processing rules, auxiliary traffic, and the interest and state fields deduced from the processing rules. In this section, we describe and illustrate our model using four common middleboxes — firewall, NAT, layer-4 load balancer, and SSL-offload capable layer-7 load balancer. Table 1 describes the notations used.
Interfaces and Zones Packets enter and exit a middlebox through one or more of its physical network interfaces. Each physical interface belongs to one or more logical network zones. A zone represents a packet entry and exit point from the perspective of middlebox functionality. A middlebox processes packets differently based on their ingress and egress zones. For example, the firewall shown in Fig. 1a has two physical interfaces, one belonging to the red zone that represents the insecure external network, and the other belonging to the green zone representing the secure internal network. Packets entering through the red zone are more stringently checked than those entering through the green zone. Similarly, the NAT in Fig. 1b has two different physical network interfaces, one belonging to the internal network (zone int) and the other belonging to the external network (zone ext). The source IP and port number are rewritten for packets received at zone int, whereas the destination IP and port number are rewritten for packets received at zone ext. Figure 1c shows a load balancer with a single physical network interface that belongs to two different zones — zone inet representing the
IEEE Network • September/October 2008
JOSEPH LAYOUT
9/5/08
1:04 PM
Page 21
∧
Logical AND operation
!
Logical NOT operation
sm
Source MAC (layer 2) address
dm
Destination MAC (layer 2) address
si
Source IP (layer 3) address
di
Destination IP (layer 3) address
sp
Source TCP/UDP (layer 4) port
dp
Destination TCP/UDP (layer 4) port
p
Packet
[hd]
Packet with header h and payload d
5tpl
Packet 5-tuple: si, di, sp, dp, proto
Xrev
Swaps any source-destination IP, MAC, or port number pairs in X
Z(A, p)
true if packet p arrived at or departed zone A
I (P, p)
Input precondition; true if packet p matches pattern P
C(p)
Condition specific to middlebox functionality
newflow?(p)
true if packet p indicates a new flow, e.g., TCP SYN
set(A, key → val)
Stores the specified key-value pair in zone A’s state database
S : get?(A, key)
Returns true and assigns val to S if key → val is present in zone A’s state database
n Table 1. Notations used in this article.
Internet and zone srvr representing the Web server farm. The load balancer spreads out packets received at zone inet to Web server instances in zone srvr. We assume that the mapping between interfaces and zones is pre-determined by the middlebox vendor or configured during middlebox initialization. Frames reaching an interface belonging to multiple zones are distinguished by their virtual local area network (VLAN) tags, IP addresses, and/or transport port numbers.
Input Preconditions Input preconditions specify the types of packets that are accepted by a middlebox for processing. For example, a transparent firewall processes all packets received by it, whereas a load balancer in a single-legged configuration processes a packet arriving at its inet zone only if the packet is explicitly addressed to it at layers 2, 3, and 4. Similarly, a NAT processes all packets received at its int zone, but requires those received at its ext zone to be addressed to it at layers 2 and 3. Input pre-conditions are represented using a clause of the form I (P, p), which is true if the headers and contents of packet p match the pattern P. For example, the firewall has the input precondition I (< * >, p), and the load balancer has I (< dm = MAC LB , di = IP LB , dp = 80 >, p) for its inet zone, where MAC LB and IP LB are the layer-2 and layer-3 addresses of the load balancer. Although I (< * >, p) is a tautology, we still explicitly specify it in the firewall model to enhance model clarity.
State Database Most middleboxes maintain state associated with the flows and sessions they process. Our model represents state using key-value pairs stored in zone-independent or zone-specific state databases. Processing rules (described next) record the state using the set primitive and query state using the get? primitive. Accurately tracking state removal is hard, unless explicitly specified by the del primitive in a processing rule. Although state expiration timeouts can be specified as part of the set primitive, inaccuracies in timeout values or in their fine-
IEEE Network • September/October 2008
grained measurement can cause discrepancies between the model predicted behavior of a middlebox and its actual operations. A middlebox behavior is predicted by the model. So the model predicted behavior of a middlebox may be better than its actual operations. As we illustrate in the next section, we use special processing rules to flag such possible discrepancies.
Processing Rules Processing rules model the core functionality of a middlebox. A processing rule specifies the action taken by a middlebox when a particular condition becomes true. For example, the processing of an incoming packet is represented by a rule of the general form: Z(A, p) ∧ I (P, p) ∧ C (p) ⇒ Z (B, T (p)) ∧ state ops The above rule indicates that a packet p reaching zone A of the middlebox is transformed to T(p) and emitted out through zone B, if it satisfies the input precondition I(P, p) and a middlebox-specific condition C(p). In addition, the middlebox may update state associated with the TCP flow or application session to which the packet belongs. We now present concrete examples of processing rules for common middleboxes. Firewall — First, consider a simple stateless layer-4 firewall that either drops a packet received on its red zone or relays it unmodified to the green zone. This behavior can be represented using the following two rules: Z(red, p) ∧ I(< * >, p) ∧ Caccept(p) ⇒ Z(green, p) Z(red, p) ∧ I(< * >, p) ∧ Cdrop(p) ⇒ DROP(p)
Since I(< * >, p) is a tautology, whether a packet is dropped or accepted by the firewall is solely determined by the Caccept and Cdrop clauses that represent the filtering functionality of the firewall. Common filtering rules can be represented easily using the appropriate Boolean expressions (e.g., Caccept(p) : p.di = 80 || p.si = 128.34.45.6). For more complex filtering rules, we leverage external middlebox-specific
21
9/5/08
1:04 PM
Page 22
Red zone (a)
Green zone
Insecure external network
External zone
(b) External network
Internal zone Internal network
NAT
Internet zone
Server zone
(c)
Internet
(iv), is keyed by [h.si, h.sp, h.di, h.dp] rather than by just [h.si, h.sp]. A symmetric NAT is also more restrictive than a full cone NAT. It relays a packet with header [IP s , IP NAT , PORTs, PORTd] from the ext zone only if it had earlier received a packet destined to IPs : PORTs at the int zone and had rewritten its source port to PORTd. This restrictive behavior is captured by keying the zone ext state set in rule (i) and retrieved in rules (iii) and (v) with [h.di, h.dp, newport] rather than with just newport. Other NAT types, like restricted cone and port restricted cone, can be easily represented with similar minor modifications.
Secure internal network
Web servers
JOSEPH LAYOUT
Switch
n Figure 1. Zones of different middleboxes: a) firewall; b) NAT; and c) load balancer in single-legged configuration. models like the Unified Firewall Model [2] to construct the appropriate C clauses. Rules for packets in the green → red direction are similar. NAT — Next, consider another very common middlebox — a NAT. Unlike the firewall in the previous example, a NAT rewrites packet headers and maintains per-flow state. We first describe the processing rules (rule box 1) for a full cone NAT and then, with minor modifications, change it to represent a symmetric NAT. Rule (i) describes how a full cone NAT processes a packet [hd] with a previously unseen [si, sp] pair received at its int zone. It allocates a new port number using a standard mechanism like random or sequential selection, or using a custom mechanism beyond the scope of our general model. It stores [si, sp] → newport and newport → [si, sp] in the state databases of zone int and zone ext, respectively. It rewrites the packet header h by applying the source NAT (SNATfwd) transformation function — the source medium access control (MAC) and IP addresses are replaced with the publicly visible addresses of the NAT, the source port with the newly allocated port number, and the destination MAC with the next hop IP gateway of the NAT. The packet with the rewritten header and unmodified payload is then emitted out through the ext zone. Rule (ii) specifies that the NAT emits a packet with a previously seen [si, sp] pair through zone ext, after applying SNATfwd with the port number recorded in rule (i). Rule (iii) describes how the NAT processes a packet reaching the ext zone. It retrieves the newport → [si, sp] state recorded in rule (i) using the destination port number of the packet, applies the reverse source NAT transformation function(SNATrev), and then emits the modified packet through zone int. Rule (iv) and Rule (v) flag discrepancies resulting from the inaccuracy of the model in tracking state expiration. The NAT may drop a packet arriving at its int or ext zone because the state associated with the packet expired without the knowledge of the model. Unlike a full cone NAT, a symmetric NAT allocates a separate port for each [si, sp, di, dp] tuple seen at its int zone, rather than for each [si, sp] pair. Thus, for a symmetric NAT, the zone int state set in rule (i) and retrieved in rules (ii) and
22
Layer-4 Load Balancer — Next, we present a layer-4 load balancer, which unlike the NAT in the previous example, rewrites the destination IP address of a packet to that of an available Web server (rule box 2). Rule (i) describes how the load balancer processes the first packet of a new flow received at its inet zone. The load balancer dynamically selects a Web server instance Wi for the flow and records it in the state database of the inet zone. It rewrites the destination IP and MAC addresses of the packet to Wi using the destination NAT (DNATfwd) transformation function and then emits it out through the srvr zone. It also records this flow in the state database of the srvr zone, keyed by the five-tuple of the packet expected there in the reverse flow direction. Rule (ii) specifies that subsequent packets of the flow simply will be emitted out after rewriting the destination IP and MAC addresses to those of the recorded Web server instance. Rule (iii) describes how the load balancer processes a packet received from a Web server. It verifies the existence of flow state for the packet and then emits it out through the inet zone after applying the reverse DNAT transformation — that is, rewriting the source IP and MAC addresses to those of the load balancer and the destination MAC to the next hop IP gateway. Although the Web server instance selection mechanism is beyond the scope of our general model, the load balancer model easily can be augmented with primitives to represent common selection mechanisms like least loaded and round robin. In the previous example, we assumed that the load balancer was set as the default IP gateway at each Web server. Other load balancer deployment configurations (e.g., direct server return or source NAT) can be represented with minor modifications. Layer-7 Load Balancer — We now present our most complex example, a layer-7 SSL offload-capable load balancer. This example illustrates how our model describes a middlebox whose processing spans both packet headers and contents and is not restricted to one-to-one packet transformations. The layer-7 load balancer is the end point of the TCP connection from a client (the CL connection). Because accurately modeling TCP is very hard, we abstract it using a black box TCP state machine tcp CL and buffer the data received from the client in a byte queue DCL. The I clauses are similar to those in the layer-4 load balancer and hence not repeated in rule box 3. Rule (i) specifies that the load balancer creates tcpCL and DCL and records them along with the packet header on receiving the first packet of a new flow from a client at the inet zone. Rule (ii) specifies how the TCP state and data queue of the CL connection are updated as the packets of an existing flow arrive from the client. Rule (iii), triggered when tcpCL has data or acknowledgments to send, specifies that packets from the load balancer to the client will have header hrev CL (with appropriate sequence numbers filled in by tcpCL) and payload read from the DLS queue, if it was already created by the firing of rule (iv). Rule (iv), triggered when the data collected in D CL is sufficient to parse the HTTP request URL and/or cookies, specifies that the load balancer selects a Web server instance Wi and opens a TCP connection to it, that is,
IEEE Network • September/October 2008
JOSEPH LAYOUT
9/5/08
1:04 PM
Z(int, [hd]) ∧ I(<*>, [hd]) (i) ∧ IS : get?(int, [h.si, h.sp]) SNATfwd([sm, dm, si, di, sp, dp], PORT)
Page 23
Z(ext, [SNATfwd(h,newport)d]) ∧ set(int, [h.si, h.sp] – newport) ∧ set(ext,newport – [h.si, h.sp]) =[MACNAT,MACgw,IPNAT,di,PORT,dp]
Z(int, [hd]) (ii) ∧ I(<*>, [hd]) Z(ext, [SNATfwd(h,S)d]) ∧ S : get?(int, [h.si, h.sp]) Z(ext, [hd]) ∧ I(
, [hd]) ∧ S : get?(ext, h.dp)
Z(int, [SNATrev(h,S.si, S.sp)d])
SNATrev([sm, dm, si, di, =[MACNAT,MACIP,si,IP,sp,PORT] sp, dp], IP, PORT) Z(int, [hd]) DROP([hd]) (iv) ∧ I(<*>, [hd]) ∧ WARN(inconsistent state) ∧ S : get?(int, [h.si, h.sp]) Z(ext, [hd]) ∧ I(, [hd]) ∧ S : get?(ext, h.dp)
DROP([hd]) ∧ WARN(inconsistent state)
n Rule box 1. creates tcp LS and DLS. It also installs a pointer to the state indexed by the DNATed header hLS in the state database of the srvr zone. Rule (v) shows how this state is retrieved, and its tcp LS and D LS are updated, on receipt of a packet from a Web server. Rule (vi) specifies the header and payload of packets sent by the load balancer to a Web server instance — hLS and data read from DCL. The rules listed above represent a plain layer-7 load balancer. By replacing the + and read data queue operations with +ssl and readssl operations that perform SSL encryption and decryption on the data, we can represent an SSL offloadcapable load balancer without disturbing other rules. Similar to the TCP black box, we abstract out the details of the SSL protocol.
Auxiliary Traffic In addition to its core functionality of transforming and forwarding packets, a middlebox can generate additional traffic, either independently or when triggered by a received packet. For example, a load balancer periodically checks the liveness of its target servers by making TCP connections to each server. It also can send an Address Resolution Protocol (ARP) request for the layer-2 address of the Web server assigned to a received packet. Such packets generated by middleboxes and their responses, which support middlebox functionality, are referred to as auxiliary traffic in our model. Auxiliary traffic is represented using processing rules, as well. For example, the auxiliary traffic associated with the load balancer can be represented in rule box 4. The PROBE function returns a set of packets to check the liveness of server W i. In the simple case, these are just the TCP hand-shake packets with the appropriate sm, dm, si, di, sp, and dp.
Interest and State Fields The interest fields of a middlebox identify the packet fields of interest, that is, the fields it reads or modifies. The state fields identify the subset of the interest fields used by the middlebox in storing and retrieving state. Although these fields can be deduced from the processing rules, they are explicitly presented in the model because they can highlight succinctly unexpected aspects of middlebox processing.
IEEE Network • September/October 2008
Utility of a Middlebox Model A middlebox model is useful only if it can easily represent many real-world middleboxes and has practical applications. In this section, we first describe how we constructed the models described in the previous section and then discuss the applications of our model in planning and troubleshooting existing middlebox deployments and in guiding the development of new network architectures.
Model Instances The models for the firewall, NAT, and layer-4 and layer-7 load balancers illustrated in the previous section were constructed by analyzing generic middlebox descriptions and taxonomies (like RFC 3234 [1]), consulting middlebox-specific manuals, and observing the working of the following realworld middleboxes: • Linux Netfilter/iptables software firewall • Netgear home NAT • BalanceNg layer-4 software load balancer • HAProxy layer-7 load balancer Vmware appliance We prototyped a black box testing-based model-inference tool to aid middlebox model construction. The tool infers hints about the operations of a middlebox by carefully sending different kinds of packets on one zone and observing the packets emerging from other zones, as illustrated in Fig. 2. The following are some of the inferences generated by it: • The firewall does not modify packets; all packets sent by the tool emerge unmodified or are dropped. • The load balancers only process packets addressed to them at layers 2, 3, and 4. • The layer-4 load balancer rewrites the destination IP and MAC addresses of packets in the inet → srvr direction and the source addresses in the reverse direction. This inference was made by pairing and analyzing packets with identical payloads seen at the two zones of the load balancer. By using a relaxed payload similarity metric, the header rewriting rules for even the layer-7 load balancer were partially inferred. • The layer-4 load balancer caches source MAC addresses of packets processed by it in the inet → srvr direction and uses them in packets in the reverse direction. This inference was made by correlating rewritten packet header fields with values seen in earlier packets. Our inference tool is quite basic and serves only as an aid for model construction. It is not fully automated; for example, it requires the IP address and TCP port of the load balancer as input to avoid an exhaustive IP address search for packets
Z(inet, [hd]) ∧ I(,[hd]) ∧ newflow?([hd]) DNATfwd([sm, dm, si, di, sp, dp], W)
Z(srvr, [DNATfwd(h,Wi)d]) ∧ set(inet, h.5tpl – Wi) ∧ set(srvr, DNATfwd(h,Wi)rev.5tpl – true) = [SM,MACW,si,IPW,sp,dp]
Z(inet, [hd]) ∧ I(,[hd]) ∧ !newflow?([hd]) ∧ S : get?(inet, h.5tpl)
Z(srvr,[DNATfwd(h,S)d])
Z(srvr, [hd])^ I(<sm=MACWi, si=IPWi, (iii) sp=80 >,[hd]) ∧ S : get?(srvr, h.5tpl)
Z(inet,[DNATrev(h)d])
DNATrev([sm,dm,si,di,sp,dp]) = [MACLB,MACgw,IPLB,di,sp,dp]
n Rule box 2.
23
JOSEPH LAYOUT
(i)
9/5/08
1:04 PM
Z(inet, [hd]) ∧ I(...) ∧ newflow?([hd])
Z(inet, [hd]) ∧ I(...) (ii) ∧ !newflow?([hd]) ∧ S : get?(inet, h.5tpl)
Page 24
set(inet, h.5tpl – [tcpCL = TCP.new, DCL = Data.new, hCL = h])
S.tcpCL.rev(h) ∧ S.DCL+d
(iii) S.tcpCL.ready?
Z(inet, S.tcpCL.send(S.hrevCL,S.DLS.read))
(iv) S.DCL.url? Z(srvr, [hd])
S.hLS = DNATfwd(S.hCL, Wi) ∧ set(srvr,S.hrevLS.5tpl – S) ∧ S.DLS = Data.new ∧ S.tcpLS = TCP.new
(v) ∧ I(...) ∧ S : get?(srvr, h.5tpl)
S.tcpLS.recv(h) ∧ S.DLS+d
(vi) S.tcpLS.ready?
Z(srvr, S.tcpLS.send(S.hLS,S.DCL.read))
n Rule box 3. accepted by it. The inferred packet header transformation rules and state fields may not be 100 percent accurate and thus only serve to guide further analysis. For middleboxes like SSL offload boxes that completely transform packet payloads, the tool cannot infer the processing rules. We believe that completely inferring middlebox models through black box testing alone is impossible. If the source code for a middlebox implementation were available, we hypothesize that automatic white box software test-generation tools like directed automated random testing (DART) [5] can be adapted to infer middlebox model parameters. Automatically parsing middlebox configuration manuals to extract models is another open research direction. We envision an online repository containing models of common middleboxes. We set up a pilot version of such a repository at http://www.middlebox.org with the models described in this article. We hope that middlebox manufacturers and network administrators who use middleboxes will contribute additional models to the repository. We also prototyped a model validation tool that analyzes traffic traces collected from the different zones of a middlebox and verifies whether its operations are consistent with its model downloaded from the repository. Apart from flagging errors and incompleteness in the models themselves, the validation tool can be used to detect unexpected middlebox behavior, as we describe next.
Network Planning and Troubleshooting The middlebox model clearly describes how various middleboxes under different configurations interact with the network and with each other in a standard and concise format. This information aids in planning new middlebox deployments and in monitoring and troubleshooting existing ones. The input preconditions of a middlebox specify the types of packets expected by it and thus help a network architect plan the network topology and middlebox placement required to deliver the correct packets to it. The input preconditions and processing rules together help in analyzing the feasibility of placing different middleboxes in sequence. For example, because the right-hand sides of the firewall processing rules do not interfere with the conditions on the left-hand sides of the load balancer processing rules, the firewall can be placed in front of the load balancer with little scrutiny. However, placing the load balancer before the firewall requires more careful analysis as the destination address rewriting indicated by the processing rules of the load balancer may interfere with the Caccept and Cdrop clauses of the firewall. The middlebox processing rules specify the packets flowing in
24
PERIODIC
Z(srvr, PROBE(IPWi))
Z(inet,[hd]) ∧ S : get?(inet, h.5tpl) ∧ !S’ : get?(-,IPS)
Z(srvr, ARPREQ(IPS))
Z(srvr, ARPRPLY (IP, MAC)) set(-,IP – MAC)
n Rule box 4.
different parts of a network. This information can be used to statically analyze and detect problems with a middlebox deployment before actual network rollout. It also aids in troubleshooting existing middlebox deployments and enhances automated traffic monitoring and anomaly detection. For example, the model validation tool helped us detect unexpected NAT behavior in the home network of one of the authors. The author’s home NAT was not rewriting the source port numbers of the packets sent by internal hosts. The tool automatically flagged this behavior as a violation of rules (i) and (ii) of our NAT model. We expected the multiinterface home NAT to use source port translation to support simultaneous TCP connections to the same destination from the same source port on multiple internal hosts. The failure of such simultaneous TCP connections on further investigation confirmed the anomaly. Although a small example, this experience indicates that our middlebox model holds practical utility in detecting unexpected middlebox behavior.
Guide Networking Research Our middlebox model provides networking researchers with clear and concise descriptions of how various middleboxes operate. Such information is very useful for researchers, as well as companies involved in developing new network architectures, especially those that deal with middleboxes [6]. Not only does it provide hints to make a new architecture compatible with existing middleboxes, but it also helps identify middleboxes that cannot be supported. In retrospect, the availability of a middlebox model would have benefited our research greatly on designing the policyaware switching layer (PLayer) [4], alluded to earlier. The PLayer consists of enhanced layer-2 switches (pswitches) that explicitly forward packets to the middleboxes specified by a network administrator. In our original (erroneous) design, pswitches rewrote the source MAC addresses of packets processed by a transparent firewall to a unique dummy MAC address to mark packets that had already been processed by the firewall. Contrary to our expectation of the load balancer to use ARP, it cached the dummy source MAC addresses of packets in the forward flow direction and used them to address packets in the reverse direction. Such packets never reached their intended destinations. The presence of source MAC address in the interest and state fields of the load balancer would have helped us more quickly debug this problem. Moreover, it would have warned us against rewriting the source MAC address in our original design, thus avoiding a time-consuming redesign.
Limitations The model presented in this article is only a first step toward modeling middleboxes. Its three main limitations are: • The inability to describe highly-specific middlebox operations in detail • The lack of formal coverage proofs • The complexity of model specification The goal of building a general middlebox model that can describe a wide variety of middleboxes precludes our model from representing functionality that is very specific to a particular middlebox. We can extend our model easily using middle-
IEEE Network • September/October 2008
JOSEPH LAYOUT
9/5/08
1:04 PM
Page 25
Internet zone
Server zone
rower and more detailed focus on how middleboxes operate. Reference [10] uses detailed measurement techniques to evaluate the performance and reliability of production middlebox deployments. We plan to investigate how the techniques described in these papers can enhance our model inference and validation tools. Control Observe RFC 3234 [1] presents a taxonomy of middlepacket boxes. Our model goes well beyond a taxonomy sending and describes middlebox packet processing in Model inference tool more detail using a concise and standard language. In addition, our model can naturally induce a more fine-grained taxonomy on middleboxes n Figure 2. Middlebox model inference tool analyzing a load balancer. (e.g., “middleboxes that rewrite the destination IP and port number” versus “middleboxes operating at the transport layer”). Our model does not currently consider the middlebox failover modes and functional verbox-specific models like the Unified Firewall Model as sus optimizing roles identified by RFC 3234. described earlier, although at the expense of reducing model The Unified Firewall Model [2] and IETF BEHAVE [3] simplicity and conciseness. The desire for simplicity and conworking group characterize the functionality and behavior of ciseness also limits our model from capturing accurate timing specific middleboxes — firewalls and NATs in this case. Guidand causality between triggering of different processing rules. ed by these efforts, we construct a general model that applies On the other hand, our model may not be general enough to a wide range of middleboxes and enables us to compare to describe all possible current and future middleboxes. different middleboxes and study their interactions. FurtherAlthough we represented many common middleboxes in our more, these specific models can be plugged into our general model and are not aware of any existing middleboxes that model and alleviate the limitations of model generality. cannot be represented, we are unable to formally prove that our model covers all possible middleboxes. The model for a particular middlebox consists of a small Conclusion number (typically < 10) of processing rules. However, constructing the model itself is a non-trivial task even with support In this article, we presented a simple middlebox model and from our model inference and validation tools. We expect illustrated how various commonly used middleboxes can be models to be constructed by experts and shared through an described by it. The model guides middlebox-related research online model repository, thus making them easily available to and aids middlebox deployments. Our work is only an initial all, without requiring widespread model construction skills. step in this direction and calls for the support of the middlebox research and user communities to further refine the model and to contribute model instances for the many differRelated Work ent kinds of middleboxes that exist today. The middlebox model described in this article is placed at an interReferences mediate level in between related work on very general network [1] “Middleboxes: Taxonomy and Issues,” RFC 3234. communications models and very specific middlebox models. [2] G. J. Nalepa, “A Unified Firewall Model for Web Security,” Advances in An axiomatic basis for communication [7] presents a general Intelligent Web Mastering. network communications model that axiomatically formulates [3] “Behavior Engineering for Hindrance Avoidance”; http://www.ietf.org/ html.charters/behave-charter.html packet forwarding, naming, and addressing. This article presents [4] D. Joseph, A. Tavakoli, and I. Stoica, “A Policy-Aware Switching Layer for a model tailored to represent middlebox functionality and operData Centers,” Proc. SIGCOMM, 2008. ations. The processing rules and state database in our model are [5] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed Automated Random similar to the forwarding primitives and local switching table in Testing,” Proc. PLDI, 2005. [6] M. Walfish et al., “Middleboxes No Longer Considered Harmful,” Proc. [7]. As part of future work, we plan to investigate the integraOSDI, 2004. tion of the two models and thus combine the practical benefits [7] M. Karsten et al., “An Axiomatic Basis for Communication,” Proc. SIGof our middlebox model (e.g., middlebox model inference and COMM ’07. validation tools, model repository) and the theoretical benefits [8] T. Roscoe et al., “Predicate Routing: Enabling Controlled Networking,” SIGCOMM Comp. Commun. Rev., vol. 33, no. 1, 2003. of the general communications model (e.g., formal validation of [9] S. Kandula, R. Chandra, and D. Katabi, “What’s Going On? Learning Compacket forwarding correctness through chains of middleboxes). munication Rules in Edge Networks,” Proc. SIGCOMM, 2008. Predicate routing [8] attempts to unify security and routing [10] M. Allman, “On the Performance of Middleboxes,” Proc. IMC, 2003. by declaratively specifying network state as a set of Boolean Biographies expressions dictating the packets that can appear on various DILIP JOSEPH ([email protected]) received his B.Tech. degree in computer science links connecting together end nodes and routers. This from the Indian Institute of Technology, Madras, in 2004 and his M.S. degree in approach can be extended to represent a subset of our midcomputer science from the University of California at Berkeley in 2006. He is currentdlebox model. For example, Boolean expressions on the ports ly a Ph.D. candidate at the University of California at Berkeley. His research interests and links (as defined by predicate routing) of a middlebox can include data center networking, middleboxes, and new Internet architectures. specify the input preconditions of our model and indirectly ION STOICA ([email protected]) received his Ph.D. from Carnegie Mellon hint at the processing rules and transformation functions. University in 2000. He is an associate professor in the EECS Department at the From a different perspective, middlebox models from our University of California at Berkeley, where he does research on peer-to-peer netrepository can aid the definition of the Boolean expressions in work technologies in the Internet, resource management, and network architectures. He is the recipient of the 2007 Rising Star Award, a Sloan Foundation a network implementing predicate routing. Fellowship (2003), a Presidential Early Career Award for Scientists and EngiReference [9] uses statistical rule mining to automatically neers (PECASE) (2002), and the ACM doctoral dissertation award (2001). In group together commonly occurring flows and learn the under2006 he co-founded Conviva, a startup company to commercialize peer-to-peer lying communication rules in a network. Our work has a nartechnology for video distribution.
IEEE Network • September/October 2008
25
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 26
Network Address Translation for the Stream Control Transmission Protocol Michael Tüxen and Irene Rüngeler, Münster University of Applied Sciences Randall Stewart, The Resource Group Erwin P. Rathgeb, University of Duisburg-Essen Abstract Network address translation is widely deployed in the Internet and supports the Transmission Control Protocol and the User Datagram Protocol as transport layer protocols. Although part of the kernels of all recent Linux distributions, namely, the FreeBSD 7 and the Solaris 10 operating systems, the new Internet Engineering Task Force transport protocol — Stream Control Transmission Protocol — is not supported on most NAT middleboxes yet. This article discusses the deficiencies of using existing NAT methods for SCTP and describes a new SCTP-specific NAT concept. This concept is analyzed in detail for several important network scenarios, including peer-to-peer, transport layer mobility, and multihoming.
N
etwork address translation (NAT) is a common method for separating private networks from global networks by translating private Internet Protocol (IP) addresses to global IP addresses. Often there is only one global IP address available for multiple hosts inside the private network. In this case, the transport layer port number also is modified, and the method is called network address and port number translation (NAPT). NAT and NAPT have been in use for the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP) for a long time, but the Stream Control Transmission Protocol (SCTP), as a fairly new transport protocol is not supported yet. Applying this method also to SCTP does not work for multihomed associations. Currently, NAT implementations that support SCTP in a way similar to TCP or UDP are being developed first. Although this works well for single-homed SCTP associations, it does not work for multihomed SCTP associations. This makes these solutions non-applicable for typical SCTP applications that require multihoming. However, in these cases, some vendors and operators also want to use NAT middleboxes for various reasons. Therefore, it is important to have NAT middleboxes that not only support SCTP in a limited way, but with all features, especially multihoming. In [1] and [2], the authors of this article describe an approach to integrate SCTP in network address translators for single-homed client-server communication. This article extends this method in a way that also works in the case of multihomed and peer-to-peer scenarios. Additionally, it covers the case of transport layer mobility or routing changes in the network. These additions also will be provided to the Internet Engineering Task Force (IETF) for standardization. The structure of this article is as follows: first, we provide an introduction to SCTP, emphasizing the features that are relevant for this article. We discuss generic NAT and NAPT methods for traffic based on TCP or UDP and for traffic that
26
0890-8044/08/$25.00 © 2008 IEEE
is not based on these protocols. Their applicability for SCTP is analyzed. An SCTP-specific method for NAT middleboxes that overcomes the deficiencies of the generic methods is described. Several examples are given explaining in detail how the SCTP-specific NAT method works for different scenarios including single-homed and multihomed client-server scenarios, peer-to-peer scenarios, and transport-layer mobility scenarios. Then, conclusions are presented.
Introduction to the Stream Control Transmission Protocol SCTP is currently specified in [5]. It was standardized by the IETF as the generic transport protocol for signaling transport in IP-based telephone signaling networks. SCTP is a connection-oriented protocol providing reliable transport of user messages. It supports IPv4 and IPv6 as a network layer. A connection between two SCTP end points is called an SCTP association or just an association. One of the major design goals was network fault tolerance, and therefore, each SCTP end point can use multiple IPaddresses within each association but only one port number. Each IP address of the peer can be used as the destination address of a packet. Currently, this multihoming support is used only for redundancy, but ongoing research is analyzing the possibility of also using it for load sharing. SCTP is already part of all recent Linux distributions — the Solaris 10 operating system and the FreeBSD 7.0 release. It is deployed to signal networks of telephony network operators and is used in IP-based signaling for universal mobile telecommunication system (UMTS) networks. Other applications using SCTP include the IP Flow Information Export (IPFIX) protocol, Diameter, and the Reliable Server Pooling (RSerPool) protocol suite. It should be noted that SCTP was the first transport protocol specified by the IETF in 2000 and
IEEE Network • September/October 2008
TÜXEN LAYOUT
9/5/08
Host A
1:00 PM
Host B
Page 27
Host A
INIT
Host B
INIT
Host A
Host B
Both end points can start the four-way handshake at about the same time, and the SCTP setup procedure ensures that exactly one INIT-ACK INIT-ACK INIT-ACK association is established. This is called a INIT COOKIE-ECHO COOKIE-ECHO collision case. An example message flow is CO OK shown on the right-hand side in Fig. 1. IEEC It is also possible that one side starts the I N H IT-A O CK COOKIE-ACK COOKIE-ACK association procedure while the peer is still in the established state. This might happen, K -AC for example, if one side reboots without tearE I K COO ing down the association and then starts the INIT association setup procedure. The four-way handshake succeeds, and for the server side, the association restarts. One example is shown in the middle of Fig. 1; detailed n Figure 1. Examples of the SCTP association setup. descriptions of the handling of all the possible cases is given in [9]. If an SCTP end point must terminate an association immediately, it can send a packet containing an deployed in commercial networks after the introduction of ABORT chunk. This chunk also is sent in response to UDP and TCP in the 1980s. Four years later, a modification almost all packets for which no association can be looked of UDP with limited checksum coverage — UDP-Lite — was up. On reception of an ABORT chunk, the association is standardized and is used in Third Generation Partnership terminated. Error conditions can be signaled by sending an Project (3GPP) networks. In 2006, the IETF standardized the ERROR chunk. ABORT and ERROR chunks can include Datagram Congestion Control Protocol (DCCP). Currently, the causes of the error in order to provide more detailed neither UDP-Lite nor DCCP are available on major operating information. In addition to the base protocol, several extensystems. sions also were standardized and implemented. The SCTP An SCTP packet consists of a common header followed by extension that is crucial for this article is the ability to add a number of chunks. The common header contains source and or delete IP addresses dynamically during the lifetime of an destination port numbers similar to TCP or UDP headers, a SCTP association. This is specified in [6]. If an SCTP end 32-bit verification tag and a CRC32C checksum. The checkpoint wants to add or delete an IP-address, it sends an sum covers only the SCTP packet and does not take any kind address configuration change (ASCONF) chunk that conof pseudo header into account. Each chunk consists of a type tains the address to be added or deleted and an address that field, eight flags, a length field, and type-specific data. Furcan be used to look up the association, the so-called lookup thermore, it is padded at the end to be 32-bit aligned. address. When the peer has processed an ASCONF chunk, The basic association setup procedure is based on a fourit sends back an address configuration acknowledgment way handshake and follows the client-server principle. It is (ASCONF-ACK) chunk. There is a special rule that if the shown on the left-hand side of Fig. 1. The first SCTP mesaddress to be added is the wildcard address (0.0.0.0 for IPv4 sage is sent from the client to the server. It contains exactly or ::0 for IPv6), the source address of the packet containing one chunk, the initiation (INIT) chunk. The INIT chunk conthe ASCONF chunk is added. If the address to be deleted is tains a 32-bit random number, the initiate tag, and the list of the wildcard address, all addresses except for the source IP-addresses used by the client. The server responds with an address of the packet containing the ASCONF chunk are SCTP message, which also contains just one chunk, the initiadeleted. tion acknowledge (INIT-ACK) chunk. It also contains a 32bit random initiate tag and the list of addresses of the server. If the client or server is single-homed, the list of addresses in Applicability of Generic Methods for NAT or the INIT or INIT-ACK chunk should be empty. After the server has sent the INIT-ACK chunk, it does not hold any NAT Traversal state regarding the association. Instead, it puts all information in a state cookie, which itself is put into the INIT-ACK UDP or TCP-like Network Address and Port Number chunk. On reception of the INIT-ACK chunk, the client Translation sends the state cookie in a COOKIE-ECHO chunk to the server. On reception of the COOKIE-ECHO chunk, the servNAT in its original meaning is realized by changing the (prier responds with a COOKIE-ACK chunk, and the association vate) IP address of the client to a global address of the NAT is established. Other chunks might be bundled with the middlebox and keeping this correlation in a table (Fig. 2). COOKIE-ECHO or COOKIE-ACK chunk in the third and Thus, the server addresses its packets to this global address, fourth message. reaches the NAT, which substitutes the destination address The verification tag in the common header is always the with the address of the client. This is a feasible method, as initiate tag sent by the peer in the INIT or INIT-ACK meslong as the source ports of the clients connecting to the same sage during the association setup. This is used to protect assoserver are different. The source port numbers are chosen ciations against blind attacks. Only the common header of the dynamically from operating system dependent ranges. Some packet containing the INIT chunk has the verification tag 0. It operating systems use the port numbers between 49152 and is important to note that most SCTP implementations use the 65535. Because many clients can be located behind the same verification tag for looking up the association when a packet is NAT middlebox, and these clients might access a very popular received. Some implementations even ensure that the verificaserver at about the same time, the chance that two clients get tion tags are unique across all associations currently known. the same port is non-negligible. SCTP supports not only the client-server model for associaTherefore, TCP or UDP sessions usually are translated by tion setup, but also the more general peer-to-peer model. changing the private IP address and additionally, the private
IEEE Network • September/October 2008
INIT
27
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 28
10.1.0.1:52001
UDP-Based Tunneling Currently, most NAT middleboxes support only protocols running on top of TCP or UDP. A standard technique for all other protocols is to encap120.10.2.1 sulate these packets into UDP instead of IP. 10.1.0.2:52002 Internet Because both UDP and IP provide an unreliable packet delivery service, this is feasible. This also works for SCTP, as described in [3], and is cur120.10.2.1:52001 => 100.4.5.1:8080 120.10.2.1:52002 => 100.4.5.1:8080 rently implemented in the SCTP kernel extension 120.10.2.1:52003 => 100.4.5.1:8080 for Mac OS X. 10.1.0.3:52003 It should be noted that NAT middleboxes on different paths are not synchronized, and therefore, the UDP port number might be different on different paths. n Figure 2. Using basic NAT. One drawback of using UDP encapsulation is that Internet Control Message Protocol (ICMP) messages might not contain enough information to be proport number to a global IP address and port number in the cessed by the SCTP layer. Another drawback is that the simTCP or UDP header, respectively. This method is called ple peer-to-peer solution described in the sections about NAPT. Thereby, the NAT middlebox chooses the port numpeer-to-peer communication and multihoming with a renbers from a pool and makes sure that no two connections to dezvous server does not work because the UDP port numbers the same server obtain the same port numbers. might be changed by NAT-middleboxes. As the transport layer checksum of the TCP and UDP Tunneling SCTP over UDP must handle the same probpackets covers the transport header that includes the port lems as any other UDP-based communication for NAT travernumbers, it must be modified according to the port number sal. However, this is the only possibility for SCTP-based change. However, the checksum used for TCP or UDP has communication through a NAT middlebox without modifying the property that the change of the checksum can be computit to add SCTP support. ed only from the change of the port numbers. So this can be done very efficiently by a simple set of additions and subtractions. An SCTP-Specific Variant of NAT It should be noted that the behavior of NAT middleboxes varies dramatically because there were no standards describIn the NAPT method described previously, the NAT middleing how to build them. The Behavior Engineering for Hinbox controls the 16-bit source port number of outgoing TCP drance Avoidance (BEHAVE) working group of the IETF connections to distinguish multiple TCP connections of all develops best current practice (BCP) documents giving clients behind the NAT middlebox to the same server. The requirements for NAT middlebox behavior and protocols to basic idea for the SCTP-specific method is instead to use the help applications to run over networks with NAT middleboxcombination of the source port number and the verification es. tag. For single-homed hosts, this method is described in [2]. Considering only single-homed SCTP clients and servers, it If NAT middleboxes use the verification tags together with is also possible to use this NAPT concept for SCTP because it the addresses and the port numbers to identify an association, has the same port number concept as TCP and UDP. Howevthe probability that two hosts end up with the same combinaer, the transport layer checksum used by SCTP is different tion decreases to a tolerable level. from the one used by UDP and TCP. This checksum does not A Simple Association Setup allow the computing of the checksum change based only on the port number change. Therefore, the NAT middlebox must The main task of a NAT middlebox is to substitute the source compute the new SCTP checksum again, based on the comaddress of each packet with the public address used by the plete SCTP packet. This requires a substantial amount of NAT middlebox and to keep the corresponding IP addresses computing power that might be reduced when the computain a table. First, we consider an association setup between a tion is performed directly by hardware. single-homed client and a single-homed server. Neither the For multihomed SCTP clients and servers, reusing the INIT nor the INIT-ACK chunk contain any IP addresses. This techniques from TCP and UDP becomes much harder. As leads to a scheme as described in Fig. 3. we mentioned earlier, hosts can be multihomed, which In the first message of the handshake, the verification tag means that they can simultaneously use multiple network in the common header must be set to 0, but the initiate tag addresses and thus can be attached to multiple networks. (initTag) in the INIT chunk holds a 32-bit random number Therefore, the traffic of one SCTP association, in general, that is supposed to be the verification tag (VTag) of the passes through different NAT middleboxes on different incoming packets. Hence, at the beginning of the handshake, paths. Because each SCTP end point can use only one only one verification tag is known. The NAT middlebox keeps SCTP port number on all paths, the NAT middleboxes track of this information and takes the local private address cannot change the port number independently. To apply (Local-Address) and the officially registered destination IP the existing NAT concept, the NAT middleboxes involved address (Global-Address) from the IP header of the SCTP would have to synchronize the port numbers to assign a packet and saves them in the NAT table (Fig. 3). The local common number for the association. This is very hard to source port (Local-Port) and the destination port (Globalachieve. Port) are obtained the same way. Based on this discussion, it seems desirable to use a NAT The initiate tag of the INIT chunk, which the client has mechanism for SCTP that does not require a change to the chosen for its communication, is also extracted from the INIT SCTP header at all and hence to the port numbers, which chunk header and saved as Local-VTag. The Global-VTag avoids synchronization among NAT middleboxes and the that eventually will be chosen by the communication partner recomputation of the SCTP checksum. is not known yet. Before forwarding the packet, the NAT mid100.4.5.1:8080
28
IEEE Network • September/October 2008
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 29
Client 10.1.0.1:52001
NAT 120.10.2.1
INIT: 10.1.0.1:52001=>100.4.5.1:8080 Vtag=0, initTag=12345
INIT-ACK: 100.4.5.1:8080=>10.1.0.1:52001 Vtag=12345, initTag=45678
Server 100.4.5.1:8080
INIT: 120.10.2.1:52001=>100.4.5.1:8080 Vtag=0, initTag=12345
INIT-ACK: 100.4.5.1:8080=>120.10.2.1:52001 Vtag=12345, initTag=45678
Chunk type Local-Address Global-Address Local-Port INIT INIT-ACK
10.1.0.1 10.1.0.1
100.4.5.1 100.4.5.1
52001 52001
Global-Port
Local-VTag
Global-VTag
8080 8080
12345 12345
45678
COOKIE-ECHO: 10.1.0.1:52001=> 100.4.5.1:8080 Vtag=45678
COOKIE-ECHO: 120.10.2.1:52001=> 100.4.5.1:8080 Vtag=45678
COOKIE-ACK: 100.4.5.1:8080=> 10.1.0.1:52001 Vtag=12345
COOKIE-ACK: 100.4.5.1:8080=> 120.10.2.1:52001 Vtag=12345
n Figure 3. Four-way handshake for the SCTP association setup with NAT table.
dlebox exchanges the source address of the IP header with the NAT address (Nat-Global-Address) and sends the packet toward the other end point. The other SCTP end point receiving the packet containing the INIT chunk answers the request with a message containing the INIT-ACK chunk. This message is addressed to the NAT-Global-Address and the Local-Port. Its verification tag in the common header must be identical to the initiate tag of the INIT chunk, whereas the initiate tag of the INIT-ACK chunk will be used as the verification tag for all packets that are sent by the initiating end point (client 10.1.0.1 in the figure) of the association. For an incoming INIT-ACK chunk, the NAT middlebox searches the table entries for the corresponding combination of Local-Port, Global-Address, GlobalPort, and the Local-VTag and adds the Global-VTag. Thus, after the reception of the INIT-ACK chunk, both verification tags are known. Now the NAT middlebox sets the destination address to the Local-Address found in the table entry and delivers the packet. To complete the handshake, a packet with a COOKIE-ECHO chunk is sent that is acknowledged with a message containing a COOKIE-ACK chunk.
NAT Table The NAT table consists of several entries. Each entry is a tuple consisting of: 1) Local-Address 2) Global-Address 3) Local-Port 4) Global-Port 5) Local-VTag 6) Global-VTag In addition to the procedure to modify the table given in the next subsection, a timer must be used to remove entries that have not been used for a certain amount of time. This time should be long enough such that the SCTP path supervision procedure prevents the table entries from timing out.
Modifications to the NAT Table The basic procedure for handling INIT and INIT-ACK chunks was described previously. If the INIT or INIT-ACK chunk contains a list of addresses, then for each address in the list, an entry is added to the table. If an ASCONF chunk is received to add the wildcard
IEEE Network • September/October 2008
address, an entry to the NAT table is made for that address. Because both verification tags must be added, a parameter must be included in the ASCONF chunk that contains the verification tag that is not present in the common header.
Behavior of the SCTP End Points Because multiple clients behind the NAT middlebox might choose the same local port when connecting to the same server, the restart procedure would result in a loss of an SCTP association. Therefore, the INIT chunk sent by the clients should contain a parameter indicating that the server should not follow the restart procedure. Instead it should use the verification tag to distinguish between the associations. This is what most SCTP implementations already do. Furthermore, the SCTP end points must not include nonglobal addresses in the INIT or INIT-ACK chunk. If an SCTP end point is multihomed and has non-global addresses, it should set up the association single-homed and then add the other addresses after the association has been established by sending an SCTP packet containing an ASCONF chunk for each address. To add such an address, the ASCONF should contain only the wildcard address and the parameter providing the required verification tag. The source address of the packet containing the ASCONF chunk will be added to the association. To remove an address, an ASCONF chunk is sent with the wildcard address. Then, all addresses except the source address of the packet containing the ASCONF chunk are deleted from the association.
Communication between the NAT Middleboxes and the SCTP End Points If a NAT middlebox receives an INIT chunk that would result in adding an entry to the NAT table that conflicts with an already existing entry, it should not insert this entry and may send an ABORT chunk back to the SCTP end point. In the ABORT chunk, an M-bit should be set that indicates that it has been generated by a middlebox. This happens if two different clients choose the same local port number and initiate tag and try to connect to the same server. On reception of such an ABORT chunk, the end point can try to choose a different initiate tag and try setting up the association again.
29
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 30
Server 100.4.5.1:8080 100.5.5.1:8080
Client 10.1.0.1:52001
Router 1 NAT Internet Router 2
Chunk type
Local-Address
Global-Address
Local-Port
Global-Port
INIT+INIT-ACK INIT-ACK
10.1.0.1 10.1.0.1
100.4.5.1 100.5.5.1
52001 52001
8080 8080
Local-VTag
Global-VTag
12345 12345
45678 45678
n Figure 4. Building the NAT table for the single-homed client with a multihomed server. If the NAT middlebox receives an SCTP packet that cannot be processed because there is no entry in the NAT table, the NAT middlebox should discard the packet and can send back an ERROR chunk. An M-bit must be set to indicate that the chunk is generated by a middlebox, and an error cause should indicate that the NAT middlebox does not have the required information to process the packet. On reception of such an ERROR chunk, the end point should use an ASCONF chunk to provide the required information to the NAT middlebox.
New SCTP Protocol Elements Clients require a new parameter to be included in the INIT chunk to indicate that they will use the procedures described in this article. This parameter also is included in the INITACK chunk to indicate that the receiver also supports it. Another new parameter is required that can contain a verification tag and is included in an ASCONF chunk.
Both the ERROR chunk and the ABORT chunk must have an M-bit indicating that the packet containing the chunk is generated by a middlebox instead of the peer. Two additional error causes are introduced, one to be included in the ERROR chunk to indicate that the NAT middlebox misses some state, and one to be included in the ABORT chunk to indicate a conflict in the NAT table.
Examples This section provides a detailed discussion of several network scenarios involving NAT middleboxes. The proposed NAT mechanisms were verified in all these scenarios using an SCTP simulation in the INET framework for the OMNeT++ simulation kernel described in [10]. Furthermore, a group of the Center for Advanced Internet Architecture at Swinburne University is implementing this method for the FreeBSD operating system. This project,
Server 100.4.5.1:8080 Client 10.1.0.1:52001
Router NAT Internet
new NAT Packets arriving at the server
120.10.2.1
140.1.1.1
120.10.2.1:52001=>100.4.5.1:8080 140.1.1.1:52001=>100.4.5.1:8080
10.1.0.1=>100.4.5.1
DATA: 120.10.2.1:52001=>100.4.5.1:8080
100.4.5.1=>10.1.0.1
ERROR: 100.4.5.1:8080=>120.10.2.1:52001 Cause: NAT state missing
10.1.0.1=>100.4.5.1
ASCONF: 120.10.2.1:52001=>100.4.5.1:8080 Vtag: 12345
140.1.1.1=>100.4.5.1
100.4.5.1=>10.1.0.1
ASCONF-ACK: 100.4.5.1:8080=>120.10.2.1:52001
100.4.5.1=>140.1.1.1
n Figure 5. After a route change a new NAT middlebox appears.
30
IEEE Network • September/October 2008
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 31
Rendezvous server
For more information on transport layer mobility, see [7].
100.1.3.1 Peer 1
Router 100.1.3.254
Peer 3 10.1.3.1
Peer-to-Peer Communication
A greater challenge is the communication between two peers, that is, two hosts that both use private IP addresses (peer-to-peer communication). A detailed description for UDP and 10.1.2.1 TCP is given in [8]. The two peers require an 10.1.4.1 agent to help them find their communication Peer 2 Peer 4 partner. This agent usually is called a rendezvous server. In Fig. 6 the corresponding network setup is n Figure 6. Peer-to-peer communication with rendezvous server. shown. The communication process in this case consists of two phases. First, associations are initialized between the peers and the rendezvous server; after retrieving the required information from the renSCTP over NAT Adaptation (SONATA), is being implementdezvous server, the peers can communicate with each other ed in cooperation with two of the authors and is based on [2]. independently of the server. After both peers retrieve the Single-Homed Client to Multihomed Server required information, the actual communication between the peers can start. As there is no server, both hosts must be able In the case of a single-homed client and a multihomed server, to act as client and server. Thus, both start an association. If the server announces all its global addresses in address the message containing the INIT chunk of Peer 1 reaches the parameters included in the INIT-ACK chunk (Fig. 4). The NAT middlebox, NAT 2, before the message of Peer 2 could packet crosses the NAT middlebox, which updates its entries arrive, it will be discarded. The retransmission of the INIT for the association. When the client receives the chunk, it chunk will arrive if in the meantime, Peer 2 has punched a adds those addresses to its list of destination addresses. As a hole by triggering the NAT middlebox to set up a table entry. result, there will be a separate entry for each server address The best results can be achieved if the associations are started although there is only one association. at the same time. From the perspective of SCTP, the simultaAdding New NAT Middleboxes neous sending of INIT chunks also is not a normal situation because the INIT chunk is not followed directly by an INITAfter setting up an association, data can be exchanged ACK chunk but by another INIT chunk. The SCTP collision between client and server. The packets are routed through the handling procedure ensures that exactly one association Internet. It must be emphasized that the routes are not stable between the peers is established. and can change during the lifetime of an association, in particular if the association has a long life span as expected for Multihomed Client and Server major SCTP application scenarios. Therefore, a new NAT middlebox could become involved that has no knowledge of The client sends an INIT chunk without a list of addresses to the properties of the association as shown in Fig. 5. the server, which responds with an INIT-ACK chunk includPassing through a new NAT middlebox also means that the ing a list of all addresses of the server. As shown in Fig. 7, this server receives a packet with a new source address, which initial handshake uses the path via NAT 1. appears as if the client has an additional IP address. After the association is established, the client adds its secIn Fig. 5 the upper route shows the path where the associaond address by sending an ASCONF chunk. If the packet tion was set up initially. After the route was changed, the containing this chunk is sent via the path containing NAT 2, packets travel on the lower route. An example for the both NAT middleboxes have the required state. If this packet address/port combination for both routes is shown below the is sent on the path via NAT 1, any packet sent from the client server. on the path via NAT 2 results in an ERROR chunk being If the new NAT middlebox receives the first packet from sent back, and this triggers the sending of an ASCONF chunk. the client, it sends back a packet containing an ERROR chunk indicating that it lacks the required NAT table entry. Therefore, upon receipt of the ERROR chunk, the client 1 INIT sends an ASCONF chunk on the new path with the required INIT-ACK 2 information. The new NAT middlebox can add a complete COOKIE-ECHO 3 entry to its table upon receipt of this message. COOKIE-ACK 4 This message can pass through the NAT middlebox and can be acknowledged by the server with an ASCONF-ACK message. Afterward the communication can proceed as usual. NAT 2 10.1.1.1 100.1.1.254 100.1.1.253 100.1.2.253 100.1.2.254 NAT 1
NAT 1
Client Using Transport Layer Mobility SCTP with its functionality of dynamic address configuration is well suited to be employed in an environment with host mobility. Whereas all other parameters remain the same, the moving client will receive a new address. This not only results in a new source address for the packet but also in a changing route, such that eventually another NAT middlebox must be traversed, which again, initially has no knowledge of the association. As the situation is similar to the one described in the last subsection, we suggest that the same actions are taken.
IEEE Network • September/October 2008
NAT 2 Client
5
Server
ASCONF, ADD-IP ASCONF-ACK
6
n Figure 7. Multihoming through NAT middleboxes.
31
TÜXEN LAYOUT
9/5/08
1:00 PM
Page 32
This chunk provides the required information to the NAT middlebox, NAT 2.
Multihomed Transport Layer Mobility Previously, we discussed the procedure for a case when a client moves and hence changes its source address and the corresponding NAT middlebox as well. During the transition from one cell to another in a host mobility scenario, there is likely to be a zone where both cells are active, and thus, two addresses can be in use. Adding the new address results in a temporarily multihomed client. We propose to handle this situation in a way similar to the case explained in the last section. The new address is added by the sending of a message containing an ASCONF chunk. But as the old address is completely replaced by the new one as soon as the previous cell is left, another parameter must be added that indicates that the primary path should be set to the new address. This causes the server to send the next packets to the new address.
Multihoming with Rendezvous Server The final step in increasing the complexity of the NAT scenario is the communication between two multihomed peers that are behind different NAT middleboxes. Just like in the single-homed case, the rendezvous server must gather the peer information to fill its table. This time the table must be enlarged by the additional addresses. The peers first set up an association with the rendezvous server. Using this server the peers can obtain each other’s addresses and port numbers. At this point, the peers must set up an association via initialization collision to provide a path by using hole punching. To also use the second path, on the way, the NAT middleboxes must obtain the required information. By sending messages containing ASCONF chunks almost simultaneously, the NAT middleboxes are notified to allow packets arriving from the opposite direction to pass through. Unfortunately, the mechanism described earlier to request information by sending a message containing an ERROR chunk does not work when coming from the global side of the network because only the host behind the NAT middlebox can provide the data to fill the NAT table. So when the message containing an ASCONF chunk arrives at the opposite NAT middlebox before a hole is punched, the packet is discarded, but its retransmission might be successful. After both NAT tables receive the appropriate entries, the secondary paths also can be used.
Conclusion In this article, we proposed a comprehensive solution for the support of SCTP in NAT middleboxes. We motivated the necessity for a specific NAT concept with NAPT functionality, where the verification tags provided by SCTP are used to distinguish between associations. The NAT middleboxes can request information from the SCTP end points and give hints to improve the overall procedure. Furthermore, several scenarios were analyzed to explain the manipulation of the NAT table in single-homed, multihomed, and mobility environments. The peer-to-peer communication with a preregistration was taken into account as well. Generalizing the SCTP-specific variant of NAT, the following is important. For supporting a transport protocol with multipath support, a connection identifier makes connection tracking possible without a requirement to rely on the port
32
numbers. This avoids the requirement of changing the port numbers and possibly synchronizing them between different NAT middleboxes. A feature of dynamic address reconfiguration can be used to avoid having IP addresses in the transport layer, which is problematic for the processing in NAT middleboxes. For peer-to-peer communications, it is helpful if the transport layer supports simultaneous connection setups. Finally, it might be preferable to use simple algorithms involving random numbers with a small chance of collision instead of more complex deterministic algorithms without collision. The solution presented in this article will be included in a future version of our Internet drafts to be considered for standardization in the BEHAVE working group of the IETF.
References [1] Q. Xie et al., “SCTP NAT Traversal Considerations,” draft-xie-behave-sctpnat-cons-03.txt (work in progress), Nov. 2007. [2] R. Stewart and M. Tüxen, “Stream Control Transmission Protocol (SCTP) Network Address Translation,” draft-stewart-behave-sctpnat-03.txt (work in progress), Nov. 2007. [3] M. Tüxen and R. Stewart, “UDP Encapsulation of SCTP Packets,” draft- tuexen-sctp-udp-encaps-02.txt (work in progress), Nov. 2007. [4] P. Srisuresh and M. Holdrege, “IP Network Address Translator (NAT) Terminology and Considerations,” RFC 2663, Aug. 1999. [5] R. Stewart, “Stream Control Transmission Protocol,” RFC 4960, Sept. 2007. [6] R. Stewart et al., “Stream Control Transmission Protocol (SCTP) Dynamic Address Reconfiguration,” RFC 5061, Sept. 2007. [7] M. Riegel and M. Tüxen, “Mobile SCTP Transport Layer Mobility Management for the Internet,” Proc. SoftCOM 2002, Int’l. Conf. Software, Telecommunications and Computer Networks, Split, Croatia, 2002, pp. 305–09. [8] B. Ford and P. Srisuresh, “Peer-to-Peer Communication across Network Address Translators,” USENIX Annual Technical Conf., Anaheim, CA, Apr. 2005. [9] R. Stewart and Q. Xie, Stream Control Transmission Protocol (SCTP): A Reference Guide, Addison-Wesley, Oct. 2001. [10] I. Rüngeler, M. Tüxen, and E. Rathgeb, “Integration of SCTP in the OMNeT++ Simulation Environment,” Int’l. Developers Wksp. OMNeT++ (OMNeT++ 2008), Mar. 2008.
Biographies ERWIN P. RATHGEB ([email protected]) received his Dipl.-Ing. and Ph.D. degrees in electrical engineering from the University of Stuttgart, Germany, in 1985 and 1991, respectively. He has been a full professor at the University Duisburg-Essen since 1999 and holds the Alfried Krupp von Bohlen und Halbach Chair for Computer Networking Technology at the Institute for Experimental Mathematics. From 1991 to 1998 he held various positions at Bellcore, Bosch Telekom, and Siemens. His current research interests include concepts and protocols for next-generation Internets with a focus on network security. He is a member of IFIP, GI, and ITG, where he is chairman of the expert group on network security. IRENE RÜNGELER ([email protected]) received her diplomas in computer science and economics at the University of Hagen in 1992 and 2000, respectively. She joined the Münster University of Applied Sciences in 2002, where she works as a research staff member. Her research interests include innovative transport protocols, especially, SCTP and their performance analysis, signaling transport over IP-based networks, and fault-tolerant systems. R ANDALL S TEWART ([email protected]) works for TRG Holdings as chief development officer. His current duties include integrating software solutions for call center applications using both SCTP and RSerPool. Previously, he was a distinguished engineer at Cisco systems. He also has worked for Motorola, NYNEX S&T, Nortel, and AT&T Communications. Throughout his career he has focused on operating system development, fault tolerance, and call-control signaling protocols. He is also a FreeBSD committer with responsibility for the SCTP reference implementation within FreeBSD. MICHAEL TÜXEN ([email protected]) studied mathematics at the University of Göttingen and received a Dipl.Math. degree in 1993 and a Dr.rer.nat. degree in 1996. He has been a professor in the Department of Electrical Engineering and Computer Science of Münster University of Applied Sciences since 2003. In 1997 he joined the Systems Engineering group of ICN WN CS of Siemens AG in Munich. His research interests include innovative transport protocols, especially SCTP, IP-based networks, and highly available systems. At the IETF, he participates in the Signaling Transport, Reliable Server Pooling, and Transport Area Working Groups.
IEEE Network • September/October 2008
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 33
Distributed Connectivity Service for a SIP Infrastructure Luigi Ciminiera, Guido Marchetto, Fulvio Risso, and Livio Torrero, Politecnico di Torino Abstract Because of the constant reduction of available public network addresses and the necessity to secure networks, middleboxes such as network address translators and firewalls have become quite common. Because they are designed around the client-server paradigm, they break connectivity when protocols based on different paradigms are used (e.g., VoIP or P2P applications). Centralized solutions for middlebox traversal are not an optimal choice because they introduce bottlenecks and single point-of-failures. To overcome these issues, this article presents a distributed connectivity service solution that integrates relay functionality directly in user nodes. Although the article focuses on applications using the Session Initialization Protocol, the proposed solution is general and can be extended to other application scenarios.
A
lthough end-to-end direct connectivity was a must in the early days of the Internet, currently, increasing numbers of hosts are connected through middleboxes such as network address translators (NATs) that enable the reuse of private addresses and/or firewalls, which are used to secure corporate networks and internal resources. These devices work seamlessly in case of client-server applications (although the client must reside in the “protected” part of the network), but they limit the end-to-end connectivity of the applications that use different paradigms, such as voice over IP (VoIP) and peer-topeer (P2P). In particular, middleboxes prevent nodes behind them from being contacted directly from external nodes. For example, an internal host might not have a problem starting a data transfer to an external host, but the reverse (e.g., an incoming VoIP call) may be impossible. Thus, proper strategies for middlebox traversal are required to enable the seamless communication between hosts, no matter where they are located. Among the known strategies, hole punching and relaying [1] represent the ones that are used most frequently. The common idea is to make the middlebox function as if the internal host begins the communication. The middlebox then creates a temporary channel with the remote host, thus allowing the delivery of external packets. In particular, the hole punching forces each internal host to maintain a persistent connection with an external rendezvous server located on the public Internet. This creates a type of “hole” that can be used by an external host to contact the internal host directly. If hole punching fails, for example, if hosts are behind symmetric NATs, the relaying represents the last chance: internal hosts maintain a persistent connection with an external node (the relay server), which operates as a forwarder, that is, it receives all packets directed to the internal host and redirects them to it. This solution requires that the internal host advertises the IP address of the relay server as one of its addresses, and that instructs the relay server with the proper forwarding rules. This article focuses on the problem of middlebox traversal
IEEE Network • September/October 2008
for applications using the Session Initialization Protocol (SIP) [2], which is among the protocols that suffers most from middlebox limitations. Two solutions were defined in this context. SIP messages directed to the destination user agent (UA) are delivered with a relay-based approach that exploits an intermediate public SIP proxy [3]. For media flows, the interactivity connectivity establishment (ICE) [4] protocol was proposed. ICE is an integrated solution defined to discover NAT bindings and to execute the hole punching for media streams. In addition, ICE also supports media relaying based on the Traversal Using Relay around NAT (TURN) [5] protocol. Both the hole-punching mechanism of ICE and TURN rely on simple traversal of UDP through NAT (STUN) [6], a client-server protocol consisting of two messages, Binding Request and Binding Response. These messages are sufficient for implementing the hole-punching procedure [1], whereas TURN must extend the STUN protocol to establish communication channels with relays, called TURN servers. STUN also can be used to implement a middlebox behavior discovery service [7] that can be used by internal hosts to determine the type of NAT/firewall they are behind. Current middlebox traversal solutions rely on centralized servers that provide rendezvous and relay capabilities. However, the centralized server is a single point of failure: if the server fails, all UAs behind middleboxes become unreachable. Furthermore, a centralized solution cannot scale to an IPbased telecommunication provider with millions of customers, in which servers may be required to handle a huge amount of traffic (both SIP signaling messages and media datagrams), thus requiring a large amount of computational resources and bandwidth. The server acting as relay for SIP (i.e., the SIP proxy) also must handle the traffic generated by keep-alive messages that UAs behind the middlebox periodically send to it. Keep-alive messages are required to maintain the communication channel with the server and thus to guarantee that these UAs can always be reached. This could result in a high overhead. For example, according to the NAT binding timeout reported in [3], in a SIP domain including 1.5 million UAs
0890-8044/08/$25.00 © 2008 IEEE
33
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 34
with limited connectivity, the central server must handle about 50,000 keep-alive messages per second. This article proposes a distributed architecture — referred to as DIStributed COnnectivity Service (DISCOS) — for ensuring connectivity across NATs and firewalls in a SIP infrastructure. This solution overcomes the limitations of the current centralized solution by creating a gossip-based P2P network and integrating the previously described rendezvous and relay functionalities in the UAs. Each globally reachable UA with enough resources can provide such services to UAs with limited connectivity. A major emphasis is given to the overlay design, as it is a key point for ensuring a fast “service lookup” (i.e., to find a peer that still has enough resources for offering the connectivity service), which is instrumental for providing an adequate quality of service to the users. In particular, we show how a scale-free topology can fit this requirement, and we propose an overlay construction model that can be used to build such topology. DISCOS is somewhat orthogonal to P2P-SIP [8], although both are based on P2P technologies. In fact, P2P-SIP is a solution mainly for distributed lookup, whereas DISCOS offers a solution for middlebox traversal. The idea of distributing such functionalities among end systems is also one of the characteristics of Skype, a well-known VoIP application. However, Skype uses secret and proprietary protocols that cannot be studied and evaluated by third parties, therefore limiting the ability to understand exactly how these problems are solved. For example, in the Skype analysis presented in [9] and [10], the authors could give only partial explanations about its NAT and firewall traversal mechanisms. Their experiments pointed out that nodes with enough resources can become supernodes and provide support for NAT and firewall traversal. In particular, they offer relay functionalities and probably run a sort of STUN server that other nodes use to discover the presence (and to determine the type) of NAT and firewall in front of them. Therefore, it is clear that a node behind NAT must connect to a super node to be part of the Skype network, but no information could be provided about the super node discovery and selection policies. Also, super node overlay topology is almost completely unknown. Thus, there is no way to evaluate the effectiveness of these solutions. On the other hand, here we propose a distributed architecture for middlebox traversal whose scalability and robustness are discussed and evaluated. In addition, the solution was engineered and validated by simulation on a SIP infrastructure, but the solution is more general, and it can be seen as a mechanism to cope with middlebox traversal, thus opening the path to a wider adoption.
Operating Principles Distributed Connectivity Service DISCOS extends current centralized NAT and firewall traversal solutions by distributing rendezvous and relay functionalities among UAs. Relaying and hole-punching service for media flows is implemented by integrating a STUN/TURN server in each UA. The TURN server also is used to support relaying SIP messages. However, DISCOS can be modified easily to offer the relaying of SIP messages by integrating SIP proxy functionalities in each UA, leading to a distributed implementation of [3]. A UA with enough resources (e.g., a public network address, a wideband Internet connection, and free CPU cycles) becomes what we define as a connectivity peer and starts to offer a connectivity service. In particular, connectivity peers can act as both SIP relay (leveraged by UAs with limit-
34
ed connectivity for receiving SIP messages) and media relay. Connectivity peers also can offer support to the hole-punching procedure for media session establishment, thus operating as a distributed rendezvous server. In addition, connectivity peers also provide support for middlebox behavior discovery [7]. UAs with limited connectivity can locate and attach to an available peer whenever they require one of these services. Connectivity peers are organized in a P2P overlay, and their knowledge is spread through proper advertisement messages, thus building an unstructured gossip-based network. Structured networks, characterized by additional overhead due to the maintenance of the structure, are not considered because their excellent lookup properties are not required. In fact, DISCOS uses the overlay to find only the first available connectivity peer and not for locating a precise resource. Note that because DISCOS distributes existing middlebox traversal functionalities among peers, it is also totally compatible with current middleboxes and their traversal techniques. This enables a smooth deployment of the proposed solution.
Overlay Topology In order to enable DISCOS to locate an available peer for UAs with limited connectivity in the shortest time possible, peers should have a deep knowledge of the network: the greater the number of known peers, the higher the probability of finding an available peer in a short time, especially if known peers are lightly loaded. In gossip-based networks, the spread of information is based on flooding, thus the overlay topology has a deep impact on the network efficiency. For instance, the greater the average path length between nodes, the higher the depth of the flooding (hence the load on the network) that is required for an adequate spread of the information. Thus, an overlay topology that ensures a small average path length is required. However, this is not sufficient for enabling peers to know a large set of suitable connectivity peers from which to choose when a UA asks for the connectivity service. In fact, nodes maintain a cache that should be kept small to reduce the overhead required to manage all the entries. This limits the number of peers known at each instant. The limited cache size can be compensated by frequently refreshing its contents so that the set of known peers changes frequently, resulting in a sort of round robin among peers: different connectivity peers can always be provided to UAs that request the service at different instants, thus increasing the opportunity for a queried connectivity peer to suggest available ones when it cannot provide the service itself. Frequent cache refresh also is useful for ensuring that nodes store upto-date information about existing peers. Such a policy can be efficiently adopted if the overlay results in a scale-free network [11], an interesting topology that ensures small average path length and features scalability and robustness. In a scale-free network, few nodes (referred to in the following as hubs) have a high degree, whereas the others have a low one. The degree of a node is the sum of all its incoming (i.e., the in-degree) and outgoing (i.e., the out-degree) links. In the DISCOS overlay, the out-degree of a node is limited by the cache size whereas the in-degree is the number of other peers that have that node in their cache. Thus, nodes can be considered hubs when they are in the cache of several peers, that is, when they are highly popular. Hubs frequently receive advertisement messages from a large set of different nodes, so they frequently update their cache. In particular, if advertisement messages contain nodes that are low in popularity, hubs can discover peers, which being low in popularity, are lightly loaded with high probability. The key is to make searches through hubs because they potentially know a large variety of lightly loaded peers. Thus, the proposed solution essentially exploits — and
IEEE Network • September/October 2008
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 35
A joining connectivity peer with no entries in cache queries the bootstrap service for some hubs CP CP
CP
CP
CP
CP
CP
CP Bootstrap service
CP
CP
CP
CP
The UA behind NAT queries a node (possibly a hub) for service
CP
CP NAT
CP CP
UA A
CP
= Highly popular connectivity peer (hub)
CP
= Connectivity peer
n Figure 1. DISCOS overlay topology.
generalizes to the case of a single resource provided by many nodes — the results achieved by Adamic et al. [12] about random walk searches in unstructured P2P overlays. They demonstrated that searches in scale-free networks are extremely scalable (their cost grows sublinearly with the size of the network), also proving that searches toward hubs perform better than random searches because hubs have pointers to a larger number of resources. In DISCOS, the benefit of searching through hubs comes from the high frequency with which pointers to connectivity peers change in their cache. These properties are obtained at the expense of a non-uniform distribution of the number of messages handled by nodes: the higher the popularity of a node, the larger the number of advertisement messages received. However, a proper hub selection policy and a reasonable advertisement rate could mitigate the effects of this disparity. These aspects are analyzed in more detail in the following section. The Barabasi-Albert [11] model was proposed to create scale-free graphs. In this model, few nodes are immediately available and when a new node arrives, it connects to one of the existing nodes with a probability that is proportional to the degree of popularity of such a node (preferential attachment); in other words, the model assumes a global knowledge of nodes and their degree, which is clearly inapplicable in a real network scenario. A first step to implement such a model in our overlay is to make M peers available to other nodes through a bootstrap service. When a node joins the overlay for the first time, it queries the bootstrap service for a subset of these M registered nodes. However, preferential attachment is not possible with the mechanism described so far because all incoming peers: • Can learn only the nodes provided by the bootstrap service • Cannot compute the popularity of a node An adequate spread of the network knowledge can address the first issue, but there is no way to enable a node to learn the in-degree (i.e., the precise metric of node popularity) of the others. In our case, the popularity is computed
IEEE Network • September/October 2008
autonomously by each node through a simple approximated metric based on the number of received advertisement messages that contain such a node. In our approximated model, preferential attachment is implemented by forcing peers to evaluate the popularity of nodes through the previously mentioned mechanism and then to include some of the most popular peers in the advertisement messages they send. This allows nodes to insert highly popular peers (hubs) in their cache, thus building and maintaining the scale-free topology. In summary, new nodes use the peers known through the bootstrap service as “bootstrap” nodes; then they learn the most popular ones through the received advertisement messages and start to perform preferential attachment. Furthermore, incoming nodes that already know peers discovered during their previous visits can avoid the bootstrap procedure by attaching directly to them. The resulting topology is shown in Fig. 1. It is worth noting that different bootstrap services can be used to create disjoint overlays because joining peers that fetch nodes from different bootstrap services start to exchange advertisement messages with different connectivity peers. This enables the possibility of deploying different DISCOS overlays in different geographical areas of a SIP domain. If a locationaware bootstrap service selection policy is adopted, users can find a connectivity peer that is close to them, thus preserving the user-relay latency achieved by current centralized solutions, where different servers can be used at different locations. The implementation of the bootstrap service is highly customizable. A possible solution consists in deploying M static peers and preconfiguring their addresses on each UA. A more flexible approach (considered in the following) consists in deploying multiple bootstrap servers reachable through appropriate domain name server service (DNS SRV) location entries configured in the DNS. Each bootstrap server stores information about M connectivity peers that spontaneously register themselves when they join the overlay. Multiple boot-
35
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 36
(a)
(b)
(c)
Start
Start
Start
Yes
Is the cache empty? No
Fetch peers from bootstrap service
Are there peers No in the advertisement? Yes
No
Start
Are there peers not yet visited? Yes
Extract one
Contact one using STUN
No
Is the peer already in cache? Yes
Response within a timeout?
No
Is the cache full?
Yes
Yes Perform other STUN tests with the contacted peer
No
Has node limited connectivity?
Join DISCOS overlay
Increase peer popularity Yes
Yes
Is the cache empty?
No
Fetch peers from bootstrap service
Stop (timeout if SIP relay lookup)
Contact the most popular not yet visited
No No
Drop peer in average position
Response within a timeout? Yes Is it available for SIP/media relay service?
Insert new peer
No
SIP relay lookup
Yes
Success
Get the three peers provided in the response Order the cache by popularity
Put the most popular in cache (drop less popular if full) Order the cache by popularity Contact the two less popular included in the response
No
No
Response within a timeout? Yes Yes Is it available for SIP/media relay service? Success
n Figure 2. Operation of DISCOS when: a) a node joins the SIP domain; b) a node in the overlay receives an advertisement message; c) a node performs a SIP/media relay lookup.
strap servers are deployed for redundancy and load balancing purposes. Proper DNS configuration can enable a locationaware bootstrap service selection.
Protocol Overview Whenever a UA joins the SIP domain, it must determine if it can become a connectivity peer, or if it is behind a middlebox. This is done by contacting a connectivity peer and exploiting its STUN functionalities [7]. The described bootstrap procedure is performed if it does not know an active peer. The flow chart related to the join procedure is shown in Fig. 2a. If the UA can become a connectivity peer, it checks the number of addresses registered on each bootstrap server and if it is smaller than a fixed bound M, it adds itself to the list.
36
Then, it sends an advertisement message to the known peers to announce itself. The UA is now part of the DISCOS overlay, and it starts receiving messages from other nodes, thus gradually filling its cache with new peers. A proper peer advertisement policy is adopted to implement preferential attachment (thus building and maintaining the scale-free topology) and to enable caches to be refreshed with lightly loaded peers (thus having potential nodes available for the service). In particular, advertisement messages include the sender node, the two most popular peers it knows (enabling preferential attachment), and the two less popular peers it knows (spreading the knowledge of lightly loaded peers). Advertisement messages are periodically sent by peers to all nodes they have in their cache and contain a special time-
IEEE Network • September/October 2008
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 37
to-live (TTL) field that allows the message to cross N hops: as soon as the message is received, the TTL value is decremented and if it is a positive value, the recipient sends another message to all the nodes in its cache. Every time a peer receives an advertisement message, it updates its cache by increasing the popularity of nodes already present and by inserting the new ones. As previously described, it is important for a node to have both hubs and peers of low popularity in its cache. Thus, a proper cache management policy also is adopted if the cache is full: the node with average popularity is removed before the insertion, resulting in a cache that privileges big hubs and peers of low popularity. Figure 2b details the operations of a peer when it receives an advertisement message. UAs with limited connectivity have a different behavior because essentially they exploit DISCOS features to find SIP relays (they choose a connectivity peer as relay for SIP messages as soon as they join the SIP domain; in addition, they select another when the current one disappears) and media relays (when they need one to establish a media session). A UA with limited connectivity performs these lookups by contacting the most popular peers in its list, which can accept or decline the request. If it refuses, it includes in the answer the two least popular peers and the most popular peer it knows: the least popular peers are queried immediately (since they are supposed to be free enough to provide connectivity), whereas the most popular is inserted in the cache (because it can perform faster searches as it is probably a hub). If both queried peers refuse to provide the service, another node is picked from the cache, and the procedure is repeated. If all the nodes in the cache were queried without success, two different policies are applied, depending on the type of service the UAs with limited connectivity require: in the case of lookup for a SIP relay, the UA waits for a random time and then repeats the procedure; in the case of lookup for a media relay, the procedure is stopped, and the media session cannot be established. Relay lookup procedure is shown in Fig. 2c. UAs with limited connectivity also receive ad hoc messages from their relays containing three highly popular peers that allow them first to fill, and then to update, their cache with new hubs. This enables them to direct searches toward hubs when they require a connectivity peer. Broken hubs (e.g., because of a network failure) are detected through a timeout: if a hub does not reply to a query, the UA can query one of the others hubs in its cache. If no peers are available, the UA again fetches the registered ones from the bootstrap server; however, this situation is unlikely to occur because UAs with limited connectivity periodically receive new hubs from their SIP relays. This protocol could be integrated in SIP, as well as implemented separately. The former approach is more straightforward as it simply consists in defining new SIP header fields. The latter one is more efficient, especially concerning the message size. In fact, the human-readable nature of SIP messages would result in advertisement messages of about 800 bytes.
Security Issues The deployment of a P2P architecture for providing connectivity service raises several security issues that are different than in centralized solutions. In DISCOS, like in many other distributed systems, the control of the consequences of malicious behavior of nodes can be more difficult than in the centralized counterpart. Much effort has been expended during past years in investigating these issues in the context of P2PSIP overlays [8, 13, 14] that must deal with similar concerns as they replace centralized SIP proxies for user locations. Some
IEEE Network • September/October 2008
solutions were proposed and can be seamlessly applied in DISCOS. For example, in [14], public key certificates are distributed among users to enable them to verify the origin and the integrity of messages. Analogously, certificates can be used in DISCOS to authenticate advertisement messages, so that they can be considered trusted. This limits the operation of malicious peers as they can be easily traceable. This and other P2P-SIP derived security policies certainly require further improvement to better fit specific DISCOS requirements. However, we are confident that effective results can be obtained with minimal modifications because, as mentioned previously, the security issues that must be addressed are similar in the two environments. This additional effort is left for future work.
Overlay Simulation Simulations Background We developed a custom, event-driven simulator to evaluate the effectiveness of the proposed solution. In particular, we were interested in proving its scalability and validating its algorithms. Thus, we implemented a simulator supporting the following four operations: node arrival/departure, media session set up/teardown, SIP relay lookup (triggered when a node with limited connectivity joins the network or when its current SIP relay disappears), and media relay lookup (that occurs when a node requires a relay to perform a media session). Simulations are referred to a single SIP domain. Node arrivals and call occurrences are modeled using a Poisson process, whereas node lifetime and call length are extracted from real Skype traffic coming from/to the network of the university campus to approximate the behavior of real VoIP networks. With our parameters, the average number of nodes in the network depends on their arrival rate because of the effect of the Poisson arrivals model coupled with the lifetime distribution of Skype. For example, an arrival rate λN = 100 nodes/minute leads to a network consisting, on average, of 30,000 nodes, which is the standard size in our simulation and is a good trade-off between simulation length (some lasting several days on a Dual Xeon 3 GHz processor) and significance of results. To test our solution within different traffic load scenarios, three different rates are used for media session occurrences: 1.4 λN, 5 λN, and 20 λN sessions/minute. These values, coupled with the distribution of the Skype call duration, lead to 10 percent, 30 percent, and 98 percent of nodes simultaneously involved in a media session, respectively. Statistics presented in [15] show that about 74 percent of hosts are behind a NAT. In addition, [1] shows that hole punching is successful in about 82 percent of the cases. To the best of our knowledge, no detailed information is available about firewall proliferation over the Internet. On the strength of these available data, we consider for simulation a network scenario where nodes have limited connectivity with probability P LC = 0.74 and media sessions directed to these nodes require relaying with probability P MR = 0.18. Whenever a node joins the SIP domain, two different actions can be performed at simulation level: if it is tagged as a node with limited connectivity (with probability PLC), it triggers a SIP relay lookup; otherwise it joins the DISCOS overlay as a connectivity peer. Media sessions are possible between each pair of nodes (selected randomly). When a node behind a NAT is contacted, a media relay lookup is triggered by this node with probability PMR. The number of UAs with limited connectivity to which a peer can simultaneously provide SIP relay service is set to 10; advertisement messages have a TTL equal to 2; their sending
37
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 38
1 DISCOS observations Power law, c=0.7, y=1.5
DISCOS overlay Random graph 0,4
0,3
0,2
0,01
0,001
0,1
0
0,0001 0
5000
10,000 15,000 20,000 Network size (nodes) (a)
25,000
30,000
1
12
100
10% involved in a call 30% involved in a call 98% involved in a call
Failure probability
0,006
8 6 4 2
0,005 0,004 0,003 0,002 0,001 0
0 0
5000
10000 15000 20000 Network size (nodes) (c)
25000
0
30000
14
1
10% involved in a call 30% involved in a call 98% involved in a call
12
8 6 4
1 2 Number of backup relays (d)
3
10% involved in a call 30% involved in a call 98% involved in a call
0,8
10
Fraction of nodes
Contacted nodes
10 In-degree (b)
0,007 DISCOS overlay Random spread and lookup
10 Contacted nodes
0,1
Fraction of nodes
Average clustering coefficient
0,5
0,6 0,4
0,2
2 0
0 1
2 Number of allocated relays (e)
3
0
1 2 Number of media flows per relay (e)
3
n Figure 3. Simulations results: a) average clustering coefficient evaluation; b) in-degree power law distribution; c) average number of contacted peers to find a SIP relay; d) media session failure probability vs. number of allocated backup relays; e) average number of peers contacted to allocate K relays; f) bandwidth consumption distribution.
interval is set to 60 minutes; and the cache of a peer is supposed to contain 10 entries. Furthermore, the number of peers registered in the bootstrap server (which is supposed to be unique and reachable by nodes) is set to 20. Simulation lasts enough to exit from the transient period; presented results are referred to the steady state.
Overlay Topology Evaluation First, simulation aims at demonstrating that our protocol creates a scale-free network among connectivity peers. In particular, we consider the clustering coefficient and the in-degree of nodes [11]. The clustering coefficient of a node is defined as the number of links between its neighboring nodes divided by the number of links that could possibly exist between them. To be scale-free, an overlay must have an average clustering
38
coefficient higher than the one of a random graph obtained in the same conditions, which is clearly proved in Fig. 3a. In detail, the average clustering coefficient for DISCOS decreases when the network size grows, asymptotically converging to a value that is about 20 times the clustering coefficient of a random graph. We also verified that at all network sizes experimented, the coefficient remains almost constant in time. Concerning the in-degree, the requirement to be met is that the distribution of node degree follows a power law, where the probability is that a node has k connections, and c is a normalization factor. Figure 3b shows that the distribution of in-degree values obtained through simulation fits well a power law P(k) = ck–γ with c = 0.7 and γ = 1.5. These tests validate our overlay construction model, showing that the resulting topology really evolves in a scale-free network.
IEEE Network • September/October 2008
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 39
To prove the effectiveness of the DISCOS topology, we compare our solution with a distributed system where the information is randomly spread, and the nodes to query during lookup procedures are randomly chosen among peers in the cache. Figure 3c depicts the average number of peers that must be contacted to reach an available SIP relay for both DISCOS and the randomized overlay. Although the advertisement rate and the TTL value remain the same, the figure shows that in DISCOS, the number of peers contacted is sensibly lower. Furthermore, the ratio between the performances obtained by the two policies increases with the network size, thus demonstrating the scalability properties of our solution. These tests prove the effectiveness and the scalability of DISCOS. In particular, results show how the scale-free topology ensures overlay efficiency with a limited message rate (each peer sends an advertisement message every 60 minutes) with a small TTL (equal to 2) and a limited cache size (10 entries). We also evaluated the number of advertisement messages that connectivity peers must handle in our simulated SIP domain including 30,000 UAs: 99 percent of nodes process less than seven advertisement messages per minute and the remaining 1 percent process a number of messages that varies between eight and 48 messages per minute, thus resulting in a reduced per-node overhead. However, this confirms that hubs should be chosen carefully, with a preference for nodes with enough computational and bandwidth resources, for example, using the dynamic protocol proposed by Chawathe et al. for the Gia P2P network [16].
Media Sessions Relaying Performance This section analyzes the overlay support for media sessions, in particular when hole punching fails and relaying is required. To prevent resource wasting, a media relay is typically chosen by a UA immediately preceding the establishment of a media session. Various types of media flows are considered, differing in the amount of consumed bandwidth. In particular, assuming b bit/s is the consumed bandwidth unit, five types of flows requiring nb (1 ≤ n ≤ 5) bit/s are defined. The flow type is randomly selected (with uniform distribution) when a new session starts. We also define Bi as the amount of bandwidth that peer i can offer for relaying media sessions. For the sake of simplicity, Bi is assumed to be the same for each connectivity peer and equal to 5b bit/s. However, in a real scenario, this value could vary according to node capabilities. We start the evaluation of the DISCOS support for media sessions from the estimation of the call failure probability because it is the parameter that mainly affects the quality of service perceived by users. A session can fail because either an available relay cannot be found, or the relay is found but becomes unavailable during the session (e.g., because it disconnects from the network). With respect to the first problem, we never observed such an event during simulation: a UA with limited connectivity was always able to find a media relay. This result suggests that with our assumptions about the number of media sessions requiring a relay, the probability for this event to occur in a DISCOS environment can be considered negligible. The second issue could be mitigated by implementing proper relay back-up policies. As shown in Fig. 3d, the media session can fail in about 0.6–0.65 percent of cases, but the selection of a single back-up relay (that handles the communication in case the first relay fails) sensibly reduces this probability, and further reductions are possible increasing the number of relay nodes. The blocking probability remains low even in the unlikely case in which 98 percent of the users are involved in a call (i.e., almost all users are at the phone). The overhead
IEEE Network • September/October 2008
deriving from the search of back-up relays is depicted in Fig. 3e, which plots the average number of peers that must be contacted to find K available media relays. For a reasonable number of simultaneous sessions, this value remains low. However, we set the number of back-up relay nodes to one, which is a reasonable trade-off between the probability of a session drop and the additional complexity that results when a UA must search a back-up relay node before starting media sessions. Finally, we analyzed the distribution of load among connectivity peers. In particular, Fig. 3f shows the distribution of the number of media flows simultaneously handled by media relays. It can be observed that although media flows have different bandwidth requirements, the great part of relays simultaneously handles no more than one media session. Thus, a good load balancing among peers is guaranteed.
Conclusions This article presents a distributed infrastructure, called DISCOS that aims at providing connectivity service to hosts behind middleboxes. This solution extends current centralized approaches (and overcomes their scalability and robustness limitations) by integrating middlebox traversal functionalities into edge nodes. The article also presents the mechanisms that can be used to manage such infrastructure and exploit its services. The proposed infrastructure is based on an unstructured peer-to-peer paradigm and proved to be extremely effective in locating suitable relays and distributing media sessions evenly among the available connectivity peers. Results confirm that the overhead for managing the overlay is low, that each host is able to locate a suitable connectivity peer with a small number of messages (hence, in a very short time), and the blocking probability of a new media call is negligible even for a very high load. Although our simulations cannot simulate a nationwide network (for processing/memory problems), we are confident that results can be extended to such an environment because the distributed infrastructure is based on the scale-free topology, which is the key to achieving these results ensuring overlay scalability and robustness. Future work aims to validate the proposed infrastructure in non-SIP environments and more exhaustively address security issues.
Acknowledgment The authors would like to thank Marco Mellia who was instrumental in obtaining a proper characterization of Skype user agents.
References [1] B. Ford, P. Srisuresh, and D. Kegel, “Peer-to-Peer Communication across Network Address Translators,” USENIX Annual Tech. Conf., Anaheim, CA, Apr. 2005. [2] J. Rosenberg et al., “SIP: Session Initiation Protocol,” IETF RFC 3261, June 2002. [3] C. Jennings and R. Mahy, Eds., “Managing Client Initiated Connections in SIP,” http://tools.ietf.org/html/draft-ietf-sip-outbound-11, Nov. 2007. [4] J. Rosenberg, “Interactive Connectivity Establishment (ICE): A Protocol for NAT Traversal for Offer/Answer Protocols,” http://tools.ietf.org/html/draftietf-mmusic-ice-18, Mar. 2008. [5] J. Rosenberg, R. Mahy, and P. Matthews, “Traversal Using Relays around NAT (TURN): Relay Extensions to Session Traversal Utilities for NAT (STUN),” http://www3.tools.ietf.org/html/draft-ietf-behave-turn-07, Feb. 2008. [6] J. Rosenberg et al., “Session Traversal Utilities for (NAT) (STUN),” http://tools.ietf.org/html/draft-ietf-behave-rfc3489bis-15, Feb. 2008. [7] D. MacDonald and B. Lowekamp, “NAT Behavior Discovery Using STUN”; http://www3.tools.ietf.org/html/draft-ietf-behave-nat-behavior-discovery-03, Feb. 2008. [8] D. A. Bryan and B. B. Lowekamp, “Decentralizing SIP,” ACM Queue, vol. 5, no. 2, Mar. 2007. [9] S. A. Baset and H. Schulzrinne, “An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol,” IEEE INFOCOM ’06, Barcelona, Spain, Apr. 2006.
39
CIMINIERA LAYOUT
9/5/08
1:03 PM
Page 40
[10] P. Biondi and F. Desclaux, “Silver Needle in the Skype,” Black Hat Europe 2006, Amsterdam, The Netherlands, Mar. 2006. [11] R. Albert and A.-L. Barabási, “Statistical Mechanics of Complex Networks,” Rev. Modern Physics, 74, 2002, pp. 47–97. [12] L. A. Adamic et al., “Search in Power Law Networks,” Physical Rev., E 64, 2001. [13] J. Seedorf, “Security Challenges for Peer-to-Peer SIP,” IEEE Network, vol. 20, no. 5, Sept. 2006. [14] C. Jennings et al., “Resource Location and Discovery (RELOAD)”; http://www.p2psip.org/drafts/draft-bryan-p2psip-reload-04.txt, June 2008. [15] M. Casado and M. J. Freedman, “Peering through the Shroud: The Effect of Edge Opacity on IP-Based Client Identification,” USENIX/ACM Int’l. Symp. Networked Sys. Design and Implementation, Cambridge, MA, Apr. 2007. [16] Y. Chawathe et al., “Making Gnutella-Like P2P Systems Scalable,” ACM SIGCOMM ’03, Karlsruhe, Germany, Aug. 2003.
Biographies LUIGI CIMINIERA ([email protected]) [M] is a professor of computer engineering in the Dipartimento di Automatica e Informatica at Politecnico di Torino, Italy. His research interests include grids and peer-to-peer networks, distributed software systems, and computer arithmetic. He is a co-author of two international books and more than 100 contributions published in technical journals and conference proceedings.
40
GUIDO MARCHETTO ([email protected]) received his Ph.D. in computer engineering in April 2008 and his laurea degree in telecommunications engineering in April 2004, both from Politecnico di Torino. He is a post-doctoral fellow in the Department of Control and Computer Engineering at the Politecnico di Torino. His research topics are packet scheduling and quality of service in packet-switched networks, peer-to-peer technologies, and voice over IP protocols. His interests include network protocols and network architectures. FULVIO RISSO ([email protected]) received his Ph.D. in computer and system engineering from Politecnico di Torino in 2000 with a dissertation on quality of service in packet-switched networks. He is an assistant professor in the Department of Control and Computer Engineering of Politecnico di Torino. His current research activity focuses on efficient packet processing, network analysis, network monitoring, and peer-to-peer overlays. He is the author of several papers on quality of service, packet processing, network monitoring, and IPv6. LIVIO TORRERO ([email protected]) is a Ph.D. student in computer and system engineering in the Department of Control and Computer Engineering at Politecnico di Torino. He received his laurea degree in computer engineering from Politecnico di Torino in November 2004. His research topics include voice over IP protocols, IPv6, and peer-to-peer technologies and their NAT/firewall related issues.
IEEE Network • September/October 2008
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 41
Dial “M” for Middlebox Managed Mobility Stephen Herborn and Aruna Seneviratne, NICTA
Abstract Users can be served by multiple network-enabled terminal devices, each of which in turn can have multiple network interfaces. This multihoming at both the user and device level presents new opportunities for mobility handling. Mobility can be handled by utilizing devices, namely, middleboxes that can provide intermediary routing or adaptation services. This article presents an approach to enabling this kind of mobility handling using the concept of personal networks (PNs). Personal networks (PNs) consist of dynamic conglomerations of terminal and middlebox devices tasked to facilitate the delivery of information to and from a single human user. This concept creates the potential to view mobility handling as a path selection problem because there may be multiple valid terminal device and middlebox configurations that can successfully carry a given communication session. We present details and an evaluation of our approach, based on an extension of the Host Identity Protocol, which demonstrate its simplicity and effectiveness.
A
major trend evident in mobile communication systems today is the use of multiple networkenabled terminal devices to deliver information to a single user. These devices can be equipped with multiple network interfaces, of which some, all, or none may be satisfactorily operational at a given point in time. This trend toward multiple devices, which we refer to as “user multihoming,” combined with what is commonly termed “device multihoming,” introduces the potential for multiple end-toend paths between two mobile communicating parties. These alternate paths can be facilitated with the support of specialized intermediaries, called middleboxes or service proxies to either establish or maintain a communication session. For the purposes of our discussion here, the common terms middlebox, intermediary, and our own term — service proxy (or SP) — are synonymous. Although some of the aspects of user multihoming will diminish with the next generation of devices that will provide integrated functionality, the availability of different paths may be useful for different purposes or at different times. Therefore, multihoming can be exploited to maximize the service offerings to mobile users, for example, to take advantage of a more cost-effective network interface available on another device by re-routing ongoing communications sessions through that device. One of the primary requirements for exploitation of this network/path diversity is the ability to seamlessly switch communication sessions between different networks/paths and potentially add or remove intermediaries in the process. Although, there have been numerous proposals for network/path switching, such as [1–3], they do not take advantage of intermediaries, namely, middlebox devices. One major advantage of being able to utilize intermediaries is that they may be able to perform intensive mobility handling operations such as content adaptation or time-shifting at the edges of the core networks, thus saving last-hop bandwidth and terminal device processor/power capacity. Additionally, intermediaries can be used to provide indirect connectivity to the core network at times when no direct connectivity is possible.
IEEE Network • September/October 2008
This article analyzes the requirements for exploiting the opportunities that arise as a result of user and device multihoming in mobile communications systems. It then proposes a solution that enables the use of intermediaries to redirect or transform data flows so that they can be received by the best available terminal device, or combination thereof, at any given time. The article presents a vision for mobility management that captures the potential for mobility support using middleboxes and then describes an approach to solving one of the fundamental requirements to realize this vision. The following sections elaborate on the details of a proposed scheme, concluding with a summary and review of future work.
Overview The conceptual foundation on which this article is based is that users are served by a PN comprising loose, dynamic conglomerations of many devices focused on serving a particular user. PNs, as depicted in Fig. 1, encompass personal area networks (PANs) [4] and may incorporate other terminal and non-terminal devices that belong to public infrastructure or devices at distant locations in the network. In a PN, terminal devices (TDs) terminate an end-to-end connection on behalf of a user to either display or store the information received via the connection. In Fig. 1 the mobile phone and laptop computer, as well as the large display, all represent potential terminal devices. The non-terminal entities (middleboxes) act as an intermediary relay point for an end-to-end connection. These entities intercept information in transit from one end of an end-to-end connection, possibly process, and then forward toward the other end of the connection. The processing provides either high-level adaptation, filtration, and transformation of application data, or low-level connectivity provision. Figure 1 presents four examples of non-terminal entities. The mobile phone and laptop computer are able to serve as bridges between various access technologies. The third and fourth examples of service proxy devices are a remotely located application-aware general packet radio service (GPRS) gateway and a mobile router.
0890-8044/08/$25.00 © 2008 IEEE
41
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 42
Personal network
Bl
GPRS
th
o to ue
et
ern
Eth
Wi-Fi
Public Internet
Personal area network
Local area network
n Figure 1. A personal network.
Managing Personal Networks
Identity Delegation
One of the primary management functions for PNs is the selection of service paths. Consider a single unicast communication session between two sets of terminal devices belonging to PN A and PN B in a content distribution application. Assume that certain parts of the requested content is available in multiple and different forms on different servers, for example, advertisements can be served from a different location to the rest of the content. Non-terminal entities can be used if required and also can be composed to form an aggregated end-to-end service. The result is a number of valid end-to-end paths that can be used to carry the communication session, as shown in Fig. 2. To facilitate this, at least one candidate path from one set of terminal devices to the other must be discovered, selected, and configured. In the simplest case, the best path is one directly between two terminal devices. In other cases, it may involve some non-terminal intermediary. To realize this, it is necessary to develop mechanisms for: • Discovering and constructing end-to-end paths • Dynamically configuring and utilizing these paths The first requires a means to discover and select appropriate intermediaries that are distributed throughout a network and compose an end-to-end path. Such mechanisms are beyond the scope of this article but are addressed in numerous existing works including [5]. The second requires a means to switch ongoing communication sessions between terminal devices and transparently insert or remove intermediaries into the end-to-end path. This section focuses on the second issue. For the second issue there have been numerous attempts to develop mechanisms to dynamically discover, construct, and configure end-to-end paths, for example, [6] and [7]. However, the deployment of these schemes is predicated on development and large-scale deployment of proprietary system components. The contribution this article makes is to show that it is possible achieve the desired functionality through minor modification to a pre-existing protocol. Specifically we show that it is possible to extend the mobility and multihoming capability present in the Internet Engineering Task Force (IETF) Host Identity Protocol (HIP) [8]. The proposal extends the IETF HIP with the capability for movement of communication sessions between terminal devices, as well as the transparent insertion and removal of intermediaries (middleboxes), while retaining ultimate control at the terminal devices on either side of an end-to-end connection through the use of a central functional building block we call “identity delegation.” This is explained in further detail in the following section.
The overwhelming majority of user-level applications that require network connectivity do so by creating a socket bound to some local interface identifier (i.e., IP address) and possibly statically connected to a peer identifier. End-to-end connections based on sockets cannot cope well with changes in identifiers (i.e., IP addresses) that occur as a result of mobility. This can be solved at the middleware level by providing applications with “bind-able” identifiers that can be assigned on a per-flow basis and delegated between physical devices. This enables applications on both ends of a communication session to remain oblivious to mobility management activities performed at the operating system level or by mobility handling middleware. To transfer sessions, first there must be assurance that devices within a given PN are able and willing to accept any incoming data stream that is redirected to them. This is not the case for a number of reasons. First, no device blindly accepts an unsolicited, incoming data stream. Second, the redirected stream may be forwarded toward a port on the new device that is already occupied by another application. Finally, even if the device can accept the redirected stream, it is not aware of the application to which the data should be delivered. The use of cryptographic identifiers provides the means to solve at least the first issue by decoupling network locators from identifiers and by providing strong authentication of the identities used to send, receive, and forward data. Port conflicts can be solved with stateful connection management similar to the approach used by network address translators (NATs). What is missing is a way to delegate identifiers between devices, and a solution for this problem is proposed here and described in detail later. The capacity for identity delegation makes possible a number of interesting application scenarios. The scenarios considered here are identity delegation toward a single and an arbitrary number of intermediate SPs. Figure 3 and Fig. 4 illustrate how this could be achieved with the proposed scheme using HIP and IPSec. Background on HIP and its relationship to IPSec is provided in [8].
42
Identity Delegation Toward Terminal Devices A clear, potential application for host identity delegation is to enable single intermediary hosts to be inserted dynamically or removed from an end-to-end session. This occurs transparently to the transport and higher layers so it does not break Transmission Control Protocol (TCP) connections. Figure 3 provides a generic step-by-step illustration of the process of inserting an intermediary, SPA in between two terminal devices TDB and TDA. This process starts with TDA or TDB
IEEE Network • September/October 2008
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 43
Personal network PDA Display
Laptop PN ‘A’
PN ‘A’ Service proxies
Mobile ins router Jo
PN ‘B’ Server #1 Server #2
PN ‘B’
Server #3
(a)
(b)
Personal network
Dep
arts
PN ‘A’
Dep
arts
PN ‘A’
Dep
arts
PN ‘B’
PN ‘B’ (c)
(d)
n Figure 2. Service-path-based mobility (example): a) many candidate paths, one selected (bold line) using laptop as SP for display; b) mobile router appears and offers a better path, previous path replaced by new path; c) mobile router leaves coverage range, display is switched off, path readjusts laptop now used as a terminal device; d) laptop battery depleted, triggering selection of different endto-end path toward PDA.
initiating the action. In the example, TDA is the initiator as shown in Fig. 3c.
Identity Delegation Toward Multiple SPs Identity delegation assists service composition by allowing two hosts engaged in a communication session, TDA and TDB, to delegate their identity to the head and tail of the composed SP chain, SP1 and SP2. This enables an arbitrary AA number of intermediate SPs to be inserted in between the head and the tail transparently to TDA or TDB. Figure 4 follows Fig. 3 and depicts this usage case as a sequence of twelve consecutive steps. Since application data streams can consist of several separable atomic components that can be routed independently, for example, audio and video, and because some intermediary SPs can split or join certain application data flows, it is possible to construct an end-to-end SP path that is composed of two or more converging subpaths. The benefits provided by splitting and joining media include the potential for selection of hybrid service paths that are more efficient than any available serial service path. In some cases, it may be desirable to construct service paths that do not completely converge, for example, to deliver the audio component of a media stream to a separate network interface or terminal device to the rest of the stream. For the purpose of discussion, it is assumed that SPs participate in some common directory, the administration of which
IEEE Network • September/October 2008
can be centralized or distributed. Service selection and discovery is not addressed in detail here but is discussed in the context of related work.
Design and Implementation In current systems, IP addresses are the most common type of identifier used for end-to-end communication. However, IP addresses are strongly bound to topological location and thus not suitable for the purpose of delegation that is required to realize the scenarios described in the previous section. As a result, we base our design on the HIP, which uses identifiers that are decoupled from network topology. This section first provides some background on HIP and IPSec, a general description of the identity delegation approach compared with a naïve private key duplication-based approach, and then delves into the specifics of the prototype implementation.
Host Identity Protocol and IPSec Schemes such as mobile IPv6 (MIPv6) [9] and the HIP [10] provide a static identifier, referred to as a home-address (HoA) in the former and a host identity tag (HIT) in the latter, which is separate from its IP or IPv6 address that can be routed. The approach presented here is based on HIP, although the general approach is applicable to any similar scheme. HIP is an end-to-end communication protocol that introduces a thin layer of resolution between the network and
43
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 44
(a)
(b) TDB
(c) TDB
Decision to utilize service proxy SPA
Data flow
SPA
TDA
xy Pro aling n g si
TDA
(d)
sig HIP na lin g
HIP ling na sig
SPA
TDA
(e) TDB
TDB
(f) TDB
TDB
SPA
Da
ta
flo
w
SPA
SPA p
ec
IPS
TDA
u set
TDA
TDA
n Figure 3. Insertion of a single host. transport layers, decoupling sockets from network addresses. Instead of binding to IPv6 addresses, applications bind to 128bit HITs, a flat (non-hierarchical) crypto-graphic identifier generated by hashing a public key. Due to the decoupling between the network and transport layers, HIP enables applications on a mobile host to continue communication oblivious to changes in local network addresses. New HIP communication sessions are preceded by a challenge-response-based authentication process. As HIP deals only with control signaling, standard IPSec is used to carry the actual data traffic. The implementation of HIP referred to in this article uses the recently proposed
(a)
bound end-to-end tunnel (BEET) mode of IPSec operation that eliminates the requirement to retain the source and destination HIT as an encapsulated header in each transmitted packet [10]. The set up of a HIP connection between two hosts results in a pair of unidirectional BEET mode IPSec security associations (SAs) at each host. The security parameter index (SPI) for each SA is contained in the I2 and R2 base exchange packets and used by the hosts to determine the source and destination HIT. The mapping between IPSec SPI and source/destination HIT is performed by the BEET mode association, which simply replaces the network layer addresses with HITs after decryption.
(b) TDB
SPA2
TDB
(c) Proxy signaling
SPA2
SPA2
TDB
Da
ta
flo
HIP signaling
w
TDA
TDA
SPA1
(d)
SPA1
(e) TDB
IPSec setup
SPA2
TDA
SPA1
TDB
SPA2
(f) TDB
SPA2 Data flow
SPAy Arbitrary service proxy configuration now possible
SPAx
IPSec setup TDA
SPA1
TDA
SPA1
TDA
SPA1
n Figure 4. Insertion of arbitrary number of intermediaries.
44
IEEE Network • September/October 2008
SENEVIRATNE LAYOUT
TDA
9/5/08
1:01 PM
Page 45
IPSec secured
TDB
key for signing. The owner of the private key can then, at its discretion, sign the messages and return SPA them to the host that requested them, which can in connect (IP?) turn forward them on to the corresponding host ack(IPSPA) thereby verifying the claim to use the HIT. The main advantage of this approach is that it avoids Setup IPSec secured signaling the dissemination of private keys. This approach also allows temporary delegation of a HIT because p[I1(IPSPA-IPTDB, HITTDA-HITTDB)] I1(IPSPA-IPTDB, HITTDA-HITTDB) the destination host can use the HIT only for the duration that the corresponding host does not R1 p[R1] request a re-keying procedure to be performed. An additional advantage is that because the locationp[R12] I2 update-signaling messages forwarded to the keyR2 p[R2] holder host for signatures also contain the HIT and IP addresses of the corresponding host, the keyIPSec security association details owner host can keep track of the corresponding IPSec secured IPSec secured hosts with which communication sessions are being conducted using its identity. Should the key-owner wish to revoke the use of its HIT from a certain destination host, it need only perform a re-key n Figure 5. Proxy insertion by TDA. directly with the corresponding host. The drawback of this approach is that if the key-owner host disappears, any further requests to sign location-update-signaling HIP mobility handling comprises an authenticated location messages cannot be processed. This means that a destination update procedure in which the mobile host delivers a signed host may be forced to terminate a communication session if location update packet to the correspondent host with details of the corresponding host initiates a re-key. the new network layer address. Our contribution is an extension One potential philosophical ramification of the delegation to standard HIP that provides a means to delegate HITs between approach (on HIP specifically) is that so called host identities physical hosts on-the-fly in response to a mobility event. no longer explicitly belong to a specific host but are capable Depending on local security policy, either the mobile host of being moved around between physical hosts, contrary to or the correspondent host may ask to re-key the connection in the original intention of the designers of HIP. The proposed response to mobility. Re-keying also can be requested by approach limits the architectural impact of this by ensuring either host after a certain time period has elapsed. Re-keying that identities are delegated only temporarily and can never involves the deletion of existing IPSec SAs and the establishbe used without the explicit consent of the actual entity that ment of new ones with a newly generated session key. If rethe host identity serves to identify. keying is not required, existing SAs are deleted and Our proposal changes the notion of end-to-end security of re-established with the previous session key. This reconfiguraHIP because even though communication is still encrypted tion of IPSec SAs is transparent to the transport layer. (IPSec), all nodes explicitly included in the service path can To mitigate the effect of the implementation-specific securead the payload. This is a desired functionality because we rity policy on experimental results, a base exchange was subenvisage that nodes included on the service path may be stituted for an update procedure in the work presented here. tasked with some application-layer processing such as content A base exchange is, in fact, in most cases roughly equivalent adaptation. to an update with the main difference being modified HIP header fields.
The use of cryptographic identifiers, as in HIP, decouples the identifier used by applications and transport layer sockets from the locator used for routing. A natural consequence of decoupling is that it is possible to transfer the identity between different physical devices, which is what we want to achieve to be able to insert intermediate middleboxes into an ongoing communication session. However, for a host to be able to verify that it is authorized to use a certain identifier it must present messages signed with the private key corresponding to the identifier. This can be solved in two ways, either by duplicating the requisite private key on any host that requires it, or by forwarding location update signaling packets to be signed on demand. The second approach is advocated in this article because private keys should, in principle, remain private. To avoid the introduction of another acronym, the abbreviation HIT is used interchangeably with the term cryptographic identifier in the remainder of this article. In the approach proposed in this article, hosts that wish to use a delegated HIT are required to forward location update signaling packets to the owner of the corresponding private
IEEE Network • September/October 2008
Implementation Details We implemented the identity delegation approach described in this article by extending publicly available code from the Infrastructure for HIP project (InfraHIP) [11] as a base. Our
(b) TCP sequence
Transferring Cryptographic Identities: Duplication versus Delegation
(c) (a)
2
4
6
8
10
12
14
16
Time (s)
n Figure 6. TCP sequence number vs. time plot: two service proxies.
45
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 46
extensions were evaluated on a Debian Linux system running kernel version 2.6.16. The remainder of this section provides an analysis of the signaling procedures specific to the implementation. Our description refers to a simple scenario: a mobile terminal device (TD A) communicating with a static correspondent terminal device (TD B ). The analysis commences with a description of how TDA may delegate its identity to an intermediary SP. Figure 5 depicts the signaling involved in the delegation process. It is assumed that there is a pre-existing trust relationship between terminal devices belonging to a single PN. The delegation process starts when the TD A1 queries the SP for the IP address (IPSP) that it wants to use for the delegated identity. At the same instance, TDA and the SP establish transport mode IPSec. This channel carries encapsulated HIP signaling traffic, as well as the IPSec security policy and association information used to establish a BEET mode IPSec used for applications data. The HIP signaling traffic between TDA and SP is sent as encapsulated payloads indicated in Fig. 5 by “p […].” The SP relays any HIP signaling traffic either to TDA or TDB without modification. The whole process of identity delegation and subsequent session redirection is transparent to applications running on TDB. The modular nature of our design means that the scheme can be implemented as an extension to an existing HIP-enabled network stack. As mentioned previously, intermediary SP insertion also can be performed by TDB to construct chains of two or more composed SPs between TDA and TDB. The signaling involved in the insertion of the second SP is equivalent to the SP insertion by TDA shown in Fig. 5.
Experimental Evaluation Evaluation of the identity delegation scheme was performed for the second usage scenarios described previously. The results of this scenario are also applicable to the other scenarios because they utilize the same identity delegation mechanism. The intention of the experiments was twofold: first, to provide a general evaluation of HIP performance in a real system, and second, to show that the delegation approach does not result in any measurable performance drop compared to unmodified HIP. Initially, it was assumed that most of the hand-off latency overhead would be due to heavy CPU load caused by the cryptographic operations required to sign HIP signaling messages and establish IPSec sessions. As such, it was expected that the performance of both approaches would be equivalent, provided that the machines used to sign HIP messages and set up IPSec sessions were equal in terms of processing power. These assumptions were confirmed by the evaluation results presented below. The experiment was performed to evaluate the scenario of inserting intermediary SPs, which, for example, can be a content adaptation SP between two devices engaged in a TCP communication session. The purpose of evaluating this scenario was to demonstrate that the TCP connection between the two devices remains unbroken and that the scheme does not cause any specific harm to the normal performance of higher-layer protocols. In reality, altering the end-to-end path in midsession may introduce some degradation in TCP performance if the new path is of lower quality than the old path; however, this issue is outside the scope of our proposal. In these experiments, an initial communication session was established from TDA (600 MHz PIII) toward TDB (500 MHz Celeron). The evaluated scenario was the insertion of two 3GHz Pentium 4 service proxies, SPA1 and SPA2, in serial between the initial TCP session end points, TD A and TD B. Figure 6 shows the resulting TCP sequence number vs. time plot. The two large gaps (a) and (c) in the plot represent the
46
respective times at which SP 1A and SP 2A were inserted in between TDA and TDB. From the plot it can be observed that the effects of the insertion of SP2A are similar to those of the insertion of SP1A in terms of latency and impact on TCP performance. However, the plot also demonstrates that insertion of multiple consecutive SPs does not result in any further drop in performance provided the SPs are powerful enough to handle the required IPSec sessions without CPU saturation. Some smaller gaps such as that indicated by (b) can be attributed to the CPU being utilized by the cryptographic operations required to set up a secure signaling channel prior to hand off. It is important to note that if the capability to delegate or transfer identity were not available, then the session must be broken and restarted to insert and remove each intermediary proxy, causing the TCP sequence number to reset to the beginning each time.
Related Work There are a number of previous and ongoing related works addressing inter-device mobility. On the other hand, there are fewer proposals that address the insertion of intermediary SPs as a mobility-handling technique. The only proposal that has a similar functionality is Stream Control Transmission Protocol (SCTP). Like any other transport protocol, a node can be made to act as a proxy. In SCTP, when an end point (A) initiates a connection, the other end (B) can, with or without the knowledge of the initiator, open an association to another entity (C) and act as a proxy in between. Then B can either remove itself or make an association with C to receive the data from C. All this must be done before heartbeat signals are exchanged [12]. The HIP base specification provides no mechanism for inter-device mobility. However, [1] and [8] allude to the possibility of identity delegation using signed certificates. The approach proposed here provides a higher degree of transparency and control and is more responsive than delegation certificates. Koponen, Gurtov, and Nikander provide a high-level discussion of the potential for HIP identity delegation with certificates [1]. References [2, 3] are related solutions that enable ongoing communication sessions to be moved between devices. In [2], Su creates a virtualized network interface that can be transferred between different devices and with it the associated communication sessions. It should be noted that none of these schemes conflict with HIP or with the scheme presented here; in fact, there is even potential for useful interoperation. A major difference of the delegation approach is that it focused only on managing connectivity and can be implemented in such a way that it is at least transparent to one end of an end-to-end connection, if not both. There are also a number of related activities in the IETF associated with locator/ID split [13 and 14]. Of this work, the network-based schemes such as Locator/Identifier Separation Protocol (LISP) do not consider the use of middleboxes. The others, especially mobility Internet key exchange (MOBIKE) and SHIM6 focus on device mobility and do not support the use of middleboxes as described in this article.
Conclusion Auxiliary devices that can serve as dynamically configured middleboxes introduce potential for a new approach to mobility handling that makes use of multiple available network interfaces and terminal devices. Mobility handling in this case means adapting to the changing status of an individual terminal by delivering application data flows to the best available
IEEE Network • September/October 2008
SENEVIRATNE LAYOUT
9/5/08
1:01 PM
Page 47
terminal device(s) and utilizing the available service proxies (middleboxes) in the best possible way. This cannot be achieved using currently available technology. This article addresses the problem by creating and exploiting PNs to provide enhanced mobility handling to mobile users. This article is focused on the specific problem of decoupling application data flows from specific devices by making use of multiple available network interfaces and terminal and service proxy devices. We propose mechanisms to switch ongoing communication sessions between terminal devices and to transparently insert or remove intermediary service proxies, with the mobility management schemes at layers lower than the transport layer. The proposed identity delegation approach is based on the HIP and allows the identity creator to retain full control over the use of their identity. The approach enables the movement of communication sessions between terminal devices, as well as the transparent insertion and removal of middleboxes, service proxies, or other intermediaries able to perform routing or adaptation.
Future Work Future work in support for movement of communication sessions between terminal devices may include the coupling of identity delegation with “checkpointing” and transfer of transport, session, and application layer state to allow full application sessions to be moved between devices. Another problem worthy of investigation for security reasons is how to enable independent verification of whether or not two terminal devices belong to the same PN.
References [1] T. Koponen, A. Gurtov, and P. Nikander, “Application Mobility Using the Host Identity Protocol,” Proc. ICT ’05, Madeira, Portugal, May 2005. [2] G. Su, MOVE: Mobility with Persistent Network Connections, Ph.D. diss., Columbia Univ., Oct. 2004.
IEEE Network • September/October 2008
[3] R. Baratto et al., “MobiDesk: Mobile Virtual Desktop Computing,” Proc. MobiCom, Philadelphia, PA, Sept. 2004. [4] I. G. Niemegeers and S. M. Heemstra De Groot, “From Personal Area Networks to Personal Networks: User Oriented Approach,” Wireless Personal Commun., vol. 22, no. 2, 2002, pp 175–86. [5] S. Herborn, A Personal-Network Centric Approach to Mobility Aware Networking, Ph.D. diss., Univ. New South Wales, Mar. 2007. [6] S. Ardon et al., “MARCH: A Distributed Content Adaptation Architecture,” Int’l. J. Commun. Sys., vol. 16, 2003, pp. 97–115. [7] B. Knutsson and H. Lu, “Architecture and Performance of Server Directed Transcoding,” ACM Trans. Internet Technology, vol. 3, 2003, pp. 392–424. [8] R. Moskowitz et al., “Host Identity Protocol,” Internet RFC 5201; http:// www.ietf.org/rfc/rfc5201.txt [9] D. Johnson, C. Perkins, and J. Arkko, “Mobility Support in IPv6,” IETF RFC 3775; http://www.ietf.org/rfc/rfc3775.txt [10] P. Nikander and J. Melen, “A Bound End-to-End Tunnel (BEET) Mode for ESP,” Internet draft; http://tools.ietf.org/id/draft-nikander-esp-beet-mode08.txt [11] InfraHIP project; http://infrahip.hiit.fi/ [12] T. Aura, P. Nikander, and G. Camarillo, “Effects of Mobility and Multihoming on Transport-Protocol Security,” IEEE Symp. Security and Privacy, Berkeley, CA, May 2004. [13] D. Meyer, “The Locator/ID Split, Its Implications for IP Architecture, and a Few Current Approaches,” Future of Routing Wksp., APRICOT ’07; http://www.1-4-5.net/dmm/talks/apricot2007/locid [14] D. Lee, X. Fu, and D. Hogrefe, “A Review of Mobility Support Paradigms for the Internet,” IEEE Commun. Surveys & Tutorials, vol. 8, no. 1, 2006.
Biographies ARUNA SENEVIRATNE ([email protected]) received his Ph.D. in electrical engineering from the University of Bath, United Kingdom, in 1982. He is director of the NICTA Australian Technology Park Laboratory. He has held academic appointments at the University of Bradford, United Kingdom, Curtin University, and the University of New South Wales. He has also held visiting appointments at the University of Pierre and Marie Curie, Paris, and INRIA, Nice. In addition, he has been a consultant to numerous organizations including Telstra, Vodafone, Inmarsat, and Ericsson. STEPHEN HERBORN ([email protected]) completed his Ph.D. at the University of New South Wales under the supervision of Professor Aruna Seneviratne. He works for Accenture consulting. Between 2003 and 2008, he was a member of the Networking and Pervasive Computing (NPC) program at NICTA in Sydney, first as a student and then as a full-time researcher. While at NICTA, his research activities centered around personal area networking, mobile networking, and context-aware computing.
47
PARK LAYOUT
9/5/08
1:02 PM
Page 48
NAT Issues in the Remote Management of Home Network Devices Choongul Park, Kitae Jeong, and Sungil Kim, KT Technology Lab Youngseok Lee, Chungnam National University
Abstract Currently, many customer devices are being connected to home networks. For this reason, it is expected that device management capabilities will be a powerful instrument for the service provider to cope with high maintenance costs, security concerns, and management issues related to home networks. Through DM, the service provider could provide valuable services such as auto-provisioning, remote configuration, firmware and software updates, diagnostics, monitoring, scheduling, and fraud management. However, network address translators that are widely deployed in the home network environment prohibit DM operations from reaching user devices behind the NAT. In this article, we focus on NAT issues in the management of home network devices. Specifically, we discuss efforts relating to standardization and present our proposal to deploy DM services for VoIP and IPTV devices behind NATs. By slightly changing the behavior of Simple Network Management Protocol managers and agents and by defining additional management objects (MOs) to gather NAT binding information, we could solve the NAT traversal problem under a symmetric NAT. Moreover, we propose an enhanced method to search for the UDP hole binding time of the NAT box. For evaluation, we applied our method to 22 randomly selected VoIP devices out of 194 NATed hosts in the real broadband network and achieved a success ratio of 99 percent for exchanging SNMP request messages and a 26 percent enhancement in determining the UDP hole binding time.
I
n the broadband network, the service provider must communicate with customer devices located at the end of the last mile for administrative purposes. As user devices become more diverse and complex, the software that controls them will become more complex as well. Thus, for the development of effective device management (DM) software, it is important to deal with the various customer problems that could be centered on the device, such as firmware updates, software misbehaviors, or configuration errors. The costs relating to deployment, customer care, operation, and management in a large-scale network could be significantly reduced through the DM services described in Fig. 1. The importance of the remote device management function will be emphasized further as the number of broadband subscribers carrying network-attached devices increases dramatically. In Korea, the number of subscribers to high-speed broadband Internet services is over 14 million.1 Therefore, it is assumed that many network address translators (NATs) are deployed 1
This was announced by the Korea Information Promotion Committee in the domestic information trend (vol. 7, no. 8) in March 2008. 2
The percentage of NAT penetration was produced by our device management system named U-CEMS, even though official figures were lower than ours by 5 percent in July 2007, which was referenced by the Digital Times news (www.dt.co.kr) published on July 30th, 2007
48
0890-8044/08/$25.00 © 2008 IEEE
in home networks. According to recent statistics2 from Korea Telecom (KT), which is the largest Internet service provider (ISP) in Korea, approximately 20 percent of customer devices are located behind a NAT middlebox. A NAT [1] allows several computers to share a single public IP address. Private IP addresses are assigned to hosts behind a NAT, which means that communication between hosts and the public Internet nodes pass through the NAT, which maintains port and address translation information. A managing device that is behind a NAT is one of the most urgent problems for service providers who are attempting to provide DM services to their customers. As shown in Fig. 1, a device management system (DMS) communicates with a device management client (DMC) to receive information from the remote devices and control them. In a NATed environment, a DMS, like other Internet applications, cannot avoid the NAT traversal problem. Namely, if a customer is using a NAT, the device behind the NAT cannot be controlled by the DMS. The NAT traversal problem has been studied a great deal in order to support hosts using voice over IP (VoIP) and peer-topeer (P2P) applications behind a NAT [2]. However, to the best of our knowledge, this issue has not been considered with the aim of controlling NATed hosts through DM although some efforts for standardization have begun recently. In this article, we aim to identify the issues and challenges relating to NATs when using DM to manage home network devices that are behind a NAT. In addition, we present a Sim-
IEEE Network • September/October 2008
PARK LAYOUT
9/5/08
1:02 PM
Page 49
Management authorities
DM server (DMS) DM services Auto-provisioning Remote diagnostics and control Service quality management Firmware and software management Status and fault monitoring Operation supporting Inventory management Statistics and report management
DM client (DMC)
DM operations Get MOs Set MOs Event notification Add/replace/copy MOs Exec MO
-069
P, TR
SNM
Notebook
PC IPTV STB
VoIP NAT phone middlebox WiFi phone Home network
OM
A-D
M
Northbound interface
Southbound interface
Mobile phone
Notebook
WiFi phone
PDA
Mobile network
n Figure 1. Management of customer devices. ple Network Management Protocol (SNMP)-based approach to control hosts under NATs, which employs a User Datagram Protocol (UDP) hole-punching technique with the correct timer estimation method. The remainder of this article is organized as follows. We provide an overview of DM protocols and standards and then discusses the open issues of the remote management of NATed devices. We describe our proposal using SNMP as device management, give the results of our experiment, and also make comparisons with other DM methods. Our conclusions and suggestions for future research are presented later.
Managing Devices Behind a NAT Overview of Device Management Protocols There are many device management protocols; the protocols we discuss here are presented in Table 1. These are standards-based protocols that are widely accepted around the world by many service and solution providers for device management. Open Mobile Alliance (OMA) [3] for DM uses extensible markup language (XML) for data exchange, more specifically the subset defined by Open Mobile Alliance device synchronization (OMA DS). Open Mobile Alliance-device management (OMA DM) is designed to support Wireless Session Protocol (WSP), Wireless Application Protocol (WAP), Hypertext Transfer Protocol (HTTP), or OBject Exchange (OBEX) or similar transports as a transport layer protocol. The protocol specifies the exchange of packages during a session, with each package consisting of several messages and each message in turn being composed of one or more commands. The server initiates the commands, and the client is expected to execute the commands and return the result with a reply message. Technical Report 069 (TR-069) [4] is also a device management protocol that is defined by a digital subscriber line (DSL) forum technical specification. This application layer protocol provides the remote management function for enduser devices. Based on a bidirectional Simple Object Access 3
SOAP stood for Simple Object Access Protocol, but this acronym was dropped in Version 1.2 of the standard because it was considered to be misleading
IEEE Network • September/October 2008
Protocol (SOAP)3/HTTP protocol, it enables communication between a device and a DMS. Typical applications of TR-069 are safe auto-configuration and the control of other customer premises equipment (CPE) management functions within the integrated framework. The SNMP [5, 6] is popular in network management because it enables easy monitoring of the status of networkattached devices through SNMP. A set of standards for network management and application-layer protocols, a database schema, and a set of data objects are defined in SNMP, with management data specified in the form of variables on the managed systems, which describe the system configuration information. These variables can then be queried and sometimes set by SNMP manager applications.
Open Issues of Remote Management of NATed Devices As explained earlier, several protocols have been standardized to support device management. However, with the advent of many NATs in the home network environment, a NAT becomes an important part to consider. Therefore, we present open issues in the remote management of NATed devices. A NAT translates between internal private IP addresses and external public ones. NATs, particularly network address port translation (NAPT), one of the most common NAT systems, deal with communication sessions, which are identified uniquely by the combination of source IP address, source port number, destination IP address, and destination port number. When a NATed device in a private network sends packets to the external host, the NAT intercepts the packet and replaces the source private IP address and the port number with a public IP address and a port number. Subsequently, when the NAT receives an incoming packet from the same public IP address and port number, it replaces their destination address and port number with the corresponding entry stored in the translation table, forwarding the packet to the private network. The first issue in the remote management of a NATed device is to find an efficient way to facilitate the successful exchange of remote management request/response messages through the NAT box. A DMS cannot provide management authorities with management functions for a device behind a NAT because the management operations are blocked by the NAT.
49
PARK LAYOUT
9/5/08
1:02 PM
Page 50
OMA-DM
TR-069
SNMP
Organization
OMA1
DSL Forum2
IETF3
Target devices
Mobile devices (mobile phones, PDAs, palm top computers, etc)
Fixed devices (Residential gateways, VoIP phones, IPTV STB, etc)
Network elements (computers, routers, switches, terminal servers, VoIP phones)
Typical uses
Provisioning Configuration of device Software management FOTA (Firmware Over the Air) Fault management
Auto configuration Dynamic service activation Firmware management Status and performance control
Fault management Configuration management Account management Performance management Security management
Data model
OMA-DS4
XML
ASN.15
Transport protocol
TCP
TCP
UDP
Operations
Add, Get, Replace, Exec, Copy, Event
GetRPCMethods, Get/Set Parameter (Values, Names, Attributes), (Add/Delete)Object, Download, Inform, etc.
Get, GetNext, GetBulk, Trap, Set
Current specification
OMA Device Management V1.2 Approved Enabler (April.2006)
CPE WAN Management Protocol v1.1 (Dec. 2007)
RFC3411(Dec. 2002), RFC3418(Dec. 2002)
1
Open Mobile Alliance; http://www.openmobilealliance.org Digital Subscriber Line Forum; http://www.dslforum.org 3 Internet Engineering Task Force (IETF); http://www.ietf.org 4 Open Mobile Alliance Data Synchronization; the former name of OMA-DS is Synchronization Markup Language (SyncML). 5 Abstract Syntax Notation One (ASN.1): a standard and flexible notation that describes data structures for representing, encoding, transmitting, and decoding data. 2
n Table 1. Overview of three device management standards.
On the other hand, a NAT maintains a table that maps the private addresses and the port numbers to the public port numbers and IP addresses. Thus, it is important to note that this “binding” information could be initiated only by outgoing traffic from the internal host. In addition, most NATs maintain an idle timer for several outgoing sessions and close the hole if no traffic is observed for the given time period. If we knew the default timer value of a NAT, we could minimize session management overhead. However, there is no way to know the default timer value without any information about the NAT itself, such as vendor or model. In other words, the issue that we focus on here is determining the NAT timer value for an unknown NAT. Therefore, a second important issue is to estimate the correct timer value for each NAT box at a minimal cost. Without knowledge of the appropriate timer values for each NAT, the DMS repeatedly must send unnecessary probe packets to each NAT to find it in a large-scale network. These two issues are not specific to DM but related to all applications under an unknown NAT. To provide DM services against a NAT, we look into efforts for standardizing NATed device management.
Efforts for the Standardization of NATed Device Management When it comes to the issue of how to manage NATed devices, there exist similar discussions such as RFC 2962 [7] and the CALLHOME Birds of a Feather (BoF) draft [8]. First, RFC 2962 describes an SNMP application level gate-
50
way (ALG) for payload address translation, but this ALG has serious limitations, including its scalability and speed of deployment of new applications. Moreover, it requires an upgrade to existing NATs. A CALLHOME BoF was held at the 64th IETF meeting and suggested a connection model that reversed the client-server role when establishing a connection. However, its activity ended without a clear result. For these reasons, in this section we focus on the efforts of the defacto DM standardization body to manage NATed devices. We discuss and compare these with our approach in detail in a later section. Technical Report-111 — TR-111 [9] extends the mechanism defined in TR-069 for the remote management of devices and is incorporated in TR-069 ANNEX G. TR-111 enables a management system to access and manage devices connected to a local area network (LAN) through a NAT. Two mechanisms were suggested in TR-111. TR-111 Part 1 is defined for the situation in which both the NAT and the device are TR-069 managed by the same DMS. TR-111 Part 2 provides a mechanism to realize a remote connection request to a device behind a NAT, in the event that the NAT does not support TR-069. It allows a DMS to initiate a TR-069 session with a device that is operating behind a NAT. The simple traversal of UDP through NATs (STUN) protocol mechanism defined in RFC 3489 [10] is included as Part 2 of TR-111, in which a device uses STUN to determine whether or not the device is behind a NAT. Then, if the device is behind a NAT with a private allocated address, the device uses the procedures defined in STUN to discover the binding timeout. The device
IEEE Network • September/October 2008
PARK LAYOUT
9/5/08
1:02 PM
Page 51
sends periodic STUN-binding requests at a sufficient frequency to maintain the NAT binding, on which it listens for UDP connection requests. The STUN-based mechanism requires a large amount of bandwidth but covers a wide range of usages including VoIP service deployed behind an unmanaged or unfriendly NAT, or home networks with multiple NATs. Two alternative mechanisms based on Dynamic Host Configuration Protocol (DHCP) or universal plug and play (UPnP) have been proposed and discussed recently, as follows: • DHCP-based TR-111: DHCP is a well-known protocol used by networked devices to obtain the parameters necessary for Internet connectivity. A device informs its connection request URL to the NAT via DHCP option 60. The NAT in turn creates a proxy URL to use this URL for the communication back to the device. Then, the device can communicate the proxy URL as its connection request URL to the DMS. The NAT forwards packets on the proxy URL to the device connection request URL. • UPnP-based TR-111: UPnP is a set of protocols that allow devices in the home network to be connected seamlessly. A device uses UPnP to discover the NAT, learn its public IP addresses, and open a forwarding port. After a port is opened, the device can register for notification of changes to the wide area network (WAN) IP address and communicate the connection request URL with the public IP address of the NAT and the forwarding port to the ambient control space (ACS). However, these two alternatives also are limited, in that the NAT must support the DHCP option mechanism or the UPnP protocol. OMA CDM — There is standardization work being performed in the area that involves the discussion of converged DM (CDM) issues, such as the configuration and management of devices that support one or more bearer technologies for services. This is expected to standardize urgent management issues for devices within a consumer’s network, including the assessment of device management when a device is located behind a NAT. However, the CDM standardization issue, which will be a part of the OMA-DM v2.0 work item document (WID), is still in its infancy. Lessons from Standardization Efforts — Note that two approaches exist to provide DM services over NATs. One is to make NATs manageable like ALGs or friendly to device management protocols like proxies. Thus, the NAT box could relay the operation of a DMS to the device. The other is to adopt common NAT solutions like STUN and to make DMCs independent of NAT traversal mechanisms. The first solution is not easily applied in the real environment because most currently deployed NATs do not support SNMP ALG, the DHCP option, or UPnP. Thus, the deployment cost becomes expensive. The second solution using STUN, a well-known NAT traversal mechanism, also has scalability and cost issues in that it requires additional dedicated servers and clients. Then, we propose a SNMP-based device management scheme exploiting the UDP hole punching technique [11] that easily could be implemented and deployed in the current network. The comparison of those issues is discussed more specifically later. 4
This was announced in the pressroom on the MegaTV portal; http://mymegatv.com/pressroom/pressroomList03.asp.
IEEE Network • September/October 2008
Our Proposal: Using SNMP as Device Management More than 700,000 IPTV set-top boxes and VoIP devices had been distributed to high-speed broadband customers in the KT network as of the end of 2007. 4 Consequently, those customer devices must be managed by the integrated device management system. To manage those devices, we are adopting the popular SNMP version 2c as a DM protocol. There are several reasons for this choice. First, SNMP is a well-known protocol both for service providers and device vendors, which means that we can benefit from fitting the time-to-market by rapidly implementing our req u irements . I n additio n, vendo rs w ant to adopt a lightweight DM client to avoid a cost problem for devices in terms of required resources. However, we are trying to change this situation by considering SNMPv3, TR-069, or OMA-DM because of the issue of security, which could be big challenge in the future. Accordingly, in this article we propose an SNMP-based DM method for NATed hosts over unmanageable NATs. The UDP mechanism using SNMP traps that we propose in this article is easier to implement and deploy than a TCP mechanism to manage NATed devices because a TCP mechanism could result in substantial system overhead by holding a large number of sessions initiated by more than hundreds of thousands of devices.
Our Challenges in Avoiding the NAT Traversal Problem As stated previously, to solve the NAT traversal problem, we designed a connection request mechanism for the NATed device, which enables a DMS to exchange SNMP messages through an unknown NAT. Based on the well-known NAT traversal mechanism of UDP hole punching, we slightly modified the behaviors of the SNMP manager and the agent and defined additional MOs so that we could effectively solve the NAT traversal issues in a manner that avoids the problems of cost and scalability. Moreover, we developed an enhanced method to determine the UDP hole binding time for the NAT box and applied it to VoIP devices. Through experiments, we obtained a significantly reliable result that 99 percent of exchanges of SNMP request messages were successful, and the searching time for the default UDP hole binding timer value was reduced by 26 percent. Most NATs hold an idle timer for a UDP session and close the hole if no traffic is observed for the given time period. Hence, we can reach a device behind a NAT by using the UDP hole punching scheme, in which the SNMP agent sends keep-alive trap messages to the DMS periodically. This enables the DMS to recognize the private/public address and port binding information. Because the SNMP agent at the device usually uses UDP port 161 for the SNMP request message, the binding entry for UDP port 161 must exist in the binding table at the NAT. Generally, any port could be allocated to the source port for sending the SNMP trap message. If an SNMP agent sends the trap message using the fixed UDP port of 161, we can ensure that the binding entry will be maintained in the binding table at the NAT. On the other hand, there is another problem in managing devices with a private IP address, known as a symmetric NAT problem, whereby only a public IP:port can reach a private IP:port if the traffic is initiated from the private network. To solve this problem, a DMS uses UDP port 162 to send the
51
PARK LAYOUT
9/5/08
1:02 PM
Page 52
DMC
NAT middlebox
DMS
(A:161)
(B:p)
(C:162)
(Address port) Private IP address: A Public IP address: B, C Port: 161,162,p(dynamic)
Step 1
Source Destination A C Address 161 162 Port Trap object ClientAddress=A SNMP trap
Creating UDP hole
Source Destination B C Address p 162 Port Trap object ClientAddress=A Step 2
Binding discovery
Step 3
Keep punching UDP hole
SNMP trap at hole-timer interval
Address Port
Step 4
Extract A with (B:p)
SNMP trap
Source C 162
Pass through UDP hole
Sending SNMP request message Address Port
Source C 162
Destination B p
SNMP request
Destination A 161
SNMP request Address Port
Source A 161
Destination C 162
SNMP response
Address Port
Source B p
Destination C 162
SNMP response
n Figure 2. The sequential message flow of SNMP device management on the NAT environment.
SNMP request message because the SNMP agent sends the SNMP trap message to the destination port of 162. By using these concepts, we propose an SNMP-based remote management method for a device behind a NAT as follows. First, we define the behavior of the SNMP agent embedded in the device. An SNMP agent triggers a UDP hole by periodically sending SNMP trap messages and keep-alive messages to the DMS. We chose 180 seconds for the interval of keep-alive messages because it was the most frequently found value of NATs in our experiments. We also fixed the source port of the SNMP trap message sent from the SNMP agent to UDP port 161. Second, we added a function of gathering the agent IP address and its source port number to the SNMP manager in the DMS. If the agent IP address is different from the source IP address, the SNMP manager decides that the device that has sent the SNMP message is located behind a NAT. To avoid the symmetric NAT problem, the SNMP manager must fix the source port to 162 when sending the SNMP request message.
Proposal of SNMP-Based DM over NAT Figure 2 shows the message flow associated with the procedures of our proposed method to manage a NATed device by using the UDP hole punching scheme. In Fig. 2 the address/port pairs use the notation (Address:port). There are four steps in our mechanism, as follows:
52
• Precondition: The DMC uses port 161 as its listening port for receiving SNMP requests from the DMS and as its source port for sending SNMP trap messages. The DMS uses port 162 as its listening port for receiving SNMP trap messages sent from the DMC and as its source port for sending SNMP request messages to the DMC. • Step 1: Creating a UDP hole: When the IP address (A) is assigned to the device by the NAT, the DMC (A:161) sends the SNMP trap message to the DMS (C:162), which includes the device address as a trap object and its value (ClientAddress=A), which provides the private IP address of the device. The NAT translates the IP:port pair (A:161) of the SNMP trap packet to (B:p), which are the IP address and the port number allocated by the NAT, randomly or sequentially. In other words, the NAT creates a UDP hole and a binding entry (A:161, B:p). • Step 2: Binding discovery: The DMS determines that the device is located behind a NAT when it knows that the address (A) of the device extracted from the SNMP object ClientAddress differs from the source address of the SNMP trap packet. If a device is behind the NAT, the DMS extracts the binding information (A:161, B:p) from the SNMP trap message, whereby the IP:port (A:161) of the device is extracted from the SNMP message, and the IP:port pair (B:p) of the NAT is extracted from the received SNMP trap packet.
IEEE Network • September/October 2008
PARK LAYOUT
9/5/08
1:02 PM
Page 53
Method
Concept
Algorithm
Linear search
• Search UDP mapping time with a linearly increasing value of EV
1) EV = IV + TE 2) Wait for EV and send SNMP command
Slow start search
• Similar method as TCP congestion control • Search UDP mapping time by increasing EV exponentially, but after failure perform a linear search increase
1) EV = IV + PEV*2 2) Wait until EV and send SNMP command 3) If success go to 1) If fail do linear search
Binary search
• Use binary search method
1) EV = (MinVal + MaxVal)/2 2) Wait until EV and send SNMP command 3) If success, MinVal = EV and go 1) If fail, MaxVal = EV and go 1)
TopN binary search
• Limited types of NAT vendors are deployed in the real environment • Maintain TopN list of UDP hole time • Based on the TopN list, first perform binary search between entries
1) Binary search in the Top N list 2) Binary search between TopN(i) and TopN(i+1) until the difference is in TE
n Table 2. Methods for searching for the UDP hole timer values of a NAT (EV: Expect Value, IV: Initial Value, TE: Tolerable Error, PEV: Previous Expect Value).
• Step 3: Keep punching the UDP hole: To maintain the UDP hole bound with entry (A:161, B:p) of the NAT, the DMC keeps sending SNMP trap messages to the DMS (C:162) at hole-timer intervals. • Step 4: Sending SNMP request messages: When DMS (C:162) wants to manage a NATed device, it sends the SNMP request message to manage the device through the hole (B:p). Then, the message can pass through the hole and reach the NATed device. The NAT translates (B:p) to (A:161) according to the binding table. The DMC receives this SNMP message from UDP port 161, and it sends the SNMP response message to the DMS (C:162). The process to deliver this SNMP response message with the result to the DMS is the same as that of the SNMP request message.
Heuristic to Estimate the UDP Hole Punching Timer Values
nation-wide high-speed broadband network, as shown in Fig. 3. Of 1177 manageable VoIP devices, 194 hosts are shown to be NATed in our network; thus, on average 17 percent of end hosts are NATed. Based on these hosts, we randomly selected 22 devices and tested our proposed method. The reason why we chose only a small number of devices is that we had to carefully test the minimum number of devices so as not to affect customer service if we sent command messages repeatedly. The SNMP manager was implemented based on University of California, Davis (UCD) SNMP version 4.2.6, 5 and can send and receive SNMP messages simultaneously using one port, UDP 162. Hence, we embedded the SNMP agent into the device with our proposed method. Table 3 shows the results of different methods of searching for the UDP hole time. Our heuristic, based on the binary search method, showed the best performance in the experiments. Compared with the popular binary search method, our heuristic could reduce search times by 26 percent, as well as reduce the average number of probes by 0.6. Table 4 shows the command success ratio of our proposed method. We could achieve a 99 percent success rate of SNMPcommand penetration into the NATed devices. This result provides compelling evidence that it is possible to manage a device using private IP addresses, without any additional servers or equipment. It also shows that the top N binary search heuristic is useful for the efficient management of NATed hosts. There might be an argument with our NAT traversal success rate of 99 percent when compared with the well-known result of 80 percent in [12]. First, we think that the small number of experimented devices (22 devices) could be contributing to the
In general, UDP mapping timer values are not standardized so they could be different for each NAT vendor. For the remote management of devices behind a NAT from a public network, the DMS should make the user device send a UDP packet periodically before the UDP hole is closed. In other words, the device should punch the UDP hole periodically at the time interval configured by the DMS. Note that searching the UDP mapping time could cause a large amount of overload to the DMS in a large-scale network because the DMS must send many probe packets with the estimated timeout values for each NAT box. As such, we propose a heuristic method to maintain the list of the top 10 UDP mapping times statistically obtained through experiments. Then, we applied the binary search algorithm to the list of the top 10 known timer values. That is, a DMS uses the binary search algorithm to find the UDP mapping time in the list and then to Method search it between two items. Table 2 summarizes four kinds of applicable search methods. The experiLinear search mental results are explained in the next section.
Device
Test
Average number of probes
Average time (s)
22
196
25.6
4608
Experimental Results
Slow start search
22
275
19.2
2984
To evaluate the proposed method, we implemented an SNMP manager and defined the client behavior. With the implementation, we tested our method in a
Binary search
22
488
2.9
470
TopN binary search
22
541
2.3
348
5
The current release version of UCD SNMP is NET-SNMP 5.4.1; http://net-snmp.sourceforge.net
IEEE Network • September/October 2008
n Table 3. Average time of searching UDP hole timer values for different NATs.
53
PARK LAYOUT
9/5/08
1:02 PM
Page 54
Hole punching method
Searching hole timer
Device
Test
Success
Success ratio (%)
No hole punching
—
22
NA
NA
0
Hole punching
Fixed timer of 180 s
22
2160
1636
75
Hole punching
TopN binary search
19
1221
1207
99
n Table 4. SNMP command success ratio for NATed devices.
high success ratio of passing through NATs, compared to the experiment in [12] with 40 different kinds of NATs. We could not perform experiments with a large number of subscribers because testing that should not affect ongoing service. It also is possible that the failure rate of 1 percent is due to multiple NATs or abnormal NATs.
Comparison of DM Approaches under NAT
STUN
DHCP
UPnP
Implementation complexity
l
NA
Scalability
º
NA
º
Fault tolerance
l
NA
Security
º
º
NA
º
In this section, we present a brief comparative discussion of various approaches to realizing remote device management functions under NATs as in Table 5. Using STUN servers requires an additional server, as well as client modules to support the mechanism, thus demanding more overhead in the aspects of implementation complexity, scalability, fault tolerance, security, deployment cost, command response time, system load, and compatibility. DHCP and UPnP are available only in environments where NATs are compatible with TR-111. Moreover, as mentioned before, CDM still has no standardization result. However, our proposal based on SNMP shows manageable advantages when compared with other methods.
Deployment cost
l
º
º
NA
Conclusion and Future Work
Command response time
l
º
º
NA
System load
l
NA
º
Compatibility
º
l
l
NA
TR-111 CDM
Our proposal
–Implementation complexity: Our proposal employs a NAT traversal mechanism by slightly changing SNMP-based DM software. However, STUN requires an additional dedicated server and a client in addition to the DM software. –Scalability: The periodic hole punching method that increases the number of in-flight packets in proportion to the number of devices may cause a scalability problem like STUN. However, our proposal uses fewer and smaller-sized packets than STUN. –Fault tolerance: Our proposal has fewer points of failures causing the overall availability of the system, whereas TR-111 with STUN needs an additional STUN server and a client for NAT traversal. –Security: Our proposal may be affected by source address spoofing attacks like DHCP and UPnP. –Deployment cost: Except slightly changing the SNMP trap mechanism, our proposal does not require an additional deployment cost. On the other side, for TR-111, STUN needs dedicated clients and servers, and DHCP or UPnP functions should be deployed on all NATs. –Command response time: Our proposal could reduce the command response time by estimating the UDP hole timer values correctly. However, STUN might experience a long command response time because of additional intermediate nodes like STUN servers and clients. –System load: Our proposal will result in less server load to maintain the NAT traversal mechanism using simple UDP packets. However, STUN uses a little complex mechanism exchanging many UDP and TCP packets for NAT traversal. –Compatibility: Our proposal is compatible with all NAT environments including symmetric NAT. However, STUN is the independent standard in addition to device management and needs additional considerations for the compatibility with the legacy system.
n Table 5. Comparisons of device management methods ( : good, º : average, l : bad).
54
As the number of NATs deployed in the broadband network grows, more and more IP devices will be hidden behind a NAT. Therefore, it is necessary for a DMS to find a connection request mechanism for NATed devices so that it can exchange messages through an unknown NAT. When we apply the known NAT traversal solutions to the real environment, we can meet new challenges, such as expensive maintenance costs and symmetric NAT problems. The problem of expensive maintenance costs is related to the additional servers that must be deployed, or the NATs that must be upgraded to support a remote connection request mechanism. The problem of symmetric NATs is that the NAT traversal mechanism must work under all different kinds of NATs, including symmetric NATs. In this article, we presented a simple overview of early standardization efforts and have proposed an effective remote SNMP connection request mechanism for NATed devices using the UDP hole punching method. By slightly modifying the behaviors of the SNMP manager and the agent, and by defining additional management objects to gather NAT binding information, we solved the cost problem and symmetric NAT issue. In addition, we proposed an enhanced method to efficiently determine the binding time of the UDP holes of the NAT box. For the experimental evaluation, we applied our method to 22 VoIP devices behind NATs in the real environment and achieved a success ratio of 99 percent in exchanging SNMP request messages and a 26 percent enhancement in determining the UDP hole binding time. Even though the proposed DM protocol is to be changed to SNMP v3 in the future, we believe that this would necessitate only a slight change in our scheme.
IEEE Network • September/October 2008
PARK LAYOUT
9/5/08
1:02 PM
Page 55
Backbone
Access network
Home network
DMS located in KT’s backbone network (KORNET)
Ethernet, xDSL, FTTH
194 hosts exist under NATs out of 1,177 VoIP phones
Ethernet DMS
NAT FES
VoIP phone DMC
xDSL Modem NAT
KORNET DSLAM
VoIP phone DMC
FTTH Modem NAT
VoIP phone
OLT
DMC
n Figure 3. Test environment of estimating UDP hole time at the commercial broadband network in Korea. There are some issues in our system. One is the issue of scalability, in that our system must put up with thousands of keep-alive SNMP trap messages per minute from thousands of devices. As a result, we are now approaching a time-to-live (TTL)-based scheme to avoid heavy traffic that is not required to reach a DMS, in which we send periodic trap messages with TTL = n (n being the least count of TTL to punch the hole), which will be discussed in the future. Another issue is the security issue arising from SNMP v2. To address this issue, in the near future we are considering changing the DM protocol to one of the standards-based secure DM protocols, such as SNMP v3, OMA-DM, or TR-069. In future work, we are going to estimate and analyze real environment results in our large scale VoIP and IPTV network that will be widely deployed this year, reaching over a million devices.
Acknowledgment This work was partly supported by the IT R&D program of MKE/IITA (2008-F-016-01, Collect, Analyze, and Share for Future Internet) and partly by the ITRC (Information Technology Research Center) support program of MKE/IITA (IITA-2008-C1090-0801-0016). The corresponding author is Youngseok Lee.
References [1] M. Holdrege, “IP Network Address Translator (NAT) Terminology and Considerations,” IETF RFC 2663, Aug. 1999. [2] S. Guha and P. Francis, “Characterization and Measurement of TCP Traversal through NATs and Firewalls,” Proc. Internet Measurement Conf., Berkeley, CA, Oct. 2005. [3] OMA, “OMA Device Management V1.2 Approved Enabler,” Feb. 2007. [4] DSL Forum, “CPE WAN Management Protocol v1.1,” Dec. 2007. [5] J. Case et al., “Introduction and Applicability Statements for Internet Standard Management Framework,” IETF RFC 3410, Oct. 2000. [6] W. Stallings, SNMP, SNMPv2, SNMPv3, and RMON 1 and 2, 3rd ed., Addison Wesley, 1998. [7] D. Raz, J. Schoenwaelder, and B. Sugla, “An SNMP Application Level Gateway for Payload Address Translation,” IETF RFC 2962, Oct. 2000.
IEEE Network • September/October 2008
[8] E. Lear, “Simple Firewall Traversal Mechanisms and Their Pitfalls,” IETF draft, Oct. 2005. [9] DSL Forum, “Technical Report 111, Applying TR-069 to Remote Management of Home Networking Devices,” Dec. 2005. [10] J. Rosenberg et al., “STUN — Simple Traversal of User Datagram Protocol through Network Address Translators,” IETF RFC 3489, Mar. 2003. [11] B. Ford and P. Srisuresh, “Peer-to-Peer Communication across Network Address Translators,” USENIX Annual Technical Conf., 2005. [12] C. Jennings, “NAT Classification Test Results,” IEEE draft, July 2007.
Biographies CHOONGUL PARK ([email protected]) received B.S. and M.S. degrees in computer engineering in 2001 from Pusan National University and in 2008 from Chungnam National University, Korea, respectively. Currently, he is a Ph.D. student at Chungnam National University. He joined KT Technology Laboratory in 2002 and started his research work on the Next Generation OSS project. Since 2005 he has been a member of the KT Device Management project and a senior researcher in the Department of Next Generation Network Research. His research interests include device management and traffic engineering in the nextgeneration Internet. KITAE JEONG ([email protected]) received B.S. and M.S. degrees in 1983 and 1986 in electronic engineering from Kyungpook National University, and a Ph.D. from Tohoku University of Japan in 1996. He joined KT Laboratory in 1986, and is the leader of the Department of Next Generation Network Research. His research interests are in the fields of device management, next-generation network, and fiber to the home. SUNGIL KIM ([email protected]) received B.S. and M.S. degrees in 1992 and 1994 in computer engineering from Choongbuk National University. He joined KT Technology Laboratory in 1994, and is the leader of the KT Device Management project and delegate to the Broadband Convergence Network Standardization Group. His research interests are in the fields of device management and nextgeneration networks. YOUNGSEOK LEE [SM] ([email protected]) received B.S., M.S., and Ph.D. degrees in 1995, 1997, and 2002, respectively, all in computer engineering, from Seoul National University, Korea. He was a visiting scholar at Networks Lab at the University of California, Davis from October 2002 to July 2003. In July 2003 he joined the Department of Computer Engineering, Chungnam National University. His research interests include Internet traffic measurement and analysis, traffic engineering in next-generation Internet, wireless mesh networks, and wireless LAN.
55
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 56
Improving the Performance of Route Control Middleboxes in a Competitive Environment Marcelo Yannuzzi, Xavi Masip-Bruin, Eva Marin-Tordera, Jordi Domingo-Pascual, Technical University of Catalonia Alexandre Fonte, Polytechnic Institute of Castelo Branco Edmundo Monteiro, University of Coimbra Abstract Multihomed subscribers are increasingly adopting intelligent route control solutions to optimize the cost and end-to-end performance of the traffic routed among the different links connecting their networks to the Internet. Until recently, IRC practices were not considered adverse, but new studies show that in a competitive environment, they can lead to persistent traffic oscillations, causing significant performance degradation rather than improvements. To cope with this, randomized IRC techniques were proposed. However, the proliferation of IRC products raises concerns, given that randomization becomes less effective as the number of interfering IRC systems increases. In this article, we present a more scalable route control strategy that can better support the foreseeable spread of IRC solutions. We show that by blending randomization with adaptive filtering techniques, it is possible to drastically reduce the interference between competing route controllers, and this can be achieved without penalizing the end-to-end traffic performance. In addition to the potential improvements in terms of scalability and performance, the route control strategy outlined here has various practical advantages. For instance, it does not require any kind of protocol or coordination between the competing IRC middleboxes, and it can be adopted readily today because the only requirement is a software upgrade of the available route controllers.
T
oday, the vast majority of the communications on the Internet are between nodes located in non-transit (i.e., stub) networks. Stub networks are primarily composed of medium and large enterprise customers, universities, public administrations, content service providers (CSPs), and small Internet service providers (ISPs). These networks exploit a widespread practice called multihoming, which consists of using multiple external links to connect to different transit providers. By increasing their connectivity to the Internet, stub networks potentially can obtain several benefits, especially in terms of resilience, cost, and traffic performance [1]. These are described as potential benefits because multihoming per se cannot improve any resilience, cost, and traffic performance. Accordingly, multihomed stub networks require additional mechanisms to achieve these improvements. In particular, when an automatic mechanism actively optimizes the cost and end-to-end performance of the traffic routed among different links connecting a multihomed stub network to the Internet, it is referred to as intelligent route control (IRC). During the last few years, IRC has attracted significant interest in both the research and the commercial fields. Several vendors are developing and offering IRC solutions [2–4] that increasingly are being adopted by multihomed stub netThis work was partially funded by the European Commission through CONTENT under contract FP6-0384239.
56
0890-8044/08/$25.00 © 2008 IEEE
works. Most available IRC solutions follow the same principle, that is, they dynamically shift part of the egress traffic of a multihomed subscriber from one of its ISPs to another, using measurement-driven path switching techniques. IRC systems operate in relatively short timescales — even reaching switching frequencies on the order of a few seconds — allowing IRC users to balance cost and performance criteria according to the priority and requirements of their applications. Despite these strengths, IRC practices have one major weakness, that is, they try to achieve a set of local objectives individually without considering the effects of their decisions on the performance of the network. Recently, it was discovered that in a competitive environment, IRC systems actually can cause significant performance degradation rather than improvement. In [5], the authors show that persistent oscillations can occur when independent controllers become synchronized due to a considerable overlap in their measurement time windows. To avoid synchronization issues, the authors propose randomized IRC strategies and empirically show that the oscillations disappear after introducing a random component in the route control decision. It is important to note that although randomization offers a straightforward mechanism to mitigate the oscillations, it cannot guarantee global stability. This issue raises concerns given the proliferation of IRC products because as the number of interfering IRC systems increases, randomization becomes
IEEE Network • September/October 2008
YANNUZZI LAYOUT
9/9/08
MMM
12:35 PM
Page 57
IRC box
IP
RCM Enforce routing decision
RVM
ISP11
ISP12
ISP1n
ISP21
ISP22
ISP2m
MMM
IRC box
RCM RVM
IP
systems, such as endowing them with an SRC algorithm supported by adaptive filtering techniques, is enough to drastically reduce the number of path switches, and most importantly, this can be accomplished without penalizing the end-to-end traffic performance. Extensive simulations show that with SRC, it is possible to reduce the overall number of path switches between approximately 40 to 80 percent on average (depending on the load on the network) and still obtain better end-to-end traffic performance than with randomized IRC techniques in a competitive environment. The rest of the article is structured as follows. First, we present the basics of IRC. Then, we overview the most relevant related work. Next, we analyze some general aspects of different IRC strategies and describe the SRC approach together with some of our main results. We conclude with directions for future research in the area of IRC.
The Basics of IRC
A typical IRC scenario with two different configurations is shown in Fig. 1. The IRC box at the top of Fig. 1 is connected by a span port off n Figure 1. The IRC model. IRC systems are composed of three modules: the a router or switch so although the egress traffic monitoring and measurement module (MMM), the route control module is controlled by the box, it is never forwarded (RCM), and a reporting and viewer module (RVM). through it. The IRC box in the multihomed network at the bottom of Fig. 1 is placed along the data path so traffic always is forwarded through it. Typically, less effective, and hence, the more likely it is that the oscillathe former configuration offers a more scalable solution than tions reappear. In light of this, it is necessary to explore more the latter, in the sense that it is able to control and optimize a scalable route control strategies that can safely support the larger number of traffic flows. foreseeable spread of IRC solutions. Conceptually, an IRC system is composed of the following In principle, two research approaches can be taken. On the three modules (Fig. 1): one hand, the research community could formally study the • Monitoring and measurement module (MMM) stability properties of IRC practices and provide guidelines on • Route control module (RCM) how to design IRC systems with guaranteed stability. Unfortu• Reporting and viewer module (RVM) nately, several challenging stages must be completed properly The existing IRC systems can control a moderately large before a formal study of stability can be conducted. For number of flows1 toward a set of target destination networks. instance, accurate measurements are required to understand comprehensively the actions of the closed-source IRC systems These target destinations can be configured manually or disdeployed today (e.g., [2–4]) and thereby, model the stochastic covered by means of passive measurements performed by the distribution of path switches in a competitive IRC environMMM. By using passive measurements, the MMM can rank ment. Only after characterizing the distribution of path the destinations according to the amount of traffic sourced switches, is it possible to formally study the stability aspects of from the local network and subsequently optimize the perforcompetitive IRC. mance for the traffic toward the D destinations at the top of In the absence of such characterization, the practical alterthe rank. The MMM also uses passive measurements to moninative is to find ways to drastically reduce the potential intertor the target flows in real time and analyze packet losses, ference between competing route controllers without latency, and retransmissions, among others, as indicators of penalizing the end-to-end traffic performance. This is preciseconformance or degradation of the expected traffic perforly the challenge addressed in this work. This article makes the mance. To assist the RCM in the dynamic selection of the following contributions: best egress link to reach each target destination, the MMM • We show that although randomization offers a straightforprobes all the candidate paths using both Internet Control ward way to mitigate the oscillations, it leads to a large Message Protocol (ICMP) and Transmission Control Protocol number of unnecessary path switches. (TCP) probes. • We report some of our recent results on the development of The set of active and passive measurements collected by strategies blending randomization with a lightweight and the MMM enables IRC systems to concurrently assess the more “sociable” route-control algorithm. The term sociable quality of the active and the alternative paths toward the tarroute control (SRC) refers here to a route control strategy get destinations. The role of the RCM is to dynamically that explicitly considers the potential implications of its decisions in the performance of the network and can adap1 Typically this is on the order of several hundreds and even thousands, tively restrain its intrinsic selfishness depending on the network conditions. using a configuration like the one shown at the top of Fig. 1 with several • We show that a simple enhancement to randomized IRC border routers.
IEEE Network • September/October 2008
57
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 58
choose the best egress link for each target flow, depending on the outcome of these measurements. More specifically, the RCM is capable of taking rapid routing decisions for the target flows, often avoiding the effects of issues such as distant link/node failures2 or performance degradation due to congestion.3 The third module of an IRC system, namely the RVM, typically supports a broad set of reporting options and provides online information about the average latency, jitter, bandwidth utilization, and packet loss experienced through the different providers, summaries of traffic usage, associated costs for each provider, and so on. Overall, IRC offers an incremental approach, complementing some of the key deficiencies of the Interior Gateway Protocol/Border Gateway Protocol (IGP/BGP)-based route control model. It is worth emphasizing that the set of candidate routes to be probed by IRC boxes usually is determined by IGP/BGP; so conversely to overlay networks [8], IRC boxes never circumvent IGP/BGP routing protocols. The effectiveness of multihoming in combination with IRC is confirmed not only by studies like [8], but also by the increased trend in the deployment of these solutions. In this article we deal with the algorithmic aspects of IRC systems so hereafter we focus our attention on the RCM in Fig. 1 — the functionality of the MMM and RVM modules essentially is orthogonal to the proposals made in this work.
Related Work In [9], the authors simultaneously optimize the cost and performance for multihomed stub networks, by introducing a series of new IRC algorithms. The contributions of that work are fundamentally theoretical. For instance, the authors show that an intelligent route controller can improve its own performance without adversely affecting other controllers in a competitive environment, but the conclusions are drawn at traffic equilibria (traffic equilibrium is defined by the authors as a state in which no traffic can improve its latency by unilaterally changing its link assignment). However, after examining and modeling the key features of conventional IRC systems, it becomes clear that they do not seek this type of traffic equilibria. Indeed, more recent studies, such as [5], show that in practice, the performance penalties can be large, especially when the network utilization increases. In light of this, and considering the current deployment trend of IRC solutions, it becomes necessary to explore alternative IRC strategies. These new route control strategies should always improve the performance and reliability of the target flows, or at least, they should drastically reduce the potential implications associated with frequent traffic relocations, such as persistent oscillations causing packet losses and increased packet delays [5]. Although most commercially available IRC solutions do not reveal in depth the technical details of their internal operation and route control decisions, the behavior of one particular controller is described in detail in [10]. That work also provides measurements that evaluate the effectiveness of different design decisions and load balancing algorithms. Akella et al. also provided rather detailed descriptions and experimental 2
The timescale required by IRC systems to detect and react to a distant link/node failure is very small compared to that of the general IGP/BGP routing system [2–4, 6].
evaluations of multihoming in combination with IRC tools, as in [1, 8, 11]. These research publications, along with the documentation provided by vendors, allowed us to capture and model the key features of conventional IRC techniques. A similar approach was followed by the authors in [5]. For simplicity, and as in [5, 8, 10], we consider traffic performance as the only criteria to be optimized for the target flows.4
The General IRC Network Model The general IRC network model is composed of a multihomed stub network S, a route controller C , the transit domains, and a set of target destinations {d} with cardinality |d| = D to be optimized by C . The source domain S has a set of egress links {e}, with |e| = E. For the sake of simplicity, we keep the notation in the granularity of destinations (d), but the model easily can be extended to consider various flows per target d. To dynamically decide the best egress link for each target destination d, the MMM in C probes all the candidate paths through the egress links e of S. Then, the collected measurements are processed and abstracted into a performance function Pe(d,t) at time t, associated with the quality perceived for each of the available paths toward the target destinations d. Let N (d) denote the number of available paths to reach d. Because N (d) usually represents the number of candidate paths in the forwarding information base (FIB) of the BGP border routers of S, N(d) ≤ E ∀ d. We assume that the better the end-to-end traffic performance perceived by C for a target destination d through egress link e, the lower the value of the performance function Pe(d,t). In this framework, IRC strategies can be taxonomized into two categories, namely, reactive route control (RRC) and proactive route control (PRC). RRC practices switch a target flow from one egress link to another only when a maximum tolerable threshold (MTT) is met. The MTTs are applicationspecific and typically represent the maximum acceptable packet loss, the maximum tolerated packet delay, and so on, for a given application. Beyond any of these bounds, the performance perceived by the users of the application becomes unacceptable. PRC strategies, on the other hand, switch traffic before any of the MTTs are met and in turn, can be taxonomized into two categories: those that can be called fully proactive (FP), and those that follow a controlled proactivity (CP) approach. FP IRC practices always switch to the best path. Therefore, the dynamic optimization problem addressed by a FP route controller is to: Find the min{Pe(d,t)} ∀ d, t and enforce the redirection of the corresponding traffic to the egress link found. The alternative offered by CP is to keep the proactivity, but switch traffic as soon as the performance becomes degraded to some extent, typically represented by a relocation threshold (R th). The dynamic optimization problem addressed by CPbased strategies can be formulated as follows. Let e best denote the egress link utilized to reach d at time t, (d,t) and let e′ be such that Pe′ = min{Pe(d,t)} for destination d at time t. 5 A CP-based route controller would switch traffic to d (d,t) (d,t) from ebest to e′ whenever Pebest – Pe′ ≥ Rth, with Rth > 0. 4
3
This cannot be automatically detected and avoided with BGP [7].
58
Cost reductions are typically accomplished by aggregating traffic toward non-target destinations over the cheapest ISPs.
IEEE Network • September/October 2008
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 59
RTTe(d,t)
MMM
Me(d,t) First filter
RCM
Second filter Medians (ms) RTTe(d,t) Qe(d,t)
Randomized SRC algorithm
Me(d,t)
MTT
Compute Pe(d,t)
Qe(d,t) 0
∆(d,t)
Sampling instants (s)
n Figure 2. Filtering process and interaction between the monitoring and measurement module (MMM) and the route control module (RCM) of a sociable route controller. The Randomized SRC Algorithm within the RCM is outlined in Algorithm 1. After extensive evaluations and analysis, we confirmed that PRC performs much better than RRC. The reason for this is that proactive approaches can anticipate network congestion situations, which in the reactive case, typically demands several traffic relocations when congestion already was reached. In addition, we found that in a competitive environment, CPbased route control strategies can outperform the FP ones. Therefore, our SRC algorithm (outlined in the following section) is supported by a CP-based route control strategy.
Sociable Route Control In the SRC strategy that we conceive, each controller remains independent so the SRC boxes do not require any kind of coordination with one another — just as conventional IRC systems operate today. Moreover, our SRC strategy does not introduce changes in the way measurements are conducted and reported by conventional IRC systems, so both the MMM and the RVM in Fig. 1 remain unmodified. Our SRC strategy introduces changes only on the algorithmic aspects of the RCM.
High-Level Description of the SRC Strategy For simplicity in the exposition, we focus on the optimization of a single application, namely, voice over IP (VoIP), and we describe the overall SRC process for the round-trip time (RTT) performance metric. For a comprehensive and formal analysis, the reader is referred to [12]. Our goal is that a controller C becomes capable of adaptively adjusting its proactivity, depending on the RTT conditions for each target destination d. To be precise, a sociable controller analyzes the evolution of the RTT, that is, {RTTe(d,t)}, and depending on its dynamics, the controller can restrain its traffic reassignments adaptively (i.e., its proactivity). To this end, the RCM processes the RTT samples gathered from the MMM using two filters in cascade (Fig. 2). The first filter corresponds to the median RTT, M e(d,t), which is constantly computed through a sliding window. This approach is used widely in practice because the median represents a good estimator of the delay that the users’ applications cur5
We notice that with CP, ebest might be different from e′.
IEEE Network • September/October 2008
rently are experiencing in the network. These medians are precisely the input to the second filter, where the social nature of the route control algorithm covers two different facets: • CP • SRC
Controlled Proactivity On the one hand, the proactivity of box C is controlled to avoid minor changes in the medians triggering traffic relocations at S. This prevents interfering too often with other route controllers. For this reason, our sociable controllers filter the medians. The second filter in Fig. 2 works like an analog-to-digital (A/D) converter, with quantization step ∆, and its output is one of the levels of the converter Qe(d,t). The right-hand side of Fig. 2 illustrates how the instantaneous samples of RTT are filtered to obtain the median Me(d,t), and then, the latter is filtered to obtain Qe(d,t). As described earlier, IRC systems compare the quality of the active and alternative paths by means of a performance function Pe(d,t), which as shown in Fig. 2, is fed by Qe(d,t). The controller C would switch traffic toward d only when the vari(d,t) ations of Qe(d,t) cause that Pebest = Pe(d,t) ≥ Rth. A more detailed description of the route selection process is shown in Algorithm 1. For simplicity, only the stationary operation of the algorithm is summarized. The randomized nature of Algorithm 1 is discussed later. The timer in Step 8 is also introduced later. For the RCM described here, we simply used the outcome of the digital conversion as the performance function Pe(d,t), that is, the number of quantization steps in the quantification level Q e(d,t) . Similarly, R th represents the difference in the number of quantization steps that Pe(d,t) must reach to trigger a path switch. Overall, the advantage of this filtering technique is that it produces the desired effect (i.e., controlled proactivity) because it prevents minor changes in the medians from triggering unnecessary traffic relocations at S.
Socialized Route Control The second facet of the social behavior of the algorithm relates to the dynamics of the median RTTs; more precisely,
59
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 60
Input: d – A target destination of network S {e} – Set of egress links of network S Pe(d,t) – Performance function to reach d through e at time t Output: ebest – The best egress link to reach target destination d (d,t)
1: Wait for changes in Pebest (d,t) Pebest
2: if – Pe(d,t) < Rth ∀e ≠ ebest then go to Step 1 3: /* Egress link selection process for d */ 4: Choose e′ as Pe′(d,t) = min{Pe(d,t)} 5: Estimate the performance after switching the traffic (d,t) 6: if Pebest – Pe′(d,t)Estimate ≥ Rth then 7: Wait until TH =0 /* Hysteresis Switching Timer */ 8: 9:
Switch traffic toward d from ebest to e′ ebest
← e′
(d,t)
10: Pebest ← Pe′(d,t) 11: end if 12: /* End of egress link selection process for d */ 13: Go to Step 1
n Algorithm 1. Randomized SRC algorithm.
with how rapid the variations are in the median values that are typically computed by IRC systems using a sliding window. The motivation for this is that when the median values start to show rather quick variations, the algorithm must react so as to avoid a large number of traffic reassignments in a short timescale. Such RTT dynamics typically occur when several route controllers compete for the same resources, leading to situations where their traffic reassignments interfere with each other. To cope with this problem, we turn the second filter in Fig. 2 into an adaptive filter. This filter is endowed with an adaptive quantization step ∆(d,t) for each target destination d that is automatically adjusted by the algorithm according to the evolution of the median RTTs. If the RTT conditions are smooth, the quantization step is small, and more proactivity is allowed by the controller C . However, if the RTT conditions could lead to instability, the quantization step ∆(d,t) automatically increases, so the number of changes in the values of Qe(d,t) is diminished or even stopped until the network conditions become smooth again. This has the effect of desynchronizing only the competing route controllers. Therefore, the filtering technique outlined here allows a controller C to “sociably” decide whether to switch traffic to an alternative egress link or not, in the sense that the degree of proactivity of C is constantly adjusted by the adaptive nature of the second filter. For the sake of simplicity, we focused here on the optimization of a single performance metric (the RTT), but the concept of SRC is general and can be extended to consider other metrics, such as available bandwidth, packet losses, and jitter. When multiple metrics are used, two straightforward approaches can be followed. On the one hand, a combination of two or more metrics can be used in the same performance function P e(d,t) . For instance, [12] introduces a more general performance function based on a non-linear combination of the quantification level Qe(d,t) and the available bandwidth (AB) in the egress links of the source network. This, in turn, can be extended to consider the AB along the entire path to a target destination d, using available bandwidth estimation techniques like the one described in [5]. With this approach, the weights of the different metrics combined in Pe(d,t) can be tuned on an application basis, for example, to prioritize the role of the AB over the RTTs (or vice versa) depending on the application type.
60
On the other hand, multiple performance functions Pe(d,t) can be used (e.g., one for each metric), and the selection of the best path for each target destination can be performed by sequentially comparing the performance functions Pe(d,t) and tie-breaking similarly to the BGP tie-breaking rules [7]. With this approach, the order in which the performance functions are compared can be tuned on an application basis. For example, a controller might select the path with the maximum AB, and if there is more than one path with the same AB, choose the one with the lowest RTT. In either case, adaptive filtering techniques are required to prevent rapid variations in the performance metrics considered.
Randomization Randomization is present in Algorithm 1 in two different ways: implicitly and explicitly. On the one hand, the route control decisions in Algorithm 1 are inherently stochastic for a number of reasons, for example, due to its adaptive features along time, the fact that different controllers might have configured different thresholds R th , and others. On the other hand, we explicitly use a hysteresis switching timer TH that we introduced in a previous work [13] and that guarantees a random hysteresis period after each traffic relocation. More precisely, traffic toward a given destination d cannot be relocated until the random and decreasing timer T H = 0. A similar approach was used in [5] for one of the randomized algorithms presented there.
Performance Evaluation The performance of our SRC strategy is compared against that obtained with: • Randomized IRC • Default IGP/BGP routing
Evaluation Methodology and Simulation Set Up The simulation tests were performed using the event-driven simulator J-Sim [14]. All the functionalities of the route controllers were developed on top of the IGP/BGP implementations available in this platform. Network Topology — The network topology was built using the Boston University Representative Internet Topology gEnerator (BRITE) [15]. The topology was generated using the Waxman model with (α, β) set to (0.15, 0.2) [16], and it was composed of 100 domains with a ratio of domains to inter-domain links of 1:3. This simulated network aims at representing a set of ISPs that can provide connectivity and reachability to customers operating stub networks. We assume that all ISPs operate points of presence (PoPs) through which the stub networks are connected. We considered 12 uniformly distributed stub networks across the domain-level topology as the traffic sources toward the set of target destinations. These source networks are connected to the routers located at the PoPs of three different ISPs. We considered triple-homed stub networks given that significant performance improvements are not expected from higher degrees of multihoming [1]. For the stub networks containing target destinations, we considered 25 uniformly distributed destinations across the domain-level topology. This offers an emulation of 12 × 25 = 300 IRC flows competing for the same network resources during the simulation run time. Furthermore, given that IRC solutions operate in short timescales, we assumed that the domain-level topology remains invariant during the simulation run time.
IEEE Network • September/October 2008
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 61
12000 SRC Randomized IRC
12000 SRC Randomized IRC
10000
2000
SRC Randomized IRC
10000
8000
Path switches
3000
Path switches
Path switches
4000
6000 4000
8000 6000 4000
1000 2000 0
2000
0 2
4 68
1
2
4 68
10
2
4 68
0 2
100
4 68
2
1
Rth
4 68
10
2
4 68
2
100
SRC Randomized IRC IGP/BGP routing
40
80 60
1
2
4 68
10
2
4 68
100
4 68
100
80 60
40
40
20
20
0 4 68
2
100 RTTs (ms)
RTTs (ms)
RTTs (ms)
60
10
120
100
80
4 68
140 SRC Randomized IRC IGP/BGP routing
120
100
2
2
1
Rth
140
140 120
4 68
Rth
SRC Randomized IRC IGP/BGP routing
0 2
4 68
1
2
Rth
4 68
Rth
10
2
4 68
100
2
4 68
1
2
4 68
10
2
4 68
100
Rth
n Figure 3. Number of path switches (top) and (bottom) for L = 0.450 (left), L = 0.675 (center), and L = 0.900 (right).
Simulation Scenarios — We run the same simulations separately using three different scenarios: • Default IGP/BGP routing, where BGP routers choose their best routes based on the shortest AS-path • BGP combined with the SRC strategy at the 12 source domains • BGP combined with randomized IRC systems at the 12 source domains For a more comprehensive comparison between the different route control strategies, we performed the simulations for three different network loads. We considered the following load factors (L): • L = 0.450, low load corresponding to an average occupancy of 45 percent of the egress links capacity • L = 0.675, medium load corresponding to an average occupancy of 67.5 percent of the egress links capacity • L = 0.900, high load corresponding to an average occupancy of 90 percent of the egress links capacity Simulation Conditions — The simulation tests were conducted using traffic aggregates sent from the source domains to each target destination d. These traffic aggregates were composed of a variable number of multiplexed Pareto flows as a way to generate the traffic demands, as well as to control the network load during the tests. The flow arrivals were modeled according to a Poisson process and were independently and uniformly distributed during the simulation run time. This approach aims at generating sufficient traffic variability to support the assessment of the different route control strategies. In addition, we used the following method to generate traffic demands for the remaining Internet traffic, usually referred to as background traffic. We started by randomly picking four nodes in the network. The first one chosen acts as the origin (O) node, and the remaining three nodes act as destinations (D) of the background traffic. We assigned one Pareto flow for each O-D pair. This process continues until all the nodes
IEEE Network • September/October 2008
are assigned with three outgoing flows (including those in the multihomed stub domains and those in the ISPs). All background connections were active during the simulation run time. Furthermore, the frequency and size of the probes sent by the route controllers were correlated with the outbound traffic being controlled, just as conventional route controllers do today [2–4]. Finally, we assume that the route controllers have preestablished performance bounds (i.e., the MTTs) for the traffic under control. For instance, the recommendation G.114 of the International Telecommunication Union-Telecommunication Standardization Sector-(ITU-T) suggests a one-way-delay (OWD) bound of 150 milliseconds to maintain a high quality VoIP communication over the Internet. Thus, for VoIP traffic, the maximum RTT tolerated was chosen as twice this OWD bound, that is, 300 ms.
Objectives of the Performance Evaluation Our evaluations have two main objectives. Assess the Number of Path Switches — The first objective of the simulation study is to demonstrate that the sociable nature of our SRC strategy contributes to drastically reducing the potential interference between competing route controllers. To this end, we compared the number of path switches that occurred during the simulation run time for the 300 competing IRC flows for the SRC and randomized IRC scenarios. The number of path switches is obtained by adding the number of route changes that are required to meet the desired RTT bound for each target destination d. It is worth emphasizing that in both the randomized IRC and SRC strategies, the route controllers operate independently and compete for the same network resources. This allows us to evaluate the overall impact on the traffic caused by the interference between several standalone route con-
61
9/9/08
12:35 PM
Page 62
1
1 SRC Randomized IRC IGP/BGP routing
0.9 0.8
0.8
0.7 0.6 0.5 0.4 0.3
0.8 0.7
0.6 0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.2
0.1 0 30 40
0.1 0
0.1 0
50
60
70
80
90 100 110 120
SRC Randomized IRC IGP/BGP routing
0.9
0.7 P(RTT>=x)
P(RTT>=x)
1 SRC Randomized IRC IGP/BGP routing
0.9
P(RTT>=x)
YANNUZZI LAYOUT
50
100
RTTs (ms)
150
200
RTTs (ms)
250
300
50
100
150
200
250
300
350
RTTs (ms)
n Figure 4. Complementary cumulative distribution function (CCDF) of the RTTs for the 300 competing IRC flows, for Rth = 1, and for L = 0.450 (left), L = 0.675 (center), and L = 0.900 (right). trollers running at different stub domains. Thus, when analyzing the results for the different route control strategies, it is important to keep in mind that we take into account all the competing route controllers present in the network. To contrast the number of path switches under fair conditions, we made the following decisions. First, both the randomized IRC and SRC controllers are endowed with the same (explicit) randomization technique [5, 13]. This approach avoids the appearance of persistent oscillations that might lead to a large number of path switches in the case of conventional IRC [5]. Second, both types of controllers follow a controlled proactivity approach. We have conducted the simulations modeling the same triggering condition R th for both of them. The main difference is that in the SRC case, the social adaptability of the controllers can result in the trigger being reached more often, or less often, depending on the variability of the RTTs on the network.
The top of Fig. 3 illustrates the total number of path switches performed by both the randomized IRC and SRC strategies, in all the stub networks, and for the three different load factors: L = 0.450 (left), L = 0.675 (center), and L = 0.900 (right). The number of path switches is contrasted for different triggering conditions, that is, for different values of the threshold Rth (shown on a logarithmic scale). Several conclusions can be drawn from the results shown in Fig. 3. In the first place, the results confirm that SRC drastically reduces the number of path switches compared to a randomized IRC technique. 6 An important result is that the reductions are significant for all the load factors assessed. For instance, when compared with randomized IRC, our SRC strategy contributes to reductions of up to: • 77 percent for Rth = 1 and 71 percent for Rth = 2 when L = 0.450 • 75 percent for Rth = 1 and 74 percent for Rth = 2 when L = 0.675 • 34 percent for Rth = 1 and 36 percent for Rth = 2 when L = 0.900
The second observation is that the reductions in the number of path switches offered by the SRC strategy become more and more evident as the proactivity of the controllers increases, that is, for low values of Rth, which is precisely the region where IRC solutions operate today. It is worth recalling that these results were obtained when both route control strategies were complemented by the same randomized decisions. This confirms that in a competitive environment, SRC is much more effective than pure randomization in reducing the potential interference between route controllers. On the other hand, our results show that when the route control strategies become less proactive, that is, for higher values of Rth, randomized IRC and SRC tend to behave comparatively the same so SRC does not introduce any benefit over a randomized IRC technique. To assess the effectiveness of SRC, it is mandatory to confirm that the reductions obtained in the number of path switches are not excessive, resulting in a negative impact on the end-to-end traffic performance. To this end, we first analyze the performance of randomized IRC and our SRC “globally,” that is, by averaging the RTTs obtained by “all” competing route controllers. This is shown at the bottom of Fig. 3 and in Fig. 4. The end-to-end performance obtained by “each” route controller individually, is shown in Fig. 5. The bottom of Fig. 3 reveals that as expected, both SRC and randomized IRC perform much better than IGP/BGP for all values of L and Rth, and the improvements in the achieved performance become more evident as the network utilization increases. In particular, SRC is capable of improving the 〈RTTs〉7 by more than 40 percent for L = 0.675 and by more than 35 percent for L = 0.900 when compared with IGP/BGP. Moreover, the 〈RTTs〉 obtained by SRC and IRC are comparatively the same and particularly for L = 0.675, SRC not only drastically reduces the number of path switches, but also improves the end-to-end performance for almost all the triggering conditions assessed. It is worth emphasizing that a low value of Rth together with a load factor of L = 0.675 reasonably reflect the conditions in which IRC currently operates in the Internet. Our results also reveal an important aspect: by allowing more path switches, some route controllers can improve slightly their end-to-end performance, but such actions have no major effect on the overall 〈RTTs〉. Indeed, a certain number of path switches is always required, and this number of path switches is what actually ensures the average performance observed in the RTTs at the bottom of the Fig. 3 (this becomes clear as the proactivity decreases). By analyzing Fig. 3 as a whole, it becomes evident that the
6
7
End-to-End Traffic Performance — The second objective of the simulation study is to demonstrate that the drastic reduction in the number of path switches obtained with our SRC strategy can be achieved without penalizing the end-to-end traffic performance. To this end, we compared the RTTs obtained for the 300 flows in the three different scenarios, namely, default IGP/BGP, SRC, and randomized IRC.
Main Results
Clearly, no results are shown for the default IGP/BGP routing scenario here because BGP does not perform path switching actively.
62
As mentioned previously, this average is computed over the RTTs obtained by all competing route controllers in the network.
IEEE Network • September/October 2008
9/9/08
12:35 PM
Page 63
1
1 0.9
IGP/BGP routing
0.6 0.5 0.4
0.8 0.7
0.6 0.5 0.4
0.4
0.2
0.2
0.2
0.1 0 30 40
0.1 0
0.1 0
60
70
80
90 100 110 120
0.3
50
100
150
200
250
300
50
0.5 0.4
0.6 0.5 0.4 0.2
0.2
0.1 0 30 40
0.1 0
0.1 0
90 100 110 120
0.3
50
100
RTTs (ms)
150
200
250
300
50
1 0.9
Randomized IRC
0.5 0.4
0.6 0.5 0.4
0.6 0.5 0.4
0.2
0.2
0.2
0.1 0 30 40
0.1 0
0.1 0
80
90 100 110 120
350
0.8 0.7
0.3
70
300
Randomized IRC
0.9
0.3
60
250
1 Randomized IRC
P(RTT>=x)
P(RTT>=x)
0.6
150 200 RTTs (ms)
0.8 0.7
0.8 0.7
50
100
RTTs (ms)
1 0.9
SRC
0.4
0.2
80
350
0.5
0.3
70
300
0.6
0.3
60
250
0.8 0.7 P(RTT>=x)
0.6
200
1 0.9
SRC
0.8 0.7 P(RTT>=x)
0.8 0.7
150
RTTs (ms)
1 0.9
SRC
50
100
RTTs (ms)
1 0.9
P(RTT>=x)
0.5
0.3
RTTs (ms)
P(RTT>=x)
0.6
0.3
50
IGP/BGP routing
0.9
0.8 0.7 P(RTT>=x)
0.8 0.7 P(RTT>=x)
1 IGP/BGP routing
0.9
P(RTT>=x)
YANNUZZI LAYOUT
0.3
50
100
RTTs (ms)
150
200
250
RTTs (ms)
300
50
100
150
200
250
300
350
RTTs (ms)
n Figure 5. CCDFs for IGP/BGP routing (top), SRC (center), and randomized IRC (bottom), for L = 0.450 (left), L = 0.675 (center), and L = 0.900 (right). selection of the best triggering condition actually depends on the load present in the network. For this particular case, the best trade-offs are Rth = 30 for L = 0.450, Rth = 10 for L = 0.675, and Rth = 7 for L = 0.900, which is a reasonable progression to lower values of Rth because the route controllers require less proactivity when the network utilization is low. The corollary is that the triggering condition should be adaptively adjusted as well, depending on the amount of traffic carried through the egress links of the domain. We plan to investigate this in the future. Figure 4 compares the distribution of the RTTs obtained by IGP/BGP, SRC, and randomized IRC for the 300 competing IRC flows, for the three different load factors assessed, and for Rth = 1, which as mentioned above is in the range of operation of the IRC solutions presently deployed in the Internet. To facilitate the interpretation of the results, we use the complementary cumulative distribution function (CCDF). An important observation is that under high egress link utilization, that is, L = 0.900, there is a fraction of 〈RTTs〉 for which the bound of 300 ms is exceeded in the case of IGP/BGP; whereas both SRC and the randomized IRC fulfill the targeted bound. To complete the analysis, Fig. 5 provides a more granular
IEEE Network • September/October 2008
picture than Fig. 4 because it shows the CCDFs of the RTTs obtained by each of the 12 competing route controllers. The figure shows the results for the three studied scenarios and for all the load factors assessed when Rth = 1. Our results show that the targeted bound of 300 ms is satisfied by both SRC and randomized IRC in all cases and for all controllers. IGP/BGP, however, shows a distribution of large delays given that the shortest AS-paths are not necessarily the best performing paths. Figure 5 also shows that when considering boxes individually, randomized IRC achieves slightly better end-to-end performance for some of them but at the price of a much larger number of path switches: • ≈ ≈ 435 percent larger for L = 0.450 • ≈ ≈ 400 percent larger for L = 0.675 • ≈ ≈ 80 percent larger for L = 0.900 when Rth = 1.
Conclusion In this article, we examined the strengths and weaknesses of randomized IRC techniques in a competitive environment. We proposed a way to blend randomization with a sociable route control (SRC) strategy, where by sociable, we mean a route control strategy that explicitly considers the potential
63
YANNUZZI LAYOUT
9/9/08
12:35 PM
Page 64
implications of its decisions in the performance of the network and with the ability to adaptively restrain its intrinsic selfishness depending on the network conditions. We have shown that in a competitive scenario, our SRC strategy is capable of drastically reducing the potential interference between controllers without penalizing the end-to-end traffic performance. This makes SRC more scalable and promising than pure randomization, given the proliferation of IRC systems in the Internet. SRC strategies, like the one described in this article, also have a number of practical advantages; for example, they do not require any kind of coordination between the competing IRC boxes; and they can be supported by a lightweight software implementation based on well-known filtering techniques, with no additional requirements to be adopted other than a software upgrade of existing IRC systems. Among the open issues in the area, the most important is the lack of a stochastic model characterizing the distribution of path switches in a competitive environment. Studies like [5] have shown that randomized techniques are effective in desynchronizing some route controllers when their measurement windows are sufficiently overlapped; however, they cannot guarantee stability. Only after characterizing the distribution of path switches will it be possible to formally study the local and global stability aspects of competitive IRC. Furthermore, the proposals and results described here apply to the optimization of VoIP traffic, but the conception of blending randomization with an SRC strategy is general in scope so our work can be extended to control other kinds of traffic flows concurrently, as well as consider other performance metrics besides the RTT.
References [1] A. Akella et al., “A Measurement-Based Analysis of Multihoming,” Proc. ACM SIGCOMM, Karlsruhe, Germany, Aug. 2003. [2] Avaya, Inc., “Converged Network Analyzer.” [3] Cisco Systems, Inc., “Optimized Edge Routing.” [4] Internap Networks, Inc., “Flow Control Platform.” [5] R. Gao, C. Dovrolis, and E. W. Zegura, “Avoiding Oscillations Due to Intelligent Route Control Systems,” Proc. IEEE INFOCOM 2006, Barcelona, Spain, Apr. 2006. [6] C. Labovitz et al., “Delayed Internet Routing Convergence,” Proc. ACM SIGCOMM, Stockholm, Sweden, Aug. 2000. [7] M. Yannuzzi, X. Masip-Bruin, and O. Bonaventure, “Open Issues in Interdomain Routing: A Survey,” IEEE Network, vol. 19, no. 6, Nov.–Dec. 2005, pp. 49–56. [8] A. Akella et al., “A Comparison of Overlay Routing and Multihoming Route Control,” Proc. ACM SIGCOMM, Portland, OR, Aug. 2004. [9] D. K. Goldenberg et al., “Optimizing Cost and Performance for Multihoming,” Proc. ACM SIGCOMM, Portland, OR, Aug. 2004. [10] F. Guo et al., “Experiences in Building a Multihoming Load Balancing System,” Proc. IEEE INFOCOM ’04, Hong Kong, China, Mar. 2004.
64
[11] A. Akella, S. Seshan, and A. Shaikh, “Multihoming Performance Benefits: An Experimental Evaluation of Practical Enterprise Strategies,” USENIX Annual Technical Conf., Boston, MA, June 2004. [12] M. Yannuzzi, “Strategies for Internet Route Control: Past, Present, and Future,” Ph.D. diss., Tech. Unive. of Catalonia, Barcelona, Spain, 2007. [13] M. Yannuzzi et al., “A Proposal for Inter-Domain QoS Routing Based on Distributed Overlay Entities and QBGP,” Proc. QoFIS ’04, LNCS 3266, Barcelona, Spain, Oct. 2004. [14] J-Sim homepage; http://www.j-sim.org [15] A. Medina et al., “BRITE: An Approach to Universal Topology Generation,” Proc. MASCOTS, Aug. 2001. [16] B. Waxman, “Routing of Multipoint Connections,” IEEE JSAC, Dec. 1988.
Biographies MARCELO YANNUZZI ([email protected]) received a degree in electrical engineering from the University of the Republic (UdelaR), Uruguay, in 2001, and DEA (M.Sc.) and Ph.D. degrees in computer science from the Department of Computer Architecture, Technical University of Catalonia (UPC), Spain, in 2005 and 2007, respectively. He is with the Advanced Network Architectures Lab at UPC, where he is an assistant professor. He held previous positions with the Physics Department of the School of Engineering, UdelaR, from 1997 to 2003, and with the Electrical Engineering Department of the same university from 2003 until 2006. He worked in industry for 10 years at the national telco in Uruguay (1993–2003). XAVI MASIP-BRUIN ([email protected]) received M.S. and Ph.D. degrees from UPC, both in telecommunications engineering, in 1997 and 2003, respectively. He is currently an associate professor of computer science at UPC. His current research interests are in broadband communications, QoS management and provision, and traffic engineering. His publications include around 60 papers in national and international refereed journals and conferences. Since 2000 he has participated in many research projects: IST projects E-NEXT, NOBEL, and EuQoS; and Spanish research projects SABA, SABA2, SAM, and TRIPODE. EVA MARIN-TORDERA ([email protected]) received M.S. degrees in physics in 1993 and electronic engineering in 1998, both from Barcelona University, and a Ph.D. from UPC in 2007, where she works as an assistant professor. She has published many papers in national and international conferences. Her main interests focus on QoS provisioning and optical networks. She is now actively participating in the BONE and DICONET international projects, and in the national project CATARO. JORDI DOMINGO-PASCUAL ([email protected]) is a full professor of computer science and communications at UPC. He is co-founder of and a researcher at the Advanced Broadband Communications Center (CCABA) of the university. His research topics are broadband communications and applications, IP/ATM integration, QoS management and provision, traffic engineering, IP traffic analysis and characterization, and QoS measurements. ALEXANDRE FONTE ([email protected]) graduated in electrical engineering from the University of Coimbra, Portugal, in 1995, and received his M.Sc. degree in electronic and telecommunications engineering (distributed systems specialty) from the University of Aveiro, Portugal, in 2000. He is currently a Ph.D. student in computer engineering at the Department of Informatics Engineering, University of Coimbra. His Ph.D. research activity is focused on interdomain quality of service routing and traffic engineering in IP networks. EDMUNDO MONTEIRO ([email protected]) is an associate professor at the University of Coimbra, Portugal, from which he graduated in 1984 and received a Ph.D. in electrical engineering (computer specialty) in 1995. His research interests are computer communications, QoS, mobility, routing, resilience, and security.
IEEE Network • September/October 2008