Dr. Dobb's Journal (September)

, Dr. Dobbs J O U R N A L #376 SEPTEMBER 2005 SOFTWARE TOOLS FOR THE PROFESSIONAL PROGRAMMER http://www.ddj.com CO...

80 downloads 1212 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

,

Dr. Dobbs J O U R N A L

#376 SEPTEMBER 2005

SOFTWARE TOOLS FOR THE PROFESSIONAL PROGRAMMER

http://www.ddj.com

COMMUNICATIONS & NETWORKING 3D Graphics & Mobile Phones Inside Mobile IP High-Performance Java I/O:

How Orbitz.com Does It! C++, STL, & Custom Pool Allocators EFI: Not Your Father’s BIOS! C++ Exceptions & the Linux Kernel $4.95US $6.95CAN

09

Multithreading .NET Apps Testing Web Applications

0

74470 01051

7

Software Optimization & DSP Embedded Systems Ed Nisley on

Complexity & Software Development Jerry Pournelle on

Mac OS X Tiger

C O N T E N T S

SEPTEMBER 2005 VOLUME 30, ISSUE 9

FEATURES Mobile Java & 3D Applications 14

FORUM

by Oscar Vivall and Tom Thompson

Oscar and Tom examine the Mascot Capsule Micro3D and JSR 184 APIs, then use them to develop high-quality 3D applications for mobile phones.

Inside Mobile IP 24 by Narendra Venkataraman

Mobile IP lets mobile-device users stay connected when moving to networks with different IP addresses.

High-Performance I/O with Java NIO 32 by Brian Pontarelli

The NIO library offers a high-performance technique for handling input/output (I/O) operations.

Improving Performance with Custom Pool Allocators for STL 40 by Anthony Aue

Anthony presents a highly flexible and configurable replacement for C++’s std::allocator for use with nodebased standard containers.

The Extensible Firmware Interface 46 by Craig Szydlowski

The Extensible Firmware Interface is a modern replacement for the venerable BIOS.

C++ Exceptions & the Linux Kernel 50 by Halldór Ísak Gylfason and Gísli Hjálmtysson

C++ kernel-level runtime support for Linux lets you use the full power of C++ in kernel-space programming.

Multithreading .NET Apps for Optimal Performance 54 by Eric Bergman-Terrell

Multithreading delivers maximum performance from parallel CPU architectures. .NET has two main threading mechanisms — the Thread class and asynchronous methods.

Testing Web Applications 60 by Sean Dawson and Kristin Kerr

Sean and Kristin automate web application testing by integrating JWebUnit into Hippo’s existing test framework.

Finding Binary Clones with Opstrings & Function Digests: Part III 64 by Andrew Schulman

Andrew wraps up his examination of reverse engineering this month, further unraveling binary code.

EMBEDDED SYSTEMS Software Optimization & DSP Embedded Systems 71 by Robert Oshana

The challenge of developing DSP-based embedded systems lies in making the most of limited resources — performance, memory, and power.

COLUMNS Programming Paradigms 75

Chaos Manor 82

by Michael Swaine

by Jerry Pournelle

Embedded Space 78

Programmer’s Bookshelf 85

by Ed Nisley

by Gregory V. Wilson

EDITORIAL 6 by Jonathan Erickson LETTERS 8 by you DR. ECCO’S OMNIHEURIST CORNER 10 by Dennis E. Shasha NEWS & VIEWS 12 by DDJ Staff PRAGMATIC EXCEPTIONS 34 by Benjamin Booth OF INTEREST 87 by DDJ Staff SWAINE’S FLAMES 88 by Michael Swaine

RESOURCE CENTER As a service to our readers, source code, related files, and author guidelines are available at http:// www.ddj.com/. Letters to the editor, article proposals and submissions, and inquiries can be sent to [email protected], faxed to 650-513-4618, or mailed to Dr. Dobb’s Journal, 2800 Campus Drive, San Mateo CA 94403. For subscription questions, call 800-456-1215 (U.S. or Canada). For all other countries, call 902-563-4753 or fax 902-563-4807. E-mail subscription questions to ddj@neodata .com or write to Dr. Dobb’s Journal, P.O. Box 56188, Boulder, CO 803226188. If you want to change the information you receive from CMP and others about products and services, go to http://www.cmp .com/feedback/permission.html or contact Customer Service at the address/number noted on this page. Back issues may be purchased for $9.00 per copy (which includes shipping and handling). For issue availability, send e-mail to [email protected], fax to 785838-7566, or call 800-444-4881 (U.S. and Canada) or 785-8387500 (all other countries). Back issue orders must be prepaid. Please send payment to Dr. Dobb’s Journal, 4601 West 6th Street, Suite B, Lawrence, KS 66049-4189. Individual back articles may be purchased electronically at http://www.ddj.com/.

NEXT MONTH: In October, we open the vault on valuable computer-security information.

DR. DOBB’S JOURNAL (ISSN 1044-789X) is published monthly by CMP Media LLC., 600 Harrison Street, San Francisco, CA 94017; 415-947-6000. Periodicals Postage Paid at San Francisco and at additional mailing offices. SUBSCRIPTION: $34.95 for 1 year; $69.90 for 2 years. International orders must be prepaid. Payment may be made via Mastercard, Visa, or American Express; or via U.S. funds drawn on a U.S. bank. Canada and Mexico: $45.00 per year. All other foreign: $70.00 per year. U.K. subscribers contact Jill Sutcliffe at Parkway Gordon 01-49-1875-386. POSTMASTER: Send address changes to Dr. Dobb’s Journal, P.O. Box 56188, Boulder, CO 80328-6188. Registered for GST as CMP Media LLC, GST #13288078, Customer #2116057, Agreement #40011901. INTERNATIONAL NEWSSTAND DISTRIBUTOR: Worldwide Media Service Inc., 30 Montgomery St., Jersey City, NJ 07302; 212-332-7100. Entire contents © 2005 CMP Media LLC. Dr. Dobb’s Journal is a registered trademark of CMP Media LLC. All rights reserved.

http://www.ddj.com

Dr. Dobb’s Journal, September 2005

3

,

Dr.Dobbs J O U R N A L

PUBLISHER Michael Goodman

SOFTWARE TOOLS FOR THE PROFESSIONAL PROGRAMMER

EDITOR-IN-CHIEF Jonathan Erickson

EDITORIAL MANAGING EDITOR Deirdre Blake MANAGING EDITOR, DIGITAL MEDIA Kevin Carlson SENIOR PRODUCTION EDITOR Monica E. Berg ASSOCIATE EDITOR Della Wyser ART DIRECTOR Margaret A. Anderson SENIOR CONTRIBUTING EDITOR Al Stevens CONTRIBUTING EDITORS Bruce Schneier, Ray Duncan, Jack Woehr, Jon Bentley, Tim Kientzle, Gregory V. Wilson, Mark Nelson, Ed Nisley, Jerry Pournelle, Dennis E. Shasha EDITOR-AT-LARGE Michael Swaine PRODUCTION MANAGER Stephanie Fung INTERNET OPERATIONS DIRECTOR Michael Calderon SENIOR WEB DEVELOPER Steve Goyette WEBMASTERS Sean Coady, Joe Lucca AUDIENCE DEVELOPMENT AUDIENCE DEVELOPMENT DIRECTOR Kevin Regan AUDIENCE DEVELOPMENT MANAGER Karina Medina AUDIENCE DEVELOPMENT ASSISTANT MANAGER Shomari Hines AUDIENCE DEVELOPMENT ASSISTANT Melani Benedetto-Valente MARKETING/ADVERTISING ASSOCIATE PUBLISHER Will Wise SENIOR MANAGERS, MEDIA PROGRAMS see page 86 Pauline Beall, Michael Beasley, Cassandra Clark, Ron Cordek, Mike Kelleher, Andrew Mintz MARKETING DIRECTOR Jessica Marty SENIOR ART DIRECTOR OF MARKETING Carey Perez DR. DOBB’S JOURNAL 2800 Campus Drive, San Mateo, CA 94403 650-513-4300. http://www.ddj.com/ CMP MEDIA LLC Gary Marshall President and CEO John Day Executive Vice President and CFO Steve Weitzner Executive Vice President and COO Jeff Patterson Executive Vice President, Corporate Sales & Marketing Leah Landro Executive Vice President, Human Resources Mike Mikos Chief Information Officer Bill Amstutz Senior Vice President, Operations Sandra Grayson Senior Vice President and General Counsel Alexandra Raine Senior Vice President, Communications Kate Spellman Senior Vice President, Corporate Marketing Mike Azzara Vice President, Group Director of Internet Business Robert Faletra President, Channel Group Vicki Masseria President, CMP Healthcare Media Philip Chapnick Vice President, Group Publisher Applied Technologies Michael Friedenberg Vice President, Group Publisher InformationWeek Media Network Paul Miller Vice President, Group Publisher Electronics Fritz Nelson Vice President, Group Publisher Network Computing Enterprise Architecture Group Peter Westerman Vice President, Group Publisher Software Development Media Joseph Braue Vice President, Director of Custom Integrated Marketing Solutions Shannon Aronson Corporate Director, Audience Development Michael Zane Corporate Director, Audience Development Marie Myers Corporate Director, Publishing Services

American Buisness Press

4

Dr. Dobb’s Journal, September 2005

Printed in the USA

http://www.ddj.com

EDITORIAL

Only a Matter of Time

I

t’s only a matter of time until Dr. Jekyll changes into Mr. Hyde, a dumb idea turns into a zillion dollars, and an Apple Macintosh runs on an Intel CPU. But then on second glance, it looks like time has proved all these true. Okay, try on these it’s-only-a-matter-of-time scenarios:

Video baseball will replace real baseball. Naw, never happen. There’s nothing like the smell of fresh-cut grass, the crack of the ball leaving the bat, and an $8 hotdog washed down with a $10 beer. Ah, actually it has happened. This past summer, the first two innings of a minor-league baseball game between the Kansas City (KS) T-Bones and the Schaumburg (IL) Flyers were played, well, virtually. Instead of professional ballplayers taking the field, two opposing video gamers, selected in a month-long competition sponsored by CompUSA, settled into recliner chairs at home plate, then proceeded to spit, scratch, pitch, hit, and field it out on the 16×24 foot CommunityAmerica Ballpark video board using EA Sports MVP Baseball 2005 software and X-Box hardware. After the first two innings, the real players took the field, picking up the lineups, scores, and statistics from the video gamers. For the box score, see http://www.tbonesbaseball.com/. Wireless communication systems will go nuts. Get outta here! Some of technology’s best and brightest are on the wireless bandwagon. Except for a few security problems, there’s nothing at all that can go wrong. Right? Okay, you’ve convinced me, but you might try selling this to homeowners who live near U.S. military facilities. It seems that an $800-million program to upgrade military communications systems is causing some ups-and-downs for nearby residential automatic garage-door-opening systems. As it turns out, garage-door-opener manufacturers, with the permission of the FCC, have for years been operating on an unlicensed basis in the 390-MHz frequency band reserved for military radios. This hasn’t been a problem until recently, as new higher powered military radios have begun jamming low-power door-openers. Depending on the nearby terrain and who you believe, homes anywhere from 25 to 50 miles can be affected. Meanwhile the FCC is recommending homeowners lengthen their antenna, buy a new transmitter/receiver that operates on a different frequency, or sell their homes. (All right, I made up that last suggestion.) For more information, see http://www.dasma.com/PDF/Publications/ TechDataSheets/OperatorElectronics/TDS374.pdf. E-mail spam will go away. There’s no one more than me who hopes it’s only a matter of time before e-mail spam disappears — and books like the recently released Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification, by anti-spam expert Jonathan Zdziarski (not to mention DDJ articles such as John Graham-Cumming’s May 2005 “Naïve Bayesian Text Classification”) will help make this happen. I first heard about Zdziarski’s book in an e-mail from the book’s distributor. Then I heard about it again in another e-mail from the distributor, then in another e-mail, and another, and…sigh — you get the idea. If it walks like a can of Spam and tastes like a can of Spam, it darn well must be spam. Maybe it is more than a matter of time before spam goes away. Artificial intelligence will replace medical doctors. I hope not, at least not anytime soon. Still, AI is making in-roads into the medical information community. The Physician Point of View, an intelligent enterprise portal developed by DataGlider (http://www.dataglider.com/), is a system that does the leg-work for doctors by tracking down information they need to treat patients. According to a DataGlider spokesperson, doctors report that previous information systems, which took up to 200 mouse clicks to return specific information, now require no clicks at all — the information is automatically presented via portlets. The word “McDigital” will be listed in the Oxford English Dictionary. Sad to say, but it’s probably going to happen. Looking to attract “a younger, hipper [but presumably not thinner] clientele,” McDonald’s is turning to technology. Forget about food. The fast-food purveyor wants us to use its kiosks inside the golden arches to burn CDs, download ringtones, surf the Web, and print photos. Wi-Fi will be available, along with large-screen TVs and cashless payment systems. At the heart of this new high-tech McDonald’s is the “Blaze Net” system, implemented by Digital Transaction Machines (http://www.dtmachines.com/), which is also capable of distributing videos, mobile games, tickets, and whatnot. Jeez, what’s next —“Happy Meals for Geeks”? It’s only a matter of time.

Jonathan Erickson editor-in-chief [email protected] 6

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

LETTERS

, PO S S O BB

T

D

2

C

T EN

S

2

High-Reliability Languages Dear DDJ, In “Reliability: The Hard and the Soft” (DDJ, May 2005), Ed Nisley dismisses Ada for DO-178B applications, saying “the entire cadre of Ada programmers is reaching retirement age with no replacements on tap.” I’m surprised that Ed is propagating this myth. For one example, I’m not “nearing retirement age,” and I’ve got two young engineers on my team, happily learning Ada and hard real-time programming. The AdaCore Academic Initiative (http://www.adacore.com/academic_members.php) has over 70 member universities, all training new Ada programmers. The Ada 2006 standard will be released next year, adding Java-style interfaces to the language, along with other significant improvements. Many Ada vendors are making money and growing; see http:// www.adaic.com/index.html for current information on the state of the Ada industry. The SPARK (http://www.praxis-his.com/ sparkada/) subset of Ada is directly targeted to high- integrity systems, and is growing in popularity; several tool vendors are incorporating it (for example, ILogix— see http://www.ilogix.com/newsroom/ newsroom_detail.cfm?pressrelease= 2004_09_29_035924_126720pr.cfm). Stephen Leake [email protected] Ed replies: Thanks Stephen. While it’s true that Ada continues to be pretty much the only high-level language that’s a good fit for DO-178B applications, it suffers from a certain lack of undergraduate mindshare. Judging from the vendors I’ve talked to (admittedly, a small and biased subset), the consensus is that while their Ada biz is doing okay, many customers are interested in a Javaoid high-reliability language. It comes down to economics. The same realities 8

that drove the military to COTS hardware and software (with sometimes devastating consequences) is driving software toward readily available languages and personnel. That those languages aren’t such a good fit for the job and the personnel might have the wrong background may be just unintended consequences. Looked at from the other side of the paycheck, though, knowing how to dance with Ada can be the ticket to a rewarding future. Undergraduates reading this, take heed! Nuclear versus Wind Energy Dear DDJ, Jonathan Erickson’s February 2005 “Editorial” presented some interesting numbers about U.S. wind energy. That generated a rather inflamed answer by our fellow reader Peter Andre, defending nuclear power, calling it “the cleanest, safest, best way to create energy.” Not in the least. These days, nuclear power is generating something like one third of the pollution put out by fossil- fuel operated power plants. Enriched uranium has to be mined and transported to the power plants, and in the end, nuclear waste has to be shelled and buried. All of this spends energy and produces pollution. Unlike wind, enriched uranium is a finite resource, so scarce that the amount available on the Earth’s crust is sufficient but to three years of world energy consumption (at present rates). Detailed information about this can be found at http://beheer.oprit.rug.nl/deenen/. But there are other environment impact issues. That’s why the future of wind energy lays off-shore. The EU has, among others, a program to build a wind farm in a 44 meters sea deep area off Scotland (DOWNVinD project); see http://home.wxs.nl/~windsh/of fshore.html. Actually, this nuclear/wind war is pretty stupid, because they are the best we have to tackle the oil peak. It seems that nuclear will be the first hour answer, but only wind power can be seen as a longterm solution for the post-oil days. The beginners guide to the oil peak is at http://www.wolfatthedoor.org.uk/. Luís de Sousa [email protected] Surround Sound Dear DDJ, In his excellent article “Surround Sound” (DDJ, July 2005), Don Morgan discusses the difference between the definition of a decibel by electrical engineers and audio engineers. The implication is that Dr. Dobb’s Journal, September 2005

a given power ratio can have more than one dB value. This is not true. The reason for the two formulas shown in the first paragraph on page 80 has simply to do with what unit of measurement is being used — volts or watts. If, for example, the two measurements were made on a voltmeter, dB=20*log(E2/E1). If the measurements were made with a wattmeter, the calculation would be dB= 10*log(P2/P1). The reason for this is that when calculating power from voltage and resistance, the voltage is squared and divided by the resistance (P=E**2/R). When simplified, the equation becomes 10*log((E2/E1)**2), which is 2*10* log(E2/E1), because log(n**2)=2*log(n). The bottom line is that 10 dB is always 10 dB, no matter how you measure it. Dave Bushong [email protected] Don replies: Thanks Dave. You are absolutely right, dB is dB. My meaning was that Magnitude Squared is the way the audio engineer looks at the spectrum while most electrical engineering folk prefer simple magnitude measurements. And I made a point of it because it can become confusing. If we compare a Bode plot of a spectrum plotted in magnitude we will find that the –3dB point is at 0.5 of the magnitude, while –3dB on a magnitude squared plot will be at 0.707. Actually, multiplying by 20 (magnitude_squared_db=20*log(out/in) is equivalent to squaring magnitude_db= 10*log(out/in) because we are dealing with exponents –> we get magnitude squared of the transfer function by multiplying the output of the transfer function by its complex conjugate; the magnitude is simply the output of the absolute value of the output of the transfer function. Magnitude squared does not reduce to simple magnitude, and dB is always dB, but its context needs to be understood.

DDJ

Letters to the Editor Send letters to the editor via e-mail to [email protected]; or Dr. Dobb’s Journal, 2800 Campus Drive, San Mateo, CA 94403.

http://www.ddj.com

DR. ECCO’S OMNIHEURIST CORNER

Maximum Lottery Dennis E. Shasha

W

hen I arrived at Ecco’s house, he had just greeted two visitors. “Jimmy Casino’s the name,” said the man in the beige suit and white shoes, as he offered his embossed card to Ecco. “The competition is hell out there, doc. I gotta find a new betting game to attract players. My numbers guy Janos will explain it to you.” “Dr. Ecco, we have invented a betting lottery for large stakes, based on hollow lottery balls,” Janos, with his shock of disheveled blond hair, spoke with a distinct Hungarian accent. “Here is how it goes. “Your adversary is given 100 identical pieces of paper, writes a number on each one — he can choose the range of numbers and can even put in duplicates — then folds them. Then these papers are given to an independent third party whom both sides can see. That third party shuffles the papers and then inserts one piece of paper into each of 100 hollow lottery balls that can be screwed open or shut. These balls have been tested using drop tests, bounce tests, and resiliometric tests to be sure they are all as close to equal as tolerances allow. “The third party then puts the balls into a lottery machine. The lottery machine mixes the balls until one comes out. The ball is opened and you are told the number. You have the option to ‘keep’ the Dennis, a professor of computer science at New York University, is the author of four puzzle books: The Puzzling Adventures of Dr. Ecco (Dover, 1998); Codes, Puzzles, and Conspiracy (Freeman 1992, reprinted by Dover in 2004 as Dr. Ecco: Mathematical Detective); and recently Dr. Ecco’s Cyberpuzzles (W.W. Norton, 2002); and Puzzling Adventures (W.W. Norton, 2005). With Philippe Bonnet, he has written Database Tuning: Principles, Experiments, and Troubleshooting Techniques (2002, Morgan Kaufmann). With Cathy Lazere, he wrote Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientists (1995, Copernicus/Springer). He can be contacted at [email protected]. 10

number. If you keep it, you put it in your keep pile and you have used up one keep. If you don’t, it goes in the discard pile never to come out. (You can record the numbers on a private storage device for future reference, however). Repeat this procedure for all 100 balls. You are allowed three ‘keeps’ altogether. Your goal is to have the highest number written in the keep pile. If you do, you win $100,000. If you don’t, you lose $100,000. Should you take the bet? If so, what is your probability of winning?” “Echoes of the sultan’s daughters problem,” said Ecco with a chuckle after a few minutes of silence. “Surely you know that one, Professor,” he said. “A young suitor may choose any of the 100 daughters of the sultan. They are presented to him in some random order. He has little to go on, so he judges only by outward beauty and grace. If he rejects one, he never sees her again. Once he selects one, he must marry her and no other.” Warm-up: Can you design a strategy that gives him at least a 1/4 chance of marrying the most beautiful daughter? Solution to warm-up: Look at but reject the first half of the daughters. Then take the first daughter who is more beautiful than any of those in the first half. This is not the optimal solution, but it has the virtue of offering an easy rough analysis: You have a 1/2 chance that the most beautiful daughter is in the second half and a 1/2 chance that the second most beautiful daughter is in the first half. In that case, which happens with probability 1/4 assuming the daughters are presented to you in random order, you are sure to marry the most beautiful daughter with this protocol. In fact, your odds are better (for example, if the third most beautiful daughter is in the first half and the most beautiful one precedes the second most beautiful one in the second half) than 25 percent. A deeper analysis (check out http://mathworld.wolfram .com/SultansDowryProblem.html) shows that it is better to reject only 37 daughters and then choose the most beautiful one that follows. Dr. Dobb’s Journal, September 2005

Erratum in the Solution to the Treasure Arrow Alan Dragoo pointed out an arithmetic error in my solution to the weight of the arrow in “Treasure Arrow.” It should be 75 kilograms, because delta is only 15 centimeters. DDJ

Dr. Ecco Solution Solution to “Election Fraud in Verity,” DDJ, August 2005. 1. There can be at most three pollsters meeting these conditions. Here’s why. Each pollster must miss about 13 Wendy votes (because there are 51 in total, but only 38 seen by each pollster). No two pollsters miss the same Wendy voter, because every pair of pollsters interview all 100 voters. If we number the Wendy voters W1 to W51, then we can imagine that pollster A misses W1 to W13, pollster B misses W14 to W26, and pollster C misses W27 to W39. There are not enough more Wendy voters for a fourth pollster to miss, so there can be only three pollsters. 2. Number the Fred voters F1 to F49. Pollster A could interview W14 to W51 and F1 to F42. Then pollster B could interview W1 to W13 and W27 to W51 as well as F8 to F49. Finally, pollster C could interview W1 to W16 and W40 to W51 as well as F1 to F7 and F15 to F49. 3. As the solution to the first question showed, this result would not be possible with five honest pollsters. So Fred was right: Wendy stole the election. Tom Rokicki helped enormously with these solutions.

http://www.ddj.com

SECTION

A

MAIN NEWS

Dr. Dobb’s

News & Views

Next Generation Java Announced Sun Microsystems has announced a slew of new features available in Java EE 5, the next edition of Java (http://java.sun.com/ ee/). New features in Java EE 5 will include annotations, simplified common coding scenarios, streamlined deployment, and new utility classes and helpers. New APIs and services include a JSP standard tag library, StAX, web-services metadata, JAXB, easy web applications with JavaServer Faces, Common Annotations, and a new persistence API.

eBay Launches Open-Source Program eBay has launched its Community Codebase initiative, which open sources some of its search and access applications to members of eBay’s Developers Program and PayPal Developer Network. The free program (http://codebase.ebay.com/) lets developers access source code for a variety of eBay and PayPal tools and applications. Initial projects available in the eBay Community Codebase include a Firefox MyeBay toolbar, Eclipse plug-in, and TiVo/eBay sample application (all built by eBay), and five payment scripts for integrating PayPal.

IBM’s BlueGene/L Still the Fastest Top500, an organization that ranks supercomputer performance (http://www.top500 .org/), has named IBM’s BlueGene/L as the most powerful supercomputing system in the world. With a sustained performance of 136.8 Teraflops, or trillions of floatingpoint calculations per second, the system developed by IBM and the U.S. Department of Energy’s National Nuclear Security Administration will be installed at Lawrence Livermore National Labs. Eventual plans call for it to run at 360 Teraflops.

Smart Ankle Prevents Falls Graduate engineers at Stanford University have designed a “smart” ankle brace to reduce falls among the elderly. Working with Thomas Andriacchi, professor of mechanical engineering and orthopedic surgery, the smart ankle brace designed by Tim Ramsey, Ryan McDonnell, Buzzy Bonneau, Tejas Mazmudar, Jeremy Dittmer, and Surag Mantri corrects imbalances and prevents falling (http://www.stanford.edu/group/ biodesign/). The device is built around a microcontroller that continuously monitors 12

the roll of the ankle. If the chip detects a roll that is greater than normal, it provides a correctional vibration. This vibration helps the wearer change position or shift balance to avoid a fall in much the same way that sensory nerves provide correctional feedback to the brain. One in three individuals over the age of 65 fall every year, and one fall in 200 results in a broken hip. Falls account for $26 billion in medical costs each year.

Cw Extends C# for Concurrency Nick Benton and Gavin Bierman, a pair of researchers in Microsoft’s Cambridge Lab, have extended C# for greater support of concurrency by developing the Comega (Cw) language (http://research.microsoft .com/Comega/). Cw is a strongly typed, dataoriented language that connects semistructured hierarchical XML data, relational SQL data, and the .NET Common Type System (CTS). Comega also extends C# with asynchronous concurrency abstractions.

Sun Sorta Open Sources Sun Microsystems has begun open sourcing some of its Java applications, although the company isn’t open sourcing Java itself. In introducing a new open-source license, Sun announced that the upcoming Java System Application Server Platform Edition 9 (http://glassfish.dev.java.net/) and the Java System Enterprise Server Bus (Java ESB), will allow developers to examine and use the application server’s source code under the terms spelled out in the Common Development and Distribution License (CDDL). To do so, however, developers first need to sign Sun’s Java Research License (JRL) agreement.

Microsoft Wants Smartphone Market The head of Microsoft’s mobile and embedded division is promising that the company will have the lead in the smartphone market. Zhang Ya-qin claims Microsoft can do it within the next three years, but that could be a tough sell, given that Symbian currently has 76 percent of the smart mobile device market, compared to 7.6 percent for Microsoft, according to market researcher Canalys. Microsoft’s roadmap is based in part on providing for additional features in Microsoft’s Mobile 5.0, released last month. Version 5.0 sports a Blackberry-like push email system, although according to Zhang, Dr. Dobb’s Journal, September 2005

DR. DOBB’S JOURNAL September 1, 2005

Mobile 5.0 uses a more direct approach for linking users with e-mail, bypassing proprietary servers. Microsoft has teamed up with manufacturing-services provider Flextronics (http://www.flextronics.com/) to develop the Peabody platform, which will reportedly be based on the OMAP730 GSM/GPRS processor from Texas Instruments and will include 32-Mbytes of RAM and 64-Mbytes of flash memory. It will also include a 1.9-inch color LCD with a resolution of 176×220 pixels, USB, and IrDA ports and a miniSD slot. A 1.3 megapixel camera is optional. The first devices should come on the market in 2006, Zhang said.

IT Advisory Committee Shut Down The President’s IT Advisory Committee (PITAC), a panel of industry and academic experts mandated by the High-Performance Computing Act of 1991 and the Next Generation Internet Act of 1998, has been disbanded. PITAC (http://www.nitrd.gov/ pitac/) members included representatives from Salesforce.com, Microsoft’s SQL Server group, Akamai Technologies, Dell, the University of California at Berkeley, MIT, and Purdue, among others. What turned out as the committee’s final report focused on U.S. competitiveness in computation science and recommended a study on how government can better connect computational science research across academia, industry, and government.

NSF Honors Young Scientists and Engineers Twenty young scientists and engineers whose work is supported by the National Science Foundation (NSF) have received the Presidential Early Career Award for Scientists and Engineers (PECASE), considered the highest national honor for investigators in the early stages of promising research careers (http://www.nsf.gov/news/ news_summ.jsp?cntn_id=104239). The PECASE awards include four computer and information scientists: David V. Anderson, Georgia Institute of Technology for his work in embedded signal processing; Elaine Chew, University of Southern California, for creating computer assisted methods for making music; Shalinee Kishore, Lehigh University, for innovative research wireless networks; and ChengXiang Zhai, University of Illinois at Urbana-Champaign, for research into user-centered, adaptive intelligent information access. http://www.ddj.com

Mobile Java & 3D Apps Cell phones come alive with 3D graphics OSCAR VIVALL AND TOM THOMPSON

T

he mobile phone has changed tremendously over the past decade. While pundits may talk about technology convergence coming to the living room, it has already arrived on your mobile phone. This convergence is still a work in progress. It has arrived to the point where the mobile phone is capable of functioning not only as a true imaging and communication device, but also as an advanced gaming device and a deliverer of 3D content. To accomplish the latter, mobile phone manufacturer Sony Ericsson has incorporated not one, but two APIs that manage the real-time rendering of 3D graphics on a mobile phone. In this article, we examine these 3D APIs and show how to use a few of their capabilities. The two 3D APIs on Sony Ericsson mobile phones are HI Corporation’s Mascot Capsule Micro3D Version 3, and Java Specification Request (JSR) 184, which describes the Mobile 3D Graphics (M3G) API for the Java 2 Mobile Edition (J2ME). Both APIs support high-quality model rendering, using transparent color blending and appearance properties, plus applying texture maps to the model’s surfaces. They also let you import files that store a model’s geometry data, its appearance properties, texture maps, and animation sequences. This allows the hard work of assembling the model and any animation sequences to be done on a PC using 3D-authoring software such as Newtek’s Lightwave, Discreet’s Studio 3D Max, Avid System’s SoftImage, and Alias System’s Maya. The work is then exported via a plug-in for use in a mobile Java application. This eliminates the computationally expensive stage of generating a model’s geometry through algorithms (although you can do that if necessary), and instead focuses the mobile phone’s processing power on the rendering stage. Why have two disparate APIs in a mobile phone? Sony Ericsson’s goal is to encourage widespread adoption of 3D applications and 3D-application development. Mascot Capsule V3 had already been field tested and offers opportunities to migrate 3D applications from existing BREW CDMA or DoJa devices to the J2ME platform on GSM/UMTS phones. It also lets you deliver 3D content now while JSR 184 tools and devices become widely available, and the standard matures. Each of these APIs takes a different approach to solving the same problem of rendering 3D models quickly while consuming few platform resources.

Oscar and Tom are support engineers for Sony Ericsson Developer World, http://www.SonyEricsson.com/developer/. 14

Mascot Capsule Micro3D Version 3 Mascot Capsule V3 started life as a 3D-rendering engine for embedded devices. Such an environment requires robust code and has a small resource footprint — precisely what the J2ME platform requires (http://www.hicorp.co.jp/english/products/mc_top.html).

“Mascot Capsule V3 started life as a 3D rendering engine for embedded devices”

Although Mascot Capsule V3 has a proprietary API, its wide adoption by mobile phone vendors such as Sony Ericsson, Motorola, and Sanyo make it a de facto standard. Mascot Capsule V3 consists of an enhanced rendering engine and Java extension that exposes its proprietary APIs for use on the J2ME platform. The rendering engine uses 32-bit integer arithmetic operations exclusively; thus, rendering functions don’t require any graphic acceleration hardware or a floating-point math coprocessor. A side effect of this is that 3D programmers must pass all coordinate data and API arguments to Mascot Capsule V3 as integers within the range of 0 (equivalent to 0.0) and 4096 (equivalent to 1.0). This can be unsettling for seasoned 3D programmers who are used to supplying floating-point values to 3D APIs such as OpenGL. As Figure 1 shows, Mascot Capsule V3 hews to its embedded roots with a minimalist set of 10 classes that manage the display and control of 3D content. However, don’t be fooled by Mascot Capsule V3’s lean architecture: The methods provided by these classes are feature rich and handle the most commonly used 3D-graphics operations. Some of these classes, such as the Figure class, store geometric object information. Other classes, such as Effect3D and

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

FigureLayout, attach rendering attributes to instances of the geometric objects. Other classes describe the lighting and textures that a scene uses. The Graphics3D class handles all rendering operations, while a utility class (Util3D) provides methods that can help with the design of graphics algorithms. Mascot Capsule V3 also has a graphics pipeline, similar to those found in other 3D-graphics environments such as OpenGL. This means that graphics commands are fed into a queue leading to the rendering engine where they are executed. Mascot Capsule V3 operates in an immediate mode where all graphics commands entering the pipeline are executed immediately upon the receipt of a flush command. Some Mascot Capsule V3 commands consist of graphics primitives or descriptions of a model’s 3D geometry. Other commands configure the graphics environment and manage the rendering process. Mascot Capsule V3 lets you import model geometry data that was generated by 3D-authoring programs. The exported data is stored as a resource in the mobile application’s JAR file. The application then loads this resource into an instance of a Figure. The Figure class also helps implement animation effects for a model. The model’s current arrangement of 3D objects represents a pose. An action description class, ActionTable, is associated with the Figure and stores the motions that change the model’s pose. The action tables animate the model by changing its pose over time under program control. Animation effects generated in 3D-authoring programs can be exported as ActionTables whose files are also stored as JAR file resources. When imported by the application, these Actiontables animate the models.

Controller and AnimationTrack classes that are used to animate 3D models under program control. JSR 184 also supports an immediate mode, where you invoke the render( ) method on submesh objects (actually objects that consist of arrays of vertex information and attributes). These two modes can execute on the same scene simultaneously. You can import a World that creates a virtual scene, then use the immediate mode to move sprites or other objects through it. A Loader class enables JSR 184 to import 3D models and animations made by 3D-authoring programs with desktop PCs. The file format that’s used to store 3D-geometry data, appearance attributes, texture maps, and other information is described in detail in the specification document. Using the APIs To add 3D capabilities to mobile applications (MIDlets), you simply add 3D API calls to the source code. No special linking or postprocessing is required. Both APIs work by “binding” a single instance of the 3Drendering environment to a J2ME LCD User Interface (LCDUI) Graphics object. The code renders the scene into this object and when done, it is released. Listings One(a) and One(b) illustrate this sequence for both APIs. While the principle is simple, the tricky part lies in the details that don’t appear in the listing.

Object

Figure

JSR 184 JSR 184 is a J2ME-based standard developed under the guidance of the Java Community Process (JCP), a consortium of companies of which Sony Ericsson is a member (http://www .jcp.org/en/jsr/detail?id=184). JSR 184 was crafted to provide a wide set of 3D-rendering features that function within J2ME’s constrained resources. The JSR 184 implementation consists of API classes and a rendering engine. The rendering engine is software-only, although it was designed so that its architecture can take advantage of any FPU or graphics accelerators if present. JSR 184’s API follows typical 3D conventions in that you specify floating-point values for coordinates, transforms, and other 3D information. The floating-point numbers must conform to the Java language’s float data type. By using floating-point arguments, JSR 184 simplifies the porting of code based on OpenGL or other 3D APIs. Compared to Mascot Capsule V3, JSR 184 has a bevy of classes —30 in all (Figure 2). JSR 184 takes the kitchen-sink approach of providing a wide variety of features. The many classes let you use just a few API calls to execute more specialized 3D operations —not just the common ones. They allow you to design complex scenery using meshes and other objects to build a landscape, illuminate it with multiple lights, and view the scene from different angles using multiple cameras. A detailed description of these objects is beyond the scope of this article. JSR 184 supports both retained and immediate modes of operation. Retained mode uses what’s known as a “scene graph” to link all of the geometric objects in a 3D world through a structured tree of nodes. You use objects subclassed from Node (such as Lights, Sprites, and Meshes) to assemble the 3D world. The Group class lets you gather node objects together and organize them, and the World class defines a special Group node that acts as the top-level container for all of the nodes in the 3D world. To display a view of the 3D world using the retained mode, you execute a Graphics3D render( ) method on a World node. Like Mascot Capsule V3, JSR 184 supports Animationhttp://www.ddj.com

Effect3D

Texture

Graphics3D

Light FigureLayout

=AnimationSupport =Rendering Attributes =Render Context/Control

AffineTrans

Util3D

ActionTable

=3D Object Assembly =Utility Classes

Vector3D

Class

Figure 1: The Mascot Capsule V3 class hierarchy.

Object3D

Graphics3D

Image2D

Transform

RayIntersection

Loader

AnimationTrack

VertexBuffer

AnimationController

KeyframeSequence

Background

VertexArray

IndexBuffer TriangleStripArray

Transformable

Appearance

CompostingMode

PolygonMode

Material Node

Group

Texture2D

Sprite3D

Camera

World

=AnimationSupport =Rendering Attributes =Render Context/Control

Light

Fog

Mesh

MorphingMesh

SkinnedMesh

=Scene Graph Objects =3D Object Assembly =Utility Classes

Class Abstract Class

Figure 2: JSR 184 class hierarchy.

Dr. Dobb’s Journal, September 2005

15

Often you use an instance of Canvas or GameCanvas as the Graphics binding target, as it provides low-level access to the phone’s keypad, jog dial, or touch screen. You add a run( ) method that periodically updates the models in the 3D world, then calls repaint( ) to display the revised 3D content. The Canvas’s paint( ) method is where you render the scene. The set-up sequence takes six steps: 1. 2. 3. 4. 5. 6.

Assemble or import the models. Set up textures and visual attributes for the models. Set up the camera. Set up the lights and any other effects. Set up the projection layout (parallel or perspective). Assign the active camera.

Now the scene can be rendered. Some of these steps will be part of the MIDlet’s initialization code while others, such as the active camera choice, lighting, and effects, are in the paint( ) method. Step 1 is to assemble or import your cast of 3D models. Listings Two(a) and (b) show how to use the immediate mode in Mascot Capsule V3 and JSR 184 to generate a 3D model of a cube. The JSR 184 procedure is more involved because the data has to be stored into instances of VertexArrays and then referenced by VertexBuffer. The additional effort is necessary because JSR 184 uses these objects to share texture and visual attributes with other models, thereby reducing memory usage for complex models. For step 2, you assign visual attributes for the models and set up any texture maps. Listings Three(a) and (b) show how to load and assign the textures in these APIs. For Mascot Capsule V3, you make instances of a Texture. This is also where you configure the visual attributes of the model with instances of Appearance and Material in JSR 184. Again, because of JSR 184’s flexibility and wider range of functions, it requires that you supply more information to set up the model’s visual characteristics. With the cast of models and their appearances prepared, now it’s onto step 3 to arrange the point of view (POV) for the scene. Mascot Capsule V3 has one camera, where you supply three vectors to a lookAt( ) method to configure the viewing position coordinates. JSR 184 defines a Camera class used for this purpose. A JSR 184 scene can have multiple instances of Camera with different POVs, although only one Camera can be active at a time. Listings Four(a) and (b) detail how this is done in both APIs. Note that in making a Camera, you also set the projection layout. For step 4, you add lighting and effects to the stage. Listings Five(a) and (b) show how this is done. Mascot Capsule V3 has only two white lights, a directional one and an ambient one, that are part of a Light class. For effects, you make an Effect object and specify the color-shading process the scene uses. This is also where you fold in the lighting characteristics. JSR-184 provides a Light class that allows for multiple lights of different colors. You’ll make an instance of Light, then supply its color and intensity. It’s time to specify the viewing geometry or projection model of the 3D space for step 5. The projection model type determines what visual corrections and transformations are applied to render a 2D image from the 3D scene. Each API supports two views: parallel and perspective.

Listings Six(a) and (b) show how the perspective projection model is set. For Mascot Capsule V3, this is accomplished by invoking the setPerspective( ) method call on an instance of FigureLayout. For JSR 184, invoking its setPerspective( ) method during the creation of the Camera does the job, as shown in Listing Four. Now all that’s left to do for step 6 is to assign the active camera and render the scene; see Listings Seven(a) and (b). For Mascot Capsule V3, first we assemble a graphics command, COMMAND, that specifies triangle primitives, no color, texture mapping, and other attributes to be dropped into the graphics pipeline. The renderPrimitives( ) method is invoked, and it is given the information stored in the Texture, FigureLayout, and Effect instances, along with the model geometry data in vert, norm, and tex. For JSR 184, the Camera is positioned and chosen as the active camera with the setCamera( ) method. The Light is added and made active. Finally, the scene is rendered using the information in the instances of VertexBuffer, IndexBuffer, and Appearance. The operation occurs in immediate mode because it is invoked on a single object and not an instance of a World.

“The projection model type determines what visual corrections and transformations are applied to render a 2D image from the 3D scene”

16

Interacting with the Models As the listings suggest, you can also change lights, lighting values, the projection view, and other aspects of the scene on-thefly. This allows for easy user interaction with the models. For example, JSR 184’s render( ) method also accepts a transformation matrix as an argument, which lets you rotate, scale, or move the model. For Mascot Capsule V3, you can apply a transformation matrix to the instance of FigureLayout, which then affects the model during rendering. Through the Canvas class’s event methods, you capture keystrokes on the phone’s keypad. This input can be used to modify a transformation array so that the model responds in a specific way to the key press. For example, pressing certain keys could cause a race car to move left or right. Or, they might make a spaceship rotate about. The two demo programs that accompany this article (available electronically; see “Resource Center,” page 3) demonstrate how to import models from files, and you can move them about using the phone’s keypad or navigation button. As you better understand the capabilities of the APIs, you can apply more creative effects. By selecting a different Camera that contains a different POV, you can effortlessly switch between different views of the same scene (perhaps a view above a race car, where you can toggle to a second view from the driver’s seat) or in JSR 184, turn on/off different “banks” of colored lights. From the previous discussion, you can see that both Mascot Capsule V3 and JSR 184 provide capable 3D-graphic environments. If your 3D-graphics requirements are modest, Mascot Capsule V3 is a suitable API. It also has better performance than the JSR 184 implementation on current phones in the market. MIDlets with more sophisticated graphics requirements, or those applications using less common 3D functions, are best served by JSR 184. Sony Ericsson’s phones, by carrying both 3D APIs, have become the labs where next-generation 3D applications will be written. For more information on mobile Java 3D development, see http://www.SonyEricsson.com/developer/java3d/. DDJ (Listings begin on page 20.)

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

Listing One (a)

-10, 10,-10, 10, 10,-10, -10,-10,-10, 10, 10,-10, -10,-10,-10, 10,-10,-10, // back -10, 10, 10, -10, 10,-10, -10,-10, 10, -10, 10,-10, -10,-10, 10, -10,-10,-10, // left 10, 10,-10, 10, 10, 10, 10,-10,-10, 10, 10, 10, 10,-10,-10, 10,-10, 10, // right 10, 10,-10, -10, 10,-10, 10, 10, 10, -10, 10,-10, 10, 10, 10, -10, 10, 10, // top 10,-10, 10, -10,-10, 10, 10,-10,-10, -10,-10, 10, 10,-10,-10, -10,-10,-10 }; // bottom // Cube's normals -- Used in lighting calculations. private int[] norm = { 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, -4096, 0, 0,-4096, 0, 0,-4096, 0, 0,-4096, 0, 0,-4096, 0, 0, -4096, -4096, 0, 0, -4096, 0, 0, -4096, 0, 0,-4096, 0, 0, -4096, 0, 0, -4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, 4096, 0, 0, -4096, 0, 0, -4096, 0, 0, -4096, 0, 0, -4096, 0, 0, -4096, 0, 0, -4096, 0};

// Mascot Capsule v3 public class CoolGameCanvas extends Canvas { // Get instance of 3D environment Graphics3D gameG3D = new Graphics3D(); // Paint method which handle 3D content display public void paint(Graphics g) { try { gameG3D.bind(g); // update the scene elements // render the scene } ﬁnally { gameG3D.release(); } } // end CoolGameCanvas

(b)

(b)

// JSR 184

// JSR 184

private VertexBuffer iVb; // Vertex positions, normals, colors, texcoords private IndexBuffer iIb; // Indices to VertexBuffer, for tri-strips // Triangles strips that make up the cube's faces short[] vert = { 10, 10, 10, -10, 10, 10, 10,-10, 10, -10,-10, 10, // front -10, 10,-10, 10, 10,-10, -10,-10,-10, 10,-10,-10, // back -10, 10, 10, -10, 10,-10, -10,-10, 10, -10,-10,-10, // left 10, 10,-10, 10, 10, 10, 10,-10,-10, 10,-10, 10, // right 10, 10,-10, -10, 10,-10, 10, 10, 10, -10, 10, 10, // top 10,-10, 10, -10,-10, 10, 10,-10,-10, -10,-10,-10 }; // bottom

public class CoolGameCanvas extends Canvas { // Get instance of 3D environment Graphics3D gameG3D = Graphics3D.getInstance(); // Paint method which handle 3D content display public void paint(Graphics g) { try { gameG3D.bindTarget(g); // update the scene elements // render the scene } ﬁnally { gameG3D.releaseTarget(); } } // end CoolGameCanvas

// Create a VertexArray to hold the vertices for the object VertexArray vertArray = new VertexArray( vert.length / 3, 3, 2 ); vertArray.set( 0, vert.length/3, vert );

Listing Two (a) // Mascot Capsule v3 // Triangles that make up the cube's faces. Are vertices, not triangle strips private int[] vert = { 10, 10, 10, -10, 10, 10, 10,-10, 10, -10, 10, 10, 10,-10, 10, -10,-10, 10, // front

20

// The per-vertex normals for the cube byte[] norm = {0, 0, 127, 0, 0, 127, 0, 0, 127, 0, 0, -127, 0, 0,-127, 0, 0,-127, -127, 0, 0, -127, 0, 0, -127, 0, 0, 127, 0, 0, 127, 0, 0, 127, 0, 0, 0, 127, 0, 0, 127, 0, 0, 127, 0, 0, -127, 0, 0, -127, 0, 0, -127, 0, // Create a vertex array for the normals of the object. VertexArray normArray = new VertexArray( norm.length / 3, 3, normArray.set( 0, norm.length/3, norm );

Dr. Dobb’s Journal, September 2005

0, 0, 127, 0, 0, -127, -127, 0, 0, 127, 0, 0, 0, 127, 0, 0, -127, 0}; 1 );

(continued on page 22)

http://www.ddj.com

(continued from page 20) int[] stripLen = { 4, 4, 4, 4, 4, 4 }; // Length of each triangle strip // Create the VertexBuffer for our object VertexBuffer vb = iVb = new VertexBuffer(); vb.setPositions(vertArray, 1.0f, null); // Unit scale, zero bias vb.setNormals(normArray); // Create the index buffer for the object (this tells how to create triangle // strips from the contents of the vertex buffer). iIb = new TriangleStripArray( 0, stripLen );

Listing Three (a) // Mascot Capsule v3 private static Texture d1_texture; d1_texture = new Texture("tex_skin.bmp", true); // Coordinates on 2D BMP image to be mapped onto the cube's private int[] tex = { 96, 32, 64, 32, 96, 64, 64, 32, 64, 32, 32, 32, 64, 64, 32, 32, 64, 0, 32, 0, 64, 32, 32, 0, 32, 0, 0, 0, 32, 32, 0, 0, 32, 32, 0, 32, 32, 64, 0, 32, 96, 0, 64, 0, 96, 32, 64, 0, private int [] kolor = { 0xFF00FF }; // Placeholder

faces. 96, 64, 64, 64, 64, 32, 32, 32, 32, 64, 96, 32,

64, 64, 32, 64, 32, 32, 0, 32, 0, 64, 64, 32 };

(b) // JSR 184 private Image iImage; private Appearance iAppearance; // For material, texture, compositing // Vertex texture coordinates. These specify the coordinates on a 2-D image // that will be painted onto a surface. short[] tex = { 96, 32, 64, 32, 96, 64, 64, 64, 64, 32, 32, 32, 64, 64, 32, 64, 64, 0, 32, 0, 64, 32, 32, 32, 32, 0, 0, 0, 32, 32, 0, 32, 32, 32, 0, 32, 32, 64, 0, 64, 96, 0, 64, 0, 96, 32, 64, 32 }; // Create a vertex array for the texture coordinates of the object VertexArray texArray = new VertexArray( tex.length / 2, 2, 2 ); texArray.set(0, tex.length/2, tex); // Place texture coords into instance of VertexBuffer made in Listing 2 vb.setTexCoords(0, texArray, (1.0f/128.0f), null); // 128-pixel scale, zero bias // Load the image for the texture iImage = Image.createImage("/cubeface.png"); // Create the Image2D (we need this so we can make a Texture2D) Image2D image2D = new Image2D( Image2D.RGB, iImage ); // Create the Texture2D and enable mipmapping Texture2D texture = new Texture2D( image2D ); texture.setFiltering( Texture2D.FILTER_NEAREST,Texture2D.FILTER_NEAREST ); texture.setWrapping( Texture2D.WRAP_CLAMP, Texture2D.WRAP_CLAMP ); texture.setBlending( Texture2D.FUNC_MODULATE ); // Create the appearance iAppearance = new Appearance(); iAppearance.setTexture( 0, texture ); // Add Texture2D to the Appearance iAppearance.setMaterial(iMaterial); iMaterial.setVertexColorTrackingEnable( true ); // Track per-vertex color iMaterial.setColor(Material.SPECULAR, 0xFFFFFFFF); // Specular = white iMaterial.setShininess(100.0f);

Listing Four (a) //Mascot Capsule v3 // Vectors for point of camera point of view private Vector3D D1_POS = new Vector3D(0,30,50); private Vector3D D1_LOOK = new Vector3D(0,-2048,-4096); private Vector3D D1_UP = new Vector3D(0,4096,0); // Set up camera and rotation transform AfﬁneTrans d1_vtrans = new AfﬁneTrans(); d1_vtrans.setIdentity(); d1_vtrans.lookAt(D1_POS, D1_LOOK, D1_UP);

(b) // JSR 184 private Camera iCamera; // Create a camera iCamera = new Camera(); iCamera.setPerspective(80.0f, (ﬂoat)getWidth()/ (ﬂoat)getHeight(), 1.0f, // near clipping plane 1000.0f); // far clipping plane

22

Dr. Dobb’s Journal, September 2005

// ﬁeld of view // aspectRatio

http://www.ddj.com

Listing Five (a) // Mascot Capsule v3 Effect3D effect; // Lights settings. Only two lights: directional and ambient // Directional light, vector and intesity private Vector3D lightDirVec = new Vector3D(-146,293,439); private int lightDir = 3730; // Ambient light intensity private int lightAmbi = 1626; // Make the lights private Light d1_light = new Light(lightDirVec,lightDir,lightAmbi); // Effects set up effect = new Effect3D(); effect.setShadingType(Effect3D.NORMAL_SHADING); effect.setLight(d1_light);

(b) // JSR 184 private Light iLight; // Create a light iLight = new Light(); iLight.setColor( 0xffffff ); iLight.setIntensity(1.25f);

// White // Over bright

Listing Six (a) // Mascot Capsule v3 private int D1_pers = 512; // Equivalent to 45 degrees FigureLayout d1_layout; // Set up Figure's (cube) layout d1_layout = new FigureLayout(); d1_layout.setPerspective(10, 4096, D1_pers); d1_layout.setCenter(getWidth()/2, getHeight()/2); // Center on screen

(b) // Accomplished in Listing 4.

Listing Seven (a) // Mascot Capsule v3 // Constants for graphics commands private static ﬁnal int COMMAND = Graphics3D.PRIMITVE_TRIANGLES | Graphics3D.PDATA_NORMAL_PER_VERTEX | Graphics3D.PDATA_TEXURE_COORD | Graphics3D.PDATA_COLOR_NONE | Graphics3D.ENV_ATTR_LIGHTING | Graphics3D.PATTR_BLEND_HALF; // Instantiate 3D graphics object Graphics3D g3d = new Graphics3D(); // Draw the scene try{ g3d.bind(g); g3d.renderPrimitives(d1_texture, 0, 0, d1_layout, effect, COMMAND, 12, vert, norm, tex, kolor); g3d.ﬂush(); g3d.release( g ); } catch (Throwable h){ h.printStackTrace(); } // end catch

(b) // JSR 184 private Graphics3D iG3D; // Set up the camera in the desired position Transform transform3D = new Transform(); transform3D.postTranslate(0.0f, 0.0f, 30.0f); // Get the singleton Graphics3D instance iG3D = Graphics3D.getInstance(); iG3D.setCamera(iCamera, transform3D); iG3D.resetLights(); iG3D.addLight(iLight, transform3D); // Render our cube iG3D.render(iVb, iIb, iAppearance, null); // Free the graphics object iG3D.releaseTarget();

DDJ http://www.ddj.com

Dr. Dobb’s Journal, September 2005

23

Inside Mobile IP Staying connected while moving around NARENDRA VENKATARAMAN

M

obile IP is a standard that lets mobile-device users whose IP addresses are associated with one network, stay connected when moving to a network with a different IP address. When users leave the network with which their device is associated (that is, the “home network”) and enter the domain of a foreign network, the foreign network uses the Mobile IP protocol to inform the home network of a “care-of” address to which all packets for the user’s device should be sent. Mobile IP is most often found in wireless WAN environments where users need to carry their mobile devices across multiple LANs with different IP addresses. A common analogy to explain Mobile IP is when someone moves his residence from one location to another; say, from Boston to New York. The person drops off the new mailing address at the New York post office, which notifies the Boston post office of the new mailing address. When the Boston post office receives mail for a person, it knows to forward the mail to that person’s New York address. Some of the key drivers for embracing Mobile IP are: large-scale deployments of wireless networks, be it public WLAN hotspots or All-IP 3G networks such as UMTS; less expensive wireless Internet access; and powerful laptops and PDAs with built-in WLAN and GPRS/UMTS radio interfaces. Telecommunications companies that have both WLAN and WWAN infrastructure

Narendra Venkataraman is a senior developer at Hewlett-Packard Telecom division and can be reached at narendra [email protected]. 24

(GPRS/UMTS/CDMA1x) are providing seamless roaming as a value-added service to subscribers. Among the various roaming solutions in the market, Mobile IP (being an IETF Standard) scores very high. Mobile IP comes with two versions — Mobile IPv4 and Mobile IPv6. Though conceptually similar to IPv4, IPv6 supports more enhanced security and optimized routing features. Because of the slow transition toward IPv6 infrastructure, Mobile IP solutions for existing IPv4 networks are already being deployed. In this article, I focus mainly on Mobile IPv4, highlighting the key concepts and issues faced during integration with wireless LAN and GPRS/UMTS networks. In this article, I will explain the core concepts of roaming and highlight the insights gained during the integration of Mobile IP with WLAN and GPRS/UMTS networks. Though the article is focused on wireless networks, Mobile IP is applicable to wireline networks as well. Layer 2 Mobility For instance, Figure 1 shows two access points —AP1 and AP2— within the same subnet 15.76.222. The WLAN client is initially associated with AP1 and assigned the IP address 15.76.222.11. The user issues a huge HTTP download and walks out of the coverage area of AP1 and is now associated to the new access point AP2. Because both the access points are serving the same subnet, the client’s IP address is retained. The download continues without any hiccups. In this case, the roaming occurs at Layer 2. Layer 2 roaming is very fast and reliable across similar network media within the same subnet or network topology. Even the latest GPRS/UMTS combo adapters handle roaming at the link layer, provided GPRS and UMTS networks interface with the same IP backbone. Layer 3 Mobility Take a case where AP1 is in subnet 15.76.222/24 and AP2 is in subnet 16.138.52/24. As illustrated in Figure 2, the client is assigned 15.76.222.11 when Dr. Dobb’s Journal, September 2005

connected to AP1. Later, when it associates with AP2, it cannot retain the same address; hence, it gets a new address 16.138.52.112 from the 16.138.52 subnet’s DHCP server. The HTTP download is a TCP session identified by a tuple (source ip:port and destination ip:port). In this case, the TCP session breaks because the source address changed from 15.76.222.11 to 16.138.52.112. The question is how to retain the address 15.76.222.11 even while you move to the new subnet 16.138.52

“In a public hotspot, users associate with an AP and are assigned an IP address” so that the existing application sessions are not disrupted. Mobile IP solves this problem by introducing a host entity called “mobile node” and two new network entities called “Home Agent” (HA) and “Foreign Agent” (FA). The mobile node is any host running the Mobile IP client. The mobile node is always assigned one permanent IP address; in this case, 15.76.222.11. In Mobile IP terminology, the permanent address is known as the “home address” and the subnet 15.76.222 is the “home network.” The mobile node implements a virtual network adapter that is assigned to the home address. All application layer packets are routed via this adapter. These packets are then sent over the physical WLAN interface, which is assigned 16.138.52.112. The WLAN adapter’s address changes depending on the network it is connected to and this temporary address is called the “care-of” address. Say that the mobile node was downloading a file from a web server hosted http://www.ddj.com

(continued from page 24) on 161.114.22.105. The second piece of puzzle is to determine how the mobile node receives IP packets at its new location. Applying the standard routing prin-

ciples, packets destined for 15.76.222.11 are forwarded to the subnet default gateway 15.76.222.1 and eventually dropped. Figure 3 explains the basic concept of Mobile IP:

Figure 1: Same subnet.

Figure 2: Different subnet.

When the mobile node goes back to the 15.76.222 network, it deregisters by sending a registration packet with a lifetime equal to 0. If it moves to another subnet, it registers its new care-of address and deregisters the previous one. The mobile node (after extracting the original packet) knows the CN’s IP address. The mobile node can send the packets directly to the CN without tunneling as shown in Figure 4. This is known as “triangular routing” and optimizes the return path. However, currently deployed routers and firewalls enable “ingress filtering,” in which the outgoing packets are blocked if the source address is not part of the network. In this case, 15.76.222.11 is not part of subnet 16.138.52; hence, it is blocked. Mobile IP solves the ingress filtering issue by introducing reverse tunneling (see RFC 3024; http://www.faqs.org/rfcs/rfc3024 .html). The mobile node uses IP- in-IP tunneling with the care- of address 16.138.52.112 in the source and the HA address 15.76.222.1 in the destination; see (3) of Figure 4. The HA extracts the original packet and forwards it to the CN. This is shown in (4) of Figure 4.

Figure 3: Basic architecture.

Optional Foreign Agent If a large number of mobile nodes move to subnet 16.138.52, the DHCP runs out

Figure 4: Reverse tunneling. 26

1. In a mobile IP scenario, the default gateway acts as the HA. The mobile node notifies its new care-of address by sending a Mobile IP registration packet. The registration packet is sent over UDP at port 434 of the HA. 2. The HA sends back the reply after updating its registration table. The registration table consists of home address and care-of address mapping. It also has other parameters such as registration lifetime. (It is not shown for simplicity.) 3. The packets sent by the host running the web server known as the “correspondent node” (CN) are intercepted by the HA. These packets are then sent to the care-of address 16.138.52.112. 4. The HA encapsulates the original IP packet sent by the CN in another IP packet with the destination address as the care-of address 16.138.52.112 and the source address as the HA address 15.76.222.1. The process is known as IP- in-IP “tunneling” (see RFC 2003; http://www.faqs.org/rfcs/rfc2003.html). 5. The mobile node’s WLAN adapter receives the tunneled packets, extracts the original packets, and passes them to the virtual adapter. Because the host application interfaces only with the virtual adapter, it is transparent to changes in the care-of address.

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

(continued from page 26) of IP addresses. Mobile IP resolves this address scarcity problem by introducing another network entity called “Foreign Agent” (FA). Figure 5 shows how the packets flow in the presence of an FA. The NAT Problem In the presence of an FA, each mobile node is not assigned a care-of address. Instead, the FA’s address is used as the care-of address in the registration table and the mobile node registers with the HA via an FA. The packets sent from the CN are tunneled via the HA and sent to the FA. The FA decapsulates them and sends the original packets directly to the mobile node’s WLAN adapter MAC address. The FA maintains a routing table with the mobile node’s home address and the WLAN adapter’s MAC address as in Figure 5. The mobile node detects an FA by listening to agent advertisements broadcast by the FA. These agent advertisements

are an extension of ICMP router advertisement messages (see RFC 1256; http://www.faqs.org/rfcs/rfc1256.html). In short, there are two types of care-of addresses: an FA care-of address, when an FA is present in the visited network; and a colocated care-of address, when the mobile node obtains the care-of address from DHCP or PPP or some other means. IP-in-IP tunneling assumes that the endpoints are the HA and MN in colocated mode or the HA and FA in the presence of an FA. If the mobile node is behind a Network Address Port Translation (NAPT) enabled wireless router, IP-in-IP tunneling breaks because the care-of address (192.168.1.110) is not reachable from outside the local subnet. IP-in-UDP tunneling (see RFC 3519; http://www.faqs.org/ rfcs/rfc3519.html) solves this problem by tunneling the IP packets in a UDP packet. The UDP packet contains a Mobile IP tunnel message header and the original IP packet. Figure 6 shows how IP-in-UDP

tunneling traverses through a mobile node behind NAPT. 1. The mobile node requests for IP-inUDP tunneling when it sends the registration packet. The registration packet contains the tuple {192.168.1.110:2001, 15.76.222.1:434}. 2. The router uses port address translation to map many internal addresses and TCP/UDP ports in the subnet 192.168.1 to a single external address 15.76.222.100. For the registration UDP packet, the router is mapped from 192.168.1.110:2001 to 15.76.222.100:1656. 3. The HA receives the registration request with source IP 15.76.222.100 and a UDP port 1656. The HA maintains the registration table mapping home address 15.76.222.11 to the care-of address and UDP port 15.76.222.100:1656. 4. When a packet addressed to the mobile node reaches the HA, it encapsulates the packet in another UDP packet and sends it to the port 1656 that was used for registration. 5. Once the packet reaches the NAPTenabled wireless router, it is forwarded to 192.168.1.110:2001. 6. The mobile node decapsulates and extracts the original packet. The IP-in-UDP tunneling is not bandwidth efficient because the packet overhead for the tunneling is 32 bytes compared to 20 bytes for IP-in-IP tunneling.

Figure 5: Foreign agent.

Combining Different Wireless Technologies Figure 7 shows a high-level deployment architecture for public hotspots and UMTS networks. The public WLAN and UMTS networks are connected to the common IP backbone, enabling users to seamlessly roam across these networks. Deploying a Mobile IP infrastructure requires installation of HAs and FAs. The routers can be upgraded to serve as an HA or FA. If the existing routers cannot be upgraded, you can run FA and HA software on standard Linux or UNIX servers. On the host side, you can install

Figure 6: NAPT traversal.

Figure 7: Deployment architecture.

28

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

(continued from page 28) any third-party Mobile IP client available on the market. Currently, Windows XP has no built-in support for Mobile IP. While switching from WLAN to UMTS, the latency is in the order of 2–5 seconds. This is not acceptable for real-time streaming applications. It is better to be always connected to the ubiquitous GPRS/UMTS network and switch to a cheaper and faster WLAN whenever available. This is a good strategy because most service providers use packet-based billing; hence, you are not charged extra for an alwayson UMTS connection. The mobile node should implement low-latency handoff logic to ensure smooth switch over. The handoff logic chooses the right network interface at the right time in the right location. Handoff across similar network technology (for instance, WLAN to WLAN) is simply based on signal strength. Whereas handoff across WLAN and UMTS is a complex function with many parameters such as speed of interface, signal strength, power consumption, bandwidth, cost of network usage, location of user, user preferences, time of the day, and so on. The best approach is to study and anticipate user behavior and utilize it for switching to a new network before losing the existing connection.

Instead of using a static home address, configure Network Access Identifier (NAI) (see RFC 2794; http://www.faqs.org/rfcs/ rfc2794.html). NAI is a text identifier that is primarily used as a user ID during registration. If the mobile node is config-

“The VPN provides secure access based on user credentials, whereas Mobile IP provides terminal mobility” ured with an NAI, then it does not need a static home address. The mobile node sends a registration request with an NAI. The HA sends a registration reply including the mobile node’s home address. You can set, for instance, an e-mail ID, such as “[email protected]” as an NAI.

In a public hotspot, users associate with an AP and are assigned an IP address. They need to authenticate using some application layer authentication in order to obtain access to the Internet. The most common authentication mechanism is web login, where the user is presented with a login page and the user’s credentials are sent over a secure socket layer (SSL). It is important to note that before the authentication phase, all the packets are blocked including Mobile IP registration packets. Therefore, ensure that Mobile IP registration packets are sent after the hotspot login. Upon successful authentication, if there is no network activity for a predefined interval, most of the hotspots log you out. Therefore, make sure that you send ping or keep-alive packets at regular intervals. Home agents installed within an intranet offer roaming across LAN and WLAN networks. Unlike public WLANs, enterprise WLANs use link-layer authentication called “802.1x authentication.” Because authentication is already done before obtaining an IP address, Mobile IP registration happens smoothly. Apart from roaming between various LAN and WLAN subnets within an intranet, we can also roam across the intranet and Internet while retaining the corporate network connection. Some enterprises partner with GPRS service providers to provide access to intranet resources. This is optimal because there is no extra overhead from a VPN tunnel. But this is not always feasible, and therefore, the most generic and popular way to connect to the enterprise is using IPSecbased VPNs. The VPN provides secure access based on user credentials, whereas Mobile IP provides terminal mobility. The commercially available IPSec-based VPN gateways are not interoperable and standardizing the integration of HA functionality into a VPN gateway is not an easy task. Various approaches are being evaluated and are available in draft form (see http://www.ietf.org/internet-drafts/draftietf-mip4-vpn-problem-statement-03.txt). Conclusion Mobile IP is an important step toward ubiquitous networking. However, it does not address all the issues of mobility. For instance, if a bandwidth-hogging application running on a high-speed WLAN network switches to a low-bandwidth GPRS network, TCP connections cannot survive the high latency of the GPRS network. This can be solved if the application is designed to cater to the differences in quality- of- service of various wireless media. True mobility can only be achieved if all layers in the TCP/IP stack collaborate. DDJ

30

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

High-Performance I/O With Java NIO How Orbitz.com improved throughput and performance using NIO BRIAN PONTARELLI

T

he NIO library, a feature added to Java as part of JDK 1.4, offers an alternative, high-performance technique for handling input/output (I/O) operations. This library supports working directly with native channels, sockets, and direct memory buffers to increase the performance and throughput of I/O. In addition, it offers the ability to do true nonblocking I/O operations. However, if used incorrectly, NIO can yield little performance increase and can cause an increase in the load and reduce stability of an application. In this article, I discuss how Orbitz (the company I work for) used NIO to increase performance and stability by replacing an older standard I/O implementation. Orbitz communicates with various third-party backend systems to retrieve data for its customers. During the course of handling a single customer request, the Orbitz.com application performs time-sensitive I/O to fetch the customer’s response. With a large

Brian is a senior architect with Orbitz.com. He can be contacted at [email protected]. 32

number of concurrent users, the application can be performing hundreds of I/O operations at the same time. The Orbitz.com application handles concurrent user requests by allowing each request to be performed in a separate thread. These threads, called “execute threads,” can be configured such that each request is handled by a new thread or by a thread from a thread pool. In either case, it is vital to ensure that no thread is running for an extended period of time. When threads take too long processing a single user request and a new thread is used for each request, the Java VM has the possibility of crashing when excessive threads are created. Likewise, if a pool is being used and threads from that pool take too long while executing, other users are forced to wait until a thread has finished and can be reused. Standard I/O The Orbitz.com application was initially implemented with a standard I/O design using Java 1.3. To guarantee timeouts for I/O operations, it employed a separate thread to perform the I/O operations. The execute threads are then joined on this new thread using a timeout. With this approach, it prevented the execute thread from hanging when I/O operations encountered problems. Listing One shows the original thread that performed the I/O operations. It has been simplified from its original version. By calling the join method on the IOThread, one of the execute threads waits until the IOThread has finished running. A timeout is specified when the join method is called to ensure that the exeDr. Dobb’s Journal, September 2005

cute threads do not wait forever for an IOThread to complete. Listing Two shows an example of how an execute thread used an IOThread. The first version of the execute threads created new instances of the IOThread class for each I/O operation. Using the initial implementation it was observed that the

“Multiplexed NIO is a technique that moves all the I/O work into a single thread” servers were periodically experiencing high thread counts. It had initially been assumed that the maximum number of threads that could be created in a Java VM running this code would be 2N, where N was the number of execute threads. After some investigation, I found that an instance of the IOThread would remain in memory if it experienced network problems and was hung while performing an I/O operation. In addition, the execute thread correctly timed-out and the next execute thread attempted the same operation, with the same results. The net result was that the Java VM created an enormous number of threads and eventually ran out of memory. http://www.ddj.com

(continued from page 32) To reduce the thread count, I added a thread pool that stored instances of the IOThread class. I also changed the execute thread code so that it checked out an IOThread instance from the pool, used it, and then checked it back in. This not only reduced the high thread count, but also reduced the overhead of thread creation. Listing Three shows the second version of the execute thread that used the pool of IOThreads (the implementation Minute 1 2 3 4 5 6 7 8 9 10 Avg

Standard IOMultiplexed NIO 135 137 162 152 144 167 164 164 163 154 154.2

287 240 235 270 246 266 390 339 364 390 302.7

Table 1: Load test results. Solution Standard I/O Multiplexed NIO

of a thread pool is out of scope of this article). This improvement still has an underlying problem. If the network experiences major failures, all instances of IOThread in the pool quickly become stalled and execute threads begin to fail during checkout because there are no more IOThreads in the pool. Additionally, when I began testing this new design under heavy load, I found that it still had problems that reduced overall performance. The issue was excessive context switching. All the implementations thus far force the Java VM to perform increased context switching because either the execute threads or the IOThreads are all performing I/O operations concurrently. The largest problem with this concurrency is that the Java VM doesn’t know which threads are ready to perform I/O operations and which are not. Therefore, the Java VM might put a thread onto the processor that isn’t ready to perform I/O operations. Meanwhile another thread that is waiting to run is ready to perform

Average Latency

Average Load

1489 1199

0.8 0.7

“Use exceptions for exceptional cases” is such a common refrain that I wonder if anyone really knows what it means. It means not using them as a part of control flow, but beyond that, the proper use of exceptions is anybody’s guess and everyone’s mystery. When should you throw exceptions? Should you throw a runtime or a checked exception? Is it okay to catch a Throwable exception? Where is the best place to catch an exception? What do you do when you catch one? In general, how can you be pragmatic when it comes to exceptions? Over the next few months, I answer some of these questions. While Java is the reference language, many of these tips are universal.

34

Pragmatic Exceptions . . . .

Table 2: Statistics from the Orbitz.com production environment.

I/O operations. Because Java does not provide the ability to “stack the deck” by moving threads that are ready to perform I/O to the top of the list, it is possible that a thread, which is ready to perform I/O, might wait long enough that it times out. NIO The NIO package added to Java 1.4 offers Java developers the ability to perform true nonblocking I/O and guarantee timeouts for I/O operations. It also works with low-level operating-system constructs to increase performance and reduce latency. In addition, it provides the ability to monitor multiple I/O operations concurrently, also known as “multiplexing.” Multiplexed NIO is a technique that moves all the I/O work into a single thread that watches over many I/O operations executing concurrently. A multiplexed NIO design uses the java.nio.channels.Selector class in conjunction with subclasses of the java.nio.channels.SelectableChannel class. The Selector class lets the application watch many I/O connections (subclasses of the SelectableChannel class and also known as “Channels”) and organize these Channels into those that are ready to perform I/O and those that are not.

Tip #1: If In Doubt, Throw It Out

Exceptions are guilty until proven innocent. To ensure the integrity of the application (see the upcoming Tip #5 “Exceptions May Be Poisonous”), exception pitchers should also play it safe and throw exceptions out if there’s any doubt. Let someone else who will know for sure take a closer look. Throwing it out is usually the first, safest bet. At some point, higher up in the code path, there will be sufficient programmatic context to know for sure whether a contract has been broken. With less ambiguity, appropriately handling the exception becomes straightforward. If the method signature must change, then change it if at all possible. If not, then throw an unchecked exception (in Java, a RuntimeException). This is better than blindly handling a potentially fatal state change in your application.

Dr. Dobb’s Journal, September 2005

—Benjamin Booth [email protected]

http://www.ddj.com

I decided to redesign to use NIO and a single thread to perform all the I/O operations. Listing Four shows the skeleton body of the NIO thread. As with the threaded standard I/O design, the execute threads need a way to supply work to the NIOWorker in the form of a URL they want to contact. The NIOWorker then connects to the URL, sends an HTTP message to the server, and then reads the HTTP-response. After it completes that work, it passes back the result to the execute thread. One unit of work is designated using an inner class within the NIOWorker thread that lets a URL be passed in and the contents to be passed back. Listing Five shows the inner class of the NIOWorker that represents one unit of work. Supplying work to the NIOWorker is not a simple process and requires two methods. Because the NIOWorker thread is a separate thread that handles all the I/O operations, execute threads must add their work and then wait for the NIOWorker thread to process it. Therefore, the NIOWorker class has one method that forces the execute threads to wait, and a second method that actually adds the work. This design waits and notifies using the Work object as the monitor. Listing Six shows the two add methods that force the execute threads to wait and add the work to the Selector. This code creates a SocketChannel, which is a Channel that extends SelectableChannel, and registers the Selector with it. This Channel is set to nonblocking so that operations performed on Channel do not block until complete but rather return as quickly as possible. Registering a Selector with a Channel informs the Channel to begin a specific I/O operation and also tells the Channel to signal the Selector once the operation is complete. Here the code is setting the Channel to begin an OP_CONNECT operation, which forces the Channel to connect to a remote socket. During the registration, this code also constructs a small class that holds some state information for the I/O operation. This is called an “attachment” and can be any object. The attachment is associated with a specific Selector/Channel relationship. This Selector/Channel relationship is called a “SelectionKey” (also known as “key”). It stores the type of operation the Channel is performing as well as the attachment. The WorkState object is what the NIOWorker uses as an attachment. WorkState lets the results that are read from the Channel be stored in a StringBuffer. It also provides a mechanism to monitor timeouts so that the NIOWorker thread does not continue working on something that has timed out. Listing Seven shows the WorkState class. To watch over the Channels, the NIOWorker thread needs to perform “sehttp://www.ddj.com

lects” by calling the selectNow method on the Selector. Performing “selects” tells the Selector to update all of its keys to reflect which Channels are ready and which are not, allowing the NIOWorker thread to stack the deck by only working on the Channels that are ready. The number of ready keys is returned from this method. Listing Eight shows the main run method for the NIOWorker thread that performs the select. The processKeys method handles all of the Channels that signal the Selector that they are ready to perform I/O. Because the Selector handles multiple Channels, the processKeys method performs the I/O for all Channels that are ready. Listing Nine shows the processKeys method that is responsible for iterating over all the SelectionKeys and handling each one.

Dr. Dobb’s Journal, September 2005

Dobbs_0508.indd 1

The Selector object provides the ability to retrieve all the keys that are ready to perform I/O using the selectedKeys( ) method. This method returns a Set containing all the keys that are ready to perform I/O. This Set is populated with the ready keys whenever a select operation is performed on the Selector. One important detail about the processKeys method is that during the loop iteration, the WorkState attachment is retrieved from the SelectionKey and used to verify that the work has not timed out. If it has timed out, the finished method is called to signal that the work is complete, which in this case is due to a timeout rather than success. The doWrite, doRead, and finished methods are the last methods that complete the NIOWorker. The doWrite method

35

07.06.2005 17:24:19 Uhr

writes the HTTP-request String to the channel. Listing Ten shows the doWrite method. The doRead method is responsible for performing a read from a Channel and signaling if everything has been read from the Channel or not. Listing Eleven is the doRead method. The doRead method is a standard NIO implementation of reading from a Channel. It reads zero or more bytes from the Channel and also determines if the Channel has any more to read. Once this code is certain that there is nothing more to read, it returns True to signal to the processKeys method that it should call the finished method. This code goes one step further and sets the success flag on the WorkState object. This is used in the finished method to determine whether the HTTP-response should be parsed. The final finished method notifies the execute thread that its work was complete, whether successfully or not. The success flag set in the doRead method is used to determine if the HTTP-response is parsed out. Listing Twelve is the finished method. NIO does not handle the HTTP-request generation and HTTP-response parsing automatically. That work is left up to the NIO implementation or the client. This design handles the generation of the HTTP-

request and parsing of the HTTP-response messages. The buildHTTPRequest and parseHTTPResponse methods were added here for illustration. The work of generating and parsing the HTTP messages is slightly more complex but is available in

“The work of generating and parsing the HTTP messages is slightly more complex” the full version of the code, within the HTTPParser class and the classes that implement the HTTPMethod interface. Listing Thirteen is an example of an execute thread calling the NIOWorker to perform some work. Performance To fully appreciate the benefits gained between standard I/O and NIO, I collected statistics using each design employed.

These statistics were collected by running each of the actual implementations through a series of load tests in a fixed environment. Table 1 presents the results from those tests. All values are the number of transactions completed by an implementation per minute. In addition to a load test, the standard I/O and the NIO solutions were both used in the Orbitz production environment for some period of time. I was able to monitor each one and collect statistics for load and latency. Table 2 shows the statistics from the Orbitz.com production environment. Additional Thoughts The code I present here is only part of the complete solution used in the Orbitz.com application. I simplified some of the code, but have retained the major work required to implement a multiplexed NIO solution. NIO is not a simple API to use and I encountered many issues during testing that required fine tuning. A few things I found that should be kept in mind are: • Prevent adding too much work to the Selector. This can cause memory leaks that result in a Java VM crash. The Selector can be queried to determine how many keys it currently has and that number can be compared to a configurable threshold. • Determine if your run method runs in a tight loop. The earlier code runs in a tight loop and this may be undesirable for lowto medium-volume applications. To cause a looser loop, one of the Selector methods with a timeout can be used. This will cause the call to that method to block until a Channel is ready. • You may want to use a tight loop but wait when the Selector is empty inside the run method. Inside the add methods, the NIOWorker thread can then be notified when work is successfully added. • Increase performance by moving the ByteBuffer and CharsetDecoder from the doRead method to instance variables on the NIOWorker object. • Check the Selector for each loop iteration in the main run method to determine if any work has timed out. This check will prevent the Selector from becoming full of invalid connections due to network failures. • The full version of the code, available electronically (see “Resource Center,” page 3) includes all of these implementation details. Pitfalls One major NIO pitfall to be aware of is that you should always use multiplexed

36

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

(continued from page 36) NIO. Putting NIO code into multiple threads reduces overall performance and can cause excessive load. I experimented with using NIO in multiple threads and observed that due to enormous amounts of context switching the load nearly tripled, while latency and throughput remained unchanged. Conclusion A multiplexed NIO solution is an ideal solution for the Orbitz application requirements because of the efficiencies

of multiplexed I/O operations. Multiplexed I/O means that there is a single thread doing many I/O operations by leveraging a Selector to perform the multiplexing. Using a Selector, multiple Channels executing I/O operations are managed and watched over concurrently. This means that at no point is the Java VM ever waiting on Channels that are not ready to perform I/O; because the Selector knows precisely the Channels that are ready to perform I/O and those that are not. Because there is only a single thread performing I/O,

Listing One public class IOThread extends Thread { private String result; public synchronized String getResult() { return result; } public void run() { try { // Connect to the remote host and read in the data URL url = new URL("http://example.orbitz.com"); URLConnection connection = url.openConnection(); String localResult = doRead(connection); synchronized (this) { result = localResult; } } catch (IOException ioe) { System.err.println(ioe.toString()); } } private String doRead(URLConnection connection) throws IOException { StringBuffer buf = new StringBuffer(); InputStream is = connection.getInputStream(); BufferedReader br = new BufferedReader(new InputStreamReader(is)); char[] cbuf = new char[4096]; int n; do {

38

the problem of context switching between multiple threads performing I/O operations has been eliminated. All of these conditions greatly reduce overhead and latency while increasing total throughput over a standard I/O implementation. Acknowledgments Thanks to David P. Thomas of Orbitz for reviewing this article. DDJ

n = br.read(cbuf); if (n > 0) { buf.append(cbuf, 0, n); } } while (n >= 0); return buf.toString(); } }

Listing Two public class OldIO { public String execute() { IOThread thread = new IOThread(); thread.start(); try { thread.join(15000); // Wait for 15 seconds } catch (InterruptedException ie) {} return thread.getResult(); } }

Listing Three public class PoolOldIO { private static ThreadPool pool = new ThreadPool();

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

public String execute() { IOThread thread = pool.checkOut(); String result; try { thread.execute(15000); // Wait for 15 seconds } catch (InterruptedException ie) { // Smother, okay to be interrupted } finally { result = thread.getResult(); pool.checkIn(thread); // Must always check-in the thread } return result; } }

Listing Four public class NIOWorker extends Thread { private Selector selector; public NIOWorker () throws IOException { selector = Selector.open(); } ... }

Listing Five public static class Work { public URL in; public String out; public ByteBuffer httpRequest; }

Listing Six public String doWork(URL url, long timeout) { Work work = new Work(); work.in = url; String result = null; synchronized (work) { work.httpRequest = buildHTTPRequest(work); // If the work was added successfully, call wait on the work object // which forces the calling thread to wait (i.e. the execute thread) if (add(work, timeout)) { try { work.wait(timeout); } catch (InterruptedException e) { System.err.println("NIO operation interrupted"); } result = work.out; } } return result; } protected boolean add(Work work, long timeout) { SocketChannel channel = null; try { URL url = work.in; InetSocketAddress addr = new InetSocketAddress(url.getHost(), 80); channel = SocketChannel.open(); channel.configureBlocking(false); channel.connect(addr); WorkState state=new WorkState(System.currentTimeMillis(),timeout,work); channel.register(selector, SelectionKey.OP_CONNECT, state); } catch (IOException ioe) { if (channel != null) { try { channel.close(); } catch (IOException ioe2) { System.err.println("Unable to close channel: " + ioe2); } } System.err.println("Channel creation or registration failed: " + ioe); return false; } return true; }

Listing Seven public static class WorkState { public final StringBuffer buffer = new StringBuffer(); public final Work work; public final long start; public final long timeout; public boolean success = false; public WorkState(long start, long timeout, Work work) { this.start = start; this.timeout = timeout; this.work = work; } public boolean isTimedOut() { return (System.currentTimeMillis() - start >= timeout); } }

Listing Eight public void run() { while (true) { try { int num = selector.selectNow(); if (num > 0) { processKeys(); } else { Thread.yield(); } } catch (IOException ioe) { System.err.println("Unable to select: " + ioe.toString()); } catch (InterruptedException ie) { // Continue processing

http://www.ddj.com

} } }

Listing Nine protected void processKeys() { Set keys = selector.selectedKeys(); for (Iterator iter = keys.iterator(); iter.hasNext();) { SelectionKey key = (SelectionKey) iter.next(); iter.remove(); WorkState state = (WorkState) key.attachment(); SocketChannel channel = (SocketChannel) key.channel(); try { if (state.isTimedOut()) { finished(channel, key, state); continue; } boolean connectable = key.isConnectable(); if (connectable && channel.finishConnect()) { // If the Channel is connected, setup the Channel to // write the HTTP message to the remote server key.interestOps(SelectionKey.OP_WRITE); } else if (key.isWritable()) { // If the Channel is finished writing, setup the // Channel to read the HTTP response if (doWrite(channel, state)) { key.interestOps(SelectionKey.OP_READ); } } else if (key.isReadable()) { // If the Channel is finished reading, call finished // to complete the work if (doRead(channel, state)) { finished(channel, key, state); } } } catch (IOException ioe) { System.err.println("Failure during IO operation"); finished(channel, key, state); } } }

Listing Ten protected boolean doWrite(SocketChannel channel, WorkState state) throws IOException { int rem = state.work.httpRequest.remaining(); int num = channel.write(state.work.httpRequest); // Continue writing until everything has been written return (num == rem); }

Listing Eleven protected boolean doRead(SocketChannel channel, WorkState state) throws IOException { buffer.clear(); decoder.reset(); boolean done = false; int num = channel.read(buffer); if (num == -1) { state.success = true; done = true; } else if (num > 0) { buffer.flip(); String data = decoder.decode(buffer).toString(); state.buffer.append(data); } return done; }

Listing Twelve protected void finished(SocketChannel channel, SelectionKey key, WorkState state) { key.cancel(); try { channel.close(); } catch (IOException ioe) { System.err.println("Failed to close socket: " + ioe.toString()); } finally { Work work = state.work; synchronized (work) { // Only if the Work was successful, parse out the HTTP response if (state.success) { String result = state.buffer.toString(); work.out = parseHTTPResponse(result); } work.notify(); } } }

Listing Thirteen public class NewIO { // Assume someone else setup the NIOWorker and passed it in public String execute(NIOWorker worker) { try { // Construct the URL and pass it to the worker URL url = new URL("http://example.orbitz.com"); return worker.doWork(url, 15000); // Wait 15 seconds } catch (IOException ioe) { System.err.println(ioe.toString()); } } }

Dr. Dobb’s Journal, September 2005

DDJ

39

Improving Performance With Custom Pool Allocators for STL An efficient method for providing allocation services ANTHONY AUE

W

hen preaching the benefits of C++, evangelists invariably mention things such as type safety, generic programming, and multiple inheritance when trying to convince devotees of other languages that C++ is better. For me, one of the most attractive features of C++ is the Standard Template Library (STL). Having access to a standardized, well-designed, and well-tested set of containers with algorithms to operate on them takes the load off programmers and lets us concentrate on the interesting parts of the code. At the root of the STL is an incredibly rich set of containers. The STL provides default implementations for contiguous storage (vectors), lists, sets, and maps. Most STL implementations have hash sets and hash maps, which will become canonical in the next version of the Standard. This rich set of containers lets you store data in the most efficient manner. If freAnthony is a software developer in the Natural Language Processing group at Microsoft Research. He can be contacted at [email protected]. 40

quent insertions and deletions are likely but the data does not need to be sorted by a key, the data can be put in a list. If constant-time lookup is a requirement, they can be put in a hash map instead. Best of all, if your requirements change, you can usually change container types without having to touch too much of your code. While the ideal of container-neutral code is unfortunately unattainable — see Item 2 of Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library, by Scott Meyers (AddisonWesley, 2001)— much of the code that inserts, iterates through, and retrieves elements from the various container types is, thanks to the magic of iterators, similar if not identical. The Problem As is always the case, however, general solutions do not come without a price. With STL containers, one of the ways we pay that price is in the final template argument to all of the containers —std::alloc. Every STL container lets you specify the type of the allocator used to acquire storage for whatever it is you’re storing, as well as the container’s bookkeeping information. You don’t have to specify it, because the Standard Library provides a default argument for you in the form of std::allocator. The default implementations of std::allocator provided by STL implementations vary in complexity by the STL implementation from thin wrappers around new and delete to carefully built allocators running to thousands of lines of code. Different allocator implementations shine in different Dr. Dobb’s Journal, September 2005

scenarios. If you’re allocating a big block of memory, holding on to it for a long time, and then releasing it all at once (as is the case with vectors and deques), it may be

“It would be nice to have a whole family of allocators, each optimized for a different memory usage pattern, ideally sharing as much code as possible” hard to beat the minimalist wrapper around new and delete. If you’re using node-based containers (such as maps, sets, or lists), allocators optimized for smaller objects are probably a better way to go. In some situations, however, the standard allocator can penalize you heavily in terms of both time and space, and unless you’re paying attention, you won’t even realize it. There won’t be any memory http://www.ddj.com

(continued from page 40) leaks, and your program won’t crash (at least, not until it runs out of memory). You’ll just be using more memory than you should, and possibly spending a lot of unnecessary time in calls to allocate and delete. The problem is that std::allocator is, by definition, a general-purpose allocator, and yet there’s really no such thing as general-purpose memory usage. Different programs use memory differently. Even within the same program, different areas have different memory usage patterns. What To Do? I’m certainly not the first person to realize that this is a problem. Scott Meyers, for instance, talks about using alternatives to std::allocator (see Effective STL). Andrei Alexandrescu devotes an entire chapter of Modern C++ Design: Generic Programming and Design Patterns Applied (AddisonWesley, 2001) to an allocator for small objects. The Boost library provides two wrappers around its pool_allocator that are std::allocators. It would be nice to have a whole family of allocators, each optimized for a different memory usage pattern, ideally sharing as much code as possible. It should be possible to switch from one allocation strategy to another as easily as possible. Again, each program has different memory usage patterns, and this makes it difficult for memory allocator writers to write a general-purpose allocator that’s good for all cases. No matter how your memory allocator is designed, it’s possible to write pathological programs that result in fragmented memory and poor allocation times (see “An Estimate of the Store Size Necessary for Dynamic Storage Allocation,” by J.M. Robson, Journal of the ACM, 18(3):416 – 423, 1971). However, there are common usage patterns that are seen again and again, especially with standard containers. Once you know how your container is going to be used, you ought to be able to specify the allocation strategy that performs best with your particular usage pattern. In addition, the environment in which the code is being executed plays a part in design decisions. For instance, if an allocator is only going to be used in single-threaded environments, there’s no need to worry about protecting its data against access from multiple threads. The family of allocators just described is a perfect example of a case where policy- based design can really shine. Knowledge of how the container will be used, performance requirements on the container, the size of the objects that will be allocated, and the environment in which the code will be executed are all generally known at compile time. Ideally, 42

users of the allocator could specify the right combination of policies, and an allocator configured to perform well for the given situation would be generated at compile time. Design The allocator library I present here (available electronically; see “Resource Center,” page 3) is a pool allocator. Pool allocators are an extremely efficient method for providing allocation services for objects of fixed size s. The basic strategy is to reserve a chunk of memory sufficient for storing N objects of size s all at once. When the allocator is asked to provide a chunk of memory, it simply returns an offset mod s into the allocated chunk. This is far more efficient than calling operator new separately for each request because it avoids the bookkeeping overhead required of a general-purpose memory allocator that has to service requests for different-sized allocations. It’s an economy of scale. The decision to use a pool allocation scheme limits its usability in some cases. Because pool allocators are based on the idea that you’ll be allocating objects of a single size, using them with containers that are likely to request different amounts of memory with each allocation doesn’t pay off. This means that you won’t see the same performance boosts when using the allocator with large containers of type std::vector, std::deque, or std::string as you would in node-based containers — such as std::map, std::set, and std::list To guarantee O(1) amortized insertion time, std::vector and friends have to request memory proportional to the amount of memory currently allocated at each request. In fact, requests for allocations greater than 64 bytes will be shunted off to malloc and free. This isn’t such a serious limitation because the default memory allocator is generally optimized for dealing with large, variable-sized memory blocks anyway. A potentially more serious caveat is that, since the allocator uses nonstatic data, it’s not technically Standard compliant because the Standard requires that allocators of the same type be equivalent. See Effective STL (Item 10) for a thorough explanation of the issue. This amounts to requiring that an allocator for a given type be able to deallocate memory allocated by any other instance of an allocator for that type. For many uses of standard containers, this requirement is unnecessary (some might say Draconian). However, there are two cases where this requirement is absolutely necessary: list::splice and swap( ). The case of swap( ) is especially serious because it is needed in order to implement certain operations on containers in an exception-safe manner Dr. Dobb’s Journal, September 2005

(see Exceptional C++, Item 12). Technically, swap could be (and in some cases, is) implemented in the face of allocators that don’t compare equally— items could be copied or the allocators could be swapped along with the data — but this is not always the case. For this reason, if you’re using swap( ) or list::splice, you should make sure to use HoldingPolicySingleton; otherwise, you’re bound to run into some really nasty behavior. The lowest level of the allocator is the PoolChunk class, which represents a contiguous chunk of memory living at m_ pMem, big enough for NUMBLOCKS objects of size ALLOCSIZE. It uses a neat space-saving trick (used by many implementations of pool allocators) to store a linked list of memory blocks in the same space that it uses for servicing allocation requests. Upon construction, the beginning of each block is an integer pointing to the next free block. The object keeps track of the head of this list in m_pNext. When a block is requested, the head of the list is updated to point to the second chunk in the list, and the chunk of memory that used to be the head is returned. When a block is freed, the beginning of the block is updated to point to the old head, and this block becomes the new head. PoolChunk exposes a very basic interface: • PoolChunk(size_t allocsize_set, IndexType num_blocks_set) constructs a new PoolChunk for num_blocks_set objects of size allocsize_set. • bool In(void* p) simply tells you whether the memory at p comes from this PoolChunk. • bool Empty( ) tells you whether the chunk is empty. • bool Full( ) tells you whether the chunk is full. • void* allocate( ) returns the next item from the free list, or null if the chunk is full. • void deallocate(void* p) puts p at the head of the free list. The second layer of the allocator is the BunchOfChunks object, so named because it aggregates multiple PoolChunks. Because each PoolChunk has a static size, you need more than one of them in order to be able to satisfy an arbitrary number of allocation requests. Like the PoolChunk object, the BunchOfChunks object only handles allocation for objects of a fixed size. It does so by keeping a collection of PoolChunk objects and farming the requests out to them. The BunchOfChunks interface is even simpler than the PoolChunk interface. Aside from the constructor and destructor, only two public functions are exposed: http://www.ddj.com

• void* allocate( ) finds a nonempty PoolChunk and allocates a block of memory from it. • void deallocate(void* p) finds the PoolChunk that allocated p, and deallocates it from that PoolChunk. I’ve deliberately left the exact mechanisms for storing and accessing individual PoolChunks vague in this description, since these are details handled by BunchOfChunks’s STORAGEPOLICY (one of the required template arguments). The previous two objects provide allocation services for objects of fixed size. In order to be able to handle requests for allocations of arbitrary size, which is required of an std::allocator, you need to aggregate BunchOfChunks objects for various sizes. PoolAllocator, the third level in the design hierarchy, does just this. It keeps a vector of BunchOfChunks objects at m_vboc, sorted by size. Once again, the interface is straightforward: • void* allocate(size_t stNumBytes) does a binary search in m_vboc to find a BunchOfChunks for size stNumBytes. If it finds one, it just calls the allocate( ) method on it. If it doesn’t find one, it creates a new one and inserts it into the vector. • void deallocate(void* pDeallocateMe, size_t stNumBytes) finds the BunchOfChunks for stNumBytes and calls the deallocate( ) method on it. The final layer of the allocator is pool_allocator_stl, which contains the implementation of the std::allocator interface. It holds an instance of PoolAllocator and delegates allocation and deallocation requests to it. The multitiered design (individual chunk object -> collection of chunks -> collection of collections of chunks for various sizes) is a typical design for pool allocators — both boost::pool and Alexandrescu’s SmallObject follow similar designs. Readers will notice a more-than-passing resemblance to Alexandrescu’s SmallObject implementation, to which I am much indebted, as it was my first exposure to the issue of small object allocation. What makes this implementation more flexible is the rich set of policies for specifying the behavior of the allocator. Evaluation Evaluating memory allocators is notoriously difficult because every program uses memory somewhat differently. To measure allocation and deallocation directly, I wrote a drop-in std::allocator (implemented in trace_allocator.hpp, available electronically) that records each allocation and deallocation to a log file. I specified this allocator in containers that use memory in http://www.ddj.com

different ways, and then wrote a testing application that “plays back” the memory allocations and deallocations using different allocators, monitoring memory usage, page faults, and execution times. The four containers I attached the allocator to are: • simple_map: creates an int->int map, fills it with some integer pairs, then lets it go out of scope. Based on a quick survey of my colleagues, as well as a look through code I’ve written and/or maintain, this is a very common scenario. • fixed_ priority_queue: is based on a std::set of fixed size. The set is allowed to grow to a certain size. Thereafter, new inserts are only allowed if the item to be inserted is greater than the smallest item in the set. The smallest item is deleted and the new item is inserted. A real-world example of this would be an MRU cache. • remove_if: creates an int->int map and adds a number of integer pairs to it. Generate a random integer, call remove_if on the map with that integer. Repeat three times. • random: creates an int->int map. Insert a number of integer pairs to it. Then, for some number of iterations, randomly either insert a new pair or delete an old pair. Figures 1 through 4 depict the memory profiles for the cases I instrumented. I played each allocator trace back using allocators built with these compilers, in each case running once with the default (compiler included) STL, and once with STLport 4.6.2: • Microsoft Visual C++ 7.1 • GCC 3.3.3 under cygwin • GCC 3.3.3 under Linux Performance data for each combination of policies, on all configurations, is in the spreadsheet entitled “results.xls” (available electronically). Policies The actual definition of pool_allocator_stl is as follows: template< class T, class STORAGEPOLICY, template class ThreadingPolicy, template class HoldingPolicy, class GrowthPolicy, bool DeletionPolicy=true> class pool_allocator_stl.

Aside from the first template argument T, which is the type of object to be allocated, the rest of the arguments are all configurable policies that govern Dr. Dobb’s Journal, September 2005

various aspects of the allocator’s behavior. I’ll now give a quick description of each of the policies, along with some charts showing the effects of each policy on each of the data sets. The performance numbers I present for each policy are based on code compiled with the Microsoft Visual C++ 7.1 compiler using the default STL implementation, with aggressive speed optimizations turned on (/Ox /Os). StoragePolicy. The StoragePolicy is the most interesting and challenging part of the allocator. It dictates how individual PoolChunks are stored in a BunchOfChunks. It is also responsible for actually releasing the memory allocated by BunchOfChunks. The interface is defined by the following functions: • void Store(ChunkType* pStoreMe) stores the chunk in question for later retrieval. • void Purge( ) releases all memory from the container. • ChunkType* GetChunkForAlloc( ) returns a pointer to a nonempty PoolChunk in the collection, or NULL if there is not one. • ChunkType* FindChunkDealloc(void* p, iterator& itrRet) finds and returns the PoolChunk containing p. Also returns a StoragePolicy::iterator (necessary typedef ) via itrRet that points to the deallocated chunk. This enables us to efficiently remove the chunk from our collection if necessary • void DealWithEmptyChunk(iterator itr) is called by BunchOfChunks whenever a chunk becomes empty as the result of a deallocation. I’ve implemented three different StoragePolicies: • StoragePolicySimple keeps its PoolChunks in an std::list. When a chunk is requested for allocation, it looks in the chunk most recently allocated from. If there’s still room, it just returns that chunk. If the current chunk is full, it does a linear search in the list for a chunk with free space. Allocation and deallocation are O(c), where c is the number of chunks in the list. Deallocation is handled similarly. • StoragePolicyBoundary is similar to StoragePolicySimple, but it does a bit of extra housekeeping to enforce a boundary between full and nonfull chunks. This improves worst- case allocation speed to O(1), though deallocation is still O(c) in the number of chunks in the list. • StoragePolicyOrdered keeps its PoolChunks in an std::map. Allocation is O(c), but deallocation is O( logc). 43

Figure 5 shows the effect of StoragePolicy on speed in each of the contains I profiled. HoldingPolicy. The HoldingPolicy is a simple policy that lets you control the lifetime of the pool_allocator object held by pool_allocator_stl. It has only a single function, T& Get( ), which returns a reference to the encapsulated object. Currently, there are two HoldingPolicies: • HoldingPolicyStack, which puts the encapsulated object on the stack, to be cleaned up whenever it goes out of scope. • HoldingPolicySingleton, which provides access to a shared pool_allocator for all objects. ThreadingPolicy. ThreadingPolicy consists of a few simple typedefs: • VolatileType, inspired by Alexandrescu’s ThreadingModel, is a way to make sure

that data members shared across multiple threads are properly qualified as volatile. • MutexType indicates a type that must in turn have a scoped_lock typedef. Thread safety for multithreaded policies is guaranteed by wrapping all state-changing code in a scoped_lock. I’ve provided two separate threading policies: • ThreadingSingleThreaded is a policy in which the lock is not operational. As you might expect, it’s always preferred for single-threaded programs. You can use it in multithreaded programs as well, so long as you are using HoldingPolicyStack. Each container has its own copy of the allocator, and any container operations causing allocation or deallocation that might be called from multiple threads have to be locked by the client for thread safety anyway, so there’s no need to lock the allocator separately.

• ThreadingMultiThreaded is a policy suitable for use in multithreaded situations. It’s a thin wrapper around boost::mutex for maximum portability. GrowthPolicy. Each BunchOfChunks object has a GrowthPolicy member that’s responsible for calculating the size of the next chunk to allocate. The policy implements a single function, unsigned int NextChunkSize( ). Two GrowthPolicies are defined: • GrowthPolicyLinear allocates fixed-size chunks of size 255 each time. • GrowthPolicyExponential doubles the size of the last chunk. The GrowthPolicy basically controls the relationship between N, the number of objects allocated; and C, the number of PoolChunks encapsulated by the BunchOfChunks. This has a predictable effect on the performance guarantees of the container. When you use GrowthPolicyLinear,

Figure 1: simple_map.

Figure 4: random.

Figure 2: fixed_priority_queue.

Figure 5: StoragePolicy.

Figure 3: remove_if.

Figure 6: Speed for GrowthPolicies.

44

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

C is O(N). When you use GrowthPolicyExponential, C is O(log(N)). This lets you optimize for time or space. For instance, recall that allocation and deallocation time using StoragePolicySimple were both O(C). If you use GrowthPolicyLinear, then allocation and deallocation time will be O(N), but the amount of wasted space will be O(1) because you’ll never be wasting space for more than 255 objects. If you use GrowthPolicyExponential, allocation and deallocation will be O(log(N)), but the amount of wasted space will be O(N). Figures 6 and 7 show the effect of GrowthPolicy on speed and working set size. DeletePolicy. DeletePolicy is just a Boolean flag that’s passed as a template argument to pool_allocator_stl. It controls whether objects are actually deallocated when deallocate( ) is called. When DeletePolicy is set to True, deallocation is performed as per normal. When it’s set to False, deallocation becomes nonoperational. There are plenty of times when this can come in handy. For instance, a very common memory usage pattern is to fill a container, perform some processing with the elements in it, save the results as temporary, and then destroy the object (or let

it go out of scope). In this case, you know you’re going to be getting rid of the entire container at once, so there’s no need to call deallocate( ) separately for each object. All of the memory will be released at once as soon as the allocator goes out of scope. Figure 8 shows the effect of DeletePolicy on speed. Comparative Performance As you can see in Figures 9 through 12, the graphs comparing different policies, the best set of policies to use can vary depending on the allocation pattern. To determine how much performance gain I could achieve over less-configurable solutions, I also ran each scenario against the default std::allocator for the STL I was using, as well as against boost::pool_allocator and boost::fast_ pool_allocator. In most platform/STL combinations, the allocator presented in this article performs better than the others. This is not surprising because, in each case, we were able to tune the instantiated allocator type according to the specific memory use at hand. One thing that puzzles me, however, is that the boost::fast_allocator is invariably faster on Linux when using the STL provided with GCC, sometimes by a

large margin. The most likely explanation seems to me that the allocator was probably optimized in this configuration. Conclusion I’ve presented here a highly flexible and configurable replacement for std::allocator for use with node-based standard containers. An efficient and configurable allocator is useful because of the widespread use of standard containers. On several occasions, I’ve been able to make substantial improvements to the performance and/or memory usage of programs by a simple one-line modification that specifies a different allocator for a given container. Hopefully, the allocator described here can be as useful to you as it has been to me — give it a try and see! Acknowledgments Thanks to Eric Niebler for pointing out the danger of swap( )ing containers whose allocators have state, and to Andrei Alexandrescu for pointing out the necessity of special allocators for small objects in the first place. DDJ

Figure 7: Peak Working Set for GrowthPolicies. Figure 10: Scaled speed on Windows, using Visual C++ and STLport.

Figure 8: DeletePolicy. Figure 11: Scaled speed on Linux, using GCC and default STL.

Figure 9: Scaled speed on Windows, using Visual C++ and the default STL. http://www.ddj.com

Figure 12: Scaled speed on Linux, using GCC and STLport.

Dr. Dobb’s Journal, September 2005

45

The Extensible Firmware Interface Putting firmware control in the hands of the developers CRAIG SZYDLOWSKI

E

very time a PC boots up, the basic input/output system (BIOS) initializes the system. This system boot process has remained virtually unchanged over the last 20 years, while other platform technologies have advanced significantly. The Extensible Firmware Interface (EFI) changes this (http://developer.intel.com/ technology/efi/). EFI is an open-standard specification that defines an interface between the operating system and platform firmware. This interface facilitates modifications and optimizations by board developers, opening the door to greater platform differentiation. Intel-architecture-based platforms have evolved since the PC AT hardware of 1981. Many of the platform components — CPU, buses, video, audio, and operating sysCraig is an engineer for the Infrastructure Processor Division at Intel. He can be contacted at [email protected]. 46

tems — have seen tremendous technological changes (see Figure 1). The firmware, however, has been stuck in a 16-bit world, founded in a time when 640 KB of memory was “more than anyone would need.” BIOS is still 16-bit assembler code while the rest of the platform is 64-bit and highlevel language capable. Helping firmware participate in the technological revolution, EFI supports 64-bit with preboot drivers and applications in C. EFI is designed to be a pure API between the platform firmware and the operating system. BIOS has traditionally provided this functionality for 32-bit Intel architecture-based platforms. EFI is an architecture-independent specification that defines the interfaces and structures the platform firmware must implement to successfully launch an operating system (Figure 2). Platform firmwarerelated information is abstracted using EFI data tables, while boot and runtime services can configure hardware and run applications before the OS launches. This capability supports many preOS features, such as initializing add-in cards, communicating with other devices on a network, and running diagnostics. The Intel Platform Innovation Framework for EFI (the “Framework”) is a productionready implementation of EFI (http:// www.intel.com/technology/framework/). The Framework is designed to support multiple architectures, including IA-32-, ItaniDr. Dobb’s Journal, September 2005

um-, and Xscale-based processors. This broad reach helps drive interoperability and communications across a wide range of networked devices, such as servers, appliances, storage, desktops, laptops, and PDAs. Most new Intel boards that ship in 2005 support the Framework for EFI, and the expectation is that many embedded boards will

“EFI defines an interface between operating systems and platform firmware” be EFI based in 2006. Platform developers can get the Framework through participating vendors, including the likes of American Megatrends (http://www.megatrends .com/) and Insyde Software (http://www .insydesw.com/). The Framework is not generally available directly from Intel. However, to assist platform firmware engineers who http://www.ddj.com

write, build, debug, and test drivers and EFI applications within the Framework’s preboot environment, Intel hosts a web-based collaborative software development community called “TianoCore” (http://www .tianocore.com/). EFI provides embedded board vendors and system developers new flexibility. This includes the ability to boot with both legacy and new EFI methods. For example, the Framework supports an optional Compatibility Support Module (CSM) that launches existing BIOS code. This lets you utilize legacy BIOS to initialize peripherals instead of developing new software. Unlike PCs, many embedded designs are not required to support a large number of legacy devices. A simple EFI implementation resembles a board-support package, with the minimum amount of code required to initialize the hardware and to load the operating system. Hardware boot times can be significantly decreased when legacy device support is not a concern. Keeping up with modern times, board developers can write EFI drivers and services in high-level languages such as C. This should make EFI- compliant code rather portable between designs. As BIOS functions migrate from 16-bit assembly code to C, firmware projects can be assigned more easily to general programmers as opposed to relying on specialized BIOS programmers. For instance, Examples 1 through 3 wait for a keypress and print the corresponding character to a console. The progression of these code snippets illustrate how the EFI facilitates the use of high-level languages. In Example 1, programmers need to be aware of the routines supported by the firmware. The implication is, you must be educated on the lowlevel details of the boot firmware. In Example 2, efilib is added and lets users use more generic print instructions instead of knowing about the ConOut routine supported by the firmware. Also, the code to wait for a character input is abstracted by not requiring a call to the “Bootservices” firmware routine. In Example 3, the atk_libc or “application toolkit” library lets the code be written with C instructions without any firmware-specific calls. This further abstracts the firmware from the program code. The EFI model is modular, which increases software reusability and eases development resource planning. For example, a driver for initializing a USB port for one platform can easily be ported to other boards. A modular approach lets code development be partitioned into independent modules, enabling tasks to be divided amongst a team of programmers. http://www.ddj.com

Debug and testing can be conducted in a modular fashion as well. EFI implementations, such as the Framework, support a development shell to ease software development. Within the shell, drivers can be invoked, tested and revised, and reinvoked without going through

painful steps such as reprogramming onboard flash. The shell greatly reduces driver development time by improving programmer interactivity between the hardware and the software. EFI also lets board vendors build more reliable systems. By writing EFI drivers

Original PC

Subsequent Years

Today

CPU Architecture

16-bit internal, 8-bit external

32-bit, virtual memory

64-bit extensions

Bus

8-bit ISA

Microchannel*, XT, EISA, PCI

PCI Express*

Video

Text

Color, VGA, SVGA, XVGA

SXVGA

Audio

PC speaker

Soundblaster*, AC '97

Intel High Definition Audio

Operating System

DOS

Windows* 3.1, Windows 95

Windows XP

Firmware

16-bit assembly based BIOS

Fundamentally unchanged

Fundamentally unchanged

Figure 1: Intel architecture platform component evolution. #include "efi.h" EFI_STATUS InitializeHelloApplication ( IN EFI_HANDLE ImageHandle, IN EFI_SYSTEM_TABLE *SystemTable ) { UINTN Index; SystemTable->ConOut->OutputString(SystemTable->ConOut, L"Hello application started\n"); SystemTable->ConOut->OutputString(SystemTable->ConOut, L"\n\r\n\r\n\rHit any key to exit this image\n\r"); SystemTable->BootServices->WaitForEvent( 1, &(SystemTable->ConIn->WaitForKey), &Index); SystemTable->ConOut->OutputString(SystemTable->ConOut, L"\n\r\n\r"); return EFI_SUCCESS; }

Example 1: Routines supported by the firmware. #include "efi.h" #include "efilib.h" EFI_STATUS InitializeHelloLibApplication ( IN EFI_HANDLE ImageHandle, IN EFI_SYSTEM_TABLE *SystemTable ) { InitializeLib (ImageHandle, SystemTable); Print(L"\n\n\nHelloLib application started\n\n\n"); Print(L"\nHit any key to exit this image\n"); WaitForSingleEvent(ST->ConIn->WaitForKey,0); ST->ConOut->OutputString (ST->ConOut, L"\n\r\n\r"); return EFI_SUCCESS; }

Example 2: Using more generic print instructions. Dr. Dobb’s Journal, September 2005

47

that execute at the start of the boot sequence, hardware can be protected against intrusion even before the OS loads. Board developers have greater control over the hardware shortly after reset, allowing them to pursue enhancements such as decreasing the time to secure the board or running TCP/IP stacks to communicate with other network elements. In addition, new services are enabled through the flexibility afforded by EFI. For example, EFI code can be loaded from a range of storage devices including disk, CD-ROM, and DVD as well as a remote location via a network. The ability to boot from a network source will usher in new platform-management features for IT departments as well as provide mechanisms to implement advanced diagnostics and manufacturing test automation. IT managers will appreciate the improved network manageability enabled through EFI applications. It will be easier to handle out-of-box configuration issues because EFI applications can facilitate the

Operating System

EFI OS Loader

EFI Boot Services

EFI Runtime Services

Platform Hardware

}

EFI defined interfaces and structures

Figure 2: EFI conceptual description. 1

Platform Initialization

2

3

Load EFI Images

Boot OS Loader

4

• Execute from reset vector • Initialize the basic platform

management of computers by other computers. Instead of manual and serial configuration of new network devices, EFI enables network managers to automate this process. Because EFI is extensible to many types of devices, it helps networks better identify and control diverse machines and improve device interoperability and network reliability. As opposed to the relatively closed nature of BIOS, EFI embraces modularity, high-level language, scalability, reusability, and enhanced debug environments to drive reduced development time and effort. Additionally, some embedded platforms may boot faster when firmware code is tailored to specific hardware configurations. In short, EFI is providing a new era of firmware efficiency. The EFI Booting Sequence Figure 3 illustrates a generalized boot sequence. This sequence is similar to BIOS today, but with the “hooks” provided by EFI for developers to add their own code. Upon system reset, the CPU, chipset, and main memory are initialized. This code is likely to be supplied by the silicon manufacturer or a BIOS vendor. Initialization of other components such as an Ethernet controller may be handled by traditional BIOS code supplied by an Independent BIOS Vendor (IBV) or perhaps through the component manufacturer. This code runs in 16-bit, real-mode, which is limited to a small region of code. This phase is intended to be short, representing around 15 percent or less of a booting sequence. For 32-bit Intel architecture platforms, the CPU transitions to “protectedmode,” which provides more memory and better support for high-level language programming. During the second phase, the EFI boot manager starts running the EFI firmware. EFI drivers, protocols, and applications are loaded and executed, most of which are written in C. These drivers and applications allow platform developers to add new features and differentiate their boards.

Drivers and applications are modular and can be developed and tested in a development shell. The shell provides an interactive console interface, loads EFI protocols and drivers, and supports simple scripting and batch file capabilities. This environment helps developers debug and test drivers more efficiently, especially during the integration of new hardware and features. Some EFI drivers, protocols, and applications will be available within an EFI development environment. EFI environments will provide code libraries to expedite platform firmware development. There are two types of EFI firmware services — boot and runtime. Boot services terminate once the OS loads and never run again. Runtime services remain after the OS launches and today are limited to manipulating nonvolatile RAM and the real-time clock. EFI applications execute platformspecific tasks within the booting sequence. Diagnostics, disaster recovery tools, and hard-drive-related operations are examples of tasks that can be implemented in a preOS environment. These routines will run in the development shell after the OS launches to make debugging easier. During the third booting sequence phase, the OS loader is loaded. If the OS loader is unable to load its operating system properly, control can be issued back to the EFI boot manager and alternative action taken. The last booting sequence phase occurs when the OS loads properly. The EFI boot services are terminated and the OS takes control. Conclusion EFI lets firmware evolve from a strict legacy-based BIOS structure to a more flexible and feature-rich environment. By putting more of the firmware control in the hands of the developers, EFI opens the door to new features and capabilities DDJ

• Load EFI drivers, protocols • Initialize buses, devices • Load EFI applications • Load EFI OS loader • OS failure mechanism

Handoff to OS Loader

• Boot services terminate • OS takes control of system

OS in Operation

EFI Defined

Code from supplier

#include EFI_STATUS InitializeHelloLibCApplication ( IN EFI_HANDLE ImageHandle, IN EFI_SYSTEM_TABLE *SystemTable ) { InitializeLib(ImageHandle, SystemTable); printf("Hello LibC application started\n\n\n"); printf("Hit C/R to exit this image\n"); return( getchar() ); }

OS Control

Figure 3: Boot sequence. 48

Example 3: Using C without any firmware-specific calls. Dr. Dobb’s Journal, September 2005

http://www.ddj.com

C++ Exceptions & the Linux Kernel Leveraging the power of C++ HALLDÓR ÍSAK GYLFASON AND GÍSLI HJÁLMTYSSON

E

xceptions are common in user-space programming. Most modern programming languages offer some form of syntactic constructs to handle exceptional events. The belief is widespread that the use of exceptions leads to more maintainable and robust systems, as errorhandling code is separated from the normal flow. Multiple modern programming styles and best practices encourage the use of exceptions to handle exceptional cases. When a function detects an error, an exception is raised, which directs the program flow to the nearest dynamically enclosing handler for that type of exception. Thus, exceptions are handled differently from normal procedure exits, and exceptions are transparently passed through functions that do not handle the error. In spite of their common use in user space, the use of exceptions in kernel space has been limited. In fact, some operating systems do not exploit higher level language abstractions at all. In particular, Linux is written in pure C. Whereas performance issues may negate the use of some modern languages such as Java, one of the driving factors behind the creation of C++ was for use in writing operating systems. Some constructs were specifically introduced and de-

Halldór was a research assistant at the Network Systems and Services lab at Reykjavik University and chief architect at Calidris. He is currently a Ph.D. student at the University of California at Berkeley. Gísli is a professor of computer science at Reykjavik University. He previously was a member of the technical staff at AT&T Bell Labs, where he researched networking systems and services, optical networking, router architectures, modeling, and performance evaluations. They can be contacted at [email protected] and [email protected], respectively. 50

signed based on observed patterns — and to address problems — in operating system implementations. Although C++ does not enforce strong type safety or other safety properties (such as Java), C++ offers an array of high-level language abstractions valuable for the construction of operating systems, and offers type safety and compiler support far beyond that of C. In particular, the safety provided by language-level polymorphism provides significant value as polymorphic behavior is widespread throughout any operating system. In our work on the Pronto software router (http://netlab.ru.is/pronto/pronto .shtml), we have used many of the advanced C++ constructs extensively, including classes and virtual functions to achieve clarity, flexibility, and extensibility. We have shown that these benefits come at no performance penalty compared to the Linux implementation (see “The Pronto Platform: A Flexible Toolkit for Programming Networks using a Commodity Operating System” by G. Hjálmtysson. Proceedings of OpenArch 2000, March 2000). Our desire to employ the full range of C++ abstractions in the kernel and, in particular, to use C++ exceptions in our work on Pronto is the driver behind the work presented herein. Of course, handling exceptional conditions is relatively expensive, regardless of the mechanisms employed for implementation. Substantial fraction of operating system code (for any operating system) is there to resolve exceptional conditions. Handling such conditions requires: • Detecting when an exceptional condition occurs. • Determining where (by whom) such conditions should be handled. • Doing the work needed to recover resources and otherwise handle the condition to return the operating system to a state from where it is safe to resume normal execution. The cost of the first and the last requirements are independent of the mechanisms employed to implement the second. The use of language-level exception handling translates into machinery providing complete and systematic approach Dr. Dobb’s Journal, September 2005

to the second — to identify where a thrown exception should be handled. The performance cost of using the language-level exception machinery must be weighed against both its nonperformance benefits and the total cost of handling a given exceptional condition. Important nonperformance quality metrics include reliability, robustness, flexibility,

“Employing a language-level exception mechanism at kernel level has not become common in practice” maintainability, and speed of development. In contrast, ad hoc exception-handling patterns, common particularly in the Linux kernel, consist of convoluted traces where exceptional function abort is communicated to the caller via an exceptional return value. The performance overhead of throwing exceptions in C++ is appreciable when compared, for example, to a simple return with an integer error code. However, in many cases, executing the exception-handling code dominates the total cost of recovering from an error condition. In those cases, the substantial benefits of exceptions warrant the relatively small cost. Clearly, throwing a C++ exception is more expensive than returning from a function. Therefore, there are cases where using exceptions should be avoided. The latter would typically apply when the exception handling is trivial and a simple return value is appropriate, or for exceptions that occur relatively frequently (and thus, perhaps constitute a branch rather than a true exception) or operate on such a fine time-scale that even a small overhead is burdensome. However, rather than voiding the viability of using languagelevel exceptions, the added cost instead http://www.ddj.com

determines the granularity appropriate for the use of exceptions, and/or the rarity at which the exceptional case must occur to justify the cost. Implementation of Exception Handling There are two main approaches to implementing exception handling in modern languages such as C++ (see “Exception Handling for C++,” by Andrew Koenig and Bjarne Stroustrup. Journal of Object Oriented Programming, July/August 1990). The first approach employs dynamic registration based on the setjmp/longjmp methods. This method manages a stack of execution contexts — each entry to a try-block pushes an execution context onto the stack through the setjmp method, while each exit pops an execution context off the stack through the longjmp method. This approach incurs some (albeit small) cost when entering try blocks, but has the advantage that it is portable in the sense that it is possible to generate ANSI C from a C++ program. The second approach is based on a statically generated table of program counter values that map try blocks into program counter values for exception handlers. When an exception is thrown, this table is searched for the appropriate handler using the current program counter value as entry point. This method incurs no runtime overhead when entering a try block (zero instructions), at the expense of increased overhead when throwing and during exception. The current GNU g++ compiler implementation, on which we have built our runtime library, implements exception handling using the table-driven approach. Former versions of the compiler used the setjmp/longjmp method, and that version can still be compiled into the implementation. The GNU implementation is contained in the (user-level) application binary interface (ABI), which accompanies the compiler as part of the standard runtime library. When using C++ exceptions, GNU g++ generates calls to the ABI; for example, the throw operator is transformed into a call to the ABI function _ _cxa_allocate_exception followed by _ _cxa_throw. Versions 3.x of GNU g++ implements the C++ ABI specification for the IA-64 (see “C++ ABI for Itanium: Exception Handling,” http:// www.codesourcery.com/cxx-abi/abi-eh .html). The aim of the C++ ABI specification is to standardize the object layout and the interface of the object code to the runtime system. Thus, code compiled with old versions of the compiler should be compatible with newer releases, and object code from different compilers should be compatible. Employing a language-level exception mechanism at kernel level has not become http://www.ddj.com

common in practice. Some operating systems are written at least partly in C++ but generally do not employ C++ exceptions. However, Windows NT provides a facility called “Structured Exception Handling” (SEH) (see “A Crash Course on the Depths of Win32TM Structured Exception Handling,” by Matt Pietrek. Microsoft System Journal, January 1997), which can be used in kernel device drivers. SEH is an operating-system facility and is, therefore, independent of any compiler or language. SEH uses dynamic registrations of try blocks and thus does affect normal program flow. Each thread is associated with a stack of exception registrations, which contain a function pointer to a handler function. Although SEH is similar to C++ exception handling, it has different semantics. In the SEH model, exceptions are thrown explicitly with the RaiseException Win32 routine and — in contrast to the C++ exception model — exceptions in the SEH model are singular integers. However, the SEH model also covers processor-level errors, such as divide-by-zero, access violation, and stack overflow, which requires support from the OS. When an exception is raised in the SEH model, the operating system calls the handler functions on the exception registration stack in sequence. Each handler function decides whether to handle the exception, to pass it through, or to resume execution at the point where the exception was raised. One drawback of SEH is that it does not call class destructors during stack unwinding; however, the SEH-specific _ _ finally block can be used to clean up resources. The Microsoft C++ compiler implements C++ exceptions on top of the SEH model, and as a consequence, a catch(…) block catches all C++ exceptions as well as processor-level errors, such as segmentation faults. Segmentation faults in Linux are not caught when compiled with the GNU g++ compiler. Implementing the C++ Runtime Support The first step in our work to support kernel-level exceptions was to create and include an implementation of the C++ ABI in the kernel. We started by carving out the user-level GNU ABI implementation as the basis for our implementation; however, simply porting user-level code to the kernel requires some changes. The malloc function used to reserve space on the heap for exceptions is replaced with the kernel-level kmalloc function. Furthermore, the GNU library is threadsafe and uses the pthread library for locking. However, as the pthread library is not available in kernel space, we modified the locking mechanism to use kernel-level spinlocks. Dr. Dobb’s Journal, September 2005

The semantics of the C++ throw operator requires the implementation to keep a stack of active exceptions to support the usage of throw without any arguments. In user space, this is done using thread-local storage. To achieve this effect, in kernel space, we augmented the Linux task_struct to include information on the active exceptions; see Listing One. When creating ELF executables, GNU g++ silently links in two object files at the front and the back (crtbegin.o and crtend.o, respectively). Furthermore, GNU g++ adds initialization code into the ELF .init section, and clean-up code into the .fini section. This is necessary to ensure that global constructors and destructors are run, and that the exception tables are registered with the ABI. To enable the usage of exceptions and global objects in the kernel, we modified the Linux makefile rule for the kernel image and kernel modules to link with those two files. Furthermore, we ensure that the initialization routines are called on module load by clever use of preprocessor macros, which modify the definition of the module initialization and finalization functions, module_init and module_exit. This is necessary because the kernel module loader in Linux pays no attention to the ELF .init section. We implemented several optimizations to the standard GNU implementation of exceptions because our original measurements indicated that the cost of using exceptions was somewhat high. One optimization concerns the fact that the standard GNU implementation unwinds the stack in two phases — the first phase locates a handler, and the second phase unwinds the stack to the handler. The rationale for this two-phased approach is that, in the case that no handler is found, the stack frames have not been destroyed and debuggers can inspect the state of the frame that threw the exception. However, for our use at kernel level, we don’t see this cost justified. In fact, we feel that even in user space, programmers should have the option of having the compiler optimize this debugging help out of the code. To enhance the performance of exceptions further, we included an optimization that improves the mechanisms used to search for the handler of an exception by caching frame state of functions, which is used to restore registers that have been saved on the stack. Our optimization # Stack Frames

Cost (µs)

1 2 3 4 5

2.14 2.52 2.85 3.21 3.59

Table 1: Absolute cost of throwing exceptions through a number of stack frames. 51

caches this frame state data in a hash table, indexed by program counter. When an exception is thrown the first time through a function, or more specifically, the first time through a certain place in the function, the frame state is computed and subsequently inserted into the hash table. Subsequent throws through this place result in a successful lookup in the hash table. The importance of this optimization increases as more exceptions are used in the kernel. This is because the time needed to locate the frame descriptor entry for a function is proportional to the number of modules that use exceptions and the number of functions within those modules. The cost of dynamic type checking in C++ is highly dependent on the method used to encode the runtime type informa-

tion in the objects. GNU g++ follows the traditional approach and associates with each class a type information object that encodes the type of the class as a mangled string and puts a pointer to this object in the virtual table for the class. GNU g++ uses weak symbols to reduce the dynamic type checking to a pointer comparison, thus avoiding the more expensive string comparison. Each time a class containing virtual functions is used in a source file, GNU g++ generates the virtual table, type information object, and type name string as weak symbols. The user-space linker, ld, ensures that there is only one copy of this object, which renders the simple pointer comparison sufficient. However, the kernel module loader, which in the 2.6 versions of the kernel is exclusively in kernel

space, does not handle these weak symbols and always relocates references to weak symbols to the weak definition within each object file. Therefore, multiple type information objects may exist for the same class, and pointer comparison becomes insufficient when doing dynamic type checking across kernel modules. To avoid this overhead, we modified the Linux kernel module loader to handle these weak symbols; the first time a weak symbol is encountered, it is added to the symbol map, and on subsequent encounters, the relocation is done relative to the first symbol. Using Kernel-Level Exceptions The C++ kernel-level runtime support for Linux provides complete runtime support for C++, including support for virtual functions, memory allocation operators, global constructors/destructors, dynamic type checking, and exceptions. The code is installed by applying a patch to the Linux kernel and enables the full use of C++ using the GNU g++ compiler. Using our new C++ kernel-level runtime support, programming in C++ at kernel level becomes similar to programming in user space. The compiler compiles files with the suffix .cc as a C++ file. However, the Linux kernel distribution is written in vanilla C, so consequently, C++ source files need to include C files to interface with the Linux core. This introduces a problem not commonly encountered in user space, as some of the C++ keywords have been used as identifiers in the Linux header files. To combat this, we have provided two inclusion files — begin_include.h and end_include.h — with our distribution that should be used to enclose the Linux C header files, as in Listing Two. These two files use #define and #undef, respectively, to redefine these identifiers to names accepted by the C++ compiler. In the following examples, we use a class hierarchy (Listing Three) where we group the exceptions according to subsystems, using inheritance. The top-level exception class —OSException— consists of a message, severity, and the virtual method report that, by default, prints the message through printk if the severity is MAJOR or FATAL. The other two exceptions defined are NetworkException, derived from OSException, and ProntoException, derived from NetworkException. System Calls The most straightforward use of exceptions in kernel space is in system calls. The Pronto architecture introduces three new system calls to the Linux kernel. New types of packet processors (an abstract data type for processing in the data path of the Pronto router) can be plugged into the operating system at runtime and their behavior

52

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

manipulated through type-specific system calls that are dynamically linked through virtual functions. To promote safety, it is beneficial to catch all exceptions thrown by packet processors. Listing Four shows the use of exceptions to guard a system call. In this example, the sys_pproc_type_call is the entry point from the system call. Its only function is to dispatch a method invocation to thePProcKType, which is an object in a dynamically loaded module. To guard the dispatch, the virtual call is performed inside a try block. The system call catches all ProntoExceptions and calls the virtual function report. The packet processors can throw subclasses of ProntoException and customize the report function. Finally, all other exceptions are caught with the second clause. This could include processor-level errors if the OS provides support for mapping processor-level errors into catchable exceptions. Using Try Blocks in the Data Path The Pronto data path consists of a classifier that maps packets to flows. Each flow is associated with a forwarding path consisting of chains of packet processors. Each forwarding path may have multiple branches. Examples of packet processors include basic IP forwarding, tunnel entry/exit, NAT functionality, and more. Packet processors are dynamically added to the router at runtime. Listing Five shows how we employ exceptions in this critical part of the data path. A try block guards the processing of a packet as it is sent through the chain of packet processors associated with the flow they belong to (identified by the call

to the classifier above the try). As in the previous example, there are two catch statements, one catching all ProntoExceptions, the other catching all. It is worth noting, in this example, that as new types of packet processors, say for example IPSec tunnel entry, are introduced, they may in turn introduce new subclasses of the ProntoException, defining a new handler (the report method). This way, the Pronto data path is capable of performing type-specific exception handling for new, dynamically installed types. Evaluation For the purpose of measuring the absolute cost of throwing an exception, we implemented a kernel module that throws an integer out of a function. Our measurements were performed on an Intel Pentium 3, 996.859 MHz running the Linux 2.6.6 kernel that has been patched to include Pronto and the C++ runtime library. To put the numbers into context, we measured the performance of the Linux printk function, which is commonly used in exceptional circumstances to communicate error messages. The time to print a string of length 6 —printk("Error\n") — was measured to be 18.15 µs. Typical usage of printk include formatting the strings, which is even more expensive. As expected, the cost of exceptions is dependent on the number of stack frames that the exception is thrown through. Table 1 tabulates how the number of stack frames affects the cost of throwing an exception, using our optimized implementation. We observe that cost increases about 0.35 µs with each stack frame. However, when using other types of error-handling

# Stack Frames

Cost (µs)

1 2 3 4 5

12.70 15.43 18.12 20.46 23.09

Table 2: Absolute cost of throwing exceptions through a number of stack frames without our kernel-level optimizations. techniques, the cost also increases with the number of stack frames traversed. For comparison, the cost of using exceptions without our optimizations is tabulated in Table 2. We observe that the increase in cost for each stack frame in the GNU g++ implementation without our optimizations is around 2.5 µs. Hence, the effect of our optimization is even more impressive when throwing the exception through multiple functions. Finally, we measured the effect of using try blocks in the data path when no exceptions occur. We observed an increase of 2.8 percent in packet latency— consistent with the results of other writers. Using exceptions induces a slight overhead, even though no instructions are executed on entry to try blocks, which can be accredited to less successful optimizations of compilers. Summary In this article, we discussed our C++ kernel-level runtime support for Linux, which lets you use the full power of C++ in kernel-space programming, including global constructors and destructors, dynamic type checking, and exceptions. DDJ

Listing One

Listing Four

struct task_struct { volatile long state; struct thread_info *thread_info; ... #ifdef CONFIG_CXX_RUNTIME struct { void *caughtExceptions; unsigned int uncaughtExceptions; } cxa_eh_globals; #endif };

asmlinkage int sys_pproc_type_call(int pptype, int call, void* args) { int retval = -ENOSYS; try { if (thePProcKType) { retval = thePProcKType->syscall(pptype, call, args); } else { printk(KERN_ERR "pproc not loaded"); } } catch(ProntoException & exception) { exception.report(); } catch(...) { printk(KERN_ERR "Unknown Exception occurred"); } return retval; }

Listing Two #include #include #include #include

<end_include.h>

Listing Three

Listing Five

class OSException { public: char* getMessage(); OSException(char* msg,int sev); int getSeverity(); virtual void report(); enum tSeverity {MINOR=1,MAJOR,FATAL}; private: char* message; int severity; }; class NetworkException : public OSException { ... }; class ProntoException : public NetworkException { ... };

int pronto_ip_rcv(struct sk_buff *skb, .) { ... flow = classifier->lookup(skb); if( flow ){ try { flow->arrive( skb ); } catch(ProntoException & exception) { exception.report(); kfree_skb(skb); } catch(...) { printk("Unknown exception occurred"); kfree_skb(skb); } } ... }

http://www.ddj.com

DDJ Dr. Dobb’s Journal, September 2005

53

WINDOWS/.NET DEVELOPER

Multithreading .NET Apps For Optimal Performance Leaving it to software to improve performance ERIC BERGMAN-TERRELL

F

rom the 1970s to 2004, microprocessor execution speed increased at an amazing rate — from kilohertz to megahertz to gigahertz. But starting in 2004, CPU speeds have been increasing at a much slower pace. The main speed bumps are power and heat. Faster clock rates require more power, extra power creates more heat, and keeping the CPU from melting requires increasingly elaborate and expensive cooling mechanisms. CPU manufacturers are shifting their focus from increasing clock rates to improving the performance of multithreaded code. For instance, Intel’s HyperThreading technology lets an individual Pentium 4 CPU appear as two CPUs to Windows XP and other operating systems. Although a HyperThreaded CPU is still a single processor, HyperThreading can speed up multithreaded code up to 25 percent. Intel, AMD, and other manufacturers are also shipping multicore CPUs, which are simply multiple CPUs on the same chip. Multithreading is often the only way to achieve maximum performance from parallel CPU architectures. In this article, I show how to reliably multithread .NET applications. I include File Search (available electronically; see “Resource Center,” page 3), which I wrote to examine .NET multithreading. File Search was developed and tested with Visual Studio 2003/.NET 1.1 and Visual C# Express 2005/.NET 2.0 beta 2. .NET provides two main threading mechanisms — the Thread class and asynEric has developed everything from data reduction software for particle bombardment experiments to software for travel agencies. He can be contacted at ericterrell@ comcast.net. 54

chronous methods. Listing One shows how to explicitly manage threads. In the Main method, two Thread objects are instantiated with ThreadStart delegates as constructor parameters. When their Start methods are called, the methods pointed to by the ThreadStart delegates start running. These methods, HelloWorld1 and HelloWorld2, must be void and take no parameters. Calling the Join methods causes the program to block until the methods finish. The program exits after both HelloWorld1 and HelloWorld2 have finished. If the Join statements were removed, the program would still wait for the threads to complete because thread1 and thread2 are foreground threads. If the threads were changed to be background threads (by setting the IsBackground property to True), and if the Join statements were removed, the program would exit before the threads finished running. Thread objects are easy to configure. For example, you can change a thread’s priority by assigning a new value to its Priority property. Unlike Threads, asynchronous method calls (Listing Two) can have parameters. In the Main method workDelegate refers to a method (Work) that takes a string parameter. Each workDelegate.BeginInvoke call returns immediately. After each BeginInvoke call, the Work method is run on its own background thread. BeginInvoke returns an IAsyncResult object. The IAsyncResult.AsyncWaitHandle object can be used to block until the method returns. The WaitHandle.WaitAll method waits until both asynchronous method calls have finished. Asynchronous methods run on threads managed by the ThreadPool class. The ThreadPool class maintains a set of worker threads (25 per CPU, 50 per Hyperthreaded CPU), and creates and destroys threads as necessary. Asynchronous methods don’t let you directly manipulate the threads running the methods, but an asynchronous method can manipulate its thread once it’s running, by accessing Thread.CurrentThread. I decided to use asynchronous methods in File Search rather than explicit Thread objects because asynchronous Dr. Dobb’s Journal, September 2005

methods can be parameterized. File Search (Figure 1) uses asynchronous methods to simultaneously search multiple drives. For example, if the user searches the C: and F: drives in parallel, one asynchronous method searches the C:\ drive, another searches F:\.

“.NET provides two main threading mechanisms—the Thread class and asynchronous methods” To install File Search, navigate to the Setup folder and double- click filesearch.zip. If you use WinZip, you can automatically install by clicking the Install toolbar button. Otherwise, you may need to extract the .zip file to a hard disk folder and run setup.exe. To build File Search, extract the source code to a hard disk folder and load the .sln file into Visual Studio. After you’ve installed or built File Search, run it. Select the Settings! menu item. The number of CPUs on your system is displayed. HyperThreaded CPUs count as 2. Make sure that the Multithreaded search checkbox is checked, then press OK. Enter the file types to search, and optional search text. Select the Text or Regular Expression radio button. Then specify the drives to search by checking the drive letter checkboxes. Click the Browse button to specify individual folders and network drives. Then click the Search Now button. When the search begins, the Results tab automatically displays (Figure 2). Click a file in the ListView to display its contents and search for specific text. Notice the 4 and 5 on the right of the status bar. The number to the left, 4, is the number of completed asynchronous method calls. http://www.ddj.com

(continued from page 54) The number to the right, 5, is the total number of asynchronous method calls. Asynchronous Method Call Lifecycle When you click the Search Now button the Search.SearchAllFolders method is called (Listing Three). The actual searches are performed by the SearchFolders method. The SearchFoldersDelegate variable is used to call SearchFolders asynchronously. The AsyncCallback delegate points to the SearchCallback method. SearchCallback is automatically called each time an asynchronous method finishes running. The SearchFoldersDelegate.BeginInvoke calls return immediately. After the BeginInvoke calls have been made, the SearchFolders method is called from a pooled thread. If a pooled thread is available, it is used; otherwise, a new thread is created and added to the pool. The call to CreateDriveHashtable creates a Hashtable with a key for each drive being searched. The value corresponding to each key is an ArrayList of the folders in the drive to search. Once the Hashtable has been created, it’s used to call the asynchronous methods. For each key (drive) in the Hashtable, the SearchFolders method is called asynchronously with an array of folders to search. Because each

56

drive is searched by only one asynchronous method, searching the drive will not slow down due to thrashing. The last parameter (drive) in the BeginInvoke call is automatically sent to the SearchCallback method after the method finishes. It is available via the IAsyncResult.AsyncState property. SearchFolders calls FileUtils.GetFiles to navigate through folders and search for files. As GetFiles runs, it periodically checks the SearchInfo.Cancelled property. If users press the Stop Search button, the Cancelled property is set to True and GetFiles returns. The thread running GetFiles could be terminated by calling Thread.Abort, but I don’t recommend it. Thread.Abort attempts to terminate a thread by throwing a ThreadAbortException. This is a dangerous thing to do without knowing exactly where the thread is in its processing. Additionally, calling Abort on a suspended thread deadlocks the thread and the application. When SearchFolders returns, the SearchCallback method is automatically called (Listing Four). The SearchInfo.remainingSearches counter is decremented because one of the parallel searches has just finished. SearchCallback needs to update the status bar and the progress bar, but it must not directly call MainForm.Update-

Dr. Dobb’s Journal, September 2005

StatusBar and MainForm.UpdateProgressBar. Manipulating GUI components derived from the Control class (forms, labels, status bars, progress bars, and the like) must only be done on the GUI thread that created the component. Because SearchCallback is called from a different worker thread, the MainForm methods are called indirectly by the Control.Invoke method (the Form class is derived from Control). Control.Invoke takes two parameters, a delegate to the method being called, and an object array containing the method’s parameters. Control.Invoke calls the method on the thread that created the Control. Visual Studio 2003’s debugger does not notify you when your code manipulates Controls in a worker thread unless you put Debug.Assert(!InvokeRequired); statements in GUI code that is callable by worker threads. However, Visual Studio 2005’s debugger automatically reports these threading bugs. Race Conditions and Synchronization In multithreaded applications, variables can be accessed and updated simultaneously by multiple threads. This can cause race conditions (timing-dependent bugs) and data corruption issues that are notoriously difficult to debug. For example, consider the UnsafeCounter property:

http://www.ddj.com

private int unsafeCounter; public int UnsafeCounter { [MethodImpl(MethodImplOptions. Synchronized)] get { return unsafeCounter; } [MethodImpl(MethodImplOptions .Synchronized)] set { unsafeCounter = value; } }

The Synchronized attribute ensures that the get and set methods are only called by one thread at a time. This prevents one thread from assigning a value at the same time that another thread is reading it. But this is not sufficient to avoid a race condition. When a thread executes the UnsafeCounter++ statement, these native CPU instructions are executed:

terlocked class to increment and decrement counters atomically. It also uses the Synchronized attribute to ensure that the get and set methods are not called by multiple threads simultaneously. The ThreadSafeCounter is limited to 32-bit integer values. .NET 2.0 lets the Interlocked class manipulate 64-bit integers, but this can cause race conditions on 32-bit CPUs. The Interlocked class is ideal for synchronizing access to integers. The lock statement is a more general-purpose synchronization mechanism:

lock ( {expression} ) { {statements} }

The lock statement waits until no other thread is holding a lock on the expression. Then the expression is locked and the statements are executed. After the statements are finished, the lock is released and another thread can acquire a lock on the expression. The expression must be a reference type. Typically code locks on

// UnsafeCounter++; mov esi,ebx mov ecx,ebx // Call the get method call dword ptr ds:[0BFF63D8h] mov edi,eax // Increment value inc edi mov edx,edi mov ecx,esi // Call the set method to store the result call FFB91C23

Consider what happens if the UnsafeCounter has a value of 0 and two threads are about to increment the counter with the ++ operator. Here’s one possible outcome: 1. Thread 1 calls the get method, retrieves value (0). 2. Thread 1 increments the value to 1.

Figure 1: File Search’s search screen.

The operating system switches to Thread 2. 3. Thread 2 calls the get method, retrieves value (0). 4. Thread 2 increments the value to 1. 5. Thread 2 calls the set method to store result (1). The operating system switches back to Thread 2. 6. Thread 1 calls the set method to store result (1). In this case, the counter started with a value of 0, was incremented twice, and ended up with an incorrect value of 1. And this is only one of many erroneous outcomes that could be caused by the race condition. I wrote a ThreadSafeCounter class to prevent this race condition (Listing Five). The ThreadSafeCounter class uses the Inhttp://www.ddj.com

Figure 2: File Search’s results screen. Dr. Dobb’s Journal, September 2005

57

this to lock an entire object. To acquire locks in a static method, use the typeof operator to lock on the method’s class. See the lock statement in updateStaticMembers below: public class LockDemo { int w, x; static int y, z; public void updateMembers() { lock(this) { w = w * 2; x = w * 4; } } public static void updateStaticMembers() { lock (typeof(LockDemo)) { y = 33; z = z * 2 + y; } } }

method’s code requires mutual exclusion, it’s better to put only that subset in one or more lock statements. Make sure each lock is held only as long as necessary. You can use Monitor.TryEnter method to acquire a lock only if it’s available: if (Monitor.TryEnter(this)) { lock(this) { } }

Performance One of the main reasons for developing a multithreaded application is to improve performance. But even multithreaded apps can have performance problems. For example, creating too many threads can slow down an application. Every thread uses memory to store its stack and other state information. And it takes time for the operating system to context switch from one thread to a different thread. More threads mean more memory consumption and more context switches. For some applications you may want to make the number of threads proportional to the number of available CPUs. The NUMBER_OF_PROCESSORS environment variable specifies the number of CPUs (HyperThreaded processors count as 2). You can access this variable programmatically, see SettingsForm.cs (available electronically) for the details. The Synchronized attribute is a convenient way to make an entire method mutually exclusive, but if only a subset of the

This works because the lock uses the Monitor class internally. The lock statement calls Monitor.Enter before executing the statements in its body, and calls Monitor.Exit after the statements have been executed. Another way to reduce lock contention is to use the [ThreadStatic] attribute. When the [ThreadStatic] attribute is placed above a member variable, a separate instance of that variable is available to each thread. If synchronizing a member variable is slowing your application down, you may be able to mark it as [ThreadStatic] and remove the synchronization. For example, File Search’s asynchronous method uses a Regex member variable for regular expression searching. The member variable, FileUtils.regularExpression, is marked as [ThreadStatic] so there’s no need to synchronize it. Run perfmon to determine if your app is slowing down due to lock contention. Press the + toolbar button and select the .NET CLR LocksAndThreads performance object. Select the All Counters and Select Instances From List radio buttons. Highlight your program in the listbox and press Add and Close. Click on the View Report toolbar button and monitor the Contention Rate/sec and Total # of Contentions counters. Searching files for text is I/O-bound rather than CPU-bound. Consequently, searching multiple drives with simultaneously executing asynchronous methods is a major performance enhancement. For example, when I searched two local drives and four network shares, the multithreaded search was about 3.7 times faster than the single-threaded search. This dramatic speedup is not surprising. While one thread is blocked waiting for file I/O, another thread can search for text. Multithreading a CPU-bound program typically results in a more modest speedup.

Figure 3: NUnit detects a race condition.

Debugging and Testing Multithreaded Apps Debugging a multithreaded application full of race conditions is a nightmare. Here are some testing tips to make testing more effective and reduce the likelihood of race conditions. If you’re developing a Windows Forms application, test it with the Visual Studio 2005 debugger. Unlike previous ver-

Applications must synchronize access to any variable that can potentially be accessed by multiple threads. The Interlocked class is a high-performance option for integer variables. The lock mechanism and Monitor class are slower than the Interlocked class, but they can synchronize access to any type, not just integers.

58

Dr. Dobb’s Journal, September 2005

sions, the 2005 debugger automatically detects when a control is manipulated by a worker thread. Unit testing software like NUnit is extremely useful for multithreaded code. NUnit (Figure 3) is able to detect the multithreaded increment bug mentioned earlier. To see NUnit in action, download it from http://www.nunit.org/. Run the NUnit-Gui program. Select File/Open and open the FileSearchNUnitTests.nunit project file from the FileSearchNUnitTests project. Press Run and NUnit detects the race condition if it happens during the test run. NUnit test cases are synchronous. If you call asynchronous methods in NUnit, use WaitHandle.WaitAll to force the test case to wait for the asynchronous methods to complete. If your code uses Thread objects, force the test code to wait by calling Thread.Join. See NUnitTests.cs (available electronically) for sample multithreaded NUnit test cases. It’s worthwhile to test on a variety of machines with different performance characteristics. Some race conditions only show up on true multiprocessor machines. If your customers will be running your software on multicore or SMP machines, your test matrix needs to include multiprocessor machines. Conclusion Moore’s Law hasn’t been repealed. CPU manufacturers are still able to double the transistor count on a given area of silicon every 18 months or so. What has changed is the rate at which the performance of single-threaded applications can be improved. For at least the next few years, single-threaded applications will not speed up dramatically as customers upgrade their machines. But multithreaded applications will speed up if they take advantage of the parallel processing capabilities of newer HyperThreaded and multicore CPUs. Perhaps in the future, compilers will automatically parallelize .NET code and distribute processing across multiple CPUs. But for now, multithreading is the way to harness the parallelism of today’s computers. Asynchronous method calls are a convenient way to call parameterized methods on multiple threads. Just be sure to synchronize access to variables. Debug your Windows Forms code with Visual Studio 2005 and you’ll automatically find Windows Forms threading bugs. Finally, put NUnit to work testing your multithreaded code. Detecting any bugs, especially threading issues, during unit testing is much less expensive than finding them later in the development process or at a customer’s site.

DDJ http://www.ddj.com

Listing One

Listing Four

class ThreadTest { const int iterations = 100; public void HelloWorld1() { for (int i = 1; i <= iterations; i++) { Console.WriteLine("Hello World1 {0}", i); } Console.WriteLine("\nHello World1 finished\n"); } public void HelloWorld2() { for (int i = 1; i <= iterations; i++) { Console.WriteLine("Hello World2 {0}", i); } Console.WriteLine("\nHello World2 finished\n"); } [STAThread] static void Main(string[] args) { ThreadTest threadTest = new ThreadTest(); // Create the threads. The ThreadStart delegate must // refer to a void method with no parameters. Thread thread1 = new Thread(new ThreadStart(threadTest.HelloWorld1)); Thread thread2 = new Thread(new ThreadStart(threadTest.HelloWorld2)); // Start the threads. thread1.Start(); thread2.Start(); // Wait for threads to complete. Doesn't matter which thread finishes first. thread1.Join(); thread2.Join(); Console.WriteLine("Main finished"); } }

delegate void UpdateStatusBarDelegate(String Text); ... delegate void UpdateProgressBarDelegate(); private static void SearchCallback(IAsyncResult asynchResult) { // One parallel search just completed, so decrement // the number of remaining searches. int remainingSearches = SearchInfo.remainingSearches.Decrement(); UpdateStatusBarDelegate USBD = new UpdateStatusBarDelegate(Globals.mainForm.UpdateStatusBar); // If the search was not cancelled... if (!SearchInfo.Cancelled) { string message = string.Format("Finished searching {0}", asynchResult.AsyncState); // Update the status bar with a progress message. Globals.mainForm.Invoke(USBD, new object[] { message } ); } // If the last asynchronous method finished searching... if (remainingSearches == 0) { TimeSpan elapsedTime = DateTime.Now - SearchInfo.StartTime; // Update status bar to display total search time. Globals.mainForm.Invoke(USBD, new object[] { SearchInfo.Cancelled ? "Search cancelled" : "Search completed." + " Elapsed time: " + elapsedTime.ToString() }); SearchInfo.InProgress = false; } // Update the progress bar. UpdateProgressBarDelegate UPD = new UpdateProgressBarDelegate(Globals.mainForm.UpdateProgressBar); // You don't need to pass an object array to Invoke when the // method (UpdateProgressBar) does not have any parameters. Globals.mainForm.Invoke(UPD); }

Listing Two class AsynchMethodTest { const int iterations = 100; private delegate void WorkDelegate(string message); private void Work(string message) { for (int i = 1; i <= iterations; i++) { Console.WriteLine(message + " " + i); } Console.WriteLine("\n" + message + " finished\n"); } static void Main(string[] args) { AsynchMethodTest test = new AsynchMethodTest(); WorkDelegate workDelegate = new WorkDelegate(test.Work); WaitHandle[] waitHandles = new WaitHandle[2]; waitHandles[0] = workDelegate.BeginInvoke("Hello World1", null, null).AsyncWaitHandle; waitHandles[1] = workDelegate.BeginInvoke("Hello World2", null, null).AsyncWaitHandle; // Asynchrononous methods are run on background threads by default. // Wait for them to complete before letting the main thread exit. WaitHandle.WaitAll(waitHandles);

}

}

Console.WriteLine("Main finished");

Listing Three public static void SearchAllFolders(string[] allFolders, string searchPattern, string containingText, bool regularExpression, bool caseSensitive) { SearchInfo.StartTime = DateTime.Now; SearchFoldersDelegate searchFoldersDelegate = new SearchFoldersDelegate(SearchFolders); AsyncCallback asyncCallback = new AsyncCallback(SearchCallback); SearchInfo.Cancelled = false; SearchInfo.InProgress = true; // If user requested a multithreaded search... if (SerializeConfiguration.Settings.MultithreadedSearch) { // Don't want to thrash hard drives and optical drives. // Ensure that each drive is only accessed by one thread. Hashtable drives = CreateDriveHashtable(allFolders); // Keep track of how many searches will be done. SearchInfo.totalSearches.Set(drives.Keys.Count); SearchInfo.remainingSearches.Set(drives.Keys.Count); // For each drive being searched... foreach (string drive in drives.Keys) { // Get the folders on the drive to be searched. ArrayList foldersArrayList = (ArrayList) drives[drive]; // Convert ArrayList of folders to a string array. string[] folders = (string[]) foldersArrayList.ToArray(typeof(string)); // Call the asynchronous method with all the search parameters. searchFoldersDelegate.BeginInvoke(folders, searchPattern, containingText, regularExpression,caseSensitive, asyncCallback, drive); } } else // If user requested a single-threaded search... { SearchInfo.totalSearches.Set(1); SearchInfo.remainingSearches.Set(1); searchFoldersDelegate.BeginInvoke(allFolders, searchPattern, containingText, regularExpression, caseSensitive, asyncCallback, "all folders"); } // It's OK to directly manipulate the status bar because // this method was called from the GUI thread. Globals.mainForm.UpdateStatusBar("Searching..."); }

http://www.ddj.com

Listing Five // A ThreadSafeCounter contains an integer value that can be read, written, // incremented and decremented from multiple threads without data corruption // issues caused by race conditions. public sealed class ThreadSafeCounter { private int intValue; public ThreadSafeCounter() { intValue = 0; } public ThreadSafeCounter(int intValue) { this.intValue = intValue; } [MethodImpl(MethodImplOptions.Synchronized)] public int Get() { return intValue; } [MethodImpl(MethodImplOptions.Synchronized)] public int Set(int newValue) { int result = intValue; intValue = newValue; return result; } public int Increment() { return Interlocked.Increment(ref intValue); } public int Decrement() { return Interlocked.Decrement(ref intValue); } }

DDJ

More .NET on DDJ.com Custom Error Pages in ASP.NET Customized error-handling messages provide a more polished application for your users, and they can help performance, as well. http://www.ddj.com/documents/ddj050609asp/

Securing the Win32 File I/O APIs By creating an I/O firebreak with alternative custom API calls, your Windows boxes become impervious to attackers trying to exploit the standard Win32 APIs. http://www.ddj.com/documents/ddj050607sec/

Dr. Dobb’s Journal, September 2005

59

PROGRAMMER’S TOOLCHEST

Testing Web Applications Web testing presents unique challenges SEAN DAWSON AND KRISTIN KERR

O

ver the past five years, testing has evolved from retroactive (testing at the end of a cycle) to proactive (testdriven development). However, testing web applications is challenging. For Java servlet-based applications, the problem is due in part to the servlet’s dependency on the servlet container. Issues arise, such as how testing can be performed inside a container, how servlet requests made from outside the container can be tested, and how a JSP can be tested to ensure that the page is rendered correctly. Frameworks such as Tapestry (http:// jakarta.apache.org/tapestry/) further complicate testing. While these frameworks simplify the creation of dynamic user interfaces, they make it difficult to isolate areas of code that rely on a large number of components within the framework. In 2004, we worked on Hippo, a Java web application for managing undergraduate team programming projects. Hippo used the Tapestry web application framework and Hibernate (http://www.hibernate .org/) for managing data persistence. Automated unit testing was a guiding principle throughout development. The creation of new tests came to a standstill, however, when development moved on to the application’s web-based interface. To address this, we set out to extend our test framework so that developers

Sean and Kristin are senior undergraduate computer science students at the University of Toronto. They can be reached at [email protected] and kristinkerr@ gmail.com, respectively. 60

could automate the testing of the presentation layer. In this article, we describe how we integrated JWebUnit (http://jwebunit .sourceforge.net/) into Hippo’s existing test framework. The Testing Toolshed Servlet applications typically use the Model-View-Controller (MVC) design pattern. The servlet acts as the controller, receiving HTTP requests and dispatching to the appropriate domain objects. The servlet then selects a View (JSP) to render the response, possibly using a templating engine. While there is no one-size-fits-all framework to test these layers, the layered structure lets the testing be broken down into smaller stages. The Model layer, typically backed by a database, can be isolated and tested using standard unit testing. In this context, a testing framework, such as JUnit, is acting as the Controller; the Model is oblivious to how it is being used. The Controller layer is harder to test because interacting with the Controller (servlet) requires communicating using HTTP. Servlets usually live in a servlet container that dispatches the HTTP requests to the appropriate servlet. Testing the controller layer directly requires a testing framework, such as Cactus (http://jakarta .apache.org/cactus/), to simulate a servlet container. Testing the View layer can be done in a number of ways. First, you can view the entire application as a black box, sending requests to the servlet container and analyzing the returned response. This requires a testing framework, such as HttpUnit (http://httpunit.sourceforge.net/), to simulate HTTP requests; the generated HTML can then be analyzed by the tester to ensure correctness. To gain more control, the Model/Controller layers can be set to a predetermined state, at which point the rendered View can be verified. This strategy works well with templating frameworks that use data objects in combination with a set of templates to render the view. The testing Dr. Dobb’s Journal, September 2005

framework, rather than the Controller, supplies the data object, allowing for finegrained testing. This divide-and-conquer approach to testing is easiest to apply to applications designed with clear separation. For example, object-oriented programming simplifies unit testing by modularizing the

“Building and deploying a web application usually involves combining a number of external components”

code. Likewise, the MVC design pattern takes this idea to a higher level; functionality, rather than code, is modularized. For this reason, web-development frameworks that enforce this separation simplify both development and, in theory, testing. In practice, however, web-development frameworks can have the opposite effect. Complicated frameworks tend to hide the inner workings of an application, freeing the developer from dealing with mundane and error-prone details. While this is ideal for development, testing requires digging through the abstraction that the framework presents (Law of Leaky Abstractions; http://www.joelonsoftware .com/articles/LeakyAbstractions.html). If the framework is not designed with testing in mind, it may be difficult or impossible http://www.ddj.com

to interact with the framework internals. A vicious cycle soon follows, with developers reluctant to test, and frameworks less likely to be designed with testing in mind for lack of demand. In our case, we found the Tapestry 3.0 framework difficult to test, a point acknowledged by its creator, Howard Lewis Ship. Testing Tapestry code requires working with abstract classes, long lists of mock objects, and monolithic Tapestry code. Howard is planning to improve testing in the upcoming 3.1 release, but this did not help our project at the time. JWebUnit Many web-testing tools and frameworks are available, both commercially and through the open-source community. One of the first we looked at was HttpUnit, an open-source Java library that simulates a web browser, allowing external, black-box testing of a web application. By providing an API for interacting with HTTP servers, a live web application can be queried and the response analyzed. This API allows for submitting forms, parsing HTML, following links, and setting cookies. It also offers basic JavaScript support, and separate support for using a simulated servlet container. Ultimately we chose to use JWebUnit, an open-source tool for automated external testing of web applications. This test framework combines the functionality of HttpUnit with JUnit and includes an extensive set of assertions to analyze the returned HTML. Because JWebUnit extends JUnit, developers familiar with JUnit can start using this web-testing framework right away. The JWebUnit API contains methods to navigate web sites (by links or form submission) and analyze the returned HTML DOM tree. The two most important classes in JWebUnit are: • WebTester, which provides a high-level API for basic web-application navigation and validation by wrapping HttpUnit and providing JUnit assertions. • WebTestCase, which extends JUnit’s TestCase class, and acts as a wrapper for a WebTester object. To implement a test with JWebUnit, you use the WebTester API to navigate to a specific web page, then use the provided assert methods to validate the HTML returned by the server. Using JWebUnit is much like manual web browsing, except that you can automate the procedure and perform complex validation on the generated HTML. It runs against a live instance of the application to be tested, which means that the application must be deployed to a servlet conhttp://www.ddj.com

tainer during testing. Once JWebUnit is configured to point to the URL of the application, the various navigation and assertion methods can be called to browse the application and verify its operation. For instance, the sequence in Listing One tests the operation of a login form. Starting at the main page of the application, Listing One looks for fields named username and password, sets values for them, asserts the presence of a button called login, and presses the login button. The form is submitted to the application just as if users had entered and submitted those values from a browser. The returned web page is stored in the WebTester object; consequent actions, such as asserting a link is present, apply to this recently returned page. In this way, writing tests in JWebUnit is fairly intuitive, hiding the complicated details of HTTP requests and responses. Other methods allow assertions to be made about the existence and content of forms, tables, and so on. Integrating JWebUnit Prior to writing any tests, we had to modify Hippo’s build and test environment. Again, JWebUnit requires a functioning application to test against; the application must be built, initialized, and deployed to a servlet container. The first step was to automate the process of building and deploying to a servlet container so that the test suite could be invoked in a single step. Both Tomcat and Jetty (http:// jakarta.apache.org/cactus/integration/ integration_ jetty.html), the servlet containers our team used while developing Hippo, provide Java packages allowing servers to be managed from within an application. This let us create a wrapper around our test suite, automatically starting and stopping the servlet container during the web tests. Furthermore, building and deploying a web application usually involves combining a number of external components, many of which may be slow, bulky, and time consuming to set up. These issues are less of a concern with standard unit testing, where the functionality being tested tends to be more isolated. In our case, Hippo requires a functioning Subversion server to provide version control. This places an unwanted dependency on developers wishing to run the web test suite; either a local Subversion server must be running or Hippo must be configured to use a remote Subversion server. The former would be inconvenient and time consuming to set up, the latter slow and dependent on network access. We removed this dependency by stubbing out the Subversion layer. First, we Dr. Dobb’s Journal, September 2005

specified a custom version-control interface; classes implementing this interface are expected to provide only the functionality required by Hippo (for example, creating a repository or adding a file). Interaction between Hippo and the versioncontrol layer occurs through this interface. We then created two version-control implementations: The first encapsulates the functionality provided by the official Subversion Java bindings, while the second is a trivial implementation consisting of empty methods. The version-control implementation is then chosen at runtime using a factory method, letting testing proceed without a functioning Subversion server. The trivial implementation was sufficient for our purposes because interaction with the Subversion repository was limited to a small section of code. Had it been more widespread, the behavior of Hippo would have changed dramatically during testing. The middleground between the trivial and functional implementations is an implementation that simulates a versioncontrol repository in memory, using a simple tree-based data structure to represent the repository. Version-control functionality can then be tested without having to interact with a slow Subversion server. It could be argued that using a virtual implementation, whether trivial or simulated, is ineffective — how can we find problems in the application if we remove the functioning components during testing? Well, in some sense we cannot. There are bound to be subtle bugs lurking that only surface when we use the functional component. Such examples include filesystem synchronization, database locking, network failures, and so forth. While a virtual implementation cannot discover these bugs, it can discover those bugs that would surface regardless of the implementation used — the quick and dirty bugs. These bugs are arguably more common, and thus more important to catch in a rapid edit-test-debug development cycle. Because virtual implementations tend to be an order of magnitude faster than their functional equivalent, testing can be done early and often. Due to the plug-in nature of the implementations, the functional implementation can be transparently substituted into the application, allowing real testing to be carried out. The insidious bugs that may have escaped the initial tests will then be discovered. This approach of “test once, and then test for real” allows testing to be partitioned into stages, particularly important if the testing process is time consuming. We also needed to modify the Tapestry layer of Hippo. JWebUnit uses anchors to identify components in HTML source code. For instance, to locate a table 61

in the source HTML of a web page with the string , the ID table23 is used as an anchor. Tapestry uses static HTML templates as a guide for generating dynamic HTML, but the anchors produced are neither predictable nor consistent. For example, this static HTML:

generates this HTML during runtime by Tapestry:

The dynamic anchor ($TextField$0) is determined by Tapestry at runtime and changes depending on the composition of the static HTML template. Because Tapestry automatically maps code vari-

ables to HTML variables, you aren’t required to ensure that anchors in the HTML match those in the code. This makes it difficult to analyze the generated HTML; without consistent anchors, it is impossible to locate tables, forms, and so forth. To get around this, we instructed Tapestry to insert specific anchors. For instance, the revised static HTML:

generates a consistent anchor in the HTML:

Fortunately, this only required a small amount of changes to the various static templates. Had the previous Hippo team been designing with testing in mind, they would have already added these tags.

The last major change we made was related to cleaning and initializing Hippo’s data layer. Effective unit tests are independent; the outcome of any particular test should not be affected by the order in which it is run within the suite. In most cases, independence can be attained by initializing the testing environment to a predetermined state prior to running a test. In fact, this is such a common procedure that JUnit provides built-in methods, setUp( ) and tearDown( ), which are called before and after, respectively, every unit test. Initializing the environment for an isolated class is usually straightforward. Testing against a live application, however, complicates the matter. We now had to be concerned with reinitializing every component of the data layer (database, subversion repository, and so on) without direct access to any of the internal data structures used in Hippo. We implemented a set of hooks allowing our external testing framework to trigger the initialization cycle while Hippo was deployed. The easiest way to implement this was by adding two additional submit buttons (reset, rebuild) to the main login form of Hippo that would perform the necessary actions once invoked. JWebUnit could then be configured to click these buttons prior to every web test, allowing each test to run independently of the others. The obvious drawback to this scheme is the slowdown resulting from a costly initialization cycle every test. This slowdown can be offset somewhat by using mock objects and in-memory data structures whenever possible. For example, some databases can be configured to run entirely in memory rather than constantly reading and writing to disk, a wellknown performance bottleneck. This is another example of the “test once, and then test for real” concept we described earlier; in-memory databases may mask subtle bugs, but they are generally much faster than databases on disk. Implementing the Test Framework Similar to JUnit, creating test-case classes using JWebUnit requires extending WebTestCase or any class that subclasses it. We chose to extend WebTestCase with a new base class, HippoWebTestCase (see Figure 1). This gave us a place to put common set-up code and utility methods that could be used by all tests. First, we overrode the setUp( ) method to handle configuration of the test environment. In particular, JWebUnit was configured to point to the application’s base URL. Since we deployed the application on our own local machines for testing

62

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

purposes, we set the base URL to “http:// localhost:8080.” We also used the setUp( ) method to handle cleaning and initializing of the database, as discussed in the previous section. Then we added a number of utility methods to handle common testing scenarios. For example, we created a login method that would execute the sequence of steps required to login to Hippo (go to the login page, fill in form parameters, and submit the form). Other methods were created to handle common navigation sequences, such as navigating to the main Project configuration page, a prerequisite for doing any project-related testing. We decided to start all navigation sequences at the login page, regardless of the current position within the web page. Because the front end of the application is built on Tapestry, which generates HTML pages dynamically at runtime, we could not always predict the current state of the application. By beginning at the login page, we could help ensure consistent navigation sequences. Also, jumping directly to a specific web page was not possible as Tapestry performs a complex URL transformation depending on dynamic content. Unfortunately, this adds to the running time of the tests, as each navigation sequence requires that much more data to be sent over the network. Listing Two is an example of a typical navigation sequence. This method reproduces the sequence of steps a user would take when manually moving through Hippo. The followLink(String linkId) method is a custom Hippo method that uses the JWebUnit API (http://jwebunit.sourceforge.net/apidocs/ index.html) to first assert that the link is present with a text value of linkId, then calls clickLinkWithTest(linkId). Finally, we added a number of operations that could be viewed as atomic to the developer. For example, you may wish to add a project and then assert that the project was indeed added. Listing Three executes the sequence of actions to add a project to Hippo. Now, a test class can be created by extending HippoWebTestCase and creating test methods using these utility methods Listing One getTestContext().setBaseUrl("http://localhost"); beginAt("/main"); setField("username", username); setField("password", password); assertSubmitButtonPresent("login"); assertSubmitButtonPresent("reset"); submit("login");

Listing Two

and the JWebUnit API to browse Hippo, execute actions, and make assertions about the results. Conclusion By using JWebUnit, we can now perform functional testing on Hippo. We have only addressed one facet of testing, so our

“Given hundreds of tests, it’s impractical to frequently run the entire test suite” work isn’t quite done yet. Our decision to use JWebUnit was primarily influenced by the difficulty associated with testing a Tapestry application. While Tapestry is a terrific framework to develop with, the testing support needs to be improved. Our testing framework is not perfect. If you’re accustomed to running hundreds of unit tests in a few seconds, you may be frustrated when running JWebUnit tests. As discussed, JWebUnit requires interacting with an application over a network. Even on a local network, the time required to send requests between the testing framework and the application quickly ac-

DDJ

TestCase (JUnit)

WebTestCase (JWebUnit)

Server Running Hippo

HippoWebTestCase HTTP TestLogin

TestAddProject

TestAddUser

Figure 1: Web testing architecture. Listing Three protected void addProject(String parentProject, String project, String url) { gotoProjectAdmin(); followLink("Create a project"); assertFormPresent(); setField("projectName", project); setField("parentProject", parentProject); setField("courseURL", url); submit("create"); }

private void gotoProjectAdmin() { login(); followLink("System Administration"); followLink("Manage Projects"); }

http://www.ddj.com

cumulates. On top of this, the application itself may have to perform time-consuming tasks, such as interacting with the filesystem or querying a subversion repository. For example, each JWebUnit test in our framework takes, on average, one to two seconds to complete. Given hundreds of tests, it’s impractical to frequently run the entire test suite. Moreover, JWebUnit presents some tradeoffs developers will have to live with. Using a functional testing framework like JWebUnit is arguably less beneficial to the developer than standard unit testing with JUnit. This is because higher level approaches to testing make it difficult to test functionality in isolation; as a result, it’s not always self-evident what went wrong in a JWebUnit test because the entire application is working in concert. On the other hand, functional testing frameworks allow the developer to simulate user interaction with the application. A comprehensive suite of JWebUnit tests can, therefore, save you countless hours of manual testing. The fact that our framework has its problems points to the central issue — there is no single solution for testing. Rather, testing often requires many different strategies to fully exercise an application. Testing a web application presents more challenges, but applications designed with testing in mind will go a long way to encouraging testing in the future.

DDJ Dr. Dobb’s Journal, September 2005

63

Finding Binary Clones With Opstrings & Function Digests: Part III Implementation and testing ANDREW SCHULMAN

C

an a few extracts from a piece of binary code be sufficient to identify that piece of code in a large database? As discussed in the two previous parts of this article, a piece of binary code’s noteworthy features can be extracted and placed into an opstring, an ordered set of instruction mnemonics or operations (“ops”). Noteworthy features include atypical instructions, system or API calls, branch-target locations (places the code may jump to), and the use of unlikely “magic” numbers. In x86 Win32 code, one such opstring would be: “test, jz, call, test, jz, call, ret, loc, [HeapFree], loc, ret”. Using an algorithm such as MD5, this can be turned into a function digest such as 65E482BF6A2F391D84D243D9E02244D7. If the function’s name is known, either from debug symbols or from a Win32 DLL export table, the function digest and name can Andrew is a software litigation consultant who works on the technical aspects of cases involving copyright, patents, trade secrets, antitrust, and privacy. He can be contacted at [email protected] or http:// www.undoc.com/. 64

be added to a function database. The “test,jz,...” example and its MD5 signature happen to correspond to one implementation of the free C runtime library (RTL) function. If, in some other program, a function were encountered with a digest of 65E482BF6A2F391D84D243D9E02244D7, it could with a fair degree of certainty be identified as free. How much certainty is one of the subjects of this third and final article on binary code lookup and comparison. Generating additional symbolic information for otherwise anonymous code — using symbols from one file to ID code in a different file — can tell you, when staring at code in a debugger, that some function is the free memory-deallocation function; this beats trying to read and understand the code. However, that is a fairly limited use. Given a large database with MD5 signatures for every function in every Win32 file preinstalled on a typical PC or on the Windows XP CD, you can measure the amount of code duplication in the system, classify modules by sorting them into sets based on how much code they share in common, identify code that should perhaps be moved into a shared library, or find multiple instances of flawed code. Excluding Boilerplate I had a different purpose when I started this project: Not to use or study the contents of the function database, but instead to filter it out as noise. The function database was intended merely as a list of “boilerplate” code — stereotyped verDr. Dobb’s Journal, September 2005

biage — that could be ignored when comparing binary modules in copyright infringement litigation. Almost all litigation happens outside a courtroom. At least in the U.S., the majority of legal cases never reach trial (the system would collapse if more than a handful did). Cases are settled based on each side’s estimation of the “value” of the case, determined in part by what each

“Different compilers can take the same source code and produce fairly different binary code” side learns about both sides to the case. Modern rules of “discovery” require each side to turn over to the other side (“produce”) certain types of information, such as potentially relevant internal corporate e-mails and other documents (for typical examples, see http://www.usdoj.gov/atr/ cases/ms_exhibits.htm). In software copyright litigation, this includes each side’s source code. Each side’s expert compares the two sets of source code, looking for literal similarities that shouldn’t be there, and for signs http://www.ddj.com

(continued from page 64) of nonliteral copying. (Each source tree may consist of millions of lines of code, so this process must be automated. It’s a bit harder than it sounds. Hints: diff is not the right tool; associative arrays are nearly essential.) Even open-source software is faced with issues like these: What percentage does GNU represent of a typical Linux distribution? Is a commercial vendor reusing open-source code without abiding by the GPL (see gpl-violations.org)? Has a contributor used copyrighted code? Binary comparison is useful in part because source is sometimes unavailable. Not surprisingly, litigants sometimes resist turning over their source code (“the crown jewels”) to the other side, even

with court protective orders limiting who may see the source code. Also, a litigant only sees the other side’s source code after it becomes an actual litigant. Thus, a software copyright complaint will generally be issued before the plaintiff has seen the defendant’s source code. In the U.S., complaints can be filed “upon information and belief,” and can later be amended. Still, a plaintiff should try confirming their suspicions. Meanwhile, a potential defendant might also have concerns (“I wonder if my programmers have been ‘borrowing’ code from their previous jobs”). If and when both sides’ source code becomes available, the binary comparison is a double check, and may pinpoint specific source-code areas to investigate.

(a) int foo(int x, int y) { while (b(x,y)) if (c(x,y)) return 1; return 0; } (b) 004011F0 004011F0 004011F1 004011F2 004011F4 004011F6 004011FB 004011FD 004011FF 004011FF 00401201 00401203 00401208 0040120A 0040120C 0040120E 00401210 00401215 00401217 00401219 00401219 0040121A 0040121C 0040121D 0040121E 0040121E 0040121F 00401224 00401225

56 57 8BF2 8BF9 E869FEFFFF 85C0 741A 8BD6 8BCF E857FEFFFF 85C0 7512 8BD6 8BCF E84FFEFFFF 85C0 75E6 5F 33C0 5E C3 5F B801000000 5E C3

@foo@8: push esi push edi mov esi,edx mov edi,ecx call fn_00401064 ; _b test eax,eax jz loc_00401219 loc_004011FF: mov edx,esi mov ecx,edi call fn_0040105F ; _c test eax,eax jnz loc_0040121E mov edx,esi mov ecx,edi call fn_00401064 ; _b test eax,eax jnz loc_004011FF loc_00401219: pop edi xor eax,eax pop esi ret loc_0040121E: pop edi mov eax,1 pop esi ret

(c) call,test,jz,loc,call,test,jnz,call,test,jnz,loc,xor,ret,loc,ret (d) EEC1E2A53DA3D4FCF3E57850F302412D (e) int bar(int x, int y, int z) { while (g(x,y,666)) if (h(1234,x,y)) return 1; return 0; }

Figure 1: (a) A small C function; (b) disassembly of a compiled version; (c) its opstring; (d) its function digest; (e) and a slightly different C function that yields the same opstring and function digest. 66

Dr. Dobb’s Journal, September 2005

When detecting similarities in code — be it source or binary— there must be a baseline for comparison. Comparing any two otherwise-unrelated source- code trees, written in the same programming language and/or targeting the same operating system, yields some exact matches that nonetheless don’t reflect copying, such as return 0; and for (int i=0; i
if it’s not found in the standard function database; see Computer Associates v. Altai.) Does “Structurally Similar” Really Mean “Clone”? As noted, the function database contains MD5 digests of opstrings, each of which is a string of the noteworthy features of a function. For example, the C source code in Figure 1(a), compiled to produce something like the code in Figure 1(b), yields the opstring in Figure 1(c) and the MD5 digest in Figure 1(d). The instruction mnemonics have been extracted from the disassembly, the common mov, push, and pop instructions discarded, and the remaining mnemonics placed into the string, along with “loc” pseudo-ops, indicating the targets of the jz and jnz instruction. The operands have been ignored. While Figure 1(c) plainly abbreviates the code in Figure 1(b), it is less clear that Figure 1(c) truly represents Figure 1(b), much less its source in Figure 1(a). After all, wouldn’t a lot of other code, looking nothing like Figure 1(a) or Figure 1(b), also abbreviate to Figure 1(c)? It turns out that the answer is no in this particular case. A database of about 775,000 function digests, representing most of the binary code in Windows XP SP2, did not contain a match for the function digest in Figure 1(d). Even having discarded so much of the function, what’s left are reasonably reliable indicators of the function’s identity. Knowing that the opstring algorithm ignores mov, push, and pop, you could easily construct a function that differed from Figure 1(a) but that yielded the same opstring as in Figure 1(c). For example, Figure 1(e) appears to be a false positive, but only under a literal definition of “clone,” in which even obvious cut-and-paste variations are treated as nonmatches. As noted in Part II of this series, Opstring was written to ignore minor differences. In other words, this is a feature, not a bug. It is important to have a uniform method for generating the function database, but a testing version of Opstring (available electronically; see “Resource Center,” page 3) does provide command-line options to control what is included or excluded in an opstring, allowing for different definitions of “clone.” If two opstrings match, then — given a minimum opstring length — the binary code from which they were generated (and, to a lesser extent, the overlying source code) will be sufficiently similar that someone eyeballing them would regard them, perhaps not as identical, but as a good match, such that one could be a cut-and-paste variation of the other. In http://www.ddj.com

other words, we are backing into a working definition of “clone,” based on what this tool can find. A binary-code “clone” can be defined as a match found by Opstring. Obviously, not all tools are worth listening to. A “tool” that hashed any piece of code, of whatever length, to a single byte, would “tell” us which of the only 256

“Opstrings allow textual methods to be applied to binary code”

types of code in existence a given function fell into. Like classifying people by their astrological signs, it would provide only the illusion of usefulness. There’s a similar danger here, because we’re usually looking at compiler output (rather than handwritten assembler), and one can expect great regularity, even given infinite possible source-code input, because the compiler acts as a funnel, narrowing infinite varieties of source into a relatively small number of microprocessable structures.

Testing for False Positives Where symbolic names are available, one test for false positives checks for ostensibly matching functions that have significantly different names. Multiple function names mapped to the same function digest (MD5 signature) may indicate that the opstrings from which the signatures are created are not good indicators of the function’s identity. In my first test, nearly half of the named signatures appearing more than once had more than one name. However, browsing through the test program’s output showed that nearly all the mismatches were C++ mangled/decorated names that differed from each other in only minor respects, such as ?Unlock@Perimeter@@QAEXXZ versus ?Unlock@RWLock@@QBEXXZ, or ?Remove@foo versus ?Update@foo versus ?Set@foo. These functions tended to be shorter than the system average. In many cases, the seemingly mismatched names were all located in a single file, indicating that indeed the functions were related, despite the different names. I also found many cases where, despite having used Microsoft’s symchk utility, PDB symbols and code were misaligned. When I reran the test, this time looking only for cases where code appears in more than one file and where a given symbol is used in more than one file (to eliminate PDB misalignments), ignoring all nonalpha characters, and looking only at the first five characters of the function name, the result was still that almost 3 percent of the signatures had one or more names that apparently mismatched the others in its first five characters. Of the apparent mismatches, many were C RTL functions, such as isdigit, islower, and ispunct mapping to the same signature. That Opstring currently treats, say, islower and isupper as the same function

65 . A675B247E3DA5CC0CFC0418AE0A33607 taskkill.exe - _wmainCRTStartup 65 . A675B247E3DA5CC0CFC0418AE0A33607 systeminfo.exe - start 65 . A675B247E3DA5CC0CFC0418AE0A33607 syskey.exe - _wmainCRTStartup ... 63 . E65204032B23E3DE64472CE99203EF34 wmspdmoe.dll - _m6_ownBitRev1_Z 63 . E65204032B23E3DE64472CE99203EF34 wmspdmoe.dll - _a6_ownBitRev1_Z 63 . E65204032B23E3DE64472CE99203EF34 wmspdmod.dll - _m6_ownBitRev1_Z 63 . E65204032B23E3DE64472CE99203EF34 wmspdmod.dll - _a6_ownBitRev1_Z ... 27 . D40F1D68E0D3D7859498B13925213272 provthrd.dll ?HexCharToDecInteger@ProvAnalyser@@SGKD@Z 27 . D40F1D68E0D3D7859498B13925213272 msmsgs.exe ?HexCharToDigitA@@YGHD@Z 27 . D40F1D68E0D3D7859498B13925213272 dpwsockx.dll - _GetDigit@4 27 . D40F1D68E0D3D7859498B13925213272 dplayx.dll - _GetDigit@4 27 . D40F1D68E0D3D7859498B13925213272 dplaysvr.exe - _GetDigit@4 ... 25 . 90ACA191ED795D375823FF86C4CDBB97 sapi.dll - __lseek 25 . 90ACA191ED795D375823FF86C4CDBB97 netfxocm.dll - __lseek 25 . 90ACA191ED795D375823FF86C4CDBB97 msvcr71.dll - _write 25 . 90ACA191ED795D375823FF86C4CDBB97 mstscax.dll - __read 25 . 90ACA191ED795D375823FF86C4CDBB97 msihnd.dll - __lseek

Figure 2: Some function signatures with different names in different files. Dr. Dobb’s Journal, September 2005

67

(a) test,jz,call,and,loc,test,jz,call,and,loc,test,jz,call,and,loc,ret (b) ... 16 . 16 . 16 . 16 . ...

BE52429A4DB568969503A30A27D4CB24 BE52429A4DB568969503A30A27D4CB24 BE52429A4DB568969503A30A27D4CB24 BE52429A4DB568969503A30A27D4CB24

(c) 7D9D7AFB 7D9D7AFB 7D9D7AFD 7D9D7AFE 7D9D7B00 7D9D7B02 7D9D7B04 7D9D7B06 7D9D7B07 7D9D7B0C 7D9D7B0F 7D9D7B0F 7D9D7B12 7D9D7B14 7D9D7B16 7D9D7B18 7D9D7B19 7D9D7B1C 7D9D7B20 7D9D7B20 7D9D7B23 7D9D7B25 7D9D7B27 7D9D7B28 7D9D7B2D 7D9D7B31 7D9D7B31 7D9D7B32

8BFF 56 8BF1 8B06 85C0 7409 50 E81EC40B00 832600 8B4604 85C0 740A 8B08 50 FF5108 83660400 8B4608 85C0 740A 50 E8FDC30B00 83660800 5E C3

rtcshare.exe- ??1CShareErrorInfo@@UAE@XZ rend.dll - ??1CObjectWithSite@@QAE@XZ query.dll-?_Cleanup@CDbParameter@@AAEXXZ qedit.dll - ?Release@CDXDataPtr@@QAEXXZ

?_Cleanup@CDbParameter@@AAEXXZ: ;;; query.dll mov edi,edi push esi mov esi,ecx mov eax,[esi] test eax,eax jz loc_7D9D7B0F push eax call _CoTaskMemFree@4 and dword ptr [esi],0 loc_7D9D7B0F: mov eax,[esi+4] test eax,eax jz loc_7D9D7B20 mov ecx,[eax] push eax call dword ptr [ecx+8] and dword ptr [esi+4],0 loc_7D9D7B20: mov eax,[esi+8] test eax,eax jz loc_7D9D7B31 push eax call _CoTaskMemFree@4 and dword ptr [esi+8],0 loc_7D9D7B31: pop esi ret

(d) 5DA13F91 ??1CObjectWithSite@@QAE@XZ: ;;; rend.dll 5DA13F91 56 push esi 5DA13F92 8BF1 mov esi,ecx 5DA13F94 8B4604 mov eax,[esi+4] 5DA13F97 85C0 test eax,eax 5DA13F99 C706AC17A15D mov dword ptr [esi],offset ??_7CRendezvous@@6BCObjectWithSite@@@ 5DA13F9F 740B jz loc_5DA13FAC 5DA13FA1 50 push eax 5DA13FA2 E819EF0000 call ??3@YAXPAX@Z 5DA13FA7 83660400 and dword ptr [esi+4],0 5DA13FAB 59 pop ecx 5DA13FAC loc_5DA13FAC: 5DA13FAC 8B460C mov eax,[esi+0Ch] 5DA13FAF 85C0 test eax,eax 5DA13FB1 740A jz loc_5DA13FBD 5DA13FB3 8B08 mov ecx,[eax] 5DA13FB5 50 push eax 5DA13FB6 FF5108 call dword ptr [ecx+8] 5DA13FB9 83660C00 and dword ptr [esi+0Ch],0 5DA13FBD loc_5DA13FBD: 5DA13FBD 8B4610 mov eax,[esi+10h] 5DA13FC0 85C0 test eax,eax 5DA13FC2 740B jz loc_5DA13FCF 5DA13FC4 50 push eax 5DA13FC5 E8F6EE0000 call ??3@YAXPAX@Z 5DA13FCA 83661000 and dword ptr [esi+10h],0 5DA13FCE 59 pop ecx 5DA13FCF loc_5DA13FCF: 5DA13FCF 5E pop esi 5DA13FD0 C3 ret

Figure 3: A false positive: (a) a function digest; (b) a few of the 30 functions in XP with this digest; and (c) and (d) two disassemblies showing that the code doesn’t really match. 68

Dr. Dobb’s Journal, September 2005

may reduce its ability to supply exact symbolic names to disassembly listings, but does not undercut its ability to find code clones. Figure 2 shows what some of the remaining apparent mismatches look like (the first column is the opstring length). I did find one genuine mismatch, so surely there must be others, especially in signatures based on shorter opstrings. Figure 3(a) shows the offending opstring, Figure 3(b) lists a few of the 30 functions whose code abbreviates to this opstring; Figures 3(c) and 3(d) are two examples of the actual code. Even here, given that the code in Figure 3(c) is called “Cleanup” and calls CoTaskMemFree twice, and that the code in Figure 3(d) calls C++ operator delete (“YAXPAX” in Microsoft C++) twice, the functions perhaps aren’t totally unrelated. But clones, probably not. Even though many of the mismatches are only apparent, some clearly are real; the 3 percent given earlier is probably a reasonable upper bound on the falsepositive rate. This was based on a minimum opstring length of 12, and the falsepositive rate can be reduced simply by requiring longer opstrings. A minimum length of 40 brings the false-positive rate down to about 1 percent. This raises the question of the ideal minimum length for an opstring. The Opstring program discards any function whose opstring is shorter than the desired minimum. The average opstring length in XP is about 55. As seen in Figure 1, even a fairly small C function yielded an opstring whose length was 15. As a rough back-of-theenvelope figure, say there are about four “ops” per line of nonblank, noncomment C code. A minimum opstring length of 20 or 25 then represents a function definition, curly braces, some variable definitions, and five or six lines of actual code. False Negatives Different compilers can take the same source code and produce fairly different binary code. The same compiler, with different optimization settings, can do the same thing. Likewise, a single piece of source code that uses preprocessor macros or C++ templates can generate disparate binary code. Given a template class Stack, with Push and Pop methods, consider the difference between Stack, Stack, and Stack. The developer wrote Push and Pop once, but the compiler generates three different pieces of code for each. To still detect the underlying similarities between these three different binary versions of Stack::Push, based on the instruction mnemonics, would likely require boiling away so many differences among the three versions that unacceptably high false positives would result. http://www.ddj.com

One approach to reduce false negatives is to match on something other than function boundaries. In addition to using branch targets for the “loc” pseudo-op, they could act as delimiters for the blocks of code from which to generate opstrings. With smaller blocks of code to consider, naturally more matches would be found. But considering each branch target as a separate block of code will result in many blocks of code that fall below the minimum length that avoids false positives. One solution is to consider both functions as a whole, and the branch-target blocks within them. Another approach is to ignore boundaries entirely. The self-similarity within a file or group of files found using the DotPlot method (see “Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code,” by Kenneth Church and Jonathan Helfman; http://imagebeat .com/dotplot/rp.jcgs.pdf). DotPlot is typically a visual technique in which duplication appears as diagonal lines, but the diagonals can be produced nonvisually (and thus selected by another program) by comparing every element of a string with every other element, using nested for loops. Church and Helfman describe several easy-to-implement optimizations that allow most comparisons to be skipped, by ignoring high-frequency tokens (much as Opstring ignores high-frequency instruction mnemonics). The frequency with which tokens appear can also be used to do weighted matching. All of this can be applied to binary code, once this code has been turned into some form of text, like an opstring. One of the benefits of opstrings, apart from generating MD5 digests, is that they allow textual methods such as DotPlot to be applied, in effect, to binaries. Using the DotPlot method and a hacked version of Opstring, I located large tracts of duplicated code, crossing function boundaries, in Windows programs. However, even with the Church/Helfman optimizations, the program was significantly slower than comparisons with MD5 digests. Apart from redefining what gets matched with what, we can also allow nonexact matching. By modifying the Levenshtein edit- distance algorithm (see “Finding String Distances,” by Ray Valdes; DDJ, April 1992) to work with ops rather than characters, we could find identical functions by finding those whose opstrings have an edit distance of zero, then locate slight variations by selecting those pairs with small edit distances, for instance, only a small number of insertions, deletions, or substitutions needed to turn one opstring into the other. As with the DotPlot method, this too is significantly slower than comparing MD5 digests. http://www.ddj.com

Implementation The Opstring program (available electronically; see “Resource Center,” page 3) is implemented in the AWK programming language, which simplifies text manipulation (if desired, the AWK code can be translated into Perl with a2p). Even though Opstring digests binary files, it’s a text-manipulation tool that depends on an external disassembler, DumpPE. AWK splits each line of text input into spacedelimited fields, designated as $1, $2, and so on. The entire input line is $0, and the number of fields is NF. In DumpPE disassembly output such as in Figure 1(b), opcode bytes appear in $2, the instruction mnemonic in $3, and operands in $4. A function label or branch target appears in $2. To aid reading the code listing, Figure 4 presents pseudocode. When Opstring sees a function boundary, it outputs the previously collected filename, function name, and opstring, then sets up for the next opstring. When Opstring sees an instruction, it adds it to the opstring. All else is commentary. Opstring could work directly on PE files, but it makes more sense, and not only for ease of implementation, to simply create a wrapper around an existing disassem-

bler, DumpPE (http://www.tbcnet.com/ ~clive/dumppe.htm). While Opstring is currently overly tied to the DumpPE output format in particular, and Wintel code in general, there is little reason it couldn’t be generalized for other disassemblers and other instruction sets. (Java and .NET IL come to mind, though there are decent decompilers for both of these, allowing somewhat higher level comparison methods.) Microsoft’s dumpbin is an obvious choice. Its -disasm option produces reasonable disassemblies of Win32 executable files. Newer versions use symbols in a PDB file. However, its output format is inconvenient to work with (each opcode byte is output followed by a space, making the location of the instruction mnemonic difficult to locate), and more importantly it does not identify branch targets. To use Opstring with dumpbin output, which is similar to that of the Linux objdump utility, I’ve written a converter (available electronically) that makes a first pass over a listing, looking for calls and jumps, to construct a pseudosymbol table that is then used to output function and branch-target labels on the second pass.

input file with "magic numbers" (DEADCAFEh, etc.) into array given output from DumpPE -disasm: get Win32 filename from first line of dumppe output ignore everything until see "Disassembly" for each line in disassembly if it's a function label if opstring length > minlength and if opstring contains at least one ret or jmp if opstring length is maximum, might be junk so: chop it off at the first ret or jmp output filename, previous function name, and opstring set up for next opstring: set new function name; clear opstring; oplength = 0 else if it's an instruction line, and opstring isn't at maxlength if line contains a large hex operand if the hex operand is found in the "magic" array add mnemonic "_" magic number to opstring else if it's a Windows API call add API name to opstring without trailing 'W' or 'A' else if it's a branch target add "loc" to opstring else if it's common (mov, push, pop, add esp) if it's an API mov add "mov_" and API name to opstring else do nothing else if it's junk code (nop, int 3, etc.) do nothing else if it's data ; relying on DumpPE to separate code/data do nothing else if we've seen this same thing many times in a row do nothing else if mnemonic has prefix (rep, lock, etc.) add prefix "_" mnemonic to opstring else add mnemonic to opstring if added to opstring oplength++ stop processing when see hex dump output last one send output to mkmd5db

Figure 4: Pseudocode for the Opstring program. Dr. Dobb’s Journal, September 2005

69

IDA Pro from DataRescue (http://www .datarescue.com/) is the gold standard in disassemblers. Its numerous features already include one of the side benefits that I’ve listed for opstrings and function digests — identification of runtime library functions. IDA Pro’s FLIRT (“Fast Library Identification and Recognition Technology”) identifies functions using the first 32 bytes of a function, marking all variant bytes, and a CRC of the remaining bytes. Provision is made for the fact that functions such as _strupr and _strlwr share identical code. An IDA Pro external plugin could presumably employ a database of opstring-style function signatures. However, IDA Pro is designed largely for interactive use (it even includes a runtime debugger to help with encrypted or obfuscated code that is difficult to untangle when looking at a static listing). To build the database itself, however, which is the purpose of Opstring, interactivity is less useful. Thousands of disassemblies are generated, briefly used by another program, and then discarded. The end result is the function signature. Clive Turvey’s DumpPE is a good choice as the supplier of function names, branch targets, and instruction mnemonics to opstring. It makes two passes over the code to find branch targets, uses symbols from a PDB file, identifies functions and branch targets that are indirectly accessed via pointer tables, and identifies the names of Windows API calls. A key problem is finding the start and end of each block to be compared, which generally will be a function. Opstring relies on DumpPE to find function starts; DumpPE gets these from entry points (especially from the PE file’s export table), from any debug symbols, and from its first pass through the code. Finding function ends is more difficult; this is the halting problem. As seen in the pseudocode, the program doesn’t find function ends at all. It simply relies on seeing the next function, or the end of the disassembly. Function returns or unconditional jumps do not necessarily signal the end of a function, because code often contains multiple returns or jumps per function. Because Opstring continues to add to the current opstring as long as it thinks it is inside the same function, it is crucial to filter out as much junk as possible. As seen in the pseudocode, opstring ignores NOPs, debugger invocations (INT 3), data that looks like code, and anything that will take the opstring length beyond a reasonable length. Any opstring that doesn’t contain at least one jmp or ret is discarded, and any opstring of the maximum length is chopped back to the location of its first jmp or ret. 70

When Opstring has generated the opstrings for a file, they can be turned into MD5 digests with the MkMD5Db program (available electronically). This is little more than a wrapper around the standard MD5Init, MD5Update, and MD5Final functions originally written by Colin Plumb in 1993. MkMD5Db expects the name for an item specified on a line beginning with ‘!’,

“Commercial software isn’t analyzed as rigorously as malware”

followed by one or more lines to be digested, and a blank line. Opstring, of course, outputs this format. Preliminary Results I noted in Part I that while this technique can be used for all sorts of things, these articles only construct the technique itself, deferring the actual use of it to some later time. While it is in the true spirit of software to construct tools and then move onto something else, without putting the tool to much use, it would be good to glimpse this hammer actually encountering some nails. I took a Windows XP CD, expanded all files in the \I386 directory, and then discarded those that were not Win32 PE files. The result was 11,913 files (primarily DLLs), totaling 304 MB. I ran the following command on this directory: for %f in (\pefiles\*.*) do dumppe -disasm | awk -f opstring.awk | mkmd5db >> xp_i386.db

This ran in about half an hour on a cheap laptop computer. The resulting file contained just over 500,000 function signatures, of which 44,000 different signatures appeared more than once (and, on average, was used 3.8 times; thus, about 123,000 signatures were duplicates). The small program funcdupe.awk (available electronically) pulls an entire function Dr. Dobb’s Journal, September 2005

database into an associative array and then generates these numbers. It also factors in opstring lengths to calculate a duplication percentage. For Windows XP, the result was 17 percent. Considering that this includes startup and runtime library code that is included in many programs, 17 percent doesn’t sound very high. On the other hand, considering that Windows is built around shared libraries (DLLs), and is supposed to comprise the very operating system itself, and not merely a loose collection of applications, this duplication rate could also be viewed as high. One study of source-code duplication in large systems reports a range (excluding comments and blank lines) from 8.7 percent in GCC, to 29 percent in a web-based message board written in Python, to 36.4 percent for a Smalltalk database server, and 59 percent for a COBOL payroll system (“A Language Independent Approach for Detecting Duplicated Code,” by Stephane Ducasse et al., http://www.bauhaus-stuttgart.de/clones/ Duploc_ICSM99_IEEECopyright.pdf). Previous Work Given that cut and paste comprises a lot of what developers do, it is not surprising that clone detection is a rich field of research, and even has its own annual “Detection of Software Clones” workshop, held in conjunction with the IEEE’s annual Working Conference on Reverse Engineering (WCRE). For an excellent overview, see “Detecting Duplicated Code” in Object-Oriented Reenginering Patterns, by Serge Demeyer et al. (Morgan Kaufmann, 2002). Over 10 years ago, Brenda Baker described “parameterized string matching” for the purpose of finding sections of code that are identical except for a systematic change of parameters. Initially this involved source rather than binary code, but has been extended to Java bytecodes (see http://cm.bell-labs.com/who/bsb/). Other key contributions are discussed in Parts I and II of this article. Inspecting binary executable files has been particularly important to the antivirus industry. Binary signatures, with provision made for wildcards, are used to scan for known viruses (see Art of Computer Virus Research and Defense, by Peter Szor; Addison-Wesley Professional, 2005). It seems odd that the software used by millions every day isn’t analyzed with the same rigor as malware. In large part, Opstring takes malicious-code analysis, as used by antivirus researchers, and tries to appropriate it to analysis of commercial software. DDJ http://www.ddj.com

EMBEDDED SYSTEMS

Software Optimization & DSP Embedded Systems Planning for performance ROBERT OSHANA

N

ew tools, more comprehensive software architectures, and a wider range of available algorithms have made it much easier to develop embedded systems based on digital signal processors (DSPs). Nevertheless, DSP system developers still have to optimize software because the challenge of developing DSP embedded systems lies in making the most of limited resources — performance, memory, and power. As a result, developers spend a considerable amount of time working to obtain the right balance of code size and execution speed for their particular system requirements. Unfortunately, the process of optimizing software for DSP embedded systems has remained hit or miss, with software developers often following ad hoc approaches that may or may not achieve the best solution. A more effective optimization process does, however, exist. Developers who treat optimization as a concern throughout the entire development flow, rather than as an isolated stage, often achieve better results. In addition, those who become familiar with the algorithms, architectures, and compilers underlying their DSP systems are better positioned to take advantage of software techniques that can help them get the most out of their hardware. While the process of optimizing embedded DSP systems is not always easy, it can be accomplished successfully through a disciplined approach.

A Lifecycle Process Although efficiency goals differ in every system, optimization normally focuses on program execution speed, memory usage, Robert is an engineering manager in the Software Development Systems Group at Texas Instruments. He can be contacted at [email protected]. http://www.ddj.com

and power consumption. These indices are not independent because what affects one usually affects the others in some way, and all have some effect on system cost. Optimization is usually thought of as beginning after the code is written and functionally verified, but this is a short-sighted view. Choosing the right processor and algorithm at the beginning of the design cycle can make much more difference in performance than any amount of codetweaking on a poorly chosen platform. A disciplined approach to DSP code optimization includes the following guidelines: • Do your homework. Make certain you have a thorough understanding of the DSP architecture and compiler, as well as the application algorithms. • Know when to stop. Performance analysis and optimization is a process of diminishing returns, so take advantage of the big improvements that come with relatively little effort and let the difficult, low-yield modifications go. • Change one parameter at a time. Avoid making several optimization changes at the same time, so you can tell which change led to which result. Retest after each significant change, then document so that you have a history of changes and results in case you need to backtrack. • Use the right tools. Profilers and other tools let you quickly identify and focus on the core bottlenecks in the program, helping you use your time more efficiently and effectively. • Have a set of regression tests and use it after each iteration. Develop a test plan that compares the expected results to the actual results of the software program, and run the regression tests often enough to catch problems early. Doing Your Homework DSPs are a specialized type of microprocessor that handles certain types of algorithms extremely efficiently for real-time signal processing. Knowing how the hardware and compiler function together can help you concentrate on the 20 percent of code that performs 80 percent of the work, giving you greater leverage on program efficiency as you modify your code. Dr. Dobb’s Journal, September 2005

Typical DSP algorithms used in realtime signal processing involve sums of products (SOPs) based on mathematical summations. To handle repetitive SOP operations as fast as possible, DSP data paths perform multiply and accumulate (MAC) operations in a single cycle. High- end DSPs are also characterized by multiple data and address buses that keep data flowing with minimal latency and wait

“DSPs are a specialized type of microprocessor that handles certain types of algorithms extremely efficiently” states. Sometimes algorithms can be structured to compute more efficiently on DSP architectures while still obtaining the same result. A classic example is the use of fast Fourier transforms (FFTs) instead of discrete Fourier transforms (DFTs), a substitution that gains greater cycle count efficiency as iterations increase. Relatively recent introductions in highperformance DSP architectures include two-level on-chip caches, sophisticated direct memory access (DMA) controllers, and multiple function units with very long instruction words (VLIWs) for increased parallelism. DSP architectures specifically aimed at portable systems tend to implement less parallelism, instead adding more features for control-of-power consumption. While DSP compilers are capable of handling these architectural elements, they may not automatically achieve the most efficient solution in terms of performance. Today, compiled C and C++ code is highly efficient, but judiciously tweaking source code can often achieve faster throughput or more compact code. 71

Machine-Independent Optimizations Compilers perform optimization techniques that are both machine dependent and independent. Examples include: • Rearranging code to combine blocks and minimize branching. • Eliminating unnecessary recalculations of values. • Combining equivalent constants. • Eliminating unused code and storage of unreferenced values. a

b

c

Figure 1: Wasted bytes in storage. a b

c

Figure 2: An efficient mapping.

• Global register allocation. • Inlining or replacing function and library calls with program code. • Substituting operations to conserve resources. • Eliminating ambiguous pointer references when possible. A particularly important generic optimization technique is loop unrolling. Like other loop techniques, unrolling can be extremely significant in optimizing DSP algorithms, which normally spend the bulk of program execution time in loop iterations. Loop unrolling is simply rewriting the loop’s body more than once to reduce the number of loop iterations or eliminate the loop altogether. For instance, the loop in Example 1(a) can be unrolled four times as in Example 1(b). Because the loop branch logic is overhead, reducing the number of times the

loop has to execute reduces the overhead and makes the loop body, the important part of the structure, run faster. The tradeoff in loop unrolling is that it requires more instruction code, and more registers and cache space for data. The latter factor puts a natural limit to the number of times you can unroll a loop because if your code stores values that exceed the number of registers available, or if it needs enough data to cause cache spills, memory accesses will be required, slowing execution. Performance Optimizations The area where your knowledge can make the greatest difference in optimizing code performance is in those transformations that depend on the underlying machine architecture. Generally, machine-dependent optimizations fall into three groups: Special features, which you implement by selecting machine-specific instructions in the source code. Because DSPs are number crunchers by nature, their assembly instruction sets may include some instructions for mathematical operations used often in algorithms. However, compilers may not be able to implement these algorithm-specific instructions efficiently. In these cases, the programmer can force their use through specific assembly language instructions known as “intrinsics.” The C compiler for TI’s TMS320C5x DSPs, for instance, recognizes intrinsics for functions such as absolute value, normalization, rounding, and saturated add. The last of these, saturation, limits two added values instead of overflowing, and is performed by many DSP applications during video and image processing. When it is manually written, a saturated add (sadd) requires many steps; see Example 2(a). Compiling this code generates 14 assembly instructions before the return. On the other hand, using an intrinsic in Example 2(b) compiles to only three instructions before returning. One benefit of intrinsics is that they are automatically inlined, so they do not waste the overhead of calling a function. Also,

Figure 3: Software pipelining.

(a) do i = 1 to 400 by 1 a(i) = a(i) + b(i) end (b) do i = 1 a(i) a(i+1) a(i+2) a(i+3) end

Figure 4: Profiler performance/code size tradeoff. 72

Dr. Dobb’s Journal, September 2005

to 100 by 4 = a(i) + b(i) = a(i+1) + b(i+1) = a(i+2) + b(i+2) = a(i+3) + b(i+3)

Example 1: Loop unrolling. http://www.ddj.com

kernel, the NOPs are eliminated and the total loop cycle count drops to two:

the pipeline, because the kernel is much longer than the relatively inefficient prolog and epilog. A pipelined loop with many iterations is more efficient in overhead cycle counts than the same loop unrolled, which is in turn more efficient than the original loop. In code size, the opposite order of efficiency applies because unrolling repeats instructions, and software pipelining does so even more. Software pipelining inherently improves execution speed, but it may not be possible if unnecessary dependencies exist. In Example 3, the two input arrays may or may not overlap in memory. The compiler must assume that there is an overlap and that a given load or store must be finished before the next one can be initiated. If this source is compiled for serial execution on a C6x DSP, the complete loop requires 26 clock cycles to execute, including 20 NOP cycles. Using a little bit of parallelism, the compiled loop requires five NOP cycles and 10 cycles to execute. However, if you know that the inputs are not dependent, you can indicate that these fields will not change by declaring the input1 and input2 arrays as restrict. When the declaration is changed like this, it becomes possible to software pipeline the loop efficiently because two values can be fetched from different memory locations simultaneously. Within the pipeline

because the compiler is aware of the intrinsic contents, it may be able to perform some additional optimizations that it cannot perform on an assembly-language function of which it is ignorant. Latency techniques, which schedule the order of operations for most efficient execution. Assembly operations require different lengths of time due to memoryaccess delays and differences in parallel function units. When the compiler predicts these conditions, it can force the processor to wait by inserting no operation (NOP) cycles into the instruction stream. Alternatively, various instruction schedulers can decrease running time by reordering some of the operations in the compiled code. One of these schedulers implements software pipelining, which distributes different iterations of a loop to different parallel function units so that they execute simultaneously instead of sequentially. Figure 3 schematically shows the body of a loop in five sections (A– E) that has been software pipelined to execute n iterations (A1–An, and so on) through four different function units. The opening and closing sections, which load and clear the pipeline, are referred to as the “prolog” and “epilog.” The central section, when all of the function units are operating, is the pipeline kernel. Generally, the more iterations in the loop, the more efficient

void example2(float *out, restrict float *input1, restrict float *input2)

The complexity of high-end DSPs means that it is no longer feasible to fully handcode algorithms in assembly language. Even so, you can still achieve considerable performance enhancement if you understand the DSP compiler and architecture. Resource allocation, which determines the values that should reside in registers at each point in the program. (a) int sadd(int a, int b) { int result; result = a + b; if (((a^b) & 0x8000) == 0) { if ((result ^ a) & 0x8000) result = ( a < 0) ? 0x8000 : 0x7FFF; } return result; } (b) int sadd(int a, int b) { return _sadd(a,b); }

Example 2: A saturated add (sadd).

1

2

3

4

5

Understand key performance scenarios for application

Set goals for key optimizations for performance, memory, and power

Select processor architecture to match DSP application and perf

Analyze key algorithms in system and transform if necessary

Analyze compiler performance and output for key benchmarks

13

12

Goals not Instrument code to Run test regression, get data as close as met profile application, possible to CPU using and rerank DMA and other

11 New pareto rank

Tune C/C++ code to map to DSP architecture

10

15

Write out-of-the-box code in high-level language

Turn on low-level optimizations

Goals met

Goals met

16

17

Instrument code to Run test regression, provide hints to profile application, compiler with instrinsic Goals New and rerank pareto pragmas, keywords not rank met

8 Goals not met

Profile application and pareto rank "hot spots" Goals met Document and save code-build settings and compiler-switch settings for each function

Goals met

Run test regression, profile application, and rerank New pareto rank

7 Debug and achieve correctness and develop regression test suite

9

New Goals pareto not Run test regression, rank met profile application, and rerank

Goals met

14

6

Goals met

18

Turn on higher Run test regression, levels of optimization profile application, using compiler New Goals and rerank pareto directives not rank met

Goals met

19 Rewrite key inner loops in assembly language

20 Run test regresssion, profile application, Goals and rerank not met

21 Back to step 1

New pareto rank

Repartition the application in HW and SW and start again

Figure 5: DSP code optimization process. http://www.ddj.com

Dr. Dobb’s Journal, September 2005

73

void example1(float *out, float *input1, float *input2) { int i; for(i = 0; i < 100; i++) { out[i] = input1[i] * input2[i]; } }

Example 3: Unnecessary dependencies can prevent software pipelining. (a) for (expression) { Do A Do B Do C Do D } (b) for (expression){ Do A} for (expression){ Do B} for (expression){ Do C} for (expression){ Do D}

Example 4: Breaking big loops into smaller ones. Registers are the fastest locations in the memory hierarchy. Making optimum use of the registers is the job of the register allocator, which takes the instructions generated by the instruction scheduler and arranges data and variables to avoid spilling them from the registers into memory, where they take longer to access. Of course, sometimes there is simply too much data for the register allocator to avoid spills, and at these times you may be able to help by reshaping the source code. One way to reduce or eliminate register spilling is to break larger loops into smaller loops. When a loop has the shape as in Example 4(a), it may be possible to rewrite it in the form in Example 4(b). The register allocator treats each loop independently, increasing the possibility of finding a suitable allocation for each of the functions. Eliminating inner loops from embedded loop structures, whenever possible, can also free up registers and promote the chances of software pipelining. Memory Optimization Some of the techniques used to optimize execution speed can lead to inflated code and larger memory requirements. The compiler’s profiler can create a number of cycle count/code size options, giving you the choice of where you want to optimize. Figure 4 shows a screen from a compiler with a range of speed/memory optimiza74

tion choices indicated along the curve of dots. The best balance is normally found in the lower left corner, where little of either code compactness or performance is sacrificed to benefit the other. Programmers have other techniques available as well to make code more compact. For instance, some data structures make more sense for real-time applications than others do. Structures of arrays in the familiar form: struct A [ n ] -> a short b int c char

waste bytes in storage with arrangements as shown in Figure 1. They also require an extra load in statements such as: for ( A [ i ] -> a ± 1)...

On the other hand, pointer-to-pointer structures of arrays such as: A -> a [ i ]

are treated by the compiler as invariants and are mapped to memory in a thoroughly efficient manner, as in Figure 2. In addition, statements such as: ptr [I] ± 1

require only a single load inside a loop. In general, you should remember that compilers have a hard time dealing with complicated structures, so it is best to keep the simplest structures at the bottom of the pointer tree where they are easy to optimize and manage. Direct memory access (DMA) control is essential to the operation of highperformance DSPs, allowing data transfers to proceed with little or no intervention from the processor core. While DMA operates automatically, some degree of intervention from you can increase its efficiency. For instance, writing to bits in the DMA register can control staging data to and from off-chip memory, so that while one block of cache is being used by the DSP core, another block is being written out and/or filled with fresh data. These blocks can be very fine grained because DMA controllers are capable of keeping track of what is in many small areas of on-chip memory. Dr. Dobb’s Journal, September 2005

Power Optimization Modern DSPs are designed with power consumption in mind, especially those targeted at portable applications. Typical power conservation techniques include reducing voltages and frequencies whenever possible and disabling clocks and power rails from peripherals that are not in use. These are useful optimization resources for the developer, but the largest single factor in digital power consumption remains the area of memory accesses. In general, you should try to keep the most-used code as close to the DSP core as possible in order to minimize power-expensive pin and bus traffic. Some of the same techniques that improve performance speed also help lower power consumption by keeping data access on-chip as much as possible. Increasing code efficiency by reducing the number of instructions executed also serves to lower power requirements. Code-Optimization Process Figure 5 shows the steps involved in the DSP code optimization process from beginning to end. Although the initial seven steps take place before optimization is traditionally considered to begin, they are nevertheless part of the process. From step 8 on, the process can and should end whenever the application efficiency goals are met. Each optimization step is followed by a test regression step to verify that the software still functions as it should. You should note that optimizing compilers can make your code harder to understand. Optimizers reorder operations and eliminate variables, so it becomes difficult, if not impossible, to debug code once you are in the optimization stage. In addition, because the compiler has scheduled instructions carefully, you risk seriously impairing performance if you try to debug optimized code. The best strategy is to make your code work right first, then turn on optimization features one at a time, documenting results as you go. This disciplined approach makes optimization workable in the increasingly complex world of high-performance DSP design. Optimization should not be an afterthought, but an inherent part of the development process from concept to completion. Developers who do their homework about the algorithms, architecture, and compiler are in a better position to take advantage of the DSP hardware and software for greater performance with less code and less power consumption. With today’s multifunctional DSP embedded systems, it may not always be possible to achieve the perfect optimization, but a disciplined approach should always make it possible to achieve a satisfactory optimization. DDJ http://www.ddj.com

PROGRAMMING PARADIGMS

Freon in The Styx Michael Swaine

A

s John “Hannibal” Stokes suggests in an essay at arstechnica.com, it was probably the liquid cooling. Some demon poured freon in the river Styx in May, causing hell to freeze over. This, in turn, forced Apple to dump the PowerPC processor architecture in favor of chips from Apple’s new best friend, Intel. At the announcement, my friend Chris Breen wondered what this meant for Steve Jobs’s health, since he was sure he recalled Steve saying that this would only happen over his dead body. This month’s column includes some early thoughts on the Macintel transition, as well as on some other transitioning paradigms, including moving beyond the meme of the meme, the death of Grokster, and hacking your own home.

Macintel Q&G Moscone West in San Francisco was totally taken up with Apple’s Worldwide Developer Conference (WWDC) for one week in May. I attended this year as a developer, something I hadn’t done for a while. Apple is doing so many interesting things right now at the scripting or Saturday- morning programmer level (about which I hope to have something interesting to say when I get deeper into them) that I knew it would be worth my time to go. I also knew that attending as a developer would mean that I couldn’t share what I learned there, except what was already public. Which, realistically, is almost everything you’d care about Michael is editor-at-large for DDJ. He can be contacted at [email protected]. http://www.ddj.com

unless you’re an Apple developer, in which case you were probably at the conference. What was covered in Steve Jobs’s keynote was public information, and very public indeed. As rumored a couple of days before the event, Steve revealed that Apple would be moving its computers to Intel processors. The move to Macintel (now an official Apple trademark) has been the subject of intense analysis by everybody and his dog, but I can hardly let it pass by without comment, so here are a few thoughts, presented more or less in the form of a Q&A, or really a Q&G (questions and guesses). The first question has to be: Why? And the informed guess is surely that Jobs perceived a lack of commitment and delivery from IBM. His stated reasons were that IBM had failed to deliver a 3-GHz G5 and that Intel’s chip roadmap made better sense in terms of power consumption. Given that portables are increasingly important for Apple, I see no reason to doubt that motive, and there seems to be some evidence that IBM was more interested in the market for chips in game boxes than in Macintoshes. The second question and the third and beyond must all be about the impact of the move. And right off, a word needs to be said about elegance, orthogonality, and byte sex. Byte sex is also known as “Endianness.” PowerPC chips are Big- endian, like Internet IP addresses. Intel chips are Little- endian, like German sentences. (Would you call Mac OS X on Intel “Ten Little Endian”?) Just about everyone Dr. Dobb’s Journal, September 2005

agrees that one of these is brilliant and the other is insane, but there is some disagreement about which is which. To many, though, the PowerPC chips have a certain cachet (I hope that doesn’t get copyedited into cache) that has to do with elegance and beauty. That’s all gone now. But how much does it matter, and to whom? Pragmatically, the part that really matters to some is byte sex. Here, I suspect, some 90 –10 rule — or 98.5 –1.5 rule — applies. Most developers will have little trouble making the transition. I can’t tell you that I watched programmers port their applications with very little effort to run on those 3.6-GHz Pentium 4 Macintoshes in the porting labs at WWDC. But Apple will happily describe such experiences, and if I had seen anything different in the labs, I’d find a way to let you know about it. Applications that were developed using Apple’s Xcode, applications written in Java or Objective C, scripting applications, UNIX-centric applications, applications that make no assumptions about the hardware — these should port easily. (Tom Yager is being a little cute and a lot accurate, I think, when he says that the problems that are encountered will involve issues that the developers would be embarrassed to admit exist in their code.) Then there will be applications that port easily except for certain small sections of code. And then there will be apps ported from Windows under intense time pressure. And then there will be Microsoft’s Mac apps. Apple will have lots of success stories to tell while 75

waiting for Microsoft to come aboard, I suspect. The marketing question has to be, Will there be an Osborne Effect? (Osborne Computer Company failed spectacularly back in the ’80s. That’s fact. It is less than fact and closer to myth that it was all because of an unfortunate preannouncement of a new model that would obsolete the current line. That did happen, and sales of the current models did in fact drop off precipitously, which perhaps justifies the term “Osborne Effect,” even though there were other factors responsible for the demise of the company.) Contrary to intuition, there’s generally no reason for consumers to avoid the end-of-life nonIntel Macs, but intuition is powerful. So what if the worst case happens and people quit buying Macs, waiting for the new machines? Does Apple rush to release Intel Macs ahead of schedule? Try to fix the problem with PR? Take the loss? The question seems to be generating more questions. But this move is well- timed: The company is cash-rich (which is one of the very best kinds of rich). Apple could, in fact, survive a huge hole in sales; its position is almost the opposite of Osborne’s. Final question: Does the move to Intel push Apple any closer to licensing the OS? Mike Dell has made noise about wanting to sell Macintel systems. It’s easy to question his motivation: Is he angling for better terms from Microsoft? But the idea of Dell licensing the Mac OS has to be looked at both in the light of past licensing efforts and in the light of current market conditions. Although Steve Jobs did the Macintosh licensing program, it wasn’t his licensing program. Opening the floodgates of device incompatibility and lack of quality control and brand dilution is probably not on the Apple agenda. But partnering with the biggest fish in the pond — that’s a different story. We are in a different world now than when Apple was licensing the OS before. Computers are less important to Apple in some ways, the PC era keeps threatening to end, and Steve likes to stay ahead of the curve. Of course, he also believes in the personal computer, but does this necessarily mean by extension, in building boxes? Remember, Steve Jobs has in the past made the decision to move from a proprietary hardware model to a licensed software model with NeXT. He may have hated doing it, he may have had no other choice, but he did go the licensing route when he concluded that it was the right move. Apple has partnered with HP on marketing iPods. Could it decide to partner exclusively with Dell, build76

ing some constraints into the contract to protect the Mac market? Smart Home Hacks While at WWDC, I picked up Gordon Meyer’s book Smart Home Hacks (O’Reilly Media, 2005; ISBN 0-596-00722-1). I also caught a talk by Meyer on building

“Contrary to intuition, there’s generally no reason for consumers to avoid the end-of-life nonIntel Macs” an X10 network to automate a home. I have long been attracted to the hackishness of X10. The whole essence of X10 is a hack, and a particularly hackish hack at that, with the appeal of taking advantage of an unintended side effect of one technology to support another, the slightly outlawish feeling that you’re hacking the power company to run your little homemade network. Actually, of course, you’re just hacking your own home or office wiring. X10 works by sending signals over your home’s electrical wiring to turn on or off lights, dim them, and control or respond to other devices. Any electrical device can be X10-enabled by plugging it into a cheap X10 module that, in turn, plugs into the wall outlet. In addition to the ability to turn things on and off, X10 can accept signals and data from sensors and can interface with scripting languages, allowing you to construct arbitrarily complex conditions under which actions take place. Because the whole technology is a hack, you have to hack it to make it work properly. Various devices in your home treat X10 signals as noise and try to filter them. Some buzz from your neighbor’s house may get picked up and treated as an X10 signal, throwing your system into confusion. You have to tune the network with filters and boosters. Amusingly, if someone unplugs the dryer, even though it is not part of your X10 scheme, you could find that carefully designed X10 system unceremoniously split into two noncommunicating networks. Fun stuff. This book may just be the nudge that gets me into X10. I certainly have uses for it: a 1000-bottle wine cellar to monDr. Dobb’s Journal, September 2005

itor, plus other business inventory to protect and track; a farm operation needing to monitor weather and soil data; several irrigation systems; a restaurant that has a long checklist of daily startup and shutdown activities, some of which could surely be automated; various heaters and air conditioners and freezers and refrigerators and other devices that could surely be operated more efficiently with automated monitoring of temperatures and an understanding of the different states the restaurant can be in — open for business, opening within an hour, between lunch and dinner shifts, closed for the night, closed for the weekend. What I find especially appealing is this idea of defining meaningful states that a home/business might be in, then figuring out how to map these states onto the paltry bits of data that X10 devices can pick up. The devices can tell you that there is water at address B10 or movement at address G5 or that switch D8 is closed. But what you would really like to track are meaningful states like “Residents not at home,” “Guest staying in guest room for weekend,” “Summer home closed up for winter,” “Everybody’s asleep,” “Intruder is in house,” “Wine cellar locked and all its systems operating properly,” “UPS left package at front door,” “Motion detector batteries need replacing,” “It’s daylight,” “It’s a holiday,” “Rain expected this week,” and so on. Coming up with the right states to monitor and working out how to tease evidence for those states out of sensor data calls for some creative thinking. And deterministic algorithms are rare: Mostly it’s sifting the evidence and weighing the probabilities and making a judgment. In other words, hacks. The 100 hacks in the book are mostly Meyer’s, and he’s certainly an expert, having automated his own home to what unimaginative people might call a ridiculous degree. But a couple dozen others contributed hacks, including Bill Fernandez, who introduced Steve Jobs to Steve Wozniak and was at Jobs’s wedding. (See how I connected this X10 story with Apple?) The hacks in the book include tasks such as making sure the lights stay on in your home office when you’re working and not moving around much, even though you’ve set them to turn off after a set period when no motion has been detected. Since X10 signals can take a couple of seconds to take effect, a good hack is to anticipate where lights need to be turned on. When motion is detected in a bedroom after lights out, and perhaps the bedroom door opens, go ahead and turn on the hall light (dimmed appropriately) and maybe the bathroom http://www.ddj.com

light as well. Other hacks monitor the litter box, heat the toilet seat, control lights while you’re out, and interface an X10 system to calendar or charting software for scheduling events and viewing system activity. Incidentally, I was pleased to see that, in that charting hack, an X10 hacker had interfaced his X10 system to a SuperCard program running on OS X. With cardbased development products like SuperCard and Revolution, the spirit of HyperCard is still alive on the Mac, and will clearly make the transition to Macintel. (I have it confirmed by a SuperCard developer that SuperCard ran fine on one of those Intel boxes at WWDC.) Meyer makes good use of states in his hacks, conditioning all sorts of system activity on the occupancy state of the house. His system knows whether or not he’s home, his wife is home, and/or a guest is visiting, and responds accordingly. For example, if he’s home, certain alerts are announced throughout the house; if he’s on the road, the alert might instead triggers an e-mail or cell phone call to him. If a guest is present and the residents are away, many of the automated lighting effects are turned off so as not to freak out the guest. Another state variable comes almost free with X10: knowing whether it’s daylight outside. Some X10 software will do its best to figure that out for you based on settings on your computer: If it can find a longitude and latitude, it’ll use those. You can override its guess, though: If you live on the side of a hill, your local sunrise might be later than it would be on level ground, for example. There are decent software packages for X10 for Mac, Windows, and Linux, and a couple hundred dollars will typically get you started with a computercontrolled X10 system. From there, it’s $10 or $20 to hook up each new device. X10 can also work with wireless devices, including cameras and motion detectors.

ulation Bomb, and more recently, the president of the Center for Conservation Biology at Stanford University. In this article, they tackle the huge question of how the values of a culture change over time. A lot of folks in technology have read Richard Dawkins’s The Selfish Gene and gotten caught up with the analogy of

“With card-based development products like SuperCard and Revolution, the spirit of HyperCard is still alive on the Mac” genes in biological evolution and memes in cultural. In certain circles, the meme is a powerful meme. Although I hate to throw cold water on such an appealing concept, these authors judge Dawkins’s meme idea “regrettably infertile.” You

can see why for yourself at http://www. biology.plosjournals.org/. Collaboration Evolves This summer, the Supreme Court invalidated the Grokster model of collaborative peer-to-peer file sharing. That’s the noise on the Web. The way I read it, the Supreme Court invalidated the encouragement-ofcopyright-infringement model, but as a repeat-offender copyright holder, I may be an unreliable judge of such things. One thing I am sure of is that the whole notion of collaboration on the Internet is evolving. The Los Angeles Times boldly published an editorial online in the form of a wiki. Not surprisingly, the experiment failed as editorial, but may have succeeded as an experiment. Kudos to them for taking the plunge. Then there’s the Gitmo Project, a collaborative project of sifting through the mountains of data about Guantanamo Bay detention practices made available to the ACLU through a Freedom of Information Act request (http://www.dkosopedia.com/index .php/FOIA:Detention_Practices_Project). Unlike SETI@Home, this one involves people rather than computers collaborating. DDJ

Beyond Memes I check out the Public Library of Science (PLoS, http://www.plos.org/) from time to time. The idea behind this set of online journals is the open sourcing of science, which you’d think would be redundant, but regrettably, is not. If you’re in any doubt as to the need for such a site, you might look at “Medical Journals are an Extension of the Marketing Arm of Pharmaceutical Companies” by Richard Smith (http://medicine.plosjournals.org/). In June, PLoS Biology (http://biology .plosjournals.org/) posted an essay titled “The Evolution of Norms” by Paul R. Ehrlich and Simon A. Levin. Ehrlich is famously the author of the book The Pophttp://www.ddj.com

Dr. Dobb’s Journal, September 2005

77

EMBEDDED SPACE

Monoculturalism Ed Nisley

E

ven a quick glance at programming magazine ads shows the unrelenting complexity of software development. A program’s source code no longer defines its overall logical structure, as it now must connect the various frameworks and scaffolds and infrastructures that actually do most of the heavy lifting. The days when writing a program required nothing more than a pad of paper and a keyboard have vanished with, oh, the slide rule. The ads promise that, in exchange for giving up control over major chunks of functionality, we can devote more time to the small pieces that actually differentiate our program from its competitors. We’re told that there’s little enough value in Yet Another User Interface, Database Back End, Network Stack, and so forth and so on, that we may as well use Other People’s Code for those parts of the project and get on with our own stuff. Embedded systems take this notion to an entirely new level, because hardware design has become subject to the same pressures. I think we have four puzzle pieces that, in combination, pose some interesting challenges for embedded systems designers.

Other People’s Code Although it’s hard to be certain, the first instance of code reuse probably occurred while laying out the plugboards for ENIAC’s second program. If not, then surely the third program recycled parts of the second, establishing the “Write one to throw away” design pattern. Fast forward four decades, as programming segues from plugging cables to GUI IDEs. The notion of dynamic linking, which long predates Window’s DLLs, now lets programs invoke common routines without including them in every program’s executable file (let alone card deck). Programs become nondeterministic on a grand scale, because they depend on Other People’s Code that not only isn’t inEd’s an EE, PE, and author in Poughkeepsie, NY. Contact him at ed.nisley@ieee .org with “Dr Dobbs” in the subject to avoid spam filters. 78

cluded in the executable file, but may differ from run to run. Such late binding promised that you could fix problems by simply replacing a single common file, rather than relinking all the programs that used it. The reality, known as “DLL Hell,” was an ever mutating series of errors and dependency failures. Of course, by the time the symptoms surfaced, they were completely unrelated to the actual causes. Fast forward another two decades, when software frameworks unifying disparate operating system and utility functions become feasible. Writing a program now requires less overall system knowledge, while simultaneously putting more reliance on vast chunks of Other People’s Code. A system-level package management database can help resolve dependencies, although everybody must agree to play by the rules. Pop quiz: If you weren’t such a nice person, what could you do with an arrangement like that? Hint: On foreverrunning systems, you can delete the file that loads your code. Building Blocks On the hardware side, classic small-scale embedded-system design started with datasheets for single integrated circuits that were actually comprehensible to one person. The chip designers specified how the external hardware must behave, to the extent that Motorola and Intel microprocessors might as well have been from different planets, and embeddedsystem designers continued that process to produce completely unique chunks of hardware. Eventually though, vendors noticed that most of the gadgets looked pretty much the same: A microcontroller, some external memory, an assortment of digital and analog I/O, plus some communications ports. Add a few status LEDs and a debugging port: Shazam, everyone could use a single board. At least if they were willing to buy a few features they didn’t really need, which generally turned out to be the dealbreaker: The cost of the board was relatively Dr. Dobb’s Journal, September 2005

high compared to the cost of the final product. Nowadays, chips have become sufficiently complex that reading the datasheet does not give you confidence that you can actually build a working gadget. Chip vendors now produce “evaluation boards” to provide an existence theorem: If you interconnect the hardware using this exact board layout, then run this software, the chip behaves properly. Once you’ve seen it work, the theory goes, you can adapt it for your own use, write your own code, and go on your merry way to satisfy your customers. Any problems can be traced back to differences between the eval board’s design and your efforts, so figuring out what’s going wrong shouldn’t take all that long. In actual practice, however, the adaptation process sometimes goes along these lines: “This works, so we’ll just run with it!” Grafting an eval chip’s layout onto your board is one thing, but shoehorning sample code into a production application becomes something quite different and rather scary. Eval board code tends to be written for the very specific purpose of showing how the circuit works. Considerations of reliability, error handling, overall structure, and security tend to fall by the wayside even in ordinary projects, but particularly for sample code that’s written very, very early in the chip’s design stage. As a result, eval board code comes with a surprisingly high level of cruft, not to mention outright errors in lesser used functions. Nevertheless, there’s an awful pressure to make as few changes as possible, because you (or your boss) really wants to concentrate on the rest of the project, the part that contains your unique and nodoubt valuable IP, rather than buildingblock hardware and code. Sound familiar? Hammering on the Wall The next puzzle piece comes from the Internet, with an overall design dating back to the Good Old Days when everyone was on more-or-less friendly terms with everyone else. The degree to which that is http://www.ddj.com

(continued from page 78) a bad assumption has taken many people by surprise, so a quick look at what’s going on is in order. Some consumer- grade firewalling routers, including my D-Link DI-604, can e-mail their status logs on a regular basis. Being that sort of bear, I stored all those status e-mails since installing the firewall in late 2003, knowing that they’d come in handy. Such a firewall’s key job is to block all external packets that do not correspond to traffic originating from the protected “inside” network. From the Internet side, it appears that the firewall’s IP address is unused: Blocked packets are not bounced back to the sender, they’re simply discarded. These firewalls can also relay ex-

ternal packets to internal servers, but I’m not using that function. The DI-604 also reveals a closed Port 113, the standard Internet ID Protocol port, so you can tell that there’s something at my IP address, although attempted connections won’t work and all the other ports simply don’t respond. I wrote a Python script to read those e-mail messages, look up the host name corresponding to each dotted-quad IP address, and tally the number of packets received from each one. Because many of the IP addresses are in blocks reserved for ISPs, many different systems may use a single IP address. Conversely, it’s also trivially easy to spoof source IP addresses, so you can’t trust everything you see. Nevertheless, we can ex-

tract some interesting information from those records. From January through late May 2005, my firewall discarded 32202 packets from 5787 distinct IP addresses, sending 170 status emails in the process. Figure 1 shows the top 15 packet sources and the ports they attempted to connect with at my IP address. Although it looks like NASA is a perp, there’s an innocent explanation. It turns out that if you shut down the viewer for Quicktime movies, the server continues to hose down your IP address until it notices that you’ve vanished. The firewall discards those incoming packets, because they no longer correspond to outbound traffic from your system. Similarly, the 322 packets from Best Buy were probably triggered by something I viewed on their web site. I used the Shields-Up port scanner at Gibson Research to verify that my firewall was operating correctly. Scanning your own system from the outside makes sense, as long as you can trust the system doing the scanning, but the firewall can’t distinguish the scan from an actual attack. All the remaining entries represent unsolicited probes of my firewall. Although I don’t know if they first tested port 113 to verify that the IP address was in use, that seems unlikely on the face of it. One system in my optonline.net neighborhood steadily plinks away at the longpatched Microsoft Remote Procedure Call vulnerability, another engages in broadspectrum port scans, and a third still harbors a Slammer worm. While they may not be geographically nearby, they have easy access to my firewall because they’re within the Optimum Online subnetwork, behind any ISP-level firewalls. All but one of the remaining systems (seem to) reside in mainland China, doggedly trying to stuff pop-up spam through MS Messenger’s ports around 1024. You might think that, after 2700 attempts, every eight minutes or so, over the course of two weeks, someone would get a clue, but that’s not the case. Hold that thought for a moment. Remote Computing Early this year, a trio of researchers from Shandong and Princeton Universities announced, without giving the details, an attack on the SHA-1 message digest algorithm. Dallas Semiconductor subsequently produced a white paper describing the effect of the attack on its high-security memory devices. These gizmos use SHA-1 to create a MAC (Message Authentication Code, not to be confused with the Media Access Control address on your network cards) digest over the memory contents, the device’s serial number, a challenge string, and a secret key to authenticate and verify both

80

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

data storage and transmission. They’re not your common stick of SDRAM, for sure! Anyhow, Dallas observes that it’s still infeasible to determine the 64-bit secret key given all the rest of the information. The computation requires 264 SHA-1 operations, each of which takes about 1740 “basic arithmetic operations.” They estimate 12.4 years of grinding on a 64-CPU Cray X1 at 819 GFLOPS or 2 months on a 4096-CPU machine. While renting a supercomputer isn’t feasible for most of us, there’s a much cheaper way to get serious computing power. The street price for zombie Windows boxes (aka bots) is now under $0.10 each in lots of 20,000. That price includes preinstalled Trojan software giving you complete control. Spammers send junk mail and MS Messenger popups from zombies, but we can do better than that! Dallas allows for 20-percent overhead in the SHA-1 computation, so, if a typical zombie is a 2-GHz Windows PC, it can run a million (2×109/2100) SHA-1 computations per second. Giving each zombie’s owner half the CPU cycles still yields 40×109 SHA-1 computations per day, so a single zombie can crack one memory device in 500 million days, worst case. There are, however, two types of Windows machines: Those behind firewalls and zombies. The former divide into two camps — those with good security practices and zombies. I’m not sure the numbers are reliable, but Symantec estimates a quarter of U.S. PCs are zombies and, as nearly as I can tell, the U.S. has about 150 million PCs. That gives a pool of 37×106 Windows zombies, just in the U.S. alone. Let’s derate that by an order of magnitude to account for the usual handwaving assumptions. If you had half a megabuck to spend (What? No bulk discount?), 4×106 PCs can crack a SHA-1 MAC in four months. That seems like lots of money and a long time, but, by definition, SHA-1 memories appear in high-value, long-lived applications. Desktop PCs have been stuck around 3 GHz for a few years and, while Moore’s Law isn’t in abeyance yet, Intel and AMD pretty much admit that more performance requires more CPUs. Assuming that the overall performance per box doubles every two years and the Windows zombie pool remains unchanged, by 2013 we’ll crack SHA-1 MACs in a week, just with U.S. PCs — no need for offshoring! If that sounds like it’s too far in the future to worry about, then you must have grown up since, oh, about 1998. Assembling the Pieces Embedded systems range from simple microcontrollers to command-and-control networks, but they tend to run all day, evhttp://www.ddj.com

Count

Source IP

Resolved Domain Name

Destination Ports and Commentary

3977 3973 2730

192.55.91.23 204.1.226.228 61.235.154.92

6989 Interrupted Quicktime streams Deliberate firewall scans 1026 1027 MS Messenger popups

1739 567 520 492 489 457 369

222.88.173.5 61.172.249.200 61.152.158.109 69.119.120.80 61.152.158.123 69.119.193.167 222.208.168.126

359

61.235.154.101

354

61.235.154.102

323 322 308

164.109.152.173 198.22.124.62 69.119.194.32

ttpweb.grc.nasa.gov shieldsup.grc.com CHINA RAILWAY TELECOMMUNICATIONS CENTER CHINANET henan province network Beijing Waei Software Development Shanghai Global Network Co., Ltd. ool-45777850.dyn.optonline.net Shanghai Global Network Co., Ltd. ool-4577c1a7.dyn.optonline.net meishan telecom idc meishan, Sichuan PR China CHINA RAILWAY TELECOMMUNICATIONS CENTER CHINA RAILWAY TELECOMMUNICATIONS CENTER staging-admin-1.wellcheck.com Best Buy Co., Inc. ool-4577c220.dyn.optonline.net

1026 1026+ 1026+ 1025 MS RPC service 1026+ myriad ports 1026 1027 1026 1027 1026 1027 62500+/22438 1433 MS SQL Server: Slammer worm

Figure 1: The top-15 sites sending unsolicited packets to my IP address in early 2005. Remember that spoofing a packet’s source address to refer to an innocent bystander takes little skill, so you cannot assume the actual senders appear here. The top two entries result from my actions. The others? Well, that’s why you need a firewall! ery day, through a lifetime of years to decades. With few exceptions, they’re designed so their users can forget the “embedded computer” part and get on with their jobs. Alas, keeping up with current security patches isn’t usually part of that job. The complexity of current systems ensures that they will be built with vast chunks of Other People’s Code atop common hardware blocks. In practical terms, such code comes with no guarantees of reliability or security, regardless of whether it’s from proprietary or Free Software sources. Internet attackers are no longer amateurs. The current goal is control of your system for financial gain and, as always, money changes everything. High-value systems, those protecting resources that can be converted to money, will endure relentless automated attacks designed by folks at least as smart as you are, with more knowledge and better tools. The computational resources available to a determined attacker with even a modest budget means that bad crypto will be fatal. SHA-1 is extremely good, but if the system’s security depends on “computationally infeasible” tasks, the time horizon must outlast the system: If you’re not thinking SHA-256 right now, you’re in trouble. Bottom line? Your gizmo’s function depends on your code, but its security depends on everything. Any flaw in any of the common building blocks will expose your product, just like all the rest. Worse, any flaw detected in any other product can provide entry to your systems, particularly if the attacker can gain physical control of any gizmo using similar building blocks. Dr. Dobb’s Journal, September 2005

Monocultures exist in more places than just Wintel boxes. Somehow, we must provide defense in depth, despite using large chunks of common code and hardware, to our embedded gizmos. Surely, it’s time to start paying more attention to security than to fancier features? Reentry Checklist My capsule review of the progression from source to frameworks surely mangles a detail you consider vital. Much of computing’s history can be found in Wikipedia starting at http://en.wikipedia.org/wiki/ Main_Page. More on stealthing Port 113 at http:// www.grc.com/port_113.htm. Top 10 lists of scanned ports and scanning IPs from the Internet Storm Center at http://isc .sans.org/top10.php. Read the Dallas Semiconductor white paper on SHA-1 attacks at http:www .maxim- ic.com/appnotes.cfm/ apnote_number/3522. I cannot make their Cray numbers work out, but it doesn’t matter. See the original papers describing various attacks at http://www.cs.ut.ee/ ~helger/crypto/link/hash/mdx.php. A news writeup of the price of zombies is at http://www.usatoday.com/technews/ computersecurity/2004-09-08zombieprice_x.htm. The links and hints here will give you an idea of the scale of the problem: http://windowssecrets .com/comp/040923/. Symantec’s observations on the current state of the Internet are at http://www.cert- in.org.in/ training/21-22april05/internet%20threat.pdf. DDJ 81

CHAOS MANOR

Tiger, Tiger Burning Bright Jerry Pournelle

I

’ve installed Mac OS X 10.4 Tiger and can summarize simply enough — it’s Grrreaaat! OS X 10.4 (which quickly upgrades to 10.4.1) is wonderful. It has just about all the features Microsoft promises to have in Longhorn Real Soon Now, but Tiger is here today, and It Just Works. Everything about this inspires enthusiasm. Of course, just as Apple brings out an OS that makes me seriously consider keeping some fast Windows machines for gaming but switching over to the Mac as my main system for writing, mail handling, and everything but games, Steve Jobs announces that the current Mac hardware is to be replaced by x86 chips, which are themselves coming to the end of their development cycle. The rumor that Apple would switch to Intel has been around for a while, but without details. There was speculation that Apple would stay with the PowerPC, but it would be made by Intel. That didn’t happen. Instead, Apple went whole hog. Apple is changing to x86 architecture, and in fact, all of the recent Apple OS X operating systems already run (in the lab) on the current Intel hardware. With this bold stroke, Jobs is clearly announcing his intent to take Apple out of the niche market and compete for real market share. Given Apple’s marketing strategies, it’s unlikely Apple can — or even wants to — become the dominant player in the PC market, but it’s clear Jobs is aiming to regain at least the 20 percent market share Apple almost had back in the early days of small computers — and do that while staying out ahead in the Way Cool department. The announcement generated a world of speculation, some of it outright silly, Jerry is a science-fiction writer and senior contributing editor to BYTE.com. You can contact him at [email protected]. 82

some silly on reflection. For example, the night before the announcement, I wondered if this would now make it possible to build “white box” Apple computers? This seems not unreasonable on first thought, but it isn’t going to happen. Oh, no doubt someone will figure out a way to do it, but it won’t make sense as anything but a stunt. The whole point of Apple is the “Apple Experience,” which is made possible through Apple’s system integration. As Peter Glaskowsky observed, with Apple, everything is either very simple or impossible. There’s a reason for that. By restricting the kinds of hardware Apple works with, the driver problem is much simplified. You pay for Apple system integration, and one of the prices is that you can’t use some commodity peripherals and parts; but if you’re willing to accept the costs and restrictions, you get the genuine Apple experience. That’s not going to change. On the other hand, some things do become possible. The development work needed to make a TabletPC work with Intel hardware is already done. It remains to integrate that with the Apple OS, which shouldn’t be all that difficult. Apple may not choose to do this — predicting the whims of Steve Jobs is work for a better soothsayer than I— but it becomes possible if there’s a demand. The Game World And then there are games. In the early days, the Apple II had the coolest games around. After the Mac came out, game developers tried to work in both Mac and DOS, and then Windows, but as Apple lost market share and game development got increasingly expensive, most gave up. Really popular games might be ported to a Mac well after they were released for Windows, but most never made that crossing. For those who like massively multiplayer Dr. Dobb’s Journal, September 2005

online games such as Everquest II, there’s no choice — you need a Windows system, and a relatively powerful one at that. Bringing games from Windows to Mac is terribly expensive, because DirectX and other such tools don’t exist on the Mac. That won’t change. After Mac switches from PowerPC to Intel hardware, they still won’t have DirectX, but there’s another solution to the gaming problem — dual-boot systems. Intel’s latest security developments, called “LaGrande” (http:// www.intel.com/technology/security/), can prevent systems not made by Apple from booting up the Apple OS, while permitting genuine Apple- made systems to boot up Windows. Dual booting has never been very popular, but in this case, it makes sense: Boot up in Windows for games and in Mac OS for everything else. There’s another possibility: Intel’s new Virtualization Technology (VT) lets systems run several operating systems at once. Microsoft already sells Virtual PC for the Mac. Current PowerPC Macs aren’t really fast enough to allow a satisfactory gaming experience running a Windows game on a G4 Mac (I haven’t tried with a G5, but I gather it’s true for those as well); but once the systems are running on the same hardware, that difficulty should go away. I can foresee a time when you boot up your system in Mac OS, let that automatically open a Virtual Windows Machine, and have the best of both worlds, letting each OS run the applications it does best. Switch Now or Later? Jobs touched all the bases. Major software developers such as Microsoft and Adobe publicly promised to support the new “Aptel” platform, and others not consulted before the announcement are scrambling to get on board. There doesn’t look to be any impending shortage of software for http://www.ddj.com

the new Aptel systems. Moreover, Jobs demonstrated many of the PowerPC Mac applications running on Intel hardware. That raises the question: If you’re planning on changing from Windows to Mac, when do you do it? Do you buy a G5 now? Wait for Aptel Macs? Which generation of Aptel Mac? The first generation of Aptel will be “Yonah,” which doesn’t seem to have 64-bit support. Is Apple going to launch with Tiger for 32-bit x86, then have another migration to Leopard for 64-bit, at which point it will finally be back to the memory capacity of the G5? This cannot be great news for developers, but I haven’t heard any of them talking about such matters. These are all serious questions, particularly for corporate buyers. I can raise the questions, but I don’t have answers. I do know this: When I first saw Tiger I thought “Wow! I’ve got to get a G5!” On reflection, though, I think I’ll wait for some of this dust to settle. On that subject, Peter Glaskowsky says “This media focus on Apple’s perceived demand for Intel’s superior offerings is seriously counterproductive. Intel’s current offerings aren’t superior at all! Apple just thinks Intel’s future x86 processors will be better than IBM’s future PowerPC chips. That’s all they’ve claimed. “So buy the Mac you want now. You’ll be able to use it for one full product generation (three to four years) with no lack of software and no performance disadvantage. Any software you buy during that time will come in PowerPC or ‘universal binary’ versions, so that’s all anyone can ask.” Tiger, Tiger, Burning Bright None of this takes away a thing from Tiger, which has just about everything Microsoft promises to deliver with Longhorn — but it has it now. We installed Tiger on Ariadne, my 15inch PowerBook, in under an hour, and a good part of that time was used in making sure that Ariadne was completely up to date with applications upgrades installed before we started. Once that was done, we inserted the Tiger DVD and let her rip. The installation was nearly automatic. I had a short panic when I realized I didn’t know my .MAC password, but managed to get around that with the secret questions, and won’t have that problem again. There was also a hooray over the serial number for Stuffit: You don’t need the number to install the upgrades I needed, but you do need it to download them. That’s plain silly. I figured out a way around that problem, too. Otherwise, the installation was clear sailing. Everything worked. It looks and feels like Apple, but nicer. Things are crisp, http://www.ddj.com

and less confusing. Dashboard is fascinating. There’s a variety of small applications— widgets— available free (see http:// www.apple.com/downloads/dashboard/ for the “official” ones). These include clocks set to almost any city time in the world, weather forecasts, airline flight progress (type in airline and flight number and see where the airplane is), Yellow Pages for your neighborhood, traffic reports, and lots more. A long time ago, when I was a systems analyst for aerospace, one of the most high prestige items you could have was one of those sun clocks that showed in real time just what part of the world was in daylight. They were electromechanical and cost about $1200, and the only reason to have one was if you needed to know just what our satellites could see at any given time. I managed to get one. Dan Spisak was helping install Tiger; I asked him if something like that was available for Dashboard. It took just under a minute to find the proper widget, download it, and install it. Widgets are apparently easy to construct if you know what you’re doing, being written in scripts and XML-like languages rather than coded. There are a lot of them available, and you can expect a lot more — and they’re easily added to Dashboard. Dashboard comes up when you press the F12 key, and unless it’s active, it’s not using system resources. There are a hundred other features to like about Tiger. For instance, if someone sends you a pile of pictures in e-mail, you can make an automatic slide show within Mail 2.0 rather than having to open them one at a time. Also, in the past, I had a lot of trouble connecting the Apple to my Active Directory Windows 2000 Server environment. With Tiger, It Just Works. No third-party software needed, no problems with length of path or computer names, nothing. If you’re using Windows Server 2003 you will need this (http://www.allinthehead.com/retro/218/ accessing-a-windows-2003-share-fromos-x) workaround to make your network visible to your Mac. And, of course, the really spectacular feature of Tiger is Spotlight, a search system that works wonders. It does take time to build an index, but it was able to do that surprisingly quickly, after which, it can find files, e-mail, pictures, you name it. Dan asked for a word to search for, and Alex suggested “Magneto.” Spotlight found two instances, one inside an Apple iMovie PDF in French, the other inside a WinHEC 2004 Power Point presentation. Double clicking on the file name opened the presentation in Keynote, which is Mac’s equivalent of Power Point, and took us to the Dr. Dobb’s Journal, September 2005

proper slide. All that was automatic. Finding words in documents or in e-mail was trivially easy. That was impressive enough that I tried an experiment: Find and show the Power Point presentation I had made at Henry Vanderbilt’s space conference (http:// www.space-access.org/). It was over on Lisabetta, the TabletPC. Tiger was able to connect to Lisabetta and search for the file on ∗.ppt (I had forgotten what I had named it). Once we identified it (“How To Get To Space.ppt”), it copied the file to the Mac and opened the presentation. All this took under a minute. We then closed the presentation, gave the system a few seconds to digest it, and used Spotlight to search for the word “irreverent,” which I knew I had used in “How To Get To Space.” Spotlight found it instantly and took us to the proper slide. Spotlight’s metadata capabilities are also extensible to new file formats through plug-ins available at http://www.apple.com/downloads/macosx/spotlight/. I could say more, but surely the point is clear. OS X 10.4.1 Tiger is what we’ve been waiting for a long time, and if this doesn’t attract a lot of people to Macs, nothing will. Winding Down There are two series of computer books well worth your notice. First, there is the O’Reilly & Associates “Annoyances” series, of which the best is PC Annoyances by Steve Bass, now in its second edition. The second set is the “Degunking” series from Paraglyph Press. These include Degunking Windows, by Joli Ballew and Jeff Duntemann; and Degunking your Email, Spam, and Viruses, by Jeff Duntemann. Every one of these books is worth going through. In each case, you’ll find a bunch of stuff you already knew, and other stuff of no interest to you, but in every book, there will be a couple of gems. The book of the month is A New Republic: A History of the United States in the Twentieth Century, by John Lukacs (New Haven, Yale University Press). This is a revised and retitled edition of Lukacs’s 1983 Outgrowing Democracy. Lukacs points out the profound changes in American values and institutions that often go unnoticed but affect us all. My father once told me that 1956 was a crucial year in the history of America, because the Suez affair was a turning point in the Cold War. Lukacs also identifies the midfifties as crucial for other reasons, although the Suez affair remains important as a reversal of a longstanding set of policies. DDJ 83

PROGRAMMER’S BOOKSHELF

What Were They Thinking? Gregory V. Wilson

T

he older I get, the more often I ask, “What were they thinking!?” Take The Hitchhiker’s Guide to the Galaxy, for example: The books are among the funniest ever written, but somehow, Hollywood managed to turn them into a movie that makes a parlimantary debate on export duties for raw cabbage look like The Life of Brian. Didn’t anyone read the script before they started shooting? More importantly, didn’t anyone watch it the whole way through before they sent it to theaters? Sigh. Unfortunately, it isn’t just Hollywood. The same thing sometimes happens to books. Buffer Overflow Attacks, by James Foster et al., is a case in point. I was really looking forward to it, both because I’m interested in computer security and because I think that showing students how malware works is a great way to teach them the nitty-gritty of systems programming. The book itself, though, was a disappointment. The authors seem to have no idea who they’re writing for: Anyone who needs to be told what a compiler is (page 13) isn’t going to understand the fragments of assembly code that are presented just a few pages later. The material is also badly organized. Concepts are used before they’re explained, then explained several times in quick succession. (Yes, I get it. You can’t have a null byte in an instruction sequence if you want to smuggle it into a program as a string…) There’s a lot of important material here, but it’s too raw to be a cost-effective read. Frank Cohen’s Java Testing and Design is a much better book, although it’s still one that I’d borrow rather than buy. The Greg is a DDJ contributing editor and can be contacted at [email protected]. His most recent book is Data Crunching: Solved Everyday Problems Using Java, Python, and More. http://www.ddj.com

Buffer Overflow Attacks: Detect, Exploit, Prevent James C. Foster, Vitaly Osipov, Nish Bhalla, and Niels Heinen Syngress, 2005 497 pp., $34.95 ISBN 1932266674 Java Testing and Design: From Unit Testing to Automated Web Tests Frank Cohen Prentice Hall PTR, 2004, 489 pp., $49.99 ISBN 0131421891 Painless Project Management With FogBugz Mike Gunderloy Apress, 2005 184 pp., $34.99 ISBN 159059486X Designing Effective Database Systems Rebecca M. Riordan Addison-Wesley, 2005 353 pp., $49.99 ISBN 0321290933 book gets off to a slow start, as ideas about testing are interrupted by apparent nonsequiturs on things like management style. After a while, though, the author settles into a fairly technical description of a webtesting tool called “TestMaker.” There’s plenty of example code (much of it written in the JVM-based dialect of Python called “Jython”), and some discussion of interoperating with .NET. It isn’t the “webtesting bible” so many programmers have been looking for, but it is proof by example that automated testing of webDr. Dobb’s Journal, September 2005

based applications doesn’t have to be any harder than any other kind of testing. Like Java Testing and Design, Mike Gunderloy’s Painless Project Management with FogBugz focuses on a single product. Unlike JT&D, this book is brief, to the point, lavishly illustrated, and has an introduction by Joel Spolsky (whose company builds the bug-tracking product the book describes). This book is a user guide, but it is to user guides what Kernighan and Ritchie’s The C Programming Language is to language manuals. Every question I had was answered within a paragraph or two of being raised; every explanation made sense, and the only thing that was missing was a glossary. I think we’ll all be better off if this book will become the standard against which other end-user documentation is measured. Finally, Rebecca Riordan’s Designing Effective Database Systems is just as well written as Gunderloy’s book and just as useful. I came to database programming late — I managed to program for a living for almost 15 years before writing my first SQL query— and it’s been difficult to find a book for someone with that kind of background. Riordan’s is a good fit. She writes for intelligent grownups, rather than dummies, without making too many assumptions about background knowledge. She starts with basic concepts, such as relations and normal forms, then moves on to database design, and finally talks about user-interface issues. Her examples are all Microsoft specific, but almost all apply equally well to PostgreSQL on Linux (which is my target platform). It’s a good book and I hope it’s as widely read as it deserves to be.

DDJ 85

OF INTEREST

Zugg Software has released zApp 2.0, an application development tool for Windows. zApp lets you create software that can be modified by end users. Using Perl, JavaScript, VBScript, Python, and other languages compatible with the Windows Scripting Host, end users can create “plugins” that add features to your software. Multiple scripting languages can be used within one application for added flexibility. zApp includes a set of visual components that support Themes and Skins. Existing ActiveX and COM components can also be used within zApp. The tool features database support based on JDBC so that Microsoft ADO is not required. zApp applications support popular databases. The developer’s version lets you encrypt zApp programs, while still allowing end users to customize them. Zugg Software 5465 Broadmoor Bluffs Drive Colorado Springs, CO 80906 719-527-0091 http://www.zuggsoft.com/ Softel vdm has released a new version of SftTabs/ATL 5.0 for use with Visual Basic 6 applications. The Version 5 offers 66 basic tab control styles, each of which can be customized. This version also introduces six new tab styles, gradient fill backgrounds, built-in MDI-style Close, Minimize and Restore buttons including tool tips, new button styles, simplified deployment using Registration-Free Activation on Windows XP and extensive online help. The tab control supports many different styles and tab locations (such as top, left, right, or bottom). Among its many features are background bitmaps, user-definable multirow indentation, and scrollable tabs. Scrollable tabs offer several button styles, which can even use custom images. Softel vdm Inc. 1436 Kinglet Drive Punta Gorda, FL 33950 941-505-8600 http://www.windowscontrols.com/ http://www.ddj.com

The MyEclipse Enterprise Workbench is a J2EE IDE that enhances Eclipse with additional features for enterprise Java development. Version 4.0 introduces a number of features, including UML modeling, Tapestry support, and an advanced Oracle connector for editing events, triggers, and stored procedures. MyEclipse JSF Developer includes support for Sun JSF RI 1.1.01, MyFaces 1.0.9, and Faces Config Designer all integrated with MyEclipse HotSync Deployment and WAR Export Tools. MyEclipse Hibernate enhancements include a new Hibernate Configuration editor and a Simplified Hibernate support wizard. Genuitec 2505 Powderhorn Drive Plano, TX 75025 888-267-4176 http://www.myeclipseide.com/ Absoft has announced support of its Pro Fortran 9.2 Compiler Suite for Apple’s Mac OS X 10.4 Tiger operating system. Pro Fortran 9.2 provides full support for Apple’s 64-bit operating system, letting Fortran developers fully exploit the entire addressable memory space of the G5 processor for the first time. Pro Fortran 9.2 can produce programs optimized for either 32-bit or 64-bit Mac OS X systems. Pro Fortran 9.2 still supports Mac OS X 10.3 Panther. Pro Fortran 9.2 also includes the Absoft IDE with programmer’s editor, graphical debugger, graphical libraries, and other tools. Further 9.2 improvements include Absoft’s application framework, the Macintosh Runtime Window Environment (MRWE) completely rewritten in Fortran 95 (with full source code included). Absoft Corp. 2781 Bond Street Rochester Hills, MI 48309 248-853-0050 http://www.absoft.com/ Dangberg Software Solutions has announced the release of Version 4.0 of its Jindent Java source-code formatter. Version 4.0 supports more than 300 formatting options and all language features of Java 5.0, including generics and other language extensions. Additionally, Jindent 4.0 introduces a new GUI design and native installers for all major operation systems. The Jindent 4.0 formatting engine includes functionality such as full-control sorting support for Java source-code fragments. Integrations with all major Java IDEs (JBuilder, Eclipse, NetBeans, IntelliJ IDEA, and JDeveloper) are included. Dangberg Software Solutions Julius-Leber-Strasse 32 33102 Paderborn, Germany + 49-179-3892126 http://www.jindent.com/ Dr. Dobb’s Journal, September 2005

Recursion Software has announced enhancements to Cinergi, its web services integration platform for multilanguage environments. Cinergi 1.02 support has been added for IBM’s AIX platform. Errorreporting capabilities have been extended, including streamlined error-message handling. For C++ integration, intelligence has been added to source parsing. Recursion Software Inc. 2591 North Dallas Parkway, Suite 200 Frisco, TX 75034 800-727-8674 http://www.recursionsw.com/ Tangible Engineering has released Tangible Architect 2.0, an updated version of its model-driven code generator for Visual Studio.NET. New features include UML support, business-object generation, database schema generation, data binding, and an object-oriented database browser. Tangible Architect generates a full business object implementation including data access code from C# interface definitions or UML. Tangible Engineering GmbH Beethovenstr. 14 D-73274 Notzingen, Germany + 49 0700 82644253 http://www.tangible.de/ FairCom has released its c-tree Server for Apple’s Mac OS X Server 10.4 Tiger operating system. The new database server exploits one of Tiger’s major new features — 64-bit support. The redesigned kernel and updated system software allows 64-bit addressing for extended memory access, thereby making available huge amounts of memory for application developers. In addition, the new operating system uses the latest version of the GCC compiler, GCC 4.0, featuring support for 64-bit code generation. Developers using FairCom’s ctree Server SDK, which lets you compile your own custom server, benefit from both the 64-bit code generation and the performance of this new compiler. FairCom 6300 West Sugar Creek Drive Columbia, MO 65203 573-445-6833 http://www.faircom.com/ DDJ Dr. Dobb’s Software Tools Newsletter What’s the fastest way of keeping up with new developer products and version updates? Dr. Dobb’s Software Tools e-mail newsletter, delivered once a month to your mailbox. This unique newsletter keeps you up-to-date on the latest in SDKs, libraries, components, compilers, and the like. To sign up now for this free service, go to http://www.ddj.com/maillists/.

87

SWAINE’S FLAMES

Slick & Slack YAP on Podcasting

S

lick: Welcome to Yet Another Program — or YAP, as we like to call it, from Slick, that’s me, the charming one, and my grumpy partner, Slack— Slack: That would be me. Slick: Well at least you admit it. Our topic for today is Apple’s sweet move to support podcasting in a major way in iTunes. Slack: And I just want to say thanks, Apple. Thanks for once again empowering The Rest of Us. Or as I prefer to put it, for arming nutcases. Slick: Wow. Not pulling any punches, eh Slack? Slack: Think about it: There are probably a million L. Ron Hubbards out there — Slick: Now that’s a sobering thought. Slack: And each one has a 500-page Dianetics inside him. Slick: And what a great place to keep it. Slack: Slick, the only thing keeping those manifestos inside those egotists is the effort of typing. And now Apple has removed that barrier. Slick: Well it’s not just Apple. Apple didn’t invent podcasting. Although I have to say, the iTMS implementation is now the game to beat. Slack: Slick, I’m old enough to remember the early days of the Desktop Publishing Revolution. Slick: For our younger listeners, that would be the mid-1980s, right after the invention of paper. Slack: Right, when term papers started looking like magazine ads and business letters started looking like ransom notes. Slick: Let me cut to the chase, because I think I get where you’re going with this. You fear the unleashing of a million podcasters, all with the egos of Matt Drudge but lacking his typing skills. Slack: Like I used to fear cooties. Slick: Fear not, querulous one. I have a different take on the phenomenon. Podcasting is just radio. Get it? But it’s Radio Done Right. Slack: I don’t listen to radio. Slick: What do you mean you don’t listen to radio? We’re on radio. Slack: I can’t help that. It’s a poor medium for transmitting information. Slick: But what about music, entertainment, humor? Slack: Oh, that’s good. Radio is good for those things. Just not news or opinion. Slick: And didn’t you tell me you wake up to the BBC? Slack: Yes, but — Slick: There you go. Talk radio. Slack: But only because I can’t understand what they’re saying at 5:00 AM with those British accents. I just hear soothing noises, like geese in the fog. It’s pleasant to wake up to. Slick: The idea of waking up at 5:00 AM is making my head hurt. Slack: Here’s what I think sound is good for: music and entertainment and alarms. But text is a much better medium for news and opinions. It’s efficient, you can search it and index it and edit it — Slick: You’re not listening in the right spirit. The people who listen to talk radio are basically using it as an alarm. Whole radio programs that are built on that premise, that everything the host thinks is the information equivalent of a burglar alarm. That’s why the jocks shout. It’s like Paul Revere: The British are coming. Only today, it’s the treasonous liberals are coming, or the fascist wingnuts. Slack: Yes, and I hate that. I think America should turn down the volume. Slick: Let me give you another analogy. Podcasting is like TiVo for audio. Radio when you want it. You use TiVo, right? Slack: No. I don’t like TV enough to worry about what I missed. Besides, that’s why God created reruns. Slick: Well if you’re so down on podcasting, then I guess you wouldn’t be interested in this. But you know, what’s slipped below the radar on this announcement is that Apple has become a content provider. Slack: You’re heading somewhere with this, I just know it. Slick: Well Apple’s 3000 podcast feeds have an interesting slant. Lots of NPR and ABC News, and nothing remotely like Rush Limbaugh. Slack: ABC News? Are you serious? Cripes. With a liberal bias like that, they’re going to have the FCC at the door. Slick: Hannity and O’Reilly on their case. Slack: Black helicopters over One Infinite Loop. Slick: And on top of that, the Beatles will probably sue them again. Slack: Oh Apple, what have you done?

Michael Swaine editor-at-large [email protected] 88

Dr. Dobb’s Journal, September 2005

http://www.ddj.com

Dr. Dobb's Journal (September)