Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5661
Cecilia S. Gal Paul B. Kantor Michael E. Lesk (Eds.)
Protecting Persons While Protecting the People Second Annual Workshop on Information Privacy and National Security, ISIPS 2008 New Brunswick, NJ, USA, May 12, 2008 Revised Selected Papers
13
Volume Editors Cecilia S. Gal Paul B. Kantor Michael E. Lesk Rutgers University School of Communication and Information New Brunswick, NJ, USA
[email protected] [email protected] [email protected]
Library of Congress Control Number: 2009938117 CR Subject Classification (1998): E.3, K.6.5, D.4.6, K.4, K.4.1, C.2.6, H.2.8 LNCS Sublibrary: SL 4 – Security and Cryptology ISSN ISBN-10 ISBN-13
0302-9743 3-642-10232-8 Springer Berlin Heidelberg New York 978-3-642-10232-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12776047 06/3180 543210
Preface
The Second Annual Workshop on Privacy and Security, organized by the Center for Interdisciplinary Studies in Information Privacy and Security of the School of Communication and Information at Rutgers University, was held on May 12, 2008 at the Hyatt Regency, New Brunswick, New Jersey, USA. A few of the papers in this volume were produced through a multi-step process. First, we recorded the talk given by each author at the workshop in May 2008. Next, we transcribed the recording. The authors then produced a draft of their paper from these transcriptions, refining each draft until the final version. Although the papers are not verbatim transcriptions of the talks given, some do retain the informal and conversational quality of the presentations. In one instance we have included some material from the question-and-answer period after the talk, since the material covered proved to be relevant and interesting. The majority of authors, however, preferred to include a more formal paper based on the material presented at the workshop. A few notes about language and conventions used in the book. Since some of the authors in this volume come from different parts of the globe we have tried to preserve their native cadences in the English versions of their papers. And finally, a few papers have pictures from screen captures of illustrations or graphics created for computer displays. Although every effort was made to include the highest quality pictures so they would reproduce well in print, in some instances these pictures may not reproduce as well as might be desired, and we beg the reader’s indulgence. We wanted to thank Rutgers University for their support for the ISIPS Program, DyDAn for sponsoring the workshop and SPARTA, Inc. for their generous contribution for the workshop bags and nametags. We also wanted to thank our many reviewers for help in the paper selection process and the Program Committee for help with the initial direction and planning of the workshop.
May 2009
Cecilia S. Gal
Organization
Conference Co-chairs Paul B. Kantor Michael E. Lesk Naftaly Minsky
Rutgers University, USA Rutgers University, USA Rutgers University, USA
Reviewers Yigal Arens Antonio Badia Hsinchun Chen Gordon Cormack Dennis Egan Stephen Fienberg Mark Goldberg Jim Horning Leslie Kennedy Moshe Koppel Ivan Koychev Don Kraft Carl Landwehr Janusz Luks Antonio Sanfilippo Joshua Sinai David Skillicorn Rebecca Wright
University of Southern California, USA University of Louisville, USA The University of Arizona, USA University of Waterloo, Canada Telcordia Technologies, USA Carnegie Mellon University, USA Rensselaer Polytechnic Institute, USA SPARTA, Inc., USA Rutgers University, USA Bar-Ilan University, Israel Bulgarian Academy of Science, Bulgaria Louisiana State University, USA IARPA, USA GROM Group, Poland Pacific Northwest National Laboratory, USA The Analysis Corporation, USA Queen's University, Canada Rutgers University, USA
Program Committee Yaakov Amidror Yigal Arens Antonio Badia Maureen Baginski Arthur Becker Michael Blair Endre Boros Yigal Carmon Hsinchun Chen Gordon Cormack George Cybenko
Lander Institute, Israel University of Southern California, USA University of Louisville, USA SPARTA, Inc., USA IARPA, USA SAIC, USA Rutgers University, USA MEMRI, USA University of Arizona, USA University of Waterloo, Canada Dartmouth College, USA
VIII
Organization
Timothy Edgar Dennis Egan Yuval Elovici Stephen Fienberg Uwe Glaesser Mark Goldberg Vladimir Golubev David Grossman Jim Horning Leslie Kennedy Joseph Kielman Moshe Koppel Ivan Koychev Don Kraft Carl Landwehr Mark Levene Janusz Luks Richard Mammone Joan McNamara Rafail Ostrovsky Gerhard Paass Warren Powell Fred Roberts Antonio Sanfilippo Bracha Shapira Andrew Silke Joshua Sinai David Skillicorn Eugene Spafford Gary Strong Rebecca Wright Stefan Wrobel Daniel Zeng
ODNI, USA Telcordia Technologies, USA Deutsche Telekom Research Laboratories at Ben-Gurion University, Israel Carnegie Mellon University, USA Simon Fraser University, Canada Rensselaer Polytechnic Institute, USA Computer Crime Research Center, Ukraine Illinois Institute of Technology, USA SPARTA, Inc., USA Rutgers University, USA U.S. Department of Homeland Security, USA Bar-Ilan University, Israel Bulgarian Academy of Science, Bulgaria Louisiana State University, USA IARPA, USA Birkbeck University of London, UK GROM Group, Poland Rutgers University, USA Los Angeles Police Department, USA University of California, Los Angeles, USA Fraunhofer Institute for Intelligent Analysis and Information Systems, Germany Princeton University, USA Rutgers University, USA Pacific Northwest National Laboratory, USA Ben-Gurion University, Israel University of East London, UK The Analysis Corporation, USA Queen's University, Canada Purdue University, USA Johns Hopkins University, USA Rutgers University, USA Fraunhofer Institute for Intelligent Analysis and Information Systems, Germany University of Arizona, USA
Conference Coordinator Cecilia S. Gal
Conference Sponsors Rutgers University, ISIPS, DyDAn, SPARTA Inc.
Table of Contents
The Challenges of Seeking Security While Respecting Privacy . . . . . . . . . Paul B. Kantor and Michael E. Lesk
1
Section One: Statement of the Problem Intelligence Policy and the Science of Intelligence . . . . . . . . . . . . . . . . . . . . Maureen Baginski Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eugene H. Spafford
11
20
Intelligence, Dataveillance, and Information Privacy . . . . . . . . . . . . . . . . . . Robyn R. Mace
34
Results of Workshops on Privacy Protection Technologies . . . . . . . . . . . . . Carl Landwehr
45
Words Matter: Privacy, Security, and Related Terms . . . . . . . . . . . . . . . . . . James J. Horning
57
Section Two: Theoretical Approaches to the Problem kACTUS 2: Privacy Preserving in Classification Tasks Using k-Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Slava Kisilevich, Yuval Elovici, Bracha Shapira, and Lior Rokach
63
Valid Statistical Analysis for Logistic Regression with Multiple Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen E. Fienberg, Yuval Nardi, and Aleksandra B. Slavkovi´c
82
Section Three: Practical Approaches to the Problem Suspicious Activity Reporting (SAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joan T. McNamara
95
Stable Statistics of the Blogograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Goldberg, Malik Magdon-Ismail, Stephen Kelley, and Konstantin Mertsalov
104
X
Table of Contents
Privacy-Preserving Accountable Accuracy Management Systems (PAAMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roshan K. Thomas, Ravi Sandhu, Elisa Bertino, Budak Arpinar, and Shouhuai Xu
115
On the Statistical Dependency of Identity Theft on Demographics . . . . . . Giovanni Di Crescenzo
122
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
The Challenges of Seeking Security While Respecting Privacy Paul B. Kantor and Michael E. Lesk Rutgers University 4 Huntington Street New Brunswick, NJ
[email protected]
Abstract. Security is a concern for persons, organizations, and nations. For the individual members of organizations and nations, personal privacy is also a concern. The technologies for monitoring electronic communication are at the same time tools to protect security and threats to personal privacy. Participants in this workshop address the interrelation of personal privacy and national or societal security, from social, technical and legal perspectives. The participants represented industry, the academy and the United States Government. The issues addressed have become, if anything, even more pressing today than they were when the conference was held. Keywords: personal privacy, national security, computer security, intelligence agencies.
“Three can keep a secret if two of them are dead”-- Poor Richard’s Almanack [1].
1 Introduction Cooperative behavior is not unique to humans. Ants, bees, even bacteria seem to engage in cooperative behavior for survival and for defense against their enemies. But as far as anyone can tell they accomplish this with absolutely no expectation of privacy. And there may have been a point in the development of human intelligence when that was also true. But for all of recorded history and probably well before it, keeping secrets was part and parcel of human communication. In the earliest days, when communication was only by speech, two could keep a secret if each trusted the other and they were sure that no one was within earshot when they discussed the matter. Even that, of course, was subject to some limitations. The mere fact of being seen to go together to some place where you cannot be overheard does alert others to the possibility that there may be a secret there to be discovered. But it was with the introduction of writing, as a method for transferring information among people who could not speak with each other directly, that security really came into its own. The history of ciphers [2] has been with us for a long time, one of the early recorded ciphers was used by Julius Caesar. As long as written messages had to be carried from one place to another it was necessary that they be protected from C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 1–10, 2009. © Springer-Verlag Berlin Heidelberg 2009
2
P.B. Kantor and M.E. Lesk
prying eyes. Other physical techniques such as the wax seal with a special imprint, or the use of invisible inks attest to the importance of maintaining the privacy of communication. With the founding of the United States and the establishment of the postal service this requirement of privacy was written deeply into the fabric of American civilization. It has adapted, as modes of communication have changed to deal with the telegram, and then the telephone. It is adapting, with some difficulty to the era of email. Email must pass through many intermediary “hands”. Many employers, such as our own university, claim a legal right to all the email that we place on our servers, although they assure us that they would not look at it save for the gravest of reasons. All such stored records are subject to subpoena by the courts on proper authorization. With the advent of Voice Over IP, telephonic communications also travel through the internet, sliced into many packets and perhaps traveling across the ocean and back on their way from New York to Baltimore. Thus the opportunities for prying hands to read the mail are growing exponentially. All of us have a reasonable expectation of privacy if we are sending a message that says nothing more than “mom is feeling better and we hope you can drop by for coffee in the afternoon”. Unless, of course this is pre-agreed code meaning “get the explosives and meet me in the basement of the World Trade Center”. This volume contains selected papers written to extend remarks presented at a workshop on privacy and security, sponsored by Rutgers University, the Center for Interdisciplinary Studies in Information Privacy and Security (ISIPS), the Department of Homeland Security Center for Dynamic Data Analysis (DyDAn), and SPARTA. At the time of the workshop, May 2008, it was clear from a cursory glance at the media, that in the United States there was a strong and growing concern regarding the tensions between programs aimed at protecting the security of citizens, and programs or policies aimed at protecting the privacy of those same citizens. There was a sense among many concerned observers that the United States had fallen behind the European Union, in the protection of individual privacy. There was also a general sense, particularly following the attacks in the United States in 2001, in Madrid in 2004, and in London in 2005, that the North American and European community faced a sustained threat from a very diffuse organization whose primary symbolic center might be located in Al Qaeda. The wheels of public concern turn in unpredictable ways. For example, one of the presenters shared with workshop participants the ACLU Pizza Video clip. That clip, which appears (at the time of this writing) on the website of the American Civil Liberties Union, was posted as a campaign message against the policies of the United States Administration at the time (President G. W. Bush). The video, which remains an engaging and insightful comment on the threats to privacy, is preserved at the Internet Archive [3]. Since the time at which the workshop was held, there has been a national election in the United States, with substantial changes in the composition of the Congress, and a new President in the White House. At the same time, the world economy has experienced a decline, which is being compared to the most difficult recessions of the preceding 40 years. The spotlight of media and public interest therefore seems to have moved away from the themes of this conference. However, it takes little reflection to realize that these themes remain as important as they ever were. In addition, there is no evidence that significant progress has been
The Challenges of Seeking Security While Respecting Privacy
3
made in the improvement of privacy-protecting investigation of suspicious activities or persons, or in clarifying the complex legal strictures under which those responsible for protecting the security of the nation operate. Thus each paper in this collection remains strikingly timely. The papers are organized into several thematic groups.
2 Statement of the Problem The first group deals with issues of policy and regulation, with particular concern to the responsibilities of government. Keynote speaker Maureen Baginski explains how the language of the laws and policies that govern the activities of government agencies were developed in an era when information was hard to obtain, and the very mechanism by which we obtained information also made it clear how and in what ways the information might be useful. Today, gathering information is like drinking from a fire hose, and bits of information that may be found by crawling the Web, or monitoring signals traffic, do not come in any way labeled with the particular threat or target to which it should be associated. (In the world with which this conference is concerned, “target” may refer to the target of investigation or the target of a terrorist attack; here we mean the latter). Gene Spafford, of Purdue highlights the lack of a national policy to respond to cyberattacks which are steadily becoming more numerous and more sophisticated. He calls for a combination of better design of important software systems, better platforms on which to build them, and additional resources for research and education in computer security and reliability. The legal framework is also reviewed in the paper by Robyn Mace, “Intelligence, Dataveillance, and Information Privacy”. In surveying the need for privacy protecting mechanisms, Carl Landwehr summarizes the work of a select commission whose report has not yet been published, which defines a range of threats, and a range of actions that might be taken. The report seeks to identify the actions that are most promising for near-term research. James Horning, of SPARTA, reviews the tremendous range of different meanings that one may attach to the term “security” and to the term “privacy” when working in this arena.
3 Theoretical Attacks on the Problem From the legal and social framework, we turn to the technical appraisal of the problem. Is it easy, or is it difficult to protect privacy? Is it simply a matter of removing people’s names from the file and replacing them with “Mr. X” and “Mr.Y”? This question is important because there are major research efforts aimed at finding automatic ways to “connect the dots” in support of intelligence and security. The researchers developing these programs, who are usually situated in universities, are not at all prepared to work with classified material. But they continually ask the government for “scrubbed data”, that is, data which has the same structure as the real stuff, but does not invade anyone’s privacy.
4
P.B. Kantor and M.E. Lesk
The problem turns out to be difficult. In the paper by Kisilevich et al., the authors explore the potential, and the limitations, of batching and grouping records to maintain the privacy of individuals. The levels of protection needed vary. If we want to protect records from the idle curiosity of a graduate student working on a research project, probably having five or six different individuals who would be candidates for a given identity might be enough protection. The student does not really have the time and the resources to sort them out. On the other hand if we want to protect privacy from intrusion by large government organizations, an ambiguity of hundreds or even thousands might not be sufficient to really protect privacy. In another paper, Stephen Fienberg et al. explore the question from the perspective of rigorous statistical analysis, and point up the enormous difficulty of being able to ensure that privacy has been maintained, when there is the possibility that investigators might bring together two collections of scrubbed data, and find that while each retrieved set of documents is privacy-protecting, the combination of results is not. A provocative presentation by Rafi Ostrovsky of the University of California, Los Angeles, sketched a mechanism by which it may prove possible to submit privacy protecting queries from one agency to the data contained by another, without violating necessary protections of privacy. This paper has not been included in the conference, but there is a relevant publication in the open literature [4].
4 Practical Approaches to the Problem Joan McNamara, of the Los Angeles Police Department reports on a “suspicious incident” reporting system that enables first responders, and the “officer on the beat” to watch for any of some 74 indications that a threat relevant to homeland security may be underway. By integrating this additional classification into the normal workflow of the officers, the LAPD multiplies the effectiveness of the thousands of front line observers, who are trained to see anything out of the ordinary. With the new reporting system, which is spreading to other cities, they are now able to feed potentially actionable information into the integrated “fusion centers” that look for a pattern in the “dots”. Kelley presented a report on the work of the Rensselaer Polytechnic Institute team, headed by Mark Goldberg, on the necessary first steps towards practical monitoring of the exploding blogosphere. You cannot know what is a “suspicious indication” until you know what things look like ordinarily. We have very little experience with a “cyber beat” such as the blogosphere, or Second Life, or Twitter. To lay the foundation for automatic flagging of anomalies, this group is finding rigorous mathematical characterizations of the normal and stable appearance of this cyber beat. Thomas et al. lay out a high level overview of the components that must be present in a software system that supports security activities, while protecting the privacy of the persons and organizations whose information it contains. Di Crescenzo reports on the use of available metropolitan statistics to look for patterns of identity theft. Although this crime occurs in cyber-space, its victims are not randomly distributed around the country. There are concentrations of victims that can be predicted by the analyses in this paper. This can, in turn guide efforts to increase security, and to educate potential victims, so that the defensive effort is most cost effective.
The Challenges of Seeking Security While Respecting Privacy
5
5 Two Editorial Postscripts 5.1 Plus Ça Change… In the months that have passed since the workshop, and the much anticipated change of governing party in the United States, problems of the interrelation between security and privacy have become, if anything, an even greater concern. There is growing concern that many types of government systems, including the systems used by the numerous contractors who build equipment for, and supply services to the governments of various nations, are far more easily subject to attack than those which the governments control themselves. It is extraordinarily difficult to trace the origins of these attacks, and to maintain a consistent policy across a richly diverse network of organizations and systems. As of this writing, the United States is moving to address this problem by establishing a cyber-security command. This will work at a high level within the civilian organization of the Pentagon, and will be headed by, in all probability, a career military officer. In other discussions, the National Security Agency, which is the base for the most advanced knowledge about cyber-security and threats against it, within the United States government, is offering to expand the services it provides to the government as a whole. This raises concerns on the part of privacy advocates, as the operations of the National Security Agency (NSA) are not routinely open to public scrutiny. As an example of the type of concerns that are addressed by participants in this workshop, it has recently been revealed in the press that telephone conversations of the Congresswoman who chairs the House Subcommittee on Intelligence, Information Sharing and Terrorism Risk Assessment, The Honorable Jane Harman, were inadvertently intercepted and transcribed by the National Security Agency. At the time of this writing it is not known how the existence of these calls or the redacted information about their content has appeared in the media. Without advocating conspiracy theories, one would be overly optimistic to assume that the strong ability to intercept conversations among U.S. persons does not carry with it the temptation to use information gained in that way to advance both honorable and dishonorable goals. Similarly, the United States is committed, under the new administration of President Barack Obama, to move aggressively towards shared portable electronic health records. The content of health records raises enormous concerns about privacy. From the social perspective, although each generation seems to be more open about the discussion of illness than is the preceding generation, a vast portion of the population still has some parts of the medical record that it would wish to keep private even from close friends and associates. In addition, the complex method of providing health insurance coverage in the United States exacerbates the threats raised by a breach of privacy of information. A great many Americans are protected by health insurance policies which are purchased for them at group rates by their employers. Upon changing employment they must move to new employers. At this point, the content of their medical records becomes engaged in the legal issues surrounding the honest search to obtain coverage at reasonable rates. An overly cautious insurance auditor might respond to the merest trace in a medical record and deny coverage. It is rumored that such traces are easy to find, as physicians follow complex guidelines on the indication of suspected diseases
6
P.B. Kantor and M.E. Lesk
in order to obtain coverage for laboratory tests that they deem necessary, but not for purposes that are currently “coded” in the guidelines. Bringing together these developments we see a growing awareness of the threats to the national security apparatus and to such vital governmental functions as national defense, from the proliferation of the uses of the World Wide Web, and the built-in weaknesses of currently used software. At present the only “guaranteed defense” against such threats to information integrity is to detach a computer from the Web completely. With flash drives small enough to be concealed almost anywhere, the threat of physical removal of information grows almost as rapidly as the threat over the internet. Thus it must be anticipated that the United States, and all other nations, will move aggressively to limit these threats. In doing so it is quite certain that they will at the same time introduce mechanisms with enormous potential to invade the privacy of individual citizens. Thus the concerns highlighted in this workshop have become, if anything, even more urgent and more complex than they were in May of 2008. Paul Kantor, Rutgers. May 2009
6 Why Does Privacy Matter? This conference is about technological methods to improve both privacy and security. All too often, in such discussions it is assumed that privacy is important for purely personal reasons. This is summarized in the phrase "if you have done nothing wrong you have nothing to hide", paraphrased recently by Senator Roland Burris to the City Club of Chicago as "I've done nothing wrong and I have absolutely nothing to hide" [5]. Economists, particularly the reigning Chicago school, will tell you that the more information there is, the better the markets work. Is that correct: is personal privacy merely a whim that interferes with social and economic efficiency? Sometimes, giving up privacy is worth doing to improve a service you are seeking to use. Amazon.com, for example, tracks what you buy and what you search for, and as a result it makes recommendations for new books or music that are frequently useful [6]. Similarly, one company which sells air filters seems to keep track of purchases and sends reminders when the appropriate time period has elapsed, encouraging the owner to replace them. In these cases, the economists are correct: by knowing when you bought the last air filter, somebody can send you a useful message at the appropriate time, instead of bombarding you continuously to avoid the risk of missing the magic moment when the spirit moves you to buy a new filter. So, if this is true, why are companies so anxious to keep their information private? After all, companies are not people, and do not have personal feelings such as shame or guilt. Yet they frequently see an economic benefit to keeping many of their activities private, and theft of "trade secrets" is a well-known offense. What is this benefit, and why does society respect it? As an example, during the dot-com boom a company that considered creating an internal directory of employee expertise rejected the idea, the objection being that such a directory was likely to fall into the hands of headhunters who would use it to
The Challenges of Seeking Security While Respecting Privacy
7
recruit staff away. And yet the usual economic argument would apply: if one of the employees was worth more to a different employer, then society as a whole would be better off if that employee changed jobs. The company in question, however, was not interested in maximizing the benefit to society at its own expense. This argument is why we have the patent law in the first place. If there were no patents, companies would rely entirely on trade secrets, and thus it would be difficult for inventors to figure out improvements on existing processes. By instituting a patent system, the government encourages companies to make their processes public by offering royalties in exchange for giving up secrecy. Thus this practice encourages benefits to society as a whole by dividing the gains from improvements between the original and the later inventor. Unfortunately, there are places where commercial privacy is needed for reasons other than greediness vs. society in general; it's needed to prevent misappropriation by thieves. For example, Google does not publish the details of the PageRank algorithm, for at least two reasons. One is that it doesn't want competitors who have not shared in the development and data collection costs to be able to use the same system. The other reason is that it would be used by those who want to push their listing higher on the page of search results to "fool" the algorithm. In 2006, for example, somebody exploited a quirk in the Googlebot spider to insert temporarily some 5 billion spam webpages as part of an effort to distort the search results [7]. Similarly the IRS does not publish the exact thresholds which will cause a tax return to be audited; this would undoubtedly turn up in suggestions of the form "since you will be audited if you claim $200 or more for your cat's medical expenses, always claim $199 whether you have a cat or not." Individual information, similarly, can be mis-used. If everybody's medical information were public, employers seeking to minimize the cost of their medical insurance might decide to avoid hiring people with chronic diseases. Logically, people would then avoid treatment and not tell their doctors about symptoms, which would lead to incorrect medical care and probably higher social costs later. Some years ago, for example, people avoided having HIV tests for fear that merely taking such a test would later mean they would not be able to purchase life or health insurance. In fact, some instances of mis-use are comical. The Canadian list of "do-not-call" phone numbers was sold to telemarketers operating outside Canada who promptly used it as a source of numbers to call [8]. The Tennessee Republican party requested the list of gun permit holders so they could ask them for contributions to help make that list confidential [9]. In general, if having more information gives one party to a transaction more of an advantage, you would expect that each party would try to keep its information private and get information from the other party. This is exactly what we observe. There are some cases where disclosing information helps both parties. In that case, you would expect people to present it voluntarily; the difficulty is that we may not always agree on these cases. For example, a complaint to a representative of a professional society that they were selling their mailing list to an insurance company that was sending out advertising was met with a reply that this was not junk mail, but a membership service. How much is privacy worth? This is a difficult question to answer. Ideally we would like to know both how much each side will pay to get information or to preserve
8
P.B. Kantor and M.E. Lesk
it as a secret. That is four separate questions, each of which has widely variable answers depending on the specific kind of information and the context. What do businesses pay to get information about customers? Simple direct- mail lists cost about 10 cents per name, as does the minimal "pay-per-click" Web word. On the other hand, "mesothelioma" was going for $65 per click [10]. What do businesses claim their own information is worth? ClearOne recently won a lawsuit for $9.7M in a trade secrets lawsuit for some telephony software [11]. Unfortunately, in this as in all other cases, the amount is based partly on the cost of developing the information stolen as well as its free market value. Or, for that matter, some iconic value, as in the case of the "11 herbs and spices" recipe for KFC chicken which an analyst described as of "immeasurable value" [12]. On a smaller scale, an Oregon bakery named Tienda y Panaderia Santa Cruz sued two employees who left to work at El Grande Bakery for $10,000 over the theft of recipes for bread and cake [13]. What do consumers pay to get information about companies? Well, the subscription price of Consumer Reports is $26.00 per year. This is not, of course, for trade secret information. What do people pay to preserve their own privacy? Ross Anderson tried an experiment where he asked subjects how much they wanted to be paid in exchange for having their locations recorded every few minutes, 24/7, for a month. The answer was 10 pounds [14]. Other indications of a low value include the widespread adoption of affinity or club cards in supermarkets, the use of radio toll devices, and the like. In each case, relatively small rewards, such as a dollar off a pound of fish or a few minutes saved at a toll booth, are enough to get people to cede the possibility of observation. Society has reasonable economic grounds for preserving privacy. It preserves a balance between corporations and individuals. It's not just about intangible beliefs. However, the values attached are fairly small. Right now, people seem relatively unconcerned about commercial privacy, or at least not placing a high cash value on it. So then why is this debate heating up? Why do we need a conference on the possible technological solutions to compromise between security issues and privacy issues? Both technology and society are shifting the balance of information transfer: it is easier to collect data, the rise of variable pricing makes it more likely to have an effect on you, and the growth of "disintermediation" makes for larger collections of personal data. Clearly, new technology is making it easier to collect information about people. Credit card purchases, bar code scanners, and online shopping all mean that an exact record is left of who bought what. This assists businesses in building up mailing lists. An amateur theater company, for example, might encourage patrons to use credit cards rather than cash in order to obtain their names and addresses for marketing purposes. Web activity permits even more careful surveillance. More than your purchases are tracked; every mouse click is recorded, and even the time that you spend with your mouse hovering over a particular button is known and used [15]. Again, this is intended for use in a later bargaining session: what products you are offered, and at what price, may depend on how much the seller knows about you. The airlines are the most obvious example, with their constant effort to find ways to offer lower-priced tickets to "discretionary" (leisure) travelers and higher-priced tickets to those who
The Challenges of Seeking Security While Respecting Privacy
9
must travel. They are usually driven to decide on your kind of travel based on how far in advance you are buying the ticket and which days you are staying at the destination. But in the future you can easily imagine that the price quoted to you for a hotel room might depend on whether your last few Web purchases were at Prada or Target. The rise of price discrimination, in which vendors try to sell to each customer at the maximum price that customer will pay, is increasing. This raises the commercial value of the information available to both sides. Another change taking place is the shrinking of the sales chain, often called "disintermediation". If you buy a book in a local bookstore, even if the store knows your name, they don't tell the wholesaler, and the wholesaler doesn't tell the publisher. These supply chains are being condensed, so that a purchase made at the online L.L. Bean website provides information directly to the original vendor. The smaller the number of companies we deal with the greater their opportunity to use their corporate data about us. Sometimes this is a benefit to us as consumers; if L.L. Bean remembers somebody’s shirt size across transactions, that's a convenience. But when Amazon.com tried to charge returning customers more than new customers for the same book, there was a public outcry [16]. Although people have historically put a low economic value on privacy, this might well change if they understand what the cost to them is going to be. We may look forward to an increasing number of "shopping bots" that operate by concealing the consumer's information as they surf for pricing. With luck, this might produce a bit more transparency in the general selling system, so that consumers feel that it is more of an even bargain. What are the implications of the commercial privacy debates for government security? Even though government security is not trying to extract money from people, individuals still think of this in the context of "somebody else is trying to capture information about me". Do we really know whether a security camera in Penn Station is trying to identify passengers in order to help Amtrak marketing or to help the New York City police? There are scare stories about companies targeting marketing to people who have browsed websites about cancer, and there are scare stories about the FBI asking for the names of people who borrowed library books about Osama bin Laden. The technology is pretty much the same. So if we wish legitimate public needs to be served without provoking unreasonable reactions, it would help if we could design technology that was either more protective of personal privacy or at least more transparent about what it did. If we continue down the current road of invasive technology, we are likely to see a heightened public fear and thus increased difficulty providing information for national purposes. Michael Lesk, Rutgers. May 2009
References [1] Franklin, B.: Poor Richard’s Almanack. Peter Pauper Press, White Plains (1980) [2] Kahn, D.: The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet. Scribner, New York (1996) [3] ACLU “Pizza”, http://web.archive.org/web/20080306060321/ http://www.aclu.org/pizza/ (accessed April 22, 2009)
10
P.B. Kantor and M.E. Lesk
[4] Ostrovsky, R., Skeith, W.E.: Private Searching on Streaming Data Journal of Cryptology 20(4) (2007), http://www.cs.ucla.edu/~rafail/PUBLIC/ Ostrovsky-Skeith.html (accessed March 11, 2009) [5] Korecki, N., McKinney, D.: More holes in Burris’ story, February 18. Chicago SunTimes (2009) [6] Moran, M.: Search is getting personal, Revenue Today, p. 28 (January/February 2007) [7] Lester, E.: http://articles.apollohosting.com/search/ A05-Pushing_Bad_Data-Googles_Latest_Black_Eye.php (2006) [8] Galloway, G.: Fraudsters abusing do-not-call list. Toronto Globe and Mail (January 23, 2009) [9] Sher, A.: GOP spokesman wants database of all Tennessee handgun carry permit holders, April 3. Chattanooga Times Free Press (2008) [10] Liptak, A.: Competing for clients, and paying by the click, October 15. The New York Times (2007) [11] Womble Trade Secrets: (2009), http://wombletradesecrets.blogspot.com [12] Schreiner, B.: http://kathythompson.wordpress.com/2008/09/09/ kfc-extra-super-secret-secret-11-herbs-spices/ (2008) [13] Pitkin, J.: http://blogs.wweek.com/news/2008/02/12/ juicy-suits-war-between-rival-hispanic-bakeries/ (2008) [14] Danezis, G., Lewis, S., Anderson, R.: How Much Is Location Privacy Worth? In: 4th Workshop on the Economics of Information Security (2005), http://infosecon.net/workshop/pdf/location-privacy.pdf [15] Edmons, A., White, R., Morris, D., Drucker, S.: Instrumenting the Dynamic Web. J. of Web Engineering 6, 244–260 (2007) [16] Cox, J.L.: Can differential prices be fair? J. Product and Brand Management 10, 264–275 (2001)
Intelligence Policy and the Science of Intelligence Maureen Baginski SPARTA, Inc.
[email protected]
Abstract. Intelligence policy must protect the security of intelligence sources, and the privacy of individuals. We have moved away from a world in which the most important information was secret, and was very hard to collect. Today there is a lot of valuable information that is available open source, and provides key context for intelligence analysis. At the same time, a scientific focus is needed to define the missing elements, so that they can be collected. Moving away from the “vacuum cleaner” approach will improve intelligence operations and, at the same time, solve many of the most difficult issues of privacy of citizens and security of sources. Keywords: privacy, US Persons, information sharing.
1 The Environment Has Changed Good morning, it is very nice to see you, and thank you for having me here. It will come as no surprise to you that I am not a technologist, and I never did play one on TV. I am a practitioner. And I come to you this morning as a person who spent 27 years in the intelligence community dealing with issues of privacy and information sharing. During those 27 years the world “changed under my feet”. When I started in the intelligence community, there was, for all practical purposes, no Internet; there was no cable news. Cell phones were heavy things in a suitcase, with a battery. And “the computer” was a room you went to, made a query and got the answer two weeks later. Now, I’m not that old, and 27 years is not very much time for the revolution that has come upon us. In addition to the technological changes, I think, even more importantly, that during those times, many of us, couldn’t (well, I certainly couldn’t) imagine a world that didn’t have the Soviet Union in it as our major concern. I want to concentrate on the rate of change. Historians can argue about what caused which changes. I think it isn’t all that important which caused what. What we do know is that the fall of the Berlin Wall, and the Iron Curtain allowed for globalization and the extension of the IT infrastructure, that has created the very networked world that we are dealing with today. Because of this it is no longer true that “what happens over there only happens over there”, and “what happens over here in the US happens only over here”. We don’t have these convenient compartments any longer. When I was working as an analyst, I was the Soviet Electric Power Analyst. It was wonderful. (I had no idea why I was doing that, by the way.) So, I was the Soviet electric power analyst, and my communication source was a dedicated channel for specific information relevant to C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 11–19, 2009. © Springer-Verlag Berlin Heidelberg 2009
12
M. Baginski
my responsibility. I stared at those communications all day long, and I could keep up with that very specific target. The communication channel and the geography were in a one-to-one mapping, as the mathematicians would say. And, as an analyst, I didn’t have to worry about privacy at all, because my information was all collected overseas. We did not deal with individuals: the nation state was the adversary we dealt with. That’s another way in which the world has grown more complex. Many people do not realize that in the 1950’s there were less than 60 nation states. But today there are well over one hundred. And what is the single quality that most of them share? It is that they are weak, much weaker as nation states than many of them were when they were parts of larger and more powerful nation-states. What has emerged, almost on a par with them and in some cases to take their place, are things that I would call “shadowy criminalized organizations” (SCO). These criminal enterprises use the marvelous information technology, and the global capability to network seamlessly and make geography irrelevant. Geography is not a barrier in terms of where they’re doing harm and how they’re doing harm.
2 Intelligence and Crime-Fighting Consider for example, North Korea. Now, what everyone most worries about relative to North Korea is the nuclear weapons. We can see the nukes, and they scare the heck out of us. Now, if you are really going to do something about that threat, what do you have to understand? You have to dig into the question of how North Korea finances their nuclear development. On the evidence, they do it by counterfeiting and by drug smuggling. So we began with what is clearly a national security issue. And we find it is joined at the hip to a law enforcement problem. So, is the North Korean nuclear issue a national security problem, or is it a law enforcement problem? The answer is “yes”. It’s both. North Korea is not alone. Afghanistan is almost a “poster child” for this connectedness. For quite a few years that was a nation state. Correct. But controlled by whom? By terrorists. So in fact it was a terrorist state, and those terrorists controlled the instruments of power. And I don’t have to say anything more, because we are all aware of what happened on 9/11. Those terrorists, in their nation-state were communicating with a bunch of people right here in this country, and we know what happened as a result of that. Their preferred battlefield is the territory of the United States. For one final example, my favorite and former subject: Russia. We may ask: who’s in charge of Russia? Putin. But, besides Putin, there are the oligarchs, and many of them along with a resurgent central government, and very strong organized crime organizations which have their roots in the old secret police. Their primary tool is the control of infrastructure that enables them to control the flow of goods and services. Organized criminals in Russia love to open branch offices right here in the United States of America. So, again, we have to ask: is that a national security issue or is that a criminal problem? In short, the case I’m making right now is that the world has changed. Not only is the economy global, but also threats to national security are global. What happens over there now does indeed also happen over here. In the intelligence community we have to come to grips with unifying the information collected from multiple sources.
Intelligence Policy and the Science of Intelligence
13
3 Three Truths about Intelligence I will give you three truths about the community that I come from: 1) Privacy and security are conflated in the intelligence community (IC). As Jim Horning points out, “words matter”. In the intelligence community and law enforcement community, finding the right words to describe what needs to be done has been an art form. But what people felt “needed to be done” was to not share information. If “security” doesn’t work as a reason not to share, then privacy is offered, as the reason not to share. This is a habit; it is a rut in the brain. Within the intelligence community it is a social issue, a huge social and cultural issue. Just to spell out the chain of reasoning, in the IC, “security” is about protecting the raw source of our information. And there is a shared belief, a supposition, that anything tied to that information might reveal the source, that’s the mindset in the intelligence community: protect the security of our information sources. Now consider the law enforcement community; what are they worried about? They really are worried about privacy. They have the authority to collect information about US persons because they are law enforcement officers. Insofar as they are intelligence officers, they do not have that authority. This is a big difference. If you are an intelligence community professional you’re really proud of what you do. If you are in the law enforcement profession you are very proud of what you do. But because of these two different mindsets – and, really, two different missions -- here’s how they view one another: The law enforcement community views the intelligence community as breaking the laws of the country it operates in. Is that wrong? No. That happens to be right. On the other hand the IC view of the law enforcement guys is, politely, “they just don’t understand Intell”. They do not understand the need to share information to prevent harm to the nation; they only understand prosecution after the fact of harm.The intell community doesn’t understand the solemn responsibility they have, as law enforcement officers. That is true. 2) The soft issues in this problem are in fact the hardest issues. Even if we can resolve people’s concerns with technology, which I believe we can, there’s a lack of trust in the technology and across the two communities. Do not underestimate that. 3) With all the respect that I have for technology and technologists who are far more brilliant than I am, I have never seen a technology alone solve a problem that was not solved first on a legal-policy level. Only afterwards can it be clearly understood on the concept of operations level. Those three things are the biggest challenges, in my view, to “connecting the dots”. I must say I hate that term, but let’s just use it, because you know what I mean. Connecting the dots. We know information doesn’t come marked “terrorism information”, Good lord, do I wish it did. Recall that when I was learning the craft, information came marked “Soviet electric power information”. Because there was a one-for-one mapping between the kind of information, and its communication stream. It was easy: geography = source. It was a one-for-one equation. That is not the case right now.
14
M. Baginski
4 The Purposes of Intelligence Let’s think about the attack in Madrid 2004. Radical Moroccans joined Spanish society. How do they support themselves? By counterfeiting compact discs, low-level drug trafficking, and so forth. Then they buy some cell phones from a known international Indian criminal enterprise. Next they buy their explosives from a known local Spanish criminal enterprise. They steal a van, they drive to a train station and they kill hundreds of people. What was that activity, criminal, terrorism, a national security issue? Unless you are thinking about national security, there is no reason to connect this strange set of low level crimes. So how on earth could you have known and put those pieces together? We all agree that connecting the dots is essential. Now, what I hope to do in the next couple of minutes is to give you a sense of why we are where we are and some strategies for getting past the difficulties. I have a very pedestrian view of intelligence, it is, very simply: vital information about people and things and, in some cases, places that would do you harm. And (much to the chagrin of some of us who did this for multiple years) we actually don’t do it for its own sake. If that were the purpose, analysts could sit in their little cubicles and just have a grand old time writing briefings. But they do it to inform the decision makers; that’s the bottom line. Those decision makers range, from the President in the White House, at the top, to the patrolman on the street. The key to the job is to understand how to serve all of them.
5 Origins of Intelligence Policy Now let’s recall about how modern Intell was born. I will date that to 1947. What was the information environment like? There was very little information and it was extraordinarily hard to get, right? Now, let’s move fast forward 60 some odd years. What is today’s information environment? There’s too much information and it’s very hard to understand. Many of you have seen the picture of the intelligence cycle. If you go to the Internet Archive to check the old CIA website it actually doesn’t go clockwise either, it goes counterclockwise, it’s a fascinating thing, right? I just figured they were doing their own thing. In theory it starts with requirements, it moves to collection, goes to processing, analysis and dissemination. And each term means something. Requirements are, “the President said this is important”; national security concerns, etc. So the Intell community turns those into requirements. Now the homeland community and the Intell community do it together; and that’s a “cool” thing. But because of the way we grew up, I would argue, the single biggest problem we have today is that we have not recognized the shifts in the business environment. The world gives us a new challenge now: “too much information that is too hard to understand”. But we, by our habits, are still completely collection driven. This is a holdover from the days when there was too little information and it was too hard to get. Because of this, all the investment of energy in building safeguards, all of the policy, and all of the law, is written about collection. None of the policy and none of the law is written about processing.
Intelligence Policy and the Science of Intelligence
15
Let me describe processing in two ways. The first part is: making the unintelligible intelligible to a human being. That might be signals (zeros and ones) or encrypted systems, and so on. But the first step is making it intelligible to a human being. Just as important is the second part of processing: placing that information in corporate accessible databases. Now, over the years the collection activity and the processing activity were hooked together. And so we created what are called “stovepipes”. How could such a terrible thing have happened? There was no notion in the world that anything else was called for. Those systems, policies and regulations were built with not the faintest notion that any of those databases would ever be shared outside of the authority that collected it. That’s the source of our challenge. Laws have been written and policies are written about who has the authority to collect certain information. Who can do Sigint (Signals Intelligence), who can do Humint (Human Intelligence), and who can do law enforcement type of collection. But where is the policy and where is the body of law for federating a query across that information? I have looked for it. I have not found it. That’s the problem on which we’re working. At this workshop we have the Deputy Civil Liberties Officer of the Office of the Director of National Intelligence (DNI). That position never, nor the deputy, ever existed before. We recognize that the DNI can be thought of as a czar for information sharing. But nonetheless, establishing policies, and maintaining privacy and security is the biggest issue.
6 Shortcomings of Present Law Now, I will tell you why I don’t see an answer in law, and then you can argue with me; in fact, I’m sure some of you will argue with me. The National Security Act of 1947 and the new version of that, the Intelligence Reform and Terrorism Prevention Act of 2004 both define two types of “intelligence operations”, which we can read as “collection operations”: foreign intelligence and counterintelligence. What is the difference? One is to catch bad guys and the other is to find out about bad guys. I have looked to see whether, in 2004 when we updated the law, there was a change. There wasn’t. The law still says there are two types of intelligence operation, foreign and counter. So when you hear any magical discussion about “domestic intelligence”, get a grip. There is no such thing. To the extent that we do domestic intelligence collection in the IC, it is determined by where you do it: abroad it is foreign; at home it is not domestic, it is “counter”.
7 Responsibility for Domestic (Counter) Intelligence Who do we leave domestic intelligence collection to? The arm of national power that operates in the United States, that is, the law enforcement community. Think about the Foreign Intelligence Surveillance Act (FISA). There has been a lot of debate about FISA in recent years. What is FISA? It’s a method for obtaining warrants; it is a law enforcement mechanism. The Foreign Intelligence Surveillance Act is a law enforcement mechanism to make lawful the collection of information to, from, or about a US person.
16
M. Baginski
So my point is that the types of operations remain the same. Years ago, when the Intelligence law was written, there is a wonderful line in it, to the effect that “now that we’ve defined Intell operations, let’s define information about the national security.” So, what does it say? It says that information about the national security, or national security information, “can come from anywhere”, that’s what the law says (Section 3(1)). And it leaves it at that. If you are a law enforcement person and you are collecting law enforcement information, then you’re collecting information based on your law enforcement authority. Where is the enabling policy that lets you share it outside of that? That stroke of the pen in law, simply saying that you “can” or that you “should” is one thing. But where’s the policy? Where’s the policy that says a law enforcement professional can take law enforcement sensitive information and share it with an Intell community person? We’re working on it, and we’re developing it. But, to me, this absence of policy is the biggest issue. Do not be terribly comforted by the fact that we have asserted that this information “can be shared”. Because there are years, there are decades of law that have not been altered, and these laws circumscribe the way people can collect and share information. So where do we stand? Here’s how it looks to me: We have got the collection end right, and we’ve always had the dissemination right, we throw a person in the sharing pathway, who cleanses stuff out and you get the right stuff being disseminated. Where the key to this problem lies, where it has not yet been resolved, is in the processing steps and in the analysis steps. If we don’t get processing right, so that there can be federated clearing of common data, we’re not going to solve this problem. Let me give you a couple of examples. Right now, “man in the middle” is always necessary, even as you researchers think of technology solutions to help us out. But there aren’t enough people in the world to keep people “in the middle”. We can do some of this with technology and computers. At one time, I was in charge of Sigint, signals intelligence. And the laws were very intrusive, and very circumscribing in terms of what we could and could not collect. I had some partners who came to me and said “I want some access to your raw data”. And my answer to that was automatically and always “absolutely not!” Because I had a certain set of laws about how that stuff was to be looked at and handled, and especially the concept of “minimization”. So when I got a request, I would “arm wrestle” with our lawyers. And eventually I found a complex “workaround” that could be used. To do this I would say: “for the purposes of this data, I am going to make these individuals and this partner agency part of the Sigint production chain. I will train them, I will conduct oversight, I will make sure that they do minimization according to my rules. So for the present purposes I’m going to make them an asset that is part of our operations.” That was the only solution that made sense. However, can you imagine how labor intensive that is? Train them and provide oversight. Just think about it. So, when partners came and wanted access to our data, we did the right thing for them. But we did not enable them to take some of the human load off of our shoulders. And this is a perfect time for me to ask:
Intelligence Policy and the Science of Intelligence
17
Can’t technology be used to lift some of this load? Let’s consider the specific requirement of “minimization”. That is, simply removing US identities, at least in a pedestrian description. So, can’t you guys figure out a way to do that? Flag it, tag it, screen it out? Make sure that it is not in there. Make it not accessible to those who shouldn’t have it. That’s a real operational issue that I imagine we could solve today, but we haven’t solved that through technology, yet.
8 The National Counter Terrorism Center (NCTC) And finally, let me talk to you about the NCTC. Very quickly, the National Counter Terrorism Center is where we’ve brought all the instruments of national power together. The whole idea is that we bring the FBI, CIA, the NSA -- all the “warring parties” into one room. Then we would finally create an environment where we could do cross-organizational analysis of terrorism information. How did we do it the first time? Each of them had their own CPU under their own desk and was connected to their individual network! Right! Each person was only allowed to query what he or she could have queried anyway. And then they could sort of talk to people, but not until they checked “back home” to see if they could actually release that information to them. You are laughing, but that is true, that is absolutely true. Why was this? Because the FBI data was collected with FBI authority and the Sigint data was collected with Sigint authority, and so on. We have not resolved the federated query problem. Yes, the NCTC is a huge step forward. It keeps everybody in a room, they can at least say yes “I have that same information”. But it is like the children’s game of “go fish”:, “got any 2s”; “yes”; “got any 3s”; “yes”; “got any….”? So people can do that, and it’s a huge advance. But it is not a solution to the federated query problem.
9 A Possible Solution Here’s what I think they should do. Have any of you ever been stopped by a cop? [laughter]. What happens? You’re sitting in your car. And what is the officer doing? He’s checking you for warrants. You know that; you stop shaking and find your registration and all your papers. And he’s sitting back there in his car, and he’s checking for warrants. How is that possible? So if you are in Michigan in an accident, does it ever occur to you that for 60 years the police in this country have some kind of a system? Are they sharing their data? Do they say “sure go fishing in mine”? No they’re not. What happened is that they held a leadership meeting about criminal justice information. They decided what elements of information are necessary for all policemen. They sat in a room to decide what they are going to share, they decided what format it’s going to be in. And they say “the price of admission for all of this information is that you have to abide by this policy; you have to respect these standards and if you misuse it you are kicked out and you never use it again”.
18
M. Baginski
Because of this working agreement, for 60 years that miracle has been going on without violating anybody’s privacy. Because they decided first, what information they needed. They didn’t say “let’s suck it all up now and figure it out later”. But that “vacuum cleaner” approach, today, is exactly what scares me. Isn’t that what scares you? The data you get that way just cannot be managed on a human level. Nor can it accommodate our very real privacy concerns. In summary, the changes we need to make to enable information sharing begin with changing our business model. The imperative for that change is the change in the information environment. We can no longer approach this by collecting all the information we can and sorting it out later. There is too much information to successfully scale that business process; and collecting all information possible and figuring it out later raises significant civil liberties concerns. We must first understand what is already known and from there understand what is unknown and what the gaps are to prevent harm to the nation. Only then should we launch collection operations to fill critical information gaps, and leverage all instruments of national power to do so lawfully. Questions and Answers Question: I have a question about the use of policy and it’s a cultural one… What if that organization uses policy as an excuse not to share or as a barrier for something that needs to be overcome? Do you have any sense of that? Maureen Baginski: I think we are unfortunately back to 50/50. I think after 9/11 we got a little smarter with the clarity that a crisis gives you, and a commonality of purpose. The further we get from it the more barriers can be put up. How do we deal with the law? It’s what I used to describe as a self inflicted wound. “You wrote it, it’s yours you can change it”. But what happens, and Mike Hayden is right about this, is that the law stands at a certain point, and policy generally takes 4 steps back from what is really allowed, to avoid coming too close to the law. We did that for years, Art, as you know. It’s a really good thing to hide behind because one of the real issues is that the various parts of the community don’t understand how one another do business. So when people use policy to avoid sharing, you have to just say “OK, it is law enforcement, and we can’t use it”, right? So you just kind of nod your head and in the end it’s not shared at all. Q: Any thoughts about how that might change. MB: Yes. I think the solution lies in a total re-examination of the business models for intelligence. As long as it is a vacuum cleaner collection approach. I don’t think you can get out of the bind we’re in. What I would advocate is based on the fact that there’s more information outside of the intelligence community than is inside of it. There is more that exists in the non-secret world. So if we stay focused on the information that is secret, what I fear is happening is that there is no rich and accurate depiction of reality, of what is true in the world, for intelligence professionals to place their data into. Intelligence professionals are trying to create reality based only on what they collect. That is a recipe for getting it wrong most of the time. So I would advocate that the intelligence community shift completely and use open-source as a way to help it define what is truly secret, what it must truly go after. Along with this we may need to make the Intell community smaller and make it do nothing but
Intelligence Policy and the Science of Intelligence
19
national security again. The IC never was built to do everything. Unfortunately, I think that as the information environment has changed in the way it has, we have just continued to collect whatever we can, and not to focus on what we really need. We need to get back to where we collect only what we really should, and that’s the stuff that really is secret. We weren’t born to deal with CNN and many other open sources. Q: This point of view comes from a different era, right, where different kinds of information was collected in different ways, and when there was a sense of what each piece of information was related to. “If you know the database, if you know where the information came from”. Now the sense that many people have is “if we know everything about him, what he has in the morning for breakfast, then we’ll be able to tell if he’s going to put a bomb somewhere.” MB: And that’s what’s wrong. Q: Well, that is a very strong sense that some people have today, right? MB: It is and I share your concern with that and what I just said to [the previous questioner] is part of my answer. I think that view is wrongheaded and it certainly won’t scale. It’s sort of like saying “well, give me as much information as you possibly can and then I’ll figure out if I really need it or not”. I believe that, in the end, intelligence is related to scientific reasoning. And I think we have to return that rigor and discipline to the intelligence community. We must ask: “What do we need to know?” We must take a problem, cut it into its simplest parts and then gather the information that would help a decision maker to make a decision. We instead are saying it’s a mystery and it’s an art, and really we couldn’t possibly understand it so we need everything. That’s wrong.
Reference National Security Act of 1947; Act of July 26 (1947) (as Amended), http://www.intelligence.gov/0-natsecact_1947.shtml
Speaker Bio Keynote Speaker Maureen Baginski began her cryptologic career as a Russian Language instructor in 1979. During her tenure, Ms. Baginski held various operational management positions, including a tour as a Senior Operations Officer in the National Security Operations Center and the Signals Intelligence (SIGINT) National Intelligence Officer for Russia. Ms. Baginski held the position of Signals Intelligence Director during the terrorist attacks on the United States on September 11, 2001. She directed the Extended SIGINT Enterprise during this time to acquire, produce, and disseminate foreign SIGINT to a wide variety of government and military customers. She interacted with the Executive Branch, members of the Intelligence Community, and worldwide SIGINT partners providing crucial information during a critical time. She left NSA to take a senior position at the Federal Bureau of Investigation. She is now President of the National Security Systems Center at SPARTA.
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense* Eugene H. Spafford Purdue University CERIAS West Lafayette, IN
[email protected]
Abstract. The number and sophistication of cyberattacks continues to increase, but no national policy is in place to confront them. Critical systems need to be built on secure foundations, rather than the cheapest general-purpose platform. A program that combines education in cyber security, increasing resources for law enforcement, development of reliable systems for critical applications, and expanding research support in multiple areas of security and reliability is essential to combat risks that are far beyond the nuisances of spam email and viruses, and involve widespread espionage, theft, and attacks on essential services. Keywords: cyber security, security education, security policy, vulnerabilities, patching.
1 Introduction Our country is currently under unrelenting attack. It has been under attack for years, and too few people have heeded the warnings posed by those of us near the front lines. Criminals and agents of foreign powers have been probing our computing systems, defrauding our citizens, stealing cutting-edge research and design materials, corrupting critical systems, and snooping on government and private information. Our systems have been compromised at banks, utilities, hospitals, law enforcement agencies, every branch of the armed forces, and even the offices of the Congress and White House. Although exact numbers are impossible to obtain, some estimates currently run in the tens to hundreds of billions of dollars per year lost in fraud, IP theft, data loss, and reconstitution costs. Attacks and losses in much of the government and defense sector are classified, but losses there are also substantial. Over the last few decades, there have been numerous reports and warnings of the problems issued. When I was a member of the President’s Information Technology Advisory Committee (PITAC) in 2003-2005, we found over a score carefullyresearched and well-written reports from research organizations that highlighted the dangers and losses, and pointed out that the problem was only going to get worse unless drastic action is taken. Our report from the PITAC, Cyber Security: A Crisis of *
This essay is derived from testimony presented by the author on March 19, 2009 to the Senate Committee on Commerce, Science, and Transportation. A copy of that testimony is
.
C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 20–33, 2009. © Springer-Verlag Berlin Heidelberg 2009
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
21
Prioritization [1], published in 2005, echoed these concerns but was given scant attention. Other notable reports, such as Computers at Risk: Safe Computing in the Information Age [3], Cyber Security Today and Tomorrow [5], Trust in Cyberspace [4], and Toward a Safer and More Secure Cyberspace [2] all by the National Academies have similarly been paid little attention by leaders in government and industry. Meanwhile, with each passing week, the threats grow in sophistication and number, and the losses accumulate. The economic impact is equivalent to having one or two Hurricane Katrinas per year, with almost no response; the effects throughout the economy are diffused, so they are not as noticeable as an all-at-once event. I do not mean to sound alarmist, but the lack of attention being paid to these problems is threatening our future. Every element of our industry and government depends on computing. Every field of science and education in our country depends, in some way, on computing. Every one of our critical infrastructures depends on computing. Every government agency, including the armed forces and law enforcement, depend on computing. As our IT infrastructure becomes less trustworthy, the potential for failures in the institutions that depend on them increases. Furthermore, each new incident erodes public trust in those institutions. There are a number of reasons as to why our current systems are so endangered. Most of the reasons have been detailed in the various reports I mentioned above and their lists of references, and I suggest those as background. I will outline some of the most significant factors that I see as contributing to the problem, in no particular order: • Society has placed too much reliance on marketplace forces to lead to development of solutions. This strategy has failed, in large part, because the traditional incentive structures have not been present: there is no liability for poor quality, and there is no overt penalty for continuing to use faulty products. In particular, there is a continuing pressure to maintain legacy systems and compatibility rather than replace components with deficient security. The result is a lack of reward in the marketplace for vendors with new, more trustworthy, but possibly more expensive products. • Our computer managers have become accustomed to deploying systems with inherent weaknesses, buying add-on security solutions, and then entering a cycle of penetrate-and-patch. As new flaws are discovered, we deploy patches or else add on yet new security applications. There is little effort devoted to really “designing in” security and robustness from the beginning. This also has contributed to unprotected supply chains, where software and hardware developed and sold by untrusted entities is then placed in trusted operational environments: the (incorrect) expectation is that the add-on security will address any problems that may be present. • There is a misperception that security is a set of problems that can be “solved” in a static sense. That is not correct, because the systems are continuing to change, and we are always facing new adversaries who are learning from their experiences. Security is dynamic and changing, and we will continue to face new challenges. Thus, protection is something that we will need to continue to evolve and pursue. • Too few of our systems are designed around known, basic security principles. Instead, the components we do have are optimized for cost and speed rather than
22
•
•
•
•
•
•
•
E.H. Spafford
resilience and security, and those components are often needlessly complex. Better security is often obtained by deploying systems that do less than current systems – extra features not necessary for the task at hand too often provide additional avenues of attack, error, and failure. However, too few people understand cyber security, so the very concept of designing, building, or obtaining less capable systems, even if they are more protected, is viewed as unthinkable. We have invested far too little on the resources that would enable law enforcement to successfully investigate computer crimes and perform timely forensic activities. Neither have we pursued enough political avenues necessary to secure international cooperation in investigation and prosecution of criminals operating outside our borders. As a result, we have no effective deterrent to computer crime. The problems with deployed systems are so numerous that we would need more money than is available simply to patch existing systems to a reasonable level. Unfortunately, this contributes to a lack of funding for long term research into more secure systems to replace what we currently have. The result is that we are stuck in a cycle of trying to patch existing systems and not making significant progress towards deploying more secure systems. Over-classification hurts many efforts in research and public awareness. Classification and restrictions on data and incidents means that it is not possible to gain an accurate view of scope or nature of some problems. It also means that some research efforts are inherently naive in focus because the researchers do not understand the true level of sophistication of adversaries they are seeking to counter. Too little has been invested in research in this field, especially in long-term, risky research that might result in major breakthroughs. We must understand that real research does not always succeed as we hope, and if we are to make major advances we must take risks. Risky research led to computing and the Internet, among other things, so it is clear that some risky investments can succeed in a major way. We have too many people who think that security is a network property, rather than understanding that security must be built into the endpoints. The problem is not primarily one of “Internet security” but rather of “computer and device” security. There is a common misconception that the primary goal of intruders is to exfiltrate information or crash our systems. In reality, clever adversaries may simply seek to modify critical applications or data so that our systems do not appear to be corrupted but fail when relied upon for critical functions – or worse, operate against our interests. We seldom build and deploy systems with sufficient self-checking functions and redundant features to operate correctly even in the presence of such subversion. Government agencies are too disorganized and conflicted to fully address the problems. Authorities are fragmented, laws exist that prevent cooperation and information sharing, and political “turf” battles all combine to prevent a strong, coordinated plan from moving forward. It is debatable whether there should be a single overarching authority, and where it should be if so. However, the current disconnects among operational groups including DHS, law enforcement, the armed forces and the intelligence community is a key part of the problem that must be addressed.
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
23
• We have too few people in government, industry and the general public who understand what good security is about. This has a negative effect on how computing is taught, designed, marketed, and operated. I would be remiss not to note that most systems handling personal information have also been poorly designed to protect privacy. Good security is necessary for privacy protection [7]. Contrary to conventional wisdom, it is not necessary to sacrifice privacy considerations to enhance security. However, it takes additional effort and expense to design to both protect privacy and improve security, and not everyone is willing to make the effort despite the rewards. This battle is global. Our colleagues in other countries are also under siege from criminals, from anarchists, from ideologues, and from agents of hostile countries. Any effective strategy we craft for better cyber security will need to take into account that computing is in use globally, and there are no obvious national borders in cyberspace. Additionally, it is important to stress that much of the problem is not purely technical in nature. There are issues of sociology, psychology, economics and politics involved (at the least). We already have technical solutions to some of the problems we face, but the parties involved are unable to understand or agree to fielding those solutions. We must address all these other issues along with the technical issues if we are to be successful in securing cyberspace.
2 Rethinking Computing 1 Fifty years ago, IBM introduced the first commercial all-transistor computer (the 7000 series). A working IBM 7090 system with a full 32K of memory (the capacity of the machine) cost about $3,000,000 to purchase – over $21,000,000 in current dollars. Software, peripherals, and maintenance all cost more. Rental of a system (maintenance included) could be well over $500,000 per month. The costs of having such a system sit idle between jobs (and during I/O) led the community to develop operating systems that supported sharing of hardware to maximize utilization. It also led to the development of user accounts for cost accounting and development of security features to ensure that the sharing didn't go too far. As the hardware evolved and became more capable, the software also evolved and took on new features. Costs and capabilities of computing hardware have changed by a factor of tens of millions in five decades. It is now possible to buy a greeting card at the corner store with a small computer that can record a message and play it back to music: that card has more memory and computing power than the multimillion dollar machine of 1958. Yet, despite these incredible transformations, the operating systems, databases, languages, and more that we use are still basically the designs we came up with in the 1960s to make the best use of limited equipment. We're still suffering from problems known for decades, and systems are still being built with intrinsic weaknesses. 1
Adapted from Rethinking computing insanity, practice and research, CERIAS Weblog, December 15, 2008, ., which was itself derived from my essay in the October 2008 issue of Information Security magazine.
24
E.H. Spafford
We failed to make appreciable progress with the software because, in part, we've been busy trying to advance on every front. It is simpler to replace the underlying hardware with something faster, thus getting a visible performance gain. This helps mask the ongoing lack of quality and progression to really new ideas. As well, the speed with which the field of computing (development and application) moves is incredible, and few have the time or inclination to step back and re-examine first principles. This includes old habits such as the sense of importance in making code "small" even to the point of leaving out internal consistency checks and error handling. (Y2K was not a one-time fluke – it was instance of an institutionalized bad habit.) Another such habit is that of trying to build every system to have the capability to perform every task. There is a general lack of awareness that security needs are different for different applications and environments; instead, people seek uniformity of OS, hardware architecture, programming languages and beyond, all with maximal flexibility and capacity. Ostensibly, this uniformity is to reduce purchase, training, and maintenance costs, but fails to take into account risks and operational needs. Such attitudes are clearly nonsensical when applied to almost any other area of technology, so it is perplexing they are still rampant in IT. For instance, imagine the government buying a single model of commercial speedboat and assuming it will be adequate for bass fishing, auto ferries, arctic icebreakers, Coast Guard rescues, oil tankers, and deep water naval interdiction – so long as we add on a few aftermarket items and enable a few options. Fundamentally, we understand that this is untenable and that we need to architect a vessel from the keel upwards to tailor it for specific needs, and to harden it against specific dangers. Why cannot we see the same is true for computing? Why do we not understand that the commercial platform used at home to store Aunt Bea's pie recipes is not equally suitable for weapons control, health care records management, real-time utility management, storage of financial transactions, and more? Trying to support everything in one system results in huge, unwieldy software on incredibly complex hardware chips, all requiring dozens of external packages to attempt to shore up the inherent problems introduced by the complexity. Meanwhile, we require more complex hardware to support all the software, and this drives complexity, cost and power issues. The situation is unlikely to improve until we, as a society, start valuing good security and quality over the lifetime of our IT products. We need to design systems to enforce behavior within each specific configuration, not continually tinker with general systems to stop each new threat. Firewalls, intrusion detection, antivirus, data loss prevention, and even virtual machine "must-have" products are used because the underlying systems aren't trustworthy—as we keep discovering with increasing pain. A better approach would be to determine exactly what we want supported in each environment, build systems to those more minimal specifications only, and then ensure they are not used for anything beyond those limitations. By having a defined, crafted set of applications we want to run, it will be easier to deny execution to anything we do not want; To use some current terminology, that's "whitelisting" as opposed to "blacklisting." This approach to design is also craftsmanship—using the right tools for each task at hand, as opposed to treating all problems the same because all we have is a single tool, no matter how good that tool may be. After all, you may have the finest quality multi-tool money can buy, with dozens of blades and screwdrivers and pliers. You would never dream of building a house (or a government agency) using that
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
25
multi-tool. Sure, it does many things passably, but it is far from ideal for expertly doing most complex tasks. Managers will make the argument that using a single, standard component means it can be produced, acquired and operated more cheaply than if there are many different versions. That is often correct insofar as direct costs are concerned. However, it fails to include secondary costs such as reducing the costs of total failure and exposure, and reducing the cost of "bridge" and "add-on" components to make items suitable. There is far less need to upgrade and patch smaller and more directed systems than large, all-inclusive systems because they have less to go wrong and do not change as often. There is also a defensive benefit to the resulting diversity: attackers need to work harder to penetrate a given system, because they do not know what is running (eg., [8]). Taken to an extreme, having a single solution also reduces or eliminates real innovation as there is no incentive for radical new approaches; with a single platform, the only viable approach is to make small, incremental changes built to the common format. This introduces a hidden burden on progress that is well understood in historical terms – radical new improvements seldom result from staying with the masses in the mainstream. Therein lies the challenge, for researchers and policy-makers. The current cyber security landscape is a major battlefield. We are under constant attack from criminals, vandals, and professional agents of governments. There is such an urgent, large-scale need to simply bring current systems up to some minimum level of security that it could soak up way more resources than we have to throw at the problems. The result is that there is a huge sense of urgency to find ways to "fix" the current infrastructure. Not only is this where the bulk of the resources is going, but this flow of resources and attention also fixes the focus of our research establishment on these issues, When this happens, there is great pressure to direct research towards the current environment, and towards projects with tangible results. Program managers are encouraged to go this way because they want to show they are good stewards of the public trust by helping solve major problems. CIOs and CTOs are less willing to try outlandish ideas, and cringe at even the notion of replacing their current infrastructure, broken as it may be. So, researchers go where the money is – incremental, "safe" research. We have crippled our research community as a result. There are too few resources devoted to far-ranging ideas that may not have immediate results. Even if the program managers encourage vision, review panels are quick to quash it. The recent history of DARPA is one that has shifted towards immediate results from industry and away from vision, at least in computing. NSF, DOE, NIST and other agencies have also shortened their horizons, despite claims to the contrary. Recommendations for action (including the recent CSIS Commission report to the President [9]) continue this tunnel vision by posing the problem as how to secure the current infrastructure rather than asking how we can build and maintain a trustable infrastructure to replace what is currently there. Some of us see how knowledge of the past combined with future research can help us have more secure systems. The challenge continues to be convincing enough people that "cheap" is not the same as "best," and that we can afford to do better. Let's see some real innovation in building and deploying new systems, languages, and even networks. After all, we no longer need to fit in 32K of memory on a $21 million computer. Let's stop optimizing the wrong things, and start focusing on discovering and
26
E.H. Spafford
building the right solutions to problems rather than continuing to try to answer the same tired (and wrong) questions. We need a major sustained effort in research into new operating systems and architectures, new software engineering methods, new programming languages and systems, and more, including some approaches with a (nearly) clean-slate starting point. Failures should be encouraged, because they indicate people are trying risky ideas. Then we need a sustained effort to transition good ideas into practice. I'll conclude this section with s quote that many people attribute to Albert Einstein, but I have seen multiple citations to its use by John Dryden in the 1600s in his play The Spanish Friar: "Insanity: doing the same thing over and over again expecting different results." What we have been doing in cyber security has been insane. It is past time to do something different.
3 Education One of the most effective tools we have in the battle in cyber security is knowledge. If we can marshal some of our existing knowledge and convey it to the appropriate parties, we can make meaningful progress. New knowledge is also necessary, and there too there are urgent needs for support. 3.1 History In February 1997, I testified before the House Science Committee [10]. At that time, I observed that nationally, the U.S. was producing approximately three new Ph.Ds. in cyber-security per year. I also noted that there were only four organized centers of cyber security education and research in the country, that none of them were very large, and that all were judged to be somewhat at risk. Indeed, shortly after that testimony, one of the centers2 dissolved as institutional support faded and faculty went elsewhere. Although the number of university programs and active faculty in this area have increased in the last dozen years, the number involved and the support provided for their efforts still falls far short of the need. As an estimate, there have been less than 400 new Ph.Ds. produced in cyber security in the U.S. over the last decade with some nontrivial percentage leaving the U.S. to work in their countries of origin. (Over 100 of those graduates have come from CERIAS at Purdue.) Of those that remained, less than half have gone back into academia to be involved in research and education of new students. In my testimony in 1997 and in subsequent testimony in 2000 [11], I provided suggestions for how to increase the supply of both students and faculty in the field to meet the anticipated demand. Three of my suggestions were later developed by others into Federal programs: the Centers of Academic Excellence (CAE), the NSF Scholarship for Service program, and the NSF Cyber Trust program. Today, we have about a dozen major research centers around the country at universities, and perhaps another two dozen secondary research groups. Many—but not 2
At the University of Wisconsin-Milwaukee.
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
27
all—of these institutions are certified as CAEs, as are about 60 other institutions providing only specialized cyber security education. The CAE program has effectively become a certification effort for schools offering educational programs in securityrelated fields instead of any true recognition of excellence; there are some highly regarded programs that do not belong to the CAE program for this reason (Purdue and MIT among them). One problem with the way the CAE program has evolved is that it does not provide any resources that designated schools may use to improve their offerings or facilities. The Scholarship for Service program, offered through NSF, has been successful, but in a limited manner. This program provides tuition, expenses and a stipend to students completing a degree in cyber security at an approved university. In return, those students must take a position with the Federal government for at least two years or pay back the support received. Over the last seven years, over 1000 students have been supported under this program at 30 different campuses. The majority of students in these programs have, indeed, gone on to Federal service, and many have remained there. That is an encouraging result. However, the numbers work out to an average of about four students per campus per year entering Federal service, and anecdotal evidence indicates that demand is currently five times current production and growing faster than students are being produced. This program also address needs in other segments of U.S. society. NSF has been the principal supporter of open university research in cyber security and privacy through its Cyber Trust program (now called Trustworthy Computing). That effort has produced a number of good results and supported many students to completion of degrees, but has been able to support only a small fraction (perhaps less than 15%) of the proposals submitted for consideration. Equally unfortunate, there has been almost no support available from NSF or elsewhere in government for the development and sustainment of novel programs that are not specifically designated as research; as an example, CERIAS as an important center of education, research and outreach has never received direct Federal funding to support core activities, staff, and educational development. If it were not for periodic gifts from generous and civic-minded industrial partners, the center would have disappeared years ago—and may yet, given the state the economy. Other defined centers are similarly precariously funded. 3.2 Future We need significant, sustained efforts in education at every level to hope to meet the challenges posed by cyber security and privacy challenges. In the following, I will outline some of the general issues and needs, with some suggestions where Federal funding might be helpful. A study by an appropriate organization would be necessary to determine more precisely what program parameters and funding levels would be useful. Given the complexity of the issues involved, I can only outline some general approaches here. Let me note that many of these activities require both a ramp-up and sustainment phase. This is especially true for postgraduate programs. We do not currently have the infrastructure to switch into “high gear” right away, nor do we have the students available. However, once students are engaged, it is disruptive and discouraging to
28
E.H. Spafford
them and to faculty if resources and support are not provided in a steady, consistent fashion. I will start by reiterating my support for the existing Scholarship for Service program. It needs to include additional funding for more students, and to allow recipient institutions to pursue curricular development and enhancement, but is otherwise functioning well. K-12. Our children are the future. We should ensure that as they are being taught how to use the technology of tomorrow that they also are getting a sound background in what to do to be safe when using computers and networks. We teach children to cover their mouths when they sneeze, to wash their hands, and to look both ways when they cross the street—we should also ensure that they know something about avoiding phishing, computer viruses, and sharing their passwords. Older students should be made familiar with some of the more complex threats and issues governing computing especially privacy and legal implications. Avenues for teaching this material certainly include the schools. However, too many of our nation’s schools do not currently offer any computing curriculum at all. In many schools, all that is taught on computers is typing, or how to use the WWW to research a paper. Many states have curricula that treat computing as a vocational skill rather than as a basic science skill. Without having a deeper knowledge of the fundamentals of computing it is more difficult to understand the issues associated with privacy and security in information technology. Thus, teaching of computing fundamentals at the K-12 level needs to be more widespread than is currently occurring, and the addition of cyber security and privacy material nationally should be considered as part of a more fundamental improvement to K-12 education. Recently the leaders of the computing community released recommendations on how the Federal Government’s Networking and Information Technology Research and Development (NITRD) Program could be strengthened to address shortfalls in computer science education at the K-12 level.3 Consideration should be given to encouraging various adjunct educational opportunities. Children’s TV is one obvious venue for conveying useful information, as is WWW-based delivery. Computing has a significant diversity problem. Cyber Security and privacy studies appear, anecdotally, to be especially attractive to students from underrepresented groups, including females. Presenting meaningful exposure to these topics at the K12 level might help encourage more eager, able young people to pursue careers in those or related STEM fields. Undergraduate Degrees. Of the thousands of degree-granting institutions throughout the U.S., perhaps only a few hundred have courses in computer security basics. These courses are usually offered as an elective rather than as a part of the core curriculum. As such, basic skill such as how to write secure, resilient programs and how to protect information privacy are not included in standard courses but relegated to the elective course. This needs to change or we will continue to graduate students who do not understand the basics of the area but who will nonetheless be producing and operating consumer computing artifacts. 3
http://www.acm.org/public-policy/NITRD_Comment_final.pdf
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
29
More seriously, we have a significant shortfall of students entering computing as a major area. Last year was the first year in six where the enrollment of undergraduates in CS did not decline. The significance of this concern is not only important from a national competitiveness standpoint, but it implies that we will have a significant shortfall of trained U.S. citizens in the coming years to operate in positions of national responsibility. We are already off-shoring many critical functions, and without an increase in the U.S. production of computing majors, this will pose a significant national security threat. Graduate Degrees. There is disagreement within the field about the level of education needed for some positions in the workforce. Clearly, there is a range of positions, some of which may only require an undergraduate degree, but many that require at least a Master’s degree. Some educators (myself included) believe that a strong undergraduate degree in computing or software engineering, or in some other field related to cyber security (e.g., criminal justice), should be obtained followed by a graduate degree to ensure appropriate depth of knowledge. There continues to be a need for Ph.D. graduates in cyber security. Individuals at this level are needed for advanced concept development in academia, industry and government. Generally, a Ph.D. is also required for faculty positions and some senior technical supervisory positions. Given the strong demand in this field and the number of institutions with need of faculty with experience in security or privacy topics, there will undoubtedly be a continuing and increasing demand for graduates at this level. One of the issues facing researchers in academia is the lack of access to current commercial equipment. Most funding available to researchers today does not cover obtaining new equipment. Universities also do not have sufficient resources to equip laboratories with a variety of current products and then keep them maintained and current. As a result, unless faculty are adept at striking deals with vendors (and few vendors are so inclined) they are unable to work with current commercial security products. As a result, their research may not integrate well with fielded equipment, and may even be duplicative of existing solutions. The situation is in some senses similar to that of the 1980s when major research institutions were able to seek grants to get connections to research networking, but has evolved to a point where almost every college and university has network access. We now need a program to fund the instantiation of experimental laboratories for cyber security with a cross-section of commercial products, with an eventual goal of having these be commonplace for teaching as well as research. Some faculty and their students are willing and able to work on classified problems so long as that work is near enough to their home institution to make travel reasonable. The best solution is to have a facility on campus capable of supporting classified research. This is not common on today’s campuses.4 It is not inexpensive to build or retrofit a facility for classified processing, and it is costly to staff and maintain it. Research grants almost never cover these costs. A Federal program to identify institutions where such facilities would be useful, and then build and support them might be helpful. To produce graduate students requires resources for stipends, laboratory equipment, and general research support, as well as support for the faculty advisors. Given 4
As an example, I need to travel over 70 miles from Purdue to be able to find a cleared facility.
30
E.H. Spafford
university overhead costs, it will often cost more than $250,000 over a period of years for a graduate student to complete a Ph.D. That support must be consistent, however, because interruptions in funding may result in students leaving the university to enter the workforce. Additionally, there needs to be support for their advisors, usually as summer salary, travel, and other expenses. Here again, consistency (and availability) are important. If faculty are constantly worried about where the money will come from for the coming year, some will choose to leave the field of study or academia itself. Other disciplines. Computing is not the only area where advanced research can and should occur. As noted earlier, the cyber security “ecology” includes issues in economics, law, ethics, psychology, sociology, policy, and more. To ensure that we have an appropriate mix of trained individuals, we should explore including training and support for advanced education and research in these areas related to cyber security and privacy. Encouraging scholars in these areas to work more closely with computing researchers would provide greater synergy. On possibility that should be explored is to expand the current Scholarship for Service program in a manner that includes students taking advanced degrees with a mix of cyber studies and these other areas; as an example, the program might fund students who have completed an undergrad in cyber security to obtain a J.D., or a student with a degree in public policy obtaining an M.S. in cyber privacy. Upon graduation those individuals would be highly qualified to enter government service as policy experts, prosecutors, investigators, and other roles where there is currently an urgent and growing need for multidisciplinary expertise. Training. There are many people working in the IT field today who have security and privacy as one of their job functions. Given the pace of new tool development, best practices, new threats, and other changes, it is necessary that these individuals receive periodic training to stay current with their positions. Many 3rd-party organizations are currently providing such training (although the expense per student is significant), but as demand grows it seems unlikely that these efforts will scale appropriately. It is also the case that not all individuals who currently need such training either know they need it, or can afford it. There should be an effort made, perhaps through DHS and/or the Department of Education, to provide ongoing training opportunities to the workforce in a cost-effective and timely manner. This might be by way of some mechanism that is delivered over the Internet and/or through community colleges. “Train the trainer” opportunities should be considered as well. Note that this is not the same as continuing education as it assumes that the students involved already know how to perform their jobs. Rather, this is training in new tools and techniques to enable individuals to stay current in their positions. Adult Education. The majority of citizens today using personal computers do not know anything about computer security, yet they are common targets for fraud and abuse. Phishing, Spam, and botnets are all generally targeted at home computers. Most people do not know that they need additional knowledge about security, and those that do are often unsure where to go to obtain that knowledge.
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
31
This is an area where many different techniques could be employed. Having educational modules and resources available online for citizens to review at their leisure would seem to be an obvious approach. Providing incentives and materials for ISPs, community groups, public libraries, and perhaps state and local governments to offer courses and information would be another possibility. Public television is yet another avenue for education of the general population about how to defend their computing resources. Coupled with this effort at citizen education might be some program to provide access and ratings of products that could be obtained and deployed effectively. Unfortunately, there are many ineffectual products on the market, and some that are actually malicious in the guise of being helpful. Providing resources for citizens to get product details and up-to-date information on what they should be doing could make a large difference in our national cyber security posture. Professional Education. We have many people in professional roles who use computers in their work, but who were not exposed to computing education during their formal studies. These positions include law enforcement personnel, judges, doctors, lawyers, managers, C-level executives, bankers, and more. In these various professions the individuals need education and training in cyber security and privacy basics as they relate to their jobs. They also need to be made aware that lack of security has real consequences, if not for their organizations, then for the country, and that it should be taken seriously. Many professional organizations already provide organized training along these lines; for example, the National White Collar Crime Center (NW3C) offers courses for law enforcement personnel. Mechanisms need to be developed to help scale these offerings and motivate more professionals to take them. Where no such courses are available they need to be developed in conjunction with experienced and competent advisors who understand both the material involved and the issues specific to the professions.
4 Concluding Remarks The cyber security problem is real. Informed warnings have been largely ignored for years, and the problems have only gotten worse. There is no “silver bullet” that will solve all our problems, nor are solutions going to appear quickly. Any program to address our problems will need to focus on deficiencies in our regulatory system, in the economic incentives, and in user psychology issues as well as the technical issues. We need a sustained, significant research program to address questions of structure, deployment, and response. We need a significant boost to law enforcement to act as an effective deterrent. Most of all, we need a comprehensive and wide-reaching program of education and training to bring more of the population in line to address the problem than the small number of experts currently involved. Thus, there needs to be a significant investment made in both students and research in cyber security and privacy. The PITAC report made a conservative recommendation in 2005 of tripling available research funding per year, although the committee
32
E.H. Spafford
privately discussed that 4-5 times the base could be productively spent. We noted that much of the money designated as R&D funding is really spent on the “D” portion and not on research. In the years since that report, it is unlikely that the amount has more than doubled, and that is due, in part, to standard inflationary issues and across-theboard increases rather than any targeted effort. A conservative estimate for FY 2010 would similarly be to at least triple the current allocation for basic research and for university fellowships, with some nontrivial fractions of that amount dedicated to each of privacy research, cyber forensics tools and methods for law enforcement, to cyber security infrastructure, and to multidisciplinary research. Equal or increasing amounts should be allocated in following years. An additional annual allocation should be made for community and professional education. This is almost certainly less than 1% of the amount lost each year in cyber crime and fraud in the U.S. alone, and would be an investment in our country’s future well-being. Again, it is important to separate out the “R” from the “R&D” to ensure that increases are made to the actual long-term research rather than to short term development. There must be a diverse ecology of research funding opportunities supported, with no single agency providing the vast majority of these funds. Opportunities should exist for a variety of styles of research to be supported, such as research that is more closely aligned with specific problems, research that is better coordinated amongst larger numbers of investigators, research that involves significant numbers of supporting staff beyond the PI’s, and so on. The NITRD Coordination Office is well-suited to assist with coordination of this effort to help avoid duplication of effort. There are many good topics for research expenditures of this order of magnitude and beyond. As already mentioned, there are numerous problems with the existing infrastructure that we do not know how to solve including attribution of attacks, fast forensics, stopping botnets, preventing spam, and providing supply chain assurance. More speculative tasks include protecting future architectures including highly portable computing, developing security and privacy metrics, creating self-defending data, semi-autonomous system protection, building high-security embedded computing for real-time controls, and beyond. The PITAC report listed 10 priority areas, and the National Academies reports list more. The community has never had a shortage of good topics for research: it has always been a lack of resources and personnel that has kept us from pursuing them. Above all, we must keep in mind two important facts: First, protection in any realm, including cyber, is a process and not a goal. It is an effort we must staff and support in a sustainable, ongoing manner. And second, as with infections or growth of criminal enterprises, a failure to appropriately capitalize the response now will simply mean, as it has meant for over two decades, that in the future the cost will be greater and the solutions will take longer to make a difference. Acknowledgements. I wish to acknowledge comments and assistance provided to me in preparing this essay from Becky Bace, Steve Cooper, Dan Geer, Harry Hochheiser, Lance Hoffman, Carl Landwehr, Ed Lazowska, Victor Piotrowski, Bobby Schnable, Carlos Solari and Cameron Wilson. Despite listing their names here, none of those individuals necessarily agrees with, nor endorses any of my comments or opinions. Portions of the effort leading to this paper were supported by NSF grant 0523243.
Cyber Security: Assessing Our Vulnerabilities and Developing an Effective Defense
33
References 1. Cyber Security: A Crisis of Prioritization. Report from the President’s Information Technology Advisory Committee. National Coordination Office, NITRD (2005) 2. Goodman, S.E., Lin, H.S. (eds.): Toward a Safer and More Secure Cyberspace. National Academy Press, Washington (2007) 3. Computers at Risk: Safe Computing in the Information Age. National Academy Press, Washington (1991) 4. Schneider, F.B.: Trust in Cyberspace. National Academy Press, Washington (1999) 5. Cyber Security Today and Tomorrow. National Academy Press, Washington (2002) 6. Unsecured Economies: Protecting Vital Information. McAfee Corporation (2008) 7. Spafford, E.H., Antón, A.I.: The Balance Between Security and Privacy. In: Kleinman, D.L., Cloud-Hansen, K.A., Matta, C., Handelsman, J. (eds.) Controversies in Science and Technology ch. 8, vol. II, pp. 152–168. Mary Ann Liebert, Inc., New York (2008) 8. Karas, T.H., Moore, J.H., Parrott, L.K.: Metaphors for Cyber Security. Sandia Report SAND2008-5381. Sandia Labs, NM (2008) 9. Security Cyberspace for the 44th Presidency. Center for Strategic & International Studies, Washington, DC (2008) 10. Spafford, E.H.: One View of a Critical National Need: Support for Information Security Education and Research. In: Briefing Before the Committee on Science Subcommittee on Technology, U.S. House of Representatives, February 11 (1997), http://spaf.cerias.purdue.edu/usgov/index.html 11. Spafford, E.H.: Cyber Security — How Can We Protect American Computer Networks From Attack? In: Briefing Before the Committee on Science, U.S. House of Representatives, October 10 (2001), http://spaf.cerias.purdue.edu/usgov/index.html
Intelligence, Dataveillance, and Information Privacy Robyn R. Mace Michigan State University, School of Criminal Justice 450 Baker Hall, East Lansing, MI 48824 [email protected]
Abstract. The extent and scope of intelligence activities are expanding in response to technological and economic transformations of the past decades. Intelligence efforts involving aggregated data from multiple public and private sources combined with past abuses of domestic intelligence functions have generated significant concerns among privacy advocates and citizens about the protection of individual civil liberties and information privacy from corporate and governmental misuse. In the information age, effective regulation and oversight are key components in the legitimacy and success of government domestic intelligence activities. Keywords: dataveillance, intelligence, privacy, terrorism, surveillance.
1 Intelligence in the Age of Global Commerce and Terror As communication and travel, transnational trade and business, consumer economies and digital technologies have proliferated, new types of transaction and exchange forums have created opportunities and threats to social, economic, and government activities. Among the most serious are transnational terrorism, identity theft and fraud. Technological and economic transformations have significantly altered the methods and tactics for intelligence and related activities designed to prevent and mitigate threats to national security and economic stability. In the aftermath of the September 11th terrorist attacks, there have been concerted federal efforts to enhance the ability of law enforcement and national security agencies to collect, store, use, and exchange information to improve the ability of the United States Intelligence Community (USIC)1 and associated federal and state agencies to identify and neutralize potential threats and the sources from which they emanate. Congress passed the USA PATRIOT Act2 forty-five days after the September 1
The USIC as defined in 50 U.S.C. 401a(4) comprises the following 16 agencies: Central Intelligence Agency (CIA), Bureau of Intelligence and Research, Department of State (INR), Defense Intelligence Agency, National Security Agency (NSA), National Reconnaissance Office (NRO), National Geospatial-Intelligence Agency (NGA), Federal Bureau of Investigation (FBI), Army Intelligence, Navy Intelligence, Air Force Intelligence, Marine Corp Intelligence, Department of Homeland Security (DHS), Coast Guard, Treasury Department, Energy Department, and Drug Enforcement Agency (Best, 2007). 2 The formal title of the Act is Uniting and Strengthening America by Providing Appropriate Tools Required to Intercept and Obstruct Terrorist Act of 2001. C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 34–44, 2009. © Springer-Verlag Berlin Heidelberg 2009
Intelligence, Dataveillance, and Information Privacy
35
11th attacks. Identifying substantial changes governing financial, educational, transportation, and library records, Regan elaborates on how the Act “amends virtually every information privacy statute to facilitate access, increase data collection, and reduce the due process and privacy protections for record subjects” [1]. Many efforts to collect personal information about individual citizens and foreign nationals (and thereby potential terrorists) use information technologies to aggregate massive quantities of data about individuals from publicly and privately held sources; data is often managed and stored by privately contracted firms. The unprecedented extent, type, sources, and availability of previously isolated or unshared personal information has generated significant concern over the protection of individual civil liberties and information privacy from corporate and governmental misuse in the digital age. Government agencies, privacy advocates, and courts continue to struggle to find the appropriate balance between national security and citizen privacy as applications of technological advances challenge traditional laws and expectations of privacy and redefine appropriate activities of government agencies and private business. Over the past thirty years, there has been a shift towards private rather than public ownership, operation, and oversight of various aspects of the infrastructure in the United States, including energy and utilities, administration of public services and functions, and the management and exchange of information. Technological capabilities that allow the aggregation of disparate and dislocated sources of information have also generated confusion about the ownership of and responsibility for a variety of types of public and private data. Combined with concerns over national security, prospective behavior and risk modeling, and distrust of government agencies, aggregated databases and surveillance technologies present conflicts between the free flow of information and appropriate protection for individual citizens.
2 The Evolution of Conceptualizations of Privacy and Regulatory Orientations Rights to privacy are traditionally mediated and recognized through law or social convention while notions of and values related to privacy in any society are individually and collectively determined based on the confluence of social, cultural, political, and economic factors. Legal protections for privacy in the United States are based in the common law and court interpretations of the Bill of Rights3. While a constitutional right to privacy was not recognized until 1965 in Griswold v. Connecticut, “prior to the late nineteenth century, no jurist or legal scholar made the argument that personal privacy warranted constitutional protection” [2]. Dynamic and robust scholarship on privacy over the past half century has defined multiple aspects and components of privacy within the public domain, the home, and the workplace. Overlapping subsets of privacy include employee, consumer, and 3
“The various amendments deal with several aspects of privacy: First, freedom to teach and give information; Third, protection of one’s home; Fourth, protection of the security of one’s person, home, papers, and effects; Fifth, protection against self-incrimination; Ninth, rights shall not be construed to deny or disparage others retained by people; and Fourteenth, the due process clause and concept of liberty” DeCew (1997:23-24) as cited in Caudill and Murphy [3].
36
R.R. Mace
information privacy. A growing literature explores how individuals conceptualize privacy in various contexts and the conditions under which they will knowingly and willingly accept diminished privacy in return for an expected benefit. People and consumers may voluntarily (and happily) exchange their personal information in return for obtaining specific benefits, such as consideration for a loan or in exchange for a valued good or service (e.g., shopper discount cards or notification of special sales). However, they appear to be less willing to give up information that may be used for prospective behavioral modeling or unknown purposes; this is particularly true with respect to U.S. citizens in terms of governmental data collection and use. Before the expansion of the industrial consumer economy after the First and Second World Wars, privacy concerns were generally localized personally and geographically and with limited exceptions did not constitute a major social concern. As wartime and post-war technologies began to be applied to the collection and collation of consumer data, individual citizens started to become aware (and increasingly wary) of their privacy and of the potential value or harm of releasing private data in a manner over which they have no control. As information technologies and economic developments ushered in the information age, the federal government became correspondingly more active in limited regulation and oversight of both corporate and government activities involving the exchange of previously “private” data about individuals. During the 1970’s much of this activity focused on regulation of actual and potential abuses of government agencies with respect to data collection and surveillance on the activities of individual citizens and legal organizations. In a special Journal of Social Issues devoted to privacy, Westin identified and delineated the defining features of a privacy baseline period and three eras of contemporary privacy development: the rise of technological threats to information privacy; enhanced technology performance without corresponding focus of privacy; and proliferating technology applications and privacy concerns [4]. During the baseline period (1945-1960), there was very little public interest in and awareness of potential threats to privacy as a result of relatively unified social concerns and a high trust in government agencies. During the first era (1961-1979), social turbulence and technological advances generated distrust of the government and concerns over unnecessary state and business intrusion into private matters, as well as the beginning of a substantive body of privacy scholarship. It was during this era that “Fair Information Practices became the dominant approach to information privacy protection for the next three decades” [4]. The second era (1980-1989) featured continuing technological advances that enhanced efficiency and reduced the cost of collecting and aggregating personal information and the further development of procedural approaches and expanded legislation to manage a fair information practices model to protect consumers and privacy. Due to globalization, communications technologies, the Internet, data-mining, and technological challenges (including encryption) to national security and law enforcement agencies, privacy became a central social and regulatory concern in Westin’s third era (1990-2002). The realities and challenges of national and economic security since 9/11 and passage of the USA PATRIOT Act have promoted recent and significant expansion of previously limited government agency surveillance rights and activities. These activities and perceptions by citizens and privacy advocacy groups about inappropriate uses of personal data have contributed to intense debate over information privacy and the government’s role and responsibilities in the collection and protection of personal data in the current and emerging era of privacy development.
Intelligence, Dataveillance, and Information Privacy
37
Recognizing the potentially negative implications and side effects of abuses and errors in a technologically mediated information-driven society during Westin’s first era, the U.S. Congress reactively passed numerous pieces of legislation that recognized both the necessity and social value of information exchange and aggregation and focused on the establishment of fair (business and professional) practices under which these exchanges can operate. The Fair Credit Reporting Act of 1970 was the first federal legislation to explicitly recognize that individuals had the right to correct errors in their aggregated personal financial information collected and “owned” by private corporations. At least thirteen additional federal regulations dealing explicitly with privacy issues were promulgated during the remainder of the twentieth century4. The Privacy Act of 1974 was the first comprehensive federal legislation designed to govern the collection, access, disclosure, and regulation of information records of individuals by the government; several of the abuses that encouraged these federal procedural controls are discussed below. The Computer Security Act of 1987 required that government agencies develop procedures and protocols to protect sensitive personal data stored on government computers5. The timing of these regulations reflected increasing social concern and anxiety about the civil liberties and personal privacy of citizens, particularly with respect to the potential for unwarranted governmental intrusion. While the United States has taken a fair practices approach to regulating privacy, the European model has focused on “national data protection laws covering the entire governmental and private sectors, under independent national data protection agencies” [4]. Based on international research into varying approaches to regulating information privacy, Milberg, Smith, and Burke delineate five regulatory models that range from lowest to highest government involvement on a privacy regulation spectrum6: self-help, voluntary control, data commissioner, registration, and licensing [6]. These models indicate that cultural values and individual privacy concerns influence corporate management orientation to information privacy, and “seem to suggest that 4
In addition to the regulations discussed subsequently, other U.S. Federal Regulations on privacy include the Right to Financial Privacy Act (1978), the Electronic Transfer Funds Act (1980), the Privacy Protection Act of 1980, the Cable Communications Act (1984), the Family Education and Privacy Right Act (1984), the Electronic Communications Privacy Act (1988), the Video Privacy Protection Act (1988), the Computer Matching and Privacy Protection Act (1988), the Telephone Consumer Protection Act (1991), the Drivers’ Privacy Protection Act (1993), and the Children’s Online Privacy Protection Act (1998). 5 In a 2008 PriceWaterhouseCoopers survey conducted with CIO and CSO magazines reported, perhaps ironically, that forty-two percent of public sector employees surveyed indicated that employee data was at greater risk of compromise than constituent data[5]. 6 The self-help and voluntary control models involve rules and requirements for recording keeping, although individuals from whom the data is collected and the corporations that manage the data bear respective responsibility for the accuracy of data and resolution of disputes about compliance. The data commissioner model employs a knowledgeable, but nonregulatory ombudsman, who proposes activities and policies and performs inspections. The registration model requires the registration of data banks that contain personal information and provides recourse for violations or failures for regulatory adherence through deregistration. The licensing model requires each data bank to be licensed under specific and prior stipulations regarding the conditions under which personal data can be collected, stored, and used [6].
38
R.R. Mace
privacy regulation, corporate management of personal data, and consumer reactions are tightly interwoven around the world” [6]. With its fair information practices orientation, the U.S. has traditionally been oriented towards facilitating business-friendly regulations and has favored the lower government regulation models, self-help and voluntary (corporate) control [3,6]. This approach places the burden for awareness of the potential use (and misuse) of personal information and responsibility for redress of violations upon the person whose data is collected or stored for later use.
3 Information Privacy and Intelligence Intelligence is both a product and a process7; it is the result of analytic processes that collect and collate information from multiple sources and speculate about or predict particular activities and their potential impact on various issues of local or national security. Without a method to determine and assess the content, validity, and reliability of raw information and apply that information to relevant situations and contexts, individual bits of information are meaningless as a basis for action. Intelligence has great potential for minimizing and preventing threats to both national security and economic viability8. Valid and actionable intelligence allows the anticipation or pre-identification of potential and credible threats, the assessment of the scope or scale of threats and consequences, and the deployment of appropriate personnel and technology for prevention and mitigation by neutralizing or minimizing the influence of threat agents. Intelligence gathering does not necessarily involve a set of formal disciplinary or organizational principles and practices, nor a static set of activities. As a product, it is the basis for organizational or unit action and, as a process, is the culmination of rigorous practices and procedures of collecting, collating, and analyzing raw information. Policies that guide the development and maintenance of law enforcement intelligence systems are detailed in 28 CFR Part 23 (Criminal Intelligence Systems Operating Policies); these regulations govern the operation and monitoring of intelligence collection systems for compliance with privacy and constitutional rights of individuals, although the regulations are rather vague about issues of oversight and remediation of violations of the policies detailed. The most significant concerns about privacy against unwarranted government intrusion relate to data mining, or knowledge discovery in databases (KDD) [2]. Data surveillance, or dataveillance, “is the systematic use of personal data systems in the investigation or monitoring of the actions or communications of one or more persons” [7]. Dramatic changes in business practices and the national security threatscape over the past twenty years presented significant challenges for agencies in the Intelligence Community, constrained by limited capital investments in technology and skilled personnel. Since “the federal government has never openly developed a central data 7
Intelligence is an important component of risk management and threat assessment, security activities that involve data-driven assessments of threats and vulnerabilities, corresponding organizational actions, and alignment of resources to eliminate, reduce, transfer, share, or accept risks. 8 While commonly associated with issues of national security, there is a long established tradition of competitive intelligence in business that involves the legal collection of information to discover the clients, plans, and methods of business competitors for strategic purposes.
Intelligence, Dataveillance, and Information Privacy
39
storage and processing facility” [2], the private sector filled the void and began to purchase and digitize public records which “infused consumer dossiers with data such as tax payments, auto registrations, arrest records, veteran status, political party affiliations, vital statistics (height, weight, and eye color), property ownership data, census data, and marital status” [2]. There are few limitations on the voluntary disclosure to government agencies of privately-held data, nor are there any formal mechanisms of oversight unless those agencies or the companies from whom they request data have procedures to govern such requests. These requests may be generated in the course of specific investigations, but they may also be used for prospective modeling of potentially dangerous behaviors; in the latter case, innocent people not suspected of any criminal behavior may find themselves with federal agency records that indicate a “profile” in need of further scrutiny, information, or surveillance. This type of situation raises the potential for unwarranted, politicized, or selective enforcement.
4 Abuses, Regulation, and Oversight of Information-Sharing and Intelligence Most threats to privacy in the information age have been the result of technology and business practices related to constructing marketing and consumer information databases (and attendant vulnerabilities to fraud and loss of personal information through databases and the Internet), and not the result of government intelligence activities or intrusions. However, American traditions and laws based in distrust of central government authority combined with past incidents of illegal and improper domestic information collection and intelligence activities by government agencies have created suspicion and debate about the propriety and scope of government domestic intelligence in the United States. The unprecedented expansion of government surveillance and data collection powers under the USA PATRIOT Act of 2001 combined with past failures and abuses of the rights and civil liberties of individual citizens lead us to a brief review of formative federal activities and legislation that struggled to balance the needs of law enforcement and civil liberties. Perhaps the most-well known and egregious example of the abuse of the power of surveillance and data collection in the history of the United States is the FBI’s Counterintelligence Program (COINTELPRO), which formally operated from 1956-1971. Part of this program involved the surveillance, investigation, and infiltration of leftist organizations perceived to be critical of the U.S. government. Tactics included covert actions and campaigns directed at U.S. citizens in an attempt to discredit the leaders of “subversive” organizations to reduce their effective organizing for social change. The National Association for the Advancement of Colored People (NAACP), Students for Democratic Society (SDS), women’s and gay liberation organizations, and civil rights organizers were among the targets of these activities: the five-year disinformation campaign against Martin Luther King, Jr. is notorious among these abusive efforts [8]. The Special Service Staff (1969-1973) of the Internal Revenue Service (IRS) was also involved in COINTELPRO, using its investigative capabilities and improper electronic surveillance to compile financial dossiers on thousands of allegedly extremist organization and individuals; this information was provided to the FBI without sufficient justification for the requests and was used for purposes of intimidation and harassment
40
R.R. Mace
of those deemed subversive [9,10]. The Long Subcommittee Hearings in 1965 investigated prior abuses of the IRS’s intelligence functions during the early to mid-1960’s and further documented “widespread abuse of electronic surveillance” [9] as well as targeted selective tax enforcement against political activists. During these proceedings and during this time, the CIA was also the focus of investigations into inappropriate domestic surveillance activities. As a result of federal agency abuses and misuses of information and power in the 1960s and 1970s, several investigations into Intelligence Community activities resulted in reforms and the establishment of oversight mechanisms designed to prevent future abuses. In 1971, the Schlesinger Report identified intelligence failures which were to become emblematic over the next several decades of a concern “that intelligence functions were fragmented and disorganized; collection activities were unnecessarily competitive and redundant; intelligence suffered from unplanned and unguided growth” among other issues [11]. This marked the first steps toward formal, if not meaningful, Congressional oversight of intelligence functions under the executive branch. Throughout the course of the decade, reforms of intelligence were recommended and undertaken by both the Congressional and Executive branches. Judicial review of the legality of statutes governing intelligence activities expanded concurrently with intelligence (and intelligence regulation); requisite secrecy of intelligence activities required special legislation in the form of the Classified Information Procedures Act (CIPA) of 1980 to begin to standardize procedures for managing classified information in criminal trials. Convened by President Ford’s Executive Order 11828 in 1975, the Rockefeller Commission investigated potential illegal activities of the Central Intelligence Agency within the United States and determined that while most Agency activities were in compliance with statutory authorities, the Agency had opened domestic mail, amassed records on domestic dissidents, and provided information that former President Nixon tried to use for political purposes. The Commission’s report indicated that stronger Executive and joint Congressional oversight of intelligence activities was necessary to preserve civil liberties and protect against abuses of government powers [11]. That same year, President Ford directed the implementation of many of the recommendations of the Rockefeller Commission, including “measures to provide improved internal supervision of CIA activities; additional restrictions on CIA’s domestic activities; a ban on mail openings; and an end to wiretaps, abuse of tax information, and the testing of drugs on unsuspecting persons” [11]. In 1976, Ford issued Executive Order 11905, the first such order governing intelligence activities which notably featured a ban on political assassination by U.S. employees and the creation of an Intelligence Oversight Board within the Executive Office of the President. In 1975-76, the Senate’s Church Committee9 examined alleged abuses of domestic surveillance by the CIA and the military as well as misuses of federal data. The Church Committee produced a multi-volume report entitled Intelligence Activities and the Rights of Americans; the report identified multiple areas for improvement in operations and oversight of intelligence agencies. Also in 1975, the Select Committee on Intelligence to Investigate Allegations of Illegal or Improper Activities of Federal 9
This Commission was formally known as the Select Committee to Study Government Operations with Respect to Intelligence Activities.
Intelligence, Dataveillance, and Information Privacy
41
Intelligence Agencies (the Pike Commission) was convened under the Chairmanship of Congressman Pike, although due to lack of approval by the House of Representatives, the results of the Commission’s investigation were never issued officially. The Senate and House established Select Committees on Intelligence in 1976 and 1977 respectively, as part of a permanent effort to regulate and oversee Intelligence Community activities. Throughout subsequent Administrations, until the 9/11 attacks, intelligence activities continued to proliferate under multiple agencies and to be subjected to periodic attempts at reform. The collapse of the former Soviet Union and the end of the Cold War in the 1990’s ushered in a period of critical examination of the intelligence function and reorganization and reorientation of intelligence activities to reflect post-Cold War global dynamics. The shocking discovery of internal espionage and the arrests of CIA employee Aldrich Ames in 1994 and FBI employee Robert Hanssen in early 2001 raised doubts about the ability of the Intelligence Community to oversee its own operations and personnel.
5 Public and Private Information-Sharing An area of intense focus during the concerted Congressional and Executive scrutiny of the Intelligence Community during the 1970’s was the potential for misuse of data collected from domestic surveillance activities. Westin’s discussion of contemporary privacy development identified three sets of types of relationships between authorities and individuals that are differentiated based upon power, expectations, and legal frameworks: citizen-government, consumer-business, and employee-employers [4]. Today’s emerging era of privacy development is characterized by tensions generated by the exchange of information, about citizens, between government and businesses. This seems to represent a new set of relationships between the citizen/consumer, business, and government that reflect contemporary realities of privacy development. The increasing and overlapping information sharing by governments and businesses about formerly confidential or private activities generates concerns about potential violations of individual’s privacy rights. There is an established history of public-private sector information sharing as well as information sharing between public entities for competitive intelligence and espionage (i.e., using international business persons as collectors of or conduits for information). The technological features that facilitate unprecedented global interaction pose new challenges to intelligence and law enforcement agencies, as well as to businesses. The geographic separation of a potential crime or terrorist activity from its planners and perpetrators changes the traditional dynamics of monitoring threats to national, economic, and infrastructure security. International markets that transfer technologies allow new competitors to enter markets and increase the potential or opportunity for theft of intellectual property, including data assets. Incentives for cooperation or overlap of interests of terrorist and criminal organizations intensify risks related to conditions and mechanisms of exchange (trade, travel, information), management and purging of collected data, as well as threats to individual privacy from law enforcement and corporate activities. Well-publicized losses of aggregated consumer data by corporations and government agencies through systems infiltration and device loss (Veteran’s Administration, TJX, Citigroup) frequently reinforce
42
R.R. Mace
public and regulatory concerns about the potential harm from loss, oversight of these assets, and the mechanisms for remediation of losses. Since the USA PATRIOT Act, warrant-less sharing of customer activity data by numerous telephone companies and (sometimes refused) demands for library lending records have raised concerns about the scope and intent of federal intelligence and data mining efforts and generated numerous law and class action suits. Perhaps most notably, Jet Blue’s provision (in violation of its own privacy policies) of five million customers’ travel records to a private Defense Department contractor raises substantial concerns about the integrity of Jet Blue’s data management practices as well as the potential harm from unwarranted third-party release of private information [12]. Complexities of the phenomena of terrorism and transnational crime present special challenges to the investigative abilities and capacities of law enforcement and intelligence agencies and their corporate security counterparts. In many cases, government agencies are limited by capital spending constraints that prevent speedy adoption of the latest and best technologies. Other competing internal and external demands may exacerbate entrenched resistance to the development of coherent directed actions. Private ownership of the information infrastructure and the concentration of information assets in private companies increase the incentives and the pressure for governmental agencies to use private resources (databases, contractors, skills, established systems) to meet the demands of information intensive intelligence and investigation activities. The value of information sharing to law enforcement has promoted the development of certain structural and operational arrangements within agencies, initiated either by legislation or environmental demands. These arrangements may take a variety of forms, ranging from voluntary or involuntary requests (warrants and subpoenas) for information, individual employee training, task force participation, joint operations or exercises, the establishment of standing task forces, dedicated forums or units (i.e., Fusion Centers) and forums (i.e., Infragard) that facilitate regular and ongoing exchanges among agencies and in some circumstances, private companies with direct investment, compelling interest in infrastructure issues, or contractual arrangements with government agencies. The oversight and control mechanisms to ensure the appropriate collection, storage, and use of information exchanged and generated in these forums are not yet firmly established nor tested, although progress has been made, largely as a result of the memory of past agency failures and the concerted efforts of civil liberties and privacy advocates.
6 Building Effective Intelligence Oversight National and domestic security and individual privacy are not necessarily mutually exclusive, however, past practices by agencies within the Intelligence Community have demonstrated that monitoring and oversight provisions of intelligence functions have not been sufficiently developed to prevent misuse of personal information in investigations, criminal proceedings, and political maneuvers. Building effective intelligence capabilities will rely “less on prohibiting the collection and dissemination of private information and more on effective oversight and control of government activity” [8]. Procedural clarification and control, regular auditing, and transparent reporting on agency activities will be essential to effective oversight and meaningful protection for citizens against potential abuses.
Intelligence, Dataveillance, and Information Privacy
43
Since the establishment of the Department of Homeland Security in 2002, there has been a significant focus on the collection and dissemination of intelligence; only recently has attention been directed towards establishing accountability and oversight for the proliferation of executive level intelligence functions. In 2004, Executive Order 13353 established the President’s Board on Safeguarding Americans’ Civil Liberties within the Department of Justice; this board was primarily advisory. A Congressional Privacy and Civil Liberties Oversight Board (PCLOB) was also established that year. With limited independence and budget and operating from the Executive Office of the President, the PCLOP as constituted had little influence beyond advisory and review capacities. In 2007, in response to congressional dissatisfaction with its operations, the Board was legislatively “reconstituted as an independent agency within the executive branch” [13] and charged with balancing civil liberties and privacy concerns with executive branch actions and ensuring the consideration of these concerns in the development and implementation of laws and policies related to antiterrorism efforts. Funding to support activities of the five-member Board was included in the legislation and it became an independent entity in early 2008. The scope and impact of the PCLOB in terms of providing meaningful and effective oversight of the operations of the Intelligence Community will be an area of great interest to privacy advocates and legal scholars over the next few years. On the corporate side, an increasing number of companies are developing and promulgating privacy policies; a smaller number are appointing chief privacy officers (CPOs) to address customer concerns and third-party data exchanges that may be facilitated by organizational information selling and sharing practices. As consumer privacy values continue to become more articulated, “institutions and government agencies that collect and handle sensitive information should consider how their contractual relationships and information-sharing practices affect their ability to remain compliant with their own policies and relevant privacy laws” [12].
7 Conclusion Effective regulation and meaningful oversight are key components in the legitimacy and success of government domestic intelligence activities in the information age. The expansion of intelligence activities to address new and emerging threats is an important and necessary government function, however, past abuses and failures suggest that unchecked, the potential for misuse of personal information can have serious consequences for individual citizens, civil liberties, and trust in government. Procedural clarity and substantive monitoring combined with awareness and recognition of potential abuses serve to improve the operational effectiveness of the Intelligence Community to identify and mitigate threats while enhancing public confidence in institutional protection of privacy and civil liberties.
References 1. Regan, P.M.: Old issues, new context: Privacy, information collection, and homeland security. Gov. Info. Qtly. 21, 481–497 (2004) 2. Kuhn, M.: Federal Dataveillance: Implications for Constitutional Privacy Protections. LFB Scholarly Publishing, LLC, New York (2007)
44
R.R. Mace
3. Caudill, E.M., Murphy, P.: Consumer Online Privacy: Legal and Ethical Issues. Jl. Pub. Ply. & Mrtg. 19(1), 7–19 (Spring 2002) 4. Westin, A.F.: Social and Political Dimensions of Privacy. J. Soc. Iss. 59(2), 431–453 (2003) 5. Collins, H.: Survey: Employee Data More Vulnerable Than Constituent Data. Gov. Tech., November 14 (2008), http://www.govtech.com/gt/560506?topic=117671 6. Milberg, S.J., Smith, H.J., Burke, S.: Information Privacy: Corporate Management and National Regulation. Org. Sci. 11(1), 35–57 (2002) 7. Clarke, R.: Introduction to Dataveillance and Information Privacy, and Definitions of Terms (2006), http://www.anu.edu.au/people/Roger.Clarke/DV/Intro.html 8. De Rosa, M.: Privacy in the Age of Terror. T. Wash. Qtly. 26(3), 27–41 (2003) 9. Select Committee to Study Government Operations with Respect to Intelligence Activities. Intelligence Activities and the Rights of Americans (1976) 10. The IRS’s 287 Billion Dollar Man. Time (April 7, 1975), http://www.time.com/time/magazine/article/ 0,9171,917244,00.html?promoid=googlep 11. McNeil, P.: The Evolution of the U.S. Intelligence Community – An Historical Overview. In: Johnson, L., Wirtz, J. (eds.) Intelligence and National Security, pp. 1–20. Oxford University Press, New York (2008) 12. Anton, A.I., Qingfend, H., Baumer, D.: Inside JetBlue’s Privacy Policy Violations. IEEE Sec. Priv. (November/December 2004) 13. Reylea, H.C.: Privacy and Civil Liberties Oversight Board: New Independent Agency Status. In: CRS Report for Congress (No. RL 34385), July 21. Congressional Research Service, Washington (2008)
Results of Workshops on Privacy Protection Technologies Carl Landwehr IARPA Office of the Director of National Intelligence Washington, DC [email protected]
Abstract. This talk summarizes the results of a series of workshops on privacy protecting technologies convened in the fall of 2006 by the Office of the Director of National Intelligence through its Civil Liberties Protection Office and the (then) Disruptive Technology Office (now part of the Intelligence Advanced Research Projects Activity, IARPA). Keywords: privacy protection technologies, strategies, workshop, intelligence community.
1 Introduction In the after lunch speaking spot, the organizers think I can keep you awake after lunch, I thought I would start with a joke, a cartoon indicating where data sharing may lead us which says “Oh, good. My complete sexual history is on tonight.” As a couple is watching television1. I will review a series of workshops on privacy protecting technologies that were held in the fall of 2006, under the auspices of the Office of the Director of National Intelligence (ODNI) Civil Liberties Protection Office and the Disruptive Technology Office (DTO). In October 2007, DTO became part of the Intelligence Advanced Research Projects Activity (IARPA), which is continuing to pursue research opportunities in this area2. I believe, and I think a large part of the intelligence community agrees, that we are not trying to advance security at the cost of privacy but that we strive to advance both security and privacy together.
2 Privacy and Security It’s difficult sometimes to think about what privacy means for the intelligence community (IC) because, in some sense, the IC is interested in compromising the privacy of the “bad guys”, the targets, in a big way. But the problem is to do that without compromising the privacy of everybody else. 1 2
http://www.cartoonbank.com/product_details.asp?sitetype=2&sid=40949&pid=1003&did=4 Toeffler Report, this report has not yet been published by the U.S. Government.
C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 45–56, 2009. © Springer-Verlag Berlin Heidelberg 2009
46
C. Landwehr
Figure 1 lists three primary areas of concern that relate to privacy and the intelligence community. One is accuracy, if you look at fair information practice principles, they typically call for review of records and mechanisms for correcting erroneous information. How can you apply that policy in the context of intelligence data? Certainly there are cases where you’d like to correct erroneous or ambiguous watch list entries, incorrect inferences that have been made, sources that have been repudiated. Those problems do come up and you’d like to have solutions for those.
Fig. 1. The three A’s
The second area is access. Access to private information can be difficult to arrange. Thinking about what’s in an intelligence community database, are you likely to let the public review their record and correct it? That’s going to be difficult. On the other hand, you don’t want incorrect information in the records. One analog that’s been suggested for this is the kinds of legal discovery processes that are in existence and that do have some steps that could enable individual access to private data under controlled conditions. A second point under access, which I think Prof. Ostrovsky [1] has addressed in his research, is that permitting access to one person’s information in a database doesn’t have to imply granting access to everyone else’s information, or to the criteria used in a search. Accountability is the third general area of concern. We want accountability in these systems as a deterrent for user misbehavior and also as a means to identify where the system is being abused.
Results of Workshops on Privacy Protection Technologies
47
Fig. 2. CLPO-DTO workshops on privacy protection technologies
These three A’s capture overarching concerns that came out of the workshops. Let me go into a little bit more detail about who was involved and what happened at the workshops. These workshops were a collaboration of the Office of the Director of National Intelligence, Civil Liberties Protection Office, including the chief, Alex Joel, and also Tim Edgar, who is here, and some others who are not here. And from the Disruptive Technology Office, Rita Bush, who was in charge of the information exploitation programs and myself, representing information assurance programs, were the primary organizers, but other members of the Disruptive Technolgy Office (DTO) participated as well. The workshops were facilitated by a conference organizer, National Conference Services, Inc. (NCSI), and also Toffler Associates, the people listed there. We had participants from a wide range of organizations as key players in the workshops. Figure 2 provides names of some of these people. These ranged from government people including “technical” people but also people from the legal side of things and people from privacy concerns offices in different agencies. We had industry participation from people who work for companies with large commercial data mining activities and we had a fair number of academic researchers as well. We held three separate one-day sessions from late September through early December. The primary questions we were trying to address in the workshop were first, (a) what’s the state of the art of current technologies that might be used to protect US person’s privacy, US person’s data, privacy in the course of intelligence community activities, and second, (b) what specific emerging technologies might be brought forward in the next few years through some research funding, potentially, that could
48
C. Landwehr
protect individual privacy while still advancing national security. So that’s what we were trying to do. Figure 3 lists some of the people who were involved. Toffler Associates developed a report from this workshop which has had some limited distribution. Perhaps the workshop’s most important role to date has been in helping to develop language used to solicit new research proposals for privacy protecting technologies in the fall of 2007 under the NICECAP BAA.
Fig. 3. Workshop participants (not a complete list)
Our original intention had been to ask the participants in the workshop to provide comments on the draft report, and then to revise the report based on their comments. In the end, we were unable to do that and so we cannot represent the report as having been endorsed by the participants; it represents Toffler’s best effort to capture the sense of the discussions.
3 Privacy Scenarios Now I will go through the scenarios that are in the report to give you an idea of how the workshop considered various situations that can arise in the intelligence community that may involve private information, potentially of United States persons. 3.1 Sensitivity and Authority Sensitive Records. One concern is that we, that is, the government and businesses, have records that are sensitive, and we can expect that at some point in the future
Results of Workshops on Privacy Protection Technologies
49
there’s going to be another crisis of some sort. At that point there may be pressure to provide augmented access to information in certain situations. So it would be useful to have privacy tools that can function even in that sort of crisis mode and still provide some privacy guarantees. We don’t want to have to say, “Well, it’s a crisis, so all the privacy controls are going away”. We would like to have controls that will work in that context.
Fig. 4. Four privacy scenarios
Progressive Authority. A second scenario concerns operating, not in a crisis but in a normal situation in which you get some information about an activity or individual. As you get more information, you may be able to justify more access in relation to information about that particular topic or individual. And so, you would like to have some corresponding higher degrees of authorization required as you go up that scale. I should say, as a general comment, that these scenarios were created to bring up representative problems and so in many cases the problems and concerns in the different scenarios overlap. They are examples that we used in the context of the workshop to try to help people understand how the intelligence community processes information and what sorts of situations can arise. Connecting raw dots. As Maureen Baginski said this morning, a lot of people in the intelligence community believe that “connecting dots” is a terrible analogy for what they do because dot-to-dot figures are really well structured a priori. Perhaps a better analogy for the intelligence analysts’ job would be that they have a lot of oddly shaped puzzle pieces, perhaps from several different puzzles, and they have to try to figure out which ones fit into which puzzle, and where.
50
C. Landwehr
Nevertheless the idea here is that we are searching for anomalous patterns of activity that may reflect hidden threats of some sort, and the databases holding this information may include some US person’s information, but we want to protect the privacy of those US persons at the same time. So we would like to assure that the policy and restrictions are in place. And as Maureen Baginski said, this is done now in a very responsible way. But, it is done, in a way that demands a great deal of manual intervention, and consequently can’t scale to problems of significant size. The “MI-6” scenario refers to the UK intelligence organization, and captures the idea that we occasionally cooperate with foreign partners. If we want to get information from them, and they want to get information from us, we want to be sure that any information we provide in that context remains protected in their environment in the way that we ourselves are mandated to protect it. Tools to help with that kind of protection would be helpful. 3.2 Second Set Long-term storage. This scenario has already come up in a discussion earlier at this conference. It’s hard to say when information might no longer be useful for intelligence purposes. On the other hand, it’s clear that the longer we keep personal information about individuals, in the present scheme of things; the more likely it is to get lost or exposed in some way. So there is a risk, in fact, of keeping information around indefinitely. So we seek to have some way of protecting it against exposure, at the same time keeping it in a way that we can, if necessary, somehow retrieve it.
Fig. 5. Four more privacy scenarios
Results of Workshops on Privacy Protection Technologies
51
CSI Fort Meade. I am not enough of a fan of the TV show CSI to know exactly how this scenario was introduced. But the idea here is that we want to be able to detect whether people are actually exceeding their authority. We want to be able to provide assurances to the public that, if there are abuses going on inside government agencies, we have a way of detecting them and enforcing action against those who do abuse their authority. We’ve actually seen examples of this outside the intelligence community in recent months, for example, the abuse of the passport database that came up in the context of the presidential primary campaigns. We didn’t prevent those instances of abuse, and we may never be able to prevent all those abuses, but, in fact, if we can detect them and somebody gets fired or disciplined as a result, that has, I think, a real effect on the workforce. Privacy toolbar. The concept here is that we have, what is the other expression for stovepipes, “cylinders of excellence”, at the present time. But as we try to bring together these cylinders into a funnel of excellence of some sort, new issues arise. Each of those cylinders have authorities and privacy rules associated with them and the rules are not necessarily the same. For an analyst working in some unified context, the issue becomes: which rules apply? It would be useful to provide clear guidance on this topic. It would also be useful to have better policy in place. But we have to deal with the situation as it is now. US-address.com. This scenario refers to the issue of identifying the location and “US person” status of communicants, so that we can actually enforce the US person regulations appropriately. This is clearly a challenge that has been aggravated by technology. Technology has been very helpful in many respects but not for this purpose.
Fig. 6. Final four privacy scenarios
52
C. Landwehr
3.3 Third Set Mr. Smith: We’ve already discussed this scenario in some respects. The issue involves trying to disambiguate identities, so that, in fact, we don’t have intrusions on every person named “John Smith” if there’s just one John Smith causing problems. If they all share a common name, we would like to avoid intruding on the privacy of all the innocent parties. Watch list: We’ve mentioned aspects of this scenario as well. In this context you don’t want to reveal who is on the watch list to the people whose database you might want to search. This is an interesting challenge and I think Rafi Ostrovsky may discuss this as well. No-fly redress. This one is commonly discussed. If you are erroneously placed on the no fly list, or, even worse if the algorithm keeps putting you back on it, how can you keep yourself off of it? We don’t want to expose that list to the public, but we would like somehow to allow people to help us correct it. How can we deal with this kind of a problem? False alarms. Sometimes we get information that turns out to be wrong. How do we retract that information, and not only the information itself, but all of the inferences that have been based on it? What if, the correction is, itself, wrong? How do we correct the (now) invalid inferences?
4 Privacy Technology Areas These were the scenarios put in front of the group to stimulate discussion. Then we considered technology areas in relation to these. So what you are going to see next is a matrix matching the technology areas with those scenarios. I won’t say too much about each of these individually. Private information retrieval is a technique that’s been around in the world of cryptography and computer science for a while, but has not seen very much use in practice. We’re hoping to see some use of it in the future. Nonmonotonic logic. For those of you from outside this field, the idea is that, in a typical logic exercise the more hypotheses we have, the more inferences we can make. In nonmonotonic logic, however, we may learn something new, which means we get a new hypothesis from an observation,, but it negates some hypotheses we had or inferences we made earlier. So we can actually infer fewer things after learning this new fact. This corresponds, in some ways, to the notion of a repudiated source, when we find out that this is actually unreliable information; we might need to retract some inferences made based upon it. Rules based access and usage control. The idea here is access control that is based on rules, in particular the kinds of rules we have in privacy protection, and also on usage control. In many cases a privacy policy might authorize information to be used for one purpose but not another. Usage-based controls could be helpful in enforcing such policies.
Results of Workshops on Privacy Protection Technologies
53
Fig. 7. Privacy technology areas considered by the workshop
Secret sharing is really a cryptographic technology. It provides a mathematical analog to the old tear-the-treasure-map-up-into-several-pieces-and-distribute-it scheme that prevents any subset of individuals from learning the full secret. The cryptographic form of that is more flexible than the paper based scheme, but the idea is similar. Entity disambiguation, as we have discussed, this refers to techniques for trying to determine whether two or more records` refer to the same physical individual or not. Secure multi-party function evaluation: Again, as I said, these are not all orthogonal categories of technologies since this technology underlies some kinds of private information retrieval. You may have already heard talks on this technology earlier today. Anonymous matching: The idea here is to allow owners of two or more databases to determine whether they have records on the same individual or set of individuals without revealing data on other individuals. Another way to look at this technology is as anonymous set intersection. Again cryptographically based, this technology can be used to match sets without revealing any information outside of the intersection. Digital rights management technology: I think we all have some familiarity with this technology, just from reading the daily press. But this is technology that might be brought to bear on some of these problems as well. We wanted the workshop to consider it.
54
C. Landwehr
Automated access expirations: This technology could presumably deal with the retention of records. If you have a defined time horizon for record storage, you might be able to automatically “expire” records as they reach their horizon. Digital signatures and hash algorithms are here as well. These are technologies already in use in many venues. But we wanted to keep them under consideration. Figure 8 shows the result of a group activity, to identify which technologies might help most with which scenarios. A check mark indicates that the corresponding technology could contribute to the indicated scenario; a check mark with a box around it indicates strongly supportive technologies. Figure 9 is a summary figure. The goal is to represent which technologies could advance both privacy and security and to provide a rough ordering of where the biggest potential gains might be made. I do not propose a logical argument for the contents for the figure; rather it represents the opinions of the workshop participants as reported by Toffler Associates. Private information retrieval emerged as the technology that supported both security and privacy most strongly The fact that some of these technologies are already in use also influenced the results -- the idea here was to identify technology areas where research investment might actually tip the scale sufficiently in a few years to bring some new technologies into use. The final figure, Figure 10, shows how privacy technologies might fit into the intelligence information life cycle.
Fig. 8. Relating privacy protecting technologies and scenarios
Results of Workshops on Privacy Protection Technologies
Fig. 9. Identifying technologies with the greatest security and privacy payoff
Fig. 10. Privacy Protection Technology
55
56
C. Landwehr
Anonymous matching might be of use at the collection/creation phase, depending on the notion of collection. Private information retrieval definitely seems like something hat could be used in the processing phase. Secure multi-party computation might be used in the dissemination phase. Immutable audit logs can certainly help assure accountability of access, though I am not sure research is required there. Finally, automated expiration techniques can be used at the end of the lifecycle of the information. Acknowledgments. Thanks to all the PPT Workshop organizers and participants.
References [1] Boneh, D., Kushilevitz, E., Ostrovsky, R., Skeith III, W.E.: Public Key Encryption That Allows PIR Queries. In: Menezes, A. (ed.) CRYPTO 2007. LNCS, vol. 4622, pp. 50–67. Springer, Heidelberg (2007) [2] Toffler Report. Office of Disruptive Technologies, United States Government (unpublished report)
Words Matter: Privacy, Security, and Related Terms James J. Horning SPARTA, Inc., Information Systems Security Organization, 710 Lakeway Dr., Ste. 195, Sunnyvale, CA 94085, USA [email protected] www.sparta.com
Abstract. Both “privacy” and “security” have a variety of meanings that are relevant to ISIPS. Inadvertently mixing these meanings can lead to confusion. We will communicate more effectively, and avoid needless misunderstandings, if we make it clear which meanings we intend in each context. This paper discusses several constellations of meanings and suggests key questions whose answers will more precisely define the kind of “privacy” and the kind of “security” that we are referring to. Keywords: Privacy, security, national security, homeland security, computer security, network security, freedom of speech, definitions.
1 Introduction We must consider now whether the need to capture dangerous conversations occurring in cyberspace will collide dramatically with the desire to protect the right of free assembly and free speech in cyberspace; or whether the need to gather information about certain individuals to prevent terrorist activities will enter in conflict with the citizen's right to privacy… The relation between privacy and security is the central theme of this workshop. —ISIPS 2007 website The impetus for this paper came from an email exchange shortly after I joined the ISIPS 2008 Program Committee. I had just read the website for Interdisciplinary Studies in Information Privacy and Security (ISIPS) 2007, and commented to Paul Kantor: “One thing bothers me: it seems to me that you have conflated some concepts that it is helpful to keep distinct…In particular, ‘free speech’ and ‘privacy’ are not equivalent…‘Security’ is a much broader concept than ‘intelligence collection and processing.’ …” His response was, “Would you be willing to give a talk on the importance and difficulty of making these distinctions, at the Conference?” And I agreed. This paper discusses a collection of words that are sometimes used more or less interchangeably, and sometimes with quite distinct meanings, in various disciplines concerned with privacy and/or security. The goals of this paper are to point out possible sources of confusion, and to encourage those who work with them to make their precise meanings clear. C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 57–62, 2009. © Springer-Verlag Berlin Heidelberg 2009
58
J.J. Horning
1.1 “Freedom of Speech” and “Privacy” “Freedom of speech” is primarily concerned with public behavior, “privacy,” with information about non-public behavior. The First Amendment of the United States Constitution explicitly protects the right to freedom of expression without government interference. However, a right to privacy is only implicitly recognized in the Constitution, and was not made explicit by the courts until the 20th Century. One could imagine the provision of complete freedom of speech even in the privacy-free environment that David Brin has called the “transparent society.” [1] Conversely, one could imagine a society with very strong privacy protection, and no freedom of speech whatsoever. There are even areas, such as libel, where these two goals come into direct conflict. So these are actually two very different ideals, which should never be confused. A related topic is the right to anonymous communication, which some feel is inherent in the right of free speech, but many others do not. Those who implicitly make either assumption may well confuse those who do not. 1.2 “National Security” Is Broader Than “Intelligence Collection and Analysis” Looking at the history of the Interdisciplinary Studies in Information Privacy and Security program and workshops, it seems that “security” in the title was implicitly understood to mean “intelligence collection and analysis for national security and homeland security.” However, most IT researchers think of “security” in terms of “computer security” and “network security.” National and homeland security involve a great deal more than just intelligence. They include a variety of defensive and preventive measures that are typically not discussed in the ISIPS context. Intelligence itself involves a larger cycle that includes direction, collection, interpretation, analysis, dissemination, and—ultimately— reaction.
2 Meanings of “Privacy” and “Security” Both of these words have a variety of meanings [2, 3, 4, 5, 6] that this section will briefly examine. 2.1 A Verbal Paradox Recently, at another workshop, I participated in a discussion session, entitled “Security with Privacy.” Within a few minutes, two strongly-held views emerged: • Difficult trade-offs must be made between privacy and security. “Historically, governments have cited compelling national security needs for seeking to violate privacy.” [7] • Security is necessary for privacy, and vice versa. Private credentials are essential for authentication at a distance, and information stored in insecure systems is inherently non-private.
Words Matter: Privacy, Security, and Related Terms
59
As the discussion proceeded, it became clear that both of these apparently contradictory statements were true, but in different contexts, and with somewhat different meanings of “privacy” and “security.” 2.2 Assumed Meanings Until recently, if asked to define “privacy” I would have probably said something like: Privacy in a computer system is reasonable assurance that the complete system does not release personal information to unauthorized entities. And if asked to define “security” I would have said: Computer system security is reasonable assurance that the complete system will function (only) as required and intended, despite hostile activity. But the more contact I have had with people concerned with privacy and security, the more I have come to realize that many of them typically mean different things when using these words, and that hidden differences in assumed meanings can interfere with communication and lead to unfruitful arguments and invalid conclusions. Unless you know that your entire audience will make the same assumptions that you do, it is best to be explicit about what you mean. What kind of privacy are you relating, to what kind of security, using the perspectives of what disciplines? 2.3 Can’t We Just Use the Dictionary Definitions? This is just a rhetorical question. Pick your dictionary. Look these words up. Each will have multiple definitions, which may or may not agree with your own assumed meaning. Interestingly, “freedom” typically appears in the definitions of “privacy” and “security,” but privacy and security typically do not appear in the definition of freedom, suggesting that freedom is a more fundamental concept. 2.4 Constellations of Concepts ISIPS, by definition, involves multiple disciplines, each of which has its own specialized concepts and vocabularies. It is inevitable that there will be multiple meanings of “privacy” and “security” and that the defaults will change depending on the discipline and the context. It is helpful to group collections of concepts that tend to be used together in particular contexts. I have identified five such clusters that either seem relevant to ISIPS, or that use “privacy” or “security” in ways that are not very relevant to ISIPS and need not be further considered in this discussion. Civil liberties: freedom of speech, freedom of association, freedom from arbitrary search, protection from self-incrimination, freedom of the press, protection of sources. Privacy: personal information, secret communication, anonymity, trade secrets, witness protection.
60
J.J. Horning
Security1:* computer security, network security, information security, security classifications, communication security, operational security, physical security. Security2: national security, homeland security, intelligence collection, surveillance, communication interception, data mining, intelligence analysis, intelligence dissemination, publication. Security3: job security, Social Security, financial security (retirement), financial securities (Wall Street). Henceforth, I will primarily be concerned with Privacy, Security1, and Security2. 2.5 Questions Characterizing Security1 To fully understand the meaning of an element of the Security1 constellation, we need to know the answers to the following questions: • • • • • • • •
What accesses or actions are being restricted? To what resources or information? Who1 is being restricted? For what reason? Who is enforcing the restriction? By what means? For whose benefit? On what authority?
2.6 Questions Characterizing Security2 To fully understand the meaning of an element of the Security2 constellation, we need to know the answers to the following questions: • • • • • • •
What accesses or actions are being sought? To what resources or information? By whom? By what means? For what reason? At whose direction? On what authority?
2.7 Questions Characterizing Privacy Protection2 To fully understand the meaning of an element of the Privacy constellation, we need to know the answers to the following questions:
1
I use the generic “who” to include organizations as well as people. * We use the general semantic notation of a subscript to distinguish meanings. 2 I use the generic term “privacy” to include organizational secrets.
Words Matter: Privacy, Security, and Related Terms
• • • • • • •
61
What actions or information flows are being blocked? About whom? From whom? By whom? For what reason? By what means? On what authority?
2.8 Inherent Interactions It should be clear that some form of Security1 is fundamental to enforcing virtually any form of Privacy, and that many forms of Security2 are potentially invasive of some forms of Privacy. Conversely, Privacy of certain information is necessary for any form of Security1—for example, authenticating information must be kept secret. Privacy of certain information is also necessary for many forms of Security2—for example, it may be necessary to protect the identities of informants, detailed intelligence results, and the existence of certain intelligence capabilities.
3 Summary “There’s glory for you!” “I don’t know what you mean by ‘glory,’” Alice said. Humpty Dumpty smiled contemptuously. “Of course you don’t—till I tell you. I meant ‘there’s a nice knock-down argument for you!’” But ‘glory’ doesn’t mean ‘a nice knock-down argument,’” Alice objected. “When I use a word,” Humpty Dumpty said in a rather scornful tone, “it means just what I choose it mean—neither more nor less.” The question is,” said Alice, “whether you can make words mean so many different things.” The question is,” said Humpty Dumpty, “which is to be master—that’s all.” —Lewis Carroll, Through the Looking Glass [8] Humpty Dumpty was right. There is nothing wrong with giving words special meanings for particular types of discourse. It is fine for a mathematician to use “set” in a very different sense than a teacup collector would. The problem comes when the person using the word fails to tell the audience what it means in that context. Interdisciplinary Studies in Information Privacy and Security will progress more rapidly if we are careful to explain for each context what we mean by “privacy” and by “security.”
References 1. Brin, D.: The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom? Basic Books, New York (1999), http://www.davidbrin.com/transparent.html 2. Waldo, J., Lin, H.S., Millett, L.I. (eds.): Engaging Privacy and Information Technology in a Digital Age. National Academies Press, Washington (2007)
62
J.J. Horning
3. Goodman, S.E., Lin, H.S. (eds.): Toward a Safer and More Secure Cyberspace. National Academies Press, Washington (2007) 4. Kent, S.T., Millett, L.I. (eds.): IDs—Not That Easy: Questions About Nationwide Identity Systems. National Academies Press, Washington (2002) 5. Schneider, F.B. (ed.): Trust in Cyberspace. National Academies Press, Washington (1999) 6. Horning, J.J.: Nothing is as simple as we hope it will be. Security, http://horning.blogspot.com/search/label/SecurityPrivacy, http://horning.blogspot.com/search/label/Privacy 7. Infosec Research Council: Hard Problem List (2005), http://www.cyber.st.dhs.gov/docs/IRC_Hard_Problem_List.pdf 8. Caroll, L.: Through the Looking Glass. Guttenberg. org. (2008), http://www.gutenberg.org/wiki/Main_Page 9. ISIPS 2007 website (2008), http://www.scils.rutgers.edu/ci/isips/ v1_ISIPS_Flyer_02.21.07.htm
kACTUS 2: Privacy Preserving in Classification Tasks Using k-Anonymity Slava Kisilevich, Yuval Elovici, Bracha Shapira, and Lior Rokach Department of Computer and Information Science Konstanz University Universitaets Strasse 10 Box 78, 78457 Konstanz, Germany [email protected]‡ Department of Information System Engineering and Deutsche Telekom Laboratories at Ben-Gurion University Ben Gurion University, Be’er Sheva, 84105, Israel {elovici,bshapira,liorrk}@bgu.ac.il§ ¶
Abstract. k -anonymity is the method used for masking sensitive data which successfully solves the problem of re-linking of data with an external source and makes it difficult to re-identify the individual. Thus k anonymity works on a set of quasi-identifiers (public sensitive attributes), whose possible availability and linking is anticipated from external dataset, and demands that the released dataset will contain at least k records for every possible quasi-identifier value. Another aspect of k is its capability of maintaining the truthfulness of the released data (unlike other existing methods). This is achieved by generalization, a primary technique in k -anonymity. Generalization consists of generalizing attribute values and substituting them with semantically consistent but less precise values. When the substituted value doesn’t preserve semantic validity the technique is called suppression which is a private case of generalization. We present a hybrid approach called compensation which is based on suppression and swapping for achieving privacy. Since swapping decreases the truthfulness of attribute values there is a tradeoff between level of swapping (information truthfulness) and suppression (information loss) incorporated in our algorithm. We use k -anonymity to explore the issue of anonymity preservation. Since we do not use generalization, we do not need a priori knowledge of attribute semantics. We investigate data anonymization in the context of classification and use tree properties to satisfy k-anonymization. Our work improves previous approaches by increasing classification accuracy. Keywords: anonymity, privacy preserving, generalization, suppression, data mining. ‡ § ¶
http://www.informatik.uni-konstanz.de/arbeitsgruppen/infovis/ mitglieder/slava-kisilevich http://www.ise.bgu.ac.il/faculty/elovici http://www.ise.bgu.ac.il/faculty/bracha-s http://www.ise.bgu.ac.il/faculty/liorr
C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 63–81, 2009. c Springer-Verlag Berlin Heidelberg 2009
64
1
S. Kisilevich et al.
Introduction and Motivation
In today’s computerized world, when storage device costs are low, data about individuals are extensively available and easily collected: Internet surfers leave all sorts of tracks at every visited site (from the computer’s IP address to submitted forms with private data); email providers scan email traffic; medical patient records are stored in databases; governments maintain records about every citizen; private companies and organizations (such as travel agencies, flight and insurance companies, etc.) collect data about people from a variety of sources. What would happen if some of this information, perhaps financial or healthrelated, became publicly available? It could very possibly lead to community ostracism or even dismissal from work. Clearly, there is a growing concern among individuals that the data they share about themselves, either voluntarily or due to various regulations, should be protected. Technological advances are increasing the demand for data and there is a growing interest for private companies and universities in using these data for research, to investigate patterns of behaviour and to draw conclusions. In the face of this continually expanding interest in exploiting available datasets, we have seen over the past 10 years increased attention in privacy preserving data mining, a field which tries to provide a tradeoff between maintaining privacy and using the private data. As an example of this thin line between preserving privacy and using private data, consider a typical hospital’s cardiology center where the database undoubtedly includes patient records comprising personal information (such as name, social security number); demographic data (race and age); and medical information (symptoms, diagnosis and medications). An external researcher is interested in using this database in order to generate a complications model for these patients with the aim of introducing a new treatment protocol. A patient record should, of course, be stripped of ifentifiable information before being shared with any person other than the patient’s primary physician. Obviously unique identity fields such as name and social security number should be completely removed from the dataset. However, other fields can be used to disclose the real identity of the patient indirectly, especially if part of the data can be linked to other data sources that may include the patients’ identities. For example, if there is only a single patient in the database who is 90 years old, then providing the age datum may reveal the identity of this patient. A real-life example was demonstrated by Sweeney [1]. She purchased a copy of the Cambridge Massachusetts voter registration list and also obtained a copy of medical records of Massachusetts state emploees. Linking together the data from the two datasets, she managed to find medical information about the governor of Massachusetts. Only three attributes were needed: ZIP, Birth date and Sex. She demonstrated that even though sensitive data such as social security numbers were not present in any of the two datasets and examining only one of two datasets alone couldn’t provide an inference about a particular individual, nevertheless, cross-linking several datasets provided the governor’s personal medical records. In an effort to enhance data privacy, Sweeney proposed a new
kACTUS 2: Privacy Preserving in Classification Tasks
65
method, k -anonymity which guarantees privacy protection via linking and allows the safe release of information without invalidating the truthfulness of an individual record. In this paper we discuss privacy preservation and data mining problems in terms of classification. We propose an algorithm for privacy preserving data mining that performs dataset anonymization using the k -anonymity model while taking into account its effect on classification results. We extend the k -anonymity model by providing new definitions and use several anonymization techniques together in order to get better results in terms of accuracy than reported in the literature. In Section 2 we discuss related work while Section 3 provides some definitions used in the article and in formulating the problem. In Section 4, we explain how our proposed algorithm works and Section 5 provides experimental results. Section 6 concludes the paper.
2
Related Work
k -anonymity is a popular approach to masking data and avoiding re-identification of sensitive data through common attributes [2]. A dataset complies with kanonymity protection if individual data stored in the released dataset cannot be distinguished from at least k−1 individuals whose data also appears in the dataset. This protection guarantees that the probability of identifying an individual based on the released data in the dataset does not exceed 1/k. Generalization and suppression are the most common methods used for de-identification of the data in kanonymity based algorithms [3],[4],[5],[6]. Generalization is a process of substiting an attribute value with a more general value but which is semantically consistent with the previous one using so-called attribute hierarchies. One of the advantages of generalization over other disclosure limitation techniques is that it preserves the truthfulness of the information. However, a major drawback of existing generalization techniques is that domain hierarchy trees are required for every quasiidentifier attribute on which k-anonymity is to be applied [4],[5],[6],[7],[8],[9],[10]. These domain hierarchy trees are generated manually by the user before applying the generalization process. In order to overcome the major drawback of generalization we effectively use suppression in our algorithm. While many other approaches for masking private data exist, we would like to mention swapping since it is used in our algorithm. Proposed in 1978 by Dalenius and Reiss [11], data swapping selectively modifies a fraction of the records in the database by exchanging a subset of attributes between selected pairs of records. The advantage of swapping lies in the fact that it introduces uncertainty for potential attackers about the true value of a sensitive attribute. Since 1978 many swapping techniques have been proposed [12],[13],[14]. A common characteristic to all of the techniques is the attempt to preserve the statistical characteristics of the original dataset (univariate distribution of the swapped attribute, natural dependency between fields, means and variances) rather than high classification results. In contrast to the original definition and purpose of
66
S. Kisilevich et al.
swapping, the technique that we propose here employs swapping as a means for achieving k -anonymity and as a retaining factor for the information loss incurred by suppression. To the best of our knowledge no research has been performed to date where several privacy-preserving techniques are mixed together. In our previous research [15], we presented a k-anonymity classification treebased suppression (kACTUS). kACTUS wraps a decision tree inducer which is used to induce a classification tree from the original dataset. kACTUS then uses this tree to apply k-anonymity to the dataset while minimizing the tradeoff between k-anonymity constraints and classification quality. The resultant anonymous dataset can then be sent to an external user that may utilize any induction algorithm for training a classifier over the anonymous dataset. kACTUS uses the suppression approach alone by performing a random selection of instances, which is known to reduce the quality of the data when inefficiently applied. Moreover this randomness injection may cause the performance of the algorithm to be unstable. Results might be improved if a greedy selection rule is used. In this paper we present a variation of kACTUS that we refer to as kACTUS2. Our new revised algorithm does not randomly select records, instead it uses an information gain criterion for selecting the best instances to suppress. Moreover, kACTUS-2 efficiently uses multi-dimensional suppression and swapping so that the performance of the new method outperforms previous algorithm and other generalization methods which require manually defined generalization taxonomies.
3
Problem Formulation
In this section several definitions which will be used later in the article are introduced and the problem formulation is presented (Definition 9). The following two definitions are adopted from [3]. Definition 1 (Quasi-identifier). Given a population U , a person-specific table P T (A1 , . . . , An ), fc : U → P T and fg : P T → U , where U ⊆ U . A quasiidentifier of PT, written QT is a set of attributes Ai , . . . , Aj ⊆ A1 , . . . , An where ∃pi ∈ U such that fg (fc (pi )[QP T ]) = pi . Definition 2 (k-anonymity). Let RT (A1 , . . . , An ) be a table and QRT be the quasi-identifier associated with it. RT is said to satisfy k-anonymity if and only if each sequence of values in RT [QRT ] appears at least k times in RT [QRT ]. Definition 3 (Decision tree). Decision tree is a classifier consisting of decision and leaf nodes. A decision node specifies a test over one of the attributes. The best attribute is selected using some of the accepted metrics. A leaf node specifies a value of the target attribute. Decision nodes which are closer to the root were selected first thus they can be regarded as more important in terms of classification than nodes closer to leaves. kACTUS-2 receives a decision tree as input and works with its decision nodes. The algorithm does not use information about the value of the target node
kACTUS 2: Privacy Preserving in Classification Tasks
67
directly from the decision tree, thus when a term leaf node will be further encountered it will denote only that the node is a decision node which have zero child nodes. Definition 4 (Complying Node and Non-complying Node). Let us suppose that we follow a tree branch from the root node a1 to a leaf node an . We construct an attribute-value set going down the branch up to the leaf node an . If the number of tuples in a dataset, consisting of the attribute-value set is greater or equal to the k-anonymity threshold than we call a leaf an a complying node, otherwise we call it a non-complying node. From the properties of decision trees we can state that there are only complying and non-complying nodes in the tree. Definition 5 (Suppression). As defined by Sweeney in [3], suppression means that a specific value from the original dataset will be substituted by a meaningless one in the anonymized dataset. In this paper, we assume that suppression is used on the non-complying node only where we prune the leaf node of the noncomplying node, thus suppressing all the attribute values which are associated with that node. We used the question mark “?” as a symbol for the suppressed value which is interpreted by the evaluated inducers as a missing value. Definition 6 (Compensation). Using Definition 4 about complying and noncomplying nodes, we propose a general approach for achieving k-anonymity using decision trees called compensation. A complying node can manipulate its associated records such that it can compensate part of its records in favor of a noncomplying node in order to turn a non-complying node into a complying one. kACTUS-2 performs compensation by suppression and compensation by swapping. In addition compensation works only between sibling leaf nodes which have a common parent. Definition 7 (Compensation by Suppression). We now extend the definition of suppression given in Definition 5. Compensation by suppression is always accompanied by suppression of non-complying nodes. First, we check how many records are required for the non-complying node to be compliant. Let us assume that the number of required records is k − m where m is the number of records associated with the non-complying node. Suppose that there is a compliant node which can compensate k − m of its records. We select k − m records from a complying node (selection criteria are described in Section 4) and suppress their attribute values associated with the complying leaf node. The suppression of the non-complying node and compensation by suppression guarantee that a non-complying node will be compliant. Definition 8 (Compensation by Swapping). First, we check how many records are required for the non-complying node to be compliant. Let us assume that the number of required records is k − m where m is the number of records associated with the non-complying node. Suppose that there is a compliant node
68
S. Kisilevich et al.
which can compensate k−m of its records. We perform k−m iterations such that on every iteration we check (using criteria explained in Section 4 what is the best class value that is required for the non-complying node. Then, we search for a record with the specified class value which is associated with the complying node. If such a record is found we swap its leaf node value with the leaf node of the non-compying node. If such a record is not found we select another record (using specific criteria explained in Section 4) and swap its leaf node value. It is clear that after k −m iterations there will be k records associated with a non-complying node. Definition 9 (Optimal k-anonymity Transformation for a Classification Problem). Given k-anonymity threshold ∈ [1, m], an inducer I, a dataset DS with input attribute set A = a1 , . . . , an and target class y from a distribution D over the labeled instance space, the goal is to find an optimal transformation such that S’ satisfies k-anonymity. Optimality is defined in terms of minimizing the deterioration in the generalized accuracy over the distribution D as a result of replacing classifier I(S) with classifier I(S’).
4
Method
Recall from the problem definition 9 that our objective is to get an anonymous dataset where predictive performance of a classifier trained on the anonymous dataset will be comparable to the performance of the classifier trained on the original dataset. In our approach we wrap an existing classification tree induction algorithm (such as C4.5) trained on the quasi-identifiers of the original dataset. Having the classification model of the original dataset, we can perform k-anonymization by manipulating leaves of the decision tree. If all the leaves in the tree are complying nodes then we can say that the dataset complies with k-anonymity. For leaves that do not comply with k-anonymity we perform one of two operations: Suppression and Compensation by Suppression (Def. 5 and 7) or Compensation by Swapping (Def. 8). 4.1
How a Complying Node Is Selected for Compensation by Swapping
Every complying node is checked against a non-complying node. First, we perform a virtual compensation by swapping on a complying node and check what is the final entropy [16] of its remaining records. Shannon entropy of a random variable X having the probability distribution p = (p1 , . . . , pn ) is given by: n H(p1 , . . . , pn ) = −pi log2 pi (1) i=1
Lastly, the complying node minimal final entropy is selected for compensation.
kACTUS 2: Privacy Preserving in Classification Tasks
4.2
69
How Compensation by Swapping is Performed
Once a complying node is selected for compensation, we perform an actual swapping in the following way. We check which class value has less influence in terms of entropy on the records associated with a non-complying node. Given a class value we search in the original dataset for a record associated with the complying node having that class value. Provided such a record is found, we select it and swap its attribute value associated with a leaf node by the attribute value of the non-complying leaf node. If such a record was not found, we check which class value will have less influence on the final entropy after we remove such a record from the set of records associated with a complying node. The same swapping process as described above is applied on the selected record. Given initial number m < k of records associated with the non-complying node, where k is the anonymity threshold, we perform k − m iterations and in every iteration we perform swapping as described above. 4.3
The kACTUS-2 Algorithm
Algorithms 1-4 describes our new method for k-anonymity. Since the algorithm uses several helper functions which are quite straightforward, we don’t provide their pseudo-code. However the general descriptions and explanations are presented below. root - returns the root node of CT height - returns the height (length of the longest path from the node to the leaf) parent - returns parent of the node count-instances - counts instances in the dataset associated with a particular node move-complying-node - moves instances associated with the complying node to the anonymized dataset with non-quasi-identifiers of the original instance if any. remove-leaf-nodes - remove leaf nodes of a node get-total-instance-count - counts the total number of instances associated with all the nodes in the set move-root-non-complying-nodes - moves the instances associated with root nodes by first suppressing all quasi-identifiers (the root node contains only one quasi-identifier) and keeps only the target class value of the original instance along with non-quasi-identifiers of the instance, if any. get-non-complying-leafs - given a node, returns all leafs which don’t comply with k -anonymity. get-complying-leafs - given a node, returns all leafs which comply with k anonymity. move-instances - like move-complying-node but takes set of nodes as an input. calculate-final-compliant-entropy - The explanation is given further in this section.
70
S. Kisilevich et al.
swap-from-complying - performs swapping of the attribute value of the complying leaf node to the attribute value of the non-complying leaf node and swapping of the class value required by the non-complying node to keep its entropy at a low level. See Definition 8 and Subsection 4.2 for detailed explanations. move-non-complying-instances - like move-instances but before moving such instances, the attribute of the leaf node is removed from all the instances (suppressed). compensate-from-complying - See Definition 7 for a detailed explanation. The algorithm requires the following input parameters: 1. 2. 3. 4. 5.
k -anonymity threshold KT swapping threshold ST original dataset OD classification tree CT set of non-quasi-identifiers
The output of the algorithm is the anonymized dataset AD. The process is as follows: we iterate over the classification tree while it has at least one root node. Having selected a root-node we find the longest path from it to the leaf node. If the longest path is of a height greater than 1 it means that the root-node has children and thus we can call the main function called PerformAnonymization (line 1.15), otherwise we need to check how many instances are associated with the particular root-node. If the number of instances is greater or equal to the k -anonymity threshold then we move the instances from the original dataset (CS) to the anonymized dataset (line 1.18). After that we remove the root-node from the classification tree (line 1.19). If the number of instances is less than the k -anonymity threshold we add the node to a set of non-complying nodes (line 1.21). When all root nodes are pruned from the classification tree we need to check how many instances in total are associated with all the non-complying nodes stored in the non-complying set (line 1.25). If the total number of instances is greater or equal to the k -anonymity threshold we call the move-root-non-complying-nodes function (line 1.25) which moves these instances to the anonymized dataset. Only the target attribute and non-quasiidentifiers of the original instance are retained. On the other hand, if the total number of instances are less than the k -anonymity threshold we do not copy these instances to the anonymized dataset. In such cases the anonymized dataset can contain less instances than the original one, however data loss is bounded by the k-threshold, so in a worst case scenario the maximum number of records which can be lost is k − 1. Algorithm 2 describes the PerformAnonymization procedure. This procedure gets the leaf node with the longest path to the root found in Algorithm 1 as a parameter. Using the parent function we acquire the parent of this node (line 2.2). Then, we check how many instances are associated with the parent. This means that we need to count all the instances of its child nodes. If the total number of instances is less than the k -threshold then we just prune the parent
kACTUS 2: Privacy Preserving in Classification Tasks
71
node by removing all its children (line 2.4). In lines 2.7 and 2.8 we get two sets which contain complying and non-complying leaf nodes (children of the parent node). An additional check is performed on line 2.9 such that if the set which holds non-complying leafs is empty, there is no need to continue further. We just move all instances associated with the complying leaf node (line 2.10) and prune all the children (line 2.11). We call two functions PerformSwapping and PerformSuppression in succession on lines 2.14 and 2.15. Finally we prune all the children nodes (line 2.16). Algorithm 3 describes the PerformSwapping procedure. It starts by iterating over non-complying nodes (lines 3.2-3.22). First, we calculate how many instances are required in order to make the non-complying nodes compliant (lines 3.3). Then, the required-ratio is calculated and compared to the swapping threshold, ST (line 3.4). We perform swapping only if the ratio of required instances is less than the predefined ST . For every non-complying node, the inner loop is started over complying nodes (lines 3.9-3.16) in order to find the best complying node to perform swapping (lines 3.11-3.15) Finally, when the best-complying-node is found, we move the instances associated with the non-complying node to the anonymized dataset using the move-instances function (line 3.20) and perform swapping using the swap-from-complying function (line 3.21). Algorithm 4 describes the PerformSuppression procedure. In lines 4.2-4.13 we iterate over complying nodes. We need to check whether the complying node is capable of compensating some of its instances to non-complying nodes. In order to implement this action, we start the inner loop over non-complying nodes (lines 4.3-4.11). We obtain the number of instances which can be given by the complying node (line 4.4 and the number of instances required by the noncomplying node (line 4.5). If the number of instances which the complying node can compensate is greater or equal to the number of required instances then compensation is possible, otherwise compensation is not used. Finally, we move instances associated with the non-complying node to the anonymized dataset (line 4.9) but with the leaf attribute value suppressed. After this step is carried out, the compensate-from-complying function is called to compensate the required number of instances of the complying node (line 4.10). The remaining instances of the complying node will be moved to the anonymized dataset (line 4.12).
5
Experimental Evaluation
This section presents the results of various issues that were examined in regard to the proposed algorithm: (1) verification of the proposed hybrid approach for achieving k-anonymity without reasonable loss of classification accuracy; (2) comparison of kACTUS-2 with TDS, TDR and kADET algorithms presented in [6],[9] and [10] in terms of classification accuracy, information loss and statistical significance; (3) comparison of kACTUS-2 with former algorithm kACTUS in terms of classification accuracy, information loss, statistical significance and data loss.
72
S. Kisilevich et al.
Algorithm 1. kACTUS-2 (Part 1) 1: 2: 3: 4: 5:
Input: KT - k-anonymity threshold ST - swapping threshold OD - original dataset CT - classification tree NQI - non-quasi-identifier set
6:
Output: AD - anonymous dataset
7: procedure Main(KT , ST , OD, AD, CT , N QI) 8: CS ← OD work on copy of the original data set 9: AD ← 10: non-complying-node-set ← 11: for all root-node in root(CT) do 12: while height(root − node) > 0 do 13: longest-node ← get-longest-node(root-node) 14: if height(longest-node) > 1 then 15: PerformAnonymization(longest-node, KT , ST , CS, AD) 16: else the longest node is the root node 17: if count-instances(longest-node, CD) ≥ KT then 18: move-complying-node(longest-node, CS, AD) 19: remove-leaf-nodes(longest-node) 20: end if 21: non-complying-node-set ← non-complying-node-set longest-node 22: end if 23: end while 24: end for 25: if get-total-instance-count(non-complying-node-set) ≥ KT then move-rootnon-complying-nodes(non-complying-node-set,CS, AD, N QI) 26: end if 27: end procedure
5.1
Datasets
A comparative experimental-study was carried out using five datasets from the UCI Repository [17]: Adult, German Credit, TTT, Glass Identification and Waveform. Researchers into k-anonymity mostly work with Adult and German Credit datasets as they are the most suitable, publicly available benchmarks. We use three additional publicly available datasets (TTT, Glass Identification and Waveform) to evaluate the performance of the above mentioned algorithms. The Adult dataset has six continuous and eight categorical attributes and a binary class attribute which represents income levels (less or more than 50K). The numerical data was discretized by a supervised discretization filter using a Weka datamining framework [18] as a preprocessing step. The dataset has 45222 records. A domain generalization hierarchy was built using all 14 attributes and
kACTUS 2: Privacy Preserving in Classification Tasks
73
Algorithm 2. kACTUS-2 (Part 2) 1: procedure PerformAnonymization(longest-node, KT , ST , CS, AD, N QI) 2: parent-node ← parent(longest-node) 3: if count-instances(parent-node,CD) < KT then 4: remove-leaf-nodes(parent-node) 5: return 6: end if 7: non-complying-leafs ← get-non-complying-leafs(parent-node) 8: complying-leafs ← get-complying-leafs(parent-node) 9: if non-complying-leafs = then 10: move-instances(complying-leafs, CS, AD, N QI) 11: remove-leaf-nodes(parent-node) 12: return 13: end if 14: PerformSwapping(complying-leafs,non-complying-leafs, KT, ST, CS, AD, NQI) PerformSuppression(complying-leafs, non-complying-leafs, KT, ST, CS, AD, NQI) 16: remove-leaf-nodes(parent-node) 17: end procedure
15:
was adopted from [6]. The domain generalization hierarchy was used by TDS, TDR and kADET algorithms. The German Credit dataset contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as a good credit (700 cases) or bad (300 cases). We used nine quasi-identifiers as was proposed by [9]: credit history, savings account, duration in month, status of existing checking account, credit amount, purpose, property, installment rate in percentage of disposal income, other debtors. The TTT dataset encodes the complete set of possible board configurations at the end of tic-tac-toe games. It contains 958 instances and nine categorical attributes and positive and negative binary class values. All attributes were used as quasi-identifiers. The Glass Identification database contains 214 instances and nine continuous attributes. We performed supervised discretization of continuous attributes. The class label represents seven types of glass: building windows float processed, building windows non float processed, vehicle windows float processed, vehicle windows non float processed, containers, tableware, headlamps. All attributes were used as quasi-identifiers. The Waveform Database Generator contains 5000 instances and 40 continuous attributes which were discretized by Weka. Class label represents three classes of waves. All attributes were used as quasi-identifiers. 5.2
Comparing to Other k-Anonymity Algorithms
In order to evaluate classification accuracy of the proposed method and compare it with other algorithms, we divided each dataset into a so-called 5x2 cross
74
S. Kisilevich et al.
Algorithm 3. kACTUS-2 (Part 3) 1: procedure PerformSwapping(complying-nodes,non-complying-nodes, KT , ST , CS, AD, N QI) 2: for all nc in non-complying-leafs do 3: required ← KT - count-instances(nc) 4: required-ratio ← required / KT 5: if ST < required-ratio then 6: continue swapping is not possible 7: end if 8: best-complying-node ← 9: for all c in complying-leafs do 10: if count-instances(c) - KT > required then 11: entropy ← calculate-final-compliant-entropy(c,nc,KT) 12: if entropy is minimal then 13: best-complying-node ← entropy 14: end if 15: end if 16: end for 17: if best-complying-node = then 18: continue no best complying node was found for swapping 19: end if 20: move-instances(nc, CS, AD, N QI) 21: swap-from-complying(best-complying-node, nc, required, N QI) 22: end for 23: end procedure
validation proposed by [19]. The method is based on five iterations of a twofold cross validation. All five algorithms were evaluated on training sets with different k-anonymity thresholds. We evaluated four classification algorithms in the induction phase in order to examine various induction biases: C4.5 [20], PART [18], Na¨ıve Bayes and Logistic [21]. All experiments were performed using Weka. The performance graphs we generated represent the classification accuracy as a function of k. The evaluation results are very encouraging. Figures 1-5 present performance graphs using a C4.5 classifier. Figure 1 displays a classification accuracy graph for the Adult dataset with the anonymity threshold of 5 ≤ k ≤ 1000. The classification accuracy of the original model is 85.96. The k values we used are: 5, 20, 50, 100, 500, 1000. We can see that the classification accuracy of kACTUS and kACTUS-2 with k = 5 is considerably higher than that of TDS, TDR and kADET. But the accuracy of kACTUS-2 converges to the accuracy of TDS for k = 1000, while the classification accuracy of kACTUS becomes worse than kADET starting from about k = 700 and starting from k = 900 with TDS. Figure 2 displays the classification accuracy of the German dataset. The k values we used are: 5, 10, 15, 20, 30, 40, 50, 80, 100. The original classification accuracy is 71.67. Again, with a small value of k = 5, kACTUS-2 performs better than all other algorithms. With higher values of k, the performance of kACTUS-2
kACTUS 2: Privacy Preserving in Classification Tasks
75
Algorithm 4. kACTUS-2 (Part 4) 1: procedure PerformSuppression(complying-nodes,non-complying-nodes, KT , ST , CS, AD, N QI) 2: for all c in complying-leafs do 3: for all nc in non-complying-leafs do 4: can-compensate ← count-instances(c) - KT 5: required ← KT - count-instances(nc) 6: if can-compensate < required then 7: continue complying node has less instances than required for compensation 8: end if 9: move-non-complying-instances(nc, CS, AD, N QI) 10: compensate-from-complying(c,required) 11: end for 12: move-instances(c, CS, AD, N QI) 13: end for 14: end procedure
gets worse and in the range 45 ≤ k ≤ 78 the classification accuracy of kACTUS is higher. Classification accuracy of kACTUS-2 converges to the accuracy of kADET for k = 100 and both algorithms provide best results at k = 100. Figure 3 shows the classification accuracy of the TTT dataset using the C4.5 classifier. The original classification accuracy is 81.20. We used the following k-anonymity thresholds: 5, 10, 15, 20, 30. The performance of kACTUS-2 is better than all the other algorithms. Only when starting from k = 25, its performance decreases below the classification accuracy of kACTUS. However it
Adult 14qi/14 Dataset comparison
85.88 TDS TDR kADET kACTUS kACTUS-2
85
84
83
Classification accuracy
82
81
80
79
78
77
76 75.21 5
100
200
300
400
500
600
700
800
900
k-anonymity threshold
Fig. 1. Adult 14qi/14. Classification accuracy vs. k-threshold. C4.5 inducer.
1000
76
S. Kisilevich et al. German Credit Dataset comparison 71.39 TDS TDR kADET kACTUS kACTUS-2
71
Classification accuracy
70.5
70
69.5
69
68.5 68.4200 5
10
20
30
40
50 k-anonymity threshold
60
70
80
90
100
Fig. 2. German Credit. Classification accuracy vs. k-threshold. C4.5 inducer.
TTT Dataset comparison TDS 74.37
TDR kADET
74
kACTUS kACTUS-2 73
Classification accuracy
72
71
70
69
68
67
66.13 5
10
15
20
25
30
k-anonymity threshold
Fig. 3. TTT. Classification accuracy vs. k-threshold. C4.5 inducer.
remains higher than kADET, TDS and TDR. The worst performance is that of TDR. From Figure 4, we can see that on small values of k, all algorithms except TDR behave almost the same with minor variations. kADET seems to perform better up to k = 15. On higher values of k kACTUS outperforms all the other
kACTUS 2: Privacy Preserving in Classification Tasks
77
Glass Dataset comparison 62.93 TDS TDR kADET kACTUS kACTUS-2
60
Classification accuracy
55
50
45
40
34.82 5
10
15
20
25
30
k-anonymity threshold
Fig. 4. Glass. Classification accuracy vs. k-threshold. C4.5 inducer. Waveform Dataset comparison 74.74
70
Classification accuracy
65
60
55 TDS TDR kADET kACTUS kACTUS-2
50
45
40 37.34 5
10
15
20
25 30 k-anonymity threshold
35
40
45
50
Fig. 5. Waveform. Classification accuracy vs. k-threshold. C4.5 inducer.
algorithms. Finally after k = 25, kACTUS-2 outperforms kACTUS. The kthresholds are 5, 10, 15, 20, 30. The original classification accuracy is 67.63. Figure 5 presents classification accuracy of five algorithms for Waveform with k = 5, 10, 15, 20, 30, 40, 50. The original classification accuracy is 74.78. We can see that kACTUS-2 performance deteriorates very slowly and its graph is almost horizontal. kACTUS performance is slightly worse than kACTUS-2 but better than kADET, TDS and TDR. Since we used all 40 attributes of the Waveform dataset as quasi-identifiers, the performance graph can suggest that kACTUS-2,
78
S. Kisilevich et al. Table 1. Area Under the Curve (AUS)
Dataset Classification Algorithm Adult 14qi/14 C4.5 PART Na¨ıve Bayes Logistics R. German Credit C4.5 PART Na¨ıve Bayes Logistics R. TTT C4.5 PART Na¨ıve Bayes Logistics R. Glass C4.5 PART Na¨ıve Bayes Logistics R. Waveform C4.5 PART Na¨ıve Bayes Logistics R.
TDS 76414.025 76401.425 75295.975 76377.775 6583.175 6456.275 6585.775 6635.25 1731.175 1727.925 1751.525 1738.275 1292.3 1292.3 1202.8 1284.65 2860.2 2857.85 2827.9 2859.825
TDR 75472.2 75472.225 73071.75 76477.2 6621.625 6292.65 6173.25 6525.025 1658.75 1628.5 1716.075 1693.15 964.475 964.475 1289.625 965.9 1837.65 2175.825 2064.7 1035.925
kADET 80843.275 N/A N/A N/A 6596.7 N/A N/A N/A 1771.3 N/A N/A N/A 1350.35 N/A N/A N/A 3113.85 N/A N/A N/A
kACTUS 81830.7 81849.25 82383.8 80761.575 6645.725 6438.1 6640.52 6635.35 1789.725 1795.325 1623.675 1481.8 1383.2 1400.675 1387.525 1188.9 3185.825 3208.5 2961.075 2038
kACTUS-2 82663.22 82338.2 82755.32 81623.65 6646.4 6505.225 6625.95 6626.425 1802.35 1791.15 1566.675 1389.725 1372.225 1390.225 1463.15 1242.9 3252.475 3231.75 3054.7 2208.4
kACTUS and kADET can work with a large number of quasi-identifiers while TDS and TDR cannot handle a large number of quasi-identifiers without losing classification accuracy. This is clearly seen in the example of the German dataset in Figure 2. Because the curves of the compared algorithms might intersect, we used the Area Under Curve (AUC) measurement as a single value metric to compare algorithms and establish a dominance relationship among them. A better algorithm should have a larger area. The reported values in Table 1 represent the mean AUC performance in the Adult, German, TTT, Glass and Waveform datasets respectively. The results indicate that kACTUS-2 outperforms the other four algorithms in 11 cases out of a total of 20. kACTUS-2 outperforms TDS in 15 cases out of 20 and TDR in 18 cases out of 20. Moreover kACTUS-2 outperforms kADET using C4.5 in all datasets. Note that since the output of kADET is an anonymous C4.5 decision tree rather than an anonymous dataset, we cannot compare it to other classification algorithms. More important, kACTUS-2 outperforms kACTUS in 13 out of 20 cases. In contrast to the kACTUS algorithm, kACTUS-2 does not use a random generator thus it provides more stable results, and it achieves an average reduction of 0.5% in classification error with respect to kACTUS. In order to check whether all five algorithms construct classifiers over a test set with the same error rate, we followed the procedure proposed in [22]. First an adjusted Friedman test was used in order to reject the null hypothesis and then a Bonferroni-Dunn test was applied to examine whether kACTUS-2 performs significantly better than the other four algorithms. We found that kACTUS-2 statistically outperforms TDS, TDR, kADET, kACTUS with a confidence level of 95%.
kACTUS 2: Privacy Preserving in Classification Tasks
5.3
79
Information Loss
A metric which is of great importance for us is information loss. Information loss in our case is the number of missing values contained in a dataset divided by the total number of values. Suppression is not as flexible as generalization, so one would anticipate a higher information loss rate in kACTUS-2 than in TDS and TDR. However, this is not the case as can be seen in Table 2, which displays suppression rates in percentages for five datasets. Since the output of kADET is an anonymized model rather than a dataset we cannot obviously determine its rate of information loss. The results are very encouraging. From Table 2 we can see that the kACTUS-2 information loss rate is lower in 19 cases out of a total of 32 cases. The kACTUS-2 Table 2. Percentage of missing values vs. k-threshold Dataset k-anonymity TDS Adult 14qi/14 5 0.73 20 0.63 50 0.69 100 0.75 500 0.83 1000 0.76 TTT 5 0.67 10 0.7 15 0.7 20 0.71 30 0.8 Glass 5 0.66 10 0.67 15 0.72 20 0.75 30 0.78 Waveform 5 0.82 10 0.85 15 0.85 20 0.85 30 0.85 40 0.85 50 0.87 German Credit 5 0.31 10 0.32 15 0.35 20 0.36 30 0.36 40 0.36 50 0.36 80 0.36 100 0.36
TDR kACTUS kACTUS-2 0.71 0.72 0.71 0.67 0.72 0.72 0.67 0.73 0.72 0.73 0.75 0.73 0.73 0.78 0.75 0.75 0.83 0.76 0.69 0.62 0.58 0.72 0.67 0.61 0.75 0.73 0.64 0.76 0.74 0.65 0.78 0.77 0.68 0.67 0.57 0.51 0.70 0.61 0.55 0.71 0.64 0.58 0.77 0.69 0.62 0.78 0.75 0.67 0.86 0.82 0.81 0.88 0.83 0.81 0.88 0.83 0.82 0.89 0.84 0.83 0.89 0.85 0.83 0.9 0.85 0.84 0.91 0.86 0.84 0.31 0.34 0.34 0.32 0.36 0.35 0.33 0.36 0.35 0.34 0.37 0.35 0.34 0.37 0.37 0.34 0.38 0.38 0.35 0.39 0.39 0.37 0.42 0.41 0.38 0.42 0.42
80
S. Kisilevich et al.
rate is lower for all values of k in the TTT, Glass and Waveform datasets. The kACTUS-2 rate is lower in two cases out of six cases in the Adult dataset and higher in the German dataset for every value of k.
6
Conclusions
In this paper we presented a new method of using k-anonymity for preserving privacy in classification tasks. The proposed method requires no prior knowledge and can be used by any inducer. It combines compensation by suppression presented in [15] and compensation by swapping which decreases information loss induced by the suppression approach. The new method also shows a higher predictive performance and less information loss when compared to existing stateof-the-art methods. Issues to be studied further include: examining kACTUS-2 in relation to other decision trees inducers. kACTUS-2 should also be extended to other data mining tasks (such as clustering and association rules) and anonymity measures (such as l-diversity) which respond to different known attacks against k-anonymity, such as homogeneous and background attacks.
References 1. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002) 2. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression 3. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression (2002) 4. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 279–288. ACM, New York (2002) 5. Wang, K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proc. of the 4th IEEE International Conference on Data Mining (ICDM 2004) (November 2004) 6. Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: Proc. of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005, pp. 205–216 (2005) 7. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain kanonymity. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 49–60. ACM, New York (2005) 8. Friedman, A., Schuster, A., Wolff, R.: k -anonymous decision tree induction. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 151–162. Springer, Heidelberg (2006) 9. Fung, B.C.M., Wang, K.: Anonymizing classification data for privacy preservation. IEEE Trans. on Knowl. and Data Eng. 19(5), 711–725 (2007); Fellow-Philip S. Yu 10. Friedman, A., Wolff, R., Schuster, A.: Providing k -anonymity in data mining. VLDB J. (2008) (accepted for publication)
kACTUS 2: Privacy Preserving in Classification Tasks
81
11. Dalenius, T., Reiss, S.P.: Data-swapping, a Technique for Disclosure Control. Program in Computer Science and Division of Engineering. Brown University (1978) 12. Reiss, S.P.: Practical data-swapping: the first steps. ACM Trans. Database Syst. 9(1), 20–37 (1984) 13. Richard, A., Moore, J.: Controlled data-swapping techniques for masking public use microdata sets. Statistical Research Division Report Series, RR96-04, U.S. Bureau of the Census (1996) 14. Fienberg, S.E., McIntyre, J.: Data swapping: Variations on a theme by dalenius and reiss. Technical report, National Institute of Statistical Sciences, Research Triangle Park, NC (2003) 15. Kisilevich, S., Elovici, Y., Shapira, B., Rokach, L.: A multi-dimensional suppression for k-anonymity (to appear, 2009) 16. Shannon, C.E.: A mathematical theory of communication. Bell Systems Technical Journal 27, 379–423 (1948) 17. Newman, C.B.D., Merz, C.: UCI repository of machine learning databases (1998) 18. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques with java implementations. SIGMOD Rec. 31(1), 76–77 (2002) 19. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998) 20. Salzberg, S.L.: C4.5: Programs for Machine Learning by J. Ross Quinlan. Machine Learning 16(3), 235–240 (1994) 21. Cessie, S.L., Houwelingen, J.C.V.: Ridge estimators in logistic regression. Applied Statistics 41(1), 191–201 (1992) 22. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Valid Statistical Analysis for Logistic Regression with Multiple Sources Stephen E. Fienberg1 , Yuval Nardi1 , and Aleksandra B. Slavkovi´c2 1
2
Carnegie Mellon University, Pittsburgh, PA 15213 {fienberg,yuval}@stat.cmu.edu Pennsylvania State University, University Park, PA 16802 [email protected]
Abstract. Considerable effort has gone into understanding issues of privacy protection of individual information in single databases, and various solutions have been proposed depending on the nature of the data, the ways in which the database will be used and the precise nature of the privacy protection being offered. Once data are merged across sources, however, the nature of the problem becomes far more complex and a number of privacy issues arise for the linked individual files that go well beyond those that are considered with regard to the data within individual sources. In the paper, we propose an approach that gives full statistical analysis on the combined database without actually combining it. We focus mainly on logistic regression, but the method and tools described may be applied essentially to other statistical models as well. Keywords: Distributed Databases, Horizontal Partitioned Data, Loglinear models, Privacy Preserving Data Mining, Secure Logistic Regression, Vertical Partitioned Data.
1
Introduction
Following the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. There have also been claims that prospective data mining could be used to find the “signature” of terrorist cells embedded in larger networks. Fienberg [1,2] describes some proposals for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. One example is the concept of selective revelation associated with the now abandoned Total Information Awareness (TIA) security program. Considerable effort has gone into understanding issues of privacy protection of individual information in single databases, and various statistical solutions have been proposed depending on the nature of the data, the ways in which the database will be used and the precise nature of the privacy protection being offered. Many data mining algorithms attempt to mine multiple distributed C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 82–94, 2009. c Springer-Verlag Berlin Heidelberg 2009
Valid Statistical Analysis for Logistic Regression with Multiple Sources
83
databases and this was the goal for TIA. For an assessment of the role of datamining in terrorism prevention and related privacy issues see the recently-released report of the Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals at the National Research Council [3]. Once data are merged across sources, however, the nature of the confidentiality problem becomes far more complex and a number of privacy issues arise for the linked individual files that go well beyond those that are considered with regard to the data within individual sources. Mining for individual data or for identifiable groups of individuals is inherently disclosive and the privacy of those whose data are sought clearly cannot be protected! But such data mining can also compromise the data of others in the databases being searched. We have some hope of privacy protection when the goal is the production of the results of some statistical calculation, such as a regression analysis. In working with multiple databases, we can conceptualize the existence of a single combined database containing all of the information for the individuals in the separate databases and for the union of the variables. In this paper, we propose an approach that gives full statistical analysis on this combined database without actually combining information sources. We focus mainly on logistic regression, but the method and tools described may be applied essentially to other statistical models as well. In Section 2, we briefly review the relevant privacy-preserving data mining (PPDM) and statistical disclosure limitation (SDL) literatures, state the general problem and provide an overview of binary logistic regression. Section 3 describes two protocols for secure logistic regression used when dealing with horizontally or vertically partitioned databases. We conclude by discussing our proposed protocols, privacy leakage problems and other ongoing work.
2
Background and Problem Formulation
Suppose that there are several parties collecting data separately and privately on overlapping sets of individuals or entities and involving overlapping sets of variables. We can use the designation ‘parties’ to stand for statistical or other government agencies, competing corporations, or any other organizations engaged in data collection. We assume that the parties desire to keep the information in their separate databases private and do not wish to share the databases with any other party. Each party is interested in performing some statistical analysis using the database in its possession in order to learn more about the underlying population from which it has drawn its database. Each party recognizes that the desired statistical analysis would enjoy greater statistical accuracy if it were able to carry out the relevant calculations using a hypothetical pooled (combined) database (made out of all of the parties’ databases). But data integration may lead to privacy breaches. Therefore, our goal is to establish new methods for analyzing the pooled data without actually combining the databases. Privacy-preserving data mining (PPDM) is a class of algorithms used to extract (mine) information, but at the same time, maintain privacy (see [4,5]).
84
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
Emphasis is often on the algorithms rather than full statistical analyses. Related statistical disclosure limitation (SDL) techniques aim to preserve confidentiality but in contrast to PPDM techniques also aim to provide access to useful statistical data. The idea is that statistical inference should be the same whether one is using the original complete dataset, or an output dataset resulting from the original dataset and the SDL techniques. Both the PPDM and SDL literatures have addressed problems related to partitioned databases. The technique used depends on how the database is partitioned. When the parties have exactly the same variables but for different data subjects, we call the situation (pure) horizontally partitioned data. At the other extreme, when the parties hold disjoint sets of attributes for the same data subjects we call the situation (pure) vertically partitioned data. These two pure cases have gained much attention recently. For results concerning horizontally partitioned data, see [6] (log-linear based logistic regression), [7] (adaptive regression splines), [8] (regression) and [9] (regression, data integration, contingency tables, maximum likelihood, Bayesian posterior distributions; regression for vertically partitioned data). Also see [10,11] for mining of association rules, and [12,13] for privacy-preserving support vector machine (SVM) classification, for both the horizontally and vertically partitioned data. Sanil et al. [14,15] describe two different perspectives for performing linear regression on vertically partitioned databases. The work in [14] relies on quadratic optimization to solve for the regression coefficients βˆ but it has two main problems. This method relies on the often unrealistic assumption that the party holding the response attribute is willing to share it with the other parties, and the method releases only limited diagnostic information. In [15] the authors use a form of secure matrix multiplication to calculate off-diagonal blocks of the fulldata covariance matrix. An advantage of this approach is that rather complete diagnostic information can be obtained with no further loss of privacy. Analyses similar to ordinary regression (e.g., ridge regression) work in the same manner. Du and colleagues [16,17] describe similar, but less complete, approaches. This work is related to the literature on secure multi-party computation (SMC). Over the past twenty years computer scientists have developed a number of efficient algorithms to securely evaluate a function whose inputs are distributed among several parties, known as secure multi-party computation protocols [18,19]. We make repeated use of these algorithms. Specifically, we use the secure summation protocol —a secure algorithm to compute a sum without sharing distributed inputs [20], and a secure matrix multiplication—a secure way to multiply two private matrices. Finally, we assume that the parties involved are semi-honest, i.e., (1) they follow the protocol and (2) they use their true data values. But parties may retain values from intermediate computations. Logistic regression is a form of multivariate regression used for the analysis of binary outcomes. It is one of the most widely used statistical methods in biomedicine, genetics, the social sciences, business and marketing. It can be used to classify and predict, in a similar fashion to linear discriminant analysis, and is closely related to neural networks and support vector machines described in
Valid Statistical Analysis for Logistic Regression with Multiple Sources
85
data mining and machine learning literatures. In this paper, we draw from both PPDM and SDL paradigms, and address the problem of performing a “secure” logistic regression when the design matrix is distributed among multiple sources. 2.1
Partitioned Database Types
In this section we consider horizontally and vertically partitioned databases while more involved partitioning schemes are briefly discussed in Section 4. We assume that K parties (designated by A1 , . . . , AK ) with K ≥ 2 are involved. Note, however, that the case with K = 2 is often trivial for security purposes. Horizontally partitioned data is the case in which agencies share the same fields but not the same individuals, or subjects. Assume the data consist of a design matrix X and a response vector Y , such that: ] and Y = [Y1 , Y2 , . . . , YK ] , X = [X1 , X2 , . . . , XK
(1)
where apostrophe ( ) denotes transpose. Here, Xk , for k = 1, . . . , K, is the database held privately by party Ak , and Yk is its vector of responses. We Klet nk denote the number of individuals that belong to party Ak , and let N = k=1 nk be the overall sample size. Each Xk is an nk × p matrix and we assume that the first column of each Xk matrix is a column of 1’s. We will refer to X and Y as the “global” predictor matrix and the “global” response vector respectively. For horizontally partitioned databases it is assumed that parties all have the same variables, and that no parties share observations. Also, the attributes need to be in the same order. In vertically partitioned data, parties all have the same subjects, but different attributes. Assume the data look like the following: (2) [Y, X] = Y, X1 , . . . , XK , where Xk is the matrix of a distinct number of independent variables on all N subjects, and Y is the vector of responses. We assume that Y is held by only one party, say party A1 . We let pk be the number of variables for party Ak , k = 1, . . . , K. Note that each Xk is an N × pk matrix (except for X1 , which is an N × (1 + p1 )) and we assume that the first column of the X1 matrix is a column of 1’s. For vertically partitioned databases it is further assumed that parties all have the same observations, and that no parties share variables. In order to match up a vertically partitioned database, all parties must have a global identifier, such as social security number. 2.2
Logistic Regression
Researchers and data analysts use binary logistic regression for modeling binary outcomes, for example, to predict a membership in a group, e.g., does a person have a high-risk credit score or a low-risk credit score given her payment history, income, and gender?
86
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
Let Y1 , . . . , YN be independent Bernoulli variables whose means πi = E(Yi ), depend on some covariates xi ∈ Rp+1 , through the relationship logit(πi ) =
p
xij βj = xi β,
(3)
j=0
where logit(π) = log[π/(1 − π)], and the xi ’s make up the N rows of the design matrix X, whose first column is unity. In logistic regression, the vector of coefficients, or β, is of interest. Since we cannot compute the estimate of β in closed form, we traditionally use NewtonRaphson or a related iterative method (see [21]), to find a value of β that maximizes the log-likelihood: ( yi xij )βj − ni log 1 + exp(xi β) . (4) l(β) = j
i
i
Here j runs over the set of predictors, and i runs over the different number of “settings” of the covariances. The number of observations corresponding to setting xi is denoted by ni (note that it is different than nk in Section 2.1). In case of continuous predictors, we have ni = 1. At each iteration of Newton-Raphson algorithm, we calculate the new estimate of βˆ by βˆ(s+1) = βˆ(s) + (X W (s) X)−1 X (y − µ(s) ), (s)
(s)
(s)
(s)
(5)
(s)
where W (s) = diag(ni πi (1−πi )), µi = ni πi and πi is the probability of a “success” for the ith observation in the iteration s, i = 1, · · · , N . The algorithm stops when the estimate converges. Note that we require an initial estimate of ˆ For the complete statistical analysis, finding the coefficients of a regression β. equation is not sufficient; we need to know whether the model has a reasonable fit to the data. We typically look at the residuals and fit statistics such as Pearson’s χ2 and the likelihood-ratio deviance statistics. Next we describe how to use secure matrix sharing techniques and apply them to the logistic regression setting over distributed databases.
3
Secure Logistic Regression
Fienberg et al. [6] described “secure” logistic regression for horizontally partitioned databases when all variables are categorical. They discuss the advantages of the log-linear approach versus the regression approach in the fully categorical case where the minimal sufficient statistics are marginal totals and logistic regression is effectively equivalent to log-linear model analysis (e.g., see [21,22]). Here we focus on binary logistic regression in the case of horizontally and vertically partitioned databases but with quantitative covariates using secure multi-party computation. We draw from [6] for the horizontal case presented here and suggest necessary modifications. We continue to work on the problem of vertically partitioned data in the fully categorical data setting but do not report on any results here.
Valid Statistical Analysis for Logistic Regression with Multiple Sources
3.1
87
Logistic Regression over Horizontally Partitioned Data
We now turn to a general approach for logistic regression for a horizontally partitioned database using ideas from secure regression (e.g. see [8]). In ordinary linear regression, the estimate of the vector of coefficients is βˆ = (X X)−1 X y.
(6)
To find the global βˆ vector, party Ak calculates their own Xk Xk and Xk yk matrices. The sum of these respective matrices are the global X X and X y matrices. Since direct sharing of these matrices results in full disclosure, the parties need to employ some other method such as secure summation to preserve privacy. In this secure summation process, the first party adds a random matrix to its data matrix. The remaining agencies add their raw data to the updated matrix until in the last step the first party subtracts their added random values and shares the global matrices. Next we apply the secure summation approach to the logistic regression analysis, and implement the secure Newton-Raphson algorithm. We can choose an initial estimate for the Newton-Raphson procedure in two ways: (i) the parties can discuss and share an initial estimate of the coefficients, or (ii) we can compute initial estimates using ordinary linear regression of the responses and predictors using secure regression computations. In order to update β, we need the parts shown in (5). We can break the last term on the right-hand side into two parts: the (X W (s) X)−1 matrix and the X (y − µ(s) ) matrix. At each iteration of Newton-Raphson, we update the π vector, and thus update the W matrix and the vector µ. It follows that X W (s) X = (X1 ) (W1 )(s) X1 + (X2 ) W2 X2 + · · · + (X1 ) WK XK ,
(7)
X (y − µ(s) ) = X1 (y1 − µ1 ) + X2 (y2 − µ2 ) + · · · + Xk (yK − µK ),
(8)
(s)
(s)
(s)
(s)
(s)
(s)
(s)
where µk is the vector of (nk )l (ˆ πk )l values and Wk = diag((nk )l (ˆ πk )l (1−(ˆ πk )l ) for party k, k = 1, · · · , K, l = 1, · · · , nk and for iteration, s. Note, however, since we are dealing here with only continuous explanatory variables that nk = (nk )l . Then for each iteration of Newton-Raphson, we find the new estimate of β by using secure summation. One major drawback of this method is that we have to perform secure matrix sharing for every iteration of the algorithm; every time it runs, we also have to share the old βˆ vector with all of the parties so they may calculate their individual pieces. When all variables are categorical, this method involves more computation than using the log-linear model approach to logistic regression, where only the relevant marginal totals must be shared (once) among the parties. In the more general setting, we also have no simple way to check on potential disclosure of individual level data and thus we are providing security only for the parties and not necessarily for the individuals in their databases, e.g., see discussion in [8] for the linear regression secure computation problem.
88
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
Diagnostics. One way to assess fit is to use various forms of model diagnostics such as residuals, but this potentially increases the risk of disclosure. As Fienberg et al. [6] proposed in their log-linear model approach, we can compare log-likelihood functions of the larger model and the more parsimonious model. We can rewrite the log-likelihood equation from (4) in terms of the K parties and use secure summation to find this value nk K
{(yk )j log((πk )j ) + (1 − (yk )j ) log(1 − (πk )j )},
(9)
k=1 j=1
as well the Pearson’s χ2 or likelihood-ratio deviance statistics: 2 nk K (yk )j − (nk )j (πk )j 2 X = (nk )j (πk )j (1 − (πk )j ) k=1 j=1
(10)
and nk
K (yk )j (nk )j − (yk )j G =2 (yk )j log + ((nk )j − (yk )j ) log . (ˆ µk )j (nk )j − (ˆ µk )j j=1 2
k=1
(11) If the change in the likelihood is large with respect to a chi-square statistic with appropriate degrees of freedom, we can reject the null hypothesis and conclude that the simpler model provides a better fit to the data. 3.2
Logistic Regression over Vertically Partitioned Data
For vertically partitioned data held by K parties, (A1 , . . . , AK ), we have X = [X1 , X2 , . . . , XK ], where each Xk is an N × pk matrix, except for X1 , which has 1 + p1 columns (one for the intercept). The parameter β has a similar block structure. Thus we can rewrite equation (3) as logit(πi ) =
K
xi,k βk ,
(12)
k=1
where xi,k are the measurements of record i restricted to the set of variables held by party Ak , and β = (β1 , . . . , βK ). This additivity across parties is crucial. Indeed, virtually all of the work noted in Section 3.1 for horizontally partitioned data depends on “anonymous” sharing of analysis-specific sufficient statistics that add over the parties. We can now write the log-likelihood function, up to an additive constant, as K K N X k βk − log 1 + exp xi,k βk . (13) l(β) = y k=1
i=1
k=1
We must obtain the maximum likelihood estimator βˆ of β through an iterative procedure like before. We show below how to implement a secure NewtonRaphson algorithm to find roots of the likelihood equations for the vertically
Valid Statistical Analysis for Logistic Regression with Multiple Sources
89
partitioned case. Karr et al. [9] describe a similar approach to numerical maximization of likelihood functions for horizontally partitioned data. To estimate β = (β1 , . . . , βK ), at each iteration of the Newton-Raphson algorithm, we calculate the new estimate of βˆ by βˆ(s+1) = βˆ(s) + (X W (s) X)−1 X y (s) ,
(14)
where W (s) = diag(πi (1 − πi )), πi = (1 + exp{−xi β (s) })−1 , and y (s) = y − π (s) (see (5)). In both horizontal and vertical settings thus far we have assumed that the parties are semi-honest, i.e., they follow the protocol but may retain intermediate values, and use them to gain more information. We now outline an additional set of assumptions required for the vertical partitioning scheme. We assume that (a) some ordering of subjects has been performed, (b) party A1 holds the response variable y, and is not willing to share it (as usually happens in practice), (c) ˆ parties are not willing to share intermediate values of their components of β, except for the convergent estimated parameter (see below for a possible security breach if they are willing), and (d) parties are willing to share some ‘summary’ statistics (see below). Note that a protocol is not private according to privacy by simulation (see, e.g., [23], [24]) if a participating party may learn more information using any intermediate values, than it could have learned based on its input and output only! A way around this problem is usually achieved by decrypting the intermediate values in such a way that parties learn only random shares of the values; a random share of an output O is a set of random outputs O1 , . . . , OK , such that Oj is kept hidden from the other parties, and such that O = O1 + . . . + OK . This idea relies on a result, coming from the secure multiparty computation community, that there exists a secure protocol for any probabilistic polynomial-time functional (see [25], [26]). These generic protocols are computationally inefficient unless the size of the problem is relatively small. We are currently working on designing specific protocols for our problem. ˆ using equation We now describe a protocol that parties can follow to update β, (s) (s) (14), in a secure way. We define I = X y and II = (X W X). Then (s)
(s)
(s)
(0)
– Party Ak picks an initial value βk , k = 1, . . . , K. (s) – Parties obtain πi by applying a K-party secure summation to xi β (s) = K (s) k=1 xi,k βk . – Write I = (I1 , . . . , IK ) . Party Ak , for every k = 2, . . . , K, obtains Ik = Xk y (s) by applying secure inner product to Xk y (note that this is done only once). The calculation of I1 requires no secure protocol since the response y is assumed to be held by party A1 . At the end of this step each party Ak holds privately Ik . The interactions are pairwise between party A1 , and party Ak , for k = 2, . . . , K. – Parties apply secure matrix product to off-diagonal sub-matrices of II. Diagonal sub-matrices, Xk W (s) Xk , are computed locally by each party, and they then share the results with the other parties (this sharing has to be done in every iteration). At the end of this step each party gets to learn II.
90
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
– Each party can now invert II. Suppose ⎞ ⎛ (s) (s) (s) A11 A12 · · · A1K ⎜ (s) (s) (s) ⎟ ⎜ A21 A22 · · · A2K ⎟ −1 ⎜ II = ⎜ . .. . ⎟ ⎟, ⎝ .. . · · · .. ⎠ (s) (s) (s) AK1 AK2 · · · AKK (s)
for suitable matrices Ajk . Then, to work with equation (14), party Aj obtains K
(s)
Ajk Ik .
k=1
using secure summation by initiating the protocol, and never sharing the result. ˆ – Each party updates its own share of the estimated parameter, β. A Possible Privacy Breach. Trying to relax the assumption that the agencies are unwilling to share intermediate values of their components of βˆ may lead to serious privacy breaches. To see this, suppose now that the agencies are willing ˆ In order to update βˆ we compute x βˆ(s) , to share the intermediate values of β. i for each i = 1, . . . , n. Suppressing dependence on i, we write x = xi = (x1 , x2 , . . . , xK ) ∈ Rp . The computation of x βˆ(s) involves the secure sum and given any fixed s reveals nothing but the sum. However, enough iterations may lead to full privacy leakage. To see this, suppose that we iterate p times. Every agency gets to learn βˆ(s) for s = 1, . . . , p, and s = 1, . . . , p. a ˆ(s) := x βˆ(s) , Therefore, they can form the following system of linear equations: x βˆ = a ˆ ,
(15)
ˆ = (ˆ a(1) , . . . , a ˆ(p) ). If, and this where βˆ = [βˆ(1) , . . . , βˆ(p) ] is a p × p matrix, and a (1) (p) ˆ ˆ is the key, {β , . . . , β } are linearly independent, one may solve (15) ˆ. x = (βˆ−1 ) a Therefore, one iteration may not reveal private information, but enough iterations may. Similar concerns may appear in the calculation of X y (s) .
4
Discussion and Future Directions
What are the disclosure risks with respect to distributed databases? In this setting, one goal is to perform the analysis on the unaltered data, by anonymously
Valid Statistical Analysis for Logistic Regression with Multiple Sources
91
sharing sufficient statistics rather than the actual data. To perform secure logistic regression with continuous predictors in the vertical case, however, one requires unique record identifiers common to all the databases. Such identifiers alone do not constitute identity disclosures, because if one shares them they do not necessarily link to associated attribute values. Nonetheless, the parties must be willing to share some intermediate estimates of the components of regression coefficients which may unintentionally reveal some identifying information (see Section 3.2). Secure logistic regression in the vertical case also poses attribute disclosure risks: if the analysis reveals that attributes held by party A predict those held by party B, then A gains knowledge of attributes held by B. This is equally true even for linear regression on pure vertically partitioned data, e.g., see [14]. For the horizontal case, there is no simple way of checking for individual disclosure risk (see Section 3.1). In the full categorical horizontal case, with the “secure” log-linear approach to logistic regression [6], the parties must only perform one round of secure summation to compute the relevant sufficient statistics. The “secure” logistic regression protocol is thus computationally more intensive than the log-linear method since the parties must perform a secure summation for each Newton-Raphson iteration. In the full quantitative horizontal case we cannot apply the log-linear approach. A preliminary analysis indicates that “secure” Newton-Raphson protocol for logistic regression will have a different computational performance given the two partition types. The total computation time of the vertical case is strongly dependent on the number of parties. In contrast to the horizontal case, the vertical case must use secure matrix products to compute the off-diagonal block elements of the covariance matrix. The secure matrix product protocol requires a QR decomposition to mitigate leakage. This is a fairly expensive calculation, and we expect the total computation time for the vertically partitioned data set to increase roughly as O(N 2 ). We are currently exploring the efficiency of our protocol for both the horizontal and vertical case. The “secure” Netwon-Raphson implementation needs further investigation. Each new iteration of the algorithm may present a leakage situation since secure matrix operations are not disclosure-free; e.g., two parties may each relinquish information to the other such as vectors that are orthogonal to their respective databases [15]. Beyond the Two Pure Cases. There may be computational reasons to consider horizontal and vertical cases separately, as well as separating cases with only categorical and only continuous predictors. Real world data are likely to be much more complex. Reiter et al. [27], and Fienberg et al. [28] describe a more general case, lying in between the two pure data partitioning schemes. It is referred to as vertically partitioned, partially overlapping database, where attributes are partitioned (vertically) among parties, but not every data record is common to all parties. Table 1 shows an example of a vertically partitioned, partially overlapping database where the database X is decomposed similarly as in the vertical partitioning scheme, but each sub-block matrix Xk contains missing values (for those records that party Ak does not have in its possession).
92
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
Table 1. Schematic representation of a vertically partitioned, partially overlapping data. Party A records values of x1 , x2 and observations 1 − 2000, 3001 − 4000, Party B records values of x3 , x4 , x5 of observations 1 − 1000, 2001 − 3000, and Party C records values of x6 , x7 of observations 1001 − 3000.
n 1 ... 1000 1001...2000 2001...3000 3001...4000
Party A X1 X2 √ √ √ √ √ √ √ √ • √
• √
Party B Party C X4 X5 X6 X7 √ √ • • √ √ • • √ √ • • √ √ • • • √ √ √ √ √
X3 √ √ √
•
•
•
•
•
Looking at Table 1, one is naturally led to think that the vertical partitioning, partially overlapping case may be cast as a missing data problem. We are currently working on developing methods for logistic regression on vertically partitioned, partially overlapping data. In particular, we consider two cleanly identifiable cases, involving solely continuous predictors or solely categorical ones. For the continuous case we assume that X follows a Gaussian mixture model (GMM), whereas for the categorical case, we assume a Multinomial mixture model (MMM). We can apply well known approaches for dealing with missing values such as the EM algorithm. In particular, we develop a “secure” (double) EM algorithm where the “double” here stands for the two types of incompleteness; the first has to do with the mixture parameters and the GMM (or MMM) parameters, while the second captures the actual missing covariates. We can obtain parameter estimates by applying the “secure” (double) EM algorithm in conjunction with secure multi-party protocols.
5
Conclusion
There are many scientific and business settings which require statistical analyses that “integrate” data stored in multiple distributed databases. Unfortunately, barriers exist that prevent simple integration of the databases. In many cases, the owners of the distributed databases are bound by confidentiality to their data subjects, and cannot allow database access to outsiders. We have outlined an approach to carry out “valid” statistical analysis for logistic regression with quantitative covariates on both horizontally and vertically partitioned databases that does not require actually integrating the data. This allows parties to perform analyses on the global database while preventing exposure of details that are beyond those used in the joint computation. We are currently developing a log-linear model approach for strictly vertically partitioned databases and a more general secure logistic regression for problems involving partially overlapping databases with measurement error.
Valid Statistical Analysis for Logistic Regression with Multiple Sources
93
Acknowledgments. The research reported here was supported in part by NSF grants EIA9876619 and IIS0131884 to the National Institute of Statistical Sciences, by NSF Grant SES-0532407 to the Department of Statistics, Penn State University, and by Army contract DAAD19-02-1-3-0389 to CyLab at Carnegie Mellon University.
References 1. Fienberg, S.: Privacy and confidentiality in an e-commerce world: Data mining, data warehousing, matching and disclosure limitation. Statistical Science 21, 143– 154 (2006) 2. Fienberg, S.: Data mining, privacy, disclosure limitation, and the hunt for terrorists. In: Chen, H., Reid, E., Sinai, J., Silke, A., Ganor, B. (eds.) Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security. Springer, New York (2008) 3. Committee on Technical and Privacy Dimensions of Information for Terrorism Prevention and Other National Goals: Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Assessment. National Academy Press, Washington (2008) 4. Agrawal, R., Srikant, R.: Privacy preserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas (2000) 5. Clifton, C., Vaidya, J., Zhu, M.: Privacy Preserving Data Mining. Springer, New York (2006) 6. Fienberg, S., Fulp, W., Slavkovic, A., Wrobel, T.: “Secure” log-linear and logistic regression analysis of distributed databases. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 277–290. Springer, Heidelberg (2006) 7. Ghosh, J., Reiter, J., Karr, A.: Secure computation with horizontally partitioned data using adaptive regression splines. Computational Statistics and Data Analysis (2006) (to appear) 8. Karr, A., Lin, X., Reiter, J., Sanil, A.: Secure regression on distributed databases. Journal of Computational and Graphical Statistics 14(2), 263–279 (2005) 9. Karr, A., Fulp, W., Lin, X., Reiter, J., Vera, F., Young, S.: Secure, privacypreserving analysis of distributed databases. Technometrics (2007) (to appear) 10. Kantarcioglu, M., Clifton, C.: Privacy preserving data mining of association rules on horizontally partitioned data. Transaction of Knowledge and Data Engineering 16, 1026–1037 (2004) 11. Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada (2002) 12. Yu, H., Jiang, X., Vaidya, J.: Privacy preserving svm using nonlinear kernels in horizontally partitioned data. In: Proc. of ACM SAC Conference Data Mining Track (2006) 13. Yu, H., Vaidya, J., Jiang, X.: Privacy-preserving svm classification on vertically partitioned data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 647–656. Springer, Heidelberg (2006) 14. Sanil, A., Karr, A., Lin, X., Reiter, J.: Privacy preserving regression modelling via distributed computation. In: Proc. Tenth ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp. 677–682 (2004)
94
S.E. Fienberg, Y. Nardi, and A.B. Slavkovi´c
15. Sanil, A., Karr, A., Lin, X., Reiter, J.: Privacy preserving analysis of vertically partitioned data using secure matrix products. Journal of Official Statistics (2007); Revised manuscript under review (2007) 16. Du, W., Zhan, Z.: A practical approach to solve secure multi-party computation problems. In: New Security Paradigms Workshop, pp. 127–135. ACM Press, New York (2002) 17. Du, W., Han, Y., Chen, S.: Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: Proceedings of the 4th SIAM International Conference on Data Mining, pp. 222–233 (2004) 18. Goldwasser, S.: Multi-party computations: Past and present. In: Proceedings of the 16th Annual ACM Symposium on Principles of Distributed Computing, pp. 1–6. ACM Press, New York (1997) 19. Yao, A.: Protocols for secure computations. In: Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science, pp. 160–164. ACM Press, New York (1982) 20. Benaloh, J.: Secret sharing homomorphisms: Keeping shares of a secret secret. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 251–260. Springer, Heidelberg (1987) 21. Agresti, A.: Categorical Data Analysis, 2nd edn. Wiley, New York (2002) 22. Bishop, Y., Fienberg, S., Holland, P.: Discrete Multivariate Analysis: Therory and Practice. MIT Press, Cambridge (1975); Reprinted by Springer (2007) 23. Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptology 15(3), 177– 206 (2002) 24. Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality (2008) (to appear) 25. Yao, A.C.: How to generate and exchange secrets. In: Proceedings of the 27th Symposium on Foundations of Computer Science (FOCS), pp. 162–167. IEEE, Los Alamitos (1986) 26. Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game - a completeness theorem for protocols with honest majority. In: Proceedings of the 19th annual Symposium on the Theory of Computing (STOC), pp. 218–229. ACM, New York (1987) 27. Reiter, J., Karr, A., Kohnen, C., Lin, X., Sanil, A.: Secure regression for vertically partitioned, partially overlapping data. In: Proceedings of the American Statistical Association (2004) 28. Fienberg, S., Karr, A., Nardi, Y., Slavkovic, A.: Secure logistic regression with distributed databases. In: Proceedings of the 56th Session of the ISI, The Bulletin of the International Statistical Institute (2007)
Suspicious Activity Reporting (SAR) Joan T. McNamara Los Angeles Police Department 150 N. Los Angeles Street Los Angeles, CA 90012 [email protected] http://www.lapdonline.org/
Abstract. In August of 2007, the Los Angeles Police Department pioneered a Suspicious Activity Report (SAR) program that enabled local, state and federal law enforcement agencies to, for the first time, gather and share information about suspicious activities with a possible nexus to terrorism. The SAR program established an information platform at the local level that previously didn’t exist and had the potential to connect many of the country’s police departments, thus shifting local law enforcement’s approach to terrorism from a reactive to a preventative model. It also essentially flipped the age-old paradigm in which information was pushed from the federal to the local level. Now local police departments are valuable players in the information sharing process and are increasingly relied on to provide their federal partners with an accurate picture of what is happening at the local level. Keywords: Suspicious Activity Report (SAR), Institutionalization, Standardization, Measurement, Paradigm Shift.
1 Introduction The role of local police in counter-terrorism efforts has steadily increased in the past seven years. Front-line officers, with their intimate knowledge of their communities and their keen observational skills, have traditionally been thought of as first responders. That perception has changed with the 9/11 terrorist attacks. Policymakers, law enforcement executives and others increasingly called for police to be redefined as “first preventers” of terrorism and the emphasis at the local level shifted from response to prevention. Still, a critical gap existed in the information-sharing cycle. Local police were now being viewed through the lens of national security but were still only receiving crumbs of information from the intelligence table. The Los Angeles Police Department (LAPD) recognized that this needed to change and developed the Suspicious Activity Report (SAR) program in 2007. The SAR program closed the information-sharing gap and adapted the strengths and systems of local police to the threat of terrorism. Rather than relying solely on their federal partners for information, police were now able to paint their own rich picture of what was happening “on the ground” in their communities and decipher the emerging patterns C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 95–103, 2009. © Springer-Verlag Berlin Heidelberg 2009
96
J.T. McNamara
and linkages that require further attention. In a broader sense, the program created an information platform at the local level where none existed before. This platform could then be adopted by police departments nationwide, simply by modifying the ways in which they recognize and report suspicious activities. What made the SAR program unique is that it focused on behaviors and activities, rather than individuals, as potential links to terrorism. This approach was taken in order to ensure that citizens’ civil and privacy rights were protected. The 63 behaviors/activities that were selected as suspicious have historically been linked to pre-operational planning and preparation for terrorist attacks. They include: taking photos, measurements, or drawing diagrams; abandoning suspicious packages or vehicles; and testing security measures. These behaviors and activities are referred to by law enforcement as Modus Operandi (MO), a Latin term that, loosely translated, means mode of operation and describes how a criminal commits his or her crimes.
Fig. 1. Behaviors and activities were identified that had a clear, historical link to terrorist activities
By modifying an existing system based on MO codes, police created a new intake path for this type of information. This apparently small change proved revolutionary in terms of directing investigations and prevention efforts, holding management accountable and ensuring that the right clues were passed to the right people. The SAR program also helped to institutionalize counter-terrorism activities throughout the LAPD. Front-line officers received training on behaviors and activities with a possible nexus to terrorism. The underlying premise of SARs is very simple: An officer’s observation and reporting of just one of these events could be the vital “nugget” of information needed to focus attention in the right place, or to connect seemingly unrelated dots and predict or prevent a terrorist act.
Suspicious Activity Reporting (SAR)
97
2 New World Order The attacks on the World Trade Center and other U.S. cities on September 11, 2001, irreversibly changed the belief of Americans that terrorism was a distant problem that affected other people in other lands. The 9/11 Commission Report concluded that numerous indicators of an impending attack on American soil had been overlooked and ignored, and information-sharing between the nation’s various law enforcement and intelligence agencies was virtually non-existent. Since then, local and state law enforcers and their counterparts in the federal government have sought answers to a series of questions: Could the most devastating terrorist attack in American history have been predicted and prevented? What should have been done differently? Could a better information-sharing system, one that promoted open and standardized communication, have made a critical difference? After the attacks, government officials assured the public that the nation’s more than 800,000 state and local law enforcers were providing a cloak of security. But, the reality was that while there was widespread recognition that these officers had the potential to be the first line of defense in protecting the home front from acts of terrorism, no action was taken at the time to truly engage these officers, creating a gap in the information-sharing cycle. Tasking local law enforcement with the policing of traditional crime and the prevention of terror attacks in their local jurisdictions constituted a dramatic paradigm shift – both for the federal government and for the local and state agencies themselves. If this shift in established thought and practice were to be successful, it would require law enforcement agencies nationwide to adopt universal guidelines for effective communication and information-sharing. This was far easier said than done. There was no system in place at any level to facilitate this crucial and necessary exchange. The SAR program was the Los Angeles Police Department’s simple and elegant answer to this problem and serves as a national solution for the American law enforcement community.
3 The Los Angeles Landscape The City of Los Angeles, which spans more than 460 square miles, is home to many critical infrastructure sites and locations with national iconic or monumental significance. This has not gone unnoticed by terrorists, who in the past have targeted the Los Angeles International Airport and the Library Tower. After exhaustive research and study, the LAPD determined that it needed to create more detailed descriptions of places, classified as “premise” and “target” locations, which could be at risk of a terrorist attack. For instance, identifying that a crime, such as vandalism or theft, occurred on a train may be sufficient for basic crime statistics and analysis. But, for counter-terrorism purposes, it became crucial to identify whether the train in question was a freight train, passenger train, subway, or light rail. To create these distinctions, the LAPD developed 74 additional location descriptors in addition to the standard ones, such as sewage facilities, pipelines and others – which included sites of cultural or tactical significance and public infrastructure and mass-gathering locations.
98
J.T. McNamara
Fig. 2. A more specific list of premises was developed for a more detailed analysis
The key to the SAR program is that the 63 behaviors and activities, and the 74 new premise/target locations are “coded” using the same type of numbering system the Department employed to track and analyze crime. By creating and assigning numbers, or codes, to the behaviors and premises, terrorist activities could be tracked by date, time and location, just as other crimes are currently tracked. This simple coding tool created a common language that mirrors the Uniform Crime Reporting system and enables data to be shared vertically and horizontally among local, state and federal agencies. With the wealth of new information, the LAPD is able to track various types of suspicious behaviors and activities, pinpoint where they are happening, and present that information in various forms such as maps and graphs. Furthermore, it also provides the ability to eliminate information that does not have a direct nexus to terrorism, but may have a correlation to crime, and funnel it properly to the correct investigative entity. The Department could now put a focus on patterns of activity that warranted further attention within Los Angeles, as well as in other cities across the nation. This has led to more informed discussions with all law enforcement partners, particularly those at the federal level.
4 Implementation On the local level, the SAR program has engaged the LAPD’s 13,000 sworn and civilian personnel in the fight against terrorism and has institutionalized counterterrorism efforts throughout the Department. On the national level, the program has paved the way for a national implementation plan as other departments have taken the LAPD blueprint and adopted it. A national SAR roll-out was preceded by the implementation of the program in the LAPD. Research and development was essentially completed by the end of 2007. The
Suspicious Activity Reporting (SAR)
99
executive leadership that LAPD Chief William J. Bratton provided paved the way for the nation. He signed off on the project shortly thereafter. This led the way for SAR implementation throughout the Department. The goal was to introduce SAR in phases, beginning with the creation of a Department-wide policy mandating that front-line officers complete a SAR when they become aware of information with a possible nexus to terrorism. Special Order No. 11, which explained SAR, was created and distributed throughout the LAPD in March 2008. This was a critical step in mandating that suspicious activity reporting happened and ensured that the process was institutionalized. The Special Order was accompanied by a comprehensive training curriculum that instructed officers on the policy and the scope of their new responsibilities. An e-learning course introduced officers to SAR, provided them with a general overview of terrorism, and detailed the behaviors and activities that they were responsible for reporting. Police officers were also given notebook dividers that provided them with a quick reference to those 63 suspicious behaviors. This, in turn, was followed by both in-service training and roll call training, which helped ensure standardization of the process.
DEPARTMENT TRAINING: AWARENESS IS THE KEY • Web-based training (e-learning) • Roll Call Training & Divisional Training Days • In-Service Schools: – Watch Commander – Supervisor – Detective – FTO School – Academy Recruits
• Terrorism Liaison Officer (TLO) Program • Notebook Divider and Officers’ Resource Guide Fig. 3. A comprehensive training program was initiated to institutionalize counter-terrorism activities throughout the Department
This training program provides a seamless and universal basis of awareness of SAR that can be provided not only to all ranks of law enforcement, but to employees of other public service agencies, community members, and both public and private stakeholders. The Department has developed a community outreach programiWATCH, iREPORT, i KEEP US SAFE. The iWATCH program includes training, posters, brochures, public service announcements and a website on which the public across the country can report suspicious activities, link to other state and federal agencies, and get additional information on terrorism. The website can be found at www.iWATCHLA.org
100
J.T. McNamara
5 Reporting The LAPD’s SAR policy requires officers to report suspicious behavior in a standardized way, follow strict guidelines due to the sensitivity of the information, and seek guidance from trained counter-terrorism investigators when necessary. They do this using the Department’s existing crime report, which now has three SAR-related modifications.
Fig. 4. Basic modifications to an existing Department form were introduced for ease of use
In terms of process, the SAR policy requires that on-call LAPD counter-terrorism personnel be immediately notified in the event the information or an incident is significant in nature, results in an arrest, or needs immediate follow-up investigation. This process rapidly engages trained and experienced investigators who can provide guidance or take action on real-time information. In recognition of the vital and sensitive nature of SARs, the LAPD also designed a processing policy to ensure the expeditious delivery of the information to the appropriate investigative entities. From the inception of the SAR program, the LAPD leadership stressed the importance of protecting civil liberties and an individual’s Constitutional rights. Throughout the SAR development phase, the LAPD consulted with the Los Angeles City Attorney’s Office to ensure that these rights were protected. Specific roll call training emphasizes that officers must abide by an individual’s right to be free from unreasonable search and seizure, and reasonable suspicion must be met in order to detain someone. In the absence of reasonable suspicion, the officer can initiate a consensual encounter to investigate, but a person is free to leave if he or she chooses. Completion or investigation of a SAR gives an officer no additional or “special” powers. From the start, the focus of the SAR program has been solely based on behavior. A SAR is never taken based on a person’s ethnicity, race, national origin, gender or religion.
Suspicious Activity Reporting (SAR)
101
6 Local and National Implications Completion of a SAR by front-line officers, detectives and specialized units ensures that potentially critical information is being gathered and flows through the organization, providing a solid information platform. The basic information contained on the SAR is entered into the records management system – the same Department database utilized for tracking crimes. Just as current crime analysis procedures utilize codes to capture, track, map and analyze crimes such as robbery and auto theft, the same philosophy is now, for the first time, applied to terrorism-related suspicious activity. The nation-wide application of this behavioral coding and uniform reporting and tracking method will provide the revolutionary basis for linking indicators and revealing emerging patterns for terrorist activity throughout the United States. With the advent of coding, an agency’s records management system has been transformed into a valuable and viable terrorism prevention tool. When the basic preliminary information contained on a SAR report is entered into a crime analysis database using these codes, the system can be utilized to map, chart and graph suspicious behaviors, and allows counter-terrorism personnel to run specific queries based on a particular behavior, location, or time frame in order to identify emerging patterns. This coding also permits the sharing of information in a systematic and uniform fashion throughout the nation, thus providing a basis for identifying trends, spikes, patterns and similarities within the national context. In addition, the capability to use queries to look back at previously reported suspicious activity enables personnel to make linkages in behavior over time that might otherwise be overlooked. This ability to run queries is crucial to the successful analysis and synthesis of information and to the production of actionable intelligence. The
Fig. 5. Specific queries can be run and maps created which assist in identifying emerging patterns
102
J.T. McNamara
added ability to fuse this capability with an “all crimes” picture can provide decision makers with statistical support that helps them allocate resources and personnel in a more strategic way. It can also help leaders determine focus areas for training, investigation, enforcement and protection and reveals potential patterns that extend beyond the region to the rest of the country and, potentially, overseas. What previously might have appeared to be an isolated activity, once information is shared vertically and horizontally throughout the region and nation, may now reveal a pattern of behavior that law enforcers can use to focus their prevention and investigation efforts. Ultimately, the SAR system is the foundation that law enforcers have been seeking to help predict and prevent terrorism.
7 Conclusion The ultimate goal of any counter-terrorism system or organization must be to provide actionable intelligence: to identify patterns and linkages and to support knowledgeable decision-making at both the strategic and operational level. Properly gathered information, coupled with timely and accurate analysis, results in intelligence. The insertion of the SAR process into this cycle at the local level will produce investigative leads critical to determining or identifying emerging patterns or spikes in terrorism-related activity and assist in identifying potential preventative measures. It will also provide a statistical foundation for managing and directing the deployment and allocation of resources, ensuring relentless follow-through and accountability. Combined, these elements will contribute to what must be the ultimate goal of law enforcement in the realm of counter-terrorism: the potential for the prediction, and thus prevention, of future activities and attacks by foreign or domestic terrorists. Counter-terrorism investigators frequently describe the intelligence process as the search for a possibly vital “nugget” of information. Without the proper informationcollection system and information-sharing platforms, such as that provided by the LAPD’s SAR’s process, counter-terrorism investigators are not only missing vital indicators that may reveal that “needle in the haystack,” they are quite possibly searching in the wrong haystacks to begin with. Acknowledgements. The following agencies have supported the work presented above: Office of the Director of National Intelligence (ODNI), Ambassador Thomas McNamara, PM-ISE (Program Manager-Intelligence Sharing Environ-ODNI), John Cohen, Office of the Director of National Intelligence (ODNI), Bureau of Justice Assistance, Department of Homeland Security (DHS), Major Cities Chiefs (MCC).
References 1. 2. 3.
Bratton, W.: Chief’s Message – (June 2008), http://infoweb/COP/2008/June/chiefs_message_june_2008.htm Gorman, S.: LAPD Terror-Tip Plan May Serve as Model. The Wall Street Journal A3 (April 15, 2008) Gorman, S.: U.S. Seeks Terror Tips From States, Cities. The Wall Street Journal A12 (June 13, 2000)
Suspicious Activity Reporting (SAR) 4.
5. 6.
7. 8.
9.
103
Meyer, J.: LAPD Leads the Way in Local Counter-Terrorism. The Los Angeles Times April 14 (2008), http://articles.latimes.com/2008/apr/14/local/ me-counterterror14 Nichols, M.: Eyes, Ears and Intel. American Police Beat 1(44) (June 2008) Sullivan, E.: LAPD Looks to Uncover Terrorist Plots, April 11 (2008), http://www.foxnews.com/wires/2008Apr11/ 0,4670,PoliceIntelligence,00.html NSI Concept of Operations (CONOPS), http://www.ise.gov/pages/ sar-initiative.html White House Memorandum for the Heads of Executive Departments and Agencies, Subject: FY 2011 Programmatic Guidance for Information Sharing Environment, ISE (July 28, 2009) Findings and Recommendations of the Suspicious Activity Report (SAR) Support and Implementation Project, http://www.ise.gov/pages/sar-initiative.html
Stable Statistics of the Blogograph Mark Goldberg, Malik Magdon-Ismail, Stephen Kelley, and Konstantin Mertsalov Rensselaer Polytechnic Institute Department of Computer Science 110 8th Street Troy, NY 12180 {goldberg,magdon,kelles,mertsk2}@cs.rpi.edu
Abstract. The primary focus of this paper is to describe stable statistics of the blogosphere’s evolution which convey information on the social network’s dynamics. In this paper, we present a number of non-trivial statistics that are surprisingly stable and thus can be used as benchmarks to diagnose phase-transitions in the network. We believe that stable statistics can be used to identify anomalous behavior at all levels: that of a node, of a local community, or of the entire network itself. Any substantial change in those stable statistics must alert the researchers and analysts to the need for further investigation. Furthermore, the usage of these or similar statistics that are based solely on the communication dynamics and not on the communication content, allows one to diagnose anomalous behavior with minimal intrusion of privacy. Keywords: Social networks, graph theory, blogs.
1
Introduction
Large social networks, such as the Blogosphere, are now channels for a significant portion of information flow. One would expect important social events to manifest themselves on such social networks as changes to the information flow dynamics, perhaps slightly before, during and after the events. More specifically, suppose one tracks social groups which are identified solely based on the pattern of their communication. One might ask whether a particular group gains in popularity and has the potential for becoming a large movement, so that a thorough study of this group is warranted. In order to answer questions like this, a picture of what normal group dynamics and behavior look like is needed as a benchmark against which hypotheses might be tested. Our goal is to develop a framework for detecting anomalous behavior in blogosphere-like social networks. In particular, we take the first step in this direction by describing normal behavior against which anomalous behavior can be calibrated. As our test-bed, we take data from the LiveJournal blogosphere. There are certainly many parameters that can be extracted from the data. However, for any statistic of the social network’s evolution to be useful as a diagnostic tool of anomalous behavior, the statistic should be stable during the normal C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 104–114, 2009. c Springer-Verlag Berlin Heidelberg 2009
Stable Statistics of the Blogograph
105
functioning of the network. Only then can we identify a change in the statistic, and possibly connect it to some underlying change in its fundamental behavior. The primary focus of this paper is to describe stable statistics of the blogosphere’s evolution which convey information on the social network’s dynamics. We categorize such statistics as follows: (i) Individual statistics: statistics relating to properties of individual nodes, such as in-degree and out-degree distributions; (ii) Relational statistics: statistics describing communication links (edges) in the network, such as the persistence of edges, correlation, and clustering coefficients; (iii) Global statistics: statistics relating to global properties of the network, such as the size and diameter of its largest (“giant”) component and total communication density; (iv) Community statistics: statistics relating to the community (social group) structure in the network; and (v) Evolution statistics: statistics relating to evolution in the social network; for example, the average lifespan of a social group. We are interested in the dynamics of such statistics, in particular, their stability. In this paper, we present a number of non-trivial statistics that are surprisingly stable and thus can be used as benchmarks to diagnose phase-transitions in the network. The stability of these statistics is surprising because even though the network size is stable, the network dynamics itself is far from stable–our experiments show that close to 60% of the edges in the network change from week to week. We believe that stable statistics can be used to identify anomalous behavior at all levels: that of a node, of a local community, or of the entire network itself. Any substantial change in those stable statistics must alert the researchers and analysts to the need for further investigation. Furthermore, the usage of these or similar statistics that are based solely on the communication dynamics and not on the communication content, allows one to diagnose anomalous behavior with minimal intrusion of privacy.
2
LiveJournal Blog Data
We define the blogograph to represent the communication within a fixed timeperiod. For our experiments, this period is one week. The blogograph is a directed unweighted graph with a node for every blogger and a directed edge from the author of any comment to the owner of the blog where the comment was made during the observed time period. Parallel edges are not allowed and a comment is ignored if the corresponding edge is already present in the graph. To study the evolution dynamics, we considered consecutive weekly snapshots of the network. The communication graph contains the bloggers that either posted or commented during this week and the edges represent the comments that appeared during the week. An example blogograph is given on Figure 1. Our data was collected from the popular blogging service LiveJournal. LiveJournal imposes few restrictions
106
M. Goldberg et al.
Thread on Alice’s Blog Alice posted Bill commented Alice commented Cory commented
Thread on Bill’s Blog
A Edges: B −>A C −>A
D
B
Edges: A −>B C −>B D −>B
C
Bill posted Alice commented Cory commented Dave commented
Fig. 1. Blogograph generation example. Vertices are placed for every blogger who posted or commented, the edges are placed from the author of the comment to the author of the post (the blog owner). Parallel edges and loops are not allowed.
on communication. What makes this network particularly interesting for our purposes is that bloggers typically make decisions to communicate and join social communities without strong influence from the outside. For this reason we believe the network observed at LiveJournal has a natural communication structure as the steady state of the network evolves. This makes the LiveJournal Blogosphere an attractive domain for our research. Much of the communication in LiveJournal is public, which allows for easy access, especially given the real time RSS update feature provided by LiveJournal that publishes all open posts that appear on any of the hosted blogs. In our experience, the overwhelming majority of comments appear on a post within two weeks of the posting date. Thus, our screen-scraping program visits the page of a post after it has been published for two weeks and collects the comment threads. We then generate the communication graph. We have focused on the Russian section of LiveJournal. as it is reasonable but not excessively large (currently close to active 250,000 bloggers) and almost self contained. We identify Russian blogs by the presence of Cyrillic characters in the posts. Technically, this also captures the posts in other languages with a Cyrillic alphabet, but we found that the vast majority of the posts are actually Russian. The community of Russian bloggers is very active. On average, 32% w 35 36 37 38 39 40 41 42 43
|V | 111,248 118,664 120,549 119,451 113,296 124,329 121,609 124,633 123,898
|E| 376,878 411,294 410,735 386,962 323,062 430,523 380,773 415,622 403,309
GC 96.0% 96.0% 96.0% 95.8% 95.2% 96.3% 95.9% 96.2% 96.5%
C 0.0788 0.0769 0.0752 0.0728 0.0666 0.0764 0.0705 0.0739 0.0713
d 5.336 5.327 5.375 5.455 5.641 5.332 5.471 5.349 5.425
α 2.87 2.74 2.79 2.82 2.80 2.77 2.81 2.74 2.81
Fig. 2. Statistics for observed blogograph: order of the graph (|V |), graph size (|E|), fraction of vertices that are part of giant component (GC size), clustering coefficient (C), average separation (d), power law exponent (α)
Stable Statistics of the Blogograph
107
of all LiveJournal posts contain Cyrillic characters. Our work is based on data collected during September and October of 2006.
3
Global, Individual, and Relational Statistics
The observed communication graph has interesting properties. The graph is very dynamic (on the level of nodes and edges) but quite stable if we look at some aggregated statistics. For any week, about 70% of active bloggers will also be active in the next week. Further, about 40% of edges that existed in a week will also be found in the next week. A large part of the network changes weekly, but a significant part is preserved. Some of the important parameters of the blogograph illustrating theis stability are presented in Figure 2. The giant component (GC) is the largest connected subgraph of the undirected blogograph. A giant component of similar size has been observed in other large social networks [5,4]. The clustering coefficient (C) refers to the probability that the neighbors of a node are connected. The clustering coefficient of a node with degree k is the ratio of the number of edges between it’s neighbors and k(k − 1). The clustering coefficient of the graph is defined to be the average of the node clustering coefficients. The observed clustering coefficient is stable over multiple weeks and significantly different from the clustering coefficient in a random graph with the same outdegree distribution, which is 0.00029. The average separation (d) is the average shortest path between two randomly selected vertices of the graph. We computed it by sampling 10,000 random pairs of nodes and finding the undirected shortest path between them. The blog communication graph is not significantly different with respect to this parameter than other observed social networks [5,6]. The in-degree of a node describes its popularity in a network. The popularity is determined thought the interaction of network participants and depends on the properties of the participants and the network structure. Many large social networks [1,4] have a power law in-degree distribution, P (k) ∝ k −α , where P (k)
1 observed distribution α = 2.81
0.1
P(x = k)
0.01 0.001 0.0001 1e-05 1e-06 1
10
100 k
Fig. 3. Average in-degree distribution
1000
108
M. Goldberg et al.
portion of edges
1
0.1
0.01
0.001 1
2
3
4
5
6
7
8
9
number of weeks
Fig. 4. Edge stability 1 0.9 0.8 portion
0.7 0.6 0.5 0.4 0.3 0.2 2
4
6
8
10 12 out-degree
14
16
18
20
Fig. 5. Average edge histories with envelopes. The bottom line presents the portion of edges that existed in the previous iteration; every next line shows the portion of the current edges whose endpoints in the previous iteration were on the distance not exceeding the corresponding value.
is the probability a node has degree k. Figure 3 shows the in-degree distribution averaged over the observed period. We observed a power law tail with parameter α ≈ 2.81, which is stable from week to week. This value was computed using the maximum likelihood method described in [3] and Matlab code provided by Aaron J. Clauset. Figure 6 shows the average cumulative in-degree distribution over 9 weeks of observed data with an envelope that shows the maximum and minimum curves over the same 9 weeks shown with grey area. The envelope curves appear very close to the average value, clearly showing the stability of the in-degree distribution.
Stable Statistics of the Blogograph
109
1 0.95
P(x < k)
0.9 0.85 0.8 0.75 0.7 0.65 0.6 1
10
100
1000
k
Fig. 6. Cumulative in-degree distribution with envelope for nine observed weeks
1 0.9
P(x < k)
0.8 0.7 0.6 0.5 0.4 1
10
100
1000
k
Fig. 7. Cumulative out-degree distribution with envelope for nine observed weeks
The out-degree distribution of the network describes the activity levels of the participants. Figure 7 shows the average cumulative out-degree distribution over 9 weeks of data with an minimum and maximum curve envelope. As with the indegree distribution, the envelope curves of the out-degree distribution appear very close to the average value and illustrate the stability of the out-degree distribution. We use edge stability and edge history to evaluate the evolutionary dynamics of individual edges in the snapshots of the evolving network. Edge stability measures the number of the observed time periods that contained a particular edge and the edge history measures how close the end points of the edge were in the previous iteration conditioned on the activity level of the source of the edge. Both edge history and edge stability can be measured in the directed or undirected graph. We found that directed version to be more informative for edge stability evaluation and the undirected version to be more informative for edge history. Figure 4 presents the edge stability distribution shown on a log
110
M. Goldberg et al.
scale. As shown, the majority of the edges appear only once or twice in the observed period, but the network also contains some edges that are very stable and appear almost in every observed week. T of an edge (i, j) found in iteration T to be the We define the history Hij geodesic distance between vertices i and j in the graph of iteration T − 1. The average edge history with minimum and maximum curve envelopes over nine observed weeks of data is presented in Figure 5. This plot shows the average portion of edges whose end points had a geodesic distance one, two, three, etc in the previous observed week for each activity level (out-degree). The lower line on the plot shows the portion of the edges in time period T that were present in the the graph of time period T − 1 and therefore had geodesic distance one. The second line from the bottom shows the portion of edges whose end points had geodesic distance at most two, third line is for portion of edges with geodesic distance three, etc. The minimum and maximum curves for each line bound the envelope around it. Clearly, the envelope is very close to the line itself. This suggests that the edge histories are stable in the observed period. It is surprising to see that the portion of the edges that repeat week to week conditioned on the out-degree of the edge source is so stable. As Figure 5 shows the portion of edges repeated in the next week is around 40% for vertices with out-degree five, 45% for vertices with out-degree ten, and 47% for vertices with out-degree fifteen. Furthermore, the portion of edges for which the end points had geodesic distance greater then one follows the same trend.
4
Community and Evolution Statistics
Beyond statistics centering on individual vertices and edges, statistics of groups may also be examined. In order to determine groups in the data, the Iterative Scan algorithm presented in [2] was used. This algorithm produces sets of vertices which are locally optimal with respect to some density function. The density function used in this paper is as shown below δ=
Ein + λep Ein + Eout
where Ein and Eout are the numbers of edges within the community and cut by the community boundary respectively, ep is the edge probability within the community, and λ is a parameter which can either increase or decrease the amount of weight placed on the edge probability of a community. This weighting was added to the density function improve the intuitive “quality” of clusters in sparse graphs such as the one detailed in this paper. Without this term, sparse areas of the graph can be added to a cluster quite easily resulting in very large communities with high diameters. The algorithm was seeded using the Link Aggregate algorithm described in [2]. The number of clusters produced after optimization via Iterative Scan, their average size, average density, and average edge probability are all shown in Figure 8. Further, two plots showing size and density are given in Figure 9.
Stable Statistics of the Blogograph
111
Note the similarities in both scale and shape of these plots. Also in Figure 9 is a plot showing the boundary of each week’s plot. Here, each point is defined as the largest 5% of the clusters in a given range along the y-axis of the plot. The portion of this plot where the lines are furthest apart are areas of few communities. However, it can be seen that each of the plots has an upper portion similar to those observable in the preceding weekly plots. The plots also show that each week has a number of low density communities of size 2. These communities are merely seeds which optimization did not modify. They can be filtered out based on some domain specific criteria, but in this case, were left in the data to get a more general sense of the algorithm’s performance without obscuring details. week 35 36 37 38 39 40 41 42 43
|C| 71590 76210 77384 77099 73741 70308 78707 80851 80404
sizeavg 2.7400 2.6915 2.6823 2.7548 2.9312 2.6620 2.7876 2.6946 2.7437
δavg 0.2701 0.2645 0.2660 0.2736 0.2864 0.2646 0.2748 0.2657 0.2697
eavg p 0.15812 0.15194 0.15367 0.16266 0.17837 0.15193 0.16442 0.15371 0.15787
Fig. 8. Cluster Statistics
Now that communities are clearly defined, the question of how they evolve over time arises. For this paper, we have defined community evolution as follows. The Iterative Scan algorithm takes as input a set of seeds and produces optimized output communities. The output from running the algorithm on one week can be used as input to the next week’s optimization. This causes some difficulty as sets of connected vertices taken from one graph may not be connected in the next. In order to get around this, the set of vertices that make up the optimized community are placed into the next graph and the largest connected component of this set in the new graph is used as a seed. A second difficulty is the definition of when a community actually succeeds another. Given two successive communities Ct and Ct+1 discovered in the manner described above, we consider cluster Ct+1 to be a continuation of cluster Ct if |Ct ∩ Ct+1 | >t |Ct ∪ Ct+1 | where t is a threshold value indicating how strong we require the continuation to be. We define the lifespan of some initial community as the number of consecutive graphs in which the initial community exists or one of its continuation communities exists. We measure these lifespans with respect to some initial set of communities which are discovered in the manner presented at the start of this section. Figure 10 shows a histogram of the lifespans with respect to three different starting weeks in the 9 week data. These numbers appear to be quite stable.
112
M. Goldberg et al. 1 Week 35 Communities 0.9 0.8
Density
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20
30
40
50
60
Size 1 Week 36 Communities 0.9 0.8
Density
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20
30
40 Size
50
60
70
80
70
80
1 Week 37 Communities 0.9 0.8
Density
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20
30
40 Size
50
60
Fig. 9. The top figures show a size-density plot for weeks 35 and 40. Each point represents one discovered community. The bottom figure shows a line representing the boundaries of each week’s plot.
Stable Statistics of the Blogograph
113
Community Lifespan Histogram: Threshold = 0.3 40000 Week 35 Continuation Week 36 Continuation Week 37 Continuation
Number of Communities
35000 30000 25000 20000 15000 10000 5000 0 0
1
2
3
4
5
6
7
8
9
Lifespan Community Lifespan Histogram: Threshold = 0.4 70000 Week 35 Continuation Week 36 Continuation Week 37 Continuation
Number of Communities
60000 50000 40000 30000 20000 10000 0 0
1
2
3
4
5
6
7
8
9
Lifespan
Fig. 10. Lifespan based on continuation results. The first image has t = 0.3 while the second has t = 0.4.
5
Conclusion
In the observed graph, communication patterns are dynamic. Even with these changes in the linkage of individual nodes, general statistics appear to be quite stable. Beyond this, link evolution and community evolution present another
114
M. Goldberg et al.
set of statistics which are stable. We propose that each of these sets of base statistics can be used as a foundation upon with future mechanisms for detecting anomalous individuals and communities can be built. In the future, this work will be expanded to a variety of structurally different social networks. In these explorations, additional in-depth statistics will also be examined. Acknowlgements. This material is based upon work partially supported by the U.S. National Science Foundation (NSF) under Grant Nos. IIS-0621303, IIS0522672, IIS-0324947, CNS-0323324, NSF IIS-0634875 and by the U.S. Office of Naval Research (ONR) Contract N00014-06-1-0466 and by the U.S. Department of Homeland Security (DHS) through the Center for Dynamic Data Analysis for Homeland Security administered through ONR grant number N00014-07-1-0150 to Rutgers University. The content of this paper does not necessarily reflect the position or policy of the U.S. Government, no official endorsement should be inferred or implied.
References 1. Barab´ asi, A., Jeong, J., N¨eda, Z., Ravasz, E., Shubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Physica A311, 590–614 (2002) 2. Baumes, J., Goldberg, M.K., Magdon-Ismail, M.: Efficient identification of overlapping communities. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 27–36. Springer, Heidelberg (2005) 3. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Review (to appear, 2009) 4. Goh, K.-I., Eom, Y.-H., Jeong, H., Kahng, B., Kim, D.: Structure and evolution of online social relationships: Heterogeneity in unrestricted discussions. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 73(6), 066123 (2006) 5. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Science 311, 88–90 (2006) 6. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404 (2001)
Privacy-Preserving Accountable Accuracy Management Systems (PAAMS) Roshan K. Thomas1, Ravi Sandhu2, Elisa Bertino3, Budak Arpinar4, and Shouhuai Xu5 1
SPARTA, Inc., 5875 Trinity Parkway, Suite 300, Centreville, VA 20120 Institute for Cyber Security, University of Texas at San Antonio, San Antonio, TX 78249 3 Computer Science Department and CERIAS, Purdue University, West Lafayette IN 47907 4 Department of Computer Science, 415 GSRC, University of Georgia Athens, GA 30602 5 Department of Computer Science, University of Texas at San Antonio, TX 78249 2
Abstract. We argue for the design of “Privacy-preserving Accountable Accuracy Management Systems (PAAMS)”. The designs of such systems recognize from the onset that accuracy, accountability, and privacy management are intertwined. As such, these systems have to dynamically manage the tradeoffs between these (often conflicting) objectives. For example, accuracy in such systems can be improved by providing better accountability links between structured and unstructured information. Further, accuracy may be enhanced if access to private information is allowed in controllable and accountable ways. Our proposed approach involves three key elements. First, a model to link unstructured information such as that found in email, image and document repositories with structured information such as that in traditional databases. Second, a model for accuracy management and entity disambiguation by proactively preventing, detecting and tracing errors in information bases. Third, a model to provide privacy-governed operation as accountability and accuracy are managed. Keywords: privacy, accuracy, accountability, entity disambiguation.
1 Introduction We present our initiative on “Privacy-preserving Accountable Accuracy Management Systems (PAAMS)” that aims to advance the accuracy management of intelligence information collected, produced and disseminated by government as well as commercial systems. Our approach is to develop unified proactive and reactive, both backward (trace to the source) and forward (trace to derivates) accuracy management techniques. Collectively, these will enable timely detection, identification and correction of errors in source information and finished intelligence while ensuring accountability and privacy preservation. Such information may span multiple repositories in varying formats including documents, email, and databases across multiple administrative domains. Our effort has direct and immediate application to well-known difficult problems such as entity disambiguation, as manifested in the compilation, merging and correction of terror watch lists. C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 115–121, 2009. © Springer-Verlag Berlin Heidelberg 2009
116
R.K. Thomas et al.
Our approach is based on the recognition that accountability is a prerequisite for better accuracy management and further, that trading off privacy in controlled and accountable ways may yield significant increases in accuracy. Hence, the approach involves three key elements: • A model to link unstructured information such as that found in email, image and document repositories with structured information such as that in traditional databases. Recognizing the need for this linkage, it forms the foundation for our development of a unified accountability and audit model that tracks the provenance, state changes and information flows across information bases. • With the accountability model as an enabler, we propose to build a model for accuracy management and entity disambiguation to (1) proactively prevent errors and reactively detect errors in information bases and (2) to trace and correct the impact of such errors in source and target (derivate) information bases. • Finally, both the accountability and accuracy models are governed by a privacy model that provides the principles and associated mechanisms for preserving privacy as accountability and accuracy are managed. We are developing an architectural framework to realize these models and demonstrate their utility and viability.
2 Technical Approach We believe that significant improvements in the accuracy of intelligence community (IC) information bases can be achieved only with a fundamental realignment of the information processing architecture – one that recognizes from the outset that accuracy, accountability and privacy are intertwined. Any comprehensive solution must manage the dependencies and tradeoffs between these elements. Our key assertion is that these tradeoffs cannot be hardwired but instead must be dynamically managed based on ongoing threats and acceptable individual tolerances for these elements as driven by application needs. Current systems do not recognize this dynamic interrelationship and most existing research efforts have treated these elements in isolation. To illustrate the above, consider the Terrorist Identities Datamart Environment (TIDE) and associated compilation, dissemination and use of watch lists by intelligence and law enforcement agencies as reported in the popular press [1]. The high number of misidentifications and occurrences of repeated misidentification of the same individual indicate inadequate accuracy management. Public perception is that the system has poor accountability as it cannot quickly trace the source of errors with consequent slow and cumbersome redress procedures. Our claim is that accuracy can only be improved in such systems by addressing and improving the accountability and privacy management dynamics as they interrelate with accuracy. To elaborate, better provenance tracking and auditing of actions on content can help in the rapid location of sources of errors. Once errors are located, the error correction process can be significantly improved if better entity disambiguation is provided. However, this may call for access to more discriminating and potentially privacy-sensitive information about source intelligence as well as individuals.
Privacy-Preserving Accountable Accuracy Management Systems (PAAMS)
117
The framework for our novel approach to dynamically navigating the accuracyaccountability-privacy tradeoff triad is to formulate this as variants of an optimization problem. For example, one variant would be to maximize accuracy subject to prespecified limits on accountability overhead and privacy intrusion. In a different scenario an 80% accuracy with 70% statistical confidence may be acceptable as long as privacy exposure is kept below a specified threshold. 2.1 Better Accountability by Linking Unstructured and Structured Data To improve accountability, it is necessary to link structured information such as watch lists stored in modern relational database systems with unstructured information that can provide a corpus of supporting documentation and evidence. The latter can consist of text files, images and web pages etc. A database management system tracks the audit history of a record or a field, but is mostly unaware of the accuracy of the stored data, the data sources that produced the data and the means to correct the accuracy of such data when an error is discovered. The modern approach to linking structured and unstructured data is based on the following three layers: (i) a set of information sources that need to be integrated in a common view, (ii) a semantic layer providing common interfaces and ontology for the information sources as well as a high-level language/notation through which the user can express queries on the semantic model; and (iii) a knowledge/information access layer which translates queries on the semantic model into the languages supported by the information sources. We are developing a semantic model and specialized ontology centered around two key concepts that are at the core of realizing the vision of PAAMS - an accountable knowledge unit and an accountable knowledge activity model. The Accountable Knowledge Unit (AKU) From a semantic modeling standpoint, an AKU represents an identifiable and attributable piece of intelligence information, such as “Joe Smith should be on the suspect watch list.” The AKU goes beyond traditional database transactions and system level notions for structuring activities and updates to records and instead provides the basis to reason about the collection, production, and dissemination of intelligence “knowledge” pieces at a higher level. The AKU can thus link unstructured information with structured database records. This linking can happen in many combinations depending on how specific unstructured and structured information bases serve as sources, intermediary storage and sinks (derivates). The Accountable Knowledge Activity Model (AKAM) The heart of our novel accountability approach is the AKAM. This is an activity model that considers how information-related activities for knowledge units need to be organized and tracked across information bases. In particular, several AKUs may be dependent on each other. Thus the AKU which says “Joe Smith should be on the suspect watch-list” may be dependent on facts from another AKU which says: “Joe Smith was arrested in July 2005 with weapons possession at Atlanta airport.” These dependencies are essentially information flow, accountability and integrity dependencies across the information bases touched by AKUs. The AKAM keeps track of these dependencies. Thus if a source document or email is now considered to be suspect or
118
R.K. Thomas et al.
false, the relevant and encompassing AKU and integrity dependency will point us to the dependent database records that need to be corrected or retracted. In this case, the DBMS has a backward-looking source-to-sink information flow dependency in that the DBMS records were derived from the source documents. The dependencies also exist in the forward direction when a DBMS becomes the source for subsequent documents or emails. 2.2 Improved Accuracy Management through Better Entity Disambiguation Given the accountability substrate based on AKUs and the AKAM, we pursue better accuracy management through improved entity disambiguation techniques. Accuracy management in the context of correcting data errors has been proposed in the context of data quality [2]. Such an approach, known as record matching, consists of correcting data by comparison with other sources, presumably of better quality. Record matching has also been used for the purpose of integrating data. However, our goal in PAAMS for entity disambiguation is to go beyond record matching and combine this with ontology-based approaches. Figure 1 provides an overview of the approach. Disambiguation utilizes background knowledge stored in the form of one or more populated ontologies. It does not rely solely on the existence of data items that can provide strong and what appears to be obvious evidence such as email addresses or affiliations. In addition, it also uses any available relationships that may be provided as evidence, as well as those from the ontology to provide clues in determining the correct entity. Similar to what we have done in [3], the task of disambiguating entities can utilize the types of relationships that connect similar entities to determine if they are the same entity or different. For example, in the domain of computer science researchers, the affiliation of researcher is commonly used to indicate that the computer scientist “John Miller” we are talking about is the one at the University of Georgia, rather than the one at the University of Montana, with a large confidence level. In the same domain, other types of relationships can also be used such as publications, or research interests to disambiguate two entities. Of course the confidence level of the disambiguation process will greatly depend on the types of relationships used. For example, the “affiliation” might be a more accurate indication rather than “research interests”. The facts which can be used to disambiguate entities can have varying “sensitivity levels” to limit their access and avoid privacy exposure. By “sensitivity level” we mean certain information can be classified as public knowledge, like address and telephone number, while others such as credit-rating may be considered more “private” or sensitive. Yet other information, such as medical history may be considered “very sensitive”. Attempts to regulate access to these types of documents have been undertaken. For example, [4] describes the research and prototyping of a system that takes an ontological approach and is primarily targeted for use by the intelligence community. The approach utilizes the notion of semantic associations and their discovery among a collection of heterogeneous documents. The basic input to the entity disambiguation module (DM) in PAAMS is a request with input parameters that specify the desired accuracy, level of confidence, and the tolerable limits on accountability overhead and privacy-intrusion. The DM then attempts to find an answer to a feasibility or optimization problem. It may come back and give an answer for accuracy with some confidence but indicate that further
Privacy-Preserving Accountable Accuracy Management Systems (PAAMS)
119
accuracy improvements with higher confidence intervals are possible if, say, the privacy constraint can be relaxed. Thus the analyst or TSA agent (as an example) can have a series of interactions with PAAMS to refine the desired result, provided the needed level of access can be justified. 2.3 Privacy-Governed Operation
Populated Ontologies = Ontology Schemas + KB
The third element in our technical approach provides for a privacy-governed operation so as to meaningfully tradeoff privacy with accuracy and accountability. Popular techniques to ensure accuracy and improve data quality require access to several data sources often containing personally identifying information and hold the potential for privacy breaches. To address such issues, privacy-preserving data matching techniques in the database context have been proposed [5, 6]; such techniques however have some major drawbacks. They use protocols based on secure set intersection [8] and their costs are prohibitive. They only perform exact matching [7, 8]. This is a major drawback when data across different sources have heterogeneous quality, as an exact match may not be very successful and thus will likely result in very few matches. In such cases, approximate matching techniques are the only viable approach. To address such issues, we plan to explore a technique recently proposed in [7]. This technique is based on embedding the records to be matched in an Euclidean space, that is, a vector space having the Euclidean distance as norm and to perform the comparison in such a space. To ensure privacy, the SparseMap embedding method [8] is used. Preliminary experiments have shown that the approach is also very efficient for very large data sets. We will investigate a variety of issues such as how to deal with the problem of heterogeneous data schema to determine how the approach can be used for disambiguating entities, as well as how to perform record matching between structured data (like relational DB data) and unstructured data.
Weights
Facts and Documents Search Facts and Documents Thematic Ontologies (RDF)
Knowledge Base (KB)
Disambiguated Entities
Find Relationships
(Listing of Entities with Confidence Score)
Fig. 1. Main parts of the entity disambiguation module
120
R.K. Thomas et al.
In essence, our goal is to explore how entity disambiguation techniques for both structured and unstructured information can be tuned to navigate privacy and accountability tradeoffs while seeking greater accuracy in information bases. Metrics-Oriented Privacy Management We are developing methods for “metrics oriented privacy management” based on the following ideas. First, different information items (i.e., records, fields, documents) may be annotated with different privacy sensitivity levels, possibly real numbers between zero and one. Second, the decision as to whether a computation process (such as a query) is allowed to succeed would depend on how much privacy information may be disclosed by the answer to the query. Such a leakage may be quantified through the differential in the private information entropies before and after conducting the query. If such a leakage violates a policy, the query or computation process may be cancelled. However, in some extreme cases such as when a TSA supervisor needs to make a decision whether to allow a passenger to board a flight, perhaps with the consent of the passenger, the supervisor may be given access to some sensitive information about the user. Schemes such as private-governed computation using cryptography-based [9, 10, 11] and statistics-based approaches [12, 13] can be leveraged for this purpose.
3 Summary and Conclusions We have discussed a vision and approach to better accuracy management in information systems. This approach recognizes from the outset that accuracy, accountability, and privacy management are intertwined. Many systems such as those maintaining terror watch lists have had a difficult time maintaining accuracy and consistency. Our thesis is that accuracy in such systems can be improved by providing better accountability links between structured and unstructured information. Further, accuracy may be enhanced if access to private information was allowed in controllable and accountable ways. Thus some access to private information can lead to better entity disambiguation. We thus argue for a metric-based approach to privacy management. We lay out a framework for providing better accountability by developing the notion of an Accountable Knowledge Activity Model (AKAM) that ties together Accountable Knowledge Unit (AKUs). AKUs span structured and unstructured information. Collectively, we lay out a vision and the related architectural concepts to build systems that can provide improved accuracy through better accountability across islands of information and through controlled management of privacy to reduce errors and ambiguity.
References 1. http://www.washingtonpost.com/wp-dyn/content/article/2007/ 03/24/AR2007032400944.html 2. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Heidelberg (2006)
Privacy-Preserving Accountable Accuracy Management Systems (PAAMS)
121
3. Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text. In: 5th International Semantic Web Conference (ISWC 2006), Athens, GA, USA, November 5-9 (2006) 4. Aleman-Meza, B., Burns, P., Eavenson, M., Palaniswami, D., Sheth, A.P.: An Ontological Approach to the Document Access Problem of Insider Threat. In: Proceedings of the IEEE Intl. Conference on Intelligence and Security Informatics (ISI-2005), May 19-20 (2005) 5. Agrawal, R., Evfimievski, A., Srikant, R.: Information Sharing Across Private Databases. In: SIGMOD (2003) 6. Naor, M., Pinkas, B.: Oblivious Transfer and Polynomial Evaluation. STOC (1999) 7. Scannapirco, M., Figotin, I., Bertino, E., Elmagarmid, A.: Privacy Preserving Schema and Data Matching. In: SIGMOD (2007) 8. Hjaltason, G.R., Samet, H.: Properties of Embedding Methods for Similarity Searching in Metric Spaces. IEEE TPAMI 25(5) (2003) 9. Xu, S., Yung, M.: K-anonymous Multi-party Secret Handshakes. In: Financial Cryptography and Data Security, FC 2007 (2007) 10. Tsudik, G., Xu, S.: A Flexible Framework for Secret Handshakes. In: Danezis, G., Golle, P. (eds.) PET 2006. LNCS, vol. 4258, pp. 295–315. Springer, Heidelberg (2006) 11. Xu, S., Yung, M.: K-Anonymous Secret Handshakes with Reusable Credentials. In: The Proceedings of ACM Conference on Computer and Communications Security 2004 (ACM CCS 2004), pp. 158–167. ACM Press, New York (2004) 12. Sharkey, P., Tian, H., Zhang, W., Xu, S.: Privacy-Preserving Data Mining Through Knowledge Model Sharing. In: Proceedings of the First ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD, ACM PinKDD 2007 (2007) 13. Dowd, J., Xu, S., Zhang, W.: Privacy-Preserving Decision Tree Mining Based on Random Substitutions. In: Müller, G. (ed.) ETRICS 2006. LNCS, vol. 3995, pp. 145–159. Springer, Heidelberg (2006)
On the Statistical Dependency of Identity Theft on Demographics Giovanni Di Crescenzo Telcordia Technologies, Inc. One Telcordia Drive, Piscataway, NJ, 08854, USA [email protected]
Abstract. An improved understanding of the identity theft problem is widely agreed to be necessary to succeed in counter-theft efforts in legislative, financial and research institutions. In this paper we report on a statistical study about the existence of relationships between identity theft and area demographics in the US. The identity theft data chosen was the number of citizen complaints to the Federal Trade Commission in a large number of US municipalities. The list of demographics used for any such municipality included: estimated population, median resident age, estimated median household income, percentage of citizens with a high school or higher degree, percentage of unemployed residents, percentage of married residents, percentage of foreign born residents, percentage of residents living in poverty, density of law enforcement employees, crime index, and political orientation according to the 2004 presidential election. Our study findings, based on linear regression techniques, include statistically significant relationships between the number of identity theft complaints and a non-trivial subset of these demographics. Keywords: Identity Theft, US Demographics, Regression Analysis, Statistics.
1
Introduction
Concern is growing significantly among US (and other) nations about the variety of computer-based attacks and general fraud that fall under the name of “identity theft”. While there seems to be no universally accepted definition of identity theft, all such illegal acts involve theft and/or unauthorized misuse of sensitive personal information, such as names, addresses, dates of birth, social security numbers, credit card numbers, login names and passwords. Typical examples of identity theft include using someone else’s personal information to open credit accounts, make financial transactions, and obtain goods and services. Less typical but related examples include illegal immigration and espionage. Identity theft was almost non-existent before the rise of the Internet and until a few years ago, was not even recognized as a crime. In a famous news event (sometimes credited with an important role in motivating US legislation on the topic), the thief would repeatedly call and mock his victims, sometimes describing his illegal acts in the C.S. Gal, P.B. Kantor, and M.E. Lesk (Eds.): ISIPS 2008, LNCS 5661, pp. 122–137, 2009. c Springer-Verlag Berlin Heidelberg 2009
On the Statistical Dependency of Identity Theft on Demographics
123
victim’s name, and stress that he was not violating any law [1]. Despite being a relatively recent phenomenon, identity theft today has a huge impact on the US population. Recent statistics indicate identity theft as the highest-percentage source of consumer fraud-related complaints to the Federal Trade Commission (FTC) in both 2005 and 2006 [2]. Several entities are significantly increasing their attention to the identity theft phenomenon. First, US Congress is instituting laws for citizens and financial institutions (see, e.g. [3,4,5]). Major challenges in preventing identity theft are in guaranteeing that sensitive identifying information, ranging from names and addresses to social security numbers and passwords, are used cautiously and responsibly by individuals, especially in the context of online banking. Accordingly, financial institutions are providing non-trivial technical solutions and infrastructures to comply with recent laws (see, e.g., [6,7]). Academic and industry research communities (from the computer science, and, specifically, security and cryptography communitites; see, e.g., [8,9]) are helping in the creation of new solutions. Despite all these legislative, technical and research efforts, it is widely believed that a large number of attacks related to identity theft will always succeed mainly because of individuals’ lack of education on the topic. An example which is very familiar to computer users is the following: several identity thief wannabes send e-mails to millions of individuals, where they claim to represent a well-known bank; furthermore, they claim that the individual’s bank account access has been interrupted and she/he needs to re-establish it by entering sensitive information on the thief’s web site. Such attacks, often referred to as (one type of) phishing attack, may be successful regardless of how sophisticated the access system is at the individual’s real bank, and thus negating the mentioned security efforts by the financial institutions and research communities. On the other hand, it has been suggested by many that improving individual education on this topic may make such attacks less and less successful. More generally, a deeper understanding of this topic has been argued to provide more accurate guidance for government, financial and research institutions’ efforts in citizen education, counter-theft initiatives and proposals. Thus, we argue that statistical studies on identity theft are very much needed today, as they can shed light on the several features of identity theft victims, including the social environment under which identity theft acts are successful, and help towards the mentioned efforts and initiatives. Investigation Details. Using linear regression techniques (see, e.g., [10,11,12]), we investigated the existence of statistical relationships between the number of identity theft complaints, as reported by the Federal Trade Commission (FTC) in [13], and various US demographics, collected from [14], representing features about the victims and the social environment associated with such complaints. Identity theft data. In [15,13] the FTC’s Identity Theft Clearinghouse provides identity theft data of interest for the year 2006, such as the total number of identity theft complaints by residents, as well as the number and percentage of complaints related to identity theft of specific types (e.g., credit card fraud, bank
124
G. Di Crescenzo
fraud, loan fraud, phone or utilities fraud, etc.), for each of the 379 municipalites considered in our analysis. The first one of these 379 entries is replicated below. Abilene, TX Metropolitan Statistical Area Theft Type Phone or Utilities Fraud Credit Card Fraud Bank Fraud Employment-Related Fraud Loan Fraud Government Documents or Benefits Fraud Other Identity Theft Attempted Identity Theft Total:
Complaints 33 22 21 15 11 8 36 11 130
Percentage 25.4 % 16.9 % 16.2 % 11.5 % 8.5 % 6.2 % 27.7 % 8.5 %
We remark that in the above table the number of total complaints is smaller than the sum of the complaints for each theft type since each identity theft complaint may be associated with more than one theft type. This also explains why the sum of all percentages is > 100%. The response variable in our analysis will contain the number in the above ‘Total’ field, for each of the municipalities. Demographic data. Our second data source is [14], listing about 100 demographic variables (as of 2006) for all municipalities in the US with at least 6000 inhabitants, from which we selected, because of their potential relevance to identity theft, the following 11 variables (also referred in the rest of this paper using the following number and/or abbreviation): 1. estimated population (briefly, population); 2. median resident age (briefly, age); 3. estimated median household income (briefly, income); 4. percentage of residents at least 25 years old with high school or higher degree (briefly, hsdegree); 5. percentage of unemployed residents at least 25 years old (briefly, unemployment); 6. percentage of married residents at least 15 years old (briefly, married); 7. percentage of foreign born residents (briefly, foreign); 8. a “city-data.com” crime index (briefly, crimeindex), where a higher value means more crime; 9. percentage of residents living in poverty (briefly, poverty); 10. density of law enforcement employees (briefly, lawenfo); 11. and percentage of votes obtained for one of the two major political parties1 in the 2004 US presidential election (briefly, politics). The explanatory variables in our analysis will be the entries for the above demographics, for each of the municipalities from [13]. Because of mismatches between the two sources (i.e., some municipalities in the first source actually group results from 2 or more municipalities in the second source), the number of usable observations was reduced to 223 (also listed in Appendix A). 1
We omit the name of the party considered to avoid partisan interpretations.
On the Statistical Dependency of Identity Theft on Demographics
125
While intuition seems to provide a number of suggestions for the existence of a relationship between these demographic categories and identity theft occurrences, a precise answer is far from clear. After a brief examination, it appears obvious that for many features in the above list, one can provide valid arguments towards a stronger or weaker relationship between identity theft and the specific feature. As an example, we may wonder how does the number of identity theft complaints relate to individual age, across cities nationwide. On the one hand, it might seem that older individuals may encounter more identity theft, as they may have a lower chance of using appropriate computer security protections. On the other hand, younger individuals may be less careful about protecting their identity. Similar ambiguities can be raised for other features. One objective of this study is to provide statistical evidence on how such ambiguities should be resolved. Sketch of Results and Techniques. Our results provide a linear regression model where the response variable is the number of identity theft complaints and the explanatory variables are the mentioned demographics. Techniques used in our analysis within the theory of linear regression, include: analysis of scatter, residual and qq-norm plots; analysis of leverage, DFFITS, DFBETAS, Cooks’ distance, and variance inflation factor metrics; model selection via AIC, RSE, coefficient of determination, variance inflation, t-tests, F -tests and regularized regression; model validation via all-subset predictionbased cross-validation, backward elimination based on variance inflation and ttests, and performance on prediction errors. To deal with some minor collinearity among the explanatory variables, part of our analysis applies a modified t-test based on variance inflation factors. The main objective of this study was explanatory, in that we aimed at finding an appropriate set of explanatory variables, and an appropriate model that would explain much of the variation in the response variable. Quantitatively speaking, we aimed, among other things, at a high coefficient of determination (a.k.a. adjusted R2 ). We obtained an adjusted R2 value of 0.764, which seems to confirm a strong linear relationship between the considered response and explanatory variables (in addition to high prediction capability).
2
Dataset Analysis and Modeling
Under the assumption that there exists a linear relationship between the number of identity theft complaints (the response variable) and at least some of the 11 chosen demographics (the explanatory variables), we will use linear regression modeling techniques (see, e.g., [10,11,12]) to investigate the details of this relationship, by keeping in mind that the main objective of this study is explanatory. The procedure used to fit the data was the Ordinary Least Squares (OLS) procedure, which is known, under appropriate assumptions, to return biased and minimum-variance estimators for the coefficients in the regression model.
126
2.1
G. Di Crescenzo
Graphical Exploration and Transformations
0.20 0.05 −0.10
−0.10
0.05
0.20
All variables were used in standardized form to limit the impact of roundoff errors and to allow the comparison of the regression coefficients in the same units. A graphical exploration, mainly through scatter plots between almost all pairs of variables; specifically, between the response variable and any of the explanatory variables, and between any two explanatory variables. Very few linear relationships appear evident, suggesting that some functional transformations of the variables were necessary. Moreover, both scatter and residual plots (obtained after fitting a regression model) revealed non-constant error variance and an error distribution with tails heavier than for a normal distribution (both violations of basic assumptions underlying the linear regression modeling procedure). Applying the logarithmic transformation to the response variable (before standardizing it) seemed to significantly move towards eliminating both problems. We also applied the logarithmic transformation to the response variable and to 2 of the 11 explanatory variables: population and foreign. After these transformations were performed, a linear relationship among the response variable and some of the explanatory variables was much clearer (see Figure 1). For some other explanatory variables, however, the scatter plots still did not seem to reveal enough information to determine whether a linear relationship with the response variable exists or not (see Figure 2). A number of other transformations did not appear to increase visual evidence for linear relationships.
0.1
0.2
−0.2
−0.1
0.0
0.1
0.0
0.1
0.2
0.05 −0.10
−0.10
0.05
0.20
0.0
0.20
−0.1
−0.15
−0.05
0.05
0.15
−0.1
Fig. 1. Four scatter plots with seemingly linear relationships with explanatory variables 1, 4, 7, 8. (R2 values were 0.56, 0.07, 0.17, 0.13, respectively.)
0.05
0.20
127
−0.10
−0.10
0.05
0.20
On the Statistical Dependency of Identity Theft on Demographics
0.1
0.2
−0.1
0.0
0.1
0.2
0.3
0.05 −0.10
−0.10
0.05
0.20
0.0
0.20
−0.1
−0.2
0.0
0.2
0.4
−0.3
−0.1
0.1
Fig. 2. Four scatter plots with unclear relationships between the response variable and explanatory variables 3, 5, 10, 11
2.2
Verification of Linear Regression Assumptions
Before any data transformation, there seemed to be problems in verifying all basic assumptions behind the OLS-based linear regression methodology (i.e., linearity of the relationship between the response and predictor variables; independence of the errors; homoscedasticity, or constant variance, of the errors, and normality of the error distribution). The logarithmic transformation from Section 2.1 significantly helped in verifying the normal distribution of the errors, which otherwise seemed to have heavier tails (this was verified by plotting the qq-plot of the errors obtained by fitting a linear regression model via OLS before and after the transformation). At least three potential outliers were identified after considering the residual plots, and after evaluating leverage, Cook’s distance, DFFITS and DFBETAS metrics (see Figure 3). Specifically, high values in the last four plots were obtained for observations 15, 88, 140, which have both high leverage and a high influence on coefficients and fitted values. These observations correspond, respectively, to the following municipalities: Atlantic City, NJ (which, upon more careful data inspection, seems to have an excessively high value for crimeindex); Honolulu, HI (which, upon more careful data inspection, seems to have an excessively high value for income); and Monroe, MI (which, upon more careful data inspection, seems to have an excessively low value for politics, and quite low values for foreign and unemployed). Accordingly, all three values were considered outliers for this study and these observations were removed to obtain a better fit for the rest of the data. The data fitting of the final model was evaluated both with and without these three
G. Di Crescenzo
0.00
0.10
CooksD
0.15 0.00
Leverage
0.30
128
50
100
150
200
0
50
100
150
200
0
50
100
150
200
0
50
100
150
200
0.8 0.0
0.4
maxDfbetas
1.0 0.5 0.0
Dffits
1.5
0
Fig. 3. Leverage, Cook’s Distance, DFFITS and max DFBETAS Plots
observations, and this evaluation confirmed that an overall better fit was obtained by omitting these three outliers. This was not the case for a few other observations. Specifically: some other observations (including 190, 129, 69, 176, 161) seemed to have a moderately high leverage, but a relatively normal influence, and thus were not removed; and observation 175 seemed to have medium leverage and influence, but was not considered for removal. Even after the mentioned transformations and outlier removals, there seemed to be some problems in verifying some basic assumptions; specifically, the symmetry of errors, the model sufficiency and the constant variance assumption. In particular, while the standardized residual vs fitted values plot showed symmetric residuals, the residuals vs leverage plot and especially the residual vs fitted values plot showed some non-trivial curvature of apparent quadratic type. This problem did not go away even after a number of transformations were performed to the response and explanatory variables. Simultaneously resolving these problems gave rise to the formulation of different candidate models. An examination of the variance inflation factor (VIF) over the 11 explanatory variables revealed that all VIF values were very low, except for 2 of them: variable 9 having VIF value = 4.973 and variable 3 having VIF value = 3.829. The average VIF value was = 2.495.2 Accordingly, by assuming that all such VIF values were satisfactory and that the curvature effects on the residual plots could be ignored, a first candidate model (see Section 2.4 for some features of this model) was produced as follows: 2
Different rules of thumb seem to exist for a variable to have high collinearity with all others: either when its VIF value is > 10 [10] or even > 5 [16].
On the Statistical Dependency of Identity Theft on Demographics
129
1. a preliminary linear model was obtained via the OLS technique using the 11 explanatory variables and the original dataset (minus 3 outliers) 2. model variable selection was performed on the model obtained in 1, using backward variable elimination based on the variables’ t-values, where the elimination was repeated until all remaining variables passed the t-test. However this model is tagged as one for which we had some potential problems in verifying some of the basic assumptions underlying the OLS-based linear regression methodology. 2.3
Interaction Variables
On the consideration that the curvature effect on the residual plots required further study, we studied the interactions between the 11 explanatory variables, by considering interactions between all pairs among the 11 explanatory variables. We added 55 interaction variables Ii,j , where (i, j) ∈ {1, . . . , 11}2 and i < j. After this addition, the curvature problem completely disappeared, in that all residual plots showed no curvature or pattern, and the normality assumption was well verified as well. On the other hand, an examination of the variance inflation factor (VIF) over the 66 variables (11 original explanatory + 55 interaction variables) revealed that many of the VIF values were quite high, and their mean was very high as well (= 12.381). The number of interaction variables was then reduced using a VIF-based adjusted t-test; that is, testing whether the interaction variables would not pass the regular t-test, even ignoring the effect due to collinearity with the other variables, measured via the VIF values. Specifically, VIF-based adjusted t-values were obtained by multiplying the original t-values times the square root of the VIF value (this is based on the fact that the square root of a VIF value shows how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other explanatory variables). This approach reduced the number of interaction variables from 55 to 34, since for only 34 out of the 55 interaction variables the VIF-based adjusted t-value was above 2. A regression model was re-fit using these 34 interaction variables only, and this model had essentially unchanged residual plots; namely, with no significant pattern or curvature and verifying the normal assumption. A second candidate model (see Section 2.4 for some features of this model) was generated as follows: 1. a preliminary linear model was obtained via the OLS technique using the 11 demographic variables and the remaining 34 interaction variables on the original dataset (minus 3 outliers); 2. model variable selection was performed on the model obtained in 1, using backward variable elimination based on the variables’ t-values, where the elimination was repeated until all remaining variables passed the t-test; 3. an outlier analysis similar to the one in Section 2.2 was performed on the final model, observations 68, 127, 172 were considered outliers and removed, and the OLS technique was applied again on the original dataset (minus 6 outliers).
130
G. Di Crescenzo
However, this model is tagged as one for which the regression coefficient may have highly unreliable values and important variables may have been omitted, due to the high collinearity, as showed by the original VIF values. 2.4
Candidate Final Models
As none of the two candidate models obtained so far seems satisfactory, we investigated the addition of some, but not all, of the interaction variables. We used the F -test to test whether certain classes of subsets of the interaction variables should be added or not. The classes of subsets that we tested were: 1. the ‘all-but-one subsets’ (i.e., for i = 1, .., 34, the null hypothesis in the i-th test said that all but the i-th interaction variable were = 0), 2. the ‘all-but-two subsets’ (i.e., for i1, i2 = 1, .., 34, the null hypothesis in the (i1, i2)-test said that all but the i1-th and i2-th interaction variables were = 0), For the tests in class 1 above, the null hypothesis was always accepted, thus suggesting that either all interaction variables should not be included or more than a single interaction variable should be included. Interestingly, it was accepted with a significantly higher margin for i = 1, 5 (in which case the associated adjusted R2 value was also significantly higher). For the tests in class 2 above, the null hypothesis was only accepted for i1 or i2 in {1, 3, 4, 5}, thus suggesting that a model with at least two interaction variables among the first, third, fourth, or fifth in the previously obtained sequence of 34 interaction variables (specifically, I1,2 , I1,6 , I1,7 , I1,8 ), might be a potential good fit for the data. Note that there are 11 such models. As a consequence of these F tests, one could generate 11 new candidate models to choose from. We decided to narrow down this number using cross-validation and regularized regression. 2.5
Cross Validation
We used the multi-fold cross-validation technique, using as explanatory variables the original 11 demographic variables and the 34 interaction variables that passed the VIF-based adjusted t-test. (Here, the reduction from 55 to 34 such variables was crucial to significantly reduce computation time in this procedure.) Our cross-validation experiment consisted of the following steps: 1. Repeat the following 100 times: - randomly and independently split the data into a training subset (of density 2/3) and one testing subset (of density 1/3); - for each of the 211+34 = 245 subsets of variables, fit a linear regression model using OLS with this subset of variables on the training data subset and record the prediction error of this model on the testing data subset - among these 245 models, choose the one with the minimum prediction error on the testing subset, and increment (by 1) the ‘winning counter’ for the list of variables in this model 2. Return the winning counters for all 46 variables.
On the Statistical Dependency of Identity Theft on Demographics
131
explanatory winning interaction winning interaction winning interaction winning variable count variable count variable count variable count population 100 I(1,2) 36 I(3,5) 0 I(5,8) 0 age 94 I(1,4) 2 I(3,6) 0 I(5,9) 1 income 13 I(1,6) 26 I(3,8) 2 I(5,10) 0 hsdegree 100 I(1,7) 27 I(3,10) 0 I(5,11) 2 unemployed 8 I(1,8) 87 I(3,11) 0 I(6,7) 1 married 100 I(2,3) 12 I(4,5) 2 I(6,8) 0 foreign 100 I(2,5) 7 I(4,6) 2 I(6,10) 1 crimeindex 84 I(2,6) 7 I(4,8) 0 I(6,11) 1 poverty 46 I(2,8) 0 I(4,9) 1 I(7,11) 18 lawenfo 24 I(2,9) 1 I(5,6) 4 I(8,9) 0 politics 0 I(3,4) 0 I(5,7) 0 I(8,10) 0 I(9,11) 1 Fig. 4. Winning counters for all explanatory and interaction variables
The winning counters, as returned by the multi-fold cross-validation technique are returned in Figure 4. Considering that this study is of interpretation type, we selected all variables with a > 33% occurrence, thus resulting in a model containing 7 out of 11 original explanatory variables: population, age, hsdegree, married, foreign, crimeindex and poverty; and 2 out of 34 interation variables: I1,2 , I1,8 . Using the OLS procedure on this subset of variables, we obtained another linear regression model, and performed an outlier analysis of this model similar to the one in Section 2.2. This caused the removal of observations 68, 127, 172 and a re-application of the OLS technique. The resulting model will be one of the candidate models discussed in Section 3. 2.6
Regularized Regression
As mentioned before, 2 among the 11 demographic variables had a moderate VIF value, and it is worth studying further this potential collinearity effect. Remedies to multicollinearity include, in addition to the already performed model selection via cross-validation, principal component regression and regularized regression. Principal component regression is better suited for predictive regression studies; since this study is an interpretation type of study, we opted for regularized regression. More specifically, we performed ridge regression to evaluate whether it was possible to obtain some significant shrinking of any of the coefficients among the 11 demographic variables plus the 4 interaction variables I1,2 , I1,6 , I1,7 , I1,8 (as a result of the F -tests discussed in Section 2.4). Our ridge regression experiment consisted of the following steps: 1. Repeat the following 10 times: - randomly and independently split the data into a training subset (of density 4/5) and one testing subset (of density 1/5);
132
G. Di Crescenzo
- compute the value of the shrinkage factor that minimizes the prediction error; - in correspondence to this shrinkage factor, increment (by 1) the ‘elimination counter’ for the list of variables that have a coefficient of less than 0.1 in this model. 2. Return the elimination counters for all 13 variables. Using the output of our ridge regression experiment, we obtained a list of variables with coefficient less than 0.1 in absolute value at the end of the i-th execution, for i = 1, . . . , 10 (see Figure 5). Execution number List of variables 1 3, 5, 10, 11, 12, 13, 15 2 3, 5, 10, 11, 12, 13 3 3, 5, 10, 11, 12, 13, 15 4 3, 5, 9, 10, 11, 12, 13 5 3, 5, 10, 11, 13, 14, 15 6 3, 5, 10, 11, 12, 13, 15 7 3, 10, 11, 12, 13 8 3, 5, 9, 10, 11, 12, 13, 15 9 2, 3, 5, 6, 9, 10, 11, 12, 13 10 3, 5, 10, 11, 12, 13 Fig. 5. List of variables having small coefficient at the end of each execution
By using the above list, we count in how many executions (out of 10) each variable had coefficient smaller than 0.1 in absolute value. Variables 13, 11, 10, 3, 12, 5, 15 had coefficients less than 0.1 for at least 50% of the time. Eliminating these variables results precisely in the subset of variables obtained via crossvalidation in Section 2.5. As mentioned there, using the OLS procedure on this subset of variables, returns another candidate linear regression model, discussed in Section 3.
3
Choice of Final Model
Thanks to the cross-validation and ridge regression experiments, we have reduced our preliminary set of 13 candidate final models to just 3: model M1 from Section 2.2, model M2 from Section 2.3, model M3 from Section 2.5 and Section 2.6. As previously discussed, model M1 assumes that all VIF values associated with the original 11 demographic explanatory variables are satisfactory and that the curvature effects on the residual plots can be ignored. Thus, this model is understood to have potential problems with some of the basic assumptions underlying the OLS-based linear regression methodology. As previously discussed, for model M2 we apparently can verify all 5 basic assumptions underlying the OLS-based linear regression methodology. However,
On the Statistical Dependency of Identity Theft on Demographics
133
Metrics Model M1 Model M2 Model M3 RSE 0.5079 0.4715 0.4849 0.7439 0.7806 0.764 adjR2 averageVIF 2.0265 1.896 1.9758 maxVIF 3.0364 2.7242 3.3494 AIC 337.0569 306.8769 313.4442 number of variables 8 15 9
0
1
0
2
11 78 110
2
−3 −2 −1
0
1
2
Theoretical Quantiles
Scale−Location
Residuals vs Leverage
−1
0
1
Fitted values
2
3
0
2
11 123 154
−2
1.0
11 78 110
Standardized residuals
Fitted values
0.0
Standardized residuals
−1
Normal Q−Q
−2
1.0
78 11 110
−1.0 0.0
Residuals
Residuals vs Fitted
Standardized residuals
Fig. 6. Quantitative comparison of the 3 candidate models
Cook’s distance 0.00
0.10
0.20
Leverage
Fig. 7. Residual plots for the final model
it is understood to have highly unreliable values for the coefficients and to have potentially omitted several important variables. Specifically, this was due to the high collinearity, as shown by the VIF values of the original 11 demographic explanatory variables plus the 34 interaction variables obtained from the 55 interaction variables using a VIF-based adjusted t-test. Model M3 verifies all 5 basic assumptions underlying the OLS-based linear regression methodology, even with somewhat higher evidence than model M2 . It has low VIF values. Model selection based on both the t-test and the F -test are reliable, and so are the values for the regression coefficients. Qualitatively, this model offers the best tradeoff between linear regression assumptions verification, reliable model selection and reliable regression coefficient values. Quantitatively,
134
G. Di Crescenzo
this model also seems close enough to being the best (i.e., model M2 ), as indicated by three metrics (RSE, adjR2 , AIC) in Figure 7. It is also worth noting that the obtained adjR2 is sufficiently high for this interpretation study in all three models.
4
Conclusions
In this paper we report on a statistical study about the existence of relationships between identity theft and area demographics in the US. Our study findings, based on linear regression techniques, include statistically significant relationships between the number of identity theft complaints and a non-trivial subset of the chosen area demographics. We obtain that the number of identity theft complaints in US municipalities appears to have linear and statistically significant dependency on the following variables: 1. 2. 3. 4. 5. 6. 7. 8. 9.
estimated population; median resident age; percentage of residents above 25 with a high school or higher degree; percentage of married residents above 25; percentage of foreign residents; the crime index (as computed from [14]); the percentage of residents below poverty level; the product of population and median age; the product of population and the crime index.
The dependence is positive on variables 1, 2, 5, 6, 9 in this list, and negative on variables 3, 4, 7, 8. There seems to be no statistically significant dependency on median resident income, percentage of unemployed residents, density of law enforcement employees, and percentage of voters of one of the two major political parties at the time of the 2004 presidential elections. Apart from some minor curvature effects in the residual plots (see Figure 7), the assumptions behind the OLS-based linear regression modeling methodology seem to be verified with sufficient accuracy. The explanatory power of this model is considered very high, as verified with a number of metrics, including a very high coefficient of determination (i.e., 0.764). In what follows, we include the quantitative details of our final linear regression model, including values for the regression coefficients’ estimates and standard error, the model’s RSE, adjusted R2 , p-value, and the VIF values. Residuals:
Coefficients
Min 1Q Median -1.08657 -0.32042 -0.03326 Estimate
(Intercept) -0.007518 log(population) 0.722802
3Q 0.28984
Max 1.22313
Std. Error
t value
Pr(>|t|)
0.033006 0.040582
-0.228 17.811
0.820041 < 2e-16 ***
On the Statistical Dependency of Identity Theft on Demographics age hsdegree married foreign crimeindex poverty inter-1-2 inter-1-8
0.174883 -0.213929 -0.257040 0.323288 0.165432 -0.133265 -0.120747 0.132059
0.058220 0.043983 0.056967 0.040923 0.043994 0.060479 0.041567 0.035328
3.004 -4.864 -4.512 7.900 3.760 -2.204 -2.905 3.738
0.002995 2.28e-06 1.08e-05 1.62e-13 0.000221 0.028662 0.004073 0.000240
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
135
** *** *** *** *** * ** ***
Residual standard error: 0.4849 on 207 degrees of freedom Multiple R-Squared: 0.7738, Adjusted R-squared: 0.764 F-statistic: 78.69 on 9 and 207 DF, p-value: < 2.2e-16 VIF values: 1.420 2.594 1.744 2.813 1.501 1.627 3.349 1.587 1.146 Mean VIF value: 1.9758
Acknowledgements. Many thanks go to Rebecka Jornsten and Dennis Egan for interesting conversations on related topics, and to the ISIPS 08 committee for their comments on this paper.
References 1. US Dept. of Justice website, http://www.usdoj.gov/criminal/fraud/websites/idtheft.html 2. Consumer Fraud and Identity Theft Complaint Data 2005 and 2006, FTC, http://www.consumer.gov/sentinel/pubs/Top10Fraud{2005,2006}.pdf 3. The 1998 Identity Theft and Assumption Deterrence Act, http://www.ftc.gov/os/statutes/itada/itadact.htm 4. Announcement of the Release of the Presidents Identity Theft Task Force Comprehensive Strategic Plan (April 2007), http://www.sec.gov/news/press/2007/2007-69.htm 5. Law enforcement publications related to identity theft from the Federal Trade Commission, http://www.ftc.gov/bcp/edu/microsites/idtheft/ law-enforcement-publications.html 6. http://internetcommunications.tmcnet.com/topics/broadband-mobile/ articles/13757-fed-takes-steps-against-id-theft.htm 7. http://www.forbes.com/markets/feeds/afx/2007/10/31/afx4284882.html 8. ACM CCS 2007 Workshop on Digital Identity Management (DIM 2007) (November 2007), http://www2.pflab.ecl.ntt.co.jp/dim2007/ 9. DIMACS Workshop on Theft in E-Commerce: Content, Identity, and Service, DIMACS Center, CoRE Building, Rutgers University, Piscataway, NJ (April 2005), http://dimacs.rutgers.edu/Workshops/Intellectual/ 10. Kutner, M., Nachtsheim, C., Neter, J.: Applied Linear Regression Models, 4th edn. McGraw Hill, New York (2004) 11. Miller, A.: Subset Selection in Regression, 4th edn. Chapman and Hall, Boca Raton (1990) 12. DeGroot, M., Schervish, M.: Probability and Statistics, 3rd edn. Addison-Wesley, Reading (2002)
136
G. Di Crescenzo
13. Federal Trade Commission’s Consumer Sentinel Network, http://www.ftc.gov/ sentinel/reports/sentinel-annualmsa-reports/topidttheft-msa2006.pdf 14. http://www.city-data.com/ 15. Federal Trade Commission’s Consumer Sentinel Network, http://www.ftc.gov/sentinel/ 16. http://en.wikipedia.org/wiki/Variance_inflation_factor 17. Di Crescenzo, G., Cochinwala, M., Shim, H.: Modeling Cryptographic Properties of Voice and Voice-based Entity Authentication. In: Proc. of ACM CCS 2007 Workshop on Digital Identity Management (DIM 2007). ACM Press, New York (2007) 18. Di Crescenzo, G., Lipton, R., Walfish, S.: Perfectly-Secure Password Protocols in the Bounded Retrieval Model. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 225–244. Springer, Heidelberg (2006) 19. De Santis, A., Di Crescenzo, G., Persiano, G.: Communication-Efficient Group Identification. In: Proc. of 5th ACM Conference on Computers and Communication Security (ACM CCS 1998). ACM Press, New York (1998)
A
Municipalities
Abilene, TX; Adrian, MI; Akron, OH; Albuquerque, NM; Alexandria, LA; Altoona, PA; Amarillo, TX; Anchorage, AK; Anderson, IN; Anderson, SC; Ann Arbor, MI; Appleton, WI; Asheville, NC; Ashtabula, OH; Atlantic City, NJ; Bakersfield, CA; Bangor, ME; Barnstable Town, MA; Baton Rouge, LA; Battle Creek, MI; Bay City, MI; Bellingham, WA; Bend, OR; Billings, MT; Binghamton, NY; Bloomington, IN; Bluefield, WV-VA; Boulder, CO; Bowling Green, KY; Burlington, NC; Cedar Rapids, IA; Chambersburg, PA; Charleston, WV; Charlottesville, VA; Chattanooga, TN-GA; Chico, CA; Clarksville, TN-KY; Cleveland, TN; Coeur d’Alene, ID; Colorado Springs, CO; Columbia, MO; Columbia, SC; Columbus, GA-AL; Columbus, OH; Concord, NH; Corpus Christi, TX; Cumberland, MD-WV; Dalton, GA; Danville, VA; Dayton, OH; Decatur, AL; Decatur, IL; Des Moines, IA; Dothan, AL; Dover, DE; Duluth, MN-WI; Dunn, NC; Durham, NC; Eau Claire, WI; El Centro, CA; Elizabethtown, KY; El Paso, TX; Erie, PA; Evansville, IN-KY; Fargo, ND-MN; Farmington, NM; Fayetteville, NC; Flagstaff, AZ; Flint, MI; Florence, SC; Fort Smith, AR-OK; Fort Wayne, IN; Fresno, CA; Gadsden, AL; Gainesville, FL; Gainesville, GA; Glens Falls, NY; Goldsboro, NC; Grand Junction, CO; Greeley, CO; Green Bay, WI; Greenville, NC; Greenville, SC; Harrisonburg, VA; Hattiesburg, MS; Hilo, HI; Homosassa Springs, FL; Honolulu, HI; Huntsville, AL; Idaho Falls, ID; Indianapolis, IN; Iowa City, IA; Ithaca, NY; Jackson, MI; Jackson, MS; Jackson, TN; Jacksonville, FL; Jacksonville, NC; Janesville, WI; Jefferson City, MO; Johnson City, TN; Johnstown, PA; Jonesboro, AR; Joplin, MO; Kansas City, MO-KS; Kingston, NY; Knoxville, TN; Kokomo, IN; La Crosse, WI-MN; Lafayette, IN; Lafayette, LA; Lake Charles, LA; Lakeland, FL; Lancaster, PA; Laredo, TX; Las Cruces, NM; Lawrence, KS; Lawton, OK; Lebanon, NH-VT; Lebanon, PA; Lima, OH; Lincoln, NE; Longview, TX; Louisville, KY-IN; Lubbock, TX; Lynchburg, VA; Macon, GA; Madera, CA; Madison, WI; Manhattan, KS; Mansfield, OH; Medford, OR; Memphis, TN-MS-AR; Merced, CA; Meridian, MS; Midland, TX; Mobile, AL;
On the Statistical Dependency of Identity Theft on Demographics
137
Modesto, CA; Monroe, LA; Monroe, MI; Montgomery, AL; Morgantown, WV; Morristown, TN; Muncie, IN; Napa, CA; New Bern, NC; Ocala, FL; Ocean City, NJ; Odessa, TX; Oklahoma City, OK; Olympia, WA; Owensboro, KY; Peoria, IL; Pittsburgh, PA; Pittsfield, MA; Pottsville, PA; Prescott, AZ; Pueblo, CO; Punta Gorda, FL; Racine, WI; Rapid City, SD; Reading, PA; Redding, CA; Richmond, VA; Roanoke, VA; Rochester, MN; Rochester, NY; Rockford, IL; Rocky Mount, NC; Roseburg, OR; Salem, OR; Salinas, CA; Salisbury, MD; Salisbury, NC; Salt Lake City, UT; San Angelo, TX; San Antonio, TX; Santa Fe, NM; Savannah, GA; Sheboygan, WI; Sioux Falls, SD; Spartanburg, SC; Spokane, WA; Springfield, IL; Springfield, MA; Springfield, MO; Springfield, OH; State College, PA; St. Cloud, MN; St. George, UT; St. Joseph, MO-KS; St. Louis, MO-IL; Stockton, CA; Sumter, SC; Syracuse, NY; Tallahassee, FL; Terre Haute, IN; Toledo, OH; Topeka, KS; Torrington, CT; Traverse City, MI; Tucson, AZ; Tulsa, OK; Tupelo, MS; Tuscaloosa, AL; Tyler, TX; Valdosta, GA; Vero Beach, FL; Victoria, TX; Waco, TX; Warner Robins, GA; Wausau, WI; Wenatchee, WA; Wichita Falls, TX; Wichita, KS; Williamsport, PA; Willimantic, CT; Wilmington, NC; Wooster, OH; Worcester, MA; Yakima, WA; Yuba City, CA; Yuma, AZ.
Author Index
Arpinar, Budak
Landwehr, Carl 45 Lesk, Michael E. 1
115
Baginski, Maureen 11 Bertino, Elisa 115 Di Crescenzo, Giovanni Elovici, Yuval
63
Fienberg, Stephen E. Goldberg, Mark Horning, James J. Kantor, Paul B. Kelley, Stephen Kisilevich, Slava
122
82
104 57 1 104 63
Mace, Robyn R. 34 Magdon-Ismail, Malik 104 McNamara, Joan T. 95 Mertsalov, Konstantin 104 Nardi, Yuval
82
Rokach, Lior
63
Sandhu, Ravi 115 Shapira, Bracha 63 Slavkovi´c, Aleksandra B. Spafford, Eugene H. 20 Thomas, Roshan K. Xu, Shouhuai
115
115
82