SECURITY INFORMATICS AND TERRORISM: PATROLLING THE WEB
NATO Science for Peace and Security Series This Series presents the results of scientific meetings supported under the NATO Programme: Science for Peace and Security (SPS). The NATO SPS Programme supports meetings in the following Key Priority areas: (1) Defence Against Terrorism; (2) Countering other Threats to Security and (3) NATO, Partner and Mediterranean Dialogue Country Priorities. The types of meeting supported are generally “Advanced Study Institutes” and “Advanced Research Workshops”. The NATO SPS Series collects together the results of these meetings. The meetings are co-organized by scientists from NATO countries and scientists from NATO’s “Partner” or “Mediterranean Dialogue” countries. The observations and recommendations made at the meetings, as well as the contents of the volumes in the Series, reflect those of participants and contributors only; they should not necessarily be regarded as reflecting NATO views or policy. Advanced Study Institutes (ASI) are high-level tutorial courses to convey the latest developments in a subject to an advanced-level audience. Advanced Research Workshops (ARW) are expert meetings where an intense but informal exchange of views at the frontiers of a subject aims at identifying directions for future action. Following a transformation of the programme in 2006 the Series has been re-named and reorganised. Recent volumes on topics not related to security, which result from meetings supported under the programme earlier, may be found in the NATO Science Series. The Series is published by IOS Press, Amsterdam, and Springer Science and Business Media, Dordrecht, in conjunction with the NATO Public Diplomacy Division. Sub-Series A. B. C. D. E.
Chemistry and Biology Physics and Biophysics Environmental Security Information and Communication Security Human and Societal Dynamics
Springer Science and Business Media Springer Science and Business Media Springer Science and Business Media IOS Press IOS Press
http://www.nato.int/science http://www.springer.com http://www.iospress.nl
Sub-Series D: Information and Communication Security – Vol. 15
ISSN 1874-6268
Security Informatics and Terrorism: Patrolling the Web Social and Technical Problems of Detecting and Controlling Terrorists’ Use of the World Wide Web
Edited by
Cecilia S. Gal Rutgers University, USA
Paul B. Kantor Rutgers University, USA
and
Bracha Shapira Ben-Gurion University of the Negev, Israel
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC Published in cooperation with NATO Public Diplomacy Division
Proceedings of the NATO Advanced Research Workshop on Security Informatics and Terrorism – Patrolling the Web Beer-Sheva, Israel 4–5 June 2007
© 2008 IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-58603-848-9 Library of Congress Control Number: 2008922759 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail:
[email protected] Distributor in the UK and Ireland Gazelle Books Services Ltd. White Cross Mills Hightown Lancaster LA1 4XS United Kingdom fax: +44 1524 63232 e-mail:
[email protected]
Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail:
[email protected]
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
v
Editors’ Introduction 1
This book is based on presentations given at the NATO Advanced Research Workshop on “Security Informatics and Terrorism – Patrolling the Web” that was held in June 2007 in Israel. The two day workshop included presentations by scholars, researchers and practitioners from several related fields (computer science, information retrieval, data mining, and terror and intelligence experts). The speakers and participants were scientists from prestigious institutes, security consulting firms, and representatives of official government agencies and practitioners. The workshop’s primary aim was to initiate a discussion and sharing of ideas among experts in quite diverse disciplines, and especially between computer scientists and terrorism experts. Most speakers were recruited by the directors, based on their academic and professional experience and performance. All talks specifically included introductory and summary remarks that were clear to the full range of experts participating. During the closing panel discussion some conclusions about the problems of patrolling the Web, and desired action items were drawn up.
Audience This work is intended to be of interest to counter-terrorism experts and professionals, to academic researchers in information systems, computer science, political science, and public policy, and to graduate students in these areas. The goal of this book is to highlight several aspects of patrolling the Web that were raised and discussed during the workshop by experts from different disciplines. The book includes academic studies from related technical fields, namely, computer science, and information technology, the strategic point of view as presented by intelligence experts, and finally the practical point of view by experts from related industry describing lessons learned from practical efforts to tackle these problems. This volume is organized into four major parts: definition and analysis of the subject, data-mining techniques for terrorism informatics, other theoretical methods to detect terrorists on the Web, and practical relevant industrial experience on patrolling the Web. Part I addresses the current status of the relationship between terrorists and the Internet. The presenters are experienced intelligence experts and describe the causes and impacts of terrorists’ use of the Web, the current status, and the governmental responses, and provide an overview of methods for the detection of, and the prevention of terrorist use of the Web, in different parts of the world. Part II addresses data and Web mining techniques for identifying and detecting terrorists on the Web. The presenters are primarily computer scientists and they present 1
For a variety of reasons most, but not all, of the presentations could be included in this book. A complete list of the presenters and their contact information is given in Appendix 1. Readers can find slides of some of the presentations posted on: http://cmsprod.bgu.ac.il/Eng/conferences/Nato/Presentations.htm
vi
recent studies suggesting data-mining techniques that are applicable for detecting terrorists and their activities on the Web. Part III addresses theoretical techniques (other than data mining) applicable to terrorism informatics. The presenters are again computer and information scientists, but they propose computational methods that are not (presently) commonly used in terrorism informatics. These papers suggest new directions and promising techniques for the detection problem, such as visual recognition, information extraction and machine learning techniques. Part IV reports on “learning from experience” and the presenters are industry practitioners who describe their applications and their experiences with operations attempting to patrol the Web. Together, the participants worked to fashion a summary statement, drawing attention to the strengths and the limitations of our present efforts to patrol and limit the use of the World Wide Web by terrorist organizations. The summary statement may be regarded as representing a consensus, but the reader is cautioned that not every participant agrees with every element of the summary.
Summary Statement As the proceedings show, a wide range of topics were discussed, and several points of view were presented. In a final session reflecting upon the entire workshop, participants identified a number of key points which should be kept in mind for future studies and efforts to limit the effectiveness of the World Wide Web as an aid to terrorists. These key points can be divided into two areas: 1) Social/Policy Issues, and 2) more narrow Technical Issues.
1. Social and Policy Issues The Internet is used by terrorists for various activities such as recruitment, propaganda, operations, etc., without their physically meeting. But the Internet can be used to track the conspirators once they are identified. Some activities are open, others are hidden. The struggle against terrorism is the quintessential example of asymmetric warfare. In addition, terrorism stands on the ill-defined boundary where criminality, warfare, and non-governmental actors meet. It was agreed that, collectively, nations have the resources needed to counter terrorism, but that it is essential to share information in order to combat the geographically distributed nature of the terrorist organizations, which may have a very small footprint in any one nation. Therefore a key recommendation of the workshop is that: International cooperation is required among intelligence and law enforcement experts and computer scientists. Other policy issues concern the interplay between the called-for cooperation, and the rights of individual citizens. This may be formulated as a technical question: “Can data mining truly protect privacy when the data is held and mined by ‘distrusted’ custodians?”. This important question was not addressed at the present workshop, but should definitely be on the agenda for future research. At the interface between policy and technical matters, several participants stressed that some idea of a scheme, a model or a scenario is needed to interdict terrorists, be-
vii
cause simple searching cannot cover every possibility. Therefore, whatever technical means and alarms are developed will have to be triggered by considerations of likelihood or probability, and/or by existing intelligence from other sources.
2. Technical Issues As noted, technical means and alarms will have to be triggered by considerations of likelihood or probability, and/or by existing intelligence from other sources. In summary, a second key agreement is that: It is necessary to have people in the analytical loop, supplying human judgment. An important point to consider in patrolling the Web for terrorism is that rates of false positives are necessarily high for any automated method of identification or discovery. As noted, the search must be moderated by some understanding of plausible scenarios. There is a clear interface to policy and social issues when considering the consequences of “false positive” (that is, naming a person, an organization, or a website as terrorist when it is not) which must be weighed, and balanced, against the consequences of failing to identify terrorist activity on the Web. It was noted that computers work “from the bottom up”, digesting large masses of data and producing indications of when something is out of the ordinary. Human analysts, in contrast, work “from the top down”, guided by models or scenarios which may be drawn from previous experience, or may be suggested, for the very first time, by some configuration in the available data. In working to make the computer a more powerful ally, it would be of immense value to have some common “challenge tasks”. This is the final finding of the workshop: It is necessary to have some model tasks which are well defined, and which have a “gold standard” known correct resolution or answer. Ideally these model challenges should be driven by the real missions of the several NATO nations. It was noted that most of the presentations dealt with websites in English, and a few with sites in Arabic. All of the technical work needs also to be extended to other languages. Overall, there was an extremely effective exchange of ideas and of concerns between the experts in technical/computer issues and the experts in social/policy issues. It is highly recommended that this type of boundary spanning workshop be expanded and replicated in the future. Bracha Shapira Paul Kantor Cecilia Gal.
A note on the production of this volume The papers of this volume have been produced by a multi-step process. First, we recorded the talk given by each author at the conference in June (due to some technical difficulties a few presentations were not recorded). Next, we transcribed each recording. The authors then produced a draft of their paper from these transcriptions, refining each draft until the final version. Although the papers are not exactly like the talks given,
viii
some do retain the informal and conversational quality of the presentations. Other authors, however, preferred to include a more formal paper based on the material presented at the conference. A few notes about language and conventions used in the book. Since the authors in this volume come from different parts of the globe we have tried to preserve their native cadences in the English versions of their papers. We have also decided to use both British English and American English spelling and standards depending on the specific style the author preferred to use. Language conventions for naming entities – such as Al Qaeda, 9/11 or Hezbollah – which have more than one standard spelling, were taken from the New York Times editorial practices. The formatting and style of references, when used, are consistent within each paper but vary between papers. And finally, a number of papers have pictures from screen captures of illustrations or of proprietary software. Although every effort was made to include the highest quality pictures so they would reproduce well in print, in some instances these pictures may not reproduce as well as might be desired, and we beg the reader’s indulgence. Cecilia Gal, Rutgers.
Acknowledgements It is a pleasure to acknowledge the superb hospitality of Ben-Gurion University, which provided their magnificent faculty Senate Hall for the two days of the Conference, together with excellent audio-visual support. The Deutsche Telekom Laboratory at BGU, and Dr. Roman Englert provided additional hospitality for the participants. We want to thank Rivka Carmi the President of Ben-Gurion University for her gracious welcome, Yehudith Naftalovitz and Hava Oz for all their hard work with the conference arrangements and Professor Fernando Carvalho Rodrigues and Elizabeth Cowan at NATO for their generous support.
ix
Contents Editors’ Introduction Bracha Shapira, Paul Kantor and Cecilia Gal
v
Part One. The Social and Political Environment The Internet and Terrorism Yaakov Amidror
3
Understanding International “Cyberterrorism”: A Law Enforcement Perspective Marc Goodman
8
Global Threats – Local Challenges. Current Issues of Intelligence Tasking and Coordination in the Rapidly Changing Threat Environment Janusz Luks Cybercrime and Cyberterrorism as a New Security Threat Vladimir Golubev The Enemy Within: Where Are the Islamist/Jihadist Websites Hosted, and What Can Be Done About It? Yigal Carmon
17 20
26
Part Two. Discovering Hidden Patterns of Activity Mining Users’ Web Navigation Patterns and Predicting Their Next Step José Borges and Mark Levene
45
Data Mining for Security and Crime Detection Gerhard Paaß, Wolf Reinhardt, Stefan Rüping and Stefan Wrobel
56
Enhancement to the Advanced Terrorist Detection System (ATDS) Bracha Shapira,Yuval Elovici, Mark Last and Abraham Kandel
71
Discovering Hidden Groups in Communication Networks Jeff Baumes, Mark K. Goldberg, Malik Magdon-Ismail and William A. Wallace
82
Part Three. Knowledge Discovery in Texts and Images Authorship Attribution in Law Enforcement Scenarios Moshe Koppel, Jonathan Schler and Eran Messeri Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages and Identifying Actors Paul B. Kantor Entity and Relation Extraction in Texts with Semi-Supervised Extensions Gerhard Paaß and Jörg Kindermann
111
120 132
x
Data Compression Models for Prediction and Classification Gordon Cormack
142
Security Informatics in Complex Documents G. Agam, D. Grossman and O. Frieder
156
Visual Recognition: How Can We Learn Complex Models? Joachim M. Buhmann
166
Approaches for Learning Classifiers of Drifting Concepts Ivan Koychev
176
Part Four. Some Contemporary Technologies and Applications Techno-Intelligence Signatures Analysis Gadi Aviran
193
Web Harvesting Intelligent Anti-Terror System – Technology and Methodology Uri Hanani
202
How to Protect Critical Infrastructure from Cyber-Terrorist Attacks Yuval Elovici, Chanan Glezer and Roman Englert
224
Appendix 1: Conference Participants
233
Author Index
237
Part One The Social and Political Environment
This page intentionally left blank
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
3
The Internet and Terrorism Yaakov AMIDROR 1 Vice President of Lander Institute, Jerusalem, Israel Head of ICA Project, Jerusalem Center for Public Affairs, Israel
Abstract. Terrorism was part of our lives before the Internet and it will be a part of our lives even if we have all the means and all the possibilities to control the Internet some day. So when we look at the Internet as a tool for terrorists, we must remember that at the end of the day, the Internet is not the source of terrorism and the Internet is not the only way to implement terrorism. There are four major areas where the Internet has had a major impact on terrorist activities First, the Internet has become a substitute for the “open squares” in cities, the Internet as an anonymous meeting place for radicals of like minds. Second, non-governmental organizations can spread propaganda, stories, movies, pictures, about their successes and no one can stop them. Third, the Internet is used by terrorists as a source of information. Fourth, the Internet can be employed as a command and control vehicle; operational coordination via the Internet. Finally, a smaller point, the Internet can be used by terrorist organizations to raise money for their activities. Anti-terrorist organizations can also use the Internet to collect information on known terrorists and ultimately democratic governments can find the standards that are needed and share knowledge about the way that terrorists are using the Internet, and in the long run the governmental systems will prevail and not the terrorist organizations.
Keywords. Cyber terrorism, Internet, Terrorism
Good morning. Let me say that I am very glad to be with you here in Beer-Sheva. I welcome you here. I thought, during the long drive from my home in the center of Israel to Beer-Sheva, that in a way we are exaggerating. In a way, if we think that terrorism without the Internet cannot survive, or that terrorism is a result of the Internet, or the Internet is something that is critical for terrorism, that is not true. I can tell you from my own experience. I have been involved in fighting terrorism since 1969, when I was a young officer in the territories dealing with Arafat and his followers, and this was before the Internet. We fought Hezbollah before the Internet existed as an international communication system. Maybe in some exotic areas in the United States, back then, the Internet existed, but for the terrorists it was not a tool. At the end of the day, terrorism was part of our lives before the Internet and it will be a part of our lives even if we have all the means and all the possibilities to control the Internet some day. I think that some sense of proportion is needed in this discussion about the Internet and terrorism. At the same time that the Internet is flourishing all around the world, 1
Corresponding Author: Yaakov Amidror, Email:
[email protected]
4
Y. Amidror / The Internet and Terrorism
since 9/11 Al Qaeda hasn’t succeeded in carrying out any new operations in the United States. In all these years of very good Internet connections inside Israel, and outside Israel, Al Qaeda hasn’t succeed in carrying out any actions inside Israel. We can see that in some areas where Al Qaeda had some success, like the operation in London, the operation was not connected at all to the Internet. They could do it without using the Internet, they did do it with the Internet. So when we look at the Internet as a tool for terrorists, we must remember that at the end of the day, the Internet is not the source of terrorism and the Internet is not the only way to implement terrorism. Having said that, I want to go to four areas, actually four and a half areas, in which the terrorist organizations use the Internet as a very important machine or tool to enhance their capabilities.
1. Internet as a Meeting Place The first one, which I think is one of the most important, is the Internet as a substitute for the “open squares” in cities. In the past, when someone wanted to hear the preacher in order to get some information, or to get the ideology, he had to go to a club or to a meeting somewhere. And consequently it was very easy for us, the intelligence organization, to understand, to know, and if needed, to track these people from their homes to the venue where they heard the one who is preaching for terrorism. This is the case in Israel; it is the case all around the world. Today is the first time in history that all of these terrorist leaders, preachers, ideologists, can spread their ideas all around the world at the same time to millions of people. No one knows who is listening to these preachers, who will come to the meeting again, who is becoming more and more caught up in this process of becoming slowly, slowly more extremist, as they understand and begin to believe in these preachers. It is the first time in history that you can organize many, many people without bringing them together in a physical place. You can preach without directly seeing these people. You can bring them the ideology without being face-to-face with them, and no one around the world can know who is there and who is not. And I think that this is more important than all the other very interesting and esoteric uses of the Internet, the fact that the terrorist can preach around the world at the same time to millions, to convince people to go into the circles in which they will be not only affiliated with the ideas but also part of the system. It is a slow process; nobody is born to be a terrorist. It is a slow process that can take months or years. I do not believe that someone will become a terrorist in a few days. This is, I think, the most important development in this area because of the Internet and only because of the Internet. This is something new.
2. Non-Governmental Organizations Can Spread Propaganda Connected to the above and important enough to mention as a separate category, is that this is the first time that organizations which are non-governmental organizations can spread propaganda, stories, movies, pictures, about their successes and no one can stop it. There is nothing more important to the success of the terrorist organization than that they can show others that they are succeeding. The Internet gives them a way of spreading their operations all around the world, without any censorship, and nobody
Y. Amidror / The Internet and Terrorism
5
can stop it. One can watch the success of Hezbollah from Australia. Or one can look at the success in Iraq from inside Israel, and nobody can stop people from watching these transmissions, at least in democratic countries. The fact is, they can spread the message of their success all around the world immediately; almost at the same time that the event is taking place in Iraq, one can see the pictures in Israel -- it takes an hour or two and you have all the pictures. It is a realtime propaganda. These organizations gain life from their success, so they want to show around the world that they are doing a very good job fighting the Israelis, fighting the Americans, fighting the infidels, whoever. So for them it is a very important tool. So this is a new phenomenon and it exists only because of the Internet. Nobody can control the way that they are spreading their propaganda, mainly through pictures, all around the world.
3. The Internet as an Information Source The third area that the Internet is very helpful for these terrorist organizations is in collecting information. Of course, it is true that it is helping governments as well, but governments had this ability in the past because of the intelligence organizations that they had to collect information, even if it is a very complicated process. For nongovernmental organizations it was almost impossible, or at least very, very hard. Think about the picture of a target in Tel Aviv. In the past, they had to send someone to Tel Aviv to photograph it, to go in and out through some check points in Lod Airport. If it was someone whose name was on the watch list it was not so easy to do this. And then the issue of taking the photo, if the target is in a sensitive area, this is also not so easy. So they need a process to get the photo of this target if it is in Tel Aviv or in BeerSheva or elsewhere. Today all they have to do is go to the Web, to Google Earth™, and get the picture they need for their operation. The fact that information so necessary for their operations is on the Web is making the life of those people who are thinking about carrying out terrorist operations much easier. In the past non-governmental organizations didn’t have the resources to collect this type of information as easily as they can today. And for them, in this instance, the Internet is more important and more helpful than it is for governmental organizations, many elements of information that, in the past, we had to send our people around the world to collect, today can be easily found on the Internet. I remember as a young officer, a Captain, when we planned an operation against Fatah in Beirut, we had to send some of our guys into Beirut to learn the streets which lead to the target and back. Today, we can go to the map of the world on the Internet and get the information we need for such an operation. In the past, as a government it was very easy for us to send people into Beirut and collect the information. It was not that easy for non-governmental organizations in the past to do the same, but now they have access to it on the Web as easily as we have it.
4. Command and Control Via the Internet The fourth area, in which the Internet is very helpful, is the one most people like to think is the most important one. Personally, I don’t think this is the most important.
6
Y. Amidror / The Internet and Terrorism
This is communication between the headquarters and the operational people on the ground, and among the operational people. I usually tell a story about four people sitting in four different places all around the world. They don’t know each other; they are getting information through the Web. One is using a code of agricultural terms, another one is using retail terms to communicate in code. Technically they can hide the source of the information using servers all over the world and they are getting all the information they need to plan an operation, even pictures, which were taken by another guy who is using third generation cellular phones to make photos in and around the target. And slowly, slowly they gather all the information they need, all the details of the operation. Then someone that they don’t know invites them to a restaurant somewhere in London. They don’t know each other. When they come to the restaurant, waiting for each of them there is a table and chair and an envelope, and in the envelope there is some money. The boss of the restaurant says “sorry Mr. John Reed the person who invited you cannot come but he sent you the money and the information that you needed”. The restaurant boss does not know anything about it, they don’t know anything, but they have a key to a lock somewhere in the train station where the explosive is hidden. The explosive they cannot move via the Web. This is something in the real world that someone has to carry from somewhere, Iran or Iraq maybe. So each of them has a very clear understanding what he should do, where he is going now, what time he is opening the case, what time he is operating, everything he needs to know. And they don’t know each other. They did not meet each other. And they will not meet each other, they don’t know who sent them all these materials, and everything is done through the Web. And the chance of an intelligence organization being able to track and find these four people, who are not connected and never were connected, people who are using coded words of agriculture, retail and commodities in their communications, is almost zero. The only part, in which the real world is involved in the operation is when someone has to bring the explosive to the site of the operation; this cannot be done through the Internet. This is an example of how the Internet can be used for such operations.
5. Weaknesses of the Internet for Terrorists What are the weak points of the system from the point of view of the counter-terrorist organizations. I think that there is one, philosophically, one great weak point. To find the “goldfish” in such a huge sea is almost impossible. To identify the one who is the terrorist, the one that we are looking for is very, very difficult. However, once we identify him on the Web, the details that we can learn about him through the Web are enormous. So the first part, to identify the guy we are looking for is very complicated. After he is identified, however, we can know everything about him. And, in the future, we will know even more. In the future, he will have a smart house and we will know what he is eating, when he’s out of bread, when he’s coming home, when he opened the door and so on and so forth. As the system is becoming more sophisticated it will be more complicated to identify him but it will also enable us to have more details after he has been identified. This is philosophically the point.
Y. Amidror / The Internet and Terrorism
7
And in the end I am very optimistic. I am very optimistic because these systems that involve the Internet need a lot of money to make them untraceable, and you need to invest a lot of energy and know-how. At the end of the day, governments, and states, have more money and more know-how and more energy than non-governmental organizations. For the present, Al Qaeda can use it, Hezbollah is using it, Hamas is using it, all these organization are using the Internet. But it is only a question of time before the free world will organize itself and after investing money, time, people and sharing information with each other, states will be in a better position to deal with the terrorist websites and Internet networks than non-governmental organizations. So we are now in a very delicate situation in which we know that these organizations use the Internet as a tool for all the four purposes that I mentioned. The “half” purpose I mentioned earlier is to collect money. Although from our experience, most of the money being used is cash. But some of it is through the Web. But I think that if we will act, by “we” I mean the free world, those who want to fight the terrorists, if we will act together and find the standards that are needed and we share our knowledge about the way that they are using the Internet, at the end of the day, the governmental systems will prevail and not the terrorist organizations. Then we will find ourselves, as in the old days, fighting terrorists who maybe won’t use the Internet but, nonetheless, have a very strong will to implement terrorist actions.
8
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Understanding International “Cyberterrorism”: A Law Enforcement Perspective Marc GOODMAN 1 Interpol Steering Committee on Information Technology Crime, INTERPOL, Lyon, France
Abstract. Cyber terrorism is a commonly heard expression but lacks definition and precision. Does it apply to terrorists’ use of the Internet or to computer attacks against critical infrastructures? There are many entities that play a role in this complex topic from governments, technology and telecommunication service providers to the terrorists themselves. This paper explores the role of law enforcement in the international fight against terrorists’ use of the Internet and provides an overview of the wide extent to which terrorists and their support structures have fully embraced cyber space. Keywords. Cyber crime, Cyber terrorism, Terrorist use of the Internet, Law enforcement in cyber space
Introduction Thank you, thank you very much. Good morning. We’ll make this an informal conversation. If you have any questions please let me know. Today, I would like to thank Ben-Gurion University, NATO, Bracha and Paul for having invited me here. It’s my great honor to be with you. Early on Paul said that he wanted differing perspectives on cyber terrorism and I hope to provide you with the law enforcement perspective on how police and security officials regard the topic of cyber crime and cyber terrorism.
1. Interpol I thought first, it would be useful for a moment to talk about Interpol itself. In some regards it is somewhat similar to the United Nations in that it is an organization of member states or member countries. After the United Nations, Interpol has more member states than any other international organization, with 186 members around the world, which is a good advantage for the organization. Interpol works with law enforcement entities around the world and works through the national law enforcement authorities in nearly 200 countries. So for example, Israel 1
Corresponding Author: Marc Goodman, Senior Advisor, Interpol Steering Committee on Information Technology Crime, INTERPOL, General Secretariat, 200, quai Charles de Gaulle, 69006 Lyon, France; Email:
[email protected]
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
9
is a member of Interpol, Ukraine is a member of Interpol, Germany is a member of Interpol, and the national police forces in each of these countries can reach out to Interpol for assistance in terms of sharing police communications and information across a global police information network. There are a number of criminal databases that Interpol can offer to police forces around the world -- everything from DNA to fingerprints, photographs, etc. Interpol also provides proactive support for police operations around the world. A very typical scenario under which Interpol would become involved in an international criminal investigation would be if somebody committed a murder in one country and fled to another country. Interpol would help coordinate the investigation between, for example, France and Germany or France and El Salvador, or among multiple jurisdictions. Let me give you a little bit of history. Interpol was officially created in 1923 in Vienna and in 1946 its headquarters was moved to Paris. As for the name Interpol itself, it is an abbreviation for the International Criminal Police Organization, and the shortened name was taken from the original telegraphic address used to communicate with the organization. Thus long before we had SMS, text messages, SKYPE and Virtual Worlds, in simpler times people used to send telegrams -- and the telegram address for the International Criminal Police Organization was Interpol. Over the years, I have had the pleasure of working with Interpol, in a variety of capacities, on the issues of cyber crime and cyber terrorism. In many regards, Interpol is the perfect organization to help address some of these issues because cyber crime, by its very nature, is a transnational crime. In the not too distant past, transnational crime was quite rare. For example, if somebody in Sofia, Bulgaria walked into a bank and robed the institution, the thief would usually travel back to his home somewhere in Bulgaria after the crime. The investigation of this matter was rather routine and, without dispute, it would be handled by Bulgarian police officials. The victim was in Sofia, the perpetrator was in Sofia, the evidence was in Sofia, it was very easy to investigate. With the advent of ocean liners, railroads, airplanes and now, of course, the Internet it is extremely easy to cross international borders. The Internet by its very nature is international. If one were to send an email from Eilat to Jerusalem, it could still pass through several international computer services, possibly hosted on a server in London or in Tokyo. Thus even if one were to communicate solely within a single country, there still could, quite possibly, be an international dimension to every Internet investigation. For law enforcement agencies around the world this represents a complete paradigm shift in their own modus operandi. Traditionally law enforcement organizations have derived their power from the state, and states are by their nature national level institutions. So, how do we, as representatives of states around the world, adapt to this new global transnational threat which is not clearly defined in any one single state? How can government authorities responsibly conduct investigations around the world in jurisdictions where they lack any actual authority to do so? The answer requires significant transnational cooperation among states and because Interpol is the world’s only true international organization established to deal with global crime issues, Interpol has a key role to play in global matters of cyber crime and terrorism. What has Interpol been doing to combat cyber crime and terrorism? Quite a bit actually. Dating back to 1989, Interpol assembled a group of cyber crime experts within the European Union to study and respond to the issue of trans-border cyber crime. Thus in 1990, Interpol formed its first Working Party on Information Technology Crime. In those early days Interpol brought together cyber crime experts from Scotland Yard, the French National Police and from the German Bundeskriminalamt to meet and
10
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
share experiences and training in the fight against cyber offenses. These officials had a traditional police background and training, but also brought deep technical knowledge to the table. Since 1990, Interpol has grown its regional working party efforts substantially and currently has working groups not just in Europe, but also in Asia, Africa, the Middle East and in the Americas. Each of these groups shares criminal information on the latest cyber tools and trends. In addition, the European Working Party has established a widely respected international computer crime investigation manual, which is available to investigators around the world to help build their capacity to respond to international cyber crime incidents. One of the ways that Interpol facilitates communication among high-tech crime investigators is via its National Central Reference Points (NCRP) program. This network of cyber crime experts augments the traditional Interpol I-24/7 network as a means of sharing emerging cyber threat and crime information in near real time. In effect, it is a closed list of global law enforcement experts working on cyber crime and security matters around the world who are available on a 24/7 basis to assist their colleagues in other countries on urgent cyber crime matters.
2. Cyber Crime As many of you know, cyber crimes differ greatly from traditional crimes, where the evidence may last much longer. For example, with a homicide there is a body on the ground, somebody has been killed, the body is not going anywhere: it will stay right there until moved by somebody else or until the police arrive. Evidence in the cyber world is much more transient. Server log data can disappear very quickly, assuming such data is even kept in the first place. Thus, police and law enforcement must investigate cyber offenses much more rapidly and across greater distances than they have done with traditional crimes in the past. For example, in the kidnapping of Daniel Pearl, the Wall Street Journal reporter, much of the original evidence in the case began to emerge first on the Internet, a non-traditional crime scene for law enforcement. The Interpol NCRP network is meant to help in the case of an emergent crime such as the pursuit of kidnappers across the globe in real-time. In fact we must ask the question: how ‘new’ is cyber crime really? Today we frequently hear police executives referring to cyber crime as a new phenomenon. But is it really? For at least twenty years now many police services have been dealing with a large number of traditional crimes that have migrated to the Internet. For example, frauds that previously existed in “real space” such as West African financial frauds (419 schemes) have merely migrated online. Sexual exploitation of minors sadly had existed in “real space” for a very very long time, but now predators have leveraged the Internet to take advantage of unsuspecting victims around the world. Interpol has responded through the establishment of several professional databases to deal with the severe problem of child pornography and exploitation. But just about any crime that occurs in “real space” can occur in cyber space, including prostitution, theft of intellectual property and even sales of narcotics. There are also some crimes that law enforcement has never seen before in “real space” and which can only exist due to the nature of the Internet itself. Distributed denial of service (DDOS) attacks for example, had no equivalent in the real world. DDOS attacks committed via BOTNET armies controlled by organized criminal groups represent a completely new phenomenon for law enforcement officials. Interpol has a
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
11
number of project groups underway studying these issues in depth, bringing together some of the best experts in the world from both the private and public sectors to tackle these problems.
3. Cyber Terrorism And now to the theme of this conference: cyber terrorism. What’s cyber terrorism? This is not such an easy question to answer. Before we can define cyber terrorism, we must have a firm understanding of what is terrorism itself. So what is terrorism? It depends on whom you ask. There are approximately 100 different definitions of what terrorism is, and while many of these definitions have elements in common, none is universal. The United Nations has a definition, Interpol has a definition, several national level police forces have their own definitions, political organizations, academic organizations, everyone has their own definition of terrorism and no single definition reigns supreme. Thus when one adds a widely misunderstood prefix to the mix (the word cyber), things become even more complicated. If there is no universal agreement on the definition of terrorism, then there certainly is no fundamental accord today that delineates “cyber terrorism.” Thus in defining cyber terrorism, we end up with a definition that draws upon confusion from both the cyber and terrorist worlds. For most people, cyber terrorism may refer to the activities of radical terrorist organizations on the Internet. The other common view of cyber terrorism relates to the use of computer networks to attack the critical information infrastructures of a nation state. 3.1. Use of the Internet by Radical Organizations I will focus on the general definition first: use of the Internet by radical organizations for the purposes of terror. The activities of these groups broadly break down into two categories: support activities and operational activities. 3.1.1. Support Activities Terrorist support activities on the Internet includes disseminating terrorist goals via propaganda in support of radicalization of thought, such as encouraging others to believe that violence is the answer for a particular political problem. Many recognized terrorist organizations have well-organized and funded propaganda machines, news services, video feeds and radio stations on the Internet. Another example of a “cyber terrorist” support activity is the emerging trend occurring in the gaming world vis-à-vis radicalization of thought. There is extensive evidence that cyber jihadists are using stolen computer source code and modifying it to create computer games whose aim is to kill an Israeli soldier, to kill George Bush, to kill Tony Blair or to kill King Abdullah II of Jordan. Those types of games certainly exist and are freely and widely circulated in cyber space as a means of spreading a particular violent and cyber jihadist point of view. There are also radical terrorist organizations that have crafted cyber jihadist messages geared specifically at children, the goal of which is radicalization at an early age to bring new members into the fold for the terrorist group. For example, were one to visit a number of terrorist websites, (and organizations such as MEMRI have carefully cataloged such sites), you will see
12
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
cartoon messages specially tailored for children. These cartoons look like other cartoon images well known to young people, such as Disney™ characters. The major difference is that the messages put forth by these characters is one of hate and violence rather than youthful diversion. Fundraising is another major terrorist support activity that takes place over the Internet. Many terrorist organizations try to raise money over the Internet either via traditional means, where people can send in cash or give money to a local terrorist organization or via virtual means, using new forms of money such as E-Gold, etc. One of our speakers the first day talked about the concept of a town common. It used to be very difficult in the past to find another terrorist depending on where you lived and what your neighborhood looked like. Now, it’s very easy for people with unique ideas and beliefs to “gather” and to find one another. Previously if two disturbed individuals lived on different sides of the city, it would have been difficult for them to find each other and to share a terrorist agenda. Now, with so many chat rooms, blogs, websites etc., it is very easy for individuals to find these groups. As a result, these would-be terrorists meet in the commons and share uncommon beliefs to reinforce each other’s radical mindset and agenda. We have seen a similar radicalization of thought on the criminal side with regard to child pornography and sexual exploitation of children on the Internet. The idea of sleeping with an 8 year old boy or girl is repugnant to most adults; most people find this very very offensive and would never consider it. Thus for obvious reasons, in the past child pornographers have had a hard time meeting each other. Now, via the Internet, and this new concept of a common, they can come together and convince each other that it’s perfectly normal to sleep with a child. The same modus operandi and mindset is also found in the radicalized common space inhabited by terrorist organizations. 3.1.2. Operational Activities The next category of terrorist activity that takes place on the Internet may be described as “operational activities.” There are vast resources online, such as training manuals that can help any person complete a self-paced study program to become a terrorist. There is a document called the Encyclopedia of Jihad which is widely available online. It has many thousands of pages and covers everything from how to build an IED, an improvised explosive device, to how to build weapons of mass destruction. Other chapters cover topics such as how to engage in secret communications and how to avoid detection by the police. Of course the Internet facilitates operational terrorist activities through its communications capabilities, allowing for command-and-control in cyber space, whether through peer-to-peer networks, SMS text messaging, or Voice Over Internet Protocol. When one thinks about it, intimidation is one of the major goals of terrorist organizations. The Internet is an outstanding operational venue for terrorists to intimidate and terrorize the public. We have seen examples of this, such as the beheading video of Nicholas Berg. What was the ultimate effect of that very graphic and violent portrayal? It was to intimidate and to cause terror. And so jihadists are carrying out their terrorist goals via an alternative method, via cyber space. Another major category of terrorist operational activity that can be carried out online is research and planning for an eventual terror attack. Terrorists can use the Internet to research their targets, whether it be using satellite imagery or just Google™ to obtain open-source information to identify, select and prepare to attack a particular
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
13
target. A current and highly publicized example of this was the recently broken-up terrorist plot to attack JFK Airport in New York City. In that case, the FBI in the United States uncovered a plot where a small homegrown terrorist cell, with links to Guyana, planned to attack a very large pipeline that provided jet fuel to JFK Airport. The pipeline was over 40 kilometers in length and terrorists planned to blow up the pipeline using homemade explosives. The group’s planning efforts were also enhanced by the participation in the plot of an insider who previously worked at the airport and became a convert to radical Islam. Law enforcement officials discovered, upon arresting the cell members, that the terrorists had extensively used Google Earth™ to help map out the entire airport as well as the pipeline. Thus, more and more, we are seeing commonly available computer tools serving the operational planning requirements for terrorist plots. 3.1.3. Attacks Against Critical Infrastructure Up until this point, we have focused on terrorists using the Internet to facilitate or support operations. Another common definition of “cyber terrorism” refers to attacks against critical information infrastructures. That is, using computers, computer networks, and the Internet to attack those systems that we rely upon in our daily lives. These systems include transportation, government communications and banking/financial institutions. If you think about it, almost all modern services in one way or another involve a computer. So when one visits the ATM (automatic teller machine), the reason one is able to take money from the machine is because there is a computer and a telephone network line behind the machine that decides it is OK to give you money. The reason police, fire and ambulance personnel can come to your house in the case of an emergency is because they have a dispatch system that knows how to get to where you are. The reason we can see each other in this windowless room is because there is electricity that is controlled via computer by a power grid. Each one of these systems represents a critical infrastructure upon which we depend in our daily lives. Some countries have much better access to these infrastructures than others, and in fact, for those reasons, the most developed countries are at greatest risk for attacks against their critical infrastructure. While many people often predict so-called “electronic Pearl Harbors,” there is disagreement as to whether such a threat is fact or fantasy. In order to better understand the potential threat of a critical infrastructure attack, I will provide three examples to give an overview of the problem. One of the earliest known critical infrastructure attacks occurred in Queensland Australia where a former employee of a water sewage treatment plant hacked into the system of his old company and, through his remote computer control, was able to release raw sewage into a neighborhood. Over 260,000 gallons of raw sewage poured into the streets of this particular town in Queensland. People had to be evacuated, the grounds of an internationally known chain-hotel was covered in raw sewage and lots of wildlife in this wetlands area died as a result. All of this damage was unleashed via a remote computer attack against the information infrastructure of a utility provider. Another famous example from the late 1990’s is Gazprom. Gazprom, as you know, is one of the largest oil/pipeline companies in the world. Hackers were able to access their control systems and at one point, Gazprom was no longer in control of their own pipes, valves and pressure gauges. Instead, each of these was now under remote control via computer hackers. A third example of this type of critical infrastructure attack comes from the United States and involved a disruption to the aviation system at a local
14
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
airport. A small airport in Worcester Massachusetts had its telephone and computer systems taken over by a hacker. By extension, the people in the FAA control tower, the air traffic controllers were no longer able to make outbound phone calls. If there would have been a plane crash officials could not have called an ambulance, nor fire or police services. Because this attack involved a small airport, the runway lights were also controlled via the same phone lines and also were affected by the attack. Each of the above scenarios are stark examples of how one could attack the critical infrastructure of a nation via a cyber attack. 3.2. Terrorist Organizations as Learning Organizations As we know, terrorist organizations are learning organizations. We see evidence of this all over the world. We see it very clearly now in Iraq. The people who are attacking the various coalition troops in Iraq are refining their methodology or modus operandi so that every day they are getting smarter and better. They observe the techniques used by coalition forces to protect themselves and adjust their attacks to counter security measures. One of the things we know that terrorist organizations have learned all too well is how to use our own systems and infrastructure against us. In the first attack against the World Trade Center in New York in 1993 terrorists smuggled a truck bomb into the basement parking lot of the building. The 1,500-lb (680 kg) urea nitrate-fuel oil device was intended to knock down the North Tower and kill thousands. Instead, only six people were murdered and over 1,000 injured—not quite the spectacular event the terrorists had hoped for. In the second attack against the towers in 2001, rather than trying to move a 10,000 kilo bomb into the building, the terrorists adapted and instead chose an airplane full of jet fuel as the bomb/projectile and achieved the desired effect. Thus terrorists took advantage of infrastructure that already existed (an airplane) and weaponized it. There are many other examples of terrorists attacking existing infrastructures, including the bombing of the Atocha train station in Madrid on March 11, 2004 and the London Tube bombings on July 7, 2005. Transportation systems are not the only infrastructures which can be subverted for terrorist use. Back in 2001, in the United States the mail system ground to a halt as the result of a series of Anthrax contaminated letters which were sent through the public mail creating widespread panic and fear. Given all the attacks against various other infrastructures, why is it not too farfetched to presume terrorists might attack critical information infrastructures next. While many downplay the significance of “cyber terrorism” (meaning terrorist attacks against computer networks), the threat is indeed real and one that has been discussed frequently in one Jihadist chat room in the past. Since we already know that terrorists can use the transportation system and the mail system, why would they not turn the computer systems against us next and use them? One of our previous speakers, an Israeli general noted that terrorist organizations had at least one Achilles heel: the need for somebody to actually move the explosives to the detonation site thus providing an opportunity to law enforcement for interception (such as observing a terrorist wearing a suicide belt). I am less optimistic about the Achilles heel theory than is the general. I think he had a very good point but it is no longer necessary to actually move explosives in a whole new genre of attack scenarios. We just saw in early 2007, the JFK Airport pipeline attack plot which was interrupted by authorities. In that attack a small amount of explosives could have been used to detonate the 40 kilometer jet fuel pipeline
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
15
causing extensive damage to JFK Airport and the surrounding neighborhoods. But, in fact, that entire pipeline was controlled by a variety of computer systems. Thus terrorists also could have theoretically compromised the safety features in the computer system and caused the explosion anyway as the pressure grew too high in the pipeline. Thus, I am not sure that it is as necessary today to move large amounts of explosives in order to do extensive damage. The explosives or the chemicals or the fertilizer—all of these things are around us everywhere, and on many occasions are controlled by computer. Thus I remain concerned about terrorist organizations compromising an information system to cause a spectacular infrastructure attack in the future—a form of “cyber terrorism,” if you will.
4. Relationship Between Crime and Terrorism Among law enforcement and justice officials around the world, there has been somewhat of a debate as to the relationship between crime and terrorism. There are some who feel quite strongly that terrorism is a form of crime that must be investigated and prosecuted like all other forms of criminality. There are others, however, who focus on the special nature of terrorism and define it as a problem for armies and intelligence agencies—for a “global war against terrorism.” In my mind, however, the distinction is not so clear and there is often overlap between the two. There have been many incidents of terrorist groups engaging in traditional criminal activities and other instances in which organized crime groups have committed terrorist acts. In South America, for example, the Fuerzas Armadas Revolucionarias de Colombia (FARC) have engaged in narcotics trafficking, and they pose a terrorist security threat to the people of Colombia. Evidence has now revealed that the Al Qaida affiliated terror cells in Spain responsible for the March 11th plot, were using stolen credit cards to help finance their operational expenses. And we know now that organized crime groups are selling their services to people all over the world, selling their expertise to the highest bidders—whether global pedophile rings or terrorists. Today organized crime groups sell so-called BOTNET armies which can be used to target computer networks around the world. While Interpol is actively engaged on the wide gamut of activities discussed previously, it is important to note that Interpol is not alone in this fight. Obviously, NATO through their sponsorship of this important forum is heavily engaged in some of these questions. So are other institutions such as the United Nations, the Council of Europe, the European Union, ASEAN, APEC and many many others. Many of these efforts are growing, however, there is always more to be done. Earlier Professor Golubev mentioned the Council of Europe Treaty on Cyber Crime. Forty-two countries have signed the treaty, though a much smaller number have actually ratified the document to date. It is good news, in the global fight against cyber crime, that so many countries have agreed to common standards and protocols on this important emerging criminal trend. Unfortunately with over 180 countries in the United Nations, there is still much work to be done. Until every country has some method of dealing with cyber crime and cyber terrorism, we are all in danger. Just as money laundering proliferated in so called financial havens, in the Caribbean and elsewhere, so too we may suffer because of so-called “cyber-havens” from which criminals and terrorist may launch their attacks. If a particular country does not protect its slice of cyber space from criminal activity, then all the criminals will frequent those jurisdictions.
16
M. Goodman / Understanding International “Cyberterrorism”: A Law Enforcement Perspective
5. Conclusion So, in conclusion, where are we today vis-à-vis cyber terrorism? It depends on whom you ask. Some activists and pundits argue that the cyber world is close to an eventual and inevitable destruction. These people often talk about an “electronic Pearl Harbor,” in reference to the 1940’s attack on the U.S. port in Hawaii. They believe that there will be an imminent electronic attack that will lead to the crashing of the entire Internet with modern man possibly returning to the Stone Age. Though I exaggerate for emphasis, some people do believe in that vision for the future. Others, including many traditionalists in law enforcement and counter-terrorism, remain unconvinced by the socalled cyber threat. They think it is “no big deal” to-date. In their point of view, “somebody has to move the explosives anyway and therefore why do we have to worry about computers.” Maybe we do and maybe we don’t. In my opinion, the truth lies somewhere in between, and elements of both sides of the camp are correct. I personally believe that while total cyber destruction may be science fiction at this point, some critical infrastructure attack involving cyber space is more than likely over the next ten years. Of course, that is in the realm of an infrastructure attack. As we already know from so much of our own experience and the many experienced speakers at this forum, on the perspective of terrorist use of the Internet—cyber-jihad has certainly arrived and is flourishing. I do believe strongly in cooperation and think that only through collaboration will we have any chance in gaining the upper-hand on either issue. Based upon my discussions with many of the attendees at this forum, I remain convinced that law enforcement cannot win this fight alone. It is evident to me that there is much that academia and industry have to contribute to this discussion. There is an overwhelmingly compelling argument for cooperation against these threats. There are so many interesting tools being developed by academia and the technology industry— law enforcement simply cannot do it alone. We need to work as a team and so for those of you that have interesting ideas that may help support our efforts, please introduce yourselves as I would enjoy the opportunity to share ideas with you. It has been my pleasure to address you today and I thank you all for your kind attention.
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
17
Global Threats – Local Challenges Current Issues of Intelligence Tasking and Coordination in the Rapidly Changing Threat Environment Janusz LUKS 1 Grupa GROM, Warsaw, Poland
Abstract. The nature of threats to international and national security is changing at a rapid pace, forcing a series of fundamental changes in intelligence management and tasking methods and processes. These are clearly visible in counterterrorism operations, particularly in the intelligence collection against terrorist targets in the Internet environment. The need for the development of new analytical capabilities and tools requires a new level of cooperation and dialogue between the intelligence community and research organizations, both at the national and international level. NATO can aid and foster international intelligence cooperation through the requirements definition by NC3A, which will help address some of the compatibility and tools availability issues the intelligence services currently face while developing joint response to the terrorist threat. Keywords. Intelligence management, intelligence tasking, terrorism, Internet
The intelligence cycle environment in the XXI century has changed quite significantly. Tasking, collection and analysis of any national intelligence organization, regardless of its size, must now face the challenges of globalization of national economies, policies and national security threats. Local conflicts, even in remote areas, often tend, nowadays, to have an impact on the national security of countries thousands of miles away. For many countries intelligence support to military operations for their forces has also gone global. In contemporary intelligence there is also a significant asymmetry of threats and targets. Threats, instead of being large, static and highly organized became smaller and less defined, highly mobile and often cloaked, as well as amorphous. Information technology has entered the arena, this time as a weapon, difficult to control and universally proliferated. We now witness electronic warfare, look for and identify cyber 1
Corresponding Author: Janusz Luks, Ul. Granitowa 3/6, 02-681 Warszawa, Tel: +48 606 802999; Email:
[email protected]
18
J. Luks / Intelligence Tasking and Coordination in the Rapidly Changing Threat Environment
threats, take up defense postures in cyberspace and police it very much as we would our physical environment. In the world of dual use technology development, with commercial-off-the-shelf (COTS) products and solutions and ambiguous delivery platforms it is getting harder to decide what is potentially a threat, or a possible weapon of today and what will be a weapon of tomorrow. Rapid development of the Internet and Web resources has brought about a steady erosion of governments’ monopoly on access to potentially sensitive information coupled with the proliferation of effective collection tools and techniques. In parallel, both industry and governments already suffer from information overload and “white noise” and are constantly trying to resolve emerging difficult privacy issues, so important in democratic societies. As the communication revolution continues at great speed, stretching the available collection and monitoring resources almost to their limits, versatile, easily accessible Internet services (data, P2P, IRC, VoIP, Webcam), mobile and satellite communication networks, the proliferation of high level encryption techniques (voice, data, digitally enhanced steganography), user ID cloaking, all present intelligence services with new functional and technological challenges. The changing nature of the threat has resulted in the introduction of new categories in national security priorities, such as: x x x
National critical infrastructure protection; Return to a homeland defense concept, with profound impacts on both the composition and functionality of the intelligence community; Forging global and regional alliances against terrorism and the proliferation of weapons of mass destruction and narcotrafficking, in order to benefit from the international cooperative environment for the exchange of intelligence.
Intelligence customers have also made significant changes in the way their requirements (tasking) are carried out. The speed of intelligence response to tasking has gone up and more often has to conform to the concept of the information reaching the decision maker “just in time”. Intelligence services are asked to collect against a very broad range of targets and to provide all-source-analysis (ASA) at the top level. It is important that the threat analysis be able to differentiate between a criminal and a terrorist threat. One should note a growing need for intelligence (or intelligence-type) information in industry, where vetting of major decisions (e.g., investment) has become a lasting component of the risk management process. Moreover, the private sector has growing intelligence resources, often directly related to national and/or transnational security issues. It is also in industry that a new understanding of the value of intelligence takes root, faster than in governments: information costs money and takes time; intelligence makes money and buys time. The implications of the above discussed changes for the intelligence cycle and process are multiple: x
On the functional side there is a growing need to expand the presence of a “new breed” of intelligence analysts in the intelligence community. They
J. Luks / Intelligence Tasking and Coordination in the Rapidly Changing Threat Environment
x
19
should posses the set of skills which seem to be of crucial importance in the contemporary intelligence environment: o good understanding of source balance and ASA processes; o ability to deliver value-added information from analysis; o understanding of the new tools and technologies, especially in the IT area; o ability to work, interact and communicate with science and technology experts in order to understand advantages and limitations of available collection techniques. On the management side there should be an understanding that processing of intelligence is becoming a core competency. Both global coverage and counterintelligence issues matter more nowadays, as do non-traditional threats. Decision support remains the raison d’être for the decision maker, but new value for him comes from a combination of content + context + speed.
One of the major problems that remains is the failure to share information. “Out there”, especially in the national governments’ world, it is still a stove-pipe environment. We all seem to understand the benefits of all-source analysis, but it is still a challenge for many governments to provide effective mechanisms to fuse, process and analyze information collected from multiple, non-traditional and diversified sources. Tracking terrorists on the Web, in the words of U.S. security expert Bruce Schneier, has all the qualities of a needle in a haystack dilemma. Schneier asks a pointed question: “Are we going to get better results by simply adding more hay to sift through?” The answer is obvious, so we had better start looking for a better pitch-fork and, preferably, X-ray vision. The issues and challenges discussed above call, among other things, for a more extensive R&D contribution to intelligence collection and analysis. One of the possible ways appears to be a focused R&D program under the auspices of NC3A (NATO Communications, Command and Control Agency). NC3A has a good record of supporting the research and development of new IT tools and systems, both for NATO and member countries’ use, through the BOA (Basic Ordering Agreement) mechanism, as well as promoting and certifying standardized solutions. The NC3A Agency is also well versed in coming up with excellent quality international tenders with clearly stated functional requirements for desired technical/technological solutions. Finally, it has the funds to do such jobs in the interest of and on behalf of member countries.
20
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Cybercrime and Cyberterrorism as a New Security Threat Vladimir GOLUBEV 1 Computer Crime Research Center, Zaporozhye, Ukraine
Abstract. As with many other new technologies in the past, the World Wide Web presents us with great opportunities for progress as well as potentials for misuse. A list of only a few of such crimes committed on the Internet are: network attacks, credit card fraud, stealing money from bank accounts, corporate espionage, and child pornography distribution. These crimes represent a significant social danger to Ukraine and other CIS Confederation of Independent States countries. Below I discuss three different countries and the ways in which they have managed their technology infrastructure and dealt with Internet crime.
Keywords. Internet crime, Cyber crime, Cyber terrorism, CIS countries, Network, Attacks, Information, Security
Introduction As with many other new technologies in the past, the World Wide Web presents us with great opportunities for progress as well as potential for misuse. Here are only a few of the crimes committed on the Internet: network attacks, credit card fraud, stealing money from bank accounts, corporate espionage, child pornography distribution. These crimes represent a significant social danger to Ukraine and other CIS Confederation of Independent States countries. Below I discuss three different countries and the ways in which they have managed their technology infrastructure and dealt with Internet crime.
1
Corresponding Author: Vladimir Golubev, Computer Crime Research Center, Box 8010, Zaporozhye 95, Ukraine, 69095. Telephone: +38 (0612) 341470; Email:
[email protected]
V. Golubev / Cybercrime and Cyberterrorism as a New Security Threat
21
1. The Republic of Moldova In Moldova, the number of Internet users grows each day. Out of a total of 100,000 Internet users 20,000 are government officials. In 2006-2007 all attempted computer attacks and infringement of information systems were successfully neutralized and did not cause any significant problems with the normal operation of state information systems and websites. Maintaining the security of the telecommunication system entails a number of technical measures and methods. These methods are employed by the Center of Special Telecommunications. One of the main factors of reliable, stable and safe functioning of the state information structure is having a special, modern and protected environment. In 2005, the Center of Special Telecommunications began to develop a modern and protected telecommunication system for their government agencies. Today this system has more than 30 kilometers of information fiber-optic network systems that link more than 60 state agencies, including the Office of the President, the Parliament, the Ministries of Commerce and Economy, the Ministry of Finance, the Ministry of Internal Affairs, Ministry of Defense, the General Prosecutor’s Office, the Information and Security Service, Border and Customs Offices, the Center for Fighting Economic Crimes and Corruption and other state offices. State agencies can already exchange information through high-speed, wide-band, encrypted channels. The central information center of the telecommunication system of state agencies was upgraded. State agencies can use special services like high-speed and protected access to the Internet, protected data exchange, and e-mail services. There is a technical support office for all information services provided in all state agencies. One of the top-priority tasks at present is to develop protected communication systems between state offices in the capital and regional state offices. Taking into account the proliferation of computer criminals, the Center of Special Telecommunications is focused on implementing high-tech solutions in the sphere of information security, including technologies of digital signatures. The digital signature is one of the significant means for providing high level protection of e-documents. Introduction of digital signatures in the republic increased the use of e-documents in the communications of state authorities and public offices; this created a fair basis for implementing the electronic exchange of documents, and increased the efficiency of document flow and administrative work. In fact, only 12% of computer crimes in Moldova become the subject of criminal prosecution. Georgiy Suprunov, sales director of the largest telecommunications company in Moldova, said at the international seminar “Control of Internet Use” held in Chisinau, that computer crime is unfavorable for companies because computer crimes decrease their business rating.
2. Republic of Kazakhstan Representatives of Kazakhstan, at a Euro-Asian forum on information security, stated that about 500,000 hacker attacks each month are attempted on their state information networks. Hackers regularly try to break into the databases of banks and companies to try to steal large sums of money and to try to obtain confidential information. The struggle against cyber crime is hampered because the government and these companies
22
V. Golubev / Cybercrime and Cyberterrorism as a New Security Threat
do not want to reveal information about these attacks. In addition, most hackers attack from abroad. The global use of modern information technologies, by governments and financial institutions all over the world, makes the solution of information security problems a top priority. Besides the direct harm from possible intrusion into systems, information gained through cyber crime may become a serious means for human oppression. For this reason Kazakhstan adopted state programs to maintain information security, maintain state secrets and other necessary means including the highest level of cooperation with the Committee of National Security. It is important to note the need to balance maximum openness of access to information, as mandated by the Constitution of Kazakhstan, with the needs of national security. The sphere of information security also demands rational investment of state financial resources in building modern information system and maintenance of the system’s security. Under the Protocol on Cooperation in the area of fighting computer crimes of CIS countries, a computer crime unit was created within the Committee of National Security. The unit’s primary tasks are to find and prevent crimes related to information resources of state authorities. It is essential to cooperate with European and Western countries in this area. Authorized state agencies are interested in ratifying the EU Convention on Cyber Crime adopted in 2001. This will further cooperation in this area of fighting computer crime and will enable colleagues from all over the world to share their experiences in this arena.
3. Republic of Byelorussia As it is well-known, a favorable environment for any crimes in general, and computer crimes in particular, may be explained by the conjunction of three factors: motivation, opportunity and the absence of capable counteraction. When all three are present, there is impunity for criminals and the opportunity for repeated illegal actions in the area of computer information. Byelorussian law enforcement first encountered cases of computer crime in 1998. Subsequently, law enforcement agencies began to fight cyber criminals using traditional legal rules. However this approach was unsuccessful, as most computer crimes grew out of traditional ones but in new ways. The New Criminal Code of the Republic of Byelorussia enacted in 2001 contains a number of articles permitting prosecution for crimes against property (article 212 of the CC) and information security (articles 349-355 of the CC), committed using computer technologies. Among newly introduced crimes are the following six: Article 349, Unauthorized access to computer information; Article 350, Modification of computer information; Article 351, Computer sabotage; Article 352, Unauthorized capture of computer information; Article 353, Production or marketing of special means designed to obtain unauthorized access to computer systems or networks; Article 354, Development, use or spreading of detrimental computer programs; Article 355, Violation of computer system or network operating rules. Taking into account the reality of the potential threat of the misuse of information technologies, the Ministry of Internal Affairs gradually realized that a complex program of counteraction to these crimes was needed; this program was created in 2001.
V. Golubev / Cybercrime and Cyberterrorism as a New Security Threat
23
Special units were created in the ministry specifically to fight high-tech crimes. Today cyber crime problems are solved by the high-tech crime unit at the ministry and its regional departments. The state program for implementing information technology infrastructure of the Republic of Belarus for 2003-2005, (forecasted to 2010), "Electronic Belarus" was developed to carry out the instructions of the President of Belarus from May 27, 2002, ʋ 09/540-20 by a group of specialists from different organizations and organizations of the republic. This endeavor is under the control of the National Academy of Sciences of the Republic of Belarus and is mandated to remove registered negative moments and factors. The program has inter-agency collaboration and is based on the basic regulations of the state policy in the area of information technology. According to Vladislav Kaluqa, the Chief Executive of "Beltelekom" the number of Internet users in Byelorussia is over 2 million now. According to statistical information from 2003, the number of Internet users in the country was more than 1 million. Every day thousands of Byelorussians connect to global networks. Between 1998 and 2001 most cyber crimes were related to stealing goods in emarkets using stolen credit cards, and these crimes were committed by single criminals. Now, crimes are more often related to hacking, with the goal of stealing information; hacker attacks on the web resources of schools, universities, ISPs (Internet Service Providers), mass media, companies and citizens, in order to obtain restricted information or to disrupt an organization’s normal operations. Almost everywhere, repeated crimes committed with the use of computers and stolen credit card information, unauthorized access to computer information, alteration of computer information, computer sabotage, and the distribution of slander on public persons and officials, occurs on the Internet. Criminals actively use technologies to gain maximum profit from illegal financial operations. An increase in Internet use and higher levels of education contributed to a rise in cyber crime in Belarus last year. Belarusian computer criminals caused about $400,000 in damages to foreign institutions last year, mostly banks, cyber crime investigators reported. Hackers targeted only a few domestic banks. The country's High Technology Crimes Investigation Department started 429 legal proceedings against hackers and the FBI (U.S. Federal Bureau of Investigations) visited the country twice to get information for its own investigations. As the number of Belarusians attending university rises, so does the number of cyber criminals. Officials at the Ministry of Internal Affairs estimate 80% of the 189 hackers known in the country are students. In 2006 almost 2000 crimes involving computers, software and the Internet occurred in the country and 175 criminal cases were prosecuted involving illegal actions in the area of information security; in 2005 the number of cases prosecuted was 178. In August, last year, the Belarusian Internet community was stressed by a number of hacker attacks. Anonymous hackers altered hundreds of Belarusian websites. An analysis of the crimes committed in the “digital” sphere shows that the majority (55%) of illegal actions were related to computer crimes: criminal infringements in the sphere of computer information and thefts using computers. Most computer crimes in Belarus are committed by persons aged 18-29 years, (60,7%), while 30 years and older represents about one third, (33,3%). Under-aged persons make up 6%. Viewed from another perspective: 11,9% were committed by students, 21,4% by unemployed individuals, 9,5% by women and 3,6% by prisoners (with some overlap among these classes).
24
V. Golubev / Cybercrime and Cyberterrorism as a New Security Threat
According to the Belarusian Ministry of Internal Affairs’ Public Relations Department, illegal access to computer information, illegal modification of computer information, computer sabotage and credit card theft and fraud are the most typical computer crimes in Belarus. In 2005 cooperation of Russian and Belarusian Ministries of Internal Affairs resulted in the uncovering of an international criminal network specializing in computer crimes. In June of 2005 Belarusian police uncovered an international criminal group that stole USD 112,000 from Citibank USA’s clients’ accounts during August through November 2003. In 2003, a criminal group specializing in the online dissemination of child pornography was uncovered in Minsk. The total profit of that group totalled USD 3 million. Currently, the Belarusian Ministry of Internal Affairs participates in the international special operation Innocent Images Task Force. A Ministry of Internal Affairs spokesman stressed that 60% of computer crimes are committed by people from 18 to 29 years, and 8% committed by juvenile delinquents. According to the official data of the Interior Ministry of the Republic of Belarus, trends of computer crimes in Belarus are significant. High-tech crimes in 2002 showed a 80,5% growth as compared to 2001, 924 such crimes occurred in 2002 and 512 crimes occurred in 2001. The experience of counteracting computer crime in Belarus has raised a number of grave questions related to the investigation of computer crimes and the cooperation of law enforcement agencies of different countries. The high-tech crimes unit at the Interior Ministry of the Republic of Belarus, together with the National High-Tech Crimes Unit of Great Britain held a joint operation “Fire wall” in 2004. The operation resulted in finding and arresting international criminal group members. These criminals produced, sold and used forged credit cards. The Interior Ministry of the Republic of Belarus emphasized the significance of combating computer crimes. There was a meeting at the Ministry dedicated to these issues in 2004. Counteracting cyber crime was flagged as a high priority issue; a program of actions including cooperation of all agencies was adopted.
4. Conclusion The agreement on cooperation of CIS countries in fighting cyber crimes, adopted six years ago in Minsk, laid the basis for the strategy and tactics of law enforcement for these countries. This agreement also put in place a mechanism for cooperation to combat high-tech crimes. However, these countries didn’t fulfill their obligations related to joint investigation measures and the creation of special information systems and cooperation in the field of training skilled police officers. CIS countries are at a complicated and critical stage in solving those numerous problems related to the investigation of cyber crimes. Massive Distributed Denial of Service (DDOS) attacks, where hundreds of compromised BOTNET computers, and even networks from several countries, are used to attack websites in other countries; computer worms causing damage to two-thirds of the countries of the world raise for us the fundamental questions related to localization and jurisdiction of these crimes and criminals. Recently, a number of measures were taken in order to develop efficient methods of international collaboration in combating cyber crime at regional and international levels. These methods produced significant results. Cooperation is
V. Golubev / Cybercrime and Cyberterrorism as a New Security Threat
25
necessary in carrying out operations combating the wide range of problems related to computer crimes in order that these efforts will be successful. It is also necessary to develop partnerships between the CIS and all other interested parties. To reiterate, only by cooperation among nations will it be possible to stop the growth of Internet crime.
26
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
The Enemy Within: Where Are the Islamist/Jihadist Websites Hosted, and What Can Be Done About It? Yigal CARMON1 MEMRI, Washington, DC, USA
Abstract. Terrorist organizations and their supporters make extensive use of the Internet for a variety of purposes, including the recruitment, training and indoctrination of jihad fighters and supporters. The overwhelming majority of the Islamist/jihadist websites are hosted by Internet Service Providers (ISPs) based in the West, many of which are unaware of the content of the Islamist/jihadist sites they are hosting. Experience shows that, once informed, most of the ISPs remove these sites from their servers. Therefore, an effective way to fight the phenomenon is to expose the sites via the media. It would be advisable to establish an organization – governmental or non-governmental – which would maintain a database and publish information about Islamist/jihadist sites on a regular basis, and/or provide such information to ISPs upon their request. Keywords. Jihadist websites, ISPs in the U.S., online jihad, Internet jihad
Introduction Extremist Islam makes extensive use of the Internet. [1] One can hardly imagine the growth of radical Islam and its jihadist organizations in recent years without the immense reach, impact and capabilities of the Internet. The threat posed by Islamist websites has recently been demonstrated by three cases: the planned attack on Fort Dix by a group of young men from New Jersey; the planned terrorist attack on JFK Airport, and the attempted car bombings in the UK, all of them in 2007. According to media reports, the terrorists in all three cases were inspired by jihadist websites. There were also two recent court cases in Britain and in Switzerland in which terrorists were convicted of using Internet sites to promote terrorist activities. The National Intelligence Estimate recently published by the U.S. National Intelligence Council stressed the following: "We assess that the spread of radical especially Salafi - Internet sites, increasingly aggressive anti-US rhetoric and actions, and the growing number of radical, self-generating cells in Western countries indicate that the radical and violent segment of the West’s Muslim population is expanding, including in the United States. The arrest and prosecution by U.S. law enforcement of a small number of violent Islamic extremists inside the United States - who are becoming 1
Corresponding Author: Yigal Carmon, MEMRI, P.O. Box 27837, Washington, DC 20038-7837, Phone: (202) 955-9070; Email:
[email protected].
Y. Carmon / The Enemy Within
27
more connected ideologically, virtually, and/or in a physical sense to the global extremist movement - points to the possibility that others may become sufficiently radicalized that they will view the use of violence here as legitimate..." [2]
1. The Goals of the Jihadist Websites The jihadist terrorist organizations utilize the Internet for two main purposes: for operational needs, and for indoctrination and da'wa (propagation of Islam). 1.1. Operational Purposes The terrorists use the Internet in the military training of jihad fighters by circulating military guidebooks on weaponry, battle tactics, explosives manufacture, and other topics. An example is Al Qaeda's online military magazine Mu'askar Al-Battar (The Al-Battar Training Camp), published by the Military Committee of the Mujahideen in the Arabian Peninsula. [3]
Figure 1. Al Qaeda's online military magazine Mu'askar Al-Battar (www.qa3edoon.com). ISP: R & D Technologies LLC; Nevada, USA
Figure 2. Online course on manufacturing explosives (www.w-n-n.com). ISP: SiteGenie LLC; Minnesota, USA
28
Y. Carmon / The Enemy Within
Some websites also carry courses on manufacturing explosives and even guides for making homemade dirty bombs. [4] Another type of online operational activity is the use of hacking techniques to break into Internet sites - what the Islamists term "electronic jihad." As part of this activity, Islamist hackers attack the websites of those whom they consider their enemies with the aim of sabotaging the sites and damaging morale. They also attempt to hack into strategic economic and military networks with the aim of inflicting substantial damage on infrastructures in the West. Many Islamist websites and forums have special sections devoted to the topic of electronic jihad, such as the electronic jihad section in the Abu Al-Bukhari forum.
Figure 3. Electronic jihad section on the Al-Bukhari forum (www.abualbokhary.info). ISP: Everyones Internet; Texas, USA
1.2. Indoctrination and Da'wa Needs The main use of the Internet by Islamist/jihadist organizations is for purposes of indoctrination and da'wa (propagation of Islam). These organizations attach great significance to this Internet activity, as evident from the considerable effort they invest in it. Al Qaeda, for example, has an "information department" and a very active production company, Al-Sahab. Likewise, the Islamic State of Iraq (ISI) - which is an umbrella organization founded by Al Qaeda, incorporating several terrorist groups in Iraq - has an "information ministry" and two media companies, Al-Furqan and Al-Fajr. In addition, there are independent media companies serving the Islamist organizations, such as the Global Islamic Media Front (GIMF), which denies having ties with Al Qaeda but has posted Al Qaeda statements taking responsibility for terrorist acts. The GIMF has also established the Media Jihad Brigade (Katibat Al-Jihad Al-I'lami). [5] The online indoctrination and da'wa activities are regarded by these organizations as an integral part of jihad, and as another front of jihad in addition to its military, economic, and political fronts. In fact, they characterize online media or informational activity as a type of jihad that can be carried out by those who cannot participate in the
Y. Carmon / The Enemy Within
29
fighting on the battlefield. They call this activity "media jihad" (al-jihad al-i'lami) or "da'wa jihad" (al-jihad al-da'awi). [6] The following are some examples of websites that carry out indoctrination and da'wa activities by disseminating communiqués, religious tracts and videos that serve the terrorists' purposes. a) The "World News Network" forum, which posts communiqués by numerous terrorist organizations. The following is a page from the website that includes messages by Al Qaeda’s production company Al-Sahab, by the ISI media company Al-Furqan, and by the Iraqi terrorist organization Ansar Al-Sunna.
Figure 4. The "World News Network" forum (www.w-n-n.net). ISP: SiteGenie LLC; Minnesota, USA
b) The Al-Saha forum, which posts videos produced by Al-Sahab and speeches by Al Qaeda leaders like Ayman Al-Zawahiri.
Figure 5. The Al-Saha forum (www.alsaha.com). ISP: Liquid Web Inc.; Michigan, USA
30
Y. Carmon / The Enemy Within
c) Some Islamist websites have special sections devoted to messages by the jihadist media companies. For example, the Al-Nusra forum, which is affiliated with GIMF, has a special section devoted to GIMF messages.
Figure 6. Page devoted to GIMF messages on Al-Nusra forum (www.alnusra.net). ISP: Select Solutions LLC; Texas, USA
In practice, the two types of online activity conducted by terrorist organizations military activity and informational activity – cannot be separated, since prominent terrorists are actively involved in online media activities. For example, Fares AlZahrani, an Al Qaeda leader who was arrested in August 2004 by the Saudi authorities, used to post on Islamist forums and on Abu Muhammad Al-Maqdisi's website "Minbar Al-Tawhid Wal-Jihad" under the pseudonyms Abu Jandal Al-Azadi and Al-Athari. [7] Another senior Al Qaeda operative, Abd Al-Aziz Al-Anzi, who has likewise participated in clashes with Saudi security forces, was also very active on the Internet. Al-Anzi, known as the "the information minister of Al Qaeda in the Arabian Peninsula," headed the Al Qaeda Media Council, recruited Al Qaeda supporters on Internet forums, and was a supervisor on the Al-Salafiyoon website (www.alsalafyoon.com) under the pseudonym Abd Al-Aziz Al-Bakri. He also wrote regularly for Al Qaeda's online magazine Sawt Al-Jihad (The Voice of Jihad) under various pseudonyms, including Abdallah bin Nasser Al-Rashid, Abd Al-Aziz bin Musharraf Al-Bakri, Sheikh Nasser Al-Najdi, and Nasser Al-Din Al-Najdi. [8]
2. Most Islamist Websites Are Hosted on Servers Based in the West Some of the Arab countries in which Islamic extremists are most active employ highly restrictive supervision measures against individuals and groups involved in online terrorist activity. [9] As a result, Islamist organizations and their supporters prefer to use Internet Service Providers (ISPs) located in the West - and especially in the U.S., which is a key provider of Internet services - and thus exploit Western freedom of speech to spread their message. In many cases, Western ISPs even host websites of organizations that have been officially designated by these very countries as illegal terrorist organizations. It must be stressed, though, that the ISPs themselves are frequently unaware that they are providing services to extremist elements. The following are examples of jihadist websites hosted in the West (along with their URLs and the names of the ISPs that host them), in three main categories: [10]
Y. Carmon / The Enemy Within
• • •
31
Websites officially or unofficially associated with specific terrorist organizations Forums used by terrorist organizations Websites of sheikhs supporting terrorism
2.1. Websites of Terrorist Organizations a) The website of the Iraqi organization Asaeb Al-Iraq, hosted in Texas.
Figure 7. www.iraqiasaeb.org. ISP: Layered Technologies Inc.; Texas, USA
b) The website of the Jaysh Al-Mujahideen organization, hosted in Pennsylvania.
Figure 8. www.nasrunmiallah.net. ISP: Network Operations Center Inc.; Pennsylvania, USA
c) The website of the Palestinian Islamic Jihad organization, hosted in Texas.
32
Y. Carmon / The Enemy Within
Figure 9. www.sarayaalquds.org. ISP: ThePlanet.com Internet Services Inc.; Texas, USA
d) A Pro-Hezbollah website, hosted in Texas.
Figure 10. www.shiaweb.org/hizbulla/index.html. ISP: Everyones Internet; Texas, USA
e) A blog called "Supporters of Jihad in Iraq," hosted in Washington state. The caption at the top of the page says "Kill the Americans everywhere."
Figure 11. www.hussamaldin.jeeran.com. ISP: Electric Lightwave; Washington, USA
Y. Carmon / The Enemy Within
33
2.2. Islamist Forums Currently, forums are the primary platform used by Islamist organizations to spread their message. The importance the organizations attach to these forums is evident in the fact that their official statements, such as communiqués or videos, are usually posted on forums before being posted on their websites. For instance, Al Qaeda, which has no official websites, uses Islamist forums to convey its messages. The following are some examples of forums: a) The highly popular Al-Hesbah forum, hosted in Texas.
Figure 12. www.alhesbah.org. ISP: RealWebHost, Texas, USA
b) The Al-Tajdid forum, hosted in Britain.
Figure 13. www.tajdeed.org.uk. ISP: FASTHOSTS-UK-NETWORK, UK
c) The Al-Saha forum, hosted in the U.S.
34
Y. Carmon / The Enemy Within
Figure 14. www.alsaha.com. ISP: Liquid Web Inc.; Michigan, USA
2.3. Websites of Sheikhs Who Support Terrorism Known Islamist sheikhs play a pivotal role in setting up the terrorist organizations' ideological infrastructure and in granting religious-legal legitimacy to their activities. Many of these religious scholars are currently serving prison sentences for incitement to terrorism or even for active involvement in terrorist activity. However, this does not prevent them from spreading inciting messages and supporting terrorism via their websites, which remain active, and are often hosted in the very same counties that convicted and imprisoned them. For example: a) The website of Abu Qatada Al-Falastini (a.k.a., Omar Mahmoud Abu Omar), a Jordanian of Palestinian origin, who has been detained in Britain since 2005 on suspicion of having ties with Al Qaeda. His site is hosted in Britain.
Figure 15. www.almedad.com/tae. ISP: BT-Wholesale, UK.
Y. Carmon / The Enemy Within
35
b) The website of Sheikh Abu Muhammad Al-Maqdisi, who was the spiritual mentor of the late commander of Al Qaeda in Iraq, Abu Mus'ab Al-Zarqawi. AlMaqdisi's website includes numerous documents that provide ideological and religious legitimacy for Islamist terrorist organizations. [11]
Figure 16. www.tawhed.ws / www.alsunnah.info. ISP: Interserver Inc.; New Jersey, USA
c) The website of extremist Islamist sheikh Hamed Al-Ali, known for supporting jihad fighters.
Figure 17. www.h-alali.net. ISP: FortressITX; New Jersey, USA
2.4. Sites Hosted by the Large Internet Companies Google™, Yahoo!™ and MSN™ The Islamist/jihadist organizations also make use of the free Internet services offered by large Internet companies like Google™, Yahoo!™ and MSN™. The following are examples of jihadist blogs hosted by these companies:
36
Y. Carmon / The Enemy Within
a) The unofficial blog of Al-Furqan (the production company of the Al Qaedafounded ISI), hosted by Google™.
Figure 18. www.gldag.blogspot.com. ISP: Google Inc.; California, USA
b) A blog dedicated to the legacy of the late commander of Al Qaeda in Iraq, Abu Mus'ab Al-Zarqawi. The blog, hosted by Google™, features messages and videos by Al-Zarqawi.
Figure 19. www.kjawd.blogspot.com. ISP: Google Inc.; California, USA
Y. Carmon / The Enemy Within
37
c) A Yahoo!™ group called "Supporters of Jihad in Iraq."
Figure 20. http://groups.yahoo.com/group/ansaraljehad. ISP: Yahoo!; California, USA
d) A pro-Hezbollah group on Microsoft's portal MSN™ which posts speeches by Nasrallah, information on Hezbollah operations and links to other Hezbollah sites.
Figure 21. http://groups.msn.com/Hezbollahh/page16.msnw. ISP: Microsoft Corp.; Washington, USA
2.5. Websites in European Languages: Among the jihadist websites hosted in the West, most are in Arabic, but there are also websites in European languages such as English, French and German, as well as Arabic websites that have sections in other languages. For example:
38
Y. Carmon / The Enemy Within
a) The World News Network forum, hosted in the U.S., has a section in English.
Figure 22. www.w-n-n.net. ISP: SiteGenie LLC; Minnesota, USA
b) GIMF's website in German, hosted in Britain:
Figure 23. http://gimf1.wordpress.com. ISP: Akamai Technologies; London, UK.
c) A website in French called La Voix des Opprimés (The Voice of the Oppressed), hosted in the U.S.
Y. Carmon / The Enemy Within
39
Figure 24. http://news.stcom.net. ISP: DNS Services; Florida, USA
3. What Can Be Done? The prevalent view in the West, even among officials in charge of counterterrorism, is that the primary way to fight the jihadist websites is to spread an alternative message, or a "counter-narrative," which is opposed to that of the Islamists. [12] Though Islamist ideology can and should be countered by alternative messages, and is indeed being increasingly challenged by reformists in the Arab and Muslim world, such an ideological campaign is, by its very nature, a long-term effort that does not produce immediate results. A more effective and immediate way to fight the phenomenon is, firstly, to expose the extremist sites via the media, and thus to inform ISPs and the public at large of their content, and secondly, to bring legal measures against ISPs that continue to host extremist websites and forums. 3.1. Exposure Experience teaches us that exposure is, in itself an effective measure against extremist sites. In 2004, MEMRI published a comprehensive two-part review of Islamist websites and their hosts. [13] Within a week of the document's publication, most of the sites exposed in it were closed down by their ISPs. This suggests that an effective measure against the extremists' online activities would be to establish a database - governmental or non-governmental - which would regularly publish information about Islamist/jihadist sites, and/or provide such information to ISPs upon their request. This database would provide a service similar to that of government bodies which inform the public about various kinds of threats to its safety, such as agencies that provide weather alerts and travel advisories; the Better Business Bureau, which provides businessmen and individuals with information about companies convicted of fraud; or the U.S. Treasury's Office of Foreign Assets Control, which provide information to banks, thereby enabling them to comply with the regulation of "know your customer." [14] It should also be stressed that the ISPs themselves have a legal authority to remove sites that violate the law (e.g., the copyright laws) or sites that abuse their own
40
Y. Carmon / The Enemy Within
regulations or terms of use. Thus, with information on extremist sites at their disposal, the ISPs should have the ability – and the obligation – to remove such sites from their servers. 3.2. Legal Countermeasures The prevailing opinion among the American public - and even among government officials, including those in charge of counterterrorism - is that the First Amendment severely limits the scope of legal measures that can be brought against terrorist websites. But the fact is that American law contains clear and effective provisions against terrorist organizations and their online activities, including the following: 3.2.1. Designated Terrorists and Foreign Terrorist Organizations The U.S. officially designates certain organizations as “Foreign Terrorist Organizations” (FTOs) and certain individuals as "Specially Designated Terrorists" (SDTs) or "Specially Designated Global Terrorists" (SDGTs). The list of designated organizations and individuals is updated regularly and made available to the public. [15] These designations have legal consequences. For example, section 219 of the Immigration and Nationality Act states that it is unlawful to provide a designated FTO with "material support or resources," including "any property, tangible or intangible, or services," including "communication equipment and facilities." [16] Some of the organizations and individuals listed in this document - such as the Islamic Jihad organization, for example - are designated FTOs and SDGTs, which means that the ISPs hosting their sites are violating U.S. law. 3.2.2. 18 U.S.C. Section 842 Another pertinent law is Title 18, Section 842, of the U.S. Code, which states: "It shall be unlawful for any person to teach or demonstrate the making or use of an explosive, a destructive device, or a weapon of mass destruction, or to distribute by any means information pertaining to... the manufacture or use [thereof] with the intent that the teaching, demonstration, or information be used for, or in furtherance of, an activity that constitutes a Federal crime of violence” (see original document listed on http://www.atf.gov/explarson/fedexplolaw/explosiveslaw.pdf). Websites and forums that disseminate operational military information to terrorists, such as many of the sites and forums presented here, are illegal under this law, even if they are not formally associated with a designated FTO. A motion claiming this law to be incompatible with the First Amendment was denied by a U.S. court in the case of Rodney Adam Coronado, a radical animal-rights and environmental activist. Coronado was charged with violating Title 18, Section 842 of the U.S. Code after giving a lecture in California in which he showed how to build an incendiary bomb. After being indicted, Coronado filed a motion to dismiss the case against him on grounds that the law had violated his First Amendment rights. The motion, however, was denied by the District Court of San Diego. In explaining its ruling, the court stated that "the First Amendment does not provide a defense to a criminal charge simply because the actor uses words [rather than actions] to carry out his illegal purpose..."
Y. Carmon / The Enemy Within
41
References [1]
For a general overview of Islamist websites, see MEMRI Inquiry & Analysis No. 328, " Islamist Websites as an Integral Part of jihad: A General Overview," February 21, 2007, http://memri.org/bin/articles.cgi?Page=archives&Area=ia&ID=IA32807. [2] http://www.dni.gov/press_releases/20070717_release.pdf. [3] See MEMRI Special Dispatch No. 637, "The Al-Battar Training Camp: The First Issue of Al-Qa'ida's Online Military Magazine," January 6, 2004, http://memri.org/bin/articles.cgi?Page=archives&Area=sd &ID=SP63704. [4] See MEMRI Special Dispatch No. 1004, "On Islamic Websites: A Guide for Preparing Nuclear Weapons," October 12, 2005, http://memri.org/bin/articles.cgi?Page=archives&Area=sd&ID= SP100405. [5] Al-Sharq Al-Awsat (London), August 14, 2005. [6] See article by Nur Al-Din Al-Kurdi in the third issue of the magazine Dhurwat Sanam Al-Islam, published by the Al Qaeda in Iraq Information Department. [7] Al-Sharq Al-Awsat (London), July 31, 2007. [8] http://www.alarabiya.net/Articles/2005/10/17/17782.htm. [9] Saudi Arabia, for example, passed a bill in March 2007 that stipulates imprisonment of up to 10 years and/or a fine of up to 5 million riyals for anyone setting up a website for a terror organization or a site promoting terrorist goals (Al-Watan, Saudi Arabia, March 27, 2007). There have also been arrests. On June 5, 2007, the Saudi Security services detained three Al Qaeda members for using the Internet to spread extremist ideology and recruit young people to jihad. One of the detainees was Abu Asid AlFaluji, who attempted to recruit youths for terrorist operations and disseminated via the Internet speeches by bin Laden and Abu Mus'ab Al-Zarqawi; another was Abu Adullah Al-Najdi, who was involved in terrorist operations against Saudi Arabia and attempted to publish a new edition of the Al Qaeda in Saudi Arabia e-journal, Sawt Al-Jihad, which was closed down two years ago (Al-Watan, Saudi Arabia, June 6, 2007; Al-Riyadh, Saudi Arabia, June 7, 2007). [10] All data concerning the websites was verified July 15, 2007. [11] See MEMRI Inquiry and Analysis No. 239, "Dispute in Islamist Circles over the Legitimacy of Attacking Muslims, Shiites, and Non-Combatant Non-Muslims in Jihad Operations in Iraq: Al-Maqdisi vs. His Disciple Al-Zarqawi," September 11, 2005, http://memri.org/bin/articles.cgi?Page=archives &Area=ia&ID=IA23905. [12] Measures of this sort were the focus at a May 3, 2007 hearing of the Senate Homeland Security and Governmental Affairs Committee, chaired by Senator Joseph Lieberman, which dealt with "The Internet as a Portal for Islamist Extremists." Among the speakers at this hearing were Lieutenant Colonel Joseph Felter from the United States Army; Deputy Assistant Secretary of Defense Michael Doran; and Frank J. Cilluffo, Director of the Homeland Security Institute at George Washington University. Frank Cilluffo stated that, in order to counter the Islamist threat, the West must "win the battle for hearts and minds... and offer hope... to those who might otherwise be seduced by the jihadi ideology." As examples of activities that achieve this aim, Cilluffo mentioned, among other things, the Indonesian pop-star Ahmad Dhani who "uses his music... to counter calls to violence with a message of peace and tolerance," and an anti-terrorism fatwa issued in 2005 by the Fiqh Council of North America - a pro-Islamist body (Speech delivered at a hearing of the U.S. Senate Homeland Security and Governmental Affairs Committee, May 3, 2007). [13] See MEMRI Special Report No. 31, "Islamist Websites and Their Hosts Part I: Islamist Terror Organizations," July 16, 2004, http://memri.org/bin/articles.cgi?Page=archives&Area=sr&ID=SR3104; MEMRI Special Report No. 35: " Islamist Websites and their Hosts Part II: Clerics," November 11, 2004, http://memri.org/bin/articles.cgi?Page=archives&Area=sr&ID=SR3504. [14] Website of the Office of Foreign Assets Control: http://www.treas.gov/offices/enforcement/ofac/. [15]For an updated list of Specially Designated Nationals, see:http://www.treas.gov/offices/enforcement/ ofac/sdn/t11sdn.pdf. [16] http://www.state.gov/s/ct/rls/fs/37191.htm.
This page intentionally left blank
Part Two Discovering Hidden Patterns of Activity
This page intentionally left blank
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
45
Mining Users’ Web Navigation Patterns and Predicting Their Next Step José BORGES a and Mark LEVENE b,1 a School of Engineering, University of Porto, Portugal b School of Computer Science and Information Systems, Birkbeck, University of London Abstract. Web server logs can be used to build a variable length Markov model representing user’s navigation through a web site. With the aid of such a Markov model we can attempt to predict the user’s next navigation step on a trail that is being followed, using a maximum likelihood method that predicts that the highest probability link will be chosen. We investigate three different scoring metrics for evaluating this prediction method: the hit and miss score, the mean absolute error and the ignorance score. We present an extensive experimental evaluation on three data sets that are split into training and test sets. The results confirm that the quality of prediction increases with the order of the Markov model, and further increases after removing unexpected, i.e. low probability, clicks from the test set. Keywords. Web usage mining, variable length Markov chains, scoring rules
Introduction When users click on a link in a web page, submit a query to a search engine or access a wireless network they leave a trace behind them that is stored in a log file. The information stored in the log file for each user click will include items such as a time-stamp, identification of the user (for example, an IP address, a cookie or a tag), the user’s location, query terms entered and further clickstream data, where appropriate. (We use the term ‘click’ generically to mean a click on a link, a query submission or an access to a network.) The log file thus contains an entry for each click and can be pre-processed into timeordered sessions of sequential clicks. Each session includes a trail that the user followed through the space, which we take without loss of generality to be a web site. A trail can be defined in various ways, for example as a sequence of clicks that has taken no longer than a given time span, or one such that the time between clicks in the sequence is no longer than a given short time span, or as a sequence of link-connected clicks (i.e. clicks that are linked together). In [21] the authors present a study to evaluate heuristics to reconstruct sessions from server log data, known as sessionizing. They show that sessions can be accurately inferred for web sites with embedded session identification mechanisms, that time-based reconstruction heuristics are acceptable when cookie identifiers 1 Corresponding Author: Mark Levene, School of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London WC1E 7HX, U.K.; E-mail:
[email protected]
46
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
are available and that referrer-based heuristics should be used when cookie identifiers are not available. Logs are thus a valuable source of information for understanding what users are doing and how a site is being used. Here we focus on web usage mining, which concentrates on developing techniques that model and study users’ web navigation data obtained from server log files, with the aim to discover and evaluate “interesting” patterns of behaviour [17]. It is important to note that apart from studying log data recording a user’s clickstream, other information, for example related to the content being browsed, or to the context in which the user is browsing, will be very useful. Several methods have been proposed for modelling web usage data. Schechter et al. [20] utilised a tree-based data structure that represents the collection of paths inferred from the log data to predict the next page access. Dongshan and Junyi [11] proposed a hybrid-order tree-like Markov model, which provides good scalability and high coverage of the state-space, also used to predict the next page access. Chen and Zhang [9] utilised a Prediction by Partial Match forest that restricts the roots to popular nodes; assuming that most user sessions start at popular pages, the branches having a non-popular page as their root are pruned. Deshpande and Karypis [10] proposed a technique that builds k th −order Markov models and combines them to include the highest order model covering each state; a technique to reduce the model complexity is also proposed therein. Moreover, Eirinaki et al. [12] propose a method that incorporates link analysis, such as the PageRank measure, into a Markov model in order to provide web path recommendations. Applications of web usage mining include: creating adaptive web sites that automatically improve their organisation and presentation [18], prefetching web pages to improve latency [9], personalisation of users’ web experience [17], and web access prediction [11,10,6,7]. In [3,4] we have proposed a first-order Markov model for a collection of user navigation sessions, and more recently we have extended the method to represent higherorder conditional probabilities by making use of a cloning operation [7]. In particular, we make use of a Variable Length Markov Chain (VLMC), which is an extension of a Markov chain that allows variable length history to be captured [1]. Our method transforms a first-order model into a VLMC so that each transition probability between two states takes into account the path a user followed to reach the first state prior to choosing the outlink corresponding to the transition to the second state. In addition, we have previously proposed a method to evaluate the predictive power of a model that takes into account a variable length history when estimating the probability of the next link choice of a user, given the user’s navigation trail [6]. Here we concentrate on the prediction problem, i.e. given a trail, we ask how well can we predict the final link the user followed to complete the trail. The algorithm we use for prediction is to choose the highest probability link, given that the user has inspected n web pages before reaching the final link on the trail, where n ≥ 1. In particular, we are interested to know how the prediction is affected by n, the order of the model used for prediction. In addition, we investigate three different scoring metrics for evaluating the prediction algorithm: the hit and miss score (also known as the hit rate) [23], the mean absolute error [23] and the ignorance score (also known as the information score) [19]. We are also interested in assessing unexpected or surprising events [16,13], since such rare events are not predictable. Although not predictable, unexpected clicks can be
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
47
detected by the fact that their probability of occurrence is low. The approach we take is to label a click as unexpected if its probability is less or equal to some threshold. Equivalently we can say that an event is expected if its probability of occurrence is greater than the threshold. We observe that if the threshold is zero then unexpected events are the ones that have not yet occurred in the log data. A significant application for the detection of unexpected events is patrolling the Web for security purposes [8]. Web usage mining can be used in this context to capture and analyse navigation patterns within a web site being monitored. Once an unexpected event is identified then the focus can be directed to tracking the user initiating the unexpected event. We have carried out an extensive experimental evaluation of the prediction algorithm on three substantial data sets with respect to the three scoring metrics. For evaluation purposes we split the data into a training set on which the VLMC model is constructed and a test set on which the prediction algorithm is evaluated. Our results show that prediction improves as the order of the Markov model increases, i.e. when we have a longer user history predicting the user’s next step is more likely. Moreover, the results also suggest that prediction stabilises as the order of the model increases. We also found that when we remove unexpected clicks from the training set up to a threshold of 0.1, the prediction further improves. The rest of the paper is organised as follows: in Section 1 we introduce the variable length Markov chain model we use. In Section 2 we introduce the prediction problem and present three different methods of evaluating the prediction. In Section 3 we discuss the issue of unexpected accesses. In Section 4 we present an experimental evaluation of the prediction algorithm, and, finally, in Section 5 we give our concluding remarks.
1. Web Mining with Variable Length Markov Chains In previous work we have proposed the use of Markov models to represent a collection of user sessions. A first-order Markov model [14] provides a compact way of representing a collection of sessions but, in most cases, its accuracy is low. A VLMC is a Markov model extension that allows variable length navigation history to be captured. We have proposed in [5] a method that transforms a first-order model into a VLMC so that each transition probability between two states takes into account the path a user followed to reach the anchor state. We now briefly review our VLMC construction method. Consider the collection of sessions and the corresponding first-order model given in Figure 1. Each web page corresponds to a state in the model. In addition, there is an artificial start state (S) and an artificial final state (F) appended to every session. The first-order model is incrementally built by processing each sequence of page requests. A transition probability is estimated by the ratio of the number of times the transition was traversed to the number of times the anchor state was visited. Next to a link, the first number gives the number of times the link was traversed and the number in parentheses gives its estimated probability. The model accuracy can be assessed by comparing a transition probability with the corresponding higher-order probability estimated by the n-gram frequency counts. For example, according to the input data the conditional probability of going to state A3 after following link (A1 , A2 ) is given by the number of times users followed (A1 , A2 , A3 )
48
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
divided by the number of times users followed (A1 , A2 ), that is 3/6 = 0.5. The firstorder model is not accurate since p(A2 , A3 ) = 0.38. A1
Session
freq.
A1 , A2 , A3 A1 , A2 , A4 A5 , A2 , A3 A5 , A2 , A4 A6 , A2 , A3 A6 , A2 , A4
3 3 1 4 2 3
S
5 (0.31)
A3
6 (1)
6 (0.38)
A5
5 (1)
5 (0.31)
6 (0.38)
6 (1)
10 (0.62)
10 (1)
A2
F
5 (1)
A4
A6
Figure 1. A collection of user navigation sessions and the corresponding first-order model
Our model extension that incorporates higher-order probabilities is based on a cloning operation that duplicates states whose outlink transition probabilities are not accurate. In Figure 1, for example, state A2 is cloned in order to separate in-paths to it that induce distinct conditional probabilities. In order to identify in-paths inducing similar conditional probabilities a clustering method is used; there is also a parameter used to set the intended accuracy. Figure 2 shows the resulting second-order model when a 5% deviation is allowed between the conditional probability induced by the n-gram counts and the transition probability given by the model. Since paths from A1 and A6 induce closer conditional probabilities, they were assigned to the same clone. The transition counts of the updated model are computed in a way that reflects the number of times each path was followed. Although the example corresponds to a second-order evaluation, the method was generalised for higher-orders; see [5].
A1 6 (0.38)
S
5 (0.31)
5 (0.45)
6 (1)
A3
A2 A6
5 (1)
6 (1)
6 (0.55)
F 10 (1)
1 (0.20)
5 (0.31)
A5
5 (1)
A’2
4 (0.80)
A4
Figure 2. The second order model corresponding to the model given in Figure 1
Modeling a collection of sessions in a VLMC has the advantage of providing a platform to (i) identify the most popular trails, which are defined as the higher probability paths, and (ii) to predict the user’s next navigation step after following a given trail.
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
49
2. Prediction and Scoring Rules Given a trail, the prediction problem is the task of predicting the next link a user followed to complete the trail. That is, by observing the clicks that lead to the last link on the trail, i.e. the clickstream history, the last link is the one to be predicted. The prediction algorithm we use is simply to choose the highest probability link, given that the user has inspected n web pages (states) before reaching the final link on the trail, where n ≥ 1; this prediction method is known as maximum likelihood. In particular, we are interested to know how the prediction is affected by n, the order of the model used for prediction. We investigate three different scoring metrics (see [22]) for evaluating the prediction algorithm. The first method is the hit and miss scoring rule (HM) [23], which counts a correct prediction, i.e. a hit, as 1 and an incorrect prediction, i.e. a miss, as 0. HM can be interpreted as the probability of guessing that the link followed was the one with the maximum probability, and is thus equal to the expected maximum likelihood probability of the last link on the trail; we denote this expected probability by M T P . The second method is the mean absolute error (MAE) [23], which ranks the links from 1 to r, where the rth link was the one that was followed, and records the MAE as r − 1. MAE can be interpreted as the expected rank of the last link on the trail that was followed minus one. So, for example, if on average, the user clicks on the link that was ranked 3rd the MAE will be 2. The third method is the ignorance score (IS) [19], which records the score as − log2 (p), where p is the probability of the link that was, in fact, followed by the user. The ignorance has an information-theoretic interpretation as the entropy of an event [19], and two to the power of IS can be interpreted as the expected probability of the last link on the trail that was followed by the user; we denote this expected probability by AT P . As opposed to HM and MAE, IS is a non-linear scoring function, ranging from zero, when p = 1, to infinity, when p = 0 (when p = 0.5, the IS is equal to 1), i.e. the penalty is large when the user follows a link whose probability of occurrence is very low.
3. Unexpected Events If the user follows a link whose probability of occurrence is very low (or zero when it is followed for the first time), the event may be considered as being unexpected from the point of view of the Markov model. Such unexpected or surprising events [16,13] are not predictable, since they are unlikely to occur according to the constructed VLMC model. Thus, although not predictable, unexpected user clicks can be detected by the fact that their probability of occurrence is low. The approach we take here is to label a click as unexpected if its probability is less than or equal to some threshold, α. Equivalently we can say that an event is expected if its probability of occurrence is greater than the threshold, α. We observe that if the threshold is less than zero, then there are no unexpected events (in the following we use all to denote a value of α which is less than 0). We also observe that when α = 0 then the only unexpected events are the ones whose probability is zero, i.e. representing links that have not yet occurred in the Markov model. When α = all, i.e. α < 0 we apply a form of Laplace smoothing [24] to the model. In particular, for each state, we set the probability, pi , of the ith outlink from the state as
50
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
pi =
wi + d , W + md
(1)
where m is the number of links that can be followed from the state, wi is the number m of times the ith outlink was traversed, W = i=1 wi , and d is an initial small positive weight assigned to each outlink; we determined the value of d = 0.001 used in our experiments by trial and error. We note than when d = 0 Laplace smoothing is turned off.
4. Experimental Evaluation For the experimental evaluation we make use of three distinct data sets. The first data set (LTM) represents four months of usage from the London Transport Museum between November 2002 and February 2003. Erroneous requests, image requests and requests from IP addresses that requested the robots.txt file were eliminated. A session was defined as a sequence of requests from the same IP address with a time limit of 30 minutes between consecutive requests. After sessionizing, sessions with a single request were removed. For testing, we decided to use a temporal based natural split, thus, the training set is the first three months of data and the test set is the last month of data. The second data set (MSWEB) was obtained from the UCI KDD archive (http: //kdd.ics.uci.edu/databases/msweb/msweb.html) and records the areas within www.microsoft.com that users visited in a one-week time frame in February 1998. Two separate data sets are provided, a training set and a test set. The third data set (PKDD) corresponds to the ECML/PKDD 2005 challenge and contains server sessions from seven e-commerce vendors [2]. The data contains a session generated ID, and a session was thus defined as a sequence of requests with the same ID for a given vendor. For testing, we randomly split the data into a 70/30 training set and test set split. Table 1 summarises the characteristics of the data sets. For each data set we indicate the number of pages occurring in the log file, the total number of requests recorded and the total number of sessions derived from each data set. Table 1. Summary characteristics of the three real data sets used Pages LTM PKDD MSWEB
2438 316 285
Trainning set Requests Sessions 792886 456786 98654
31953 60288 32711
Pages
Test Set Requests
Sessions
1974 263 236
338322 188922 15191
13694 25837 5000
We let BF be the model branching factor. For a model having i = 1, 2, . . . , s states and each state i having ui outlinks we define BF as follows: s
BF =
1 ui . s i=1
(2)
Although BF is related to the model complexity it does not give a clear indication of the difficulty associated with predicting a user next navigation step when viewing a
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
51
given state. Thus, we define the weighted branching factor wBF to measure the average number of outlinks weighted by the number of times the anchor s state was visited. We let wi be the number of times state i was visited and W = i=1 wi and define wBF as follows: wBF =
s 1 wi ui . W i=1
(3)
25 weighted branching factor
20 branching factor
300
LTM MSWEB PKDD
15 10 5
LTM MSWEB PKDD
250 200 150 100 50
0 1
2 3 order of the model
4
1
2 3 order of the model
4
Figure 3. Branching factor (left) and Weighted branching factor (right)
In Figure 3 we show how BF and wBF vary with the order of the model. It can be seen that wBF is, on average, larger than BF by a factor of over 10. Although for higher order models BF is, on average, less than 5, states with a higher probability of being visited have a much larger number of outlinks and thus the choice of which link to predict is a non-trivial task.
percentage of trails in the test set
1
0.75
0.5
0.25
LTM MSWEB PKDD
0 all
0
0.01 α
0.05
0.1
Figure 4. Average percentage of expected events in the test data
Figure 4 shows an approximately linear decrease in the percentage of trails in the test sets as α is increased. (We note that although the α values on the x-axis in the figure are not proportionally spaced, when adjusting the data points appropriately linear
52
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
regression shows a good fit for the three data sets.) Overall, LTM and MSWEB have more unexpected events than PKDD, and for α > 0.01 the increase in the number of unexpected events is sharper for MSWEB than for LTM. This could account for the links of MSWEB being less predictable than the other two data sets, as discussed below. We note that increasing α by too much is not desirable, since as seen from the trend in Figure 4, this will have an adverse effect on the size of the test set.
0.75
0.75
HM
1
HM
1
0.5 all > 0.00 > 0.01 > 0.05 > 0.10
0.25
0.5 all > 0.00 > 0.01 > 0.05 > 0.10
0.25
0
0 1
2 3 order of the model
4
1
2 3 order of the model
4
Figure 5. Hit and miss score for LTM (left) and PKDD (right)
1
all > 0.00 > 0.01 > 0.05 > 0.10
0.75
15
all > 0.00 > 0.01 > 0.05 > 0.10
MAE
HM
10 0.5
5 0.25
0
0 1
2 3 order of the model
4
1
2 3 order of the model
4
Figure 6. Hit and miss score for MSWEB (left) and Mean absolute error for LTM (right)
In Figures 5, 6, 7, 8 and 9 (left) we show for the three data sets how the HM score, the MAE and the IS score vary, respectively, with the order of the model for varying values of α. The general pattern for all scoring metrics is that the scores improve as the order of the model increases and as α increases. By increasing the order of the model we are able to make use of the history to limit the choices of clicks that users have made, and by increasing α, until it reaches a predefined level, we are able to eliminate unexpected events which make the prediction more difficult. It can be seen that for HM and IS, the LTM data set performs best, then the PKDD data set and the worst performing data set is MSWEB. For MAE the same holds when α is all or 0, but when α is greater than 0 the
53
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
distinction, although present, is less evident. Thus assuming that α > 0, we can conclude that, on average, the link we are trying to predict is in the top 3 most probable links. For HM we see that when α is greater than 0 we can reach a prediction level above 0.75 for LTM, between 0.5 and 0.75 for PKDD, while for MSWEB only when α = 0.10 can we reach a level above 0.5. (We note that 0.5 is still much better than a uniform guess as, in general, there are more than two links to choose from.) The results for the IS scoring rule show that there are significant differences between the three data sets with respect to the expected probability of the link that we are trying to predict. As we can see for LTM the probability 2−IS is able to reach levels above 0.5 when α ≥ 0.05, for PKDD the probability levels are all below 0.5, while for MSWEB they barely reach 0.25. Thus, although, as seen in Figure 3, the branching factors for the data sets are comparable, the probabilities of the links we are trying to predict are generally lower in MSWEB than in PKDD and lower in PKDD than in LTM. This is reinforced in Figure 9 (right), where it can be seen that the difference between M T P and AT P is larger for MSWEB than for PKDD and larger for PKDD than for LTM.
15
15
all > 0.00 > 0.01 > 0.05 > 0.10
MAE
10
MAE
10
all > 0.00 > 0.01 > 0.05 > 0.10
5
5
0
0 1
2 3 order of the model
4
1
2 3 order of the model
4
Figure 7. Mean absolute error for PKDD (left) and MSWEB (right)
1
0.5
0.5
0.25
0.25
0
0 1
all > 0.00 > 0.01 > 0.05 > 0.10
0.75 1/2IS
0.75 1/2IS
1
all > 0.00 > 0.01 > 0.05 > 0.10
2 3 order of the model
4
1
2 3 order of the model
Figure 8. Ignorance score for LTM (left) and PKDD (right)
4
54
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
1
1 all > 0.00 > 0.01 > 0.05 > 0.10
0.75 trail probability
1/2IS
0.75
0.5
0.25
LTM-MTP LTM-ATP MSWEB-MTP MSWEB-ATP PKDD-MTP PKDD-ATP
0.5
0.25
0
0
1
2 3 order of the model
4
1
2
3
4
order of the model
Figure 9. Ignorance score for MSWEB (left) and Average (AT P ) and maximum (M T P ) test trail probability (right)
5. Concluding Remarks We have evaluated a maximum likelihood prediction algorithm using three metrics: the hit and miss score, the mean absolute error and the ignorance score. Our experiments show that, as expected, prediction accuracy increases with the order of the model, and also increases when unexpected events, controlled by the parameter α, are being detected rather than being predicted. Our experiments show that the accuracy of prediction varies for different data sets. We also discussed the different interpretations of the three scoring metrics: HM can be understood in terms of the expected maximum likelihood probability of the last link on a trail (M T P ), MAE can be understood in terms of the expected rank of the last link on the trail that was followed minus one, and IS can be understood in terms of the expected probability of the last link on a trail that was followed by the user (AT P ). Future work involves a better understanding of what makes the prediction algorithm perform better on different data sets. In addition, we would like to take into account concept drift [15] when building the Markov model, since we do not expect the probabilities to be stationary. Finally, we also wish to apply the prediction algorithms to data sets from applications areas such as patrolling the Web.
References [1] G. Bejerano. Algorithms for variable length Markov chain modelling. Bioinformatics, 20(5):788–789, March 2004. [2] P. Berka. Guide to the click-stream data. In Proceedings of the Discovery Challenge Workshop, in Conjunction with ECML/PKDD, 2005. [3] J. Borges and M. Levene. Data mining of user navigation patterns. In B. Masand and M. Spiliopoulou, editors, Web Usage Analysis and User Profiling, LNAI 1836, pages 92–111. Springer-Verlag, Berlin, 2000. [4] J. Borges and M. Levene. An average linear time algorithm for web usage mining. International Journal of Information Technology and Decision Making, 3(2):307–319, June 2004. [5] J. Borges and M. Levene. Generating dynamic higher-order Markov models in web usage mining. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), LNAI 3721, pages 34–45, Porto, Portugal, October 2005.
J. Borges and M. Levene / Mining Users’ Web Navigation Patterns and Predicting Their Next Step
55
[6] J. Borges and M. Levene. Testing the predictive power of variable history web usage. Soft Computing, 11(8):717–727, 2007. [7] J. Borges and M. Levene. Evaluating variable length Markov chain models for analysis of user web navigation sessions. IEEE Transactions on Knowledge and Data Engineering, 19(4):441–452, 2007. [8] H. Chen. Editorial, Intelligence and security informatics: information systems perspective. Decision Support Systems, 41:555–559, 2006. [9] X. Chen and X. Zhang. A popularity-based prediction model for web prefetching. IEEE Computer, 36(3), pages 63–70, 2003. [10] M. Deshpande and G. Karypis. Selective Markov models for predicting web page accesses. ACM Transactions on Internet Technology, 4(2):163–184, May 2004. [11] X. Dongshan and S. Juni. A new Markov model for web access prediction. IEEE Computing in Science & Engineering, 4(6):34–39, November 2002. [12] M. Eirinaki, M. Vazirgiannis, and D. Kapogiannis. Web path recommendations based on page ranking and Markov models. In Proceedings of the 7th ACM international workshop on Web Information and Data Management (WIDM), pages 2–9, New York, NY, USA, 2005. [13] L. Geng and H.J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3), 2006. [14] J.G. Kemeny and J.L. Snell. Finite Markov Chains. D. Van Nostrand, Princeton, NJ, 1960. [15] I. Koychev. Experiments with two approaches for tracking drifting concepts. Serdica Journal of Computing, 1:27–44, 2004. [16] K. McGarry. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review, 20(1):39–61, 2005. [17] B. Mobasher. Web usage mining and personalization. In Munindar P. Singh, editor, Practical Handbook of Internet Computing. Chapman Hall & CRC Press, Baton Rouge, 2004. [18] M. Perkowitz and O. Etzioni. Towards adaptive web sites: Conceptual framework and case study. Artificial Intelligence, 118(2000):245–275, 2000. [19] M.S. Roulston and L.A. Smith. Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130:1653–1660, 2002. [20] S. Schechter, M. Krishnan, and M.D. Smith. Using path profiles to predict HTTP requests. Computer Networks and ISDN Systems, 30(1-7):457–467, 1998. [21] M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa. A framework for the evaluation of session reconstruction heuristics in web usage analysis. IN-FORMS Journal on Computing, 15(2):171–190, 2003. [22] R.L. Winkler. Scoring rules and the evaluation of probabilies (with discussion). Test, 5:1–60, 1996. [23] I.H. Witten and E. Frank. Data Mining, Practical Machine Learning Tools and Techniques. MorganKaufmann, San Francisco, Ca., 2nd edition, 2005. [24] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179–214, 2004.
56
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Data Mining for Security and Crime Detection Gerhard PAAß 1 , Wolf REINHARDT, Stefan RÜPING, and Stefan WROBEL Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany
Abstract. As the Internet becomes more pervasive in all areas of human activity, attackers can use the anonymity of cyberspace to commit crimes and compromise the IT infrastructure. As currently there is no generally implemented authentification technology we have to monitor the content and relations of messages and Internet traffic to detect infringements. In this paper, we present recent research on Internet threats such as fraud or hampering critical information infrastructure. One approach concentrates on the rapid detection of phishing email, designed to make it next to impossible for attackers to obtain financial resources or commit identity theft in this way. Then we address how another type of Internet fraud, the violation of the rights of trademark owners by the selling of faked merchandise, can be semi-automatically solved with text mining methods. Thirdly, we report on two projects that are designed to prevent fraud in business processes in public administration, namely in the healthcare sector and in customs administration. Finally, we focus on the issue of critical infrastructures, and describe our approach towards protecting them using a specific middleware architecture.
Keywords. Internet security, email fraud, phishing, trademark infringement, counterfeit merchandise, Internet auctions, critical infrastructure
Introduction Classically, the most severe and dangerous security threats to countries, organizations and individuals are considered to be physical acts of violence, such as the ones we have sadly had to observe in recent years. Consequently, besides classical police work, efforts in the online realm have concentrated strongly on finding individuals and organizations who are offering material that incites towards such acts of violence, or who are using electronic means to arrange for committing them. At the same time, however, it has become clear that the nature of the threats, and the structure of organizations and individuals that deal with the threat has started to change.
1 Corresponding Author: Gerhard Paaß, Fraunhofer Institute for Intelligent Analysis and Information Systems,IAIS, Schloss Birlinghoven, 53757 Sankt Augustin, Germany, http://www.iais.fraunhofer.de; Email:
[email protected]
G. Paaß et al. / Data Mining for Security and Crime Detection
57
Firstly, the distinction between criminal activity carried out for purely financial gain and security threats motivated by political or ideological reasons is beginning to dissolve. More and more, we are seeing that the Internet is being used to provide financing for politically or ideologically motivated offences, and that techniques that up to now have been associated only with financial crime, such as phishing, are being used in this context. Drying up these sources of funds through measures designed to prevent phishing and other financial fraud on the Internet thus becomes an important contribution not only to fighting crime in general, but also to defend against larger scale security threats. Secondly, the nature of the threats is changing dramatically. While previously, explosives or other chemical, biological or nuclear weapons were needed to seriously threaten the general public or infrastructure of a country, recent examples such as the attacks against Estonia in the spring of 2007 [Landler, Markoff 07, Rhoads 07] show that nowadays, critical infrastructures can be destabilized by entirely nonphysical methods, simply by attacking a country’s infrastructure with suitable means. In this paper, we therefore present recent research that is not primarily directed at identifying individuals conspiring to threaten countries or organizations, but which is focused on the general Internet threats that provide the financial basis for such activities. In particular, in the following sections, we first present a research project centrally focused on the rapid detection of phishing email, designed to make it next to impossible for attackers to obtain financial resources or commit identity theft in this way. In the next section, we then show how another type of Internet fraud, the violation of the rights of trademark owners by offering faked merchandise, can be semiautomatically detected with text mining methods. Thirdly, we report on two projects that are designed to prevent fraud in business processes in public administration, namely in the healthcare sector and in customs administrations. Finally, we focus on the issue of critical infrastructures, and describe our approach towards protecting them using a specific middleware architecture.
1. Fighting Phishing: The Anti-Phish Project
Figure 1. Example of a phishing email (© wikipedia http://en.wikipedia.org/wiki/Image:PhishingTrustedBank.png)
58
G. Paaß et al. / Data Mining for Security and Crime Detection
One of the most harmful forms of email spam is phishing. Criminals are trying to convince unsuspecting online customers of various institutions to surrender passwords, account numbers, social security numbers or other personal information. To this end they use spoofed messages which masquerade as coming from reputable online businesses such as financial institutions (see Figure 1 for an example). It includes a web link or a form to collect passwords and other sensitive information. Subsequently this information is used to withdraw money, enter a restricted computer system, etc.
Number of Hijacked Brands by Month 180 160 140 120 100 80 60 40 20 0 April
June
August
October
December
February
April
Figure 2. Brands used in phishing emails April '06 – April'07 according to AntiPhishing Working Group (© Fraunhofer IAIS)
The AntiPhishing Working Group reports [APWG 07] that phishing has increased enormously over the last months and is a serious threat to global security and economy (see Figure 2). Most often banks and online payment services are targeted. Others phishers obtain sensitive data from U.S. taxpayers by fraudulent IRS-emails. Recently more non-financial brands were attacked including social networking (e.g. myspace.com), VoIP, and numerous large web-based email providers. (see Figure 3, source of graphics and information: APWG 07). There is an upward trend in the total number of different phishing emails sent over the whole Internet. They grew from 11121 in April 2006 to 55643 in April 2007. There has been a massive increase of phishing sites over the past year. In addition there is an increasing sophistication of phishing e-mails, e.g. by link manipulation, URL spelling using unicode letters, website address manipulation. Finally there is an evolution of phishing methods from shotgun-style email to phishing using embedded images and targeted attacks to specific organizations (“spear phishing”).
G. Paaß et al. / Data Mining for Security and Crime Detection
59
Figure 3. Phishing emails attacks by business sector of attack brand (© Fraunhofer IAIS)
Figure 4. Anti-Phish project logo (© AntiPhish Consortium)
AntiPhish (see Figure 4, http://www.antiphishresearch.org/press.html) is a specific targeted research project funded under Framework Program 6 by the European Union. It started in January 2006 and will run until December 2008. AntiPhish is an acronym for “Anticipatory Learning for Reliable Phishing Prevention”. The consortium consists of six partners. AntiPhish aims at developing spam and phishing filters with high accuracy for use both on traditional emails and mobile messaging services. The scientific focus of the project is on trainable and adaptive filters that are not only able to identify variations of previous phishing messages, but are capable of anticipating new forms of phishing attacks. Such technology does not exist yet, but
60
G. Paaß et al. / Data Mining for Security and Crime Detection
could greatly improve all existing methods used in spam and phishing filters. So the project does not pursue new legal regulations or the change of email protocols, but is concentrating on technical recognition approaches. Besides Fraunhofer IAIS the consortium comprises of a reputable university (K.U. Leuven), a world leading company (Symantec), a major Internet service provider (Tiscali) and a leading provider of mobile phone infrastructure (Nortel). The research will be driven by the hands-on expertise of our industrial participants, who have years of success fighting spam on a global scale. The AntiPhish project not only aims at developing the filter methodology in a test laboratory setting, but has the explicit goal to implement this technology in real world settings at our partners sites, to be used to filter all email traffic online in real time, as well as content filtering at the edge of wireless networks. Figure 5 shows the structure of the project [AntiPhish 06]. From the stream of emails appropriate features are extracted to train and estimate classifiers in a laboratory setting. Subsequently these filters are simulated in real-world conditions in the Symantec labs. Finally, they will be deployed into large-scale real world environments at an Internet provider and a mobile phone provider.
Figure 5. Anti-Phish project structure (© AntiPhish Consortium)
If successful all of this technology will be implemented in Symantec’s Global Intelligence Network where people are working in three shifts, 24 hours a day, around the globe to help fight spam and phishing. Success against these threats will always require a combination of human labor and intelligent algorithms. The main approach of this project is to obtain training data from the email stream, to extract features, to estimate and update classifiers and, at the end, to deploy them at the Internet service provider. A central challenge to this task is concept drift and the project will develop adaptive approaches to detect this concept drift and react on it.
G. Paaß et al. / Data Mining for Security and Crime Detection
61
2. Detecting Illegal Products: Text Mining to Identify Fake Auction Offers According to a current study by the University of Mainz [Huber et al. 06] the number of faked products seized by German customs between 1998 and 2004 rose by about 1000 percent. The real number of fake products is estimated to be significantly higher. Whereas previously mainly watch, clothing and perfume industries had to face this problem, now the pharmaceutical, automobile and aeronautical industries are affected as well. For a perfume web auction in a specific period 84.4 percent of products in the sample survey could be identified as fakes, and only 7 percent as the genuine article. Besides the financial losses for the original manufacturer the poor quality of fake products can adversely affect the original brand.
Figure 6. Counterfeited Luxury Watch (© wikipedia http://en.wikipedia.org/wiki/Image:Frolex.jpg)
Often the fake products are openly offered as replicas, remakes, etc., and the customer knows that he will get a fake product, but nevertheless with the prestige of the genuine product. For a well-known manufacturer of luxury watches Fraunhofer IAIS developed a filter to detect such fakes in Ebay™ watch offer auctions. First a training set of genuine and fake watch offers were compiled. Then a classifier was trained to detect the different classes using the text of the offer and format features as inputs. On a training set these classifiers showed very good performance and needed very little time for actual filtering.
62
G. Paaß et al. / Data Mining for Security and Crime Detection
Therefore it may be used in the workflow of an Internet auction platform. Whenever a new offer is entered the classifier checks the text and immediately bans a suspicious auction. The additional computational effort for the auction provider is minimal. Using the results of the project the highest German court, the Bundesgerichtshof, ruled that a brand product manufacturer can require an Internet auction provider to set up a filter to identify fake offers and exclude them from the auctions [Bundesgerichtshof 07].
3. Preventing Fraud: The iWebCare and RACWeB Projects The assessment of fraud risks associated with business processes is a very interesting field of application for data mining. With respect to security and crime detection, those risks can, for example, be the risk of customs offences, or the risk of fraudulent activities in public administration. This has a significant economical impact; for example according to the Counter Fraud Service of the UK’s National Health System, fraud accounts for almost 3% of public healthcare expenditures in the UK. The projects iWebCare and RACWeB (Risk Assessment in Customs in the Western Balkans), in which Fraunhofer IAIS takes part, are aimed at implementing anti-fraud measures in public administration based on data mining techniques in the domains of healthcare and customs, respectively, in order to improve the efficiency and transparency of risk assessment and fraud detection and prevention. The problem of fraud detection is characterized by the adversarial setting of a fraudster trying to hide from fraud detection in a large amount of data. This setting has important implications that distinguish fraud detection from standard data mining tasks: Skewed class distribution: there are many different estimates of how many cases of fraud exist in different domains. What they all have in common is that fraud is always the exception to the rule, that is, the percentage of fraud cases is usually quite low. This poses a problem for many supervised learning algorithms, which usually work best when there are equally many positive and negative examples. Skewed class distributions can be addressed in data mining by forms of sampling [Scholz, 2005]. Sparsely labeled data: fraud is a complex organizational problem, and it is usually not trivial to determine whether a given case actually constitutes fraud or not, as many other legal, organizational, and practical issues have to be considered. As a result, one cannot expect that fraud experts can produce large labeled data sets that would be necessary to train supervised methods. Instead, one has to take into account the fact that expert feedback is a scarce resource and one must make optimal use of the limited time that experts have by using methods like active learning and semi-supervised learning [Chapelle et al., 2006]. False negatives: a false negative is a case of fraud which the system has labeled as non-fraud. While a case the system labels as fraud will be investigated by a fraud operator, such that false positives will be corrected later in the fraud discovery process, false negatives will not be inspected any further and disappear in the large amount of cases which are seemingly correct. Hence, completely novel, unknown methods of committing fraud will not be detected by a supervised learning scheme and one must rely on unsupervised methods to detect new trends and developments in the data that may turn out to be fraud on further inspection. The general problem with this is that while it is possible to confirm a detected case of fraud with high confidence, it is much
G. Paaß et al. / Data Mining for Security and Crime Detection
63
more difficult to decide whether a given case is not fraudulent or whether the fraud is simply hidden too well. Trivial rules: a standard approach for fraud detection would be to collect an initial set of known fraud cases from fraud inspectors and then to use a supervised learner to generalize these fraud cases into rules to detect future cases of fraud. The problem with this approach is that fraud inspectors already use rules for finding fraud. These rules may either be given explicitly as a set of procedures to apply to new cases (an example of this would be to check that no two invoices have the same serial number), or implicitly by the way fraud inspectors work (an example being that when fraud inspectors come upon a novel case of fraud, they try to find similar cases, e.g., by checking all the invoices submitted by the same company). Consequently, the set of known fraud cases given to the system is not randomly selected, as would be assumed by standard statistical techniques, but are biased in the sense that only fraud cases that are detectable by the known fraud rules are included. When this happens, the best thing the data mining algorithm can do is to re-construct the old rules that have been used to construct the data set. Returning these rules to the fraud inspectors will be disappointing to them, because they will only get back what they already know. Concept drift: as a direct result of the adversarial nature of fraud and anti-fraud measures, whenever there is a good way to detect certain cases of fraud, fraudsters will sooner or later adapt to the detection strategy and will try to find novel ways to cheat the system. This means that fraud is evolving over time and hence one cannot expect fraud rules to remain constant over time. This scenario is called concept drift in the machine learning literature and must be treated by careful validation of the learning results [Widmer, Kubat, 1996]. Interpretability: data mining is only one step in the process of fighting fraud. Human fraud inspectors and automatic data mining have to interact closely to have the maximum effect. In order to achieve this interactivity, it is important that the rules returned by the system are interpretable to the user, such that he can decide whether the new pattern that the algorithm has discovered actually describes fraud or not [Rüping, 2006]. From this discussion it follows that one has to distinguish between two different goals of data mining fraud detection. 1) Detection of new fraud cases of a known fraud pattern. When a new case of fraud is detected, the goal of fraud detection is not only to stop and prosecute this particular instance of fraud, e.g., by dismissing an employee who was involved in procurement fraud, but also to prevent similar cases of fraud from occurring, e.g., by finding indicators that facilitate the identification of such cases earlier and with higher confidence. Supervised learning can be used to find these indicators by generalizing the single cases into high-quality rules, and prevent the same type of fraud from happening again. 2) Detection of new fraud patterns. It is safe to assume that new types of fraud are being developed all the time. Hence, fraud detection cannot only rely on tracing known types of fraud, but must also incorporate methods to find new fraud patterns. In order to do this, one cannot rely on known fraud labels, but must identify unusual patterns based on other properties of the data. Once a statistical significant deviation is found, the pattern can then be reported back to a fraud officer for investigation. An example might be that a certain type of doctor spends much more money per patient than the rest. While it can be statistically confirmed that such a deviation in the budget is indeed significant and not random, one can usually not decide from the given data
64
G. Paaß et al. / Data Mining for Security and Crime Detection
whether there is a valid reason for the higher spending (e.g., the doctor treating a special group of high-risk patients that require more expensive treatment) or whether this is a sign of fraud. This makes unsupervised fraud detection more challenging, because it needs to combine high statistical significance of the found patterns with interpretability, such that the experts can understand and judge the validity of the patterns.
4. Fighting Threats Against Critical Infrastructures: The IRRIIS Project The EU Integrated Project "Integrated Risk Reduction of Information-based Infrastructure Systems" (IRRIIS) is carried out under the motto: Enhance substantially the dependability of Large Complex Critical Infrastructures (LCCIs) by introducing appropriate Middleware Improved Technology (MIT) components. IRRIIS increases dependability, survivability and resilience of EU critical information infrastructures based on Information and Communication Technology (ICT) and has the objectives to: x x x x x
determine a sound set of public and private sector requirements based upon scenario and related data analysis; design, develop, integrate and test MIT components suitable for preventing and limiting cascading effects and supporting automated recovery and service continuity in critical situations; develop, integrate, and validate novel and advanced modeling and simulation tools integrated into a synthetic environment (SimCIP) for experiments and exercises; validate the functions of the MIT components using the SimCIP environment and the results of the scenario and data analysis; disseminate novel and innovative concepts, results, and products to other ICTbased critical sectors.
IRRIIS addresses the challenges of Complex Information Infrastructure Protection (CIIP) by a "diagnosis - therapy strategy" and "therapy implementation and validation approach" starting with the electrical power infrastructure and its supporting telecommunication infrastructure. After thoroughly analysing these infrastructures and their interdependencies, the synthetic simulation environment (SimCIP) is build. MIT components are developed, tested and validated inside SimCIP to demonstrate their capabilities before dissemination to potential stakeholders. The approach subsequently includes additional critical infrastructures. The interdisciplinary research is performed by a European consortium of fifteen partners, ranging from academia to key stakeholders from the fields of energy supply and telecommunication. The project is partly financed by the European Union’s Sixth Framework Programme for a term of three years. 4.1. Large Complex Critical Infrastructure Analysis and Requirements Up until now there is a lack of deep understanding of Large Complex Critical Infrastructures' (LCCIs) dependability and interdependency particularly with regard to the use of Information and Communication Technology (ICT). Although some models
G. Paaß et al. / Data Mining for Security and Crime Detection
65
and tools dealing with these issues exist, LCCI complexity cannot yet be tackled properly. Basic research is necessary to understand the phenomena of interdependency, dynamic behaviour and cascading effects in order to support the development of solutions for protecting and managing existing LCCIs in case of incidents. IRRIIS will perform in-depth research regarding the topological structure of LCCIs and the interdependencies between different LCCIs. Appropriate analytical approaches will be applied such as simulation models or analytical models suitable to investigate interdependency, network dynamics and cascading effects. Starting from a thorough analysis of LCCIs, incorporating the stakeholder’s views regarding ICT tools and models, a sound set of public and private sector requirements can be determined. These requirements will be the base for the development of the SimCIP simulation environment and the Middleware Improved Technology (MIT) components. In order to enhance the understanding of LCCIs and to gain a sound foundation for the development of the SimCIP simulation environment and the MIT components IRRIIS will: x x x x x x x
Survey LCCI stakeholder’s requirements on technology and tools needed for understanding and mitigating cascading effects Survey and analyse existing tools and models Analyse current research gaps to identify relevant research and development efforts Provide detailed scenario and risk analysis Perform in-depth topological analysis of LCCIs Analyse the interdependency between different LCCIs Analyse the upcoming Next Generation Networks (NGN), i.e., networks based on IP connectivity or wireless connections with mainly software based services
This work will not only help ensure that the SimCIP environment and the MIT components meet the stakeholder’s needs but also contribute to the ongoing worldwide research efforts concerning LCCIs. 4.2. Middleware Improved Technology Starting with the knowledge gained from the LCCI analysis and the survey of stakeholder’s requirements and existing tools, Middleware Improved Technology (MIT) components are developed. These MIT components facilitate the communication between different LCCIs and allow the identification and evaluation of incidents and malicious attacks and responding accordingly. Currently, one big problem for the dependability, security and resilience of LCCIs is the high interdependence between different LCCIs within the same sector and also between different sectors. The consequence is that problems within one LCCI can lead to severe problems in dependent LCCIs. The resulting cascading effects are not limited to one kind of infrastructure and do not stop at national borders. To make things worse there is often a lack of appropriate communication structures between the dependent LCCIs (see Figure 7). This results in a lack of awareness of problems occurring in other infrastructures and therefore mitigating actions cannot be performed in time.
66
G. Paaß et al. / Data Mining for Security and Crime Detection
Figure 7. Interdependent LCCIs of the same and different sectors. The arrows indicate interdependencies and the lines communication links using different standards. (© IRRIIS Consortium)
To facilitate the communication between different infrastructures, IRRIIS will develop appropriate middleware communication components. All communication between different LCCIs should run via this middleware. The advantage is that each LCCI only needs one communication link to the middleware and does not have to interface with several other LCCIs (see Figure 8).
Figure 8. Interdependent LCCIs with Middleware Communication Layer and Middleware Improved Technology components. (© IRRIIS Consortium)
The middleware will also be used by the optional MIT add-on components which have some kind of build-in “intelligence”. These add-on components will monitor data flowing within and between the infrastructures and raise an alarm in case of intrusions or emergencies and then take measures to avoid cascading effects. They will be able to detect anomalies, filter alarms according to their relevance and support recovery actions and will thus contribute to the security and dependability of LCCIs. MIT components will interface with existing systems and will not require major replacement
G. Paaß et al. / Data Mining for Security and Crime Detection
67
of existing hardware or software. The flexibility of the middleware allows the easy integration of new LCCIs or the exchange of new kinds of information.
4.3. SimCIP Simulator for Critical Infrastructure Protection Application The purpose of the SimCIP simulation environment is twofold: On the one hand, side simulation can be used to improve the understanding of interdependent LCCIs. On the other hand, the MIT components will be tested and validated in experiments using SimCIP. Furthermore, their applicability and usefulness will be demonstrated to stakeholders within the SimCIP environment before deployment to the “real world” systems (see Figure 9).
MI Development & Test in Simulation
MIT Deployment to the Real World
Figure 9. The role of the SimCIP simulation environment in the development of the MIT components. (© IRRIIS Consortium)
Building the SimCIP environment is a big challenge because the simulation will not only have to include physical simulations but also has to simulate the cyber layer and the management layer of a LCCI as well (see Figure 10). For this purpose SimCIP will use the principle of agent-based simulation. Each object will be modeled as an agent with clear interfaces to its environment and other agents. The SimCIP environment will include and interface with existing tools to keep the simulation meaningful with respect to existing technologies and to allow the use of the results gained in current systems. This also means that the SimCIP environment does not have to start from scratch but can rely on already existing and proven technology.
68
G. Paaß et al. / Data Mining for Security and Crime Detection
Figure 10. The different layers of a single LCCI and the according SimCIP simulation objects. (© IRRIIS Consortium)
To decide which tools and models should be included in SimCIP, an in-depth survey of existing tools and models will be performed. However, the main strength of SimCIP will be the simulation of interdependencies between different LCCIs. To that end, one will model some objects of the individual LCCIs on more abstract levels. This will ensure the high scalability and flexibility of the SimCIP environment. SimCIP should be as generic as possible to allow its application to various kinds of LCCIs and its adaptation to the specific needs of individual stakeholders. Knowledge Elicitation and Research will lead to a “diagnosis” of the current and the future status of interdependent LCCIs. The “therapy” will be implemented through the MIT components which can be tested and validated in the SimCIP environment. The main contributions of IRRIIS are an enhanced understanding of LCCIs, the SimCIP simulation environment and the MIT components. To disseminate the results broadly to stakeholders, technology and service providers and the research community, these interest groups will be included within the IRRIIS project right from the start. IRRIIS also relies on international cooperation and is open to joint efforts of all kinds to achieve its goals. To foster international cooperation IRRIIS will establish an international conference and will define scenarios and benchmarks to allow the comparison of different approaches.
5. Conclusion In this paper, we have given an overview of several projects currently carried out at our institute which are aimed at fighting Internet related crime. As these projects show, to
G. Paaß et al. / Data Mining for Security and Crime Detection
69
fully address the issue of Internet related security threats, especially when looking towards politically and ideologically motivated attacks, requires consideration not only of physical violence, but also of the criminal activities that provide the funds for such activities. As a primary example, phishing is used on the Internet in this way, and we have described how phishing activities can be suppressed using advanced text mining solutions. In related projects, we show how data mining and text mining techniques can be used to prevent fraud with fake auction goods, and in the medical administration system. Finally, we have described how specific software architectures and middlewares can be used to protect critical infrastructures consisting of technical, economic and social systems that are interrelated. All the above projects form part of what we call “Preventive Security” which is one of the major areas of activity for our institute. In addition to the projects described here, we provide services to military and non-military organizations in the area of organizational planning, simulation and process intelligence, as well as research and applied work on physical hardware and robots for security and inspection purposes.
Acknowledgements This work was funded in part by the EU FP6 projects AntiPhish (contract 027600), IRRIIS (contract 027568), iWebCare (contract 28055), and RACWeB (contract 045101).
References AntiPhish (2006): Website of the project. http://www.antiphishresearch.org/home.html. APWG (2007): Phishing Activity Trends. Report for the Month of April, 2007. http://www.antiphishing.org/reports/ apwg_report_april_2007.pdf. Bundesgerichtshof (2007): Bundesgerichtshof bestätigt Rechtsprechung zur Haftung von eBay bei Markenverletzungen. Press release 45/2007, April 19th 2007. Chapelle, O., and Schölkopf, B., and Zien, A. (2006): Semi-Supervised Learning, MIT Press, Cambridge, MA. Huber, F., Matthes, I., Vollhardt, K., Ulbrich, D. (2006): Marken- und Produktpiraterie aufdecken und bekämpfen – am Beispiel von Internetauktionen eines Markenparfums Arbeitspapiere Management P6 Center of Market-Oriented Product and Production Management Mainz ISBN: 3-938879-13-0. Landler, M., Markoff, J. (2007): Digital Fears Emerge After Data Siege in Estonia. Online edition of 2007. http://www.nytimes.com/2007/05/29/technology/ NewYork Times, May 24th, 29estonia.html?pagewanted=1&ei=5070&en=5858f0fda0af7087&ex=1184644800. Rhoads, C. (2007): Cyber Attack Vexes Estonia, Poses Debate. The Wall Street Journal Online. May 18, 2007. Page A6. Rüping, Stefan (2006): Learning Interpretable Models, Ph.D. Thesis, Universität Dortmund, URL: http://hdl.handle.net/2003/23008. Scholz, Martin (2005): Sampling-Based Sequential Subgroup Mining. In Grossman, R. L. and Bayardo, R. and Bennett, K. and Vaidya, J. (editors), Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '05), Seiten 265--274, Chicago, Illinois, USA, ACM Press. Widmer, G. and Kubat, M. (1996): Learning in the presence of concept drift and hidden contexts, Machine Learning 23, pp. 69-101.
70
G. Paaß et al. / Data Mining for Security and Crime Detection
Copyright of images: x x x x x x
Figure 1: Public domain image from http://en.wikipedia.org/wiki/Image:PhishingTrustedBank.png Figure 2 & 3: Own Graphs Figure 4: Logo of our own project AntiPhish, copyright AntiPhish consortium Figure 5: Graph of our own project AntiPhish, copyright AntiPhish consortium Figure 6: Public domain image from http://en.wikipedia.org/wiki/Image:Frolex.jpg Figures 7-10: Graphs of our own project IRRIIS, copyright IRRIIS consortium
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
71
Enhancement to the Advanced Terrorist Detection System (ATDS) Bracha SHAPIRAa1 ,Yuval ELOVICIb, Mark LASTa, Abraham KANDELc a Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel b Deutsche Telekom Laboratories at Ben-Gurion University of the Negev Beer-Sheva, Israel c Department of Computer Science and Engineering, University of South Florida
Abstract. The ATDS system is aimed at detecting potential terrorists on the Web by tracking and analyzing the content of pages accessed by users in a known environment (e.g., university, organization). The system would alert and report on any user who is "too" interested in terrorist-related content. The system learns and represents the typical interests of the users in the environment. It then monitors the content of pages the users access and compares it to the typical interests of the users in the environment. The system issues an alarm if it discovers a user whose interests are significantly and consistently dissimilar to the other users' interests. This paper briefly reviews the main ideas of the system and suggests improving the detection accuracy by learning terrorists' typical behaviors from known terrorist related sites. An alarm would be issued only if a "non-typical" user is found to be similar to the typical interests of terrorists. Another enhancement suggested is the analysis of the visual content of the pages in addition to the textual content. Keywords. Cluster-Analysis, Content-based detection, Similarity measure
Introduction Terrorists increasingly take advantage of the Internet as a cheap and accessible infrastructure for intelligence and propaganda. The Internet enables anonymous communication with other terrorists, coordination of activities, promotion of new ideas recruiting new members into terror organizations. to a mass audience, and Governments and intelligence agencies are trying to track terrorists and their activities 1
Corresponding Author: Bracha Shapira, Department of Information Systems Engineering, Ben-Gurion University; Email:
[email protected]
72
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
on the Web in order to prevent future acts of terrorism [3,5]. Vast amounts of resources are invested in research aimed at developing new methods and technologies for the cyber intelligence effort. One fact is that terrorists and their supporters are accessing the Web for their activities and to browse terror-related sites. This is especially true for "newcomers" who are planning to form a new terrorist cell, and browse terror-related sites to obtain relevant information for terrorist activities. It is therefore desired to detect users who are continuously accessing terror-related sites and to further examine if they are indeed interested in committing terrorist acts. It is not feasible to monitor known terror-related sites in order to detect users who access them since terrorist sites tend to change their locations very often to prevent surveillance by intelligence agencies. The Advanced Terrorist Detection System (ATDS) is using a content-based approach for the detection of potential terrorists and their supporters while they access the Web by monitoring and analyzing the content of the web pages accessed. The assumption underlying ATDS is that it is possible to determine users' interests (referred to as profiles) from this analysis [1,2]. Using this approach, real time web traffic monitoring can be performed to identify terrorists and their supporters as they access terror-related information on the Internet. The current implemented version of ATDS is described in detail in [1,2,7]. In this paper we provide a brief overview of the current version of ATDS, and suggest an extension to the underlying model that is aimed at improving ATDS performance. We also point out the current version's limitations, and suggest improvements that are being implemented and evaluated now. The paper is structured as follows: Section 1 briefly reviews ATDS and reports the results of the experiments conducted to examine its feasibility and performance. Section 2 presents the model extension. Section 3 lists further suggestion to improve the system based on the current version's limitations and Section 4 concludes with future issues.
1. ATDS Review 1.1. ATDS Model ATDS uses the content of web pages accessed by a group of users in a specific environment as an input for detecting non-typical activities of individual users. ATDS analyzes the textual content of web pages. The underlying idea is to learn and model the typical interests of a group of users in a certain environment such as a company, a university campus, or a public organization and then identify users whose interests are atypical, i.e., users whose interests are consistently and significantly dissimilar to the typical interests in the environment. These users are referred to as "abnormal" or "nontypical" users. The model consists of two phases, the learning phase and the detection phase: During the learning phase - the web traffic of a group of users is recorded and an efficient representation is created for further analysis. We use the vector-space model [6] and cluster analysis to represent users' typical interests, i.e., each page is transformed to a vector of weighted terms. The set of users' vectors are clustered to represent the users' areas of interests. The learning phase is applied in the same environment where the detection would later be applied in order to learn the typical content of users in the environment.
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
73
The detection phase is aimed at detecting non-typical users, i.e., users whose interests are consistently and significantly dissimilar to the typical interests of the group. The detection is performed on-line while the users browse the Internet. Each accessed page is represented as a vector and compared to the known typical interests of the reference group. The level of consistency and significance of the dissimilarity can be configured by the dissimilarity threshold and the number and frequency in a time frame of non-typical accesses that would cause an alert. The learning and detection phases are described briefly in the following sections. The reader is referred to [1,2,7] for further details. 1.1.1. Learning Phase During the learning phase, a database representing the interests of the monitored group is generated. During the detection phase the content of the pages users access is compared to this database. The learning module includes the Filter, Vectors-generator and the Clustersgenerator components. The functionality of these components are: each page of the training data is sent to the Filter for exclusion of non-appropriate pages, e.g., nontextual pages, or pages in languages not supported by the system, and for omitting images and tags related to the content format. The filtered pages are then sent to the Vectors-generator component that generates a weighted terms vector for each page. The vector entries represent terms, and their values represent the discriminate value of the term to the page defined by the normalized frequency of the term on the page, and by other properties of the term, such as if it is part of the title, or its position on the page. The vectors are recorded for the clustering process that follows. The Clusters-generator module (Figure 1) receives the vectors from the Vectorsgenerator and performs cluster-analysis on it. Cluster-analysis methods group objects of similar kinds to respective categories. A clustering process receives as input a set of objects and their attributes and generates clusters (groups) of similar objects based on their attribute, so that objects within a cluster have high similarity in their attribute values, and objects between groups are dissimilar. The objects in ATDS are the pages, and the attributes are the weighted terms. The n clusters generated by the Clustergenerator represent the n areas of interest of the users in the monitored environment. The optimal n is defined by the clustering algorithm. For each cluster, the Grouprepresentor component computes the central vector (centroid), denoted by Cvi for cluster i. In the ATDS context, each centroid represents an area of interest for the users in the environment. The learning phase is a batch process and should be activated recurrently to assure up-to-date representation of the users’ interest. 1.1.2. The Detection Phase During the detection phase, a group of computers in a particular environment is monitored to detect non-typical users. The content of web-accesses of the group members is collected on-line and compared to the set of pre-defined cluster centroids
74
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
Training Data
Filter
Vectors Generator
Vectors of Group’s Transactions
Clusters-Generator
Cluster 1 (Vectors)
Cluster n (Vectors)
Group-Representors
Group Interests represented
Figure 1. The learning phase of the model
(generated during the learning phase). The detection phase consists of the Sniffer, Filter, the Vector–generator and the Detector which are the main components of this phase. The Sniffer captures the pages that users, (identified by their Internet Protocol Addresses (IPs)), access at the network layer, and sends each page to the Filter for further processing. The Filter and the Vector-generator have the same functionality as in the learning phase. The Detector receives the vectors (and their respective IPs) and decides whether to issue an alarm. The Detector measures the distance between each vector and each of the centroids representing an area of interest to the user. Each incoming vector is compared to every cluster centroid, and classified as non-typical if its similarity to all clusters is below a pre-defined threshold. The similarity between the vectors accessed by users and the centroids is measured by the cosine of the angle between them as defined in Equation 1. A vector representing a user access is considered non-typical if the similarity between the access vector and the nearest centroid is lower than the threshold denoted by tr.
§ ¨ ¨ Max ¨ ¨ ¨ ©
· tAvi ¸ ¸ i 1 ,.., i 1 ¸ tr m m m m 2 2 2 2 ¸ tCv tAv tCv tAv ¦ ¦ i1 ¦ i in ¦ i ¸ i 1 i 1 i 1 i 1 ¹ m
¦ tCv
i1
tAvi
m
¦ tCv
in
(1)
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
75
where Cvi is the i-th centroid vector, Av - the access vector, tCvi1 - the i-th term in the vector Cvi , tAvi - the i-th term in the vector Av , and m – the number of unique terms in each vector. The system holds a data-structure to maintain the history of users' accesses. For each known IP address in the environment, a queue is maintained with the most recent vectors accessed by the user/users at their respective IPs. The vector and its classification (typical or not-typical) and the time stamp of the access are recorded into the queue of the specific IP address in the sub-queue data structure. The next step is performed by the Anomaly finder which scans the queue of the IP address of the page accessed and analyzes the set of recorded vectors within the queue, which represent the history of accesses for the specific IP. If the percent of the non-typical vectors in a queue is above a pre-defined threshold, an alarm is generated flagging that the user at the specific IP is consistently accessing non-typical information. The detection algorithm can be calibrated with certain parameters to fine-tune the detection. Some of the parameters are: 1) the dissimilarity threshold, 2) the minimum number of nontypical accesses by the IP that issues an alarm and, 3) the time frame of suspicious accesses by the IP that issues an alarm. These thresholds are set to prevent alerts caused by random accesses, and to prevent alerts caused by the aggregation of accesses from different sessions/users on the same IP. Web Pa es accessed and IP address
Sniffer
Filter
Vectors Generator
Users Queues Access history
Detector
DB of User interests
Anomaly Finder Threshold Parameters Alarm
Figure 2. The Detection Phase
1.2. ATDS Evaluations and Results A very brief overview of ATDS evaluations and the results is presented below, to illustrate the suggested extensions. The reader is referred to [1,2,7] for further details. ATDS evaluation was performed at the Information Systems Engineering Department of Ben-Gurion University in Beer-Sheva, Israel. The evaluation took place in a student teaching lab with 38 PCs. The typical interests of the users were learned from 170,000 accesses to the Web collected during the students' regular activities (with their permission). After the exclusion of non-English pages which ATDS cannot handle, we obtained a collection of 13,300 pages as inputs for the learning phase of the activated Filtering and Vector-generator modules.
76
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
To simulate atypical accesses during the detection phase we collected 582 terrorrelated pages to which we applied the Filtering and Vector-generation modules. We also randomly selected 582 vectors from the data set of typical users so as to simulate typical accesses, in order to examine the system's ability to distinguish between typical and non-typical content. ATDS performance was measured by the fraction of True Positive (TP) alerts which indicate the detection rate, vs. the fraction of False Positive (FP) alerts which indicate the detection completeness. This fraction is described in ROC (Receiver Operator Characteristic) curves. A True Positive (TP) is the percentage of alarms in the case of non-typical activity, a False Positive (FP) is the percentage of false alarms when typical activity is taking place. The ATDS evaluation had two goals; the first was to prove its feasibility; and the second to examine the effect of the following parameters on ATDS performance: the size of the sub-queue for each IP, i.e., the length of the history list of accessed pages kept for each IP, and the percentage of abnormal pages to issue an alert, similarity threshold values for issuing an alert, and the number of clusters representing the normal profile of the monitored users. Three simulations were run; the first examined ATDS monitoring capability where all 38 computers submitted 13 iterations of 100 URL access requests while controlling the time-gap between requests. In this simulation the loss of data was recorded, which turned out to be very minimal. The second and the third simulations manipulated the above mentioned parameters in order to examine their effects. To sum the results briefly (please see [1,2,7] for further details), the experiments showed that all the parameters did have a significant effect on results. For example, the ideal detection was observed for a queue size of 32 and the percentage of atypical vectors required in the queue for issuing an alarm was 100%. However, we do not think that these results should be generalized before further study. The evaluation did prove, besides the feasibility of ATDS, that the system is sensitive to these parameters and that a sensitivity analysis might be required before the application of ATDS to a new data set. ROC Queue Size 32 1 0.9 0.8 0.7 0.6 TP
Alarm Threshold 100% 0.5 Alarm Threshold 50% 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FP
Figure 3. Results of simulation for queue size= 32 and %accesses in a queue = 50,100
1
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
77
2. Extensions to the Model In the current version of ATDS users who consistently browse content significantly dissimilar compared to the typical interests of their environment will be suspected as potential "terrorists" and will be further investigated. The underlying assumption is that potential terrorists, using the infrastructure of some non-terrorist environment, would be detected due to their non-typical interests. While this assumption is reasonable with respect to real terrorists, it is too general and might cause the detection of users who are not typical to the group, but who are not terrorists. Examples of such users are workers in an organization who are using the organizational infrastructure for private activities or interests during their work, students in a specific department who are interested in topics that have nothing to do with their studies, etc. If the system is aiming at detecting terrorists and not just non-typical users, it might be beneficial to narrow the suspect list in order to prevent further investigation of non-typical users who are not involved in any illegal activities. We suggest the use of a similar but reverse method and test whether a user who is detected as non-typical, has similar interests to known terrorists' interests. This requires the following additions to the ATDS model: The enhanced learning phase would include a process for “learning typical terrorists’ interests" in addition to learning the typical interests of the users in the environment. The typical terrorists’ interests would be derived from many sites known to include terror-related content. Such sites are being monitored by many organizations that track terror activities on the Web. Examples of such organizations are the SITE Institute (www.siteinstitute.org) and MEMRI (www.memri.org). The learning phase components, i.e., the Filter, the Vector-generator, and the Cluster-generator are activated on terror-related sites as an input and the output would be a set of clusters that represent terrorists’ typical interests. Thus, the system holds two sets of clusters, the first represents the typical users' interests in the environment being monitored, and the second represents terrorists' typical interests. The enhanced detection phase (presented in Figure 4) would include an additional component: the "Terrorist Detector" that would receive as an input every user (IP) detected as non-typical to the environment by the Detector component. As described above, a user is flagged as non-typical if a pre-defined percent of the pages she accessed within a pre-defined time-frame are not similar to any of the typical known interests of the environment. The set of vectors viewed by the suspected user (as not similar to the typical known interests) is then compared to the centroids of the clusters representing terrorists' interests (using cosine similarity). An alarm would be issued if the similarity between the user interests and the terrorists’ typical interests is above a pre-defined threshold. Thus, only if a non-typical user is consistently interested specifically in terror-related content, is he then suspected as a potential terrorist and is further investigated. The detection algorithm should be calibrated to set the similarity threshold that would cause an alarm to be issued. The system is flexible to the type of "abnormal" users that it is able to detect. It can actually be tuned to detect any type of dissimilar users. Thus, if law enforcement authorities are interested in detecting child abusers, they can collect web pages with child abuse related content and activate the learning phase on the relevant data.
78
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
Web Page of User Access and User IP
Sniffer
Filter
Vector Generator
Detector User QueuesAccess history
DB of Group User interests
Annomaly finder Detection Parameters
Terrorist Detector
DB of terror related interests
Alarm
Figure 4. Enhanced Detection Algorithm
3. Visual Content Handling ATDS currently handles valid HTML web pages, and extracts only the textual information on the page. Extension of ATDS to support additional formats (e.g., XML) is rather technical, and can be easily solved by adding other interpreters to the model.
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
79
However, many terrorist related pages include typical visual elements rather than text. The visual elements are very strong identifiers of their sources. These elements include typical colors, images (of terrorist leaders or holy places), logos (of organizations), and known phrases. Terror-related visual elements, in addition to the terror-related textual content, may improve the accuracy of ATDS by providing stronger evidence for a terror related page. Ollari & Glotin [8] showed that even simple linear fusion of textual and visual analysis of web pages for image search engines significantly improve retrieval results. They claim that the "textualness" and the "visualness" of web pages are dependent. We believe that the same idea might apply to the representation of a group profile, and that both types of evidence, might improve classification results. We thus intend to add a visual analysis component that will examine the existence of typical terrorist visual elements on a website. The visual analysis is to be integrated into ATDS in its two phases: The learning phase, in the context of visual elements, would apply learning to the typical visual elements on terror-related sites. The detection phase would use the knowledge about typical elements as inputs for the classification of a page to terror-related or non-terror related groups in order to improve classification accuracy. During the first stage we will concentrate on improving detection, thus, we intend to use human experts to define a set of typical visual elements known to appear on terror-related sites. Such elements include background colors, terror-organization logos, and images. We will represent these visual elements efficiently to enable fast comparison during the detection phase[10]. We will then focus on integrating the detection of terror-related content using visual elements on a page in addition to textual content. We will conduct empirical experiments to examine the optimal method of fusion of textual and visual analysis. Several options exist: 1.
2.
Check the existence of terror-related visual elements after textual analysis is applied, regardless of the classification result of the textual analysis. Fuse the results according to a relative weighting scheme. This method is similar to the method presented in [8]. Check the existence of terror-related elements after the textual analysis was applied, only for documents classified to the terror-related group. Use the visual elements as an additional verification for the classification results.
As for visual analysis implementation, (as shown in Figure 5), the Vectorgenerator component of the detection phase will also extract images of every incoming document. The comparison between the visual elements of a user-accessed web page and the known terror-related visual elements would be activated only for the pages classified as non-typical to the environment and according to the fusion policy. A crucial consideration for the selection of methods for the non-textual representation of visual elements is their efficiency since the visual analysis is performed on-line during the detection process. Thus, efficient algorithms for extracting, representing and the comparison of visual elements should be used [10].
80
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
Web Page of User Access and User IP
Sniffer
Filter
Vector Generator & Image Extractor User queues Access History
Detector
DB of Group User interests
Anomaly finder Detection Parameters Page images
Terrorist Detector
DB of terror related interests
Visual Analysis DB of terror related visual elements
Alarm
Figure 5. ATDS including visual analysis
4. Future Issues and Limitations For ATDS to become practical, i.e., to enable its use in real environments to monitor web users and detect potential terrorists, the following additional enhancements should be applied: Language - ATDS should support other languages in addition to English which is the only language currently supported. ATDS should support Arabic which is the common language of many Islamist terror organizations. The Arabic language is very challenging for text analysis due to its many special characteristics. For example, a given word in Arabic can be found in many different forms which could be conflated in information retrieval. Many definite articles, particles and other prefixes can be attached to the beginnings of words, and a large number of suffixes can be attached to their ends. Most related forms (such as plural) are irregular, and cannot be formed just by adding prefixes or suffixes [4]. Additionally, most noun, adjective, and verb stems are derived from a few thousand roots by infixing, for example, creating words like
B. Shapira et al. / Enhancement to the Advanced Terrorist Detection System (ATDS)
81
maktab (office), kitaab (book), Kutub (books), kataba (he wrote), and naktubu (we write), from the root ktb [9]. Thus, in the ATDS context, stemming is required to prevent mismatches between a term representing a cluster and a similar term on a web page. There exist some stemmers for Arabic, mainly statistically based stemmers [4], but their performance is far worse than the performance of stemmers in English. ATDS should include the most advanced existing Arabic-based retrieval tools to support Arabic as the inclusion of Arabic is a pre-condition for its practicality. Scalability – The current version of ATDS was evaluated on a network with 38 computers, which of course is not the scale of the environment it is designed for. ATDS would have to be scaled up, optimized, and calibrated for large scale environments and would have to be evaluated in such a large scale environment. We are planning to conduct experiments applying and manipulating the enhancements described above, so we can examine their effect on ATDS performance. The experiments will be conducted with a new set of pages in a different environment to test consistency with former results.
References [1]
Elovici,Y., Shapira, B., Last, M., Kandel, A., and Zaafrany, O. Using Data Mining Techniques for Detecting Terror-Related Activities on the Web, J of Information Warfare, 3 (1) (2004) 17-28. [2] Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Freidman, M., Schneider, M., Kandel, A. ContentBased Detection of Terrorists Browsing the Web Using an Advanced Terror Detection System (ATDS). ISI 2005 (2005), 244-255. [3] Ingram, M.: Internet privacy threatened following terrorist attacks on US, http://www.wsws.org/articles/2001/sep2001/isps24.shtml (2001) [4] Larkey, L., Ballesteros, L., and Connell, M. Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In Proceedings of ACM SIGIR, (2002), 269-274. [5] Lemos, R. What are the real risks of cyberterrorism?, ZDNet, http://zdnet.com.com/2100-1105955293.html (2002). [6] Salton, G., Wong, A., Yang, C.S., "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol. 18, 11, (1975) 613–620. [7] Shapira B. Content Based Model for Web Monitoring, Fighting Terror in Cyberspace, Series in Machine Perception and Artificial Intelligence, Vol. 65, (2005), 63-69.. [8] Tollari , S., Glotin, H., Web Image Retrieval on ImagEVAL: Evidences on Visualness and Textualness Concept Dependency in Fusion Models, Conference On Image And Video Retrieval Archive, Proceedings of the 6th ACM International Conference on Image and Video Retrieval (2007), 65-72. [9] Wightwick, J.,Gaafar,M. Arabic Verbs and Essentials of Grammar. Chicago: Passport Books, 1998. [10] Yoo, H-W., Jang, D-S., NA Y-K. An Efficient Indexing Structure and Image Representation for Content-Based Image Retrieval. IEICE Trans Inf Syst (Inst Electron Inf Commun Eng) (2002), 13901398.
82
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Discovering Hidden Groups in Communication Networks1 Jeff BAUMES a , Mark K. GOLDBERG a,2 , Malik MAGDON-ISMAIL a , William A. WALLACE b a CS Department, Rensselaer Polytechnic Institute, Troy, NY 12180. b DSES Department, Rensselaer Polytechnic Institute, Troy, NY 12180. Abstract. This chapter presents statistical and algorithmic approaches to discovering groups of actors that hide their communications within the myriad of background communications in a large communication network. Our approach to discovering hidden groups is based on the observation that a pattern of communications exhibited by actors in a social group pursuing a common objective is different from that of a randomly selected set of actors. We distinguish two types of hidden groups: temporal, which exhibits repeated communication patterns; and spatial which exhibits correlations within a snapshot of communications aggregated over some time interval. We present models and algorithms, together with experiments showing the performance of our algorithms on simulated and real data inputs. Keywords. statistical communication analysis, terrorist networks, graph clustering, temporal correlation
1. Introduction 1.1. Motivation Modern communication networks (telephone, email, Internet chatroom, etc.) facilitate rapid information exchange among millions of users around the world. This vast communication activity provides the ideal environment for groups to plan their activity undetected: the related communications are embedded (hidden) within the myriad of random background communications, making them difficult to discover. When a number of individuals in a network exchange communications related to a common goal, or a common activity, they form a group; usually, the presence of the coherent communication activity imposes a certain structure of the communications on the set of actors, as a group. A group of actors may communicate in a structured way while not being forthright in exposing its existence and membership. This chapter develops statistical and algorithmic approaches to discovering such hidden groups. 1 This article is a reproduction of the article “Identification of Hidden Groups in Communications,” by J. Baumes, M. Goldberg, M. Magdon-Ismail, and W. Wallace, Handbook in Information Systems, Volume 2, pp. c Elsevier B. V. 209 - 242; 2007, 2 Corresponding Author: Rensselaer Polytechnic Institute; 110 8th street, Troy, N.Y., 12180; USA; Email:
[email protected].
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
83
Finding hidden groups on the Internet has become especially important since the September 11, 2001 attacks. The tragic event underlines the need for a tool (a software system) which facilitates the discovery of hidden (malicious) groups during their planning stage, before they move to implement their plans. A generic way of discovering such groups is based on discovering correlations among the communications of the actors of the communication network. The communication graph of the network is defined by the set of its actors, as the vertices of the graph, and the set of communications, as the graph’s edges. Note that the content of the communications is not used in the definition of the graph. Although the content of the messages can be informative and natural language processing may be brought to bear in its analysis, such an analysis is generally time consuming and intractable for large data-sets. The research presented in this chapter makes use of only three properties of a message: its time, the name of the sender and the name of the recipient of the message. Our approach to discovering hidden groups is based on the observation that a pattern of communications exhibited by actors in a social group pursuing a common objective is different from that of a randomly selected set of actors. Thus, we focus on the discovery of such groups of actors whose communications during the observation time period exhibit statistical correlations. We will differentiate between spatial and temporal correlations, which as we shall see, lead to two different notions of hidden groups. 1.2. Temporal Correlation One possible instance of temporal correlation is an occurrence of a repeated communication pattern. Temporal correlation may emerge as a group of actors are planning some future activity. This planning stage may last for a number of time cycles, and, during each of them, the members of the group need to exchange messages related to the future activity. These message exchanges imply that, with high probability, the subgraph of the communication graph formed by the vertices corresponding to the active members of the group is connected. If this connectivity property of the subgraph is repeated during a sufficiently long sequence of cycles, longer than is expected for a randomly formed subgraph of the same size, then one can discovers this higher-than-average temporal correlation, and hence identify the hidden group. Thus, in order to detect hidden groups exhibiting temporal correlations, we exploit the non-random nature of their communications as contrasted with the general background communications. We describe efficient algorithms, first appearing in [4,5], which, under certain conditions on the density of the background communications, can efficiently detect such hidden groups. We differentiate between two types of temporally correlated hidden groups: a trusting, or non-secretive hidden group, whose members are willing to convey their messages to other hidden group members via actors that are not hidden group members, using these non-hidden group members as “messengers”; and, a non-trusting, or secretive hidden group, where all the “sensitive” information that needs to be conveyed among hidden group members uses only other hidden group members as messengers. Our results reveal those properties of the background network activity and hidden group communication dynamics that make detection of the hidden group easy, as well as those that make it difficult. We find that if the background communications are dense or more structured, then the hidden group is harder to detect. Surprisingly, we also find that
84
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
when the hidden group is non-trusting (secretive), it is easier to detect than if it is trusting (non-secretive). Thus, a hidden group which tries to prevent the content of its messages from reaching third parties undermines its operations by becoming easier to detect! 1.3. Spatial Correlation We use spatial correlation to refer to correlations in the communications of a single communication graph, which represents a snapshot of the communications aggregated over some time interval (in contrast to temporal correlation which refers correlation in the communications over multiple communication graphs which represent successive snapshots of the communications). Spatial correlation of messages initiated by a group of actors in a social network can be identified by a higher-than-average total communication level within this group. This property does not rely on the content of the messages and is adequately described by the communication graph: the edge density of the corresponding set of vertices of the graph is higher than that of the average set. To be able to address a wide variety of applications, we consider a general notion of edge density, which compares the intensity of communications between the actors within a particular set and that between the set and the “outside world.” The edge density may be defined in numerous ways depending on the desired characteristics of the discovered groups; our algorithms for discovering groups of higher density (potential hidden groups) are generic with respect to the definition of density. Furthermore, we find only groups which are more dense than any group sufficiently close, which reflects the principle of locality in a social network. For our numerical experiments, we use two main ideas in defining density: one is the proportion of the number of actual communications to the total number of possible communications; and the other is the ratio of the number of communications within the group to the total number of group communications, including messages to individuals outside the group. In graph-theoretical terminology, the problem we study is clustering. An important implication of our approach is that our algorithms construct clusters that may overlap, i.e., some actors may be assigned to more than one group. While there is much literature in the area of graph clustering, up until very recent work it has mainly focused on a specific sub-case of the problem: graph-partitioning. As opposed to partitioning algorithms, which decompose the network into disjoint groups of actors, general clustering allows groups to extend to their natural boundaries by allowing overlap. We discuss prior work in the area of partitioning, and present three general clustering heuristics, originally described in [2,3]. We refer to these procedures by the names Iterative Scan (IS), Rank Removal (RaRe) and Link Aggregate (LA). We present experimental data that illustrate the efficiency, flexibility, and accuracy of these heuristics. Searching for both spatial and temporal correlation may be combined to produce a more effective algorithm for the identification of hidden groups. The temporal algorithms may indicate that a large group of individuals are involved in planning some activity. The spatially-correlated algorithm may then be used to cluster this large group into overlapping subgroups, which would correspond to smaller working groups within the larger group. We present results from the testing of our spatial-hidden group algorithms on a number of real-world graphs, such as newsgroups and email. We analyze the quality of the
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
85
groups produced by the clustering algorithms. We also test the algorithms on random graph models in order to determine trends in both runtime and accuracy. One of the interesting experimental discoveries that different implementations of the Iterative Scan algorithm are optimized for different domains of application based on the sparseness (density) of the communication network. 2. Discovering Temporal Correlation 2.1. Literature Review Identifying temporally correlated hidden groups was initiated in [30] using Hidden Markov models. Here, our underlying methodology is based upon the theory of random graphs [7,22]. We also incorporate some of the prevailing social science theories, such as homophily [32], by incorporating group structure into our model. A more comprehensive model of societal evolution can be found in [20,37]. Other simulation work in the field of computational analysis of social and organizational systems [9,10,36] primarily deals with dynamic models for social network infrastructure, rather than the dynamics of the actual communication behavior, which is the focus of this chapter. One of the first works analyzing hidden groups is found in [14]. Here, the author studies a number of secret societies, such as a resistance that was formed among prisoners at Auschwitz during the Second World War. The focus, as it is in this paper, was on the structure of such societies, and not on the content of communications. An understanding of a hidden network comes through determining its general pattern and not the details of its specific ties. The September 11, 2001 terrorist plot spurred much research in the area of discovering hidden groups. Specifically, the research was aimed at understanding the terrorist cells that organized the hijacking. Work has been done to recreate the structure of that network and to analyze it to provide insights on general properties that terrorist groups have. Analyzing their communication structure provides evidence that Mohammed Atta was central to the planning, but that a large percent of the individuals would have needed to be removed in order to render the network inoperable [39]. In “Uncloaking Terrorist Networks” [29], Krebs uses social network measures such as betweenness to identify which individuals were most central to the planning and coordination of the attacks. Krebs has also observed that the network that planned September 11 attempted to hide by making their communications sparse. While these articles provide interesting information on the history of a hidden group, our research uses properties of hidden groups to discover their structure before a planned attack can occur. There is also work being done in analyzing the theory of networked groups, and how technology is enabling them to become more flexible and challenging to deal with. The hierarchical structure of terrorist groups in the past is giving way to more effective and less organized network structure [35]. Of course, the first step to understanding decentralized groups is to discover them. What follows are some strategies to solve this problem. 2.2. Methodology A temporally hidden group is a different kind of group from the normally functioning social groups in the society that engage in “random” communications. We define a tempo-
86
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
Figure 1. Cyclic representation of the communication.
rally hidden group, or in this section labeled a hidden group, as some subset of the actors who are planning or coordinating some activity over time; the hidden group members may also be engaging in other non-planning related communications. The hidden group may be malicious (for example some kind of terrorist group planning a terror attack) or benign (for example a foursome planning their Sunday afternoon golf game). So in this sense, a hidden group is not assumed to be intentionally hiding, but the group activity is initially unknown and masked by the background communications. The hidden group is attempting to coordinate some activity, using the communication network to facilitate the communications between its members. Our task now is to (1) discover specific properties that can be used to find hidden groups; and (2) construct efficient algorithms that utilize those properties. The next steps such as formulation of empirically precise models and further investigation of the properties of hidden groups is beyond the scope of this methodology. Whether intentional or not, in a normal society, communications will, in general, camouflage the planning related activity of the hidden group. This could occur in any public forum such as a newsgroup or chatrooms, or in private communications such as email messages or phone conversations. However, the planning related activity is exactly the Achilles heel that we will exploit to discover the hidden group: on account of the planning activity, the hidden group members need to stay “connected” with each other during each “communication cycle.” To illustrate the general idea, consider the following time evolution of a communication graph for a hypothetical society; here, communications among the hidden group are in bold, and each communication cycle graph represents the communications that took place during an entire time interval. We assume that information must be communicated among all hidden group members during one communication cycle (see Figure 1). Note that the hidden group is connected in each of the communication cycle figures above. We interpret this requirement that the communication subgraph for the hidden group be connected as the requirement that during a single communication cycle, information must have passed (directly or indirectly) from some hidden group member to all the others. If the hidden group subgraph is disconnected, then there is no way that information could have been passed from a member in one of the components to a member in the other, which makes the planning impossible during that cycle. The information need not pass from one hidden group member to every other directly: A message could be passed from A to C via B, where A, B, C are all hidden group members. Strictly speaking, A and C are hidden group members, however B need not be one. We will address this issue more formally in the next section. A hidden group may try to hide its existence by changing its connectivity pattern, or by throwing in “random” communications to non-hidden group members. For example, at some times the hidden group may
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
87
Figure 2. Types of connectivity.
be connected by a tree, and at other times by a cycle. None of these disguises changes the fact that the hidden group is connected, a property we will exploit in our algorithms. We make the assumption here that the hidden group remains static over the time period when communications are collected. The algorithms described here would still be useful, however, as long as a significant subset of the group remains the same. The algorithms would likely not detect members that joined or left the group, but would discover a “core” group of members. 2.2.1. Trusting vs. Non-Trusting Hidden Groups Hidden group members may have to pass information to each other indirectly. Suppose that A needs to communicate with B. They may use a number of third parties to do this: A → C1 → · · · → Ck → B. Trusting hidden groups are distinguished from nontrusting ones by who the third parties Ci may be. In a trusting (or non-secretive) hidden group, the third parties used in a communication may be any actor in the society; thus, the hidden group members ( A, B) trust some third-party couriers to deliver a message for them. In doing so, the hidden group is taking the risk that the non-hidden group members Ci have access to the information. For a malicious hidden group, such as a terrorist group, this could be too large a risk, and so we expect that malicious hidden groups will tend to be non-trusting (or secretive). In a non-trusting (secretive) hidden group, all the third parties used to deliver a communication must themselves be members of the hidden group, i.e., no one else is trusted. The more malicious a hidden group is, the more likely it is to be non-trusting. Hidden groups that are non-trusting (vs. trusting) need to maintain a higher level of connectivity. We define three notions of connectivity as illustrated by the shaded groups in Figure 2. A group is internally connected if a message may be passed between any two group members without the use of outside third parties. In the terminology of Graph Theory, this means that the subgraph induced by the group is connected. A group is externally connected if a message may be passed between any two group members, perhaps with the use of outside third parties. In Graph Theory terminology, this means that the group is a subset of a connected set of vertices in the communication graph. For example, in Figure 2 (2) above, a message from A to B would have to use the outside third party C. A group is disconnected if it is not externally connected. The following observations are the basis for our algorithms for detecting hidden groups. (i) Trusting hidden groups are externally connected in every communication cycle.
88
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
Figure 3. Internally persistent, externally persistent and non-persistent groups. The communication graph during 4 communication cycles is shown. Three groups are highlighted, I P, E P, D. One can easily verify that I P is internally persistent during these 4 communication cycles, and so is a candidate non-trusting hidden group. E P is only internally persistent only for time periods 1 and 2. If we only observed data during this time period, then E P would also be a candidate non-trusting hidden group. However, E P is only externally persistent for all the communication cycles, and hence can only be a candidate for a trusting hidden group. D becomes disconnected during communication cycle 4, and hence is not a candidate hidden group.
(ii) Non-trusting hidden groups are internally connected in every communication cycle. We can now state the idea behind our algorithm for detecting a hidden group: a group of actors is persistent over communication cycles 1, . . . , T if it is connected in each of the communication graphs corresponding to each cycle. The two variations of the connectivity notion, internal or external, depend on whether we are looking for a nontrusting or trusting hidden group. Our algorithm is intended to discover potential hidden groups by detecting groups that are persistent over a long time period. An example is illustrated in Figure 3. A hidden group can be hidden from view if, by chance, there are many other persistent subgroups in the society. In fact, it is likely that there will be many persistent subgroups in the society during any given short time period. However, these groups will be short-lived on account of the randomness of the society communication graph. Thus we expect our algorithm performance to improve as the observation period increases. 2.2.2. Detecting The Hidden Group Our ability to detect the hidden group hinges on two things. First, we need an efficient algorithm for identifying maximally persistent components over a time period . Second, we need to ensure, with high probability, that over this time period there are no persistent components that arise, by chance, due to the background societal communications. We will construct algorithms to efficiently identify maximal components that are persistent over a time period . Given a model for the random background communications, we can determine (through simulation) how long a time period a group of a particular size must be persistent in order to ensure that, with high probability, this persistent component did not arise by chance, due to background communications.
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
(a) Externally persistent components 1: Ext_Persistent({G t }tT=1 , V )
2: //Input: Graphs {G t = (E t , V )}tT=1 . 3: //Output: A partition P = {V j } of V . 4: Use DFS to get the connected components Ct of every G t ; 5: Set P1 = C1 and Pt = {} for t > 1; 6: for t = 2 to T do 7: for Every set A ∈ Pt −1 do 8: Obtain a partition P of A by intersecting
A with every set in Ct ;
89
(b) Internally persistent components 1: Int_Persistent({G t }tT=1 , V ) 2: //Input: Graphs {G t = (E t , V )}tT=1 . 3: //Output: A partition P = {V j } of V . K = Ext_Persistent({G }T , V ) 4: {Vi }i=1 t t =1 5: if K = 1, then 6: P = {V1 }; 7: else K Int_Persistent({G (U )}T , V );| 8: P = ∪k=1 t k t =1 k 9: end if 10: return P;
9: Place P into Pt ; 10: end for 11: end for 12: return PT ;
Figure 4. Algorithms for detecting persistent components.
2.3. Algorithms Select to be the smallest time-interval during which it is expected that information is passed among all group members. Index the communication cycles (which are consecutive time periods of duration ) by t = 1, 2, . . . , T . Thus, the duration over which data is collected is = · T . The communication data is represented by a series of communication graphs, G t for t = 1, 2, . . . , T . The vertex set for each communication graph is the set V of all actors. The input to the algorithm is the collection of communication graphs {G t } with a common set of actors V . The algorithm splits V into persistent components, i.e., components that are connected in every G t . The notion of connected could be either external or internal, and so we develop two algorithms, Ext_Persistent and Int_Persistent. Each algorithm develops the partition in an iterative way. If we have only one communication graph G 1 , then both the externally and internally persistent components are simply the connected components of G 1 . Suppose now that we have one more graph, G 2 . The key observation is that two vertices, i, j are in the same external component if and only if they are connected in both G 1 and G 2 , i.e., they are in the same component in both G 1 and G 2 . Thus, the externally persistent components for the pair G 1 , G 2 are exactly the intersections of the connected components in G 1 with those in G 2 . This argument clearly generalizes to more than two graphs, and relies on the fundamental property that any subset of an externally connected set is also externally connected. Unfortunately, the same property does not hold for internal connectivity, i.e, a subset of an internally connected set is not guaranteed to be internally connected. However, a minor modification of the externally connected algorithm where one goes back and checks any sets that get decomposed leads to the algorithm for detecting internally persistent components. (Figure 4(b)). The formal details of the algorithms are given in [5].
90
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
2.3.1. Analysis The correctness and computational complexity results of the algorithms given in Figure 4 are stated here. For full detail see [5]. We say that a set A is a maximal persistent set (internal or external) if it is persistent, and any other persistent set that contains at least one element of A is a subset of A. Clearly, any two maximal persistent sets must be disjoint, which also follows from the following lemma. Lemma 1 If A and B are non-disjoint externally (resp. internally) persistent sets then A ∪ B is also externally (resp. internally) persistent. Theorem 1 (Correctness of Ext_Persistent) Algorithm Ext_Persistent correctly partitions the vertex set V into maximal externally connected components for the input graphs {G t }tT=1 . the number of edges in G t , and let E denote the total number of edges Let E t denote in the input, E = tT=1 E t . The size of the input is then given by E + V · T . Theorem 2 (Complexity of Ext_Persistent) The computational complexity of Algorithm Ext_Persistent is in O(E + V T ) (linear in the input size). Theorem 3 (Correctness of Int_Persistent) Algorithm Int_Persistent correctly partitions the vertex set V into maximal internally connected components. Theorem 4 (Complexity of Int_Persistent) The computational complexity of Algorithm Int_Persistent is in O(V · E + V 2 · T ). 2.3.2. Statistical Significance of Persistent Components Let h be the size of the hidden group we wish to detect. Suppose that we find a persistent component of size ≥ h over T communication cycles. A natural question is to ask how sure we can be that this is really a hidden group versus a persistent component that happened to arise by chance due to the random background communications. Let X (t) denote the size of the largest persistent component over the communication cycles 1, . . . , t that arises due to normal societal communications. X (t) is a random variable with some probability distribution, since the communication graph of the society follows a random process. Given a confidence threshold, , we define the detection time τ (h) as the time at which, with probability 1−, the largest persistent component arising by chance in the background is smaller than h, i.e., τ (h) = min{t P X (t) < h ≥ 1 − }.
(1)
Then, if after τ (h) cycles we observe a persistent component of size ≥ h, we can claim, with a confidence 1 − , that this did not arise due to the normal functioning of the society, and hence must contain a hidden group. τ (h) indicates how long we have to wait in order to detect hidden groups of size h. Another useful function is h (t), which is an upper bound for X (t), with high probability (1 − ), i.e., h (t) = min{h P X (t) < h ≥ 1 − }.
(2)
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
91
If, after a given time t, we observe a persistent component with size ≥ h (t), then with confidence at least 1 − , we can claim it to contain a hidden group. h (t) indicates what sizes hidden group we can detect with only t cycles of observation. The previous approaches to detecting a hidden group assume that we know h or fix a time t at which to make a determination. By slightly modifying the definition of h (t), we can get an even stronger hypothesis test for a hidden group. For any fixed δ > 0, define H (t) = min{h P X (t) < h ≥ 1 −
δ }. t 1+δ
(3)
Then one can show that if X (t) ≥ H (t) at any time, we have a hidden group with confidence 1 − . Note that the computation of τ (h) and h (t) constitute a pre-processing of the society’s communication dynamics. This can be done either from a model (such as the random graph models we have described) or from the true, observed communications over some time period. More importantly, this can be done off-line. For a given realization of the society dynamics, let T (h) = min{t X (t) < h}. Some useful heuristics that aid in the computation of τ (h) and h (t) by simulation can be obtained by assuming that T (h) and X (t) are approximately normally distributed, in which case, Confidence level τ (h) h (t) 50% E T√(h) E X√(t) 84.13% E T (h) + √V ar T (h) E X (t) + √V ar X (t) 97.72% E T (h) + 2 V ar T (h) E X (t) + 2 V ar X (t)
(4)
2.4. Random Graphs as Communication Models Social and information communication networks, e.g., the Internet and WWW, are usually modeled by graphs [33] [9] [10] [36], where the actors of the networks (people, IP-addresses, etc.) are represented by the vertices of the graph, and the connections between the actors are represented by the graph edges. Since we have no a priori knowledge regarding who communicates with whom, i.e., how the edges are distributed, it is appropriate to model the communications using a random graph. In this paper, we study hidden group detection in the context of two random graph models for the communication network: uniform random graphs and random graphs with embedded groups. In describing these models, we will use standard graph theory terminology [40], and its extension to hypergraphs [6]. In a hypergraph, the concept of an edge is generalized to a hyperedge which may join more than two vertices. In addition to these two models, there are other models of random networks, such as the small world model and the preferential attachment model [1]. However, in this work we limited our experiments to the following models, which are also illustrated in Figure 5. Random Model A simple communication model is one where communications happen at random uniformly among all pairs of actors. Such a communication model can be represented by the random graph model developed and extensively studied by Erd˝os and Rényi, [11,12,13,7]. In this model, the graph is generated by a random process in which an edge between every pair of vertices is generated independently with a given probability p. The probability space of graphs generated by such a random process is denoted G(n, p), or sometimes are called the Bernoulli Graphs. We will use the G(n, p) notation throughout this paper.
92
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
Random Model
Group Model
Figure 5. Two random graph models.
Group Model The G(n, p) random graph model may not be a suitable model for large communication networks. Actors tend to communicate more often with certain actors and less frequently with others. In a more realistic model, actors will belong to one or more social groups where communication among group members is more frequent than communication among actors that do not belong to the same group. This leads us to the hypergraph model of the communication network, in which the actors associate themselves into groups. In this paper, we assume that each group is static and contains m actors. While this is a simplification, it serves to illustrate all the essential ideas and results without undue complication. A group of actors is represented by a hyperedge in the graph, and an actor may belong to zero or more hyperedges. The set of all hyperedges represents the structure of the communication network. Since groups tend to be small, it is appropriate to model the communications within a group as a G(m, pg ), where pg is the probability within the group. We also allow communication between two actors that do not share a group in common; we denote such communications as external. The probability of an external communication is pe ; we further assume that pe pg because intra-group communications are much more likely than extra-group communications. Connectivity of Random Graphs The key idea of our algorithms is based on the following observation. For any subset of actors in a random model network, it is very unlikely that this subset is connected during a “long” consecutive period of time cycles, while a hidden group must stay connected (for its operations) as long as it functions as a group. Thus, we summarize here some results from random graph theory regarding how the connectivity of a G(n, p) depends on n and p, [11,12,13,7]. These results are mostly asymptotic in nature (with respect to n), however, we use them as a guide that remains accurate even for moderately sized n. Given a graph G = {V, E}, a subset S ⊆ V of the vertices is connected if there exists a path in G between every pair of vertices in S. G can be partitioned into disjoint connected components such that every pair of vertices from the same connected component is connected and every pair of vertices in different connected components is not connected. The size of a component is the number of its vertices; the size of the largest connected component is denoted by L(G). The remarkable discovery by Erd˝os and Rényi, usually termed The Double Jump, deals with the size of the largest component, and essentially states that L(G) goes through two phase transitions as p increases beyond a critical threshold value. All the results hold asymptotically, with high probability, i.e., with probability tending to 1 when n → ∞:
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
p = nc p = lnnn + nx , x > 0 ⎧ ⎪ ⎨ O(ln n) 0 < c < 1 −x L(G(n, p)) = O(n 2/3 ) c = 1 L(G(n, p)) = n with prob. e−e ⎪ ⎩ β(c)n c > 1, β(c) < 1
93
(5)
Note that when x → ∞, the graph is connected with probability 1. Since our approach is based on the tenet that a hidden group will display a higher level of connectivity than the background communications, we will only be able to detect the hidden group if the background is not maximally connected, i.e., if L(G) = n. Thus we expect our ability to detect the hidden group to undergo a phase transition exactly when the background connectivity undergoes a phase transition. For p = constant or p = d ln n/n with d > 1, the graph is asymptotically connected which will make it hard to detect the hidden group. However, when p = constant, connectivity is exponentially more probable than when p = d ln n/n, which will have implications on our algorithms. 2.5. Experiments and Results In these tests, we simulate societies of sizes n = 1000 and 2000. The results for both the random background communication model and the group background communication model are presented in parallel. For each model, multiple time series of graphs are generated for communication cycles t = 1, 2, . . . , T , where T = 200. Experiments were run on multiple time series (between five and thirty), and averaged in order to obtain more statistically reliable results. In order to estimate h (t), we estimate E X (t) by taking the sample average of the largest persistent component over communication cycles 1, . . . , t. Given h, the time at which the plot of E X (t) drops below h indicates the time at which we can identify a hidden group of size ≥ h. We first describe the experiments with the random model G(n, p). The presence of persistently connected components depends on the connectivity of the communication graphs over periods 1, 2, . . . , T . When the societal communication graph is connected for almost all cycles, we expect the society to generate many large persistent components. By the results of Erd˝os and Rényi described in Section 2.4, a phase transition from short-lived to long-lived persistent components will occur at p = 1/n and p = ln n/n. Accordingly, we present the results of the simulations with p = 1/n, p = c ln n/n for c = 0.9, 1.0, 1.1, and p = 1/10 for n = 1000. The rate of decrease of E X (t) is shown in Figure 6. For p = 1/n, we expect exponential or super-exponential decay in E X (t) (Figure 6, thin dashed line). This is expected because L(G) is at most a fraction of n. An abrupt phase transition occurs at p = ln n/n (Figure 6 dot-dashed line). At this point the detection time begins to become large. For constant p (where p does not depend on n, in this case 1/10), the graph is connected with probability tending to 1, and it becomes essentially impossible to detect a hidden group using our approach without any additional information (Figure 6 thick dashed line). This will occur for any choice of a constant as n becomes large. That is, for any constant p > 0, there is an integer N such that if n > N then G(n, p) is connected with high probability, tending to 1. The parameters of the experiments with the group model are similar to that of the G(n, p)-model. We pick the group size m to be equal to 20. Each group is selected independently and uniformly from the entire set of actors; the groups may overlap; and each actor may be a member of zero or more groups. If two members are in the same
94
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
Largest Persistent Internally Connected Component, G(1000,p)
Largest Component Size as Fraction of n
1 0.9 0.8 0.7 0.6 0.5 0.4
p=1/n p = 0.9 ln n / n p = 1.0 ln n / n p = 1.1 ln n / n p = 0.1
0.3 0.2 0.1 0 0
50
100
150
200
Time
Largest Persistent Internally Connected Component, G(2000,p)
Largest Component Size as Fraction of n
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3
p=1/n p = 0.9 ln n / n p = 1.0 ln n / n p = 1.1 ln n / n p = 0.1
0.2 0.1 0 0
50
100
150
200
Time
Figure 6. The largest internally persistent component E X (t) for the G(n, p) model with
n = 1000, 2000. The five lines represent p = 1/n, p = c ln n/n for c = 0.9, 1.0, 1.1, and p = 1/10. Note the transition at p = ln n/n. This transition becomes more apparent at n = 2000. When p is a constant (i.e. does not depend on n; here we used 1/10), the graph is almost always connected. The results were averaged over a number of runs. The sharp jumps indicate where the largest component quickly jumped from about 0.9n to zero in different runs.
group together, the probability that they communicate during a cycle is pg , otherwise the probability equals pe . It is intuitive that pg is significantly bigger than pe ; we picked pe = 1/n, so each actor has about one external communication per time cycle. The values of pg that we use for the experiments are chosen to achieve a certain average number of communications per actor, thus the effect of a change in the structure of the communication graph may be investigated while keeping the average density of communications constant. The average number of communications per actor (the degree of the actor in the communication graph) is set to six in the experiments. The results do change qualitatively for different choices of average degree. The number of groups g is chosen from {50, 100, 200}. These cases are compared to the G(n, p) structure with an average of six communications per actor. For the selected values of g, each actor is, on average, in 1, 2 and 4 groups, respectively. When g is 50, an actor is, on average, in approximately one group, and the overlaps of groups are small. However, when g is 200, each actor,
J. Baumes et al. / Discovering Hidden Groups in Communication Networks Largest Persistent Internally Connected Component, deg
av
=6
Uniform G(n,p) 50 Groups Size 20 100 Groups Size 20 200 Groups Size 20
1
Largest Component Size as Fraction of n
95
0.8
0.6
0.4
0.2
0 10
20
30
40
50
60
70
80
90
100
Time
g (# of groups) T (1)
Group Model 50 100 200 > 100 63 36
Random Model G(n, n6 ) 32
Figure 7. Times of hidden group discovery for various amounts of group structure; each group is independently generated at random and has 20 actors. In all cases, n = 1000, degav = 6, and the group size m = 20. Note how, as the number of groups becomes large, the behavior tends toward the G(n, p) case.
on average, is in about 4 groups, so there is a significant amount of overlap between the groups. The goal of our experiments is to see the impact of g on finding hidden groups. Note that as g increases, any given pair of actors tends to belong to at least one group together, so the communication graph tends toward a G(n, pg ) graph. We give a detailed comparison between the society with structure (group model) and the one without (random model) in Figure 7. The table shows T (1), which is the time after which the size of the largest internally persistent component has dropped to 1. This is the time at which any hidden group would be noticed, since the group would persist beyond the time expected in our model. We have also run similar experiments for detecting trusting groups. The results are shown in Figure 8. As the table shows, for the corresponding non-trusting communication model, the trusting group is much harder to detect. 3. Discovering Spatial Correlation 3.1. Literature Review While an informal definition of the goal of clustering algorithms is straightforward, difficulty arises when formalizing this goal. There are two main approaches to clustering: partitioning and general clustering.
96
J. Baumes et al. / Discovering Hidden Groups in Communication Networks Largest Persistent Connected Component, deg
av
Largest Persistent Connected Component, degav = 6
0.8
0.6
0.4
0.2
0
Internally Connected Externally Connected
1
Largest Component Size as Fraction of n
Largest Component Size as Fraction of n
=2
Internally Connected Externally Connected
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25
30
Time
degav T (1) for trusting groups 2 28 6 > 100
10
20
30
40
50
60
70
80
90
100
Time
T (1) for non-trusting groups 2 32
Figure 8. Times of hidden group discovery for non-trusting (internally connected) hidden groups and trusting (externally connected) hidden groups. In all cases the communication graphs are G(n, p) with n = 1000.
Partitioning, or hierarchical clustering, is the traditional method of performing clustering. In some circles, clustering and partitioning are synonymous. For example, Kannan, Vempala, and Vetta define clustering as “partitioning into dissimilar groups of similar items” [23]. However, the partitioning approach forces every cluster to be either entirely contained within or entirely disjoint from every other cluster. Partitioning algorithms are useful when the set of objects needs to be broken down into disjoint categories. These categories simplify the network and may be treated as separate entities. Partitioning is used in the fields of VLSI design, finite element methods, and transportation [24]. Many partitioning algorithms attempt to minimize the number of connections between clusters, also called the cut size of the partition [27] [26] [21] [24] [25]. The ρseparator metric attempts to balance the sizes of the clusters, while also minimizing the cut size [15]. The betweenness metric is used to find a small cut by removing edges that are likely to split the network into components [17,19]. In addition to trying to minimize the cut size, some algorithms attempt to maximize the quality of each cluster. Two metrics used in partitioning which define cluster quality are expansion and conductance [16,23]. A final metric relates to how well the members of the same cluster are related to the values in eigenvectors of the adjacency matrix of the network [8]. Groups in social networks do not conform to this partitioning approach. For example, in a social network, an individual may belong to numerous groups (e.g., occupational, religious, political, activity). A general clustering algorithm may put the individual into all these clusters, while a partitioning algorithm will only place the individual into one cluster. Classifying an individual as belonging to a single cluster or social group will often miss the full picture of the societal structure. As opposed to partitioning, general clustering allows individuals to belong to many groups at once. General clustering algorithms determine the zero, one or more groups that each actor belongs to, without enforcing a partition structure on the clusters. This complex structure may more directly correspond to real-world clusters. However, permitting overlapping groups is a more complex problem, since each cluster may not be treated as a separate entity. See Figure 9 for a comparison of partitioning and general clustering.
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
(a)
97
(b)
Figure 9. A comparison of a partitioning (a) and a general clustering (b) of the same network.
When clustering a network, there needs to be a definition of what constitutes a “good” cluster. In some sense, members of a cluster need to be “close” to each other, and “far” from the other objects in the network. There are many ways to define the criterion for a valid cluster. The general clustering problem has been less widely studied than the partitioning problem, however there are some algorithms that exist for discovering a general clustering of a network. Some algorithms of this type are well suited for web networks [28,18,31]. These algorithms all attempt to find clusters by optimizing a metric referred to as bicliqueness. Though often used for partitioning, eigenvector correlation may also be used to discover overlapping clusters [38]. Local optimality is a generic technique which can optimize many of the previously mentioned metrics, even metrics originally developed for partitioning. These algorithms have been applied to social networks [2,3]. We present these algorithms in more detail in the following sections. 3.2. Methodology Let G = (V, E) be a graph whose nodes represent individuals, web pages, etc., and whose edges represent communications, links, etc. The graph may be directed or undirected. We present the definitions for directed graphs, the undirected case being similar. A graph cluster C is a set of vertices which can be viewed as a binary vector of length |V | that indicates which nodes are members of C. The set of all graph clusters, C, is the power set of V . A weight function, or metric is a function W C → R that assigns a weight to a graph cluster. Associated to cluster C, we define three edge sets: E(C), the edges induced by C; E(C, C), the edges in E from C to its complement; E(C, C), the edges in E to C from its complement. Let E out (C) = E(C, C) + E(C, C). We define the internal and external edge intensities, pin (C) =
E(C) , |C| · (|C| − 1)
pex (C) =
E out (C) 2|C| · (n − |C|)
(6)
( pex = 1 when |C| = |V |). We will consider three weight functions: the internal edgeprobability W p ; the edge ratio We ; and, the intensity ratio Wi , W p (C) = pin (C),
We (C) =
E(C) , E(C) + E out (C)
Wi (C) =
pin (C) . pin (C) + pex (C)
(7)
98
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
These metrics are measures of how intense the communication within the cluster is, relative to that outside the cluster; they can be efficiently updated locally, i.e. the metric may be updated by knowing only the connectivity of the one node that is added or removed (which improves the efficiency of the algorithms). A set-difference function δ is a metric that measures the difference between two clusters C1 , C2 . Two useful set-difference functions are the Hamming or edit distance δh , and the percentage non-overlap δ p : δh (C1 , C2 ) = |(C1 ∩ C2 ) ∪ (C1 ∩ C2 )|,
δ p (C1 , C2 ) = 1 −
|C1 ∩ C2 | . |C1 ∪ C2 |
(8)
The -neighborhood of a cluster Bδ (C) is the set of clusters that are within of C with respect to δ, i.e., Bδ (C) = {C |δ(C, C ) ≤ }. For weight function W , we say that a cluster C ∗ is -locally optimal if W (C ∗ ) ≥ W (C) for all C ∈ Bδ (C ∗ ). We are now ready to formally state our abstraction of the problem of finding overlapping communities in a communication network. The input is a graph G, the communication graph, along with the functions W , δ and . The output is a set of clusters O ⊆ C such that C ∈ O iff C is -locally optimal. While our heuristic approaches are easily adapted to different weight and set-difference functions, we will focus on the choices W = We , δ = δh and = 1, referring to the output clusters as locally optimal. As stated, the problem is NP-hard. In fact, the restriction to δ = δh and = |V | asks to find all the globally optimal clusters according to an arbitrary weight function W , which is well known to be NP-hard. Thus, we present heuristic, efficient (low-order polynomial time) algorithms that output candidate (overlapping) clusters, and then evaluate the quality of the output.
3.3. Algorithms 3.3.1. k-Neighborhood (k − N) k − N is a trivial algorithm that yields overlapping clusters. The clusters are simply the kneighborhoods of a randomly selected set S of cluster centers. The inputs to this algorithm are k and |S|.
3.3.2. Rank Removal (RaRe) Algorithm RaRe is based on the assumption that within a communication network, there is a subset of “important” or high-ranking nodes, which do a significant amount of communication. RaRe attempts to identify these nodes and remove them from the graph, in order to disconnect the graph into smaller connected components. The removed node(s) are added to a set R. This process is repeated, until the sizes of the resulting connected components are within a specified range. These connected components can be considered the core of each cluster. Next, the vertices in R are considered for addition into one or more of these cores. If a vertex from R is added to more than one cluster, then these clusters now overlap. Note, however, that the cores of each cluster are disjoint, and only communicate with each other through vertices in R. “Important” or high-ranking nodes are determined by a ranking function φ. These are the nodes which are removed at each iteration. We wish to remove nodes that will result in disconnecting the graph as much as possible. One choice is to remove vertices with high degree, corresponding to the choice φd (v) = deg(v). Another approach that we have found to be experimentally better is to rank nodes according to their Page RankTM , φ p (v) [34]. The Page RankTM of a node is defined implicitly as the solution to the following equation, φ p (v) = c
φ p (u) (1 − c) + − (v) deg n u,v
(9)
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
99
Table 1. User specified inputs for Algorithm RaRe. Input
Description
W
Weight function.
φ min, max
Ranking function. Minimum and maximum core sizes.
t
Number of high-ranking vertices to remove.
where n is the number of nodes in the graph, deg − (v) is the out degree of vertex v, and c is a decay factor between 0 and 1. An iterative algorithm to compute φ p (v) for all the nodes converges rapidly to the correct value. Once we have obtained the cores, we must add the vertices in R back into the cores to build up the clusters. Intuitively, a vertex v ∈ R should be part of any cluster to which it is immediately adjacent, as it would have been part of the core if it were not removed at some step. Also, if we do not take this approach, we run the risk of v not being added to any cluster, which seems counterintuitive, as v was deemed “important” by the fact that it was at one time added to R. This is therefore the approach which we take. We also add vertices in R to any cluster for which doing so increases the metric W . The algorithm is summarized in Figure 10, and all the user specified inputs are summarized in Table 1. It is important to note that the initial procedure of removing vertices, though not explicitly attempting to optimize any single metric, does produce somewhat intuitive clusters. The cores that result are mutually disjoint and non-adjacent. Consider a connected component C at iteration i. If C has more vertices than our maximum desired core size max, we remove a set Ri of vertices, where |Ri | = t. If the removal of Ri results in disconnecting C into two or more connected components C1 , C2 ...Ck , we have decreased the diameter of C1 , C2 ...Ck with respect to C, resulting in more compact connected components. If the removal of Ri does not disconnect the graph, we simply repeat the procedure on the remaining graph until it either becomes disconnected or its size is less than max. As an added performance boost, the ranks may be computed initially, but not recomputed after each iteration. The idea is that if the set R is being removed, the rank of a vertex v in G will be close to the rank of v in G − R .
3.3.3. The Link Aggregate Algorithm (LA) The IS algorithm performs well at discovering communities given a good initial guess, for example when its initial “guesses” are the outputs of another clustering algorithm such as RaRe as opposed to random edges in the communication network. We discuss a different, efficient initialization algorithm here. RaRe begins by ranking all nodes according to some criterion, such as Page RankTM [34]. Highly ranked nodes are then removed in groups until small connected components are formed (called the cluster cores). These cores are then expanded by adding each removed node to any cluster whose density is improved by adding it. While this approach was successful in discovering clusters, its main disadvantage was its inefficiency. This was due in part to the fact that the ranks and connected components need to be recomputed each time a portion of the nodes are removed. The runtime of RaRe is significantly improved when the ranks are computed only once. For the remainder of this paper, RaRe refers to the Rank Removal algorithm with this improvement, unless otherwise stated. Since the clusters are to be refined by IS, the seed algorithm needs only to find approximate clusters. The IS algorithm will “clean up” the clusters. With this in mind, the new seed algorithm Link Aggregate LA focuses on efficiency, while still capturing good initial clusters. The pseudocode is given in Figure 10. The nodes are ordered according to some criterion, for example
100
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
decreasing Page RankTM , and then processed sequentially according to this ordering. A node is added to any cluster if adding it improves the cluster density. If the node is not added to any cluster, it creates a new cluster. Note, every node is in at least one cluster. Clusters that are too small to be relevant to the particular application can now be dropped. The runtime may be bounded in terms of the number of output clusters C as follows Theorem 5 The runtime of LA is O(|C||E| + |V |). Proof: Let Ci be the set of clusters just before the ith iteration of the loop. The time it takes for the ith iteration is O(|Ci |deg(v i )), where deg(v i ) is the number of edges adjacent to v i . Each edge adjacent to v i must be put into two classes for every cluster in Ci : either the other endpoint of the edge is in the cluster or outside it. With this information, the density of the cluster with v i added may be computed quickly (O(1)) and compared to the current density. If deg(v i ) = 0, the iteration takes O(1) time. Therefore the total runtime is asymptotically on the order of
|Ci |deg(v i ) +
|V |
1≤
|V |
|Ci |deg(v i ) +
|V |
1
(10)
|C|deg(v i ) + |V | = 2|C||E| + |V | = O(|C||E| + |V |).
(11)
deg(v i )>0
≤
deg(v i )=0
i=1
i=1
i=1
Q.E.D
3.3.4. Iterative Scan (IS) Algorithm IS explicitly constructs a clusters that is a local maximum w.r.t. a density metric by starting at a “seed” candidate cluster and updating it by adding or deleting one vertex at a time as long as the metric strictly improves. The algorithm stops when no further improvement can be obtained with a single change. This algorithm is given in pseudo-code format in Figure 10. Different local maxima can be obtained by restarting the algorithm at a different seed, or changing the order in which vertices are examined for cluster updating. The algorithm terminates if the addition to C or deletion from C of a single vertex does not increase the weight. During the course of the algorithm, the cluster C follows some sequence, C1 , C2 , . . ., with the property that W (C1 ) < W (C2 ) < · · · , where all the inequalities are strict. Since the number of possible clusters is finite, the algorithm must terminate when started on any seed, and the cluster output will a locally optimal cluster. The cluster size may be enforced heuristically by incorporating this criterion into the weight function. This is done by adding a penalty for clusters with size outside the desired range. Such an approach will not impose hard boundaries on the cluster size. If the desired range is Cmin , Cmax , then a simple penalty function Pen(C), that linearly penalizes deviations from this range is Cmin − |C| |C| − Cmax Pen(C) = max 0, h 1 · , (12) , h2 · Cmin − 1 |V | − Cmax where Cmin , Cmax , h 1 , h 2 are user specified parameters. All the user specified inputs are summarized in Table 2. We emphasize that algorithm IS can be used to improve any seed cluster to a locally optimal one. Instead of building clusters from random edges as a starting point, we can refine clusters, that are output by some other algorithm – these input clusters might be good “starting points”, but they may not be locally optimal. IS then refines them to a set of locally optimal clusters. The original process for IS consisted of iterating through the entire list of nodes over and over until the cluster density cannot be improved. In order to decrease the runtime of IS, we
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
101
Table 2. User specified inputs to Algorithm IS. Parameter
Description
W
Weight function.
δ
Set-difference function (δ = δh in our implementation). Size of set neighborhood ( = 1 in our implementation).
max_ f ail
Number of unsuccessful restarts to satisfy stopping condition.
Cmin , Cmax h1, h2
Desired range for cluster size. Penalty for a cluster of size 1 and |V |.
Table 3. Algorithm performance on real-world graphs. The first entry in each cell is the average value of Wad . The two entries in parentheses are the average number of clusters found and the average number of nodes per cluster. The fourth entry is the runtime of the algorithm in seconds. The e-mail graph represents emails among the RPI community on a single day (16,355 nodes). The web graph is a network representing the domain www.cs.rpi.edu/∼magdon (701 nodes). In the newsgroup graph, edges represent responses to posts on the alt.conspiracy newsgroup (4,526 nodes). The Fortune 500 graph is the network connecting companies to members of their board of directors (4,262 nodes). Algorithm
E-mail
Web
RaRe → IS LA → IS2
1.96 (234,9); 148 2.94 (19,25); 305
6.10 (5,8); 0.14 5.41 (6,19); 0.24
Algorithm
Newsgroup
Fortune 500
RaRe → IS
12.39 (5,33); 213
2.30 (104,23); 4.8
LA → IS2
17.94 (6,40); 28
2.37 (288,27); 4.4
make the following observation. The only nodes capable of increasing the cluster’s density are the members of the cluster itself (which could be removed) or members of the cluster’s immediate neighborhood, defined by those nodes connected to a node inside the cluster. Thus, rather than visiting each node on every iteration, we may skip over all nodes except for those belonging to one of these two groups. If the neighborhood of a cluster is much smaller than the entire graph, this could significantly improve the runtime of the algorithm. This algorithm provides both a potential decrease and increase in runtime. A decrease occurs when the cluster and its neighborhood are small compared to the number of nodes in the graph. This is the likely case in a sparse graph. In this case, building the neighborhood set N takes a relatively short time compared to the time savings of skipping nodes outside the neighborhood. An increase in runtime may occur when the cluster neighborhood is large. Here, finding the neighborhood is expensive, plus the time savings could be small since few nodes are absent from N. A large cluster in a dense graph could have this property. In this case, placing all nodes in N is preferable. Taking into account the density of the graph, we may construct N in either of the two methods described here, in order to maximize efficiency in all cases. If the graph is dense, all nodes are placed in N, but if the graph is sparse, the algorithm computes N as the neighborhood of the cluster. In the experiments that follow, the behavior of IS for sparse graphs is denoted IS, and the behavior for dense graphs is denoted IS2 .
3.4. Experiments and Results A series of experiments were run in order to compare both the runtime and performance of the new algorithm with its predecessor. In all cases, a seed algorithm was run to obtain initial clusters, then a refinement algorithm was run to obtain the final clusters. The baseline was the seed algorithm
102
J. Baumes et al. / Discovering Hidden Groups in Communication Networks procedure RaRe(G, W ) global R ← ∅; {Hi } are connected components in G; for all Hi do ClusterComponent(Hi ); end for Initial clusters {Ci } are cluster cores; for all v ∈ R do for all Clusters Ci do Add v to cluster Ci if v is adjacent to Ci or W (v ∪ Ci ) > W (Ci ); end for end for
procedure ClusterComponent(H ) if |V (H )| > max then {v i } are t highest rank nodes in H ; R ← R ∪ {v i }; H ← H \ {v i }; {Fi } are connected components in H ; for all Fi do ClusterComponent(Fi ); end for else if min ≤ |V (H )| ≤ max then mark H as a cluster core; end if
procedure LA(G, W ) C ← ∅; Order the vertices v 1 , v 2 , . . . , v |V | ; for i = 1 to |V | do added ← false; for all D j ∈ C do if W (D j ∪ v i ) > W (D j ) then D j ← D j ∪ v i ; added ← true; end if end for if added = false then C ← C ∪ {{v i }}; end if end for return C;
procedure IS(seed,G, W ) C ← seed; w ← W (C); increased ← true; while increased do if G is dense then N ← All nodes adjacent to C; else N ← All nodes in G; end if for all v ∈ N do if v ∈ C then C ← C \ {v}; else C ← C ∪ {v}; end if if W (C ) > W (C) then C ← C; end if end for if W (C) = w then increased ← false; else w ← W (C); end if end while return C;
Figure 10. Algorithms Rank Removal (RaRe), Link Aggregate (LA), and Iterative Scan (IS).
RaRe followed by IS. The proposed improvement consists of the seed algorithm LA followed by IS2 . The algorithms were first run on a series of random graphs with average degrees 5, 10, and 15, where the number of nodes range from 1,000 to 45,000. In this simple model, all pairs of communication are equally likely. All the algorithms take as input a density metric W , and attempt to optimize that metric. In these experiments, the density metric was chosen as Wad , called the average degree, which is
103
J. Baumes et al. / Discovering Hidden Groups in Communication Networks Runtime (s) (5 Edges Per Node)
Runtime (s) (5 Edges Per Node)
500
800 Original RaRe RaRe LA
400
IS IS2
700 600 500
300
400 200
300 200
100
100 0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
Runtime (s) (10 Edges Per Node)
Runtime (s) (10 Edges Per Node)
600
1000 RaRe LA 800
IS IS2
500 400
600
300 400
200 200
0 0
100
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
0 0
0.5
Runtime (s) (15 Edges Per Node)
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
Runtime (s) (15 Edges Per Node)
1500
2500 RaRe LA
IS IS2 2000
1000 1500
1000 500 500
0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
Figure 11. Runtime of the previous algorithm procedures (RaRe and IS) compared to the current procedures (LA and IS2 ) with increasing edge density. On the left is a comparison of the initialization procedures RaRe and LA, where LA improves as the edge density increases. On the right is a comparison of the refinement procedures IS and IS2 . As expected, IS2 results in a decreased runtime for sparse graphs, but its benefits decrease as the number of edges becomes large.
defined for a set of nodes C as Wad (C) =
2|E(C)| , |C|
(13)
where E(C) is the set of edges with both endpoints in C. The runtime for the algorithms is presented in Figure 11. The new algorithm remains quadratic, but both the seed algorithm and the refinement algorithm run-times are improved significantly for sparse graphs. In the upper left plot in Figure 11, the original version of RaRe is also plotted, which recalculates the node ranks a number of times, instead of precomputing the ranks a single time. LA is 35 times faster than the original RaRe algorithm and IS2 is about twice as fast as IS for graphs with five edges per node. The plots on the right demonstrate the tradeoff in
104
J. Baumes et al. / Discovering Hidden Groups in Communication Networks Runtime Per Cluster (s) (5 Edges Per Node)
Runtime Per Cluster (s) (5 Edges Per Node)
0.8
0.1 RaRe LA 0.08
IS IS2
0.7 0.6 0.5
0.06
0.4 0.04
0.3 0.2
0.02
0.1 0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
Figure 12. Runtime per cluster of the previous algorithm (RaRe followed by IS) and the current algorithms (LA followed by IS2 ). These plots show the algorithms are linear for each cluster found.
Mean Density (5 Edges Per Node) 2.5
Mean Density (10 Edges Per Node) 4
RaRe → IS LA → IS2
3.5
2
RaRe → IS LA → IS2
3
1.5
2.5
1
1.5
2
1
0.5
0.5 0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
0 0
0.5
1
1.5 2 2.5 3 Number of Nodes
3.5
4
4.5 4 x 10
Figure 13. Performance (average density) of the algorithm compared to the previous algorithm.
IS2 between the time spent computing the cluster neighborhood and the time saved by not needing to examine every node. It appears that the tradeoff is balanced at about 10 edges per node. For graphs that are more dense, the original IS algorithm runs faster, but for less dense graphs, IS2 is preferable. Figure 12 shows that the quadratic nature of the algorithm is based on the number of clusters found. When the runtime per cluster found is plotted, the resulting curves are linear. Runtime is not the only consideration when examining this new algorithm. It is also important that the quality of the clustering is not hindered by these runtime improvements. Figure 13 compares the average density of the clusters found for both the old and improved algorithms. A higher average density indicates a clustering of higher quality. Especially for sparse graphs, the average density is approximately equal in the old and new algorithms, although the older algorithms do show a slightly higher quality in these random graph cases. Another graph model more relevant to communication networks is the preferential attachment model. This model simulates a network growing in a natural way. Nodes are added one at a time, linking to other nodes in proportion to the degree of the nodes. Therefore, popular nodes get more attention (edges), which is a common phenomenon on the Web and in other real world networks. The resulting graph has many edges concentrated on a few nodes. The algorithms were run on graphs using this model with five links per node, and the number of nodes ranging from 2,000 to 16,000. Figure 14 demonstrates a surprising change in the algorithm RaRe when run on this type of graph. RaRe removes high-ranking nodes, which correspond to the few nodes with very
J. Baumes et al. / Discovering Hidden Groups in Communication Networks Mean Density (5 Edges Per Node)
Runtime (s) (5 Edges Per Node)
4
4
x 10
RaRe → IS 2 LA → IS
3.5
105
RaRe → IS LA → IS2 8
3 2.5
6
2
4
1.5 1
2
0.5 0 0
2000
4000
6000 8000 10000 12000 14000 16000 Number of Nodes
0 0
2000
4000
6000 8000 10000 12000 14000 16000 Number of Nodes
Figure 14. Runtime and performance of the previous algorithm (RaRe followed by IS) and the current algorithm (LA followed by IS2 ) for preferential attachment graphs.
Mean Density (5 Edges Per Node)
Runtime (s) (5 Edges Per Node) 250 LA with PageRank LA with Random Order
2
200
1.5
LA with PageRank LA with Random Order
150
1 100
0.5
50
0 0
2000 4000 6000 8000 10000 12000 14000 16000 18000 Number of Nodes
0 0
2000 4000 6000 8000 10000 12000 14000 16000 18000 Number of Nodes
Figure 15. Runtime and performance of LA with two different ordering types.
large degree. When these nodes are added back into clusters, they tend to be adjacent to most all clusters, and it takes a considerable amount of time to iterate through all edges to determine which connect to a given cluster. The algorithm LA, on the other hand, starts by considering high-ranking nodes before many clusters have formed, saving a large amount of time. The plot on the right of Figure 14 shows that the quality of the clusters are not compromised by using the significantly faster new algorithm LA → IS2 . Figure 15 confirms that constructing the clusters in order of a ranking such as Page RankTM yields better results than a random ordering. LA performs better in terms of both runtime and quality. This is a surprising result since the random ordering is obtained much more quickly than the ranking process. However, the first nodes in a random ordering are not likely to be well connected. This will cause many single-node clusters to be formed in the early stages of LA. When high degree nodes are examined, there are many clusters to check whether adding the node will increase the cluster density. This is time consuming. If the nodes are ranked, the high degree nodes will be examined first, when few clusters have been created. These few clusters are likely to attract many nodes without starting a number of new clusters, resulting in the algorithm completing more quickly. The algorithms were also tested on real-world data. The results are shown in Table 3. For all cases other than the web graph, the new algorithm produced a clustering of higher quality.
106
J. Baumes et al. / Discovering Hidden Groups in Communication Networks
4. Conclusion In this article, we described methods for discovering hidden groups based only on communication data, without the use of communication contents. The algorithms rely on the fact that such groups display correlations in their communication patterns (temporal or spatial). We refer to such groups as hidden because they have not declared themselves as a social entity. Because our algorithms detect hidden groups without analyzing the contents of the messages, they can be viewed as an additional, separate toolkit, different from approaches that are based on interpreting the meaning of the messages. Our algorithms extract structure in the communication network formed by the log of messages; the output groups can further be studied in more detail by an analyst who might take into account the particular form and content of each communication, to get a better overall result. The main advantage is that our algorithms greatly reduce the search space of groups that the analyst will have to look at. The spatial and temporal correlation algorithms target different types of hidden groups. The temporal hidden group algorithms identify those groups which communicate periodically and are engaged in planning an activity. Our algorithms have been shown to be effective at correctly identifying hidden groups artificially embedded into the background of random communications. Experiments show that as the background communications become more dense, it takes longer to discover the hidden group. A phase transition occurs if the background gets too dense, and the hidden group becomes impossible to discover. However, as the hidden group becomes more structured, the group is easier to detect. In particular, if a hidden group is secretive (non-trusting), and communicates key information only among its members, then the group is actually more readily detectable. Our approach to the discovery of spatial correlation in communications data is based on the observation that social groups often overlap. This fact rules out the traditional techniques of partitioning and calls for novel procedures for clustering actors into overlapping groups. The families of clustering algorithms described here are able to discover a wide variety of group types based on the clustering metric provided. The algorithms have been shown to be both efficient and accurate at discovering clusters and retaining meaningful overlap between clusters.
Acknowledgments The research presented here was partially supported by NSF grants 0324947 and 0346341.
References [1] [2]
[3] [4] [5]
[6] [7] [8]
R. Albert and A. Barabasi. Statistical mechanics of complex networks. Reviews of Modern Physics, 74, 2002. J. Baumes, M. Goldberg, M. Krishnamoorthy, M. Magdon-Ismail, and N. Preston. Finding comminities by clustering a graph into overlapping subgraphs. Proceedings of IADIS Applied Computing, pages 97–104, 2005. J. Baumes, M. Goldberg, and M. Magdon-Ismail. Efficient identification of overlapping communities. Intelligence and Security Informatics (ISI), pages 27–36, 2005. J. Baumes, M. Goldberg, M. Magdon-Ismail, and W. Wallace. Discovering hidden groups in communication networks. Intelligence and Security Informatics (ISI), pages 378–389, 2004. J. Baumes, M. Goldberg, M. Magdon-Ismail, and W. Wallace. On hidden groups in communication networks. Technical report, TR 05-15, Computer Science Department, Rensselaer Polytechnic Institute, 2005. C. Berge. Hypergraphs. North-Holland, New York, 1978. Béla Bollobás. Random Graphs, Second Edition. Cambridge University Press, new york edition, 2001. A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori. Detecting communities in large networks. Workshop on Algorithms and Models for the Web-Graph (WAW), pages 181–188, 2004.
J. Baumes et al. / Discovering Hidden Groups in Communication Networks [9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
[21] [22] [23] [24] [25] [26] [27] [28] [29] [30]
[31] [32] [33] [34] [35] [36] [37]
107
K. Carley and M. Prietula, editors. Computational Organization Theory. Lawrence Erlbaum associates, Hillsdale, NJ, 2001. K. Carley and A. Wallace. Computational organization theory: A new perspective. In S. Gass and C. Harris, editors, Encyclopedia of Operations Research and Management Science. Kluwer Academic Publishers, Norwell, MA, 2001. P. Erd˝os and A. Rényi. On random graphs. Publ. Math. Debrecen, 6:290–297, 1959. P. Erd˝os and A. Rényi. On the evolution of random graphs. Maguar Tud. Acad. Mat. Kutató Int. Kozël, 5:17–61, 1960. P. Erd˝os and A. Rényi. On the strength of connectedness of a random graph. Acta Math. Acad. Sci. Hungar., 12:261–267, 1961. Bonnie H. Erickson. Secret societies and social structure. Social Forces, 60:188–211, 1981. G. Even, J. Naor, S. Rao, and B. Schieber. Fast approximate graph partitioning algorithms. Siam J. Computing, 28(6):2187–2214, 1999. G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Clustering methods basen on minimum-cut trees. Technical Report 2002-06, NEC, Princeton, NJ, 2002. L. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40:35–41, 1977. D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 1998. M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci., 99:7821–7826, 2002. Mark Goldberg, Paul Horn, Malik Magdon-Ismail, Jessie Riposo, David Siebecker, William Wallace, and Bulent Yener. Statistical modeling of social groups on communication networks. In 1st Conf. of the N. Amer. Assoc. for Comp. Social and Organizational Science (NAACSOS), PA, June 2003. (electronic proceedings). Bruce Hendrickson and Robert W. Leland. A multi-level algorithm for partitioning graphs. In Supercomputing, 1995. Svante Janson, Tomasz Luczak, and Andrzej Rucinski. Random Graphs. Series in Discrete Mathematics and Optimization. Wiley, New york, 2000. R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad, and spectral. Journal of the ACM, 51(3):497–515, 2004. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. 20(1), 1998. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1), 1998. B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):291–307, 1970. A. Kheyfits. Introduction to clustering algorithms: Hierarchical clustering. DIMACS Educational Module Series, 03-1, March 17, 2003. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. S. Tomkins. The Web as a graph: measurements, models, and methods. Lecture Notes in Computer Science, 1627, 1999. Valdis E. Krebs. Uncloaking terrorist networks. First Monday, 7 number 4, 2002. Malik Magdon-Ismail, Mark Goldberg, William Wallace, and David Siebecker. Locating hidden groups in communication networks using Hidden Markov Models. In Int. Conf. on Intelligence and Security Informatics (ISI), pages 126–137, Tucson, AZ, June 2003. N. Mishra, D. Ron, and R. Swaminathan. Large clusters of web pages. Workshop on Algorithms and Models for the Web Graph (WAW), 2002. P. Monge and N. Contractor. Theories of Communication Networks. Oxford University Press, 2002. M. E. J. Newman. The structure and function of complex networks. SIAM Reviews, 45(2):167–256, June 2003. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Digital Libraries Working Paper, 1998. David Ronfeldt and John Arquilla. Networks, netwars, and the fight for the future. First Monday, 6 number 10, 2001. Ashish Sanil, David Banks, and Kathleen Carley. Models for evolving fixed node networks: Model fitting and model testing. Journal oF Mathematical Sociology, 21(1-2):173–196, 1996. David Siebecker. A Hidden Markov Model for describing the statistical evolution of social groups over
108
[38] [39] [40]
J. Baumes et al. / Discovering Hidden Groups in Communication Networks communication networks. Master’s thesis, Rensselaer Polytechnic Institute, Troy, NY 12180, July 2003. Advisor: Malik Magdon-Ismail. D. B. Skillicorn. Social network analysis via matrix decompositions: al Qaeda. Thomas A. Stewart. Six degrees of Mohamed Atta. Business 2.0, 2 issue 10:63, 2001. Douglas B. West. Introduction to Graph Theory. Prentice Hall, Upper Saddle River, NJ, U.S.A., 2001.
Part Three Knowledge Discovery in Texts and Images
This page intentionally left blank
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
111
Authorship Attribution in Law Enforcement Scenarios1 Moshe KOPPEL2, Jonathan SCHLER and Eran MESSERI Department of Computer Science, Bar-Ilan University, Israel
Abstract. Typical authorship attribution methods are based on the assumption that we have a small closed set of candidate authors. In law enforcement scenarios, this assumption is often violated. There might be no closed set of suspects at all or there might be a closed set containing thousands of suspects. We show how even under such circumstances, we can make useful claims about authorship. Keywords. Authorship attribution, text categorization, machine learning, law enforcement.
Introduction We are going to discuss authorship attribution in law enforcement scenarios. The problem is quite straightforward. You get an anonymous text, presumably from some assailant, if it’s a law enforcement situation, and you want to know as much as you possibly can about the writer of the text. Ideally, what you’d like to know is whether a particular suspect wrote the text. That is sometimes hard to do. If we can’t do that, we’d like to at least be able to say something about the guy. Or it might not be a guy. So we’d like to know: is it a man or a woman? How old is the person? What is his or her native language? Can we say anything about their personality? The question is how much can we know, just given a little piece of text (no handwriting, it’s electronic text). The answer is, we can know a lot more than you might think. Let’s first consider the vanilla authorship attribution problem, the kind of problem definition that you give if you are a researcher in Computer Science, who doesn’t care much about the real world but wants to have a well-defined problem that submits well to the kind of tools that you’ve got. In the vanilla problem you’ve got a small closed set of candidate authors for each of whom you’ve got lots of texts. And you want to be able to take some new anonymous text and say which one of your handful of authors – in the ideal situation, two authors – it is. Was Edward, the Third written by Shakespeare or Marlowe? That’s the perfect authorship attribution problem for a researcher. 1
This paper is a lightly edited transcript of a talk given by Moshe Koppel at the June 2007 NATO meeting in Beer-Sheva on security informatics and terrorism. Thanks to Bracha Shapira and Paul Kantor for organizing the meeting, to our collaborator, Shlomo Argamon, and to Cecilia Gal for preparing the transcript of the talk. 2 Corresponding Author: Moshe Koppel, Dept. of Computer Science, Bar-Ilan University, Ramat Gan, Israel; E-mail:
[email protected].
112
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
But the vanilla problem is not especially hard nor does it often occur in real life. In law enforcement you will typically have no suspects. You just have a text and you have no clue who wrote it. In that case the question is, can we profile the author? What can we say about the person who wrote this text? Alternatively, you might have thousands of suspects, and then the question is, can we find the needle in the haystack? Ideally, you want to say, of the tens of thousands of people who might have written this, exactly which one it is. Incredibly enough you can do this a lot of the time, as we’ll see. And the key thing is that the texts might be very short. Unless the assailant is the Unabomber, he doesn’t send a 50,000 word tract for analysis. He’s more likely to send some short little note that says, “give me the money or we’ll shoot you”. So, the question is, given a short note, how much can we say. (Well, that example was a little bit too short, but we will see that even a couple of hundred words is quite useful.)
1. Solving the Vanilla Attribution Problem Let’s first discuss how we solve the vanilla problem with a small number of authors. The general picture is shown in Figure 1. Suppose you’ve got texts by A and texts by B. First, you clean them up, removing whatever junk is totally inappropriate. Then you translate them into numerical vectors that capture measures (say, frequency) of features that you think might be relevant to this problem. Now that you have two sets of vectors, some of type A and some of type B, you use your favorite learning algorithm to build a classifier that distinguishes A vectors from B vectors. Once you’ve got your classifier, you put new texts into it for attribution. That’s pretty much how the game works.
Figure 1. The categorization process
The most important question is what kind of features we should use. One of the dirty secrets of machine learning is that, although researchers generally spend more
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
113
time talking about better learning algorithms than about what features to use, in real life, it doesn’t really matter all that much which learning algorithm you use. The good ones all give pretty much the same results. What really matters is what feature sets you use; if you do some good feature engineering, you can often improve the results quite a bit. What kind of features do you want for authorship attribution? Ideally, you want the kind of features that stay constant within a given author but vary between authors. And what kind of features might those be? Well, they are not topic words, because a given author is going to use different topic words depending on whether he’s writing about a given topic or not. So, it’s got to be stylistic information. The main stylistic features are the following: •
•
•
Function words: The ancestor of all stylistic features for authorship analysis are the function words, those little words like “and” and “it” and “of” and “if” and “the” that don’t really tie in very strongly to content. Those were used by Mosteller and Wallace back in 1964 in their work on the Federalist Papers [1], which is the seminal work in this area. Syntax: Different authors use syntax differently, so we might try to pick up on syntactic habits of individual authors. Parsing is slow and unreliable. A better approach is to consider the frequency with which authors use particular sequences of parts-of-speech (POS); that information indirectly gives you hints to syntax. Systemic Functional Linguistics (SFL) trees: A more general approach that gives us elements of both of the above feature types is to use systemic functional linguistics features. Essentially, those are glorified parts-of-speech but, as can be seen in Figure 2, instead of very general parts of speech, like conjunctions, we’ll talk about specific kinds of conjunctions, and then get even more fine-grained than that, until finally at the bottom of this tree we have actual function words. This subsumes both function words, which are down at the leaves, and parts-of-speech which are up at the roots.
Conjunctions ConjExtension ConjElaboration ConjEnhancement ConjSpatiotemporal ConjCausalConditional
and, or, but, yet, however,… for_example, indeed,… then, beforehand, afterwards, while, during… if, because, since, hence, thus, therefore,…
Figure 2. SFL tree sample
•
Morphology: The frequency of use of various grammatical suffixes and prefixes can sometimes be useful clues for authorship. In English, these tend not to be very useful because there are just not that many of them, but in languages like Hebrew or Arabic,
114
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
•
•
morphology is crucial. In such languages, many of the function words that we use in English don’t exist as separate words and only show up in the morphology. Complexity measures: Historically, the first features explored as possible markers of authorial style were complexity measures such as average word length, average sentence length, all kinds of entropy measures, type/token ratio, hapax legomena and so on [2]. Idiosyncrasies: Researchers like Donald Foster, who identified the anonymous author of the roman-a-clef Primary Colors as Joe Klein, rely mostly on authorial idiosyncrasies, including neologisms, exotic syntax and word construction and so forth.
Once we have built vectors based on the frequency with which features of these sorts are used, we use some learning algorithm to distinguish between authors. We have run many vanilla authorship experiments using a variety of learning algorithms including linear SVM [3], which is almost a de facto standard, Bayesian Regression (using software kindly made available by the Rutgers group [4]) and real-valued balanced Winnow [5,6], a kind of exponential gradient algorithm. And they all work. If you need to decide between two candidate authors and you’ve got a reasonable amount of known text for each author, I can pretty much guarantee you can attribute a not-tooshort anonymous text with accuracy well above 90%. The amount of text I’m discussing is not especially large: maybe a few tens of documents of 500 words and up. So, the vanilla attribution problem is definitely solvable and, in fact, function words and single parts of speech are generally enough. Systemic functional linguistics trees by themselves are enough, because they subsume the previous two. Idiosyncrasies are the best thing to use if you’ve got unedited text (like email); obviously, if you’re dealing with edited documents, this is useless. And morphology is useful for particular languages, such as Hebrew, which is rich in morphology.
2. Profiling All the above was just to provide background. What about real life? In real life, you might have a text written by some anonymous assailant, but without any specific suspects at all. In such cases you’d be satisfied to extract some general information about the gender, age, native language and personality of the author. So let’s consider the problems of gender, which is binary (this point is apparently subject to debate but not for now) and age, which we divide into three categories, teenagers, people in their twenties, and people in their thirties or above. As luck would have it, for the purpose of running systematic experiments, we needed people to write electronic texts about anything they wanted and to also tell us their gender and their age, and we got about 100 million volunteers. They call themselves bloggers. We took tens of thousands of blogs labeled for gender and age and randomly threw out enough of them so that we had the same number of male and female writers in each age group. (You may be interested to know that, as of several years ago, a large majority of
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
115
bloggers below the age of 18 were females, while a large majority of those above 22 were males. The numbers may have changed since then.) We also ran experiments on native speakers of five different languages (Russian, Czech, Bulgarian, Spanish and French) writing in English [10]; our objective was to determine an author’s native language. For this we used the International Corpus of Learner English [7] and selected 258 essays from speakers of each of the five languages. For each of the experiments, we used two feature sets, SFL trees for capturing style and frequent non-function words for content [11,12]. We used Bayesian regression as our learning algorithm and ran ten-fold cross-validation experiments to estimate accuracy of attribution on out-of-sample documents. The results for each experiment are shown in Figure 3.
100 90 80 70 60
SFL Trees
50
Frequent content words SFL + content
40 30 20 10 0 Gender (2)
Age (3)
Native language (5)
Figure 3. Experimental results
As can be seen, for the gender problem, which has a natural baseline of 50%, we obtain accuracy of 76% using both feature sets. In fact, when we also use blog jargon as features – “LOL” and “ROTFL” and all that – we get accuracy over 80%. The features that are used most differently by males and females (as measured using information gain [8]) are shown in Table 1. Note that among style features, personal pronouns are used more by females, particularly the words “I, me, him” and “my”, while males make more frequent use of the determiners “a” and “the” and certain kinds of prepositions. In fact, there are hundreds of features that are used very differently by males and females. This is true in a variety of genres: blogs, fiction, and non-fiction, including academic articles in professional journals. The numbers vary but the differences between males and females are consistent in all of them. In all those genres,
116
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios Table 1. Most distinguishing features (information-gain) for gender
Class Female
SFL Features personal pronoun, I, me, him, my
Content Features cute, love, boyfriend, mom, feel
Male
determiner, the, preposition-matter, as
system, software, game, based, site
of,
females use more pronouns and males use more determiners. (As for content features that are used most differently by male and female bloggers, there must be a message in there somewhere but it’s not likely to help us in typical law enforcement scenarios.) As can be seen, for the age problem, which has a natural baseline of 42.7% (the size of the largest of the three classes), we obtain accuracy of 77.7% using both feature sets. Unsurprisingly, when we also use blog jargon as features, we get accuracy over 80%. The content features that are used most differently by bloggers of different ages are shown in Table 2. It is amusing to imagine this as representing the changing interests through the lifespan of the bloggers in our sample. As teens, they are concerned with matters that are either “boring” or “awesome”, in their 20’s they are mostly concerned with “college”, “bar”, “apartment”, and “dating”, and eventually are preoccupied with running the whole world, which is apparently neither boring nor awesome. As can be seen, for native language, which has a natural baseline of 20%, we get 65% accuracy using SFL features alone. (The results using content are uninteresting for this experiment since differences between the groups are almost certainly artifacts of the experimental setup.) Interestingly, using only SFL features and a variety of idiosyncrasy-based features, we get accuracy above 80%. The best features for distinguishing native speakers of each language are shown in Figure 4.
Russian –over, the (infrequent), number_reladverb
French – indeed, Mr (no period), misused o (e.g., outhor)
Spanish – c-q confusion (e.g., cuality), m-n confusion (e.g., confortable), undoubled consonant (e.g., comit)
Bulgarian – most_ADVERB, cannot (uncontracted)
Czech – doubled consonant (e.g., remmit) Figure 4. Best features for distinguishing native speakers
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
117
Table 2. Most distinguishing features (information-gain) for different ages. Numbers are uses per 1000 words. feature
10s
20s
30s
bored
3.84
1.11
0.47
boring
3.69
1.02
0.63
awesome
2.92
1.28
0.57
mad
2.16
0.8
0.53
homework
1.37
0.18
0.15
mum
1.25
0.41
0.23
maths
1.05
0.03
0.02
dumb
0.89
0.45
0.22
sis
0.74
0.26
0.10
crappy
0.46
0.28
0.11
college
1.51
1.92
1.31
bar
0.45
1.53
1.11
apartment
0.18
1.23
0.55
beer
0.32
1.15
0.70
student
0.65
0.98
0.61
drunk
0.77
0.88
0.41
album
0.64
0.84
0.56
dating
0.31
0.52
0.37
semester
0.22
0.44
0.18
someday
0.35
0.40
0.28
son
0.51
0.92
2.37
local
0.38
1.18
1.85
marriage
0.27
0.83
1.41
development
0.16
0.50
0.82
tax
0.14
0.38
0.72
campaign
0.14
0.38
0.70
provide
0.15
0.54
0.69
democratic
0.13
0.29
0.59
systems
0.12
0.36
0.55
workers
0.10
0.35
0.46
3. Finding a Needle in a Haystack Suppose we’ve got 10,000 authors and someone gives us one text and we’ve got to say who wrote it. We began with 10,000 blogs and removed from each one the last post or, if it was too short, enough posts to add up to 500 words [9]. Let’s call the removed texts “snippets”. Now, given a random snippet, we need to decide which of these 10,000 bloggers wrote it. (We don’t have any hints in terms of formatting; we only have the actual text, without even distinguishing quoted text from integral text.)
118
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
We begin with a naive information retrieval approach. Let’s just assign the snippet to whichever blog looks most similar, using standard information retrieval measures of similarity: cosine of tf.idf representations. This method does not work very well. Using three different versions of the tf.idf representation (style features only, content features with raw term frequency, content features with binary term appearance), we find that the best of them only assigns the snippet to the actual author 36% of the time. But, here is what we can do. In a typical law enforcement scenario, we can decide we are not allowed to be wrong, but we are allowed to say, “I don’t know”. We can say that a given snippet just doesn’t have enough information in it for us to say anything. But if we say “we know”, we had better get the right answer, generally speaking. So we use a meta-learning technique: we consider how strong the similarity is to the top ranked author and how far back the second ranked author is, using each of our similarity measures. Without belaboring the details of the learning techniques and how they are applied, it’s enough to say that if there is one stick-out author who is much likelier than the others to be the actual author, we gamble on that author. Otherwise, we just throw up our hands. Clearly, the more risk-averse we are, the lower the recall we achieve but the higher the precision. The full recall-precision curve is shown in Figure 5 (upper curve). Note that, for example, we can achieve recall of 40% with precision of 84%. But if we can settle for recall of 30%, we can get precision of 90%. To make this more concrete, if we have 10,000 candidate authors and 10,000 snippets to attribute, and we venture a guess for 4,762 of these snippets, we’ll be right for 4,000 of them.
Figure 5. Recall/precision curves in attributing a snippet to one of 10,000 authors. Curves are for snippets of length 600 (upper) and 200 (lower).
M. Koppel et al. / Authorship Attribution in Law Enforcement Scenarios
119
Now you might wonder how much text we really need. For these experiments, we used snippets of length between 500 and 600 words. We ran the identical experiment but with the snippets limited to exactly 200 words. As can be seen in Figure 5 (lower curve), at a recall level of 30%, we achieve precision of 82%. So, the results do degrade for very short texts, but they are still quite useful even at very realistic document lengths. To conclude then, we can use these techniques in order to profile an anonymous author. We can tell you with some reasonable degree of accuracy the author’s age, gender, and native language. (We didn’t discuss personality, but in fact we can tell if a writer is neurotic or not with the same accuracy as the degree of psychologist agreement on neurosis.) And even with 10,000 candidates, in a fair number of cases, we can confidently and correctly identify the author.
References [1] F. Mosteller, D. L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, Mass. Addison Wesley. 1964. [2] G. U. Yule, On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship, Biometrika, 30, (1938) 363-390. [3] T. Joachims, Text categorization with support vector machines: learning with many relevant features. In Proc. 10th European Conference on Machine Learning ECML-98, (1998) 137-142. [4] A. Genkin, D. D. Lewis and D. Madigan. Large-scale Bayesian logistic regression for text categorization. Technometrics (to appear). 2006. [5] I. Dagan, Y. Karov, D. Roth, Mistake-driven learning in text categorization, In Proc. EMNLP-97: 2nd Conf. on Empirical Methods in Natural Language Processing, (1997) 55-63. [6] N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, Machine Learning, 2, 4, (1987) 285-318. [7] S. Granger, E. Dagneaux, F. Meunier, The International Corpus of Learner English. Handbook and CD-ROM. Louvain-la-Neuve: Presses Universitaires de Louvain, 2002. [8] T. Mitchell,. Machine Learning. McGraw-Hill , New York, 1999. [9] M. Koppel, J. Schler, S. Argamon and E. Messeri. Authorship Attribution with Thousands of Candidate Authors, in Proc. Of 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, August 2006. [10] M. Koppel, J. Schler and K. Zigdon, Determining an Author's Native Language by Mining a Text for Errors, in Proceedings of KDD 2005, August 2005, Chicago, IL, USA [11] S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler. Mining the Blogosphere: Age, gender and the varieties of self–expression , in First Monday, vol 12(9), September 2007. [12] M. Koppel, J. Schler, S. Argamon and J. Pennebaker. Profiling the Author of an Anonymous Text , to appear in Communications of the ACM (CACM).
120
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages and Identifying Actors Paul B. KANTOR1 School of Communication, Information and Library Studies, Rutgers University
Abstract. Research on modeling the identification of materials related to a given topic, person or country is reported. The models use Bayesian analysis, and a sparsity-inducing prior distribution of the chance that a term will be useful. The result is concise computer-generated models, which can be understood, and improved, by human users of the tools. Examples are given based on technical literature, and on materials of interest to intelligence and policy analysts. The methods are particularly effective in learning to recognize materials pertinent to a specific topic, from very small sets of learning materials, provided that general background information (such as might be found in a text-book or encyclopedia) can be used to set the prior probability that a term will be used in the machine learning model. Keywords. Automatic content identification, entity resolution, Machine Learning.
Introduction I want to thank Ben-Gurion University for the wonderful setting in which we are meeting. This will be some rather technical work, and this is a cross-disciplinary conference. Even though the title looks a little bit technical there will be only a few equations in the slides. We will be looking into the question of whether one can identify authors by their writings, as those writings might appear on websites, or on blogs. We are particularly interested in the case of persons who might wish to conceal their identities under multiple shifting aliases, but may reveal themselves by their stylistic quirks and turns of phrase.
1. Bayesian Statistics The approach invented by the Reverend Thomas Bayes, Bayesian Analysis, is a somewhat controversial way of thinking about statistics. It recognizes the fact that 1
Corresponding Author: Paul B. Kantor, SCILS - Rutgers University, The State University of New Jersey, 4 Huntington Street, New Brunswick, NJ 08901; Email:
[email protected]
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
121
when we encounter clues or evidence positively linked to some possibility “of interest”, it should increase our suspicion, or our confidence, that the thing of interest is a reality. The possibility may be the presence of a disease, an attack on a border outpost, a change in political climate, or the radicalization of a young blogger. Bayes pointed out that we cannot actually compute how likely that “possibility” is unless we have some belief before we begin about how likely it is to occur. In the specific application we discuss today, the kinds of “possibility” that concern us might be “is person A the author of message M” or “were messages M and N written by the same author?” There is another approach to statistics, called the frequentist approach, which emphasizes that observations are the result of randomness, and that we do not know anything about that randomness. I was educated in that point of view, which might be called an “orthodox” view of statistics. But that approach cannot lead us to estimates of probabilities, and such estimates are needed (in the expected value theory of decision-making) to compute the best course of action. If I may joke about it, I could say that “I pray as a frequentist, but I have to shop at the Bayes supermarket in order to put dinner on the table”. The basic principle of Bayesian analysis is this: if some observable feature f is associated with, or characteristic of, some condition of interest, then the odds in favor of that condition, when we see the feature f, are improved. The precise formula requires that we consider the probability that f would happen if the condition were true, divided by the probability that it would happen if it is not true, and this whole factor multiplies whatever we believed about the odds in favor of the event, beforehand. We speak about this in terms of the “prior odds”, and the “posteriori odds”, which are the odds adjusted to exploit the new evidence, illustrated in Equation 1:
Bayes’ Updating Rule:
OddsFavoring (C | f ) =
Pr( f | C ) OriginalOddsFavoring (C ) Pr( f | not − C )
The odds favoring a possibility “C” are increased if the feature we have seen, f is more strongly associated with the possibility C, than with its absence. The expression Pr( f |C ) means “the probability that a particular observable feature, f will occur, if, in fact the condition C, holds. Thus what we want to know, the odds that an event will occur, is translated into a calculation depending on how likely the event is to produce the features that we can see. Thus we spend effort trying to calculate things such as: the probability that, for example, a terrorist would mention the words “Christmas dinner” in a message.
2. Application to Message Streams: Theory In the research at DIMACS, the Center for Discrete Mathematics and Computer Sciences, at Rutgers, we actually deal with the texts of the messages themselves. We
122
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
seek to make models of the texts written by many people, which, taken all together, can contain hundreds of thousands of words. Because there are so many possible words, it is easy for the computer to build a model that is much too big for a person to comprehend. That weakens the usefulness of the model, since its users cannot check the “reality” or “sensibleness” of it. That is where the Bayesian methods come into play. They provide a specific technique for decreasing the complexity of the model. When we (that is, our algorithms) begin the analysis, we do not know how important any particular word will be, in the final model. This is represented by saying that there is a “prior distribution of the possible importance of the word, in the model”. The importance might be zero, it might be positive (meaning that the word is evidence in favor of whatever we are looking for) or it might be negative (meaning that it is evidence against whatever we are looking for). Typically, absent any evidence, we have no expectation that the weight is either positive or negative. Specifically, we imagine, before we start considering evidence, that the probability that any particular word is going to be important is controlled by some distribution as shown in Figure 1. [This figure shows the distribution for two particular words, one on each axis.]
Figure 1. Possible prior distributions. Although there are thousands of terms, we show the distribution of the weights for only two of them. The distribution on the left, because it is smooth at the maximum, ensures that all words will end up in the model. The distribution on the right is far more restrictive.
The peak at the middle of the distributions in Figure 1 means that the most probable weight, for any word, is zero. This would mean that the specific word does not matter at all. That model is given by the famous Gaussian distribution, or “bellshaped” curve, whose shape is familiar. But there is an alternative model for this peak, which is called the Laplace model. The Laplace model is different because it has a very sharp point right in the middle, something like a mail spike. The mathematical effect of this sharp peak [for technical details see the papers (1-4) is that it tends to keep the weight or importance of any particular word “stuck at zero”. This means that one really has to have a lot of evidence that the word matters, before it is included in the model.
123
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
Underlying this use of the Laplacian model, there is a single parameter that one can adjust. This parameter corresponds to the “sharpness” of the mail spike. As it is varied, more or fewer terms will appear to have important positive weight or important negative weight in the final model. On the other hand, in the Gaussian model, as soon as one begins the analysis, everything is involved. This, of course, makes it very hard to explain it to the user of the system.
0.10
Posterior Modes with Varying Hyperparameter - Laplace
0.00
glu bp bm i/100 ped npreg
-0.05
posterior mode
0.05
age/100
-0.10
s kin
intercept
120
100
80
60
40
20
0
lambda
Figure 2. The weights of various terms as the parameter controlling entry of terms into the model is adjusted. Terms enter the model one at a time, as the controlling parameter, lambda, is varied
For this reason, we prefer the Laplacian prior, that is, the one with the spike. With it new features are added very slowly as one adjusts the crucial parameter. Thus it tends to produce models that are relatively small. When this happens, it is possible for the allimportant human analyst to be “in the loop”, reviewing and improving the model. We believe that this is an essential point in any monitoring of traffic. When the models have relatively few terms (technically, when they are “Sparse Models”) then a human can examine the model and say “this is [or is not] a reasonable set of words to describe what I am looking for”.
124
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
Figure 3. This flow chart shows that the model is applied to stored materials, to identify those that fit the model. The model has been trained to flag interesting messages (Shown here in darker shading).
3. Application in Practice
Figure 4. SIDEBAR: Dealing with Uncertainty For some feature of interest the relative frequency is shifted slightly to the right for “positive cases” (that is, interesting messages. For a particular choice of the cutoff value of the feature, shown by the light vertical bar, .616 of the positives will be flagged. But .487 of the negatives will be flagger., and they count as “False Positives Found (FPF)”. The second row of the small cross-table shows the cases that are “not flagged”. As the vertical bar is moved, the numbers in the upper row of the table trace out the curve called the ROC curve.
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
125
To develop these models, the algorithms must start with training materials, which are analyzed to develop a model. Specifically, in the training stage (where the truth is known), the algorithm selects those terms which are most effective in “predicting” whether or not a message will be of interest to an analyst. In application, a second component of the algorithm is applied to new materials, to identify those things that appear to be suspicious or worth passing on to a human analyst. It is very important to recognize that no method is going to be 100% accurate. The complexity of the situation is shown in the Sidebar: Dealing with Uncertainty. The area under the ROC curve provides a numerical measure of the accuracy of any classification. It ranges from 50% (no better than random guessing) to 100% (perfect classification). In use, the algorithms compute a “suspiciousness score”. In the example in Figure 4, the truly suspicious items fall under the right hand bell-shaped curve, and the harmless fall under the left-hand curve. Because the distributions overlap, there is a (continuous) range of choices for the threshold or cutoff. For any particular choice of the threshold (as indicated by the vertical line) there will be some fraction of “truly suspicious items” flagged and some number of them that are “missed”. At the same time, there will be some fraction of the harmless items that are “not flagged”, and others which are flagged, in a “false alarm”. Finally, the information can be summarized in a curve, called the ROC curve, which shows how the fraction detected rises as a function of the false alarm rate. The area under this curve (AUC) is a convenient summary of the power of the system. It can also be shown that AUC is also equal to the “average distance, from the bottom of the list, of the items we’d like to see at the top”. If the proposed method actually has no power at all (the so-called “null hypothesis”) this area is distributed as the average of N samples drawn from a uniform distribution on the unit interval.
4. Illustrative Results In this setting we cannot present real messages, or web pages, that are specifically known to be of interest to intelligence analysts. So, to illustrate how this approach is used, we built a model collection of texts that represented a problem similar to the problems of interest. When one applies this model, one finds that it works rather well, when there are enough training examples, as suggested in Figure 3. But when we are searching for “needles in haystacks”, there typically is not a lot of training data. To deal with this, one must build a collection of additional materials that are relevant to the model we need. I will give some details about how we did that, after reviewing the kinds of example data we used.
5. The Data Analyzed In view of the setting I will not go into great detail, and refer the reader to our publications (5). One of our tasks was to identify news items, in the Reuters collection, that were about specific countries. For these documents, we used, as “ground truth” about the documents, the categories that Reuters news assigned to them. We then crosschecked for countries whose names could be found in the CIA Factbook, which is available online (6). The idea here is to mimic a fully automatable process of going to
126
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
machine readable resources, including, but not limited to, the World Wide Web, to provide more information about a topic (in this case, a country). Our search produced too many matches for a pilot project. We reduced this set to 27 names beginning with the letter A or B, as an example. We then looked for classes with tiny training sets, containing only five positive examples (that is, examples of what we are looking for) and complemented them with five randomly chosen examples that were about other countries. 5.1. Effectiveness in Topic Identification As is shown in Figure 5, adding information from the Factbook can produce a very large improvement when applied to the so-called Mod-Apte collection (7). For this collection, adding data from the Factbook provides great improvement both for the LASSO models (corresponding to the Laplace prior) and for the Ridge models (corresponding to the Gaussian).
Figure 5. The improvement in classification performance (measured by the F-measure) when additional training information is used to set the prior distributions. Lasso represents the “Laplace” approach, Ridge represents the “Gaussian” approach. The row in darker font is the baseline. The rows labeled in lighter font correspond to two different ways of using the prior information; in one case adjusting the variance, and in the other case adjusting the mode. In all cases the so called “TFIDF” model was used to represent the documents.
Although I have tried to persuade you that the Gaussian model is not a very good way to do things, I should note here that, purely judged in terms of fit to the data (area under the ROC curve) when we combine the Gaussian method (which performs poorly)
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
127
with this kind of prior knowledge, we do as well as we can with the Laplacian methods that produce “human readable” models. In sum, the findings of this work were: when there is a lot of training data adding domain knowledge doesn’t help much. But, when there is very little training data it helps more and it helps more often. 5.2. Two Ways to Use Training Data As I mentioned, we have used the training data to control the ease or difficulty with which terms can be added to the model. Although we did not consider it here, one can think of using training data in another way. This would be to pretend that the training data represents some number of other messages that are “relevant”. While this is simpler, what we actually do, which is adjusting the prior, or “changing the shape of the spike”, is more complicated but is also more effective.
5.3. Another Application: Entity Resolution Our methods can be applied to another kind of question that is also important in the terrorist domain. When we see a name attached to a piece of information, or an activity, is the name that we see the name of the person we think it is? For example, let’s say that we have two postings on the Web, and we want to know the identity of the authors. In the work reported here, we worked with examples that are publicly available, specifically papers that people have written, or web postings.
Same person, different names
Same name, different persons
Smith and Jones
Jones Smith and Wesson
Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about
Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about. Important text whose authorship we care about
Figure 6. Two kinds of entity resolution problems that can be of interest in dealing with authorship of materials on the Web. The same “person” Jones, appears under two names. The single name Smith refers to two different persons.
Suppose that we find an item that claims to have been written by “Smith” and “Jones” and another one that was, in fact, written by “Jones” and “Wesson”. But for some reason the authors deleted the name “Jones” and they put “Smith” there. At this point there are several questions that we can ask:
128
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
a) are persons claming the same name (“Smith”), the same person? b) in general, presented with a pair of identity claims (or labels) do they refer to the same real person? In this case, we have two people apparently named “Smith”. Are they the same person? The answer is, of course, “no”. On the other hand, we have two people with apparently different names, Smith and Jones, but in fact the answer to the question is “yes”. They are the same person. One may go further, and ask: “which of the persons in a long list of authors is hiding behind someone else’s name? In that case we would want to find that the name “Smith” is false, and eventually figure out that the person Jones is hiding behind it. The research organization focused on Knowledge Discovery and Dissemination (KDD), which is run from the ITIC (Intelligence Technology Innovation Center), in the United States, set up a challenge in which they provided data of this type. The data were created by taking biomedical articles, represented by their abstracts, and then systematically changing the names of some people. So the “documents” were presented to us with unaltered abstracts, institutional information, and locations. But some of the names were changed. There were several specific challenges, and I will focus here on the ones for which the methods developed at Rutgers seemed most effective. 5.4. Are Two Actors, In Fact, the Same Person? One challenge asked: for each pair of names (as they occur associated with specific documents and co-authors) do they, in fact, refer to the same real person? The Rutgers methods achieved top accuracy on the task of figuring out whether two different “names” are, in fact names of the same person. As it happened, we did not do this by using our Bayesian modeling approaches, which do not readily adapt to this type of challenge. Instead, we applied an array of “off-the-shelf” clustering techniques (8). For each clustering method, any given pair of “authors” either appeared in the same cluster, or did not. From this data, we had to make an assessment of whether they are, or are not, the same “real person”. We did this by asking whether the documents to which these authors’ names were attached seemed “sufficiently similar”. That is, the relations among the documents were used to infer relations among the authors (pairwise), representing each author by the information about and in the documents to which his or her name was assigned. The disappointing news, from the point of view of an algorithm researcher, is that the best of the algorithms we tried on this problem didn’t work terribly well; nor did the second best algorithm. All in all we tried 11 different algorithms, obtained by varying the representations of the documents (vector, or binary) the measures of similarity, the measures of cluster membership, and the cluster assignment rules. When we combined the results by adding the appropriate scores (a form of what is called Model Averaging) the resulting analysis performed quite well. Well enough, in fact, to outperform methods that were more pure to their algorithmic intentions. We then did a kind of “ablation analysis”; we just took away some of the information and averaged the remaining models. It produced the second best score. When we then removed all analyses using binary representations, and continued to average the remaining models, it produced the third best score.
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
129
It appears that there’s a lesson here. For some reason, averaging models together, which is also known by other names, such as “data fusion” and “combination of methods” is simply more effective than any of the parts that go into it. I propose, in fact, that this is a very important message for the customer. I know that we, as researchers, would like to be able to claim not only that our method is good, but, also, that it’s better than someone else’s method. But, in our experience many of these differences are quite small, while a combination of all the methods will be perceptibly better than any one of them alone. 5.5. Which Member of a Group of Persons Is Using an Alias? We also looked at a more difficult question: which of the authors in a given list is using an alias? In fact, we did this without paying any attention to what they said in their papers. We just looked for other kinds of information, such as the list of co-authors, and information about addresses and keywords used to describe the documents. Using this data, we achieved about the third best performance (depending on the measure used) among the group of researchers. Basically, the approach was a kind of social network approach. We sought to compute the probability that no person in a list of authors is using a false identity by calculating separately the probability that each of them does belong in that list.
Aff (a, b) = log Aff (a, D) =
p ( a , b) p ( a ) p ( b) ∑ Aff (a, b)
b∈Authors ( D ) b ≠a
Figure 7. The affinity of authors to each other is measured by how often they appear together, which is normalized against the probability that they would appear together by chance. The affinity of an author to a document is determined by the affinity of that author to all the other authors on the document.
This cannot really be expressed without equations, which are shown here in Figure 7. This method was the most effective of the “fair and honest methods”. As fate would have it, some useful information had not been removed from the data as provided, and it was information about the author’s addresses. This, not surprisingly, turned out to be a great help, but we regard it as “not quite fair”. Of course, there is a message there, that we should not ignore any information. It has been mentioned by other speakers, and, in fact, it is generally recognized now, that this kind of “open source” information is extremely important. But the decision on how to include it is quite difficult. We had two ways of figuring whether a pair of authors belonged together, one of them, if you can remember your probability theory, is not hard to understand. We looked at how frequently they
130
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
appeared together compared to the probability that they would have just appeared together by chance. We call this the “affinity” of the pair of authors. The affinity of a particular author to a document was just taken as the sum of the affinities of the author to each of the other authors in the document. We also had a more complicated way, and with your permission I’ll jump past this. So, taken together, what have we found in these experiments using Bayesian methods to identify messages or people of interest? 1.
With little training data, Bayesian methods work quite a lot better when they can use general knowledge about the target group, which may be a set of messages, a set of persons, or what have you.
2.
To determine whether several records refer to the same person, there is no magic bullet. But combining many different methods is quite feasible once we get past researchers’ emotional commitment to their own methods. And it’s quite powerful.
3.
Finally, to detect an imposter in a group, social methods based on combining probabilities are quite effective. And every piece of information helps.
6. Relation to Patrolling the Web I would like now to relate all of this to our conference goals of patrolling the Web. Specifically, we are looking for ideas that limit the terrorists’ use of the Web. We want to consider the implications imposed by respect for the privacy of law abiding citizens. And, at least in the kinds of work I have just presented, we are looking for technical fixes. I think the most important reality to recognize here is that acceptable error rates for use in identifying terrorists and terror threats are very, very low. If there are N persons to be screened and N is hundreds of millions, with a false alarm rate F, F times N persons will be investigated. Even if we only make one error in a million we are going to investigate hundreds of innocent people who will not be very happy about that. It’s not the kind of situation we want to get into. I think that what that means is, for the moment, that these technologies are much more likely to be useful in forensics, after the fact. Suppose, to be concrete, that we can find, and there are some big “if”s here, ways to store all the messages in some hashed and compressed forms. Specifically, they are not being read by everyone who might like to read our mail. In fact, they are not even being read by screening software, because there would be too many false positives. However, once something bad has been uncovered by other means (it need not have happened yet, as for several intercepted plans in recent years) it may prove possible to search for it in the stored data. In particular, we might be able to identify, after the fact, all of the persons who sent messages containing a specific set of keywords, even if we are not able to reconstruct the entire message. This would be extremely useful. When we consider the Web itself as a medium, we can ask, for any specific collection of web pages, what are all the IP addresses of the people who downloaded them? Might we be able to crosslink persons who visited a particular collection of web
P.B. Kantor / Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages
131
pages and find out where else they had gone? And, so the technical challenges are: (1) how much data does this require, and (2) can it be held secure from inappropriate snooping? These are challenges. At the moment it’s not clear whether data can be stored in such privacy protecting ways. When and if it can be, then these Bayesian methods that I have been talking about can be used after the fact to track down perpetrators and their associates.
Acknowledgments Many excellent researchers have contributed to the work reported here. Fred Roberts is principal investigator on the KDD project, and the Director of the DIMACS Institute, and the DyDAn Center for Dynamic Data Analysis. David Madigan, now in the Department of Statistics at Columbia, designed the algorithm, working with David Lewis of David D. Lewis Consultants, an expert on text classification. Vladimir Menkov and Alex Genkin did the programming. A number of doctoral students have worked on the project, including Andrei Anghelescu, Sukrid Balakrishnan, Aynur Dayanik, and Dmitry Fradkin, and we had help from a number of undergraduate students:. Diana Michalek, Ross Sowell, Jordana Chord, Melissa Mitchell, and Paul Bonamy. We gratefully acknowledge support from the NSF (CCR 00-87022; SES 05-18543) ONR (N00014-07-1-029); N00014-07-1-0150) IBM (2003 S518700 00).
References 1. Constructing informative prior distributions from domain knowledge in text classification, (ACM Press, New York, NY, USA, 2006). 2. A. Anghelescu et al., "Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD challenge of 2005." Rep. No. DIMACS Technical Report #2005-42, (2005). 3. A. Genkin, D. Madigan, D. D. Lewis, "BMR: Bayesian Multinomial Regression Software." (2005). 4. A. Genkin, D. Madigan, D. D. Lewis, "BBR: Bayesian Logistic Regression Software," (2004). 5. DIMACS at the TREC 2004 Genomics Track. Proceedings of the 13th Text REtrieval Conference (TREC (2004). 6. "CIA World Fact Book (online)," (2007). 7. D. D. Lewis, "Mod-Apte Subset of the Reuters Collection," (2004). 8. G. Karypis, "CLUTO: A Software Package for Clustering High-Dimensional Data Sets," University of Minnesota, Dept.of Computer Science, Minneapolis, MN, Nov (2003).
132
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Entity and Relation Extraction in Texts with Semi-Supervised Extensions Gerhard PAAß 1 and Jörg KINDERMANN Fraunhofer Institute Intelligente Analyse- und Informationssysteme Sankt Augustin, Germany
Abstract. In the last few years the Internet has become a prominent vehicle for communication with the side effect that digital media also has become more relevant for criminal and terrorist activities. This necessitates the surveillance of these activities on the Internet. A simple way to monitor content is the spotting of suspicious words and phrases in texts. Yet one of the problems with simply looking for words is the ambiguity of words, whose meaning often depends on context. Information extraction aims at recovering the meaning of words and phrases from the neighboring words. We give an overview of term and relation extraction methods based on pattern matching and trainable statistical methods and report on experiments of semi-supervised training of such methods. Keywords. Named Entity Recognition, Relation Extraction, Semi-Supervised Learning, Conditional Random Field
Introduction In the last few years electronic media and the Internet have become a prominent vehicle for communication in society and in the economy. This has had the side effect that digital media also has become more relevant for criminal and terrorist activities, which necessitates the surveillance of these activities on the Internet. There is a variety of content to monitor, e.g. public websites, discussion forums and blogs, and – with legal restrictions – email and electronic communication. Organized crime uses email and the Internet to coordinate criminal activities. Terrorist groups promote their aims through Internet web sites and they coordinate their actions through electronic media. Economic transactions especially between businesses and individuals are hampered by phishing emails and Internet fraud. The Internet is an ideal framework for illegal trading and gambling. Surveillance activities, of course, have to observe constitutional and legal constraints. A simple way to monitor content is the spotting of suspicious words in texts. Yet one of the problems with simply looking for words is the ambiguity of words, whose 1
Corresponding Author: Gerhard Paaß, Fraunhofer Institute for Intelligent Analysis and Information Systems,IAIS, Schloss Birlinghoven, 53757 Sankt Augustin, Germany, http://www.iais.fraunhofer.de; Email:
[email protected]
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
133
meaning often depends on context. If you take the term “bank”, it can have the following meanings: sloping land, depository financial institution, a long ridge or pile, an arrangement of objects in a row, a supply or stock held in reserve, funds held by a gambling house, slope in the turn of a road or track, savings bank, a container (usually with a slot), bank building, flight maneuver. In addition “bank” may be part of a name, e.g. for many banking institutes, persons (Jesper Bank, OndĜej Bank) or locations (Jodrell Bank). In addition there are different words or phrases which may express the same concepts. Under the heading information extraction there are a number of approaches to tackle these problems and to recover the meaning of words and phrases. x
x
x
Named entity recognition aims to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Named entity recognition systems have been created that use linguistic grammar-based techniques as well as statistical models. Latent semantic indexing is a clustering method which assigns a cluster index to single words and phrases [6], [8]. These clusters group words together which appear in a similar neighbourhood. In this way they allow the disambiguation of words. Relation extraction identifies relations between entities, which may be named. An examples is the relation “is_member(
, )”. Extraction methods should detect equivalent formulations, e.g. “Tony Blair joined the Labour party in 1975”.
Extraction methods are based on pattern matching or trainable statistical methods and should detect similar formulations. Statistical methods have the advantage that they often can generalize to new similar formulations whereas pattern matching methods (e.g. based on regular expressions) just detect the anticipated pattern. New statistical methods have been shown to reach the level of accuracy of carefully constructed handtuned pattern matching approaches but with much higher flexibility and robustness. If statistical techniques are used they require training data with annotated named entities, relations, etc. It is usually very costly to compile this labeled data, as humans have to look at each instance and decide on the membership of that instance. An alternative is semi-supervised machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many studies show that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. In the next section we will discuss information extraction methods in more depth. The following section is devoted to semi-supervised learning techniques. We will report some results from our work which shows the possible gain in accuracy. The last section contains a short summary.
134
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
Figure 1. Logo for Theseus Project
In Germany a large new project named Theseus has been started by the German Federal Ministry of Economics and Technology that is devoted to the semantic annotation of multimedia content (text, speech, audio images, video). This five-year program is scheduled to operate until 2011 and joins 22 enterprises and research institutes. Fraunhofer IAIS takes part in two projects. In the first project a multimedia collection of the German National Library is digitized using OCR, image and video capturing as well as audio digitization. This content is semantically indexed and prepared for integrated retrieval. In the second project professional content like legal and business texts as well as patents are semantically annotated and entered in a semantic search engine which takes into account the semantic annotations. Finally, IAIS participates in a Core Technology Cluster which is devoted to the research on information extraction.
1. Named Entity Extraction Natural language texts contain much information that is not directly suitable for automatic analysis by a computer. However, computers can be used to sift through large amounts of text and extract useful information from single words, phrases or passages. Information extraction can be regarded as a restricted form of full natural language understanding, where we know in advance what kind of semantic information we are looking for. The task of information extraction naturally decomposes into a series of processing steps, typically including tokenization, sentence segmentation, part-of-speech assignment, and the identification of named entities, i.e. person names, location names and names of organizations. At a higher level, phrases and sentences have to be parsed, semantically interpreted and integrated. Named entity recognition (NER) is a subtask that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. As an example consider the task of extracting executive position changes from news stories: "Robert L. James, Chairman and Chief Executive Officer of McCannErickson, is going to retire on July 1st. He will be replaced by John J. Donner, Jr., the agency’s Chief Operating Officer." In this case we have to identify the following information: Organization (McCann-Erickson), position (Chief Executive Officer), date (July 1), outgoing person name (Robert L. James), and incoming person’s name (John J. Donner, Jr.).
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
135
NER systems usually use linguistic grammar-based techniques as well as statistical models. Hand-crafted rule-based systems typically obtain better results, but at the cost of months of work by experienced linguists. Statistical NER systems typically require a large amount of manually annotated training data. NER was originally formulated in the Message Understanding Conference [4]. One can regard it as a word-based tagging problem: The word where the entity starts gets tag "B", possible continuation words get tag "I" and words outside the entity get tag "O". This is done for each type of entity of interest. For the example above we have for instance the person-words "by (O) John (B) J. (I) Donner (I) Jr. (I) the (O)". Hence we have a sequential classification problem for the labels of each word, with the surrounding words as input feature vectors. A typical way of forming the feature vector is a binary encoding scheme. Each feature component can be considered as a test that asserts whether a certain pattern occurs at a specific position or not. For example, a feature component takes the value 1 if the previous word is the word "John" and 0 otherwise. Of course we may not only test the presence of specific words but also whether the words start with a capital letter, has a specific suffix or is a specific part-ofspeech. In this way results of previous analyses may be used. Now we may employ any efficient classification method to classify the word labels using the input feature vector. A good candidate is the Support Vector Machine because of its ability to handle large sparse feature vectors efficiently. Takeuchi and Collier [20] used it to extract entities in the molecular biology domain. One problem of standard classification approaches is that they do not take into account the predicted labels of the surrounding words. This can be done using probabilistic models of sequences of labels and features. One such frequently used model is the Hidden Markov model (HMM), which is based on the conditional distributions of current labels. To use this model a training set of words and their correct labels is required. For the observed words the algorithm takes into account all possible sequences of labels and computes their probabilities. Hidden Markov models were successfully used for named entity extraction, e.g. in the Identifinder system [1]. Hidden Markov models require the conditional independence of features of different words given the labels. This is quite restrictive as we would like to include features which correspond to several words simultaneously. A recent approach for modeling this type of data is called conditional random field (CRF) [14]. Again we consider the observed vector of words and the corresponding vector of labels. The labels have a graph structure, where each label may depend on labels in its neighborhood as well as from additional observed inputs, e.g. the words in the neighborhood, and features thereof, e.g. capitalization. The whole vector of observed terms and the labels of neighbors may influence the distribution of the label of interest. CRF models encompass hidden Markov models, but they are much more expressive because they allow arbitrary dependencies in the observation sequence and more complex neighborhood structures of labels. As for most machine learning algorithms a training sample of words and the correct labels is required. In addition to the words themselves arbitrary properties of the words, like part-of-speech tags, capitalization, prefixes and suffixes, etc. may be used leading to sometimes more than a million features. McCallum [17] applies CRFs with feature selection to named entity recognition and reports the following F1-measures for the CoNLL corpus: person names 93%, location names 92%, organization names 84%, miscellaneous names 80%. CRFs also
136
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
have been successfully applied to noun phrase identification, part-of-speech tagging, shallow parsing , and biological entity recognition. 1.1. Relation Extraction After identifying named entities in the text we often are interested in predefined relations between these entities, such as the relation between a person and an organization. For instance the ACE (Automatic Content Extraction) evaluation in 2007 defines seven major types of relations [16]. In addition to named entities (e.g. Fred Miller) other entities are considered such as nominal mentions (e.g. the man wearing a red shirt) or pronoun mentions (e.g. he). According to ACE relations connect two entities in some logical relationship. For example, “employment” or “membership” relations can apply between people and organizations, “located” relations between two geographical regions, locations, or facilities, and “family” relations between people. In contrast to hand-written rules previously used, current systems mostly use statistically trained models, as they have obvious advantages in domain and language portability. After some elementary preprocessing (e.g. tokenization, POS-tagging) a typical system first identifies entity mentions, named, nominal, and pronominal. Subsequently coreferences of the same entity are identified. Finally relations between entities are detected based on textual evidence surrounding mentions of the two entities. Again this task can be handled by statistical learning from a sample of sentences where the target relations are labeled. Crucial for success is the prior annotation of words and text by a variety of features such as part-of-speech, shallow parsing, recognized named entities, deep parsing, etc. A common observation is that deep parsing has relatively high errors, therefore methods relying on this alone are less robust. Relations are typically predicted based on local evidence from those locations in the text where two entities are mentioned together. For example, in “the White House in Washington”, the location relation is conveyed by the facility and town being connected with “in”. The machine learning techniques used for relation extraction are similar to NER, e.g. Hidden Markov Models or Conditional Random Fields [19]. In our institute we performed a study on extracting the membership of persons in political parties from newswire documents and achieved an F-value of 64%. A similar objective has the semantic role labeling task. It aims at developing a machine learning system to recognize arguments of verbs in a sentence, and label them with their semantic role. A verb and its set of arguments form a proposition in the sentence, and typically, a sentence contains a number of propositions. Popular learning methods for the CoNLL-2005 Shared Task evaluation [3] were Maximum entropy approaches, Support Vector classifiers, decision trees, and Adaboost. On the test set the best systems had an F-score of 78%.
2. Semi-supervised Learning 2.1. Probabilistic Approaches A central problem for statistical learning techniques is the preparation of labeled training data, where entities or relations are marked. Labeled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
137
experienced human annotators. Usually unlabeled data may be relatively easy to collect. Semi-supervised learning addresses this problem by using a large amount of unlabeled data, together with the labeled data, to build better information extraction methods. Because semi-supervised learning requires less human effort and gives higher accuracy, it is of great interest both in theory and in practice. Note that unlabeled data has no direct information on the dependency between the unknown class and the input variables. But how can unlabeled data improve statistical performance? Consider a simple example, where from the labeled data the term “Tony” has been identified as an indicator for a person’s name. Now assume that in the unlabeled data the terms “Tony” and “Blair” often occur together. By this correlation “Blair” can also become an indicator for a person’s name. In this way co-occurrence information can be used to improve generalization performance. If a statistical model of the relevant entities is known the ExpectationMaximization algorithm (EM) is the optimal semi-supervised procedure [15]. It predicts the probabilities for the different classes of the unlabeled data using the current model. Subsequently, the different possible classes for an instance are treated as weighted cases using the class probabilities as case weights. With this artificial data a new model is estimated. The procedure is iterated until it converges to some limit which can be shown to be a (local) optimum. It is difficult, however to formulate a valid joint probabilistic model for all variables and words in a text. Usually strong independence assumptions have to be used. Otherwise there is an exponential number of parameters which cannot be estimated from a reasonably sized training set. Nigam et al. [15] for instance use a Naïve Bayesian model in the case of document classification. 2.2. Self-Training An alternative approach is self-training. Here a classifier is first trained with a small amount of labeled data. It is then used to classify the unlabeled data. Subsequently the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re-trained and the procedure is repeated. Kozareva et al. [13] use self-training for NER together with K-nearest neighbor classifiers. We have applied self-training to semi-supervised learning of conditional random fields (CRF) for named entity recognition. Here we used the familiar BIO coding of named entities, where “B” marks the first word of a named entity, “I” marks a possible subsequent word of the name and “O” denotes words outside a name. The CRF outputs a probability for the different labels. We imputed only labels which were predicted with a probability above a threshold. The approach was applied to the CoNNL-2003 English corpus (Reuters newsfeed texts). The corpus of 14041 sentences which were randomly partitioned into a training set of 7020 sentences and a test set of 7020 sentences. Labeled and unlabeled items were collected from the training set, while the test set was only used for final testing. We trained CRFs for four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The following Tables 1 and 2 give the average F1-score for the four CRFs for predicting the “B”-label and the “I”-label. We define F1-reduction as the relative reduction of the difference of the F1-value from 100. It turns out that semi-supervised learning can improve the F1-value to a large extent with a reduction between 60% and 20%.
138
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
Table 1. F-Scores for semi-supervised self-training of CRFs for English No. of …
B-F1
labeled
unlabeled
average
35
0
64.4
35
3510
86.6
70
0
72.1
70
3510
89.5
350
0
97.9
350
3510
98.8
3159
0
99.6
I-F1 F1-reduction
62%
62%
43%
PER
LOC
ORG
MISC
average
76.0
64.6
48.4
68.6
38.6
82.5
79.5
87.1
88.6
52.5
89.1
68.5
60.2
70.7
43.8
97.0
87.5
90.7
84.5
61.7
97.1
98.7
98.9
97.1
89.5
98.6
98.8
99.2
98.2
92.1
99.5
99.6
99.8
99.7
98.2
F1-reduction
23%
32%
25%
For the German language named entity recognition is harder, as not only names but all nouns are capitalized. This can be seen in Table 2, which gives the results for semisupervised learning for the CoNLL-2003 German corpus (Frankfurter Rundschau newspaper texts). It contains 12148 sentences, again randomly split into a training and a test set each of 6074 sentences. Table 2 gives, again, the average F1-score for the four CRFs for predicting the “B”-label. It shows that more labeled training examples are required to obtain a performance comparable to the English case and that semisupervised learning can lead to a sizable reduction of error. On the other hand, the utilization of the complete labeled training set gives a much higher accuracy.
Table 2. F-Scores for semi-supervised self-training of CRFs for German No. of
B-F1
labeled
unlabeled
%
30
0
6.6
30
3030
9.3
120
0
48.4
120
3030
65.4
300
0
83.2
300
3030
88.7
2730
0
99.4
F1-reduction
3%
26%
56%
2.3. Other Semi-Supervised Approaches Another more heuristic criterion for semi-supervised learning is to avoid decision boundaries between different classes in low density input regions. This is exploited, for
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
139
instance, by the transductive Support Vector Machine [10] algorithm. Another approach tries to assign label probabilities to the unlabeled instances such that the entropy – a measure of the label uncertainty – is reduced. Jiao et al. [9] apply this principle to conditional random fields. In an entity extraction task with biomedical texts they achieve an increase of about 20% in the F-measure. Co-training ([2]) assumes that features can be split into two sets and each subset is sufficient to train a good classifier. This requires that the two sets are approximately conditionally independent given the class. Initially two separate classifiers are trained with the labeled data, on the two feature subsets respectively. Each classifier then classifies the unlabeled data, and “teaches” the other classifier with the predicted labels for unlabeled instances which have lowest prediction uncertainty. Each classifier is retrained with the additional training instances given by the other classifier, and the process repeats. Jones [11] uses co-training, EM and other approaches for entity extraction from text. Co-training makes strong assumptions about the independence of feature subsets. In this section we consider the relaxation of these features. The approach [7] employs two classifiers of different types but takes the complete set of features. Each of them is used to predict the labels for the unlabeled data for the other classifier. This is extended by [21] by an ensemble of classifiers of different type with different inductive bias which again are trained separately on the complete set of features of the labeled data. They then make predictions on the unlabeled data using a majority vote to predict the unknown label. All classifiers are retrained on the updated training set. The final prediction is made with a variant of a weighted majority vote among all the classifiers. As a variant, Zhou and Li [22] propose to use three learners. If two of them agree on the classification of an unlabeled point, the classification is used to teach the third classifier. In this way conflicts are resolved. It can be applied to data sets without different views, or different types of classifiers. Overall semi-supervised learning can help to increase accuracy without any additional labeling effort. If, however, less labeled data is used, the modeling assumptions are more critical and in the worst case can lead to an increase in errors. As a consequence one should carefully review prior knowledge and use it if appropriate. For practical applications it is necessary to put away some final test set to assess the results of semi-supervised learning. 2.4. Active Learning Complementary to semi-supervised learning we may use active learning to select specific examples for labeling. This should be done in such a way that the labeling effort is minimized. Active selection of the training examples can significantly reduce the necessary number of labeled training examples without degrading the performance. Optimal approaches should pose queries in such a way that the overall number of labeled examples is minimized. As this is usually intractable, approximate methods are used. Uncertainty-based systems [18] begin with an initial classifier and the systems assign some uncertainty scores to the unlabeled examples. A number of examples with the highest scores will be labeled by human experts and the classifier will be retrained. In the ensemble-based systems [5], diverse ensembles of classifiers are generated. The degree of disagreement among the ensemble members will be evaluated and the examples with the highest disagreement will be selected for manual labeling. Active learning with conditional random fields is evaluated by Kim et al. [12]. They measure
140
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
the uncertainty of labeling according to the entropy of label probabilities and use it to propose examples for labeling.
Summary In recent years information extraction methods have been developed which are able to scan large document collections and the Internet. In contrast to simple string comparisons they capture the meaning of content phrases. Named entity extraction methods annotate names of persons and other entities and allow different formulations to be compared. Relation extraction methods and semantic role labeling identify relations between entities. Currently the new German joint project Theseus was started to advance the technology in this area and apply it to real-world problems. To reduce the labeling effort we may utilize semi-supervised learning, which exploits information in unlabeled samples to improve accuracy. Different techniques were discussed and it was shown that sizeable gains are possible. Complementary to these methods we may use active learning to select new cases for labeling. The combination of these techniques allows us to apply statistical information extraction methods in a much larger scale than before by reducing human labeling effort. If information extraction methods are used for security applications we have to keep in mind that in most cases statistical annotations currently have an error of 5-20%. Note that terrorist activities are usually very rare events, e.g. one in a million. If millions of people are screened this will lead to large numbers of false positives and the authorities would be overwhelmed by false alarms. Therefore it is necessary to follow a cautious strategy. We may require a number of independent “confirmations” until an alarm is given, and humans have to judge the severity of a threat before an action is taken. On the other hands non-rare events may be readily detected by information extraction methods, e.g. terrorist websites for recruiting new supporters.
References [1] D. Bikel, R. Schwartz, R. Weischedel (1999): An Algorithm that learns what's in a name. Machine learning, 34, p.211-231. [2] A. Blum, T. Mitchell (1998): Combining labeled and unlabeled data with co-training. COLT: Proceedings of the Workshop on Computational Learning Theory. [3] X. Carreras, L. Màrquez, Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. http://www.lsi.upc.edu/~srlconll/st05/st05.html#CM05. [4] N. Chinchor. (1998): MUC-7 Information Extraction Task Definition. http://acl.ldc.upenn.edu/muc7/ie_task.html. [5] D.A. Cohn,, L. Atlas, R.E. Ladner (1994). Improving generalization with active learning, Machine Learning, 15(2), 201-221. [6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman (1990): Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41 391-407. [7] S. Goldman, Y. Zhou (2000). Enhancing supervised learning with unlabeled data. Proc. 17th International Conf. on Machine Learning (pp. 327–334). Morgan Kaufmann, San Francisco, CA. [8] T. Hofmann (1999):: Probabilistic Latent Semantic Indexing. Proc. SIGIR 1999, p. 50-57. [9] F. Jiao, S. Wang , C. Lee, R. Greiner, D. Schuurmans (2006): Semi-supervised conditional random fields for improved sequence segmentation and labelling. Proc. 21st International Conference on Computational, Sydney, Australia, pp. 209 – 216. [10] T. Joachims (1999). Transductive inference for text classification using support vector machines. Proc. 16th ICML, pp. 200– 209.
G. Paaß and J. Kindermann / Entity and Relation Extraction in Texts
141
[11] R. Jones (2005). Learning to extract entities from labeled and unlabeled text. Carnegie Mellon University. Doctoral Thesis (Technical Report CMU-LTI-05-191). [12] S. Kim, Y. Song, K. Kim, J. Cha, G.Lee. MMR-based active machine learning for Bio named entity recognition. Proc. of the Human Language Technology Conference/North American chapter of the Assoc. for Computational Linguistics annual meeting 2006 (HLT-NAACL06), June 2006, New York. [13] Z. Kozareva, B. Bonev, A. Montoyo. (2005): Self-Training and Co-Training applied to Spanish Named Entity Recognition. MICAI 2005 , Mexico. LNAI 3789, p. 770-779. [14] J. Lafferty,A. McCallum, F. Pereira (2001): Conditional Random Fields: Probabilistic Models for Segmenting and Labeling sequence data. Proc. ICML. [15] K. Nigam, A.K. McCallum, S. Thrun,, & T. Mitchell (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134. [16] Nist (2007): The ACE 2007 (ACE07) Evaluation Plan. http://www.nist.gov/speech/tests/ace/ace07/doc/ace07-evalplan.v1.3a.pdf. [17] A. McCallum (2003): Efficiently Inducing Features of Conditional Random Fields. Proc. Conf. on Uncertainty in Articifical Intelligence (UAI). [18] T. Scheffer, S. Wrobel (2001). Active learning of partially hidden markov models. In Proceedings of the ECML/PKDD Workshop on Instance Selection. [19] C. Sutton, A. McCallum (2005): An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006 (to appear). [20] K. Takeuchi, N. Collier (2002), Use of support vector machines in extended named entity recognition, 6th Conf. on Natural Language Learning (CoNLL-02), p. 119-125. [21] Y. Zhou, S. Goldman, (2004). Democratic co-learing. Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004). [22] Z-H. Zhou, M. Li, (2005a): Semi-supervised regression with co-training. International Joint Conference on Artificial Intelligence (IJCAI).
142
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Data Compression Models for Prediction and Classification Gordon CORMACK1 University of Waterloo, Waterloo, Canada
Abstract. The crux of data compression is to process a string of bits in order predicting each subsequent bit as accurately as possible. The accuracy of this prediction is reflected directly in compression effectiveness. Dynamic Markov Compression (DMC) uses a simple finite state model which grows and adapts in response to each bit, and achieves state-of-the art compression on a variety of data streams. While its performance on text is competitive with the best known techniques, its major strength is that is lacks prior assumptions about language and data encoding and therefore works well for binary data like executable programs and aircraft telemetry. The DMC model alone may be used to predict any activity represented as a stream of bits. For example, DMC plays "Rock, Paper, Scissors" quite effectively against humans. Recently, DMC has been shown to be applicable to the problem of email and web spam detection -- one of the best known techniques for this purpose. The reasons for its effectiveness in this domain are not completely understood, because DMC performs poorly for some other standard text classification tasks. I conjecture that the reason is DMC's ability to process non-linguistic information like the headers of email, and to predict the nature of polymorphic spam rather than relying on fixed features to identify spam. In this presentation I describe DMC and its application to classification and prediction, particularly in an environment where particular patterns of data and behavior cannot be anticipated, and may be chosen by an adversary so as to defeat classification and prediction. Keywords. Spam
Prediction, Classification, Markov Model, Compression, Screening,
I’d like to play a guessing game with you. I’ll start with a very simple version. I’ll give you a sequence of numbers, zeros and ones, and I’ll stop at some point, and you tell me what the next one is. So, if we look at the first sequence [Figure 1], one zero, one one zero, one one zero one one zero one one, and I ask you what the next one is, probably you’ll tell me it’s zero. I’m going to make things a little harder though. I’m going to say, “what odds will you give me that it’s zero?” And so those are the two basic problems, “guess what the answer is” and “give me odds.” And of course, this string may never have happened before but I still want you to guess. Here’s another string: the second one [in Figure 1] is a little bit more complicated, zero one zero, one 1
Corresponding Author: Gordon Cormack, David R. Cheriton School of Computer Science, 2502 Davis Centre, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada; Email: [email protected]
G. Cormack / Data Compression Models for Prediction and Classification
143
one zero, one one one zero, one one one one zero and so on. What’s the next? Again, we can say it’s probably “one” and I’ll say that’s not good enough. I want to know “how probably” it is “one”. And so, in a nutshell, I want to know, “given a string what is the probability distribution for what the next symbol might be?”
Figure 1. A simple prediction game
Here [Figure 2] is another version of the game. I have a number of email messages from an inbox, let’s say I have 100,000 of them and I pick one of the ones that happens to have “ai.stanford.” in it. What’s the next symbol after “ai.stanford.?” Any guesses? An E. Well, one of the messages has an E, but in fact your answer was underspecified, because in the data set there are actually two forms of E, lowercase “e” and uppercase “E.” Any guesses which is the more common? Well, if you guessed lowercase “e,” you’d be wrong. In this particular corpus there was one lowercase “e” and there were 509 uppercase E’s [Figure 3]. And not only that, I’m going to make an observation that was not part of the game: All 509 of the uppercase E’s were in spam messages. And the one lowercase E was the one legitimate message. So what further predictions might we make from this? First of all, if we want to know, “what’s spam or not,” capital E is a very good predictor of spam, at least in this data set. And we might be tempted to say 509 to 1 odds. That would just be an approximation, but it’s a common way of doing approximation, you count proportions of things you’ve seen before and assume that
144
G. Cormack / Data Compression Models for Prediction and Classification
that’s going to predict a proportion that you’ll see in advance and therefore the probability. Now, Paul [Kantor] don’t bother with your question about, “you know, the world isn’t a stochastic place?” It isn’t, but in this case it’s modeled fairly well by this. The point is if you see a lowercase E, you’ve seen a lowercase E once and it was not a spam message, so does that mean that we can predict from seeing another lowercase E that it’s not a spam message? Our odds are one to zero, that’s infinite. So, we are 100% sure. Well, that would be a really bad assumption. And probably wrong, from our intuition as to what this data means. Here is an even worse assumption: If we see “ai.stanford” in an email message the chances that it’s spam are 509 to 1. That’s almost certainly a specious assumption inferred from this particular data. But it’s not completely obvious just from looking at this example without any intuition.
Figure 2. Guess the next symbol
Here is another version of the game. Well, this is basically the same game. I have two email messages, or at least fragments of email messages here and I want to find the spam or I want to find the non-spam. Just to make this a little bit more fun -- as they do in game shows, sometimes you have to make a weird noise or say a strange word when you have the answer -- if you see spam you have to color it black and if you see non-spam you have to color it gray. So here [Figure 4] is actually what my program
G. Cormack / Data Compression Models for Prediction and Classification
145
Figure 3. Observed frequencies of “e” and “E”
that plays this game does. It doesn’t completely color the messages black or gray; it colors the bits of the message that it thinks are spammy black and bits that are nonspammy, gray. And it does a great job on the spam message. One of the reasons it can do a great job on the spam messages is that somebody architected this message to try to beat spam filters. They split it up so that there are none of these features or “bag of words” things that all of the machine learning people who have talked here already love so much. There just aren’t any in this message. And that will beat somebody who depends on putting things in words, but it certainly doesn’t beat this particular technique. It [the technique] is not so unequivocal on the non-spam message. But again, if you look at it “from a distance” it’s more gray than it is black and this supports the view that one should combine all sources of evidence in making a decision. You have to step back and look at the whole picture; the whole picture is pretty clear. The first one was identified as spam because it has this chopped up feature. In the non-spam my name was non-spammy and there were various other words and bits of words that indicated non-spamminess.
146
G. Cormack / Data Compression Models for Prediction and Classification
Figure 4. Spam coloring game
I’m going to talk very briefly about a method -- the method I use to do this coloring. But I’m also going to talk about that in context of … as an academic I love to play games, …. but really this is in support of a genuine application. We should always be asking, “if I learn to play this game really well does it really help to solve the spam filtering problem?” So I need to look at a bigger context and I need to ask if what I am measuring about this actually predicts whether this is solving the email reading problem. I’m also going to talk about other applications, or other versions of this game. But first, before I talk about the methods, I want to give you one more game. This is an adversarial game, it’s called Rocks-Scissors, Paper-Scissors or Roshambo, and it’s played with a rock and paper and scissors. Well, it’s actually not played with a rock and paper and scissors, it’s played with hand gestures that sort of look like these things and two people simultaneously chose one of these things. The game has a cyclical dominance graph [Figure 5]. If you get dominated you lose and if you dominate you win. So, the key to the game is actually to guess what the other person is going to play at the same time as you are and then to beat them. And of course, we are not going to use hands here; we’re going to write down an R for rock, and a P for paper and so on. And then we’ll pair these things up, but together they’re just a string of characters [Figure 5, bottom right], or in other words a spam message [Figure 4]. So what we want to do once again is to read a bunch of stuff that’s been played and then we want to say, “well, what’s the next move my opponent is going to make?” and then I want to beat him. How do I do that? I consider the move that he is most likely going to make and I play the move that will beat that. But in fact it’s not quite as simple as that,
G. Cormack / Data Compression Models for Prediction and Classification
147
because I also am concerned very much about my opponent being able to predict my behavior. So I have a big trade-off between telling my opponent what I know and beating my opponent on this particular round. In short, there is, the cost (well, benefit) of winning this round versus the cost of what I communicate to my opponent, not only what I communicate to my opponent but what I communicate perhaps to a third party observer. If this is a “Rock, Paper, Scissors” tournament I’m playing, the person I’m up against next is watching how I play, trying to decide how good I am. And others may be watching me as well. So, I should really introduce enough noise so that I only just play well enough to beat my opponent; or I only play well enough to do what I need to qualify for the next round. In any event, there are these two trade-offs but that doesn’t really change the fact that underlying the game, the main thing we need is to predict the opponent’s next move with an odds ratio or with odds.
Figure 5. Rock-paper-scissors game
All right, now let’s go back to the main game of which these are all instances [Figure 1]. We have a sequence of symbols -- without loss of generality, they’re all bits -- and we want to predict the next one. The particular method that I am going to talk about today is called Dynamic Markov Modeling, and it was something I first did in the 1980’s (actually for data compression). It’s an extremely simple algorithm. But I am going to oversimplify it even further; (If you want the code, not the code for the game,
148
G. Cormack / Data Compression Models for Prediction and Classification
but the code for the model it’s on the Web right now; search for “dmc.c”; if that’s not enough put in my name). The entire compressor is 250 lines of code and half of that, approximately, is the model. So it really is as trivial as it seems. The way you work this model is: you put your finger on A [Figure 6] and then if you see a 1 you move your finger to B, and if you see a zero you move your finger back to A and so on. And you do this in sequence. Then at any given time, when you follow one of the arrows you increment the count on that arrow. So finally, the count on the two arrows that your finger might follow -- “one” and the “zero” – give the odds estimate; the ratio of the counts is the odds estimate. It is that simple. Now there’s nothing special about this, this is a very trivial Markov model; they’ve been talked about already. Where it gets a little bit more interesting is that after a while you say, “wait, I’ve had my finger on B a lot”. “Not only that, I’ve gone from A to B a lot.” And so whenever, when you go from A to B, when you hit some threshold you say, “oh, I think I’m going to make a new version of B.” So you clone B and you divert some of the traffic from A, in particular, all the traffic from A to B [Figure 7]. You divert to your new cloned version of B. And all the other traffic you leave it where it was. So you start a “bypass” for this particular path. And what happens is, this grows and grows and grows and eventually yields a big model that is a variable order Markov model. You don’t have to worry about back-off probabilities and everything. There are a few arbitrary constants such as what you initialized these weights to. But when you actually do this cloning operation, you simply split the traffic in proportion to how often you visited it from this place or state, versus from other places. So anyway, that’s DMC; that is as much as I’m going to say about the actual model other than it plays exactly our game [Figure 8].
Figure 6. A Markov model
G. Cormack / Data Compression Models for Prediction and Classification
Figure 7. Dynamic Markov model
Figure 8. Spam to Ham likelihood ratio
149
150
G. Cormack / Data Compression Models for Prediction and Classification
Figure 9. Combining likelihood ratios
Now, if you want to predict something bigger than a single bit, it’s easy, you just predict all the bits in sequence. And then it doesn’t really make much difference but you can either average the predictions or you can multiply the predictions together or you can sum them. Anyway, you can combine them in a number of ways, and actually for the purpose of this presentation it doesn’t matter how you combine them. Basically, you are taking the average prediction over the entire string, that’s how you are predicting an email. That’s essentially what Figure 9 says. So what we do with email now is this, if we are trying to predict spam or non-spam, we take all the spam we’ve ever seen and stick them all together into a sequence and then predict how likely it is that the new message that we are trying to judge, would occur in a list of spam messages. And we also predict how likely it is to occur in a list of non-spam messages and that’s our odds. The method is absolutely as simple as that; we take the ratio of those, well, actually it’s more convenient to take the log of the odds, just so when we sum these logarithms of ratios we get something meaningful. But the question then is: how well does this work? Well, I can tell you “it works great”. But let me do two things. First of all, I want to talk about measures of how well it actually plays the game. To do this, I want to show you a little picture of the context. In the context if this conference I want to stress that I think we should always be doing this -- you know, drawing a stick diagram like this [Figure 10] asking “where does this mathematical game actually fit in the overall picture?” For the case of email the typical email reader looks something like this, there is an incoming stream of mail, the filter puts it into a mail file or into a quarantine file, and you routinely read your mail file. So it annoys you if there’s spam in the mail file. If you feel that there’s some (psychological) benefit to venting your emotional state, you
G. Cormack / Data Compression Models for Prediction and Classification
151
might even yell at the filter and say, “how could you misclassify this?” and if the filter is smart it will say “well, I’m sorry I’ll try not to do that again”. In fact, this is what the DMC model can do; it can grow and clone and learn the kinds of mail that are spam and non-spam. Now, I should mention that no user will ever tell you the correct classification for every mail. But they might, if the errors are rare enough, tell you about the errors that they notice or at least some substantial fraction of the errors that they notice. So what you have to do is to assume (by default) that you did it right unless you hear otherwise; this is an example of bootstrapping or positive feedback. If you’re right often enough it works. The quarantine file is more difficult because, in general, the user doesn’t look at it. And the better your filter is the less likely the user is to look at it. Of course, if you are perfect that’s fine; but if you are not perfect there might be some misclassified mail. In general, looking in the spam file is a completely different information retrieval task from reading actual email. The quarantine file is something that the user has to search. Maybe she’s bored and just wants to look through it and see if there are any needles in the haystack, i.e., errors. Or maybe, she just booked an airline flight and didn’t receive the message and says, “I bet there’s a message from Expedia™ in there.” You have to put these factors all together in order to figure out what the cost, the downside of a misclassification is. It didn’t really cost her that much to lose that travel agent booking because she was expecting it; even though it was valuable information, she was expecting it and she knew exactly how to get it. On the other hand, say, Paul sent me a message the day before my flight saying “the workshop is cancelled; don’t come!” and it went into my spam file, that could have disastrous consequences. Happily, that happens to be the kind of message that this method would be very unlikely to get wrong because he and I had corresponded already and the filter would have had ample opportunity to develop a positive attitude about this kind of message. To sum up, when we measure misclassification rates they can be extremely misleading because 1 in 100 errors or 1 in 1000 errors could be disastrous. If errors are one in 1,000, but they are all critical messages that I wasn’t going to get by some alternate channel and got lost, that would be unacceptable. On the other hand, if the advertising for my frequent flyer plan gets lost, who cares? So, what I want to do is, I want to measure error rates. And what I’m not going to do is to talk about precision or recall or F-measure. I can’t, in my mind, get any picture for what precision recall or F-measure would mean in terms of this overall picture; it wouldn’t mean anything to me. On the other hand, I’ve already mentioned measurements that do mean something even though they may not be strictly proportions. That is, the number of spam, or proportion of spam misclassified, and the proportion of non-spam misclassified. So we can measure one against the other because I have a “paranoia threshold” and I can vary it from, “no paranoia” to “lots of paranoia” and just plot one measure against the other. Now this is plotted on the log odds scale [Figure 11]; otherwise it is exactly the ROC curve that Paul talked about earlier and the black line is what our DMC does and all the other lines are what the best other known spam filters do. The number, area under the curve of the ROC, is very high – perhaps 0.999 – so, for whatever reason in this particular application, classifiers are two or three orders of magnitude more accurate than they are for others like Reuters or 25 newsgroups.
152
G. Cormack / Data Compression Models for Prediction and Classification
Figure 10. Context for the spam coloring game
Figure 11. Quantitative results for spam filters [1]
G. Cormack / Data Compression Models for Prediction and Classification
153
Not only that, DMC is pretty mediocre for some of those standard text classification tasks but it really works in this application. I want to go back if I can to Figure 11. If we look at this graph, a reasonable place to look at this graph would be on the first X tick, at 0.1%; that is, one in one thousand good emails misclassified. And if you look at the curve up here at x=0.1% you find that the filter gives slightly less than one percent spam misclassification, so it gets rid of more than 99% of spam while only misclassifying about one in a thousand good messages. Again, useful to think about 1 in 1000 just as a proportion because these are rare occurrences, so let’s look at them for some spam filters. Unfortunately, my spam filter isn’t in this table [Figure 12] but this still illustrates the idea. Of 9000 good email messages the best filters here misclassified 6 or 7. And if you categorize these by genre you’ll see that personal communication with known correspondents, news clipping services, mailing lists and non-delivery messages in general are not the things that are misclassified. What gets misclassified is advertising, and the occasional cold call. Now, you look down at some of the poorer filters and you’ll notice that right at the bottom of the list you actually have two filters that are very heavily touted on the Internet, to such an extent that I might even give them as an example of a disinformation campaign. They just misclassify everything -- personal email, bad email it’s all the same.
Figure 12. Qualitative “genre classification” [2]
154
G. Cormack / Data Compression Models for Prediction and Classification
All in all, we have to measure, we have to use measurements that are meaningful and our measurements have to be scientifically controllable. I can tell you that this works in a number of applications [Figure 13] because I can test it. If I can’t test it what can I do? Try it out and see if it seems to work; hire an advertising company to promote it and then it doesn’t matter whether it works because I have enough money anyway. What I can say for sure that DMC works for because I can test it as data compression. It works well for spam detection -- all the variants, viruses, phishing, different kinds of spam on the Web, and log spam. It’s completely insensitive to the language of discourse, to the character encoding technique, it works great for multi-media, for multi-part MIME, and so on. It works pretty well for plagiarism and authorship detection, works pretty well for intrusion detection. I’ve already mentioned game playing but in this case I mean outright game playing. Is it good for terrorism? I think it is, but terrorism is such an amorphous term I have no idea what you are talking about. So, if you have well-defined tasks -- better still, well-defined tasks and data – please talk to me. If you can send me your data -which you will probably tell me is classified -- that’s great, but if not, I can send you some software and you can scroll your data off and send me back one of those ROC curves and I’ll be extremely happy.
Figure 13. DMC applications
G. Cormack / Data Compression Models for Prediction and Classification
155
As far as “promoting DMC”, I think I’ve actually said everything here. There are no explicit features, it handles heterogeneous data, it’s adaptive, it learns on the job, there’s none of this “find a training set and do this and then freeze the training set,” so in terms of this concept drift and so on, it automatically captures new things that happen. I’ve already talked about visual output. It’s extremely difficult for an adversary to defeat. And here’s yet another example, and this is my final slide [Figure 14] which shows the application of DMC to web spam -- web pages that have no purpose other than to redirect you to other pages and to mislead search engines. This is DMC applied to the host name only, not even the URL, and you can see it actually does a pretty good job here as well.
Figure 14. Web spam hostnames
Thank you. References [1] Andrej Bratko, Gordon V. Cormack, Bogdan Filipič, Thomas R. Lynam and Blaž Zupan, Spam filtering using statistical data compression models, J. Machine Learning Research 7 (Dec. 2006), 2673--2698. [2] Gordon V. Cormack and Thomas R. Lynam, On-line supervised spam filtering, ACM Trans. Information Systems 25 (July 2007).
156
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Security Informatics in Complex Documents G. AGAM1a, D. GROSSMAN2a and O. FRIEDER3a,b a Illinois Institute of Technology, b Georgetown University
Abstract. Paper documents are routinely found in general litigation and criminal and terrorist investigations. The current state-of-the-art processing of these documents is to simply OCR them and search strictly the text. This ignores all handwriting, signatures, logos, images, watermarks, and any other non-text artifacts in a document. Technology, however, exists to extract key metadata from paper documents such as logos and signatures and match these against a set of known logos and signatures. We describe a prototype that moves beyond simply the OCR processing of paper documents and relies on additional documents artifacts rather than only on text in the search process. We also describe a benchmark developed for the evaluation of paper document search systems. Keywords. Information Retrieval, Text Mining, Document Segmentation
Introduction Investigating terrorism on the Web requires the analysis of documents that are often complex. Automated analysis of complex documents is, therefore, a crucial component. Consider for example the following case: On September 29, 2002, during an episode of 60 Minutes, the reporter, Lesley Stahl, broadcast a story called, “The Arafat Papers.” During the story the following dialog occurred: STAHL: (Voiceover) The Israelis captured tens of thousands of documents when they bulldozed into Arafat's compound in Ramallah in March. Now the Palestinian Authority's most sensitive secrets are stacked in a sea of boxes in an Israeli army hangar. Colonel MIRI EISIN: It's basically all of their files, all of their documents, everything that we could take out.
1 Corresponding Authors: G. Agam, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL, 60610; E-mail: [email protected] 2 D. Grossman, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL 60610; E-mail: [email protected] 3 O. Frieder, Georgetown University, Washington, DC 20057; E-mail: [email protected]; He is on leave from IIT.
G. Agam et al. / Security Informatics in Complex Documents
157
Clearly, searching this document collection is an example of a problem that involves searching text, signatures, images, logos, watermarks, etc. While it is true that search companies such as GoogleTM and Yahoo!TM have made search technology a commodity, they fail to support the searching of complex documents. A complex document, or informally a "real world, paper document”, is one that comprises of not only text but also figures, signatures, stamps, watermarks, logos, and handwritten annotations. Furthermore, many of these documents are available in print form only. That is, the documents must first be scanned so as to be in digital format, and their scan (image) quality is often poor. Searching complex documents (such as those found in the Arafat Papers), involves the integration of image processing techniques such as, but not limited to, image enhancement, layer separation, optical character recognition, and signature and logo detection and identification, as well as information retrieval techniques including relevance ranking, relevance feedback, data integration and style detection. To date no such system is available. Searching such a collection often involves discarding all document components other than text and then searching the text with a conventional search engine. Yet another problem is evaluation. Currently, even if a complex document search system did exist, it is not possible to scientifically evaluate it. The impossibility of scientifically evaluating such a system is a direct consequence of the lack of an existing benchmark. Search systems are evaluated using benchmarks, e.g., using the various NIST TREC data sets (see trec.nist.gov for details), and the lack of benchmarks prevents any meaningful evaluation. To advance the state of the art of search in terms of complex documents, our effort focused on the development of a complex document information processing prototype and evaluation benchmark, the IIT CDIP data set.
1. The CDIP Collection For a data set to be of lasting value, it must meet, challenge, and exceed application domains. These application require a collection that: • • • • • • •
Covers a richness of input in terms of a range of formats, lengths, and genres and variance in print and image quality; Includes documents that contain handwritten text and notations, diverse fonts, multiple character sets, and graphical elements, namely graphs, tables, photos, logos, and diagrams; Contains a sufficiently high volume of documents; Contains documents in multiple languages including documents that have multiple languages within the same document; Contains a vast volume of redundant and irrelevant documents; Supports diverse applications, thereby, includes private communications within and between groups planning activities and deploying resources; Is publicly available at minimal cost and licensing.
The collection chosen is a subset of the Master Settlement Agreement documents hosted by the University of California at San Francisco as the Legacy Tobacco
158
G. Agam et al. / Security Informatics in Complex Documents
Document Library (see http://legacy.library.ucsf.edu). These data were made public via legal proceedings against United States tobacco industries and research institutes. For the most part, the documents are distributed free of charge and are free of copyright restrictions. (The sued parties did not own a few of the Legacy Tobacco Document Library documents included; hence, some of them are potentially subject to copyright restrictions.) The collection consists of roughly 7 million documents or approximately 42 million scanned TIFF format pages (about 1.5 TB). These documents are predominantly in English; however, there are some documents in German, French, Japanese, and a few other languages. A few of these documents also include multiple languages within a given document. As multiple companies at multiple sites using a diversity of scanners scanned the pages, the resulting image quality varies significantly. As search benchmark data collections require queries with associated relevant documents indicated, we developed in excess of 50 such queries of varying complexity for the Legacy Tobacco Document Collection. This benchmark collection is, however, only in its "infancy" stage. It currently suffers from a rather limited coverage of query topics and a low number of relevant documents per query. None the less, the collection was successfully used for the NIST TREC Legal Track both in 2006 and 2007. A complete description of the collection is provided in [1].
2. The CDIP Prototype 2.1. Functional Components Our prototype comprises an integrated tool suite, based on several existing technologies, implementing three core CDIP functionalities: document image analysis, named-entity recognition, and integrated retrieval. This prototype tool facilitates the later inclusion of a fourth core technology: data mining. As noted, specific attention was paid to modular design to ensure that the developed software modules are easily integrated into different task-level applications. Document image analysis extracts information from raster scanned images such as the overall structure of the document [2], the content of text regions, the location of images/graphics, the location of logos and signatures, the location of signatures and handwritten comments [3], and the identification of signatures [4, 5, 6, 7]. It should be noted that OCR of machine printed text in real-world documents has limited accuracy (depending on the quality of the input documents) and so the textual features obtained are unavoidably noisy. Named-entity recognition identifies meaningful entities such as people and organizations in textual components. Our prototype relies on ClarabridgeTM technology. Our initial tests on real-world data showed the effectiveness of entity extraction on noisy text obtained from OCR of our test collection is reduced to 70% of its performance on noise-free text. Integrated retrieval from different kinds of data sources is the key high-level function. Such integrated retrieval is possible through the IIT Intranet Mediator technology [8, 9]. The IIT Intranet Mediator is capable of integrating traditional data sources such as unstructured text, semi-structured XML/text data, as well as structured database querying. A rule-based source selection algorithm selects those data sources most relevant to an information request, enabling the system to take full advantage of domain-specific searching techniques, such as translation of a natural language request
G. Agam et al. / Security Informatics in Complex Documents
159
into a structured SQL query. Results are then fused into an integrated retrieval set [10]. Although the IIT Mediator is protected by an issued patent providing us with guaranteed unconstrained free use of the technology, the mediator implementation technology that exists to date is only at the prototype level. Consequently, as we needed a more robust framework by which to implement our CDIP prototype. We built our prototype using the ClarabridgeTM integration fabric. Data mining, a component not currently implemented in our current prototype, will leverage text, metadata, and information extracted from complex documents. Our approach allows application of traditional data mining and machine learning methods to discover relationships between different data such as association rules [11] and document clusters [12]. We will further develop routines to find correlations in document descriptors (for example, possible relationships between the author of a document and particular language styles). Note that data mining was not targeted in our initial implementation of the system prototype but is a goal for follow-on efforts. 2.2. Software Architecture The prototype’s architecture (Figure 1) is designed as a generic framework for integrating component technologies with appropriate APIs and data format standards through SOAP (Simple Object Access Protocol) to allow ‘plugging in’ different subsystems for performing component tasks. Our current effort integrates available components with little emphasis on the development of new ones. The current system architecture is depicted in Figure 1. The workflow of the system consists of three main processes: a document ingestion process, a data transition process, and a document querying process. The document ingestion process is a straightforward pipeline that consists of: • • •
Low-level image processing for noise removal, skew-correction, orientation determination, and document and text regions zoning (using AbbyyTM’s SDK and the DocLib package [2]). OCR in text regions (using AbbyyTM’s SDK), recognition of logos (using the DocLib package [2]), and recognition of signatures (using CEDAR’s signature recognition system [3, 4, 5]) and a signature warping module [6, 7]. Linguistic and classification analysis of extracted information for annotation in the database: entity tagging, relationship tagging, and stylistic tagging in text regions (using ClarabridgeTM Software).
At the end of the ingestion process, we have an operational data store in third normal form (3NF). At this point, it would be too complex to perform sophisticated roll-up or drill-down computations along various data dimensions. Hence, we transition the data from 3NF that has been ingested into a multidimensional star schema. This is a common technique for analyzing structured data, and it is well known to dramatically improve decision support. Using this structure for complex document metadata results in a scalable query tool that can quickly answer questions like “How many documents do we have from Fortune 500 companies” and then quickly drill into different market sectors (e.g., manufacturing companies, IT companies, etc.) At the center of this process are tools from ClarabridgeTM. These tools use web services to access the point solutions and identify metadata about complex documents to populate the 3NF schema. ClarabridgeTM tools also migrate the 3NF schema to a star
160
G. Agam et al. / Security Informatics in Complex Documents
schema using well known Extract, Transform, and Load (ETL) processing. ClarabridgeTM is a startup dedicated to the application of well known structured data techniques such as a star schema and applying these to integrate structured data and text. As the analysis of document images involves errors which are inherent to the automated interpretation process, each attribute in the database is associated with a probability that indicates the confidence in this value as obtained from the corresponding point solution. Finally, following the ETL process, a query tool is used to access both an inverted index of all text and the star schema, to integrate structured results. A key component that is facilitated by our approach is a tight integration of the processes of document image interpretation, symbol extraction and grounding, and information retrieval. This integrated approach could be used to increase reliability for all of these processes. Constraints on image interpretation, based on consistency with other data, can improve reliability of image interpretation. Similarly, gaps in the database can potentially be filled in at retrieval time, by reinterpreting image data using top-down expectations based on user queries. Due to its added complexity, this tight integration model is not followed in our current implementation of the system prototype. A summary of the CDIP architecture is presented in Figure 1. Each component in this figure is a separate thread, so that processing is fully parallelized and pipelined. Image files are served to processing modules dealing with different types of document image information. The AbbyyTM OCR engine is used to extract text from the document image. This text is fed to the ClarabridgeTM information extraction module, which finds and classifies various named entities and relations. Signatures are segmented and then fed to CEDAR’s signature recognition system which matches document signatures to known signatures in a database. Logos are segmented and matched using the DocLib package. These three threaded processing paths are then synchronized, and the data extracted are transformed into a unified database schema for retrieval and analysis.
3. Document Image Analysis Components 3.1. Document Image Enhancement Given an image of a faded, washed out, damaged, crumpled or otherwise difficult to read document, one with mixed handwriting, typed or printed material, with possible pictures, tables or diagrams, it is necessary to enhance its readability and comprehensibility. Documents might have multiple languages in a single page and contain both handwritten and machine printed text. Machine printed text might have been produced using various technologies with variable quality. The approach we developed [13] addresses automatic enhancement of such documents and is based on several steps: the input image is segmented into foreground and background, the foreground image is enhanced, the original image is enhanced, and the two enhanced images are blended using a linear blending scheme. The use of the original image in addition to the foreground channel allows for foreground enhancement while preserving qualities of the original image. In addition, it allows for compensation for errors that might occur in the foreground separation.
G. Agam et al. / Security Informatics in Complex Documents
161
Figure 1. Architectural overview of the current CDIP research prototype.
The enhancement process we developed produces a document image that can be viewed in different ways using two interactive parameters with simple and intuitive interpretation. The first parameter controls the decision threshold used in the foreground segmentation, the second parameter controls the blending weight of the two channels. Using the decision threshold the user can increase or decrease the sensitivity of the foreground segmentation process. Using the blending factor the user can control the level of enhancement: on one end of the scale the original document image is presented without any enhancements, whereas on the other end, the enhanced foreground is displayed by itself. Note that the application of these two adjustable thresholds is immediate once the document image has been processed. The adjustment of the parameters is not necessary and is provided to enable different views of the document as deemed necessary by the user. For automated component analysis purposes the parameters can be set automatically. 3.2. Logo Detection Our approach for logo detection is based on two steps: detection of distinct document zones and classification of the different zones detected. For efficiency reasons, some heuristics incorporating the expected location of logos are used to reduce the candidate set. We employed detection of distinct zones using the DOCLIB library [2]. Our approach uses automated means of training a classifier to recognize a document layout or set of layouts [14]. The classifier is then used to score an unknown image. For page segmentation, we use the Docstrum method for structural page layout analysis [15, 16]. The Docstrum method is based on bottom-up, nearest-neighbor clustering of page
162
G. Agam et al. / Security Informatics in Complex Documents
components. It detects text lines and text blocks, and has advantages over many other methods in three main ways: independence from skew angle, independence from different text spacing, and the ability to process local regions of different text orientations within the same image. Script identification [17] for machine printed document images can be used to increase reliability. This approach allows for classifying a document image as being printed in one of the following scripts: Amharic, Arabic, Armenian, Burmese, Chinese, Cyrillic, Devanagari, Greek, Hebrew, Japanese, Korean, Latin, or Thai. Script identification can also be retrained to focus on different language mixes. Once the zones are detected, logo detection works by identifying blocks with certain spatial and content characteristics including: relative position of the zone’s center of mass, the aspect ratio of the zone’s bounding box, the relative area of the bounding box, and the density of the bounding box. The features are tuned based on a training set of documents. 3.3. Logo Recognition Logo recognition is performed by matching candidate regions against a database of known logos. While it is possible to match logos by extracting and matching feature vectors, it has been shown that direct correlation of bitmaps produces better results [18, 19]. To improve the correlation measure, we first normalize the logos to be of standard size and orientation and then sum the products of corresponding elements in the bitmaps. The computed correlation measure is the standard gray-scale correlation. For each candidate a score between 0 and 100 is generated corresponding to the degree of similarity. The best match is provided along with the score. To improve performance, the algorithm stops comparing against candidate logos when the best score is beneath a predefined threshold. Text that is associated with logos can be used in assisting the recognition of the associated logo, but is not currently used in our system. 3.4. Signature Detection Signature detection is performed using algorithms for document zoning as described before, and analyzing the different zones for signatures [20]. In analyzing zones for signatures, line and word segmentation are necessary. The process of automatic word segmentation [21] begins with obtaining the set of connected components for each line in the document image. The interior contours or loops in a component are ignored for the purpose of word segmentation as they provide no information for this purpose. The connected components are grouped into clusters, by merging minor components such as dots above and below a major component. Every pair of adjacent clusters are candidates for word gaps. Features are extracted for such pairs of clusters and a neural network is used to determine if the gap between the pair is a word gap. Possible features are: width of the first cluster, width of second cluster, difference between the bounding box of the two clusters, number of components in the first cluster, number of components in the second cluster, minimum distance between the convex hulls enclosing the two clusters and the ratio between, the sum of the areas enclosed by the convex hulls of the individual clusters, to the total area inside the convex hull enclosing the clusters together. The minimum distance between convex hulls is calculated by sampling points on the convex hull for each connected component and calculating the minimum distance of all pairs of such points.
G. Agam et al. / Security Informatics in Complex Documents
163
3.5. Signature Recognition Signature recognition works by obtaining feature vectors for signatures and measuring the similarity between feature vectors of compared signatures. Image warping techniques can be used to increase the similarity between signatures before comparing them. The approach for signature feature extraction we employ [4, 5, 6, 7], consists of taking the block of the image that is identified as a potential signature and partitioning it into rectangles such that the size of each rectangle is adapted to the content of the signature. Each rectangle is examined for multiple features (e.g., curvature of lines, principal directions, fill ratio, etc.). The obtained feature vector is then compared to a database of known signatures that are represented in a similar way. The vectors are matched and the signatures that match the closest are identified as possible candidates.
4. Performance Evaluation The rich collection of attributes our system associates with each document (including words, linguistic entities such as names and amounts, logos, and signatures) enables both novel forms of text retrieval, and the evidence combining capabilities of a relational database. We have finished the initial implementation of our research prototype and are currently in the process of evaluating it quantitatively. The evaluation includes using a subset of several hundred document images which were manually labeled for authorship (based on signatures), organizational unit (based on logos), and various entity tags based on textual information (such as monetary amounts, dates, and addresses). The evaluated tasks include authorship-based, organizational-based, monetary-based, date-based, and address-based document image retrieval. In each experiment the precision and recall is recorded as a function of a decision threshold. This experiment is expected to be expanded in the near future to include a larger subset of several tens of thousands of document images. We realize that this testing methodology cannot be extended to higher order subsets, as it requires complete manual labeling, which is labor intensive. Consequently, effectiveness using larger subsets will be evaluated by inserting document images containing unique labels into large subsets. These inserted documents will be manually labeled and their uniqueness will guarantee that documents with similar labels should not exist within the subset. While we have, as yet, no quantitative evaluations to report, we illustrate here the kinds of capabilities that our prototype currently supports. The mini-corpus used for this consists of 800 documents taken from the IIT CDIP benchmark collection. We consider integrated queries that our prototype makes possible for the first time. We apply conjunctive constraints on document image components to a straightforward document ranking based on total query-word frequency in the OCRed document text. Once the metadata are populated using logo and signature processing components, SQL queries easily associate both textual and non-textual data. One query involving currency amounts found in text showed that Dr. D. Stone, who was active during 1986, was associated with a company whose logo template is “liggett.tif”, was associated with dollar amounts between $140K and $1.68M, and was associated with several other persons such as Dr. Calabrese. By clicking on the document ID, the system presents the user with the original documents for full examination.
164
G. Agam et al. / Security Informatics in Complex Documents
5. Conclusion As stated throughout, the complex document information processing area of research is only in its infancy. We have developed an initial prototype, but have yet to effectively evaluate it. We have, however, created a benchmark that should stress all foreseeable future complex document information processing systems. This benchmark was already used to evaluate some search systems in recent TREC activities; we can only hope that its availability will inspire further research into the design of complex document information processing systems.
Acknowledgements We thankfully acknowledge and are grateful to Shlomo Argamon, David Doermann, David Lewis, Gene Sohn, and Sargur Srihari for their vast contributions to the CDIP effort. We also thank the March 2005 CDIP Workshop participants for their suggestions.
References [1]
[2] [3] [4]
[5]
[6] [7] [8] [9]
[10]
[11]
[12] [13]
D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a test collection for complex document information processing,” in ACM Twenty-Ninth Conference on Research and Development in Information Retrieval (SIGIR), (Seattle, Washington), August 2006. K. Chen, S. Jaeger, G. Zhu, and D. Doermann, “DOCLIB: a document processing research tool,” in Symposium on Document Image Understanding Technology, pp. 159–163, 2005. S. N. Srihari, C. Huang, and H. Srinivasan, “A search engine for handwritten documents,” in Proc. Document Recognition and Retrieval XII, pp. 66–75, SPIE, (San Jose, CA), January 2005. S. Chen and S. N. Srihari, “Use of exterior contours and word shape in off-line signature verification,” in Proc. Intl. Conference on Document Analysis and Recognition, pp. 1280–1284, (Seoul, Korea), August 2005. S. N. Srihari, S. Shetty, S. Chen, H. Srinivasan, C. Huang, G. Agam, and O. Frieder, “Document image retrieval using signatures as queries,” in IEEE Intl. Conf. on Document Image Analysis for Libraries (DIAL), pp. 198–203, 2006. G. Agam and S. Suresh, “Particle dynamics warping approach for offline signature recognition,” in IEEE Workshop on Biometrics, pp. 38–44, 2006. G. Agam and S. Suresh, “Warping-based offline signature recognition,” IEEE Trans. Information Forensics and Security, 2007. Accepted for publication. D. Grossman, S. Beitzel, E. Jensen, and O.Frieder, “IIT Intranet Mediator: Bringing data together on a corporate intranet,” IEEE IT Professional 4(1), pp. 49–54, 2002. J. Heard, J. Wilberding, G. Frieder, O. Frieder, D. Grossman, and L. Kane, “On a mediated search of the united states holocaust memorial museum data,” in Sixth Next Generation Information Technology Systems, (Sefayim, Israel), July 2006. S. Beitzel, E. Jensen, A. Chowdhury, D. Grossman, O. Frieder, and N. Goharian, “On fusion of effective retrieval strategies in the same information retrieval system,” Journal of the American Society of Information Science and Technology 55(10), 2004. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in knowledge discovery and data mining, pp. 307–328, American Association for Artificial Intelligence, 1996. M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in Proc. SIGKDD Workshop on Text Mining, 2000. G. Agam, G. Bal, G. Frieder, and O. Frieder, “Degraded document image enhancement,” in Document Recognition and Retrieval XIV, X. Lin and B. A. Yanikoglu, eds., Proc. SPIE 6500, pp. 65000C–1 – 65000C–11, 2007.
G. Agam et al. / Security Informatics in Complex Documents
165
[14] L. Golebiowski, “Automated layout recognition,” in Symposium on Document Image Understanding Technology, pp. 219–228, 2003. [15] L. O’Gorman, “The document spectrum for page layout analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence 15(11), pp. 1162–1173, 1993. [16] S. Mao, A. Rosenfeld, and T. Kanungo, “Document structure analysis algorithms: a literature survey,” in Proc. SPIE Electronic Imaging, 5010, pp. 197–207, 2003. [17] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic script identification from document images using cluster-based templates,” IEEE Trans. Pattern Analysis and Machine Intelligence 19(2), pp. 176–181, 1997. [18] G. Zhu, S. Jaeger, and D. Doermann, “Robust stamp detection framework on degraded documents,” in Proc. Intl. Conf. Document Recognition and Retrieval XIII, pp. 1–9, 2006. [19] D. S. Doermann, E. Rivlin, and I. Weiss, “Applying algebraic and differential invariants for logo recognition,” Machine Vision and Applications 9(2), pp. 73–86, 1996. [20] Y. Zheng, H. Li, and D. Doermann, “Machine printed text and handwriting identification in noisy document images,” IEEE Trans. Pattern Analysis and Machine Intelligence 26(3), pp. 337–353, 2004. [21] S. Srihari, H. Srinivasan, P. Babu, and C. Bhole, “Spotting words in handwritten Arabic documents,” in Proc. SPIE, pp. 101–108, 2006.
166
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Visual Recognition: How can we learn complex models? Joachim M. BUHMANN 1 Institute of Computational Science Swiss Federal Institute of Technology ETH Zurich, Switzerland
Abstract. Visual objects are composed of parts like a body, arms, legs and a head for a human or wheels, a hood, a trunk and a body for a car. This compositional structure significantly limits the representation complexity of objects and renders learning of structured object models tractable. Adopting this modeling strategy I describe a system, which both (i) automatically de-composes objects into a hierarchy of relevant compositions and which (ii) learns such a compositional representation for each category without supervision. Compositions are represented as probability distributions over their constituent parts and the relations between them. The global shape of objects is captured by a graphical model which combines all compositions. Experiments on large standard benchmark data sets underline the competitive recognition performance of this approach and they provide insights into the learned compositional structure of objects. Keywords. Object recognition, compositionality, learning, graphical models
Good afternoon, thank you very much for the kind introduction. Let me begin my presentation by stating that I have not worked on terrorist activity detection or on detecting terrorist events. The reason why I was invited to this conference is related to my research specialty which is Machine Learning. Many of the fundamental challenges, which you face when patrolling the Internet or searching through large data archives, requires algorithmic solutions for pattern recognition and pattern analysis. Such algorithms have to learn regularities and patterns of large scale data sets in a data driven way. Machine Learning and modern Artificial Intelligence provide methods and concepts for these data analysis challenges. So I will now present a project on visual recognition, i.e., on recognizing objects in images. What is the computer science problem? You are confronted with diverse types of data, you have images and these images have most often a common theme. Based on the statistics of grey and color values and other features in the image you have to assign a label to the image. Such an image category might be a motorbike and you should not confuse it with the car category, the person category or the plane category. We would like to find the correct categories in an automatic way. Therefore, learning of these image categories should not be hand tailored by humans, but the statistics should speak for themselves and the algorithm should find a representation which actually warrants the label given to that image. 1 Corresponding Author: Joachim M. Buhmann, Institute of Computational Science, CAB G 69. 2, Universitaetstrasse 6, 8092 Zurich, Switzerland; Email: [email protected]
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
167
Figure 1. Learning of image categories should be achieved without hand-segmented training exemplars.
Automatic image categorization belongs to a research area which has attracted my interest for a long time, unsupervised learning. In unsupervised learning, an explicit teacher is missing and we do not have access to the correct labels during training. The learning algorithm is not explicitly instructed which decision should be made for such a specific training example. So what is the unsupervised learning part in object recognition? The procedure works as follows: I give you the label for the image (e.g., Figure 1) but what I don’t provide you with is the evidence of where in the image you actually find the information that in fact this image belongs to the motorbike category. A fairly common approach in the literature assumes that you have access both to the image as well as to a segmentation of semantically meaningful objects in the image. Often, only the global image label is available and a detailed segmentation is ambiguous or involves substantial human labor. We work under the hypothesis that such side information for training should not be required. The algorithm should detect by itself which part of an image supports the global image label and, therefore, we only require the (log k) bits to specify the image label as supervision information. That’s what we call unsupervised learning, but you might also call this type of learning weakly supervised because we still require the correct label for each training image. Now why is such a form of learning a grand challenge in computer vision? Actually, the challenge has been open since 1959 when Marvin Minsky, one of the forefathers of artificial intelligence, gave this project of image understanding to a graduate student. We are still trying to finish this graduate student project. The real problem of object categorization is related to the fact that objects come in many different varieties, in different poses, different colors and so on. In Figure 2 you can see a small sample from the category of images in Cal Tech 101, one of the standard benchmark databases.
Figure 2. Some of the 101 image categories from the Caltech 101 image database.
168
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
To give you numbers in this experiment, we are confronted with 101 image categories and they could be arranged in a hierarchy, but essentially you are provided with between 15 and 100 exemplars per category. On the basis of this information you should learn what speaks for a particular category in a given image. There exist a number of design principles for learning algorithms which might apply to other detection problems as well, and so it is worthwhile to think about these principles: 1.
2.
3.
Decoupling: When learning object description for computer vision, we frequently look for image evidence which decouples the data, e.g., pixel sets. That means a hidden variable like a motorbike template might explain a set of different pixels. Estimating this source decouples these foreground pixels from the background. Modularity: Another key idea for learning is modularity. You would like to keep things which belong together as an entity (often also called symbol) and separate them from other groups of objects, pixel for example, which are only loosely coupled. Such a grouping gives rise to modules. Hierarchies: Similarly, we would like to discover hierarchies in our data. Hierarchical dependencies can efficiently be learned and they enable very efficient search procedures for decision making.
On the learning side, these structural properties of image representations relate to efficient learnability. If your model has a highly modular structure, then there exists a realistic chance to learn it with a small number of samples. This constraint of few samples per class precisely describes the situation which we are facing in object recognition, since we would like to classify objects of a new class after having inspected as few as half a dozen exemplars. It is particularly important in such a learning scenario to validate learning results. Validation of learning denotes a procedure which tests the learned model on new data. If the model generalizes to new examples then we have validated it and the learning results can be used on future examples. For a couple of years, we have explored the concept of stability for validation, i.e., we test if the learning result remains stable when we repeat the learning process on a second sample of the same source. In image analysis, we test our recognition process on images of the same category but with different arrangements of the objects, e.g., if a recognition system has learned the category “tiger image” then it should also be able to correctly classify a test image with a tiger hunting an antelope although the training image set does not contain such a scene. The stability method requires that the description of an object (like the tiger in our example) does not change too much from one scene to the next but remains stable. This principle selects models which are insensitive to noise in the data but which are also sufficiently discriminative. Let me describe a recognition system which has been developed by my student Björn Ommer. The system conceptually is based on a face recognizer which I have built together with C. von der Malsburg at the end of the 1980’s and early 1990’s. The recognizer employed an elastic grid for face classification and it was one of the first working systems for constraint access control. Now in the recent work by Ommer and myself we rely on the concept of composition systems for recognition and training. In particular, we use a class of statistical models which are called graphical models.
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
169
Historically, graphical models have been introduced in machine learning as a generalization of artificial neural networks which have proven very valuable in technical learning applications like digit recognition, protein classification, etc. In particular, graphical models are well suited for modeling requirements in computer vision. They are intuitive at least for us, since their graphical representation captures the relevant statistical dependencies you care about. A schematic sketch of a composition system is shown in Figure 3.
Figure 3. An object recognition system based on compositions as intermediate representations of object components. The upper box describes a transformation of the image into a set of relevant compositions. The final classification is achieved by the recognition part where we estimate the position of an object and its most likely class.
What is the idea of compositionality and of composition systems? Objects can appear in images in an unimaginably large number of different ways depending on lighting conditions, pose, shape and other imaging factors. The object identity is not
170
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
affected by these influence factors and recognition systems effectively have to separate these appearance factors from information which is relevant for classification. Compositionality can be seen as a strategy to infer parts which are highly correlated with object identity and to encode appearance factors by relations between these parts. Such a highly structured object representation is expected to capture the complexity and richness of real world scenes. You decompose a novel image (like the airplane in Figure 3) into parts and you combine them into compositions. Among all possible compositions, only a small subset is relevant for correct and reliable classification and we select those according to their class discriminability. In the recognition stage the object recognition system then estimates the position of the object and determines the most likely class of the image. Such a representation of image information avoids storing many complicated templates but encodes objects by their components and their relative arrangement. This idea was strongly promoted by Stuart Geman and, with a neural computing emphasis, by C. von der Malsburg. Composition systems use a small number of generic parts as their key design concept. These parts are pieces in your data, which occur sufficiently often to code them as separate entities. More complex representations are then assembled by combining these parts to new generic features. This assembly is called a composition. You have to store the relation between these generic parts and these relations might be quite indicative for a category of a particular object. If you want to recognize a face as in Figure 4a, you might actually concentrate on the edges of the mouth, store their relative appearance and their relation to each other, you store some information from the chin, maybe some of the eye, and then that gives you a description of what you have here.
Figure 4. A prototypical (a) and a distorted (b) face as processed by a face recognition system. Feature vectors which are attached at the grid nodes measure the local image content and, thereby, characterize the image category.
What is the decision making involved in image categorization? The decision making for the image in Figure 4 means that you can classify the image intensities as a
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
171
face image. But even more, you can tell where in the image you find the evidence for this face image. So you have to localize the object. This behavior is not only important to inspect what the system is actually doing, but also to ensure that the system does not learn artifacts of the database which have nothing to do with the object itself but with your limited ability to set-up a really challenging benchmark.
Figure 5. Bayesian network that couples compositions Gi, shape GI, and image categorization C. Shaded nodes denote evidence variables. Xi is the spatial location of a composition whereas rkl represents the distance vector between two compositions. The sets *1, * 2 denote the set of compositions and the set of compositions of compositions, respectively. The random variable X codes the position of an object.
The complete system is now constructed in a hierarchical fashion. Let us assume that we like to interpret the motorbike image in Figure 1, and then we locally extract statistics of image features. For example, we determine color features, we measure local texture features which are indicative for a small part of an image, and these local descriptors are then called atomic parts. These atomic parts are assembled to what we call candidate compositions. The categorization system basically hypothesizes what combination of atomic parts might be indicative for the motorbike label of this image. After this preprocessing step, the categorization machine selects the relevant compositions by maximizing the probability of the category label given this composition. Therefore, we have developed a maximum a posteriori type of approach to filter relevant compositions out of a large set of candidates. In addition to image labeling, we also like to encode where in the image we have found the evidence for the selected image label. This information is represented by a small graphical model, which is depicted in Figure 5. I do not want to enter into the discussion of details, but all vertices in the graph denote random variables. For example, the random variable for the category of an image is C, the random variable for the center is X. The location where in the image you find the evidence for an object, is encoded in the upper dashed box by Xj and Gj. The lower box summarizes the composition of compositions with their relative distance vector rkl. They all have relations between each other, they have all positions, and they cover to some degree the pixels in the image, which give rise to a particular label. Some
172
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
researchers have compared such an approach to image interpretation as parsing the image. Parallel to the extraction of compositions we form combinations of combinations which defines a hierarchical approach. The logical next step is to filter out relevant combinations of combinations, and to use them for decision-making. In the end, the categorization system calculates the probability of a category given the image evidence that is captured by atomic parts, by the position of the object, by the value of the combinations, and the value of combinations of combination. Figure 3 visualizes the structure of the model without the hierarchical part. The training of these composition systems is quite delicate since we require sufficient regularization at every level of learning. Only very few images are available to estimate the compositions, so for some of the categories we actually have to learn from only 15 to 20 images. To avoid the notorious problem of overfitting, a crossvalidation scheme for learning the categories is used in the following way to (i) select a set of irrelevant images, which do not contain the object of interest; (ii) we assemble a set of relevant images for a particular category. The set of relevant images is subdivided in two subsets of equal size; (iii) the composition candidates are trained on the first half of the set images. Then we learn if compositions and compositions of compositions are relevant or not. Relevance determination is achieved by scoring the compositions on the second half of the set images. Thereby, we select the discriminant representations of category evidence. The set of irrelevant images determines the acceptance level for discriminant compositions. A schematic summary of the learning procedure is shown in Figure 6.
Figure 6. Training of our object recognition system requires a cross-validation scheme to filter out discriminant compositions. The critical discriminance score is estimated by using the irrelevant images.
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
173
Figure 7. Alpha matting of pixels according to their importance for the category discrimination. All opaque pixels (white) have not contributed to classifying this image as an airplane image. The fully transparent pixels contribute the most to the decision of category “airplane”. This visualization of pixel relevance reveals that the composition machine has learned the concept of an airplane with a body, wings and turbines.
What has the composition machine learned at the end of training? This subtle question cannot be answered in a simple way. The information flows from image intensities to local histograms that are then composed into parts. These parts are further assembled into compositions and composition of compositions. Every step in the information flow is accompanied by information condensation and loss of irrelevant image content. Since the model is generative down to the features extracted from the image, we can easily sample new features from the graphical model. How these features then correspond to pixel configuration cannot be decided in a unique fashion since this mapping is not invertible. However, we can answer the question, how an image patch from our database of natural images should look like to explain the synthesized features. In Figure 7 we have printed the original image together with a pixel weighting in form of transparency. Fully transparent pixels with the original color carry the largest weight for the decision; and vice versa, opaque pixels will appear as white and they are irrelevant for the categorization decision. When we test these types of systems on image databases like the CalTech 101 database or the Pascal database, then we measure confusion tables which encode the probabilities of predicting category i when the true category is j. Consequently, we should find red colors on the diagonal for a reliable system and blue for the off diagonal elements. The total detection rate is 59% for the system without multiscale optimization and 62% for the system with all features to detect object at multiple scales. If you switch the compositionality part of the system then the performance will drop to a 33% categorization rate.
174
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
We have extended this approach to design and build a composition machine for video streams as well. For video, we have to address the additional difficulty of processing the time domain. Objects can appear and disappear in the image and you might encounter multiple instances of an object that can define the object class. The setting for categorization has to be re-evaluated since we might have multiple category labels which apply to a particular image sequence.
Figure 8. Confusion matrix between categories depicted with a grey value code of probabilities. Rows enumerate the predicted category whereas columns denote the true categories. A black matrix element (i,j) indicates that the joint empirical probability of predicted category i and true category j assumes a high value, blue denotes a low value. A perfect result of the composition system would be achieved if the diagonal elements are black or dark and the off-diagonal elements are white or light grey.
To give the time dimension an appropriate treatment we extract the optical flow of image segments from the image stream. These additional features are then again used to enrich the compositions and the composition of compositions. Unsupervised training of image parts as well as discriminative selection of relevant compositions allows the categorization machine to achieve competitive categorization results on non-calibrated videos, which have been acquired by a standard consumer electronic video camera. The system performs the video procession not yet in real time but the current implementation is within a factor of 10 to video frame rate. Let me summarize what I consider to be relevant for patrolling the Web and for cyber crime detection on the Internet. In my view the compositionality approach to pattern recognition demonstrates that our learning technology today is powerful enough
J.M. Buhmann / Visual Recognition: How Can We Learn Complex Models?
175
to learn complex statistical models if we have weakly labeled data with regularities. These models might contain hundreds and thousands of random variables that have to be estimated. The composition system is still learnable in very moderate time frames. Therefore, we are talking not about high performance computing over days, but computations that are performed on desktop computers within minutes. We now have a fairly good understanding of how to extract such models from data in a weakly supervised way by just exploiting regularities. I believe that machine learning will also help us to unravel the semantics behind the information flood on the Internet and that it might provide us the ability to find solutions for some of the difficult security issues raised in this session. Beyond the pattern recognition question, we have to address the issue of how to interpret events or sequences of events on the Internet. Not all suspicious activities are forbidden or illegal and quite often we can find various interpretations for the activity traces, which we can extract from log-files, email exchange, page clicks, etc. The described system detects regularities in massive amounts of data and it might be suitable to support operators or security personnel that are trained to understand the semantics behind such patterns. The self-learning abilities endow the composition system with a high amount of autonomy and fault tolerance and in combination with a human in the loop, we might be able to build the next generation of surveillance and security systems.
Acknowledgements The research results surveyed in this paper are part of B. Ommer’s Ph.D. thesis.
References B. Ommer and J. M. Buhmann, “Learning Compositional Categorization Models”, Proceedings of the European Conference of Computer Vision 2006, Lecture Notes In Computer Science 3753, eds. Ales Leonardis, Horst Bischof and Axel Pinz , Springer Verlag, Vol. III, pp 316-329, (2006). B. Ommer and J. M. Buhmann, “Learning the Compositional Nature of Visual Objects”, 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, IEEE Computer Society, (2007). B. Ommer, “Learning the Compositional Nature of Objects for Visual Recognition”, PhD thesis, D-INFK ETH Zurich, October 2007. S. Geman, D. F. Potter and Z. Chi, “Composition Systems”, Technical Report, Division of Applied Mathematics, Brown University, (1998). S. Geman. Hierarchy in machine and natural vision. Proceedings of the 11th Scandinavian Conference on Image Analysis, 1999. M. Lades, J. C. Vorbrüggen, J. M. Buhmann, J. Lange, C. von der Malsburg, R. P. Würtz and W. Konen, “Distortion Invariant Object Recognition in the Dynamic Link Architecture”, IEEE Trans. Computers 42, 300-311, (1993). C. von der Malsburg, “Dynamic Link Architecture“. The Handbook of Brain Theory and Neural Networks. The MIT Press, Second Edition, pp. 365—368, (2002).
176
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Approaches for Learning Classifiers of Drifting Concepts Ivan KOYCHEV 1 2 Faculty of Mathematics and Informatics, University of Sofia, Bulgaria
Abstract. Numerous counterterrorist activities on the Web have to distinguish between terror-related and non-terror related items. In this case Machine Learning algorithms can be employed to construct classifiers from examples. Machine Learning applications often face the problem that real-life concepts tend to change over time and some of the learned classifiers from old observations become out-ofdate. This problem is known as concept drift. It seems to be doubly valid for terrorists acting on the Web, because they want avoid being tracked. This paper gives a brief overview of the approaches that aim to deal with drifting concepts. Further it describes, in more detail, two mechanisms for dealing with drifting concepts, which are able to adapt dynamically to changes by forgetting irrelevant data and models. The presented mechanisms are general in nature and can be an add-on to any concept learning algorithm. Results from experiments that give evidences for the effectiveness of the presented approaches are reported and discussed. Keywords. Concept Learning, Drifting Concepts
Introduction One of the major problems related to detecting terrorism on the Web is to distinguish between terror-related and others items (e.g. documents, user’s behaviour profiles etc.) [1]. For this purpose a classifier needs to be built. Often these systems employ Machine Learning algorithms to construct classifiers from examples (labelled instances of the cases) [2]. While such a system works, new examples arrive continuously that need to be added to the training dataset. This requires the classifier to be updated regularly to take into account the new set of examples. We can expect that with time the classification accuracy will improve because we are using a larger training set. However, a classifier built on all previous examples can decrease in accuracy for the new data because real-life concepts tend to change over time. Hence, some of the old observations that are out-of-date/out-of-context have to be “forgotten”, because they produce noise rather than useful training examples. This problem is known as concept drift [3]. A promising example is the systems that learn from observing user’s profiles. These systems face the problem that users are inclined to change over time, or when the context is shifting ([4], [5], [6] [7], [8]). This concept drift is even more valid for terrorist actions on the Web, because the terrorists are most likely trying to hide. 1 Associate with the Institute of Mathematics and Informatics at the Bulgarian Academy of Science. 2 Corresponding Author: Ivan Koychev, 5 J. Bouchier Street, Sofia 1164, Bulgaria; Email: [email protected]
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
177
A number of context-aware “forgetting” mechanisms have been developed to cope with this problem. They can be divided into two major types depending on whether they are able to forget only or whether they are also able to keep old data/knowledge and recall them if they seem to be useful again. The first type of mechanism has the advantage of being simpler and faster and in many cases provides good enough solutions. A typical example is the time window approach. It performs well with concepts that do not change often and change gradually rather than abruptly, but it performs very poorly on frequently changing and recurring concepts. The second type of mechanisms has a clear advantage in the case of recurring concepts (e.g., caused by seasonal changes or other transformations in context), where the relevant old data or acquired knowledge can be very helpful. However, it requires more storage space for the old information, extra time for retrieving and adapting/weighting of the old relevant episodes of data or knowledge. This paper describes two forgetting mechanisms which belong to the forget only type of forgetting mechanisms: the first approach is the time window with weights, which decrease over time [6], thus it simulates a forgetting that is gradual, which seems to be more natural. The second approach dynamically adapts the size of the time window according to the current concept behaviour [9]. This paper reports the results from experiments with these two forgetting mechanisms. The objectives of the experiments are to test and compare the approaches, as well as to explore the possibility of combining them for further improvement of the model’s performance. The next section gives a short overview of related work. The forgetting mechanisms are described in section 3. The experimental design is introduced in section 4. The results from the experiments are reported and discussed in section 5.
1
Related Work
Different approaches have been developed to track changing (also known as shifting, drifting or evolving) concepts. Typically it is assumed that if the concept changes, then the old examples become irrelevant to the current period. The concept descriptions are learned from a collection of the most recent examples called a time window. For example, Mitchell et al. [4] developed a software assistant for scheduling meetings, which employs machine learning to acquire assumptions about individual habits of arranging meetings and uses a time window to adapt faster to the changing preferences of the user. Widmer and Kubat [10] developed the first approach that uses adaptive window size. The algorithm monitors the learning process and if the performance drops under a predefined threshold it uses heuristics to adapt the time window size dynamically. Maloof and Michalski [5] have developed a method for selecting training examples for a partial memory learning system. The method uses a time-based function to provide each instance with an age. Examples that are older than a certain age are removed from the partial memory. Delany et al. [11] employ a case base editing approach that removes noise cases (i.e., the cases that contribute to the classification incorrectly are removed) to deal with concept drift in a spam filtering system. The approach is very promising, but seems applicable to lazy learning algorithms only. To manage the problems with gradual concept drift and noisy data, the approach suggests the use of three windows: a small (with fixed size), a medium and a large (dynamically adapted by simple heuristics) [12].
178
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
The pure time window approaches totally forget the examples that are outside the given window, or older than a certain age. The examples which remain in the partial memory are equally important for the learning algorithms. This is an abrupt forgetting of old examples, which probably does not reflect their rather gradual aging. To deal with this problem it was suggested to weight training examples in the time window according to their appearance over time [6]. These weights make recent examples more important than older ones, essentially causing the system to gradually forget old examples. This approach has been explored further in [13], [14] and [15]. Some systems use different approaches to avoid loss of useful knowledge learned from old examples. The CAP system [4] keeps old rules as long as they are competitive with the new ones. The FLORA system [10], also maintains a store of concept descriptions relevant to previous contexts. When the learner suspects a context change, it will examine the potential of previously stored descriptions to provide better classification. The COPL approach [7] employs a two-level schema to deal with changing and recurring concepts. On the first level the system learns a classifier from a small set of the most recent examples. The learned classifier is accurate enough to be able to distinguish the past episodes that are relevant to the current context. Then the algorithm constructs a new training set, “remembers” relevant examples and “forgets” irrelevant ones. As explained in the introduction, the approach explored in this paper, does not assume that old examples or models can be retrieved. Widmer [16] assumes that the domain provides explicit clues to the current context (e.g. attributes with characteristic values). A two-level learning algorithm is presented that effectively adjusts to changing contexts by trying to detect (via meta-learning) contextual clues and uses this information to focus the learning process. Another twolevel learning algorithm assumes that concepts are likely to be stable for some period of time [17]. This approach uses contextual clustering to detect stable concepts and to extract hidden context. The mechanisms studied in this paper do not assume that the domain provides some clues that can be discovered using a meta-learning level. They rather aim to get the best performance using one-level learning. An adaptive boosting method based on dynamic sample-weighting is presented in [18] Chu and Zaniolo. This approach uses statistical hypothesis testing to detect concept changes. Gama et al., [19] also use a hypothesis testing procedure, similar to that used in control charts, to detect concept drift, calculated on all of the data so far. The mechanism gives a warning at 2 standard deviations (approximately 95%) and then takes action at 3 standard deviations (approximately 99.7%). If the action level is reached, then the start of the window is reset to the point at which the warning level was reached. However, for this mechanism, it can take quite a long time to react to changes and the examples that belong to the old concept are not always completely useless, especially when the concept drift is rather gradual. Baron and Spiliopoulou [20] suggested a pattern monitoring mechanism that observes pattern evolution across the timeline aiming to detect interesting changes. It uses devise heuristics that detect specific types of change. The objective is the identification and categorization of changes according to an “interestingness” model, so that the mining expert can be notified accordingly. To adapt the size of the window according to current changes in the concept, the approach presented in [15] uses a naïve optimization approach, which tries all possible window sizes and selects the one with the smallest error rate. This work provides interesting results from experiments with different forgetting mechanisms used with a support vector machines classifier, using a textual data corpus.
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
179
To detect concept changes the TWO (Time Window Optimisation) approach [9] uses a statistical test on a selected population from the time window, which excludes its beginning and end. If a concept drift is detected an efficient optimization algorithm is employed to detect the optimal window size. This approach is explained in more detail in the next section and studied in the experiments.
2
Two Approaches for Tracking Drifting Concepts
Let us consider a sequence of examples. Each example is classified according to underlying criteria into one of a set of classes, which forms the training set. The task is to “learn” a classifier that can be used to classify the next examples in the sequence. However, the underlying criteria can subsequently change and the same example can be classified differently according to the time of its appearance, i.e., a concept drift takes place. As we discussed above, to deal with this problem machine learning systems often use a time window i.e., the classifier is not learned on all examples, but only on a subset of recent examples. The next section describes forgetting mechanisms that further develop this idea. 2.1
Gradual Forgetting
This section describes a gradual forgetting mechanism introduced earlier in [6] and [13]. Just like the time window approach it assumes that when a concept tends to drift the newest observations better represent the current concept. Additionally, it aims to make the forgetting of old examples gradual. Let us define a gradual forgetting function w f (t ) , which provides weights for each instance according to its location in the course of time. The weights assign a higher importance value to the recent examples. Earlier, Widmer [16] suggests an “exemplar weighting” mechanism, which is used for the IBL algorithm in METAL(IB), however it is not exploited for NBC in METAL(B). The researcher in the area of boosting also saw the need for weighting examples. There are two ways in which boosting employs the weights. The first one is known as boosting by sampling - the examples are drawn with replacement from the training set with a probability proportional to their weights. However, this approach requires a preprocessing stage where a new set of training examples should be generated. The better the sampling approximates the weights, the larger the new training set becomes. The second method, boosting by weighting, is used with learning algorithms that accept weighted training examples directly. In this case the weight is constrained as follows: wi 't 0 and
¦
n
i 1
w'i
1
(1)
where, n is the size of the training set. Most of the basic learning algorithms are designed to treat all examples as equally important. For kNN it can be easily implemented by multiplying the calculated distances with the weights of the examples. For other algorithms it is not so straightforward. When there are no weights (i.e., all 1 . However, as the weights are the same) the formula (1) will provide weights wi ' n weights are multiplied, it seems to be better if we have wi 1 in this boundary case.
180
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
If we substitute wi follows:
nw'i then the constraints in equation (1) will be transformed as
¦ w t 0 and
n
i 1
i
n
wi
1
(2)
Weights that obey the constraints (2), can be easily used in almost any learning algorithm requiring minor changes in the code i.e., every time the algorithm counts an example it should be multiplied by its weight. Various functions that model the process of forgetting and satisfy constraints (2) can be defined. For example, the following linear gradual forgetting function has been defined: wi
2k (i 1) 1 k n 1
(3)
where: i is a counter of observations starting from the most recent observation and it goes back in time (as the function is forgetting wi t wi 1 ); k [0,1] is a parameter that controls the relative weight of the most recent and oldest observation in comparison to the average. By varying k, the slope of the forgetting function can be adjusted to reach better predictive accuracy. The presented gradual forgetting mechanism is naturally integrated into a time window, by weighting the examples in it (i.e., n l , where l is the size of the time window). The system will forget gradually inside the window. When k 0 then all weights will be equal to 1, which means that we will have a “standard” time window. When k is approaching 1 then the weights of the examples in the end on the window approach 0 and there is not an abrupt forgetting after the window’s end.
2.2
Time Window Optimization
This section presents an approach that learns an up-to-date classifier on drifting concepts by dynamic optimization of the time window size to maximize accuracy of prediction [9]. In the case of dynamical adaptation of window size there are two important questions that have to be addressed in the case of concept drift. The first one is how to detect a change in the concept? We are assuming that there is no background information about the process and we use the decrease in predictive accuracy of the learned classifier as an indicator of changes in the concept. Usually, the detection mechanism uses a predefined threshold tailored for the particular dataset. However the underlying concept can change with different speeds to a larger or smaller extent. To detect the changes the approach uses a statistical hypothesis test, which adapts the thresholds according to the recently observed concept deviation. The second important question is how to adapt if a change is detected? Other approaches use heuristics to decrease the size of the time window when changes in the concept are detected (e.g., [10] and [19]). This approach employs a fast 1-D optimization to get the optimal size of
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
181
the time window if a concept drift is detected. The next two subsections describe the algorithm in more detail. 2.2.1
Detecting the Concept Drift
To detect concept changes the approach monitors the performance of the algorithm. For the presentation below we choose to observe the classification accuracy as a measure of the algorithm performance, but other measures, such as error rate or precision could have been used. The approach processes the sequence of examples on small episodes (a.k.a. batches). On each step, a new classifier is learned from the current time window then the average accuracy of this classifier is calculated on the next batch of examples [4]. Then the approach suggests using a statistical test to check whether the accuracy of classification has significantly decreased compared with its historical level. If the average prediction accuracy of the last batch is significantly less than the average accuracy for the population of batches defined on the current time window, then a concept drift can be assumed. The significance test is done using a population from the time window that does not include the first one, or a few batches from the beginning of the time window, because the predictive accuracy could be low due to a previous concept drift. Anywhere from one to a few most recent batches are not included in the test window, because if the drift is gradual the accuracy will drop slowly. A test that uses a test population from the core of the time window works well for both abrupt and gradual drift. The confidence level for the significance test should be sensitive enough to discover concept drift as soon as possible, but not to mistake noise for changes in the concept. The experience from the experiments shows that the “standard” confidence level of 95% works very well in all experiments. This drift detection level is rather sensitive and it assists the algorithm to detect the drift earlier. If a false concept drift alarm is triggered, it will activate the window optimizing mechanism, but in practice, this only results in an insignificant decrease in the time window size. This mechanism works as follows: If concept drift is detected then the optimization of the size of the time window is performed (see the next section) otherwise, the time window size is increased to include the new examples. 2.2.2
Optimising the Time Window Size
In general, if the concept is stable, the bigger the training set is (the time window), the more accurately the classifier can be learned. However, when the concept is changing, a big window will probably contain a lot of old examples, which will result in a decrease of classification accuracy. Hence, the window size should be decreased to exclude the out-of-date examples and in this way the algorithm can learn a more accurate classifier. However, if the size of the window becomes too small, this will lead to a decrease in accuracy. The shape of the curve that demonstrates the relationship between the size of the time window and the accuracy of the classification is shown in Figure 1. To adapt the size of the window according to current changes in the concept, the presented mechanism uses the golden section algorithm for one-dimensional optimization. The algorithm looks for an optimal solution in a closed and bounded interval [a, b] - in our case the possible window sizes X [ x min , x c ] , where x min is a
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
Accuracy
182
a
l
Window Size
r
b
Figure 1. A sample shape of the correlation between the window size and accuracy of the classifier.
predefined minimum size of the window and xc is the current size of the time window. It assumes that the function f (x) is unimodal on X, i.e., there is only one max x * and it is strictly increasing on ( x min , x*) and strictly decreasing on ( x*, xc ) , which is the shape that can be seen on Figure 1. In our case the function f (x) calculates the classification accuracy of the learned model using a time window with size x. The basic idea of this algorithm is to minimize the number of function evaluations by trapping the optimum solution in a set of nested intervals. On each step the algorithm uses the golden section ratio W I | 0.618 to split the interval into three subintervals, as shown in Figure 1, where l b W (b a ) and r a W (b a) . If f (l ) ! f (r ) then the new interval chosen for the next step is [a, r ] else [l , b] . The length of the interval for the next iteration is W (b a) . Those iterations continue until the interval containing the maximum reaches a predefined minimum size, x * is taken to lie at the centre of the final interval. The golden section optimization algorithm is a very efficient way to trap the x * that optimizes the function f (x) . After n iterations, the interval is reduced to 0.618 n times its original size. For example if n 10 , less than 1% of the original interval remains. Note that, due to the properties of the golden section (e.g. 1 I I 1 ), each iteration requires only one new function evaluation. In conclusion, if we can assume that the classification accuracy in relation to the time window is a unimodal function then the golden section algorithm can be used as an efficient way to find the optimal size of the time window. It is possible to find datasets for which the unimodal assumption is not true (e.g., when the concept changes very often and abruptly). In such cases, we can use other optimization methods that do not assume a unimodal distribution, however they take much longer. The trade-off that we have to make is that we can occasionally be trapped in a local maximum, but have a fast optimization; or find a global maximum, but have significantly slower optimization.
3
Experimental Design
All experiments were designed to run iteratively, in this way simulating the process of the mechanism’s utilisation. For this reason the data streams were chunked on
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
183
episodes/batched. Figure 2 tries to illustrate how the experiments were conducted: on each iteration, a concept description is learned from the examples in the current time window; then the learned classifier is tested on the next batch.
Data Stream Time Window (i.e., Training set)
Test Set
Time Figure 2. Iterations’ cross-validation design.
To facilitate the presentation below the basic forgetting mechanisms are abbreviated as follows: FTW - Fixed-size Time Window GF - Gradual Forgetting TWO - Time Window Optimization For each dataset, the window size for the FTW was chosen to approximate the average time window size obtained in the experiments with the TWO mechanism on the same set. This functions as extra help for the FTW that would not be available in a real situation where the forthcoming sequence of events is unknown. The aim here is to allow the FTW approach to achieve its best performance. Each hypothesis was tested on four experimental datasets; using three basic machine learning algorithms. The results from the experiments are reported and discussed in the next subsection. 3.1
Datasets
This subsection explains how the data streams used in the experiments were generated. 3.1.1
STAGGER Dataset
The first experiments were conducted with an artificial learning problem that was defined and used by Schlimmer and Granger [3] for testing STAGGER, probably the first concept drift tracking system. Much of the subsequent work dedicated to this problem used this dataset for testing purposes (e.g., [5], [6], [7], [9], [10], [13], [16], [17], [21], [22] etc.). The STAGGER problem is defined as follows: The instance space is described by three attributes: size = {small, medium, large}, color = {red, green, blue}, and shape = {square, circular, triangular}. There is a sequence of three target concepts: (1) size = small and color = red; (2) color = green or shape = circular; and (3) size = (medium or large), 120 training instances are generated randomly and classified according to the current concept. The underlying concept is forced to change after every 40 training examples in the sequence: (1)-(2)-(3). The set up of the experiments with the STAGGER dataset was done exactly in the same way as in other similar experiments. The retraining step is 1, however there is a test set with 100 instances, generated randomly and classified according to the current concept. This differs from the experiments with the other datasets where the retraining step and the test set are the
184
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
same - one batch. The size of the FTW is set up to 25, which approximates the average size of the optimized windows. 3.1.2
German Credit Dataset
Experiments were conducted with the German credit dataset, from the UCI Machine Learning Repository 3 . The dataset contains 1000 instances of bank credit data which are described by 20 attributes. The examples are classified in two classes as either “good” or “bad”. To simulate hidden changes in the context the dataset was sorted by an attribute then this attribute was removed from the dataset for the experiments. Using an attribute to sort the dataset and in this way simulate changing context is a commonly used approach to set up experiments to study concept drift. Two sorted datasets were created using the German credit dataset: The first one was sorted by a continuous attribute: “age”, which would produce a gradual drift of the “class” concept. The second one was sorted by the attribute “checking_status”, which has three discrete values. In this way we aimed to create abrupt changes of the “class” concept. Thus using this source dataset, two data streams are generated for the experiments. The datasets were divided into a sequence of batches, each of them containing 10 examples. The size of the FTW is set to 200, which approximates the average size of the optimized windows. 3.1.3
Spam Dataset
Experiments have also been conducted with the Spam dataset from the UCI Machine Learning Repository. Spam is an unsolicited email message. The dataset consists of 4601 instances, 1813 (39.4%) of which are spam messages. The dataset is represented by 54 attributes that represent the occurrence of a pre-selected set of words in each of the documents plus three attributes representing the number of capital letters in the email. To simulate the changing hidden context the examples in the dataset are sorted according to the “capital_run_length_total”, which is the total number of capital letters in the e-mail. This attribute and the related two attributes “capital_run_length_average” and “capital_run_length_longest” are removed from the dataset, because they can provide explicit clues for the concept changes. The sorted dataset was divided into a sequence of batches with a length of 10 examples each. For this dataset the fixed window size was set to 400 - an approximation of the average window size for this dataset used by the TWO mechanism. 3.2
Machine Learning Algorithms
The experiments were conducted using three basic machine learning algorithms: x k Nearest Neighbours (kNN ) - also known as Instance Based Learning (IBL) [21]. k=3 was the default setting for the experiments reported below except for experiments with STAGGER dataset, where k=1 was chosen, because it produces a more accurate classification than k=3; x Induction of Decision Trees (ID3) [22] (using an attribute selection criteria based on the F 2 statistic); x Naïve Bayesian Classifier (NBC) [2]. 3
http://www.ics.uci.edu/~mlearn/MLRepository.html
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
3.3
185
Hypotheses The following four hypotheses were tested: H1: Fixed-size Time Window with Gradual Forgetting (FTW+GF) is better than pure Fixed-size Time Window (FTW), measured in classification accuracy. H2: Time Window Optimization (TWO) is better than Fixed-size Time Window (FTW) measured in classification accuracy. H3: Time Window Optimization (TWO) is better than Fixed-size Time Window with Gradual Forgetting (FTW+GF) measured in classification accuracy. H4: Time Window Optimization with Gradual Forgetting (TWO+GF) is better than Time Window Optimization (TWO) measured in classification accuracy.
4
Experimental Results and Discussion
This section presents the results from the experiments. Parts of these results were introduced in [9] and [22]. The current paper summarizes the findings. Tables 1, 2, 3 and 4 below summarise the results in the form: the rows show the learning algorithms used: kNN, ID3 and NBC; the columns show different datasets. For a more compact presentation in the tables the datasets are referred to with numbers as follows: 1 – STAGGER dataset 2 – German dataset - sorted by the attribute “checking_status” 3 – German dataset - sorted by the attribute “age” 4 – Spam dataset - sorted by the attribute “capital_run_length_total” We used paired t-tests with a 95% confidence level to see whether there is a significant difference in the accuracy of learned classifiers. The pairs are formed by comparing the algorithms’ accuracies on the same iteration. In the tables below, we are reporting the results from the experiments testing the formulated hypothesis. The cells represent the results from the significance test comparing the results from the runs using the different forgetting mechanisms. The basic learner used is in the row and the dataset used is in the column. The notation is as follows: 9 - the second algorithm performs significantly better then the first one; 8 - the first algorithm performs significantly better then the second one; ÷ - there is no significant difference in the performance of the compared algorithms. For example: the sign 9 in the first column and first row in Table 1 denotes that the experiments conducted using kNN as a basic algorithm and using the STAGGER dataset (corresponds to column 1), shows that FTW with GF is significantly better than the plain FTW, measured in accuracy of classification.
186
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts Table 1. Comparison of FTW versus FTW+GF forgetting mechanisms. algorithm
dataset
1
2
3
4
kNN
9
9
9
8
ID3
9
9
9
÷
NBC
9
9
9
÷
The results from the experiments testing hypothesis H1 are presented in Table 1. For three of the four datasets, for any of the gradual forgetting is able to significantly improve the average classification accuracy in comparison with Fixed Time Window, however gradual forgetting fails to improve the accuracy on the fourth dataset. In comparison with other datasets, the main difference in this dataset is that its features represent words’ frequency in e-mail messages. As a result, in most cases, the features’ value is equal to zero i.e., the dataset is sparse. Probably this can be the reason for the lack of improvement. Perhaps, using features selection and adapted for text similarity measures for kNN will diminish this problem. Further studies are needed to find the correct answer to these questions. Table 2. Comparison of FTW versus TWO forgetting mechanisms. algorithm
dataset
1
2
3
4
kNN
9
9
9
9
ID3
9
9
9
9
NBC
9
9
9
9
The experimental results presented in Table 2 provided very strong evidence that the Time Window Optimization mechanism is able to improve the prediction accuracy significantly in comparison with Fixed-size Time Window, i.e., the H2 hypothesis is confirmed. Table 3. Comparison of FTW+GF versus TWO forgetting mechanisms. algorithm
dataset
1
2
3
4
kNN
9
÷
9
9
ID3
9
9
9
9
NBC
9
9
9
9
Table 3 presents the results from experiments testing the third hypostasis H3. It can be seen that in all experiments the Time Window Optimization mechanisms achieve significantly better classification accuracy than Gradual Forgetting, except one instance where there is no significant difference in performance.
187
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts Table 4. Comparison of TWO versus TWO+GF forgetting mechanisms. algorithm
dataset
1
2
3
4
kNN
÷
9
9
8
ID3
9
9
÷
8
NBC
8
÷
9
÷
Table 4 shows the results from the experiment testing the hypothesis that applying Gradual Forgetting together with TWO can improve the prediction accuracy in comparison to pure TWO, i.e., the two mechanisms are having a synergetic effect. The results from the experiments do not provide enough evidence in support the H4 hypothesis. To explore how the TWO mechanisms behave, according to the nature of the changes in the concept, we draw on the same diagram (Figure 3) the classification accuracy in one of the experiments with the TWO mechanism (the thin line) and the size of the optimized time window (the thick line) on each step. We see that a drop in the accuracy normally leads to a decrease in the time widow size. However, a sudden decrease in the classification accuracy does not always indicate a concept drift, it can be caused by noise in the data stream. It can be seen that the TWO mechanism is very robust to noise, merely decreasing the window size insignificantly, e.g., see the arrow 1 on Figure 3. However, it remains sensitive enough to detect genuine concept drifts that decrease the accuracy by a relatively small value – e.g., see the arrow 2 on Figure 3. The detection mechanism flags both real and false concept drifts, but the window size optimizer responds very differently to the two possibilities.
1000
100%
900
90%
2 800
80% 70%
1 600
60%
500
50%
400
40%
300
30%
200
20%
100
10%
0
0% 900
1500
2100
2700
3300
3900
4500
Instances Progress Time Window Size
predictive accuracy of TWO
Figure 3. The relationship between the accuracy of prediction measured and the TW size.
Accuracy
Time Window Size
700
188
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
Further the presented forgetting mechanisms (GF and TWO) were compared with similar approaches (i.e. [5], [7], [10] and [21]) on the STAGGER dataset. A detailed discussion of this comparison can be found in [9]. In summary the results show: x GF mechanism is a simple approach that usually performs comparably to other approaches, in some cases outperforms them significantly when concepts are drifting; x TWO mechanism significantly outperforms the other forgetting mechanisms.
5
Conclusion
This paper presents two forgetting mechanism for dealing with the concept drift problem and an experimental study. Both are general in nature and can be added to any relevant algorithm. Experiments were conducted with three learning algorithms using four datasets. The results from the experiments show that: applying Gradual Forgetting inside a fixed-size time window usually leads to a significant improvement of the classification accuracy on drifting concepts; when comparing the two forgetting mechanisms the advantage clearly is on the side of the self-adapting algorithm - Time Window Optimization; finally, applying both mechanisms together usually does not lead to an additional improvement in the classification accuracy. Further controlled experiments using common patterns from the space of different type of drift (i.e., frequent – rare; abrupt - gradual; slow – fast; permanent – temporary; etc.) will provide clearer evidence for which pattern, and what kind of forgetting mechanism is most appropriate. The importance of taking into account the idea of concept drift in cases of counterterrorists mining of the Web springs from the nature of the typical task: a limited data history and often changing concepts. The significance of this problem was also raised by other participants in their formal talks and informal discussion.
References [1] [2] [3] [4] [5] [6] [7]
[8]
[9]
A. Abbasi and H. Chen. Applying Authorship Analysis to Extremist-Group Web Forum Messages, IEEE Computer Society, september/october (2005), 67-75. T. Mitchell. Machine Learning. McGraw-Hill (1997). J. Schlimmer, and R. Granger. Incremental Learning from Noisy Data. Machine Learning 3 (1986), 317-357. T. Mitchell, R. Caruana, D. Freitag, J. McDermott and D. Zabowski. Experience with a Learning Personal Assistant. Communications of the ACM 37(7) (1994), 81-91. M. Maloof and R. Michalski. Selecting Examples for Partial Memory Learning, Machine Learning 41 (2000), 27-52. I. Koychev, and I. Schwab. Adaptation to Drifting User's Interests. In proc. of ECML2000 Workshop: Machine Learning in New Information Age, Barcelona, Spain (2000), p. 39-46. I. Koychev. Tracking Changing User Interests through Prior-Learning of Context. In: P. de Bra, P. Brusilovsky and R. Conejo (eds.): Adaptive Hypermedia and Adaptive Web Based Systems. Lecture Notes in Computer Science, Vol. 2347, Springer-Verlag (2002), 223-232. G. Ducatel, and A. Nürnberger. iArchive: An Assistant To Help Users Personalise Search Engines. In Chapter 2. User Profiles in Information Retrieva, Enhancing the Power of the Internet, Series: Studies in Fuzziness and Soft Computing , Vol. 139, Nikravesh, M.; Azvine, B.; Yager, R.; Zadeh, L.A. (Eds.), VIII, Stinger-Verlag, (2004). I. Koychev and R. Lothian. Tracking Drifting Concepts by Time Window Optimisation. In: Research and Development in Intelligent Systems XXII Proc. of AI-2005, the 25-th SGAI Int. Conference on
I. Koychev / Approaches for Learning Classifiers of Drifting Concepts
[10] [11]
[12] [13] [14]
[15]
[16] [17] [18]
[19]
[20]
[21] [22] [23]
189
Innovative Techniques and Applications of Artificial Intelligence. Bramer, Max; Coenen, Frans; Allen, Tony (Eds.), Springer-Verlag, London (2006), p.46-59. G. Widmer and M. Kubat. Learning in the Presence of Concept Drift and Hidden Contexts, Machine Learning 23 (1996), 69-101. SJ. Delany, P. Cunningham, A. Tsymbal and L. Coyle. A Case-Based Technique for Tracking Concept Drift in Spam Filtering. In: Macintosh, A., Ellis, R. & Allen T. (eds.) Applications and Innovations in Intelligent Systems XII, Proceedings of AI2004, Springer (2004), 3-16 M. Lazarescu, S. Venkatesh and H. Bui. Using Multiple Windows to Track Concept Drift. In the Intelligent Data Analysis Journal, Vol 8 (1), (2004) 29-59. I. Koychev. Gradual Forgetting for Adaptation to Concept Drift. Proceedings of ECAI 2000 Workshop on Current Issues in Spatio-Temporal Reasoning, Berlin, (2000) 101-107. M. Kukar. Drifting Concepts as Hidden Factors in Clinical Studies. In Dojat, D., Elpida T. Keravnou, Pedro Barahona (Eds.): Proceedings of 9th Conference on Artificial Intelligence in Medicine in Europe, AIME 2003, Protaras, Cyprus, October 18-22, 2003, Lecture Notes in Computer Science, Vol. 2780, Springer-Verlag (2003) 355-364. R. Klinkenberg. Learning Drifting Concepts: Example Selection vs. Example Weighting. In Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, Vol. 8, No. 3 (2004), 281-300. G. Widmer. Tracking Changes through Meta-Learning. Machine Learning 27 (1997) 256-286. M. Harries, and C. Sammut. Extracting Hidden Context. Machine Learning 32 (1998), 101-126. F. Chu, and C. Zaniolo. Fast and Light Boosting for Adaptive Mining of Data Streams. In: Proc. of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, Vol.3056, Springer-Verlag, (2004), 282-292. J. Gama, P. Medas, G. Castillo and P. Rodrigues. Learning with Drift Detection. In: Ana, C., Bazzan, S. and Labidi (Eds.): Proceedings of the 17th Brazilian Symposium on Artificial Intelligence. Lecture Notes in Computer Science, Vol. 3171, Springer, (2004) 286-295. S. Baron and M. Spiliopoulou. Temporal Evolution and Local Patterns. In Morik, K., Boulicaut, JF., and Siebes A. (Eds.): Local Pattern Detection, International Seminar, Dagstuhl Castle, Germany, April 12-16, 2004, Revised Selected Papers. Lecture Notes in Computer Science 3539, Springer (2005). D. Aha, D. Kibler and M. Albert. Instance-Based Learning Algorithms. Machine Learning 6 (1991), 37-66. I. Koychev. Experiments with Two Approaches for Tracking Drifting Concepts. Serdica Journal of Computing 1 (2007), 27-44. R. Quinlan. Induction of Decision Trees. Machine Learning 1 (1986), 81-106.
This page intentionally left blank
Part Four Some Contemporary Technologies and Applications
This page intentionally left blank
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
193
Techno-Intelligence Signatures Analysis Gadi AVIRAN1 Hazard Threat Analysis, Ltd., Bnei Zion, Israel
Abstract. The following is an introduction to the world of techno-intelligence signatures, given by Gadi Aviran, CEO of Hazard Threat Analysis Ltd. Hazard Threat Analysis Ltd. (HTA) specializes in assessment of terror threats derived from Internet-sourced intelligence material (WebInt), utilizing an array of expertise in the collection, translation and analysis of focused data. 'TechnoIntelligence signatures' is a collective name for the indicative effects an action has on its environment. Either physical or behavioral, these 'signatures' vary in nature, ranging from patterns of communication to unusual scents and sounds. During the development of a new technical or operational capability, or during the execution of one, these 'signatures' are emitted, allowing intelligence analysts to collect and analyze them, in an effort to issue an intelligence alert. The difficulty in achieving a timely alert based on the analysis of such signatures is the process of “connecting the dots” – making sense of information particles often lacking an obvious connection. In order to begin 'connecting the dots', it is first vital to know what the 'dots', the analyst is looking at, are by making the distinction between theoretical threats and current intelligence threats. A profound understanding of both current capabilities and their distinctive 'signatures' and the nature of information sources, results in the life saving intelligence alert the counter-terrorism agencies are after. Keywords. Terror, Intelligence, Internet, open-source, Techno-Intelligence
I am the second representative of a commercial company which kind of puts me on the spot here. But unlike my predecessor, you may see me as your “anti-Christ” because we do not believe in machines and I will explain why in a minute. We see the problem of assessing terror threats from the open source, or the Internet in that respect, as a growing concern and a very complicated concern that gets more complicated all the time. While you see what our company does for a living (see Figure 1, Hazard Threat Analysis’ (HTA) company introduction) I must stress that at this stage we do not see any feasible way of taking the man out of the loop when assessing terror related information. That said, this has nothing to do with the ability to use machines, up to a certain point, and to deploy them in other areas of expertise where the unknowns or the dilemmas as you will see later on, such as the context dilemma, are as big as they are in the area of terror threats.
1
Corresponding Author: Gadi Aviran, Hazard Threat Analysis Ltd. PO Box 395, Bnei Zion 60910, Israel; E-mail: [email protected], [email protected]
194
G. Aviran / Techno-Intelligence Signatures Analysis
Figure 1. HTA’s Company introduction
When we look at terror threats we define basically three major components of the threat. One of them is the intention, and the other two are the technical and operational capabilities. First of all, there is the technical capability. So, for example, our living existence here in Israel in the past six years, in which Kassam rockets, or other rockets, are being fired from the Gaza strip; They represent, to our analysis, a very low operational capability because they are not as good, in the adversary’s eyes, as they should be. They are not as effective as they should be. But, on the other hand, they do show high technical capability in creating from scratch a rocket that flies up to about 15 kilometers and lands in the proximity of the target as well as the ability to do that for a long period of time without us being able to stop it. On the other hand, which is the other extreme, when we look at 9/11, you have a very high operational capability: managing to get people into the flight schools, getting them on the planes, getting them to fly the planes into the targets. But the technical capabilities were extremely low. They were using carpet knives. Their knowledge is something which we’ll talk about in a minute. In this presentation, I will try to demonstrate the complexity of what a terror threat is through an example to which I think most of you would relate. It is in news that you see specifically from Iraq and Afghanistan, and around the world as well. When we look at an IED, an improvised explosive device, what we see is a whole supply chain around it. The device itself is not an incident, we do not see it as an incident; we see it as a process: starting from an initial intention or capability,
G. Aviran / Techno-Intelligence Signatures Analysis
195
Figure 2. Techno-Intelligence Signatures
sometimes the need to respond to a specific occasion, leading to the final execution. And after the execution there are other stages that link to the same incident, of lessons learned, of transferring the knowledge of the event, of thinking "can we improve the next one, can we make it better?" This is all part of the world that we are investigating. All the supply chain that you see surrounding the IED incident emits signatures. And “signatures” is a very tricky word because all these processes, every action being taken emit signatures. There is no action that doesn’t have a signature. The problem is to understand what the signature is, how to find it, and what conclusions to draw from a collection of signatures, especially if you don’t know where the road map is. It really comes down to what is commonly called – “connecting the dots”. In order to connect the dots, first of all, you have to understand that there are dots. And we will discuss this in a few minutes. The signatures themselves, or the techno-intelligence signatures as we tend to call them in our methodology, can range from anything to anything. For example, they can be smells of certain explosive manufacturing procedures, perhaps susceptible to sniffers in the future, definitely to human resources, canines etc. These dots can include communications, and this is where we intercept, or everybody tries to intercept as much as possible. But we have very limited capability of actually understanding that this “dot” belongs to a picture that has 400 “dots”. That if you connect them right you will end up with 9/11. If you remember the committee report on the intelligence failures of 9/11, the dots were there; most of them were there. But there wasn't a map, a knowledge map that
196
G. Aviran / Techno-Intelligence Signatures Analysis
could have told you that if there are a bunch of guys practicing flight lessons down in Florida, and another guy is looking into this, and another guy is sending “go” codes to a certain group, that at the end of the day this bunch of guys end up getting themselves on airplanes. The map was not there and therefore the dots could not be connected. At HTA we provide analysis of terror threats of all families and types, but what we do, most of all, is we define the dots.
Figure 3. Drawing the Dots
Defining the dots is the issue or the business of techno-intelligence signatures. You have to understand what emits which signature, and to what purpose it can be used and then by a process of elimination start playing with the information until you get to a real picture. The problem with dots, or the problem with signatures is, first, you have to make a very clear distinction between what is a theoretical threat and what is an intelligence threat. Because trying to fight or identify every theoretical threat in the world is not a realistic goal. But when you cut it down to the intelligence threats - the threats that you really have to deal with, not the theory that a meteor would hit or that the satellite would fire a laser beam on the Empire State Building, the picture improves. Some acts may be feasible in a few years, but are not relevant to terrorists at this stage, so you have to understand that there is a gap between what is a known theoretical treat, and current day intelligence threats. The other problem is - what should we be looking for? And that involves, of course, the machines. This is not an easy task to perform because there are lots of sources of
G. Aviran / Techno-Intelligence Signatures Analysis
197
information starting from simple printouts of billing machines of cellular networks that can help to identify locations, and at the other extreme human visual reports: “I saw a guy, he’s walking around with ten gallons of anti-freeze”, is he a terrorist? Is he trying to replace the liquid in his radiator? Interesting question. Another problem is that media formats and types come in a huge variety. And we’ll give you some examples: We’ll now look at three examples of what we would call collection complexities. 1. Complexities 1.1. The first one is the context dilemma: Brainiac is a big Sky One show in the UK that deals with various aspects of technology. A very interesting show, very humorous in a way.
Figure 4. The Context Dilemma.
One of the episodes deals with an explosive called thermite, not yet used very heavily by terrorists, but the substance creates a very big problem once it starts getting deployed because of what it does and how easily it is obtainable. Basically, thermite is rust, and therefore it is very easily obtainable. The second problem is that once you see the explanation of how thermite works on Sky One, then thermite becomes a theoretical threat, but it is not an intelligence threat at this stage. Therefore, a machine would look at Sky One and look at what we define as a threat, and it will deduce this is not a threat, this is information. There are billions and billions of items like that floating around the world and open sources.
198
G. Aviran / Techno-Intelligence Signatures Analysis
The second stage is where the information resides or how it was pointed. In this case: “Google™ video”. You search for thermite and Brainiac and you’ll get to the clip that I just described. Now, is this a terror threat or not? And the answer is probably “not” because the fact that Sky One’s thermite episode of Brainiac appears on Google™ video is natural. The problem starts with examples such as this, and this is an excerpt from a closed forum which is a renowned jihadi forum, it is a global jihadi forum called Al-Firdaws.
Figure 5. The Context Dilemma
On Al-Firdaws one of the members’ posts says (see Figure 5) to the other members, (and you have to understand that Firdaws has probably 40,000 registered users, in addition to guest users, and you don’t even know how many have entered as guests) and this member now says: "go to this link and look at the clip", and then he explains in his own words what this clip means to him. I assume that you would no longer consider thermite a theoretical threat, now it has become an intelligence threat. When a terrorist tells another terrorist or another group of terrorist on an accessible platform, to go take this and do that, this is no longer theory, this is intelligence. This is something that the intelligence analyst or decision maker has to deal with and he has to manage it as part of his risk analysis, risk management and threat analysis.
G. Aviran / Techno-Intelligence Signatures Analysis
199
1.2. The second example of collection complexity I am going to present is the double meaning dilemma (see Figure 6):
Figure 6. The Double Meaning Dilemma
The meaning of the word “Namsa” in Arabic is “Austria”, and “Namsawi” means “Austrian”. But originating from Iraq, and now spreading through the Internet is a second meaning – “artillery shell” and artillery shells are nowadays used for making improvised explosive devices (IEDs). And this issue of double meaning is extremely common in the Arabic language, and in the terrorists' jargon and slang. Another example is - what happens when your translator tells you someone is planning to attack your aircraft with a rocket? For a rocket attack there are standard procedures and standard measures that should be implemented according to security protocols. But what if your translator tells you, “you should expect an attack using a missile”? Well, missile attacks have an entirely different security protocol. The problem is that in Arabic both rockets and missiles are the exact same word – Sarukh, and you tell the difference according to context, which totally depends on your translators' knowledge and experience. What do I mean when I said I’m going to fire a Sarukh? A problem. 1.3. A third example of a collection complexity is the issue of proximity dilemma. The top left of Figure 7 shows a device that was captured in the attempted attack against the U.S. Embassy in Damascus in 2006.
200
G. Aviran / Techno-Intelligence Signatures Analysis
Figure 7. The Proximity Dilemma.
And since we have tens of thousands of original documents and a knowledge base and videos of terrorists we easily find the second picture, in a manual. The item in Figure 7 bottom right is from the manual of how to make the one on the top left. Now, for an intelligence agency to understand where the information came from is a good beginning for an investigation. From that specific document, since we do a lot of link analysis and a lot of meta-data analysis of documents, we can tell you a lot: where it was uploaded, what was it called, on what share file it was, was it raw, was it a zipped, did it have a password, how many people downloaded it, from which countries it was downloaded in some of the cases. But all of this analysis starts with the ability to determine that these two pictures are the same and a machine might have a problem with determining this. 2. Solutions First, based on our experience, you need to single out the current intelligence threat and not deal with noise. We've got to be very focused not on theory but on actual intelligence threats. Second, understand what signatures any given threat emits in order to have a map or a book of dots where you can start placing them in various locations and then start creating your own little charts. Is somebody buying butane tanks and now another guy goes into a plumber’s store and buys pipes, and another guy is buying acetone and another one is getting peroxide and another is buying a car battery. These are the dots that, at the end of the day, would lead you to the device in Figure 7. Another solution is to analyze and understand the behavioral patterns, analyze meaning – slang, jargon, the issue of Arabish and Pinglish. (Arabish is writing in
G. Aviran / Techno-Intelligence Signatures Analysis
201
Arabic with Latin letters for people who do not have keyboards in Arabic, or who do not want to use virtual keyboards, and Pinglish is the same but in Farsi). Beyond words there is the issue of text and context which you have to infer because the words just don’t contain enough meaning to make a distinction, or rather to make a clear distinction between terror and non-terror issues. And at the base of everything sits what we are doing in the field of link and network analysis. We are getting to a level of sophistication at this stage where we can take an entity which is a virtual entity on a forum and give you enough details about that entity to begin an investigation. In some of the cases you will get a name, a picture of the guy, current location and so on, and it is an on-going chase because they change, they close forums, the identity authentication methods change; you live in an environment which is almost like a HUMINT (HUMan INTelligence) intelligence environment. In summary, I think, and this is why I started by saying I may be your “anti-Christ” about learning machines, to date we have not found any technology that can mimic the sharpness of one extraordinary researcher. If you are talking about volumes, somebody here mentioned analyzing billions of documents of data, then it is a cost effective program. You won't assign 10,000 people to do that. I don’t know if you remember what Iranians did with the shredded CIA documents in the Tehran embassy. They just took 5000 people and let them staple the strips back together, and they got intelligence, it is a method of brute force that you can employ. We are not firm believers in brute force. We work in much more delicate ways. We employ extraordinary people to do the work. Of course, we are open to the concept of machine harvesting, as you can see since I am speaking here, but we do not employ any at this stage. But we do have a lot of information, a lot of knowledge that we’ve collected, stored and keep on collecting, that can be the training text for your machines. We see processes in our organization and our company that you can look over our shoulders and “tell” the machines: “why did they think about that”? “What made him think about that?” I am not suggesting you relocate to Israel and look over my shoulder, but there are ways around that as you may understand, and we are doing that as part of training and workshops that we are providing customers. We are a commercial company, no misunderstandings about that, we are not a nonprofit organization. But we do have a lot of information which could be valuable to some of the research areas that Dr. Elovici was speaking about such as looking for signatures, as well as a lot of cyber terror analysis. And on this happy note, thank you.
202
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Web Harvesting Intelligent Anti-Terror System – Technology and Methodology Dr. Uri HANANI1 Founder and ex-CTO of MindCite Ltd., Israel
Abstract. The terror and anti-terror activities on the Web pose a real challenge for web harvesting and data analysis tasks. The Internet virtual domain is likely to be used intensively by terrorist organizations. Using advanced innovative web harvesting AI-based technologies give a surprising advantage to those who master the Internet open source. In addition, developing and implementing a large-scale web harvesting intelligent terror & security IT system is an extensive and time consuming organizational endeavor. In order to ensure a successful system implementation resulting in the payoff of a technological intelligent terror and security IT system, the organization has to initiate and manage a large scale operation that consists of methodological, organizational, data integration and technological components. In this paper we present a comprehensive system to tackle this challenge. Starting with a theoretical web harvesting framework and leading to a collection of web technologies that were developed and which resulted in a complete product that gave the users vast field experience throughout the globe. The paper has two main objectives: present a comprehensive technological and methodological framework for developing and implementing a Web Harvesting IT security system; draw and discuss lessons learned and the conclusions from field experience gathered over the last decade. For achieving these goals we introduce and describe in detail an IT solution equipped with all necessary algorithms and tools, and then describe how this system and methodology was implemented in the intelligence community over the last decade. Lessons learned and real field experience conclude the paper with a discussion of the system’s advantages in the fight against terror and terrorist attacks in cyber space.
Keywords. Anti-Terror, Web, Information Retrieval, Implementation Methodology, Digital Libraries, Harvested Digital Library, Ontology, Knowledge Management
Introduction The terror and anti-terror activities on the Web pose a real challenge for web harvesting and data analysis. The Internet virtual domain is likely to be used intensively by terrorist organizations. Using advanced innovative web harvesting AI-based technologies gives a surprising advantage to those who master the Internet open source. 1 Corresponding Author: Uri Hanani, [email protected]
10/61 Harimon St., Givat-Shmuel, Israel 54403, Email:
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
203
Developing and implementing a large-scale web harvesting intelligent terror & security IT system is an extensive and time consuming organizational endeavor. In order to ensure a successful system implementation resulting in a payoff of a technological intelligent terror and security IT system, the organization has to initiate and manage a large scale operation that consists of methodological, organizational, data integration and technological components. In this paper we present a comprehensive system to tackle this challenge. Starting with a theoretical web harvesting framework and leading to a collection of web technologies that were developed and which resulted in a complete product that gave the users vast field experience in many organizations throughout the globe. The paper has two main objectives: 1. 2.
Present a comprehensive technological and methodological framework for developing and implementing a web harvesting IT security system Draw and discuss lessons-learned and the conclusions from field experience gathered over the last decade.
Although the data sources which exist on the Internet are text as well as multimedia (such as graphics, voice and video), we have limited our technologies to text-based systems only. The rationale for this assumption is due to the basic structure of the first generation of the Web environment. This assumption also allows us to go further and expand the scope of research in the text domain as will be explained later. There are many research and technological issues that can be discussed. In this paper we focus on better understanding of the following methodological issues (mainly based on the author’s knowledge of the field and academic experience): • • • • • • • •
Text Mining Natural Language Processing (NLP) capabilities what are the necessary text web mining & harvesting components and algorithms required for the optimal or near optimal system? System architecture and buildup. Web open sources – are they reliable, deceptive, exhaustive and inclusive in the anti-terror arena? Intranet resources - how to integrate legacy, email, office & file repository, paper based archive, etc., within the framework of a total anti-terror IT system? Ontology & taxonomy - how to design and maintain them? Human resources - the optimal combination of IT/technology and anti-terror operators (practitioners) Does history count? - For which range of time? Can one look upon an anti-terror system as a dynamic digital library? Does the system pay off? Can one measure CSF (Critical Success Factors)?
204
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
•
•
User education and system implementation - the use of an antiterror IT system for educational purposes within the intelligence community; who is responsible? How much to invest in the implementation phase? Is it a one time operation or a continuing process? Who controls the system - supporting high governmental decision making and policy making – who should control/manage/operate the anti-terror IT system?
This article is structured as follows. Section 1 concentrates on system requirements and objectives as well as data flow requirements. Section 2 discusses the system architecture and describes the developed software platform. The third section is devoted to case studies and the demonstration of the system in the field of anti-terror. The last section, Section 4, returns to methodological issues and presents some practitioners' conclusions and recommendations.
1. Web Harvesting – System Objectives Following the introduction, this section aims to provide a general overview of various aspects and issues related to web harvesting and its specific requirements for information search on the Internet. Mainly we deal here with the definition of the information requirements and the data flow definition. Information requirements of organizations dealing with anti-terror issues are composed of two different components. The first is the data sources for the organization and the second deals with the data flow and organizational processes for handling data in order to turn it into meaningful information and critical knowledge. In the first sub-section we start with a discussion of data sources. 1.1. Data Source Requirements Data sources in a corporate environment include Internet open sources (and sometimes we could also have closed sources but we do not discuss this type of data in this paper) and Intranet data sources. The Intranet involves databases (Legacy system and Data Warehouse), and repositories such as: email archives, scanned documents, ERP, office & file archives, etc., (see Figure 1). Dealing with all these types of data can be a huge and complicated task for an organization, especially for a large dynamic corporation. Evidently there is a need for a consistent, reliable, and simple system that will handle all these different types of data under one comprehensive model. Figure 2 presents such an approach. The data can be gathered from a data source from left to right. Internet data can be gathered using a sort of a Crawler or Robot and the Intranet data can be handled in the same manner, all controlled and managed by the Corporate Semantic System. Moreover, it is clear that the key element is the semantic capability of that Semantic System. Those system characteristics will be explained in the next sub-section.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
Figure 1. Mapping Corporate Data Sources
Figure 2. Flow of data gathering sources
205
206
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
1.2. Data Flow Requirements Corporate IT departments dealing with anti-terrorism issues are supposed to monitor the Web for terror related topics from relevant websites. In Figure 3 such a situation is presented. Here information from websites that are involved with “Jihad”, discussion groups dealing with possible threats, and other related terror data sources are monitored and the information is gathered and stored in the corporate internal repositories.
Figure 3. Open Sources example
Yet, one may pose many operational questions to the users and owners of this system. For example, how to identify the relevant websites? What are the topics, magnitude and depth of these websites? What is the optimal scheduling policy? Which data is appropriate 2 for collection and storage? The approach chosen by us is the original Harvest Model [1, 2] described below. The steps described by the Circles in Figure 4 are modifications of the Harvest Model and will be described later on. In order to understand our modified approach we will briefly present the enhanced Harvest Model that was named in Hebrew “Katsir” [meaning “harvest”] [3, 4], the model follows: The harvesting logical model architecture is composed of the following seven components [5]:
2
The academic concept being used is 'relevant' instead of 'appropriate'. We have chosen to start the discussion with a more popular concept. Further on we will use the academic concept.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
1.
2.
3.
4. 5. 6.
207
Harvester - prepares the harvesting request for the HDL3. The harvester provides an interface to the information scientist to achieve this goal. The harvesting request consists initially of a DL (Digital Library) profile and a list of URLs. The DL profile contains the DL categories, list of keywords and a set of expected stereotypes of the DL. The given URLs represent the most relevant sites that are known to the information scientist. Locator - receives the initial harvesting request and automatically or semiautomatically expands the URL list. The Locator can consult with various Search Engines to expand the harvesting request, and can also enhance the DL profile by using knowledge management techniques. Gatherer - contacts the Internet and Intranet providers in order to gather the prospective resources for the HDL. The gathering is done by recursive descent of all the URLs provided in the harvesting request. Filterer - filters irrelevant documents and passes on only the documents that should be part of the HDL. Summarizer - extracts summaries from the relevant resources and streams them to the Broker. Broker - organizes the set of HDL summaries and builds the various metadata structures, such as full index, a topics-tree and relational views of the DL. It can
Figure 4. Full Intelligence Cycle requirements & functionalities 3
HDL stands for Harvested Digital Library, which in accordance to the Harvest Model is any local repository that is equivalent with repositories we deal in our discussion, and should include six different criteria, the most important of which are: metadata services, persistency, quality controlled and collection of services. DL is the acronym for Digital Library. For a detailed discussion of DL see the following references [6, 7, 8, 9, 10, 11, and 12].
208
7.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
relieve network traffic and solve bandwidth bottlenecks on the Web by using a Harvest Caching Server. Retriever - provides the user with an interface for querying, browsing and touring the HDL.
These seven components of the Harvest logical model were mapped into the five basic requirements described by the circles in Figure 4. The Harvester and Locator were mapped into the Identify step in our current model; Gatherer and Filterer were combined into the Gathering step; the Summarizer is depicted by the Organizer; the Broker and Retriever were combined and mapped into the Searching step; an additional step was added to tackle the Sharing and Collaborating functions. These requirements reflect the end-user perspective, taking into account the fact that the logical model of the original Harvest model was first and foremost reflecting the logical perspective of the software designer4. A detailed description of these requirements is shown in the “Squares” in Figure 4 and discussed in Section 2 of the paper.
2. Web Harvesting – System Architecture Having delineated the system requirements definitions above one can turn to the issues of system architecture. The best way to survey the web harvesting system architecture is to follow the path when a command for harvesting a data item is executed. This path includes the following steps (all are, of course, done automatically by the system): 1.
2. 3.
4.
5.
Identify – Crawls among initial data sources (URLs), identifies the appropriate (or relevant) items in accordance with Keywords, profiles and executes other relevance measures like the Ontology.5 Gather – Initiate additional branching among candidate data sources (URLs), crawls among them and gathers the relevant one after initial filtering. Organize – Generate metadata fields (using Natural Language Processing techniques like Stemming, Stopping; Noun Phrasing and other linguistics algorithms) indexes and classifies the gathered data items; categorizes and clusters them; uses additional categorization tools like Tray to serve user needs. Search – Endows the end-user with services like searching in the repository, using semantic search, keyword search, advanced search, browsing in a topic-tree hierarchy view, cluster-based search, zooming into the metadata fields, field search, etc. Share – Collaborating services between the user community like: SMS, RSS, Hot Staff, file/tray sharing; generation of automatic reports.
4 One can say that going from the academic model to the practitioner arena requires some re-adoption and compromise, and the changes made to the harvest model while converting into the working software described below were due to marketing as well as other software engineering constraints. 5 Ontology [13, 14, and 15] represents a conceptual body of knowledge in a specific domain. It can be shown as a network of keywords and various relations among them. See further on in the paper for a visualization of terror ontologies.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
209
The nature of this process has a cyclic character, with feedback from one iteration to the next. Thus, the system uses additional information gathered in the previous run to enhance and elaborate the next cycle performance of the system. This iterative nature uses a learning process to guarantee that this knowledge-based system will be better and better with use. Another overview of the system architecture is given in Figure 5. Four different “software machines” or modules are included in the system framework from left to right. This is drawn after the original documentation of MindCite Ltd. (www.mindcite.com), the company that has developed the Citer 6 product which is a commercialization of the software framework presented in this paper. The four modules are: 1.
2.
3.
4.
Data Sources – the ability to gather unstructured sources (Internet open sources, forums and blogs, Intranet file repositories, office files, Email, RSS, OCR), and structured data sources like: SQL Server, DB2, Oracle and other DBMS, XML repositories and any other existing data storage in the organization. Core – the intelligent core of the system containing the algorithmic components which process the data item gathered, extract their metadata fields, takes care of the multilingual7 issue, prepares the summary of each page, filters the irrelevant items, clusters and indexes the pages. These components provide services to all the other components of the system (for simplicity from now on we denote these functions as Citer). Administration – as a software tool that should function smoothly in the end user environment and within the corporate environment, management services should be included in the software product. Besides the regular services of backup, recovery, users’ handling, DBMS management, there are specific services that are typical for a web harvesting environment. For example, scheduling the routine order of the harvesting tasks, management of groups of URLs, receiving services from search engines, authentication and internal quality control of the system functionalities. All these routine tasks are taken care of by this module. PKM Workstation –the other three modules are part of what is called, in the software industry, “SDK/API” 8 . In other words, no end user and no GUI (Graphical user) interfaces are included in the system architecture. However, the Personal Knowledge Management (PKM) module includes an intelligent workstation with GUI that gives these services to the end users. These services contain among others: searching, browsing, management of metadata fields, preparing reports, and any input of management activity.
There are many other technical and software capabilities in the Citer system. Some capabilities like security handling, user management (LDAP), DBMS management, etc., are beyond the scope of this paper. Others are demonstrated in Section 3 and in Figures 6-15 (see Figures 6-15 of Citer's screen layout) and the examples of the Citer application described in Section 3.
6
®2000-2008 Citer is a Registered Trademark of MindCite Ltd. Among many other technical features Citer is I18 ('Internationalization' standard) compatible. It adheres to the severe requirement of the multilingual system. 8 SDK – Software Development Kit; API – Application Programming Interface. 7
210
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
Figure 5. System Architecture and Components
3. Application and Real Field Examples Citer has been applied in fields such as anti-terror, financial, military, advertisement, transportation, etc. Hereafter we dedicate the discussion to some typical anti-terror applications and examples. Due to the nature of customers and their application, we cannot refer directly to the specific application and data as they were applied, especially when we are dealing with anti-terror activities. Yet, we shall try to illustrate the original approach with some samples of real data and data sources. 3.1. Terror in South-East Asia (SEA) In this example, terrorist activities in South-East Asia are monitored, analyzed and reported. We start in this first sub-section with the construction of the ontology, then the data sources are handled and we sum up with the HDL built and the search & analysis phases. 3.1.1. SEA Ontology Construction To build an ontology several steps must be completed. This process can be done using the software tool called Ontology Editor which is part of the Citer framework. It is out . of the scope of this research paper to go into the details of this process 9 Yet, as 9
For a more complete explanation of the Ontology construction process, we refer the reader to the Citer Documentation at the MindCite Web Site (www.mindcite.com ). This recommendation also holds for the other parts of the technical discussion of the Citer framework and to the way is it implemented.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
211
illustrated in Figure 6 the ontology building process involves several steps such as: Entities definition, Relations definition (the “color”10 arcs in Figure 6), Entities weight, Entities Slots and Entities types. In Figure 7 a zooming view is given for the arch terrorist Hambali. He is one of the main terrorist figures responsible for the Bali October 2002 nightclub bombing, and he is the leader of the JI (Jemaah Islamiah) terror organization that was responsible for many terrorist activities in Indonesia and SouthEast Asia in the last decade.
Figure 6. SEA Ontology
The color10 connections between the various entities stand for the way Hambali is connected to other objects in the terror world. Each color stands for one sort of a relation. For example, blue represents “Affiliation”, while “Frozen Assests”, “Teror Risk”, “Operational” and other relations stand for other characteristics of the terror world. Hambali’s personal slots (with their data fillers) appear in another data box (which one can get to just by pushing the right click). Thus, unstructured data can be mingled with structured data from databases. Each relation can be filtered, searched and processed further as needed by the analyst. One may note the magnitude of this Ontology. In comparison to a regular ontology that contains a few hundred entities [16], this demo ontology contains thousands of objects. A real life ontology built by the system may sometimes contain 10
The original screens are in color. As these screen shots are in black & white one should note that each color stands for a different Relation type.
212
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
hundreds of thousands of objects. Taking into consideration that this building process can be done automatically (see Section 3.1.2) this ontology can reach even millions of objects (and even more data fields). No need to explain the importance of the magnitude for anti-terror practitioners.
Figure 7. Hambali view at SEA Ontology
3.1.2. SEA Library The next step is building the data source list. This is a task of collecting URLs and scheduling their harvesting routine. All is done by the Citer's “Administration Panel” as seen in Figure 8 using the Data Source Editor. Data Sources can be assembled into Groups, and for each group a scheduling routine is determined. For each data source several parameters are optional (branching policy, maximum number of pages, depth). The feedback from these data sources will come after each harvesting cycle is finished, and their content will be enhanced through this learning process. After completing the data source phase, the harvesting process is set to work on the data and an HDL is built. The end users can access the results harvested into the library through a GUI that works with the browser as drawn in Figure 9. This interface encompasses a multitude of services like searching (regular, advanced), browsing through the Topic-Tree viewer, using the Tray, quick view, K-card and many others. Special attention should be given to the K-card (to the right of the figure). The K-card
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
213
contains the metadata fields that were generated automatically by the system, and the search, classification and other activities can be done on it. From the anti-terror perspective, the Topic-Tree viewer (also generated automatically from the ontology) contains many relevant names and objects (e.g., Abu Sayyaf group, Al Qaeda, FARC, Gerakan Achec Merdeka, etc.,), and many other related terror information objects. Besides the new interconnections revealed to the analyst using the system, one of the powerful tools in Citer is the metadata fields contained in the K-card. As explained previously, the metadata was generated automatically using Natural Language Processing algorithms (Stemming, Parsing, Noun Phrasing Summarization, Multi lingual, and other semantic based tools) and AI based techniques (Clustering, Markov Chains, Graph theory). Figure 10 demonstrates an example of a K-card that deals with the Indonesian Terrorist scene. Besides the classical metadata fields (as part of the Dublin Core standard [17, 18]), one can see an Automatic Summary, automatic Keyword generation (see interesting examples for Noun Phrasing in Bahasa) and above all the ability to understand that the original page is in the Bahasa language, thus calling for the grammar rules of this specific language. All of these decisions are done automatically by the system.
Figure 8. Data Source Editor
214
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
Figure 9. Citer GUI
Figure 10. Bahasa (Indonesia) K-card
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
215
3.1.3. Automatic Report Generation Out of the hundreds of thousands11 (sometimes even many millions) of relevant pages gathered, the analysts are asked to produce a report that contains the most important information. A special tool called e-Report Generation is for helping the analyst in this crucial task. Using e-Report Manager the operator prepares beforehand a template (or chooses among the optional ones) and then the system automatically assembles several metadata fields into a report that can be edited further and/or be emailed as a draft or an official report. The importance of such a tool is highly appreciated as it saves a lot of expensive human labor and shortens production time. An example of such an e-Report on Hambali is presented in Figure 11. On the right side of Figure 11 one can see the GUI and to the left a portion of the e-Report is shown.
Figure 11. Hambali e-Report
The e-Reports and gathered pages (and any other input or output from the system) can be stored temporarily or permanently in the Citer system in a tool called "Tray". A Tray can be personal or organizational as chosen by the user. In Figure 9, on the left, a collection of trays is shown. The Tray mechanism serves as a file cabinet and as trays can be shared in the community it could serve as an investigative tool (or a case file system). There are many more tools that support investigation and anti-terror follow-up and analysis. In Section 3.2 some additional analytically oriented features are discussed.
11
In the demo HDL used in this example, there were about 537,500 gathered pages in the repository.
216
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
3.2. Spain Blast 3.2.1. The Ontology The next real-life example is a retrospective investigation of the March 2004 Madrid Blast. A lot of data was reported in the open sources about the terrorist group that planned and executed the attack, their activities and the social network that was created to carry out the blast (and a network that could be detected beforehand by analysts). Some of this important and complicated data is described in Figures 12 and 13. The relations chosen in this case are presented in Figure 12. Besides the classical relations (Related-to, Synonym, Broader-than, Narrower-than), and the typical terror-related relations such as: Affiliation, Arms Trafficking, Cells, Organizations, there are specific relations in this instance with social networking features such as: Family, Meetings, Phones and Flights. The visual picture in Figure 13 is very interesting. Connections between the terrorists can be detected, as represented in Figure 14 by the example of Imad Barakat who happened to be the commander of the Madrid terrorist group. Zooming in to this visual picture we can endow the analysts with valuable information about the networking, family connections, flights and telephone calls which were all part of the preparation that took place before the blast. Additional structured data can be gathered from governmental bodies, such as flights, telephone, etc., – data which, most of the time, is accessible to the homeland security investigative organizations. This data can be a lesson in the mode of operation (Modus Operandi) of the terrorists and enable preemptive action against the next terrorist network. We emphasize that all this data was gathered from open sources. Availability of information on the Web is no longer the critical issue; the main issue now is the availability to and timing of the use of the information in the hands of the anti-terror community.
Figure 12. Relations in Madrid Blast Ontology
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
Figure 13. Madrid Blast Ontology
Figure 14. Social Network of the Commander of the Madrid Blast
217
218
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
3.2.2. Entities Extraction One can wonder how the tagging of objects (or entities) occurred in the various ontologies. Examining Figures 6, 7, 13, 14, and 15, one can see the categorization of the entities by gender, person, country, organization, etc. Most of the time human labor is required to find the identity of the object under study. This requires enormous time and attention. The area of research called Entities Extraction tries to automate this tedious task. In the Citer framework human manual tagging, semi-automatic and fully automatic tagging are all possible. By semi-automatic tagging we mean that the system output is revised and checked by a human analyst while fully automatic tagging produces tagging and extracts entities without human intervention. It is clear that adapting fully automatic tagging (full entities extraction) can produce big cost savings as well as efficient ontologies. The research to fully automate entities extraction is still under way, yet some tools are already available. As can be seen in Figure 16, the page on the right deals with an analysis of global terrorism and on the right side of the figure an ontology map was generated, with identification of specific objects with their identifiers (person, organization and place).12 3.3. Additional Advantages In this sub-section we want to summarize some of the advantages of using the system, and lay the groundwork for the methodological discussion in Section 4. One can show advantages in four principal domains: cost savings, timely and beneficial information, enhanced investigative tools and capabilities, and achieving long term goals such as education of trainees and above all organizational memory. 1.
2.
3.
4.
12
Cost savings: it is evident that an army of operators dealing with the collection of open source and Intranet information is very expensive for the intelligence community. If some of these tasks can be done automatically or semiautomatically the cost savings is significant. It is beyond the scope of this research to measure and quantify the amount of the cost savings. However, one can easily understands that such broad functionality entails large and substantial cost savings. Timely and beneficial information – even if cost reduction was not provable, then the advantage of highly timely and focused information would be of very high value to the intelligence community. The value of data that was gathered, filtered, focused, organized and presented in a totally integrated way at the right time is valuable beyond any measurable criteria to the intelligence community. Enhanced investigative tools – link analysis and visual investigative graphical tools (see for example Figure 15) add additional capabilities that no manual work can supply. Scanning enormous databases with billions of records and pinpointing some clues to additional analysis is beyond human labor capabilities. Long term organizational goals – the education and training of new recruits in the intelligence community is a tedious and difficult task. Using the system’s ontology memory, past experience, “war-stories”, e-Reports and other such tools enables easy simulations in training exercises. Moreover, storing all the past reports, documents and history objects in the system creates what is being called in
A detailed analysis of the results shows that although the 'Senate' and other organizations are identified correctly, 'Al Qaeda' which is an organization was identified as a Person. Using a machine learning algorithm will ensure that in the next run this machine extraction error will be corrected.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
219
the Knowledge Management arena “organizational memory” (OM) [19]. Organizational memory is considered one of a corporation’s most valuable assets. It can contain: broad past experience, lessons learned, procedures, thus creating a living corporate heritage.
Figure 15. Links of Imad Baraka to his network
Figure 16. Automatic Generation of Ontology Map
220
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
4. Conclusions - Methodological and In-the-Field Issues As explained in the introduction, developing and implementing a large-scale web harvesting intelligent terror & security IT system is an extensive and time consuming organizational endeavor. In order to launch a successful system implementation, the intelligence organization has to initiate and manage a large scale and complex operation that consists of several activities with methodological, organizational, data integration and technological components. In this last section we aim to make use of widespread field and academic experience gathered during the last decade. Rather than giving answers, we prefer to pose the methodological issues to be dealt with and present some lessons learned as a result. •
•
•
•
Text Mining and Natural Language Processing (NLP) capabilities – are the existing algorithms strong and efficient enough to supply the end-users’ requirements? It seems that although much is yet to be developed in the research of NLP, the tools that already exist are not the bottleneck for the success of the systems. In other words, the level of sophistication and efficiency of NLP seems to be able to provide at least minimally satisfactory results for the intelligence community. System architecture and buildup – is the system an “isolated island” in the organizational IT network? Are the system servers isolated from the rest of the corporate network? Can one trust the various existing firewall technologies? What is the level of security that must be adapted for the open and live connection to the Web? As seen in Figure 3, the system server is behind a firewall, but how one should manage the rest of the HDLs (the digital libraries handling the Intranet documents)? These critical security issues are certainly beyond the scope of this paper. Practical solutions exist with a certain amount of organizational security risk taking. There are some other non-risky solutions that go hand-in-hand with some complexities in the operational usage of the diverse HDLs. As always, the organization has to find the optimal “golden way” compromise solution. Web open sources – are they reliable, deceptive, exhaustive and inclusive in the anti-terror arena? Can the organization make use of open sources on a regular basis? It is clear that although open sources cannot be considered a fully reliable source of information, the intelligence community has long ago adopted open sources as a channel that one cannot avoid using in the daily routine. Constant follow-up of reliability and integrity (than can be implemented through the Citer system) and other internal measurements (like authentication, reliability, exhaustiveness, etc.) and tools are somehow an answer to these critical problems. Intranet resources - how to integrate legacy, email, file repository, office file archives, paper based archives, etc., within the framework of a total anti-terror IT system? Can the corporation produce an encompassing one-stop solution? As indicated above, there is no price for lack of integration of information. How can one accept a situation when the relevant information exists yet it is not available in the right time frame when asked for? Practical solutions that enable most of the necessary integration exist in the Citer system
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
•
•
•
•
•
13
221
(and maybe on other data warehousing systems) as Intranet repositories can be integrated with open sources of information. In summary, the end user can have an overview of the relevant information from existing corporate repositories. Ontology & taxonomy - how to design and maintain them? How to develop and maintain corporate anti-terror ontology? In this paper we have shown that the approach based on ontology can yield more than reasonable results. Another important phenomenon is the Semantic Web [20, 21]. It derives from Sir Tim Berners –Lee's vision of the Web as a universal medium for data, information and knowledge exchange. Since the last decade the notion of the Semantic Web has turned out to be a wide-spread market tendency and it is all based on the concept of ontology (and Folksonomies 13 ; or ontologies created by specific communities). Tools to maintain and develop ontology are under development and the academic community is giving more and more attention to that domain of research. Human resource – what is the optimal combination of IT/Technology and anti-terror operators (practitioners)? Who is in charge of the management policy of the system – IT, information scientists or the operators? Who is in charge of the project in general? As always, the end users are typically in charge, however some IT middle management should be put in place to ensure smooth operations and proper maintenance of the systems. Special attention should be given to the information scientists (the librarian or archivist of the “old paper culture”), or the intermediary person who implements, tutors the system and takes care of the ontology, data sources, system data integrity and content. Does history count? For what range of time? Can one look upon an anti-terror system as a dynamic digital library? At first, while launching the system this issue seems far away and not relevant. As time passes and the system becomes a living phenomenon in the organization, end users cannot understand how the operators' routine work was done without the digital library. In such intelligent organizations no problem exists in finding and integrating past experience and knowledge in daily affairs. Does the system pay off? Can one measure CSF (Critical Success Factors)? This is not a simple issue to determine. Sometimes, even a single successful operation with such a system gives a positive answer. In other scenarios only a comprehensive definition of a full CSF criteria and measurement yields, at last, a trustable answer. User education and system implementation – the benefits to the users of anti-terror IT systems for educational purposes within the intelligence community was explained in the Section 3.3. The benefit is substantial and allows simulation and intensive training. The process of continuing education for users of the system goes hand-in-hand with the classical process of system implementation and life cycle considerations. Even after the first round of system implementation is over, a new generation of operators is imminent and a continual process of education is needed. Moreover, the introduction and the
Folksonomy stands for collaborative tagging, social classification, and social tagging. In other words, it is the "practice of collaboratively creating and managing tags to annotate and categorize content. In contrast to traditional subject indexing, metadata is not only generated by experts but also by creators and consumers of the content. Usually, freely chosen keywords are used instead of a controlled vocabulary" [23].
222
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
•
teaching of routine procedures of daily work to operators may be more successful while they are trained hand-in-hand with an appropriate supportive system. Who controls the system – in supporting high governmental decision making and policy making, who should control/manage/operate the anti-terror IT system? Organizational policy and sometimes a specific country’s or government’s policy is an important issue to take into consideration. The owner of such a system is surely the owner of a powerful resource. Attention should be given to this issue beforehand to reflect that the right government policy (decentralized, centralized or a hybrid solution) will be chosen.
There are many more methodological and technological issues that can be raised and discussed [22], but that goes beyond the scope of this paper. In summary, this research was trying to address theoretical and academic technologies developed in the last few decades with the inclusion of pragmatic, practitioners’ field experience gathered in the last few years. We are sure that the convergence of both approaches can yield better solutions and answers to the intelligence community in the important fight against terror in cyber space.
Acknowledgements The author is indebted to Dr. S. Gal, Ms. A. Alter and other members of the MindCite LTD. team for their insightful observations and elucidations, and their support and assistance along the common last decade.
References [1]
Bowman C. M., et al., “Scaleable Internet Resource Discovery: Research Problems and Approaches”, Communications of the ACM, vol. 37, no. 8, August 1994, pp. 98-107. [2] Bowman C. M., et al., “Harvest: a Scaleable, Customizable Discovery and Access System”, Dept. Computer Science,Univ. Colorado, TR CU-CS-732-94, March 1995. [3] Hanani U., and A. Frank, "Katsir: A Framework for Harvesting Digital Libraries on the Web”, ECIS 2000 Proceedings, Vienna, July 2000, pp. 306-312. [4] Hanani U., and A. Frank, "Intelligent Information Harvesting Architecture: an Application to a High School Environment”, Online Information 96, London, December 1996, pp. 211-220. [5] Hanani U., and A. Frank, "The Parallel Evolution of Search Engines and digital Libraries: their Convergence to the Mega Portal”, 2000 Kyoto International Conference on Digital Libraries: Research and Practice, Nov 2000, Kyoto, Japan, pp.269-276. [6] Special Issue: Digital Library, Communications of the ACM, April 1995, vol. 38, no. 4. [7] Special Issue: Digital Library, IEEE Computer, May 1996, vol. 29, no. 5. [8] Chen H., and A. L. Houston, “Digital Libraries: Social Issues and Technological Advances”, Advances in Computers, Academic Press, vol. 48, 1999, pp. 257-314. [9] Arms W. Y., Digital Libraries, MIT Press, Cambridge, Mass., 2000. [10] Lesk M., Practical Digital Libraries, Morgan Kaufmann, San Francisco, 1997. [11] Kessler J., Internet Digital Libraries, Artech House, Boston, 1996. [12] Waters, D.J., “What are digital libraries?”, CLIR Issues, July/August, 1998,
http://www.clir.org/pubs/issues/issues04.HTML
[13] Katifori A. et al , "Ontology visualization methods—a survey", ACM Computing Surveys (CSUR) Volume 39, Issue 4 (2007), article No. 10, 43 pages, October 2007. [14] Staab S. And R. Studer (Eds.), Handbook on Ontologies, Springer Verlag, Berlin. 2004.
U. Hanani / Web Harvesting Intelligent Anti-Terror System – Technology and Methodology
223
[15] Fensel D., Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce, Springer Verlag, Berlin, 2004. [16] Cardoso J., “The Semantic Web Vision: Where Are We”, IEEE Intelligent Systems, September/October 2007, vol. 22, no. 5, pp. 84-88. [17] Lassila O., "Web Metadata: A Matter of Semantics", IEEE Internet Computing, July 1998, vol. 2, no. 4, pp. 30-37. [18] Rust G., "Metadata: the Right Approach", D-Lib Magazine, July/August 1998, [19] Kransdorff A., Corporate DNA, Gower Publishing, 2006. [20] Davies J., D Fensel and F. Van Harmellen (eds.), Towards The Semantic Web, John Wiley & Sons, Chichester, 2003. [21] Shadbolt N., T. Berners-Lee and W. Hall, "The Semantic Web Revisited", IEEE Intelligent Systems, May/June 2006,Volume 21, no.3, pp. 96-101. [22] Last M. and A Kandel (eds.), Fighting Terror in Cyberspace, World Scientific, New Jersey, 2005. [23] Wikipedia, http://en.wikipedia.org/wiki/Folksonomy.
224
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
How to Protect Critical Infrastructure from Cyber-Terrorist Attacks Yuval ELOVICI 1 , Chanan GLEZER and Roman ENGLERT Deutsche Telekom Laboratories at Ben-Gurion University, Israel
Abstract. This article deals with protection of home and enterprise users and in particular Critical Infrastructures (CIs) against attacks unleashed by terrorists or criminals. Threats and challenges in large-scale network protection are discussed and their congruent defense mechanisms are classified into defensive and offensive. One defensive and one offensive mechanism is described. The Early Detection, Alert and Response (eDare) framework is a defensive mechanism aimed at removing malware from NSPs’ traffic. eDare employs powerful network traffic scanners for sanitizing web traffic from known malware. Remaining traffic is monitored and various types of algorithms are used for identifying unknown malware. To augment judgments of the algorithms, experts’ opinions are used to classify files suspected as malware which the algorithms are not decisive about. Finally, collaborative feedback and tips from end-users are meshed into the identification process. DENAT system is an offensive mechanism which uses Machine Learning algorithms to analyze traffic that is sent from organizations such as universities through Network Address Translators (NAT). The analysis associates users with the content of their traffic in order to identify access to terror related websites.
Keywords. Distributed Network Intrusion Detection Infrastructure Protection, Network Address Translator
(DNIDS),
Critical
Introduction The Internet has become a public medium where governments, businesses, home users as well as terrorists can interact in a relatively anonymous manner. Every day new vulnerabilities are being discovered, and later exploited by hackers to create killer worms which inflict huge damages on the public. As a case in point, with the Melissa worm, the damage was 1.1 billion dollars; with “code red” it was 2.6 billion dollars and with the SQL slammer it was 1.2 billion dollars. These threats existed for the last 20
1
Corresponding Author: Dr. Yuval Elovici, Deutsche Telekom Research Labs at Ben-Gurion University; Email: [email protected]
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
225
years and are just more abundant nowadays. They include viruses, worms, Trojan horses, spyware, adware, phishing and spam. According to a 2005 survey by America Online and the National Cyber Security Alliance (NCSA) [1-3], 81% of home users’ computers are lacking core protections including an up-to-date anti-virus software, a properly-configured firewall, and/or spyware protection. Twenty-three percent of home computer users have received at least one phishing attempt via e-mail over the prior two weeks. Due to the large home web access market the situation in the above survey highlighted that home-user computers are vulnerable. For example, terrorists may follow-up an announcement about a new vulnerability in any software commonly used by the public and create a killer worm which they launch unobtrusively through a Trojan horse installed by innocent users. The bottom line is that terrorists could use vulnerabilities to create attacks on Critical Infrastructures (CIs) through home users. Therefore, it is very important to develop effective technologies to protect home users from being infected by malware. Shielding end users and CIs against malware can be divided into the following approaches: defensive and offensive. Under the defensive approach one possible solution is to focus on web traffic and malware through activities such as detecting and classifying new vulnerabilities; contrasting normal versus abnormal traffic patterns; continuously updating existing firewalls and intrusion detection systems (IDS) with signatures of new threats; and protecting users’ hosts from being exploited as launch pads for cyber-terror attacks (i.e., Distributed Denial of Service) The offensive approach, on the other hand, may include engaging in a stealthy hunt searching for human and behavioral traits characterizing perpetrator activities which includes forensic analysis to pinpoint the origin of cyber-terrorist attacks after they have already occurred. This paper presents the research activities on defensive and offensive protection mechanisms which have been conducted at Deutsche Telekom Laboratories at BenGurion University.
1. Defensive Protection Mechanisms As indicated above, home users cannot be relied upon to keep their computers adequately protected. The idea is, therefore, to enable Network Service Providers (NSPs), to protect their customers by sanitizing traffic on the core network. Consequently, customers benefit from cleaner traffic without requiring them to deal with the ongoing hurdle of keeping the security definitions on their end-devices up-todate. The underlying assumption is that NSPs have a unique ability to quickly detect an emerging new threat and then purify the network from the threat in a centralized manner by sanitizing all the traffic they deliver to their customers. To achieve the above goal a sophisticated algorithm was developed to detect the optimal location of high-performance cleaning devices across the NSP network (see Figure 1). The idea was that it is possible to purify the traffic of an NSP by cleaning only several strategic links which are identified by sophisticated algorithms stemming from complex network analysis. Such theoretic models [4] suggest protecting 4% of the routers/links in order to contain malware propagation on the entire network.
226
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
Figure 1. Finding the optimal deployment of sniffers and cleaning devices
Executable files are synthesized and assembled from the traffic collected at the optimally selected monitoring points and reconstructed from layer 3 to layer 7 (see Figure 1). The executable files are analyzed using various machine learning algorithms. (i.e., Artificial Neural Networks, Decision Trees, Static Analysis [12]). Whenever a conclusion is reached that a file includes malicious code we immediately create a new signature, and update all the distributed cleaning devices that we optimally deployed inside the network with the new signature. In that way, if the threat will emerge again in the network it will be blocked (see Figure 2). eDare [5-11] is designed to provide maximum automation in the cycle of malware interception: detection, analysis, alert, response or remedy. The system aims to provide a very low false positive result by integrating multiple sources of information and processing techniques. The system also incorporates state-of-the-art hardware devices enabling fast scanning of web traffic at speeds meeting typical throughputs of NSPs edge routers. When encountering new malware or suspicious behaviors, the response time of the system is expedited by the sharing of observations and warnings between users and the system. Finally, the system can accommodate external plug-ins, expert consultation and risk assessment in a flexible manner. The malware faced by eDare can be classified into two types: x
Known malware for which eDare has already generated a distinct signature; and
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
x
227
Unknown (New) malware which eDare has yet to encounter and classify, and for which eDare needs to generate a distinct signature.
The architecture of eDare is presented in Figure 3. Known malware is identified and blocked in real-time by the known malware handler module (KMHM), using lightweight powerful hardware scanning devices capable of detecting, blocking or disarming known malware at a rate of multi-gigabits/sec. This hardware is the main building block of the known malware detection module.
Figure 2. Eliminating known malicious code from the network traffic
Another module, the new malware detection module (NMDM), constantly monitors data traffic and searches for previously un-encountered malware. This module cannot work in real-time as evidence may need to be accumulated over time and since some modern algorithms are very computationally intensive. Upon new malware recognition, a signature builder component is activated and the generated signature is published to the known malware detection module and to all eDare protection and Feedback Agents installed on end-user devices. To augment detection performed on the network, each end-user can decide whether to install the eDare protection & feedback agent on her or his computer. Endusers will also have the ability to choose which of the following functions of the eDare agent they wish to use. The agent can automatically clean out (after user’s confirmation) malware located on the user’s device that managed to enter between the moment it first infiltrated the network and the moment eDare generated and propagated
228
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
a valid signature to its end-users. Cleaning will be possible when malware was downloaded but not yet launched or when it was launched but did not change the OS in a way that will disable its deletion. The agent also monitors user-content interaction for suspicious patterns; deploys a dynamic sand-box to safely open and test incoming files; transfers end-users’ observation tips to eDare; and displays warnings and other messages to end-users. These capabilities are provided while maintaining user anonymity, privacy and data secrecy.
Figure 3. Conceptual Architecture of eDare
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
229
In order to leverage sharing of observations and tips among end-users, a collaborative malware recognition module receives feedback from both the eDare agent and the end-users. In some cases, the feedback will be an opinion and in other cases it might even be false. Thus, the main function of this module is to transform subjective feedback gathered from various sources (both human and software sources) into objective, coherent and valid information. This information is then fed into the new malware recognition module. Additionally, the collaborative malware recognition module will maintain eDare agents up-to-date, by forwarding various kinds of information, such as new malware signatures, configuration updates, software updates, warnings, etc. Nevertheless, some novel or peculiar malware which cannot be handled decisively by the new malware detection module requires a more intensive, manual analysis. An Expert Group Feedback Manager Module, which resides on the eDare control center, collects feedback from a community of computer security experts affiliated with eDare. The group of experts is asked to resolve conflicts by confirming whether or not a data stream (or a file) actually contains a new malware instance. The experts will be presented with all necessary supporting information and tools to make such a judgment. Using this module, experts may also fine-tune statistical thresholds, reliability scores and weights for human and software judgments that determine when to automatically treat a potential instance of malware as an actual new malware and when to ask for further expert consultation. All in all, eDare facilitates a detection and an immunization cycle which is much shorter compared to antivirus companies because it is done automatically using an ensemble of automated classifiers with assistance from human experts whenever required. This enables detection of new threats, and the immediate updating of cleaning devices with a new signature.
2. Offensive Protection Mechanisms The goal of an offensive approach is to proactively detect terrorist activities on the World Wide Web. In this case, the goal is to be able to detect terrorist activities hidden behind a Network Address Translator (NAT) used by organizational network infrastructures. NAT is a device that connects an organizational internal network to the Internet. NAT hides the real IP address of the internal network from the external network. Thus an observer from the outside observes only one Internet Protocol (IP) address, while there are many, many internal IP’s inside the organization. The problem with NAT is that a perpetrator can hide behind the organizational NAT and start to navigate to terrorist related websites without being identified (Figure 4). The goal of this research was to develop a system deployed outside of the organizational network which will try to attribute the traffic that emanates from the organizational net to its original IP address. This research employed several analyzers based on machine learning techniques. The underlying assumption is that packets originating from a single NAT address have some features which can be employed to separate them from the other NAT addresses. For example, each packet from a single NAT IP address has a counter so continuity in counter values can most likely be attributed to packets that emanate from a similar IP address. Each cluster of packets is expected to include only packets of a specific user hidden behind the NAT and then
230
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
cluster detection techniques can be applied on the accumulated traffic to analyze whether it includes access to terror-related websites. A system based on the new approach includes four modules: Sniffer – Intercepts the NAT network traffic and stores it in a database. Analyzers – Clusters the intercepted traffic based on internal detection techniques and stores the results in the database. Final Clustering – Combines the clustering results of the analyzers into one final clustering scheme. Viewer – Present to the user reconstructed sessions of each group (host computer). Each analyzer implemented in our study relies on a different network protocol and uses different protocol attributes in order to classify each packet to the appropriate group, with a pertinence probability. Each analyzer examines only packets/sessions that are relevant to the protocol that it can handle. 1. 2. 3. 4.
5.
MSN Messenger™ Analyzer – This analyzer uses the "transaction id" of the MSN Messenger protocol to group sessions. HTTP Analyzer – This analyzer uses the HTTP protocol request header fields: user agent, host and referrer, to cross-reference http sessions. Time Stamps Analyzer – uses the TCP protocol's time stamp field to insert each packet to the appropriate stream of packets with the similar time stamp. IP-ID Analyzer – Uses the IP protocol's ID field [13], originally used for defragmentation in order to insert each packet to the appropriate group of packets with similar IDs. SMTP Analyzer – uses the SMTP protocol's handshake to cross-reference different mail sessions.
Figure 4. NAT architecture
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
231
Each analyzer examines packets one by one, and for each packet/session it delivers a list of pairs: the group and probability that the packet/session belongs to the group. The groups generated by the analyzers are defined locally, in that each analyzer constructs its own database of groups. Finally, the results of all analyzers are clustered together in order to combine their results. Figure 5 depicts the architecture of the DENAT system which employs several analyzers to study the traffic obtained from the NAT sniffers.
DENAT Database
Sniffer
Figure 5. DENAT system
3. Conclusion New research challenges today are mainly in the areas of developing innovative approaches for detecting and preventing a variety of computer network attacks including distributed denial of service attacks. There is a need to improve the penetration-testing technologies in order to be able to detect vulnerabilities before terrorists can exploit them. In addition, there is a need to develop innovative approaches for detecting the origin of cyber attacks, and thereby prevent terrorists from unleashing cyber attacks because they will know that they can be detected. Finally, there is a need to develop state-of-the-art approaches for detecting unknown malicious code. Currently, with our system we achieved the very good result of over 95% detection rate for unknown malicious code (true positives) with relatively small false positives.
232
Y. Elovici et al. / How to Protect Critical Infrastructure from Cyber-Terrorist Attacks
References [1] NCSA Study, Available from http://www.staysafeonline.info/pdf/ safety_study_2005.pdf [2] Symantec Internet Security Threat Report (January-June 2004). www.symantec.com [3] The Danger of Spyware, Symantec Security Response. www.symantec.com, June 2003 [4] L.C. Chen and M. Carley, The impact of countermeasure propagation on the prevalence of computer viruses. IEEE Systems, Man and Cybernetics, Part B, 34(2), 2004, pp. 823– 833. [5] A. Shabtai, Y. Elovici, and Y. Shahar, Knowledge-Based Temporal Abstraction (KBTA) method for Detection of Electronic Threats”, 5th European Conference on Information Warfare and Security, National Defence College, Helsinki, Finland, June1-2, 2006. [6] D. Stopel, Z. Boger, Z., R. Moskovitch, Y. Shahar, and Y. Elovici, Application of Artificial Neural Networks Techniques to Computer Worm Detection, IEEE World Congress on Computational Intelligence (IEEE WCCI 2006), Vancouver, BC, Canada, July 16-21, 2006. [7] D. Stopel, Z. Boger, R. Moskovitch, Y. Shahar, and Y. Elovici, Improving Worm Detection with Artificial Neural Networks through Feature Selection and Temporal Analysis Techniques, International Conference on Neural Networks (ICNN 2006), Barcelona, Spain, October 22-24, 2006. [8] A. Shabtai, D. Klimov, Y. Shahar, and Y. Elovici, An Innovative Visualization Tool for Exploration of Time-Oriented Security Data, ACM Workshop on Visualization for Computer Security (VizSEC2006), Virginia, USA, November 2006. [9] A. Shabtai, Y. Shahar, and Y. Elovici, Monitoring For Malware Using a Temporal-Abstraction Knowledge Base, 8th International Symposium on Systems and Information Security (SSI 2006), Sao Jose dos Campos, Sao Paulo, Brazil, November 8-10, 2006. [10] R. Moskovitch, I. Gus, S. Pluderman, D. Stopel, C. Glezer, Y. Shahar and Y. Elovici, Detection of Unknown Computer Worms Activity Based on Computer Behavior using Data Mining, IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, Hawaii, 2007. [11] M. Tubi, R. Puzis and Y. Elovici, Deployment of DNIDS in Social Networks, Intelligence and Security Informatics 2007, New Jersey, USA, May 23 2007. [12] M. Christodorescu, S. Jha, Static Analysis of Executables to Detect Malicious Patterns. Proc. the 12th USENIX Security Symposium (Security’03), Washington D.C., August 4-8, 2003. [13] S.M. Bellovin A technique for counting NATted hosts, In Proc. 2nd Internet Measurement Workshop (IMW '02), Marseille, France, November 2002
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Appendix 1: Conference Participants Co-Chairs Paul Kantor School of Communication Information and Library Studies (SCILS), Rutgers, The State University of New Jersey 4 Huntington Street New Brunswick, NJ 08901-1071 USA [email protected] Bracha Shapira Department of Information Systems Engineering Ben-Gurion University of the Negev P.O.B. 653, Beer-Sheva, Israel, 84728 [email protected]
Participants Bulgaria Ivan Koychev Institute of Mathematics and Informatics Bulgarian Academy of Science 5 J. Bouchier Street Sofia – 1164, Bulgaria [email protected] Canada Gordon Cormack David R. Cheriton School of Computer Science University of Waterloo 2502 Davis Centre Waterloo, Ontario N2L 3G1, Canada [email protected]
233
234
Appendix 1: Conference Participants
Germany Joachim Buhmann Institute of Computational Science, CAB G 69.2, Universitaetsstrasse 6, 8092 Zürich, Switzerland [email protected] Daniel Keim Computer Science Institute Uni Konstanz Fach D78 Universitätsstr 10 D-78457 Konstanz, Germany [email protected] Gerhard Paass Fraunhofer Institute for Intelligent Analysis and Information Systems, IAIS Schloss Birlinghoven 53757 Sankt Augustin, Germany http://www.iais.fraunhofer.de [email protected] Stefan Wrobel Fraunhofer Institute for Intelligent Analysis and Information Systems, IAIS Schloss Birlinghoven 53757 Sankt Augustin, Germany http://www.iais.fraunhofer.de Israel Yaakov Amidror, Maj. Gen. (ret). Lander Institute, Jerusalem, Israel [email protected] Gadi Aviran Hazard Threat Analysis Ltd. P.O. Box 395, Bnei Zion 60910, Israel [email protected], [email protected] Yigal Carmon MEMRI P.O. Box 27837, Washington, DC 20038-7837, USA www.memri.org [email protected], [email protected] Yuval Elovici Deutsche Telekom Research Laboratories at Ben-Gurion University P.O.B 653, Beer-Sheva, Israel [email protected]
Appendix 1: Conference Participants
Uri Hanani Mindcite P.O.B. 8707, 13 Giborei Israel St. 42504, Netanya, Israel [email protected] Moshe Koppel Department of Computer Science, Bar-Ilan University Ramat-Gan, Israel [email protected] Poland Janusz Luks GROM Group Ul. Granitowa 3/6, 02-681 Warszawa, Poland [email protected] UK Nick Craswell Microsoft Research 7 J J Thomson Ave, Cambridge CB3 0FB, UK [email protected] Mark Levene School of Computer Science and Information Systems Birkbeck, University of London Malet Street, London WC1E 7HX, UK [email protected] Charles Shoniregun School of Computing and Technology University of East London Docklands Campus, 4-6 University Way, London E16 2RD, UK [email protected] Ukraine Vladimir Golubev Director- Computer Crime Research Center Box 8010, Zaporozhye 95, Ukraine, 69095 [email protected] USA Mark Goldberg Rensselaer Polytechnic Institute 110 8th Street, Troy, New York 12180, USA [email protected]
235
236
Appendix 1: Conference Participants
Marc Goodman Senior Advisor Interpol Steering Committee on Information Technology Crime INTERPOL, General Secretariat, 200, Quai Charles de Gaulle, 69006 Lyon, France [email protected] David Grossman Illinois Institute of Technology 10 West 31st Street, Rm. 285B Chicago, IL 60610, USA [email protected] John Prange The MITRE Corporation Fort Meade Office, Hanover, MD 21076, USA [email protected] Dan Roth, University of Illinois 3322 Siebel Center 201 N. Goodwin Avenue Urbana, IL 61801, USA [email protected]
237
Security Informatics and Terrorism: Patrolling the Web C.S. Gal et al. (Eds.) IOS Press, 2008 © 2008 IOS Press. All rights reserved.
Author Index Agam, G. Amidror, Y. Aviran, G. Baumes, J. Borges, J. Buhmann, J.M. Carmon, Y. Cormack, G. Elovici, Y. Englert, R. Frieder, O. Gal, C.S. Glezer, C. Goldberg, M.K. Golubev, V. Goodman, M. Grossman, D. Hanani, U.
156 3 193 82 45 166 26 142 71, 224 224 156 v 224 82 20 8 156 202
Kandel, A. Kantor, P.B. Kindermann, J. Koppel, M. Koychev, I. Last, M. Levene, M. Luks, J. Magdon-Ismail, M. Messeri, E. Paaß, G. Reinhardt, W. Rüping, S. Schler, J. Shapira, B. Wallace, W.A. Wrobel, S.
71 v, 120 132 111 176 71 45 17 82 111 56, 132 56 56 111 v, 71 82 56
This page intentionally left blank