HANDBOOKS IN INFORMATION SYSTEMS VOLUME 3
Handbooks in Information Systems Advisory Editors
Editor
Ba, Sulin
Andrew B. Whinston
University of Connecticut
Duan, Wenjing The George Washington University
Geng, Xianjun University of Washington
Gupta, Alok
Volume 3
University of Minnesota
Hendershott, Terry University of California at Berkeley
Rao, H.R. SUNY at Buffalo
Santanam, Raghu T. Arizona State University
Zhang, Han Georgia Institute of Technology
United Kingdom North America Japan India Malaysia China
Business Computing
Edited by
Gediminas Adomavicius University of Minnesota
Alok Gupta University of Minnesota
United Kingdom North America Japan India Malaysia China
Emerald Group Publishing Limited Howard House, Wagon Lane, Bingley BD16 1WA, UK First edition 2009 Copyright r 2009 Emerald Group Publishing Limited Reprints and permission service Contact:
[email protected] No part of this book may be reproduced, stored in a retrieval system, transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without either the prior written permission of the publisher or a licence permitting restricted copying issued in the UK by The Copyright Licensing Agency and in the USA by The Copyright Clearance Center. No responsibility is accepted for the accuracy of information contained in the text, illustrations or advertisements. The opinions expressed in these chapters are not necessarily those of the Editor or the publisher. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-84855-264-7 ISSN: 1574-0145
Awarded in recognition of Emerald’s production department’s adherence to quality systems and processes when preparing scholarly journals for print
Contents Preface Introduction
xiii xv
Part I: Enhancing and Managing Customer Value CHAPTER 1 Personalization: The State of the Art and Future Directions Alexander Tuzhilin 1. 2. 3.
3
Introduction Definition of personalization Types of personalization
3 7 14
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
14 15 16 17 19 20
Provider- vs. consumer- vs. market-centric personalization Types of personalized offerings Individual vs. segment-based personalization Smart vs. trivial personalization Intrusive vs. non-intrusive personalization Static vs. dynamic personalization
4. When does it pay to personalize? 5. Personalization process 6. Integrating the personalization process 7. Future research directions in personalization Acknowledgments References
21 24 36 37 39 40
CHAPTER 2 Web Mining for Business Computing Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh and Vamsee Venuturumilli
45
1. 2.
Introduction Web mining
45 46
2.1. 2.2.
47 49
3.
Data-centric Web mining taxonomy Web mining techniques—state-of-the-art
How Web mining can enhance major business functions
50
3.1. 3.2. 3.3.
51 55 56
Sales Purchasing Operations
v
vi 4.
5.
Contents Gaps in existing technology
62
4.1. 4.2. 4.3.
62 63 63
Lack of data preparation for Web mining Under-utilization of domain knowledge repositories Under-utilization of Web log data
Looking ahead: The future of Web mining in business
64
5.1. 5.2. 5.3. 5.4.
64 64 65 66
Microformats Mining and incorporating sentiments e-CRM to p-CRM Other directions
6. Conclusion Acknowledgments References
66 66 67
CHAPTER 3 Current Issues in Keyword Auctions De Liu, Jianqing Chen and Andrew B. Whinston
69
1. 2.
Introduction A historical look at keyword auctions
70 72
2.1. 2.2. 2.3. 2.4.
73 73 74 75
3.
4. 5.
6.
Early Internet advertising contracts Keyword auctions by GoTo.com Subsequent innovations by Google Beyond search engine advertising
Models of keyword auctions
76
3.1. 3.2. 3.3.
77 78 80
Generalized first-price auction Generalized second-price auction Weighted unit–price auction
How to rank advertisers How to package resources
81 85
5.1. 5.2. 5.3.
86 87 90
The revenue-maximizing share structure problem Results on revenue-maximizing share structures Other issues on resource packaging
Click fraud
91
6.1. 6.2.
93 94
Detection Prevention
7. Concluding remarks References
96 96
CHAPTER 4 Web Clickstream Data and Pattern Discovery: A Framework and Applications Balaji Padmanabhan
99
1. 2. 3.
Background Web clickstream data and pattern discovery A framework for pattern discovery
99 101 103
3.1. 3.2.
103 104
Representation Evaluation
Contents 3.3. 3.4.
Search Discussion and examples
vii 104 105
4. Online segmentation from clickstream data 5. Other applications 6. Conclusion References
108 111 113 114
CHAPTER 5 Customer Delay in E-Commerce Sites: Design and Strategic Implications Deborah Barnes and Vijay Mookerjee
117
1.
E-commerce environment and consumer behavior
119
1.1. 1.2. 1.3.
119 119 120
2.
The long-term capacity planning problem 2.1.
3.
4.
E-commerce environment Demand generation and consumer behaviors System processing technique Allocating spending between advertising and information technology in electronic retailing
121 121
The short-term capacity allocation problem
127
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
127 128 129 130 131 132
Optimal processing policies for an e-commerce web server Environmental assumptions Priority processing scheme Profit-focused policy Quality of service (QoS) focused policy Practical implications
The effects of competition
132
4.1. 4.2. 4.3. 4.4.
132 133 133 136
A multiperiod approach to competition for capacity allocation Practical implications Long-term capacity planning under competition Practical applications and future adaptations
5. Conclusions and future research References
136 138
Part II: Computational Approaches for Business Processes CHAPTER 6 An Autonomous Agent for Supply Chain Management David Pardoe and Peter Stone
141
1. 2.
Introduction The TAC SCM scenario
141 142
2.1. 2.2. 2.3.
143 144 145
3.
Component procurement Computer sales Production and delivery
Overview of TacTex-06 3.1.
Agent components
145 145
viii 4.
5. 6. 7. 8.
Contents The Demand Manager
147
4.1. 4.2. 4.3.
147 149 152
Demand Model Offer Acceptance Predictor Demand Manager
The Supply Manager
156
5.1. 5.2.
157 158
Supplier Model Supply Manager
Adaptation over a series of games
161
6.1. 6.2.
162 163
Initial component orders Endgame sales
2006 Competition results Experiments
164 165
8.1. 8.2.
166 166
Supply price prediction modification Offer Acceptance Predictor
9. Related work 10. Conclusions and future work Acknowledgments References
168 170 171 171
CHAPTER 7 IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation Moninder Singh and Jayant R. Kalagnanam
173
1. 2.
Introduction Techniques for data cleansing
174 177
2.1. 2.2. 2.3. 2.4.
178 179 183 185
3.
Overview of data cleansing approaches Text similarity methods Clustering methods Classification methods
Automating data cleansing for spend aggregation
186
3.1. 3.2.
187 192
Data cleansing tasks for spend aggregation Automating data cleansing tasks for spend aggregation
4. Conclusion References
203 204
CHAPTER 8 Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics Daniel Zeng, James Ma, Hsinchun Chen and Wei Chang
207
1. 2.
Introduction Retrospective and prospective spatial clustering
207 209
2.1. 2.2. 2.3. 2.4.
209 213 217 223
Literature review Support vector clustering-based spatial-temporal data analysis Experimental studies A case study: Public health surveillance
Contents 3.
ix
Spatial-temporal cross-correlation analysis
224
3.1. 3.2. 3.3.
225 228 229
Literature review Extended K(r) function with temporal considerations A case study with infectious disease data
4. Conclusions Acknowledgments References
233 234 234
CHAPTER 9 Studying Heterogeneity of Price Evolution in eBay Auctions via Functional Clustering Wolfgang Jank and Galit Shmueli
237
1. 2.
Introduction Auction structure and data on eBay.com
237 240
2.1. 2.2.
240 240
3.
4.
How eBay auctions work eBay’s data
Estimating price evolution and price dynamics
242
3.1. 3.2. 3.3.
243 245 246
Estimating a continuous price curve via smoothing Estimating price dynamics via curve derivatives Heterogeneity of price dynamics
Auction segmentation via curve clustering
247
4.1. 4.2. 4.3. 4.4. 4.5.
247 249 251 254 256
Clustering mechanism and number of clusters Comparing price dynamics of auction clusters A differential equation for price Comparing dynamic and non-dynamic cluster features A comparison with ‘‘traditional’’ clustering
5. Conclusions References
257 260
CHAPTER 10 Scheduling Tasks Using Combinatorial Auctions: The MAGNET Approach John Collins and Maria Gini
263
1. 2.
Introduction Decision processes in a MAGNET customer agent
263 265
2.1. 2.2. 2.3. 2.4. 2.5. 2.6.
265 266 268 271 276 278
3.
Agents and their environment Planning Planning the bidding process Composing a request for quotes Evaluating bids Awarding bids
Solving the MAGNET winner-determination problem
279
3.1. 3.2. 3.3.
280 282 284
Bidtree framework A formulation Iterative-deepening A
x 4.
Contents Related work
286
4.1. 4.2. 4.3.
286 288 289
Multi-agent negotiation Combinatorial auctions Deliberation scheduling
5. Conclusions References
290 292
Part III: Supporting Knowledge Enterprise CHAPTER 11 Structuring Knowledge Bases Using Metagraphs Amit Basu and Robert Blanning
297
1. 2. 3.
Introduction The components of organizational knowledge Metagraphs and metapaths
297 298 300
3.1. 3.2. 3.3. 3.4.
301 303 304 306
4.
5.
Metagraph definition Metapaths Metagraph algebra Metapath dominance and metagraph projection
Metagraphs and knowledge bases
308
4.1. 4.2. 4.3.
308 311 313
Applications of metagraphs to the four information types Combining data, models, rules, and workflows Metagraph views
Conclusion
314
5.1. 5.2.
314 315
Related work Research opportunities
References
316
CHAPTER 12 Information Systems Security and Statistical Databases: Preserving Confidentiality through Camouflage Robert Garfinkel, Ram Gopal, Manuel Nunez and Daniel Rice
319
1. 2.
Introduction DB Concepts
319 321
2.1. 2.2. 2.3. 2.4.
321 322 323 324
3.
Types of statistical databases (SDBs) Privacy-preserving data-mining applications A simple database model Statistical inference in SDBs
Protecting against disclosure in SDBs
325
3.1. 3.2. 3.3. 3.4.
326 327 327 328
Protecting against statistical inference The query restriction approach The data masking approach The confidentiality via camouflage (CVC) approach
Contents 4. 5.
6.
xi
Protecting data with CVC
328
4.1. 4.2.
329 331
Computing certain queries in CVC Star
Linking security to a market for private information—A compensation model
332
5.1. 5.2. 5.3. 5.4. 5.5. 5.6.
332 333 334 335 337 339
A market for private information Compensating subjects for increased risk of disclosure Improvement in answer quality The compensation model Shrinking algorithm The advantages of the star mechanism
Simulation model and computational results
340
6.1. 6.2. 6.3.
340 341 341
Sample database User queries Results
7. Conclusion References
344 345
CHAPTER 13 The Efficacy of Mobile Computing for Enterprise Applications John Burke, Judith Gebauer and Michael J. Shaw
347
1. 2.
Introduction Trends
347 349
2.1. 2.2. 2.3. 2.4.
349 349 350 351
3.
4. 5.
Initial experiments in mobile information systems The trend towards user mobility The trend towards pervasive computing The future: ubiquitous computing
Theoretical frameworks
352
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
352 353 353 354 355 356
Introduction The technology acceptance model Example of the technology acceptance model Limitations of the technology acceptance model The task technology fit model Limitations of the task technology fit model
Case study: mobile E-procurement
357
4.1. 4.2.
357 358
Introduction A TTF model for mobile technologies
Case study findings
359
5.1. 5.2.
359 361
Functionality User experiences
6. Conclusions from the case study 7. New research opportunities 8. Conclusion References
363 367 368 370
xii
Contents
CHAPTER 14 Web-Based Business Intelligence Systems: A Review and Case Studies Wingyan Chung and Hsinchun Chen
373
1. 2.
Introduction Literature review
374 374
2.1. 2.2.
375 376
3.
4.
Business intelligence systems Mining the Web for BI
A framework for discovering BI on the Web
378
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
379 381 381 382 382 383
Collection Conversion Extraction Analysis Visualization Comparison with existing frameworks
Case studies
383
4.1. 4.2. 4.3.
384 389 392
Case 1: Searching for BI across different regions Case 2: Exploring BI using Web visualization techniques Case 3: Business stakeholder analysis using Web classification techniques
5. Summary and future directions References
396 397
PREFACE
Fueled by the rapid growth of the Internet, continuously increasing accessibility to communication technologies, and the vast amount of information collected by transactional systems, information overabundance has become an increasingly important problem. Technology evolution has also given rise to new challenges that frustrate both researchers and practitioners. For example, information overload has created data management problems for firms, while the analysis of very large datasets is forcing researchers to look beyond the bounds of inferential statistics. As a result, researchers and practitioners have been focusing on new techniques of data analysis that allow identification, organization, and processing of data in innovative ways to facilitate meaningful analysis. These approaches are based on data mining, machine learning, and advanced statistical learning techniques. The goal of these approaches is to discover models and/or identify patterns of potential interest that lead to strategic or operational opportunities. In addition, privacy, security, and trust issues have grown in importance. Recent legislation (e.g., Sarbanes–Oxley) is also beginning to impact IT infrastructure deployment. While popular press has given a lot of attention to entrepreneurial activities that information technologies, in particular computer networking technologies, have facilitated, the tremendous impact to business practices has received less direct attention. Enterprises are continuously leveraging advances in computing paradigms and techniques to redefine business processes and to increase process effectiveness leading to better productivity. Some of the important questions in these dimensions include: What new business models are created by the evolution of advanced computing infrastructures for innovative business computing? What are the IT infrastructure and risk management issues for these new business models? Business computing has been the foundation of these, often internal, innovations. The research contributions in this collection present modeling, computational, and statistical techniques that are being developed and deployed as cutting-edge research approaches to address the problems and challenges posed by information overabundance in electronic business and electronic commerce. This book is an attempt to bring articles from xiii
xiv
Preface
thought leaders in their respective areas to bring together information on state-of-the-art knowledge in business computing research, emerging innovative techniques, and futuristic reflections and approaches that will find their way in mainstream business processes in near future. The intended audiences for this book are students in both graduate business and applied computer science classes who want to understand the role of modern computing machinery in business applications. The book also serves as a comprehensive research handbook for researchers that intend to conduct research on design, evaluation, and management of computing-based innovation for business processes. Business practitioners (e.g., IT managers or technology analysts) should find the book useful as a reference on a variety of novel (current and emerging) computing approaches to important business problems. While the focus of many book chapters is data-centric, it also provides frameworks for making business case for computing technology’s role in creating value for organizations.
INTRODUCTION
An overview of the book The book is broadly organized in three parts. The first section (Enhancing and Managing Customer Value) focuses on presenting the state-of-knowledge in managing and enhancing customer value through extraction of consumer-centric knowledge from mountains of data that modern interactive applications generate. The extracted information can then be used to provide more personalized information to customers, provide more relevant information or products, and even to create innovative business processes to enhance overall value to customers. The second section in the book (Computational Approaches for Business Processes) focuses on presenting several specific innovative computing artifacts and tools developed by researchers that are not yet commercially used. These represent cutting-edge thought and advances in business computing research that should soon find utility in real-world applications or as a tool to analyze real-world scenarios. The final section in the book (Supporting Knowledge Enterprise) presents approaches and frameworks that focus on ability of an enterprise to analyze, build, and protect computing infrastructure that supports value-added dimensions to the enterprise’s existing business processes. Chapter summaries Part I: Enhancing and managing customer value The chapters in this part are, primarily, surveys of the state-of-the-art in research; however, each chapter points to the business applications as well as future opportunities for research. The first chapter by Alexander Tuzhilin (Personalization: The State of the Art and Future Directions) provides a survey of research in personalization technologies. The chapter focuses on providing a structured view of personalization and presents a six-step process of providing effective personalization. The chapter points out why, despite the hype, personalization applications have not reached their true potential and lays the groundwork for significant future research. xv
xvi
Introduction
The second chapter, by Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh, and Vamsee Venuturumilli (Web Mining for Business Computing), focuses on knowledge extraction from data collected over the Web. The chapter discusses different forms of data that can be collected and mined from different Web-based sources to extract knowledge about the content, structure, or organization of resources and their usage patterns. The chapter discusses the usage of the knowledge extracted from transactional websites in all areas of business applications, including human resources, finance, and technology infrastructure management. One of the results of Web mining has been the better understanding of consumers’ browsing and search behavior and the introduction of advanced Web-based technologies and tools. The chapter by De Liu, Jianqing Chen, and Andrew Whinston (Current Issues in Keyword Auctions) presents the state of knowledge and research opportunities in the area of markets for Web search keywords. For example, Google’s popular AdWords and AdSense applications provide a way for advertisers to drive traffic to their sites or place appropriate advertisements on their webspace based on users’ search or browsing patterns. While the technology issues surrounding the intent and purpose of search and matching that with appropriate advertisers are also challenging, the chapter points out the challenges in organizing the markets for these keywords. The chapter presents the state-of-knowledge in keyword auctions as well as a comprehensive research agenda and issues that can lead to better and more economically efficient outcomes. Another chapter in this part by Balaji Padmanabhan (Web Clickstream Data and Pattern Discovery: A Framework and Applications) focuses specifically on pattern discovery in clickstream data. Management research has long distinguished between intent and action. Before the availability of clickstream data, the only data available regarding the action of consumers on electronic commerce websites was their final product selection. However, availability of data that captures not only buying behavior, but browsing behavior as well, can provide valuable insights into the choice criteria and product selection process of consumers. This information can be further used to design streamlined storefronts, presentation protocols, purchase processes and, of course, personalized browsing and shopping experience. The chapter provides a framework for pattern discovery that encompasses the process of representation, learning, and evaluation of patterns illustrated by conceptual and applied examples of discovering useful patterns. The part ends with a chapter by Deborah Barnes and Vijay Mookerjee (Customer Delay in E-Commerce Sites: Design and Strategic Implications) examining the operational strategies and concerns with respect to delays suffered by customers on e-commerce sites. The delay management directly affects customers’ satisfaction with a website and, as chapter points out, has implications for decisions regarding the extent of efforts devoted to generating traffic, managing content, and making infrastructure decisions.
Introduction
xvii
The chapter also presents ideas regarding creating innovative business practices such as ‘‘express lane’’ and/or intentionally delaying customers when appropriate and acceptable. The chapter also examines the effect of competition on determination of capacity and service levels. Part II: Computational approaches for business processes The first chapter in this part by David Pardoe and Peter Stone (An Autonomous Agent for Supply Chain Management) describes the details of their winning agent in Trading Agent Competition for Supply Chain Management. This competition allows autonomous software agents to compete in raw-material acquisition, inventory control, production, and sales decisions in a realistic simulated environment that lasts for 220 simulated days. The complexity and multidimensional nature of agent’s decisions makes the problem intractable from an analytical perspective. However, an agent still needs to predict future state of the market and to take competitive dynamics into account to make profitable sales. It is likely that, in the not-sodistant future, several types of negotiations, particularly for commodities, may be fully automated. Therefore, intelligent and adaptive agent design, as described in this chapter, is an important area of business computing that is likely to make significant contribution to practice. The second chapter in this part by Moninder Singh and Jayant Kalagnanam (IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation) examines the problem of cleansing massive amounts of data that a reverse aggregator may need in order to make efficient buying decisions on behalf of several buyers. Increasingly businesses are outsourcing the non-core procurement. In such environments, a reverse aggregator needs to create complex negotiation mechanisms (such as electronic request for quotes and request for proposals). An essential part of preparing these mechanisms is to provide rationale and business value of outsourcing. Simple tools such as spreadsheets are not sufficient to handle the scale of operations, in addition to being non-standardized and error-prone. The chapter provides a detailed roadmap and principles to develop automated system for aggregation and clean-up of data across multiple enterprises as a first step towards facilitating such a mechanism. The third chapter in this part by Daniel Zeng, James Ma, Wei Chang, and Hsinchun Chen (Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics) discusses the use of spatial-temporal data analysis techniques to correlate information from offline and online data sources. The research addresses important questions of interest, such as whether current trends are exceptional, and whether they are due to random variations or a new systematic pattern is emerging. Furthermore, the ability to discover temporal patterns and whether they match any known event in the past is also of crucial importance in many application domains, for
xviii
Introduction
example, in the areas of public health (e.g., infectious disease outbreaks), public safety, food safety, transportation systems, and financial fraud detection. The chapter provides case studies in the domain of infectious disease informatics to demonstrate the utility of the analysis techniques. The fourth chapter by Wolfgang Jank and Galit Shmueli (Studying Heterogeneity of Price Evolution in eBay Auctions via Functional Clustering) provides a novel technique to study price formation in online auctions. While there has been an explosion of studies that analyze online auctions from empirical perspective in the past decade, most of the studies provide either a comparative statics analysis of prices (i.e., the factors that affect prices in an auction) or a structural view of price formation process (i.e., assuming that game-theoretic constructs of price formation are known and captured by the data). However, the dynamics of the price formation process has been rarely studied. The dynamics of the process can provide valuable and actionable insights to both a seller and a buyer. For example, different factors may drive prices at different phases in the auction; in particular, the starting bid or number of items available may be the driver of price movement at the beginning of an auction, while the nature of bidding activity would be the driver in the middle of the auction. The technique discussed in the chapter provides a fresh statistical approach to characterize the price formation process and can identify dynamic drivers of this process. The chapter shows the information that can be gained from this process and opens up potential for designing a new generation of online mechanisms. The fifth and final chapter in this part by John Collins and Maria Gini (Scheduling Tasks Using Combinatorial Auctions: The MAGNET Approach) presents a combinatorial auction mechanism as a solution to complex business transactions that require coordinated combinations of goods and services under several business constraints, often resulting in complex combinatorial optimization problems. The chapter presents a new generation of systems that will help organizations and individuals find and exploit opportunities that are otherwise inaccessible or too complex to evaluate. These systems will help potential partners find each other and negotiate mutually beneficial deals. The authors evaluate their environment and proposed approach using the Multi-AGent NEgotiation Testbed (MAGNET). The testbed allows self-interested agents to negotiate complex coordinated tasks with a variety of constraints, including precedence and time constraints. Using the testbed, the chapter demonstrates how a customer agent can solve the complex problems that arise in such an environment. Part III: Supporting knowledge enterprise The first chapter in this part by Amit Basu and Robert Blanning (Structuring Knowledge Bases Using Metagraphs) provides a graphical
Introduction
xix
modeling and analysis technique called metagraphs. Metagraphs can represent, integrate, and analyze various types of knowledge bases existing in an organization, such as data and their relationships, decision models, information structures, and organizational constraints and rules. While other graphical techniques to represent such knowledge bases exist, usually these approaches are purely representational and do not provide methods and techniques to conduct inferential analysis. A given metagraph allows the use of graph-theoretic techniques and several algebraic operations in order to conduct analysis of its constructs and the relationship among them. The chapter presents the constructs and methods available in metagraphs, some examples of usage, and directions for future research and applications. The second chapter in this part by Robert Garfinkel, Ram Gopal, Manuel Nunez, and Daniel Rice (Information Systems Security and Statistical Databases: Preserving Confidentiality through Camouflage) describes an innovative camouflage-based technique to ensure statistical confidentiality of data. The basic and innovative idea of this approach, as opposed to perturbation-based approaches to data confidentiality, is to provide the ability of being able to conduct aggregate analysis with exact and correct answers to the queries posed to a database and, at the same time, provide confidentiality by ensuring that no combinations of queries reveal exact privacy-compromising information. This provides an important approach for business applications where personal data often needs to be legally protected. The third chapter by John Burke, Michael Shaw, and Judith Gebauer (The Efficacy of Mobile Computing for Enterprise Applications) analyzes the efficacy of the mobile platform for enterprise and business applications. The chapter provides insights as to why firms have not been able to adopt the mobile platform in a widespread manner. They posit that gaps exist between users’ task needs and technological capabilities that prevent users from adopting these applications. They find antecedents to acceptance of mobile applications in the context of a requisition system at a Fortune 100 company and provide insights as to what factors can enhance the chances of acceptance of the mobile platform for business applications. The final chapter in this part and in the book by Wingyan Chung and Hsinchun Chen (Web-based Business Intelligence Systems: A Review and Case Studies) reviews the state of knowledge in building Web-based Business Intelligence (BI) systems and propose a framework for developing such systems. A Web-based BI system can provide managers with real-time capabilities of assessing their competitive environments and supporting managerial decisions. The authors discuss various steps in building a Webbased BI system such as collection, conversion, extraction, analysis, and visualization of data for BI purposes. They provide three case studies of developing Web-based BI systems and present results from experimental studies regarding the efficacy of these systems.
xx
Introduction
Concluding remarks The importance of the topic of business computing is unquestionable. Information technology and computing-based initiatives have been and continue to be on the forefront of many business innovations. This book is intended to provide an overview of the current state of knowledge in business computing research as well as the emerging computing-based approaches and technologies that may appear in the innovative business processes of the near future. We hope that this book will serve as a source of information to researchers and practitioners and also will facilitate further discussions on the topic of business computing and lead to the inspiration for further research and applications. This book has been for several years in the making, and we are excited to see it come to life. This book contains a collection of 14 chapters written by experts in the areas of information technologies and systems, computer science, business intelligence, and advanced data analytics. We would like to thank all the authors of the book chapters for their commitment and contributions to this book. We would also like to thank all the diligent reviewers who provided comprehensive and insightful reviews of the chapters, in the process making this a much better book—our sincere thanks go to Jesse Bockstedt, Wingyan Chung, Sanjukta Das Smith, Gilbert Karuga, Wolfgang Ketter, YoungOk Kwon, Chen Li, Balaji Padmanabhan, Claudia Perlich, Pallab Sanyal, Mu Xia, Xia Zhao, and Dmitry Zhdanov. We also extend our gratitude to Emerald for their encouragement and help throughout the book publication process. Gediminas Adomavicius and Alok Gupta
Part I Enhancing and Managing Customer Value
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 1
Personalization: The State of the Art and Future Directions
Alexander Tuzhilin Stern School of Business, New York University, 44 West 4th Street, Room 8-92, New York, NY 10012, USA
Abstract This chapter examines the major definitions and concepts of personalization, reviews various personalization types and discusses when it makes sense to personalize and when it does not. It also reviews the personalization process and discusses how various stages of this process can be integrated into a tightly coupled manner in order to avoid ‘‘discontinuity points’’ between its different stages. Finally, future research directions in personalization are discussed.
1
Introduction
Personalization, the ability to tailor products and services to individuals based on knowledge about their preferences and behavior, was listed in the July 2006 issue of the Wired Magazine among the six major trends driving the global economy (Kelleher, 2006). This observation was echoed by Eric Schmidt, the CEO of Google, who observed in (Schmidt, 2006) that ‘‘we have the tiger by the tail in that we have this huge phenomenon of personalization.’’ This is in sharp contrast to the previously reported disappointments with personalization, as expressed by numerous prior authors and eloquently summarized by Kemp (2001): No set of e-business applications has disappointed as much as personalization has. Vendors and their customers are realizing that, for example, truly personalized Web commerce requires a reexamination of business processes and marketing strategies as much as installation of shrinkwrapped software. Part of the problem is that personalization means something different to each e-business.
3
4
A. Tuzhilin
Many of these disappointing experiences happened because various businesses jumped on the popular ‘‘personalization bandwagon’’ in the late 1990s and early 2000s without putting considerable thought into such questions as what, why and when it makes sense to personalize. This situation fits well the Gartner’s hype-and-gloom curve, presented in Fig. 1, which characterizes growth patterns of many technologies with personalization being one of them. As Fig. 1 demonstrates, expectations of personalization technologies initially exceeded the actual technological developments (the area of inflated expectations), then were followed by profound disappointments with these technologies (in early 2000s), as reflected in the Kemp’s quote above, and finally reached the ‘‘slope of enlightment’’ when expectations from the technologies coincided with the actual technological developments. The Wired magazine’s article cited above and the remark by Eric Schmidt acknowledge the fact that the personalization technology has significantly matured by now and that it has a very large potential if understood well and implemented properly. It turns out that the hype-and-gloom situation with personalization of the 1990s and 2000s described above constitutes only the most recent developments in the field. The roots of personalization can be traced back to antiquity when business owners knew their customers and provided different products and services to different customers based on extensive knowledge of who they were and on good understanding of their needs. More recently, elements of personalization can be traced to the second half
Fig. 1.
Gartner’s hype-and-gloom curve for personalization technologies.
Ch. 1. Personalization: The State of the Art and Future Directions
5
of the 19th century, when Montgomery Ward added some of the simple personalization features to their, otherwise, mass-produced catalogs (Ross, 1992). However, all these early personalization activities were either done on a small scale or were quite elementary. On a large scale, the roots of personalization can be traced to direct marketing when the customer segmentation method based on the recencyfrequency-monetary (RFM) model was developed by a catalog company to decide which customers should receive their catalog (Peterson et al., 1997). Also, the direct marketing company Metromail has developed the Selection by Individual Families and Tracts (SIFT) system in mid-1960s that segmented the customers based on such attributes as telephone ownership, length of residence, head of household, gender and the type of dwelling to make catalog shipping decisions. This approach was later refined in late 1960s when customers were also segmented based on their ZIP codes. These segmentation efforts were also combined with content customization when Time magazine experimented with sending mass-produced letters in 1940s that began with the salutation ‘‘Dear Mr. Smith . . . ’’ addressed to all the Mr. Smith’s on the company’s mailing list (Reed, 1949). However, all these early-day personalization approaches were implemented ‘‘by hand’’ without using Information Technologies. It was only in the mid-1960s, however, that direct marketers began using IT to provide personalized services, such as producing computer-generated letters that were customized to the needs of particular segments of customers. As an early example of such computerized targeted marketing, Fingerhut targeted New York residents with personalized letters that began, ‘‘Remember last January when temperatures in the state of New York dropped to a chilly 32 degrees?’’ (Peterson et al., 1997). Similarly, Burger King was one of the modern early adopters of personalization with ‘‘Have it your way’’ campaign launched in mid-1970s. However, it was not until 1980s that the areas of direct marketing and personalization experienced major advances due to the development of more powerful computers, database technologies and more advanced data analysis methods (Peterson et al., 1997) and automated personalization became a reality. Personalization was taken to the next level in the mid- to late-1990s with the advancement of the Web technologies and various personalization tools helping marketers interact with their customers on a 1-to-1 basis in real time. As a result, a new wave of personalization companies has emerged, such as Broadvision, ATG, Blue Martini, e.Piphany, Kana, DoubleClick, Claria, ChoiceStream and several others. As an example, the PersonalWeb platform developed by Claria provides behavioral targeting of website visitors by ‘‘watching’’ their clicks and delivering personalized online content, such as targeted ads, news and RSS feeds, based on the analysis of their online activities. Claria achieves this behavioral targeting by requesting online users to download and install the behavior-tracking
6
A. Tuzhilin
software on their computers. Similarly, ChoiceStream software helps Yahoo, AOL, Columbia House, Blockbuster and other companies to personalize home pages for their customers and thus deliver relevant content, products, search results and advertising to them. The benefits derived from such personalized solutions should be balanced against possible problems of violating consumer privacy (Kobsa, 2007). Therefore, some of these personalization companies, including DoubleClick and Claria, had problems with consumer privacy advocates in the past. On the academic front, personalization has been explored in the marketing community since the 1980s. For example, Surprenant and Solomon (1987) studied personalization of services and concluded that personalization is a multidimensional construct that must be approached carefully in the context of service design since personalization does not necessarily result in greater consumer satisfaction with the service offerings in all the cases. The field of personalization was popularized by Peppers and Rogers since the publication of their first book (Peppers and Rogers, 1993) on 1-to-1 marketing in 1993. Since that time, many publications appeared on personalization in computer science, information systems, marketing, management science and economics literature. In the computer science and information system literature, special issues of the CACM (Communications of the ACM, 2000) and the ACM TOIT (Mobasher and Anand, 2007) journals were dedicated to personalization technologies already, and another one (Mobasher and Tuzhilin, 2009) will be published shortly. Some of the most recent reviews and surveys of personalization include Adomavicius and Tuzhilin (2005a), Eirinaki and Vazirgiannis (2003) and Pierrakos et al. (2003). The main topics in personalization studied by computer scientists include Web personalization (Eirinaki and Vazirgiannis, 2003; Mobasher et al., 2000; Mobasher et al., 2002; Mulvenna et al., 2000; Nasraoui, 2005; Pierrakos et al., 2003; Spiliopoulou, 2000; Srivastava et al., 2000; Yang and Padmanabhan, 2005), recommender systems (Adomavicius and Tuzhilin, 2005b; Hill et al., 1995; Pazzani, 1999; Resnick et al., 1994; Schafer et al., 2001; Shardanand and Maes, 1995), building user profiles and models (Adomavicius and Tuzhilin, 2001a; Billsus and Pazzani, 2000; Cadez et al., 2001; Jiang and Tuzhilin, 2006a,b; Manavoglu et al., 2003; Mobasher et al., 2002; Pazzani and Billsus, 1997), design and analysis of personalization systems (Adomavicius and Tuzhilin, 2002; Adomavicius and Tuzhilin, 2005a; Eirinaki and Vazirgiannis, 2003; Padmanabhan et al., 2001; Pierrakos et al., 2003; Wu et al., 2003) and studies of personalized searches (Qiu and Cho, 2006; Tsoi et al., 2006).1 Most of these areas have a vast body of literature and can be a subject of a separate survey. For example, the survey of recommender 1 There are many papers published in each of these areas. The references cited above are either surveys or serve only as representative examples of some of this work demonstrating the scope of the efforts in these areas; they do not provide exhaustive lists of citations in each of the areas.
Ch. 1. Personalization: The State of the Art and Future Directions
7
systems (Adomavicius and Tuzhilin, 2005b) cites over a 100 papers, and the 2003 survey of Web personalization (Eirinaki and Vazirgiannis, 2003) cites 40 papers on the corresponding topics, and these numbers grow rapidly each year. In the marketing literature, the early work on personalization (Surprenant and Solomon, 1987) and (Peppers and Rogers, 1993), described above, was followed by several authors studying such problems as targeted marketing (Chen and Iyer, 2002; Chen et al., 2001; Rossi et al., 1996), competitive personalized promotions (Shaffer and Zhang, 2002), recommender systems (Ansari et al., 2000; Haubl and Murray, 2003; Ying et al., 2006), customization (Ansari and Mela, 2003) and studies of effective strategies of personalization services firms (Pancras and Sudhir, 2007). In the economics literature, there has been work done on studying personalized pricing when companies charge different prices to different customers or customer segments (Choudhary et al., 2005; Elmaghraby and Keskinocak, 2003; Jain and Kannan, 2002; Liu and Zhang, 2006; Ulph and Vulkan, 2001). In the management science literature, the focus has been on interactions between operations issues and personalized pricing (Elmaghraby and Keskinocak, 2003) and also on the mass customization problems (Pine, 1999; Tseng and Jiao, 2001) and their limitations (Zipkin, 2001). Some of the management science and economics-based approaches to Internet-based product customization and pricing are described in Dewan et al. (2000). A review of the role of management science in research on personalization is presented in Murthi and Sarkar (2003). With all these advances in the academic research on personalization and in developing personalized solutions in the industry, personalization ‘‘is back,’’ as is evidenced by the aforementioned quotes from the Wired magazine article and Eric Schmidt. In order to understand these sharp swings in perception about personalization, as described above, and grasp general developments in the field, we first review the basic concepts of personalization, starting with its definition in Section 2. In Section 3, we examine different types of personalization since, according to David Smith (2000), ‘‘there are myriad ways to get personal,’’ and we need to understand them to have a good grasp of personalization. In Section 4, we discuss when it makes sense to personalize. In Section 5, we present a personalization process. In Section 6, we explain how different stages of the personalization process can be integrated into one coherent system. Finally, we discuss future research directions in personalization in Section 7.
2
Definition of personalization
Since personalization constitutes a rapidly developing field, there still exist different points of view on what personalization is, as expressed by
8
A. Tuzhilin
academics and practitioners. Some representative definitions of personalization proposed in the literature are ‘‘Personalization is the ability to provide content and services that are tailored to individuals based on knowledge about their preferences and behavior’’ (Hagen, 1999). ‘‘Personalization is the capability to customize communication based on knowledge preferences and behaviors at the time of interaction’’ (Dyche, 2002). ‘‘Personalization is about building customer loyalty by building a meaningful 1-to-1 relationship; by understanding the needs of each individual and helping satisfy a goal that efficiently and knowledgeably addresses each individual’s need in a given context’’ (Riecken, 2000). ‘‘Personalization involves the process of gathering user information during interaction with the user, which is then used to deliver appropriate content and services, tailor-made to the user’s needs’’ (www.ariadne.ac.uk/issue28/personalization). ‘‘Personalization is the ability of a company to recognize and treat its customers as individuals through personal messaging, targeted banner ads, special offers, . . . or other personal transactions’’ (Imhoff et al., 2001). ‘‘Personalization is the combined use of technology and customer information to tailor electronic commerce interactions between a business and each individual customer. Using information either previously obtained or provided in real-time about the customer and other customers, the exchange between the parties is altered to fit that customer’s stated needs so that the transaction requires less time and delivers a product best suited to that customer’’ (www.personalization. com—as it was defined on this website in early 2000s). Although different, all these definitions identify several important points about personalization. Collectively, they maintain that personalization tailors certain offerings by providers to consumers based on certain knowledge about them, on the context in which these offerings are provided and with certain goal(s) in mind. Moreover, these personalized offerings are delivered from providers to consumers through personalization engines along certain distribution channels based on the knowledge about the consumers, the context and the personalization goals. Each of the italicized words above is important, and will be explained below. 1. Offerings. Personalized offerings can be of very different types. Some examples of these offerings include Products, both ready-made that are selected for the particular consumer (such as books, CDs, vacation packages and other ready-made products offered by a retailer) and manufactured in a custom-made
Ch. 1. Personalization: The State of the Art and Future Directions
9
fashion for a particular consumer (such as custom-made CDs and custom-designed clothes and shoes). Services, such as individualized subscriptions to concerts and personalized access to certain information services. Communications. Personalized offerings can include a broad range of marketing and other types of communications, including targeted ads, promotions and personalized email. Online content. Personalized content can be generated for an individual customer and delivered to him or her in the best possible manner. This personalized content can include dynamically generated Web pages, new and modified links and insertion of various communications described above into pre-generated Web pages. Information searches. Depending on the past search history and on other personal characteristics of an online user, a search engine can return different search results or present them in a different order to customize them to the needs of a particular user (Qiu and Cho, 2006; Tsoi et al., 2006). Dynamic prices. Different prices can be charged for different products depending on personal characteristics of the consumer (Choudhary et al., 2005).
These offerings constitute the marketing outputs of the personalization process (Vesanen and Raulas, 2006). Given a particular type of offering, it is necessary to specify the universe (or the space) of offerings O of that type and identify its structure. For example, in case of the personalized online content, it is necessary to identify what kind of content can be delivered to the consumer, how ‘‘granular’’ it is and what the structure of this content is. Similarly, in case of personalized emails, it is necessary to specify what the structure of an email message is, which parts of the message can be tailored and which are fixed, and what the ‘‘space’’ of all the email messages is. Similarly, in case of personalized prices, it is important to know what the price ranges are and what the granularity of the price unit is if the prices are discrete. 2. Consumers can be considered either at the individual level or grouped into segments depending on the particular type of personalization, the type of targeting and personalization objectives. The former case fits into the 1-to-1 paradigm (Peppers and Rogers, 1993), whereas the latter one into the segmentation paradigm (Wedel and Kamakura, 2000). It is an interesting and important research question to determine which of these two approaches is better and in which sense. The 1-to-1 approach builds truly personalized models of consumers but may suffer from not having enough data and the data being ‘‘noisy,’’ i.e., containing various types of consumer biases, imperfect information, mistakes, etc. (Chen et al., 2001), whereas the segmentation approach has sufficient data but may suffer from the problem of having heterogeneous populations of consumers
10
A. Tuzhilin
within the segments. This question has been studied before by marketers and the results of this work are summarized in Wedel and Kamakura (2000). In the IS/CS literature, some solutions to this problem are described in Jiang and Tuzhilin (2006a,b, 2007). Moreover, this problem will be discussed further in Section 3.3. Note that some of the definitions of personalization presented above refer to customers, while others to users and individuals. In the most general setting, personalization is applicable to a broad set of entities, including customers, suppliers, partners, employees and other stakeholders in the organization. In this chapter, we will collectively refer to these entities as consumers by using the most general meaning of this term in the sense described above. 3. Providers are the entities that provide personalized offerings, such as e-commerce websites, search engines and various offline outlets and organizations. 4. Tailoring. Given the space O of all the possible offerings described above and a particular consumer or a segment of consumers c, which offering or a set of offerings should be selected from the space O in each particular situation to customize the offering(s) to the needs of c according to the personalization goal(s) described below. How to deliver these customized offerings to individual consumers constitutes one of the key questions of personalization. We will address this question in Section 5 (Stage 3) when describing the ‘‘matchmaking’’ stage of the personalization process. 5. Knowledge about consumers. All the available information about the consumer, including demographic, psychographic, browsing, purchasing and other transactional information, is collected, processed, transformed, analyzed and converted into actionable knowledge that is stored in consumer profiles. This information is gathered from multiple sources. One of the crucial sources of this knowledge is the transactional information about interactions between the personalization system and the consumer, including purchasing transactions, browsing activities and various types of inquiries and information gathering interactions. This knowledge obtained from the collected data and stored in the consumer profiles is subsequently used to determine how to customize offerings to the consumers. The consumer profiles contain two types of knowledge. First, it has factual knowledge about consumers containing demographic, transactional and other crucial consumer information that is processed and aggregated into a collection of facts about the person, including various statistics about the consumer’s behavior. Simple factual information about the consumer can be stored as a record in a relational database or as a consumer-centric data warehouse (DW) (Kimball, 1996). More complicated factual information, such as the information about the social network of a person
Ch. 1. Personalization: The State of the Art and Future Directions
11
and his or her relationships and interactions with other consumers, may require the use of taxonomies and ontologies and can be captured using XML or special languages for defining ontologies (Staab and Studer, 2003), such as OWL (Antoniou and Harmelen, 2003). Second, the consumer profile contains one or several data mining and statistical models capturing behavior either of this particular consumer of the segment of similar consumers to which the person belongs. These models are stored as a part of the consumer-centric modelbase (Liu and Tuzhilin, 2008). Together, these two parts form the consumer profile that will be described in greater detail in Section 5. 6. Context. Tailoring of particular offering to the needs of the consumers depends not only on the knowledge about the consumer, but also on the context in which this tailoring occurs. For example, when recommending a movie to the consumer, it is not only important to know his or her movie preferences, but also the context in which these recommendations are made, such as with whom the person is going to see a movie, when and where. If a person wants to see a movie with his girlfriend in a movie theater on Saturday night, then, perhaps, a different movie should be recommended than in the case if he wants to see it with his parents on Thursday evening at home on a VCR. Similarly, when a consumer shops for a gift, different products should be offered to her in this context than when she shops for herself. 7. Goal(s) determine the purpose of personalization. Tailoring particular offerings to the consumers can have various objectives, including Maximizing consumer satisfaction with the provided offering and the overall consumer experience with the provider. Maximizing the Lifetime Value (LTV) (Dwyer, 1989) of the consumer that determines the total discounted value of the person derived over the entire lifespan of the consumer. This maximization is done over a long-range time horizon rather than pursuing short-term satisfaction. Improving consumer retention and loyalty and decreasing churn. For example, the provider should tailor its offerings so that this tailoring would maximize repeat visits of the consumer to the provider. The dual problem is to minimize the churn rates, i.e., the rates at which the current consumers abandon the provider. Better anticipate consumers’ needs and, therefore, serve them better. One way to do this would be to design the personalization engine so that it would maximize predictive performance of tailored offerings, i.e., it would try to select the offerings that the consumer likes. Make interactions between providers and consumers efficient, satisfying and easier for both of them. For example, in case of Web
12
A. Tuzhilin
personalization, this amounts to the improvement of the website design and helping visitors to find relevant information quickly and efficiently. Efficiency may also include saving consumer time. For example, a well-organized websites may help consumers to come in, efficiently buy product(s) and exist, thus saving precious time for the consumer. Maximize conversion rates whenever applicable, i.e., convert prospective customers into buyers. For example, in case of Web personalization, this would amount to converting website visitors and browsers into buyers. Increase cross- and up-selling of provider’s offerings.
The goals listed above can be classified into marketing- and economicsoriented. In the former case, the goal is to understand and satisfy the needs of the consumers, sometimes even at the expense of the short-term financial performance for the company, as is clearly demonstrated with the second (LTV) goal. For example, an online retailer may offer products and services to the consumer to satisfy his or her needs even if these offerings are not profitable to the retailer in the short term. In the latter case, the goal is to improve the short-term financial performance of the provider of the personalization service. As was extensively argued in the marketing literature, all the marketing-oriented goals, eventually, contribute to the long-term financial performance of the company (Kotler, 2003). Therefore, the difference between the marketing- and the economics-oriented goals boils down to the long- vs. the sort-term performance of the company and, thus, both types of goals are based on the fundamental economic principles. Among the seven examples of personalization goals listed above, the first five goals are marketing-oriented, whereas the last two are economicsoriented since their objectives are to increase the immediate financial performance of the company. Finally, a personalization service provider can simultaneously pursue multiple goals, among which some can be marketing- and others economics-oriented goals. 8. Personalization engine is a software system that delivers personalized offerings from providers to consumers. It is responsible for providing customized offerings to the consumers according to the goals of the personalization system, such as the ones described above. 9. Distribution channel. Personalized offerings are delivered from the producers to the consumers along one or several distribution channels, such as a website, physical stores, email, etc. Selecting the right distribution channel for a particular customized offering often constitutes an important marketing decision. Moreover, the same offering can be delivered along multiple distribution channels. Selecting the right mixture of channels complementing each other and maximizing the distribution effects
Ch. 1. Personalization: The State of the Art and Future Directions
13
constitutes the cross-channel optimization problem in marketing (IBM Consulting Services, 2006). If implemented properly, personalization can provide several important advantages for the consumers and providers of personalized offerings depending on the choice of specific goals listed in item (7) above. In particular, it can improve consumer satisfaction with the offerings and the consumer experience with the providers; it can make consumer interactions easier, more satisfying, efficient and less time consuming. It can improve consumer loyalty, increase retention, decrease churn rates and thus can lead to higher LTVs of some of the consumers. Finally, well-designed economics-oriented personalization programs lead to higher conversion and click-through rates and better up- and cross-selling results. Besides personalization, mass customization (Tseng and Jiao, 2001; Zipkin, 2001) constitutes another popular concept in marketing and operations management, which is sometimes used interchangeably with personalization in the popular press. Therefore, it is important to distinguish these two concepts to avoid possible confusion. According to Tseng and Jiao (2001), mass customization is defined as ‘‘producing goods and services to meet individual customer’s needs with near mass production efficiency.’’ According to this definition, mass customization deals with efficient production of goods and services, including manufacturing of certain products according to specified customer needs and desires. It is also important to note that these needs and desires are usually explicitly specified by the customers in mass customization systems, such as specifying the body parameters for manufacturing customized jeans, the feet parameters for manufacturing customized shoes and computer configurations for customized PCs. In contrast to the case of mass customization, offerings are usually tailored to individual consumers without any significant production processes in case of personalization. Also, in case of personalization, the knowledge about the needs and desires of consumers is usually implicitly learned from multiple interactions with them rather than it being explicitly specified by the consumers in case of mass customization. For example, in case of customized websites, such as myYahoo!, the user specifies her interests, and the website generates content according to the specified interests of the user. This is in contrast to the personalized web page on Amazon, when Amazon observes the consumer purchases, implicitly learns her preferences and desires from these purchases and personalizes her ‘‘welcome’’ page according to this acquired knowledge. Therefore, personalization is about learning and responding to customer needs, whereas mass customization is about explicit specification of these needs by the customers and customizing offered products and services to these needs by tailoring production processes. In this section, we explained what personalization means. In the next section, we describe different types of personalization.
14 3
A. Tuzhilin
Types of personalization
Tailoring of personalized offerings by providers to consumers can come in many different forms and shapes, thus resulting in various types of personalization. As David Smith put it, ‘‘there are myriad ways to get personal’’ (Smith, 2000). In this section, we describe different types of personalization. 3.1 Provider- vs. consumer- vs. market-centric personalization Personalized offerings can be delivered from providers to consumers by personalization engines in three ways, as presented in Fig. 2 (Adomavicius and Tuzhilin, 2005a). In these diagrams, providers and consumers of personalized offerings are denoted by white boxes, personalization engines by gray boxes and the interactions between consumers and providers by solid lines. Figure 2(a) presents the provider-centric personalization approach that assumes that each provider has its own personalization engine that tailors the provider’s content to its consumers. This is the most common approach to personalization, as popularized by Amazon.com, Netflix and the Pandora streaming music service. In this approach, there are two sets of goals for the personalization engines. On the one hand, they should provide the best marketing service to their customers and fulfill some of the marketing-oriented goals presented in Section 2. On the other hand, these provider-centric personalization services are designed to improve financial performance of the providers of these services (e.g., Amazon.com and Netflix), and therefore their behavior is driven by the economics-oriented goals listed in Section 2. Therefore, the challenge for the provider-centric approaches to personalization is to strike a balance between the two sets of goals by keeping the customers happy with tailored offerings and making personalization solutions financially viable for the provider. Consumers
Providers
(a) Provider-centric
Fig. 2.
Consumers
Providers
(b) Consumer-centric
Consumers
Providers
(c) Market-centric
Classification of personalization approaches.
Ch. 1. Personalization: The State of the Art and Future Directions
15
The second approach, presented in Fig. 2(b), is the consumer-centric approach, which assumes that each consumer has its own personalization engine (or agent) that ‘‘understands’’ this particular consumer and provides personalization services across several providers based on this knowledge. This type of consumer-centric personalization delivered across a broad range of providers and offerings is called an e-Butler service (Adomavicius and Tuzhilin, 2002) and is popularized by the PersonalWeb service from Claria (www.claria.com). The goals of a consumer-centric personalization service are limited exclusively to the needs of the consumer and should pursue only the consumer-centric objectives listed in Section 2, such as anticipating consumer needs and making interactions with a website more efficient and satisfying for the consumer. The problem with this approach lies in developing personalization service of such quality and value to the consumers that they would be willing to pay for it. This would remove dependency on advertising and other sources of revenues coming from the providers of personalized services, which would go against the philosophy of the purely consumer-centric service. The third approach, presented in Fig. 2(c), is the market-centric approach that provides personalization services for a marketplace in a certain industry or sector. In this case, the personalization engine performs the role of an infomediary by knowing the needs of the consumers and the providers’ offerings and trying to match the two parties in the best ways according to their internal goals. Personalized portals customizing the services offered by its corporate partners to the individual needs of their customers would be an example of this market-centric approach.
3.2 Types of personalized offerings Types of personalization methods can vary very significantly depending on the type of offering provided by the personalization application. For example, methods for determining personalized searches (Qiu and Cho, 2006) differ significantly from the methods for determining personalized pricing (Choudhary et al., 2005), which also differ significantly from the methods for delivering personalized content to the Web pages (Sheth et al., 2002) and personalized recommendations for useful products (Adomavicius and Tuzhilin, 2005b). In Section 2, we identified various types of offerings including Products and services, Communications, including targeted ads, promotions and personalized email, Online content, including dynamically generated Web pages and links, Information searches, Dynamic prices.
16
A. Tuzhilin
One of the defining factors responsible for differences in methods of delivering various types of personalized offerings is the structure and complexity of the offerings space O that can vary quite significantly across the types of offerings listed above. For example, in case of dynamic prices, the structure of the offering space O is relatively simple (e.g., discrete or continuous variable within a certain range), whereas in case of online content tailoring it can be very large and complex depending on the granularity of the web content and how the content is structured on the web pages of a particular personalization application. Another defining factor is conceptually different methods for delivering various types of targeted offerings. For example, how to specify dynamic prices depends on the underlying economic theories, whereas providing personalized recommendations depends on the underlying data mining and other recommendation methods discussed in Section 5. Similarly, methods of delivering personalized searches depend on underlying information retrieval and web search theories. A particular application can also deal with a mixture of various types of offerings described above, which can result in a combination of different personalization methods. For example, if an online retailer decides to add dynamic prices to the already developed personalized product offerings (i.e., customer X receives a recommendation for book Y at a personalized price Z), then this means combining personalized recommendation methods, such as the ones discussed in Section 5, with personalized pricing methods. Alternatively, a search engine may deliver personalized search results and personalized search-related ads targeted to individuals that are based not only on the search keywords specified by the consumer, but also on the personal characteristics of the consumer, as defined in his or her profile, such as the past search history, geographic location and demographic data in case it is available. 3.3 Individual vs. segment-based personalization As was pointed out in Section 2, personalized offerings can be tailored either to the needs of individuals or segments of consumers. In the former case, the consumer profile is built exclusively from the data pertaining to this and only this consumer (Adomavicius and Tuzhilin, 2001a; Jiang and Tuzhilin, 2006a). In the latter case, the consumer is grouped into a segment of similar individuals, and the profile is built for the whole segment. This profile is subsequently applied to target the same offering to the whole segment. The smaller the segment size, the finer the targeting of the offering to the consumers in that segment and, therefore, the more personalized the offerings become. Thus, by varying segment sizes, we change the degree of personalization from being coarse for large segments to being fine for
Ch. 1. Personalization: The State of the Art and Future Directions
17
smaller segments. In the limit, complete personalization is reached for the 1-to-1 marketing when the segment size is always one. Although strongly advocated in the popular press (Peppers and Rogers, 1993; Peppers and Rogers, 2004), it is not clear that targeting personalized offerings to individual consumers will always be better than for segments of consumers because of the tradeoff between sparsity of data for individual consumers and heterogeneity of consumers within segments: individual consumer profiles may suffer from sparse data resulting in high variance of performance measures of individual consumer models, whereas aggregate profiles of consumer segments suffer from high levels of customer heterogeneity, resulting in high performance biases. Depending on which effect dominates the other, it is possible that individualized personalization models outperform the segmented or aggregated models, and vice versa. The tradeoff between these two approaches has been studied in Jiang and Tuzhilin (2006a), where performance of individual, aggregate and segmented models of consumer behavior was compared empirically across a broad spectrum of experimental settings. It was shown that for the highly transacting consumers or poor segmentation techniques, individual-level consumer models outperform segmentation models of consumer behavior. These results reaffirm the anecdotal evidence about the advantages of personalization and the 1-to-1 marketing stipulated in the popular press (Peppers and Rogers, 1993; Peppers and Rogers, 2004). However, the experiments reported in Jiang and Tuzhilin (2006a) also show that segmentation models, taken at the best granularity level(s) and generated using effective clustering methods, dominate individual-level consumer models when modeling consumers with little transactional data. Moreover, this best granularity level is significantly skewed towards the 1-to-1 case and is usually achieved at the finest segmentation levels. This finding provides additional support for the case of micro-segmentation (Kotler, 2003; McDonnell, 2001)—when consumer segmentation is done at a highly granular level. In conclusion, determining the right segment sizes and the optimal degree of personalization constitutes an important decision in personalization applications and involves the tradeoff between heterogeneity of consumer behavior in segmented models vs. sparsity of data for small segment sizes and individual models. 3.4 Smart vs. trivial personalization Some personalization systems provide only superficial solutions, including presenting trivial content for the consumers, such as greeting them by name or recommending a book similar to the one the person has bought recently. As another example, a popular website personalization.com (or its alias personalizationmall.com) provides personalized engravings on various
18
A. Tuzhilin
items ranging from children’s backpacks to personalized beer mugs. These examples constitute cases of trivial (Hagen, 1999) [shallow or cosmetic (Gilmore and Pine, 1997)] personalization. In contrast to this, if offerings are actively tailored to individuals based on rich knowledge about their preferences and behavior, then this constitutes smart (or deep) personalization (Hagen, 1999). Continuing this categorization further, Paul Hagen classifies personalization applications into the following four categories, described with the 2 2 matrix shown in Fig. 3 (Hagen, 1999). According to Fig. 3 and Hagen (1999), one classification dimension constitutes consumer profiles that are classified into rich vs. poor. Rich profiles contain comprehensive information about consumers and their behavior of the type described in Section 2 and further explained in Section 5. Poor profiles capture only partial and trivial information about consumers, such as their names and basic preferences. The second dimension of the 2 2 matrix in Fig. 3 constitutes tailoring (customization) of the offerings. According to Hagen (1999), the offerings can be tailored either reactively or proactively. Reactive tailoring takes already existing knowledge about consumers’ preferences and ‘‘parrots’’ these preferences back to them without producing any new insights about potentially new and interesting offerings. In contrast, the proactive tailoring takes consumer preferences stored in consumer profiles and generates new useful offerings by using innovative matchmaking methods to be described in Section 5. Using these two dimensions, Hagen (1999) classifies personalization applications into Trivial personalizers: These applications have poor profiles and provide reactive targeting. For example, a company can ask many relevant questions about consumer preferences, but would not use this knowledge about them to build rich profiles of the customers and deliver truly personalized and relevant content. Instead, the company insults its customers by ignoring their inputs and delivering irrelevant marketing messages or doing cosmetic personalization, such as greeting the customers by name. Lazy personalizers: These applications build rich profiles, but do only reactive targeting. For example, an online drugstore can have rich Rich profile Poor profile
Lazy personalizers Trivial personalizers Reactive tailoring
Fig. 3.
Smart personalizers Overeager personalizers Proactive tailoring
Classification of personalization applications (Hagen, 1999).
Ch. 1. Personalization: The State of the Art and Future Directions
19
information about customer’s allergies, but miss or even ignore this information when recommending certain drugs to patients. This can lead to recommending drugs causing allergies in patients, although the allergies information is contained in the patients’ profiles. Overeager personalizers: These applications have poor profiles but make proactive targeting of its offerings. This can often lead to poor results because of the limited information about consumers and faulty assumptions about their preferences. Examples of these types of applications include recommending books similar to the ones the consumer bought recently and various types of baby products to a woman who recently had a miscarriage. Smart personalizers: These applications use rich profiles and provide proactive targeting of the offerings. For example, an online gardening website may warn a customer that the plant she just bought would not grow well in the climate of the region where the customer lives. In addition, the website would recommend alternative plants based on the customers’ preferences and past purchases that would fit better the climate where the customer lives. On the basis of this classification, Hagen (1999), obviously, argues for the need to develop smart personalization applications by building rich profiles of consumers and actively tailoring personalized offerings to them. At the heart of smart personalization lie two problems (a) how to build rich profiles of consumers and (b) how to match the targeted offerings to these profiles well. Solutions to these two problems will be discussed further in Section 5. 3.5 Intrusive vs. non-intrusive personalization Tailored offerings can be delivered to the consumer in an automated manner without distracting her with questions and requests for information and preferences. Alternatively the personalization engine can ask the consumer various questions in order to provide better offerings. For example, Amazon.com, Netflix and other similar systems that recommend various products and services to individual consumers ask these consumers for some initial set of ratings of the products and services before providing recommendations regarding them. Also, when a multidimensional recommender system wants to provide a recommendation in a specific context, such as recommending a movie to a person who wants to see it with his girlfriend on Saturday night in a movie theater, the system would first ask (a) when he wants to see the movie, (b) where and (c) with whom before providing a specific recommendation (Adomavicius et al., 2005). Such personalization systems are intrusive in the sense that they keep asking consumers questions before delivering personalized offerings to
20
A. Tuzhilin
them, and the levels of consumer involvement can be very significant in some cases. Alternatively, personalization systems may not ask consumers explicit questions, but non-intrusively learn consumer preferences from various automated interactions with them. For example, the amount of time a consumer spends reading a newsgroup article can serve as a proxy of how much the consumer is interested in this article. Clearly, non-intrusive personalization systems are preferable from the consumer point of view, but they may provide less accurate recommendations. Studying the tradeoffs between intrusive and non-intrusive personalization systems and determining optimal levels of intrusiveness in various personalization applications constitutes an interesting and important research problem. This problem has already been studied by several researchers in the context of recommender systems. In particular, Oard and Kim (1998) described several ways of obtaining implicit feedback for recommender systems. The methods of minimizing the number of intrusive questions for obtaining user ratings in recommender systems have also been studied in Pennock et al. (2000), Rashid et al. (2002), Boutilier et al. (2003), Montgomery and Srinivasan (2003) and Yu et al. (2004).
3.6 Static vs. dynamic personalization Personalization applications can be classified in terms of who can select and deliver the offerings and how this is done. On the one hand, the offerings can be selected and delivered dynamically by the personalization system. For example, the system may be monitoring the activities of the consumer and the environment and dynamically deciding to change the content of the web pages for the consumer based on her activities and the changes in the environment. One promising type of dynamic personalization constitutes ubiquitous personalization based on mobile location-based services (LBS) (Rao and Minakakis, 2003) that deploy various types of wireless technologies that identify the location and other types of contextual information, such as the current time, the consumer schedule and the purpose of the trip, in order to provide dynamic personalized services to the consumer based on this contextual information and the consumer profile. Examples of these LBS including suggestions of various shops, restaurants, entertainment events and other points of interest in the geographical and temporal vicinities of the consumer. On the other hand, the offerings delivered to the consumer can be selected either by the consumer herself or by the system administrator who had selected a fixed set of business rules governing the delivery of the offerings to specific segments of the consumers. In this case, this selection was done statically and can be changed only by the consumer or the system administrator depending on the case.
Ch. 1. Personalization: The State of the Art and Future Directions
21
Obviously, the dynamic selection of offerings is more flexible and is more preferred than the static selection process. On the other hand, it should be done in a smart way, as described in Section 3.4 above, to avoid the substandard performance of the personalization system. In summary, we discussed various types of personalization in this section, following the dictum of David Smith that ‘‘there are myriad ways to get personal’’ (Smith, 2000). Therefore, specific types of personalization approaches need to be selected carefully depending on the particular personalization application at hand and on the goals that this application tries to accomplish, such as the ones described in Section 2.
4
When does it pay to personalize?
One of the reasons why personalization has its share of successes and disappointments is that it does not always make sense to personalize both for the technical and economic reasons. One of such technical reasons is that provision of personalized offerings can lead to questionable outcomes that do not benefit, or even worse, insult the consumer. For example, an online maternity store can start recommending various types of baby products to a woman who has bought maternity clothes for herself a few months ago without realizing that she had recently had a miscarriage. One of the fundamental assumptions in personalization is that of the stability of consumer preferences and the assumption that the past consumer activities can be used to predict their possible future preferences and actions. As the previous example clearly demonstrates, this assumption does not hold in some cases. In those cases, the dangers of personalization and the risks of falling into the de-personalization trap (to be discussed in Section 5) may overweight the potential benefits of personalization, thus making it impractical. On the economic side, proponents of personalization first need to build a strong business case before launching a personalization project. At the most general level, personalization should be done when the benefits derived from a personalization project exceed its costs for both providers and consumers of personalized offerings. Otherwise, one of the parties will refuse to participate in the personalization project. In the rest of this section, we examine the costs vs. benefits tradeoff for both providers and consumers of personalized offerings. Consumers. From the consumers’ perspective, the benefits of personalization constitute more relevant offerings delivered by the providers at the most opportune moments. One problem with these types of benefits is that it is hard to measure their effects, as will be discussed in Section 5. The costs of personalization consist of two parts for the consumers: direct and indirect. The direct costs are subscription costs paid by the consumers. For
22
A. Tuzhilin
the provider-centric personalization, personalization services are usually provided for free and, therefore, the direct costs for the consumers are usually zero. In case of the consumer-centric personalization, consumers should pay for these services, as discussed in Section 3.1, and these fees constitute the direct costs for the consumers. Indirect costs to the consumers include time and cognitive efforts of installing and configuring personalization services, and the privacy and security issues associated with these services. As a part of the subscription, the consumers should provide certain personal information to the personalization service providers, and there are always some risks that this personal information can be misused by the providers. As in the case of benefits, these indirect costs are also hard to measure. Providers. For the providers, the key question is whether or not they should customize their offerings and if yes, then to what degree and scope. The decision to personalize its offerings or not depends on the tradeoff between the personalization costs vs. the benefits derived by the providers from personalized offerings to the consumers. We will now examine these costs and benefits. Customization does not come for free since it requires additional costs to customize offerings in most of the cases, especially in the case of customized products that need to be manufactured. Also, the more personalized an offering is, the more customization is usually required. For example, one matter is to make a certain type of shoe in 20 different varieties depending on the color, foot size and width, and a completely different and more expensive proposition is to manufacture a personal pair of shoes for a specific customer. In general, the more customized an offering is and the smaller the targeted segment, the more costly the manufacturing process becomes. In the limit, manufacturing for the segment of one is the most expensive, and it requires stronger business justification to adopt this option (Zipkin, 2001). One interesting research question is whether firms should customize their products based on one or multiple attributes and whether different firms should select the same or different attributes for customization purposes. In Syam et al. (2005), it is shown that it is better for the firms to select only one and the same attribute as a basis for customization. This problem was further explored in Ghose and Huang (2007). Moreover, it is also important to know how customization of products and prices affect each other. In Ghose and Huang (2006), it is shown that if the fixed costs of personalization are low, firms are always better off personalizing both prices and products. Shaffer and Zhang (2002) also show that similar effects can arise if firms are asymmetric in market share. As for the consumers, personalization costs for the providers consist of direct and indirect costs. The direct costs are associated with extra efforts required to customize personalized offerings, whereas indirect costs are associated with the potential problems pertaining to providing personalized
Ch. 1. Personalization: The State of the Art and Future Directions
23
solutions, such as privacy-related and legal costs. For example, Doubleclick and some other personalization companies had to deal with legal challenges pertaining to privacy, had to incur significant legal costs and the subsequent decisions to abstain from certain types of personalization. Benefits derived from providing personalized offerings include Premium prices charged for these offerings under certain competitive economic conditions (Chen et al., 2001; Ghose and Huang, 2007; Shaffer and Zhang, 2002; Syam et al., 2005). For example, a shoe manufacturer can charge premium prices for the custom-made shoes in many cases. Additional customer satisfaction, loyalty and higher retention rates resulting in higher LTV values for the customers and less churn. Achieving higher conversion rates from prospective to real and to loyal customers. Achieving higher average revenue levels per customer via cross- and up-selling capabilities. Unfortunately, as discussed in Section 5, some of these benefits are hard to measure. Therefore, it is often hard to produce exact numbers measuring personalization benefits. To deal with this problem, Rangaswamy and Anchel (2003) proposed the framework where the decision to personalize or not for providers is measured in terms of the tradeoffs between the customization costs incurred and the heterogeneity of consumers’ wants. Rangaswamy and Anchel (2003) present a 2 2 matrix having dimensions ‘‘customization costs’’ and ‘‘heterogeneity of consumer wants’’ and classify various manufacturing products into the quadrants of this matrix. Such products as mutual funds, music and similar types of digital products have low customization costs, while consumer wants for these products are very heterogeneous. Therefore, these types of products are primary candidates for personalization, as is witnessed by the personalized Internet radio station Pandora (www. pandora.com). On the other end of the spectrum are such products as cars, photocopiers and MBA programs. Customization costs for such products are high, whereas consumer wants are significantly more homogeneous than for the other types of products. Therefore, it is less attractive for the providers to personalize such products. An interesting situation happens for the class of products where consumer wants and customization costs are in between these two extremes, i.e., they are not too high and not too low. According to Rangaswamy and Anchel (2003), examples of such products include clothes, certain food items, computers, watches, etc. Therefore, we see certain personalization efforts for these products, such as certain customized clothes (e.g., jeans), foods prepared for individual consumers and customized computers (e.g., Dell), while still none for others (e.g., mass-produced clothes, watches, etc.).
24
A. Tuzhilin
In summary, when the benefits of personalization exceed its costs for both providers and consumers of personalized offerings, only then it makes business sense to personalize, which happens only for certain types of offerings and usually on a case-by-case basis. Moreover, it is difficult to measure the costs and benefits of personalization in many cases. Therefore, personalization decisions are often hard to make in real business settings, and they require careful cost-benefit analysis and evaluation.
5
Personalization process
As was argued by Adomavicius and Tuzhilin (2001b), personalization should be considered as an iterative process consisting of several stages that are integrated together into one tight system. In particular, Adomavicius and Tuzhilin (2001b) proposed the following five stages: (a) collecting customer data, (b) building customer profiles using this data, (c) matchmaking customized offerings to specific customer profiles to determine the most relevant offerings to individual customers, (d) delivery and presentation of customized information and offerings through the most relevant channels, at the most appropriate times and in the most appropriate form and (e) measuring customer responses to the delivered offerings. Moreover, Adomavicius and Tuzhilin (2001b) argued for the necessity of a feedback loop mechanism that takes customers’ responses to the current personalization solution, transfers appropriate information to the earlier stages of the personalization process, and adjusts, improves and corrects various activities in these earlier stages that cause poor responses from the customers. This approach of viewing personalization as a process was further developed by Murthi and Sarkar (2003), who partitioned the personalization process into the following three stages: (a) learning customer preferences, (b) matching offerings to customers’ preferences and (c) evaluation of the learning and matching processes. Murthi and Sarkar (2003) also placed personalization within the firm’s overall Value Net framework and connected it to the general business strategy of the firm. Subsequently, Adomavicius and Tuzhilin (2005a) extended and refined the previous approaches by proposing the Understand–Deliver–Measure (UDM) framework, according to which the personalization process is defined in terms of the UDM cycle consisting of the following stages as shown in Fig. 4: Understand consumers by collecting comprehensive information about them and converting it into actionable knowledge stored in consumer profiles. The output of this stage is a consumer-centric DW (Kimball, 1996) and the consumer-centric modelbase (Liu and Tuzhilin, 2008). The consumer-centric DW stores factual profiles of each consumer.
Ch. 1. Personalization: The State of the Art and Future Directions
Adjusting Personalization Strategy
Feedback loop
Measuring Personalization Impact
Delivery and Presentation
25
Measure Impact of Personalization
Deliver Customized Offerings
Matchmaking
Building Consumer Profiles
Understand the Consumer
Data Collection
Fig. 4.
Personalization process.
The consumer-centric modelbase stores data mining and statistical models describing behavior of individual consumers. Collectively, factual profile and the collection of data mining models of the consumer form the consumer profile. Deliver customized offering based on the knowledge about each consumer C, as stored in the consumer profiles, and on the information about the space of offerings O. The personalization engine should find the customized offerings from the space O that are the most relevant to each consumer C within the specified context and deliver them to C in the best possible manner, including at the most appropriate time(s), through the most appropriate channels and in the most appropriate form. These customized offerings constitute marketing outputs of the personalization process. Measure personalization impact by determining how much the consumer is satisfied with the marketing outputs (in the form of delivered personalized offerings). It provides information that can enhance our understanding about consumers or point out the deficiencies of the methods for personalized delivery. Therefore, this additional information serves as a feedback for possible improvements to each of the other components of personalization process. This feedback information completes one cycle of the personalization process, and sets the stage for the next cycle where improved personalization techniques can make better personalization decisions. More recently, Vesanen and Raulas (2006) presented an alternative approach to describing the personalization process that consists of interaction,
26
A. Tuzhilin
processing, customization and delivery stages. In addition, Vesanen and Raulas (2006) explicitly introduced four objects into its framework: customers, customer data, customer profiles and marketing outputs, and showed how the aforementioned four stages are connected to these four objects. In particular, they described how customer data is obtained from the customers via interactions with them and from the external sources, then how it is preprocessed into the customer profiles, and then how marketing outputs are customized based on the profiling information. Vesanen and Raulas (2006) also argue for the importance of integrating various personalization stages and describe possible problems arising from improper integration of various stages of the personalization process and the existence of the ‘‘discontinuity points.’’ Finally, Vesanen and Raulas (2006) present a case study describing how the described personalization process was implemented in a direct marketing company. Although each of the described approaches covers different aspects of the personalization process, we will follow below the modified UDM model from Adomavicius and Tuzhilin (2005a) that is schematically described in Fig. 4, because we believe that this modified UDM model covers all the aspects of the personalization process. For example, the four personalization stages presented in Vesanen and Raulas (2006) are closely related to the six stages of the personalization process presented in Fig. 4. The UDM framework described above constitutes the high-level conceptual description of the personalization process. The technical implementation of the UDM framework consists of the following six stages (Adomavicius and Tuzhilin, 2005a) presented in Fig. 4: Stage 1: Data Collection. The personalization process begins with collecting data across different channels of interaction between consumers and providers (e.g., Web, phone, direct mail and other channels) and from various other external data sources with the objective of obtaining the most comprehensive ‘‘picture’’ of a consumer. Examples of the ‘‘interactions’’ data includes browsing, searching and purchasing data on the Web, direct mail, phone and email interactions data, and various demographic and psychographic data collected through filling various online and offline forms and surveys. Examples of external data can be economic, industryspecific, geographic and census data either purchased or obtained from the external sources through means other than direct interactions with the consumer. Stage 2: Building Customer Profiles. Once the data is collected, one of the key issues in developing personalization applications is integrating this data and constructing accurate and comprehensive consumer profiles based on the collected data. Many personalization systems represent consumer profiles in terms of a collection of facts about the consumer. These facts may include consumer’s demographics, such as name, gender, date of birth and address. The facts can also be derived from the past transactions of a consumer,
Ch. 1. Personalization: The State of the Art and Future Directions
27
e.g., the favorite product category of the consumer or the value of the largest purchase made at a Web site. As explained in Section 2, this simple factual information about the consumer can be stored as a record in a relational database or a consumer-centric DW. Also, more complicated factual information, such as the information about the social network of a person and his or her relationships and interactions with other consumers, may require the use of taxonomies and ontologies and can be captured using XML or special languages for defining ontologies (Staab and Studer, 2003), such as OWL (Antoniou and Harmelen, 2003). However, such factual profiles containing collections of facts may not be sufficient in certain more advanced personalization applications, including high-precision personalized content delivery and certain advanced recommendation applications. Such applications may require the deployment of more advanced profiling techniques that include the development of data mining and statistical models capturing various aspects of behavior of individuals or segments of consumers. These consumer models may include predictive data mining models, such as decision trees, logistic regressions and Support Vector Machines (SVMs), predicting various aspects of consumer behavior. These models can be built either for individuals or segments of consumers. The tradeoff between individual and segment-based models lies in idiosyncrasy of individual models vs. the lack of sufficient amounts of data to build reliable predictive models (Jiang and Tuzhilin, 2006a). As was shown in Jiang and Tuzhilin (2006a), for the applications where individual consumers performed many transactions and it is possible to build reliable individual predictive models, individual models dominate the segment-based models of consumers. In contrast, in the low-frequency applications micro-segmentation models outperform individual models of consumers, assuming consumers are grouped into segments using high-quality clustering methods. In addition to the predictive models, profiles may also include descriptive models of consumer behavior based on such data mining methods as descriptive rules (including association rules), sequential and temporal models and signatures (Adomavicius and Tuzhilin, 2005a). An example of a rule describing consumer’s movie viewing behavior is ‘‘John Doe prefers to see action movies on weekends’’ (i.e., Name ¼ ‘‘John Doe’’ & MovieType ¼ ‘‘action’’ - TimeOfWeek ¼ ‘‘weekend’’). Such rules can be learned from the transactional history of the consumer (e.g., John Doe in this case) using the techniques described in Adomavicius and Tuzhilin (2001a). Consumer profiles can also contain important and frequently occurring sequences of consumer’s most popular activities, such as sequences of Web browsing behavior and various temporal sequences. For example, we may want to store in John Doe’s profile his typical browsing sequence ‘‘when John Doe visits the book Web site XYZ, he usually first accesses the home page, then goes to the Home&Gardening section of the site, then browses the
28
A. Tuzhilin
Gardening section and then leaves the Web site’’ (i.e., XYZ: StartPage Home&Gardening - Gardening - Exit). Such sequences can be learned from the transactional histories of consumers using frequent episodes and other sequence learning methods (Hand et al., 2001). Finally, consumer profiles can also contain signatures of consumer behavior (Cortes et al., 2000) that are the data structures used to capture the evolving behavior learned from large data streams of simple transactions (Cortes et al., 2000). For example, ‘‘top 5 most frequently browsed product categories over the last 30 days’’ would be an example of a signature that could be stored in individual consumer profiles in a Web store application. In summary, besides factual information about consumers, their profiles can also contain various data mining and statistical models describing consumer behavior, such as predictive, descriptive rule-based, sequential, temporal models and signatures. All this consumer profiling information can be stored in two types of repositories Consumer-centric DW (Kimball, 1996), where each consumer has a unique record or a taxonomy containing demographic and other factual information describing his or her activities. Consumer-centric modelbase (Liu and Tuzhilin, 2008) containing one or several models describing different aspects of behavior of a consumer. As explained before, a model can be unique for a consumer or a segment of consumers, and can be organized and stored in the modelbase in several different ways (Liu and Tuzhilin, 2008). However, each consumer should have a list of all the models describing behavior of that consumer that is easily accessible and managed. Collectively, the set of all the models of all the consumers forms a modelbase, and it is organized and managed according to the principles described in Liu and Tuzhilin (2008). Stage 3: Matchmaking. Once the consumer profiles are constructed, personalization systems must be able to match customized offerings to individuals or segments of consumers within a certain context, such as shopping for yourself vs. for a friend, based on the consumer profiling information obtained in Stage 2 and on the information about the space of offerings O. The matchmaking process should find customized offerings from the space O that are the most relevant to each consumer C within the specified context. Before describing the matchmaking process, we first need to clarify the following concepts: 1. Space of offerings O: This space has a certain structure that varies significantly among the offerings. For example, in case of dynamic prices, the space of offerings O can consist of a range of possible prices (e.g., from $10 to $100), whereas for the content management systems presenting personalized content in the form of dynamically generated pages, links and
Ch. 1. Personalization: The State of the Art and Future Directions
29
other content, the space O can consist of a complex structure with a certain taxonomy or ontology specifying granular and hierarchical content for a particular application.2 For example, space O for the book portion of the Amazon website needs to specify taxonomy of books (such as the one specified on the left-hand-side of the home page of the Amazon’s book section and containing classification of books based on categories, such as arts & entertainment, business & technology, children’s books, fiction, travel and subcategories, such as travel to Africa). Besides the book taxonomy, the home page of the Amazon’s book section has the granular content containing various templates, sections and slots that are filled with the specific content. Some examples of these sections include the middle section for the most interesting and appropriate books for the customer, the bargain offers section, the recent history section at the bottom and so on, with each section having its own structure and taxonomy. In summary, all these granular offerings need to be organized according to some taxonomy, which should be hierarchical in structure with complex relationships among its various components. The problem of specifying space O and organizing online content is a part of the bigger content management problem, which has been studied in Sheth et al. (2002). Defining appropriate taxonomies and ontologies of offerings for optimal targeting to consumers constitutes a challenging research problem for certain types of offerings, such as online content, and needs to be carefully studied further. 2. Space of consumers: In addition to offerings, we need to build an ontology or a taxonomy of consumers by categorizing them according to one or more methods. For example, consumers can be categorized based on the geography, occupation, consumption and spending patterns. Each of these dimensions can have a complex hierarchical or other structure, such as geographic dimension divided into country, region, state, city, zip and other categories. One of the ways to categorize the consumers is to partition them into some segmentation hierarchy (Jiang and Tuzhilin, 2006a). For each segment, one or several models can be built describing the behavior of this segment of consumers. As explained in Stage 2 above, these models are part of consumer profiles. More generally, profiles can be built not only for the individuals but also for various segments of consumers. Also, we can support more complex ontologies of consumers that incorporate their social networks and other relationships among themselves and with the products and services in which they may be interested, including various types of reviews and opinions. The problem of defining and building appropriate consumer ontologies and taxonomies, including social networks, for 2 Ontology is a more general concept than taxonomy and includes representation of a set of concepts, such as various types of offerings, and different types of relationships among them. However, it is more difficult to support fully fledged ontologies in the matchmaking process, and taxonomies of offerings (as well as consumers discussed below) may constitute a reasonable compromise.
30
A. Tuzhilin
optimal targeting of customized offerings constitutes an interesting research question. 3. Context: Personalization systems can deliver significantly different customized offerings to consumers depending on the context in which these offerings are made. For example, if an online book retailer knows that a consumer looks for a book for a course that she takes at a university, a different type of offering will be provided to her than in the case when she looks for a gift for her boyfriend. Defining and specifying the context can significantly improve the personalization results, as was shown in Adomavicius et al. (2005) and Gorgoglione et al. (2006). Moreover, the more specific the context and the more individualized models are built, the more this context matters for better customizing offerings to the consumers (Gorgoglione et al., 2006). Given these preliminary concepts, the matchmaking process can be defined as follows. For a given context and a specified segment of consumers (a) find the appropriate granularity level in the taxonomy associated with the offerings space O at which the offerings should be made and (b) select the best tailoring of the offering at that granularity level. For example, assume that a female high school teacher, 30–35 years old from New York is buying a book online for herself. Then the personalization engine should figure out which books should be placed on the starting page of a female high school teacher from New York in the specified age category. It should also identify how many books should be placed and how to reorder various categories of books in the list to personalize the list for the teacher and her segment in general. The personalization engine may also expand some categories of books most suitable for the teacher into subcategories to make the book selection process more convenient for her. A related issue is how often the online retailer should send the teacher emails with various book offerings. Note that this tailoring is done for a segment of consumers (with the specified characteristics) and within a particular context (personal purchase). The special case is when this matchmaking is done for individual consumers, i.e., for the segments of one. The answer to this question depends on various factors, including the goals of personalization: what goals we want to accomplish with this particular customization of offering(s). One such goal is to maximize utility of the offering o in O for the segment of consumers s in the context c, U(o, s, c), i.e., we want to select such offering o that maximizes utility U(o, s, c) for the given context c and the consumer segment s. As we said before, the special case of this problem is when the segment s consists of a single consumer. The tailoring process can be of two types It requires manufacturing processes with the appropriate time delays, delivery issues and costs incurred to customize the offering(s).
Ch. 1. Personalization: The State of the Art and Future Directions
31
Examples of such customized offerings include customized jeans, shoes, CD records and personal computers. It does not require any manufacturing and only needs to deal with the selection and configurability issues, such as selection of appropriate books to display on a website or generation of personalized web pages and other online content. Such customization can be done in real time with negligible costs. Although both problems are important, we will focus on the latter one in the rest of this section since the first one is a large subject on its own and can constitute a separate stand-alone paper. Although some of the matchmaking principles are common across all the targeted offerings and applications, such as maximizing the utility of the offering o, U(o, s, c), other matchmaking methods depend critically on the particular offering and/or application. For example, website personalization has its own set of matchmaking algorithms that are quite different from recommending books and personalization of product prices to consumers. These differences come in part from using different objective functions and dealing with different structures of the offering space O across these applications. For example, in case of recommending books, one of the possible objectives is to maximize the predictive accuracy of a recommendation. In contrast to this, one of the objectives of website personalization is to maximize navigational simplicity of the website. Because of these different objectives and different offering spaces, the matchmaking approaches can be quite different. There are many matchmaking technologies proposed in the literature, including recommender systems, statistics-based predictive approaches and rule-based systems, where an expert specifies business rules governing delivery of content and services that depend on the conditions specified in the antecedent part of the rule. In particular, several industrial personalization solutions initially developed by BroadVision and subsequently integrated into various personalization servers, support rule-based matchmaking, where the rules are defined by a domain expert. For example, a marketing manager may specify the following business rule: if a consumer of a certain type visits the online grocery store on a Sunday night, then this consumer should be shown the discount coupons for diapers. There has been much work done on developing various recommendationbased matchmaking technologies over the past decade since the appearance of the first papers on collaborative filtering in the mid-1990s (Hill et al., 1995; Resnick et al, 1994; Shardanand and Maes, 1995). These technologies are based on a broad range of different approaches and feature a variety of methods from such disciplines as statistics, machine learning, information retrieval and human–computer interactions. Moreover, these methods are often classified into broad categories according to their recommendation approach as well as their algorithmic technique. In particular, Balabanovic
32
A. Tuzhilin
and Shoham (1997) classify these methods based on the recommendation approach as follows: Content-based recommendations: the consumer is recommended items (e.g., content, services, products) similar to the ones the consumer preferred in the past. In other words, content-based methods analyze the commonalities among the items the consumer has rated highly in the past. Then, only the items that have high similarity with the consumer’s past preferences would get recommended. Collaborative recommendations (or collaborative filtering): the consumer is recommended items that people with similar tastes and preferences liked in the past. Collaborative methods first find the closest peers for each consumer, i.e., the ones with the most similar tastes and preferences. Then, only the items that are most liked by the peers would get recommended. Hybrid approaches: these methods combine collaborative and contentbased methods. This combination can be done in many different ways, e.g., separate content-based and collaborative systems are implemented and their results are combined to produce the final recommendations. Another approach would be to use content-based and collaborative techniques in a single recommendation model, rather than implementing them separately. Classifications based on the algorithmic technique (Breese et al., 1998) are Heuristic-based techniques constitute heuristics that calculate recommendations based on the previous transactions made by the consumers. An example of such a heuristic for a movie recommender system could be to find consumer X whose taste in movies is the closest to the tastes of consumer Y, and recommend to consumer Y everything that X liked that Y has not yet seen. Model-based techniques use the previous transactions to learn a model (usually using a machine learning or a statistical technique), which is then used to make recommendations. For example, based on the movies that consumer X has seen, a probabilistic model is built to estimate the probability of how consumer X would like each of the yet unseen movies. These two classifications are orthogonal and give rise to six classes of matchmaking methods corresponding to six possible combinations of these classifications. Adomavicius and Tuzhilin survey various recommendation methods within the specified framework in Adomavicius and Tuzhilin (2005b), and the interested reader is referred to this article. Although there has been much work done on developing different matchmaking methods, most of them do not address certain issues that are crucial for personalization technologies to be successfully deployed in real-life applications, such as not fully considering contextual information, working
Ch. 1. Personalization: The State of the Art and Future Directions
33
only with a single-criterion ratings, not fully addressing explainability, trustworthiness, privacy and other issues. A detailed list of limitations of the current generation of recommender systems and the discussion of possible approaches to overcome these limitations is presented in Adomavicius and Tuzhilin (2005b). How to address privacy issues in personalization is discussed in Kobsa (2007). Many commercial ventures implemented recommender systems over the past several years to provide useful recommendations to their customers. Examples of such companies include Amazon (book, CD and recommendations of other products), Tivo (TV programs), Netflix and Yahoo! (movies), Pandora and eMusic (music) and Verizon (phone service plans and configurations). Despite all the recent progress in developing successful matchmaking methods, ‘‘smart matchmaking’’ remains a complex and difficult problem and much more work is required to advance the state-of-the-art to achieve better personalization results. To advance the state-of-the-art in recommender systems, Netflix has launched a $1 million prize competition in October 2006 to improve its recommendation methods so that these recommendations would achieve better performance results (Bennett and Lanning, 2007). This competition and other related activities further reinvigorated research interests in recommender systems, as demonstrated by launching a new ACM Conference on Recommender Systems (RecSys) in 2007. Stage 4: Delivery and Presentation. As a result of matchmaking, one or several customized offerings are selected for the consumer. Next, these offerings should be delivered and presented to the consumer in the best possible manner, i.e., at the most appropriate time(s), through the most appropriate channels and in the most appropriate form, such as lists of offerings ordered by relevance or other criteria, through visualization methods, or using narratives. These customized offerings, when delivered to the consumer, constitute marketing outputs of the personalization process. One classification of delivery methods is pull, push and passive (Schafer et al., 2001). Push methods reach a consumer who is not currently interacting with the system, e.g., by sending an email message. Pull methods notify consumers that personalized information is available but display this information only when the consumer explicitly requests it. Passive delivery displays personalized information as a by-product of other activities of the consumer, such as up- and cross-selling activities. For example, while looking at a product on a Web site, a consumer also sees recommendations for related products. The problem of selecting the most appropriate delivery methods, including the choice of push, pull or passive methods and determination of the most appropriate times, channels and forms, constitutes an interesting and underexplored problem of personalization.
34
A. Tuzhilin
Stage 5: Measuring Personalization Impact. In this step, it is necessary to evaluate the effectiveness of personalization using various metrics, such as accuracy, consumer LTV, loyalty value and purchasing and consumption experience metrics. The most commonly used metrics for measuring personalization impact are accuracy-related metrics, i.e., they measure how the consumer liked a specific personalized offering, e.g., how accurate and relevant the recommendation was (Breese et al., 1998; Pazzani, 1999). Although important and most widely used now, accuracy-based metrics are quite simplistic and do not capture more complex and subtle aspects of personalization. Therefore, attempts have been made to develop and study more general aspects of personalization effectiveness by advocating the use of more advanced and comprehensive personalization metrics, such as consumer LTV, loyalty value, purchasing and consumption experience and other metrics based on the return on consumer (Peppers and Rogers, 2004, Chapter 11). However, they constitute only initial steps, and clearly much more work is required to develop better and more feasible ways to measure the impact of personalization. Stage 6: Adjusting Personalization Strategy. Finally, after the personalization impact is measured, these metrics can be used for possible improvements to each of the other five stages of the personalization process. If we are not satisfied with the measurement results, we need to identify the causes of this poor performance and adjust some of the previously discussed methods associated with the previous five stages of the personalization process based on the feedback loops presented in Fig. 4. In other words, if the performance metrics suggest that the personalization strategy is not performing well, we need to understand if this happens because of poor data collection, inaccurate consumer profiles, poorly chosen techniques for matchmaking or content delivery. After identifying possible sources of the problem, it is necessary to fix it through a feedback mechanism. Alternatively, we may determine that the selected performance metric measures wrong indicators that are irrelevant for the personalization application and needed to be replaced with more relevant metric(s). It was called a feedback integration problem in Adomavicius and Tuzhilin (2005a), since it determines how to adjust different stages of the personalization process based on the feedback from the performance measures. For example, assume that a personalization system delivers recommendations of restaurants to the consumers and does it poorly so that recommendation performance measures described in Stage 5 above remain low. In Stage 6, we need to examine causes of this poor performance and identify which of the prior stages are responsible for this. For example, poor recommendation results might be due to poorly collected data in Stage 1, such as incomplete list of restaurants available for recommendation purposes, insufficient information about these restaurants (e.g., absence of information about its chef or absence of consumer reviews and
Ch. 1. Personalization: The State of the Art and Future Directions
35
comments about the restaurants). Alternatively, the data about the consumers may be insufficient and needs to be enhanced. Further, consumer profiles can be poorly constructed in Stage 2 and need to be readjusted or completely rebuilt. For example, it may be the case that we did not include the list of the person’s favorite websites or the list of friends in the person’s profile, thus cutting access to the consumers’ social network and thus decreasing the quality of recommendations. Finally, we may need to re-examine the selected recommendation algorithm in Stage 3 or consider deploying a different one that can achieve better performance results. All these are examples of how we can adjust the chosen personalization solution in order to achieve better performance results. Note, that the feedback integration problem is a recursive one, i.e., if we are able to identify the underperforming stages of the personalization process, we may still face similar challenges when deciding on the specific adjustments within each stage. For example, if we need to improve the data collection phase of the personalization process, we would have to decide if we should collect more data, different data or just use better data preprocessing techniques. If this feedback is properly integrated in the personalization process, the quality of interactions with individual consumers, as measured by the metrics discussed above, should grow over time resulting in the virtuous cycle of personalization.3 If this virtuous cycle is achieved, then the personalization becomes a powerful process of delivering ever-increasing value to the stakeholders. This virtuous cycle is not only essential for improving the personalized service over time, it is also crucial in order for the personalization system to keep up with the constantly changing environment, e.g., to be able to adjust to changes in the tastes and preferences of individual customers and to changes in product offerings. The opposite of the virtuous cycle is the process of de-personalization (Adomavicius and Tuzhilin, 2005a). It can occur when the metrics of consumer satisfaction are low from the start or when they are decreasing over time, or when the system cannot adjust in time to the changing environment. In either case, the consumers get so frustrated with the personalization systems that they stop using it. The de-personalization effect is largely responsible for failures of some of the personalization projects. Therefore, one of the main challenges of personalization is the ability to achieve the virtuous cycle of personalization and not fall into the de-personalization trap. This completes the description of the personalization process. As was argued in Adomavicius and Tuzhilin (2005a) and Vesanen and Raulas 3 The term ‘‘virtuous cycle’’ was conceived in 1950s. According to www.wordspy.com/words/ virtuouscycle.asp, virtuous cycle is a situation in which improvement in one element of a chain of circumstances leads to improvement in another element, which then leads to further improvement in the original element, and so on.
36
A. Tuzhilin
(2006), it is really important to integrate all the stages of the personalization process into one smooth iterative process to achieve the virtuous cycle of personalization. This issue is addressed in the next section. 6
Integrating the personalization process
As was pointed out above, various stages of the personalization process described in Section 5 need to be integrated through the carefully developed transitions from one stage to another in a tightly coupled manner (Adomavicius and Tuzhilin, 2005a). Without such tight coupling, there will be discontinuity points between various stages of personalization (Vesanen and Raulas, 2006), and this would result in a failure to achieve the virtuous cycle of personalization. Some of the failures of personalization projects in the past are attributed to the lack of this integration. In particular many companies have developed piecemeal solutions to their personalization initiatives by focusing on individual stages of the personalization process without putting much thinking into how to integrate different stages into an organic process. For instance, Vesanen and Raulas (2006) present an example of a ‘‘discontinuity point’’ in a large oil and fuel marketing company where the marketing department of the company owns and manages the company’s credit cards. However, the customers’ purchasing data is owned and managed by the finance department that produces credit card bills based on the data. Unfortunately, the finance department does not share purchasing data with the marketing department, thus creating a discontinuity point in the personalization process in that company. This is unfortunate because marketing department cannot do much in terms of building personalized relationships with the customers without such purchasing data and customer profiles built from this data. Vesanen and Raulas (2006) also present a case study of a mail-order company where they identify other discontinuity points in their personalization process. This situation is typical for many personalization projects since few of them support (a) all the six stages of the personalization process presented in Fig. 4, including extensive measurement mechanisms of personalization impacts, (b) feedback loops allowing adjustments of personalization strategies based on the feedbacks and (c) integration of all the adjacent personalization stages in Fig. 4 to avoid discontinuity points. This is unfortunate because developing good evaluation measures, sound methods for adjusting personalization strategies and proper feedback loops constitutes one of the most important tasks of personalization, and achieving virtuous cycle of personalization (or falling into the traps of depersonalization) crucially depends on how well these steps are implemented. A successful implementation of the personalization process that achieves the virtuous cycle of personalization needs to deploy
Ch. 1. Personalization: The State of the Art and Future Directions
37
1. viable solutions for each of the six stages of the personalization process, 2. sound design principles of integrating these six stages into the complete personalization process. The technologies used in each of the six stages of the process were discussed in Section 5. Integration principles for the personalization process are presented in Adomavicius and Tuzhilin (2005a), where they are classified into data-driven and goal-driven. According to Adomavicius and Tuzhilin (2005a), the most currently widespread method for designing the personalization process is the data-driven (or ‘‘forward’’) method. According to this method, the data is usually collected first (or has already been collected), then consumer profiles are built based on the collected data, then these profiles are used in the matchmaking algorithms, etc. In contrast to this currently adopted practice, Adomavicius and Tuzhilin (2005a) advocate designing the personalization process backwards in accordance with the well-known dictum that ‘‘you cannot manage what you cannot measure.’’ This means, that the design of the personalization process should start with the specification of the measures used for determining impact of the personalization process. The selected measure(s) should determine what types of personalized offerings should be delivered to consumers. Next, the profiling and matchmaking technologies for delivering these offerings need to be determined, as well as the types of information that should be stored in the consumer profiles and how this information should be organized in the profiles. Finally, the types of relevant data to be collected for building comprehensive profiles of consumers need to be determined. Adomavicius and Tuzhilin (2005a) call this approach goal-driven (as opposed to the aforementioned data-driven approach), because it starts with a predefined set of goal-oriented measures. Adomavicius and Tuzhilin (2005a) argue that the goal-oriented approach can realize the virtuous cycle of personalization better than the data-driven approach, because it starts with personalization goals and, therefore, would provide more value to the providers and consumers. However, Adomavicius and Tuzhilin (2005a) also maintain that the goal-oriented approach has not been systematically studied before, and therefore this conjecture needs to be rigorously validated by personalization researchers.
7
Future research directions in personalization
Although much work has been done in field of personalization, as is evidenced by this survey, personalization still remains a young field, and much more research is needed to advance the state-of-the-art in the field. Throughout the chapter, we identified various open problems or discussed possible extensions and new directions for the already studied problems.
38
A. Tuzhilin
Therefore, we will not repeat these observations in this section. Instead, we will summarize the issues that are currently the most important in the field in our opinion. We believe that the following topics are among the most important for advancement of the field: 1. Improving each of the six stages of the personalization process presented in Fig. 4. Although some of these six stages, such as data collection and matchmaking, have been studied more extensively than others, still more work is required to develop deeper understanding and improving performance of personalization systems in all the six stages. We believe that the performance measurement and consumer profile building stages are the most underexplored and one of the most crucial among the six stages. Therefore, particular emphasis should be paid towards advancing our understanding of these stages. Although, there has been much work done on the matchmaking stage recently, including work on recommender systems, much more additional research is also required to advance this crucial stage. 2. As was argued in Adomavicius and Tuzhilin (2005a) and Vesanen and Raulas (2006), integration of different stages of the personalization process constitutes a very important problem, and little work has been done in this area. In addition to integrating adjacent stages, it is also important to develop viable feedback loop methods, and practically no research exists on this important problem. 3. Developing specific personalization techniques for particular types of offerings. Although the overall personalization framework, as described in this chapter, is applicable to various types of offerings listed in Section 2, some personalization methods in various stages of the personalization process can vary across different offerings, as was explained in Section 3.2. For example, the techniques for matchmaking of personalized prices can be quite different from the personalized searching and from the product recommendation techniques. Therefore, it is necessary to advance the state-of-the-art for each of the offerings-specific methods in addition to developing novel offeringindependent techniques. Although this is a general problem that is important for various types of offerings described in Section 2, delivery of targeted communications, including targeted ads, promotions and personalized emails, stands out because of its importance in business. Solutions to this problem have been developed since mid-1990s when companies such as Doubleclick and 24/7 have introduced targeted ad delivery methods for online advertising. Still this topic constitutes an interesting and important area of research that became even more important in more recent years due to the advent of search engine marketing and advertising that was popularized by sponsored search products provided by Yahoo (Overture) and Google (AdWords).
Ch. 1. Personalization: The State of the Art and Future Directions
39
4. Formalization of the whole personalization process. As stated before in Sections 5 and 6, most of the personalization research focused only on a few stages of the personalization process, and appropriate formal methods have been developed for the corresponding stages. For example, the field of recommender systems has witnessed rich theoretical developments over the past few years (Adomavicius and Tuzhilin, 2005b). Unfortunately, little mathematically rigorous work has been done on formalizing the whole personalization process, including formal definitions of the feedback loop mechanisms. We believe that this work is needed to gain deeper understanding of personalization and also to be able to abstract particular personalization problems for a subsequent theoretical analysis. 5. Understand how stability (or rather instability) of consumer preferences affects the whole personalization (and customization) process. As discussed in Section 4, one of the fundamental assumptions behind the personalization approach is the stability of consumer preferences and the assumption that the past consumer activities can be used to predict their possible future preferences and actions. Since consumer preferences change over time, it is important to understand how these changes affect the delivery of personalized offerings to them. Simonson (2005) provides several important insights into this problem and outlines possible future research directions. Continuation of this line of work constitutes an important research topic that should be pursued by personalization researchers. 6. Privacy and its relationship to personalization constitutes another important topic of future research. A recent paper by Kobsa (2007) examines the tensions between personalization and privacy and outlines some of the possible approaches for finding the balance between the two. We believe that these six areas require immediate attention of personalization researchers. However, as stated before, these are not the only important problems in the personalization field, and numerous other open problems were formulated throughout this chapter. On the basis of this observation, we believe that personalization constitutes a rich area of research that will only grow in its importance over time since, as Eric Schmidt from Google pointed out, we indeed ‘‘have the tiger by the tail in that we have this huge phenomenon of personalization’’ (Schmidt, 2006).
Acknowledgments The author would like to thank Anindya Ghose from NYU and two anonymous reviewers for their insightful comments that helped to improve the quality of the chapter.
40
A. Tuzhilin
References Adomavicius, G., A. Tuzhilin (2001a). Using data mining methods to build customer profiles. IEEE Computer 34(2), 74–82. Adomavicius, G., A. Tuzhilin (2001b). Expert-driven validation of rule-based user models in personalization applications. Data Mining and Knowledge Discovery 5(1–2), 33–58. Adomavicius, G., A. Tuzhilin (2002). An architecture of e-butler—a consumer-centric online personalization system. International Journal of Computational Intelligence and Applications 2(3), 313–327. Adomavicius, G., A. Tuzhilin (2005a). Personalization technologies: a process-oriented perspective. Communications of the ACM 48(10), 83–90. Adomavicius, G., A. Tuzhilin (2005b). Towards the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17(6), 734–749. Adomavicius, G., R. Sankaranarayanan, S. Sen, A. Tuzhilin (2005). Incorporating contextual information in recommender systems using a multidimensional approach. ACM Transactions on Information Systems 23(1), 103–145. Ansari, A., C. Mela (2003). E-customization. Journal of Marketing Research 40(2), 131–146. Ansari, A., S. Essegaier, R. Kohli (2000). Internet recommendations systems. Journal of Marketing Research 37(3), 363–375. Antoniou, G., F. Harmelen (2003). Web ontology language, in: S. Staab, R. Studer (eds.), Handbook on Ontologies in Information Systems. Springer-Verlag, Berlin. Balabanovic, M., Y. Shoham (1997). Fab: content-based, collaborative recommendation. Communications of the ACM 40(3), 66–72. Bennett, J., S. Lanning (2007). The Netflix Prize, in: Proceedings of the KDD Cup and Workshop, San Jose, CA. Billsus, D., M. Pazzani (2000). User modeling for adaptive news access. User Modeling and UserAdapted Interaction 10(2–3), 147–180. Boutilier, C., R. Zemel, B. Marlin (2003). Active collaborative filtering, in: Proceedings of the 19th Conference on Uncertainty in AI, Acapulco, Mexico. Breese, J.S., D. Heckerman, C. Kadie (1998). Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July 1998. Cadez, I.V., P. Smyth, H. Mannila (2001). Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction, in: Proceedings of the ACM KDD Conference, San Francisco, CA. Chen, Y., C. Narasimhan, Z. Zhang (2001). Individual marketing with imperfect targetability. Marketing Science 20, 23–43. Chen, Y., G. Iyer (2002). Consumer addressability and customized pricing. Marketing Science 21(2), 197–208. Choudhary, V., A. Ghose, T. Mukhopadhyay, U. Rajan (2005). Personalized pricing and quality differentiation. Management Science 51(7), 1120–1130. Communications of the ACM (2000). Special issue on personalization. 43(8). Cortes, C., K. Fisher, D. Pregibon, A. Rogers, F. Smith (2000). Hancock: a language for extracting signatures from data streams. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA. Dewan, R., B. Jing, A. Seidmann (2000). Adoption of internet-based product customization and pricing strategies. Journal of Management Information Systems 17(2), 9–28. Dwyer, F.R. (1989). Customer lifetime valuation to support marketing decision making. Journal of Direct Marketing 3(4), 8–15. Dyche, J. (2002). The CRM Handbook. Addison-Wesley, Boston, MA. Eirinaki, M., M. Vazirgiannis (2003). Web mining for web personalization. ACM Transactions on Internet Technologies 3(1), 1–27.
Ch. 1. Personalization: The State of the Art and Future Directions
41
Elmaghraby, W., P. Keskinocak (2003). Dynamic pricing in the presence of inventory considerations: research overview, current practices, and future directions. Management Science 49(10), p. 47. Ghose, A., K. Huang (2006). Personalized Pricing and Quality Design, Working Paper CeDER-06-06, Stern School, New York University. Ghose, A., K. Huang (2007). Personalization in a two dimensional model. Unpublished manuscript. Gilmore, J., B.J. Pine (1997). The four faces of mass customization. Harvard Business Review 75(1), 91–101. Gorgoglione, M., C. Palmisano, A. Tuzhilin (2006). Personalization in context: does context matter when building personalized customer models? IEEE International Conference on Data Mining, Hong Kong. Hagen, P. (1999). Smart personalization. Forrester Report. Hand, D., H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. Haubl, G., K. Murray (2003). Preference construction and persistence in digital marketplaces: the role of electronic recommendation agents. Journal of Consumer Psychology 13(1), 75–91. Hill, W., L. Stead, M. Rosenstein, G. Furnas (1995). Recommending and evaluating choices in a virtual community of use, in: Proceedings of the CHI Conference. IBM Consulting Services. (2006). Cross-Channel Optimization: A Strategic Roadmap for Multichannel Retailers. The Wharton School Publishing. Imhoff, C., L. Loftis, J. Geiger (2001). Building the Customer-Centric Enterprise, Data Warehousing Techniques for Supporting Customer Relationship Management. Wiley, New York, NY. Jain, S., P.K. Kannan (2002). Pricing of information products on online servers: issues, models, and analysis. Management Science 48(9), 1123–1143. Jiang, T., A. Tuzhilin (2006a). Segmenting customers from populations to individuals: does 1-to-1 keep your customers forever? IEEE Transactions on Knowledge and Data Engineering 18(10), 1297–1311. Jiang, T., A. Tuzhilin (2006b). Improving personalization solutions through optimal segmentation of customer bases, in: Proceedings of the IEEE ICDM Conference, Hong Kong. Jiang, T., A. Tuzhilin (2007). Dynamic micro targeting: fitness-based approach to predicting individual preferences, in: Proceedings of the IEEE ICDM Conference, Omaha, NE. Kelleher, K. (2006). Personalize it. Wired Magazine, July. Kemp, T. (2001). Personalization isn’t a product. Internet Week 864, 1–2. Kimball, R. (1996). The Data Warehousing Toolkit. Wiley, New York, NY. Kobsa, A. (2007). Privacy-enhanced personalization. Communications of the ACM 50(8), 24–33. Kotler, P. (2003). Marketing Management. 11th ed. Prentice Hall. Liu, B., A. Tuzhilin (2008). Managing and analyzing large collections of data mining models. Communications of the ACM 51(2), 85–89. Liu, Y., Z.J. Zhang (2006). The benefits of personalized pricing in a channel. Marketing Science 25(1), 97–105. Manavoglu, E., D. Pavlov, C.L. Giles (2003). Probabilistic user behavior models, in: Proceedings of the ICDM Conference, Melbourne, FL. McDonnell, S. (2001). Microsegmentation, ComputerWorld, January 29. Mobasher, B., A. Tuzhilin (2009). Data mining for personalization. Special Issue of the User Modeling and User-Adapted Interaction Journal, in press. Mobasher, B., H. Dai, T. Luo, M. Nakagawa (2002). Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery 6(1), 61–82. Mobasher, B., R. Cooley, J. Srivastava (2000). Automatic personalization based on web usage mining. Communications of the ACM 43(8), 142–151. Mobasher, B., S. Anand (eds.). (2007). Intelligent techniques for web personalization. Special Issue of the ACM Transactions on Internet Technologies 7(4). Montgomery, A., K. Srinivasan (2003). Learning about customers without asking, in: N. Pal, A. Rangaswamy (eds.), The Power of One: Gaining Business Value from Personalization Technologies. Trafford Publishing, Victoria, BC, Canada. Mulvenna, M., S. Anand, A. Buchner (2000). Personalization on the net using web mining. Communications of the ACM 43(8), 122–125. Murthi, B.P., S. Sarkar (2003). The role of the management sciences in research on personalization. Management Science 49(10), 1344–1362.
42
A. Tuzhilin
Nasraoui, O. (2005). World wide web personalization, in: J. Wang (ed.), The Encyclopedia of Data Warehousing and Mining, pp. 1235–1241. Oard, D.W., J. Kim (1998). Implicit feedback for recommender systems, in: Recommender Systems Papers from the 1998 Workshop. AAAI Press, Menlo Park, CA. Padmanabhan, B., Z. Zheng, S. O. Kimbrough (2001). Personalization from incomplete data: what you don’t know can hurt, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA. Pancras, J., K. Sudhir (2007). Optimal marketing strategies for a customer data intermediary. Journal of Marketing Research XLIV(4), 560–578. Pazzani, M. (1999). A framework for collaborative, content-based and demographic filtering. Artificial Intelligence Review 13(5–6), 393–408. Pazzani, M., D. Billsus (1997). Learning and revising user profiles: the identification of interesting web sites. Machine Learning 27, 313–331. Pennock, D.M., E. Horvitz, S. Lawrence, C.L. Giles (2000). Collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach, in: Proceedings of the 16th Conference on Uncertainty in AI, Stanford, CA. Peppers, D., M. Rogers (1993). The One-to-One Future. Doubleday, New York. Peppers, D., M. Rogers (2004). Managing Customer Relationships: A Strategic Framework. Wiley, New York, NY. Peterson, L.A., R.C. Blattberg, P. Wang (1997). Database marketing: past, present, and future. Journal of Direct marketing 11(4), 27–43. Pierrakos, D., G. Paliouras, C. Papatheodorou, C. Spyropoulos (2003). Web usage mining as a tool for personalization: a survey. User Modeling and User-Adapted Interaction 13, 311–372. Pine, J. (1999). Mass Customization: The New Frontier in Business Competition. HBS Press, Cambridge, MA. Qiu, F., J. Cho (2006). Automatic identification of USER interest for personalized search, in: Proceedings of the WWW Conference, May, Edinburgh, Scotland. Rangaswamy, A., J. Anchel (2003). From many to one: personalized product fulfillment systems, in: N. Pal, A. Rangaswamy (eds.), The Power of One: Gaining Business Value from Personalization Technologies. Trafford Publishing, Victoria, BC, Canada. Rao, B., L. Minakakis (2003). Evolution of mobile location-based services. Communications of the ACM 46(12), 61–65. Rashid, A.M., I. Albert, D. Cosley, S.K. Lam, S.M. McNee, J.A. Konstan, J. Riedl (2002). Getting to know you: learning new user preferences in recommender systems, in: Proceedings of the International Conference on Intelligent User Interfaces, Gran Canaria, Canary Islands, Spain. Reed, O. (1949). Some Random Thoughts . . . On Personalizing, The Reporter of Direct Mail Advertising, April. Resnick, P., N. Iakovou, M. Sushak, P. Bergstrom, J. Riedl (1994). GroupLens: an open architecture for collaborative filtering of netnews, in: Proceedings of the 1994 Computer Supported Cooperative Work Conference. Riecken, D. (2000). Personalized views of personalization. Communications of the ACM 43(8), 26–28. Ross, N. (1992). A history of direct marketing. Unpublished paper, Direct Marketing Association. Rossi, P.E., R.E., McCulloch, G.M. Allenby (1996). The value of purchase history data in target marketing. Marketing Science 15, 321–340. Schafer, J.B., J.A. Konstan, J. Riedl (2001). E-commerce recommendation applications. Data Mining and Knowledge Discovery 5(1/2), 115–153. Schmidt, E. (2006). ‘‘Succeed with Simplicity’’ (interview with Eric Schmidt of Google). Business 2.0 7(11), p. 86. Shaffer, G., Z. Zhang (2002). Competitive one-to-one promotions. Management Science 48(9), 1143–1160. Shardanand, U., P. Maes (1995). Social information filtering: algorithms for automating ‘word of mouth’, in: Proceedings of the Conference on Human Factors in Computing Systems.
Ch. 1. Personalization: The State of the Art and Future Directions
43
Sheth, A., C. Bertram, D. Avant, B. Hammond, K. Kochut, Y. Warke (2002). Semantic Content Management for Enterprises and the Web, IEEE Computing, July/August. Simonson, I. (2005). Determinants of customers’ responses to customized offers: conceptual framework and research propositions. Journal of Marketing 69, 32–45. Smith, D. (2000). There are myriad ways to get personal. Internet Week, May 15. Spiliopoulou, M. (2000). Web usage mining for web site evaluation: making a site better fit its users. Communications of the ACM 43(8), 127–134. Srivastava, J., R. Cooley, M. Deshpande, P.-N. Tan (2000). Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2), 12–23. Staab, S., R. Studer (2003). Handbook on Ontologies in Information Systems. Springer-Verlag, Berlin. Surprenant, C., M.R. Solomon (1987). Predictability and personalization in the service encounter. Journal of Marketing 51, 86–96. Syam, N., R. Ruan, J. Hess (2005). Customized products: a competitive analysis. Marketing Science 24(4), 569–584. Tseng, M.M., J. Jiao (2001). Mass customization, in: Handbook of Industrial Engineering, Technology and Operation Management, 3rd ed. Wiley, New York, NY. Tsoi, A., M. Hagenbuchner, F. Scarselli (2006). Computing customized page ranks. ACM Transactions on Internet Technology 6(4), 381–414. Ulph, D., N. Vulkan (2001). E-commerce, mass customisation and price discrimination, Working Paper, Said Business School, Oxford University. Vesanen, J., M. Raulas (2006). Building bridges for personalization: a process model for marketing. Journal of Interactive Marketing 20(1), 5–20. Wedel, M., W. Kamakura (2000). Market segmentation: conceptual and methodological foundations. 2nd ed. Kluwer Publishers, Dordrecht, Boston. Wu, D., I. Im, M. Tremaine, K. Instone, M. Turoff (2003). A framework for classifying personalization scheme used on e-commerce websites, in: Proceedings of the HICSS Conference, Big Island, HI, USA. Yang, Y., B. Padmanabhan (2005). Evaluation of online personalization systems: a survey of evaluation schemes and a knowledge-based approach. Journal of Electronic Commerce Research 6(2), 112–122. Ying, Y., F. Feinberg, M. Wedel (2006). Leveraging missing ratings to improve online recommendation systems. Journal of Marketing Research 43(3), 355–365. Yu, K., A. Schwaighofer, V. Tresp, X. Xu, H.-P. Kriegel (2004). Probabilistic memory-based collaborative filtering. IEEE Transactions on Knowledge and Data Engineering 16(1), 56–69. Zipkin, P. (2001). The limits of mass customization. MIT Sloan Management Review 42(3), 81–87.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 2
Web Mining for Business Computing
Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Woong-Kee Loh and Vamsee Venuturumilli Department of Computer Science and Engineering, 200 Union Street SE, Room 4-192, University of Minnesota, Minneapolis, MN 55455, USA
Abstract Over the past decade, there has been a paradigm shift in business computing with the emphasis moving from data collection and warehousing to knowledge extraction. Central to this shift has been the explosive growth of the World Wide Web, which has enabled myriad technologies, including online stores, Web services, blogs, and social networking websites. As the number of online competitors has increased, as well as consumer demand for personalization, new techniques for large-scale knowledge extraction from the Web have been developed. A popular and successful suite of techniques which has shown much promise is ‘‘Web mining.’’ Web mining is essentially data mining for Web data, enabling businesses to turn their vast repositories of transactional and website usage data into actionable knowledge that is useful at every level of the enterprise—not just the front-end of an online store. This chapter provides an introduction to the field of Web mining and examines existing and potential Web mining applications for several business functions, such as marketing, human resources, and fiscal administration. Suggestions for improving information technology infrastructure are given, which can help businesses interested in Web mining begin implementing projects quickly.
1
Introduction
The Internet has changed the rules for today’s businesses, which now increasingly face the challenge of sustaining and improving performance throughout the enterprise. The growth of the World Wide Web and 45
46
P. Desikan et al.
enabling technologies has made data collection, data exchange, and information exchange easier and has resulted in speeding up of most major business functions. Delays in retail, manufacturing, shipping, and customer service processes are no longer accepted as necessary evils, and firms improving upon these (and other) critical functions have an edge in their battle at the margins. Technology has been brought to bear on myriad business processes and affected massive change in the form of automation, tracking, and communications, but many of the most profound changes are yet to come. Leaps in computational power have enabled businesses to collect and process large amounts of data of different kinds. The availability of data and the necessary computational resources, together with the potential of data mining, has shown great promise in having a transformational effect on the way businesses perform their work. Well-known successes of companies such as Amazon.com have provided evidence to that end. By leveraging large repositories of data collected by corporations, data mining techniques and methods offer unprecedented opportunities in understanding business processes and in predicting future behavior. With the Web serving as the realm of many of today’s businesses, firms can improve their ability to know when and what customers want by understanding customer behavior, find bottlenecks in internal processes, and better anticipate industry trends. Companies such as Amazon, Google, and Yahoo have been top performers of B2C commerce because of their ability to understand the consumer and effectively communicate. This chapter examines past success stories, current efforts, and future directions of Web mining as it applies to business computing. Examples are given in several different business aspects, such as product recommendations, fraud detection, process mining, inventory management, and how the use of Web mining can enable revenue growth, cost minimization, and enhancement of strategic vision. Gaps in existing technology are also elaborated on, along with pointers to future directions. 2
Web mining
Web mining is the application of data mining techniques to extract knowledge from Web data, including Web documents, hyperlinks between documents, and usage logs of websites. A panel organized at ICTAI 1997 (Srivastava and Mobasher, 1997) asked the question ‘‘Is there anything distinct about Web mining (compared to data mining in general)?’’ While no definitive conclusions were reached then, the tremendous attention on Web mining in past decade, and the number of significant ideas that have been developed have answered this question in the affirmative. In addition, a fairly stable community of researchers interested in the area has been formed, through the successful series of workshops such as WebKDD
Ch. 2. Web Mining for Business Computing
47
(held annually in conjunction with the ACM SIGKDD Conference) and the Web Analytics (held in conjunction with the SIAM data mining conference). Many informative surveys exist in the literature that addresses various aspects of Web mining (Cooley et al., 1997; Kosala and Blockeel, 2000; Mobasher, 2005). Two different approaches have been taken in defining Web mining. First was a ‘‘process-centric view,’’ which defined Web mining as a sequence of tasks (Etzioni, 1996). Second was a ‘‘data-centric view,’’ which defined Web mining in terms of the types of Web data that was being used in the mining process (Cooley et al., 1997). The second definition has become more acceptable, as is evident from the approach adopted in most recent papers that have addressed the issue. In this chapter, we use the data-centric view of Web mining, which is defined as, Web mining is the application of data mining techniques to extract knowledge from Web data, i.e. Web Content, Web Structure and Web Usage data.
The attention paid to Web mining in research, software industry, and Web-based organizations, has led to the accumulation of a lot of experiences. Its application in business computing has also found tremendous utility. In the following sub-sections, we describe the taxonomy of Web mining research and applicability of Web mining to business computing. In the following sub-sections we point out some key aspects of Web mining that makes its different from traditional data mining techniques. Firstly, in Section 2.1, we present the different kinds of Web data that can be captured and classify the area of Web mining according to the kinds of data collected. This classification is natural since the techniques adopted by each kind of data is more or less unique to extract knowledge from the specific kind of data. Second, the Web data by its unique nature also has led to novel problems that could not be addressed by earlier data mining techniques due to lack of enabling infrastructure such as the Web to collect data. Typical examples include user-session identification, robot identification, online recommendations, etc. In Section 2.2, we present an overview of Web mining techniques and relevant pointers in the literature to the state-of-the-art. Some of the techniques developed have been exclusive to Web mining because of the nature of the data collected. 2.1 Data-centric Web mining taxonomy Web mining can be broadly divided into three distinct categories according to the kinds of data to be mined. We provide a brief overview of the three categories and an illustration depicting the taxonomy is shown in Fig. 1. Web content mining. Web content mining is the process of extracting useful information from the contents of Web documents. Content data
48
P. Desikan et al.
Fig. 1.
Web mining taxonomy.
corresponds to the collection of information on a Web page, which is conveyed to users. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to Web content has been the most widely researched. Issues addressed in text mining include topic discovery, extracting association patterns, clustering of Web documents, and classification of Web pages. Research activities on this topic have drawn heavily on techniques developed in other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While a significant body of work in extracting knowledge from images, in the fields of image processing and computer vision exists, the application of these techniques to Web content mining has been limited. Web structure mining. Web structure mining is the process of discovering structure information from the Web. The structure of a typical Web graph consists of Web pages as nodes and hyperlinks as edges connecting related pages. Web structure mining can be further divided into two kinds based on the type of structured information used. Hyperlinks: A hyperlink is a structural unit that connects a location in a Web page to different location, either within the same Web page or on a different Web page. A hyperlink that connects to a different part of the same page is called an Intra-Document Hyperlink, and a hyperlink that connects two different pages is called an Inter-Document Hyperlink. There has been a significant body of work on hyperlink analysis (see survey paper on hyperlink analysis, Desikan et al., 2002).
Ch. 2. Web Mining for Business Computing
49
Document structure: The content within a Web page can also be organized in a tree-structured format, based on the various HTML and XML tags within the page. Here, mining efforts have focused on automatically extracting document object model (DOM) structures out of documents. Web usage mining. Web usage mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a website. Web usage mining itself is further classified depending on the kind of usage data used: Web server data: The user logs are collected by Web server. Typical data includes IP address, page reference, and access time. Application server data: Commercial application servers, for example, Weblogic, etc. have significant features in the framework to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. Application level data: New kinds of events can always be defined in an application, and logging can be turned on for them—generating histories of these specially defined events. 2.2 Web mining techniques—state-of-the-art Enabling technologies such as the Internet has not only generated a new kind of data such as the Web data, but has also generated a new class of techniques that are associated with this kind of data and applications based on this platform. For example, ease of obtaining user feedback resulted in obtaining data and developing new techniques for collaborative filtering. The relation between content and structure of the Web has itself led to developing a new class of relevance rank measures. Web usage data collection itself has given rise to new kinds of problems and techniques to address them, such as user-session identification, user identification, spam detection etc. Thus the Web has not only generated new kinds of data but has opened a series of new problems that can be addressed with availability of such data and its applications on the Web, which are different from traditional data mining approaches. In the following paragraph we discuss the state-of-art in Web mining research. Web mining techniques have also adopted significant ideas from the field of information retrieval. However, our focus in this chapter is restricted to core Web mining techniques and we do not delve into depth in the large area of information retrieval. The interest of research community and the rapid growth of work in this area have resulted in significant research contributions which have been
50
P. Desikan et al.
summarized in a number of surveys and book chapters over the past few years (Cooley et al., 1997; Kosala and Blockeel, 2000; Srivastava et al., 2004). Research on Web content mining has focused on issues such as extracting information from structured and unstructured data and integrating information from various sources of content. Earlier work on Web content mining can be found in Kosala’s work (Kosala and Blockeel, 2000). Web content mining together has found its utility in a variety of applications such as Web page categorization and topic distillation. A special issue on Web content mining (Liu and Chang, 2004) captures the recent issues addressed by the research community in the area. Web structure mining has focused primarily on hyperlink analysis. A survey on hyperlink analysis techniques and a methodology to pursue research has been proposed by Desikan et al. (2002). Most of these techniques can be used independently or in conjunction with techniques proposed with Web content and Web usage. The most popular application is ranking of Web pages. PageRank (Page et al., 1998), developed by Google founders, is a popular metric for ranking the importance of hypertext documents for Web search. The key idea in PageRank is that a page has a high rank if many highly ranked pages point it to, and hence the rank of a page depends upon the ranks of pages pointing to it. Another popular measure is hub and authority scores. The underlying model for computing these scores is a bipartite graph (Kleinberg, 1998). The Web pages are modeled as ‘‘fans’’ and ‘‘centers’’ of a bipartite core, where a ‘‘fan’’ is regarded as a hub page and ‘‘center’’ as an authority page. For a given query, a set of relevant pages is retrieved. And for each page in such a set, a hub score and an authority score (Kleinberg, 1998). Web usage data has is the key to understand user’s perspective of the Web, while content and structure reflect the creator’s perspective. Understanding user profiles and user navigation patterns for better adaptive websites and predicting user access patterns has evoked interest to the research and the business community. The primary step for Web usage mining is pre-processing the user log data, such as to separate Web page references into those made for navigational purposes and those made for content purposes (Cooley et al., 1999). The concept of adaptive Web was introduced by researchers from University of Washington, Seattle (Perkowitz and Etzioni, 1997). Markov models have been the most popular form of techniques to predict user behavior (Pirolli and Pitkow, 1999; Sarukkai, 1999; Zhu et al. 2002). A more detailed information about various aspects of Web usage mining techniques can be found in a recent extensive survey on this topic (Mobasher, 2005). 3
How Web mining can enhance major business functions
This section discusses existing and potential efforts in the application of Web mining techniques to the major functional areas of businesses. Some
Ch. 2. Web Mining for Business Computing
51
Table 1 Summary of how Web mining techniques are applicable to different business functions Area
Function
Application
Technique
Sales
Product marketing
Product recommendations Product trends Expert-driven recommendations Inventory management
Association rules
Consumer marketing Customer service Purchasing
Shipping and inventory
Operations
Human resources
Sales management Fiscal management Information technology Business process management
Time series data mining Association rules, text mining, link analysis Clustering, association rules, forecasting HR call centers Sequence similarities, clustering, association rules Sales leads identification Multi-stage supervised and assignment learning Fraud detection Link mining Developer duplication Clustering, text mining reduction Process mining Clustering, association rules
examples of deployed systems, as well as frameworks for emerging applications yet-to-be-built, are discussed. It should be noted that the examples are should not be regarded as solutions to all problems within the business function area they are cited in. Their purpose is to illustrate that Web mining techniques have been applied successfully to handle certain kind of problems, providing evidence of their utility. Table 1 provides the summary of how Web mining techniques have been successfully applied to address various issues that arise in business functions.
3.1 Sales 3.1.1 Product recommendations Recommending products to customers is a key issue for all businesses. Currently, traditional brick-and-mortar stores have to rely on data collected explicitly from customers through surveys to offer customercentric recommendations. However, the advent of e-commerce not only enables a level of personalization in customer-to-store interaction that is far greater than imaginable in the physical world, but also leads to unprecedented levels of data collection, especially about the ‘‘process of shopping.’’ The desire to understand individual customer shopping behavior and psychology in detail through data mining has led to significant advances in online customer-relationship management (e-CRM),
52
P. Desikan et al.
Fig. 2.
NetFlix.com—An example of product recommendation using Web usage mining.
as well as providing services such as real-time recommendations. A recent survey (Adomavicius and Tuzhilin, 2005) provides an excellent taxonomy of various techniques that have been developed for online recommendations. NetFlix.com is a good example of how an online store uses Web mining techniques for recommending products, such as movies, to customers based on their past rental profile and movie ratings together with profiles of users who have similar movie rating and renting patterns. As shown in Fig. 2, Netflix uses a collaborative-filtering-based recommendation system called Cinematch (Bennet, 2007) that analyzes movie ratings given by users to make personalized recommendations based on their profile. Knowledge gained from Web data is the key driver of NetFlix’s features such as favorite genres, recommendations based on earlier movies rated by users, or recommendations based on information shared with friends who are a part of their social network. Other companies such as Amazon.com use a host of Web mining techniques, such as associations between pages visited and click-path analysis, which are used to improve the customer’s experience and provide recommendations during a ‘‘store visit.’’ Techniques for automatic generation of personalized product recommendations (Mobasher et al., 2000) form the basis of most Web-mining-based recommendation models. 3.1.2 Product area and trend analysis John Ralston Saul, the Canadian author, essayist and philosopher noted With the past, we can see trajectories into the future—both catastrophic and creative projections.
Businesses would definitely like to see such projections of the future, especially identifying new product areas based on emerging trends—key for
Ch. 2. Web Mining for Business Computing
53
any business to build market share. Prediction using trend analysis for a new product typically addresses two kinds of issues: first, the potential market for a particular product and second, that a single product may result in a platform to develop a class of products having potentially high market value. Different methods have been implemented for such prediction purposes. Among the popular approaches are surveying techniques and time-series forecasting techniques. Traditionally, sufficient data collection was a major hurdle in the application of such techniques. However, with the advent of the Web, the task of filling out forms and recording results has been reduced to a series of clicks. This enabling technology has caused a huge shift in the amount and types of data collected, especially in regards to understanding customer behavior. For example, applying Web mining to data collected from online community interactions provides a very good understanding of how such communities are defined, which can then used for targeted marketing through advertisements and e-mail solicitation. A good example is AOL’s concept of ‘‘community sponsorship,’’ whereby an organization, Nike, for instance, may sponsor a community called ‘‘Young Athletic TwentySomethings.’’ In return, consumer survey and new product development experts of the sponsoring organization are able to participate in that community, perhaps without the knowledge of other participants. The idea is to treat the community as a highly specialized focus group, understand its needs and opinions towards existing and new products, and to test strategies for influencing opinions. New product sales can also be modeled using other techniques, such as co-integration analysis (Franses, 1994). The second most popular technique is time-series analysis. Box and Jenkins give an excellent account of various time series analysis and forecasting techniques in their book (Box and Jenkins, 1994). It has been also shown how time-series analysis can be used for decision-making in business administration (Arsham, 2006). These techniques have broad applicability and can be used for predicting trends for potential products. While most of these techniques have been based on statistical approaches, recent work have shown the data mining can be successfully used to discover patterns of interest in time-series data. Keogh (Keogh, 2004) provides a good overview of data mining techniques in time-series analysis, most of which can also be applied to the Web data. With the growth of Web search and keyword search-based ad placement, query words have assumed a great deal of significance in the advertising world. These query words represent topics or products popular among users. Search engines have been increasingly focusing on analyzing trends in these query words, as well as their click-through rates, for improving queryrelated ad delivery. Fig. 3 gives an example of how keywords can be analyzed for trends. It depicts the trends in keywords ‘‘Web Mining’’ and ‘‘Business Computing.’’ For example, a possible conclusion seems to be that since the two keywords have correlated search volume as of late,
54
P. Desikan et al.
Fig. 3.
Google trends.
collaboration between the two fields may be possible. The news articles represent randomly selected news articles on a particular topic when the search for topic was high. 3.1.3 Expert-driven recommendations for customer assistance Most recommender systems used in business today are product-focused, where recommendations made to a customer are typically a function of his/ her interests in products (based on his/her browsing history) and that of other similar customers. However, in many cases, recommendations must be made without knowledge about a customer’s preferences, such as in customer service call centers. In such cases, call center employees leverage their domain knowledge in order to help align customer inquiries with appropriate answers. Here, a customer may be wrong, which is often observed when domain experts are asked questions by non-experts. Many businesses must maintain large customer service call centers, especially in retail-based operations, in order to address this need. However, advances in Web-based recommender systems may enable to improve call center capacity by offering expert-based recommendations online DeLong et al., 2005). Fig. 4 gives an overview of expert-driven customer assistance recommendations. Similar to a customer talking to a call center assistant, the recommendation system equates customer browsing behavior as a series of ‘‘questions’’
55
Ch. 2. Web Mining for Business Computing
Concept-Page Graph <page_id, topic_id>
Browsing Sequence < p1, p2, p4, p7... > Recommendations
Customer
Fig. 4.
Web Usage Logs <seq1, seq2, ...>
Recommender System
Overview of expert-driven customer assistance recommendations.
the customer wants answered or, more generally, expressions of interest in the topic matter of a clicked-on Web page. Given the order and topic matter covered by such sequences of clicks, the recommendation system continuously refines its recommendations, which are not themselves directly a function of customer interest. Rather, they are generated by querying an abstract representation of customer service website, called a ‘‘concept-page graph.’’ This graph contains a ranked set of topic/Web page combinations, and as the customer clicks through the website, the system looks for Web pages best capturing the topics that a customer is seeking to know more about. And since their browsing behavior helps determine the questions they want answered, the eventual recommendations are more likely find the correct answer to their question, rather than a potentially misleading one based on interest alone.
3.2 Purchasing 3.2.1 Predictive inventory management A significant cost for a business that sells large quantities of products is the maintenance of an inventory management system to support sales. The goal of inventory management is to keep the inventory acquisition and maintenance costs low while simultaneously maximizing customer satisfaction through product availability. As such, inventory management systems must keep track of customer product demand through sales trend and market analysis. By analyzing transaction data for purchasing trends, inventory needs can addressed in a pre-emptive fashion, improving efficiency by enabling ‘‘just-in-time’’ inventory. As the Internet has permeated business computing, the task of browsing and purchasing products has been reduced to a series of clicks. This has made shopping extremely simple for customers and has lowered the barrier for businesses to obtain detailed customer feedback and shopping behavior
56
P. Desikan et al.
data. And though Web mining techniques have provided an excellent framework for personalized shopping, as well as improved direct marketing and advertisement, it can also aid companies in understanding customer access and purchasing patterns for inventory management at a very detailed level. Web mining techniques can improve inventory management in a variety of ways. First, using techniques such as Web content mining to search the Web—including competitors’ websites—businesses can discover new or alternate vendors and third-party manufacturers. Second, trend analysis using Web usage mining can yield valuable information about potentially latent relationships between products, helping gauge demand for one or more products based on the sales of others. Taken together, the identification of inventory gaps (where there is demand for a product not yet stocked) can be addressed and added to inventory at levels corresponding to estimated demand from product relationship analysis. Amazon.com, again, is a great example. As it became one of the largest B2C websites, its ever-increasing customer base and product breadth (and depth) demanded an efficient inventory management system. As such, Amazon.com adopted advanced Web mining techniques to manage and plan material resource availability. These techniques have enabled Amazon.com to decrease costs incurred from idle stock maintenance and consequently increase product choice for customers, greatly increasing Amazon’s revenue. By taking advantage of Web usage mining techniques and applying them to website usage data, transaction data, and external website data, other companies can reap the benefits of such predictive inventory management. 3.3 Operations 3.3.1 Human resources call centers Human resource (HR) departments of large companies are faced with answering many policy, benefits, and payroll related questions from employees. As the size of the company grows, the task becomes more difficult as they not only need to handle the number of employees, but also consider other issues such as geographically local policies and issues. An often-used approach of handling this problem is to have ‘‘call centers,’’ where human representatives provide expert assistance to employees through telephone conversations. Due in part to the cost associated with call centers, many companies have also published all relevant policy and process information to their corporate intranet websites for easy perusal by employees. However, in the face of such large repositories of detailed content, many employees tend to still primarily seek the advice of human representatives at call centers to help them more quickly sort through their policy and procedure questions, resulting in call center escalation.
Ch. 2. Web Mining for Business Computing
57
A recent study (Bose et al., 2006) has shown promise in applying Web mining techniques, as well as gene sequence similarity approaches from bioinformatics, to the problem of providing knowledge assistance in cases such as the HR call center escalation. As a result a Web recommendation system was developed to assist employees navigate HR websites by reducing the number of clicks they would need to locate answers to their questions. This was done by coupling the conceptual and structural characteristics of the website such that relevant pages for a particular topic (e.g., 401K plans, retirement, etc.) could be determined. In the study, the conceptual characteristics are represented by the logical organization of website as designated by the website administrator, while the structural characteristics provide a navigational path starting from a given Web page. By using this information, expert knowledge can be incorporated into website usage analysis, which recent studies (DeLong et al., 2005) have shown play an important role in improving the predictive power of recommendation engines. Figure 5 gives an example of an employee benefits website, with a sample of recommendations provided to a user looking for information related to 401(K) plan. 3.3.2 Sales leads identification and assignment To quote Jeff Thull, a leading sales and marketing strategist Accepting that 20% of your salespeople bring in 80% of your revenue is like accepting that 80% of your manufacturing machines are, on the average, producing one-fourth of your most productive machines.
Fig. 5.
Recommendations provided to the user of an employee benefits website.
58
P. Desikan et al.
In many businesses, an effective way of countering this problem is to develop a process (or system) that will allow sales managers to learn from past performance and track their current performance, both qualitatively and quantitatively. With the Internet, a number of Web-based businesses have emerged to enable customer-relationship management (CRM), a related approach used to collect, store, and analyze customer information. For example, Salesforce.com (http://www.salesforce.com) offers Web-based infrastructure for CRM to companies. Web-based approaches enable easy customization and integration of different application tools, as well as geographical independence of viewing the CRM data due to the usage of a Web-based interface. Macroscopic and microscopic (detailed) views and spatio-temporal partitions of information are also possible. Further, real-time dashboards allow easy visualization of various parameters and their statistical properties (e.g., means, medians, standard deviation). Commonly, tools for more sophisticated statistical analysis of the data are available in such applications. However, much of the burden of analysis and interpretation of such data lies with the sales managers. Given the number of types of data collected and possible analysis techniques that can be applied, it becomes difficult for a sales manager to apply all possible techniques and search for interesting results. As such, many sales managers will use only a few parameters for analysis on a daily basis. Additionally, the learning curve for this process is slow due to manual effort required in learning which parameters are important. Effective analysis is, therefore, made that much more difficult and cumbersome due to limited analytical resources and constraints on the sales manager’s time, which can result in errors and the inability to properly forecast emerging leads. In sales leads identification, traditional sources of information, such as phone calls, are supplemented by Web data, providing an additional means of collecting information about buyers. For example, eBay (http:// www.ebay.com/) creates behavior profiles about the buyers (and sellers) of products on their Web portal. Web usage data such as products bought and browsed for by buyers provide critical sales lead information. For similar portals, a buyer can be a group of individuals, a small business, or a large organization, depending upon its type (e.g., customer-to-customer, customerto-business, or business-to-business e-commerce). Web content data from these external websites can be analyzed using Web content mining to help identify new customers and understand their requirements. Competitor’s websites can be used to learn about their past performance and customers, as well as helping identify competitors in different geographical regions. This rich information about buyers can be harnessed by an enterprise to determine the most profitable markets, find new sales leads, and align them with a business’s offerings. Here, Web mining approaches can play an important role in identifying connections between various customers, analyzing them, and understanding their various business impacts.
Ch. 2. Web Mining for Business Computing
59
3.3.3 Fraud analysis The Internet has dramatically changed the ways in which businesses sell products. There are now many well-established Internet sites for e-commerce, and huge numbers of items have been bought and sold online. Meanwhile, fraudulent attempts to unjustly obtain property on websites have also been increasing. Although a great deal of effort has been expended in investigating and preventing Internet fraud, criminals have shown they are also capable of quickly adapting to existing defensive methods and continue to create more sophisticated ways of perpetrating fraud. Some Internet fraud, such as shilling, also exists in offline commerce. However, the frequency of such fraud has dramatically increased in online e-commerce applications due to its ease of implementation in an online setting. While some fraudulent activities are ignored when detected, others are more serious, involve large sums of lost money and property, and can result in lawsuits brought by their victims. Much Internet-based fraud is perpetrated in a cooperative manner among multiple associates. For example, in online auction shilling, fake customers (who are actually associates of a fraudulent seller) pretend not to have any connection with the seller and raise the bid price so that the seller’s item is sold at a higher price than its real value. Such associates are called shills, though shilling can be perpetrated without human associates. A seller can have multiple ids and pose as different customers simultaneously by participating in a single auction using multiple computers having different IP addresses, pretending to be different bidders. Detecting such fraud often means tracking the relationships between sellers and customers over a period of time. Clearly, non-automated techniques of accomplishing this task on a wide scale will incur significant costs. For addressing such issues, Web mining techniques have risen to prominence through their capacity to automatically detect ‘‘fraudulent behavior’’ in Web usage data. Since Web mining techniques are often focused on discovering relationships in usage, content, and transaction data, they can be readily applied to analyzing the relationships among people participating in online trading. As previously mentioned, much Internet fraud is perpetrated in cooperation with multiple ‘‘associates.’’ In order to detect such fraudulent activity, graph mining techniques can be used to uncover latent relationships between associates by finding graphs with similar topological structures. Since a number of frauds may be perpetrated by the same group of fraudsters, identifying the group’s other frauds can be made possible through these techniques, which have been exploited not only for detecting fraud in e-commerce, but also for antiterrorism, financial crime detection, and spam detection. 3.3.4 Developer duplication reduction Many businesses, both large and small, maintain one or more internal application development units. Thus, at any given time, there may be
60
P. Desikan et al.
hundreds, if not thousands, of projects being developed, deployed, and maintained concurrently. Due to overlapping business processes (i.e., human resources and fiscal administration) and multiple project development groups, duplication of source code often occurs (Rajapakse and Jarzabek, 2005) and (Kapser and Godfrey, 2003). Given the non-trivial cost of application development, mitigating such duplication is critical. Source code consistency is also an issue, for example, to prevent a case where only one of two duplicate segments is updated to address a bug and/or feature addition. Turnkey solutions for source code duplication are already available, but they suffer from two major problems: They are not able to address code which is functionally similar, but syntactically different. They only detect duplication after it has already occurred. The goal of a full-featured duplication detection system will be to address existing and potential duplication—the latter of which is currently unavailable. However, Web mining methods may offer a solution. Many businesses maintain intranets containing corporate policy information, best practices manuals, contact information, and project details—the last of which is of particular interest here. Assuming project information is kept current, it is possible to use Web mining techniques to identify functionality that is potentially duplicative, oftentimes syntactically different functions may be described using similar language. Figure 6 gives an overview of a possible approach for identifying potential duplication among multiple projects. First, the project Web pages and documents must be extracted from the intranet. Next, each document is split into fragments using common separators (periods, commas, bullet points, new lines, etc.). These fragments form the most basic element of comparison—the smallest entity capable of expressing a single thought. Using clustering techniques, these fragments can then be grouped into collections of similar fragments. When two or more fragments are part of the same collection, but come from different projects, potential duplication Clusters of Similar Fragments
Cluster “A” f1, f4, f7 Intranet
Project Info
Fragmentation
Cluster “B” f2, f3
Cluster Fragments < f1, f2, f3, f4, ...>
Cluster “C” f5, f6
Fig. 6.
Duplication candidate process overview.
Ch. 2. Web Mining for Business Computing
61
has been identified. These fragments may then be red-flagged and brought to the attention of affected project managers. 3.3.5 Business process mining Business process mining, also called workflow mining, reveals how existing processes work and can provide considerable return on investment (ROI) when used to discover new process efficiencies. In the context of the World Wide Web, business process mining can be defined as the task of extracting useful process-related information from the click stream of users of a website or the usage logs collected by the Web server. For example, mining of market-basket data to understand shopping behavior is perhaps the most well-known and popular application of Web mining. Similarly, one can find better understanding of the shipping process by modeling customer browsing behavior as a state transition diagram while he/she shops online. To implement such a system, Web usage logs and click stream data obtained from servers can be transformed into an XML format. These event logs can then be cleaned and the temporal ordering of business processes inferred. One can then combine Web usage mining with Web structure mining. By determining the number of traversals (usage) on each link (structure), one can estimate the transition probabilities between different states. Using these probabilities, entire business process models can be benchmarked and measured for performance increases/decreases. The discovered process model can also be checked for conformance with previously discovered models. Here, an anomaly detection system can also be to identify deviation in existing business process behavior. Srivastava and Mobasher (Srivastava and Mobasher, 1997) give an example of such a state transition diagram modeling a shopping transaction in a website, shown in Fig. 7. One can also analyze ‘‘process outcome’’ data to understand the value of various parts (e.g., states) of the process model (i.e., the impact of various states on the probability of desired/undesired outcomes). The results of such analysis can be used to help develop strategies for increasing (or decreasing) the probabilities of desired outcomes. A possible end objective of this business process mining would be to maximize the probability of reaching the final state while simultaneously maximizing the expected number of sold products (or value of sold products) from each visit, conduct a sensitivity analysis of the state transition probabilities, and identify appropriate promotion opportunities. In addition to the above-mentioned example, business process mining can also used for e-mail traffic, we can discover how people work and interact with each other in an organization. We can see what kinds of patterns exist in workflow processes and answer questions like do people hand-over their tasks to others, do they sub-contract, do they work together or do they work on similar tasks. It thereby helps in determining the process, data, organizational, and social structure.
62
P. Desikan et al.
Fig. 7.
State transition diagram modeling a shopping transaction in a website (Srivastava and Mobasher, 1997).
Sometimes, the information contained in Web server logs is incomplete or noisy or fine-grained or specific to an application which makes preprocessing a bit more difficult and challenging. Research has to be done in extracting business process models from the server logs. By leveraging business process mining properly, we can re-engineer the business process by reducing work-in-progress, adding additional resources to increase the capacity or eliminating or improving the efficiency of bottleneck processes, thereby boosting the performance of businesses. 4
Gaps in existing technology
Though Web mining techniques can be extremely useful to businesses, there are gaps which must often be bridged (or completely dealt with) in order to properly leverage Web mining’s potential. In this section, we discuss a few such important gaps and how these can be addressed. 4.1 Lack of data preparation for Web mining To properly apply Web mining in a production setting (e.g., recommending products to customers), data stored in archival systems must be linked back to online applications. As such, there must be processes in place to clean, transform, and move large segments of data back into a setting, where these can be accessed by Web mining applications quickly and continuously. This often means removing extraneous fields and converting
Ch. 2. Web Mining for Business Computing
63
textual identifiers (names, products, etc.) into numerical identifiers to make the processing of large amounts of transactional data quick. For instance, segmenting data into one-month intervals can cut down on expended computing resources and to ensure that relevant trends are identified by Web mining techniques, provided there is sufficient transactional activity. Additionally, databases for these kinds of intermediate calculations to reduce repeat computations have to be developed. Web mining is often computationally expensive, thus efforts to maximize efficiency are important. 4.2 Under-utilization of domain knowledge repositories Businesses have long made use of domain knowledge repositories to store information about business processes, policies, and projects, and if they are utilized in a Web mining setting, it becomes ever more paramount to manage it. For instance, corporate intranets provide a wealth of information that is useful in expert-oriented recommendations (e.g., customer service) and duplication reduction, but the repository itself must be up-to-date and properly maintained from time to time. One of the best ways to ensure an intranet’s ‘‘freshness’’ is to maintain it with a content management system (CMS) allowing non-professionals to update the website and distributing the responsibility to internal stakeholders. 4.3 Under-utilization of Web log data Most companies keep track of Web browsing behavior of employees by collecting Web logs mostly for security purposes. However, as seen from previous successful applications of Web mining techniques on such kinds of data, companies could utilize this information to better serve their employees. For example, one of the key issues that is usually dealt by human resources department is to keep employees motivated and retain them. A common approach is to offer perks and bonuses in various forms to satisfy the employee. However, most policies are ‘‘corporate-centric’’ and are not geared towards ‘‘employee-centric.’’ With the advance of Web mining techniques, it is now possible to understand employees’ interests in a better way. Two kinds of techniques can be employed. First, is to mine the behavior of employees in company policy and benefits website, in order to understand what employees are looking for. For example, employees browsing retirement benefits related website, could be offered a better retirement package. Other examples include, tuition waiver for employees looking on pursuing professional development course, or a travel package deal to an employee who has shown interest in traveling. A different dimension is to use trend analysis to see what’s new and popular in market, such as a new MP3 player, and offer perks in form of such as products.
64
P. Desikan et al.
Of course, a key issue in such kind of techniques is privacy. Privacy preserving data mining is a currently a hot area of research. Also, it has been studied and shown from examples such as Amazon that people are willing to compromise a certain level of privacy to gain the benefits offered. 5
Looking ahead: The future of Web mining in business
We believe that the future of Web mining is entwined with the emerging needs of businesses, and the development of techniques fueled by the recognition of gaps or areas of improvement in existing techniques. This section examines what is on the horizon for Web mining, the nascent areas currently under research, and how they can help in a business computing setting. 5.1 Microformats It is very important to not only to present the right content on a website, but also in the right format. For example, a first step in formatting for the Web was the use of oHTMLW to give the browser’s ability to parse and present text in a more readable and presentable format. However, researchers soon developed formats with higher semantics and presentability, for example, XML and CSS, for efficient processing of content and extracting useful information. XML is used to store data in formats such that automatic processing can be done to extract meaningful information (not just for presenting it in a website). Today, the trend is moving more towards ‘‘micro-formats’’ which capture the best of XML and CSS. Microformats are design principles for formats and not another new language. They provide a way of thinking about data, which will provide humans a better understanding of the data. They are currently widely used in websites such as blogs. With such new structured data, there arises need for NLP and Web content mining techniques such as data extraction, information integration, knowledge synthesis, template detection, and page-segmentation. This leads to the suggestion for the corporate businesses to decide on right kind of format to best utilize the data for processing, analysis, and presentation. 5.2 Mining and incorporating sentiments Even though automated conceptual discovery from text is still relatively new, difficult, and imperfect, accurately connecting that knowledge to sentiment information—how someone feels about something—is even harder. Natural language processing techniques, melded with Web mining, hold great promise in this area. To understand how someone feels about a
Ch. 2. Web Mining for Business Computing
65
particular product, brand, or initiative, and to project that level of understanding across all customers would give the business a more accurate representation of what customers think to date. Applied to the Web, one could think of an application that collects such topic/sentiment information from the Internet, and returns that information to a business. Accomplishing this would open up many marketing possibilities.
5.3 e-CRM to p-CRM Traditionally, brick-and-mortar stores have been organized in a productoriented manner, with aisles for various product categories. However, success of online e-CRM initiatives in the online world in building customer loyalty is not hidden from CRM practitioners in the physical world, which we refer to as p-CRM for clarity in this chapter. Additionally, the significance of physical stores has motivated a number of online businesses to open physical stores to serve ‘‘real people’’ (Earle, 2005). Many businesses have also moved from running their online and physical stores separately to integrating both, in order to better serve their customers (Stuart, 2000). Carp (Carp, 2001) points out that although the online presence of a business does affect its physical division of its business, people still find entertainment value in shopping in malls and other physical stores. Finally, people prefer to get a feel of products before purchase, and hence prefer to go out to shop instead of shopping online. From these observations, it is evident that physical stores will continue to be the preferred means of conducting consumer commerce for quite some time. However, margins will be under pressure as they must adopt to compete with online stores. These observations led us to posit the following in our previous study (Mane et al., 2005): Given that detailed knowledge of an individual customer’s habits can provide insight into his/her preferences and psychology, which can be used to develop a much higher level of trust in a customer-vendor relationship, the time is ripe for revisiting p-CRM to see what lessons learned from e-CRM are applicable.
Till recently, a significant roadblock in achieving this vision has been the ability to collect and analyze detailed customer data in the physical world, as Underhill’s seminal study (Underhill, 1999) showed, both from cost and customer sensitivity perspectives. With advancements in pervasive computing technologies such as mobile Internet access, third-generation wireless communication, RFIDs, handheld devices, and Bluetooth; there has been a significant increase in the ability to collect detailed customer data. This raises the possibility of bringing e-CRM style real-time, personalized, customer relationship functions to the physical world. For a more detailed study on this, refer to our previous work (Mane et al., 2005).
66
P. Desikan et al.
5.4 Other directions We have mentioned some of the key issues that should be noted by businesses as they proceed to adopt Web mining techniques to improve the business intelligence. However, as claimed earlier, this by no means is an exhaustive list. There are various other issues that need to be addressed from a technical perspective in order to determine the framework necessary to make these techniques more widely applicable to businesses. For example, there are host of open areas of research regarding Web mining, such as extraction of structured data from unstructured data or ranking of Web pages by integrating semantic relationships between documents, and automatic derivation of user sentiment. Businesses must also focus on the types of data that need to be collected for many Web usage mining techniques to be possible. Designing the content of websites also plays a crucial role in deciding what kinds of data can be collected. For example, one viewpoint is that pages with Flash-based content, though attractive, are more of broadcast nature and do not easily facilitate the collection of information about customer behavior. However, recent advances in technologies such as AJAX, which enhance customer/website interaction, not only allow corporations to collect data, but also give the customer a ‘‘sense of control’’ leading to an enriched user experience. 6
Conclusion
This chapter examines how technology, such as Web mining, can aid businesses in gaining an extra information and intelligence. We provide an introduction to Web mining and the various techniques associated with it. We briefly update the reader with state-of-art research in this area. Later, we show how these class of techniques can be used effectively to aid various business functions and provide example applications to illustrate their applicability. These examples provide evidence of Web mining’s potential, as well as existing success, in improving business intelligence. Finally, we point out gaps in existing technologies and elaborate on future directions that should be of interest to the business community at large. In doing so, we also note that we have intentionally left out specific technical details of existing and future work, given the introductory nature of this chapter. Acknowledgments This work was supported in part by AHPCRC contract number DAAD19-01-2-0014, by NSF Grant ISS-0308264 and by ARDA grant F30602-03C-0243. This work does not necessarily reflect the position or policy of government and no official endorsement should be inferred.
Ch. 2. Web Mining for Business Computing
67
We would like to thank the Data Mining Research Group at the University of Minnesota for providing valuable feedback.
References Adomavicius, G., A. Tuzhilin (2005). Towards the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17, 734–749. Arsham, H. (2006). Time-critical decision making for business administration. Available at http:// home.ubalt.edu/ntsbarsh/stat-data/Forecast.htm. Retrieved on 2006. Bennet, J. (2007). The Cinematch system: Operation, scale coverage, accuracy impact. Available at http://blog.recommenders06.com/wp-content/uploads/2006/09/bennett.pdf. Retrieved and accessed on July 30. Bose, A., K. Beemanapalli, J. Srivastava, S. Sahar (2006). Incorporating Concept Hierarchies into Usage Based Recommendations. WEBKDD, Philadelphia, PA, USA. Box, G.E., G.M. Jenkins (1994). Time Series Analysis: Forecasting and Control. 3rd ed. Prentice Hall PTR. Carp, J. (2001). Clicks vs. bricks: Internet sales affect retail properties. Houston Business Journal Cooley, R., B. Mobasher, J. Srivastava (1997). Web mining: Information and pattern discovery on the World Wide Web. 9th IEEE ICTAI. Cooley, R., B. Mobasher, J. Srivastava (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems 1(1), 5–32. DeLong, C., P. Desikan, J. Srivastava (2005). USER (User Sensitive Expert Recommendation): What non-experts NEED to know, in: Proceedings of WebKDD, Chicago, IL. Desikan, P., J., Srivastava, V., Kumar, P.N. Tan (2002). Hyperlink analysis: Techniques and applications. Technical Report 2002-0152, Army High Performance Computing and Research Center. Earle, S. (2005). From clicks to bricks . . . online retailers coming back down to earth. Feature story. Available at http://www.specialtyretail.net/issues/december00/feature_bricks.htm. Retrieved on 2005. Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine? Communications of the ACM 39(11), 65–68. Franses, Ph.H.B.F. (1994). Modeling new product sales; an application of co-integration analysis. International Journal of Research in Marketing. Kapser, C., M.W. Godfrey (2003). Toward a taxonomy of clones in source code: A case study, in: International Workshop on Evolution of Large-scale Industrial Software Applications, Amsterdam, The Netherlands. Keogh, E. (2004). Data mining in time series databases tutorial, in: Proceedings of the IEEE Int. Conference on Data Mining. Kleinberg, J.M. (1998). Authoritative sources in hyperlinked environment, in: 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–667. Kosala, R., H. Blockeel (2000). Web mining research: A survey. SIGKDD Explorations 2(1), 1–15. Liu, B., K.C.C. Chang (2004). Editorial: Special issue on web content mining. SIGKDD Explorations special issue on Web Content Mining. Mane, S., P. Desikan, J. Srivastava (2005). From clicks to bricks: CRM lessons from E-commerce. Technical report 05-033, Department of Computer Science, University of Minnesota, Minneapolis, USA. Mobasher, B. (2005). Web usage mining and personalization, in: M.P. Singh (ed.), Practical Handbook of Internet Computing. CRC Press. Mobasher, B., R. Cooley, J. Srivastava (2000). Automatic personalization based on web usage mining. Communications of ACM. Page, L., S. Brin, R. Motwani, T. Winograd (1998). The pagerank citation ranking: Bringing order to the web. Stanford Digital Library Technologies Perkowitz, M., O. Etzioni (1997). Adaptive Web Sites: An AI Challenge. IJCAI.
68
P. Desikan et al.
Pirolli, P., J.E. Pitkow (1999). Distribution of surfer’s path through the World Wide Web: Empirical characterization. World Wide Web 1, 1–17. Rajapakse, D.C., S. Jarzabek (2005). An investigation of cloning in web applications, in: Fifth International Conference on Web Engineering, Sydney, Australia. Sarukkai, R.R. (1999). Link prediction and path analysis using Markov chains, in: Proceedings of the 9th World Wide Web Conference. Srivastava, J., P. Desikan, V. Kumar (2004). Web mining-concepts, applications and research directions. Data Mining: Next Generation Challenges and Future Directions, MIT/AAAI. Srivastava, J., B. Mobasher (1997). Panel discussion on ‘‘Web Mining: Hype or Reality?’’, ICTAI. Stuart, A. (2000). Clicks and bricks. CIO Magazine. Underhill, P. (1999). Why We Buy: The Science of Shopping. Simon and Schuster, New York. Zhu, J., J. Hong, J.G. Hughes (2002). Using markov chains for link prediction in adaptive web sites, in: Proceedings of ACM SIGWEB Hypertext.
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 3
Current Issues in Keyword Auctions
De Liu 455Y Gatton College of Business and Economics, University of Kentucky, Lexington, KY 40506, USA
Jianqing Chen Haskaynes School of Business, The University of Calgary, Calgary, AB T1N 2N4, Canada
Andrew B. Whinston Mccombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA
Abstract Search engines developed a unique advertising model a decade ago that matched online users with short-text advertisements based on users’ search keywords. These keyword-based advertisements, also known as ‘‘sponsored links,’’ are the flagship of the thriving Internet advertising business nowadays. Relatively unknown to online users, however, is the fact that slots for search engine advertisements are sold by a special kind of auctions which we call ‘‘keyword auctions.’’ As the most successful online auctions since eBay’s business-to-consumer auctions, keyword auctions form the backbone of the multibillion dollar search advertising industry. Owing to their newness and significance, keyword auctions have captured attention of researchers from information systems, computer science, and economics. Many questions have been raised, including how to best characterize keyword auctions, why keyword auctions, not other selling mechanisms, are used, and how to design keyword auctions optimally. The purpose of this chapter is to summarize the current efforts in addressing these questions. In doing so, we highlight the last question, that is, how to design effective auctions for allocating keyword advertising resources. As keyword auctions are still new, there are still many outstanding issues about keyword auctions. We point out several such issues for future research, including the click-fraud problem associated with keyword auctions.
69
70 1
D. Liu et al.
Introduction
Keyword advertising is a form of targeted online advertising. A basic variation of keyword advertising is ‘‘sponsored links’’ (also known as ‘‘sponsored results’’ and ‘‘sponsored search’’) on search engines. Sponsored links are advertisements triggered by search phrases entered by Internet users on search engines. For example, a search for ‘‘laptop’’ on Google will bring about both the regular search results and advertisements from laptop makers and sellers. Figure 1 shows such a search-result page with sponsored links at the top and on the side of the page. Another variation of keyword advertising is ‘‘contextual advertising’’ on content pages. Unlike sponsored links, contextual advertisements are triggered by certain keywords in the content. For example, a news article about ‘‘Cisco’’ is likely to be displayed with contextual advertisements from Cisco network equipment sellers and Cisco training providers. Both sponsored links and contextual advertisements can target online users who are most likely interested in seeing the advertisements. Because of its superior targeting ability, keyword advertising has quickly gained popularity among marketers, and has become a leading form of online advertising. According to a report by Interactive Advertising Bureau (2007)
Fig. 1.
Search-based keyword advertising.
Ch. 3. Current Issues in Keyword Auctions
Fig. 2.
71
Google’s Adwords and AdSense programs.
and PricewaterhouseCoopers, keyword advertising in the United States reached $6.8 billion in total revenue in 2006. eMarketer (2007) predicts the market for online advertising will rise from $16.9 billion in 2006 to $42 billion in 2011, with keyword advertising accounting for about 40% of the total revenue. A typical keyword advertising market consists of advertisers and publishers (i.e., websites), with keyword advertising providers (KAPs) in between. There are three key KAPs in the U.S. keyword advertising market: Google, Yahoo!, and MSN adCenter. Figure 2 illustrates Google’s keyword-advertising business model. Google has two main advertising programs, Adwords and AdSense. Adwords is Google’s flagship advertising program that interfaces with advertisers. Through Adwords, advertisers can submit advertisements, choose keywords relevant to their businesses, and pay for the cost of their advertising campaigns. Adwords has separate programs for sponsored search (Adwords for search) and for contextual advertising (Adwords for content). In each case, advertisers can choose to place their advertisements on Google’s site only or on publishers’ sites that are part of Google’s advertising network. Advertisers can also choose to display text, image, or, more recently, video advertisements. AdSense is another Google advertising program that interfaces with publishers. Publishers from personal blogs to large portals such as CNN.com can participate in Google’s AdSense program to monetize the traffic to their websites. By signing up with AdSense, publishers agree to publish advertisements and receive payments from Google. Publishers may choose to display text, image, and video advertisements on their sites. They receive payments from Google on either a per-click or per-thousandimpressions basis.1 AdSense has become the single most important revenue source for many Web 2.0 companies. This chapter focuses on keyword auctions, which are used by KAPs in selling their keyword advertising slots to advertisers. A basic form of 1 Google is also beta-testing a per-action based service in which a publisher is paid each time a user carries out a certain action (e.g., a purchase).
72
D. Liu et al.
keyword auction is as follows. Advertisers choose their willingness-to-pay for a keyword phrase either on a per-click (pay-per-click) or on perimpression (pay-per-impression) basis. An automated program ranks advertisers and assigns them to available slots whenever a user searches for the keyword or browses a content page deemed relevant to the keyword. The ranking may be based on advertisers’ pay-per-click/pay-per-impression only. It may also include other information, such as their historical click-through-rate (CTR), namely the ratio of the number of clicks on the advertisement to the number of times the advertisement is displayed. Almost all major KAPs use automated bidding systems, but their specific designs differ from each other and change over time. Keyword auctions are another multibillion-dollar application of auctions in electronic markets since the celebrated eBay-like business-to-consumer auctions. Inevitably, keyword auctions have recently gained attention among researchers. Questions have been raised regarding what a keyword auction is, why keyword auctions should be used, and how keyword auctions should be designed. Some of these questions have been addressed over time, but many are still open. The purpose of this chapter is to summarize the current efforts in addressing these questions. In doing so, we focus mainly on the third question, that is, how to design effective auctions for allocating keyword advertising resources. We also point out several issues for future research. We will examine keyword auctions from a theoretical point of view. The benefits of conducting a rigorous theoretical analysis on real-world keyword auctions are two-fold. On one hand, we hope to learn what makes this new auction format popular and successful. On the other hand, by conducting a theoretical analysis on keyword auctions, we may be able to recommend changes to the existing designs. The rest of the chapter is organized as follows. Next, we discuss the research context by briefly reviewing the history of keyword advertising and keyword auctions. In Section 3, we introduce a few popular models of keyword auctions. In Section 4 and Section 5, we focus on two design issues in keyword auctions, namely, how to rank advertisers and how to package advertising resources. In Section 6, we discuss a threat to the current keyword-advertising model—click fraud. We conclude this chapter in Section 7. 2
A historical look at keyword auctions
Keyword advertising and keyword auctions were born out of practice. They were fashioned to replace the earlier, less efficient market mechanisms and are still being shaped by the accumulative experiences of the keyword advertising industry. In this subsection, we chronicle the design of keyword advertising markets and keyword auctions, and show how they evolved.
Ch. 3. Current Issues in Keyword Auctions
73
2.1 Early Internet advertising contracts In early online advertising, advertising space was sold through advance contracts. These contracts were negotiated on a case-by-case basis. As such negotiations were time-consuming, advertising sales were limited to large advertisers (e.g., those paying at least a few thousand dollars per month). These advertising contracts were typically priced in terms of the number of thousand-page-impressions (cost-per-mille, or CPM). CPM pricing was borrowed directly from off-line advertising, such as TV, radio, and print, where advertising costs are measured on a CPM basis. The problem with CPM pricing is that it provides no indication as to whether users have paid attention to the advertisement. Advertisers may be concerned that their advertisements are pushed to web users without necessarily generating any impact. The lack of accountability is reflected in the saying among marketing professionals: ‘‘I know that I waste half of my advertising budget. The problem is I don’t know which half.’’ 2.2 Keyword auctions by GoTo.com In 1998, a startup company called GoTo.com demonstrated a new proofof-concept search engine at a technology conference in Monterey, California. At that time, all other search engines sorted search results based purely on algorithm-assessed relevancy. GoTo.com, on the other hand, devised a plan to let advertisers bid on top positions of the search result. Specifically, advertisers can submit their advertisements on chosen words or phrases (‘‘search terms’’) together with their pay-per-click on these advertisements. Once the submitted advertisements are validated by GoTo.com’s editorial team, they will appear as a search result. The highest advertiser will appear at the top of the result list, the second-highest advertiser will appear at the second place of the result list, and so on. Each time a user clicks on an advertisement, the advertiser will be billed the amount of the bid. GoTo.com’s advertising model contains several key innovations. First, advertising based on user-entered search terms represents a new form of targeted advertising that is based on users’ behavior. For example, a user who searches ‘‘laptop’’ is highly likely in the process of buying a laptop. Keyword-based search engine advertising opens a new era of behavioral targeted advertising. Second, by billing advertisers only when users click on the advertisements, GoTo.com provides a partial solution to a longstanding issue of lack of accountability. Clicking on an advertisement indicates online users’ interests. Therefore, pay-per-click represents a significant step toward more accountable advertising. The ability to track behavioral outcomes such as clicks is a crucial difference between online advertising and its off-line counterparts. The act
74
D. Liu et al.
of clicking on an advertisement provides an important clue on advertising effectiveness. Accumulated information on clicking behavior can be further used to fine-tune advertisement placement and content. In such a sense, pay-per-click is a significant leap from the CPM scheme and signifies the huge potential of online advertising. Finally, the practice of using auctions to sell advertising slots on a continuous, real-time basis is another innovation. These real-time auctions allow advertisements to go online a few minutes after a successful bidding. As there is no pre-set minimum spending, auction-based advertising has the advantage of tapping into the ‘‘long tail’’ of the advertising market, that is, advertisers who have small spending budgets and are more likely to ‘‘do-it-yourself.’’ GoTo.com was re-branded as Overture Services in 2001 and acquired by Yahoo! in 2003. During the process, however, the auction mechanism and the pay-per-click pricing scheme remained largely unchanged.
2.3 Subsequent innovations by Google Google, among others, made several key innovations to the keyword advertising business model. Some of these have become standard features of today’s keyword advertising. In the following, we briefly review these innovations. 2.3.1 Content vs. advertising The initial design by GoTo.com features a ‘‘paid placement’’ model: paid advertising links are mixed with organic search results so that users cannot tell whether a link is paid for. Google, when introducing its own keyword advertising in 1998, promoted a ‘‘sponsored link’’ model that distinguished advertisements from organic search results. In Google’s design, advertisements are displayed on the side or on top of the result page with a label ‘‘sponsored links.’’ Google’s practice has been welcomed by the industry and policy-makers and has now become standard practice. 2.3.2 Allocation rules Google introduced a new allocation rule in 2002 in its ‘‘Adwords Select’’ program in which listings are ranked not only by bid amount, but also by CTR (later termed as ‘‘quality score’’). Under such a ranking rule, paying a high price alone cannot guarantee a high position. An advertiser with a low CTR will get a lower position than advertisers who bid the same (or slightly lower) but have higher CTRs. In 2006, Google revised its quality score calculation to include not only advertisers’ past CTRs but also the quality of their landing pages. Advertisers with low quality scores are required to pay a high minimum bid or they will become inactive.
Ch. 3. Current Issues in Keyword Auctions
75
Google’s approach to allocation gradually gained acceptance. At the beginning of 2007, Yahoo! conducted a major overhaul of its online advertising platform that considers both the CTRs of an advertisement and other undisclosed factors. Microsoft adCenter, which came into use only at the beginning of 2006, used a ranking rule similar to Google’s Adwords. Before that, all of the advertisements displayed on the MSN search engine were supplied by Yahoo! 2.3.3 Payment rules In the keyword auctions used by GoTo.com, bidders pay the amount of their bids. This way, any decrease in one’s bid will result in less payment. As a result, bidders have incentives to monitor the next highest bids and make sure their own bids are only slightly higher. The benefits from constantly adjusting one’s bid create undesirable volatility in the bidding process. Perhaps as a remedy, Google used a different payment rule in their Adwords Select program. In Adwords Select, bidders do not pay the full amount of their bids. Instead, they pay the lowest possible to remain above the next highest competitor. If the next highest competitor’s bid drops, Google automatically adjusts the advertiser’s payment downward. This feature, termed as ‘‘Adwords Discounter,’’ is essentially an implementation of second-price auctions in a dynamic context. One key advantage of such an arrangement is that bidders’ payments are no longer directly linked to their bids. This reduces bidders’ incentive to game the system. Recognizing this advantage, Yahoo! (Overture) also switched to a similar payment rule. We discuss further the implications of different payment rules in Section 3. 2.3.4 Pricing schemes As of now, Google’s Adwords for search offers only pay-per-click advertising. On the other hand, Adwords for content allows advertisers to bid either pay-per-click or pay-per-thousand-impression. Starting spring 2007, Google began beta-testing a new billing metric called pay-per-action with their Adwords for content. Under pay-per-action metric, advertisers pay only for completed actions of choice, such as a lead, a sale, or a page view, after a user has followed through the advertisement to the publisher’s website. 2.4 Beyond search engine advertising The idea of using keywords to place most relevant advertisements is not limited to search engine advertising. In 2003, Google introduced an ‘‘AdSense’’ program that allows web publishers to generate advertising revenue by receiving advertisements served by Google. AdSense analyzes publishers’ web pages to generate a list of most relevant keywords, which are subsequently used to select the most appropriate advertisements for
76
D. Liu et al.
Fig. 3.
Context-based keyword advertising.
these pages. Figure 3 shows an example of contextual advertising in Gmail. The order of advertisements supplied to a page is determined by Adwords auctions. The proceeds of these advertisements are shared between Google and web publishers. Yahoo! has a similar program called Yahoo! Publisher Network. KAPs also actively sought expansion to domains other than Internet advertising, such as mobile devices, video, print, and TV advertising. Google experimented with classified advertising in the Chicago Sun-Times as early as fall 2005. In February 2006, Google announced a deal with global operator Vodafone to include its search engine on Vodafone Live! mobile Internet service. In April 2007, Google struck a deal with radio broadcaster Clear Channel to start supplying less than 5% of the advertising inventory across Clear Channel’s 600þ radio stations. During the same month, Google signed a multiyear contract with satellite-TV provider EchoStar to sell TV advertisement spots on EchoStar’s Dish service through auctions.
3
Models of keyword auctions
In this section we discuss several models of keyword auctions. The purpose of these models is not to propose new auction designs for keyword-advertising
Ch. 3. Current Issues in Keyword Auctions
77
settings but to capture the essence of keyword auctions accurately. The value of these models lies in that they allow the real-world keyword auctions to be analyzed in a simplified theoretical framework. We start by describing the problem setting. There are n advertisers bidding for m (pn) slots on a specific keyword phrase. Let cij denote the number of clicks generated by advertiser i on slot j. In general, cij depends both on the relevance of the advertisement and on the prominence of the slot. In light of this, we decompose cij to an advertiser (advertisement) factor (qi) and a slot factor (dj ). cij ¼ dj qi
(1)
We interpret the advertiser factor qi as the advertiser i’s CTR. For example, everything else being equal, a brand-name advertiser may attract more clicks and thus have a higher CTR than a non-brand-name advertiser. We interpret the slot factor dj as the click potential of the slot. For example, a slot at the top of a page has higher click potential than a slot at the bottom of the same page. Each advertiser has a valuation-per-click vi. As in most research, we assume that advertisers know their own valuation-per-click. Though in reality, advertisers may have to learn over time their valuation-per-click from the outcome of the keyword advertising. Each advertiser submits a bid b that is the advertiser’s maximum willingness-to-pay per click for the keyword phrase. Each time a user initiates a search for the keyword phrase or requests a content page related to the keyword phrase, the auctioneer (KAP) will examine the bids from all participating advertisers and determine which advertisements should be displayed and in which order according to an allocation rule. If a user clicks on a particular advertisement, the advertiser will be charged a price determined by the payment rule of the keyword auction (which we will explain shortly). The allocation rule and the payment rule used in keyword auctions are different across different KAPs. For example, until recently, Yahoo! ranked advertisers strictly by the prices they bid, and advertisers paid the amount they bid. On the other hand, Google ranks advertisers based on their prices and their CTRs, and advertisers pay the lowest price that keeps the advertiser above the next highest-ranked advertiser. We distinguish the following models of keyword auctions by different allocation or payment rules used. 3.1 Generalized first-price auction In the early days of keyword auctions, bidders paid the price they bid. Such a format is termed ‘‘generalized first-price (GFP)’’ auctions because they essentially extended the first-price auctions to a multiple-object
78
D. Liu et al.
setting. However, people soon discovered that GFP auctions could be instable in a dynamic environment where bidders can observe and react to other bidders’ latest bids as often as they can. For instance, let us assume that there are two slots and two advertisers, 1 and 2, with valuations per click of $2 and $1 only. Assume the minimum bid is $0.10 and slot 1 generates twice the number of clicks that slot 2 generates. Obviously, it is the best for advertiser 1 to bid 1 cent higher than advertiser 2. It is also best for advertiser 2 to bid 1 cent higher than advertiser 1 till advertiser 1 reaches $0.55 or higher, in which case advertiser 2 is better off bidding just the minimum bid $0.1. So the two advertisers will form a bidding cycle that escalates continuously from the minimum bid to $0.55 and starts over again from there. Zhang and Feng (2005) and Asdemir (2005) show that the cyclic bidding pattern illustrated existed in Overture. The cyclic bidding is harmful in three ways. First, the frequent revision of bids requires additional computing resources that can slow down the entire auction system. Second, as shown by Zhang and Feng (2005), the oscillation of prices (because of the bidding cycle) can dramatically reduce KAP’s revenue. Third, GFP auctions are biased toward bidders who can attend and revise their bids more often. Such a bias may be perceived as unfair.
3.2 Generalized second-price auction Edelman et al. (2007) and Varian (2007) study a generalized second price (GSP) auction that captures Google’s payment rule. In GSP auctions, advertisers are arranged in descending order by their pay-per-click bids. The highest-ranked advertiser pays a price that equals the bid of the secondranked advertiser plus a small increment; the second-ranked advertiser pays a price that equals the bid of the third-ranked advertiser plus a small increment, and so on. For example, suppose there are two slots and three advertisers {1, 2, 3} who bid $1, $0.80, and $0.75, respectively. Under the GSP rule, advertiser 1 wins the first slot, and advertiser 2 wins the second slot. Assuming that the minimum increment is ignorable, advertisers 1 and 2 should pay $0.80 and $0.75 per click, respectively (Table 1). Table 1 Payments under generalized second-price auctions Advertiser
Bid ($)
Slot assigned
Pay-per-click
1 2 3
1 0.8 0.75
1 2 –
0.80 0.75 –
79
Ch. 3. Current Issues in Keyword Auctions Table 2 Payments under the VCG mechanism Advertiser
Bid ($)
Slot assigned
Pay-per-click
1 2 3
1 0.8 0.75
1 2 –
0.775 0.75 –
One notable feature of GSP auctions is that advertisers’ payments are not directly affected by their own bids. This feature is also present in the wellknown Vickrey-Clarke-Grove (VCG) mechanism. Under the VCG mechanism, each player’s payment is equal to the opportunity cost the player introduces to other players. To illustrate, in the earlier example, the VCG payment of advertiser 1 should equal the reduction in advertisers 2 and 3’s total valuation because of 1’s participation. Let us assume that all advertisers have the same CTR (normalized to 1), and all bids in Table 2 reflect advertisers’ true valuation-per-click. Let us also assume that the first slot has a (normalized) click potential of 1, and the second slot has a click potential of 0.5. Without advertiser 1, advertisers 2 and 3 will be assigned to the two slots, generating a total valuation of 0.8 1 þ 0.75 0.5( ¼ 1.175). With advertiser 1, advertiser 2 is assigned to the second slot, and advertiser 3 is not assigned a slot, generating a total valuation of 0.8 0.5( ¼ 0.4). The VCG mechanism suggests that advertiser 1 should pay (1.1750.4)/1 ¼ 0.775 per click. Similarly, we can calculate the VCG payment for advertiser 2. Table 2 illustrates the slot allocation and payments under the VCG rule. Clearly, GSP is not a VCG mechanism. Advertisers pay higher prices (except the lowest-ranked winner) under the GSP than under the VCG, provided that they bid the same prices.2 Edelman et al. (2007) show that GSP has no dominant-strategy equilibrium, and truth-telling is not an equilibrium. However, the corresponding generalized English auction has a unique equilibrium, and in such an equilibrium, bidders’ strategies are independent of their beliefs about other bidders’ types. These findings suggest that GSP auctions do offer a certain degree of robustness against opportunism and instability. The above results are obtained in the somewhat restrictive assumption that bidders differ on a single dimension (valuation per click). The reality is that advertisers at least differ on both valuation per click and CTRs. This fact has motivated Google, and later Yahoo! and MSN adCenter, to rank advertisers based on both bid prices and CTRs. In such a sense, GSP is accurate about Google’s payment rule but not its allocation rule. In the next subsection, we discuss an auction framework that captures the allocation rules of keyword auctions. 2 This is not to say that GSP generates higher revenue than VCG because advertisers may bid differently under the two mechanisms.
80
D. Liu et al.
3.3 Weighted unit–price auction Weighted unit–price auction (WUPA) has been studied in Liu and Chen (2006) and Liu et al. (2009). The WUPA is motivated by the fact that Google allocates slots based on a score rule that is a function of advertisers’ unit–price bids. While Google does not fully disclose their scoring formula, Search-Engine Watch reported that the formula used by Google is (Sullivan, 2002) Score ¼ Willingness-to-pay per click CTR
(2)
In 2006, Google updated its ranking rule by replacing CTR in the above formula with a more comprehensive ‘‘quality score,’’ which takes into account advertisers’ CTRs as well as other information such as the quality of their landing pages. In the updated ranking rule, CTR remains a primary consideration in an advertiser’s ‘‘quality score.’’ The idea of using a scoring rule to synthesize multidimensional criteria is not new. ‘‘Scoring auctions’’ have been used in procurement settings (Asker and Cantillon, 2008; Che, 1993) where suppliers (bidders) submit multidimensional bids, such as price p and quality q, and are ranked by a scoring function of the form s(p, q) ¼ f(q)p. A score rule is also used in ‘‘tender auctions’’(Ewerhart and Fieseler, 2003) where a buyer (the auctioneer) requests suppliers to bid a unit price for each input factor (e.g., labor and materials) and ranks suppliers by the weighted sum of their unit–price bids. However, a weighted unit–price score rule is never used on such a large scale. The scoring rule used in keyword auctions is also different from procurement auctions and tender auctions. Therefore, the weighted unit– price auction as a general scoring auction format is not previously studied. The specifics of the WUPA framework are as follows. The auctioneer assigns each advertiser a score s based on the advertiser’s bid and information on the advertiser’s future CTR. s ¼ wb
(3)
where w is a weighting factor assigned to the advertiser based on their expected future CTRs. The auctioneer allocates the first slot to the advertiser with the highest score, the second slot to the advertiser with the second-highest score, and so on. As with the price-only allocation rule, WUPA can also have ‘‘first-score’’ and ‘‘second-score’’ formats. Under the ‘‘first-score’’ rule, each advertiser pays a price that ‘‘fulfills’’ the advertiser’s score. This is equivalent to saying that advertisers need to pay the prices they bid. Under the ‘‘second-score’’ payment rule, each advertiser pays the lowest price that keeps the advertiser above the next highest advertiser’s score. For example, suppose there are only two types of expected CTRs, high and low. Suppose also the weighting factor for high-CTR advertisers is 1 and for low-CTR advertisers is 0.5.
81
Ch. 3. Current Issues in Keyword Auctions Table 3 Payments under first- and second-score WUPAs Advertiser
1 2 3
Bid
1 0.8 0.75
CTR
Low High High
Score
0.5 0.8 0.75
Slot assigned
– 1 2
Pay-per-click First-score
Second-score
– 0.8 0.75
– 0.75 0.5
Continuing with the earlier examples, we assume the expected CTRs of advertisers 1, 2, and 3 are low, high, and high, respectively. Table 3 illustrates the winning advertisers under the WUPA and the price they payper-click under the first-score and second-score rules, respectively. Liu et al. (2009) show that in an incomplete-information setting (i.e., advertisers do not know other advertisers’ valuation-per-click or expected CTRs), the first-score and second-score WUPAs are equivalent in expected revenue. The first-score WUPAs have a unique Bayesian-Nash equilibrium and the equilibrium can be explicitly solved. As in GSP auctions, the second-score WUPAs do not have a truth-telling equilibrium except when there is only one slot. In Section 4, we discuss the implications of different ranking rules.
4
How to rank advertisers Yahoo is strictly capitalistic—pay more and you are number one. Google has more socialist tendencies. They like to give their users a vote. —Dana Todd, SiteLab International
This quote summarizes an interesting dichotomy in Yahoo! and Google’s approaches to advertiser ranking. Yahoo! (Overture), the pioneer in keyword auctions, ranked advertisers purely based on their willingness-topay. On the other hand, Google, the now-leading player, invented a design that ranks advertisers by the product of per-click prices they bid and their historical CTRs. What exactly is the significance of different ranking approaches? Vise and Malseed (2005), authors of The Google Story, noted that Google displays a ‘‘socialist tendency’’ because in Google’s approach, advertisements that Internet users frequently click on are more likely to show up in top positions. Authors from academia, on the other hand, have searched for answers along the lines of revenue-generation and resourceallocation efficiency. A few authors, such as Feng et al. (2007) and Lahaie (2006), studied Google’s and Yahoo!’s approaches strictly as a ranking issue. Liu and Chen (2006) embedded the ranking problem in a larger issue of how to use bidders’ past-performance information. After all, what
82
D. Liu et al.
Google uses is the information on advertisers’ past CTRs, which essentially signals their abilities to generate clicks in the future. In fact, KAPs can also impose differentiated minimum bids for advertisers of different historical CTRs, which is what Google is doing now. This later use of pastperformance information is studied in Liu et al. (2009). Three main questions are associated with different rank rules. What is the impact of adopting different ranking rules on advertisers’ equilibrium bidding? On KAP’s revenue? And on resource-allocation efficiency? The revenue and efficiency criteria may matter at different stages of the keyword advertising industry. At the developing stage of the keyword advertising market, it is often sensible for KAPs to use efficient designs to maximize the ‘‘total pie.’’ After all, if advertisers see high returns from their initial use, they are likely to allocate more budgets to keyword advertising in the future. On the other hand, as the keyword advertising market matures and market shares stabilize, KAPs will more likely focus on revenue. Several authors in economics, information systems, and computer science have approached these questions. Feng et al. (2007) are the earliest to formally compare the ranking rules of Google and Yahoo! One focus of their study is the correlation between advertisers’ pay-per-click and the relevance of their advertisements to the keywords. With numerical simulation, they find that Google’s ranking mechanism performs well and robustly across varying degrees of correlation, while Yahoo!’s performs well only if pay-per-click and relevance are positively correlated. Those observations are sensible. Intuitively, an advertiser’s contribution to the KAP’s revenue is jointly determined by the advertiser’s pay-per-click and the number of clicks the advertiser can generate (i.e., relevance). When pay-per-click is negatively correlated with relevance, ranking purely based on pay-per-click tends to select advertisers with low revenue contribution, which can result in a revenue loss for KAPs. However, their study has certain limitations. Instead of solving the auction model, they simplify by assuming that bidders will bid truthfully under Google’s mechanism. Lahaie (2006) compares Google’s and Yahoo!’s ranking rules based on an explicit solution to the auction-theoretic model. He finds that Google’s ranking rule is efficient while Yahoo!’s is not. Yahoo!’s ranking is inefficient because, as we mentioned earlier, high bid does not necessarily mean high total valuation because total valuation also depends on relevance. In contrast, Google’s ranking rule is efficient. He also shows that no revenue ranking of Google’s and Yahoo!’s ranking mechanism is possible given an arbitrary distribution over bidder values and relevance. His findings are consistent with results derived in a weighted unit–price auction framework by Liu and Chen (2006) and Liu et al. (2009), which we discuss next. While both Feng et al. (2007) and Lahaie (2006) focus on two specific cases: Yahoo!’s price-only ranking rule and Google’s ranking rule, Liu
Ch. 3. Current Issues in Keyword Auctions
83
and Chen (2006) and Liu et al. (2009) study weighted unit–price auctions (WUPAs), which encompass Yahoo! and Google’s ranking rules. Under WUPAs, advertisers bid on their willingness-to-pay per click (or unit–price), and the auctioneer weighs unit–price bids based on advertisers’ expected CTRs. Liu and Chen (2006) consider a single slot setting. Liu et al. (2009) extend to a more general multiple slot setting and study both ranking rules and minimum-bid rules. As in Section 3, advertiser i, if assigned to slot j, will generate cij ¼ djqi clicks, where dj is a deterministic coefficient that captures the prominence of slot j. d1X d2 . . . X dm and d1 ¼ 1. qi is a stochastic number that captures the advertiser i’s CTR. A key assumption of the WUPA framework is that the KAP has information on advertisers’ future CTRs. This assumption is motivated by the fact that e-commerce technologies allow KAPs to track advertisers’ past CTRs and predict their future CTRs. The KAP can make the ranking of advertisers depend on both their pay-per-click and their expected CTRs. In particular, the KAP assigns each advertiser a score s ¼ wb, where the weighting factor w is determined by the advertiser’s expected CTR. If the advertiser has high-expected CTR, then the weighting factor is 1. If the advertiser has low expected CTR, then the weighting factor is wl. Liu et al. (2009) study WUPAs in an incomplete information setting. They assume that each advertiser has a private valuation-per-click v, v 2 ½v; v . The distributions of valuation-per-click, Fh(v) (for high-CTR advertisers) and Fl(v) (for low-CTR advertisers), are known to all advertisers and the KAP. The probabilities of being a high-CTR advertiser, a, and a low-CTR one, 1a, are also known to all advertisers and the KAP. Furthermore, we denote Qh and Ql as the expected CTRs for a high-CTR advertiser and a low-CTR advertiser, respectively. It is assumed that QhWQl. Suppose advertisers’ payoff functions are additive in their total valuation and the payment. Under the first-score payment rule (see Section 3),3 the payoffs for a low-CTR advertiser and a high-CTR advertiser are, respectively, U l ðv; bÞ ¼ Ql ðv bÞ
m X
dj Prfwl b ranks jthg
(4)
j¼1
U h ðv; bÞ ¼ Qh ðv bÞ
m X
dj Prfb ranks jthg
(5)
j¼1
Liu et al.’s analysis generates several insights. First, their analysis illustrates how ranking rules affects equilibrium bidding. The ranking rule affects how low- and high-CTR advertisers match up against each other in equilibrium. Specifically, weighting factors for low- and high-CTR 3
In this model setting, a first-score auction is revenue-equivalent to a second-score auction.
84
D. Liu et al.
advertisers determine the ratio of valuation-per-clicks between a low-CTR advertiser and a high-CTR advertiser who tie in equilibrium. For example, if the low-CTR advertisers are given a weighting factor of 0.6, and the highCTR advertiser, 1, a low-CTR advertiser with valuation-per-click 1 will tie with a high-CTR advertiser with valuation-per-click 0.6 in equilibrium. Furthermore, high-CTR advertisers’ with valuation-per-click higher than 0.6 out-compete all the low-CTR advertisers, and therefore compete only with other high-CTR competitors. As a result, these high-CTR advertisers will bid markedly less aggressively than high-CTR advertisers with valuation-per-click lower than 0.6. Second, they identify the efficient ranking rule under the WUPA framework. Here efficiency is measured by the total expected valuation realized in the auction. The efficient ranking rule under the WUPA framework is remarkably simple: The KAP should weigh advertisers’ pay-per-click by their expected CTRs, as if they bid their true valuationper-click (while in fact they generally do not). Third, they characterize the revenue-maximizing ranking rule under the WUPAs. The revenue-maximizing ranking rule may favor low- or highCTR advertisers relative to the efficient ranking rule. When the distribution of valuation-per-click is the same for high- and low-CTR advertisers, the revenue-maximizing ranking rule should always favor low-CTR advertisers (relative to the efficient design). But when the valuation distribution of lowCTR advertisers become less ‘‘disadvantaged,’’ the revenue-maximizing ranking rule may instead favor high-CTR advertisers (relative to the efficient design). Besides the above-mentioned research on ranking rules, Weber and Zheng (2007) study the revenue-maximizing ranking rule in a setting that resembles ‘‘paid placement.’’ They study a model where two competing firms can reach their customers through sponsored links offered by a search engine intermediary. Consumers differ in ‘‘inspection cost’’ (cost incurred when they click on a sponsored link to find out the surplus they can get from purchasing the product). Thus, some consumers may inspect only the first link, while others inspect both before making a purchase decision. To get the higher position, firms can offer a payment b to the search engine each time a consumer clicks on their product link (their ‘‘bids’’). The search engine’s problem is to choose how to rank the firms, given the consumer surplus u generated by two firms (assumed known to the search engine) and their bids b. The authors study an additive ranking rule sðb; u; bÞ ¼ bu þ ð1 bÞb
(6)
where the parameter b is the focal design factor. They find that the revenuemaximizing ranking design should put nonzero weight on firms’ bid b. In other words, search engines have incentive to accept ‘‘bribes’’ from advertisers to bias the ranking of product links.
Ch. 3. Current Issues in Keyword Auctions
5
85
How to package resources
Keyword auctions maintain that bidders simply bid their willingness-topay per click (thousand-impression, action) and are assigned to slots by an automatic algorithm, with higher-ranked advertisers receiving a better slot (more exposure). This is different from a fixed-price scheme, where sellers specify a menu of price-quantity pairs for buyers to choose from, and from traditional divisible-good auctions where sellers need not specify anything and buyers bid both price and quantity they desire. In a sense, keyword auctions strike a middle ground between the fixed-price scheme and traditional divisible-good auctions: in keyword auctions, the buyers (advertisers) specify prices they desire, and the seller (the KAP) decides the quantities to offer. Given the unique division of tasks in keyword auction settings, how to package resources for auctioning becomes an important issue facing KAPs. Before we address the issue of resource packaging, it is useful to clarify what we mean by resource in keyword auctions and why resource packaging is a practical issue. What KAPs sell to advertisers is impressions. Each time a page is requested by an Internet user, all advertisements on this page get an impression. Though keyword advertising is often priced by the number of clicks or ‘‘actions’’ (e.g., purchases), KAPs can always reallocate impressions from one advertiser to another but cannot do the same with clicks or actions. Therefore, impression is the ultimate resource controlled by KAPs. Although slots on the same page generate the same number of impressions, they may not be equal to advertisers. For example, an advertising slot is noticed more often if it is at the top of a page than at the bottom of the page. Other factors can also affect how often a slot is noticed, such as its geometric size, the time of the day it is displayed, and whether the deployment website is frequented by shoppers. One way to address these differences in page impressions is to substitute raw impressions with standardized effective exposure, which weighs impressions differently based on how much value it can deliver to an average advertiser. For example, if the effective exposure generated by one page impression at the top of a page is 1, then the effective exposure, generated by one page impression at the bottom of the page might be 0.3. In the following we study the packaging of effective exposures rather than raw page impressions.4 With the notion of effective exposure, a keyword auction goes like this. The KAP packages the available effective exposure into several shares, 4 A recommendation based on effective exposure can be transparently translated into a recommendation based on raw page impressions. This is because KAPs can always tailor the exposure allocated to an advertisement by randomizing its placement between different slots, varying the timing and length of its appearance, and/or selecting the number of websites for the advertisement to appear.
86
D. Liu et al.
ordered from large to small. Advertisers will be assigned to shares by their rankings, with the highest-ranked advertiser receiving the largest share, the second-highest-ranked advertiser receiving the second-largest share, and so on. A resource packaging problem in such a setting is to decide how many shares to provide and the size of each share to maximize total revenues. We call this problem a share-structure design problem. The share-structure design problem is relevant to KAP’s day-to-day operations. The demand and supply of keyword advertising resources are highly dynamic. On one hand, the supply of advertising resources fluctuates as new websites join KAPs’ advertising network, and existing websites may lose their draw of online users. On the other hand, the demand for advertising on particular keywords shifts constantly in response to changes in underlying market trends. Therefore, KAPs must constantly adjust their share structures to maximize their total revenue. To do so, KAPs need a good understanding of the share structure design problem. Given that KAPs have become managers of tremendous advertising resources, the issue of share structure design is critical to their success.
5.1 The revenue-maximizing share structure problem Chen et al. (2006, 2009) address the issue of revenue-maximizing share structures with the following specifications. There are n risk-neutral advertisers (bidders). The KAP (auctioneer) packages total effective exposure (normalized to 1) into as many as n shares arranged in a descending order, s1 s2 sn . A share structure refers to vector s ¼ (s1, s2, . . . sn). Table 4 shows some examples of share structures and their interpretations. Bidders’ valuation for a share is determined by the size of the share (s) and a private parameter (v), called the bidder’s type. v is distributed according to a cumulative distribution function F(v) on ½ v; v , with density f(v). Bidders’ valuation of a share take the form of vQ(s), where Q(d) is an increasing function. Bidders are invited to bid their willingness-to-pay per unit exposure (or unit price), and all shares are allocated at once by a rank-order of bidders’ unit–price bids.5 Bidders pay the price they bid.6 Each bidder’s expected payoff is the expected valuation minus expected payment to the auctioneer. Denote pj(b) as the probability of winning share j by placing bid b. The 5 Google ranks advertisers by a product of their willingness-to-pay per click and a click-through-ratebased quality score, which can be loosely interpreted as advertisers’ willingness-to-pay per impression (see Liu and Chen (2006) for a more detailed discussion). Yahoo! used to rank advertisers by their willingness-to-pay per click only, but recently switched to a format similar to Google’s. Our assumption that bidders are ranked by their willingness-to-pay per unit exposure is consistent with both Google’s approach and Yahoo!’s new approach. 6 The expected revenue for the auctioneer is the same if bidders pay the next highest bidder’s price.
Ch. 3. Current Issues in Keyword Auctions
87
Table 4 Examples of share structures s
Interpretation
(1, 0, 0, 0) (0.25, 0.25, 0.25, 0.25) (0.4, 0.2, 0.2, 0.2)
The highest bidder gets all effective exposures Top 4 bidders each get one-fourth of the total effective exposures The top bidder gets 40% of the total effective exposures. The 2nd–4th highest bidders each get 20% of the total effective exposures
expected payoff of a bidder of type v is Uðv; bÞ ¼
n X
pj ðbÞðvQðsj Þ bsj Þ
(7)
j¼1
The auctioneer’s revenue is expected total payments from all bidders. " # n X p ¼ nE b (8) pj ðbÞsj j¼1
Bidders maximize their expected payoff by choosing a unit price b. The auctioneer maximizes the expected revenue by choosing a share structure s. 5.2 Results on revenue-maximizing share structures Chen et al. (2009) showed that the auctioneer’s expected revenue in the above setting is written as Z v n X 1 FðvÞ p¼n Qðsj Þ Pj ðvÞ v f ðvÞdv (9) f ðvÞ v j¼1 where Pj ðvÞ
! n1 FðvÞnj ð1 FðvÞÞ j1 nj
is the equilibrium probability for a bidder of type v to win share j. We denote Z v 1 FðvÞ aj n Pj ðvÞ v f ðvÞdv; j ¼ 1; 2; . . . ; n f ðvÞ v
(10)
(11)
88
D. Liu et al.
The expected revenue (Eq. (9)) can be written as p¼
n X
aj Qðsj Þ
(12)
j¼1
Here aj is interpreted as the return coefficient for the jth share. Chen et al. (2009) showed that the revenue-maximizing share structures may consists of plateaus—a plateau is a set of consecutively ranked shares with the same size. For example, the third example in Table 4 has two plateaus: the first plateau consists of the first share (of size 0.4); the second plateau consists of the second to the fourth share (of size 0.2). Chen et al. (2009) showed that the starting and ending ranks of plateaus in the revenuemaximizing share structure are determined only by the distribution of bidders’ type. Based on their analysis, the optimal starting/ending ranks of plateaus and the optimal sizes of shares in each plateau can be computed using the following algorithm. 1. 2. 3. 4.
Compute return coefficients {aj }, j ¼ 1, . . . ,n. Let jk denote the ending rank of k-th plateau. j 0 ( 0 and P k ( 1. Given jk1, compute j k ( arg maxj2fjk1 þ1;...;ng fð1=ð j j k1 ÞÞ jl¼jk1 þ1 al g. If jk ¼ n, K ( k (K denotes the total number of plateaus) and continue to step 5. Otherwise, k ( k þ 1, go to step 3. 5. Compute the average return coefficient a k ( ð1=ðj k j k1 ÞÞ Pjk a , for plateau k ¼ 1, . . . , K. l l¼j k1 þ1 6. Solve the following nonlinear programming problem for the sizes of shares (z1, z2, . . . ,zK) in all plateaus: max
K X ð j k j k1 Þak Qðzk Þ k¼1
subject to :
K X
ð j k j k1 Þzk ¼ 1 and z1 z2 zK 0
k¼1
A share structure becomes steeper if we allocate more resources to highranked shares and less to low-ranked ones. In Table 4, the steepest share structure is (1, 0, 0, 0), followed by (0.4, 0.2, 0.2, 0.2), and then by (0.25, 0.25, 0.25, 0.25). Chen et al. (2009) obtained several results on how the revenue-maximizing share structures should change in steepness when the underlying demand or supply factors change. First, as bidders’ demands become less price-elastic (as the valuation function Q(d) becomes more concave), the auctioneer should use a less steep share structure. When bidders have perfectly elastic demand (i.e., the bidder’s valuation Q(d) is a
Ch. 3. Current Issues in Keyword Auctions
89
linear function), the auctioneer should use the steepest share structure, winner-take-all. The following example illustrates the above finding. Example 1. Let the number of bidder be six and the type distribution be an (truncated) exponential distribution on [1, 3]. When Q(s) ¼ s, the revenue-maximizing pffiffi share structure is (1, 0, 0, 0, 0, 0) (winner-take-all). When QðsÞ ¼ s, the revenue-maximizing share structure is (0.51, 0.25, 0.13, 0.07, 0.03, 0.01). When Q(s) ¼ s1/4, the revenue-maximizing share structure is (0.40, 0.25, 0.16, 0.10, 0.06, 0.03). Figure 4 plots the first to the sixth shares under three different valuation functions. The figure shows that the revenue-maximizing share structure becomes flatter when bidders’ demand becomes less price-elastic. A change in the type distribution affects the revenue-maximizing share structure through the return coefficients aj’s. In the case of ‘‘scaling’’ (all bidders’ valuation is multiplied by a common factor), all return coefficients are also scaled, and the revenue-maximizing share structure should remain the same. When the type distribution is ‘‘shifted’’ to the right (i.e., every bidder’s v increases by the same amount), the return coefficient for a low-ranked share increases by a larger proportion than the return coefficient for a high-ranked share does, and thus the revenue-maximizing share structure becomes less steep. pffiffi Example 2. Continue with Example 1. Fix QðsÞ ¼ s. When the type distribution is shifted to [5, 7], the revenue-maximizing share structure becomes (0.24, 0.19, 0.17, 0.15, 0.13, 0.12). Figure 5 shows that the revenue-maximizing share structure becomes flatter when the type distribution is shifted from [1, 3] to [5, 7].
Fig. 4.
Effect of price elasticity of demand.
90
D. Liu et al.
Fig. 5.
Effect of type distribution.
Another factor studied in Chen et al. (2009) is the effect of increasing total resources available. They showed that when total resource increases, all shares will increase, but whether the share structure (in terms of percentages of the total resources) becomes flatter or steeper depends on how bidders price elasticity increases or decreases with the resources assigned. When bidders’ price elasticity increases in the amount of resources allocated, the KAP should increase high-ranked shares by a larger percentage. When bidders’ price elasticity of demand decreases, the KAP should increase low-ranked shares by a larger percentage. The above results highlighted the importance of advertisers’ price elasticity of demand and the competitive landscape (as determined by the distribution of bidders’ types). Generally speaking, when bidders become more price-elastic, the share structure should be steeper; when the competition between bidders is fiercer, the share structure should be flatter.
5.3 Other issues on resource packaging The resources managed by KAPs have expanded significantly since the advent of keyword advertising. Leading KAPs have developed vast advertising networks of thousands of websites. Meanwhile, they are also actively seeking expansion to other media, including mobile devices, radio, and print advertising. The issue of resource packaging will only become more important when KAPs manages more advertising resources.
Ch. 3. Current Issues in Keyword Auctions
91
The earlier research addressed only a small part of a larger resourcepackaging problem. There are a few interesting directions for future research on the issue of resource packaging. First, Chen et al.’s (2009) framework assumes bidders share a common valuation function Q. A more general setting is that bidders’ valuation functions are also different. For example, bidders with highly elastic demand and bidders with inelastic demands may coexist. Feng (2008) studies a setting in which bidders differ in price elasticities, but her focus is not on the share structure design. Another interesting direction is to compare keyword auctions with alternative mechanisms for divisible goods such as the conventional discriminatory-price and uniform-price auctions (Wang and Zender, 2002; Wilson, 1979), in which bidders not only bid on prices but also on the quantity desired. The study on revenue-maximizing share structure facilitates such comparison because one would need to pick a revenuemaximizing share structure for keyword auctions to make a meaningful comparison. Also, it is interesting to study the optimal mechanism for allocating keyword-advertising resources. Different mechanisms may be evaluated along the lines of the auctioneer’s revenue, the allocation efficiency, and whether the mechanism encourages bidders to reveal their true valuation. Bapna and Weber (2006) study a mechanism that allows bidders to specify their ‘‘demand curves,’’ rather than just one price. They consider a more general setting in which multiple divisible goods are offered and bidders may have multidimensional private information. More specifically, they consider n bidders that have valuation for fractional allocations of m slots. For a fraction xi ¼ ðx1i ; . . . ; xm i Þ allocated, bidder i’s utility is vi (xi;Zi), where Zi represents bidder i’s private information, or type. The auctioneer first announces its mechanism, which includes a fixed m-dimensional price vector p ¼ (p1, . . . , pm). Then each bidder submits a bid function bi(d; Zi). The bidder’s bids are considered as discounts that will be subtracted from the payment implied by the posted price schedule. Under such a setting, Bapna and Weber show that such a mechanism has a dominant-strategy incentive-compatible equilibrium in which a bidder’s equilibrium bids do not depend on the knowledge of type distribution, the number of bidders, or other bidders’ payoff functions. 6
Click fraud
The keyword advertising industry has been extraordinarily successful in the past few years and continues to grow rapidly. However, its core ‘‘payper-click’’ advertising model faces a threat known as ‘‘click fraud.’’ Click fraud occurs when a person, automated script, or computer program imitates a legitimate user of a web browser clicking on an advertisement, for the purpose of generating a click with no real interest in the target link. The
92
D. Liu et al.
consequences of click fraud include depleting advertisers’ budgets without generating any real returns, increasing uncertainties in the cost of advertising campaigns, and creating difficulty in estimating the impact of keyword advertising campaigns. Click fraud can ultimately harm KAPs because advertisers can lose confidence in keyword advertising and switch to other advertising outlets. Both industrial analysts and KAPs have cited click fraud as a serious threat to the industry. A Microsoft AdCenter spokesperson stated, ‘‘Microsoft recognizes that invalid clicks, which include clicks sometimes referred to as ‘click fraud,’ are a serious issue for pay-per-click advertising.’’7 In its IPO document, Google warned that ‘‘we are exposed to the risk of fraudulent clicks on our ads.’’8 While no consensus exists on how click fraud should be measured, ‘‘most academics and consultants who study online advertising estimate that 10% to 15% of advertisement clicks are fake, representing roughly $1 billion in annual billings’’ (Grow and Elgin, 2006). Click fraud has created a lingering tension between KAPs and advertisers. Because advertisers pay for valid click they receive, it is critical for advertisers not to pay for clicks that are invalid or fraudulent. The tension arises when advertisers and KAPs cannot agree on which clicks are valid. KAPs often do not inform advertisers which clicks are fraudulent clicks, citing the concern that click spammers may use such information against KAPs and undermine KAPs’ effort to fight click fraud. Also, KAPs may have financial incentives to charge advertisers for invalid clicks to increase their revenues. Such incentives may exist at least in a short run. A few events illustrate the tension between advertisers and KAPs. In June 2005, Yahoo! settled a click-fraud lawsuit and agreed to pay the plaintiffs’ $5 million legal bills. In July 2006, Google settled a class-action lawsuit over alleged click fraud by offering a maximum of $90 million credits to marketers who claim they were charged for invalid clicks. Before we proceed, it is useful to clarify the two main sources of fraudulent clicks. The first is from competing advertisers. Knowing that most advertisers have a daily spending budget, an advertiser can initiate a click-fraud attack on competitors to drain their daily budgets. Once the competitors’ daily budgets are exhausted, their advertisements will be suspended for the rest of the day, so the attacker can snag a high rank at less cost. The second and more prevalent source of click fraud comes from publishers who partner with KAPs to display keyword advertisements. Many publishers earn revenue from KAPs on a per-click basis. Therefore, they have incentives to inflate the number of clicks on the advertisements displayed on their sites. This became a major form of click fraud after KAPs expanded keyword advertising services to millions of websites, 7
http://news.com.com/2100-10243-6090939.html http://googleblog.blogspot.com/2006/07/let-click-fraud-happen-uh-no.html
8
Ch. 3. Current Issues in Keyword Auctions
93
including many small and obscure websites that are often built solely for advertising purposes. One argument is that click fraud is not a real threat. This line of argument underlines current Google CEO Eric Schmidt’s comment on click fraud.9 Let’s imagine for purposes of argument that click fraud were not policed by Google and it were rampant . . . Eventually, the price that the advertiser is willing to pay for the conversion will decline, because the advertiser will realize that these are bad clicks, in other words, the value of the ad declines, so over some amount of time, the system is in fact self-correcting. In fact, there is a perfect economic solution which is to let it happen.
Research also shows that Google’s keyword auction mechanisms resist click fraud (Immorlica et al., 2005; Liu and Chen, 2006). The reason is that advertisers who suffer from click fraud also gain in their CTR rating, which works in their favor in future auctions (recall that Google’s ranking mechanism favors advertisers with high historical CTRs). While the above arguments have merits, they also have flaws. The first argument works best when the click-fraud attack is predictable. When the attack is unpredictable, advertisers cannot effectively discount its impact. Also, unpredictable click fraud creates uncertainties for advertisers, which can make keyword advertising unattractive. As to the second argument, while receiving fraudulent clicks has positive effects under the current system, it is unclear whether the positive effects can dominate the negative ones. In what follows, we discuss measures to detect and to prevent click fraud. Detection efforts such as online filtering and off-line detection reduce the negative impact of fraudulent clicks. Preventive measures such as using alternative pricing or a new rental approach can reduce or eliminate incentives to conduct click fraud. 6.1 Detection 6.1.1 Online filtering A major tool used in combating click fraud is an automatic algorithm called ‘‘filter.’’ Before charging the advertisers, major KAPs use automatic filter programs to discount suspected fraudulent clicks as they occur. Such filters are usually rule-based. For example, if a second click on the advertisement occurs immediately after a first click, the second click (‘‘the doubleclick’’) is automatically marked as invalid, and the advertiser will not pay for it. KAPs may deploy multiple filters so that if one filter misses a fraudulent click, another may still have a chance to catch it. Tuzhilin (2006) studied filters used by Google and stated that Google’s effort in filtering out invalid clicks is reasonable, especially after Google started to consider doubleclicks as invalid clicks in 2005. 9
http://googleblog.blogspot.com/2006/07/let-click-fraud-happen-uh-no.html
94
D. Liu et al.
While some fraudulent clicks are easy to detect (e.g., doubleclicks), others are very difficult. For example, it is virtually impossible to determine whether a click is made by a legitimate Internet user or by a laborer hired cheaply in India to click on competitors’ advertisements.10 The current filters are still simplistic (Tuzhilin, 2006). More sophisticated and timeconsuming methods are not used in online filters because they do not work well in real-time. As a result, current filters may miss sophisticated and lesscommon attacks (Tuzhilin, 2006). The fact that advertisers have requested refunds or even pursued lawsuits over click fraud indicates that filter programs alone cannot satisfyingly address the click fraud problem. 6.1.2 Off-line detection Off-line detection methods do not have the real-time constraint. Therefore an off-line detection team can deploy more computationally extensive methods, and consider a larger set of clicking data and many other factors (such as conversion data). Off-line detection can be automatic or manual. Google uses automated off-line detection methods to generate fraud alerts and to terminate publishers’ accounts for fraudulent click patterns. Automatic off-line detection methods are pre-programmed; thus they cannot react to new fraud patterns. Google also uses manual off-line detection to inspect click data questioned by advertisers, alert programs, or internal employees. While such manual detection is powerful, it is hardly scalable. Unlike online filtering, off-line detection does not automatically credit advertisers for invalid clicks. However, if a case of click fraud is found, advertisers will be refunded. 6.2 Prevention First of all, KAPs may prevent click fraud by increasing the cost of conducting click fraud. KAPs have taken several other steps in discouraging click spammers, including (Tuzhilin, 2006): Making it hard for publishers to create duplicate accounts or open new accounts after the old accounts are terminated, Making it hard for publishers to register using false identities, and Automatically discounting fraudulent clicks so that click spammers are discouraged. All of the above prevention efforts rely on a powerful click-fraud detection system. However, a powerful and scalable click-fraud system is very difficult, if not impossible, to develop. The above prevention efforts are dwarfed if sophisticated click spammers can pass the detection. 10
http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms
Ch. 3. Current Issues in Keyword Auctions
95
6.2.1 Alternative pricing Pay-per-click is susceptible to click fraud because clicks can be easily falsified. Witnessing this, some suggest different pricing metrics, such as pay-per-action (e.g., pay-per-call and pay-per-purchase), as a remedy to click fraud. Because purchases and calls are much more costly to falsify, switching to a pay-per-action or pay-per-call pricing scheme will overcome the click-fraud problem. Pay-per-action pricing is unlikely a remedy for all advertisers. Sometimes outcome events such as purchases are hard to track or define (e.g., should KAPs count a purchase if it is made the next day after the customer visited the link?). Other times, advertisers may be reluctant to share purchase information with KAPs. Finally, different advertisers may be interested in different outcome measures. For example, direct marketers are more interested in sales, while brand advertisers may be interested in the time Internet users spend on their websites. One may suggest going back to the pay-per-impression model to prevent click fraud. However, pay-per-impression is subject to fraud of its own kind: knowing that advertisers are charged on per-impression basis, a malicious attacker can request the advertising pages many times to exhaust the advertisers’ budgets; similarly, publishers can recruit viewers to their websites to demand higher revenue from KAPs. Goodman (2005) proposed a pricing scheme based on percentage of impressions. The assumption is that if attackers systematically inflate impressions, advertisers will pay the same amount because they still receive the same percentage of all impressions. While this proposed pricing scheme addresses the click-fraud problem to a large extent, it also has some consequences. For example, such a pricing scheme will not automatically adjust to the changes in overall legitimate traffic. As a result, web publishers have no incentives to increase the popularity of their websites. Also, the pay-per-percentage-impression pricing imposes all risks on advertisers. In general, advertisers are more risk-averse than KAPs, and it is often revenue-maximizing for KAPs to absorb some of the risks. 6.2.2 Rental model Another possible remedy is a rental model in which advertisers bid on how much they are willing to pay per hour exposure. Clearly, such a pricing model is immune to the click-fraud problem. The rental model can be implemented in different ways. One way is to ask each advertiser to bid on each slot, and KAPs will assign the slot to the highest bidder. Alternatively, KAP can ask advertisers to bid on the first slot only, provided that they agree on receiving other slots at a discounted price proportional to their bid for the first slot. Such a rental model can be valuable when advertisers have a reasonable idea about how much exposure they can get from a particular slot. Of course, when the outcome is highly uncertain, a rental model also exposes advertisers to grave risks.
96
D. Liu et al.
In sum, a single best solution to the click-fraud problem may not exist. While alternatives to pay-per-click advertising may remove incentives to conduct click fraud, they often come with other costs and limitations. Clearly, future keyword auction designs must take into account the clickfraud problem.
7
Concluding remarks
In this chapter, we review the current research on keyword advertising auctions. Our emphasis is on keyword-auction design. Keyword auctions are born out of practice and have unique features that previous literature has not studied. Keyword auctions are still evolving, giving us an opportunity to influence future keyword-auction designs. Given the central position of search and advertising in online worlds, research on keyword auctions holds important practical values. It is worth noting that keyword auctions as a mechanism for allocating massive resources in real-time are not limited to online advertising settings. Other promising areas of application of keyword auctions include grid-computing resources, Internet bandwidth, electricity, radio spectrum, and some procurement areas. In fact, Google filed a proposal on May 21, 2007, to the Federal Communications Commission calling on using keyword-auction-like mechanisms to allocate radio spectrum. In the proposal, Google argued that a keyword-auction-like real-time mechanism would improve the fairness and efficiency of spectrum allocation and create a market for innovative digital services. As keyword auctions as a general mechanism are proposed and tested in other settings, several important questions arise. For example, what conditions are required for keyword auctions to perform well? And what needs to be changed for keyword auctions to apply in new settings? This chapter focuses on design issues within keyword advertising settings. It would also be interesting to compare keyword auctions with various other alternative mechanisms in different settings. It is not immediately clear whether keyword auctions are superior to, for instance, dynamic pricing or a uniform-price auction where bidders bid both price and quantity. More research must be done to integrate the brand-new keyword auctions into the existing auction literature. We believe research in such a direction will yield new theoretical insights and contribute to the existing auction literature.
References Asdemir, K. (2005). Bidding Patterns in Search Engine Auctions, Working Paper, University of Alberta School of Business.
Ch. 3. Current Issues in Keyword Auctions
97
Asker, J.W., E. Cantillon (2008). Properties of Scoring Auctions. RAND Journal of Economics 39(1), 69–85. Bapna, A., T.A. Weber (2006). Efficient Allocation of Online Advertising Resources, Working Paper, Stanford University. Che, Y.-K. (1993). Design competition through multidimensional auctions. Rand Journal of Economics 24(4), 668–680. Chen, J., D. Liu, A.B. Whinston (2006). Resource packaging in keyword auctions, in: Proceedings of the 27th International Conference on Information Systems, December, Milwaukee, WI, pp. 1999–2013. Chen, J., D. Liu, A.B. Whinston (2009). Auctioning keywords in online search. Forthcoming in Journal of Marketing. Edelman, B., M. Ostrovsky, M. Schwarz (2007). Internet advertising and the generalized second price auction: Selling billions of dollars worth of keywords. American Economic Review 97(1), 242–259. eMarketer (2007). Online advertising on a rocket ride. eMarketer News Report, November 7. Ewerhart, C., K. Fieseler (2003). Procurement auctions and unit-price contracts. Rand Journal of Economics 34(3), 569–581. Feng, J. (2008). Optimal mechanism for selling a set of commonly ranked objects. Marketing Science 27(3), 501–512. Feng, J., H. Bhargava, D. Pennock (2007). Implementing sponsored search in web search engines: Computational evaluation of alternative mechanisms. INFORMS Journal on Computing 19(1), 137–148. Goodman, J. (2005). Pay-per-percentage of impressions: An advertising method that is highly robust to fraud. Workshop on Sponsored Search Auctions, Vancouver, BC, Canada. Grow, B., B. Elgin (2006). Click fraud: The dark side of online advertising. Business Week, October 2. Immorlica, N., K. Jain, M. Mahdian, K. Talwar (2005). Click Fraud Resistant Methods for Learning Click-Through Rates. Workshop for Internet and Network Economics. Interactive Advertising Bureau. (2007). Internet advertising revenues grow 35% in ’06, hitting a record close to $17 billion. Interactive Advertising Bureau News Press Release, May 23. Lahaie, S. (2006). An analysis of alternative slot auction designs for sponsored search, in: Proceedings of the 7th ACM Conference on Electronic Commerce, Ann Arbor, MI, ACM Press. Liu, D., J. Chen (2006). Designing online auctions with performance information. Decision Support Systems 42(3), 1307–1320. Liu, D., J. Chen, A.B. Whinston (2009). Ex-Ante Information and the Design of Keyword Auctions. Forthcoming in Information Systems Research. Sullivan, D. (2002). Up close with Google Adwords. Search Engine Watch Report. Tuzhilin, A. (2006). The Lane’s Gifts v. Google Report. Available at http://googleblog.blogspot.com/ pdf/TuzhilinReport.pdf. Retrieved on December 25, 2007. Varian, H.R. (2007). Position auctions. International Journal of Industrial Organization 25(6), 1163–1178. Vise, A.D., M. Malseed (2005). The Google story. New York, NY. Wang, J.J.D., J.F. Zender (2002). Auctioning divisible goods. Economic Theory 19(4), 673–705. Weber, T.A., Z. Zheng (2007). A model of search intermediaries and paid referrals. Information Systems Research 18(4), 414–436. Wilson, R. (1979). Auctions of shares. The Quarterly Journal of Economics 93(4), 675–689. Zhang, X., J. Feng (2005). Price cycles in online advertising auctions, in: Proceedings of the 26th International Conference on Information Systems, December, Las Vegas, NV, pp. 769–781.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 4
Web Clickstream Data and Pattern Discovery: A Framework and Applications
Balaji Padmanabhan Information Systems and Decision Sciences Department, College of Business, University of South Florida, 4202 E. Fowler Avenue, CIS 1040, Tampa, FL 33620, USA
Abstract There is tremendous potential for firms to make effective use of Web clickstream data. In a perfect world a firm will be able to optimally manage online customer interactions by using real-time Web clickstream data. Here it may be possible to proactively serve users by determining customer interests and needs before they are even expressed. Effective techniques for learning from this data are needed to bridge the gap between the potential inherent in clickstream data and the goal of optimized customer interactions. Techniques developed in various fields including statistics, machine learning, databases, artificial intelligence, information systems and bioinformatics can be of value here, and indeed several pattern discovery techniques have been proposed in these areas in the recent past. In this chapter we discuss a few applications of pattern discovery in Web clickstream data in the context of a new pattern discovery framework presented here. The framework is general and we note other applications and opportunities for research that this framework suggests.
1
Background
From a business perspective the Web is widely recognized to be a key channel of communication between firms and their current and potential customers, suppliers and partners. Early on several firms used the medium to provide information to customers, a large number of whom still transacted offline. However this has rapidly changed for a number of reasons as noted below. A factor that facilitated this transition has been the steady increase in customer comfort in using the Web to transact. Web user surveys 99
100
B. Padmanabhan
conducted annually reflect this trend clearly.1 For instance, one of the questions in a 1996 survey asked users to provide their degree of agreement to the statement ‘‘Providing credit card information through the Web is just plain foolish’’.2 In 1996 survey respondents were divided on this, although a slightly more number of them disagreed. Responses to such security questions over the years suggest a trend toward greater comfort in transacting online. Improved security technology, such as strong encryption standards, has clearly played a significant role here. If payment information such as credit card data had to be sent in plain text then the cost of battling online fraud would be prohibitive. This of course did not apply to the preWeb days in which users submitted credit card data over telephone due to two differences. First, telephone networks mostly went through switches operated by large telecom firms, and physically tapping into such a network was hard. Second, telephones, unlike personal computers, did not run software applications, some of which may be malicious programs that can intercept and re-direct data. Equally important, comfort with policies that relate to online transactions has also increased. Consumers today are, for instance, more savvy about procedures for online returns. A recent poll by Harris Interactive3 revealed that 85% of respondents indicated that they are not likely to shop again with a direct retailer if they found the returns process inconvenient. Online retailers have recognized the importance of this and have substantially eased the returns process for most goods purchased online. However one area that remains an issue today is online privacy. While firms have online privacy policies it is not clear to what extent consumers read, understand and explicitly accept some of the data use and data sharing policies currently in place. Other key reasons for the transition to transacting online are increased product variety (Brynjolfsson et al., 2003) and the overall convenience of online shopping. Online auctions such as eBay have enabled every item in every home to potentially be available online. Online services contribute to this as well. Services such as Google Answers provide consumers with, often immediate-access to experts on a wide variety of issues ranging from troubleshooting computers to picking the right school for children. In terms of convenience, a book, not required immediately, can be purchased online in minutes compared to the many hours it could have otherwise taken to check its availability and purchase this from a retail bookstore. This increased use of the Web is also evident from macroeconomic indicators released by the U.S. Census Bureau.4 From being virtually 1
http://www-static.cc.gatech.edu/gvu/user_surveys/ http://www-static.cc.gatech.edu/gvu/user_surveys/survey-10-1996/questions/security.html See ‘‘Return to Sender: Customer Satisfaction Can Hinge on Convenient Merchandise Returns’’, Business Wire, Dec 13, 2004. 4 http://www.census.gov/mrts/www/ecomm.html 2 3
Ch. 4. Web Clickstream Data and Pattern Discovery
101
non-existent a decade or so ago, U.S. retail ecommerce sales in 2004 was $71 billion, accounting for 2% of total retail sales in the economy. More recent, in the third quarter of 2007, retail ecommerce sales were estimated at $32.2 billion, accounting now for 3.2% of total retail. 2
Web clickstream data and pattern discovery
As this trend has played out, firms have invested significant resources to tracking, storing and analyzing data about customer interactions online. In a recent commercial report,5 authors indicate that the worldwide Business Intelligence (BI) tools market grew to $6.25 billion in 2006. Note that this amount only captures expenditure on BI software purchase by firms and does not include internally developed tools or the cost of labor. Compared to other channels of interaction with customers, a unique characteristic of the Web is that every single item viewed or content seen by a user is captured in real-time by Web servers. This results in massive amounts of detailed Web clickstream data captured at various servers. Further the two component terms in ‘‘clickstream’’ have both meaningfully changed over the years. When Web servers were first used, hypertext transfer protocol informed what was captured every time a user clicked on a link online. Typically the tracked information was content captured from http headers such as the time of access, the Internet Protocol (IP) address of the user’s computer and the file name of the page requested. Today, firms capture a large amount of more specific data such as the content that was even shown to a user on the screen before a user clicked on a link. For instance, a single user click on some page at an online content site translates into a large number of variables created that relate to the environment at the time of this activity. Examples of these variables include the list of all products that were displayed, list of specific online advertisements that were shown on the page and specific categories of products that appeared on the user’s screen at that time. Further, Internet use has significantly increased in the last several years, perhaps even disproportionally to the number of major content sites that are accessed. Hence the ‘‘stream’’ part of ‘‘clickstream’’ has also significantly increased for the major online firms. Some reports, for instance, put the number of unique users at Yahoo! at more than 400 million in 2007. The rate at which data streams in from such a large user base contributes to several terabytes of data collected each day. Firms are naturally interested in leveraging such a resource, subject to stated privacy policies. Toyota, for instance, may prefer to have its online advertisements shown to users who are tagged as ‘‘likely auto buyers’’ 5 Worldwide Business Intelligence Tools 2006 Vendor Shares, Vesset and McDonough, IDC Inc. June 2007.
102
B. Padmanabhan
rather than to an urban family that may have no interest in cars. The granularity at which clickstream data is collected today has enabled online firms to build much more accurate customer models, such as one that might score a user as a potential auto buyer. In the context of customer relationship management (CRM) in a perfect world, a firm will be able to optimally manage online customer interactions by using real-time Web clickstream data to determine customer interests and needs and proactively serve users. Between Web clickstream data and the implementation of models to manage online interactions is the critical component of learning from this data. One approach to learn from Web clickstream data is to use pattern discovery techniques. As defined in Hand et al. (2001), we use the term ‘‘pattern’’ to mean some local structure that may exist in the data. This is in contrast to ‘‘models’’ (Hand et al., 2001) that represent global structure. Models are also built to make specific predictions, unlike pattern discovery techniques which are often used for exploratory analysis. However, models may also be informed by the pattern discovery process. For instance, pattern discovery may help unearth a previously unknown behavioral pattern of a Web user, such as a specific combination of content that this user consumes. New features that can be constructed from such patterns may help build better prediction models learned from clickstream data. There is a large amount of research in the interdisciplinary data mining literature on pattern discovery from Web clickstream data (see Srivastava et al. (2000) and Kosala and Blockeel (2000) for reviews). Having a framework for understanding the contributions of the different papers in this literature can help in making sense of this large (and growing) body of work. The purpose of this chapter is not to survey this literature but to discuss one framework for pattern discovery under which some of this research can be better understood. Certainly there can be many approaches for grouping various contributions relating to pattern discovery from Web clickstream data. One approach might be based on the application for instance, where different research papers are grouped based on the specific application (i.e. online advertising, product recommender systems, dynamic Web page design, etc.) addressed. Another approach for grouping may be based on the specific pattern discovery technique used. Yet another approach may be based on the reference literature from which the pattern discovery techniques come from, given that pattern discovery has been addressed in various areas. In the next section, we will discuss one framework for pattern discovery that is general and can be applied to provide a useful perspective on specific pattern discovery papers in the literature. Another application of the framework may be to group the literature based on dimensions specific to this framework. We present examples in different domains to show how this framework helps in understanding research in pattern discovery. To motivate the application of this framework in the Web clickstream
Ch. 4. Web Clickstream Data and Pattern Discovery
103
context we will use this framework to explain two different approaches taken in the literature to segment online users using patterns discovered from Web clickstream data. We discuss the relationship between these two specific segmentation applications and conclude by providing a discussion of other opportunities that this framework suggests.
3
A framework for pattern discovery
As noted in Section 2, the data mining literature has a large body of work on pattern discovery from Web clickstream data. One characteristic of the data mining area is a focus on pattern discovery (Hand, 1998). In such cases the focus is usually not on prediction but on learning interesting ‘‘local’’ patterns that hold in a given database. Taking a different perspective, there has also been substantial research in this literature on the learning models from large databases. A primary goal of model building in this literature is for prediction in very large databases. These models are usually computationally intensive and are evaluated based on mainly predictive accuracies. Pattern discovery techniques can be completely described based on three choices—the representation chosen for the pattern, the method of evaluation by which a specific pattern is deemed ‘‘interesting’’ and finally an algorithm for learning interesting patterns in this representation. Below we discuss these three choices and present examples. 3.1 Representation First, pattern discovery techniques have to make an explicit choice or assumption regarding what forms a pattern can take. Specifically, a representational form has to be chosen. Some examples of representations considered in the pattern discovery literature in data mining are itemsets and association rules (Agrawal et al., 1993), quantitative rules (Aumann and Lindell, 1999), sequences (see Roddick and Spiliopoulou, 2002) and temporal logic expressions (Padmanabhan and Tuzhilin, 1996). An itemset is a representation for a set of items {i1, i2, . . . , ik} that occur together in a single transaction. While the initial application for this was market basket analysis, there have been other applications, such as learning the set of Web pages that are frequently accessed together during a single session. An association rule, however, is represented as I1-I2, where I1 and I2 are both itemsets and I1 - I2 ¼ {}. Unlike itemsets, this (association rule) representation is not used to convey a notion of mutual co-occurrence, rather this representation is used to indicate that if I1 exists in a transaction then I2 also exists. For instance it may even be the case that {I1, I2} does not occur often, but whenever I1 occurs in a transaction
104
B. Padmanabhan
then I2 also occurs.6 Depending on what captures the notion of a ‘‘pattern’’ in a specific application one or both of these representations may be useful. The ‘‘items’’ in itemsets are usually based on categorical attributes (although they have been used for continuous attributes based on discretization). Quantitative rules extend the representation of typical association rules to one where the right hand side of the rule is a quantitative expression such as the mean or variance of a continuous attribute (Aumann and Lindell, 1999). A sequence is yet another example of a representation. Srikant and Agrawal (1996) defined a sequence as an ordered list of itemsets oI1, I2, . . . , IkW. The ordering is important and is used to represent a pattern where a series of itemsets follow one another (usually in time where transactions have time stamps associated with them). Such a representation is useful where patterns relating to the order of occurrences are relevant. 3.2 Evaluation Given a representation, what makes a specific pattern in this representation interesting? Some examples of evaluation criteria considered in pattern discovery include support and confidence measures for association rules (Agrawal et al., 1993) and frequency (for sequences and temporal logic expressions). For association rule I1-I2, support is the percentage of all transactions in the data set that contain {I1, I2}. Confidence is defined based on a measure of conditional probability as the percentage of transactions where I1 is present in which I2 is also present. Frequency for sequences is defined as the number of times a specific sequence occurs in a database. The main point here is that these measures—support, confidence, frequency—are all different evaluation criteria for patterns in a given representation. Further, the criteria are specific to each representation—i.e. it is meaningful to compute the support of an itemset but confidence only applies to rules and not to individual itemsets. 3.3 Search Given a representation and a method of evaluation, search is the process of learning patterns in that representation that meet the specified evaluation criteria. The development of efficient search algorithms is a critical component given the size and high dimensionality of the databases that these methods are designed for. 6 Standard association rule discovery algorithms however use itemset frequency constraints for practical as well as computational reasons.
Ch. 4. Web Clickstream Data and Pattern Discovery
105
The Apriori algorithm (Agrawal et al., 1995) and its many improvements (see Hipp et al., 2000) are examples of efficient search algorithms for learning frequent (evaluation) itemsets (representation) and association rules (representation) with high support and confidence (evaluation). The GSP (Generalized Sequential Patterns) algorithm (Srikant and Agrawal, 1996) is a search technique that learns all frequent (evaluation) sequential patterns (representation) subject to specific time constraints (also evaluation) regarding the occurrences of itemsets in the sequence. The time constraint, for instance, can specify that all the itemsets in the sequence have to occur within a specified time window. These additional constraints can be considered as part of the evaluation criteria for a pattern (i.e. a pattern is considered ‘‘good’’ if it is frequent and satisfies each additional constraint). Pattern discovery is also often application-driven. In some cases the context (the domain plus the specific application) drives the choice of specific representation and evaluation criteria. Search is well defined given specific choices of representation and evaluation and hence it is, in this sense, only indirectly application-driven, if at all. 3.4 Discussion and examples The process of making choices in the representation–evaluation–search (R-E-S) dimensions also helps identify specific differences between the inductive methods developed in the data mining literature with those developed in other areas such as statistics. Compared to other literature the data mining area has developed and studied a different set of representations for what constitutes a pattern, developed and studied different evaluation criteria in some cases, and developed and studied various search algorithms for pattern discovery. While it is difficult to compare different representations and evaluation criteria developed across disciplines, studying multiple plausible representations (and evaluations þ search) is by itself a critical component of the process for understanding what constitutes real structure in observed data. Engineers often use the term ‘‘reverse engineering’’ in technology to understand the principles of how something works by observing its operations, and much research in various data-driven fields is similar in spirit and is often guided by (necessary) inductive bias in the R-E-S dimensions (particularly representation). In principle such methods contribute to the inductive process in scientific reasoning. Below we discuss a few examples that illustrate these three choices (Figs. 1–3 summarize pictorially the framework and examples). Padmanabhan and Tuzhilin (1996) addressed the problem of learning patterns in sequences (such as genetic sequences, or a series of discrete system events captured about network behavior). Prior work (Mannila et al., 1995) had used episodes as a representation for a pattern in a
106
B. Padmanabhan
Fig. 1.
Fig. 2.
Three steps in pattern discovery.
Context-motivated choices for representation, evaluation and search.
Fig. 3. Examples of specific choices for representation, evaluation and search. Clockwise from top-left these are from Padmanabhan and Tuzhilin (1996), Padmanabhan and Tuzhilin (1998) and Swanson (1986), respectively.
sequence. An episode was defined as a directed graph in which the links between nodes represented the observation that one event in the sequence occurred before the other event did. Padmanabhan and Tuzhilin (1996) extended the episodes representation to a general form using a temporal logic representation. An example of such a temporal logic representation was A UntilK B, capturing the occurrence of event A zero to K times just before event B occurs. The operator Until is a temporal logic operator. For instance the sequence oC, A, B, D, C, A, A, B, A, B, AW contains the pattern A Until2 B thrice within the sequence.
Ch. 4. Web Clickstream Data and Pattern Discovery
107
In this case, the directed graph representation considered before in Mannila et al. (1995) was extended since the temporal logic approach permitted more general approaches for reasoning about patterns in time. The evaluation for patterns—both for the episodes approach as well as the temporal logic approach—was a simple measure of frequency (counting the number of occurrences of a specific pattern in a sequence). In both cases, the papers also presented new search techniques—a method for efficiently discovering episodes in large sequences in Mannila et al. (1995) and a method for learning specific classes of temporal logic patterns in sequences in Padmanabhan and Tuzhilin (1996). In the previous example, the representation and search dimensions are the ones in which the main contributions were made by the papers discussed. The next example focuses specifically on work where the contribution is mainly in the evaluation dimension. In the late 1990s there was a lot of work in the data mining area on developing fast algorithms for learning association rules in databases. Much research, as well as applications in industry, suggested that most of the patterns discovered by these methods, while considered ‘‘strong’’ based on existing evaluation metrics, were in reality obvious or irrelevant. If strong patterns are not necessarily interesting, what makes patterns interesting and can such patterns be systematically discovered? Padmanabhan and Tuzhilin (1998) developed a new evaluation criterion for the interestingness of patterns. Specifically, they defined an association rule to be interesting if it was unexpected with respect to prior knowledge. This approach required starting from a set of rules that capture prior domain knowledge, which is elicited from experts or from rules embedded in operational systems used by firms. A discovered association rule is deemed interesting if it satisfied threshold significance criteria and if it contradicts a rule in the existing knowledge base. For instance, in a subset of retail scanner data relating to the purchase of beverages (categorized as regular or diet), prior knowledge may represent some condition such as female - diet beverages. A rule that satisfies threshold significance criteria and contradicts the prior knowledge, such as female, advertisement - regular beverages, is defined to be unexpected. The definition presented is based on contradiction in formal logic, and Padmanabhan and Tuzhilin (1998, 2000) present efficient algorithms to learn all unexpected patterns defined in this manner. In this example, the representation for patterns (association rules) was not new. In contrast the evaluation criterion developed was the main contribution and was one that focused specifically on the fundamental problem of what makes patterns interesting. In this case, rather than using the evaluation criterion as a ‘‘filter’’ to select rules generated by an existing technique, new search algorithms were proposed to learn only the unexpected rules, and hence the contribution is along two dimensions (evaluation and search). In the previous two examples the choice of representation, evaluation and search did not depend in any meaningful way on the application domain in
108
B. Padmanabhan
which it was used. In contrast to this consider the following example. In the field of information science, Swanson (1986) made a seminal contribution in a paper on identifying ‘‘undiscovered public knowledge’’. Swanson was particularly interested in learning potential treatments for medical conditions from publicly available information. A well-known example of a discovery facilitated by Swanson (1986) was that fish oil may be a potential treatment for Raynaud’s disease. This was identified as a potential undiscovered treatment since, 1. the Medline literature had numerous published scientific articles about blood viscosity and Raynaud’s disease—the disease apparently was correlated with higher blood viscosity, 2. the literature also had numerous published articles about fish oil and blood viscosity (these articles frequently noted that fish oil helped lower blood viscosity) and 3. the literature had little or no articles that discussed fish oil and Raynaud’s disease directly, suggesting that this was not a well-known link. Note here that (a) the original work was not a completely automated approach and (b) the work was in a different area (information science), and was presented even before the field of data mining gained momentum. However, this is an excellent example of the potential power of inductive approaches such as data mining in a world in which an increasingly large amount of information is automatically captured. In the R-E-S framework, the representation for patterns such as the one discovered in Swanson (1986) is a triple oA, B, CW where A, B and C are phrases. For instance ofish oil, blood viscosity, Raynaud’s diseaseW is a specific such triple (pattern). The evaluation is a measure with two components (1) that requires A, B and C to represent a potential treatment, disease condition and disease, respectively. This requires background knowledge such as domain ontologies. (2) The second component of evaluation is a binary indicator based on the counts of documents that contain the pairwise terms. Specifically, this component may be defined to be one if count(A, B) is high, count(B, C) is high and count(A, C) is low, where count(X, Y) is the number of Medline documents in which the phrases X and Y co-occur. Search is then designing efficient algorithms for learning all such triples. In this example too the main contribution is in the evaluation, but this is an instance where the choice of the three dimensions, from a pattern discovery perspective, is driven by the specific application. 4
Online segmentation from clickstream data
In the previous section we discussed a pattern discovery framework and discussed how specific examples of pattern discovery approaches may be
Ch. 4. Web Clickstream Data and Pattern Discovery
109
viewed in this framework. In this section and the next we continue to examine this link, but specifically for pattern discovery applications that arise in the context of Web clickstream data. The examples discussed in this section are specifically related to learning interesting user segments from Web clickstream data. Zhang et al. (2004) present an approach motivated by the problem of learning interesting market share patterns in online retail. An example of a pattern discovered by this method is Region ¼ South and household size ¼ 4 ) marketshareðxyz:comÞ ¼ 38:54%; support ¼ 5:4% The data set consists of book purchases at a subset of leading online retailers. Each record in the data set consists of one online purchase for books. The discovered rule highlights one customer segment (which covers 5.4% of all records) in which xyz.com has an unusually low market share. Generalizing from this, Zhang et al. (2004) defined a new class of patterns called statistical quantitative rules (SQ rules) in the following manner. Given (i) sets of attributes A and B, (ii) a data set D and (iii) a function f that computes a desired statistic of interest on any subset of data from the B attributes, an SQ rule was defined in (Zhang et al., 2004) as a rule of the form X ) f ðDX Þ ¼ statistic; support ¼ sup where, X is an itemset (conjunction of conditions) involving attributes in A only, DX the subset of D satisfying X, the function f computes some statistic from the values of the B attributes in the subset DX and support the percentage of transactions in D satisfying X. This representation built on prior representations (association rules and quantitative rules) in the data mining literature. In association rules the antecedent and consequent were conjunctions of conditions, whereas in quantitative rules the consequent was a quantitative measure such as the mean of some attribute. In SQ rules the consequent is defined to be a more general function (possibly involving several attributes) of the specific segment considered by the rule. These rules were evaluated based on statistical significance, specifically on whether the computed quantitative measure for a segment was different from values that might be expected by chance alone. As such the evaluation criterion was therefore not novel (standard statistical significance). However to construct the needed confidence intervals, Zhang et al. (2004) use randomization to create data sets where the attributes pertaining to the computed function are made independent of the others. Given the high computational complexity of creating several randomized data sets— particularly when the size of the data is very large—Zhang et al. (2004) present an efficient computational technique that exploits specific problem
110
B. Padmanabhan
characteristics for learning interesting market share rules from the data. Hence the search technique developed—a computational method based on randomization—was a contribution here as well. Along the three dimensions, the representation and search dimensions are the main dimensions in which Zhang et al. (2004) makes contributions to the literature. We note two characteristics of the above application. First, this learns purchase-based segments, i.e. segments defined based on dollar volumes spent at competing online retailers. Second, this uses Web clickstream data gathered on the client-side. Such data is available from data vendors such as comScore Networks and tracks Web activity of users across multiple sites. In contrast to this, we next describe another pattern discovery method for online segmentation that discovers behavioral segments as opposed to purchase-based segments, and that can be used by online retailers directly on the Web clickstream data that they individually observe (i.e. this does not need user clickstream data across firms). Yang and Padmanabhan (2005) present a segmentation approach based on pattern discovery that is motivated by grouping Web sessions into clusters such that behavioral patterns learned from one cluster is very different from behavioral patterns learned from other clusters. This motivation is similar to standard cluster analysis, but the difference is in how behavioral patterns are defined. In their approach, a behavioral pattern is defined as an itemset such as: fday ¼ Saturday; most_visited_category ¼ sports; time spent ¼ highg The evaluation for each such pattern is a count of how often this occurs in a set of Web sessions and in any given cluster the set of all such patterns can be learned efficiently using standard techniques in the literature. Given that any cluster consists of a set of such behavioral patterns learned from the cluster, Yang and Padmanabhan (2005) develop a distance metric that computes the difference between two clusters based on how different the behavioral patterns learned are. Based on this distance metric they develop a greedy hierarchical clustering algorithm that learns patternbased clusters. Hence given a set of user sessions at an online retailer, this approach learns clusters such that online user behavior is very different across different clusters. In this sense this approach develops a behavioral segmentation approach specifically for Web sessions. Interestingly the result of this analysis can, in some cases (where the number of different users is very small), identify individual users. That is, even if the user ID is ignored, the segments learned sometimes end up isolating different users. In most cases the method does not do this but instead isolates different behaviors— which is the main objective of the approach. Yang and Padmanabhan (2005) also showed how this approach can be used to learn explainable clusters in real Web clickstream data.
Ch. 4. Web Clickstream Data and Pattern Discovery
111
In this approach there is no new contribution along the R-E-S dimensions for what constitutes a behavioral pattern online. However, the contribution along these dimensions here is a new representation of a cluster (as a set of behavioral patterns), a new objective function for clustering (i.e. the evaluation dimension) that takes into account differences between patterns in different clusters, and a new greedy heuristic for learning such clusters. While the approaches for online segmentation discussed above are generally viewed as applications of pattern discovery to Web clickstream data, viewing them in the R-E-S framework helps to exactly appreciate the more general contributions that are made to the pattern discovery literature. In both the segmentation examples described in this section the dimensions were largely motivated by the specific application domain (segmentation in Web clickstream data). Yet the choices for the R-E-S dimensions were not standard and existing pattern discovery methods could not directly be used. Instead the papers developed new approaches for these, thereby making more general contributions to the pattern discovery literature. More generally, while the examples in this section and in Section 3.4 do not prove that the R-E-S framework is good, they provide evidence that the framework is useful in some cases for identifying the more general contributions made by applied pattern discovery research. Further the Web segmentation applications show how clickstream data can motivate interesting pattern discovery problems that can result in broader contributions and generalizable pattern discovery techniques. While this section focused on examples in online segmentation, the next section briefly identifies other problems to which pattern discovery from Web clickstream data has been applied and discusses connections to the R-E-S framework. 5
Other applications
Below we briefly discuss other applications of pattern discovery to Web clickstream data and highlight the R-E-S dimensions of importance in these cases. The applications and R-E-S possibilities below are not exhaustive. These are mainly intended to illustrate how this framework can be used to be clear about the specific choices made along the different dimensions and to identify where the main contributions are. There has been substantial research (e.g. Perkowitz and Etzioni, 2000; Srikant and Yang, 2001) on automatically reconfiguring Web sites based on learning patterns related to how users access specific content. For instance, if most users who visit a certain site navigate several levels before which they get to some commonly accessed page, the site design might conceivably be improved to make it easier to access this content. Here the pattern representation considered may be a sequence (of pages). Learning the set of
112
B. Padmanabhan
all frequent sequences of pages accessed can help in understanding the popular paths visitors take at a Web site. The evaluation is usually based on a count of how often sequences occur and existing search algorithms such as GPS/Apriori can be directly used. In the same context of improving Web site design some authors (Srikant and Yang, 2001) have studied backtracking patterns. Such patterns are argued to be important since these suggest cases where users locate content only after some trial and error that involves backtracking (going back to previous pages to follow new links). Here the goal is to learn specific types of sequences where users visit the same page again and branch in a different direction. One example of this is the work of Srikant and Yang (2001). The representation again is sequences, the evaluation is based on counts and on whether there exists a backtracking event in a given sequence. In this case a new algorithm was developed (Srikant and Yang, 2001) for learning such patterns efficiently as well. Rules learned from Web clickstream data can also be used to make recommendations of products or types of content a user may be interested in. There is a large literature on recommender systems and on learning user profiles based on such rules (e.g. Adomavicius and Tuzhilin, 2001; Aggarwal et al., 1998; Mobasher et al., 2002). The representations of these rules may or may not have an explicit temporal component. For instance rules of the form ‘‘if (A, B, C) then (D, E)’’ may indicate that most users who access (like) A, B and C also access (like) D and E. Such rules are easily learned from the matrices used for collaborative filtering. These rules can also be modified to if A, B, C (0, t) then D, E (t, t þ k) thereby explicitly having a temporal component that captures the fact that the behavior in the consequent should only occur after the behavior in the antecedent. In the specific example shown the content accessed in the consequent of the rule is within k time units after a user accesses A, B and C. The literature in sequential pattern discovery (Roddick and Spiliopoulou, 2002; Srikant and Agrawal, 1996) addresses such pattern discovery methods. The research problems here are on developing appropriate new representations and search algorithms. Adomavicius and Tuzhilin (2001) note that rules generated from Web clickstream data may need to be validated by domain experts before they are used in making specific recommendations. However validation of individual rules may be impractical given that rule discovery methods can learn thousands of these rules for each user. To address this, Adomavicius and Tuzhilin (2001) present a system that facilitates this rule validation process by using validation operators that can permit experts to select groups of rules for simultaneous validation. From the R-E-S perspective the rule validation process can be viewed as a critical evaluation component for discovered patterns. The evaluation of patterns is not solely based on traditional strength measures, but on user-defined criteria that the validation system described in Adomavicius and Tuzhilin (2001) facilitates.
Ch. 4. Web Clickstream Data and Pattern Discovery
113
More generally, beyond product recommendations, future systems may need to make process recommendations for online users intelligently. As firms increasingly develop their online presence and as customers increasingly use this channel to transact it will be important to develop proactive methods for assisting customers, just as a sales clerk may come over in a physical store when a customer appears to be in need to help. There is the potential to do this automatically (i.e. determine from observed real-time clickstream data that a customer is in need), but this has not been studied as yet. The R-E-S framework raises questions that can be useful here in building such methods. What is the representation for patterns indicating that a user is in need? What is the evaluation criterion and how can such patterns be learned? Methods for detecting online fraud may also use patterns learned from Web clickstream data. These methods broadly fall into two classes. In the first, these methods must be able to determine that some sequence of Web activity is considered unusual. This requires a definition of usual or normal behavior. One possibility is to define user profiles based on behavioral patterns as done in Adomavicius and Tuzhilin (2001). Then a new stream of clicks can be evaluated against an existing user profile to determine how likely the access is from a given user. The second class of methods actually builds explicit representations for what fraudulent activity may look like. The approaches that do this are in online intrusion detection (Lee et al., 1999) where the goal is to determine hacks or security compromises in computer networks. One example is a (malicious) program attempting to connect on specific port numbers in sequence. If the behavior of such a program is known (based on experts who study how networks get hacked into) then specific patterns in that representation may be learned. In both these examples, the contributions can be in all three R-E-S dimensions. Pattern representations may be novel here, the evaluation criteria (what is usual/unusual) is critical as well and methods for learning typical (or unusual) patterns are important. 6
Conclusion
As firms are increasingly using the Web to interact with customers, Web clickstream data becomes increasingly valuable since this captures information pertaining to every interaction a customer has with a firm. This naturally presents opportunities for leveraging this data using pattern discovery approaches and there has been substantial research on various topics related to pattern discovery from Web clickstream data. This chapter presented a framework for pattern discovery and showed how this framework can be viewed both to understand different pattern discovery techniques proposed in the literature as well as to understand the research on applications of these techniques to Web clickstream data. Examples in a
114
B. Padmanabhan
variety of applications such as online segmentation, Web site design, online recommendations and online fraud highlight both the value that pattern discovery techniques can provide and the value of the R-E-S framework as a tool to better understand the pattern discovery approaches developed for these problems.
References Adomavicius, G., A. Tuzhilin (2001). Using data mining methods to build customer profiles. IEEE Computer 34(2). Aggarwal, C., Z. Sun, P.S. Yu (1998). Online generation of profile association rules, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, August, New York, NY. Agrawal, R., T. Imielinski, A. Swami (1993). Mining association rules between sets of items in large databases, in: Proceedings of the 1993 ACM SIGMOD Conference on Management of Data, Washington, DC, pp. 207–216. Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo (1995). Fast discovery of association rules, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA. Aumann, Y., Y. Lindell (1999). A statistical theory for quantitative association rules, in: Proceedings of The Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 261–270. Brynjolfsson, E., Y. Hu, M.D. Smith (2003). Consumer surplus in the digital economy: estimating the value of increased product variety at online booksellers. Management Science 49(11), 1580–1596. Hand, D. (1998). Data mining: statistics and more. The American Statistician 52, 112–118. Hand, D.J., H. Mannila, P. Smyth (2001). Principles of Data Mining. August. The MIT Press, Cambridge, MA. Hipp, J., U. Guntzer, G. Nakhaeizadeh (2000). Algorithms for association rule mining—a general survey and comparison. SIGKDD Explorations 2(1), 58–64. July. Kosala, R., H. Blockeel (2000). Web mining research: a survey. SIGKDD Explorations 2(1), 1–15. Lee, W., S.J. Stolfo, K.W. Mok (1999). A data mining framework for building intrusion detection models, in: Proceedings of IEEE Symposium on Security and Privacy, Oakland, CA, pp. 120–132. Mannila, H., H. Toivonen, A.I. Verkamo (1995). Discovering frequent episodes in sequences, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montreal, Canada, August, pp. 210–215. Mobasher, B., H. Dai, T. Luo, M. Nakagawa (2002). Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery 6(1), 61–82. Padmanabhan, B., A. Tuzhilin (1996). Pattern discovery in temporal databases: a temporal logic approach, in: Proceedings of KDD 1996, Portland, OR, pp. 351–355. Padmanabhan, B., A. Tuzhilin (1998). A belief-driven method for discovering unexpected patterns, in: Proceedings of KDD 1998, New York, NY, pp. 94–100. Padmanabhan, B., A. Tuzhilin (2000). Small is beautiful: discovering the minimal set of unexpected patterns, in: Proceedings of KDD 2000, Boston, MA, pp. 54–64. Perkowitz, M., O. Etzioni (2000). Adaptive web sites. Communications of the ACM 43(8), 152–158. Roddick, J.F., M. Spiliopoulou (2002). A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering 14(4), 750–767. Srikant, R., R. Agrawal (1996). Mining sequential patterns: generalizations and performance improvements, in: Proceedings of the 5th international Conference on Extending Database Technology: Advances in Database Technology, March 25–29, Avignon, France. Srikant, R., Y. Yang (2001). Mining web logs to improve website organization, in: Proceedings of the 10th international Conference on World Wide Web, Hong Kong, May 01–05 (WWW ’01). ACM Press, New York, NY, pp. 430–437.
Ch. 4. Web Clickstream Data and Pattern Discovery
115
Srivastava, J., R. Cooley, M. Deshpande, P. Tan (2000). Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Exploration Newsletter 1(2), 12–23. Swanson, D.R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30, 7–18. Yang, Y., B. Padmanabhan (2005). GHIC: A hierarchical pattern based clustering algorithm for grouping web transactions. IEEE Transactions on Knowledge and Data Engineering 17(9), 1300–1304. Zhang, H., B. Padmanabhan, A. Tuzhilin (2004). On the discovery of significant statistical quantitative rules, in: Proceedings of KDD 2004, Seattle, WA, pp. 374–383.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 5
Customer Delay in E-Commerce Sites: Design and Strategic Implications
Deborah Barnes Market Research and Member Insights, 9800 Fredericksburg Road, San Antonio, TX 78240, USA
Vijay Mookerjee Information Systems & Operations Management, The School of Management, The University of Texas at Dallas, PO Box 860688 SM 33, Richardson, TX 75080-0688, USA
Abstract This chapter explores how e-commerce firms consider potential delays to their consumers as a component of their overall profitability. A successful e-commerce strategy must incorporate the impact of the potential delays into important firm decisions such as: IT capacity and allocation, advertising dollars spent, the quality of service (i.e., more or less delay) provided to consumers, and the ease at which pricing information can be discovered at competing web sites.
Opportunities to conduct business in the online environment have been considered a vital component of traditional information systems development; however, the need for speed (efficient processing mechanisms) is magnified in the e-business context. System users in this context are clients and customers and thus increasing delays hold a double threat: inefficiency (a symptom found in the traditional context) and potential customer and revenue loss. In order to manage both the efficiency of the system and reduce potential customer loss, e-business firms may approach the delay problem from several points of view. One consideration is to manage the demand arriving to the site. For example, the objective of spending valuable dollars in advertising is to generate traffic to the firm’s web presence; however, what if the demand generated exceeds the capacity of the web space and therefore delays (perhaps in excess of the consumer’s tolerance) are experienced? 117
118
D. Barnes and V. Mookerjee
The firm may want to jointly consider the budgets allocated to advertising and IT capacity in order to manage delays. Another delay management technique might be more short term: creating some sort of ‘‘Express Lane’’ in the online environment. In a traditional store, salespersons do not necessarily provide the same quality of service to all customers. For example, an important customer may get more attention. In an e-business environment, it is not clear if web sites are designed to accommodate differentiated service. One reason for providing differentiated service is that customers may exhibit different amounts of impatience, i.e., the point they leave if they are made to wait. A customer’s impatience could be a function of what the customer intends to purchase, the intended purchase value, and so on. For example, it is quite reasonable to expect that delay tolerance (opposite of impatience) will increase with intended purchase value. This suggests the equivalent of an ‘‘express’’ check out lane in a grocery store. On the other hand, it is also reasonable to expect that the processing requirements of a high value transaction will also be high. In an e-business site, the average time waited depends on the processing requirements of a transaction. This feature is unlike a grocery store where the average amount waited (total time in the queue minus actual service time) at a check out counter does not depend on the items purchased. Thus attention should be allocated across the customers after a careful analysis of several effects: (1) the likelihood of the customer leaving due to impatience, (2) the revenue generated from the sale if it is successfully completed, and (3) the processing requirements of the transaction. Finally, how does a firm manage quality of service and delay in a competitive environment? If the customer has the option to go elsewhere, how will the firm’s strategy change? In addition, are there any circumstances that a firm may want to intentionally delay a customer . . . can delay be used strategically? A firm’s delay management plan may not always be to reduce delay, but instead to employ delay strategically. We see examples of built-in delay in travel sites when the site is ‘‘trying to find the best deal’’. Why might a firm build in well-managed delays such as this perhaps for competitive reasons, to block shop-bots, and how do consumers search behaviors impact the use of the strategic delay? The main outcome of interest in this chapter is the management of delay to optimally benefit the firm. Firms may seek to reduce customer and revenue loss or to use delay to increase market share. The focus of this chapter is how to modify firm decisions such as IT capacity and its allocation, advertising dollars spent, service differentiation technique, and competitive strategy in order to maximize the benefits derived from the firm’s web presence. The chapter is organized in five sections. Section 1 introduces the assumed e-commerce environment and consumer behaviors, Section 2 focuses on balancing demand generation (advertising dollars) with website’s capacity to support customers, Section 3 looks at how to provide
Ch. 5. Customer Delay in E-Commerce Sites
Fig. 1.
119
Simplified e-commerce site structure.
differentiated services (i.e., the online equivalent of an ‘‘Express Lane’’), Section 4 examines how delay should be managed in a competitive environment, and Section 5 concludes the chapter. 1
E-commerce environment and consumer behavior
1.1 E-commerce environment For simplicity we will consider the structure in Fig. 1 for e-commerce sites. It is important to note that browsing and buying behaviors of users are separated out onto two functional servers. The Catalog server is for browsing activities while the Transaction server is for buying activities. Requests are submitted to the Catalog server and processed according to the needs of the consumers. 1.2 Demand generation and consumer behaviors The arrival rate of consumer requests to the catalog server (demand level) is a counting process characterized by a Poisson distribution. The Poisson distribution applies to the probability of events occurring in a discrete nature when the probability is unchanging in time. Using the Poisson distribution to simulate customer arrivals implies that the time between arrivals is exponentially distributed.1,2 This property states that if the last arrival has not occurred for some time (t) then the density that the next arrival will occur in t further time units is the same as the exponential density, i.e., it does not depend on t. Hence the system does not hold any memory. There are three possible outcomes for user behavior for users currently being managed by the Catalog server: (1) the consumer may choose to purchase items and is therefore transferred to the Transaction server, (2) the 1
See Ross (2000), Chapter 5 for an in-depth explanation of the Poisson and Exponential distributions. For further justification of this assumption, see Burman (1981), O’Donovan (1974), and Sakata et al. (1971). 2
120
D. Barnes and V. Mookerjee
consumer may browse the site only and continue to be served by the Catalog server, or (3) a consumer may exit the site early. This last, ‘‘Early Quit’’ scenario may be due to a variety of reasons including consumer impatience or expiring consumer time budgets. While consumers may leave the transaction server before the transaction has been completed, we consider this scenario much less likely as the user has already invested in the process. That is, a user who has provided shipping information and payment method is not as likely to quit before purchase. Therefore, our focus on customer loss examines the impact of delays experienced on the Catalog server. 1.3 System processing technique The system needs to be able to support multiple requests simultaneously while preventing too much delay on the part of any of the separate requests. The system needs to allocate the processing power equitably across all requests. Round-Robin Processing is one of the simplest time-shared processing techniques. In this technique, all users are given an equal share of the processing time in turn. This technique is well suited to environments where the jobs to be executed are relatively similar in size. By using this method, we can control expected waiting times for users based on the possible distribution of jobs and the quantum allocated to jobs. The time waited for each job will be proportional the processing time attained. A new customer or user enters the system at the end of the queue and must wait some unknown amount of time before they receive their allotted processing unit. It is possible that a customer leaves the queue before receiving their processing unit. If the customer’s processing needs are met after one processing unit, their session ends and they exit the system, if not the customer returns to the end of the queue. The customer will repeat this process until his processing needs are met at which time the customer exits the system. For example, a time slot or quantum could be 50 milliseconds per user. Imagine a queue with 3 users: user 1 has a 75 millisecond job, user 2 has a 25 millisecond job and user 3 has a 150 millisecond job. The first user would use all 50 milliseconds of processing time and then return to the end of the queue. The second user would self-terminate after the job was complete after 25 milliseconds and exit the system. The third user would use all 50 milliseconds of processing time and then return to the end of the queue. At this time there are only two users in the queue: user 1 and user 2. A new user arrives, user 4 with a 40 millisecond job and is added to the end of the queue. This time user 1 completes his job and self terminates after 25 milliseconds and exits the system. Again, user 3 uses all 50 milliseconds of processing time and returns to the end of the queue. User 4 completes his job and self terminates after 40 milliseconds and exits the system. Finally, user 3, the only remaining user in the system is able to complete his job after 50 milliseconds and exits the system.
In addition, instead of using a fixed quantum, the processor could allocate time based on the number of jobs in the queue. For example, if the number of jobs is n, then each user will receive (1/n) of the processing capacity of the resource.
Ch. 5. Customer Delay in E-Commerce Sites
2
121
The long-term capacity planning problem
2.1 Allocating spending between advertising and information technology in electronic retailing3 Firms must make decisions regarding the capacity of the servers supporting e-retail sites. Generally, this is a long-term decision and cannot be adjusted daily in order to meet varying traffic demands of the e-retail site. If the firm does not provide enough capacity to support the site, consumers will experience delays and possibly leave the site, causing a loss in potential revenue for the firm. However, additional capacity is not cheap, and the firm does not want to over spend on capacity as this will reduce profits. Therefore, it is important that the e-retailer choose an appropriate level of capacity to support the e-commerce site. Demand for the site can be affected by many factors. Seasonal trends, item popularity, and advertising campaigns can all influence demand for products sold on an e-commerce site. While a firm does not have explicit control on the demand (traffic) for the e-commerce site, the e-retailer may influence traffic patterns using advertising campaigns to promote items on the site. In order to address this complex relationship between advertising’s demand stimulation and corresponding server capacities, we must first examine a few related topics: (1) coordinated decision-making among functional units of the firm and (2) advertising response curves which relate advertising and the stimulation of demand. 2.1.1 Coordinated decision-making There has been a separation of the functional components of firms into distinct units: marketing, production, accounting, payroll, information technology, and so on. Each of these distinct units has performance objectives to satisfy which may be influenced by factors internal to the firm and factors external to the firm. In addition, these objectives may not be aligned globally across departmental units which may cause problems for the firm at large. It may be in the interest of the firm to enforce coordination schemes across functional units in order to ensure the global objectives of the firm are met. For example, the marketing department has launched an enormous campaign promoting the widgets sold by Firm XYZ. In the production department of the same firm, the machine to make widgets has been experiencing problems lowering production. The marketing campaign boosts demand for the widgets of the firm; however the firm cannot fulfill the orders due to production problems and thus loses out on potential profits and wastes the advertising budget spent on the campaign. If instead the advertising campaign budget was determined in conjunction with the 3
For additional information, see Tan and Mookerjee (2005).
122
D. Barnes and V. Mookerjee
Fig. 2.
Advertising response as an S-curve.
production capabilities of the firm at present, Firm XYZ would have reduced the size of the campaign (if allowing one at all) to suit the level of production available.
Although a simple example, it illustrates the need for coordination across departments. Functional units optimizing local objectives do not guarantee an optimal global solution for the firm. The levels of demand satisfying the individual and local objectives of each department most likely do not match and therefore the firm will incur some kind of loss due to this mismatch. If production levels are too high, the firm will incur inventory-holding costs; whereas, if the marketing campaign generates demand that is too high, some of the marketing budget will have been wasted. 2.1.2 Advertising response functions In order to increase demand for products and services, firms often use sales promotions (in various forms such as coupons, price discounts, and other mechanisms) to entice customers to purchase their products. It has been seen by studying the relationship between advertising and market share that there is maximum saturation level (total market size) beyond which additional advertising dollars will not improve and increase market share. This relationship between advertising and market share has been described as an S-shaped curve4 where very low advertising expenditures and very high advertising expenditures have little impact on market share, while advertising expenditures in a ‘‘middle’’ range show clear improvements in market share. Figure 2 illustrates the S-curve graphically. 2.1.3 Departmental roles The relationship between marketing and production has been well analyzed5; however, the relationship between marketing and IT capacities has not been bridged. Advertising expenditures are known to stimulate demand to a threshold level. Assuming the simple e-commerce site structure 4 See Carpenter et al. (1988), Johansson (1979), Little (1975), Mahajan and Muller (1986), and VillasBoas (1993). 5 See Eliashberg and Steinberg (1993), Fauli-Oller and Giralt (1995), Ho et al. (2002), Morgan et al. (2001), and Shapiro (1977).
Ch. 5. Customer Delay in E-Commerce Sites
123
in Fig. 1 the advertising-IT problem is presented. Similar to the production problem discussed regarding firm XYZ and the widgets, in an e-commerce setting flooding an e-retailer’s site with requests can actually cause exorbitant wait times and possibly denial of service. In fact, in most firms IT departments are demand takers, meaning that based on the advertising budget allocated for advertising campaigns, IT resources must provide support for the demand stimulated at the e-retailer’s site. It may be in the best interest of the firm to evaluate the costs of allowing IT to function as a demand taker such as overspending on capacity. If coordination between marketing and IT can be achieved, the e-retailer may be better off. Marketing department. The marketing department launches advertising campaigns in an attempt to increase demand. With no advertising the existing demand is l0(l0W0). By spending on ad campaigns, demand can be increased following the S-curve presented in Fig. 2. The specific functional form which relates the demand level (l) and advertising spending is A ¼ a ln
l1 l1 1 1 l0 l1 1
(1)
where A is the advertising spending, a the advertising cost parameter that measures advertising effectiveness, l0 the initial demand level, and lN the total market size. IT department. The IT department is responsible for processing customer requests during a session. m represents the capacity of the IT resource (the catalog server). The larger the value of m the faster the processing of customer session requests. When modeling the loss of customer mathematically, there are many influencing variables such as the ratio of arrival rate to processing time, the service rate, and the maximum number of sessions where the services rate and maximum number of sessions both represent site capacity parameters. If the site is experiencing response problems due to maximum number of sessions then the site would most likely be experiencing a denial of service attack, which is not the focus of this discussion. If the maximum number of sessions is infinite then the number of sessions is not a constraint and the response time/service rate(m) can be the object of focus. 2.1.4 Site performance assumptions In order to define the site performance characteristics, several assumptions regarding demand, processing time, customer impatience, and customer time budgets are put in place. As described earlier, customers arrive according to a Poisson distribution. In addition, we focus on the capacity needs of the Catalog server not the Transaction server as illustrated in Fig. 1.
124
D. Barnes and V. Mookerjee
Generic processing time. Each session requires a mean processing time 1/m and a generic distribution (this means that no specific probability distribution has been assumed). The processing time for a session is the total time required to process all requests generated during a given session. Due to differing processing time per request there can be a variable number of requests in a session; therefore, allowing the total processing time distribution for a session to be generic provides robustness. Customer impatience. Although customers remain impatient, the server does not immediately account for customers who have left due to impatience. Therefore, the loss of impatient customers does not relieve congestion on the server. Customer time budgets exponentially distributed. Characterizing the time budgets as exponential takes advantage of the aforementioned property that the time incurred thus far does not impact the likelihood of the time budget expiring during a discrete future time interval. In general, this scenario can be described as a M/G/1/K/PS queue (Exponential, General, One-Server, K-Connections, Processor-Sharing). M refers to the memory-less nature of the arrivals to the queue expressed by the exponential inter-arrival times characterized in Poisson counting processes. The remaining elements are expressed as follows: (G) processing times are generic, (1) one-server, (K) possible sessions at a time and (PS) a shared processor. 2.1.5 Centralized planner case In a centralized setting, the firm is a profit maximizer, and hopes to balance the revenues generated from the site, the cost of additional capacity, and the cost of advertising. The e-retailer’s profit function can be written as: p Sðl; mÞ ðg0 þ g1 mÞ AðlÞ
(2)
where S(l,m) is the net revenue based on the value of arriving customers less the lost customers (h(lL)). m is a variable in the revenue function as the processing capacity will determine how many customers are processed and also those that are lost. (g0 þ g1m) is the IT cost and A(l) is the advertising cost based on the S-curve response function discussed earlier in (Eq. (1)). Neither the IT department or the marketing department have access to all of this information; therefore, a centralized planner must solve this problem collecting the IT capacity costs from the IT department, the advertising costs from the marketing department, and compute the revenue function. By evaluating the profits, demand levels, and processing capacity with respect to the change of advertising and capacity costs using the partial
Ch. 5. Customer Delay in E-Commerce Sites
125
derivatives, we find the following: @F o0 @a
and
@F o0 @g1
(3)
where F ¼ p, l, m the optimal profit, demand level, and processing capacity respectively. From the first partial derivative ((@F/@a)o0) we can see that as advertising becomes more costly IT capacity should be reduced, and from the second (@F/@g1)o0 that as IT capacity becomes more costly advertising should be reduced. With an increase in the cost of advertising or IT capacity, it is intuitive that both the optimal demand level and the IT capacity should decrease. For example, if the advertising cost parameter (a) is held constant while the IT capacity cost l1 is increased, the e-retailer should decrease capacity. In addition, this has a negative impact on the number of sessions that can be processed and completed. Therefore, the level of advertising should also be adjusted downward, even though the cost of advertising has been held constant. This example highlights the importance of coordinating the marketing and IT functions. In a decentralized setting, marketing chooses a demand level, based purely on the advertising cost parameter, a. However, it is clear from the above argument that IT capacity cost should be considered by marketing to arrive at the optimal level of advertising. 2.1.6 Uncoordinated case If IT is set up as a demand taker then the marketing department will choose a demand level l and derive value (hlA) without considering the capacity issues the IT department may face based on this traffic perturbation. There is inherent asymmetry in this setting: marketing chooses demand locally, whereas IT reacts to this demand with an optimal choice for the capacity. This case is uncoordinated as there is no cooperation across the IT and Marketing departments. Based on the given demand level, the IT department chooses a capacity level m and derives value (hL(l, m)g0g1m) balancing the value of lost customers with the cost of capacity. Marketing overspends on advertising to attract more demand than what is optimal, causing IT to incur too much cost for increasing capacity. In this case, over-advertisement will result in the loss of profit. In fact, profit in decentralized case becomes worse as cost of IT capacity increases. This shows additional support for the need to coordinate the marketing and IT decisions. When an appropriate coordination scheme is imposed, the optimal demand level and IT capacity can be achieved.
126
D. Barnes and V. Mookerjee
2.1.7 IT as a cost center In the uncoordinated case, the capacity costs of the IT department are ignored by the marketing department. In this case, the marketing department chooses a demand level which is sub-optimal and too high. In order to achieve the optimal profits that would be attained if there was a central planner, the e-retailer must make the IT department’s capacity costs a part of the marketing departments advertising decisions. Currently, the marketing department makes advertising decisions based on h, the average per session value of the site. In order to adjust the decision-making process of the marketing department, a reduced session value x can be used by the marketing department to make advertising budget decisions. If the session value is reduced, then the marketing department will choose to spend less on advertising, and thus reduce the demand burden on the IT department. The objective of the e-retailer is to find x such that the IT demand levels and the marketing demand levels are equivalent. Although not optimal, an approximate solution for determining the reduced session value x follows: Step 1: IT assumes that a ¼ 0 and estimates x as rffiffiffiffiffiffiffiffiffiffiffiffiffiffi n hg x h g1 l1 1 Step 2: Marketing chooses a demand level l based on the reduced session value x provided. Step 3: IT chooses capacity: m ¼ c(l). In determining x, the IT department uses parameters known to the IT department such as the average per session value h, the marginal cost of capacity g1, the customer impatience level n, and the maximum demand level lN. While this approach does not allow for the exact profits gained in a centralized case, it does approximate those profits very well. In addition, this policy allows a simple coordination of the marketing and IT departments which yields higher profits than a strictly uncoordinated case. 2.1.8 IT as a profit center If the IT department operates as a profit center a processing fee is charged to marketing becomes a source of revenue to the IT department. Marketing is required to pay this processing fee as there is no alternate source of IT resources within the firm. In this setup, Marketing chooses a demand level lM deriving value (hlMZ(lM)A), and Z(lM) is the processing contract of the IT department for use of resources. The process contract specifies the cost of using the IT resources. In equilibrium the two demand levels will match. Likewise, IT chooses capacity m and demand level lIT and derives value Z(lIT)hL(lIT, m)(g0þg1m). In equilibrium, lIT ¼ lM. Furthermore, a processing contract can be derived such that the equilibrium is reached.
Ch. 5. Customer Delay in E-Commerce Sites
127
An additional property of this processing contract is that as more capacity is required the price charged to the marketing department will be reduced (i.e., quantity discounts are in effect). 2.1.9 Practical implications By examining possible coordination and cooperation between marketing and IT departments, advertising campaigns can be added as an additional factor in the customer loss web. In addition, an implicit connection between demand and processing capacity is made based on the IT department’s ability to choose both demand and processing capacity levels. Advertising spending by the marketing department stimulates demand levels. While IT acts as a demand taker and adjusts capacity in order to prevent waiting intolerances and customer loss, the capacity level is sub-optimal. In addition, advertising costs and IT capacity costs can be recognized as additional factors determining the IT capacity and advertising campaign decisions of the e-retailer. By enforcing a coordination scheme between the IT department and the Marketing department, an e-retailer can avoid over-stimulation of demand which causes high capacity costs. Although the increased capacity prevents customer reneging, the benefit from retaining the customers is lessened due to the capacity costs required to keep them. Firms must carefully balance the advertising campaigns put in place with the capacity of the servers available to support the e-commerce site functionality.
3
The short-term capacity allocation problem
3.1 Optimal processing policies for an e-commerce web server6 E-retailers must make long-term decisions balancing their advertising expenditures with the demand capacity available on their web sites. However, once the capacity decision is made there are additional concerns regarding customer impatience. It may be possible to provide differentiated services to customers based on known information regarding the customer. Using differentiated services, the e-retailer can allocate additional capacity to more consumers more sensitive to delay, and less capacity to those consumers less sensitive to delay. Given a fixed capacity level, the e-retailer can allocate processing time to customers, based on customer shopping characteristics. The e-commerce environment is very amenable to this type of discrimination as customers do not witness the service provided to others. Priority processing. Priority processing considers that different users may have a higher level of priority in the system and should in turn receive more processing time based on their needs. 6
For more information, see Tan et al. (2005).
128
D. Barnes and V. Mookerjee
That is, based on a user’s priority class, the users will be allocated different quantum. For example, the higher the priority, the longer the processing time allocated to the user. Implementation of priority processing systems can vary. For example, a modified Round-Robin processing system can be used where the time slot is modified by the priority class level. This type of priority processing requires that the priority level of jobs are known ex ante. While priority processing schemes have been put in place for resources shared by a firm’s internal members, in the e-commerce context, where the users are external to the firm’s boundaries, it is more difficult to implement effective priority processing schemes. The firm’s objectives are tied to unknown consumer characteristics such as the consumer’s willingness to buy, the amount the consumer will spend, the probability that the consumer will renege (leave the site before purchase), and other factors not determined by the firm. The firm wishes to keep the consumers who will spend the most while balancing the processing required for those users. Therefore, establishing a priority-processing scheme in this e-commerce context will be more difficult for the firm than the traditional internal scenario. As aforementioned, the objective of the firm is to implement a priorityprocessing scheme which will allocate capacity to optimize some firm goals. In a super market we see express lanes designed to reduce delay for customers with fewer items. This is an example of a priority scheme. Consumers with fewer items are more likely to leave without purchasing their items; therefore, providing them priority check out allows the firm to retain these consumers. Likewise, e-commerce firms need to implement similar priority processing systems based on the characteristics of consumers. Instead of using a server which allocates equal quantum based on Round-Robin approach, firms should implement a server which uses consumer characteristics to assign priority classes and time slots. 3.2 Environmental assumptions In order to define the environment, several assumptions are made regarding the expected purchase value of consumers, the time to process an order and impatience of consumers, the rate at which consumers arrive at the site, and finally that a Round-Robin processing scheme is in place. Static purchase value. Specifically, the purchase value of a given consumer is static throughout the course of shopping. This means that the current behavior of the consumer does not drive the assigned purchase value. Using historical data, the firm can compute averages of previous visits to the site, or assign a pre-determined value such as the market-average for first time visitors. The predicted purchase value of a customer (denoted by h) follows a distribution f(h). In fact, the e-retailer can determine both the processing
Ch. 5. Customer Delay in E-Commerce Sites
129
time and the delay tolerance (customer impatience) based on the static purchase value assigned to them initially. The only information needed by the e-retailer is the static purchase value of the customer. Value-dependent delay tolerance and processing time. Customer impatience may lead to the customer leaving the system if their tolerance for waiting (delay tolerance) is exceeded. Again referring to the express lane example, customers with a low purchase value (generally fewer items) have a lower delay tolerance and therefore should be given priority while customers with a high purchase value (generally more items) have a higher delay tolerance and can be made to wait. Specifically, customers with a static purchase value h will be willing to wait a random time that is exponentially distributed with a mean w(h).7 Exponential processing time. Characterizing the processing time (t(h)) as exponential implies that the lifetime of the consumer in the system is exponentially distributed. This implies that the probability of a consumer leaving the system in a discrete future time interval, given they have been in the system 1 millisecond or an infinite time, is just as likely; therefore, the processing time incurred does not impact the likelihood of the processing completing during a discrete future time interval. Round-Robin processing. Round-Robin processing is commonly used in prominent e-commerce server software and takes advantage of the idle time between user requests. As mentioned earlier, this type of processing makes better use of processing power by dividing the processing time into processing units allocated to each user. 3.3 Priority processing scheme By using the predicted purchase value (h) of a consumer to determine the priority class k, we can assign processing time g(h)kQ to each priority class (g(h)kZ0), where g(h)k is a priority scheme assigning weights adjusting a fixed quantum Q. The e-retailer is concerned with the loss incurred due to intolerable delays on the e-commerce site. The loss function density (l(h)) defined as the number of customers lost per unit time per unit value for customers with value h is lðhÞ ¼
tðhÞðgðhÞ þ 1Þ lf ðhÞ tðhÞðgðhÞ þ 1Þ þ wðhÞgðhÞ
(4)
t(h) is the processing time per customer. w(h) is the mean delay tolerance. g(h) is the priority assigned. l is the arrival rate of consumers to the catalog server. f(h) is the value distribution of h. We now want to examine how the 7
For further justification of this assumption, see Ancker and Gafarian (1962).
130
D. Barnes and V. Mookerjee
priority scheme (g(h)k) impacts this loss. We will examine two possible priority schemes: a Profit-focused policy and a Quality of Service (QoS) focused policy. 3.4 Profit-focused policy In this problem the e-retailer determines the priority weights by maximizing profits. Priority weights will be assigned such that the e-retailer’s profits will be the highest. The e-retailer’s problem is to choose capacity and processing weight to maximize expected profit per unit time. The total expected revenue (per unit time) is Z S¼ hðlf ðhÞ lðhÞÞdh (5) h2H
where (lf(h)l(h)) are the Net Customers with value h. In addition, the serviceable set (the set of customers worth retaining) is h S H ¼ h: c; h 2 H (6) tðhÞ This expression denotes that the ratio of the static value to the processing time should be greater than a threshold value c.8 Therefore, only customers who exceed this threshold value will be allotted processing time. This demonstrates that the e-retailer is interested in balancing the predicted purchase value (value gained) with the amount of time needed to gain that value (processing time). The priority weights need to be determined. In this case, continuous priority weights are determined based on the delay tolerance, predicted purchase value, and processing time of a given user. The expression for the priority scheme is as follows: 8 ! 1=2 > 1 1 h wðhÞ > < 1 ; h 2 HS 1þ 1 1þ 1 þ wðhÞ=tðhÞ c tðhÞ tðhÞ g ðhÞ ¼ > > : 0; heH S (7) Therefore, given any h value the firm first decides whether or not the individual will be serviced based on the determined serviceable set HS and then assigns a priority weight using the above expression to determine the customer’s capacity allocation. By examining the partial derivatives of the 8
For details regarding the derivation of these expressions see Tan et al. (2005).
Ch. 5. Customer Delay in E-Commerce Sites
131
priority weights with respect to the rate of revenue realization (h/t(h)) and priority to processing ratio (w(h)/t(h)) we find i: ii:
@g ðhÞ
40 @ h=tðhÞ @g ðhÞ
o0 @ wðhÞ=tðhÞ
(8)
These results can be interpreted as follows: (i.) customers with a higher rate of revenue realization receive more processing and (ii.) customers with a higher value of priority to patience can tolerate more delay and hence receive less processing time. Profits may not be the only variable of interest to an e-retailer. While profits are certainly important to all firms, some may wish to focus on customer satisfaction and reducing customer loss due to poor service quality rather than on a policy that treats the customer as a dollar sign. 3.5 Quality of service (QoS) focused policy In this policy, the e-retailer assigns priority weights such that lost customers are minimized. With QoS as the performance objective, the e-retailers’ problem becomes Z min L lðhÞdh (9) h2H
when (Eq. (4)) defines l(h). This means that the objective of the e-retailer is to minimize the loss of customers. The optimal processing allocation can be obtained as 8 1=2 > wðhÞ tc 1 < 1 þ tðhÞ 1 1 þ tðhÞ 1 ; tðhÞotc g ðhÞ ¼ 1þwðhÞ=tðhÞ (10) > : 0; tðhÞ tc It is obvious from the nature of the above equation that customers who require more processing (with higher values of 1/t(h)) will be assigned less priority.More patient buyers (with higher w(h)/t(h) ratio) also receive less processing time. The value threshold hc is given by t(hc) ¼ tc, and can be found by substituting the above into (Eq. (1)). Assuming t(h) increases with the value h, customers with value above the value threshold hc will not receive any processing capacity. Because of the nature of the policies, the profit-focused policy will outperform the QoS-focused policy when profit is the performance variable and vice versa when only a single-period performance variable is considered.
132
D. Barnes and V. Mookerjee
However, e-retailers generally do not operate in a single period and must consider future profits, and how decisions and policies in earlier periods may impact those future profits. 3.6 Practical implications In order to implement these priority processing schemes for a practical server, the e-retailer would need to gather information regarding delay tolerances (w(h)) and expected processing time (t(h)). Click-stream behavior can be analyzed to derive insights regarding these functional forms. Currently, information is commonly collected regarding purchases, customer click behavior, and so on while little information has been gathered regarding customer delay tolerances. This illustrates the need for firms to collect information regarding delay tolerances and expected processing time as it relates to the predicted purchase value. Using real-time click-stream information as well as shopping cart information may allow the implementation of a dynamic policy which uses the current shopping cart value to determine the priority weight assigned to users. One possible drawback of using current shopping cart information instead of a static predicted purchase value is that users may attempt to gain higher priority by adding items to their cart and later removing those items, essentially, users would be ‘‘gaming’’ the system. 4
The effects of competition
In the previous considerations, the e-retailer’s decisions were being made in a monopolistic setting where competition from other e-retailers was not considered. One approximation to the competitive nature of the environment is to consider a multiperiod model, where poor QoS early on may cost the e-retailer the continuing business of a customer. This is a crude approximation, as losing the customer implies that there is another provider of the services or products. 4.1 A multiperiod approach to competition for capacity allocation When considering that customers who are lost rarely come back, an e-retailer considering future profits may have a different outlook. In fact, there is an externality effect that demand growth depends on QoS provided in previous periods. It may be a better policy to first build a solid customer base by focusing on QoS rather than focus on the value of orders. The multiperiod model considers multiple periods indexed by j. The QoS in the earlier periods impacts the demand (lj) in later periods. E-retailers can increase the processing capacity in later periods to match the demand
Ch. 5. Customer Delay in E-Commerce Sites
133
generated. Several new factors come into play in this model: (1) the probability of a dissatisfied customer returning (pj), (2) word of mouth impact on demand, and (3) capacity costs decreasing over time. QoS in earlier periods will impact demand in later periods; therefore, the demand in period j is modeled as follows: ljþ1 ¼ Lj pj þ ðlj Lj Þrj
(11)
where (ljLj) represents the word of mouth impact of QoS in the first period. rj is the growth due to satisfied customers, Lj represents the customers lost, and pj the probability of unsatisfied customers returning. Increasing capacity (Cj) reduces processing time(t(h)) and therefore we modify the processing time expression: tðhÞ ¼ t0 ðhÞ=C j
(12)
Acquiring additional capacity comes with some cost (gj) per unit of capacity. The cost of new capacity acquisition is gj(Cjþ1Cj). This cost must be considered in the profit function. In addition, capacity costs (gj) decrease over time as processing power becomes less expensive. 4.2 Practical implications The specific policy is not shown, but attributes of the policy are discussed. As the discount factor (d) increases, the value of future period profits becomes more valuable in the current period. It will be to the advantage of firms to start with a loss in the first period in order to boost demand and profits in later periods when the discount factor is high enough. That is, initially the firm follows a policy mimicking the earlier QoS-focused policy. In fact, while the policy appears to operate as a QoS policy in initial periods, it is optimizing the long-run profits of the firm. However, if the discount factor is relatively small, then the current period is of greater importance than future periods and a policy like the profit-focused policy will be followed. In the multiperiod problem, the e-retailer may lose money in the first period in order to provide better service and growth in future demand and the server capacity is determined optimally in each period. 4.3 Long-term capacity planning under competition With the introduction of competition, e-retailers may begin to see a new perspective on delay and integrate delay strategically based on known pricing information. In contrast to the more traditional views of delay where delays act as an impediment to the functioning of e-commerce, a new view of delay is explored where a firm may intentionally impose delays. Previously, we focused on the catalog server and the browsing behavior of
134
D. Barnes and V. Mookerjee
the consumer, in this case the browsing behavior is also considered; however users are seeking specific information regarding product prices that can be found at multiple e-retailers. Waiting at a web site for a consumer may not have a negative impact on a user’s evaluation of the site if the waiting is well-managed. Consumers tend not to get as frustrated if the waiting occurs at expected positions like before the web text appears on the screen and not in the middle of interaction. Thus, a ‘‘well managed’’ delay is one that is not in the middle of a sensitive transaction, like the transmission of personal information including financial information. Thus it is conceivable that an e-retailer design its site in a way to engage the customer and make the experience enjoyable enough so that the probability of a purchase increases despite taking a longer time. This Strategic Engagement Model in attempts to capture the strategic motives of e-retailers that are associated with building delay into their interaction with potential customers. There are two e-firms, firm 1 and firm 2 selling an identical product. The firms have identical engagement costs c(ti), where dc(ti)/dtW0. Engagement costs increase with the delay or engagement time (ti) that is built into the browsing process by firm j (where j ¼ 1 or 2) at its web site.9 Thus an e-firm needs to incur higher costs (computing and personnel resources, operating costs, etc.) in order to keep customers engaged for a longer time. The consumer side of the market comprises of homogenous agents, who have unit demand for the product and always buy from the firm with the lower price if they discover that price. A representative consumer does not browse indefinitely in order to discover the price, but has a search time budget t that is known to both firms, and may be defined as the maximum time spent for the purpose of price discovery. The consumer randomly chooses one of the web sites to begin the search for the lower price. If the time budget is expended without discovering the price offered by the second site, a purchase is made from the first site. No purchase is made if neither price is discovered. If both prices are discovered, then the product is purchased from the lower priced firm. The prices (p1Wp2) charged are common knowledge to the firms. The firms interact strategically, setting delay times in a simultaneous game. We calculate the firms’ best response delay functions and solve for the Nash equilibrium in delays. For the lower priced firm 2, it is good strategy to display its price upfront without any delay or engagement time. If it is the first firm to be visited by the consumer, all firm 2 needs to do is to display its price in the shortest time. A longer period of price discovery will merely augment its costs without increasing its expected profit. If firm 2 is the second to be visited then too its optimal strategy is to provide the smallest possible price discovery time as any longer than the minimum time would increase the 9
This is also the time taken by a potential customer for price discovery at a firm.
Ch. 5. Customer Delay in E-Commerce Sites
Fig. 3.
135
Pure strategy Nash equilibrium of the engagement model.
chances of the consumer’s time budget (t) being exceeded. If e (eW0) is the smallest time period that a firm can use to display its price, then the optimal engagement time for firm 2 (t 2 ) is t 2 ¼
(13)
The higher priced firm 1 knows that the only way it can affect a sale is by not allowing the other (lower priced) firm to display its price to the consumer. Thus its optimal strategy is to set its engagement time an e above tt2, the remainder of the time budget after subtracting firm 2’s choice of delay.10 t 1 ¼ t t2 þ
(14)
Mapping out the best response functions for firms 1 and 2 in Fig. 3 we get the pure strategy Nash equilibrium (NE) given by (t, e) which are the NE strategies for firms 1 and 2 respectively. Thus the cheaper vendor makes its price discovery almost instantaneous while the more expensive provider attempts to engage the potential customer and exhaust the search time budget. 10 Notice that the higher priced firm can never win the consumer if it is visited second. The higher priced firm makes a sale only if it is visited first and sets its delay so that the consumer exhausts the time budget before discovering the lower priced firm’s price. However, the higher priced firm should not build delays any higher than e above the ‘‘available’’ time (tt2) since engagement time is costly.
136
D. Barnes and V. Mookerjee
4.4 Practical applications and future adaptations The basic wisdom from the theoretical result above, namely cheaper firms affect quicker price discovery and do not attempt to invest in engaging a customer is seen in e-commerce interfaces all over the World Wide Web. If we compare car loan web sites with more established loan agencies like Chase Automotive Finance, we find that the former, which typically offer lower APRs and a ‘‘bad credit, no problem’’ attitude has a much quicker price (in this case the interest rate) discovery than the latter. Organizations like Chase invest in making the customer go through many ‘‘personalized’’ steps (ostensibly for benefiting the customer), but in reality may just be attempting to engage customers long enough in order to induce them to sign them up for a loan. A similar pattern is observed in the online provision of health insurance. Lower premium e-tailers like Affordable Health Insurance Quotes have a quicker price discovery than larger and more established insurance firms. Several extensions can be made to the current model. It may be useful to consider a model where the cost of engagement time includes a technology cost component that reduces with time, for example, the total cost of engagement could be expressed as: c(t) ¼ a(t)þb(t), where a(t) increases with time whereas b(t) (technology cost) decreases with time. This may lead to interior solutions to the game. Another extension is to consider n firms to see if the basic intuition (high priced firms deliberately delay customers) carries through in the general case. It may also be useful to model consumers as being impatient, i.e., they may leave a slow web site with some probability even if their time budget has not been exhausted. Finally, it will be interesting to conduct an experiment to see if the model predicts correctly with human search agents. 5
Conclusions and future research
There are many issues and perspectives related to customer loss caused by delays in e-commerce settings. In the last section, delay is examined as a strategic tool for an e-retailer; however, this tool can only be enabled in a competitive environment. An e-retailer’s view of delay may change based on the competitive context. Therefore the firm’s competitive environment which could be monopolistic or competitive becomes an additional consideration for customer loss. In addition, forcing coordination between IT departments and the Marketing department is in a firm’s best interest. If IT is treated as a demand taker, the Marketing department over spends on advertising forcing the IT department to make sub-optimal capacity decisions. While these sub-optimal capacity decisions are made in an effort to prevent customer delay and eventually customer loss, the cost of the capacity is not
Ch. 5. Customer Delay in E-Commerce Sites
137
compensated. Using contracts from the IT department for IT resources, the Marketing department sees IT capacity as a cost and the IT department gains revenues from providing resources. This contract enables the departments to work in cooperation to set demand levels in an optimal fashion, and aligns the incentives of the departments. Alternatively, the IT department can generate a reduced average session value for the marketing department to consider. This reduced average session value helps account for the capacity costs experienced by the IT department and drives the demand level decisions for each department towards an optimal choice. Capacity adjustments in e-retailer servers are essential to providing quality service. In fact, quality of service is more important to an e-retailer trying to build a customer base in early periods. By implementing priority processing schemes that focus on maintaining the customer base initially, the e-retailer builds his market share. Later, once the market share is stationary, the e-retailer can maintain his market share using profit-focused priority processing schemes. In addition to an e-commerce environment, an online decision support system environment such as an online help desk may also have queuing effects as discussed in Sections 2 and 3. Capacity planning and allocation will be important decisions in this domain as well. We will also investigate the issue of queuing externalities in the context of decision-oriented systems. The analysis and design of such systems has been greatly facilitated by the use Sequential Decision Models (SDM). These models provide a powerful framework to improve system operation. The objective is to optimize cost or value over a horizon of sequential decisions. In sequential decision-making, the decision maker is assumed to possess a set of beliefs about the state of nature and a set of payoffs about alternatives. The decision maker can either make an immediate decision given current beliefs or make a costly observation to revise current beliefs. The next observation or input to acquire depends on the values of previously acquired inputs. For example, a physician collects relevant information by asking questions or conducting clinical tests in an order that depends on the specific case. Once enough information has been acquired, the decision maker selects the best alternative. Consider an example of an e-business information system where customers log on to a web site to obtain advice on health related matters. Because of queuing effects, customers may have to wait before and/or during the consulting session. The operating policy could adjust (increases or decreases) the quality of the advice offered by the system depending on the length of the queue. Such a policy may aim to optimize an objective such as the total expected cost incurred, namely, the sum of the expected error cost (associated with the advice offered by the system) and the expected waiting cost. The examination of delay experienced in an e-retail environment is of utmost important in the modern age where much business is transacted
138
D. Barnes and V. Mookerjee
online. Just as firms have carefully planned the logistics of their brick and mortar stores, so must they pay special attention to the logistics of their web presence. Customers who are impatient may leave the e-store with the slightest delay; therefore firms must carefully examine the value that may be lost for any given customer type. Understanding the behaviors and value of the individual customer will allow the firms to strategically design the web presence with an appropriate amount of delay. References Ancker, C.J., A.V. Gafarian (1962). Queuing with impatient customers who leave at random. Journal of Industrial Engineering 13, 84–90. Burman, D. (1981). Insensitivity in queuing systems. Advances in Applied Probability 13, 846–859. Carpenter, G.S., L.G. Cooper, D.M. Hanssens, D.F. Midgley (1988). Modeling asymmetric competition. Marketing Science 7, 393–412. Eliashberg, J., R. Steinberg (1993). Marketing-production joint decision making, in: J. Elizashberg, G.L. Lilien (eds.), Marketing, DHandbooks in Operations Research and Management Science, Vol. 5. Elsevier, North Holland. Fauli-Oller, R., M. Giralt (1995). Competition and cooperation within a multidivisional firm. Journal of Industrial Economics XLIII, 77–99. Ho, T.-H., S. Savin, C. Terwiesch (2002). Managing demand and sales dynamics in new product diffusion under supply constraint. Management Science 48(4), 402–419. Johansson, J.K. (1979). Advertising and the S-curve: A new approach. Journal of Marketing Research XVI 346–354. Little, J.D.C. (1975). BRANDAID: A marketing mix model part 1: Structure. Operations Research 23, 628–655. Mahajan, V., E. Muller (1986). Advertising pulsing policies for generating awareness for new products. Marketing Science 5, 86–106. Morgan, L.O., R.L. Daniels, P. Kouvelis (2001). Marketing/manufacturing tradeoffs in product line management. IIE Transactions 33 949–962. O’Donovan, T.M. (1974). Direct solution of M/G/1 processor sharing models. Operations Research 22, 1232–1235. Ross, S.M. (2000). Introduction to Probability Models. 7th ed. Harcourt Academic Press. Sakata, M., S. Noguchi, J. Oizumi (1971). An analysis of the M/G/1 queue under round-robin scheduling. Operations Research 19, 371–385. Shapiro, B.P. (1977). Can marketing and manufacturing coexist? Harvard Business Review 55, 104–114. Tan, Y., K. Moionzaeh, V.S. Mookerjee (2005). Optimal processing policies for an e-commerce web server. Informs Journal on Computing 17(1), 99–110. Tan, Y., V.S. Mookerjee (2005). Allocating spending between advertising and information technology in electronic retailing. Management Science 51(8), 1236–1249. Villas-Boas, J.M. (1993). Predicting advertising pulsing policies in and oligopoly: A model and empirical test. Marketing Science 12, 88–102.
Part II Computational Approaches for Business Processes
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 6
An Autonomous Agent for Supply Chain Management
David Pardoe and Peter Stone Department of Computer Sciences, The University of Texas at Austin, 1 University Station CO500, Austin, TX 78712-0233, USA
Abstract Supply chain management (SCM) involves planning for the procurement of materials, assembly of finished products from these materials, and distribution of products to customers. The Trading Agent Competition Supply Chain Management (TAC SCM) scenario provides a competitive benchmarking environment for developing and testing agent-based solutions to SCM. Autonomous software agents must perform the above tasks while competing against each other as computer manufacturers: each agent must purchase components such as memory and hard drives from suppliers, manage a factory where computers are assembled, and negotiate with customers to sell computers. In this chapter, we describe TacTex-06, the winning agent in the 2006 TAC SCM competition. TacTex-06 operates by making predictions about the future of the economy, such as the prices that will be offered by component suppliers and the level of customer demand, and then planning its future actions in order to maximize profits. A key component of TacTex-06 is the ability to adapt these predictions based on the observed behavior of other agents. Although the agent is described in full, particular emphasis is given to agent components that differ from the previous year’s winner, TacTex-05, and the importance of these components is demonstrated through controlled experiments.
1
Introduction
In today’s industrial world, supply chains are ubiquitous in the manufacturing of many complex products. Traditionally, supply chains have been created through the interactions of human representatives of the various companies involved. However, recent advances in autonomous agent 141
142
D. Pardoe and P. Stone
technologies have sparked an interest, both in academia and in industry, in automating the process (Chen et al., 1999; Kumar, 2001; Sadeh et al., 2001). Creating a fully autonomous agent for supply chain management (SCM) is difficult due to the large number of tasks such an agent must perform. In general, the agent must procure resources for, manage the assembly of, and negotiate the sale of a completed product. To perform these tasks intelligently, the agent must be able to plan in the face of uncertainty, schedule the optimal use of its resources, and adapt to changing market conditions. One barrier to SCM research is that it can be difficult to benchmark automated strategies in a live business environment, both due to the proprietary nature of the systems and due to the high cost of errors. The Trading Agent Competition Supply Chain Management (TAC SCM) scenario provides a unique testbed for studying and prototyping SCM agents by providing a competitive environment in which independently created agents can be tested against each other over the course of many simulations in an open academic setting. A particularly appealing feature of TAC is that, unlike in many simulation environments, the other agents are real profit-maximizing agents with incentive to perform well, rather than strawman benchmarks. In a TAC SCM game, each agent acts as an independent computer manufacturer in a simulated economy. The agent must procure components such as CPUs and memory; decide what types of computers to manufacture from these components as constrained by its factory resources; bid for sales contracts with customers; and decide which computers to deliver to whom and by when. In this chapter, we describe TacTex-06 , the winner of the 2006 TAC SCM competition. In particular, we describe the various components that make up the agent and discuss how they are combined to result in an effective SCM agent. Emphasis is given to those components that differ from the previous year’s winner, TacTex-05, and the importance of these components is demonstrated through controlled experiments. The remainder of the chapter is organized as follows. We first summarize the TAC SCM scenario, and then give an overview of the design of TacTex-06 . Next, we describe in detail the individual components: three predictive modules, two decisionmaking modules that attempt to identify optimal behavior with respect to the predictions, and two methods of adapting to opponent behavior based on past games. Finally, we examine the success of the complete agent, through both analysis of competition results and controlled experiments. 2
The TAC SCM scenario
In this section, we provide a summary of the TAC SCM scenario. Full details are available in the official specification document (Collins et al., 2005).
Ch. 6. An Autonomous Agent for Supply Chain Management
143
In a TAC SCM game, six agents act as computer manufacturers in a simulated economy that is managed by a game server. The length of a game is 220 simulated days, with each day lasting 15 s of real time. At the beginning of each day, agents receive messages from the game server with information concerning the state of the game, such as the customer requests for quotes (RFQs) for that day, and agents have until the end of the day to send messages to the server indicating their actions for that day, such as making offers to customers. The game can be divided into three parts: (i) component procurement, (ii) computer sales, and (iii) production and delivery as explained below and illustrated in Fig. 1. 2.1 Component procurement The computers are made from four components: CPUs, motherboards, memory, and hard drives, each of which come in multiple varieties. From these components, 16 different computer configurations can be made. Each component has a base price that is used as a reference point by suppliers making offers. Agents wanting to purchase components send RFQs to suppliers indicating the type and quantity of components desired, the date on which they should be delivered, and a reserve price stating the maximum amount
Fig. 1.
The TAC SCM Scenario (Collins et al., 2005).
144
D. Pardoe and P. Stone
the agent is willing to pay. Agents are limited to sending at most five RFQs per component per supplier per day. Suppliers respond to RFQs the next day by offering a price for the requested components if the request can be satisfied. Agents may then accept or reject the offers. Suppliers have a limited capacity for producing components, and this capacity varies throughout the game according to a random walk. Suppliers base their prices offered in response to RFQs on the fraction of their capacity that is currently free. When determining prices for RFQs for a particular component, a supplier simulates scheduling the production of all components currently ordered plus those components requested in the RFQs as late as possible. From the production schedule, the supplier can determine the remaining free capacity between the current day and any future day. The price offered in response to an RFQ is equal to the base price of the component discounted by an amount proportional to the fraction of the supplier’s free capacity before the due date. Agents may send zero-quantity RFQs to serve as price probes. Due to the nature of the supplier pricing model, it is possible for prices to be as low when components are requested at the last minute as when they are requested well in advance. Agents thus face an interesting tradeoff : they may either commit to ordering while knowledge of future customer demand is still limited (see below), or wait to order and risk being unable to purchase needed components. To prevent agents from driving up prices by sending RFQs with no intention of buying, each supplier keeps track of a reputation rating for each agent that represents the fraction of offered components that have been accepted by the agent. If this reputation falls below a minimum acceptable purchase ratio (75% for CPU suppliers and 45% for others), then the prices and availability of components are affected for that agent. Agents must therefore plan component purchases carefully, sending RFQs only when they believe it is likely that they will accept the offers received. 2.2 Computer sales Customers wishing to buy computers send the agents RFQs consisting of the type and quantity of computer desired, the due date, a reserve price indicating the maximum amount the customer is willing to pay per computer, and a penalty that must be paid for each day the delivery is late. Agents respond to the RFQs by bidding in a first-price auction: the agent offering the lowest price on each RFQ wins the order. Agents are unable to see the prices offered by other agents or even the winning prices, but they do receive a report each day indicating the highest and lowest price at which each type of computer sold on the previous day. Each RFQ is for between 1 and 20 computers, with due dates ranging from 3 to 12 days in the future, and reserve prices ranging from 75% to
Ch. 6. An Autonomous Agent for Supply Chain Management
145
125% of the base price of the requested computer type. (The base price of a computer is equal to the sum of the base prices of its parts.) The number of RFQs sent by customers each day depends on the level of customer demand, which fluctuates throughout the game. Demand is broken into three segments, each containing about one-third of the 16 computer types: high, mid, and low range. Each range has its own level of demand. The total number of RFQs per day ranges between roughly 80 and 320, all of which can be bid upon by all six agents. It is possible for demand levels to change rapidly, limiting the ability of agents to plan for the future with confidence. 2.3 Production and delivery Each agent manages a factory where computers are assembled. Factory operation is constrained by both the components in inventory and assembly cycles. Factories are limited to producing roughly 360 computers per day (depending on their types). Each day an agent must send a production schedule and a delivery schedule to the server indicating its actions for the next day. The production schedule specifies how many of each computer will be assembled by the factory, while the delivery schedule indicates which customer orders will be filled from the completed computers in inventory. Agents are required to pay a small daily storage fee for all components in inventory at the factory. This cost is sufficiently high to discourage agents from holding large inventories of components for long periods.
3
Overview of TacTex-06
Given the detail and complexity of the TAC SCM scenario, creating an effective agent requires the development of tightly coupled modules for interacting with suppliers, customers, and the factory. The fact that each day’s decisions must be made in less than 15 s constrains the set of possible approaches. TacTex-06 is a fully implemented agent that operates within the TAC SCM scenario. We present a high-level overview of the agent in this section, and full details in the sections that follow. 3.1 Agent components Figure 2 illustrates the basic components of TacTex-06 and their interaction. There are five basic tasks a TAC SCM agent must perform: (1) Sending RFQs to suppliers to request components; (2) Deciding which offers from suppliers to accept;
146
D. Pardoe and P. Stone
Supplier Model
Supply Manager plan for component purchases negotiate with suppliers
offers and deliveries projected inventory and costs
projected component use
Demand Manager bid on customer RFQs produce and deliver computers
offers and deliveries
Customers
computer RFQs and orders
Demand Model
Suppliers
component RFQs and orders
Offer Acceptance Predictor TacTex
Fig. 2.
An overview of the main agent components.
(3) Bidding on RFQs from customers requesting computers; (4) Sending the daily production schedule to the factory; (5) Delivering completed computers. We assign the first two tasks to a Supply Manager module, and the last three to a Demand Manager module. The Supply Manager handles all planning related to component inventories and purchases, and requires no information about computer production except for a projection of future component use, which is provided by the Demand Manager. The Demand Manager, in turn, handles all planning related to computer sales and production. The only information about components required by the Demand Manager is a projection of the current inventory and future component deliveries, along with an estimated replacement cost for each component used. This information is provided by the Supply Manager. We view the tasks to be performed by these two managers as optimization tasks: the Supply Manager tries to minimize the cost of obtaining
Ch. 6. An Autonomous Agent for Supply Chain Management
147
the components required by the Demand Manager, whereas the Demand Manager seeks to maximize the profits from computer sales subject to the information provided by the Supply Manager. In order to perform these tasks, the two managers need to be able to make predictions about the results of their actions and the future of the economy. TacTex-06 uses three predictive models to assist the managers with these predictions: a predictive Supplier Model, a predictive Demand Model, and an Offer Acceptance Predictor. The Supplier Model keeps track of all information available about each supplier, such as TacTex-06’s outstanding orders and the prices that have been offered in response to RFQs. Using this information, the Supplier Model can assist the Supply Manager by making predictions concerning future component availability and prices. The Demand Model tracks the customer demand in each of the three market segments, and tries to estimate the underlying demand parameters in each segment. With these estimates, it is possible to predict the number of RFQs that will be received on any future day. The Demand Manager can then use these predictions to plan for future production. When deciding what bids to make in response to customer RFQs, the Demand Manager needs to be able to estimate the probability of a particular bid being accepted (which depends on the bidding behavior of the other agents). This prediction is handled by the Offer Acceptance Predictor. On the basis of past bidding results, the Offer Acceptance Predictor produces a function for each RFQ that maps bid prices to the predicted probability of winning the order. The steps taken each day by TacTex-06 as it performs the five tasks described previously are presented in Table 1.
4
The Demand Manager
The Demand Manager handles all computation related to computer sales and production. This section describes the Demand Manager, along with the Demand Predictor and the Offer Acceptance Predictor upon which it relies.
4.1 Demand Model When planning for future computer production, the Demand Manager needs to be able to make predictions about future demand in each market segment. For example, if more RFQs are expected for high range than low range computers, the planned production should reflect this fact. The Demand Model is responsible for making these predictions.
148
D. Pardoe and P. Stone
Table 1 Overview of the steps taken each day by TacTex-06 Record information received from the server and update prediction modules. The Supply Manager takes the supplier offers as input and performs the following: decide which offers to accept, update projected future inventory, update replacement costs. The Demand Manager takes customer RFQs, current orders, projected inventory,and replacement costs as input and performs the following:
predict future customer demand using the Demand Model, use the Offer Acceptance Predictor to generate acceptance functions for RFQs, schedule production several days into the future, extract the current day’s production, delivery, and bids from the schedule, update projected future component use.
The Supply Manager takes the projected future component use as input andperforms the following: determine the future deliveries needed to maintain a threshold inventory, use the Supplier Model to predict future component prices, decide what RFQs need to be sent on the current day.
To explain its operation, further detail is required about the customer demand model. The state of each demand segment (high, mid, and low range computers) is represented by parameters Qd and td (both of which are internal to the game server). Qd represents the expected number of RFQs on day d, and td is the trend in demand (increasing or decreasing) on day d. The actual number of RFQs is generated randomly from a Poisson distribution with Qd as its mean. The next day’s demand, Qdþ1, is set to Qdtd, and tdþ1 is determined from td according to a random walk. To predict future demand, the Demand Manager estimates the values of Qd and td for each segment using an approach first used by the agent DeepMaize in 2003 (Kiekintveld et al., 2004). Basically, this is a Bayesian approach that involves maintaining a probability distribution over (Qd, td) pairs for each segment. The number of RFQs received each day from the segment represents information that can be used to update this distribution, and the distribution over (Qdþ1, tdþ1) pairs can then be generated based on the game’s demand model. By repeating this last step, the expected value of Qi can be determined for any future day i and used as the number of RFQs predicted on that day. Full details of the approach are available in Kiekintveld et al. (2004).1 1 The DeepMaize team has released their code for this approach: http://www.eecs.umich.edu/ Bckiekint/downloads/DeepMaize_CustomerDemand_Release.tar.gz
Ch. 6. An Autonomous Agent for Supply Chain Management
149
4.2 Offer Acceptance Predictor2 To bid on customer RFQs, the Demand Manager needs to be able to predict the orders that will result from the offers it makes. A simple method of prediction would be to estimate the winning price for each RFQ, and assume that any bid below this price would result in an order. Alternatively, for each RFQ the probability of winning the order could be estimated as a function of the current bid. This latter approach is the one implemented by the Offer Acceptance Predictor. For each customer RFQ received, the Offer Acceptance Predictor generates a function mapping the possible bid prices to the probability of acceptance. (The function can thus be viewed as a cumulative distribution function.) This approach involves three components: a particle filter used to generate initial predictions, an adaptive means of revising the predictions to account for the impact of an RFQ’s due date, and a learned predictor that predicts how the prices of computers will change in the future. A visual inspection of each day’s winning prices for each type of computer in a typical completed game suggests that these prices tend to follow a normal distribution. To estimate these distributions during a game, the Offer Acceptance Predictor makes use of a separate particle filter [specifically a Sampling Importance Resampling filter (Arulampalam et al., 2002)] for each computer type. A particle filter is a sequential Monte Carlo method that tracks the changing state of a system by using a set of weighted samples (called particles) to estimate a posterior density function over the possible states. The weight of each particle represents its relative probability, and particles and weights are revised each time an observation (conditioned on the current state) is received. In this case, each of the 100 particles used per filter represents a normal distribution (indicating the probability that a given price will be the winning price on the computer) with a particular mean and variance. At the beginning of each game, weights are set equally and each particle is assigned a mean and variance drawn randomly from a distribution that is generated by analyzing the first day prices from a large data set of past games. (The source of this data set will be described below.) Each succeeding day, a new set of particles is generated from the old. For each new particle to be generated, an old particle is selected at random based on weight, and the new particle’s estimate of mean and variance are set to those of the old particle plus small changes, drawn randomly from the distribution of day-to-day changes seen in the data set of past games. The new particles are then reweighted, with the weight of each particle set to the probability of the previous day’s pricerelated observations occurring according to the distribution represented. These observations consist of the reported highest and lowest winning 2
This section presents a significant addition to the previous agent, TacTex-05.
150
D. Pardoe and P. Stone
prices and the acceptance or rejection of each offer made to a customer for the given type of computer. Finally, the weights are normalized to sum to one. The distribution of winning prices predicted by the particle filter is simply the weighted sum of the individual particles’ distributions, and from this distribution the function mapping each possible bid price to a probability of acceptance can be determined. These functions are then modified using values we call day factors, which are designed to measure the effect of the due date on offer acceptance. The due dates for RFQs range from 3 to 12 days in the future, and a separate day factor is learned for each day in this range. Each day factor is set to the ratio of actual orders received to orders expected based on the linear heuristic, for all recent offers made. When an offer is made on an RFQ, the Offer Acceptance Predictor computes the probability of an order by multiplying the initial prediction by the corresponding day factor. The day factors therefore serve both as a means of gauging the impact of due dates on computer prices and as a mechanism for ensuring that the number of orders received is roughly the number expected. To maximize revenue from the computers sold, the Demand Manager needs to consider not only the prices it will offer in response to the current day’s RFQs, but also what computers it will wish to sell on future days. In fact, the Demand Manager plans ahead for 10 days and considers future as well as current RFQs when making offers, which will be described in the next section. It is therefore important for the Offer Acceptance Predictor to be able to predict future changes in computer prices. To illustrate why this is important, Fig. 3 shows the prices at which one type of computer sold during a single game of the 2006 finals. For each day, points representing one standard deviation above and below the average price are plotted. On most days, there is clearly little variance between the winning prices, but prices often change drastically over the course of a few days. This fact suggests that it may be even more valuable to be able to predict future changes in price than to predict the distribution of winning prices on a single day. By simply selling a computer a few days earlier or later, it might be possible for the Demand Manager to significantly increase the price it obtains. To make these predictions of price changes, the Offer Acceptance Predictor performs machine learning on data from past games. Each training instance consists of 31 features representing data available to the agent during the game, such as the date, estimated levels of customer demand and demand trend, and current and recent computer prices. The label for each instance is the amount by which the average price changes in 10 days. Once the Offer Acceptance Predictor has learned to predict this quantity, it can predict the change in average price for any day between zero and ten days in the future through linear interpolation. No effort is made to predict changes in the shape of the distribution, i.e., the variance. Thus, to generate an offer acceptance function for a future RFQ, the Offer
Ch. 6. An Autonomous Agent for Supply Chain Management
151
3000
Sales price
one standard deviation below average one standard deviation above average
2400
1800
1200 0
50
100
150
200
Day
Fig. 3.
Average prices at which one type of computer sold during one game of the 2006 finals. One standard deviation above and below the average is shown.
Acceptance Predictor simply shifts the predicted distribution over winning prices up or down depending on the predicted change in average price, and bases the acceptance function on this modified distribution. To train the price change predictor, a learning algorithm and source of training data must be chosen. After experimenting with various algorithms from the WEKA machine learning package (Witten and Frank, 1999), we selected additive regression with decision stumps, an iterative method in which a decision stump is repeatedly fit to the residual from the previous step. (M5 regression trees gave nearly identical performance, but the models generated were significantly larger.) For training data, we could have used data from games in the competition, but instead we ran a large number of games of our own using both variations of TacTex-06 and other agents taken from the TAC Agent Repository,3 a collection of agents provided by the teams involved in the competition. Doing so allowed us to generate separate training and testing data sets for various combinations of six agents, which we then used to test whether predictors trained on data from games with one set of agents would generalize to games involving a different set of agents. In particular, for four different groups of six agents, we ran 40 games, and we generated training data using 30 games and testing data with the other 10. We then trained a separate predictor on each training set. Fortunately, generalization was good: for each of the four testing data sets, all four predictors were reasonably accurate. In other words, in order to predict price changes in a game with a particular group of agents, it was not absolutely necessary to have trained on data specific to 3
http://www.sics.se/tac/showagents.php
152
D. Pardoe and P. Stone
those agents. We thus chose to train a single predictor on the entire set of data from these games, and use the same predictor throughout the competition.4 4.3 Demand Manager The Demand Manager is responsible for bidding on customer RFQs, producing computers, and delivering them to customers. All three tasks can be performed using the same production scheduling algorithm. As these tasks compete for the same resources (components, completed computers, and factory cycles), the Demand Manager begins by planning to satisfy existing orders, and then uses the remaining resources in planning for RFQs. The latest possible due date for an RFQ received on the current day is 12 days in the future, meaning the production schedule for the needed computers must be sent within the next 10 days. The Demand Manager thus always plans for the next 10 days of production. Each day, the Demand Manager (i) schedules production of existing orders, (ii) schedules production of predicted future orders, and then (iii) extracts the next day’s production and delivery schedule from the result. The production scheduling algorithm, these three steps, and the means of predicting production beyond 10 days are described in the following sections. 4.3.1 Production scheduling algorithm The goal of the production scheduler is to take a set of orders and to determine the 10-day production schedule that maximizes profit, subject to the available resources. The resources provided are:
A fixed number of factory cycles per day; The components in inventory; The components projected to be delivered; Completed computers in inventory.
The profit for each order is equal to its price (if it could be delivered) minus any penalties for late delivery and the replacement costs for the components involved as specified by the Supply Manager. The scheduling algorithm used by the Demand Manager is a greedy algorithm that attempts to produce each order as late as possible. Orders are sorted by profit, and the scheduler tries to produce each order using cycles and components from the latest possible dates. If any part of the order cannot be produced, the needed computers will be taken from the 4 In our post-competition analysis, we found that this was a reasonable decision given the limited number of games that would have been available during the competition to use for training. In more recent work, however, we explore methods of making use of both sources of data (games from the competition and games run on our own) and show that improvements in predictor accuracy are possible (Pardoe and Stone, 2007).
Ch. 6. An Autonomous Agent for Supply Chain Management
153
existing inventory of completed computers, if possible. The purpose of scheduling production as late as possible is to preserve resources that might be needed by orders with earlier due dates. A record is kept of what production took place on each day and how each order was filled. It should be noted that the scheduling problem at hand lends itself to the use of linear programming to determine an optimal solution. We initially experimented with this approach, using a linear program similar to one designed for a slightly simplified scenario by Benisch et al. (2004a). However, due to the game’s time constraints (15 s allowed per simulated day), the need to use the scheduler multiple times per day (and in a modified fashion for bidding on customer RFQs, as described below), and the fact that the greedy approach is nearly optimal [observed in our own experiments and confirmed by Benisch et al. (2006a)], we chose to use the greedy approach. 4.3.2 Handling existing orders The Demand Manager plans for the production of existing orders in two steps. Before starting, the production resources are initialized using the values provided by the Supply Manager. Then the production scheduler is applied to the set of orders due in one day or less. All orders that can be taken from inventory (hopefully be all of them to avoid penalties) are scheduled for delivery the next day. The production scheduler is next applied to the remaining orders. No deliveries are scheduled at this time, because there is no reward for early delivery. 4.3.3 Bidding on RFQs and handling predicted orders The goal of the Demand Manager is now to identify the set of bids in response to customer RFQs that will maximize the expected profit from using the remaining production resources for the next 10 days, and to schedule production of the resulting predicted orders. The profit depends not only on the RFQs being bid on the current day, but also on RFQs that will be received on later days for computers due during the period. If these future RFQs were ignored when selecting the current day’s bids, the Demand Manager might plan to use up all available production resources on the current RFQs, leaving it unable to bid on future RFQs. One way to address this issue would be to restrict the resources available to the agent for production of the computers being bid on (as in Benisch et al., 2004a). Instead, the Demand Manager generates a predicted set of all RFQs, using the levels of customer demand predicted by the Demand Model, that will be received for computers due during the period, and chooses bids for these RFQs at the same time as the actual RFQs from the current day. Once the predicted RFQs are generated, the Offer Acceptance Predictor is used to generate an acceptance prediction function for every RFQ, both real and predicted. The acceptance prediction functions for predicted RFQs are shifted based on the price changes predicted, as described in Section 4.2.
154
D. Pardoe and P. Stone
The Demand Manager then considers the production resources remaining, the set of RFQs, and the set of acceptance prediction functions and simultaneously generates a set of bids on RFQs and a production schedule that produces the expected resulting orders, using the following modification of the greedy scheduler. If we were considering only a single RFQ and had no resource constraints, the expected profit resulting from a particular bid price would be: Expected profit ¼ PðorderjpriceÞ ðprice costÞ
(1)
The optimal bid would be the value that maximized this quantity. Computing the expected profit from a set of bids when resource constraints are considered is much more difficult, however, because the profit from each RFQ cannot be computed independently. For each possible set of orders in which it is not possible to fill all orders, the profit obtained depends on the agent’s production and delivery strategy. For any nontrivial production and delivery strategy, precise calculation of the expected profit would require separate consideration of a number of possible outcomes that is exponential in the number of RFQs. If we were guaranteed that we would be able to fill all orders, we would not have this problem. The expected profit from each RFQ could be computed independently, and we would have: X Expected profit ¼ Pðorderi jpricei Þ ðpricei costi Þ (2) i 2 all RFQs
Our bidding heuristic is based on the assumption that the expected number of computers ordered for each RFQ will be the actual number ordered. In other words, we pretend that it is possible to win a part of an order, so that instead of winning an entire order with probability p, we win a fraction p of an order with probability 1. This assumption greatly simplifies the consideration of filling orders, since we now have only one set of orders to consider, while leaving the formulation of expected profit unchanged. As long as it is possible to fill the partial orders, (2) will hold, where the probability term now refers to the fraction of the order won. It would appear that this approach could lead to unfilled orders when the agent wins more orders than expected, but in practice, this is not generally a problem. Most of the RFQs being bid on are the predicted RFQs that will be received on future days, and so the agent can modify its future bidding behavior to correct for an unexpectedly high number of orders resulting from the current day’s RFQs. TacTex-06 indeed tends to have very few late or missed deliveries using this bidding strategy. By using this notion of partial orders, we can transform the problem of bid selection into the problem of finding the most profitable set of partial orders that can be filled with the resources available, and we can solve this
Ch. 6. An Autonomous Agent for Supply Chain Management
155
problem using the greedy production scheduler. All bids are initially set to be just above the reserve price, which means we begin with no orders. The scheduler then chooses an RFQ and an amount by which its bid will be lowered, resulting in an increased partial order for that RFQ. The scheduler simulates filling this increase by scheduling its production as described previously. This process is repeated until no more production is possible or no bid can be reduced without reducing the expected profit. Because we are working with resource constraints, the goal of the greedy production scheduler at each step is to obtain the largest possible increase in profit using the fewest possible production resources. At each step, the scheduler considers each RFQ and determines the bid reduction that will produce the largest increase in profit per additional computer. The scheduler then selects the RFQ for which this value is the largest. In many cases, however, the most limited resource is production cycles, and not components. In such cases, the increase in profit per cycle used is a better measure of the desirability of a partial order than the increase in profit per additional computer, so we divide the latter quantity by the number of cycles required to produce the type of computer requested by the RFQ and use the resulting values to choose which RFQ should be considered next. We consider cycles to be the limiting factor whenever the previous day’s production used more than 90% of the available cycles. The range of possible bid prices is discretized for the sake of efficiency. Even with fairly fine granularity, this bidding heuristic produces a set of bids in significantly less time than the 15 s allowed per simulated game day. The complete bidding heuristic is summarized in Table 2. 4.3.4 Completing production and delivery After applying the production scheduler to the current orders and RFQs, the Demand Manager is left with a 10-day production schedule, a record of how each order was filled, and a set of bids for the actual and predicted RFQs. The bids on actual RFQs can be sent directly to customers in their current form, and computers scheduled for delivery can be shipped. The Demand Manager then considers modifications to the production schedule to send to the factory for the next day. If there are no cycles remaining on the first day of the 10-day production schedule, the first day can be sent unchanged to the factory. Otherwise, the Delivery Manager shifts production from future days into the first day so as to utilize all cycles, if possible. 4.3.5 Production beyond 10 days The components purchased by the Supply Manager depend on the component use projected by the Demand Manager. If we want to allow the possibility of ordering components more than 10 days in advance, the Demand Manager must be able to project its component use beyond the 10-day period for which it plans production. One possibility we considered
156
D. Pardoe and P. Stone
Table 2 The bidding heuristic For each RFQ, compute both the probability of winning and the expected profit as a function of price. Set the bid for each RFQ to be just above the reserve price. Repeat until no RFQs are left in the list of RFQs to be considered: For each RFQ, find the bid lower than the current bid that produces the largest increase in profit per additional computer ordered (or per additional cycle required during periods of high factory utilization). Choose the RFQ and bid that produce the largest increase. Try to schedule production of the partial order resulting from lowering the bid. If it cannot be scheduled, remove the RFQ from the list. If the production was scheduled, but no further decrease in the bid will lead to an increase in profit, remove the RFQ from the list. Return the final bid for each RFQ.
was to extend this period and predict RFQs farther into the future. Another was to predict future computer and component prices by estimating our opponents’ inventories and predicting their future behavior. Neither method provided accurate predictions of the future, and both resulted in large swings in projected component use from one day to the next. The Demand Manager thus uses a simple and conservative prediction of future component use. The Demand Manager attempts to predict its component use for the period between 11 and 40 days in the future. Before 11 days, the components used in the 10-day production schedule are used as the prediction, and situations in which it is advantageous to order components more than 40 days in advance appear to be rare. The Demand Model is used to predict customer demand during this period, and the Demand Manager assumes that it will win, and thus need to produce, some fraction of this demand. This fraction ranges from zero during times of low demand to 1/6 during times of moderate or high demand, although the Demand Manager will not predict a higher level of component use than is possible given the available factory cycles. While this method of projecting component use yields reasonable results, improving the prediction is a significant area for future work. 5
The Supply Manager
The Supply Manager is responsible for purchasing components from suppliers based on the projection of future component use provided by the Demand Manager, and for informing the Demand Manager of expected component deliveries and replacement costs. In order to be effective, the
Ch. 6. An Autonomous Agent for Supply Chain Management
157
Supply Manager must be able to predict future component availability and prices. The Supplier Model assists in these predictions. 5.1 Supplier Model The Supplier Model keeps track of all information sent to and received from suppliers. This information is used to model the state of each supplier, allowing predictions to be made. The Supplier Model performs three main tasks: predicting component prices, tracking reputation, and generating probe RFQs to improve its models. 5.1.1 Price prediction To assist the Supply Manager in choosing which RFQs to send to suppliers, the Supplier Model predicts the price that a supplier will offer in response to an RFQ with a given quantity and due date. The Supplier Model requires an estimate of each supplier’s existing commitments in order to make this prediction. Recall that the price offered in response to an RFQ requesting delivery on a given day is determined entirely by the fraction of the supplier’s capacity that is committed through that day. As a result, the Supplier Model can compute this fraction from the price offered. If two offers with different due dates are available, the fraction of the supplier’s capacity that is committed in the period between the first and second date can be determined by subtracting the total capacity committed before the first date from that committed before the second. With enough offers, the Supplier Model can form a reasonable estimate of the fraction of capacity committed by a supplier on any single day. For each supplier and supply line, the Supply Manager maintains an estimate of free capacity, and updates this estimate daily based on offers received. Using this estimate, the Supplier Model is able to make predictions on the price a supplier will offer for a particular RFQ. 5.1.2 Reputation When deciding which RFQs to send, the Supply Manager needs to be careful to maintain a good reputation with suppliers. Each supplier has a minimum acceptable purchase ratio, and the Supply Manager tries to keep this ratio above the minimum. The Supplier Model tracks the offers accepted from each supplier and informs the Supply Manager of the quantity of offered components that can be rejected from each supplier before the ratio falls below the minimum. 5.1.3 Price probes The Supply Manager will often not need to use the full five RFQs allowed each day per supplier line. In these cases, the remaining RFQs can be used
158
D. Pardoe and P. Stone
as zero-quantity price probes to improve the Supplier Model’s estimate of a supplier’s committed capacity. For each supplier line, the Supplier Model records the last time each future day has been the due date for an offer received. Each day, the Supply Manager informs the Supplier Model of the number of RFQs available per supplier line to be used as probes. The Supplier Model chooses the due dates for these RFQs by finding dates that have been used as due dates least recently.
5.2 Supply Manager The Supply Manager’s goal is to obtain the components that the Demand Manager projects it will use at the lowest possible cost. This process is divided into two steps: first the Supply Manager decides what components will need to be delivered, and then it decides how best to ensure the delivery of these components. These two steps are described below, along with an alternative means of obtaining components. 5.2.1 Deciding what to order The Supply Manager seeks to keep the inventory of each component above a certain threshold. This threshold (determined experimentally) is 800, or 400 in the case of CPUs, and decreases linearly to zero between days 195 and 215. Each day the Supply Manager determines the deliveries that will be needed to maintain the threshold on each day in the future. Starting with the current component inventory, the Supply Manager moves through each future day, adding the deliveries from suppliers expected for that day, subtracting the amount projected to be used by the Demand Manager for that day, and making a note of any new deliveries needed to maintain the threshold. The result is a list of needed deliveries that we will call intended deliveries. When informing the Demand Manager of the expected future component deliveries, the Supply Manager will add these intended deliveries to the actual deliveries expected from previously placed component orders. The idea is that although the Supply Manager has not yet placed the orders guaranteeing these deliveries, it intends to, and is willing to make a commitment to the Demand Manager to have these components available. Because prices offered in response to short-term RFQs can be very unpredictable, the Supply Manager never makes plans to send RFQs requesting delivery in less than five days. (One exception is discussed later.) As discussed previously, no component use is projected beyond 40 days in the future, meaning that the intended deliveries fall in the period between 5 and 40 days in the future.
Ch. 6. An Autonomous Agent for Supply Chain Management
159
5.2.2 Deciding how to order Once the Supply Manager has determined the intended deliveries, it must decide how to ensure their delivery at the lowest possible cost. We simplify this task by requiring that for each component and day, that day’s intended delivery will be supplied by a single order with that day as the due date. Thus, the only decisions left for the Supply Manager are when to send the RFQ and which supplier to send it to. For each individual intended delivery, the Supply Manager predicts whether sending the RFQ immediately will result in a lower offered price than waiting for some future day, and sends the RFQ if this is the case. To make this prediction correctly, the Supply Manager would need to know the prices that would be offered by a supplier on any future day. Although this information is clearly not available, the Supplier Model does have the ability to predict the prices that would be offered by a supplier for any RFQ sent on the current day. To enable the Supply Manager to extend these predictions into the future, we make the simplifying assumption that the price pattern predicted on the current day will remain the same on all future days. In other words, if an RFQ sent on the current day due in i days would result in a certain price, then sending an RFQ on any future day d due on day dþi would result in the same price. This assumption is not entirely unrealistic due to the fact that agents tend to order components a certain number of days in advance, and this number generally changes slowly. Essentially, we are saying, ‘‘Given the current ordering pattern of other agents, prices are lowest when RFQs are sent x days in advance of the due date, so plan to send all RFQs x days in advance.’’ The resulting procedure followed by the Supply Manager is as follows. For each intended delivery, the Supplier Model is asked to predict the prices that would result from sending RFQs today with various due dates requesting the needed quantity. A price is predicted for each due date between 5 and 40 days in the future. (Each price is then modified slightly according to a heuristic that will be presented in the next section.) If there are two suppliers, the lower price is used. If the intended delivery is needed in i days, and the price for ordering i days in advance is lower than that of any smaller number of days, the Supply Manager will send the RFQ. Any spare RFQs will be offered to the Supplier Model to use as probes. The final step is to predict the replacement cost of each component. The Supply Manager assumes that any need for additional components that results from the decisions of the Demand Manager will be felt on the first day on which components are currently needed, i.e., the day with the first intended delivery. Therefore, for each component’s replacement cost, the Supply Manager uses the lowest price found when considering the first intended delivery of that component, even if no RFQ was sent. For each RFQ, a reserve price somewhat higher than the expected offer price is used. Because the Supply Manager believes that the RFQs it sends
160
D. Pardoe and P. Stone
are the ones that will result in the lowest possible prices, all offers are accepted. If the reserve price cannot be met, the Supplier Model’s predictions will be updated accordingly and the Supply Manager will try again the next day. 5.2.3 Waiting to order in certain cases5 When prices are lower for long-term orders than short-term orders, the Supply Manager faces an interesting tradeoff. Waiting to order an intended delivery in the short term is expected to increase costs, but by waiting the agent might gain a clearer picture of its true component needs. For example, if customer demand suddenly drops, the agent may be better off if it has waited to order and can avoid unnecessary purchases, even if prices are somewhat higher for those components which the agent does purchase. Using the ordering strategy of the previous section, however, the Supply Manager would always choose to place long-term orders no matter how small the expected increase in cost would be if it waited. A number of experiments using the previous version of the agent, TacTex-05, suggest that agent performance would improve if the Supply Manager were to postpone ordering in such situations (Pardoe and Stone, 2006). One possible way of ensuring this behavior would be to modify the current strategy so that instead of sending a request as soon as the predicted price is at its lowest point, the request is only sent when it is believed to be unlikely that a reasonably close price can still be obtained. In TacTex-06, the Supply Manager implements an approximation of this strategy using a straightforward heuristic: predictions of offer prices are increased by an amount proportional to the distance of the requested due date. In particular, the predicted price for a requested due date d days away, 5rdr40, is multiplied by 1þxd, where xd ¼ 0.1(d5)/35. Predicted prices are thus increased between 0% and 10%, values chosen through experimentation. As a result, the Supply Manager will wait to order when long-term prices are only slightly lower than short-term prices. 5.2.4 2-Day RFQs As mentioned previously, the prices offered in response to RFQs requesting near-immediate delivery are very unpredictable. If the Supply Manager were to wait until the last minute to send RFQs in hopes of low prices, it might frequently end up paying more than expected or be unable to buy the components at all. To allow for the possibility of getting low priced short-term orders without risk, the Supply Manager sends RFQs due in 2 days, the minimum possible, for small quantities in addition to what is required by the intended deliveries. If the prices offered are lower than those expected from the normal RFQs, the offers will be accepted. 5
This section presents a significant addition to the previous agent, TacTex-05.
Ch. 6. An Autonomous Agent for Supply Chain Management
161
The size of each 2-day RFQ depends on the need for components, the reputation with the supplier, and the success of past 2-day RFQs. Because the Supply Manager may reject many of the offers resulting from 2-day RFQs, it is possible for the agent’s reputation with a supplier to fall below the acceptable purchase ratio. The Supplier Model determines the maximum amount from each supplier that can be rejected before this happens, and the quantity requested is kept below this amount. The Supply Manager decides whether to accept an offer resulting from a 2-day RFQ by comparing the price to the replacement cost and the prices in offers resulting from normal RFQs for that component. If the offer price is lower than any of these other prices, the offer is accepted. If the quantity in another, more expensive offer is smaller than the quantity of the 2-day RFQ, then that offer may safely be rejected. The 2-day RFQs enable the agent to be opportunistic in taking advantage of short-term bargains on components without being dependent on the availability of such bargains.
6
Adaptation over a series of games
The predictions made by the predictive modules as described above are based only on observations from the current game. Another source of information that could be useful in making predictions is the events of past games, made available in log files kept by the game server. During the final rounds of the TAC SCM competition, agents are divided into brackets of six and play a number of games (16 on the final day of competition) against the same set of opponents. When facing the same opponents repeatedly, it makes sense to consider adapting predictions in response to completed games. TacTex-06 makes use of information from these games in its decisions during two phases of the game: buying components at the beginning of the game (impacting mainly the behavior described in Section 5.2), and selling computers at the end of the game (impacting the behavior in Section 4.2). In both cases, only past games within a bracket are considered, and default strategies are used when no game logs are yet available. We chose to focus on these areas for two reasons. Behavior during these two phases varies significantly from one agent to another, possibly due to the fact that these phases are difficult to reason about in general and may thus be handled using special-case heuristic strategies by many agents. At the same time, each agent’s behavior remains somewhat consistent from game to game (e.g. many agents order the same components at the beginning of each game). This fact is critical to the success of an adaptive strategy—the limited number of games played means that it must be possible to learn an effective response from only a few past games.
162
D. Pardoe and P. Stone
6.1 Initial component orders At the beginning of each game, many agents place relatively large component orders (when compared to the rest of the game) to ensure that they will be able to produce computers during the early part of the game. Prices for some components may also be lower on the first day than they will be afterwards, depending on the due date requested. Determining the optimal initial orders to place is difficult, because no information is made available on the first day of the game, and prices depend heavily on the orders of other agents. TacTex-06 addresses this issue by analyzing component costs from past games and deciding what components need to be requested on the first two days in order to ensure a sufficient supply of components early in the game and to take advantage of low prices. The process is very similar to the one described in Section 5.2, except that predictions of prices offered by suppliers are based on past games. First, the components needed are identified, then the decision of which components should be requested is made, and finally the RFQs are generated. The Supply Manager begins by deciding what components will be needed. On the first day, when no demand information is available (customers begin sending RFQs on the second day), the Supply Manager assumes that it will be producing an equal number of each type of computer, and projects the components needed to sustain full factory utilization for 80 days. On the second day, the Supply Manager projects future customer demand as before and assumes it will receive orders for some fraction of RFQs over each of the next 80 days. The projected component use is converted into a list of intended deliveries as before. (The Supply Manager makes no projections beyond the first 80 days, because we have not observed instances where it would be worthwhile to order components so far in advance.) Next, the Supply Manager must decide which components should be requested on the current day (the first or second day of the game). As in Section 5.2.2, the Supply Manager must determine which intended deliveries will be cheapest if they are requested immediately. At the beginning of the game, the Supplier Model will have no information to use in predicting prices, and so information from past games is used. By analyzing the log from a past game and modeling the state of each supplier, it is possible to determine the exact price that would have been offered in response to any possible RFQ. Predictions for the current game can be made by averaging the results from all past games. When modeling the states of suppliers, RFQs and orders from TacTex-06 are omitted to prevent the agent from trying to adapt to its own behavior. If the initial component purchasing strategies of opponents remain the same from game to game, these average values provide a reasonable means of estimating prices.
Ch. 6. An Autonomous Agent for Supply Chain Management
163
At the beginning of the game, the Supply Manager reads in a table from a file that gives the average price for each component for each pair of request date and due date. Using this table, the Supply Manager can determine which intended deliveries will cost less if requested on the current day than on any later day. Intended deliveries due within the first 20 days are always requested on the first day, however, to avoid the possibility that they will be unavailable later. If opponents request many components on the first day of the game but few on the second, the prices offered in response to RFQs sent on the second day will be about the same as if the RFQs had been sent on the first day. Since information about customer demand is available on the second day of the game but not on the first, it might be beneficial to wait until the second day to send RFQs. For this reason, the Supply Manager will not send a request for an intended delivery if the price expected on the second day is less than 3% more than the price expected on the first. Once the Supply Manager has decided which intended deliveries to request, it must decide how to combine these requests into the available number of RFQs (five, or ten if there are two suppliers). In Section 5.2.2, this problem did not arise, because there were typically few requests per day. On the first two days, it is possible for the number of intended deliveries requested to be much larger than the number of RFQs available. Intended deliveries will therefore need to be combined into groups, with delivery on the earliest group member’s delivery date. The choice of grouping can have a large impact on the prices offered. When there is only one supplier, the Supply Manager begins by dividing the 80-day period into five intervals, defined by six interval endpoints, with a roughly equal number of intended deliveries in each interval. Each interval represents a group of intended deliveries that will have delivery requested on the first day of the interval. One at a time, each endpoint is adjusted to minimize the sum of expected prices plus storage costs for those components delivered early. When no more adjustments will reduce the cost, the Supply Manager sends the resulting RFQs. When there are two suppliers, 10 intervals are used, and intervals alternate between suppliers.
6.2 Endgame sales Near the end of each game, some agents tend to run out of inventory and stop bidding on computers, whereas other agents tend to have surplus computers, possibly by design, that they attempt to sell up until the last possible day. As a result, computer prices on the last few days of the game are often either very high or very low. When endgame prices will be high, it can be beneficial to hold on to inventory so as to sell it at a premium during the last days. When prices will be low, the agent should deplete its inventory earlier in the game. TacTex-06 adapts in response to the behavior of its
164
D. Pardoe and P. Stone
competitors in past games by adjusting the predictions of the Offer Acceptance Predictor (Section 4.2) during the last few days of each game. TacTex-06’s endgame strategy is essentially to reserve only as many computers for the final few days as it expects to be able to sell at high prices. In particular, from day 215 to 217, the Demand Manager will always respond to a customer RFQ (if it chooses to respond) by offering a price slightly below the reserve. For RFQs received on these days, the probability predicted by the Offer Acceptance Predictor is set to the fraction of computers that would have sold at the reserve price on that day in past games. When the Demand Manager plans for a period of production that includes one of these days, these acceptance probabilities will hopefully result in an appropriate number of computers being saved for these three days.
7
2006 Competition results
Out of 21 teams that participated in the final round of the 2006 TAC SCM competition, held over three days at AAMAS 2006, six advanced to the final day of competition. After 16 games between these agents, TacTex06 had the highest average score, $5.9 million, followed closely by PhantAgent with $4.1 million, and DeepMaize with $3.6 million.6 Both PhantAgent and DeepMaize were much improved over their 2005 counterparts, and would very likely have beaten the previous year’s champion, TacTex-05, if it had competed unchanged. It thus appears that the improvements present in TacTex-06 were an important part of its victory. Although it is difficult to assign credit for an agent’s performance in the competition to particular components, we can make some observations that support this hypothesis. Figure 4 shows the average, over all 16 games on the final day of the competition, of the profit earned per game day for the top three agents. Daily profit is computed by determining what computers were delivered to customers each day and which components in inventory went into those computers, and then subtracting costs from revenue. TacTex-06 clearly had the highest daily profits over the first 70 days of the game, and after this point profits were roughly equal for all three agents. The difference in profits appears to be accounted for by higher revenue per computer. During the first 70 days of each game, TacTex-06 sold about as many computers as PhantAgent and DeepMaize while paying roughly the same costs for components, but TacTex-06 almost always had a much higher average sales price for each type of computer. After day 70, TacTex-06 still had somewhat higher average computer prices, but these were offset by higher component costs than the other two agents paid. 6
Competition scores are available at http://www.sics.se/tac/scmserver
Ch. 6. An Autonomous Agent for Supply Chain Management
165
80000 60000 40000
Profit
20000 0 -20000 TacTex-06 PhantAgent DeepMaize
-40000 -60000 -80000 0
Fig. 4.
50
100 Day
150
200
Daily profits for the top three agents on the final day of the 2006 competition, averaged over all 16 games.
The ability of TacTex-06 to sell computers at higher prices appears to be due to its attempt to predict future changes in computer prices and react accordingly. During the competition, TacTex-06 could often be seen building up its inventory of completed computers before prices rose or selling off its inventory as prices peaked, while such behavior among other agents was less visible. This behavior can explain not only the fact that TacTex-06 sold computers at higher prices, but also the fact that the advantage was especially large in the first portion of each game (to see why, consider Fig. 3). For this particular game and computer type, prices began very high, then fell rapidly before recovering somewhat. This pattern is actually very common. Agents begin with no components or computers in inventory, and the supply of computers is thus much smaller than the demand in the beginning of each game. As agents obtain components and begin selling computers, prices usually drop rapidly. Due to the rapid changes in computer prices and the predictability of this pattern, the attempts by TacTex-06 to predict and exploit changes in prices are particularly effective in this period of the game. To get a clearer picture of how the improvements in TacTex-06 contribute to its performance, we perform a series of controlled experiments in the following section.
8
Experiments
We now present the results of controlled experiments designed to measure the impact of individual components of TacTex-06 on its overall
166
D. Pardoe and P. Stone
performance. In each experiment, two versions of TacTex-06 compete: one unaltered agent that matches the description provided previously, and one agent that has been modified in a specific way. Each experiment involves 30 games. The other four agents competing—Mertacor, DeepMaize, MinneTAC, and PhantAgent (all versions from 2005)—are taken from the TAC Agent Repository. (Experiments against different combinations of agents appear to produce qualitatively similar results.) Experimental results are shown in Table 3. Each experiment is labeled with a number. The columns represent the averages over the 30 games of the total score (profit), percent of factory utilization over the game (which is closely correlated with the number of computers sold), revenue from selling computers to customers, component costs, and the percentage of games in which the altered agent outscored the unaltered agent. In every experiment, the difference between the altered and unaltered agent is statistically significant with 99% confidence according to a paired t-test. The first row, experiment 0, is provided to give perspective to the results of other experiments. In experiment 0, two unaltered agents are used, and all numbers represent the actual results obtained. In all other rows, the numbers represent the differences between the results of the altered agent and the unaltered agent (from that experiment, not from experiment 0). In general, the results of the unaltered agents are close to those in experiment 0, but there is some variation due to differences between games (e.g. customer demand), and due to the effects of the altered agent on the economy. 8.1 Supply price prediction modification As described in Section 5.2.3, the Supply Manager slightly increases the predictions of prices that will be offered for components by an amount proportional to the number of days before the requested due date. This addition to TacTex-06 is designed to cause the agent to favor short-term component orders over long-term orders if the difference in price is small. In experiment 1, an agent that does not use this technique is tested. Compared to the unaltered agent, this agent has increased component purchases and factory utilization, but the increase in revenue is not enough to offset the higher costs, and the final score is lower than that of the unaltered agent. It appears that the unaltered agent is able to avoid purchasing unprofitable components in some cases by waiting longer to place its orders. 8.2 Offer Acceptance Predictor We now consider the impact of the improvements to the Offer Acceptance Predictor described in Section 4.2. In experiment 2, the altered
Experiment Number
Description
Score
Utilization (%)
Revenue
Costs
Win %
0 1 2 3 4 5
No changes No component price prediction increase No computer price change prediction No particle filter No particle filter or prediction Heuristic price change prediction
$7.28M 1.42 3.51 1.97 3.93 1.74
83 þ3 1 7 6 0
$104.7M þ3.51 4.50M 10.05M 10.99M 1.14M
$94.5M þ4.79 .70M 8.03M 6.83M .64M
— 23 0 0 0 13
Note: In each experiment, one altered version of TacTex-06 and one unaltered version compete in 30 games, along with four additional agents. Columns represent the total score, percent of factory utilization, revenue from customers, component costs, and how often the altered agent outscored the unaltered agent. Numbers represent millions of dollars. In experiment 0, provided to place other experiments’ results in perspective, no alteration is made to TacTex-06, and numbers represent the actual results. In all other experiments, numbers represent the difference between the altered and unaltered agent. In each experiment, the difference between the altered and unaltered agent is statistically significant with 99% confidence according to a paired t-test.
Ch. 6. An Autonomous Agent for Supply Chain Management
Table 3 Experimental results
167
168
D. Pardoe and P. Stone
agent always predicts that future computer prices will remain unchanged. Not surprisingly, the result is a large decrease in revenue and score. The decrease in score is almost twice as large as the margin of victory for TacTex-06 in the 2006 competition ($1.8 million), adding more weight to the claim of Section 7 that the prediction of future price changes played a large role in the winning performance. In experiment 3, the particle filter used to generate predictions of offer acceptance is replaced with a simpler heuristic that was used in TacTex-05. This heuristic used linear regression over the results of the past five days’ offers to generate a linear function used for offer acceptance predictions and was originally used by the agent Botticelli in 2003 (Benisch et al., 2004a). The experiment shows that the particle filter approach is an improvement over this heuristic. The large drop in factory utilization in the altered agent is surprising. Experiment 4 shows the result when the changes of experiments 2 and 3 are combined: the agent makes no predictions of future price changes and uses the linear heuristic instead of the particle filter. The score is only slightly worse than in experiment 2, suggesting that the benefits of using the particle filter are more pronounced when price changes are predicted. It is possible that the more detailed and precise predictions of offer acceptance generated from the particle filter are necessary for the agent to effectively make use of the predictions of future price changes. In experiment 5, the learned predictor of price changes is replaced with a heuristic that performs linear regression on the average computer price over the last 10 days, and extrapolates the trend seen into the future to predict price changes. Although the heuristic’s predictions are reasonably accurate, the performance of the altered agent is about midway between that of the unaltered agent and that of the agent from experiment 2 that makes no predictions at all, demonstrating the value of learning an accurate predictor.
9
Related work
Outside of TAC SCM, much of the work on agent-based SCM has focused on the design of architectures for distributed systems in which multiple agents throughout the supply chain must be able to communicate and coordinate (Fox et al., 2000; Sadeh et al., 2001). These systems may involve a static supply chain or allow for the dynamic formation of supply chains through agent negotiation (Chen et al., 1999). Other work has focused on general solutions to specific subproblems such as procurement or delivery. TAC SCM appears to be unique in that it represents a concrete domain in which individual agents must manage a complete supply chain in a competitive setting.
Ch. 6. An Autonomous Agent for Supply Chain Management
169
A number of agent descriptions for TAC SCM have been published presenting various solutions to the problem. At a high level, many of these agents are similar in design to TacTex-06: they divide the full problem into a number of smaller tasks and generally solve these tasks using decisiontheoretic approaches based on maximizing utility given estimates of various values and prices. The key differences are the specific methods used to solve these tasks. The problem of bidding on customer RFQs has been addressed with a wide variety of solutions. Southampton-SCM (He et al., 2006) takes a fuzzy reasoning approach in which a rule base is developed containing fuzzy rules that specify how to bid in various situations. PSUTAC (Sun et al., 2004) takes a similar knowledge-based approach. DeepMaize (Kiekintveld et al., 2004) performs a game-theoretic analysis of the economy to decide which bids to place. RedAgent (Keller et al., 2004) uses a simulated internal market to allocate resources and determine their values, identifying bid prices in the process. The approach described in this chapter, where probabilities of offer acceptance are predicted and then used in an optimization routine, is also used in various forms by several other agents. CMieux (Benisch et al., 2006b) makes predictions using a form of regression tree that is trained on data from past games, Foreseer (Burke et al., 2006) uses a form of online learning to learn multipliers (similar to the day factors used in TacTex-06) indicating the impact of various RFQ properties on prices, and Botticelli (Benisch et al., 2004a) uses the heuristic described in Section 8.2. Like TacTex-06, many agents use some form of greedy production scheduling, but other, more sophisticated approaches have been studied. These include a stochastic programming approach, in which expected profit is maximized through the use of samples generated from a probabilistic model of possible customer orders (Benisch et al., 2004b) and an approach treating the bidding and scheduling problems as a continuous knapsack problem (Benisch et al., 2006a). In the latter case, an e-optimal solution is presented which is shown to produce results similar to the greedy approach of TacTex-06, but in significantly less time for large problems. Attention has also been paid to the problem of component procurement, although much of it has focused on an unintended feature of the game rules (eliminated in 2005) that caused many agents to purchase the majority of their components at the very beginning of the game (Kiekintveld et al., 2005). Most agents now employ approaches that involve predictions of future component needs and prices and are somewhat similar to the approach described in this chapter. These approaches are often heuristic in nature, although there are some exceptions; NaRC (Buffett and Scott, 2004) models the procurement problem as a Markov decision process and uses dynamic programming to identify optimal actions. Although several agents make efforts to adapt to changing conditions during a single game, such as MinneTAC (Ketter et al., 2005) and
170
D. Pardoe and P. Stone
Southampton-SCM (He et al., 2005), to our knowledge methods of adaptation to a set of opponents over a series of games in TAC SCM have not been reported on by any other agent. [Such adaptation has been used in the TAC Travel competition, however, both during a round of competition (Stone et al., 2001), and in response to hundreds of previous games (Stone et al., 2003).]
10 Conclusions and future work In this chapter, we described TacTex-06, a SCM agent consisting of predictive, optimizing, and adaptive components. We analyzed its winning performance in the 2006 TAC SCM competition, and found evidence that the strategy of exploiting predicted changes in computer prices to increase revenue played a significant role in this performance. Controlled experiments verified the value of a number of improvements made to TacTex-05, the previous winner. A number of areas remain open for future work. There is room for improvement in many of the predictions, possibly though additional uses of learning. Also, by looking farther ahead when planning offers to customers, it may be possible for the agent to better take advantage of the predicted changes in future prices. In addition, there is the question of what would happen if several agents attempted to utilize such a strategy for responding to price changes, and what the proper response to this situation would be. The most important area for improvement, in both TacTex-06 and other TAC SCM agents, is likely increasing the degree to which agents are adaptive to ensure robust performance regardless of market conditions. While developing TacTex-06, we had the opportunity to carefully tune agent parameters (such as inventory thresholds) and to test various agent modifications during several rounds of competition and in our own experiments with the available agent binaries. In addition, we were able to implement learning-based approaches that took advantage of data from past games. When developing agents for real-world supply chains, such sources of feedback and experience would be reduced in quantity or unavailable. Although it would still be possible to test agents in simulation, the market conditions encountered upon deployment might differ significantly from the simulated conditions. Designing agents that can adapt quickly given limited experience is therefore a significant part of our future research agenda. Ultimately, this research drives both towards understanding the implications and challenges of deploying autonomous agents in SCM scenarios, and towards developing new machine-learning-based complete autonomous agents in dynamic multiagent domains.
Ch. 6. An Autonomous Agent for Supply Chain Management
171
Acknowledgments We would like to thank Jan Ulrich and Mark VanMiddlesworth for contributing to the development of TacTex, the SICS team for developing the game server, and all teams that have contributed to the agent repository. This research was supported in part by NSF CAREER award IIS-0237699.
References Arulampalam, S., S. Maskell, N. Gordon, T. Clapp (2002). A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188. Benisch M., A. Greenwald, I. Grypari, R. Lederman, V. Naroditskiy, M. Tschantz (2004a). Botticelli: a supply chain management agent. in: Third International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), New York, NY, 3, 1174–1181. Benisch, M., A. Greenwald, V. Naroditskiy, M. Tschantz (2004b). A stochastic programming approach to scheduling in TAC SCM, in: Fifth ACM Conference on Electronic Commerce, New York, NY, 152–159. Benisch, M., J. Andrews, N. Sadeh (2006a). Pricing for customers with probabilistic valuations as a continuous knapsack problem, in: Eighth International Conference on Electronic Commerce, Fredericton, New Brunswick, Canada. Benisch, M., A. Sardinha, J. Andrews, N. Sadeh (2006b). Cmieux: adaptive strategies for competitive supply chain trading, in: Eighth International Conference on Electronic Commerce, Fredericton, New Brunswick, Canada. Buffett, S., N. Scott (2004). An algorithm for procurement in supply chain management, in: AAMAS 2004 Workshop on Trading Agent Design and Analysis, New York, NY. Burke, D.A., K.N. Brown, B. Hnich, A. Tarim (2006). Learning market prices for a real-time supply chain management trading agent, in: AAMAS 2006 Workshop on Trading Agent Design and Analysis/ Agent Mediated Electronic Commerce, Hakodate, Japan. Chen, Y., Y. Peng, T. Finin, Y. Labrou, S. Cost (1999). A negotiation-based multi-agent system for supply chain management, in: Workshop on Agent-Based Decision Support in Managing the InternetEnabled Supply-Chain, at Agents ‘99, Seattle, Washington. Collins, J., R. Arunachalam, N. Sadeh, J. Eriksson, N. Finne, S. Janson (2005). The supply chain management game for the 2006 trading agent competition. Technical report. Available at http:// www.sics.se/tac/tac06scmspec_v16.pdf Fox, M.S., M. Barbuceanu, R. Teigen (2000). Agent-oriented supply-chain management. International Journal of Flexible Manufacturing Systems 12, 165–188. He, M., A. Rogers, E. David, N.R. Jennings (2005). Designing and evaluating an adaptive trading agent for supply chain management applications, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK. He, M., A. Rogers, X. Luo, N.R. Jennings (2006). Designing a successful trading agent for supply chain management, in: Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, Hakodate, Japan, 1159–1166. Keller, P.W., F.-O. Duguay, D. Precup (2004). RedAgent—winner of TAC SCM 2003. SIGecom Exchanges: Special Issue on Trading Agent Design and Analysis 4(3), 1–8. Ketter, W., J. Collins, M. Gini, A. Gupta, P. Schrater (2005). Identifying and forecasting economic regimes in TAC SCM, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK, 53–60. Kiekintveld, C., M. Wellman, S. Singh, J. Estelle, Y. Vorobeychik, V. Soni, M. Rudary (2004). Distributed feedback control for decision making on supply chains, in: Fourteenth International Conference on Automated Planning and Scheduling, Whistler, British Columbia, Canada.
172
D. Pardoe and P. Stone
Kiekintveld, C., Y. Vorobeychik, M.P. Wellman (2005). An analysis of the 2004 supply chain management trading agent competition, in: IJCAI 2005 Workshop on Trading Agent Design and Analysis, Edinborough, Scotland, UK. Kumar, K. (2001). Technology for supporting supply-chain management. Communications of the ACM 44(6), 58–61. Pardoe, D., P. Stone (2006). Predictive planning for supply chain management, in: Sixteenth International Conference on Automated Planning and Scheduling, Cumbria, UK. Pardoe, D., P. Stone (2007). Adapting price predictions in TAC SCM, in: AAMAS 2007 Workshop on Agent Mediated Electronic Commerce, Honolulu, HI. Sadeh, N., D. Hildum, D. Kjenstad, A. Tseng (2001). MASCOT: an agent-based architecture for dynamic supply chain creation and coordination in the Internet economy. Journal of Production, Planning and Control 12(3), 211–223. Stone, P., M.L. Littman, S. Singh, M. Kearns (2001). ATTac-2000: an adaptive autonomous bidding agent. Journal of Artificial Intelligence Research 15, 189–206. Stone, P., R.E. Schapire, M.L. Littman, J.A. Csirik, D. McAllester (2003). Decision-theoretic bidding based on learned density models in simultaneous, interacting auctions. Journal of Artificial Intelligence Research 19, 209–242. Sun, S., V. Avasarala, T. Mullen, J. Yen (2004). PSUTAC: a trading agent designed from heuristics to knowledge, in: AAMAS 2004 Workshop on Trading Agent Design and Analysis, New York, NY. Witten, I.H., E. Frank (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA.
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 7
IT Advances for Industrial Procurement: Automating Data Cleansing for Enterprise Spend Aggregation
Moninder Singh and Jayant R. Kalagnanam IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, USA
Abstract The last few years have seen tremendous changes in IT applications targeted towards improving the procurement activities of an enterprise. The significant cost savings generated by such changes have in turn led to an even greater focus on, and investments in, the development of tools and systems for streamlining enterprise procurement. While the earliest changes dealt with the development of electronic procurement systems, subsequent developments involved an increased shift to strategic procurement functions, and consequently towards the development of tools such as eRFPS and auctions as negotiation mechanisms. A recent trend is the move towards outsourcing part or all of the procurement function, especially for non-core procurement pieces, to emerging intermediaries who then provide the procurement function for the enterprise. In this practice, called Business Transformation Outsourcing (BTO), such third-parties can substantially reduce procurement costs, in part by doing procurement on behalf of several different enterprises. An essential aspect of managing this outsourced procurement function is the ability to aggregate and analyze the procurement-spend across one or more enterprises, and rationalize this process. This too requires a new set of IT tools that are able to manage unstructured data and provide ways to efficiently aggregate and analyze spend information across potentially several enterprises. Typically, these data cleansing tasks are done manually using rudimentary data analysis techniques and spreadsheets. However, a significant amount of research has been conducted over the past couple of decades in various fields, such as databases, statistics and artificial intelligence, on the development of various data cleansing techniques, and their application to a broad range of applications and domains. This chapter provides a brief survey of these techniques and applications, and then discusses how some of these methods can be adapted to automate the various cleansing activities needs for spend data aggregation. Moreover, the chapter
173
174
M. Singh and J.R. Kalagnanam
provides a detailed roadmap to enable the development of such an automated system for spend aggregation to enable spend aggregation, especially across multiple enterprises, to be done in an efficient, repeatable and automated manner.
1
Introduction
By streamlining its procurement activities, an enterprise can realize substantial cost savings that directly impact the bottom line. Additionally, rapid developments in information technology (IT) have made this streamlining process significantly faster and cheaper than was possible just a few years ago. As such, more and more enterprises are recognizing this to be strategically essential and are devoting considerable effort and resources to improving their procurement activities, both in terms of reducing the total procurement spend as well as using what is spent more effectively. Towards this end, enterprises have been increasingly using IT tools targeted primarily at the procurement activities of enterprises. Over the past few years, these tools have gradually become more and more sophisticated, both from technological and functional aspects. Initially, the focus was primarily on the development of electronic systems to assist daily procurement activity at an operational level. These were the early ‘‘procurement systems’’ that focused largely on managing the business process dealing with operational buying to streamline it to follow procedures and authorization, as well as to make the requisitioning and payment by electronic means. Thereafter, tool development moved to tackle some of the strategic functions of procurement, such as strategic sourcing. This led to an increased interest in the use of tools, such as eRFPs and auctions, as a way of negotiating price and non-price aspects of the requirements and subsequently led to the use of various auctions’ mechanisms for negotiation using electronic exchanges and portals. The latest trend, however, is towards outsourcing non-core parts of the procurement function to emerging intermediaries who then provide the procurement function, especially for non-directs. An essential aspect of managing this outsourced procurement function (as well as for doing strategic sourcing done for procurement that is done in-house) is the ability to analyze the procurement-spend of a company (along various dimensions such as suppliers and commodities) and rationalize this process. Owing to this, one of the most important activities that an enterprise has to undertake prior to doing strategic sourcing or outsourcing its procurement functions is to develop a single, aggregated view of its procurement-spend across the entire enterprise. Since procurement activities take place normally across an enterprise, spanning multiple back-end systems and/or geographic and
Ch. 7. IT Advances for Industrial Procurement
175
functional areas and often using multiple procurement applications, spend aggregation becomes necessary to understand where the money is being spent, and on what. Once an aggregated, enterprise view of spend is developed, it can be used by the enterprise for various strategic activities such as consolidating suppliers and negotiating better volume-based prices. Spend aggregation becomes an even more essential activity in cases of procurement outsourcing. In such cases, a form of business transformation outsourcing (BTO), a third-party (referred to henceforth as a BTO service provider) takes over the procurement functions of one or more enterprises (referred to henceforth as BTO clients). However, in order to do the procurement efficiently, the BTO service provider needs to aggregate spend across all these enterprises (i.e. the BTO clients plus the BTO service provider itself) so as to develop a consistent supplier base and a consistent commodity base resulting in an accurate cross-enterprise view of exactly what is being procured and from whom. Using this view, the BTO service client too can do significant strategic sourcing (similar to an enterprise doing strategic sourcing with its internal spend but on a much larger scale), such as evaluating all suppliers from which a particular commodity is acquired, and negotiating better deals with one or more of them based on the combined volume of that commodity across all the BTO clients. Procurement outsourcing can lead to significant savings for an enterprise, especially since procurement accounts for a major part of enterprise costs. This is due to several reasons. First, by delegating the procurement function (generally a non-core business activity) to a third party, an enterprise can focus more on its core business operations, streamline its business processes and reduce the complexity and overhead of its operations (by eliminating an activity in which it does not have much expertise). Second, procurement outsourcing allows an enterprise to immediately shrink its cost structure by reducing/eliminating procurement-related resources, including headcount as well as hardware and procurement applications. Third, the cost to acquire goods by an enterprise falls since the BTO service provider passes on some of the savings it generates via the bigger (volume-based) discounts it is able to get by aggregating spend over all its BTO clients, thereby generating higher commodity volumes, and directing that to fewer, preferred suppliers. Moreover, the magnitude of the savings that can be generated by the BTO service provider are typically higher than what an enterprise could achieve by doing similar activities (such as volume aggregation, supplier consolidation, etc.) by keeping its procurement activities in-house. This can be attributed to three main reasons. First, the BTO service provider normally has significant expertise in procurement, and can utilize specialized and more efficient procurement processes. Second, taking on the procurement of multiple enterprises allows the service provider to take advantages of economies of scale. Third, the volume-based discounts that a service provider can negotiate with its suppliers are much higher than what any of the client enterprise’s could get by itself, since the
176
M. Singh and J.R. Kalagnanam
service provider can generate significantly higher commodity volumes by aggregating the procurement-spend across all of the clients and combining it with its own spend. For a BTO service provider that itself has significant procurement spend, this allows even greater savings to be negotiated. Figure 1 illustrates the savings that can thus be generated by procurement BTO. Consider an enterprise that typically buys a certain volume, v1, of a given commodity under a pricing contract, c1 it has negotiated with its supplier. Contract c2 corresponds to the BTO service provider who has been able to negotiate a better deal by guaranteeing a larger minimum volume, v2. Now, even by moving the above enterprise to its current pricing contract, the BTO service provider can generate significant savings (volume v ¼ v1 þ v2 on contract c2). However, the BTO service provider may now be able to negotiate an even better deal, say, c3, due to the further increase in the volume of the given commodity which allows even greater savings to be generated. However, spend data within an enterprise generally resides in multiple, disparate data sources often distributed across several functional and geographic organizations. Moreover, data in these repositories comes from a variety of sources and applications, such as invoices, purchase orders, account ledgers, and payments. As such, this data is generally inconsistent, with no cross-indexing between transactions, and different naming conventions used for suppliers and commodities, resulting in the same supplier or commodity being described differently in different transactions and/or systems. Consequently, spend aggregation typically requires a significant
v1
v2
v=v1+v2
Cost per Unit
c1
c2
c3
Volume
Fig. 1.
Example showing cost savings in procurement BTO.
Ch. 7. IT Advances for Industrial Procurement
177
amount of effort since the spend data has to be cleansed and rationalized so that discrepancies between multiple naming conventions get resolved, transactions get mapped to a common spend/commodity taxonomy, etc. Clearly, the level of difficulty, and the effort needed, to do this across multiple enterprises, as required for procurement BTO, gets progressively higher since different supplier bases as well as multiple commodity taxonomies have to be reconciled and mapped. This has lead to renewed focus on the development of new tools and methodologies for managing unstructured content inherent in spend data (e.g. commodity descriptions) and cleansing the data to enable spend aggregation, especially across multiple enterprises, to be done in an efficient, repeatable and automated manner. Data cleansing has long been studied in various fields, such as statistics, databases and machine learning/data mining, resulting in a host of data cleansing techniques that have been applied to a multitude of different problems and domains, such as duplicate record detection in databases/ data warehouses and linkage of medical records belonging to the same individual in different databases. Often, data cleansing has been a labor intensive task requiring substantial human involvement. Automation has generally been addressed only recently and that too in limited cases. Moreover, many of the problems tackled have been of a very specific nature and fairly domain specific. Nevertheless, the underlying techniques behind the solutions developed have generally been quite similar. Also, some of the problems addressed (e.g. duplicate detection) have much in common with some of the cleansing tasks needed for aggregation of enterprise spend. As such, Section 2 provides an overview of various techniques for data cleansing that have been developed, and applied to various cleansing tasks, over the past few decades. Section 2.1 provides a broad, albeit brief, survey of the main data cleansing techniques and applications, while Sections 2.2–2.4 take three of these techniques that are quite useful for developing automated spend aggregation systems and discusses them in detail as well as highlights their pros and cons for various data cleansing activities. Subsequently, Section 3 deals with the automation of data cleansing for spend aggregation, with Section 3.1 detailing the various data cleansing tasks that must be carried out to facilitate effective spend aggregation within and across enterprises and Section 3.2 providing a detailed roadmap for developing an automated system for carrying out those tasks, using data cleansing techniques discussed in Section 2. Finally, we conclude and summarize this discussion in Section 4. 2
Techniques for data cleansing
As discussed previously, data cleansing has been studied in various fields and applied to several different problems and domains. Section 2.1 provides
178
M. Singh and J.R. Kalagnanam
a brief survey some of the data cleansing literature. Section 2.2 then takes a closer look at some of the types of algorithms underlying the commonly used data cleansing techniques.
2.1 Overview of data cleansing approaches The data cleansing problem has been studied over several decades under various names, such as record linkage (Fellegi and Sunter, 1969; Winkler 2002, 2006), duplicate detection (Bitton and Dewitt, 1983; Wang and Madnick, 1989), record matching (Cochinwala et al., 2001), merge/purge problem (Hernandez and Stolfo, 1995), etc.1 This task, in general, refers to the identification of duplicates that may be present in data due to a variety of reasons, such as errors, different representations or notations, inconsistencies in the data, etc. While substantial work around this issue has been conducted in the statistics community with a focus on specific problems, such as record linkage in medical data for identifying medical records for the same person in multiple databases (Jaro, 1995; Newcombe, 1988) or for matching people across census or taxation records (Alvey and Jamerson, 1997; Jaro, 1989), a large body of literature also exists, especially in the database literature, on more general, domain-independent data cleaning, especially in the context of data integration and data warehousing (Bright et al., 1994; Dey et al., 2002; Lim and Chiang, 2004; Monge and Elkan, 1997). From an algorithmic point of view, the techniques that have been studied for addressing the data cleansing problem can be broadly categorized into text similarity methods (Cohen, 2000; Hernandez and Stolfo, 1995; Monge and Elkan, 1997), unsupervised learning approaches, such as clustering (Cohen and Richman, 2002; Fellegi and Sunter, 1969), and supervised learning approaches (Bilenko and Mooney, 2003; Borkar et al., 2000; Winkler, 2002). Winkler (2006) provides an extensive and detailed survey of data cleansing approaches that have been developed using methods in one or more of these categories. A detailed discussion of this subject matter is beyond the scope of this chapter, and the interested reader is referred to Winkler’s paper cited above, as well as other numerous other survey articles (Cohen et al., 2003; Rahm and Do, 2000). Nevertheless, in the following section, we discuss, in some level of detail, a few of the most commonly used techniques that are especially suited for the development of automated data cleansing techniques for enterprise spend aggregation.
1 We refer to this ‘‘classic’’ data cleansing problem as the ‘‘duplicate detection’’ problem in the rest of this chapter.
Ch. 7. IT Advances for Industrial Procurement
179
2.2 Text similarity methods Some of the most commonly used methods for data cleansing have their roots in the information retrieval literature (Baeza-Yates and Ribeiro-Neto, 1999). Generally referred to as string or text similarity methods, these techniques often measure the ‘‘similarity’’ between different strings (with identical strings considered to be the most similar) on the basis of some metric that provides a quantitative measure of the ‘‘distance’’ between multiple strings, the higher the distance between them, the lesser the similarity and vice versa.2 One class of such functions are comprised of the so-called edit distance functions that measure the distance between strings as a cost-function based on the minimum number of operations (character insertions, deletions and substitutions) needed to transform one string to the other. The Levenshtein distance (LD) (Levenshtein, 1966) is a basic edit distance that assumes a unit cost for each such operation. Several different variations that use different costs for the various operations, as well as extensions of the basic edit distance, have also been proposed (Cohen et al., 2003; Navarro, 2001). Computation of the LD between two strings can be done using dynamic programming based on a set of recurrence as described below. Consider the calculation of the LD between two strings, say S and T, with lengths n and m, respectively. Let S [1 . . . i] (and T [1 . . . j ]) and S [i] (and T [j]) represent the substring of the first ‘i’ (and ‘j’) characters and the ith (and jth) character of S (and T ), respectively. Moreover, let LD(S [1 . . . i],T [1 . . . j ]) be the distance between the substrings comprised of the first ‘i’ characters of S and the first ‘j’ characters of T. Then, LD(S,T) is given by LD(S [1 . . . n], T[1 . . . m]). It is easy to see that this computation can be done recursively by looking at the three different ways of transforming S[1 . . . i] to T [1 . . . j]. These are (i) converting S [1 . . . i1] to T [1 . . . j1] followed by converting S [i] to T [ j] leading to a cost of LD(S [1 . . . i1],T [1 . . . j1] plus the cost of replacing S [i] by T [ j] which is either 0 (if same) or 1 (if different), (ii) converting S[1 . . . i1] to T [1 . . . j] and deleting S [i] leading to a cost of LD(S [1 . . . i1],T[1 . . . j]) þ 1 and (iii) converting S [1 . . . i] to T [1 . . . j1] and inserting T [ j ] leading to a cost of LD(S [1 . . . i],T [1 . . . j1]) þ 1. The cost of converting S to T then is given by the minimum of these three costs, thus 8 > < LDðS½1 . . . i 1; T½1 . . . j 1Þ þ C sub LDðS½1 . . . i; T½1 . . . jÞ ¼ min LDðS½1 . . . i 1; T½1 . . . jÞ þ 1; > : LDðS½1 . . . i; T½1 . . . j 1Þ þ 1 where Csub is either 0 or 1 as described in (i) above. 2 We use the terms ‘‘similarity’’ and ‘‘distance’’ interchangeably, depending upon the interpretation that is more commonly used in the literature, with ‘‘higher similarity’’ analogous to ‘‘smaller distance’’ and vice versa.
180
M. Singh and J.R. Kalagnanam
The form of these recurrence relations leads to a dynamic programming formulation of the LD computation as follows: Let C be a n þ 1 by m þ 1 array where C [i,j ] represents LD(S[1 . . . i],T [1 . . . j]). Then, LD(S,T ) ¼ LD(S [1 . . . n],T [1 . . . m]) ¼ C[n,m] is calculated by successively computing C[i,j] based on the recurrence relations above as follows: Initialization: C½0; 0 ¼ 0 C½i; 0 ¼ i; C½0; j ¼ j;
1in 1jm
Calculation: Compute C[i,j] for all nZ i Z1, mZjZ1 using the formula C½i; j ¼ MinðC½i 1; j 1 þ Csub ; C½i 1; j þ 1; C½i; j 1; þ1Þ where Csub ¼ 1 if S[i] 6¼ T [ j ] and 0 otherwise. The advantage of using edit distance measures is that they are fairly robust to spelling errors and small local differences between strings. However, computation of edit distances, as shown above, can be computationally expensive, especially when it has to be done repeatedly for comparing a large number of strings. Another kind of common distance-based similarity methods work by breaking up strings into bags of tokens, and computing the distance between the strings based on these tokens. Tokens can be words (using white space and punctuation as delimiters), n-grams (consecutive n-character substrings), etc. The simplest way to then measure the similarity between the strings is to determine the number of tokens in common between the two strings, the higher the count the greater the similarity. However, since this generally favors longer strings, it is better to use normalized measures such as Jaccard similarity (Jaccard, 1912), Dice similarity (Dice, 1945) or Cosine similarity (Salton and Buckley, 1987). A common way to represent such similarity measures is by considering the strings, say S and T, as vectors in multi-dimensional vector space and representing them as weight-vectors of the form, S ¼ {s1, . . . , sn} and T ¼ {t1, . . . , tn} where si, ti are the weights assigned to the ith token (in the collection of all n tokens present in the system) for the strings S and T, P respectively. Then, the vector product of the two weight
n vectors i¼1 si ti measures the number of tokens that are common to the two strings, and the above-mentioned similarity measures can be
Ch. 7. IT Advances for Industrial Procurement
181
expressed as follows: n P
JaccardðS; TÞ ¼
n P i¼1
i¼1 n P
s2i þ
i¼1
2 DiceðS; TÞ ¼
n P
si ti t2i
n P i¼1
si ti
i¼1 n P i¼1
s2i
þ
(1) si ti
n P i¼1
(2) t2i
n P
si ti CosineðS; TÞ ¼ sffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffi n n P P s2i t2i i¼1
i¼1
(3)
i¼1
In this formulation, if the weights si, ti are assigned such that their value is 1 if the ith token is present in the corresponding string, and 0 otherwise, then the Jaccard similarity can be seen to be the number of tokens in common between the two strings, normalized by the total number of unique tokens in the two strings (union), whereas the Dice similarity can be seen to be the number of tokens in common between the two strings, normalized by the average of the number of tokens in the two strings. Cosine similarity is slightly different in that a vector length normalization factor is used where the weight of each token depends on the weights of the other tokens in the same string. Accordingly, in the above formulation (Salton and Buckley, 1987), the similarity may be considered to be a vector product of the two qffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn 2ffi weight vectors, with the individual weights being si = qffiffiffiffiffiffiffiffiffiffiffiffiffiffi i¼1 si and Pn 2ffi (instead of s t and t , respectively). ti = i i i¼1 i However, these methods do not distinguish between different terms (tokens) in the strings being compared, both in terms of their importance to the strings containing those tokens or their ability to discriminate such strings from other strings not containing those tokens. The TF/IDF (Term Frequency/Inverse Document Frequency) (Salton and Buckley, 1987) approach uses a cosine distance-based similarity measure where each token in a string is assigned a weight representing the importance of that term to that particular string as well as relative to all other strings to which it is compared. While this approach is commonly used for document retrieval, it can also be used to measure similarity between different strings in a given set of strings. In this case, the weight assigned to a token consists of three components: (i) a term frequency component measuring the number of
182
M. Singh and J.R. Kalagnanam
times the token occurs in the string, (ii) an inverse document frequency component that is inversely proportional to the number of strings in which that token occurs and (iii) a normalization component, typically based on the length of the string vector. While the term frequency component measures the importance of a term to the string in which it is contained, the inverse document frequency component measures its ability to discriminate between multiple strings, and the normalization component ensures the longer strings are not unfairly preferred over smaller strings (since longer strings, with more tokens, would otherwise have higher likelihood of having more tokens in common with the string being compared with, as opposed to smaller strings). Thus, typically, we define Term Frequency ðtfÞ ¼ # of times token occurs in the string Inverse Document Frequency ðidfÞ ¼ logðN=nÞ where n is the number of strings in which the token occurs in the entire collection of N strings under consideration. Then, for a string S with a weight-vectors of the form, S ¼ {s1, . . . ,sn}, the weight of the ith token is specified as si ¼ tf si idf si ¼ tf si logðN=ni Þ
(4)
Then, the TF-IDF similarity (Salton and Buckley, 1987) between two strings, say S and T represented as weight-vectors S ¼ {s1, . . . ,sn} and T ¼ {t1, . . . ,tn}, respectively, is given by n P
si ti TF=IDFðS; TÞ ¼ sffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffi n n P P s2i t2i i¼1
i¼1
i¼1
2 tf si tf ti logðN=ni Þ i¼1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n
2 P
2 P s tf i logðN=ni Þ tf ti logðN=ni Þ n P
i¼1
ð5Þ
i¼1
As can be seen from Eq. (3), this is equivalent to the cosine similarity between the two strings (with the token weights defined as in Eq. (4)). Several different variations of this have been studied as well (Salton and Buckley, 1987).
Ch. 7. IT Advances for Industrial Procurement
183
2.3 Clustering methods Another class of methods that is especially useful for the cleansing activities are clustering techniques. The aim of clustering is to partition a given dataset into a set of groups such that the data items within a group are ‘‘similar’’ in some way to each other, but ‘‘dissimilar’’ from data items belonging to the other groups. This implies that a good clustering of a dataset corresponds to high intra-cluster similarity and low inter-cluster similarity, and as such depends on how such similarity is measured, as well as implemented, by a clustering method. Clustering has been used over the years in various domains and applications such as pattern recognition, image processing, marketing, information retrieval, etc. (Anderberg, 1973; Jain et al., 1999; Jain and Dubes, 1988; Salton, 1991) and a number of different algorithms have been devised to do such analysis. Here, we describe some of the most commonly used methods and discuss their relative advantages and disadvantages; the interested reader is referred to Jain et al. (1999) for a more general review and discussion of different clustering techniques. Arguably the simplest and the most widely used clustering technique is the k-means algorithm (McQueen, 1967). The aim of the k-means clustering algorithm is to partition the dataset into a set of k clusters, the value of k being assigned a priori. The k-means algorithm starts with an initial partition of the data into k clusters (this could be done randomly, for example), and uses a heuristic to search through the space of all possible clusters by cycling through steps (i) and (ii) as follows: (i) for each cluster, the centroid (mean point) is calculated using the data points currently assigned to the cluster and (ii) each data point is then re-assigned to the cluster whose centroid is the least distance from it. This process is continued till some convergence criteria is satisfied (e.g. there is no movement of any data point to a new cluster). Since this is essentially a greedy approach, it generally terminates in a local optimum. The method is very popular due to the fact that it is easy to implement and is fairly computationally efficient. However, it assumes that the number of clusters is known beforehand and, since the method converges to a local optimal solution, the quality of the clusters found is very sensitive to the initial partition (Selim and Ismail, 1984). Although the k-means algorithm belongs to a wider class of clustering algorithms called partitioning algorithms (since they construct various partitions from the dataset and evaluate them using some criterion), another popular set of clustering techniques are hierarchical clustering algorithms which work by creating a hierarchical decomposition (tree) of the dataset using some criterion (normally distance based). There are two types of hierarchical clustering methods: agglomerative and divisive. While agglomerative methods start with each data item being placed in its own cluster and then successively merge the clusters until a termination
184
M. Singh and J.R. Kalagnanam
condition is reached, divisive methods work along the opposite direction, starting with a single cluster consisting of the entire dataset and successively splitting them until a stopping criterion is satisfied. The majority of hierarchical algorithms are agglomerative, differing primarily in the distance measure used and the method of measuring similarity (distance) between clusters; divisive methods are rarely used and we do not discuss them further. For measuring the distance between clusters, any of an extensive array of distance measures can be used, including those that are based on similarity measures as described previously, as well as various other distance metrics that have been used for clustering and similar tasks in the literature, such as Euclidean distance, Minkowski metric, Manhattan (or L1) distance, etc. (Cohen et al., 2003; Jain et al., 1999). For measuring the distance between clusters, the two most commonly used methods are the single-link and complete-link approaches, though other methods have also been used (Jain et al., 1999). In the single-link case, the distance between two clusters is defined as the shortest distance (or maximum similarity) between any member of one cluster and any member of the other cluster. However, in the case of maximum linkage the distance between two clusters is defined as the maximum distance between any member of one cluster and any member of the other cluster. Of these, the maximum linkage method generally leads to a higher degree of intra-cluster homogeneity for a given number of clusters. Once a choice of the distance measure as well as the method of determining inter-cluster distance is made, agglomerative clustering proceeds as described above: starting with single-ton clusters, the pair of clusters (including clusters that have been created by previous merge steps) that have the least distance between them are successively merged till a stopping criterion, such as maximum cluster size threshold or maximum intra-cluster distance threshold is reached. Although hierarchical clustering techniques have the advantage (over k-means) that the number of clusters does not have to be specified a priori, these methods are not as computationally efficient and do not allow any merge (or split) decision taken earlier on to be reversed later. Yet another class of popular clustering algorithms, called model-based clustering methods, assumes certain models for the clusters and attempt to optimize the fit between these models and the data. The most common of this class of methods assumes that each of the clusters can be modeled by a Gaussian distribution (Banfield and Raftery, 1993), and thus the entire data can be modeled by a mixture of Gaussian distributions. The task of identifying the clusters then boils down to the estimation of the parameters of the individual Gaussians. The EM algorithm (Dempster et al., 1997) is commonly used for this parametric estimation. AutoClass (Cheeseman and Stutz, 1996) is another approach that takes a mixture of distributions approach in addition to using Bayesian statistics to estimate the most probable number of clusters, given the data. While model-based clustering
Ch. 7. IT Advances for Industrial Procurement
185
allows the use of established statistical techniques, it differs from the approaches described earlier in that, unlike k-means and hierarchical clustering approaches that are purely data driven, it requires prior assumptions regarding the component distributions. Additionally, as in the case of k-means, the number of clusters also has to be specified a priori (except for AutoClass that estimates that). Irrespective of the type of clustering method used, computational efficiency and scalability become very important issues when clustering is applied to problems that are characterized by large datasets. This can occur due to a large number of records in the dataset, high dimensionality of the feature space, or a large number of underlying clusters into which the data needs to be split up. In such situations, the direct applicability of any of the previously discussed clustering approaches can become highly computationally intensive, and practically infeasible, especially when the dataset being clustered is large due to all these reasons at the same time. Recently, however, new techniques have been developed for performing clustering efficiently on precisely these kinds of high-dimensional datasets. The main idea behind such techniques is to significantly reduce the number of times exact similarity (or distance) measures have to be computed during the clustering process, thereby reducing the computational complexity of the process. One such method is the two-stage clustering technique developed by McCallum et al. (2000). In this method, the first stage is a quick and dirty stage in which cheap and approximate distance measures are used to divide the dataset into a set of overlapping subsets called canopies. This is followed by a more rigorous stage where expensive, exact distance calculations are made only between data items that occur within the same canopy. By ensuring that the canopies are so constructed such that only data items that exist in a common canopy can exist in the same cluster (i.e. clusters cannot span canopies), substantial computational savings are attained by eliminating the exact distance computations between any pair of points that does not belong to the same canopy. Moreover, this allows any of the standard clustering techniques described previously to be used during the second stage; essentially, that clustering approach is used repeatedly to cluster smaller datasets corresponding to the canopies, as opposed to performing clustering on the entire dataset as required when using a traditional clustering approach directly. 2.4 Classification methods A third category of methods, as described previously, that can sometimes prove useful for the data cleansing tasks are classification techniques (also commonly referred to as supervised learning methods). Since data cleansing for often involves mapping and manipulation of textual data, fields such as information retrieval and natural language processing offer a
186
M. Singh and J.R. Kalagnanam
plethora of machine learning techniques that have been found effective in such domains (e.g. maximum entropy (Nigam et al., 1999), support vector machines (Joachims, 1998) and Bayesian methods (McCallum and Nigam, 1998)). However, classification methods need ‘‘labeled’’ data in order to build/ train classifiers which could then be used for the mapping tasks needed for spend aggregation, such as supplier name normalization and commodity mapping. Such labeled data is, however, not always available. This is in stark contrast to the methods described previously, string similarity methods as well as clustering techniques that have no such requirement, and, hence, are generally used instead of the classification techniques. As such, we do not discuss these approaches in detail but refer the interested user to the above-mentioned references. Nevertheless, we do highlight in subsequent sections where classification approaches could be applied, especially in the context of data cleansing for spend aggregation in a procurement-BTO setting, since cleansed data for a given enterprise could potentially provide the labeled data needed for cleansing the data for other enterprises, especially those in the same industrial sector. Irrespective of the actual approach adopted, two steps are involved in using classification methods for data cleansing: (i) learning classification models for predicting one or more attributes of interest (‘‘target’’ attributes) based on the values of other attributes and (ii) applying these models to the unmapped data to determine the appropriate values of the target attributes. Winkler (2006) provides an extensive list of citations to work in the data cleansing literature based on the use of supervised learning techniques for data cleansing, and the interested reader is referred to the same. One particular area, though, in which supervised learning techniques may be relevant in the context of data cleansing for spend aggregation is in the automatic parsing and element extraction from free-text address data (Borkar et al., 2000; Califf, 1998). For this specific task, it may be easier to get labeled data by some combination of standard address databases, manual tagging and labeling, as well as the incremental data cleansing activities that would be performed during procurement BTO as subsequent enterprise repositories encounter free-text addresses that have already been cleansed for previous enterprises. 3
Automating data cleansing for spend aggregation
We now turn our attention to the specific task of automating the cleansing and rationalization of spend across an enterprise so that it can be aggregated and analyzed in a meaningful way. In the case of BTO, this data cleansing has to span multiple enterprises, thus leading to a significantly higher level of complexity. Spend aggregation has traditionally been done manually, generally employing spreadsheets and rudimentary, data analysis
Ch. 7. IT Advances for Industrial Procurement
187
techniques for mapping and normalizing the data prior to aggregation. However, this is an extremely costly and time-intensive process, especially for larger enterprises where the volume and complexity of spend data makes it all the more difficult. Owing to this slow, error-prone and expensive nature of manual cleansing, coupled with the increased focus in this area along with the aforementioned rapid developments in various fields such as databases, data mining and information retrieval, there has been a steady shift towards the development of methods and tools that automate at least some aspects of this cleansing activity. While some enterprises have turned to in-house development of automated solutions for cleansing and aggregating their spend data, others use solutions provided by independent software vendors (ISVs) such as Emptoris, VerticalNet and Zycus to address their spend aggregation needs. Some of these are pure consulting solutions in which the ISV takes the spend data from the enterprise, cleanses, aggregates (automatically and/or manually) and analyzes it, and returns aggregate spend reports back to the enterprise for further action. On the other end are automated spend analysis solutions that are deployed and configured to work directly with the enterprise’s spend data repositories and systems to cleanse, aggregate and analyze the data on a continual basis. However, most of these solutions are primarily for aggregating intracompany spend (traditionally referred to as spend analysis); there are few solutions that deal explicitly with of inter-company spend aggregation which presents many challenges not encountered while aggregating intracompany spend (Singh and Kalagnanam, 2006). In Section 3.1, we discuss in detail the various cleansing tasks that must be carried out in order to convert spend data to a form where effective aggregation is possible, and the issues that must be addressed in order to enable this cleansing to be done in an automated manner. We specifically highlight the similarities of some of these cleansing tasks with the classic duplicate detection problem and also point out the key points where the two differ. Section 3.2 then provides a rough roadmap towards the development of a simple automated spend-aggregation solution using some of the techniques discussed in Section 2. While many of the techniques and methods discussed in Section 2 can be used to create such an automated solution, we focus on only some of the most commonly used such techniques, such as string comparisons and clustering, and address how the various issues that arise during spend-data cleansing activities can be effectively addressed using these methods.
3.1 Data cleansing tasks for spend aggregation Regardless of the techniques adopted, three main tasks generally need to be performed for cleansing spend data to facilitate effective spend aggregation and analysis, and the development of automated solutions to
188
M. Singh and J.R. Kalagnanam
perform these tasks brings forth several technical issues that need to be addressed satisfactorily. One of the cleansing tasks that need to be performed is the normalization of supplier names to enable the development of a consistent supplier base across all the data repositories and systems. This has to be done both for analyzing intra-company spend as well as aggregating spend across multiple enterprises for procurement BTO. The normalization of supplier names involves the mapping of multiple names for the same supplier to a single, common, standard name for that supplier. Multiple names may arise due to various reasons, including errors (e.g. IBM Corp. and IBMCorp.), different locations (e.g. IBM Canada, IBM TJ Watson Research Center, IBM India, IBM Ireland Ltd., etc.), different businesses undertaken by the same supplier (e.g. IBM Software Group, IBM Global Services, IBM Daksh Business Process Services, etc.), parent–child relationships due to acquisitions (e.g. IBM, Tivoli Systems, Lotus Software, Ascential Corporation, etc.) as well as different terminologies and naming conventions employed by an enterprise for its suppliers in different geographic or functional locations (e.g. IBM, I.B.M, I B M, IBM Corporation, IBM Corp., Inter Bus Machines, International Business Machines, etc.). Clearly, the number of possible ways a supplier may be represented within the spend data may be fairly large, and unless they are all normalized to a single, unique nameinstance, the procurement-spend corresponding to that supplier will be significantly underestimated by any aggregation exercise. Moreover, even if the name of suppliers in multiple transactions or systems is the same, it may be the case that other attributes for that supplier, such as address and supplier id, may have differences, again due to reasons described above (e.g. IBM, 1101 Kitchawan Rd, Yorktown, NY; IBM, Rt 134, Yorktown Heights, NY; IBM, 365 Maiden Lane, New York, NY, etc.). As such, to properly normalize supplier names, it is imperative to compare not only the suppliers’ names but also other information such as address and contact information that may be available. This is especially true for enterprises that do world-wide procurement since different suppliers in different countries may in fact have the same, or fairly similar, names. This is more likely in the case of suppliers that have fairly common words in their names. The complexity of supplier normalization increases rapidly in the case of procurement outsourcing, especially as the number of BTO clients increases, as the size of the supplier base that needs to be normalized increases, as does the noise/variability in the data. Figure 2 shows highly simplified views of the procurement spend for three enterprises (ENT 1, ENT 2 and ENT 3). The view for each enterprise shows the procurement-spend aggregated over suppliers and commodities before any data-cleaning activity has been undertaken. In each case, there are multiple name variations of the same supplier. For ENT 1, it would seem that total procurement amount from IBM is of the order of 1 million. Moreover, no supplier would seem to account for more than 2.5 million.
ENT 1
Armonk, NY
H/W Support
1000000
Inter Bus Machines
Armonk, New York
High-end Server maintenance
2500000
IBMCorp
Somers, NY
S/W mainenace & licenses
2500000
HP Corp
3000 Han St Palo Alto CA
S/W mainenace & licenses
1300000
3000 Hanover St, Palo Alto, CA
High-end Server maintenance
1600000
International Armonk, NY Business Machines
Server support
1100000
IBM SWG
Armonk, New York
1500000
IBMCorp
Somers, NY
Enterprise s/w licenses Enterprise
500000
IBM
H.P. Corp. (Sales)
3000 Han St Palo Alto CA
Enterprise software
1300000
HP
Software
4100000
HP
H/W Support
6200000
Hewlett Packard
IBM
16100000
HP
10300000
ENT 2
Hewlett Packard
3000 Hanover St, Palo Alto, CA
IBM Bus Cons serv
Armonk, NY
IBM SWG
Armonk, New
Server support
2000000
Business consulting 2000000 services
Software
2500000
International Somers, NY Business Machines
High/Mid-end server maint
2500000
HP
Palo Alto CA
Software
1500000
Hewlett-Packard Company
San Francisco, CA
High/Mid-end server maint
2600000
H/W Support
7100000
IBM
Software
7000000
Consulting Services
2000000
H/W Support
13300000
Software
11100000
Consulting Services
2000000
ENT 3
Example demonstrating the need for data cleansing for spend aggregation.
189
Fig. 2.
IBM
Ch. 7. IT Advances for Industrial Procurement
IBM
190
M. Singh and J.R. Kalagnanam
However, the true picture is quite different and is obvious after normalizing the supplier names, which shows that IBM actually accounts for about 6 million in all. A similar picture can be observed for the other enterprises as well. Without supplier name normalization, one would assume that there would be 11 different suppliers, none of them accounting for more than 4 million in spend across the three enterprises. However, as the view on the right shows, there are only two suppliers, with IBM accounting for about 16 million and HP accounting for 10 million. Clearly, data cleansing allows a much clearer, cross-enterprise picture of procurement-spend than is available by simple aggregation without any cleansing. As illustrated in Fig. 2, another cleansing activity that needs to be carried out is to make the spend categorization of the entire procurementspend consistent, both within as well as across enterprises, to enable spend aggregation across commodities. Most enterprises label their spend transactions with an associated commodity and/or spend category. Moreover, the commodities, and various spend categories are arranged in a taxonomy, with higher levels corresponding to general categories of spend and lower categories representing more specific ones (and lowest level corresponding to commodities). However, there are often multiple taxonomies in use across an enterprise, resulting in multiple ways of representing the same commodity. One reason may be that the enterprise may have no specific, enterprise-wide taxonomy that is used to categorize the procurement spend; instead different geographic and functional organizations develop and use their own taxonomies for categorizing the procurement spend. Another reason could be that there is no formal taxonomy in place, either at the enterprise level or at a functional/ geographic organization level, and spend is categorized in an ad hoc manner by multiple people spread across the enterprise and using various procurement functions and applications (such as requisitions, supplier catalogs and payments). Clearly, this leads to multiple taxonomies within the enterprise with substantial inconsistencies, disparity, noise and errors in the way the same commodity is referred to across the enterprise. This is especially true in the latter case where multiple descriptions may be used for the same commodities based on different naming conventions and styles, errors and terminologies. Before a meaningful and accurate spend aggregation can be done across the enterprise, all the taxonomies in use across the enterprise must be mapped to/consolidated into a single taxonomy that uses a normalized, consistent commodity base (e.g. hazardous waste handling, pollution control expense, hazardous waste management, HAZMAT removal, etc. need to be mapped to the same commodity, say hazardous waste management). This taxonomy may be one of those currently in use, built by consolidating several taxonomies in use, or may be a standard classification code, such as the United Nations Standard Products and Services Code, or UNSPSC (Granada Research, 2001; UNSPSC). Nowhere is the importance and the complexity of this
Ch. 7. IT Advances for Industrial Procurement
191
mapping more apparent than in the case of procurement BTO where the spend categories of all the involved enterprises (BTO clients and BTO service provider) need to be mapped to a uniform taxonomy in order to accurately determine the total procurement volume of any given commodity. Thus, not only do multiple taxonomies within an enterprise need to be reconciled but also taxonomies across several enterprises have to be mapped to a single, consistent taxonomy across all enterprises to enable a uniform, cross-enterprise, view of commodity spend to be developed. In such cases, a standard taxonomy such as the UNSPSC is best suited since it spans multiple industry verticals, thus enabling a BTO service provider to host procurement for all kinds of clients. As in the case of supplier name normalization, Fig. 2 illustrates the impact that commodity taxonomy mapping has on spend aggregation, especially across multiple enterprises. As the example shows, the same commodities are referred to in several different forms across the different enterprises, and it is only after mapping them to a common taxonomy (shown on the right side of the figure) does the true picture emerge, i.e. three commodities account for the entire spend with h/w support and software accounting for most of it. Thus, by performing supplier normalization and commodity mapping across the various enterprises and then aggregating the spend, the BTO service provider can see that there are significant volumes associated with the commodities being procured, which in turn enables it to negotiate better deals with the suppliers involved; without the cleansing, the view available to the BTO service provider would be quite distorted and would not be amenable to such analysis. Finally, besides supplier name normalization and commodity mapping, individual spend transactions may also need to be consolidated and mapped to a normalized, commodity taxonomy. This generally happens when an enterprise either does not have a formal taxonomy for spend categorization, or does not require its usage for categorizing individual spend transactions. In such cases, the individual spend transactions have to be mapped to a common commodity taxonomy (either the enterprise taxonomy, if it exists, or a standard taxonomy such as the UNSPSC), based on unstructured textual descriptions in the transactions (such as from invoices or purchase orders). Another case where such transactional mapping is needed is when the enterprise spend taxonomy is not specific enough, i.e. spend is categorized at a much more general level than the commodity level needed for the aggregation. In such cases, the transactional level descriptions may provide more information about the actual commodities purchased to allow such mapping to be done. As Singh et al. discuss, there are several factors that make taxonomy, as well as transactional, mapping far more complex and difficult in the case of spend aggregation for procurement BTO, as opposed to aggregation of spend within a single enterprise (Singh et al., 2005; Singh and Kalagnanam, 2006). These arise primarily due to the need to map the procurement-spend
192
M. Singh and J.R. Kalagnanam
of each of the participating enterprises (via mapping of taxonomies and/or transactions) to a common, multi-industry, standard taxonomy (primarily the UNSPSC), since the participating enterprises may be from different industries with little or no overlap between the commodities they deal with. One issue is that though cross-industry standards like the UNSPSC are fairly broad and cover all industry sectors, they are often cases where the taxonomy is not very specific within an industry (i.e. commodities are quite general). However, enterprise-specific taxonomies, though generally smaller (in terms of the number of commodities covered), may have a narrower but more specific coverage of commodities. Many times, the inverse also holds true where the UNSPSC is more specific but the enterprise taxonomy’s commodities are more general. In the former case, multiple enterprise commodities will end up getting mapped to a single UNSPSC commodity. This entails a loss of information during spend aggregation unless the UNSPSC is extended to include more detailed commodities. In the former case, an enterprise commodity will correspond to multiple UNSPSC commodities, which requires the enterprise commodity to either be mapped to a more general UNSPSC class (group of commodities), or the use of transactional descriptions in mapping individual transactions (rather than commodities) to appropriate UNSPSC commodities. A second issue is that fact that while the UNSPSC taxonomy is a true hierarchical taxonomy in which an is a relationship exists across different levels, enterprise taxonomies are seldom organized as such, and, more often than not, reflect functional/operational organizations, or spend categorizations (such as business travel expenses, direct procurement related, etc.). This implies that multiple, related commodities in an enterprise taxonomy (i.e. they have the same parent in the taxonomy) may map to very different areas of the UNSPSC taxonomy, and vice versa. As such, it is not possible to take advantage of the taxonomy structure during the mapping process, since mapping a higher level item in a taxonomy to a similar high-level item in the UNSPSC taxonomy does not imply that all children of that item in the enterprise taxonomy will also map to children of the corresponding UNSPSC commodity; rather, they could be anywhere in the taxonomy. As such, mapping generally has to be done at the commodity level, one commodity at a time. 3.2 Automating data cleansing tasks for spend aggregation As mentioned previously, automating the various data cleansing activities often requires the use of multiple methods and techniques, based on the specific cleansing task as well as the quality of the available data. The best approach depends upon several factors—the cleansing task at hand, the availability of labeled data, the clustering method best suited for the
Ch. 7. IT Advances for Industrial Procurement
193
available data in the absence of labeled data, the similarity measures to be used in computing similarity between different terms, the data attributes available, etc. Also, the data cleansing literature (as discussed in Section 2.1) offers a variety of techniques that have been successfully applied to such activities. At first glance, the supplier normalization task seems to be identical to the extensively studied duplicate detection problem. Clearly, there are significant similarities between the two, especially when enterprises have multiple representations in the data due to errors, different terminologies or nomenclatures, etc. As such, many of the techniques that have been developed in the past for tackling the duplicate detection problem can be adapted for use in the supplier normalization task. However, there is one aspect specific to the supplier normalization task that makes it different from, and often quite difficult than, the duplicate detection problem. This arises from the fact that duplicate detection (in the context of supplier normalization) can be considered to be the task of checking whether the name/address data for two enterprises is syntactically different but semantically equivalent, i.e. both entities actually represent the same enterprise that has been represented differently due to errors, etc., and at least one of the entries is erroneous that must be fixed. However, in enterprise spend data, it is often the case that supplier information is both syntactically and semantically different, but is, in fact, of the same enterprise. Moreover, each representation is in fact correct and must be maintained that way; at the same time, for spend aggregation, they must be considered equivalent and spend must be aggregated over them and associated with a single enterprise. This can arise due to different reasons. One reason arises due to acquisitions and divestitures. In such cases, different parts of the same enterprise may have completely different names/ addresses (e.g. Lotus Software and IBM Corp.). Another reason is that many enterprises operate on a global scale, often with several business units (e.g. IBM Software Group, IBM Systems and Technology Group and/or subsidiaries (e.g. IBM India Pvt Ltd) across multiple geographies, each conducting business directly with its customers, suppliers, etc. In either case, there may be invoice and/or account-payable transactions that have substantially different names and addresses that are entirely correct and must be maintained as such (since, e.g. payments may be made to the individual business units directly). However, for spend-aggregation, the individual entities do need to be mapped to a common (parent) enterprise. In the former case, the only solution is to purchase such information from companies such as Dun & Bradstreet, or to cull news sources for such information and build up a repository of acquisitions/divestitures for use during the supplier normalization task. In the latter case, the differences between names and addresses are often quite significant than those that are introduced by errors or naming conventions, etc. Hence, as we discuss subsequently in Section 3.2.1, standard similarity methods as are used in
194
M. Singh and J.R. Kalagnanam
duplicate detection tasks cannot be directly applied and more elaborate schemes must be devised. Commodity taxonomy mapping and commodity transactional mapping are much more general cleansing problems than the classic duplicate matching problem since they involve unstructured noisy data and direct application of similarity methods is not enough as we discuss in Section 3.2.2. In the following paragraphs, we discuss the pros and cons of using various techniques for the three data cleansing tasks described previously, and describe how they can be used for successfully automating these tasks. 3.2.1 Supplier name normalization One common limiting factor in spend aggregation is the absence of any ‘‘mapped’’ data. For example, there is generally no data that explicitly labels different supplier names as being variations of the same physical enterprise’s name. Similarly, there is often no transactional spend data that is labeled with the appropriate normalized commodity codes. In the absence of such labeled data, the first step in a supplier name normalization exercise for an enterprise is to perform a clustering exercise on the enterprise’s supplier data in order to partition it into a set of groups, each group representing a unique supplier. An alternative approach is to compare data for each supplier in the dataset with the data for (potentially) every other supplier in the data, and that too several times. For example, starting with an empty list of normalized suppliers, this approach would entail comparing each supplier in the dataset with every supplier in the list of normalized suppliers, and either mapped it to one already on the list or adding it to the list as a new normalized supplier. This process would be continued until every supplier is mapped to some supplier on the normalized list. Compared to this approach, a clustering-based approach is usually significantly more computationally efficient, especially in cases where the size of the supplier base is fairly large. Once the clustering is done, each cluster has to be assigned a normalized name which can be done using any of a variety of heuristics, such as the most common supplier name in the cluster, the shortest name in the cluster, or the shortest common prefix among the names in the cluster, etc. Since multiple enterprises are involved in a procurement BTO setting, the task of normalizing the supplier base across all the involved enterprises can then be conducted in two steps: first, the supplier base of each participating enterprise is normalized, and then all the normalized supplier bases are merged together to yield a single, consistent, cross-enterprise, normalized supplier base. To perform supplier name normalization using the cluster-based approach, two decisions needs to be made. First, an appropriate clustering method has to be selected. Second, an appropriate similarity measure has to be selected for comparing data items and/or clusters. There are certain characteristics of this problem that affect the choice of the methods used.
Ch. 7. IT Advances for Industrial Procurement
195
One, an enterprise can have tens of thousands of suppliers. Two, in order to do effective similarity matching between supplier-names, it is necessary to tokenize the names, which greatly increases the dimensionality of the feature space. Three, the number of normalized supplier names (corresponding to the clusters) is also usually quite large, corresponding to a significant part of the non-normalized supplier base. Owing to this, methods such as k-means are generally not suited for clustering the data since it is difficult to estimate beforehand the number of clusters into which the supplier base will eventually be partitioned. Hierarchical methods are better suited since they require no a priori assumptions, either about the number or the distribution of the clusters. More importantly, as discussed in Section 3.2, due to these very reasons the dataset is often quite large (in the number of data items, feature dimensionality as well as the number of clusters) which makes it computationally quite inefficient to do clustering using a straightforward application of any of the standard clustering techniques. As such, a more suitable technique instead is to use a clustering technique meant for large datasets, such as the two-stage canopybased clustering technique (McCallum et al., 2000) discussed previously, in conjunction with a hierarchical (agglomerative) clustering method. As in the case of the clustering technique, several issues also need to be considered while deciding on the similarity approach to use to measure the distance between the data items (and the clusters). While, theoretically, string similarity metrics can be used for directly comparing supplier names by calculating the similarity (or distance) between them, several issues arise that make this practically infeasible. One, edit distance (e.g. Levenshtein distance) calculation is computationally expensive, and its usage on real data with tens of thousands of supplier names can make the mapping process quite computationally expensive. Two, similarity metrics are position invariant. That is, they only provide a quantitative measure of the difference between strings, but no indication of where the differences are. This is especially important in the case of supplier names that often consist of multiple words. Consider the following: ‘‘IBM Corp’’, ‘‘IBMCorp’’ and ‘‘ABM Corp’’. The Levenshtein distance between ‘‘IBM Corp’’ and ‘‘IBMCorp’’ is 1 as is the distance between ‘‘IBM Corp’’ and ‘‘ABM Corp’’. However, while the latter represents a variation (due to an error) in the name of the same enterprise, the latter case corresponds to names of different enterprises. ‘‘Texas Instruments Incorporated’’, ‘‘Texas Instruments Inc’’, ‘‘Westbay Instruments Incorporated’’ and ‘‘Western Instruments Incorporated’’. Here, the Levenshtein distance between the first and second names is 9, but between the first and third (as well as first and fourth) is 7, which implies that first name is more ‘‘similar’’ to the obviously incorrect names than it is to the correct one.
196
M. Singh and J.R. Kalagnanam
‘‘IBM’’, ‘‘IBM Corporation’’ and ‘‘CBS’’. Whereas the first and second are obviously variations for the same enterprise and the third name is obviously different, the edit distance between the first two is 12 while the distance between the first and third is only 3. Part of this problem, however, can be alleviated by tokenizing the names and performing similarity checks on tokens instead of the entire string. Moreover, a token-based similarity technique, such as the TF/IDF approach discussed previously, has the advantage of making similarity comparisons between strings while taking into account distinctions between the various terms (tokens) in those strings, both in terms of their importance to the strings containing them as well as their ability to discriminate these strings from other strings not containing these tokens. However, like the other similarity methods, the TF/IDF approach does not differentiate between the positions of tokens that are dissimilar; it simply considers each string as a ‘‘bag’’ of tokens and calculates the similarity between the strings based on those tokens. As such it also does not distinguish between differences among the compared strings at different positions. However, as the above examples show, it is often the case that differences towards the beginning of supplier names are more significant than differences towards the end of the names. Moreover, it does not take into account the order of the tokens, merely the similarity (or dissimilarity) between the tokens. Thus, names containing the same words but in different positions (e.g. ‘‘Advanced Technology Systems Inc’’ and ‘‘Advanced Systems Technology Inc’’) are considered to be exactly similar. Other issues arise as well. First, an appropriate tokenization scheme has to be selected. Using a tokenization scheme that produces too many, small tokens (such as n-grams with a high ‘n’) introduces too much noise, while schemes with too few tokens (such as word-grams or sequences-of-words) reduce detection of local-differences and/or make the process more computationally intensive. According to Singh et al. (2005, 2006) a wordbased tokenization scheme (that uses white space and punctuation for tokenization) generally provides a suitable compromise between detecting local differences and computational complexity when comparing supplier names. Second, in addition to supplier names, data such as address and contact information is often available, and helpful, for supplier-name normalization. As such, in addition to comparing supplier names, similarity checks also need to be done for various other data attributes such as street address, PO Box number, city, state, etc., and all these comparisons may yield conflicting or inconclusive information that needs to be resolved. On top of it, address information may not be available as attribute-value pairs but simply as unstructured textual data. In that case, values for various attributes such as street name, city, zip code, etc. need to be extracted before
Ch. 7. IT Advances for Industrial Procurement
197
similarity checks can be performed. While several different techniques can be used for extracting this information, a common and efficient method involves the use of regular expressions (Hopcroft et al., 2006) to define patterns corresponding to various attributes and then searching the text for those patterns to find the corresponding attribute values. Regular expressions are textual expressions that are used to concisely represent sets of strings, without enumerating out all the members of the set, according to certain syntax rules. For example, street addresses, such as ‘‘1101 Mount Kisco Avenue’’, often consist of three parts: a numeric part (street number), a sequence of words (street name) followed by a keyword (such as ‘‘avenue’’, ‘‘street’’ or ‘‘lane’’, etc.). A corresponding regular expression to match this could then be defined (using appropriate syntax) as a string consisting of a numeric prefix, followed by white space, then one or more non-numeric, alphabetic words, then some more white space and finally a keyword. Using such a regular expression would allow the street address to be extracted from unstructured address data, and broken up into its constituents to allow similarity matching. Given the type and quality of data, regular expressions may need to be defined for several different attributes, and may also need to be refined several times. This process is best undertaken by taking a suitable sample of the data and using a trial and error method on that data to create and subsequently refine the needed regular expressions. Third, no similarity method can directly handle differences in supplier names due to different naming conventions, acronyms, punctuation, formatting, etc. These must be addressed prior to any clustering/similarity exercises can be successfully carried out. Owing to all the reasons cited above, similarity comparisons between suppliers using a direct application of standard string similarity techniques is likely to yield unsatisfactory results for supplier name normalization. Rather, these comparisons typically have to be carried by applying a variety of different methods to various supplier data attributes. One way is via a repository of rules based on the application of various string similarity methods to tokenized supplier names, as proposed by Singh and Kalagnanam (2006). Such an approach provides a simple, straightforward way of addressing all the issues raised earlier. For example, by constructing and prioritizing rules of varying complexity, and including exact as well as fuzzy matches on whole or parts of supplier name and address attributes, it is possible to limit the use of computationally intensive tasks (such as extensive edit distance calculations) as well as satisfactorily address issues such as the position-invariant nature of string similarity methods. The former can be achieved by using simpler rules (such as exact name match, exact zip code match, etc.) first and using complex rules (involving distance computations) only if the simpler rules are not sufficient to make a decision regarding the level of similarity of the entities being compared, and that too only on some of the data attributes (such as a subset of name tokens, or on
198
M. Singh and J.R. Kalagnanam
the street name). However, the latter can be attained by building rules that specifically address this issue, such as rules that consider differences towards the beginning of names as being more significant than differences towards the end of the names, etc. This rule-based similarity approach can be further enhanced by using various techniques such stop word elimination, formatting and special character removal, number transformations, abbreviation generation and comparison, etc., which help preprocess and ‘‘standardize’’ the supplier data to enable a better comparison between them. Furthermore, rules can be designed to use information from dictionaries and standard company name databases (such as the Fortune 500 list) to assign different weights to different words and tokens in a name, thereby enhancing the normalization process further. For example, a nondictionary word that occurs in the Fortune 500 list, such as ‘‘Intel’’, can be considered to a more significant in similarity measurements than other words. Constructing the repository of similarity rules is generally an iterative process, involving a lot of hit and trial manual work initially. As in the case of regular expressions, it is often helpful to use a sample of data to help create these rules initially and then refine them as more data is analyzed. However, as more and more supplier bases are normalized the rulerepository gets incrementally bigger and better to successfully encompass all types of situations, and lesser and lesser manual intervention is required. As such, the canopy-based clustering technique with hierarchical (agglomerative) clustering using rule-based similarity measurement provides a suitable, efficient approach for supplier name normalization, the idea being to first use computationally cheap methods to make some loose clusters, called canopies, followed by the more computationally intensive methods to refine the canopies further into appropriate clusters. Owing to the extremely large supplier bases encountered for many enterprises, this clustering approach is particularly attractive. To create canopies, cheap methods including zip code matches, phone number matches and name and/or address token matches are used in various rules. Once the canopies have been formed, more expensive techniques consisting of the elaborate similarity measurements are used to form clusters. Once the supplier base of the enterprise under consideration has been clustered, the same process is repeated for each of the other enterprise involved in the procurement BTO exercise. Once all the individual enterprises are normalized, they can be merged into a single, cross-enterprise normalized supplier base. Note, however, that if the enterprise being normalized in a new enterprise being brought on-board an existing BTO platform, then it would need to be merged with the cumulative normalized supplier base formed from all previously normalized clients’ data. In either case, the merger can be easily done using agglomerative clustering in conjunction with the same set of similarity rules used previously, and the sets of clusters in the two supplier bases being the starting clusters from which
Ch. 7. IT Advances for Industrial Procurement
199
the agglomeration process starts (see Section 3.2). A side advantage of this process is that by incrementally building up such a repository of normalized suppliers, and mining the repository for subsequent clients’ normalization tasks, the accuracy and performance of the system can be progressively improved with each additional client. The supplier-normalization approach described above can be summarized as follows: 1. Pre-process the supplier data (name, address, etc.) by eliminating stop words, removing special characters, transforming numbers to uniform format, etc. 2. Define regular expression-based extractors to break up address fields into more specific information such as street name, street number, PO Box number, etc. It may be helpful to take a sample of the data and use that to define the regular expressions (as discussed previously). 3. Create/augment similarity-rules repository. As more and more supplier bases get normalized, he incremental changes needed to this repository decreases. 4. Segment the supplier base of the current enterprise a. Create canopies using cheap similarity rules such as zip code matches, phone number matches, first-token matches (using n-grams) as well as exact and inclusion name matches. b. Use more stringent, computationally intensive similarity rules to create clusters from the canopies. Use cheaper rules first, followed by more expensive rules. Rules include checking for non-dictionary words, Fortune 500 words, similarity in name and address fields, abbreviation matches, etc. 5. Merge the set of clusters with the current normalized supplier base consisting of all enterprises’ data that has already been normalized. This can be easily done using agglomerative clustering using the same repository of similarity rules, as we describe above. 3.2.2 Commodity taxonomy and commodity transactional mapping Like supplier name normalization, commodity taxonomy mapping is also typically limited by the absence of mapped data. This is especially true in the case of taxonomy mapping for BTO procurement as it often involves mapping the taxonomy of one enterprise to a totally different taxonomy. At the transactional level too, there is often no transactional data that is labeled with appropriate commodity codes, either due to the absence of a formal commodity taxonomy for spend categorization, or simply due to a lack of strict enforcement of such labeling for all transactions, etc. In the limited cases in which labeled data is available within an enterprise (by way of descriptions in transactions that are also labeled with appropriate commodity codes), it does not associate the individual transactions with an
200
M. Singh and J.R. Kalagnanam
external taxonomy, as is the case for BTO procurement. As a result, systems for automating commodity mapping, both taxonomy as well as transactional, are once again mostly limited to unsupervised methods, such as similarity and clustering techniques discussed earlier, although, in some cases, classification techniques can play a useful role as we discuss later in this section. Moreover, even clustering techniques are often of little use, especially for commodity taxonomy mapping, since each commodity in taxonomy is generally different (except in some cases where the taxonomy is built on an ad hoc basis which may result in some duplication), leaving similarity methods as one of the most viable technique for building effective commodity mapping solutions. Furthermore, as in the case of supplier-name normalization, several issues have to be considered while deciding on the specific approach for both commodity taxonomy mapping as well as commodity transactional mapping. Some of these are the same issues that also affect suppliername normalization, such as the computational expense of edit distance methods and the absence of positional differentiation in all the string similarity methods. Others are specific to the task of commodity mapping, though they affect taxonomy mapping and transactional mapping to different extents. Nevertheless, they must be considered as well as addressed appropriately for commodity mapping to be done successfully. One, different words (or their various grammatical forms) may be used in different descriptions to represent the same entity, both in taxonomies as well as in transactions. Examples are synonyms, tenses, plurals, etc. Two, commodity taxonomy descriptions are normally very short and concise. As such, each word is significant, albeit to different degrees. However, distinguishing between different potential matches becomes correspondingly harder, since the items in a taxonomy often number in the tens of thousands of which the best one has to be selected based on a couple of words. Moreover, taxonomy descriptions may still have errors such as spelling mistakes as the taxonomy may have been generated on the fly during categorization of spend transactions in day to day procurement activities. Three, commodity descriptions often contain significant amounts of domain-specific terminology. Four, the order of words in commodity descriptions becomes an important issue, one that is not considered in traditional information retrieval methods that use a bag-of-words approach. For example, ‘‘tax software’’ and ‘‘software tax’’ are considered similar by a token-based similarity metric such as TF/IDF. Five, in cases where transactional mapping needs to be done, for the reasons highlighted earlier, the problems get more compounded by the fact that transactional descriptions are often more noisier than taxonomy descriptions, often have substantially more domain-specific terminology, and also entail the need for resolving potentially conflicting matches resulting from multiple descriptions in the same transaction (arising from different sources such as POs and invoices).
Ch. 7. IT Advances for Industrial Procurement
201
Compounding all this is the fact that the source and target taxonomy may have wide structural differences. As a case in point, consider the UNSPSC code. It has roughly 20K commodities in a four-level taxonomy. However, while the taxonomy is very broad, and includes commodities and services in almost all industrial sectors, it is not very deep in any given sector. Company taxonomies, however, are not very broad but are generally far more specific in terms of commodities, especially in the case of items used in production. For example, while the UNSPSC has commodity codes for desktop and notebook computers, companies are much more specific in terms of the specific types of desktop and notebook computers. This is more so in the case of production parts, but also occurs in the case of services. As such, there is often a many-to-many mapping that needs to be done between the two taxonomies. Another important factor is the need to determine exactly what the commodity description is referring to. For example, software tax is a sort of tax while tax software is a type of software. As pointed out earlier, token-based string similarity methods cannot distinguish between these two phrases. More importantly though, they do not distinguish between tokens on their semantic significance to the description but only to their discriminative ability on the basis of token and document frequencies. The problem is that while mapping taxonomies, it is quite common for a sizeable list of possible candidates to be evaluated as being similar to the description being considered on the basis of common tokens, but an accurate match cannot be made unless it is known what the specific object of the description is. To enable the mapping to be done properly, various techniques from classical information retrieval literature including stop word removal, stemming, tokenization using words and grams, etc. can be used in conjunction with dictionaries and domain-specific vocabulary. Moreover, lexical databases, such as WordNet (WordNet, 1998), enable the use of synonyms, sense determination, morphological analysis and part of speech determination in the creation of rules and methods for better identifying the main keyword(s) in a description and ranking the results of mapping better, as well as provide means of enhancing similarity calculation on the basis of synonyms, different word forms, etc. instead of just token similarity as provided by vanilla string similarity methods. For example, consider the descriptions ‘‘safety shoes’’ and ‘‘safety boots’’. With a similarity measure like TF/IDF, they would be considered similar to some extent (due to the common token ‘‘safety’’) but there is no way to know that ‘‘shoes’’ and ‘‘boots’’ are also similar. Use of a system such as WordNet enables such mappings to be correctly made. Finally, a set of post-filtering and ranking rules which assign weights to tokens in the queries and candidate descriptions based on such importance, and re-ranks the candidate results to get a more accurate match list needs to be created. This is necessary since often exact matches are not found; rather, a list of potential matches is found with different sets of tokens in common with the description being
202
M. Singh and J.R. Kalagnanam
mapped, and a decision needs to be made as to which of these is the best match. Thus, for mapping commodity taxonomies, an appropriate mapping method would be to use token-based string similarity methods (TF/IDF) augmented with a lexical database, such as WordNET, and rules based on positional differences between tokens in the query and candidate descriptions. Another step that can prove to be quite beneficial is to mine previously mapped enterprise taxonomies for similarities to the commodity description in question, and use that to acquire the proper UNSPSC mapping when substantial similarities were found. This approach (Singh and Kalagnanam, 2006) can be described as follows: 1. Pre-process the commodity descriptions for the target (UNSPSC) taxonomy, previously mapped enterprise taxonomies and the to-bemapped taxonomy by eliminating stop-words and doing transformations such as stemming and term normalization, generate synonyms, etc. for the descriptions using some lexical database such as WordNET, and generate TF/IDF indexes for each taxonomy. 2. Define/augment weighting rules repository for identifying best match for a given description. a. Define rules for identifying the main object of the description, as well as the related qualifiers. Thus, for ‘‘software tax’’, object would be tax and the qualifier would be software. For ‘‘application software tax’’, there would be an additional qualifier, ‘‘application’’. b. Define rules to rank prospective matches based on the objects, qualifiers and their relative positions. The general idea is as follows: in addition to the presence/absence of various tokens in the query and candidate match, weights are assigned to tokens based on their relative and absolute position, as well as their importance to the query (object, immediate qualifier, distant qualifier, etc). Thus, for example, if the objects matched in value and position, the candidate would be ranked higher than a candidate in which the tokens matched but their relative positions did not. Thus, if the query was ‘‘software tax’’, then a candidate ‘‘tax’’ would be ranked higher than a candidate ‘‘tax software’’ even though the latter is a perfect token-based match. Similarly, ‘‘application software tax’’ would be ranked higher than ‘‘tax software’’ but lower than ‘‘software tax’’. 3. For each commodity description in the to-be-mapped taxonomy, do a. Check for an exact match with a description in the target taxonomy or a previously mapped taxonomy. If found, stop and use the matches. Otherwise, proceed. b. Use TF/IDF similarity method to generate a candidate list of possible matches. c. For the query description, identify the main object and qualifiers, and use the weighting rules to rank the possible matches, and map to the highest rank description.
Ch. 7. IT Advances for Industrial Procurement
203
For transactional mapping, the same techniques and algorithms as described above for taxonomy mapping can be used with some extensions. First, the same clustering technique as used for supplier name normalization (canopy-based clustering in conjunction with hierarchical, agglomerative clustering, using rule-based similarity measurement) can be applied to cluster together similar transactions based on the transactional descriptions. Second, the taxonomy mapping algorithm described above is extended to use transactional descriptions from previously mapped companies’ data as well. Third, simple methods (such as majority rule) are used to combine mapping results arising from multiple descriptions, either for the same transaction or different transactions in the same cluster. Fourth, better repositories are built and improved techniques for filtering out the noise from such descriptions, mainly using stop words and better keyword indices, are designed. By the sheer nature of the task, this step will almost always necessitate extensive human intervention, primarily due to domain-specific terminology and conventions, etc., to do the mapping correctly. However, as more and more data is mapped, subsequent mapping exercises for other enterprises, especially those in the same industry as the ones mapped earlier, should require lesser and lesser human involvement and enable more automation. In this regard, classification methods can also prove useful as models can be induced from the previously mapped data and used to map transactions from newer enterprises, especially in the same industry. 4
Conclusion
This chapter discussed how enterprise spend aggregation can be automated using data cleansing techniques. Over the past few years, more and more enterprises have been heavily investing in IT tools that can significantly improve their procurement activities. An earlier step in this direction was the move towards addressing strategic procurement functions, such as strategic sourcing, that requires aggregation of spend across the entire enterprise. A recent trend is the move towards the outsourcing of the procurement functions (especially, non-core procurement pieces) to third party providers who then provide the procurement function for the enterprise. This practice, called Business Transformation Outsourcing, can generate significant financial benefits for the enterprises involved, but requires spend aggregation to be done on a much larger scale than before, often across multiple enterprises. However, before such spend aggregation can be done, the spend data has to be cleansed and rationalized across, and within, the multiple enterprises, an activity that is typically done manually using rudimentary data analysis techniques and spreadsheets. However, a significant amount of research has been conducted over the past couple of decades in various fields, such as databases, statistics and artificial
204
M. Singh and J.R. Kalagnanam
intelligence, on the development of various data cleansing techniques and their application to a broad range of applications and domains. This chapter provides a brief survey of these techniques and applications, and then discusses how some of these methods can be adapted to automate the various cleansing activities needs for spend data aggregation. Moreover, the chapter provides a detailed roadmap to enable the development of such an automated system for spend aggregation.
References Alvey, W., B. Jamerson (eds.) (1997). Record linkage techniques, in: Proceedings of an International Record Linkage Workshop and Exposition, March 20–21, Arlington, Virginia. Also published by National Academy Press (1999) and available at http://www.fcsm.gov under methodology reports. Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York. Baeza-Yates, R., B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison-Wesley, Boston, MA. Banfield, J.D., A.E. Raftery (1993). Model based gaussian and non-gaussian clustering. Biometrics 49, 803–821. Bilenko, M., R.J. Mooney (2003). Adaptive duplicate detection using learnable string similarity metrics, in: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48. Bitton, D., D.J. DeWitt (1983). Duplicate record elimination in large data files. ACM Transactions on Database Systems 8(2), 255–265. Borkar, V., K. Deshmukh, S. Sarawagi (2000). Automatically extracting structure from free text addresses. Bulletin of the Technical Committee on Data Engineering 23(4), 27–32. Bright, M.W., A.R. Hurson, S. Pakzad (1994). Automated resolution of semantic heterogeneity in multidatabases. ACM Transactions on Database Systems 19(2), 212–253. Califf, M.E. (1998). Relational learning techniques for natural language information extraction. Unpublished doctoral dissertation, University of Texas at Austin, Austin, TX, USA. Cheeseman, P., J. Stutz (1996). Bayesian classification (AutoClass): theory and results, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 153–180. Cochinwala, M., S. Dalal, A.K. Elmagarmid, V.S. Verykios (2001). Record matching: past, present and future. Available as Technical Report CSD-TR #01-013, Department of Computer Sciences, Purdue University. Available at http://www.cs.purdue.edu/research/technical_reports/2001/TR%2001-013.pdf Cohen, W.W. (2000). Data integration using similarity Joins and a word-based Information representation language. ACM Transactions on Information Systems 18(3), 288–321. Cohen, W.W., P. Ravikumar, S.E. Fienberg (2003). A comparison of string distance metrics for namematching tasks, in: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78. Cohen, W.W., J. Richman (2002). Learning to match and cluster large high-dimensional data sets for data integration, in: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, pp. 475–480. Dempster, A.P., N.M. Laird, D.B., Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical. Society, Series B 39(1), 1–38. Dey, D., S. Sarkar, P. De (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering 14(3), 567–582. Dice, L.R. (1945). Measures of the amount of ecologic association between species. Ecology 26(3), 297–302.
Ch. 7. IT Advances for Industrial Procurement
205
Fellegi, I.P., A.B. Sunter (1969). A theory of record linkage. Journal of the American Statistical Society 64, 1183–1210. Granada Research (2001). Using the UNSPSC – United Nations Standard Products and Services Code White Paper. Available at http://www.unspsc.org/ Hernandez, M.A., S.J. Stolfo (1995) The Merge/Purge problem for large databases, in: Proceedings of the ACM SIGMOD Conference, San Jose, CA. Hopcroft, J.E., R. Motwani, J.D. Ullman (2006). Introduction to automata theory, languages and computation. 3rd ed. Addison-Wesley, Boston, MA. Jaccard, P. (1912). The distribution of flora in the alpine zone. New Phytologist 11, 37–50. Jain, A.K., M.N. Murty, P.J. Flynn (1999). Data clustering: a review. ACM Computing Surveys 31(3). Jain, A.K., R.C. Dubes (1988). Algorithms for Clustering Data. Prentice Hall, Saddle River, NJ. Jaro, M.A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 89, 414–420. Jaro, M.A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine 14, 491–498. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features, in: C. Nedellec, C. Rouveirol (eds.), Lecture Notes in Computer Science: Proceedings of the 10th European Conference on Machine Learning. Springer, London, UK. Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710. Lim, E., R.H.L. Chiang (2004). Accommodating instance heterogeneities in database integration. Decision Support Systems 38(2), 213–231. McCallum, A., K. Nigam (1998). A comparison of event models for Naive Bayes text classification, in: AAAI-98 Workshop on Learning for Text Categorization. McCallum, A., K. Nigam, L.H. Ungar (2000). Efficient clustering of high-dimensional data sets with application to reference matching, in: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp. 169–178. McQueen, J. (1967). Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, pp. 281–297. Monge, A.E., C. Elkan (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records, in: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ. Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88. Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford University Press, New York, NY. Nigam, K., J. Lafferty, A. McCallum (1999). Using maximum entropy for text classification, in: IJCAI99 Workshop on Machine Learning for Information Filtering, pp. 61–67. Rahm, E., H.H. Do (2000). Data cleaning: problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13. Salton, G., C. Buckley (1987). Term weighting approaches in automatic text retrieval. Technical Report No. 87-881, Department of Computer Science, Cornell University, Ithaca, New York. Salton, G. (1991). Developments in automatic text retrieval. Science 253, 974–980. Singh, M., J. Kalagnanam (2006). Using data mining in procurement business transformation outsourcing. 12th ACM SIGKDD Conference on Knowledge Discovery and Data Mining – Workshop on Data Mining for Business Applications, Philadelphia, PA, pp. 80–86. Singh, M., J. Kalagnanam, S. Verma, A. Shah, S. Chalasani (2005). Automated cleansing for spend analytics. CIKM ’05-ACM 14th Conference on Information and Knowledge Management, Bremen, Germany. Selim, S.Z., M.A. Ismail (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 81–87.
206
M. Singh and J.R. Kalagnanam
UNSPSC, The United Nations Standard Products and Services Code. Available at http:// www.unspsc.org Wang, Y.R., S.E. Madnick (1989). The inter-database instance identification problem in integrating autonomous systems, in: Proceedings of the 5th International Conference on Data Engineering, Los Angeles, CA, pp. 46–55. Winkler, W.E. (2002). Record linkage and Bayesian networks, in: Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington, DC. Winkler, W.E. (2006). Overview of record linkage and current research directions. Research Report Series: Statistics #2006-2, Statistical Research Division, U.S. Census Bureau, Washington, DC 20233. Available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf WordNet (1998). A lexical database for the English language. Cognitive Science Laboratory, Princeton University, Princeton, NJ. Available at http://wordnet.princeton.edu
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 8
Spatial-Temporal Data Analysis and Its Applications in Infectious Disease Informatics
Daniel Zeng Department of Management Information Systems, The University of Arizona 1130 E. Helen Street, Rm 430, Tucson, AZ 85721-0108, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China
James Ma and Hsinchun Chen Department of Management Information Systems, The University of Arizona 1130 E. Helen Street, Rm 430, Tucson, AZ 85721-0108, USA
Wei Chang Katz Graduate School of Business, The University of Pittsburgh 343 Mervis Hall, Pittsburgh, PA 15213, USA
Abstract Recent years have witnessed significant interest in spatial-temporal data analysis. In this chapter, we introduce two types of spatial-temporal data analysis techniques and discuss their applications in public health informatics. The first technique focuses on clustering or hotspot analysis. Both statistical and machine learning-based analysis techniques are discussed in off-line (retrospective) and online (prospective) data analysis contexts. The second technique aims to analyze multiple data streams and identify significant correlations among them. Both classical spatial correlation analysis methods and new research on spatial-temporal correlation are presented. To illustrate how these spatial-temporal data analysis techniques can be applied in real-world settings, we report case studies in the domain of infectious disease informatics.
1
Introduction
Recent years have witnessed significant interest in spatial-temporal data analysis. The main reason for this interest is the availability of datasets 207
208
D. Zeng et al.
containing important spatial and temporal data elements across a wide spectrum of applications ranging from public health (disease case reports), public safety (crime case reports), search engines (search keyword geographical distributions over time), transportation systems (data from Global Positioning Systems (GPS)), to product lifecycle management (data generated by Radio Frequency Identification (RFID) devices), and financial fraud detection (financial transaction tracking data) (Sonesson and Bock, 2003). The following central questions of great practical importance have arisen in spatial-temporal data analysis and related predictive modeling: (a) How to identify areas having exceptionally high or low measures (hotspots)? (b) How to determine whether the unusual measures can be attributed to known random variations or are statistically significant? In the latter case, how to assess the explanatory factors? (c) How to identify any statistically significant changes in a timely manner in geographic areas? (d) How to identify significant correlations among multiple data streams with spatial and temporal data elements? Questions (a)–(c) can be tackled by spatial-temporal clustering analysis techniques, also known as hotspot analysis techniques. Two types of clustering methods have been developed in the literature. The first type of approach falls under the general umbrella of retrospective models (Yao, 2003; Kulldorff, 1997). It is aimed at testing statistically whether events (e.g., disease cases) are randomly distributed over space and time in a predefined geographical region during a predetermined time period. In many cases, however, this static perspective is inadequate as data often arrive dynamically and continuously, and in many applications there is a critical need for detecting and analyzing emerging spatial patterns on an ongoing basis. The second type of approach, prospective in nature, aims to meet this need with repeated time periodic analyses targeted at identification of statistically significant changes in an online context (Rogerson, 2001). Alerts are usually disseminated whenever such changes are detected. In the first part of this chapter, we present introductory material on both retrospective and prospective spatial-temporal data analysis techniques, and illustrate their applications using public health datasets (Chang et al., 2005; Kulldorff, 2001). To answer question (d), one has to study relationships among multiple datasets. Current correlation analysis is mainly applied in fields such as forestry (Stoyan and Penttinen, 2000), acoustics (Tichy, 1973; Veit, 1976), entomology (Cappaert et al., 1991), or animal science (Lean et al., 1992; Procknor et al., 1986), whose practices focus mostly on either time series or spatial data. One of the widely adopted definitions for spatial correlation analysis is Ripley’s K(r) function (Ripley, 1976, 1981). In order to analyze
Ch. 8. Spatial-Temporal Data Analysis
209
the data sets with both spatial and temporal dimensions, in our recent research, we have extended the traditional K(r) definition by adding a temporal parameter t. By analyzing real-world infectious disease-related data sets, we found that the extended definition K(r, t) is more discriminating than the K(r) function and can discover causal events whose occurrences induce those of other events. In the second part of this chapter, we introduce both Ripley’s K(r) function and its extended form K(r, t), and discuss a case study applying them to a public health dataset concerning mosquito control. The remainder of this chapter is structured as follows. In Section 2 we introduce the retrospective and prospective spatial clustering techniques. Section 3 focuses on spatial and spatial-temporal correlation analysis methods. In Section 4 we conclude by summarizing the chapter.
2
Retrospective and prospective spatial clustering
We first review major types of retrospective and prospective surveillance approaches in Section 2.1. Section 2.2 introduces recently developed spatialtemporal data analysis methods based on a robust support vector machine (SVM)-based spatial clustering technique. The main technical motivation behind such methods is the lack of hotspot analysis techniques capable of detecting unusual geographical regions with arbitrary shapes. Section 2.3 summarizes several computational experiments based on simulated datasets. This experimental study includes a comparative component evaluating the SVM-based approaches against other methods in both retrospective and prospective scenarios. In Section 2.4, we summarize a case study applying spatial-temporal clustering analysis technologies to realworld datasets. 2.1 Literature review In this section, we first introduce retrospective spatial-temporal data analysis, and then present the representative prospective surveillance methods, many of which were developed as extensions to retrospective methods. 2.1.1 Retrospective spatial-temporal data analysis Retrospective approaches determine whether observations or measures are randomly distributed over space and time for a given region. Clusters of data points or measures that are unlikely under the random distribution assumption are reported as anomalies. A key difference between retrospective analysis and standard clustering lies in the concept of ‘‘baseline’’ data. For standard clustering, data points are grouped together directly
210
D. Zeng et al.
based on the distances between them. Retrospective analysis, on the other hand, is not concerned with such clusters. Rather, it aims to find out whether unusual clusters formed by the data points of interest exist relative to the baseline data points. These baseline data points represent how the normal data should be spatially distributed given the known factors or background information. Clusters identified in this relative sense provide clues about dynamic changes in spatial patterns and indicate the possible existence of unknown factors or emerging phenomena that may warrant further investigation. In practice, it is the data analyst’s responsibility to separate the dataset into two groups: baseline data and data points of interest, typically with events corresponding to the baseline data preceding those corresponding to the data points of interest. As such, retrospective analysis can be conceptualized as a spatial ‘‘before and after’’ comparison. For example, Fig. 1 shows certain disease incidents in a city. Asterisks indicate the locations where the disease incidents usually occur in normal situations (corresponding to the baseline cases). Crosses are the recently confirmed incidents (cases of interest). Comparing the distribution of the cases of interest with that of the baseline, one could identify an emerging area containing dense disease incidents, indicative of a possible outbreak. In Fig. 1, this emerging area is identified with an irregularly shape area close to the center. Later we discuss two major types of retrospective analysis methods: scan statistic-based and clustering-based. A comparative study of these two types of retrospective approaches can be found in Zeng et al. (2004).
Fig. 1.
An example of retrospective analysis.
Ch. 8. Spatial-Temporal Data Analysis
211
2.1.1.1 Scan statistic-based retrospective analysis. Various types of scan statistics have been developed in the past four decades for surveillance and monitoring purposes in a wide range of application contexts. For spatialtemporal data analysis, a representative method is the spatial scan statistic approach (Kulldorff, 1997). This method has become one of the most popular methods for detection of geographical disease clusters and is being widely used by public health departments and researchers. In this approach, the number of events, for example, disease cases, may be assumed to be either Poisson or Bernoulli distributed. Algorithmically, the spatial scan statistic method imposes a circular window on the map under study and lets the center of the circle move over the area so that at different positions the window includes different sets of neighboring cases. Over the course of data analysis, the method creates a large number of distinct circular windows (other shapes such as rectangular and ellipse have also been used), each with a different set of neighboring areas within it and each a possible candidate for containing an unusual cluster of events. A likelihood ratio is defined on each circle to compute how likely the cases of interest fall into that circle not by pure chance. The circles with high likelihood ratios are in turn reported as spatial anomalies or hotspots. 2.1.1.2 Clustering-based retrospective analysis. Despite the success of the spatial scan statistic and its variations in spatial anomaly detection, the major computational problems faced by this type of methods is that the scanning windows are limited to simple, fixed symmetrical shapes for analytical and search efficiency reasons. As a result, when the real underlying clusters do not conform to such shapes, the identified regions are often not well localized. Another problem is that it is often difficult to customize and fine-tune the clustering results using scan statistic approaches. For different types of analysis, the users often have different needs as to the level of granularity and the number of the resulting clusters, and they have different degrees of tolerance regarding outliers. These problems have motivated the use of alternative modeling approaches based on clustering. Risk-adjusted nearest neighbor hierarchical clustering (RNNH) (Levine, 2002) is a representative of such approaches. Developed for crime hotspot analysis, RNNH is based on the well-known nearest neighbor hierarchical (NNH) clustering method, combining the hierarchical clustering (Johnson, 1967) capabilities with kernel density interpolation techniques (Levine, 2002). The standard NNH approach identifies clusters of data points that are close together (based on a threshold distance). Many such clusters, however, are due to some background or baseline factors (e.g., the population which is not evenly distributed over the entire area of interest). RNNH is primarily motivated to identify clusters of data points relative to the baseline factor. Algorithmically, it dynamically adjusts the threshold distance inversely proportional to some density measure of the baseline factor (e.g., the threshold should be shorter in regions where
212
D. Zeng et al.
the population is high). Such density measures are computed using kernel density based on the distances between the location under study and some or all other data points. We summarize below the key steps of the RNNH approach. Define a grid over the area of interest; calculate the kernel density of baseline points for each grid cell; rescale such density measures using the total number of cases. Calculate the threshold distances between data points for hierarchical clustering purposes and perform the standard NNH clustering based on the above distance threshold. RNNH has been shown to be a successful tool in detecting spatialtemporal criminal activity patterns (Levine, 2002). We argue that its built-in flexibility of incorporating any given baseline information and computational efficiency also make it a good candidate for analyzing spatialtemporal data in other applications. In Section 2.2.1, we will introduce another clustering-based method, called risk-adjusted support vector clustering (RSVC) (Zeng et al., 2005), the result of our recent attempt to combine the risk adjustment idea of RNNH with a modern, robust clustering mechanism such as SVM to improve the quality of hotspot analysis. 2.1.2 Prospective spatial-temporal surveillance A major advantage that prospective approaches have over retrospective approaches is that they do not require the separation between the baseline cases and cases of interest in the input data. Such a requirement is necessary in retrospective analysis, and is a major source of confusion and difficulty to the end users. Prospective methods bypass this problem, and process data points continuously in an online context. Two types of prospective spatialtemporal data analysis approaches have been developed in the statistics literature (Kulldorff, 2001; Rogerson, 1997, 2001). The first type segments the surveillance data into chunks by arrival time, and then applies a spatial clustering algorithm to identify abnormal changes. In essence, this type of approach reduces a spatial-temporal surveillance problem into a series of spatial surveillance problems. The second type explicitly considers the temporal dimension, and clusters data points directly based on both spatial and temporal coordinates. We briefly summarize representative approaches for both types of methods including Rogerson’s method and the space-time scan statistic. 2.1.2.1 Rogerson’s methods. Rogerson has developed CUSUM-based surveillance methods to monitor spatial statistics such as Tango and Knox statistics, which capture spatial distribution patterns existing in the surveillance data (Rogerson, 1997, 2001). CUSUM is a univariate surveillance approach that monitors the number of events in a fixed
Ch. 8. Spatial-Temporal Data Analysis
213
interval. Let Ct be the spatial statistic (e.g., Tango or Knox) at time t. The surveillance variable is defined as Z t ¼ ðCt EðC t jCt1 ÞÞ=sðCt jC t1 Þ. Refer to Rogerson, 1997, 2001 for the derivation of the conditional expected value EðC t jC t1 Þ and the corresponding variance sðC t jC t1 Þ. Following the CUSUM surveillance approach, when the accumulated deviation Zt exceeds a threshold value, the system will report an anomaly (which typically triggers an alarm in public health applications). Rogerson’s methods have successfully detected the onset of the Burkitt’s lymphoma in Uganda during 1961–1975 (Rogerson, 1997). 2.1.2.2 Space-time scan statistic. Kulldorff has extended his retrospective 2-dimensional spatial scan statistic to a 3-dimensional space-time scan statistic, which can be used as a prospective analysis method (Kulldorff, 2001). The basic intuition is as follows. Instead of using a moving circle to search the area of interest, one can use a cylindrical window in three dimensions. The base of the cylinder represents space, exactly as with the spatial scan statistic, whereas the height of the cylinder represents time. For each possible circle location and size, the algorithm considers every possible starting and ending times. The likelihood ratio test statistic for each cylinder is constructed in the same way as for the spatial scan statistic. After a computationally intensive search process, the algorithm can identify the abnormal clusters with corresponding geolocations and time periods. The space-time scan statistic has successfully detected an increased rate of male thyroid cancer in Los Alamos, New Mexico during 1989–1992 (Kulldorff, 2001). 2.2 Support vector clustering-based spatial-temporal data analysis In this section, we present two recently developed robust spatial-temporal data analysis methods. The first is a retrospective hotspot analysis method called RSVC (Zeng et al., 2005). The second is a prospective analysis method called prospective support vector clustering (PSVC), which uses RSVC as a clustering engine (Chang et al., 2005). 2.2.1 Risk-Adjusted Support Vector Clustering (RSVC) The RSVC is the result of our recent attempt to combine the risk adjustment idea of RNNH with a modern, SVM-based robust clustering mechanism to improve the quality of hotspot analysis. SVM-based clustering (SVC) (Ben-Hur et al., 2001) is a well-known extension of SVM-based classification. However, the standard version of SVC does not take into consideration baseline data points and therefore cannot be directly used in spatial-temporal data analysis. As such, we have developed a risk-adjusted variation, called RSVC, based on ideas similar to those in RNNH. Firstly, using only the baseline points, a density map is constructed
214
D. Zeng et al.
using standard approaches such as the kernel density estimation method. Secondly, the case data points are mapped implicitly to a high-dimensional feature space defined by a kernel function (typically the Gaussian kernel). The width parameter in the Gaussian kernel function determines the dimensionality of the feature space. The larger the width parameter is, the harder the data points in the original space constitute a cluster and hence data points are more likely to belong to smaller clusters. Our algorithm dynamically adjusts the width parameter based on the kernel density estimates obtained in the previous step. The basic intuition is as follows: when the baseline density is high, a larger width value is used to make it harder for points to be clustered together. Thirdly, following the SVM approach, RSVC finds a hypersphere in the feature space with a minimal radius to contain most of the data. The problem of finding this hypersphere can be formulated as a quadratic or linear program depending on the distance function used. Fourthly, the function estimating the support of the underlying data distribution is then constructed using the kernel function and the parameters learned in the third step. When projected back to the original data space, the identified hypersphere is mapped to (possibly multiple) clusters. These clusters are then returned as the output of RSVC. 2.2.2 Prospective support vector clustering Although well-grounded in theoretical development, both Rogerson’s methods and the space–time scan statistic have major computational problems. Rogerson’s approaches can monitor a given target area but cannot search for problematic areas or identify the geographic shape of these areas. The space–time scan statistic method performs poorly when the true abnormal areas do not conform to simple shapes such as circles. Below we introduce the basic ideas behind our approach, which is called PSVC, and summarize its main algorithmic steps. Our PSVC approach follows the design of the first type of the spatialtemporal surveillance method discussed in Section 2.2, which involves repeated spatial clusterings over time. More specifically, the time horizon is first discretized based on the specific characteristics of the data stream under study. Whenever a new batch of data arrives, PSVC treats the data collected during the previous time frame as the baseline data, and runs the retrospective RSVC method. After obtaining a potential abnormal area, PSVC attempts to determine how statistically significant the identified spatial anomaly is. Many indices have been developed to assess the significance of the results of clustering algorithms in general (Halkidi et al., 2002a,b). However, all these criteria assess clustering in an absolute sense without considering the baseline information. Thus, they are not readily suitable for prospective spatial-temporal data analysis. Kulldorff’s (1997) likelihood ratio L(Z) as defined in the following equation is to our best knowledge the only statistic that explicitly takes
Ch. 8. Spatial-Temporal Data Analysis
the baseline information into account. c c c nc C c Cc C c ðNnÞðCcÞ LðZÞ ¼ 1 1 n n Nn Nn
215
(1)
In this definition, C and c are the number of the cases in the entire dataset and the number of the cases within the scanned area Z, respectively. N and n are the total number of the cases and the baseline points in the entire dataset and the total number of the cases and the baseline points within Z, respectively. Since the distribution of the statistic L(Z) is unknown, we use the standard simulation approach to calculate statistical significance measured by the p-value. Specifically, we first generate T replications of the dataset, assuming that the data are randomly distributed. We then calculate the likelihood ratio L(Z) in the same area Z for each replication. Finally, we rank these likelihood ratios and if L takes the Xth position, then the p-value is set to be X/(Tþ1). Note that in a straightforward implementation of the earlier algorithmic design, anomalies are identified (or equivalently alerts are triggered) only when adjacent data batches have significant changes in terms of data spatial distribution. This localized myopic view, however, may lead to significant delay in alarm triggering or even false negatives because in some circumstances, unusual changes may manifest gradually. In such cases, there might not be any significant changes between adjacent data batches. However, the accumulated changes over several consecutive batches can be significant and should trigger an alarm. This observation suggests that a more ‘‘global’’ perspective beyond comparing adjacent data batches is needed. It turns out that the CUSUM approach provides a suitable conceptual framework to help design a computational approach with such a global perspective. The analogy is as follows. In the CUSUM approach, accumulative deviations from the expected value are explicitly kept track of. In prospective analysis, it is difficult to design a single one-dimensional statistic to capture what the normal spatial distribution should look like and to measure the extent to which deviations occur. However, conceptually the output of a retrospective surveillance method such as RSVC can be viewed as the differences or discrepancies between two data batches, with the baseline data representing the expected data distribution. In addition, accumulative discrepancies can be computed by running RSVC with properly set baseline and case data separation. For an efficient implementation, we use a stack as a control data structure to keep track of RSVC runs which now include comparisons beyond data from adjacent single periods. The detailed control strategy is described later. When clusters generated in two consecutive RSVC runs have overlaps, we deem that the areas covered by these clusters are risky areas. We use the stack to store the clusters along with the data batches from which these risky clusters are identified. Then we run RSVC to compare the current data
216
D. Zeng et al.
batch with each element (in the form of a data batch) of the stack sequentially from the top to the bottom to examine whether significant spatial pattern changes have occurred. Stacks whose top data batch is not the current data batch under examination can be emptied since the areas represented by them no longer have the trend to bring on any significant distribution change. This operation resembles one of the steps in the CUSUM calculation where the accumulated deviation is reset to 0 when the monitored variable is no longer within the risky range. We now explain the main steps of the PSVC algorithm as shown in Fig. 2. Each cluster stack represents a candidate abnormal area and the array clusterstacks holds a number of cluster stacks keeping track of all candidate areas at stake. Initially (line 1) clusterstacks is empty. The steps from line 3 to 35 are run whenever a new data batch enters the system. First, the RSVC retrospective method is executed (line 3) to compare the spatial distribution of the new data batch with that of the previous data batch. The resulting abnormal clusters are saved in rsvcresult. Any statistically significant cluster in rsvcresult will immediately trigger an alert (line 5).
1 clusterstacks=[] 2 Whenever a new data batch arrives { 3 rsvcresult=RSVC (previousdate, currentdate) 4 For each cluster C recorded in rsvcresult { /*C records the identified cluster, its p-value, and the date of the associated data batch. */ 5 If (C.p-value
Fig. 2.
PSVC algorithm.
Ch. 8. Spatial-Temporal Data Analysis
217
For those emerging candidate areas that are not yet statistically significant, they are kept in clusterstacks. Lines 7 to 32 of the PSVC algorithm describe the operations to be performed on each of these candidate clusters C. If no cluster stack exists, we simply create a new cluster stack which contains only C as its member (line 30), and update the array clusterstacks accordingly (line 31). If cluster stacks already exist, for each of these cluster stack S, we determine whether the current cluster C has any overlaps with the most recent cluster (the top element) in S (line 10). If the current cluster C does overlap with an existing candidate area, further investigation beyond comparison between adjacent data batches will be warranted. The operations described from line 11 to 15 implement these further investigative steps. First, cluster C is added onto stack S (line 11). Then the current data batch is compared against all remaining data batches in S in turn from top to bottom. Should any significant spatial distribution change be detected, an alert will be triggered (lines 13 to 15). If cluster C does not overlap with any of the most recent clusters in all of the existing cluster stacks, a new cluster stack is created with C as its only element and the array clusterstacks is updated accordingly (lines 22 and 23). After processing the candidate cluster C, we remove all inactive cluster stacks whose top clusters are not generated at the present time (equal to the creation time of C) (line 26). Note that two stacks may have the same top element. However, because the accumulated deviation of spatial distribution stored in these two stacks might be different and this accumulated deviation information may produce valuable information as to deciding whether to trigger an alert or not, we do not merge these two clusters. 2.3 Experimental studies This section reports experimental studies designed to evaluate RSVC and PSVC, and compare their performance with that of existing retrospective and prospective analysis methods. 2.3.1 RSVC evaluation We have conducted a series of computational studies to evaluate the effectiveness of the three hotspot analysis techniques (SaTScan, RNNH, RSVC) (Zeng et al., 2005). In the first set of experiments, we used artificially generated datasets with known underlying probability distributions to precisely and quantitatively evaluate the efficacy of these techniques. Since the true hotspots are known in these experiments based on simulated data, we use the following well-known measures from information retrieval to evaluate the performance of hotspot techniques: Precision, Recall, and F-Measure. In the spatial data analysis context, we define these measures as follows. Let A denote the size of the hotspot(s) identified by a given algorithm, B the size of the true hotspot(s), and C the size of the overlapped
218
D. Zeng et al.
area between the algorithm-identified hotspot(s) and true hotspot(s). Precision is defined as C/A. Recall is defined as C/B. F-measure is defined as the harmonic mean of precision and recall (2 Precision Recall/ (PrecisionþRecall)). Observe that high recall indicates low false negatives and that high precision indicates low false positives. F-measure represents a balance and trade-off between precision and recall. Below we report one artificially generated scenario we have experimented with. In this scenario, as shown in Fig. 3, the true spot is a square with its circular-shaped center removed. We first randomly generated 100 baseline points in the circle located in the center. We then generated 200 case points of interest in total over the square. To make the problem more interesting, we introduced some noise—30 outlier baseline points and 40 outlier case points over the entire map. For statistical testing purposes, we repeated the above data generation process for 30 times to produce 30 instances of the template scenario by moving the centers of the circle randomly across the map. To collect the performance data, we ran all hotspot analysis methods under study on all the problem instances. Table 1 summarizes these methods’ average performance across all instances. We have also conducted additional experiments using different template scenarios. Interested readers are referred to Zeng et al. (2005). Some general observations are
Fig. 3.
Scenario 1 based on simulated data.
219
Ch. 8. Spatial-Temporal Data Analysis Table 1 Average performance of RSVC, SCAN, and RNNH
RSVC SCAN RNNH
Precision (%)
Recall (%)
F-measure (%)
80 54 95
92 92 50
85 65 64
in order. RSVC and the spatial scan method have similar level of recall across different scenarios. However, RSVC has higher precision than the spatial scan method does (confirmed by statistical tests). RNNH has the highest precision level but typically with lowest recall. When considering the combined measure, i.e., the F-measure, RSVC consistently delivers the best results. Although this experimental study is based on artificially generated datasets whose characteristics are ad-hoc and somewhat arbitrary, this type of study is the only one that can produce verifiable quantitative performance measures. The consistent observations suggest that RSVC is a strong (if not the best) candidate for real-world applications in security informatics for various types of hotspot identifications. 2.3.2 PSVC evaluation To evaluate a prospective spatial-temporal data analysis method, we need to consider both spatial and temporal evaluation measures. From a spatial perspective, the goal is to evaluate how accurate the detected clusters are geographically, relative to the location of the true clusters. When true hotspots are known, precision, recall, and F-measure provide appropriate performance metrics, as in the case of retrospective analysis. As for the temporal evaluation measures, ARL0 (average run length under null hypothesis) and ARL1 (average run length under alternative hypothesis) are two widely used ones in univariate surveillance (Sonesson and Bock, 2003). ARL1 reveals how timely an algorithm can detect an anomaly and ARL0 how easily an algorithm tends to trigger a false alarm. In our study, we adopt the ARL1 measure and rename it to ‘‘Alarm Delay,’’ which is defined as the delay between the time an anomaly occurs and the time the algorithm triggers the corresponding alert. Using ARL0 can be difficult in practice as it would require the system run for a long time under the normal condition to collect false alarm data. As an alternative, we have followed the following performance data collection procedure. We apply the prospective analysis method under study to a simulated data stream for a relatively long period of time. This data stream contains some anomalies generated according to known patterns. When a suspicious area reported by the method does not overlap with the true abnormal area (i.e., both precision and recall are 0) or the report date is earlier than the actual date
220
D. Zeng et al.
of the abnormal occurrence, we consider it as a false alarm. In some cases, the system fails to trigger any alarms during the entire monitoring period. We count how many times an algorithm triggers false alarms and how many times it fails to detect the true anomalies as surrogate measures for ARL0. We have chosen the space–time scan statistic as the benchmark method since it has been widely tested and deployed, especially in public health applications, and its implementation is freely available through the SaTScan system. Similar to the evaluation of RSVC, we have used simulated datasets with the generation of true clusters fully under our control. Below we report one scenario used in our computational experiments. For ease of exposition, throughout this section, we use the public health application context to illustrate this scenario. This scenario corresponds to an ‘‘emerging’’ scenario where disease outbreaks start from some location where very few disease incidents occurred before. For this scenario, we created 30 problem instances by randomly changing the size, location, starting date, and the speed of expansion of the simulated abnormal cluster. For all ‘‘emerging’’ scenario problem instances, both x- and y-axes have the support of [0, 20]. The range for time is from 0 to 50 days. We first generated 300 data points in this 3-dimensional space ([0, 20] [0, 20] [0, 50]) as the background. We then generated another 300 data points inside a cylinder whose bottom circle resides at center (xl, yl) with radius rl. The height of this cylinder is set to 50, covering the entire time range. This cylinder is designed to test whether a prospective spatio-temporal data analysis method might identify the pure spatial cluster by mistake. Consider the dense cone-shaped area in the top sub-figure of Fig. 4. An abnormal circular cluster which is centered at (xr, yr) emerges on some date startT. This circle starts with radius startR and continuously expands until the radius reaches endR on the last day, day 50. In contrast to the cylinder to the left which has roughly the same number of data points every day, the cone-shaped area represents an emerging phenomenon. To approximate exponential expansion, we let the number of points inside the cone-shaped area at any given day follow the following expression: a n ðcurrent_date start_date þ 1Þ^ increaserate where a is the number of points inside the area on the anomaly starting date and increaserate indicates how fast an outbreak expands. Figure 5 shows three snapshots of an emerging scenario problem instance projected to the spatial map at three different times. The crosses represent the new data batch for the current time frame during which the analysis is being conducted. The stars represent the data points from the last time frame. As shown in these snapshots, until day 22 there is no notable spatial pattern
221
Ch. 8. Spatial-Temporal Data Analysis "emerging" scenario data set
50 40
z
30 20 10 0 20 20
15 y
15
10 10
5
5 0
x
0
Emerging anomaly detected on day 29 ("emerging" scenario) 20 incidents collected on day 22 incidents collected on day 29 PSVC result SatScan result true cluster
18 16 14
y
12 10 8 6 4 2 0 0
Fig. 4.
5
10 x
15
20
A problem instance of the ‘‘emerging’’ scenario.
change during two consecutive weeks. But during the week from day 22 to day 29, we can clearly observe an emerging circle. When generating data points for 30 replications of the emerging scenario, we aimed to experiment with the cone-shaped area and the cylinder of
222
D. Zeng et al. Day 15
20
Day 8 Day 15
16
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
Day 15 Day 22
18
16
y
y
18
Day 22
20
0
0
5
10 x
15
20
0
5
10 x
15
20
Day 22−29
20
Day 29 Day 22
18 16 14
y
12 10 8 6 4 2 0 0
Fig. 5.
5
10 x
15
20
Snapshots of an ‘‘emerging’’ scenario problem instance.
varying sizes and locations under two constraints: (a) neither area is completely inside the other area and (b) both areas are confined within the boundary of the three-dimensional space. Under this guideline, we carefully generated the experimental parameters as follows: xl, yl, rl, xr, and yr are uniformly distributed on intervals [4,8], [8,12], [1,5], [13,15], and [9,11], respectively; the anomaly starting date startT is uniformly distributed on [18,35], and the starting and ending radiuses of the emerging circle startR and endR are uniformly distributed on [1,2] and [3,5] respectively; a and increaserate are uniformly distributed on [2,4] and [0.2,1.5], respectively. Prospective analysis was conducted on a weekly basis with each batch containing around 80–100 data points. The bottom sub-figure of Fig. 4 illustrates the results of the analyses using SaTScan and PSVC on the problem instance shown in the left sub-figure. As expected, both methods reported an emerging abnormal area. Neither reported the pure spatial cluster (cylinder), which is positive. The average performance of PSVC and SaTScan over the 30 problem instances is summarized in Table 2. We observe that for the emerging scenario,
223
Ch. 8. Spatial-Temporal Data Analysis Table 2 Average performance of SaTScan and PSVC over 30 ‘‘emerging’’ scenario instances
Precision Recall F-measure Alarm delay False alarm Fail to detect Computing (%) (%) (%) (days) (times) (times) time (seconds) SaTScan 66.2 PSVC 88.5
83.6 55.2
69.5 64.8
5.4 6.0
5 0
2 2
607 95
SaTScan achieves a higher level of recall and PSVC a higher level of precision. These two methods do not differ significantly with respect to the overall spatial performance given by the F-measure. In general, PSVC detected anomaly as soon as SaTScan did but with less false alarms. Besides the ‘‘emerging’’ scenario, we also created the ‘‘expanding’’ scenario and the ‘‘moving’’ scenario to mimic two events that could possibly occur in public health application. In the ‘‘expanding’’ scenario, the disease cases are first concentrated on a particular infected area and then spread to the neighboring area. The ‘‘moving’’ scenario captures the movement of the infected area along certain directions possibly due to some environmental factors, such as river and wind. Detailed experimental results and discussion can be found in (Chang et al., 2008). In general, we observe that both SaTScan and PSVC can effectively identify abnormal areas demonstrating changes in the spatial distribution pattern over time and correctly ignore pure spatial clusters. PSVC significantly outperforms SaTScan in terms of spatial evaluation measures when detecting abnormal areas with complex, irregular shapes as in the case of the expanding and moving scenarios. PSVC and SaTScan achieve similar detection speed but PSVC does it with less false alarms. This is particularly true when abnormal areas do not conform to simple regular shapes. We observe that given the same number of the baseline points, the scanning cylinder of SaTScan tends to include more positive cases to result in a higher likelihood value. In other words, SaTScan inherently tends to reach a higher level of recall at the expense of the lowered precision. In addition, the tendency to include more points inside the scanning window makes SaTScan more likely to trigger false alarms.
2.4 A case study: Public health surveillance In many security-related applications, an issue of central importance is to identify the regions of potential interest or unusual activities as soon as possible as candidates for further investigation. In such applications, knowing the precise locations of such regions and the start and end of unusual events is critical. Additional refined measures such as intensity of activities within these regions play a role only after the area is identified. Our research exclusively focuses on the identification of spatial-temporal
224
D. Zeng et al.
Fig. 6.
WNV migration patterns identified by PSVC and SaTScan.
areas. In this section, we analyze a real-world dataset to demonstrate how we can apply PSVC and SaTScan to detect the areas at risk. The dataset used in this study contains the dead bird sightings in the state of New York in spring and summer 2002. As dead bird clusters have been proven to be highly indicative of West Nile Virus (WNV) outbreaks, we applied PSVC and SaTScan to monitor the dead bird sighting data to identify possible abnormal clustering effect. In our dataset, there are 364 sightings in total. Before May 2002, there are less than 10 records per week. We chose a 2-week data-monitoring interval for PSVC. From the results shown in Fig. 6, we note that most sightings stayed inside Long Island before April 29. However, in the next two weeks, more and more sightings started to show up north of Long Island along Hudson River. Both PSVC and SaTScan detected an abnormal cluster forming on May 12, which is much earlier than May 26, the first day a dead bird was diagnosed with WNV. This automated advance warning capability, albeit anecdotal, is of great interest and importance to the public health community from the viewpoints of infectious disease monitoring, modeling, and related resource allocation and counter-measure planning. Figure 6 also shows that the irregularly shaped area detected by PSVC is more informative than the large circle detected by SaTScan.
3
Spatial-temporal cross-correlation analysis
We now introduce the correlation analysis of multiple spatial-temporal data sets.
Ch. 8. Spatial-Temporal Data Analysis
225
3.1 Literature review In general, correlation refers to the departure of two random events from independence. In the specific area of spatial-temporal data analysis, correlation is to measure the degree of co-existing of multiple data observations in a close geographical neighborhood and a short time frame. In this section, we use the ‘‘data event’’ to represent one type of data stream, and the ‘‘data observation’’ to represent one occurrence of a data event. For example, in infectious disease informatics, a data event can be ‘‘dead birds,’’ and a data observation is a specific occurrence of dead birds. Depending on the co-existing effect among data observations, there are two possible types of correlations (Dixon, 2002). Positive correlation indicates that an observation increases the probability of other observations to occur, whereas negative correlation indicates that an observation tends to decrease the appearance probability of other observations. In the domain of infectious disease informatics, if a dead bird occurrence tends to increase the probability of more dead birds to occur, we state that the event of dead birds has a positive correlation to itself. With respect to the number of data events being analyzed, two types of correlation analyses can be defined (Ripley, 1981). If the observations of one data event are not independent, the event is called auto-correlated. If the observations of one event are not independent of those of another event, those two events are cross-correlated. This section is mainly concerned with the cross-correlations. Depending on the type of data being analyzed, correlation analysis can be categorized into time-series correlation, spatial correlation, and spatialtemporal correlation. We will briefly summarize the literature on all these three types of correlations. Our research is particularly interested in the analysis of spatial-temporal cross-correlations. Specifically we are interested in exploring the possible correlations among events by studying the relations of the geographical locations and the time of the observations. 3.1.1 Time series cross-correlations Time series cross-correlation analysis focuses on the correlation of multiple data events with temporal components. The methods to analyze time series cross-correlations are well established and are generally available in commercial statistical software packages. Two representative studies from the public health informatics are summarized below. Lean et al. (1992) used time series cross-correlation analysis to determine relationships among glucose, cholesterol, and milk yield, for 42 days postpartum for 14 multiparous cows. They claimed that time series crosscorrelation analysis was a useful tool in examining relationships among variables when repeated samples were obtained from the same individuals.
226
D. Zeng et al.
Cappaert et al. (1991) used time series cross-correlation to estimate synchrony of Colorado potato beetle eggs with predators. Predators, including asopine pentatomids, carabids, thomisid spiders, and coccinellids, were abundant and well synchronized with developing Colorado potato beetle in late 1987, corresponding to the period of highest mortality. 3.1.2 Moran’s I A widely adopted statistical measure in spatial correlation analysis is Moran’s I index (Moran, 1948). Moran’s I is a weighted correlation coefficient used to detect departures from spatial randomness. Departures from randomness indicate spatial patterns, such as clusters. The statistic may identify other kinds of patterns such as the geographic trend. The value of Moran’s I can be from 1 to 1. Values of I larger than 0 indicate positive spatial correlation; values smaller than 0 indicate negative spatial correlation. Moran’s I is commonly applied in areas such as epidemiology and many extensions have been developed (Thioulouse et al., 1995; Waldhor, 1996). Moran’s I requires a comparable numeric value for each observation (e.g., a count of a certain event), and is not suitable for data analysis in which data observations only have information about the occurring location and time. 3.1.3 Ripley’s K(r) Ripley’s K(r) function is a tool for analyzing completely mapped spatial point process data, i.e., the locations of data observations in a predefined study area. Ripley’s K(r) function can be used to summarize a point pattern, and test hypotheses about the pattern. Bivariate generalizations can be used to describe correlative relationships between two point patterns. Applications include spatial patterns of trees (Peterson and Squiers, 1995; Stoyan and Penttinen, 2000), bird nests (Gaines et al., 2000), and disease cases (Diggle and Chetwynd, 1991). Details of various theoretical aspects of K(r) are in books by (Diggle et al., 1976; Dixon, 2002; Ripley, 1981). Formally, the K(r) function is defined as K ij ðrÞ ¼ l1 ðNumber of event j observations within distance r j E of a randomly chosen event i observationÞ ð2Þ In Eq. (2) (Ripley, 1976), lj is the density (the number of observations per unit area) of event j. The higher the K(r) value, the stronger the correlation is. When i ¼ j, the value of K(r) indicates the auto-correlative magnitude. Given the locations of all data observations of all data events, K(r) can be estimated as follows (Ripley, 1976). K^ ij ðrÞ ¼
1 XX wðim ; j n ÞIðd im ; jn orÞ l^ i l^ j A im jn
(3)
Ch. 8. Spatial-Temporal Data Analysis
227
A is the area of the study region. The unbiased estimated density l^ j is given by Number of Observations of Event j/A. In Eq. (3), im denotes the m-th observation of event i, and d im ; jn the distance between im and jn. I(x) is an indicator variable whose value is 1 when x is true and 0 otherwise. The term wðim ; j n Þ takes into account of the edge effect. The edge effect considers the situation where im is so close to the edge of the study region that the impact of im on jn should be discounted. The value of wðim ; j n Þ is the fraction of the circumference of a circle centered at im with radius d im ; j n that lies inside the study region. From Eq. (3) the value of K(r) function ranges from zero to the area A if the edge effect is ignored (Ripley, 1976). While the traditional Ripley’s K(r) function has been proven effective in discovering spatial cross-correlative relations, it does not include the temporal component. Simply considering spatial condition and ignoring temporal effect will possibly lead to false conclusions in many applications. Take a common scenario in infectious disease informatics as an example. A recent dead bird observation is unlikely to be related to another observation that happened 10 years ago in a close spatial neighborhood. However, based on Eq. (3), Iðd im ; jn orÞ has the value 1 as far as those two observations occurred in a close neighborhood. More specifically, Ripley’s K(r) function has two effects that may lead to false conclusions of cross-correlations. We now discuss them by examples. The first effect is what we refer to as the ‘‘aggregate effect’’ which may overlook significant correlations under certain circumstances. A typical scenario of positive correlations in infectious disease informatics is that a disease case has positive effects within its neighborhood in a short period of time. As time passes, the effects decrease, and disappear eventually. For example, given an observation im, assume that there were many occurrences in the neighborhood in the next 30 days after im happened, but few occurrences in other time periods in the same neighborhood. The absence of occurrences at other time periods could dilute the intensity of occurrences within the neighborhood. The overall K(r) value may not be high enough to report a significant correlation even though a real possible positive correlation is likely. We define the second effect as the ‘‘backward shadow effect’’ that may falsely indicate irrational correlations. Given im, assume there were many occurrences happened in the neighborhood prior to the time of im, and few after im occurred. Since K(r) does not differentiate if the occurrences happened before or after the time of im, the abundance of previously occurred observations will play a dominant role. As a result, the overall K(r) value may indicate a significant positive correlation. This conclusion, however, may be false because the correlation is built on the process implying that previously occurred cases can be affected by cases occurred later.
228
D. Zeng et al.
3.2 Extended K(r) function with temporal considerations To analyze spatial-temporal correlation, we have proposed a new measure K(r, t) that extends the traditional Ripley’s K(r) function by considering temporal effects. We intend to reduce the aggregate effect by eliminating the data observations that satisfy the spatial restriction but occur at time periods that are very far apart. Eq. (4) gives a mathematical definition of our new measure. K ij ðr; tÞ ¼ l1 j E ðNumber of event j observations within distance r and time t sincea randomly chosen event i observationÞ ð4Þ Note the density of event j, lj, is defined as the number of observations of event j per unit area per time period. Depending on whether or not the backward shadow effect is taken into account, two types of time windows can be defined. Given a randomly chosen event i occurrence, im, the first type of time window, defined as the one-tail time effect, only considers the data observations occurred after the time of im. The second type, defined as the two-tail time effect, considers both before and after the time of im. Eqs. (4) and (5) define the unbiased estimations of K(r, t) using one-tailed and two-tailed time effects, respectively. K^ ij ðr; tÞ ¼
K^ ij ðr; tÞ ¼
XX
1 l^ i l^ j AT 1
l^ i l^ j AT
im
XX im
wðim ; j n ÞIðd im ; jn or and 0 tjn tim otÞ
(5)
wðim ; j n ÞIðd im ; jn or and totjn tim otÞ
(6)
jn
jn
In these definitions, T is the entire time span. The occurring time of jn is defined by tjn . The terms 0 tjn tim ot and totj n tim ot eliminate those unlikely related observations by restricting the observations within a certain time window around the time of im. Note that K^ ij ðr; tÞ is a monotonically increasing function with respect to the radius r and time t. The minimum of the function value is 0 and the maximum is the product of the study area A and time span T if the edge effect is ignored. Also note that when the time window t is set to be greater than the time span T, Eq. (6) provides no more restriction than Eq. (3). Their values only differ by a constant T. On the other hand, even when the time window t is set to be infinite, Eq. (5) is still different from Eq. (3) because one-tail K(r, t) eliminates the backward shadow effect.
229
Ch. 8. Spatial-Temporal Data Analysis
3.3 A case study with infectious disease data We have evaluated the new measure K(r, t) with infectious disease data sets. By analyzing real data sets, we demonstrate that K(r, t) reveals more detailed and accurate correlative relationships among spatial-temporal data events. The new measure also helps discover the causal events whose occurrences induce those of other events. 3.3.1 Dataset The dataset was collected from the dead bird and mosquito surveillance system in Los Angeles County, California, with a time span of 162 days from May to September, 2004. There were four types of events in total, namely, Dead birds with Positive results on WNV detection, Mosquitoes with Positive WNV detection, Mosquitoes with Negative WNV detection, and Mosquito Treatments performed by public health clerks. We denoted them as DBP, MosP, MosN, and MosTr, respectively. Table 3 summarizes the data set. 3.3.2 Data analysis procedure For simplicity, we defined the spatial area as the smallest rectangle that covers the locations of all cases. We then chose multiple combinations of radius r and time window t to perform our analysis. Table 4 shows the parameters we have chosen. Table 3 Summary statistics Event type
Number of cases
DBP MosP MosN MosTr
545 207 591 1918
Table 4 Parameter setting Parameter
Value
Minimum t t increment Maximum t Minimum r r increment Maximum r
3 3 162 0.75% 0.75% 30%
230
D. Zeng et al.
Specifically we chose the time window to be from 3 days to 162 days, with 3 days as the increment. Although we have tested both one- and two-tailed time effects on the correlations, we only report the results using one-tail time effects. For each given time window t, we chose multiple radius values. The initial radius was 0.75% of the average of the length and the width of the rectangle study area. With the increment 0.75%, the largest radius we have tested is 30% of the average of the length and width of the area. We wanted the maximum radius to be sufficiently large to explore all possible situations. Due to the size of the Los Angeles County and the limited mobility of the studied creatures, 30% was considered sufficient to explore possible correlations among the studied events. For each pair of radius r and time window t, we first evaluated our extended K(r, t) definition. To evaluate the cross-correlation between two data events without the impact of the possible auto-correlation in either one of the events, we applied a common approach (Gaines et al., 2000) that computes the difference of the cross-correlation and auto-correlation. In our study, we calculated the Kij(r, t)Kjj(r, t) for all 12 possible event pairs. We then employed the random labeling method to evaluate the correlation significance. The random labeling method is to mix the observations of two events while recording the number of observations of the two events, then randomly assign an event type to each one of the observations by preserving the total numbers of observations in each event (Dixon, 2002). We repeated the random labeling 200 times for each event pair. If the value of Kij(r, t)Kjj(r, t) calculated from real data is higher than 195 values (97.5% quantile) calculated from the random labeling simulations, we concluded that events i and j were significantly positively correlated under such r and t, and event i was the causal event whose occurrences induced those of event j. Finally we analyzed the traditional Ripley’s K(r) function to compare with our extension with an almost identical experimental procedure. 3.3.3 Results and discussion We now summarize our key findings. (1) As t increases, more event pairs are found to be significantly correlated. When t was set to be six days or less, no significant crosscorrelations were discovered. When t was set to nine days, the first correlated event pair, MosTr-MosN, was identified as shown in Fig. 7(a). When r was larger than 26.25%, the value of Kij(r, t)Kjj (r, t) from the real data was higher than that of the 97.5% quantile from the random labeling simulations. It indicates that MosN has significantly higher than usual possibility to appear if MosTr observations occur. In other words, MosN was significantly correlated to MosTr.
231
Ch. 8. Spatial-Temporal Data Analysis i: M o sT r; j: M osN ; t = 9 0.2 0.1 0 -0.1
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
-0.2 -0.3 -0.4 -0.5
(a) i: MosTr; j: MosN; t = 30
1.5 1 0.5 0 0.00% -0.5
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
-1
(b)
Fig. 7. Analysis using one-tailed K(r). All x-axes are radius; all y-axes are Kij(r, t)Kjj(r, t). The cone shape band is formed by 2.5% and 97.5% quantiles estimated from 200 random labeling simulations.
This finding is intuitively correct. Recall that MosN represents the event of Mosquitoes with Negative WNV detection, and MosTr is Mosquito Treatment performed by public health personnel. Our finding shows that the mosquito treatments lead to higher than usual mosquito occurrences with no virus infections. When we set t to 30 days, one more correlated event pair was revealed: DbP-MosN. When t was set to be 60 days or greater, there were two more pairs found: MosTr-MosP and DbP-MosP. Since we only considered one-tailed time effect, the first event within the pair was the causal event. That is, MosTr and DbP potentially induced the occurrences of MosN and MosP, respectively. (2) A larger t also results in a larger difference in the value of Kij(r, t)Kjj(r, t) between the real data and the random labeling simulations. This can be observed by comparing Fig. 7(a) and 7(b). As t goes from nine days to 30 days, the curve of the real data is more dominant above the random labeling band.
232
D. Zeng et al.
If we consider the smallest r and t where an event pair is found to be correlated, we realize that those two values are inversely related. In Fig. 7(b) the smallest radius to observe the correlation is 19.50% with t being 30 days, compared with 26.25% with t being nine days as in Fig. 7(a). This finding is persistent when we keep increasing the value of t. The above observations demonstrate that if the radius r and the time window t are set to small values, the cross-correlation analysis process is selective and could miss correlative relations. By selecting larger values for r and t, more correlations may be identified. (3) More importantly, when comparing the results achieved from K(r, t) and K(r) we notice that those two methods provide different correlated event pairs. K(r), ignoring the temporal consideration, identified four pairs of events with high correlations. Those four pairs are: MosTr-MosN, DbP-MosN, MosTr-MosP, and MosTr-DbP. When compared with the four pairs found by K(r, t) with t greater than 60 days, the difference is that K(r) reports MosTr-DbP and does not report DbP-MosP. As shown in Fig. 8(a)–(d), this difference between the results of K(r, t) and K(r) is very substantial, especially for MosTr-DbP. While K(r) showed that those two events are positively correlated, K(r, t) concluded that they are actually negatively correlated if temporal dimension is considered. This is due to the backward shadow effect. The reason why K(r) did not identify DbP-MosP is because of the aggregate effect.
i: MosTr; j: DbP; t = 30
i: DbP; j: MosP; t = 30 2
3
1
2
0 0.00% -1
1 0 0.00% -1
5.00%
10.00% 15.00% 20.00% 25.00% 30.00%
-2
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
-3 -4
-2
3
(a)
-5
i: DbP; j: MosP; No Time
4
2
3
1
2
0 0.00% -1
(b) i: MosTr; j: DbP; No Time
1
5.00%
10.00%
15.00%
-2
20.00%
25.00%
30.00%
0 0.00% -1
5.00%
10.00% 15.00% 20.00% 25.00% 30.00%
-2
-3
(c)
(d)
Fig. 8. All x-axes are radius; all y-axes are Kij(r, t)Kjj(r, t). The cone shape band is formed by 2.5% and 97.5% quantiles estimated from 200 random labeling simulations.
Ch. 8. Spatial-Temporal Data Analysis
233
In general, the traditional K(r) function tends to process data in an aggregated manner and sometimes fails to reveal meaningful correlations. Our extended measure reduces the aggregate effect and can eliminate the backward shadow effect as well. Through adjusting an additional time winder parameter, the domain experts have at their disposal a more balanced and flexible analysis framework. 4
Conclusions
In this chapter, we have discussed two types of spatial-temporal data analysis techniques: clustering and correlation analysis. These techniques have wide applications in public health, natural sciences, as well as business applications. In the area of spatial-temporal clustering analysis, this chapter introduces both retrospective and prospective hotspot analysis methods. Compared with retrospective methods, prospective analysis methods provide a more powerful data analysis framework. Prospective methods are aimed at identifying spatial-temporal interaction patterns in an online context and do not require preprocessing data points into the baseline and cases of interest. We have provided a survey of major types of retrospective and prospective analysis techniques and presented a case study in public health surveillance to demonstrate the potential value of these techniques in realworld security informatics applications. As for spatial-temporal correlation analysis, we discuss both spatial correlation measures as well as their spatial-temporal variations. We demonstrate various problems associated with purely spatial techniques and advocate the development of measures that consider both spatial and temporal coordinates in an integrated manner. One such attempt, extending the Ripley’s K(r) function, is shown to be more discriminating and provides the ability to identify the causal events whose occurrences induce those of other events. In the business computing arena, as technologies such as RFID and Geospatial mash-ups as part of Web 2.0 are making their way to business practice and web-based product and service offerings, we believe that business applications of spatial-temporal data analysis techniques, some of the prominent ones surveyed in this chapter, will experience explosive growth. From a research perspective, such applications may provide fruitful grounds for technical innovations. We conclude this chapter by pointing out two significant opportunities in this emerging wave of research. First, as business applications typically involve large quantities of transactional data generated on a real-time basis, developing scalable spatial-temporal data analysis techniques, possibly at the expense of reduced accuracy and sensitivity, is critical. This is particularly true in correlation research, as most existing studies focus on measures and model interpretations.
234
D. Zeng et al.
Second, the interface between spatial-temporal data analysis and business decision-making holds opportunities which can lead to major findings with practical relevance. Yet, research in this area is just starting. How to find meaningful ways of leveraging the results of spatial-temporal data analysis in decision making contexts could well become an emerging ‘‘hotspot’’ in the research landscape.
Acknowledgments We would like to thank the members of the NSF BioPortal team for insightful discussions. Various BioPortal public health partners have provided datasets which were used in the research reported in this chapter. We also would like to acknowledge funding support provided by the U.S. National Science Foundation through Grant IIS-0428241. The first author wishes to acknowledge support from the National Natural Science Foundation of China (60621001, 60573078), the Chinese Academy of Sciences (2F07C01, 2F05N01), and the Ministry of Science and Technology (2006CB705500, 2006AA010106).
References Ben-Hur, A., D. Horn, H.T. Siegelmann, V. Vapnik (2001). Support vector clustering. Journal of Machine Learning Research 2, 125–137. Cappaert, D.L., F.A. Drummond, P.A. Logan (1991). Population-dynamics of the Colorado potato beetle (coleoptera, chrysomelidae) on a native host in Mexico. Environmental Entomology 20(December), 1549–1555. Chang, W., D. Zeng, H. Chen (2005). A novel spatio-temporal data analysis approach based on prospective support vector clustering, in: Workshop on Information Technologies and Systems, Las Vegas, Nevada. Chang, W., D. Zeng, H. Chen (2008). A stack-based prospective spatio-temporal data analysis approach. Decision Support Systems 45, 697–713. Diggle, P.J., A.G. Chetwynd (1991). Second-order analysis of spatial clustering for inhomogeneous populations. Biometrics 47(September), 1155–1163. Diggle, P.J., J. Besag, J.T. Gleaves (1976). Statistical-analysis of spatial point patterns by means of distance methods. Biometrics 32, 659–667. Dixon, P.M. (2002). Ripley’s K function, in: Encyclopedia of Environmetrics, Vol. 3. Wiley, Chichester, pp. 1796–1803. Gaines, K.F., A.L. Bryan, P.M. Dixon (2000). The effects of drought on foraging habitat selection of breeding wood storks in coastal Georgia. Waterbirds 23, 64–73. Halkidi, M., Y. Batistakis, M. Vazirgiannis (2002a). Cluster validity methods: Part 1. SIGMOD Record 31, 40–45. Halkidi, M., Y. Batistakis, M. Vazirgiannis (2002b). Clustering validity checking methods: Part II. SIGMOD Record 31, 19–27. Johnson, S.C. (1967). Hierarchical clustering schemes. Psychometrika 2, 241–254. Kulldorff, M. (1997). A spatial scan statistic. Communications in statistics—Theory and methods 26, 1481–1496.
Ch. 8. Spatial-Temporal Data Analysis
235
Kulldorff, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society A 164, 61–72. Lean, I.J., T.B. Farver, H.F. Troutt, M.L. Bruss, J.C. Galland, R.L. Baldwin, C.A. Holmberg, L.D. Weaver (1992). Time-series cross-correlation analysis of postparturient relationships among serum metabolites and yield variables in Holstein cows. Journal of Dairy Science 75(July), 1891–1900. Levine, N. (2002). CrimeStat III: A spatial statistics program for the analysis of crime incident locations. The National Institute of Justice, Washington, DC. Moran, P.A.P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society Series B—Statistical Methodology 10, 243–251. Peterson, C.J., E.R. Squiers (1995). An unexpected change in spatial pattern across 10 years in an aspen white-pine forest. Journal of Ecology 83(October), 847–855. Procknor, M., S. Dachir, R.E. Owens, D.E. Little, P.G. Harms (1986). Temporal relationship of the pulsatile fluctuation of luteinizing-hormone and progesterone in cattle—A time-series crosscorrelation analysis. Journal of Animal Science 62(January), 191–198. Ripley, B.D. (1976). Second-order analysis of stationary point processes. Journal of Applied Probability 13, 255–266. Ripley, B.D. (1981). Spatial Statistics. Wiley, New York. Rogerson, P.A. (1997). Surveillance systems for monitoring the development of spatial patterns. Statistics in Medicine 16, 2081–2093. Rogerson, P.A. (2001). Monitoring point patterns for the development of space–time clusters. Journal of the Royal Statistical Society A 164, 87–96. Sonesson, C., D. Bock (2003). A review and discussion of prospective statistical surveillance in public health. Journal of the Royal Statistical Society: Series A 166, 5–12. Stoyan, D., A. Penttinen (2000). Recent applications of point process methods in forestry statistics. Statistical Science 15(February), 61–78. Thioulouse, J., D. Chessel, S. Champely (1995). Multivariate analysis of spatial patterns: A unified approach to local and global structures. Environmental and Ecological Statistics 2, 1–14. Tichy, J. (1973). Application of correlation and fourier-transform techniques in architectural and building acoustics. Journal of the Acoustical Society of America 53, p. 319. Veit, I. (1976). Application of correlation technique in acoustics and vibration engineering. Acustica 35, 219–231. Waldhor, T. (1996). The spatial autocorrelation coefficient Moran’s I under heteroscedasticity. Statistics in Medicine 15(April 15), 887–892. Yao X. (2003). Research issues in spatio-temporal data mining, in: UCGIS workshop on Geospatial Visualization and Knowldge Discovery, Lansdowne, VA. Zeng, D., W. Chang, H. Chen (2004). A comparative study of spatio-temporal hotspot analysis techniques in security informatics, in: Proceedings of the 7th IEEE International Conference on Intelligent Transportation Systems, Washington. Zeng, D., W. Chang, H. Chen (2005). Clustering-based spatio-temporal hotspot analysis techniques in security informatics. IEEE Transactions on Intelligent Transportation Systems.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 9
Studying Heterogeneity of Price Evolution in eBay Auctions via Functional Clustering
Wolfgang Jank and Galit Shmueli Department of Decisions, Operations and Information Technologies and the Center for Electronic Markets and Enterprises, The Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, USA
Abstract Electronic commerce, and in particular online auctions, have received an extreme surge of popularity in recent years. While auction theory has been studied for a long time from a game theory perspective, the electronic implementation of the auction mechanism poses new and challenging research questions. In this work, we focus on the price formation process and its dynamics. We present a new source of rich auction data and introduce an innovative way of modeling and analyzing price dynamics. Specifically, the goal of this chapter is to characterize heterogeneity in the price formation process and understand its sources. We represent the price process in auctions as functional objects by accommodating the special structure of bidding data. We then use curve clustering to segment auctions and characterize each cluster, and then directly model cluster-specific price dynamics via differential equations. Our findings suggest that there are several types of dynamics even for auctions of comparable items. Moreover, by coupling the dynamics with information on the auction format, the seller and the winner, we find new relationships between price dynamics and the auction environment, and we tie these findings to the existing literature on online auctions. Our results also show a significant gain in information compared to a traditional clustering approach.
1
Introduction
The public nature of many online marketplaces has allowed empirical researchers new opportunities to gather and analyze data. One example of an online marketplace is the online auction. Online auctions have become a popular way for both businesses and consumers to exchange goods. One of 237
238
W. Jank and G. Shmueli
the biggest online marketplaces, and currently the biggest Consumerto-Consumer (C2C) online auction place, is eBay (www.ebay.com). In 2005, eBay had 180.6 million registered users, of which over 71.8 million bid, bought, or listed an item during the year. The number of listings in 2005 was 1.9 billion, amounting to $44.3 billion in gross merchandize volume.1 At any point in time there are millions of items listed for sale, across thousands of product categories. Since eBay archives detailed records of its completed auctions, it is a great source for immense amounts of high-quality data. There has been extensive research on classical auction theory (Klemperer, 1999; Milgrom and Weber, 1982). Classical auction theory typically focuses on the analysis of optimal auctions and the effects of relaxing some of their assumptions. While there has been some empirical research in the area (Hendricks and Paarsch, 1995), more and more bidding data are now becoming accessible thanks to the recent surge of online auctions and the capability of collecting data conveniently over the Internet. While online auctions have the advantage of increasing the understanding of classical auction theory, they also pose new research questions. Recent research suggests the need for additional investigation of the impact of the electronic implementation on the auction mechanism (Klein and O’Keefe, 1999). In fact, empirical studies provide evidence that classical auction theory may not generally carry over to the online context (Lucking-Reiley, 1999). Reasons such as the anonymity of the Internet, its worldwide reach free of geographical boundaries and time, and the longer duration of online auctions, can contribute to deviation from what classical auction theory predicts. One example is the observed phenomenon of ‘‘bid sniping’’, where a large volume of bids arrives in the final moments of the auction (Roth and Ockenfels, 2002). According to auction theory, this bidding strategy is not optimal for the bidder in a second-price auction like eBay, and is also not typically observed in offline auctions (Pinker et al., 2003). Empirical research of online auctions has been growing fast in the last few years. Studies use data from online auction houses such as eBay, Yahoo! and uBid but eBay’s market dominance and its data accessibility have made it by far the center of research efforts (Ba and Pavlou, 2002; Bajari and Hortacsu, 2003; Bapna et al., 2004, 2008; Dellarocas, 2003; Hyde et al., 2006; Klein and O’Keefe, 1999; Lucking-Reiley, 1999, 2000; Roth and Ockenfels, 2002; Shmueli and Jank, 2005; Wang et al., 2008). Most of these studies rely on large databases that are collected using ‘‘web spiders’’, which are software programs that are designed to ‘‘crawl’’ over webpages and collect designated data into a database. In this work we consider a feature of online auctions that has been mostly overlooked in the auction literature: The price process and its dynamics during an auction. Rather than characterizing an auction solely by its 1
See eBay’s press release from 1/18/2006 available at investor.ebay.com/releases.cfm?Year ¼ 2006
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
239
‘‘static’’ outcome such as the final price or the number of bids, we characterize it by its dynamic price formation process throughout the auction. The goal of this chapter is to better understand the heterogeneity of price dynamics in different auctions, by segmenting auctions into clusters according to their price processes, and then characterizing the dynamics of each cluster. To do so, we first estimate price curves from bid histories and quantify their dynamics by calculating curve derivatives. We then search for ‘‘types’’ or ‘‘profiles’’ of auctions that have similar dynamics, and supplement these profiles with static auction information to obtain a comprehensive descriptions of auction types. We further study each cluster by directly modeling price dynamics through differential equations. Finally, static information about the auction is integrated into the dynamic system in order to shed light on their combined behavior. The ability to describe price dynamics in an online auction or to classify an auction into certain dynamic types can be useful for the seller, the bidder, and the auction house. Some examples are realtime forecasting of the final price while the auction is still ongoing (Wang et al., 2008); selecting which auction to bid on among a set of competing auctions; selecting auction settings (by the seller) that are associated with fast price dynamics; pricing auction options (by the auction house) that guarantee faster price increases; and detecting suspicious price dynamics that may suggest fraudulent bidding activity. We address these tasks using functional data analysis (FDA). In FDA, the object of interest is a set of curves, shapes, objects, or, more generally, a set of functional observations (Ramsay and Silverman, 2002, 2005). In the auction context, we represent the auction price process as a functional object. That is, we interpret the process of price changes between the start and the end of the auction as a continuous curve. We refer to this process as the auction’s price evolution. In that sense, every auction is associated with a functional object describing the price between its start and end. Several million auctions transact on eBay every day. We segment auctions into groups with more homogeneous price evolutions to learn about price patterns. We do so by using curve clustering (Abraham et al., 2003; James and Sugar, 2003; Tarpey and Kinateder, 2003). Our results show that curve clustering can lead to new insights about the auction process compared to traditional (non-functional) approaches. The chapter is organized as follows. In Section 2 we describe online auction data and the mechanism that generates and collects them. In Section 3 we discuss the functional representation of price via a smooth curve and quantify price dynamics via curve derivatives. Section 4 uses curve clustering to segment online auctions according to their price patterns. We investigate the details of each cluster and estimate differential equations which capture the differences in price formation in a compact (and novel) way. We use a data set of Palm Pilot eBay auctions throughout the chapter. The chapter concludes with final remarks in Section 5.
240 2
W. Jank and G. Shmueli
Auction structure and data on eBay.com
Since eBay is by far the largest source of data and empirical work in this field, we focus here on its format and data structure. However, our functional approach can be readily adapted to different data originating from other types of auctions. 2.1 How eBay auctions work The dominant auction format on eBay is a variant of the second price sealed-bid auction (Krishna, 2002) with ‘‘proxy bidding’’. This means that individuals submit a ‘‘proxy bid’’, which is the maximum value they are willing to pay for the item. The auction mechanism automates the bidding process to ensure that the person with the highest proxy bid is in the lead of the auction. The winner is the highest bidder and pays the second highest bid. For example, suppose that bidder A is the first bidder to submit a proxy bid on an item with a minimum bid of $10 and a minimum bid increment of $0.50. Suppose that bidder A places a proxy bid of $25. Then eBay automatically displays A as the highest bidder, with a bid of $10. Next, suppose that bidder B enters the auction with a proxy bid of $13. eBay still displays A as the highest bidder; however it raises the displayed high-bid to $13.50, one bid increment above the second highest bid. If another bidder submits a proxy bid above $25.50, bidder A is no longer in the lead. However, if bidder A wishes, he or she can submit a new proxy bid. This process continues until the auction ends. Unlike other auctions, eBay has strict ending times, ranging between 1 and 10 days from the opening of the auction, as determined by the seller. eBay posts the complete bid histories of closed auctions for a duration of at least 30 days on its website.2 2.2 eBay’s data Figure 1 shows an example of a bid history found on eBay’s website. The top of Fig. 1 displays a summary of the auction: The item for sale, the current bid, the starting bid, the number of bids received, the start and end times, and the seller’s username together with his/her rating (in parentheses). The bottom of the page includes detailed information on the history of bids. Starting with the highest bid, the bottom displays the bidder’s user name, the rating and the time and date when the bid was placed. The data for this study are the complete bid histories from 183 closed auctions for new Palm M515 Personal Digital Assistant (PDA) units on 2
See http://listings.ebay.com/pool1/listings/list/completed.html
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
241
home | register | sign in | services | site map | help
Search
tips
Search titles and descriptions
eBay.com Bid History for PALM M515 COLOR PDA LIKE NEW HANDHELD (Item # 3041545039)
Ends
$157.50 1 Auction has ended. Aug-16-03 10:34:26 PDT Aug-21-03 10:34:26 PDT
Seller (Rating)
daynathegreat ( 27
Currently Quantity Time left Started
First bid # of bids
$60.00 19
)
(Accessible by Seller only) Learn more.
Bidding History (Highest bids first)
User ID moonwolfdesigns ( 481 rondaroo1 ( 65
)
)
moonwolfdesigns ( 481
)
Bid Amount
Date of Bid
$157.50
Aug-21-03 10:33:20 PDT
$155.00
Aug-21-03 10:32:52 PDT
$151.99
Aug-21-03 10:19:00 PDT
rondaroo1 ( 65
)
$150.00
Aug-21-03 10:32:23 PDT
rondaroo1 ( 65
)
$145.00
Aug-21-03 10:32:11 PDT
rondaroo1 ( 65
)
$140.00
Aug-21-03 09:01:49 PDT
$125.95
Aug-21-03 10:03:09 PDT
$120.95
Aug-21-03 10:02:45 PDT
$115.95
Aug-21-03 08:31:09 PDT
cpumpkinbatman ( 16
)
cpumpkinbatman ( 16
)
moonwolfdesigns ( 481
)
$110.25
Aug-21-03 07:48:01 PDT
moonwolfdesigns ( 481
)
$108.35
Aug-21-03 08:28:58 PDT
moonwolfdesigns ( 481
)
$102.75
Aug-21-03 07:25:57 PDT
$100.25
Aug-21-03 07:19:48 PDT
quest3487 ( 68
quest3487 ( 68
)
)
moonwolfdesigns ( 481
)
$100.00
Aug-21-03 07:25:43 PDT
moonwolfdesigns ( 481
)
$95.00
Aug-21-03 07:25:30 PDT
moonwolfdesigns ( 481
)
$90.00
Aug-21-03 07:25:11 PDT
Fig. 1. Partial bid history for an eBay auction. The top includes summary information on the auction format, the seller and the item sold; the bottom includes the detailed history of the bidding. Notice that the bids are ordered by descending bid amounts, not chronologically. Since the current highest bid is not revealed, later bids can be lower than earlier ones.
eBay.com. These data are available at www.smith.umd.edu/ceme/statistics/. We chose the Palm M515 since it was a very popular item on eBay at the time, with a multitude of different auctions every day. All of the auctions were 7 days long and took place from mid-March through
242
0
1
2
3
4
5
6
7
1000 800 600
Number of Bids placed
400 200
***** ** **** ***** ******** *** ********** ** * * * * * * * * ******* * ***** **************** ****** * * * ** ****** *** ********* ********** ** ************ ****** ** **** ** ** * * * ********** * * * * * * * ** * ** * * * * * * * * * * * * ********* ** * * ** *********************** ** ********** ********** *** **** ** ******* * ***** ***** ** ***** *** ******* *** ********** ********** ************ *** * * * * * * * * * * * * * * ** * ** * * * * * * * * * * * * * * * ** * **** ***** *** *** ************** ******* ** * * * * ** ** **************** * * * * **** *** ******************************** * * * * * **** * * * * ****** ********* ********************************** * *** * * ** ********************** ****************** *************** ** ******** * * * * * ***** * * * * **** ************************* * * * * ** * * ****** ***** ***** ******** ********** ***** * ** ******* * **** ********* * ** * ************** * * * * * * * * ** * * **** ***** ******** * * **** ***** ****** *** ** * * * ************************************************ *** * * * **** ***** * ***** ** *** ** ** * * **** *********** ******** ******* *** *** *** *** ** *** *** ** ** **** ******* *********** *** *** * *** * **** **** ******** ************** ****** * ************************ ********************** ********************************** *** * ***** *** ******* *** ** ** * ** * * * * ** ******* *** ** * ***** ******* ************************ **************** *************** ** ****************** * ** **************** **** *** ** ***** ** * * ****** ******* ***** *** * **** * * * ** * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * ***** ***** * * ** * * ** * * *** ****** *** ** *** ** ******** * *************** ***** ********* ** ************* * ** * **** * * * * * * * * * ************* * * ******** * * * *** * *** **** ** ****** * ** ** * * ** ** ******* * ************************************ ** ** ** ****** ***** * ********** ***** ** ** ***** ****** * *** * ** * ******* * ********************** ** * * * * * * * * * ** ********* ****** ******************** *** * *** **** * * * * * * * * * * * * * * ** *** **** ** ************** **** ********************** ** * * *** * *
0
150 100 0
50
Bid Amount
200
250
W. Jank and G. Shmueli
0
1
Day of Auction
2 3 4 5 Day of Auction
6
7
Fig. 2. Data for 183 Palm M515 7-day auctions: The left panel shows the bid amount vs. the bid time, aggregated across all auctions. The grey dashed line represents the manufacturer’s selling price. The histogram in the right panel displays the distribution of the number of bids over the 7-day auction with 12-h bins. Notice that most of the bids arrive in the final moments of the auction.
June 2003. At the time of the data collection, the Palm M515 sold for $250.00 on Palm’s website and we therefore considered only auctions that closed at or below this price. Figure 2 displays the bid times and amounts for the 183 auctions (the data are aggregated across auctions). The histogram shows bidding volume on different days of the auction. Most of the bids arrive on the last day and range between $0.10 and $250.00. The aggregated data are plotted in order to give a general picture. However, the methods that we describe in the next sections treat each auction individually.
3
Estimating price evolution and price dynamics
Since our focus is on the price evolution during an online auction, we start by estimating a continuous price curve from the observed, discrete bid histories. Once estimated, we study the sample of auction curves and their dynamics.
243
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions Auction 3015682701
50 0
0 0
Fig. 3.
150
Price
100
150 50
100
Price
200
200
250
250
300
300
Auction 3013787547
1
2
3 4 Day
5
6
7
0
1
2
3 4 Day
5
6
7
Live bids (circles) and associated step function reflecting price during the live auction.
3.1 Estimating a continuous price curve via smoothing We start by estimating for each auction its underlying price curve. To obtain the price curve, we first compute the ‘‘live bid’’ function, a step function that reflects the price as it was seen on eBay during the ongoing auction.3 Figure 3 displays the live bid step function for two auctions in our data set. Since we assume a smooth underlying price curve, and because we are interested in quantifying its dynamics, we use polynomial smoothing splines (Ramsay and Silverman, 2005; Ruppert et al., 2003) to transform the step function into a smooth functional object which we denote by f(t). A variety of different smoothers exist. One very flexible and computationally efficient choice is the penalized smoothing spline. Let t1, . . . , tL be a set of knots. Then, a polynomial spline of order p is given by f ðtÞ ¼ b0 þ b1 t þ b2 t2 þ þ bp tp þ
L X
bpl ðt tl Þpþ
(1)
l¼1 3 Computing the live bid from the bid history is possible using eBay’s increment table. See pages.ebay.com/help/basics/g-bid-increment.html
244
W. Jank and G. Shmueli
where uþ ¼ uI ½u0 denotes the positive part of the function u. Define the roughness penalty Z PENm ðtÞ ¼ fDm f ðtÞg2 dt (2) where Dmf, m ¼ 1, 2, 3, . . . , denotes the mth derivative of the function f. The penalized smoothing spline f minimizes the penalized squared error Z PENSSl;m ¼ fyðtÞ f ðtÞg2 dt þ l PENm ðtÞ (3) where y(t) denotes the observed data at time t and the smoothing parameter l controls the tradeoff between data fit and smoothness of the function f. Using m ¼ 2 in Eq. (3) leads to the commonly encountered cubic smoothing spline. Other possible smoothers include the use of B-splines or radial basis functions (Ruppert et al., 2003). We use the same family of smoothers (i.e., the same spline order, the same set of knots, and the same smoothing parameters) for all auctions so that different auctions differ only with respect to their spline coefficients. Our choices of spline order, smoothing parameter, and knots are closely tied to our goal and the nature of the data. In general, choosing a spline of order p guarantees that the first p 2 derivatives are smooth (Ramsay and Silverman, 2002). Since we are interested in studying at least the first two derivatives of the price curves, we use splines of order 5. Knot locations are chosen according to expected changepoints in the curves (a separate polynomial is fit between each set of consecutive knots). In our case, the selection of knots is based on the empirical bidding frequencies, and in particular accounts for the phenomenon of ‘‘sniping’’ or ‘‘last moment bidding’’ (Roth and Ockenfels, 2002; Shmueli et al., 2007). In order to capture the increased bidding activity at the end, we place an increasing number of knots toward the auction end. Specifically, our selection of knots mirrors the distribution of bid arrivals: We place 7 equally spaced knots every 24 h along the first 6 days of the auction, that is, tl ¼ 0, . . . , 6; l ¼ 1, . . . , 7. Then, over the first 18 h of the final day, we place knots over shorter intervals of 6 h each, that is, t8 ¼ 6.25, t9 ¼ 6.5, and t10 ¼ 6.75. And finally, we divide the last 6 h of the auction into 4 intervals of 1 1/2 h each, letting t11 ¼ 6.8125, t12 ¼ 6.8750, t13 ¼ 6.9375, and t14 ¼ 7.0000. The choice of the smoothing parameter l is based on visual inspection of the curves with the goal of balancing data fit and smoothness.4 This leads us to a value of l ¼ 50. Finally, because of the high frequency of late bids (see Fig. 2), the price can take significant jumps at the end of the auction (as in the right 4 Generalized cross-validation (GCV) did not lead to visually very appealing representations of the price evolution.
245
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions Price Curve Auction 3015682701 5 4 2
3
Log−Price
4 3 2
Log−Price
5
Price Curve Auction 3013787547
0
1
2
3
4
5
6
7
0
1
2
3
Day
2
3
4
5
6
7
0
1
2
3
2
3
4
5
6
7
4
5
6
7
5
6
7
Price Acceleration −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second Derivative of Log−Price
Price Acceleration
1
7
Day
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second Derivative of Log−Price
Day
0
6
0.2 0.4 0.6 0.8 1.0 1.2 1.4
First Derivative of Log−Price 1
5
Price Velocity
0.2 0.4 0.6 0.8 1.0 1.2 1.4
First Derivative of Log−Price
Price Velocity
0
4 Day
0
1
2
Day
3
4 Day
Fig. 4. Price dynamics for two sample auctions. The top panels shows the fitted smoothing spline f(t). The middle and bottom panels show the first and second derivatives f u(t) and f v(t), respectively.
panel of Fig. 3). In order to capture the late bidding activity and the associated price jumps, we transform the live bid step function onto the log scale and apply smoothing to the log scaled data.5 The top panel in Fig. 4 shows the resulting smooth functional objects f(t) (on the log scale) pertaining to the live bids from Fig. 3. 3.2 Estimating price dynamics via curve derivatives The smooth function f(t) estimates the price at any time t during an auction. We refer to f(t) as the price evolution. While f(t) describes the exact position of price for any t, it does not reveal how fast the price is moving. Attributes that we typically associate with a moving object are its velocity (or its speed) and its acceleration. Because we use smoothing splines to obtain f(t), velocity and acceleration can be computed for each auction via the first and second derivatives of f(t), respectively. 5 Log transformations are common in the auction literature since price and other related variables are often skewed.
246
W. Jank and G. Shmueli
Figure 4 shows the price curve f(t) with its first and second derivatives for the two sample auctions from Fig. 3. Although the price evolution in both auctions is somewhat similar (monotonically increasing), the price dynamics are quite different: In the first auction (left panel), the velocity and acceleration decrease toward the auction end. In contrast, price velocity and price acceleration increase in the second auction (right panel). This suggests that the price formation process can be quite heterogeneous across auctions, even when considering very similar auctions (i.e., the same item, sold during the same time period, in an auction with the same duration as is the case here). We take this as motivation for more carefully investigating heterogeneity among price dynamics and what leads to their differences.
3.3 Heterogeneity of price dynamics Figure 5 shows the price curves and their dynamics for 183 auctions in our data set. There is clear variation in the price formation process: While some auctions start out at a low price and shoot up toward the end, others maintain a relatively high price throughout the entire auction. Differences in the price formation are easier to detect in the price dynamics plots: The Price Velocity
Price Acceleration
1.0 0.5
Second Derivative of Log−Price
-0.5
0.0
1 0 1 2 3 4 5 6 7 Day
0.0
1.0 0.5
3 2
Log−Price
4
First Derivative of Log−Price
5
1.5
1.5
Price Curve
0 1 2 3 4 5 6 7 Day
0 1 2 3 4 5 6 7 Day
Fig. 5. Price curves and dynamics for 183 Palm M515 auctions. The left panel shows the price curves. The middle and right panels show the price velocities (first derivative) and price accelerations (second derivative), respectively. The heavy solid and dashed curves are the pointwise means and 72 standard error curves.
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
247
first derivative of the price curve shows that, on average, price increases quickly during the first part of the auction. The velocity then slows down, only to increase sharply again after day 5. A similar picture is seen for the average price acceleration: It is high at the onset of the auction, then drops below zero (‘‘deceleration’’) only to sharply increase again toward the end. Although it is tempting to talk about ‘‘typical’’ dynamics for these data, we notice that there is significant variation among the curves. For instance, not all auctions show increasing price acceleration toward the end. In fact, in some auctions the price decelerates and ends at large negative acceleration. Similarly, in many auctions there is no increase in the price velocity during the first part of the auction. All this suggests that the price formation process of similar auctioned items is not as homogeneous as expected. In the following we use functional cluster analysis to segment auctions into groups of more homogeneous price dynamics.
4
Auction segmentation via curve clustering
4.1 Clustering mechanism and number of clusters As in ordinary cluster analysis, our goal is to segment the data into clusters of observations that are more homogenous in order to better understand the characteristics or factors that lead to heterogeneity in the data. Since our data are curves, we use curve clustering to find segments of more homogeneous price profiles. Curve clustering can be done in several ways. One option is to sample each curve on a finite grid and then cluster on the grid. However, this can lead to unstable estimates (Hastie et al., 1995). A different approach that has been explored in the literature only recently (Abraham et al., 2003) is to cluster the set of curve coefficients rather than the functions themselves (James and Sugar, 2003). Let B ¼ {b1, . . . , bN} be the set coefficients pertaining to the N polynomial smoothing splines. Since each of the N curves is based on the same set of knots and the same smoothing parameters, heterogeneity across curves is captured by the heterogeneity across the coefficients. Thus, rather than clustering the original curves, we cluster the set of coefficients B. We use the K-medoids algorithm (with a Manhattan distance) since it is more robust to extreme values than K-means (Cuesta-Albertos et al., 1997). The K-medoids algorithm iteratively minimizes the within-cluster dissimilarity WK ¼
K X X k¼1 j; j 0 2I k
Dðbj ; bj 0 Þ
(4)
248
W. Jank and G. Shmueli
where D(bj,bju) denotes the dissimilarity between coefficients j and ju, and Ik denotes the set of indices pertaining to the elements of the kth cluster, k ¼ 1, . . . , K (Hastie et al., 2001; Kaufman and Rousseeuw, 1987). We investigate different choices for K and use several different criteria to determine the most plausible number of clusters. The first is the popular method of examining the reduction in within-cluster dissimilarity as a function of the number of clusters. This is shown in Fig. 6 (left panel), where we see that the within-cluster dissimilarity reduces by about 2 when moving from one to two clusters and also when moving from two to three clusters. However, the reduction is less than 0.5 for a larger number of clusters. The danger with this approach is that it is prone to show kinks even if there is no clustering in the data (Sugar and James, 2003; Tibshirani et al., 2001). We therefore also use an alternative measure based on an information theoretic approach introduced by (Sugar and James, 2003). This non-parametric measure of within-cluster dispersion, dK, also called the ‘‘distortion’’, is the average Mahalanobis distance per dimension between each observation and its closest cluster center. Rather than using the raw distortions, Sugar and James (2003) suggest to use the ‘‘Jump
1.5 0.5
1.0
J[K]
W[K−1]−W[K]
2
4
6
8
Number of clusters K
10
0.0 e+00 2.0 e−09 4.0 e−09 6.0 e−09 8.0 e−09 1.0 e−08 1.2 e−08 1.4 e−08
Information Theoretic Criterion
2.0
Reduction in the within−cluster dissimilarity
2
4
6
8
10
Number of clusters K
Fig. 6. Choosing the number of clusters: The left plot shows the reduction in the withincluster dissimilarity as the number of clusters increases. K ¼ 2 and K ¼ 3 clusters lead to a strong reduction. As K>3, this reduction diminishes. The right plot shows the jump plot of the transformed distortions. The largest jump occurs at K ¼ 3 providing additional evidence for 3 clusters in the data.
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
249
Y statistic’’ defined as J K ¼ d Y K d K1 , where Y ¼ dim/2 and dim denotes the dimension of the data. A graph of JK vs. K is expected to peak at the number of clusters K that best describes the data. A jump plot for our data is shown in the right panel of Fig. 6. The largest jump occurs at K ¼ 3, providing additional evidence for 3 clusters in the data.6
4.2 Comparing price dynamics of auction clusters After determining the number of clusters in the data, we investigate each cluster individually in order to derive insight about the differences in the price formation process in the different clusters. For our data we obtain 3 distinct clusters of sizes 90, 47, and 46 auctions. We start by comparing differences across cluster dynamics, and then supplement the dynamic characterization with differences in static auction features such as opening price, seller reputation, and winner experience. In order to compare the price dynamics across clusters we plot clusterspecific price curves and their derivatives together with 95% confidence interval bounds (see Fig. 7). A comparison of the price curves (top row) reveals that auctions in cluster 1 and 2, on average, start out at a higher price than those in cluster 3. Also, the average price curves in cluster 2 and 3 increase more sharply at the auction end. Differences in price dynamics are more visible in the price velocities (second row) and accelerations (third row): Cluster 1 is marked by high acceleration at the auction start, followed by a long period of stalling/decreasing dynamics which extends almost to the end of day 6. The price then speeds up again toward closing. Cluster 2, in contrast, experiences hardly any price dynamics during the first five days, but toward the auction end the price speeds up rapidly at an increasing rate. In fact the maximum acceleration is reached at closing. Since acceleration precedes velocity, a maximum acceleration at closing does not translate into maximum speed. In other words, auctions in cluster 2 close before they reach maximum speed. The picture for cluster 3 is different: Like cluster 1, it experiences some early dynamics, albeit of larger magnitude. Like cluster 2, the dynamics slow down during mid-auction with a speeding up toward the auction end. However, while auction prices in cluster 2 do not reach their maximum speed before closing, the ones in cluster 3 do! Notice that acceleration in cluster 3 reaches its maximum a day before the auction end and then levels off. This means that the price velocity in cluster 3 is ‘‘maxed out’’ when the auction finishes. We further investigate the potential impact of these different price dynamics below, by directly modeling the price dynamics. 6 We also clustered a lower dimensional representation of the spline coefficients using principal components and obtained similar results. Further evidence for 3 clusters was also obtained by performing functional principal component analysis (Ramsay and Silverman, 2005).
2
3 4 5 Day of Auction
6
3 4 5 Day of Auction
6
7
2
3 4 5 Day of Auction
6
7
3 4 5 Day of Auction
6
7
1
2
3 4 5 Day of Auction
6
7
200 150
Price
100 50 0
1
2
3 4 5 Day of Auction
6
7
1.0
Price Velocity All Cluster
0
1
2
3 4 5 Day of Auction
6
7
0
1
2
3 4 5 Day of Auction
6
7
Price Acceleration Cluster
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8 0
0
7
Price Acceleration Cluster 3
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8
0.8 0.6 0.4 0.2
2
6
1.0 1
Price Acceleration Cluster 2
0.0
1
3 4 5 Day of Auction
0.0 0
Price Acceleration Cluster 1
0
2
First Derivative of Log−Price 0.2 0.4 0.6 0.8
1.0 0.0 2
1
Price Velocity Cluster 3
First Derivative of Log−Price 0.2 0.4 0.6 0.8
0.8 0.6 0.4 0.2 0.0
1
0
7
0
1
2
3 4 5 Day of Auction
6
7
0
1
2
3 4 5 Day of Auction
6
7
Fig. 7. Price Dynamic Profiles: For each cluster, the graph shows the price curves (first row), their first derivative (velocity, second row) and their second derivative (acceleration, third row). The thick solid curve represents the average curve and the thick dashed curves correspond to 95% confidence bounds. The last column overlays the mean curves from the three clusters.
W. Jank and G. Shmueli
First Derivative of Log−Price
1
Price Velocity Cluster 2
1.0
Price Velocity Cluster 1
0
250
300 0
7
First Derivative of Log−Price 0.2 0.4 0.6 0.8
6
0.0
3 4 5 Day of Auction
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8
2
0
0
50
50
100
100
150
Price
Price 150 200
200
250
250
300 250 200 150
Price
100 50 0
1
Price Process All Clusters
250
0
Second Derivative of Log−Price
Price Process Cluster 3 300
Price Process Cluster 2 300
Price Process Cluster 1
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
251
4.3 A differential equation for price The price curve and its derivatives appear to play an important role in understanding the price formation process in an auction. For this reason, we set out to model the relationship directly via functional differential equation analysis. Consider a second-order homogenous linear differential equation of the form b0 ðtÞf ðtÞ þ b1 ðtÞf 0 ðtÞ þ f 00 ðtÞ ¼ 0
(5)
This is a plausible model for describing the dynamics of price because it is likely that there exist forces proportional to the velocity or the position that affect the system—similar to phenomena in physics. In the above model, b0(t) reflects forces applied to the system at time t that are dependent on the position of price, while b1(t) reflects forces that are proportional to the velocity. One example of the former is the effect of the current price on the system: if an auction reaches a high price level, its acceleration may reduce because the price is at or near the market price. Other forces that can be proportional either to the velocity or the position are high-frequency bidding, the deadline effect (i.e., the end of the auction), and fierce competition. For instance, competition may imply that action is immediately followed by reaction which may result in a fast movement (i.e., velocity) of the price. We therefore explore the dynamic system of price by fitting differential equations models. The next logical step is to extend auction theory to formulate ‘‘laws of auction dynamics’’, which, like laws of physics, would relate the different factors of the system (similar to viscosity or friction that are forces proportional to velocity). Our model for auction i is therefore f 00i ðtÞ ¼ b0 ðtÞf i ðtÞ þ b1 ðtÞf 0i ðtÞ þ i ðtÞ
(6)
This model, which describes the dynamics of the price process, is conceptually equivalent to a linear regression model that relates the acceleration (f v) to its position and velocity. The inclusion of time varying coefficients (b0(t) and b1(t)) means that we allow the system dynamics to evolve over time. The results of the previous section indicate that there are three groups of auctions, each with different dynamic behavior. We therefore fit the model in Eq. (6) separately to each of the clusters. Since fitting differential equations to data in this context is not standard, we describe the process and interpretations in detail next. The process of modeling includes estimating parameters, evaluating model fit, and interpretation. Parameter estimation is obtained by minimizing the sum of squared errors using least
252
β0(t)
W. Jank and G. Shmueli 0.03
0.03
0.03
0.02
0.02
0.02
0.01
0.01
0.01
0
0
0
−0.01
−0.01
−0.01
β1(t)
−0.02
0
5
0
5
−0.02
0.2
0.2
0.2
0
0
0
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
0
5 Cluster 1
Fig. 8.
−0.02
0
5 Cluster 2
0
0
5
5 Cluster 3
Estimated weight functions ðb^ 0 ðtÞb^ 1 ðtÞÞ for each cluster. Dashed lines are bootstrapped 95% confidence intervals.
squares. This can be done either pointwise or via basis functions (Ramsay and Silverman, 2005).7 The estimated parameter curves for each cluster are shown in Fig. 8. We delay interpreting these coefficients until we assess model fit. To assess fit, we use the estimated parameters to find a function f(t) that follows these dynamics, i.e., we solve the homogenous differential equation. To obtain a unique solution we specify the initial conditions, fk(0) ¼ average (log) opening bid in cluster k, and f 0k ð0Þ ¼ 0 for k ¼ 1, 2, 3. The solution for each cluster is shown in Fig. 9. Comparing the solutions (solid lines) with the average price curves (broken lines) shows that for clusters 1 and 3 the solution matches the average price curve fairly well. In contrast, for cluster 2 the solution deviates from the data and over estimates the price more and more as the auction progresses. It appears that the price dynamics in some auctions (such as those in clusters 1 and 3) can be described fairly well by a closed system, whereas in other auctions external factors play an important role. We now return to the estimated coefficients in order to compare the dynamics in clusters 1 and 3 (left and right columns of Fig. 8). For each weight function (b^ 0 ðtÞ and b^ 1 ðtÞ), we estimate a 95% confidence interval 7
We use the function pdacell to estimate the two parameter functions.
253
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions Function (by cluster)
6 cluster 1 cluster 2 cluster 3
5.5
log(Price)
5
4.5
4
3.5
3
2.5 0
1
2
3
4
5
6
7
Day
Fig. 9.
Solution of differential equation (solid lines) compared to the average price curve (broken lines) for each cluster.
based on bootstrap samples with 500 replications. The deviation of b^ 0 from zero is statistically insignificant throughout the auction in both clusters, indicating a process with no oscillations, as would be expected for price. However, b^ 1 is similar for the two clusters from mid-auction (slightly positive) to the auction end (increasingly negative), but this change occurs earlier in cluster 1. The other difference between the two clusters is the behavior of b^ 1 during the first day of the auction: For cluster 1, it is near zero whereas in cluster 3 it is positive, although of very small magnitude. This change in the coefficients over time indicates a system with evolving dynamics: For cluster 1 the system starts out with linear motion for logprice ðb^ 0 0; b^ 1 0Þ, in other words, the price increases exponentially only as a function of time ðlog f ðtÞ ¼ a þ btÞ. From around day 5, we see b^ 0 0; b^ 1 o0, so that price moves to double-exponential growth ðlog f ðtÞ ¼ a þ begt Þ, with a growth rate g that increases as the auction draws closer to an end. This captures the increasingly steep price increase that results from high-frequency bidding toward the auction end. The slightly positive b^ 1 during days 2–3 can be attributed to the deceleration during this period, although price does increase during that time (b^ 1 describes the relationship between price and its acceleration). A similar picture is seen for cluster 3, except that the price deceleration at the auction
254
W. Jank and G. Shmueli
start manifests itself as a positive b^ 1 . This can be seen when comparing the acceleration curves in Fig. 7. In conclusion, we learn that the price dynamics in auctions of clusters 1 and 3 can be described by a homogeneous differential equation of the form Eq. (5). In both cases, price increases double exponentially from mid auction, with an increasing growth rate. This occurs earlier in cluster 1 than in 3. In contrast, the differential equation does not capture the dynamics of cluster 2 auctions which may suggest that additional factors affect the system. Next, we examine such factors and their relation to the dynamics. 4.4 Comparing dynamic and non-dynamic cluster features In order to gain insight into the relationship between price dynamics and other auction-related information we compare the three clusters with respect to some key auction features. First, we compare the opening and closing prices of the three clusters. Although the price curves approximate the actual price during the auction relatively well, the actual price at the auction start and closing have special economic significance, which contain additional information about the price formation. Indeed, for our data we find that the three clusters differ with respect to these prices. We investigate additional auction characteristics that are relevant to the price formation: the day of the week that the auction closes, the seller’s rating (a proxy for reputation), the winner’s rating (a proxy for experience), the number of bids and the number of distinct bidders that participated in the auction (both proxies for competition). Table 1 gives summary statistics by cluster for all numerical variables. Figure 10 compares the closing days of the three clusters, and a w2 test (p-value ¼ 0.02) further confirms that cluster and dayof-week are statistically dependent. Table 1 and Fig. 10 suggest several possible strategies for bidding on an auction. Take for instance auctions in cluster 2. These auctions are characterized by high opening bids, Tuesday closing days, and low final Table 1 Summary statistics by cluster (on log scale) Cluster 1
Mean SE
2
Mean SE
3
Mean SE
Obid
Price
SelRate
WinRate
Nbids
Nbidders
ClusSize
6.98 1.43 6.18 1.66 0.17 1.63
234.86 1.01 216.30 1.02 233.04 1.01
40.70 1.13 26.43 1.20 20.52 1.15
34.04 1.19 32.13 1.28 22.41 1.29
10.79 1.13 17.16 1.10 24.87 1.06
6.19 1.10 8.27 1.10 12.53 1.05
90 47 46
Notes: Opening bid (Obid), closing price (Price), seller rating (SelRate), winner rating (WinRate), and number of bids and bidders (Nbids, Nbidders). ‘‘ClusSize’’ denotes the size of each cluster.
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
255
1 2 3
0.4
Percent auctions closing
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Mon
Fig. 10.
Tue
Wed
Thu
Fri
Sat
Sun
The percent of auctions that close on each day of the week, by cluster. The bars denote the corresponding standard errors.
prices. Alternatively, auctions closing on Saturday or Sunday tend to originate from cluster 1 with high final prices. In light of these results we characterize three types of auctions, based on the three clusters and their dynamics: Steady auctions (cluster 1). Auctions in this cluster have moderately high opening prices and low competition, yet they achieve high closing prices. This cluster is also well approximated by a differential equation which moves from exponential growth to double-exponential growth around day 5. A plausible explanation for this steadiness is the highly experienced sellers. A highly rated seller in the online environment is usually regarded as being trustworthy. Such sellers have performed many transactions and therefore gained trust and experience. They also make better design choices, such as scheduling the auction closing on a weekend (when there are likely to be more users logged on). Winners in these auctions are also relatively experienced, indicating that experienced bidders are willing to pay a premium for buying from trustworthy sellers. Low-energy auctions with late dynamics (cluster 2). Auctions in this cluster have the highest average opening prices and the lowest closing price. In between, the price hardly moves until the last day, when suddenly it accelerates and remains in high acceleration when the auction closes. These auctions are not well approximated by a differential equation, and seem somewhat ‘‘out of balance’’. It appears
256
W. Jank and G. Shmueli
that the high opening price deters potential bidders from participating in the auction, and in turn the low competition level leads to lower prices. Or perhaps there are other reasons that make these auctions unattractive (e.g., they tend to end on a weekday). Bazaar auctions (cluster 3). Auctions in this cluster are the most dynamic: the price increase moves quickly from exponential to doubleexponential growth (and is well approximated by a differential equation), with acceleration peaking early on and then again before the close. These auctions have, on average, the lowest opening price, the highest competition, and high closing prices. The low opening price appears to be the only attractor of these auctions: sellers have low ratings and auctions close mostly on weekdays. We call them ‘‘bazaar’’ auctions because the market place seems to work efficiently in reaching the final price through dynamic competition.
4.5 A comparison with ‘‘traditional’’ clustering Section 4.4 sheds light on auction dynamics and their relationship with other (static) auction information. We obtain this insight via curve clustering on the auction price processes. In the following we show the amount of information that is lost by not taking a functional approach, and instead performing a traditional cluster analysis. The price curves describe the process of bidding between the start and end of the auction. In contrast, a traditional approach ignores the price curves and focuses only on the beginning and the end of this process, that is, on the opening bid and the final price only. We can think of this as a static approach since it ignores all the price dynamics. Table 2 shows the results of such a static clustering approach. Using the same approach (including the same K-medoids algorithm) as in Section 4.1, we obtain three clusters. Although cluster sizes are similar to those in Table 1 (90;47;46 vs. 93;45;45) the three clusters are in fact very different from those in Table 1. Table 2 Cluster-specific summaries (similar to Table 1) for clustering on the opening bid and the final price only Cluster 1
Mean SE
2
Mean SE
3
Mean SE
Obid
Price
SelRate
WinRate
Nbids
Nbidders
ClusSize
20.24 1.35 15.14 1.40 0.01 1.00
201.16 1.01 240.53 1.01 238.18 1.01
46.74 1.18 40.86 1.20 11.23 1.01
35.59 1.29 41.45 1.25 13.38 1.20
12.73 1.15 12.22 1.17 26.78 1.03
6.83 1.12 6.52 1.14 13.94 1.02
93 45 45
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
257
In particular, they segment auctions into three groups, one with eBay’s ‘‘default’’ opening bid of 0.01, one with medium, and one with high opening bids, but otherwise there is hardly any separation on the other dimensions. In fact, clusters 1 and 2 are almost identical with respect to seller rating, winner rating, and competition (number of bidders and bids). Furthermore, cluster variances are also much larger. The fact that competition is not captured by this static clustering can be attributed to the fact that competition pertains to what happens during an auction and can therefore only be captured by the price curve. In contrast, the opening bid and the final price ignore competition and thus result in a loss of valuable auction information. Another way of assessing the information loss due to a static clustering approach is by looking at the dynamic price profiles. Figure 11 shows the price curves and their dynamics (similar to Fig. 7) for the static clustering (i.e., using only opening bid and final price). The mean trend in the curves is generally comparable to the curve clustering result, but the inter-cluster heterogeneity is very large. The effect is especially notable in the velocity and acceleration.
5
Conclusions
In this study we take advantage of the wealth of high quality, publicly available data that exist in the online marketplace, and discuss limitations of some of the empirical research that has been conducted. In particular, the current literature has focused on quantifying the relationship between the final price of an auction and factors such as competition, opening price, and closing day. Although there is empirical evidence of different dynamics that take place during an auction, and although these differing dynamics are likely to affect the final price, there has been no attempt to quantify these dynamics. We focus here on price formation processes in online auctions and their dynamics and show that the price process in an online auction can be diverse even when dealing with auctions for very similar items with very similar settings. Using a functional representation of the price evolution, and functional data analytic tools, we find and characterize three ‘‘types’’ of auctions that exhibit different price dynamics. These differences reflect or result from different auction settings such as seller ratings, closing days, opening prices, and competition, but also from different dynamics that take place throughout the auction. We show that the static information alone does not fully explain the differences and leads to a loss of valuable information. Hence, the ability to couple static and dynamic sources of information is important for better understanding the mechanics of online auctions.
7
0
2
3 4 5 Day of Auction
6
7
3 4 5 Day of Auction
6
7
2
3 4 5 Day of Auction
6
7
3 4 5 Day of Auction
6
7
1
2
3 4 5 Day of Auction
6
7
300 250 100 50 0
1
2
3 4 5 Day of Auction
6
7
Price Velocity All Clusters
0
1
2
3 4 5 Day of Auction
6
7
0
1
2
3 4 5 Day of Auction
6
7
Price Acceleration Clusters
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8 0
0
Price Acceleration Cluster 3
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8 2
7
First Derivative of Log−Price 0.2 0.4 0.6 0.8 1
Price Acceleration Cluster 2
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8
1
6
0.0 0
Price Acceleration Cluster 1
0
3 4 5 Day of Auction
1.0
1.0 First Derivative of Log−Price 0.2 0.4 0.6 0.8 2
2
Price Velocity Cluster 3
0.0
First Derivative of Log−Price 0.2 0.4 0.6 0.8 0.0
1
1
0
1
2
3 4 5 Day of Auction
6
7
0
1
2
Fig. 11. Price profiles for clustering only on the starting- and the end-price.
3 4 5 Day of Auction
6
7
W. Jank and G. Shmueli
0
0
Price Velocity Cluster 2
1.0
Price Velocity Cluster 1
1
1.0
6
First Derivative of Log−Price 0.2 0.4 0.6 0.8
3 4 5 Day of Auction
0.0
2
Second Derivative of Log−Price 0.0 0.2 0.4 0.6 0.8
1
Price 150 200
250
0
0
50
50
100
100
Price 150 200
Price 150 200
250
250 Price 150 200 100 50 0 0
Price Process All Clusters
300
Price Process Cluster 3
300
300
Price Process Cluster 2
258
Price Process Cluster 1
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
259
The combination of static information with price dynamics leads to resulting behavior that sometimes adheres to auction theory, while othertimes it does not. For example, the three-way relationship between the opening price, competition, and the final price, which has also been shown empirically elsewhere (Bapna et al., 2008; Lucking-Reiley, 1999) is theoretically justified: Lower prices lead to more competition, which in turn result in higher prices. However, we find that this three-way relationship differs across clusters: it holds in ‘‘Bazaar auctions’’ and its reverse is true in ‘‘low-energy auctions with late dynamics’’, but it does not hold in ‘‘Steady auctions’’. In the latter it appears as if the experienced sellers are able to elicit higher prices (from more experienced winners) despite high opening prices and low competition. These conditional relationships were uncovered by examining the price dynamics of the entire auction rather than the opening and closing price alone. Another well-known effect is the premium carried by experienced sellers (Lucking-Reiley, 2000). We find evidence for this effect in some auction types (‘‘steady auctions’’), but not in others (‘‘low energy with late dynamics auctions’’). The existence of several auction types and the relationship between auction characteristics and price dynamics can be of use to sellers, bidders, the auction house, and, in general, to decision makers in the online marketplace. For instance, our results based on the Palm Pilot sample indicate that low-rated sellers might be better off setting a low opening price, a weekend closing, and other features that will increase competition and result in high-price velocity at the end. Furthermore, on eBay, for example, a seller is allowed to change the opening price and a few other settings before the first bid is submitted. We can therefore imagine a system where a seller evaluates, based on the dynamics in the beginning of his/her auction, what type of auction it is, and dynamically change the auction settings to improve dynamics. An auction house, like eBay, can use information on different types of auction dynamics to decide on varying pricing strategies and to determine optimal bid increments. Pricing strategies and bid increments affect the price dynamics and their optimal choice could avoid the typical lack of price movements during the middle of an auction. Moreover, price dynamics can be used to determine fraud. Fraudulent sellers could be suspected before the end of the transaction if their auction dynamics do not follow typical patterns. Bidders could use differences in price dynamics to forecast the ending price of an auction. This information, in turn, can help in deciding which auction to bid on among a large set of competing auctions for the same product (Wang et al., 2008). One of the limitations of the current approach is that it is only geared toward auctions of a certain fixed duration (e.g., 7-day auctions). While this can be generalized to auctions of different durations (see Bapna et al., 2008), one distinct feature of eBay auctions is the fixed ending time. It is yet
260
W. Jank and G. Shmueli
to be investigated how the same approach can be used for auctions with flexible ending times (e.g., Yahoo! or Amazon auctions). Another avenue for future research is to investigate exactly what external factors affect auctions that do show less of a ‘‘closed system behavior’’ (e.g., those in cluster 2). A first step in that direction is proposed in (Jank et al., 2008) who use novel functional differential equation tree methodology to incorporate predictor variables into dynamic models. Moreover, while we discuss several possible implications of our work for the seller, the buyer and the auction house, a thorough exploitation of price dynamics for business decision making in the auction context has not been undertaken yet. In conclusion, new insights can be gained within the field of online auctions by examining and modeling price dynamics or other dynamic processes through an FDA approach. In fact, a functional approach is very natural in this digital era, where data tend to comprise of both longitudinal and cross-sectional information and often very tightly coupled (Jank and Shmueli, 2006). Methods for exploring (e.g., Shmueli et al., 2006) and analyzing such data are therefore likely to become a necessity.
References Abraham, C., P.A. Cornillion, E. Matzner-Lober, N. Molinari (2003). Unsupervised curve-clustering using b-spline. Scandinavian Journal of Statistics 30(3), 581–595. Ba, S., P.A. Pavlou (2002). Evidence of the effect of trust building technology in electronic markets: Price premiums and buyer behavior. MIS Quarterly 26, 269–289. Bajari, P., A. Hortacsu (2003). The winner’s curse, reserve prices and endogenous entry: empirical insights from ebay auctions. Rand Journal of Economics 3(2), 329–355. Bapna, R., P. Goes, A. Gupta, Y. Jin (2004). User heterogeneity and its impact on electronic auction market design: an empirical exploration. MIS Quarterly 28(1), 21–43. Bapna, R., W. Jank, G. Shmueli (2008). Consumer surplus in online auctions. Information Systems Research 19(4), December Issue. Bapna, R., W. Jank, G. Shmueli (2008). Price formation and its dynamics in online auctions. Decision Support Systems 44(3), 641–656. Cuesta-Albertos, J.A., A. Gordaliza, C. Matran (1997). Trimmed k-means: an attempt to robustify quantizers. The Annals of Statistics 25, 553–576. Dellarocas, C. (2003). The digitization of word-of-mouth: promise and challenges of online reputation mechanisms. Management Science 49, 1407–1442. Hastie, T., R. Tibshirani, J. Friedman (2001). The Elements of Statistical Learning. Springer-Verlag, New York. Hastie, T.J., A. Buja, R.J. Tibshirani (1995). Penalized discriminant analysis. The Annals of Statistics 23, 73–102. Hendricks, K., H.J. Paarsch (1995). A survey of recent empirical work concerning auctions. Canadian Journal of Economics 28, 403–426. Hyde, V., W. Jank, G. Shmueli (2006). Investigating concurrency in online auctions through visualization. The American Statistician 60(3), 241–250. James, G.M., C.A. Sugar (2003). Clustering sparsely sampled functional data. Journal of the American Statistical Association 98, 397–408. Jank, W., G. Shmueli (2006). Functional data analysis in electronic commerce research. Statistical Science 21(2), 113–115.
Ch. 9. Studying Heterogeneity of Price Evolution in eBay Auctions
261
Jank, W., G. Shmueli, S. Wang (2008). Modeling price dynamics in online auctions via regression trees. Jank, W. and Shmueli, G. (eds.), Statistical Methods in eCommerce Research, Wiley, New York. Kaufman, L., P.J. Rousseeuw (1987). Clustering by means of medoids, in: Y. Dodge (ed.), Statistical Data Analysis Based on the L1-norm and Related Methods. Elsevier, North-Holland, pp. 405–416. Klein, S., R.M. O’Keefe (1999). The impact of the web on auctions: some empirical evidence and theoretical considerations. International Journal of Electronic Commerce 3(3), 7–20. Klemperer, P. (1999). Auction theory: a guide to the literature. Journal of Economic Surveys 13(3), 227–286. Krishna, V. (2002). Auction Theory. Academic Press, San Diego. Lucking-Reiley, D. (1999). Using field experiements to test equivalence between auction formats: magic on the internet. American Economic Review 89(5), 1063–1080. Lucking-Reiley, D. (2000). Auctions on the internet: what’s being auctioned and how? Journal of Inductrial Economics 48(3), 227–252. Milgrom, P., R. Weber (1982). A theory of auctions and competetive bidding. Econometrica 50(5), 1089–1122. Pinker, E.J., A. Seidmann, Y. Vakrat (2003). Managing online auctions: current business and research issues. Management Science 49, 1457–1484. Ramsay, J.O., B.W. Silverman (2002). Applied functional data analysis: methods and case studies. Springer-Verlag, New York. Ramsay, J.O., B.W. Silverman (2005). Functional Data Analysis. 2nd ed. Springer Series in Statistics. Springer-Verlag, New York Roth, A.E., A. Ockenfels (2002). Last-minute bidding and the rules for ending second-price auctions: evidence from ebay and amazon auctions on the internet. The American Economic Review 92(4), 1093–1103. Ruppert, D., M.P. Wand, R.J. Carroll (2003). Semiparametric Regression. Cambridge University Press, Cambridge. Shmueli, G., W. Jank (2005). Visualizing online auctions. Journal of Computational and Graphical Statistics 14(2), 299–319. Shmueli, G., W. Jank, A. Aris, C. Plaisant, B. Shneiderman (2006). Exploring auction databases through interactive visualization. Decision Support Systems 42(3), 1521–1538. Shmueli, G., R.P. Russo, W. Jank (2007). The BARRISTA: a model for bid arrivals in online auctions. Annals of Applied Statistics 1(2), 412–441. Sugar, C.A., G.M. James (2003). Finding the number of clusters in a data set: an information theoretic approach. Journal of the American Statistical Association 98, 750–763. Tarpey, T., K.K.J. Kinateder (2003). Clustering functional data. Journal of Classification 20(1), 93–114. Tibshirani, R., G. Walther, T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B 63, 411–423. Wang, S., W. Jank, G. Shmueli (2008). Explaining and forecasting ebay’s online auction prices using functional data analysis. Journal of Business and Economic Statistics 26(2), 144–160.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 10
Scheduling Tasks Using Combinatorial Auctions: The MAGNET Approach$
John Collins and Maria Gini Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Room 4-192, Minneapolis, MN 55455, USA
Abstract We consider the problem of rational, self-interested, economic agents who must negotiate with each other in a market environment in order to carry out their plans. Customer agents express their plans in the form of task networks with temporal and precedence constraints. A combinatorial reverse auction allows supplier agents to submit bids specifying prices for combinations of tasks, along with time windows and duration data that the customer may use to compose a work schedule. We describe the consequences of allowing the advertised task network to contain schedule infeasibilities, and show how to resolve them in the auction winner-determination process.
1
Introduction
We believe that much of the commercial potential of the Internet will remain unrealized until a new generation of autonomous systems is developed and deployed. A major problem is that the global connectivity and rapid communication capabilities of the Internet can present an organization with vast numbers of alternative choices, to the point that users are overwhelmed, and conventional automation is insufficient. Much has been done to enable simple buying and selling over the Internet, and systems exist to help customers and suppliers find each other, such as search engines, vertical industry portals, personalization systems, and recommender engines. However, many business operations are much $
Work supported in part by the National Science Foundation under grants IIS-0084202 and IIS-0414466.
263
264
J. Collins and M. Gini
more complex than the simple buying and selling of individual items. We are interested in situations that require coordinated combinations of goods and services, where there is often some sort of constraint-satisfaction or combinatorial optimization problem that needs to be solved in order to assemble a ‘‘deal.’’ Commonly, these extra complications are related to constraints among task and services, and to time limitations. The combinatorics of such situations are not a major problem when an organization is working with small numbers of partners, but can easily become nearly insurmountable when ‘‘opened up’’ to the public Internet. We envision a new generation of systems that will help organizations and individuals find and exploit opportunities that are otherwise inaccessible or too complex to seriously evaluate. These systems will help potential partners find each other (matchmaking), negotiate mutually beneficial deals (negotiation, evaluation, commitment), and help them monitor the progress of distributed activities (monitoring, dispute resolution). They will operate with variable levels of autonomy, allowing their principals (users) to delegate or reserve authority as needed, and they will provide their principals with a market presence and power that is far beyond what is currently achievable with today’s telephone, fax, web, and email-based methods. We believe that an important negotiation paradigm among these systems will be market-based combinatorial auctions, with added precedence and temporal constraints. The multi-agent negotiation testbed (MAGNET) project represents a first step in bringing this vision to reality. MAGNET provides a unique capability that allows self-interested agents to negotiate over complex coordinated tasks, with precedence and time constraints, in an auctionbased market environment. This chapter introduces many of the problems a customer agent must solve in the MAGNET environment and explores in detail the problem of solving the extended combinatorial-auction winnerdetermination problem. Guide to this chapter: Section 2 works through a complete interaction scenario with an example problem, describing each of the decision processes a customer agent must implement in order to maximize the expected utility of its principal. For some of them, we have worked out how to implement the decisions, while for the remainder we only describe the problems. Section 3 focuses on one specific decision problem, that of deciding the winners in a MAGNET auction. A number of approaches are possible; we describe an optimal tree search algorithm for this problem. Section 4 places this work in context with other work in the field. In particular, we draw on work in multi-agent negotiation, auction mechanism design, and combinatorial-auction winner-determination, which has been a very active field in recent years. Finally, Section 5 wraps up the discussion and points out a set of additional research topics that must be addressed to further realize the MAGNET vision.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
265
Customer Agent Top−Level Goal
Supplier Agent
Market
Planner Domain Model Task Network
Market Ontology
Domain Model
Re−Plan
Bid Manager Re−Bid
Statistics
Events & Responses
Fig. 1.
2
Bid Protocol
Market Session
Events & Responses
Commitments Availability
Bid Protocol
Task Assignment
Execution Manager
Market Statistics
Bid Manager
Resource Manager
The MAGNET architecture.
Decision processes in a MAGNET customer agent
We focus on negotiation scenarios in which the object of the interaction is to gain agreement on the performance of a set of coordinated tasks that one of the agents has been asked to complete. We assume that self-interested agents will cooperate in such a scheme to the extent that they believe it will be profitable for them to do so. After a brief high-level overview of the MAGNET system, we focus on the decision processes that must be implemented by an agent that acts as a customer in the MAGNET environment. We intend that our agents exhibit rational economic behavior. In other words, an agent should always act to maximize the expected utility of its principal. We will use an example to work through the agent’s decisions. Imagine that you own a small vineyard, and that you need to get last autumn’s batch of wine bottled and shipped.1 During the peak bottling season, there is often a shortage of supplies and equipment, and your small operation must lease the equipment and bring on seasonal labor to complete the process. If the wine is to be sold immediately, then labels and cases must also be procured, and shipping resources must be booked. Experience shows that during the Christmas season, wine cases are often in short supply and shipping resources overbooked. 2.1 Agents and their environment Agents may fulfill one or both of two roles with respect to the overall MAGNET architecture, as shown in Fig. 1. A customer agent pursues its goals by formulating and presenting requests for quotations (RFQs) to 1
This example is taken from the operations of the Weingut W. Ketter winery, Kro¨v, Germany.
266
J. Collins and M. Gini
supplier agents through a market infrastructure (Collins et al., 1998). An RFQ specifies a task network that includes task descriptions, a precedence network, and temporal constraints that limit task start and completion times. Customer agents attempt to satisfy their goals for the greatest expected profit, and so they will accept bids at the least net cost, where cost factors can include not only bid prices, but also goal completion time, risk factors, and possibly other factors, such as preferences for specific suppliers. More precisely, these agents are attempting to maximize the utility function of some user, as discussed in detail in Collins et al. (2000). A supplier agent attempts to maximize the value of the resources under its control by submitting bids in response to customer RFQs. A bid specifies what tasks the supplier is able to undertake, when it is available to perform those tasks, how long they will take to complete, and a price. Each bid may specify one or more tasks. Suppliers may submit multiple bids to specify different combinations of tasks, or possibly different time constraints with different prices. For example, a supplier might specify a short duration for some task that requires use of high cost overtime labor, as well as a longer duration at a lower cost using straight-time labor. MAGNET currently supports simple disjunction semantics for bids from the same supplier. This means that if a supplier submits multiple bids, any non-conflicting subset can be accepted. Other bid semantics are also possible (Boutilier and Hoos, 2001; Nisan, 2000a). 2.2 Planning A transaction (or possibly a series of transactions) starts when the agent is given a goal that must be satisfied. Attributes of the goal might include a payoff and a deadline, or a payoff function that varies over time. While it would certainly be possible to integrate a general-purpose planning capability into a MAGNET agent, we expect that in many realistic situations the principal will already have a plan, perhaps based on standard industry practices. Figure 2 shows such a plan for our winery bottling operation. We shall use this plan to illustrate the decision processes the agent must perform. Formally, we define a plan Pð¼ S; VÞ as a task network containing a set of tasks S, and a set of precedence relations V. A precedence relation relates Deliver bottles deadline
Bottle wine begin
Deliver cork Print labels Deliver cases
Fig. 2.
Apply labels Print cases
Pack cases
Plan for the wine-bottling example.
Ship cases
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
267
two tasks s; s0 2 S as s s0 , interpreted as ‘‘task s must be completed before task s0 can start.’’ We assume that markets will be sponsored by trade associations and commercial entities, and will therefore be more or less specialized. A consequence of this is that agents must in general deal with multiple markets to accomplish their goals. For our example, we assume that the tasks in our plan are associated with markets as specified in Table 1. It appears that we will need to deal with three different markets, and we will pack the cases ourselves. Or perhaps we will open a few bottles and invite the village to help out. So far, our plan is not situated in time, and we have not discussed our expected payoff for completing this plan. In the wine business, the quality and value of the product depends strongly on time. The wine must be removed from the casks within a 2-week window, and the bottling must be done immediately. For some varieties, the price we can get for our wine is higher if we can ship earlier, given a certain quality level. All the small vineyards in the region are on roughly the same schedule, so competition for resources during the prime bottling period can be intense. Without specifying the exact functions, we assume that the payoff drops off dramatically if we miss the 2-week bottling window, and less dramatically as the shipment date recedes into the future. This example is of course simplified to demonstrate our ideas. For example, we are treating the bottling and labeling operations as atomic— the entire bottling operation must be finished before we can start labeling— even though common-sense would inform us that we would probably want to apply this constraint to individual bottles, rather than to the entire batch. However, some varieties of wine are aged in the bottles for 6 months or more before the labels are applied.
Table 1 Tasks and market associations for the wine-bottling example Task
Description
Market
s1 s2 s3 s4 s5 s6 s7 s8 s9
Deliver bottles Deliver cork Bottle wine Print labels Apply labels Print cases Deliver cases Pack cases Ship cases
Vineyard services Vineyard services Vineyard services Printing and graphic arts Vineyard services Vineyard services Vineyard services (none) Transport services
268
J. Collins and M. Gini
2.3 Planning the bidding process At this point, the agent has a plan, and it knows which markets it must deal in to complete the plan. It also knows the value of completing the plan, and how that value depends on time. The next step is to decide how best to use the markets to maximize its utility. It will do this in two phases. First, the agent generates an overall plan for the bidding process, which may involve multiple RFQs in each of multiple markets. We call this a ‘‘bid-process plan.’’ Then a detailed timeline is generated for each RFQ. The simplest bid-process plan would be to issue a single RFQ in each market, each consisting of the portion of the plan that is relevant to its respective market. If all RFQs are issued simultaneously, and if they are all on the same timeline, then we can combine their bids and solve the combined winner-determination problem in a single step. However, this might not be the optimum strategy. For example, We may not have space available to store the cases if we are not ready to pack them when they arrive. Our labor costs might be much lower if we can label as we bottle; otherwise, we will need to move the bottles into storage as we bottle, then take them back out to label them. Once cases are packed, it is easy for us to store them for a short period. This means that we can allow some slack between the packing and shipping tasks. There is a limit to what we are willing to pay to bottle our wine, and there is a limit to the premium we are willing to pay to have the bottling completed earlier. The agent can represent these issues as additional constraints on the plan, or in some cases as alternative plan components. For example, we could constrain the interval between s5 (labeling) and s8 (packing) to a maximum of one day, or we could add an additional storage task between s3 (bottling) and s5 that must be performed just in case there is a non-zero delay between the end of s3 and the start of s5. There are many possible alternative actions that the agent can take to deal with these issues. It need not issue RFQs in all markets simultaneously. It need not include all tasks for a given market in a single RFQ. Indeed, dividing the plan into multiple RFQs can be an important way to reduce scheduling uncertainty. For example, we might want to have a firm completion date for the bottling and labeling steps before we order the cases. When a plan is divided into multiple RFQs that are negotiated sequentially, then the results of the first negotiation provide additional constraints on subsequent negotiations. Market statistics can be used to support these decisions. For example, if we knew that resources were readily available for the steps up through the labeling process (tasks s1 . . . s5), we could include the case delivery and
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
269
printing steps (tasks s6 and s7) in the same RFQ. This could be advantageous if suppliers were more likely to bid or likely to bid lower prices if they could bid on more of the business in a single interaction. In other words, some suppliers might be willing to offer a discount if we agree to purchase both bottles and cases from them, but if we negotiate these two steps in separate RFQs, we eliminate the ability to find out about such discounts. We should note that suppliers can either help or hinder the customer in this process, depending on the supplier’s motivations. For example, the supplier can help the customer mitigate issues like the constraint between bottling and packing. For example, if a supplier knew about this constraint, it could offer both tasks at appropriate times, or it could give the customer the needed scheduling flexibility by offering the case delivery over a broad time window or with multiple bids with a range of time windows. In some domains this could result in higher costs, due to the large speculative resource reservations the supplier would have to commit in order to support its bids. However, if a supplier saw an RFQ consisting of s6 and s7, it would know that the customer had likely already made commitments for the earlier tasks, since nobody wants cases printed if they are not bottling. If the supplier also knew that there would be little competition within the customer’s specified time window, it could inflate its prices, knowing that the customer would have little choice. The bid-process plan that results from this decision process is a network of negotiation tasks and decision points. Figure 3 shows a possible bidprocess plan for our wine-bottling example. Once we have a bid-process plan, we know what markets we will operate in, and how we want to divide up the bidding process. We must then start
RFQ: r1 Market: Vineyard Services Tasks: s1. . . s5
RFQ: r2 Market: Printing & Graphic Arts Tasks: s4
acceptable?
no alert user
yes RFQ: r3 Market: Vineyard Services Tasks: s6. . . s7
RFQ: r4 Market: Transport Services Tasks: s9
finish
Fig. 3.
Bid-process plan for the wine-bottling example.
270
Plan completion
Earliest start of task execution
Bid Award deadline
Bid deadline
Send RFQ
Compose RFQ
J. Collins and M. Gini
Customer deliberates Supplier deliberates
Fig. 4.
Typical timeline for a single RFQ.
schedule the bid-process plan, and allocate time within each RFQ/bidding interaction. These two scheduling problems may need to be solved together if the bid-process plan contains multiple steps and it is important to finish it in minimum time. Each RFQ step needs to start at a particular time, or when a particular event occurs or some condition becomes true. For example, if the rules of the market require deposits to be paid when bids are awarded, the customer may be motivated to move RFQ steps as late as possible, other factors being equal. However, if resources such as our bottling and labeling steps are expected to be in short supply, the agent may wish to gain commitments for them as early as possible in order to optimize its own schedule and payoff. We assume these decisions can be supported by market statistics, the agent’s own experience, and/or the agent’s principal. Each RFQ must also be allocated enough time to cover the necessary deliberation processes on both the customer and supplier sides. Some of these processes may be automated, and some may involve user interaction. The timeline in Fig. 4 shows an abstract view of the progress of a single negotiation. At the beginning of the process, the customer agent must allocate deliberation time to itself to compose its RFQ,2 to the supplier for bid preparation, and to itself again for the bid-evaluation process. Two of these time points, the bid deadline and the bid award deadline, must be communicated to suppliers as part of the RFQ. The bid deadline is the latest time a supplier may submit a bid, and the bid award deadline is the earliest time a supplier may expire a bid. The interval between these two times is available to the customer to determine the winners of the auction. In general, it is expected that bid prices will be lower if suppliers have more time to prepare bids, and more time and schedule flexibility in the execution phase. Minimizing the delay between the bid deadline and the award deadline will also minimize the supplier’s opportunity cost, and would therefore be expected to reduce bid prices. However, the customer’s 2
This may be a significant combinatorial problem—see for example Babanov et al. (2003).
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
271
ability to find a good set of bids is dependent on the time allocated to bid evaluation, and if a user is making the final decision on bid awards, she may want to run multiple bid-evaluation cycles with some additional think time. We are interested in the performance of the winner-determination process precisely because it takes place within a window of time that must be determined ahead of time, before bids are received, and because we expect better overall results, in terms of maximizing the agent’s utility, if we can maximize the amount of time available to suppliers while minimizing the time required for customer deliberation. These time intervals can be overlapped to some extent, but doing so can create opportunities for strategic manipulation of the customer by the suppliers, as discussed in Collins et al. (1997). The process for setting these time intervals could be handled as a nonlinear optimization problem, although it may be necessary to settle for an approximation. This could consist of estimating the minimum time required for the customer’s processes, and allocating the remainder of the available time to the suppliers, up to some reasonable limit. 2.4 Composing a request for quotes At this point in the agent’s decision process, we have the information needed to compose one or more RFQs, we know when to submit them, and we presumably know what to do if they fail (if we fail to receive a bid set that covers all the task in the RFQ, for example). The next step is to set the time windows for tasks in the individual RFQs, and submit them to their respective markets. Formally, an RFQ r ¼ ðS r ; V r ; W r ; tÞ contains a subset S r of the tasks in the task network P, with their precedence relations V r , the task time windows W r specifying constraints on when each task may be started and completed, and the RFQ timeline t containing at least the bid deadline and bid award deadline. As we discussed earlier, there might be elements of the task network P that are not included in the RFQ. For each task sASr, in the RFQ we must specify a time window w 2 W r , consisting of an earliest start time tes (s, r) and a latest finish time tlf (s, r), and a set of precedence relationships V r ¼ fs0 2 S r ; s0 sg, associating s with each of the other tasks s0 2 S r ; whose completion must precede the start of s. The principal outcome of the RFQ-generation process is a set of values for the early-start and late-finish times for the time windows W r in the RFQ. We obtain a first approximation using the critical path (CPM) algorithm (Hillier and Lieberman, 1990), after making some assumptions about the durations of tasks, and about the earliest start time for tasks that have no predecessors in the RFQ (the root tasks S R ) and the latest finish times for tasks that have no successors in the RFQ (the leaf tasks S L ). Market mean-duration statistics could be used for the task durations.
272
J. Collins and M. Gini
Overall start and finish times for the tasks in the RFQ may come from the bid-process plan, or we may already have commitments that constrain them as a result of other activities. For this discussion, we assume a continuoustime domain, although we realize that many real domains effectively work on a discrete-time basis. Indeed, it is very likely that some of our winebottling activities would typically be quoted in whole-day increments. We also ignore calendar issues such as overtime/straight time, weekends, holidays, time zones, etc. The CPM algorithm walks the directed graph of tasks and precedence constraints, forward from the early-start times of the root tasks to compute the earliest start tes(s) and finish tef (s) times for each task s A S r, and then backward from the late-finish times of the leaf tasks to compute the latest finish tlf(s) and start tls(s) times for each task. The minimum duration of the entire task network specified by the RFQ, defined as maxs0 2S L ðtef ðs0 ÞÞ mins2SE ðtes ðsÞÞ; is called the makespan of the task network. The smallest slack in any leaf task mins2SL ðtlf ÞðsÞ tef ðsÞÞ is called the total slack of the task network within the RFQ. All tasks s for which tlf(s) tef(s) ¼ total slack are called critical tasks. Paths in the graph through critical tasks are called critical paths. Some situations will be more complex than this, such as the case when there are constraints that are not captured in the precedence network of the RFQ. For example, some non-leaf task may have successors that are already committed but are outside the RFQ. The CPM algorithm is still applicable, but the definition of critical tasks and critical paths becomes more complex. Figure 5 shows the result of running the CPM algorithm on the tasks of RFQ r1 from our bid-process plan. We are assuming task durations as given in the individual ‘‘task boxes.’’ We observe several problems immediately. The most obvious is that it is likely that many bids returned in response to this RFQ would conflict with one another because they would fail to 0
5
10
s1 s2
tef (s3)
tlf (s3)
s3
tes(s3) s4 s5
Fig. 5.
Initial time allocations for tasks in RFQ r1. Only the tes(s) and tlf(s) times are actually specified in the RFQ.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
273
combine feasibly. For example, if I had a bid for the label-printing task s4 for days 5–7, then the only bids I could accept for the labeling task s5 would be those that had a late start time at least as late as day 7. If the bids for s5 were evenly distributed across the indicated time windows, and if all of them specified the same 4-day duration, then only 1/3 of those bids could be considered. In general, we want to allow time windows to overlap, but excessive overlap is almost certainly counterproductive. We will revisit this issue shortly. Once we have initial estimates from the CPM algorithm, there are several issues to be resolved, as described in the following sections. 2.4.1 Setting the total slack The plan may have a hard deadline, which may be set by a user or determined by existing commitments for tasks that cannot be started until tasks in the current RFQ are complete. Otherwise, in the normal case, the bid-process plan is expected to set the time limits for the RFQ. It would be interesting to find a way to use the market to dynamically derive a schedule that maximizes the customer’s payoff. This would require cooperation of bidders, and could be quite costly. Parkes and Ungar (2001) have done something like this in a restricted domain, but it is hard to see how to apply it to the more generalized MAGNET domain. 2.4.2 Task ordering For any pair of tasks in the plan that could potentially be executed in parallel, we may have a choice of handling them either in parallel, or in sequential order. For example, in our wine-bottling example, we could choose to acquire the bottles before buying the corks. In general, if there is uncertainty over the ability to complete tasks which could cause the plan to be abandoned, then (given some straightforward assumptions such as payments being due when work is completed) the agent’s financial exposure can be affected by task ordering. If a risky task is scheduled ahead of a ‘‘safe’’ task, then if the risky task fails we can abandon the plan without having to pay for the safe task. Babanov et al. (2003) have worked out in detail how to use task completion probabilities and discount rates in an expected-utility framework to maximize the probabilistic ‘‘certain payoff ’’ for an agent with a given risk-aversion coefficient. For some tasks, linearizing the schedule will extend the plan’s makespan, and this must be taken into account in terms of changes to the ultimate payoff. Note that in many cases the agent may have flexibility in the start time as well as the completion time of the schedule. This would presumably be true of our wine-bottling example. 2.4.3 Allocating time to individual tasks Once we have made decisions about the overall time available and about task ordering, the CPM algorithm gives us a set of preliminary time
274
J. Collins and M. Gini
windows. In most cases, this will not produce the best results, for several reasons. Resource availability. In most markets, services will vary in terms of availability and resource requirements. There may be only a few dozen portable bottling and labeling machines in the region, while corks may be stored in a warehouse ready for shipping. There is a high probability that one could receive several bids for delivery of corks on one specific day, but a much lower probability that one could find even 1 bid for a 6-day bottling job for a specific 6-day period. More likely one would have to allow some flexibility in the timing of the bottling operation in order to receive usable bids. Lead-time effects. In many industries, suppliers have resources on the payroll that must be paid for whether their services are sold or not. In these cases, suppliers will typically attempt to ‘‘book’’ commitments for their resources into the future. In our example, the chances of finding a print shop to produce our labels tomorrow is probably much lower than the chances of finding shops to print them next month. This means that, at least for some types of services, one must allow more scheduling flexibility to attract short lead-time bids than for longer lead times. We should also expect to pay more for shorter lead times. Task-duration variability. Some services are very standardized (delivering corks, printing 5000 labels), while others may be highly variable, either because they rely on human creativity (software development) or the weather (bridge construction), or because different suppliers use different processes, different equipment, or different staffing levels (wine bottling). These two types of variability can usually be differentiated by the level of predictability; suppliers that uses a predictable process with variable staffing levels are likely to be able to deliver on time on a regular basis, while services that are inherently unpredictable will tend to exhibit frequent deviations from the predictions specified in bids.3 For services that exhibit a high variability in duration, as specified in bids, the customer’s strategy may depend on whether a large number of bidders is expected, and whether there is a correlation between bid price and quoted task duration. If a large number of bidders is expected, then the customer may be able to allocate a below-average time window to the task, in the expectation that there will be some suppliers at the lower end of the distribution who will be able to perform within the specified window. However, if few bidders are expected, a larger than average time window may be required in order to achieve a reasonable probability of receiving at least one usable bid.
3 Whether the market or customers would be able to observe these deviations may depend on market rules and incentives, such as whether a supplier can be paid early by delivering early.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
275
Excessive allocations to non-critical tasks. One obvious problem with the time allocations from the CPM algorithm as shown in Fig. 5 is that noncritical tasks (tasks not on the critical path) are allocated too much time, causing unnecessary overlap in their time windows. All other things being equal, we are likely to be better off if RFQ time windows do not overlap, because we will have fewer infeasible bid combinations. 2.4.4 Trading off feasibility for flexibility In general we expect more bidders, and lower bid prices, if we offer suppliers more flexibility in scheduling their resources by specifying wider time windows. However, if we define RFQ time windows with excessive overlap, a significant proportion of bid combinations will be unusable due to schedule infeasibility. The intuition is that there will be some realistic market situations where the customer is better off allowing RFQ time windows to overlap to some degree, if we take into account price, plan completion time, and probability of successful plan completion (which requires at minimum a set of bids that covers the task set and can be composed into a feasible schedule). This means that the winnerdetermination procedure must handle schedule infeasibilities among bids. Figure 6 shows a possible updated set of RFQ time windows for our wine-bottling example, taking into account the factors we have discussed. We have shortened the time windows for tasks s1 and s2, because we believe that bottles and corks are readily available, and can be delivered when needed. There is no advantage in allowing more time for these tasks. Market data tells us that bottling services are somewhat more difficult to schedule than labeling services, and so we have specified a wider time window for task s3 than for s4. Our deadline is such that the value of completing the work a day or two earlier is higher than the potential loss of having to reject some conflicting bids. We also know from market data that a large fraction of suppliers of the bottling crews can also provide the labeling service, and so the risk of schedule infeasibility will be reduced if we 0
5
10
s1 s2 s3 s5 s4
Fig. 6.
Revised time allocations for tasks in RFQ r1.
276
J. Collins and M. Gini
receive bids for both bottling and labeling. Finally, there is plenty of time available for the non-critical label-printing task s5 without needing to overlap its time window with its successor task s4. 2.5 Evaluating bids Once an RFQ is issued and the bids are returned, the agent must decide which bids to accept. The bidding process is an extended combinatorial auction, because bids can specify multiple tasks, and there are additional constraints the bids must meet (the precedence constraints) other than just covering the tasks. The winner-determination process must choose a set of bids that maximize the agent’s utility, covers all tasks in the associated RFQ, and forms a feasible schedule. 2.5.1 Formal description of the winner-determination problem Each bid represents an offer to execute some subset of the tasks specified in an RFQ, for a specified price, within specified time windows. Formally, a bid b ¼ ðr; S b ; W b ; cb Þ consists of a subset S b 2 S r of the tasks specified in the corresponding RFQ r, a set of time windows W b , and an overall cost cb. Each time window ws 2 W b specifies for a task s an earliest start time tes(s, b), a latest start time tls(s, b), and a task duration d(s, b). It is a requirement of the protocol that the time window parameters in a bid b are within the time windows specified in the RFQ, or tes(s, b)Ztes(s, r) and (tls(s, b) þ d(s, b))otlf (s, r) for a given task s and RFQ r. This requirement may be relaxed, although it is not clear why a supplier agent would want to expose resource availability information beyond that required to respond to a particular bid. For bids that specify multiple tasks, it is also a requirement that the time windows in the bids be internally feasible. In other words, for any bid b, if for any two of its tasks (si, sj) A S b there is a precedence relation si sj specified in the RFQ, then it is required that tes(si, b) þ d(si, b)otls(sj, b). A solution to the bid-evaluation problem is defined as a complete mapping S-B of tasks to bids in which each task in the corresponding RFQ is mapped to exactly one bid, and that is consistent with the temporal and precedence constraints on the tasks as expressed in the RFQ and the mapped bids. Figure 7 shows a very small example of the problem the bid evaluator must solve. As noted before, there is scant availability of bottling equipment and crews, so we have provided an ample time window for that activity. At the same time, we have allowed some overlap between the bottling and labeling tasks, perhaps because we believed this would attract a large number of bidders with a wide variation in lead times and lower prices. Bid 1 indicates this bottling service is available from day 3 to day 7 only, and will take the full 5 days, but the price is very good. Similarly, bid 2
Ch. 10. Scheduling Tasks Using Combinatorial Auctions 0
5
277
10
s3 (bottling)
RFQ Time windows
s4 (labeling) bottling, 500$
Bid 1
Bid 2
labeling, 300$
bottling, 800$
Bid 3
d(s3, b3) tls(s3, b3) tes(s3, b3) Bid 4 bottling & labeling, 1200$
Fig. 7.
Bid example.
offers labeling from day 7 to day 10 only, again for a good price. Unfortunately, we cannot use these two bids together because of the schedule infeasibility between them. Bid 3 offers bottling for any 3-day period from day 2 to day 7, at a higher price. We can use this bid with bid 2 if we start on day 4, but if we start earlier we will have to handle the unlabeled bottles somehow. Finally, bid 4 offers both the bottling and labeling services, but the price is higher and we would finish a day later than if we accepted bids 2 and 3. 2.5.2 Evaluation criteria We have discussed the winner-determination problem in terms of price, task coverage, and schedule feasibility. In many situations, there are other factors that can be at least as important as price. For example, we might know (although the agent might not know) that the bottling machine being offered in bid 3 is prone to breakdown, or that it tends to spill a lot of wine. We might have a long-term contract with one of the suppliers, Hermann, that gives us a good price on fertilizer only if we buy a certain quantity of corks from him every year. We might also know that one of the local printers tends to miss his time estimates on a regular basis, but his prices are often worth the hassle, as long as we build some slack into the schedule when we award a bid to him. Of course, including these other factors will distort a ‘‘pure’’ auction market, since the lowest-price bidder will not always win. As a practical matter, such factors are commonly used to evaluate potential procurement
278
J. Collins and M. Gini
decisions, and real market mechanisms must include them if they are to be widely acceptable. Many of these additional factors can be expressed as additional constraints on the winner-determination problem, and some can be expressed as cost factors. These constraints can be as simple as ‘‘don’t use bid b3’’ or more complex, as in ‘‘if Hermann bids on corks, and if a solution using his bid is no more than 10% more costly than a solution without his bid, then award the bid to Hermann.’’ Some of them can be handled by preprocessing, some must be handled within the winner-determination process, and some will require running it multiple times and comparing results. Mixed-initiative approaches. There are many environments in which an automated agent is unlikely to be given the authority to make unsupervised commitments on behalf a person or organization. In these situations, we expect that many of the decision processes we discuss here will be used as decision-support tools for a human decision-maker, rather than as elements of a completely autonomous agent. The decision to award bids is one that directly creates commitment, and so it is a prime candidate for user interaction. We have constructed an early prototype of such an interface. It allows a user to view bids, add simple bid inclusion and exclusion constraints, and run one of the winner-determination search methods. Bids may be graphically overlaid on the RFQ, and both the RFQ and bid time windows are displayed in contrasting colors on a Gantt-chart display. Effective interactive use of the bid-evaluation functions of an agent require the ability to visualize the plan and bids, to visualize bids in groups with constraint violations highlighted, and to add and update constraints. The winner-determination solver must be accessible and its results presented in an understandable way, and there must be a capability to generate multiple alternative solutions and compare them.
2.6 Awarding bids The result of the winner-determination process is a (possibly empty) mapping S ! B of tasks to bids. We assume that the bids in this mapping meet the criteria of the winner-determination process: they cover the tasks in the RFQ, can be composed into a feasible schedule, and they maximize the agent’s or user’s expected utility. However, we cannot just award the winning bids. In general, a bid b contains one or more offers of services for tasks s, each with a duration d(s, b) within a time window w(s, b) W d(s, b). The price assumes that the customer will specify, as part of the bid award, a specific start time for each activity. Otherwise, the supplier would have to maintain its resource reservation until some indefinite future time when the customer would specify a start time. This would create a disincentive for
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
279
suppliers to specify large time windows, raise prices, and complicate the customer’s scheduling problem. This means that the customer must build a final work schedule before awarding bids. We will defer the issue of dealing with schedule changes as work progresses. This scheduling activity represents another opportunity to maximize the customer’s expected utility. In general, the customer’s utility at this point is maximized by appropriate distribution of slack in the schedule, and possibly also by deferring task execution in order to defer payment for completion.
3 Solving the MAGNET winner-determination problem We now focus on the MAGNET winner-determination problem, originally introduced in Section 2.5. Earlier we have described both an Integer Programming formulation (Collins and Gini, 2001) and a simulated annealing framework for solving this problem (Collins et al., 2001). In this chapter, we describe an application of the A method (Russell and Norvig, 1995). For simplicity, the algorithm presented here solves the winner-determination problem assuming that the payoff does not depend on completion time. The A algorithm is a method for finding optimal solutions to combinatorial problems that can be decomposed into a series of discrete steps. A classic example is finding the shortest route between two points in a road network. A works by constructing a tree of partial solutions. In general, tree search methods such as A are useful when the problem can be characterized by a solution path in a tree that starts at an initial node (root) and progresses through a series of expansions to a final node that meets the solution criteria. Each expansion generates successors (children) of some existing node, expansions continuing until a solution node is found. The questions of which node is chosen for expansion, and how the search tree is represented, lead to a family of related search methods. In the A method, the node chosen for expansion is the one with the ‘‘best’’ evaluation,4 and the search tree is typically kept in memory in the form of a sorted queue. A uses an evaluation function f ðNÞ ¼ gðNÞ þ hðNÞ
(1)
for a node N, where g(N) is the cost of the path from initial node N0 to node N, and h(N) is an estimate of the remaining cost to a solution node. If h(N) is a strict lower bound on the remaining cost (upper bound for a maximization problem), we call it an admissible heuristic and A is complete and optimal; that is, it is guaranteed to find a solution with the lowest 4
Lowest for a minimization problem, highest for a maximization problem.
280
J. Collins and M. Gini
evaluation, if any solutions exist, and it is guaranteed to terminate eventually if no solutions exist. The winner-determination problem for combinatorial auctions has been shown to be N P-complete and inapproximable (Sandholm, 1999). This result clearly applies to the MAGNET winner-determination problem, since we simply apply an additional set of (temporal) constraints to the basic combinatorial-auction problem, and we do not allow free disposal (because we want a set of bids that covers all tasks). In fact, because the additional constraints create additional bid-to-bid dependencies, and because bids can vary in both price and in time specifications, the biddomination and partitioning methods used by others to simplify the problem (for example, see Sandholm, 2002) cannot be applied in the MAGNET case. Sandholm has shown that there can be no polynomial-time solution, nor even a polynomial-time bounded approximation (Sandholm, 2002), so we must accept exponential complexity. We have shown in Collins (2002) that we can determine probability distributions for search time, based on problem size metrics, and we can use those empirically determined distributions in our deliberation scheduling process. Sandholm described an approach to solving the standard combinatorialauction winner-determination problem (Sandholm, 2002) using an iterativedeepening A formulation. Although many of his optimizations, such as the elimination of dominated bids and partitioning of the problem, cannot be easily applied to the MAGNET problem, we have adapted the basic structure of Sandholm’s formulation, and we have improved upon it by specifying a means to minimize the mean branching factor in the generated search tree. We describe a basic A formulation of the MAGNET winner-determination problem, and then we show how this formulation can be adapted to a depth-first iterative-deepening model (Korf, 1985) to reduce or eliminate memory limitations. 3.1 Bidtree framework Our formulation depends on two structures which must be prepared before the search can run. The first is the bidtree introduced by Sandholm, and the second is the bid-bucket, a container for the set of bids that cover the same task set. A bidtree is a binary tree that allows lookup of bids based on item content. The bidtree is used to determine the order in which bids are considered during the search, and to ensure that each bid combination is tested at most once. In Sandholm’s formulation, the collection of bids into groups that cover the same item sets supports the discard of dominated bids, with the result that each leaf in the bidtree contains one bid.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
281
However, because our precedence constraints create dependencies among bids in different buckets, bid-domination is a much more complex issue in the MAGNET problem domain. Therefore, we use bid-buckets at the leaves rather than individual bids. The principal purpose of the bidtree is to support content-based lookup of bids. Suppose we have a plan S with tasks sm, m ¼ 1 . . . 4. Further suppose that we have received a set of bids bn, n ¼ 1 . . . 10, with the following contents: b1 : fs1 ; s2 g, b2 : fs2 ; s3 g, b3 : fs1 ; s4 g, b4 : fs3 ; s4 g, b5 : fs2 g, b6 : fs1 ; s2 ; s4 g, b7 : fs4 g, b8 : fs2 ; s4 g, b9 : fs1 ; s2 g, b10 : fs2 ; s4 g. Figure 8 shows a bidtree we might construct for this problem. Each node corresponds to a task. One branch, labeled in, leads to bids that include the task, and the other branch, labeled out, leads to bids that do not. We use the bidtree by querying it for bid-buckets. A query consists of a mask, a vector of values whose successive entries correspond to the ‘‘levels’’ in the bidtree. Each entry in the vector may take on one of the three values, {in, out, any}. A query is processed by walking the bidtree from its root as we traverse the vector. If an entry in the mask vector is in, then the in branch is taken at the corresponding level of the tree, similarly with out. If an entry is any, then both branches are taken at the corresponding level of the bidtree. So, for example, if we used a mask of [in, any, any, in], the bidtree in Fig. 8 would return the bid-buckets containing {b6} and {b3}. A bid-bucket is a container for a set of bids that cover the same task set. In addition to the bid set, the bid-bucket structure stores the list of other bid-buckets whose bids conflict with its own (where we use ‘‘conflicts’’ to mean that they cover overlapping task sets). This recognizes the fact that all bids with the same task set will have the same conflict set. To support computation of the heuristic function, we use a somewhat different problem formulation for A and IDA than we used for the IP formulation described in Collins and Gini (2001). In that formulation, we were minimizing the sum of the costs of the selected bids. In this formulation, we minimize the cost of each of the tasks, given a set of bid
in
s1
s2
in
s3
ou t
s4
out
out
in
out
b6
b1, b9
Fig. 8.
in
out
in b3
in
out b2
out
ou t
in b8, b10
in
ou t b5
in
out
in b4
Example bidtree, lexical task order.
b7
282
J. Collins and M. Gini
assignments. This allows for straightforward computation of the A heuristic function f(N) for a given node N in the search tree. We first define f ðNÞ ¼ gðS m ðNÞÞ þ hðS u ðNÞÞ
(2)
where S m ðNÞ is the set of tasks that are mapped to bids in node N, while S u ðNÞ ¼ S r nS m ðNÞ is the set of tasks that are not mapped to any bids in the same node. We then define X cðbj Þ gðS m ðNÞÞ ¼ (3) nðbj Þ jjs 2S j
m
where bj is the bid mapped to task sj, c(bj) is the total cost of bj , nðbj Þ is the number of tasks in bj , and X cðb j Þ hðS u ðNÞÞ ¼ nðb j Þ jjs 2S j
(4)
u
where b j is the ‘‘usable’’ bid for task sj that has the lowest cost/task. By ‘‘usable,’’ we mean that the bid b j includes sj, and does not conflict (in the sense of having overlapping task sets) with any of the bids bj already mapped in node N. Note that the definition of gðS m ðNÞÞ can be expanded to include other factors, such as risk estimates or penalties for inadequate slack in the schedule, and these factors can be non-linear. The only requirement is that any such additional factor must increase the value of gðS m ðNÞÞ, and not decrease it, because otherwise the admissibility of the heuristic hðS u ðNÞÞ will be compromised, and we no longer would have an optimal search method. 3.2 A formulation Now that we have described the bidtree and bid-bucket, we can explain our optimal tree search formulation. The algorithm is given in Fig. 9. The principal difference between this formulation and the ‘‘standard’’ A search formulation (see, for example, Russell and Norvig, 1995) is that nodes are left on the queue (line 15) until they cannot be expanded further, and only a single expansion is tried (line 17) at each iteration. This is to avoid expending unnecessary effort evaluating nodes. The expansion of a parent node N to produce a child node Nu (line 17 in Fig. 9) using the bidtree is shown in Fig. 10. Here, we see the reason to keep track of the buckets for the candidate-bid set of a node. In line 16, we use the mask for a new node to retrieve a set of bid-buckets. In line 18, we see that if the result is empty, or if there is some unallocated task for which no usable bid remains, we can go back to the parent node and just dump the whole bucket that contains the candidate we are testing.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
Fig 9.
283
Bidtree-based A search algorithm.
In line 17 of Fig. 10, we must find the minimum-cost ‘‘usable’’ bids for all unallocated tasks S u (tasks not in the union of the task sets of BN 0 ), as discussed earlier. One way (not necessarily the most efficient way) to find the set of usable bids is to query the bidtree using the mask that was generated in line 14, changing the single in entry to any. If there is any unallocated task that is not covered by some bid in the resulting set, then we can discard node Nu because it cannot lead to a solution (line 22). Because all other bids in the same bidtree leaf node with the candidate-bid bx will produce the same bidtree mask and the same usable-bid set, we can also discard all other bids in that leaf node from the candidate set of the parent node N. This implementation is very time-efficient but A fails to scale to large problems because of the need to keep in the queue all nodes that have not been fully expanded. Limiting the queue length destroys the optimality and completeness guarantees. Some improvement in memory usage can be achieved by setting an upper bound once the first solution is found in line 18 of Fig. 10. Once an upper bound flimit exists, then any node N for which f(N)Wflimit can be safely discarded, including nodes already on the queue. Unfortunately, this helps only on the margin; there will be a very small number of problems for which the resulting reduction in maximum queue size will be sufficient to convert a failed or incomplete search into a complete one. We address this in the next section. One of the design decisions that must be made when implementing a bidtree-based search is how to order the tasks (or items, in the case of a
284
J. Collins and M. Gini
Fig. 10.
Bidtree-based node-expansion algorithm.
standard combinatorial auction) when building the bidtree. It turns out that this decision can have a major impact on the size of the tree that must be searched, and therefore on performance and predictability. As we have shown in Collins et al. (2002), the tasks should be ordered such that the tasks with higher numbers of bids come ahead of tasks with lower numbers of bids. This ordering is exploited in line 18 of Fig. 10, where bid conflicts are detected. 3.3 Iterative-deepening A Iterative-deepening A (IDA) (Korf, 1985) is a variant of A that uses the same two functions g and h in a depth-first search, and which keeps in memory only the current path from the root to a particular node. In each iteration of IDA, search depth is limited by a threshold value flimit on the
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
Fig. 11.
285
Bidtree-based iterative-deepening A search algorithm: top level.
evaluation function f(N). We show in Fig. 11 a version of IDA that uses the same bidtree and node structure as the A algorithm. The recursive core of the algorithm is shown in Fig. 12. This search algorithm uses the same node expansion algorithm as we used for the A search, shown in Fig. 10. Complete solutions are detected in line 10 of Fig. 12. Because we are doing a depth-first search with f(N)oflimit, we have no way of knowing whether there might be another solution with a lower value of f(N). But we do have an upper bound on the solution cost at this point, so whenever a solution is found, the value of flimit can be updated in line 11 of Fig. 12. This follows the usage in Sandholm (2002), and limits exploration to nodes (and solutions) that are better than the best solution found so far. Nodes are tested for feasibility in line 17 of Fig. 12 to prevent consideration and further expansion of nodes that cannot possibly lead to a solution. There is a single tuning parameter z shown in line 19 of Fig. 11, which must be a positive number W1. This controls the amount of additional depth explored in each iteration of the main loop that starts on line 15. If z is too small, then dfs_contourðÞ is called repeatedly to expand essentially the same portion of the search tree, and progress toward a solution is slow. However, if z is too large, large portions of the search tree leading to suboptimal solutions will be explored unnecessarily. In general, more effective heuristic functions (functions h(N) that are more accurate estimates of remaining cost) lead to lower values of z. Experimentation
286
Fig. 12.
J. Collins and M. Gini
Bidtree-based iterative-deepening A search algorithm: depth-first contour.
using the heuristic shown in Eq. (4) shows that a good value is z ¼ 1.15, and that it is only moderately sensitive (performance falls off noticeably with zo1.1 or zW1.2).
4
Related work
This work draws from several fields. In Computer Science, it is related to work in artificial intelligence and autonomous agents. In Economics, it draws from auction theory and expected-utility theory. From Operations Research, we draw from work in combinatorial optimization. 4.1 Multi-agent negotiation MAGNET proposes using an auction paradigm to support problemsolving interactions among autonomous, self-interested, heterogeneous agents. Several other approaches to multi-agent problem-solving have been proposed. Some of them use a ‘‘market’’ abstraction, and some do not. Rosenschein and Zlotkin (1994) show how the behavior of agents can be influenced by the set of rules system designers choose for their agents’ environment. In their study the agents are homogeneous and there are no side payments. In other words, the goal is to share the work, in a more or
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
287
less ‘‘equitable’’ fashion, but not to have agents pay other agents for work. They also assume that each agent has sufficient resources to handle all the tasks, while we assume the contrary. In Sandholm’s TRACONET system (Sandholm, 1996; Sandholm and Lesser, 1995), agents redistribute work among themselves using a contracting mechanism. Sandholm considers agreements involving explicit payments, but he also assumes that the agents are homogeneous—they have equivalent capabilities, and any agent can handle any task. MAGNET agents are heterogeneous, and in general do not have the resources or capabilities to carry out the tasks necessary to meet their own goals without assistance from others. Both Pollack’s DIPART system (Pollack, 1996) and the Multiagent Planning architecture (MPA) (Wilkins and Myers, 1998) assume multiple agents that operate independently. However, in both of those systems the agents are explicitly cooperative, and all work toward the achievement of a shared goal. MAGNET agents are trying to achieve their own goals and to maximize their own profits; there is no global or shared goal. 4.1.1 Solving problems using markets and auctions MAGNET uses an auction-based negotiation style because auctions have the right economic and motivational properties to support ‘‘reasonable’’ resource allocations among heterogeneous, self-interested agents. However, MAGNET uses the auction approach not only to allocate resources, but also to solve constrained scheduling problems. A set of auction-based protocols for decentralized resource-allocation and scheduling problems is proposed in Wellman et al. (2001). The analysis assumes that the items in the market are individual discrete time slots for a single resource, although there is a brief analysis of the use of the generalized Vickrey auctions (Varian and MacKie-Mason, 1995) to allow for combinatorial bidding. A combinatorial-auction mechanism for dynamic creation of supply chains was proposed and analyzed in Walsh et al. (2000). This system deals with the constraints that are represented by a multi-level supply-chain graph, but does not deal with temporal and precedence constraints among tasks. MAGNET agents must deal with multiple resources and continuous time, but we do not currently deal explicitly with multi-level supply chains.5 Several proposed bidding languages for combinatorial auctions allow bidders to express constraints, for example, Boutilier and Hoos (2001) and Nisan (2000b). However, these approaches only allow bidders to communicate constraints to the bid-taker (suppliers to the customer, in the MAGNET scenario), while MAGNET needs to communicate constraints in both directions. 5 Individual MAGNET agents can deal with multi-level supply chains by subcontracting, but this requires that the initial time allocation provide sufficient slack for the extra negotiation cycles.
288
J. Collins and M. Gini
4.1.2 Infrastructure support for negotiation Markets play an essential role in the economy (Bakos, 1998), and marketbased architectures are a popular choice for multiple agents (see, for instance, Chavez and Maes, 1996; Rodriguez et al., 1997; Sycara and Pannu, 1998; Wellman and Wurman, 1998, and our own MAGMA architecture Tsvetovatyy et al., 1997). Most market architectures limit the interactions of agents to manual negotiations, direct agent-to-agent negotiation (Faratin et al., 1997; Sandholm, 1996), or some form of auction (Wurman et al., 1998). The Michigan Internet AuctionBot (Wurman et al., 1998) is a very interesting system, in that it is highly configurable, able to handle a wide variety of auction rules. It is the basis for the ongoing Trading Agent Competition (Collins et al., 2004), which has stimulated interesting research on bidding behavior in autonomous agents, such as Stone et al. (2002). Matchmaking, the process of making connections among agents that request services and agents that provide services, will be an important issue in a large community of MAGNET agents. The process is usually done using one or more intermediaries, called middle-agents (Sycara et al., 1997, 1999) present a language that can be used by agents to describe their capabilities and algorithms to use it for matching agents over the Web. Our system casts the market in the role of matchmaker. The MAGNET market infrastructure depends on an ontology to describe services that can be traded and the terms of discourse among agents. There has been considerable attention to development of detailed ontologies for describing business and industrial domains (Fox, 1996; Gruninger and Fox, 1994; Schlenoff et al., 1998). 4.2 Combinatorial auctions Determining the winners of a combinatorial auction (McAfee and McMillan, 1987) is an N P-complete problem, equivalent to the weighted bin-packing problem. A good overview of the problem and approaches to solving it is explained in de Vries and Vohra (2001). Dynamic programming (Rothkopf et al., 1998) works well for small sets of bids, but it does not scale well, and it imposes significant restrictions on the bids. Sandholm (2002) and Sandholm and Suri (2003) relaxes some of the restrictions and presents an algorithm for optimal selection of combinatorial bids, but his bids specify only a price and a set of items. Hoos and Boutilier (2000) describe a stochastic local search approach to solving combinatorial auctions, and characterize its performance with a focus on time-limited situations. A key element of their approach involves ranking bids according to expected revenue; it is very hard to see how this could be adapted to the MAGNET domain with temporal and precedence constraints, and without
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
289
free disposal.6 Andersson et al. (2000) describe an Integer Programming approach to the winner-determination problem in combinatorial auctions. Nisan (2000b) extends this model to handle richer bidding languages for combinatorial auctions, and we have extended it to handle the MAGNET situation in Collins and Gini (2001). More recently, Sandholm (Sandholm and Suri, 2003) has described an improved winner-determination algorithm called BOB that uses a combination of linear programming and branchand-bound techniques. It is not clear how this technique could be extended to deal with the temporal constraints in the MAGNET problem, although the bid-graph structure may be of value. One of the problems with combinatorial auctions is that they are nearly always run in a single round sealed-bid format, and this is the format MAGNET uses. Parkes and Ungar (2000) have shown how to organize multiple-round combinatorial auctions. Another problem is that the individual items in a combinatorial auction are individual items; there is no notion of quantity. MAGNET will eventually need to address this. This limitation is overcome in Leyton-Brown et al. (2000) for simple items without side constraints. The addition of precedence constraints would seriously complicate their procedure, and it has not yet been attempted. 4.3 Deliberation scheduling The principal reason we are interested in search performance is because the search is embedded in a real-time negotiation scenario, and time must be allocated to it before bids are received, and therefore before the exact dimensions of the problem are known. In Greenwald and Dean (1995), deliberation scheduling is done with the aid of anytime and contract algorithms, and performance profiles. An anytime algorithm is one that produces a continuously improving result given additional time, and a contract algorithm is one that produces a result of a given quality level in a given amount of time, but may not improve given additional time. The best winner-determination algorithms we know of for the MAGNET problem have marginal anytime characteristics, and we know of no applicable contract-type algorithms. In fact, Sandholm (2002) presents an inapproximability result for the winner-determination problem, leading us to believe that there may not be an acceptable contract algorithm. One way to think about deliberation scheduling is to assign the time required for deliberation a cost, and then to balance the cost of deliberation against the expected benefit to be gained by the results of the deliberation. This is the approach taken in Boddy and Dean (1994). However, much of this analysis assumes that there is a ‘‘default’’ action or state that can be 6 Under the ‘‘free disposal’’ assumption, the goal is to maximize revenue even if this means failing to allocate all the items at auction.
290
J. Collins and M. Gini
used or attained without spending the deliberation effort, and that there is a clear relationship between the time spent in deliberation and the quantifiable quality of the result. In the MAGNET case, the alternative to deliberation is to do nothing. 5
Conclusions
We have examined the problem of rational economic agents who must negotiate among themselves in a market environment in order to acquire the resources needed to accomplish their goals. We are interested in agents that are self-interested and heterogeneous, and we assume that a plan to achieve an agent’s goal may be described in the form of a task network, containing task descriptions, precedence relationships among tasks, and time limits for individual tasks. Negotiation among agents is carried out by holding combinatorial reverse auctions in a marketplace, in which a customer agent offers a task network in the form of a RFQ. Supplier agents may then place bids on portions of the task network, each bid specifying the tasks they are interested in undertaking, durations and time limits for those tasks, and a price for the bid as a whole. The presence of temporal and precedence constraints among the items at auction requires extensions to the standard winner-determination procedures for combinatorial auctions, and the use of the enhanced winner-determination procedure within the context of a real-time negotiation requires us to be able to predict its runtime when planning the negotiation process. There are a number of real-world business scenarios where such a capability would be of value. These include flexible manufacturing, mass customization, travel arrangement, logistics and international shipping, health care resource management, and large-scale systems management. Each of these areas is characterized by limited capabilities and suboptimal performance, due at least in part to the limits imposed by human problemsolving capabilities. In each of these areas, a general ability to coordinate plans among multiple independent suppliers would be of benefit, but does not exist or is not used effectively because of an inability to solve the resulting combinatorial problems. The use of extended combinatorial auctions such as we propose is one approach to solving these problems. There are many difficulties yet to be overcome before this vision can be realized, however, not the least of which is that such auction-based markets would not be effective without wide adoption of new technology across an industry, and a willingness to delegate at least some level of autonomy and authority to that new technology. We have designed and implemented a testbed, which we call MAGNET for multi-agent negotiation testbed, to begin exploring and testing this vision. It includes a customer agent, a rudimentary market infrastructure, and a simple simulation of a population of supplier agents. The customer
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
291
agent implementation is designed so that virtually all behaviors can be specified and implemented in terms of responses to events. Events can be external occurrences, internal state changes, or the arrival of a particular point in time. The MAGNET software package is available to the research community under an open-source license. When a goal arises, the agent and its principal must develop a plan, in the form of a task network. Once a plan is available, a bid-process plan must be developed to guide the negotiation process. The bid-process plan specifies which tasks are to be offered in which markets, allocates time to the bidding process and to the plan execution, and may split the bidding into phases in order to mitigate risk. For each bidding step in the bid-process plan, time must be allocated to the customer to compose its RFQ, to the supplier to compose bids, and to the customer to evaluate bids. For each auction episode specified in the bid-process plan, a RFQ must be composed. The RFQ specifies a subset of tasks in the task network, and for each task, it specifies a time window within which that task must be accomplished. The setting of time windows is critical, because it influences the likelihood that bidders will bid, the prices bidders are likely to charge, and the difficulty of the resulting winner-determination process. If the time windows specified in the RFQ allow task precedence relationships to be violated, then the winner-determination process will need to choose a set of bids that can be composed into a feasible schedule. Once the RFQ has been issued and bids received, the agent must determine winners. We have described an optimal algorithm for determining winners based on an IDA framework. Much work remains to be done before the vision of the MAGNET project is fully realized. Some of that work, particularly with respect to the supplier agent and its decision processes, is already under way by other members of the team. With respect to the customer agent, many of the decision processes outlined in Section 2 still need to be worked out and tested. The present work has resulted in models for the auction winner-determination problem and the time that must be allocated to it. For the remainder of the decisions, we need models that will maximize the expected utility of the agent or its principal. These include composing the plan, developing the bid-process plan, allocating time to the deliberation processes of the customer and suppliers, balancing negotiation time against plan execution time, setting the time windows in the RFQ, scheduling the work in preparation for awarding bids, and dealing with unexpected events during plan execution. Babanov et al. (2003) have addressed the problem of setting time windows in the customer’s RFQ. The language we currently use for plans and bids treats tasks as simple atomic objects, without attributes. There are many real-world problems in which attributes are important, both for specifying tasks and for expressing offers in bids. Examples include colors, quantities, dimensions, and quality attributes. In addition, many real-world operations operate on a ‘‘flow’’
292
J. Collins and M. Gini
basis. This includes the wine-making example we used in Chapter 2, in which the precedence between filling bottles and applying labels would normally be applied bottle-by-bottle, and not at the batch level. In addition, the expressivity of our bidding language is limited. A number of proposals have been made for more expressive bidding languages in combinatorial auctions (Boutilier and Hoos, 2001; Nisan, 2000b). Bidding can also be done with oracles, which are functions passed from bidder to customer that can be evaluated to produce bid conditions. Some features of a more expressive bidding language would likely have minimal impact on the winnerdetermination process (parameterized quality values, for example), while others, including the use of oracles, could require wholesale re-invention.
References Andersson, A., M., Tenhunen, F. Ygge (2000). Integer programming for combinatorial auction winner determination, in: Proceedings of 4th International Conference on Multi-Agent Systems, Boston, MA, pp. 39–46. Babanov, A., J. Collins, M. Gini (2003). Asking the right question: risk and expectation in multi-agent contracting. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 17, 173–186. Bakos, Y. (1998). The emerging role of electronic marketplaces on the Internet. Communications of the ACM 41, 33–42. Boddy, M., T. Dean (1994). Decision-theoretic deliberation scheduling for problem solving in timeconstrained environments. Artificial Intelligence 67, 245–286. Boutilier, C., H.H. Hoos (2001) Bidding languages for combinatorial auctions, in: Proceedings of the 17th Joint Conference on Artificial Intelligence, Seattle, WA, pp. 1211–1217. Collins, J. (2002). Solving combinatorial auctions with temporal constraints in economic agents. PhD thesis, University of Minnesota, Minneapolis, MN. Collins, J., R. Arunachalam, N. Sadeh, J. Ericsson, N. Finne, S. Janson (2004). The supply chain management game for the 2005 trading agent competition. Technical Report CMU-ISRI-04-139, Carnegie Mellon University, Pittsburgh, PA. Collins, J., C. Bilot, M. Gini, B. Mobasher (2000). Mixed-initiative decision support in agent-based automated contracting, in: Proceedings of the Fourth International Conference on Autonomous Agents, Barcelona, Catalonia, Spain, pp. 247–254. Collins, J., C. Bilot, M. Gini, B. Mobasher (2001). Decision processes in agent-based automated contracting. IEEE Internet Computing 5, 61–72. Collins, J., G. Demir, M. Gini (2002). Bidtree ordering in IDA combinatorial auction winnerdetermination with side constraints, in: J. Padget, O. Shehory, D. Parkes, N. Sadeh, W. Walsh (eds.), Agent Mediated Electronic Commerce IV, Vol. LNAI2531. Springer-Verlag, Berlin, pp. 17–33. Collins, J., M. Gini (2001). An integer programming formulation of the bid evaluation problem for coordinated tasks, in: B. Dietrich, R.V. Vohra (eds.), Mathematics of the Internet: E-Auction and Markets. Volume 127: IMA Volumes in Mathematics and its Applications. Springer-Verlag, New York, pp. 59–74. Collins, J., S. Jamison, M. Gini, B. Mobasher (1997). Temporal strategies in a multi-agent contracting protocol, in: AAAI-97 Workshop on AI in Electronic Commerce. Providence, RI. Collins, J., M. Tsvetovat, B. Mobasher, M., Gini (1998). MAGNET: A multi-agent contracting system for plan execution, in: Proceedings of SIGMAN, AAAI Press, Menlo Park, CA, pp. 63–68. Chavez, A., P. Maes (1996). Kasbah An agent marketplace for buying and selling goods, in: Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi-Agent Technology, London, UK, Practical Application Company, pp. 75–90.
Ch. 10. Scheduling Tasks Using Combinatorial Auctions
293
de Vries, S., R. Vohra (2001). Combinatorial auctions: a survey. Technical report, Tech-nische Universita¨t Mu¨nchen, Munich. Faratin, P., C. Sierra, N.R. Jennings (1997). Negotiation decision functions for autonomous agents. International Journal of Robotics and Autonomous Systems 24, 159–182. Fox, M.S. (1996). An organization ontology for enterprise modeling: preliminary concepts. Computers in Industry 19, 123–134. Greenwald, L., T. Dean (1995). Anticipating computational demands when solving time-critical decision-making problems, in: K. Goldberg, D. Halperin, J. Latombe, R. Wilson (eds.), The Algorithmic Foundations of Robotics. A. K. Peters, Boston, MA. Gruninger, M., M.S. Fox (1994). An activity ontology for enterprise modelling, in: Workshop on Enabling Technologies-Infrastructures for Collaborative Enterprises, West Virginia University. Hillier, F.S., G.J. Lieberman (1990). Introduction to Operations Research. McGraw-Hill, New York. Hoos, H.H., C. Boutilier (2000). Solving combinatorial auctions using stochastic local search, in: Proceedings of the Seventeen National Conference on Artificial Intelligence, Austin, TX, pp. 22–29. Korf, R.E. (1985). Depth-first iterative deepening: an optimal admissible tree search. Artificial Intelligence 27, 97–109. Leyton-Brown, K., Y. Shoham, M. Tennenholtz (2000). An algorithm for multi-unit combinatorial auctions, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas. McAfee, R., P.J. McMillan (1987). Auctions and bidding. Journal of Economic Literature 25, 699–738. Nisan, N. (2000a). Bidding and allocation in combinatorial auctions. Technical report, Institute of Computer Science, Hebrew University, Jerusalem. Nisan, N. (2000b). Bidding and allocation in combinatorial auctions, in: Proceedings of ACM Conference on Electronic Commerce (EC’00), Minneapolis, MN, ACM SIGecom, ACM Press, pp. 1–12. Parkes, D.C., L.H. Ungar (2000). Iterative combinatorial auctions: theory and practice, in: Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, TX, pp. 74–81. Parkes, D.C., L.H. Ungar (2001). An auction-based method for decentralized train scheduling, in: Proceedings of the Fifth International Conference on Autonomous Agents, Montreal, Quebec, ACM Press, pp. 43–50. Pollack, M.E. (1996). Planning in dynamic environments: the DIPART system, in: A. Tate (ed.), Advanced Planning Technology. AAAI Press, Menlo Park, CA. Rodriguez, J.A., P. Noriega, C. Sierra, J. Padget (1997). FM96.5-A Java-based electronic auction house, in: Second International Conf on The Practical Application of Intelligent Agents and Multi-Agent Technology (PAAM’97), London, pp. 207–224. Rosenschein, J.S., G. Zlotkin (1994). Rules of Encounter. MIT Press, Cambridge, MA. Rothkopf, M.H., A. Peke˘c, R.M. Harstad (1998). Computationally manageable combinatorial auctions. Management Science 44, 1131–1147. Russell, S.J., P. Norvig (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Upper Saddle River, NJ. Sandholm, T. (1999). An algorithm for winner determination in combinatorial auctions, in: Proceedings of the 16th Joint Conference on Artificial Intelligence, Orlando, FL, pp. 524–547. Sandholm, T. (2002). Algorithm for optimal winner determination in combinatorial auctions. Artificial Intelligence 135, 1–54. Sandholm, T.W. (1996). Negotiation among self-interested computationally limited agents. PhD thesis, Department of Computer Science, University of Massachusetts at Amherst, Amherst, MA. Sandholm, T.W., V. Lesser (1995). On automated contracting in multi-enterprise manufacturing, in: Distributed Enterprise: Advanced Systems and Tools, Edinburgh, Scotland, pp. 33–42. Sandholm, T., S. Suri (2003). Bob: improved winner determination in combinatorial auctions and generalizations. Artificial Intelligence 145, 33–58. Schlenoff, C., R. Ivester, A. Knutilla (1998). A robust process ontology for manufacturing systems integration. National Institute of Standards and Technology.
294
J. Collins and M. Gini
Stone, P., R.E. Schnapire, M.L.L.J.A. Csirik, D. McAllester (2002). ATTac-2001: A learning, autonomous bidding agent. in: Lecture Notes in Computer Science; Revised Papers from the Workshop on Agent Mediated Electronic Commerce IV, Designing Mechanisms and Systems Table of Contents, Springer-Verlag, London, UK, Vol. 2531, pp. 143–160. Sycara, K., K. Decker, M. Williamson (1997). Middle-agents for the Internet, in: Proceedings of the 15th Joint Conference on Artificial Intelligence, Providence, RI, pp. 578–583. Sycara, K., M. Klusch, S. Widoff, J. Lu (1999). Dynamic service matchmaking among agents in open information environments. SIGMOD Record (ACM Special Interests Group on Management of Data) 28, 47–53. Sycara, K., A.S. Pannu (1998). The RETSINA multiagent system: towards integrating planning, execution, and information gathering, in: Proceedings of the Second International Conference on Autonomous Agents, pp. 350–351. Tsvetovatyy, M., M. Gini, B. Mobasher, Z. Wieckowski (1997). MAGMA: an agent-based virtual market for electronic commerce. Journal of Applied Artificial Intelligence 11, 501–524. Varian, H.R., J.K. MacKie-Mason (1995). Generalized vickrey auctions. Technical report, University of Michigan, Ann Arbor, MI. Walsh, W.E., M. Wellman, F. Ygge (2000). Combinatorial auctions for supply chain formation, in: Proceedings of ACM Conference on Electronic Commerce (EC’00), Minneapolis, MN, pp. 260– 269. Wellman, M.P., W.E. Walsh, P.R. Wurman, J.K. MacKie-Mason (2001). Auction protocols for decentralized scheduling. Games and Economic Behavior 35, 271–303. Wellman, M.P., P.R. Wurman (1998). Market-aware agents for a multiagent world. Robotics and Autonomous Systems 24, 115–125. Wilkins, D.E., K.L. Myers (1998). A multiagent planning architecture, in: Proceedings of International Conference on AI Planning Systems, Pittsburgh, PA, pp. 154–162. Wurman, P.R., M.P. Wellman, W.E. Walsh (1998). The Michigan Internet AuctionBot: a configurable auction server for human and software agents, in: Second International Conference on Autonomous Agents, Minneapolis, MN, pp. 301–308.
Part III Supporting Knowledge Enterprise
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 11
Structuring Knowledge Bases Using Metagraphs
Amit Basu Edwin L Cox School of Business, Southern Methodist University, Dallas, TX 75275, USA
Robert Blanning Owen Graduate School of Management, Vanderbilt University, Nashville, TN 37203, USA
Abstract Knowledge bases may contain a variety of knowledge types, such as data relations, decision models, production rules, and models of workflow systems. A graphical structure, called metagraphs, provides a single approach to structuring these information types and the interactions between them. Metagraphs may be used both for visualization and for formal algebraic analysis. We present an overview of metagraphs and describe their application to the structuring of knowledge bases.
1
Introduction
It has been suggested that knowledge bases will be to the twenty-first century what databases were to the twentieth century. Each captures the type of information that can be stored (in machine readable form) and processed (in algorithmic form) at the time of their rise to prominence. To put it another way, the twentieth century can be viewed as the age of data, and the twenty-first century can be viewed as the age of knowledge. Knowledge bases differ from databases not in the sense that knowledge and data are mutually exclusive, but rather because the concept of knowledge extends that of stored data relations to include additional types of information. Another type of information is decision models, such as simulation or optimization models. A third type of information found in knowledge bases is the collection of knowledge-based information structures found in artificial intelligence. This may include production rules, 297
298
A. Basu and R. Blanning
semantic nets, and frames, but we will focus on the first of these— production rules. Yet another information type found in knowledge bases is workflow systems. This is of special interest because an important type of organizational knowledge is an explicit representation of the organization’s information processes. Each of these four types of information (i.e., stored data, decision models, production rules, and workflow systems) can be described separately by means of certain graphical structures, such as entity relationship diagrams, program flowcharts, semantic nets, and data flow diagrams. However, there are two problems with these graphical representations. First, they are primarily diagrammatic and not analytical. That is, they illustrate informational relationships, but they do not tell us how to apply these illustrations to construct inference paths. Second, they do not suggest how these different information types can be integrated. In this chapter, we show how a mathematical structure called metagraphs can be used to describe these four different types of information, the ways in which they are processed, and the ways in which they interact. Thus, metagraphs can be used as a basis for a uniform and comprehensive analytical framework for structuring knowledge bases. In the following section, we describe the four types of information found in knowledge bases—stored data, decision models, production rules, and workflow structures. In Section 3, we present the fundamentals of metagraphs, with an emphasis on the algebraic properties of metagraphs and the use of these properties in constructing inference paths. In Section 4, we explain and illustrate how metagraphs can be applied to the four information types and investigate metagraph transformations. For example, we show that metagraphs can be used to determine connectivity, and therefore inference paths, between sets of elements. They can also be used to combine data, models, rules, and workflows into a single information systems model. In addition, they can be used to project a complex system into a more simple representation without omitting details that are important to the user. Then in the final section, we summarize other work that has been done in this area and we identify promising topics for further research. 2
The components of organizational knowledge
We now describe the four components of organizational knowledge mentioned above—stored data, decision models, production rules, and workflow structures. The first type of information, stored data, is typically structured in the form of files (or data relations) in which each record is a collection of attributes describing a real-world entity. The key attributes uniquely identify the record, and therefore the entity, and the remaining content attributes describe other features of the entity. A graphical structure defining a database is the entity-relationship diagram, in which
Ch. 11. Structuring Knowledge Bases Using Metagraphs
299
the relations and the relationships between them are represented diagrammatically. For example, a university database may contain student and course relations, along with a relationship linking students with the courses they are taking. Any information common to a combination (or intersection) of both student and course, such as a grade, appears in the relationship. The relationship suggests access paths that can be used to construct a class list or a transcript. Models are structurally similar to data, but the information they provide does not appear in stored form. Rather, it is calculated as needed by a stored algorithm which represents a real-world computational process. Instead of key and content attributes, there are input and output variables. The access paths are relationships between the outputs of some models and the inputs of other models. For example, the output of a forecasting model may be the input to a production scheduling or inventory control model. Rules are if-then structures defined over logical (or yes/no) propositions. The ‘‘if ’’ part, called the antecedent of the rule, is a conjunction of propositions. The ‘‘then’’ part, also a conjunction of propositions, is called the consequent. Access paths occur when rules are linked—that is, when the consequents of some rules are among the antecedents of other rules. Rules may be causal rules defined by objective considerations or they may be elicited from experienced and knowledgeable specialists during the construction of expert systems. In the latter case, the rules are often called productions or production rules, because the antecedents are said to produce the consequents. To understand workflows, one must first understand processes. A process is a collection of tasks intended to achieve some desirable organizational outcome. From an information processing standpoint, a useful representation of a process is as a network of tasks each of which transforms input documents into output documents. An example is a loan evaluation process, in which the documents are loan applications, credit reports, etc. and the tasks are the preparation of these documents. Workflows are instantiations of business processes, where the paths taken by the documents depend on the content of the documents. For example, the path taken by a loan application may depend on certain documents performed by a risk evaluation department. In this case a single process, loan evaluation, may be instantiated into two workflows, one corresponding to high-risk loans and the other to low-risk loans. Knowledge of workflows is important because an important component of organizational knowledge is an understanding of how the organization processes information and makes decisions in response to opportunities and problems. We have briefly described four types of information found in knowledge bases. We have also seen that there are certain similarities between these different information types. Specifically, they are networks of information elements (stored data, model variables, logical assertions, and documents) and relationships (data relations, decision models, production rules, and
300
A. Basu and R. Blanning
workstations performing information processing tasks). This suggests that graph theory would be useful in analyzing the structure and processing of these information types. However, traditional graph theory presents certain challenges in performing such an analysis. We will present an enhanced theory, the theory of metagraphs, which responds to these challenges and allows us to structure and integrate knowledge bases regardless of the type of information they contain. We should point out that there is also an alternative, but less structured, view of knowledge bases. In this case a knowledge base is a repository of memos, reports, descriptions of best practices, suggestions as to who should be contacted or what should be done if an unusual event occurs, etc. We do not doubt the importance of this type of information, but we will focus on the more structured types of information found in knowledge bases. As we gain a better understanding of the modeling of this type of knowledge, we expect that more and less structured knowledge will become explicit and will be amenable to formal modeling techniques of the type described here.
3
Metagraphs and metapaths
A metagraph combines the properties of a directed graph (digraph) and a hypergraph. All three are based on a set of indivisible elements, which in the case of metagraphs is called a ‘‘generating set.’’ The generating set is supplemented with a set of relationships among the elements in the generating set. In the case of digraphs, the relationships are a set of ordered pairs of elements, resulting in a network connecting certain of the elements. The elements are often represented by points in the plane, and the ordered pairs are represented by arrows (or arcs) connecting selected pairs of points (or nodes). An example is a PERT/CPM project network, in which the elements are events that take place at instances in time and the ordered pairs are activities that span durations in time between a start event and an end event. Another example is a transportation network, in which nodes are origins or destinations of shipments and the arcs are the shipments. However, digraphs do not properly describe the case in which a set of source elements (e.g., the key in a relation or the input to a model) are collectively connected to a set of target elements (e.g., the content of a relation or the output of a model). It would be possible to link each input element to each output element, but this would not disclose the fact that the entire set of source elements is needed to determine the target. Similarly, it would not be adequate to define new vertices corresponding to sets of vertices, since the element-to-element mappings in digraphs would not capture the set-to-set mappings found in data relations, models, rules, and workflows.
Ch. 11. Structuring Knowledge Bases Using Metagraphs
301
In the case of hypergraphs, there is no directionality. A hypergraph consists of the generating set and a set of edges (or hyper-edges), each of which is a subset of the generating set. Hypergraphs overcome a disadvantage of digraphs, which do not capture relationships among more than two elements. A hyper-edge can include any number of elements, up to the size of the generating set. In fact, some hypergraph definitions require that each hyper-edge contain more than two elements and that each element in the generating set appear in at least one hyper-edge. Although the lack of directionality does pose a problem in systems modeling, the principal purpose of hypergraphs is to facilitate combinatorial analyses. Metagraphs combine the beneficial properties of digraphs with those of hypergraphs. A metagraph consists of a generating set and a set of ordered pairs of subsets of the generating set. Each such ordered pair is called an edge; the first subset is the invertex of the edge and the second subset is the outvertex. We require that for each edge either the invertex or the outvertex (or both) be non-null. If the invertex is null, then the edge asserts the outvertex, and if the outvertex is null, then the edge (in a rule-based system) asserts that the invertex is false. We note that metagraphs are similar to a hypergraph construct called ‘‘directed hypergraphs,’’ but directed hypergraphs have been explored in the context of individual elements as connected by their set-oriented properties. As we will see, metagraphs place more emphasis on the connectivity of sets of elements. 3.1 Metagraph definition We now present a formal definition of metagraphs. The definitions presented here and the theorems alluded to here are found in previous articles and papers referenced at the end of this chapter. Definition 1. Given a finite generating set, X ¼ {xi, i ¼ 1 . . . I }, consisting of elements xi, a metagraph is an ordered pair S ¼ /X, ES, in which E ¼ {ek, k ¼ 1 . . . K} is a set of edges and each edge is a set of ordered pairs ek ¼ /Vk, WkS, with VkDX, WkDX, and Vk,Wk 6¼ ø ’k. In addition, if for some ek we have xiAVk and xjAWk, then the coinput of xi in ek is Vkn{xi} and the cooutput of xj in ek is Wkn{xj}. In other words, for any edge e and any element x, the coinput of x consists of all elements in the invertex of e other than x, and the cooutput of x consists of all elements in the outvertex of e other than x. We note that in directed graphs all coinputs and cooutputs are null, since the edges of a directed graph are element-to-element mappings, rather than set-to-set mappings. The application of the metagraph definition to knowledge bases follows from the definition of metagraphs (and specifically, of elements and edges)
302
A. Basu and R. Blanning
and the descriptions of the four types of information found in knowledge bases, as described in Section 2: In a database each element in the generating set represents a data attribute and each edge represents a file (or data relation), with the invertex as key attributes and the outvertex as content attributes. In a model base each element represents a variable and each edge represents a model, with the invertex as inputs to the model and the outvertex as outputs. In a rule base each element is a logical variable, and each edge corresponds to a rule. The invertex is the conjunction of assertions making up the antecedent to the rule, and the outvertex is the conjunction of assertions making up the consequent of the rule. Thus, the edge represents a statement that if all of the invertex assertions are true, then all of the outvertex assertions are true. Finally, in a workflow system the elements represent documents. Each edge represents an information processing task (often implemented by a workstation) with the invertex as input documents and the outvertex as output documents. We illustrate metagraphs with the example shown in Fig. 1. To simplify matters, we assume a metagraph with five elements and three edges. The generating set consists of INFL (the inflation rate), REV (revenues), EXP (expense), SCH (schedule), and NI (net income). There are three edges: e1 ¼ /{INFL}, {REV}S determines REV as a function of INFL, e2 ¼ /{INFL}, {EXP, SCH}S determines EXP and SCH as a function of INFL, and e3 ¼ /{REV, EXP}, {NI}S determines NI as a function of REV and EXP. Connectivity in metagraphs can be described in terms of simple paths and metapaths. Simple paths in metagraphs are similar to paths in directed
e1
REV
e3
e2
INFL
NI
EXP
SCH
Fig. 1.
Metagraph containing a metapath.
Ch. 11. Structuring Knowledge Bases Using Metagraphs
303
graphs. In metagraphs, a simple path is a sequence of edges connecting a source element to a target element, where connectivity between two successive edges occurs when the outvertex of the first edge overlaps (i.e., has a non-null intersection with) the invertex of the second edge. A formal definition is as follows: Definition 2. Given a metagraph /S, ES, a source xiAX, and a target xjAX, a simple path from xi to xj is a sequence of edges P(xi, xj) ¼ /e0l , l ¼ 1 . . . LS such that (1) e0l AE ’ l, (2) xiAV 0l , (3) xjAW 0L , and W 0l \ V 0lþ1 6¼ ø for l ¼ 1 . . . L–1. The coinput of xi in P(xi, xj) is ! L L [ [ V i n W l nfxi g i¼1
i¼1
and the cooutput of xj in P(xi, xj) is L [
Wl
!
nfxi g
i¼1
We note that /e0l , l ¼ 1 . . . LS is a sequence of edges, rather than a set of edges. 3.2 Metapaths The concept of a simple path describes element-to-element connectivity, but it does not describe set-to-set connectivity. To do this, we need another concept—that of a metapath. Consider the metagraph shown in Fig. 1. There are two simple paths from INFL to NI. They are /e1, e3S with coinput EXP and /e2, e3S with coinput REV. Thus, there is no simple path from INFL to NI with a null coinput. Even so, INFL alone is sufficient to determine NI. To do this we must invoke all three edges: e1, e2, and e3. But the set of edges {e1, e2, e3} is not a simple path; it is a set of edges, not a sequence of edges. Thus, we need a more general view of connectivity that goes beyond simple paths. This leads us to the concept of a metapath, which in this case is the set {e1, e2, e3}. Thus, a metapath, unlike a simple path, is a set of edges, rather than a sequence of edges. In addition, the source and/or the target may be a set of elements, although in this example they were singleton sets. A metapath from a source to a target is a set of edges with three properties. First, each edge in the metapath must be on a simple path from some element in the source to some element in the target, and all other edges in that simple path must also be in the metapath. Second, the set of all elements in the invertices of the edges in the metapath that are not in the outvertex of some edge in the metapath are contained in the source. Third, the set of elements in the outvertices of all the edges in the metapath
304
A. Basu and R. Blanning
contains the target, although there may be additional outvertex elements not in the target. The first property ensures that the metapath contain no extraneous edges, and the second and third properties ensure that the source and the edges in the metapath are sufficient to determine the target. Thus, there is no notion of coinputs for metapaths. Definition 3. Given a metagraph S ¼ /X, ES, a source ACX, and a target BCX, a metapath from A to B is a set of edges M(A, B) ¼ {e0l , l ¼ 1 . . . L}, such that (1) [Li¼1 V 0l n[Li¼1 W 0l A (2) B [Li¼1 W 0l (3) For each l ¼ 1 . . . L there is a simple path Pl (a, b) from some aAA to some bAB, such that e0l A set(Pl (a, b))DM(A, B). The purpose of the third condition above is to ensure that M(A, B) does not contain edges in simple paths from some elements in the set B to other elements in another set C. Their inclusion would not invalidate the first two conditions, but they also would not be needed to define a metapath from A to B. However, as we will see in Section 4.3, the definition of another concept, that of metapath dominance, will serve to identify other types of superfluous elements and edges. The construction of metapaths is described in Basu and Blanning (1994b). 3.3 Metagraph algebra Although metagraphs can be illustrated diagrammatically, as was done in Fig. 1, it is also possible to represent them algebraically. This makes it possible to perform calculations that will identify such properties as the existence of simple paths and metapaths, rather than to identify them by visual inspection. The foundation of the algebraic approach is the adjacency matrix. An adjacency matrix is a square matrix with one row and one column for each element in the generating set. Each (row, column) member Aij in the adjacency matrix is a set of zero or more triples, one for each edge containing row element in its invertex and the column element in its outvertex. The first component of the triple is the coinput of the row element, the second component is the cooutput of the column element, and the third component is the name of the edge, represented as a sequence. The reason that there may be no triples in a particular Aij is that there may be no edges in the metagraph with xi in its invertex and xj in its outvertex. In this case, Aij would be null. The reason that there may be more than one triple in a particular Aij is that there may be multiple edges in the metagraph with xi in its invertex and xj in its outvertex. For example, if it were possible to calculate net income from revenue and schedule, then there would be an additional edge e4 ¼ /{REV, SCH}, {NI}S. In that case
Ch. 11. Structuring Knowledge Bases Using Metagraphs
305
we would have AREV,NI ¼ {/{EXP}, ø,/e3SS,/{SCH}, ø,/e4SS} and ASCH,NI ¼ {/{REV}, ø, /e4SS}. We will see another example of multiple triples in the description of the A2 matrix below. The adjacency matrix of the metagraph in Fig. 1 appears in Table 1. Twenty of the twenty-five members of the matrix do not correspond to any edges and are null. For example, there is no single edge directly connecting an invertex containing INFL to an outvertex containing NI, so AINFL,NI ¼ ø. However there is an edge, e3, connecting REV to NI, so the member AREV,NI consists of the triple {/{EXP}, ø,/e3SS}; EXP is the coinput of INFL in e3, the cooutput of NI in e3 is null, and e3 is the edge. A multiplication operation has been defined for metagraph adjacency matrices (Basu and Blanning, 1994a), and this can be used to calculate successive powers of these matrices. As with simple graphs and digraphs, the nth power of the adjacency matrix discloses all paths of length n from a source element to a destination element. The square of the adjacency matrix, A2, consists entirely of null members, except for A2INFL,NI ¼ {/{EXP}, {REV},/e1, e3SS,/{REV}, {EXP, SCH},/e2, e3SS}. Each member of this set corresponds to a simple path of length 2 between INFL and NI; the first is /e1, e3S and the second is /e2, e3S. All higher powers of A consist entirely of null members, since there are no simple paths of length greater than two in the metagraph. The powers of the adjacency matrix can be summed to produce the closure of the adjacency matrix. In this case, the closure of A is A þ A2, as illustrated in Table 2. The closure identifies all simple paths of any length and can be combined with the requirements of Definition 3 to find all metapaths between any two sets of elements in the metagraph, if such metapaths exist. The closure can also be used to identify any cycles in the metagraph—that is, any simple paths from an element to itself. A metagraph contains a cycle if and only if its closure has at least one non-null member in its diagonal. The cycle will be the third component in the triple. There will be one such triple for each element in the cycle. Table 1 Adjacency matrix for Fig. 1 INFL
REV
EXP
SCH
NI
INFL
ø ø
{/ø, {SCH}, /e2SS} ø
{/ø, {EXP}, /e2SS} ø
ø
REV
{/ø, ø, /e1SS} ø
EXP
ø
ø
ø
ø
SCH NI
ø ø
ø ø
ø ø
ø ø
{/{EXP}, ø, /e3SS} {/{REV}, ø, /e3SS} ø ø
306
A. Basu and R. Blanning
Table 2 Closure of the adjacency matrix for Fig. 1 INFL
REV
EXP
SCH
NI
INFL
ø
{/ø, ø, /e1SS}
{/ø, {SCH}, /e2SS}
{/ø, {EXP}, /e2SS}
REV
ø
ø
ø
ø
EXP
ø
ø
ø
ø
SCH NI
ø ø
ø ø
ø ø
ø ø
{/{EXP}, {REV}, / e1, e3SS, /{REV}, {EXP, SCH}, /e2, e3SS} {/{EXP}, ø, /e3SS} {/{REV}, ø, /e3SS} ø ø
3.4 Metapath dominance and metagraph projection We conclude this section by addressing two issues, one related to metapaths and the second to entire metagraphs. The metapath-related issue is whether the metapath contains any elements or edges that are not needed to connect a source to a target. In other words, we wish to know whether the metapath contains superfluous elements or edges. This leads to the concept of metapath dominance. The metagraph-related issue is whether a metagraph can be simplified by eliminating some of the elements and edges in ways that retain certain specified important relationships but hide unimportant relationships. This leads to the concept of projecting a metagraph over a subset of its generating set. We begin with the concept of metapath dominance. Definition 4. Given a metagraph S ¼ /X, ES, for any two sets of elements BDX and CDX, (1) a metapath M(B, C) is edge dominant if no proper subset of M(B, C) is also a metapath from B to C and (2) a metapath M(B, C) is input dominant if there is no BuDB such that M(Bu, C) is a metapath from Bu to C. A metapath M(B, C) is dominant if it is both edge dominant and input dominant. In other words, a metapath from a set B to a set C is edge dominant if it is not possible to remove any edges from the metapath and still have a metapath from B to C. It is input dominant if it is not possible to remove any elements from B and have a metapath from the resulting subset of B to C. It is dominant if it is both edge and input dominant. We note that this also applies to conditional metagraphs, since a conditional metagraph is still a metagraph, but with an expanded generating set. We now turn to the concept of a projection. Before giving a formal definition, we present an example. Consider the metagraph illustrated at the top of Fig. 2. Price (PRI) determines volume (VOL), which in turn
Ch. 11. Structuring Knowledge Bases Using Metagraphs
e1
e2
PRI
307
CAP
VOL EXP
e3 REV
e’ PRI
Fig. 2.
EXP
A metagraph and its projection.
determines both capacity requirements (CAP) and expense (EXP), and price and volume together determine revenue (REV). Let us say that we are interested only in whether there is a relationship between price and expense. There are three issues here. First, although price indirectly determines expense, there is an intervening variable, volume, in which we are not interested. Second, volume also determines capacity, in which we are not interested. Third, price, along with volume, determines revenue, but revenue is of no interest to us. A projection of the metagraph onto the subset Xu ¼ {PRI, EXP} of the generating set, would clarify any relationships in Xu and not burden us with information about the remaining elements in XnXu. This projection is illustrated at the bottom of Fig. 2. We can see that price determines expense. This relationship is determined by a new edge eu, which is not found in the original (or base) metagraph. The projected edge eu is derived from e1 and e2 and is called a composition of these edges. The composition c(eu) ¼ {{e1, e2}} is the set of all metapaths describing relationships involving elements of Xu. In this case, there is only one such metapath, {e1, e2}, but more generally there could be several metapaths connecting PRI to EXP. Therefore, a composition is a set of sets (of edges in the base metagraph). We are now ready for a formal definition. Definition 5. Given a base metagraph S ¼ /X, ES and XuDX, a metagraph Su ¼ /Xu, EuS is a projection of S onto Xu if for any eu ¼ /Vu, WuSDEu and for any xuAWu there is a dominant metapath M(Vu, {xu}) in S,
308
A. Basu and R. Blanning
for every xuAXu, if there is any dominant metapath M(V, {xu}) in S with VDXu, then there is an edge /Vu, WuSAEu, such that V ¼ Vu and xuAWu, and no two edges in Eu have the same invertex. The third condition simplifies the projection by minimizing the number of edges in it. As a result, for any S ¼ /X, ES and any XuDX the projection of S onto Xu exists and is unique. An algorithm for constructing projections is found in Basu et al. (1997) and also in Basu and Blanning (2007). Thus, a projection provides us with a high-level view of a complex system, and a composition tells us how the relationships in the high-level view can be derived from those in the base system. 4
Metagraphs and knowledge bases
In this section, we examine three issues. In Section 4.1, we examine the application of metagraphs to the four types of information contained in knowledge bases, including the case in which a single knowledge base contains more than one type of information and the information types interact. For example, a rule base may constrain a model base and may determine how the models may interact with each other. We also address a more specialized topic, the existence of cycles in a knowledge base and therefore in the metagraph that models the knowledge base. Cycles represent simultaneity in the relationships among the elements (e.g., there is a directed relationship from xi to xj, but there is also a directed relationship from xj to xi). We will present a supply–demand example below. The existence of cycles may affect the knowledge base in ways that may not be immediately apparent. Then we will examine an additional topic—the combination of data, models, rules, and workflows in a single metagraph (Section 4.2). We conclude this section by discussing simplified views of a metagraph (Section 4.3). These are metagraphs that capture the essential elements of a base metagraph as defined by the user and omit extraneous details. They are constructed using the projection operator defined above. 4.1 Applications of metagraphs to the four information types The first of the four information types is stored data, usually in the form of data relations. Each relation is represented by a metagraph edge in which the invertex corresponds to the key and the outvertex corresponds to the content. An access path between two sets of data attributes is represented by a metapath in which the first set of attributes is the source of the metapath and the second set of attributes is the target of the metapath. We note that a projection operation in metagraphs corresponds to the decomposition (or projection) of relational databases into a succession of
Ch. 11. Structuring Knowledge Bases Using Metagraphs
309
higher normal forms to eliminate storage anomalies. For example, consider a database containing a transitive dependency (Fig. 3). Each employee (EMP) is in a single department (DEP), which in turn has a single location (LOC). Placing all of this into single relation (or metagraph edge) presents problems if the location of a department should change, if the last employee in a particular department should leave, or if a new department in a known location and yet to contain any employees should be established. A better approach is to project the edge (i.e., decompose the relation) into two components, one involving a functional dependency between EMP and DEP, and the other involving a functional dependency between DEP and LOC. The second information type is decision models, each of which is an edge with the input to the model represented by the invertex of the edge and the output by the outvertex. As with data relations, access paths (for integrating the models) correspond to metapaths in the metagraph. But there is an additional issue seldom found in databases: the possible existence of cycles in the metagraph, corresponding to cycles in the model base, which in turn will correspond to cycles in the real world. These real world cycles will arise whenever there are two (or more) variables or sets of variables, each of which affects the other. In other words, the relationships are not unidirectional, but rather are bidirectional, and this simultaneity in relationships must be resolved. For example, consider the cyclic metagraph in Fig. 4 representing a model base containing two models. A demand model (DMD) calculates the
EMP
DEP
LOC
EMP
DEP
DEP
LOC
Fig. 3.
Transitive dependency.
310
A. Basu and R. Blanning
volume (VOL) of a product that will be sold as a function of the price charged for the product (PRI) and the GNP of the country in which the company selling the product is found. A supply model (SUP) calculates the equilibrium price (again, PRI) at which the product will be sold and the needed production capacity (CAP) as a function of the volume sold (again, VOL). The method of solving these relationships depends on their functional forms. If they are simple functional relationships (e.g., linear functions), then they can be solved analytically as simultaneous equations. Unfortunately this is seldom the case, and it is usually necessary to solve them numerically using an iterative convergence procedure. A value is posited for either PRI or VOL, as well as GNP, the appropriate model is used to calculate VOL or PRI, the other model is used to reverse the process, and the iterative cycle continues until the process converges. There are several problems here: one is that the iterative process may not converge, another is that the convergence may be too slow, and the third is that there may be multiple equilibria. But in any case, cycles in the model base, which can be detected by visual inspection or by the existence of non-null members in the diagonal of the adjacency matrix, may complicate the solution of the models. Another complicating factor is the computational effort needed to solve the models. In part this may arise from the size and complexity of the models themselves, but it may also result from the characteristics of the model inputs and outputs represented by the elements in the metagraph. For example, the variables corresponding to PRI and VOL may be indexed over geographical regions, industrial sectors, and/or time stages, thus complicating the analysis. The third information type is production rules, in which the elements in the generating set correspond to propositions that may be either true or false. For example, in Fig. 4, the edges would correspond to rules in which
DMD
GNP
PRI
VOL
CAP
SUP
Fig. 4.
Cyclic metagraph.
Ch. 11. Structuring Knowledge Bases Using Metagraphs
311
the invertices and outvertices would be conjunctions of the appropriate propositions: DMD: GNP 4 PRI - VOL SUP: VOL - PRI 4 CAP In an acyclic metagraph, a metapath would correspond to an inference path in the rule base, but this need not be true in a cyclic metagraph. In this example, each of the two rules (DMD and SUP) correspond to (rather simple) inference paths. For example, if we know the truth values of GNP and PRI, we can determine the truth value of VOL. However, when the metagraph is cyclic, this breaks down. Specifically, there is an inference path between a source and a target if and only if there is an acyclic metapath connecting the source to the target. In the above example if we consider both rules, there is a metapath with source {GNP} and target {PRI, VOL, CAP}. But the rule base (and hence the metagraph) is cyclic and there is no acyclic metapath connecting the source to the target and so the two rules are mutually inconsistent. For example, if GNP ¼ T, PRI ¼ F, VOL ¼ F, and CAP ¼ F, then both of the rules DMD and SUP would be true. However, the implication corresponding to the metapath described above (i.e., GNP-PRI4VOL4CAP) would be false. We should point out that in the metagraph representation of rule bases propositions are generally assumed to be positive. That is, there is no negation operator except in special cases. The exception is the use of metagraphs in discovering implicit integrity constraints in rule bases. This exception will not be addressed here. The fourth information type is a workflow system, in which the generating set corresponds to documents and the edges represent workstations that transform input documents into output documents. The issues here are similar to those in model management, except that there is human intervention in the transformations. That is, in an acyclic system metapaths correspond to solution procedures (or inference rules). However, cyclic systems require iterative processes with the attendant possibilities of (1) lack of convergence, (2) unacceptably slow convergence, and (3) multiple equilibria. 4.2 Combining data, models, rules, and workflows There are several ways in which metagraphs can be used to combine data relations, decision models, production rules, and workflow systems. For example, a data element can be an input to a model or the output of a model can be part of the key in a data relation. The same can be said for workflow systems—for example, data relations and workflows and for decision models and workflows. In each of these cases, the metagraph
312
A. Basu and R. Blanning
constructs described above can be integrated in a reasonably straightforward manner. A more interesting case occurs when the data elements include propositions, which can be in the invertex or outvertex of a metagraph edge. Both possibilities are illustrated in Fig. 5. Edge e1 calculates the truth value of a proposition p from the value of the variable PRI. For example, if PRI is the price of a product (as in the examples above) and p is the proposition ‘‘The price is less than twenty dollars’’ (i.e., ‘‘PRIo20’’), then e1 represents the calculation of the truth value of p from the numerical value of PRI. However, when a proposition is in the invertex of an edge, the interpretation is quite different. In this case, we are asserting that the proposition must be true for the edge to be valid—that is, for the edge to be used in a metapath. An example is the edge e2 in Fig. 5, which states that VOL can be calculated from PRI if p is true. In other words, we can execute the model (or data relation or workstation) represented by e2, but only if the price is less than 20 dollars. There is no suggestion here as to what happens if the price is greater than or equal to 20 dollars, although this condition can be addressed by defining another proposition, requiring another edge. Thus, we can think of a proposition in an invertex as an assumption that must be true for the edge to be valid. In this case in order to invoke the relation/model/workstation represented by the edge e2, we must assume that the price is less than $20. A metagraph in which there are propositions in some of the invertices is called a conditional metagraph. In a conditional metagraph, the generating set is partitioned into two disjoint sets of data elements, the quantitative and qualitative variables formerly represented by X and a set of propositions. The former will be denoted XV, and the latter will be denoted XP. The entire generating set will now be denoted X, so we have X ¼ XV,XP, with XV-XP ¼ ø. In addition, we require that if an outvertex contains a proposition in XP, then it cannot contain any variable in XV—in other words, for any eAE and pAXP, then if pAWe, we have We ¼ {p}. Now for any metapath, we can identify those assumptions that must be true initially for the metapath to be valid and those assumptions that must also be true but whose values will not be known until the appropriate edges have been executed.
e1 PRI
e2 p
Fig. 5.
Conditional metagraph.
VOL
Ch. 11. Structuring Knowledge Bases Using Metagraphs
313
The concept of a conditional metagraph is of interest because it allows one information structure to enable or disable another such structure. For example, one workflow process may generate a proposition that will activate another workflow or that will determine the details of how the second workflow will be implemented. In the contest of an order entry process, the enabling workflows may be a credit check on a customer and an evaluation of inventory levels to determine existing inventories are sufficient to fill the order or whether additional production is needed. The enabled process would be a process for assembling and packaging the line items in the order and shipping them to the customer. In a more complex system, another enabling process might access a database describing a parts explosion in order to transform a demand for finished goods into derived demands for subassemblies and piece parts. Thus, conditional metagraphs can model a situation in which one process can turn on, turn off, or modify another process. 4.3 Metagraph views It is sometimes useful in knowledge processing to provide a simplified view of knowledge bases. This will make it possible for managers interested in the content of a knowledge base to examine those elements and relationships (edges) of greatest relevance to a particular situation. A projection provides that information in an explicit and convenient form. A single metagraph may have several views, as defined by the elements over which the projection is taking place. For example, in Fig. 2, we illustrated a projection over {PRI, EXP}, which resulted in the composition {{e1, e2}}. However, someone with different interests may wish to focus on the variables PRI and REV, which would have resulted in a projection containing as its edge the composition eu ¼ {{e1, e3}}. Yet another possibility is to focus on Xu ¼ {EXP, REV}, which would result in a null projection, since there are no metapaths connecting {EXP} with {REV}. We note that a metapath in a projection of a conditional metagraph does not necessarily imply an inference path. The reason is if the metagraph is a conditional metagraph, a proposition not in the projection set may be false. For example, if in the example of Fig. 5, we project over Xu ¼ {PRI, VOL}, then we find that {PRI} is connected to {VOL} using the composition eu ¼ {{e1, e2}}. However we do not see in the projection that p must be true (i.e., PRI must be less than $20) for connectivity to occur. Thus, a projection simplifies matters while retaining underlying connectivity, but it is wise to be careful that the omitted information is really unimportant. Projection is not the only method for simplifying metagraphs. Yet another approach is to use contexts. A context is applied to a conditional metagraph and it focuses on the propositions in the generating set, rather than on the variables. We will not describe contexts here, but we
314
A. Basu and R. Blanning
summarize this concept in Section 5.2 where we discuss other work on metagraphs. In summary, for large and complex metagraphs (including conditional metagraphs) methods are available to provide simplified views without omitting the underlying connectivity, and this increases the usefulness of metagraphs. 5
Conclusion
In this section, we summarize work that we have done on metagraphs, including work not described above but that has been published elsewhere, and discuss its relevance to knowledge management. Then we suggest additional work that might be done in this area. We began in this chapter by describing knowledge bases in terms of a variety of information types—stored data, decision models, production rules, and workflow systems. We then defined metagraphs as a collection of set-to-set mappings, where the first (invertex) set represents inputs to the mappings (key attributes, input variables, antecedents, and input documents), and the second (outvertex) set represents outputs of the mappings (content attributes, output variables, consequents, and output documents). The set-to-set property of metagraphs allows for a rich type of connectivity, represented by metapaths, which can be applied not only to each of the information types taken separately, but also to systems that integrate the information types, leading to the concept of a conditional metagraph. We concluded with the definition of a projection, which transforms a base metagraph into a high-level view that focuses on those elements of interest to a specific decision maker and the relationships between them. 5.1 Related work The work we have reported here summarizes the fundamental research that has been done on metagraphs. But much more work has to be done as well. We will present a brief summary, as follows: Hierarchical Modeling: Consider the case in which two separate metagraphs have overlapping but not identical generating sets. An example is metagraphs that describe manufacturing and marketing information, with distribution information common to both. We have examined the relationship between projections of the aggregate metagraph and projections of the individual metagraphs and have suggested implications for information systems design and organization design (Basu et al., 1997). Independent Submetagraphs (ISMGs): A submetagraph is simply a metagraph contained within another metagraph. A submetagraph is an
Ch. 11. Structuring Knowledge Bases Using Metagraphs
315
ISMG if there are ‘‘pure’’ input elements such that all elements in the submetagraph are determined either by the pure elements or by other elements in the submetagraph, and there is a similar condition defined in reverse for a set of pure output elements. We examine conditions under which the union and intersection of ISMGs are ISMGs. ISMGs are useful in determining whether a subprocess is independent of a larger process. This would occur when the elements in the subprocess other than the pure inputs and the pure outputs do not interact with other elements in the larger metagraph. We also examine conditions under which two subprocesses contained within a larger process are mutually independent of each other (Basu and Blanning, 2003). Attributed Metagraphs: Attributes are labels attached to edges and may be used to specify such things as resources needed to implement the models or workstations represented by the edges. The resources may be hardware, software, or human (e.g., programmers, analysts, or other specialists). We have examined cases in which a common resource is used by several edges and in which adjacent edges (and therefore, successive tasks) communicate through the use of a common resource (Basu and Blanning, 2000). Metagraph Contexts: A context is a metagraph transformation defined for a conditional metagraph in which each proposition is either known to be true, is known to be false, or is undetermined. Some of the propositions and edges in a context can be deleted, and therefore, a context provides a simplified view of a metagraph. There is an interesting relationship between contexts and projections—they are commutative. That is, for a given conditional metagraph and a given subset of the generating set, the context of its projection is the projection of its context (Basu and Blanning, 1998). Fuzzy Metagraphs: The use of metagraphs for representing rules in rule-based knowledge bases has been extended to fuzzy rules, using fuzzy metagraphs (Tan, 2006). This approach combines the set-to-set representative power of metagraphs with the expressive features of fuzzy systems, and has been shown to offer some distinct advantages for both automated reasoning systems and in intelligent information retrieval. 5.2 Research opportunities Although there has been significant progress in the development and application of metagraphs, there is much left to be done. We now speculate on several areas that might yield fruitful results: A Metagraph Workbench: A software and database workbench for metagraphs would help a manager or analyst to answer such questions
316
A. Basu and R. Blanning
as: (1) given a description of known information and desired information, how can existing modules be used to obtain the desired information, (2) what other information would be obtained as a byproduct of this process, (3) are there alternative ways of obtaining the desired information, and (4) what new modules might be useful in performing these analyses? The workbench would contain repositories known information and existing modules, as well as software for calculating metapaths, projections, and contexts, and a user interface. Metagraph-Based Models of Social Networks: How can one use metagraphs to model formal and informal social networks and organization structures and the consequent channels of communication? This might require the elimination of the directional character of metagraphs to model bilateral communication networks. In other words, an edge would be defined as an unordered pair of subsets of the generating set. What are the consequences of doing this? Metagraphs in Software Engineering: How can/should metagraphs be used in feasibility studies, system analysis, system design, and project implementation? Can metagraphs or metagraph-like features be useful in enhancing data flow diagrams, entity-relationship diagrams, and project management diagrams? Can metagraphs assist in constructing scenarios for software and system testing? In conclusion, we believe that the topic of metagraphs offers a promising set of opportunities to capture the inherent complexity of knowledge bases. The reason is that metagraphs are based on a worldview that describes the variety of modules found in knowledge bases—that is, modules involving data, models, rules, and workflows. This worldview relies on set-to-set, rather than point-to-point, mappings, and these mappings can represent the rich interactions found in knowledge bases. This will extend existing thinking about knowledge bases and may even help us in an effort to structure currently unstructured knowledge bases.
References Basu, A., R.W. Blanning (1994a). Metagraphs: a tool for modeling decision support systems. Management Science 40, 1579–1600. Basu, A., R.W. Blanning (1994b). Model integration using metagraphs. Information Systems Research 5, 195–218. Basu, A., R.W. Blanning (1998). The analysis of assumptions in model bases using metagraphs. Management Science 44, 982–995. Basu, A., R.W. Blanning (2000). A formal approach to workflow analysis. Information Systems Research 11, 17–36. Basu, A., R.W. Blanning (2003). Synthesis and decomposition of processes in organizations. Information Systems Research 14, 317–355.
Ch. 11. Structuring Knowledge Bases Using Metagraphs
317
Basu, A., R. W. Blanning (2007). Metagraphs and Their Applications. Edited by R. Sharda and S. Vos, Kluwer Academic Publishers, New York, Expository and Research Monograph, Integrated Series in Information Systems. Basu, A., R.W. Blanning, A. Shtub (1997). Metagraphs in hierarchical modeling. Management Science 43, 623–639. Tan, Z.-H. (2006). Fuzzy metagraph and its combination with the indexing approach in rule-based systems. IEEE Transactions on Knowledge and Data Engineering 18, 829–841.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 12
Information Systems Security and Statistical Databases: Preserving Confidentiality through Camouflage
Robert Garfinkel, Ram Gopal and Manuel Nunez The University of Connecticut, Storrs, CT 06269, USA
Daniel Rice Department of Information Systems and Operations Management, The Sellinger School of Business, Loyola College in Maryland, 4501 N. Charles Street, Baltimore, MD 21210, USA
Abstract This chapter outlines some of the major themes of information systems security in statistical databases (SDB). Information security in databases, including confidentiality in SDBs and privacy-preserving data mining, is discussed broadly while the chapter primary focuses on the protection of SDBs against the very specific threat of statistical inference. Several protection mechanisms that researchers have developed to protect against this threat are introduced including data restriction, perturbation, and data-hiding techniques. One particular data-hiding model, Confidentiality via Camouflage (CVC), is introduced and demonstrated in detail. CVC provides a functional and robust technique to protect online and dynamic SDBs from the inference threat. Additionally, the chapter demonstrates how CVC can be linked to an economic model for an intermediated electronic market for private information.
1
Introduction
Database (DB) security pertains to a database’s ability to reliably and securely handle sensitive information. A secure DB can protect users from the risk of loss, destruction, or misuse of information. DB security should be strong enough to protect against all the above risks, but it is increasingly important that it protects against the threat of an attacker who is trying to 319
320
R. Garfinkel et al.
gain access to sensitive information. Naturally, the requirement for strong DB security leads to an inevitable problem, namely the ‘‘conflict between the individual’s right to privacy and the society’s need to know and process information’’ (Palley and Siminoff, 1987). Much of information systems (IS) security is handled at the operating system (OS) level. For example, the modern OS is capable of concurrently managing multiple processes and multiple users while ensuring that the data stored and accessed by each of these processes and users remains separated. This requires the OS to allow only authorized users to run processes for which they have permission and to restrict processes and users from accessing unauthorized data storage locations. Separation in the OS is implemented using access control matrices coupled with user authentication (verifying user identity before authorizing access). Additionally, the modern OS uses a variety of other security features to ensure information security including firewalls, intrusion detection, security policy implementation, routine backups, and data encryption. IS security is also implemented at the application level, for example, database management systems (DBMS), such as Oracle and Microsoft Access, generally come equipped with many security features independent from the OS. DBMS level security protects against specific information threats that could result in the accidental or intentional loss, destruction, and misuse of data. The modern DBMS typically supports user authentication, backups, data encryption, and other information security features. Protecting against data misuse is often difficult because it requires a balancing of the user’s right to access data with protection. The misuse of data implies that an intentional act results in the improper release of information to unauthorized parties (data confidentiality), the unauthorized or unwanted modification of data (data integrity), or the loss of data availability (Castana et al., 1995). In order to protect the confidentiality, integrity, and availability of data, standard DBMS security features often include some or all of the following security features: DB views that restrict the information the DB reveals to users. Data integrity controls that are enforced by the DB during queries and updates to preserve integrity. User authentication and access controls that limit unauthorized user access to the DB, views, queries, and tables. User-defined procedures that limit users ability to manipulate data in the DB. Encryption that limits the readability of DB data (requires access to the decryption algorithm and a secret key in order to read data). While OS and application level security provide a first line of defense against some common DB information security threats, there remains an additional threat, called statistical inference, that often results in the misuse of data. The statistical inference problem arises when revealed aggregate
Ch. 12. Information Systems Security and Statistical Databases
321
information on sensitive data can be combined with other information to deduce individual confidential data on records in the DB (Denning, 1980). This problem is particularly piquant in statistical databases (SDBs) which have been specifically tailored to handle statistical queries and in datamining applications that often analyze massive sets of privacy-sensitive data. The goal of both SDBs, and data-mining applications, is to allow users to query a system that returns important information from stored data. SDBs are specifically designed to answer statistical queries such as count, mean, standard deviation, minimum and maximum, and correlation queries (Date, 2000). Data-mining applications, on the other hand, allow users to make sense of large data sets collected in the course of daily activities for the purpose of knowledge discovery. While the implementation of SDB and data-mining applications are quite different, they both may be vulnerable to statistical inference attacks. Therefore, statistical disclosure control (SDC) and data confidentiality techniques have been developed to protect against statistical inference in SDBs while privacy-preserving data-mining techniques have been developed to protect against statistical inference in data-mining applications (Agrawal and Srikant, 2000; Willenborg and de Waal, 2001).
2
DB Concepts
A DB is a collection of logically related data and the rules that govern the relationships between the data. A relational DB consists of two-dimensional tables storing information about real life entities and composed of records, the rows of the table, and attributes, the columns. The terms ‘‘DB’’ and ‘‘relational DB’’, as well as DBMS and relational DBMS, are used interchangeably in this chapter. Each record, or instance, contains one group of related data including the attributes. Each field represents an elementary unit of the data. A typical relational DB model consists of multiple tables and the DB schema contains information about the logical structure of DB (that is, how the DB tables are related). This chapter presents an example DB table that represents a simple but typical SDB table. The example table is used to illustrate the concept of SDB security. The table stores multiple non-confidential attributes (categorical and numeric) and a single confidential attribute (numeric). You will notice that additional attributes (protection vectors) have been created and added for the sole purpose of protecting the confidential attribute. 2.1 Types of statistical databases (SDBs) Another important concept for DB security is the intended use of the DB. Is it a stand alone version or networked throughout the world? Does the data change over time or does it stay the same? Does the system handle
322
R. Garfinkel et al.
queries as they arrive, or in a batch? Answering these questions can allow us to begin to classify DBs by their intended use environment. Turn and Shapiro (1978) classify SDBs by the following system environments: 1. Online–Offline–Online SDBs are real-time, meaning the user interacts with the DB and query answers, updates, and deletions occur at the time of the request. The user of an off-line SDB does not directly control or even know when a data request is processed. 2. Static–Dynamic–Static SDBs are created once and never change during the lifetime of the DB (no updates or deletions). A dynamic SDB can change over time, increasing the complexity of the DB security. 3. Centralized–Decentralized–Centralized SDBs essentially refer to a single copy of the DB that is maintained in one location. Decentralized SDBs, also called distributed SDBs, are multiple copy overlapping DBs that can be maintained in several locations. Decentralized SDBs complicate security. 4. Dedicated–Shared Computer System–Dedicated SDBs exist to serve a single application, while shared SDBs exist to serve several applications concurrently. Shared SDBs increase the security risks and complexity. System environment often impacts the complexity of the data protection problem, as well as performance requirements. For example, an online DB that requires ad-hoc queries to be answered as they arrive necessitates that the SDB be capable of rapidly handling requests. Dynamic DBs often allow users to add and delete data and may be more vulnerable to inferential disclosure risks (a user with limited read and write privileges could track changes to query responses before and after updating records, which can then lead to disclosure of sensitive data). System environment can complicate the operational and security requirements of the SDBs.
2.2 Privacy-preserving data-mining applications Data mining, also called knowledge discovery and exploratory analysis, allows for the exploration of volumes of data collected from various sources, such as cash register data and credit card purchases, that are dumped into data warehouses. The data are analyzed using some classical statistical approaches, such as regression analysis, as well as alternative data-mining approaches including cluster analysis, neural networks, and decision tree analysis. The recent advances in the exploratory power of data-mining algorithms necessitates full consideration of the privacy implications of data mining. Privacy-preserving data mining is a stream of research that investigates these implications. Privacy-preserving data mining often draws from statistical disclosure control approaches developed for SDBs (Verykios et al., 2004).
Ch. 12. Information Systems Security and Statistical Databases
323
Inference control techniques for data-mining systems, much like those for SDBs, have been classified as query-oriented and data-oriented methodologies and include data blocking and classification, partitioning, query audits, and data perturbation (Verykios et al., 2004; Zhang and Zhao, 2007). Distributed data mining occurs when the source of the data is distributed between two or more sites. Distributed data mining further complicates the statistical inference problem and introduces some interesting privacypreserving data-mining issues. For example, even when protection against statistical inference is ensured at one site, the combined analysis involving many sites may not protect the data from statistical inference (Clifton et al., 2002). Similar to SDBs, the application and environment of data-mining systems has an impact on systems’ capability to protect against statistical inference. The more accessible, dynamic, and complex a data-mining application is, the more difficult it is to preserve privacy. Privacy-preserving data-mining research is a rapidly growing field that has many parallels to, and intersections with, statistical disclosure control and SDB research. This chapter will focus on protecting SDBs against the threat of statistical inference. Future confidentiality research may include the application of the techniques discussed in this chapter to privacy-preserving data-mining systems. 2.3 A simple database model Consider Fig. 1, an example SDB table, where ‘‘record’’, ‘‘name’’, ‘‘age’’, ‘‘state’’, and ‘‘occupation’’ attributes are non-confidential. Non-confidential attributes are assumed to be known by the DB users and do not require special protection. In other words, a query that requests only nonconfidential attribute information is answered directly. The ‘‘salary’’ attribute in the example SDB table is confidential. A degree of care must Non-Confidential Vectors
Record Name 1 A 2 B 3 C 4 D 5 E 6 F 7 G 8 H 9 I 10 J
Age 32 56 45 27 45 54 37 34 28 47
Fig. 1.
State CT OH NY NJ CT OH NY CT CT OH
Confidential Vector
Protection Vectors
Occupation Salary P1 P2 Engineer 65 60 73 Manager 25 27 22 Manager 98 89.7 109.3 Chemist 87 81 98.4 Manager 54 59.3 48.5 Engineer 27 25 30.4 Journalist 45 48 39 Engineer 78 68 83.6 Manager 56 49.1 60.3 Manager 30 33 27
Example database table.
324
R. Garfinkel et al.
be taken when answering queries that use information specific to a confidential attribute. For instance, a query, Q1 requesting a count of the records of subjects from Connecticut could be answered directly. However, care should be taken in answering a query, Q2, requesting a count of subjects with a salary over 50 because this query involves confidential information. Also in Fig. 1 there are two protection attributes, P1 and P2, explained in Section 4, that exist for the sole purpose of protecting the confidential attribute. In the example DB there are n ¼ 10 records, five non-confidential attributes, one confidential attribute, and k ¼ 2 protection attributes. 2.4 Statistical inference in SDBs There are a variety of logical and statistical methods that an attacker may use in attempts to deduce the values of protected sensitive information. An attacker may try to discover protected data using insider information (that is, prior knowledge of either some of the protected data or some aspect of the protection algorithm). The attacker may very well be a legitimate user of the system who has access to some of the information in the SDB, but is restricted from direct access to other more sensitive information. Hence, understanding disclosure risk and protecting against it is difficult not only because of the complexity of the problem itself, but also due to the difficulties in making assumptions about the disclosure threats and information needs of the users (Willenborg and de Waal, 2001). Attacks often take on the characteristic of a sequence of apparently innocuous queries. Each query during the attack might appear to be a legitimate query that in and of itself does not leak any of the protected information. For example, consider the following sequence of queries to the example SDB above by an attacker with no insider information who desires exact salary knowledge of person ‘‘F’’. We can assume that all OS and application levels of security are effective, however, the user is allowed access to information aggregated from the sensitive information (COUNT and SUM) because this information is required by the user to execute her official duties. Note that in this sequence of queries, no single query alone discloses individual salary information. Query 1: Name, state, occupation and salary of every employee. Answer: All data is provided. Query 2: Average salary of Ohio employees. Answer: 27.3334. Query 3: Number of Ohio employees. Answer: 3. Query 4: Average salary of Ohio managers. Answer: 27.5. Query 5: Number of Ohio managers. Answer: 2. An attacker could quickly infer from these queries that the salary of employee F is 27. This simple illustration shows an SDB’s vulnerability to the indirect disclosure of a sensitive datum by an attacker with no prior
Ch. 12. Information Systems Security and Statistical Databases
325
knowledge of the information (but, as a legitimate user who may have some insider knowledge of the protection algorithm or at least how queries are answered). The problem arises because SDBs balance user access with data protection. In this example, there is a simple protection algorithm that says ‘‘do not answer singleton queries’’. This example illustrates a very simple attack that prevails over a naive protection mechanism. Of course, there are more sophisticated measures that can be taken to protect information against attacks. We discuss some of these measures in the following section. Eventually, the DB administrator (DBA) is forced to make a trade-off of ‘‘data loss’’, a degradation of information quality, in exchange for increased data protection. The remainder of this chapter will focus on techniques available for the protection of sensitive information in SDBs. 3
Protecting against disclosure in SDBs
Inferential disclosure (henceforth referred to as disclosure) refers to the ability of an attacker to infer either the identity of an entity to which a particular record belongs (re-identification) or the value of a confidential variable (predictive disclosure) as described by Willenborg and de Waal (2001). Disclosure can be complete or partial. Complete disclosure, also called exact disclosure, is said to occur if the exact value of a sensitive datum is revealed. Partial disclosure may also be referred to as ‘‘statistical disclosure’’. Beck (1980) states that statistical disclosure is said to occur if information from a sequence of queries makes it possible to obtain a better estimate for a sensitive data point than would have been possible from only a single query. Delanius (1977) comments that disclosure control in SDBs refers to a systems ability to provide answers, usually as point estimates, to user’s queries while necessitating a large number of independent samples to obtain a small variance of the estimator. Adam and Wortmann (1989) discuss the adequacy of disclosure control mechanisms and state that in order for disclosure control to be acceptable it must prevent exact disclosure while providing statistical disclosure control. If there is a common theme in the various views of disclosure control, it is that any disclosure control technique must consider both exact disclosure and partial disclosure. Further, while the issue of protecting against exact disclosure is relatively clear, protecting against partial disclosure is less clear and this implies that there is an acceptable range of statistical protection that could vary from no disclosure to some level of disclosure. While the level of partial disclosure is imperative when discussing the adequacy of a disclosure control mechanism, it is not the only factor to consider. The next most obvious factor to consider is the level of ‘‘information loss’’ (Adam and Wortmann, 1989). There are also several other additional factors to be considered including the system robustness, ability to protect data (numerical and categorical), ability to protect more than one
326
R. Garfinkel et al.
attribute, suitability for the protection dynamic SDBs, richness of information given in answers to queries, cost of implementation, and overall usability (Adam and Wortmann, 1989). Ultimately, a DBA may decide that partial disclosure is acceptable because it may be the only way to balance the disclosure control with the other performance factors. 3.1 Protecting against statistical inference Approaches to protecting against statistical inference in SDBs can be divided into four general categories; conceptual, data masking, query restriction, and camouflaging techniques. In implementation, these approaches are sometimes combined to increase effectiveness. Conceptual techniques for data protection are introduced in the conceptual design phase of the DB. Two common types of conceptual techniques are the lattice model and conceptual partitioning (Chin and Ozsoyoglu, 1981). Query restriction refers to restricting answers to queries that could potentially lead to disclosure of confidential data. Query restriction is discussed in Chin and Ozsoyoglu (1982), Dobkin et al. (1979), Fellegi (1972), Gopal et al. (1998), Hoffman and Miller (1970), and Schlorer (1980). Data masking techniques are introduced and discussed in Beck (1980), Denning (1980), Lefons et al. (1982), Leiss (1982), Liew et al. (1985), Muralidhar et al. (1995), Reiss (1984), and Schlorer (1981). Data masking refers to the changing of numerical data systematically so that a user querying the data cannot determine confidential data with certainty. Camouflage techniques ‘‘hide’’ the confidential data within a larger data set and answers queries with respect to that set. Users are then provided with intervals that are guaranteed to contain the exact answer. Thus, these techniques are related to query restriction in that query answers are ‘‘deterministically correct’’ while eliminating the need to ever refuse to answer a query. The latter property is shared with data masking techniques, as is the down side that query answers are not generally exact (point answers). Gopal et al. (2002) introduce the Confidentiality Via Camouflage (CVC) approach for confidential numerical data. As previously mentioned, each of the above data confidentiality methods has advantages and disadvantages. Conceptual and query restriction techniques offer effective protection and the answers given are correct because they are exact answers; however, often these techniques require some queries not to be answered. Data masking techniques allow for all queries to be answered, however, the answers given are not exact and can differ significantly from the exact answers especially in the case of queries of small cardinality. CVC’s major advantage is that it will give deterministically correct answers to an unlimited number of user queries. However, CVC techniques are sometimes vulnerable to ‘‘insider information’’ threats. These are threats due to prior knowledge some sensitive data or of the CVC technique, specifically knowledge of the parameters used in the technique.
Ch. 12. Information Systems Security and Statistical Databases
327
3.2 The query restriction approach Query restriction techniques protect sensitive data by enforcing rules restricting the type and number of queries answered. One very simple technique is based on query size, where queries of very small size or very large size are not answered. For instance, the earlier example query ‘‘Average salary of all the engineers from Ohio’’ would not be answered because the cardinality (the number of records used in the query) is one. Similarly, the query ‘‘Average salary of all employees’’ would not be answered because a follow-up query such as ‘‘Average salary of all employees who are not journalists’’ would allow a user to quickly infer the journalists salary. However, Friedman and Hoffman (1980) point out that simple query restriction alone may not guarantee confidentiality. They proposed an expanded query-set size control mechanism that takes into account the logical relationship between separate queries and determine an ‘‘implied query set’’. However, it is easy to show that the number of implied query sets grows exponentially with the number of attributes and therefore calculating and storing the implied query sets could be exorbitantly expensive in terms of computation and storage. Furthermore, it has been shown that sensitive statistic can still be inferred from allowed query answers (Denning, 1982). Query-set overlap techniques were developed to overcome this problem (Schlorer, 1976). However, these techniques are burdensome because they require extensive auditing and comparing incoming queries with all the query answers previously released. Other approaches to this type of query restriction include using an audit expert system (Chin and Ozsoyoglu, 1982) and the combination of query restriction and conceptual design techniques that take into account the number of attributes such as logical partitioning and cell suppression. 3.3 The data masking approach The general idea of data masking is to modify the confidential data. In some realizations the replacement data are also from the same confidential field. In that case the technique is called data swapping or data shuffling (Muralidhar and Sarathy, 2006). When the data does not come from the confidential field, but is randomly generated, the name perturbation is often used. There are a number of ways that the sensitive data may be perturbed. For instance, one technique is to view the sensitive numerical data in a SDB as a sample belonging to a distribution of a population. Then by either replacing the ‘‘exact’’ sample with another sample from the same distribution or by simply using the distribution, rather than the exact data, to answer queries disclosure of ‘‘exact’’ information can be avoided. Data
328
R. Garfinkel et al.
masking can allow for unlimited query answers, thus, allowing more queries to be answered than would query restriction techniques. 3.4 The confidentiality via camouflage (CVC) approach In CVC, protection of confidential data is termed ‘‘interval protection’’ because each confidential datum is guaranteed protection in the form of an interval (Garfinkel et al., 2002). That is, for the ith given confidential datum ai, an interval ½‘i ; ui is established such that ai 2 ½‘i ; ui , and that no user will be able to determine from the answers to queries that ai 4‘i or that ai oui . The values ‘i and ui are set by the ith user. Interval answers to query functions are given by minimum and maximum of the functions over a convex, compact (closed and bounded), set that contains the confidential vector, thus assuring that answers will be deterministically correct. Details are given in the next section.
4
Protecting data with CVC
Suppose a DB user submits a query Q to the DBA. Query selection results in the query set T(Q), which is a subset of all records in the DB N ¼ {1, 2, . . . , n}. T(Q) is of cardinality t, and TðQÞ N. The confidential vector is denoted a. A query Q corresponds to a function over the record set T(Q) which is denoted as f Q ðÞ. If the query Q is over the query set of actual values a, its corresponding answer is f Q ðaÞ. If the data involved in the query is not confidential, the query is answered exactly. Consider Query A in the following example. Example 1. Answering a SUM query: A user submits query QS: The sum of the salaries of the engineers to the DBA (who answers queries P using the example SDB table, Fig. 1). T(QS) ¼ {1, 6, 8}, t ¼ 3, f QS ðaÞ ¼ i2TðQS Þ ai ¼ 170. Consider protecting the private vector a using the vector set P where ( ) k k X X j lj P ; lj ¼ 1; lj 0 P :¼ convðPÞ ¼ x : x ¼ j¼1
j¼1
is the convex hull of the vectors in P ¼ P1 ; . . . ; Pk . The protection vectors are created so that a 2 P to ensure that query answers are deterministically correct. Also, the protection vectors are chosen to provide a level of protection guaranteed to the DB subjects. That is, for each i there is at least one pij ‘i otherwise a user will be able to infer that ‘i ominfpij : 1 j kg ¼ minfxi : x 2 Pg ai . Similarly, for each i there is at least one pij ui otherwise a user will be able to infer that ui 4maxfpij : 1 j kg ¼
Ch. 12. Information Systems Security and Statistical Databases Record
P1
1 2 3 4 5 6 7 8 9 10 Fig. 2.
60 27 89.7 81 59.3 25 48 68 49.1 33
P2 73 22 109.3 98.4 48.5 30.4 39 83.6 60.3 27
P3 52.5 29.5 82.2 67.5 59.8 21.5 55.5 79 55.6 33
329
a 65 25 98 87 54 27 45 78 56 30
Database table with a in the relative interior of P.
maxfxi : x 2 Pg ¼ ai . Query responses are given as answer intervals I Q ¼ ½rL ; rU , where rL :¼ minf f Q ðxÞ : x 2 Pg
(1)
rU :¼ maxf f Q ðxÞ : x 2 Pg
(2)
To ensure protection it is necessary for a to be a point in the relative interior of P. This can be accomplished by choosing arbitraryPvectors P1 ; . . . ; Pk1 k and strictly positive numbers g1 ; . . . ; gk such that j¼1 gj ¼ 1, and then setting k1 X P :¼ a=gk ðgj =gk ÞPj k
j¼1
Figure 2 shows the CVC protection vectors for k ¼ 3 where P1 and P2 are from Fig. 1 and g1 ¼ 0:3; g2 ¼ 0:5; and g3 ¼ 0:2. 4.1 Computing certain queries in CVC Gopal et al. (2002) provide methods to determine interval answers for VAR, MIN, MEDIAN, PERCENTILE, and REGRESSION queries. These methods include four classes of algorithms: (1) extreme point evaluation (used for SUM queries), (2) very efficient algorithms that are minimal access and time bounded by a low order polynomial in k (used for VAR queries), (3) grid search (REGRESSION query), and (4) bounding heuristics, fast minimal access heuristics that yield ½r ; rþ where r rL and rþ rU (used for MIN queries). We briefly introduce these methods by way of several examples (see Fig. 3 and Examples 2–4 below).
330
R. Garfinkel et al. Record 1 6 8
State CT OH CT
Occupation a Engineer Engineer Engineer
SUM
Fig. 3.
p1 65 27 78 170
p2 60 25 68 153
p3 73 30.4 83.6 187
52.5 21.5 79 153
Selected DB table used for Examples (2)–(4).
Example 2. QS: The sum of the salaries of the engineers. T(QS) ¼ {1, 6, 8}, and t ¼ 3. We have 8 9 < X = rU ¼ max pij ¼ 73 þ 30:4 þ 83:6 ¼ 187 ; 1 jk: i2TðQS Þ 8 9 < X = pij ¼ 60 þ 25 þ 68 ¼ 153 rL ¼ min ; 1 jk: i2TðQ Þ S
and the answer is given as the interval I QS ¼ ½rL ; rU ¼ ½153; 187 (exact answer is 170). Example 3. QM: The minimum of the salaries of the engineers. T(QM) ¼ {1, 6, 8}, and t ¼ 3. MIN is a concave function and so, rL is achieved at an extreme point of P, that is, j L r ¼ min min pi ¼ 21:5 1jk
i2TðQS Þ
However, computing rU would require solving a linear program with kþ1 variables (4 in this example) and t constraints (3 in this example). Since t may be large, a bounding heuristic is used j U r ¼ min max pi ¼ 30:4 i2TðQS Þ
1jk
and the interval answer is I QM ¼ ½rL ; rU ¼ ½21:5; 30:4 (exact answer is 27). Example 4. QV: The variance of the salaries of the engineers. T(QV) ¼ {1, 6, 8}, and t ¼ 3. In this case VAR is a convex function and so, rU is achieved at an extreme point of P, that is, ( ) X j
1 2 rU ¼ max ¼ 552:2 pi p j 1jk t i2TðQÞ
Ch. 12. Information Systems Security and Statistical Databases
331
To compute rL it is necessary to solve a quadratic program, namely, 8 9 !2 k k <1 X X = X rL ¼ min lj ðpij p j Þ : lj ¼ 1; lj 0 ¼ 348:7 : t i2TðQÞ j¼1 ; j¼1 and the interval answer is I QV ¼ ½rL ; rU ¼ ½348:7; 552:2 (exact answer is 468.2).
4.2 Star Star is a variation of CVC. In CVC protection is implemented through a camouflaging set that is a polytope, while Star protection is implemented through an n-dimensional union of n line segments intersecting at a common point. The adjective ‘‘star’’ originates from the resemblance of the protection set to a multi-dimensional star. There are two major advantages of the Star technique: first, it protects confidential data against insider information threats, and second, it is a more flexible technique in that the protection bounds may be easily manipulated in order to generate smaller answer intervals. Protecting against insider data information is a very desirable property. However, Star is vulnerable to insider algorithm information. In other words, the Star pseudo code must not be known to users. Concretely, Star uses a protection set P defined as [ P :¼ Si i2N
where each Si is the line segment
S i :¼ a ðai þ aui þ ð1 aÞ‘i Þei : a 2 0; 1 and ei is the ith unit n-vector. In words Si is the line segment in which all elements of a except for ai retain their original values, while the ith element takes on all values in the range ½‘i ; ui . It follows that for any query Q corresponding to the function fQ and the set T(Q), the answer interval IQ can be computed by minimizing and maximizing fQ over all Si, i 2 TðQÞ, and then concatenating the t answer intervals. We repeat the following example to illustrate the Star solution for a SUM query submitted to the example SDB table. Example 5. QS: The sum of the salaries of the engineers. T(QS) ¼ {1, 6, 8}, and t ¼ 3. Three sets are generated, the SUM is calculated providing an answer interval for each set in Table 1. The interval answer for the query is I QS ¼ ½148; 202.
332
R. Garfinkel et al.
Table 1 CVC-Star computation for SUM example Record
Set 1
Set 2
Set 3
1 6 8 SUM
50–90 27 78 155–195
65 5–45 78 148–188
65 27 70–110 162–202
Note that for a large-size query, the creation of the line segments and optimizing over each segment may become unwieldy, even though the calculations are very straightforward. Therefore, exact algorithms that are O(t) and minimal access have been developed for some common query types including SUM, COUNT, and VARIANCE (Garfinkel et al., 2006a).
5 Linking security to a market for private information—A compensation model 5.1 A market for private information The purpose of SDBs and data-mining systems is to extract data (sometimes from various sources), analyze data, and present the information and knowledge gleaned from analysis to the systems’ users. When the data collected and analyzed is of a personal nature, this introduces confidentiality problems in SDBs and the problem of privacy preservation in data mining. Most individuals consider privacy as a guarantee that their personal information will not be misused. One way that the administrators of SDBs, and data-mining applications, can ensure that subjects (the individuals providing their personal data, one way or another, to the system) do not have their privacy violated is to fairly compensate them for the use of their information. For example, some marketing firms have been compensating users for the collection and use of personal data in creating marketing information using ‘‘freebies’’ such as free Internet access, e-mail service, and even computers; all in exchange for individuals’ private information (Chang et al., 1999). Laudon (1996) introduces the concept of a National Information Marketplace (NIM), a marketplace that would have the capability of tracking individuals’ private information and compensating these individuals for use of their private information. Varian (1996) explores markets for personal information, like the NIM depicted by Laudon, from an economic perspective. These personal information markets, however, require the control of confidentiality as well as fair compensation. Therefore, a compensation mechanism that protects subjects’ privacy while providing
Ch. 12. Information Systems Security and Statistical Databases
333
compensation to the subjects is necessary for the creation of sustainable personal information markets. This section demonstrates a compensation model based on the CVC approach. 5.2 Compensating subjects for increased risk of disclosure We develop subject compensation models that are based on the reduction in protection intervals (Garfinkel et al., 2006a,b). We model the decrease in the protection interval as a proxy for an increased risk of disclosure. In our models, the compensation details are agreed upon between a third party trusted information intermediary (TII) and each subject. The details of the agreement, and the resulting compensation model, influence the degree of control subjects exert over the level of disclosure of their private information. Varying degrees of subject control may be achieved through more detailed agreements and the associated implementations. The way that the TII manipulates, bounds, and compensates subjects has a great impact on the TII’s production costs. The TII should set a price of compensation such that subjects have incentive to participate. We look at two compensation models that we call ‘‘elastic’’ and ‘‘inelastic’’. Answer quality is defined to be flexible if the TII allows ‘‘constrained queries’’, where the constraint is an upper limit on the range of the answer interval. For example consider again the query QS: the sum of the salaries of the engineers. Based on the data from Table 1 the exact answer is 170, the interval answer from Star is [148, 202], and the range of the interval is 202148 ¼ 54. An example of a constrained query would occur when a user requests the range of the answer interval to be less than or equal to 40, that is, a 26% reduction with respect to the initial answer range. Additionally, consistency in answering queries involving shrinking of protection bounds is critical. This is, if a user is provided with two different answers to the same query then the user could infer additional information. For instance, by intersecting overlapping intervals resulting from inconsistent answers a user could expose a DB to additional inference threat. Therefore, queries requiring reductions in protection bounds should be systematically answered so that the answers to the same queries are consistent. The nature of the Star model guarantees consistent answers. Elastic ‘‘Memoryless’’ shrinking and compensation: Elastic bound manipulation is considered to be ‘‘memoryless’’ or resilient in the sense that each query is answered individually and any adjustment to protection bounds in answering the query is forgotten as soon as the query has been answered. After a query is answered the TII compensates the appropriate subjects and the DB is rolled back to its original protection state. Subsequent query answers would be based on the original protection state. Subjects are compensated each time a query answer compromising their protection bounds is given.
334
R. Garfinkel et al.
Inelastic shrinking and compensation: The inelastic compensation method is considered to have ‘‘memory’’. When a query is answered, subjects are compensated if their protection bounds are manipulated resulting in smaller protection intervals. However, after the query is answered, the DB does not go back to its original state. Rather, it remains at the current state and the changed protection bounds remain. A subsequent query is answered based on the new state of the protection bounds. The TII can choose to implement either compensation scheme or some combination of the two. 5.3 Improvement in answer quality As indicated earlier, the improvement of quality of an answer to a query is determined by comparing the answer intervals obtained before and after reducing the subjects’ protection intervals. We denote by yi , 0 yi 1, the percentage of reduction in the protection interval of subject i 2 N. We also denote by Ri :¼ ui l i the protection range, respectively, for subject i. þ Let D i :¼ l i ai and Di :¼ ui ai for all i 2 N (so that l i ¼ ai þ Di , þ þ ui ¼ ai þ Di , and Ri ¼ Di Di ). We denote by ½l i ðyi Þ; ui ðyi Þ the interval obtained after reducing by 100yi % the protection interval of subject i, and by Ri ðyi Þ :¼ ui ðyi Þ l i ðyi Þ the corresponding range. The following two conditions should hold ai 2 l i ðyi Þ; ui ðyi Þ
(3)
½l i ðyi Þ; ui ðyi Þ ½l i ; ui
(4)
There are many ways to shrink the interval ½l i ; ui so as to achieve Eqs. (3)–(4). In particular, we will describe one algorithm to compute l i ðyi Þ and ui ðyi Þ in Section 5.3. Given a general query Q with a corresponding exact answer f Q ðaÞ, we denote by Rð f Q Þ the range of the interval answer obtained before reducing the protection interval of the DB subjects. For example, Rðf QS Þ ¼ 202 148 ¼ 54. To simplify the discussion for the rest of this section, we will use f and T instead of fQ and T(Q), respectively, and, abusing notation, refer to f as the query. Notice that Ri ðyi Þ ¼ ð1 yi ÞRi
(5)
for all i 2 N. Therefore, on a scale of zero to one, the overall improvement in the quality of the answer to a query f, after the reduction of the subjects’ protection intervals is given by qðYÞ :¼ 1
f þ ðYÞ f ðYÞ Rðf Þ
(6)
Ch. 12. Information Systems Security and Statistical Databases
335
where Y :¼ ðy1 ; . . . ; yn Þ,
f ðYÞ :¼ inf f ðx; TÞ : x 2 SðYÞ
(7)
f þ ðYÞ :¼ supff ðx; TÞ : x 2 SðYÞg
(8)
and SðYÞ is the Star protection set obtained from using the subject intervals ½l i ðyi Þ; ui ðyi Þ, i 2 N. 5.4 The compensation model Let M denote an index set to refer to the users and Fm denote an index set for the queries requested by user m 2 M. A user is charged a price pf for each record accessed to answer a query f 2 F M and a price si proportional to the reduction of the ith protection interval. The user’s profit function after shrinking the protection intervals is given by
Pm :¼
X
0 @Bm
1 X f
qðYf Þ; jT f j pf jT f j si yi A
(9)
i2T f
f 2F m
Here Bm is a function of query cardinality and answer quality representing the user’s utility from the interval answer to a given query, and Y f is the vector of interval shrinkages from query f. We assume that Bm is twicecontinuously differentiable, concave, and non-decreasing in both coordinates. The total revenue from the users of the system is 0 1 X X X f @pf jT f j þ Ptotal :¼ (10) si yi A m2M f 2F m
i2T f
Of these total revenues the TII pays the subjects a fraction 1 b, where b 2 ½0; 1, that is, generated from the shrinking of the protection intervals. Therefore the remaining proceeds, namely X X X XX f PTII :¼ pf jT f j þ b si yi (11) m2M f 2F m
m2M f 2F m i2T f
constitute the revenues of the TII. The revenue of subject i is Pi :¼ ð1 bÞsi
X
X
m2M f 2F m ;T f 3i
yif
(12)
336
R. Garfinkel et al.
If the quality improvement function qðYÞ is differentiable, then we obtain first-order optimality conditions for the mth user’s profit, namely,
@q @Bm qðY f Þ; jT f j ðY f Þ si ¼ 0 @q @yif
(13)
for all f 2 F m and i 2 T f . Eq. (13) indicates that user m’s quality choice will depend on the change in the user’s benefit with respect to quality, the change in protection intervals, and the cost of protection reduction. By re-arranging terms, we get @q @yif
ðY f Þ ¼
si
f @Bm qðY Þ; jT f j =@q
(14)
which indicates that, for a given query f, the optimal reduction in the protection interval for subject i is found when the change in overall quality of the answer to the query matches the ratio of the unit reduction cost of the subject’s interval to the marginal utility of user m. Roughly speaking, when the subject’s unit reduction cost exceeds the user’s marginal utility, then equilibrium is achieved at a point where the query answer quality curve, as a function of interval reduction is steep. In other words equilibrium is found at a point where there are considerable gains in quality achieved by small reductions in interval size. On the other hand, when the unit reduction cost is below the utility margin, equilibrium is achieved at a point where the quality curve is close to flat, that is, when there is little to gain from small reductions in interval size. Note that subject i receives revenue only if the data point ai is used in the calculations of query answers. The determination of whether that happens is independent of the unit price si and the protection interval ½l i ; ui , unless the user requests shrinking of the answer interval. In that case if either si is too high or ½l i ; ui is too wide, the total price of shrinking may be so high that the user will refuse to pay, and thus no compensation will accrue to subject i. Therefore the subjects are motivated to be honest in terms of both their required costs and levels of protection. We formalize that concept with the following proposition, the proof of which can be found in (Garfinkel et al., 2006b). As a consequence of this result, it is clear that there is no incentive for a DB subject to arbitrarily increase her protection interval. Also there is no incentive for either the TII or the subjects to arbitrarily increase the prices si. Proposition 6. Given an arbitrary query, a fixed user desired answer quality level for the query, and a fixed subject i for whom the protection interval has been reduced, then there exists a constant s^i such that PTII and Pi are linear, strictly increasing functions of si , for si s^i , and PTII ¼ Pi ¼ 0, for si 4^si ;
Ch. 12. Information Systems Security and Statistical Databases
337
there exists a constant r^i such that PTII and Pi are increasing functions of Ri , for Ri r^i , and PTII ¼ Pi ¼ 0, for Ri 4^ri
5.5 Shrinking algorithm The Star method has the advantage that it can quickly calculate answers of varying quality levels to satisfy user demand. The following generic algorithm provides optimal, i.e., minimum cost, reductions of subject protection intervals to achieve a desired level of quality of the answer to a given query. Star shrinking algorithm: ^ 1; u; T; and a query f : Input parameters : q; Output parameters : ð f ðYÞ; f þ ðYÞÞ Algorithm body : ~ Y 0; for each i 2 T doR^ i ui l i ; stop
false;
while not stop do; stop true; fi 2 T : f i ðYÞ ¼ f ðYÞg;
M
þ fi 2 T : f þ i ðYÞ ¼ f ðYÞg; find increment y in lower bound I;
Mþ
for each i 2 M do li l i þ yR^ i ; Yi
þYi þ y;
end for; if qðYÞ q^ then return ð f ðYÞ; f þ ðYÞÞ; find decrement y in upper bound u for each i 2 M þ do ui ui þ yR^ i ; Yi
Yi þ y;
end for; if qðYÞ q^ then return ðf ðYÞ; f þ ðYÞÞ; if y40 then stop end while
false;
338
R. Garfinkel et al.
In the algorithm, q^ represents a user’s desired quality improvement level for a given query f . The algorithm will return a pair ðf ðYÞ; f þ ðYÞÞ ^ At corresponding to a Star interval answer with improvement of at least q. each iteration the algorithm first tries to increase a collection of protection lower bounds to reduce the overall size of the protection set and thus to improve the quality of the Star answer to the query. If the increment in the lower bounds is not enough, then the algorithm tries to decrease a collection of protection upper bounds to additionally reduce the size of the original protection set. The algorithm iterates until the original protection set is ^ Since the sufficiently reduced in size as to achieve the desired quality level q. algorithm can potentially reduce the original protection set to just one point (the confidential vector a), its finiteness is guaranteed. To find the increment in the chosen lower bounds at each iteration, the variable y is increased as long as three conditions are satisfied: the confidential values ai remain inside the reduced protection intervals; the lower bounds indexed by M remain tight; and the overall quality of the query improves. A similar procedure is used to determine the amount to decrease the chosen upper bounds at each iteration. Our algorithm is not specific enough to make a precise statement concerning its complexity. The complexity will depend on how difficult it is to find the increment and decrement amounts in the lower and upper bounds, respectively, at each iteration. It also depends on the complexity of computing the Star answer interval for a given query f and corresponding query set T. For common queries such as: SUM; MEAN; STANDARD DEVIATION; MIN (MAX); REGRESSION, the Star structure allows computations at each iteration to be extremely fast. For example, for SUM queries, the Star answer requires OðjTjÞ computations at each iteration. Then the y required to increase the corresponding lower bounds can be found by computing
g ðYÞ :¼ min f (15) i ðYÞ : ieM
^ y :¼ min max ðg ðYÞ f ðYÞÞ=R^ i ; D i =Ri : i 2 M
(16)
Likewise, a similar formula can be established to determine the y required to decrease the corresponding upper bounds. Let
R^ :¼ max R^ i : i 2 T n o þ þ þ þ D j : D aD ; jD D j : D aD :¼ min jD i j i j i j i j ^ then the algorithm applied to a SUM query will take OðR=Þ iterations in the worst case. It follows that the shrinking algorithm applied to a SUM ^ query would take OðRjTj=Þ computations.
Ch. 12. Information Systems Security and Statistical Databases
339
In general, for random or unstructured interval the number of iterations should be very small. A very highly structured, worst case scenario would be the following. For a given constant a, suppose that all the subject intervals are of the form ½a di ; a þ di , where d1 4 4dn , that is, each interval is centered at a and interval i strictly contains interval iþ1. Then, at the end of the kth iteration of the algorithm the quality of the Star answer for SUM queries will be qk :¼ 1 dkþ1 =d1 ^ the algorithm will take k^ Hence, to obtain a desired quality level of q, ^ ^ that is, iterations, where k is the first index k for which qk q, 1 q^ dkþ1 =d1 When dk decreases ‘‘very slowly’’ the algorithm can take comparatively many iterations. For example, suppose a user requests q^ ¼ 0:5 (a 50% reduction with respect to the original Star answer). If dk ¼ k0:1 , then k^ ¼ 210 1 ¼ 1023 iterations. Yet, since each iteration will take no more than a fraction of a second, the answer would still come back very quickly. On the other hand, in practice it is very unlikely to have intervals that exhibit such a ‘‘slow’’ decreasing behavior and the algorithm will find an answer in just a few iterations. For instance, if dk ¼ 10k , then k^ ¼ dlnð2Þ=lnð10Þe ¼ 1 iteration. 5.6 The advantages of the star mechanism Since all aspects of the proposed market have now been illustrated, it is worthwhile to enumerate the beneficial properties of the Star mechanism for each of the market’s desiderata. In particular these are as follows. Consistency. By its nature, namely minimizing and maximizing the query function over the set S, the Star mechanism guarantees consistent answers. That is, a user will never be able to achieve additional information by asking the same query twice and getting different answers. Computational ease. Because of the simple structure of the compact Star set, exact algorithms for query answers can be developed that are very efficient. That is, they consist of nothing more than a set of single variable function optimizations, and typically can be done in linear time on the size of the DB. Safety. It is easy to see that the structure of the Star set also guarantees that even if a user gains some information about one subject through the shrinking process, that information will be of no use in shrinking the protection interval of any other subject without compensating that subject. Usefulness for shrinking answer intervals. Star is uniquely designed to make it simple to determine optimal shrinking in that the effect of shrinking one subject’s protection interval on the answer interval is immediately seen, and has no effect on the other subjects.
340 6
R. Garfinkel et al.
Simulation model and computational results
A simulation model is used to illustrate the compensation model with elastic shrinking. The interaction between flexible quality, privacy protection, and compensation is studied via user demand for SUM queries on the subjects’ private information. SUM queries are chosen for analysis since they represent the most common query type studied in data security models. A typical sequence of events in the simulation starts with a user submitting a SUM query to a DB and a desired quality level for the query answer. The TII answers the query by adjusting protection bounds where necessary to meet the requested quality, and calculates the cost of the query and subject compensation. Finally, the user decides whether or not to purchase the query, based on how the user values the query answer and the price charged by the TII. If the query is purchased, the TII compensates the subjects according to the corresponding percentage b.
6.1 Sample database We generated a 1000 subject DB containing a private information vector a. The entries of the confidential vector are randomly generated according to a uniform distribution ranging from 10 to 100. We chose the uniform distribution for the following reasons. Since these models are not statistical in nature, it seems clear that the distribution of the actual data is not of paramount importance. That is, the models should allow the market to function efficiently independent of the data structure. What we did want to provide was data that had little structure but a fair amount of variance, for which the uniform distribution is ideal. We also note that the uniform distribution is used to test the models in both Gopal et al. (1998) and Gopal et al. (2002). To further ascertain the robustness of the simulation findings, we replicated our study using a log-normal distribution, another commonly encountered distribution found in a variety of data. The findings are virtually identical, providing another dimension to support the validity of the findings. Lower and upper protection bounds for the ith subject are also randomly generated as follows
l i :¼ ai 1 U i =2 ui :¼ l i þ ai =2 where U i has a uniform distribution in ½0; 1. Note that as ½ai 2 l i ; ui and ðui l i Þ=ai ¼ 0:5, the generated interval provides a 50% protection level for the subjects’ confidential data. We assume that the interval reduction price per subject si is the same for all subjects, that is, si ¼ s for all i 2 N and some positive value s.
Ch. 12. Information Systems Security and Statistical Databases
341
6.2 User queries We use a finite sequence u1 ; . . . ; uW of users. The mth user submits exactly one SUM query with query set denoted by T m . Users have the following utility function: Bðq; tÞ :¼ ctð1þqÞ=2 where c is a positive constant. User um has a value v^m for a query answer with quality level q^m or higher. The values v^m are computed based on the user’s utility function as v^m :¼ Bðq^m ; jT m jÞ In the simulation we use a total of W ¼ 10,000 users, each posing a SUM query with jT m j ¼ 100: The queries were designed so that every query set contained one chosen subject termed the control subject. This control was created to track compensation and private data revelation of a single subject. The other 99 elements in a given query set are found by using simple random sampling without replacement from the 999 non-control subjects. The user’s profit is calculated as X Pm ¼ v^m pf jT m j s yi (17) i2T m
We use the following values in the simulation: c ¼ 15 and pf ¼ 1. The shrinking price s is analyzed at various levels ranging from zero to $14 in increments of $0:5. At a given iteration, if Pm 40, the query answer is purchased, the user’s profit and surplus are recorded, and the TII and subjects’ profits are also calculated and recorded. Otherwise, the query answer is not purchased, and profit and surplus are recorded as zero. For the subjects and TII we equate profits to revenues since any fixed cost structure does not alter the results qualitatively. Finally, we use a profit distribution proportion of b ¼ 0:5. 6.3 Results Figure 4 illustrates the percentage of the queries that the users were willing to pay for as a function of s. While the percentage of queries that users consent to pay for is monotonically decreasing in the price, it is close to flat for prices below $7.50 and then exhibits a sharp decline. For prices under $7.50, nearly 70% of user queries arePanswered. Figure 5 illustrates P the average profit per query for users ( P =W), subjects (ð1 bÞs m m m P i2T m yi =W), and the TII (PTII =W), along with the average total welfare (defined as the sum of the user, subject, and the TII average profits). As expected, the user profits decline with increasing prices. The subject profits
342
R. Garfinkel et al.
Fig. 4.
Fig. 5.
User willingness to pay.
Price–profit relationships.
initially increase, reach a maximum at s ¼ 7.50 and decline with further price increases. A similar price–profit relationship is also found in the control subject profits depicted in Fig. 6. Interestingly, subject profits are maximized when the percentage of queries accepted by the users remains relatively high (see Fig. 4), and where further price increases result in a sharp drop in the number of queries the users are willing to pay for. Thus the subjects are better off pricing private information such that the consumption of information products by users is high. The TII profits and the total welfare display similar behavior, remaining relatively flat for s 8 followed by a sharp decline with further increases in s. These findings indicate that the economic incentives of the TII and the subjects are closely aligned, and that the prices set by the TII are nearly welfare maximizing. This result suggests that an intermediated market for private information may be as efficient as a regulated one. Note that a subject’s private information is completely revealed when y ¼ 1, i.e., the subject’s private data needs to be fully revealed in order to
Ch. 12. Information Systems Security and Statistical Databases
Fig. 6.
Fig. 7.
343
Price–profit relationship for the control subject.
Percentage of private information disclosure.
satisfy a user’s information request. On the other hand, when y ¼ 0, no disclosure of sensitive information is necessary to respond to a user’s query request. Figure 7 illustrates that y values are no more than 5% for all price ranges considered in the simulation model. In fact, for large values of s, y approaches zero since users are unwilling to pay for queries and thus no disclosure takes place. However, even minimal reductions in subject confidentiality intervals can lead to significant improvements in the quality of the query answers provided to the users. Figure 8 indicates that the users are provided quality levels of nearly 70% for all price ranges. That small reductions in the confidentiality intervals can lead to large improvements in the query answer qualities, can be attributed to the query type SUM that leverages the ‘‘law of large numbers’’ to realize such gains. Other commonly used information products are compiled from analyses that also employ the ‘‘law of large numbers’’ (such as regression analysis and other data analytic methods), and thus should exhibit similar behavior.
344
R. Garfinkel et al.
Fig. 8.
Price impact on query quality.
Table 2 Impact of reservation value on feasible prices T Subject reservation value
0.01 0.02 0.03
Range for s Lower bound ($)
Upper bound ($)
1.94 3.88 5.84
12.05 10.22 8.84
Clearly, the willingness of the subjects to participate (i.e., permit TII to reduce protection intervals) is critical for the market. The incentive for the subjects to participate comes from the compensation received from the TII. Table 2 illustrates the price impact of the subjects’ reservation values. For instance, if the subjects are willing to participate only if they receive, on average, $0:02 or more per query to the DB, then it is clear that the TII needs to restrict the feasible values for s to lie in the interval ½$3:88; $10:22. If s is not contained in this interval, the subjects will choose not to participate and hence the TII profits resulting from higher quality answers are zero. Interestingly, increasing reservation values on the part of the subjects drive the prices towards welfare maximizing levels. In summary, the simulation analysis provides an initial validation of the viability of markets for private information.
7
Conclusion
This chapter has outlined some of the major themes in SDB protection and has focused on protection against the threat of statistical inference.
Ch. 12. Information Systems Security and Statistical Databases
345
Several protection mechanisms were discussed, with a focus on the CVC protection model while demonstrating how CVC provides a functional and robust technique to protect online and dynamic SDBs from the inference threat. Additionally, the chapter demonstrates how CVC can be linked to an economic model for the intermediation electronic market for private information. References Adam, N.R., J.C. Wortmann (1989). Security-control methods for statistical databases: A comparative study. ACM Computing Surveys 21(4), 515–586. Agrawal, R., R. Srikant, (2000). Privacy-preserving data mining, in: Proceedings of the 2000 ACM SIGMOD Conference on the Management of Data, Dallas, TX, May 14–19, ACM, pp. 439–450. Beck, L.L. (1980). A security mechanism for statistical databases. ACM Transactions on Database Systems 5, 316–338. Castana, S., G.F. Fugini, G. Martella, P. Samarati (1995). Database Security. ACM Press published by Addison Wesley Longman Limited, England. Chang, A., P.K. Kannan, A. Whinston (1999). The economics of freebies in exchange for consumer information on the Internet: An exploratory study. International Journal of Electronic Commerce 4, p. 1. Chin, F., G. Ozsoyoglu (1982). Auditing and inference control in statistical databases. IEEE Transactions on Software Engineering SE-8, 574–582. Chin, F., G. Ozsoyoglu (1981). Statistical database design. ACM Transactions on Database Systems 6(1), 113–139. Clifton, C., M. Kantarcioglu, J. Vaidya, X. Lin, M. Zhu (2002). Tools for privacy preserving distributed data mining. SIGKDD Explorations 4(2), 1–7. Date, C.J. (2000). An Introduction to Database Systems. 7th ed. Addison Wesley, Reading, MA. Delanius, T. (1977). Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444. Denning, D.E. (1980). Secure statistical databases with random sample queries. ACM Transactions on Database Systems 5(3), 291–315. Denning, D.E. (1982). Cryptography and Data Security. Addison Wesley, Reading, MA. Dobkin, D., A.K. Jones, R.J. Lipton (1979). Secure databases: Protection against user influence. ACM Transactions on Database Systems 4, 97–100. Fellegi, I.P. (1972). On the question of statistical confidentiality. Journal of the American Statistical Association 67, 7–18. Friedman, A.D., L.J. Hoffman (1980). Towards a failsafe approach to secure databases, in: Proceedings of IEEE Symposium on Security and Privacy, Oakland, CA. Garfinkel, R., R. Gopal, P. Goes (2002). Privacy protection of binary confidential data against deterministic, stochastic, and insider threat. Management Science 48, 749–764. Garfinkel, R., R. Gopal, D. Rice (2006a). New approaches to disclosure limitation while answering queries to a database: Protecting numerical confidential data against insider threat based on data or algorithms, in: Proceedings of 39th Hawaii International Conference on System Sciences (HICSS), Kuaui, HI. Garfinkel, R., R. Gopal, D. Rice, M. Nunez (2006b). Secure electronic markets for private information. IEEE Transactions on Systems Man and Cybernetics, Part A—Special Issue on Secure Knowledge Management 36(3), 461–472. Gopal, R., P. Goes, R. Garfinkel (1998). Interval protection of confidential information in a database. Informs Journal on Computing 10, 309–322. Gopal, R., P. Goes, R. Garfinkel (2002). Confidentiality via camouflage: The CVC approach to database security. Operations Research 50, p. 3.
346
R. Garfinkel et al.
Hoffman, L.J., W.F. Miller (1970). Getting a personal dossier from a statistical data bank. Datamation 16, 74–75. Laudon, K. (1996). Markets and privacy. Communications of the ACM 39(9), 92–104. Lefons, D., A. Silvestri, F. Tangorra (1982). An analytic approach to statistical databases, in: Proceedings of 9th Conference on Very Large Databases, Florence, Italy, pp. 189–196. Leiss, E. (1982). Randomizing: A practical method for protecting statistical databases against compromise, in: Proceedings of 8th Conference on Very Large Databases, Mexico City, Mexico, pp. 189–196. Liew, C.K., W.J. Choi, C.J. Liew (1985). A data distortion by probability distribution. ACM Transactions on Database Systems 10, 395–411. Muralidhar, K., D. Batra, P.J. Kirs (1995). Accessibility, security, and accuracy in statistical databases: The case for the multiplicative fixed data perturbation approach. Management Science 41, 1549–1564. Muralidhar, K., R. Sarathy (2006). Data shuffling—A new masking approach for numerical data. Management Science 52(5), 658–670. Palley, M.A., J.S. Siminoff (1987). The use of regression methodology for the compromise of confidential information in statistical databases. ACM Transactions on Database Systems 12(4), 593–608. Reiss, S.P. (1984). Practical data swapping: The first steps. ACM Transactions on Database Systems 9, 20–37. Schlorer, J. (1976). Confidentiality of statistical records: A threat monitoring scheme of on-line dialogue. Methods of Information in Medicine 15(1), 36–42. Schlorer, J. (1980). Disclosure from statistical databases: Quantitative aspects of trackers. ACM Transactions on Database Systems 5, 467–492. Schlorer, J. (1981). Security of statistical databases: Multidimensional transformation. ACM Transactions on Database Systems 6, 95–112. Turn, R., N.Z. Shapiro (1978). Privacy and security in databank systems: Measure of effectiveness, costs, and protector–intruder interactions, in: C.T. Dinardo (ed.), Computers and Security. AFIPS Press, Arlington, VA, pp. 49–57. Varian, H.R. (1996). Economic aspects of personal privacy, from privacy and self-regulation in the Information Age, Department of Commerce. Available at http://www.ntia.doc.gov/reports/privacy/ selfreg1.htm#1C. Accessed on May 7, 2007. Verykios, V.S., E. Bertino, I. Fovino, L.P. Provenza, Y. Saygin, Y. Theodoridis (2004). State-of-the-art in privacy preserving data mining. SIGMOD Record 33(1). Willenborg, L., T. de Waal (2001). Elements of Statistical Disclosure Control, Lecture Notes in Statistics. Springer-Verlag, New York. Zhang, N., W. Zhao (2007). Privacy-preserving data mining systems, Computer, April.
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 13
The Efficacy of Mobile Computing for Enterprise Applications
John Burke University of Illinois at Urbana-Champaign
Judith Gebauer Information Systems, University of North Carolina Wilmington, Cameron School of Business
Michael J. Shaw Business Administration and Leonard C. and Mary Lou Hoeft Chair of Information Systems, University of Illinois at Urbana-Champaign
Abstract In the last twenty years the use of mobile technology has grown tremendously among consumers; however organizations have been slow in adopting mobile information systems. In this chapter we will explore some of the issues surrounding mobile information systems, their future, and some possible reasons why they are or are not successfully adopted by firms. In addition, several adoption models that are commonly found in the MIS literature are introduced. Finally, using a case study of a mobile supply requisition system at a Fortune 100 company, the Task Technology Fit model is used to help explain why some users found a mobile information system useful, while other users found it unnecessary.
1
Introduction ‘‘The impact of technology is generally overestimated in three years and underestimated in 10 years.’’ As attributed to Bill Gates
This chapter is about the adoption of mobile information systems by organizations. Specifically why are some of these systems successfully 347
348
John Burke et al.
adopted, while others are not? Another important question for researchers is, what is the future of mobile information systems in organizations? However, before these questions can be addressed it is important to define what a mobile information system is. For our purposes, a mobile information system is not just the use of mobile devices such as cell phones by employees of a firm. It is an integrated information system where users, usually a firm’s employees, suppliers or customers, use mobile devices to interact with the organization, in order to perform some important function(s). mobile information system: an integrated information system where users, usually a firm’s employees, suppliers or customers use mobile devices to interact with the organization, in order to perform some important function(s).
In the last twenty years the growth of mobile technologies has been phenomenal especially in the consumer market. In part, significant advances in technology have driven this growth. For example, the first ‘‘portable computers’’ were the size of large briefcases; today many personal digital assistants (PDA’s) and cell phones can easily fit into a pocket. Furthermore, the power and flexibility of these same devices has increased dramatically over the same period. Many are even designed to access the Internet natively, which did not even exist twenty years ago. Today consumers and organizations have a bewildering array of mobile technologies to choose from and the future looks promising. In fact these devices are now so popular with the general public that many individuals no longer have landline phones or desktop computers, preferring to use their mobile counterparts instead. However, despite success with consumers at home and tremendous investments in mobile technology by firms, the impact of mobile applications in organizations has been relatively limited. In this sense, mobile technologies have followed a similar pattern to other revolutionary technologies like the railroad, the automobile, the airplane, and the Internet. That is, there have been enthusiastic investments in the beginning with accompanying high stock prices, followed by disillusionment as firms discovered that their markets and profits took longer than expected to develop. The Internet bubble is just the latest example of this pattern as firms required time to figure out the proper uses and expectations of E-technologies and lost billions of dollars in the process. In this chapter we will cover several aspects of the use of mobile technology in organizations. First, some of the current trends in mobile technologies are given, along with some predictions for the future. Second, several research frameworks used for evaluating technologies are presented. Third, one study of a mobile application in a Fortune 500 company that highlights many of the issues facing mobile applications are reviewed. Finally, areas for new research opportunities are discussed.
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
2
349
Trends
2.1 Initial experiments in mobile information systems Early experiments in mobile information systems often failed to live up to expectations. They tended to be small pilot programs that lacked sufficient scope to yield significant results. Many of the early problems experienced in these systems stemmed from a lack of IT investment by companies that cut budgets in response to the recession in 2001.1 However, even with the budget cuts in many IT departments, mobile information systems have steadily grown in use as the variety and power of devices available have increased. For example, wireless LANs (local area networks) in many urban areas have become commonplace, car companies sell mobile services such as OnStar and other global information systems, radio frequency identification’s (RFID) are now being used to track assets in factories and warehouses, and in Europe some ski resorts are even using smart personal objects embedded in Swatches as electronic ski lift passes.2 Yet even with this growth many issues still need to be worked out. One issue is that IS/IT managers are sometimes unsure which mobile technologies are best to adopt given the large number of choices they have in the marketplace, for example should they choose smart phones or PDAs, Windows- or Linux-based systems, standardize on one type of device or allow employees to choose from a variety of devices, etc. This has led some firms to adopt a variety of small mobile devices and applications in pilot programs without giving much thought as to how they would eventually fit into the overall organization. Such a patchwork of applications is often difficult to support or integrate into day-to-day operations. Therefore, it is beneficial to consider the recent trends in mobile technology before making new investments so that a more cohesive mobile strategy can be implemented.
2.2 The trend towards user mobility Today workers are more dispersed and more on the move more than ever before. In order to support these employees firms have to deploy technology to support them. As mentioned above the number and type of mobile devices has increased dramatically over the last decade. But exactly what kind of devices a firm should invest in depends a great deal on what kind of 1 A survey released in 2001 by Forrester Research showed that the number of large companies in North America that had cut their E-Business budgets had nearly doubled. Specifically, Forrester found in May 2001 that 17% of large companies had decreased their E-Business budgets. Nearly one-third of Global 3500 firms had also reported such reductions. The average reduction was only 0.3% in spring 2000, whereas in fall 2001, big companies said that they expected a 6% budget drop. 2 Kueppers, A., A. Latour (2000). Watch out phone makers, people are wearing wrist technology. WSJ Interactive Edition, April 17.
350
John Burke et al.
Fig. 1.
Functionality vs. mobility tradeoff.
work its employees need to do. Several different types of users and their usage behaviors have been discussed in the literature, namely wanderers, travelers, and visitors (Sarker and Wells, 2003). Wandering is when users are centered in a place such as an office, but throughout the day they are away from their desks for substantial amounts of time. For these users the devices need to be small and are used mostly for staying in touch by phone and perhaps email, i.e. smart phones. Travelers are those who are constantly moving from one place to another like salespeople. These users require the ability to stay in touch like wanderers, but also need the ability to run more powerful applications like word processors and spreadsheets. Devices such as laptops are common among travelers. Finally, visitors are those who do move from one place to another, but not every day, such as executives. Although these users do require mobility they expect greater functionality from their devices such as full-size keyboards and large screens. For these users desktop replacements are more appropriate. For all of these groups it can be seen that there is a tradeoff between functionality and mobility (Gebauer and Shaw, 2002) (Fig. 1). 2.3 The trend towards pervasive computing Another trend that is entering the workplace is the concept of pervasive computing. Pervasive computing is the concept that the ‘‘computer has the capability to obtain the information from the environment in which
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
351
it is embedded and utilize it to build models of computing’’, and that the environment ‘‘can and should also become ‘intelligent’ ’’ and be able to ‘‘detect other computing devices entering it’’ (Lyytinen and Yoo, 2002a). Such ‘‘intelligent’’ environments are becoming quite common in the workplace in the form of RFID tags. RFID tags are used in the same basic way that bar codes are used, to identify objects such as supplies and equipment in an organization. The difference is that bar codes must be read by passing a scanner over them while RFID tags emit their information electronically to the scanner when prompted. This allows information to be read much quicker and more accurately than in the past. These tags are being used by companies in various ways. For example, warehouses use them in pallets and crates to correctly route merchandise and to determine the weight of trucks, as they are loaded. Walmart in particular is pushing suppliers in the retail industry to adopt RFID tags to improve the speed and efficiency of their supply chains. As the power and adaptability of scanners and RFID tags increases more and more companies can be expected to be adopt this technology (Angeles, 2005).
2.4 The future: ubiquitous computing As companies progress in using mobile technology effectively, they will begin to achieve ubiquitous computing. Ubiquitous computing (Lyytinen and Yoo, 2002b) refers to the combination of the above two trends; namely a large amount of mobility in computer applications combined with a large amount of environment aware ‘‘intelligent’’ embedded devices. Lyytinen and Yoo suggest that to achieve this will require: Large-scale infrastructures. Computer technologies that share diverse types of data seamlessly with many different kinds of mobile devices. Devices that can manipulate diverse types of data effectively even with limited user interfaces such as small screens and keyboards. Computer aware environments that provide appropriate services to various devices as needed. Continued improvements in the computing power available in mobile devices. Significant increases in the number, quality, and geographic scope of wireless LANs. The standardization of the computer protocols used to communicate between heterogeneous computer technologies (Fig. 2). While the rate at which companies adopt mobile technologies may be in question it is clear that the trends towards mobile, pervasive, and
352
John Burke et al. Level of Embeddedness high Pervasive Computing
Ubiquitous Computing
low
high
Traditional Business Computing
Mobile Computing
low
Fig. 2.
Level of Mobility
(Lyytinen and Yoo, 2002b)
The future: ubiquitous computing.
ubiquitous computing can be expected to continue. According to Forrester the driving forces for the growth in these trends are many but include3: Growing familiarity with mobile technologies, like cell phones, by an increasing segment of the population. Growing familiarity with mobile technologies by organizations, many of which already have experience will small wireless projects. Greater connectivity options, with more mature wireless technologies, and both wireless WANs and LANs being offered by giants like AT&T and Sprint. Greater support by devices for wireless access, with most new laptops natively supporting such access. An increasingly wide variety of devices to choose from, with many devices supporting multiple functions. Growing IT budgets over the next few years, as companies benefit from the growing economy. Government regulations regarding such things as tracking drugs and hazardous materials. 3
Theoretical frameworks
3.1 Introduction In discussing the development of mobile technologies and their introduction into an organization it is helpful to relate them to the common theoretical models found in the current Management Information Systems (MIS) literature. Two of the most widely accepted models are the Technology 3 Daley, E. (2005). The mobile enterprise warms up: but widespread adoption is still three years away, Forrester Research, May 23, http://www.forrester.com/Research/Document/Excerpt/0,7211,36419, 00.html
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
353
Perceived Ease of Use of a Given Technology Attitude Towards Use of a Given Technology
Intention to Use a Given Technology
Actual Usage of a Given Technology
Perceived Usefulness of a Given Technology
Fig. 3.
The technology acceptance model (TAM).
Acceptance Model (TAM) and the Task Technology Fit Model (TTF) model, which are used to predict IT utilization and/or individual performance. 3.2 The technology acceptance model The idea of the TAM model is that if individuals perceive that a technology is easy to use and also perceive it as being useful, they will have a positive attitude towards it. This in turn will increase the individual’s intention to use that technology when given the opportunity and finally this will lead to actual utilization. The TAM model is shown below (Davis, 1989). As can be seen the model also predicts that technology which is perceived as being easier to use is also generally seen as being more useful (Fig. 3). 3.3 Example of the technology acceptance model As an example of how the TAM model works, consider a hypothetical non-technical manager given the opportunity to use a basic cell phone, a laptop, or a PDA while traveling. A cell phone today would probably be considered easy to use by the average manager and also fairly useful as it would allow him or her to stay in communication with the office while traveling or going to meetings, conferences, etc. Part of the perception of the cell phone’s usefulness would be based on the perception that a cell phone is easy to use. Therefore, the TAM model predicts that such a manager is likely to have a positive attitude toward using a cell phone, would intend to use it if given the chance, and would in fact use it if they were provided with one by their office. Similarly, a laptop might be perceived as being even more useful by the same manager in that it could be used for email, spreadsheets, and word processing. However, a laptop is fairly cumbersome to carry around, and so it would probably be perceived as more difficult to use than the cell phone. Therefore, the TAM model predicts that the manager is likely to have a
354
John Burke et al.
more ambivalent attitude towards the laptop than the cell phone, would have less intention to travel with it, and would use it less. Finally, this non-technical manager might consider a new PDA unfamiliar and advanced technology and therefore difficult to use. Also, since PDA’s have limited functionality in comparison to a laptop, the manager would probably perceive it as having limited usefulness. Part of this perception would be driven by the notion that PDA’s are difficult to use. Therefore, the manager is predicted by TAM to have a relatively negative attitude towards the PDA, to have a low intention to use it even if given the chance, and that actual usage of the PDA would be low when compared to a laptop or cell phone. 3.4 Limitations of the technology acceptance model There are several basic limitations of the TAM model. The first limitation is that it assumes that the usage of a given technology is voluntary. However, in the real world many employees have to use a system in order to do their jobs. For example, cashiers in a retail store have to use the checkout system their employer provides. In fact many organizations have systems that they require their employees to use, including banking systems, supply reorder systems, and centralized databases. Therefore, in many cases a user’s perceptions of a system may not be a valid predictor of utilization. A second related limitation of the TAM model is that there are situations where even if a user is not compelled to use a technology by their employers, they may be forced to use a technology they find difficult simply because there is no other viable choice. An example of this would be using a mainframe system to process extremely large quantities of data. A user may have very negative perceptions of mainframe systems as far as how difficult they are, but some organizations such as insurance companies have so much data to process they have no other alternatives. Likewise, some technologies may have no competitors such as a cell phone company that is the only service provider in a particular geographic area. A third limitation of the TAM model is that even if a user has positive perceptions of a technology’s usefulness and ease of use, and even if this in turn results in high utilization of that technology, this does not mean that the individual’s performance will be positively impacted. A system which does not work well, but which is highly utilized by an organization may actually harm the organization that uses it. An extreme example would be a banking system that is easy to use and well liked by a bank’s customers, but which has security holes that allows hackers to access customers’ accounts. Another example would be a firm that switched all of its employees’ desktop computers for laptop computers, but required all of its employees to sit in a cubicle all day long. In this case, the laptop would lose its prime advantage of mobility but would cost more, have fewer features, and would
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
355
be harder to use than desktops. Clearly it would be a case of the wrong tool for the wrong job. Finally, it should be noted that a user’s perceptions of a technology can be highly affected by network affects and/or by ‘‘coolness’’ aspects. If a person’s colleagues successfully use a particular technology, then that person is more likely to adopt it, even if it appears somewhat difficult to use at first glance. Users, especially those in an organization, are affected by those around them, and perceptions will change accordingly. For example, many students use social networking technologies such as MySpace, Facebook, blogs, RSS, and wikis simply because their peers use them. The same is true with mobile technologies. If the people surrounding you use a technology and have good experiences with it, you are more likely to expend the effort to learn and adopt it yourself. This is a major emphasis of Social Network Theory (Milgram, 1967), which proposes that it is the informal links between people, including employees, executives, managers, etc., both within and between companies that are really important in shaping the behavior. As a simple example, many corporate decisions are actually made on the golf course!
3.5 The task technology fit model In order to address these limitations Goodhue and Thompson (1995) presented the TTF model. They postulated that in order for a technology to improve an individual’s performance the technology would both have to be utilized and also fit the job for which it was used. Their originally proposed TTF model is shown below (Fig. 4). In words, similar to the TAM model, the TTF model uses attitudes and beliefs as a predictor of utilization. However, the model goes further in that it says that not only does a technology have to be utilized, it also must fit Task Characteristics required of the technology
Task Technology Fit of the technology to the tasks required of it
Technology characteristics
Performance Impacts Utilization of the technology
Precursors of Utilization, i.e. beliefs, attitudes, etc. about the technology
Fig. 4.
The task technology fit model (TTF).
356
John Burke et al.
the task for which it is being used. Only if the task characteristics fit well with the technology’s characteristics, and the technology is also being used will an individual experience a positive performance impact. As can be seen in Fig. 4, TTF also predicts that the fit of a technology to a task influences its utilization in addition to the overall performance of the task itself. Finally, the TTF model admits the possibility that overall performance of a task might influence future utilization of a tool in the form of feedback. That is, if a particular technology worked well for a particular task, individuals would incorporate that into their beliefs and attitudes, which would then increase future utilization. Likewise, if a technology did not work well for a task this would decrease future utilization of that technology for that task. As an example of TTF consider a company that purchases a Computer Aided Software Engineering (CASE) tool for its programmers in the hope of finding some gains in efficiency. According to the model those gains may fail to materialize for two general reasons. One reason would be if the programmers simply choose not to use the new CASE tools. This may simply be because the programmers may be comfortable using an older system and may not want to learn the new system. This is consistent with the TAM model. However, a second reason for potential failure exists according to the TTF model. The system could fail if the new CASE tools were a poor fit with the current practices used by the programmers. Such would be the case if the CASE tools were designed for Object Oriented Programming and the programmers were using a non-Object Oriented language like COBOL (Dishaw and Strong, 1998). 3.6 Limitations of the task technology fit model Unlike the TAM model that has been fairly stable since the 1990s, the TTF model has been in flux since first presented by Goodhue and Thompson in 1995. For one thing, what exactly is a positive performance impact? Should it be considered an individual’s increase in performance, a departmental increase, or an organizational increase? If the performance is measured at a level higher than the individual, then other factors beyond the technology in question come into play (Zigurs and Buckland, 1998; Zigurs et al., 1999). For example, the technology may fit the tasks well, and may be utilized by employees, but if the company experiences a downturn any performance impacts by the technology may be washed out by the negative consequences of layoffs, a lack of funding, high turnover, etc. In light of this, Dishaw and Strong (1999) proposed a combined somewhat modified version of the TAM and TTF models where utilization itself is used as the measure of performance. That is, if the technology fits the tasks and users have positive perceptions of the technology, utilization of the technology will increase. Their model (shown below) was used in a study that showed
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications Attitude Toward Use of the Technology
Intention to Use the Technology
Perceived Ease of Use of the Technology
Perceived Usefulness of the Technology
Task Technology Fit of the Technology
357
Actual Use of the Technology
Adopted from TAM
Adopted from TTF
Experience with the Technology
Functionality of the Technology
Fig. 5.
Task Characteristics
A combined technology acceptance/task technology fit model (TAM/TTF).
support for the idea that utilization of software engineering tools increased if the tools fit the tasks that the programmers needed to perform (Fig. 5). It should also state that there are other fit models that also combine the concept of task and technology fit with social affects and user acceptance. Socio-Technical Theory posits that not only does their need to be a fit between the tasks and the technology used, but also a fit between the structure of an organization, (such as its hierarchy and its rewards systems), and its employees’ skills, attitudes, values, etc. Only if there is a fit between both systems will the firm be able to adapt to changing situations, such as the adoption of a new information system (Bostrom and Heinen, 1977). Likewise the Unified Theory of Acceptance and Use of Technology, (Venkatesh et al., 2003) attempts to unify all of the common theories in IS regarding IT adoption. 4
Case study: mobile E-procurement
4.1 Introduction In their study of a supply requisition system for a Fortune 100 company Gebauer et al. (2005) took a slightly different approach to modeling TTF for
358
John Burke et al.
mobile information systems. The case study was conducted regarding the implementation of a mobile information procurement system. Automation of procurement systems became popular in the 1990s as large organizations sought to control costs by involving end users in the process of requisitioning supplies, automating paper-based procurement systems, and standardizing business rules and suppliers across business departments. The new system was expected to add mobile application access to an existing procurement system in order to allow managers to conduct business even when out of the office. The system under consideration was based on the Wireless Application Protocol, which allows cell phones and similar devices to access Web-based applications. The system provided the following functionality: The ability to create purchase requisitions. The ability to approve requisitions by managers. The ability to check the status of existing orders. The mobile system required users to log in using a PIN number and also a credit card sized authenticator. If access was granted, the system accessed the corporate procurement system via Internet gateways. Not only could requisitioning employees check the status of their orders, but also approving managers were notified if pending requisition requests were waiting for approval. The study was conducted using a small group of employees involved in a voluntary pilot project. Data was obtained in two primary ways, first by interacting with the group via meetings and interviews on a bi-weekly basis from August 2001 to March 2002. Second, a survey was administered to the users during February and March of 2002. The survey contained questions regarding the usage of the system, the benefits it provided, and general experiences both good and bad the users had with the system. Responses were collected from 17 users, roughly half of which were at the director and mid-management levels and all were approving managers. The other half of the respondents were at lower managerial or staff levels and were from a variety of areas, including finance, accounting, operations, etc. These individuals acted as either finance and accounting approvers of the system, or as requesters and receivers of the system. Although the small sample size prevents robust statistical analysis of the study, it did provide some insight into why mobile information systems face special challenges. 4.2 A TTF model for mobile technologies Recognizing that mobile technologies are not completely mature and that they are used in non-traditional situations, Gebauer et al. (2005) proposed splitting the TTF technology characteristics into two separate constructs; one representing the functionality of the actual devices used, and another
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
359
Managerial Tasks Non-routineness Interdependence Time-criticality
Fit 1
Mobile Information System Functionality User Interface Adaptability Mobile Use Context Distraction Network connectivity Mobility Previous Experience
Fig. 6.
Managerial task and information system functionality
Fit 3 Moderated Task Technology Fit
Task Performance
Fit 2 Mobile information system and mobile use contextmoderating factor
A task technology fit model (TTF) for mobile technologies.
representing the context in which the device is being used. Their model is shown below (Fig. 6). In words, the above variables are used to develop two separate Fit constructs. Fit 1 is the typical TTF Fit construct, and measures the extent that the information system is capable of performing the managerial tasks that they are required to perform. Fit 2 is a new construct that is used to determine the feasibility of using mobile devices to deliver the required information system. Both Fit 1 and Fit 2 together then determine Fit 3, which is the ability of the system to support managerial tasks in a mobile context. The insight gained from breaking the TTF model down in this way using Fit 1, Fit 2 and Fit 3 is that it becomes clear that the more a system is used in a mobile fashion, the greater the emphasis that must be placed on developing user interfaces, resolving network connectivity, and training users to use the mobile devices to the fullest extent. Definitions for the concepts used in the model are listed in Table 1. 5
Case study findings
5.1 Functionality The results of the case study were varied (Gebauer and Shaw, 2004). Many users liked the data processing features of the system that allowed them to check on past orders and they also liked that managers received notification when new orders were awaiting approval. However, several expressed frustration with the limited keyboards and small screens associated with the cell phones. Users also had several problems with
360
John Burke et al.
Table 1 Definition of concepts in the revised TTF model Managerial tasks Non-routineness Interdependence Time-criticality Mobile information system Functionality
User interface Adaptability
Mobile use context Distraction
Network connectivity Mobility
Previous experience
The extent to which the task being attempted using a mobile device is a routine everyday occurrence or a unique one The extent to which the task is dependent on other departments or other outside entities The extent to which the task is time critical, or urgent, i.e. an emergency The extent to which the device could handle the operations it needed to perform, i.e. cell phones are good at two way voice communication, but usually not very good at video communication or at data processing. In the case the focus was on (1) the extent to which the device is used for communication or computing and (2) the extent to which the device is used for one-way or two-way interaction Because of their small size, mobile devices often have limited user interfaces, including small screens and small keyboardsa The ability of a mobile information system to adapt to varying circumstances, such as locations, service disruptions, and the personalization of services to a particular user The extent to which the person using the device is distracted by the environment. For example, an Internet cafe is noisier than a business office The extent to which the user is able to establish network connectivity. Even in urban areas not all locations offer cellular service or wireless Internet connectivity The extent to which the device is actually being used in different geographic locations, which of course is related to network connectivity and distraction levels The user’s previous experience and comfort with the devices used in the mobile information system
a
In fact, the keyboards are often so small that Netlingo.com has an online dictionary of slang used in text messaging that developed in part because of the difficulty in typing on cell phones (i.e. ‘‘JOOTT’’ is text message slang for ‘‘Just one of those things’’).
system support, and even those who were initially enthusiastic with the system stopped using it after experiencing difficulties. Problems encountered included poor system documentation, unstable connectivity, a lack of training and a non-intuitive user interface. Additionally several users did not like carrying around an authenticator card. These results bring up several important questions. First, which of the above problems are problems inherent in the system, and which are problems with system implementation, managing user expectations
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
361
and training? Second, what could the organization have done better to help employees reach a ‘‘comfort zone’’ with the technology in order to improve their experiences? The field of change management focuses on these very issues and can be very helpful for shaping employee perceptions when introducing new technologies, practices or policies. Third, which of the above problems are really a function of the quality of the existing system? The mobile information system in question was implemented in the United States in a firm where there was an already existing wired network, with desktop computers, desktop multi-function phones, large computer screens, full-sized keyboards, and employees trained to use them. How would the reception of the system been different in a developing country? For example, the ‘‘USA is 22.6 times the population of Ecuador and 10 times the wealth, yet Ecuador sends four times as many text messages as Americans’’ (Kerensky, 2006), while in India fishermen are now using mobile phones to identify the best port to deliver their catches to based on current spot prices. In the past they had to rely on luck and some fish went unsold if there was a surplus of fish in the local market. However, since mobile coverage became available in 1997 they have been able to identify the market with greatest demand and plan delivery accordingly (Ribeiro, 2007). In both cases satisfactory alternatives to mobile systems may simply not exist, which of course shapes user perceptions. As can be seen from the following charts, the system was perceived to work best in functions associated with approval, and less well with function associated with requests or delivery. These are functions that are typically associated with managers rather than end users. It is worth noting however, that the survey recorded the users’ perceptions about the mobile technology rather than the actual benefits. Although mobile phones in particular are not the ideal tools to send email or create long documents, just being able to approve a purchase or send a short text message may be quite valuable to an organization if it allows managers to multi-task. Therefore, the actual value to the firm may be higher than users first perceive. As with most technologies, perceptions evolve and the impact of a system may take time to emerge and become apparent (Charts 1–3). 5.2 User experiences From the surveys and interviews four different types of users were identified; (1) approving managers, (2) finance and accounting managers, (3) requesters, and (4) approvers. Approving managers tended to perform many unstructured tasks with the system, as they often had to gather information needed to approve requests for supplies. Not surprisingly, approving managers who spent a lot of time out of the office reported using the mobile information system more than their more stationary
362
John Burke et al. Extent to which mobile business applications can provide support for requests (% of all occurrences that can be supported) 80 70 60 50
% 40 30 20 10 0 Select items from Submit purchasing catalog request
Track purchase order
Change purchase order
Cancel purchase order
Support of Mobile Business Applications for Different Procurement Activities
Chart 1.
Requests.
Extent to which mobile business applications can provide support for approvals (% of all occurrences that can be supported) 80 70 60 50 % 40 30 20 10 0 Notification about waiting approval requests
Delegation of approval authority
Interpersonal communication
Access to additional data during approval
Approval of purchase orders
Support of Mobile Business Applications for Different Procurement Activities
Chart 2.
Approvals.
counterparts, and generally they found the system useful (see Chart 2). However, some approving managers were so overwhelmed by the frequent use of the notification feature that they eventually disabled it. This may indicate that there are still problems regarding how business processes are structured. In contrast, finance and accounting managers tended to use the system for more routine tasks that usually involved following simple business rules, i.e. checking to make sure requisitions were within budgetary limits. These managers used the system for mainly communication purposes and to
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
363
Extent to which mobile business applications can provide support for deliveries (% of all occurrences that can be supported) 80 70 60 50 % 40 30 20 10 0 Instant notification of deliveries as they arrive
Order tracking
Access order information
Submit receiving information
Support of Mobile Business Applications for Different Procurement Activities
Chart 3.
Delivery.
access information. They also generally used the system more than the approving managers, yet high frequency users generally had a negative correlation with system usage. This may reflect the fact that high frequency users tended to need the flexibility of the mobile system less than managers that traveled more often. Requesters were clearly aware that being out of the office could negatively impact operations if approvers were unreachable. This was especially true when requests were inter-departmental in nature or impacted other entities in some way, i.e. relationships with suppliers. Those awaiting approval also appreciated approving managers having access to mobile systems when out of the office. Finally, while all users expressed a general appreciation for the flexibility of mobile technologies, especially when dealing with urgent situations, they did not have a strong understanding about how these systems actually added value to the organization. Complete findings of the case study are summarized in Table 2. 6
Conclusions from the case study The case study presented above sheds light on four central questions: 1. What are the impacts of task characteristics on mobile information system usage? 2. What are the impacts of technology characteristics on application usage? 3. How do the characteristics of mobile technology moderate the ability of an information system to deliver a good ‘‘Fit’’ for the required managerial tasks?
364
Table 2 Research propositions and evidence from case study Case study results Proposition
Evidence from case study?
Finding
Functionality
1. Users will first use mobile business applications for notification and communication purposes rather than for data processing or information access 2. There is a tradeoff between portability and usability of mobile devices, effectively limiting the usage of mobile business applications to simple activities 3. System performance and user support have a positive impact on the usage of mobile business applications 4a. Employees performing highly structured tasks tend to use mobile business applications for data processing 4b. Employees performing unstructured tasks tend to use mobile business applications to access information and for communication purposes
Limited
Users showed willingness to use mobile devices (cell phones) for ‘‘innovative’’ purposes
Yes
Factor will become less significant in the future as improved mobile devices become available
Yes (strong evidence)
Strong evidence for the need to make the effort ‘‘worthwhile’’; poor performance can have a detrimental effect on usage
Portability
System performance and user support Task structure
Yes
John Burke et al.
Research framework construct
5. The frequency with which a task is performed has a positive impact on the usage of mobile business applications
Unclear
Task mobility
6. Employees who are more mobile tend to use mobile business applications more often to perform their tasks than employees who are less mobile 7. The use of mobile business applications is positively related to the perceived need to handle emergency situations 8. The use of mobile business applications increases employee productivity and operational efficiency 9. The use of mobile business applications improves organizational flexibility and the ability to handle emergency situations
Yes
Need to handle emergency situations Impact on operational efficiency Impact on organizational flexibility
Evidence indicated that a combination of frequency and mobility can have greater explanatory power of usage than each factor alone
Some (anecdotal evidence)
Variations among users regarding the question of what constituted an emergency situation
Yes
Future research should consider both direct effects on the user and indirect effects on employees interacting with the user No clear understanding of the value of increased flexibility and the ability to handle emergency situations; some lack of awareness among users
Some (anecdotal evidence)
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
Task frequency
365
366
John Burke et al.
4. What is the impact on usage of mobile information systems on business processes? In respect to the first question the study found that users valued notification and support for simple tasks like tracking orders most highly as opposed to handling more complex operations. However, access to ad hoc information was also highly valued. This was especially true for managers who were highly mobile and had a need to process urgent requests while on the move between locations for meetings, etc. Therefore as expected, managers who were often ‘‘wandering’’ valued the smart phone mobile application system the most. In respect to the second question, the study found that poor technology characteristics like phones with small screens and keyboards inhibited usage. Also, some employees did not like carrying around the encryption key that was used to log into the system from mobile devices, while others noted that training and support were key issues that limited their usage of the mobile information system. For the third question, it appears that how the mobile technology affected the organization very much depended on the kind of user involved. Approving managers valued the system more than accounting and finance staff largely because they were more mobile and thus had more incentive to learn and use the new system. Also, the system did greatly increase the efficiency and processing speed of the requisition system since it allowed users to access the system in situations when they previously would have been simply ‘‘out of the office’’. This in turn allowed users who depended on approval of their requests to do their own jobs more efficiently. However, while most employees recognized the potential benefits of the system, reactions were in fact mixed. While many expected mobile technology to add additional flexibility to the organization, others questioned the strategic value of making the procurement function mobile. This was especially true for finance and accounting managers who were basically stationary. As far as the overall conclusions that can be drawn from the case regarding the impact of mobile technology on business processes, it was clear that users had high expectations. They were impatient with connectivity problems, and frustrated by hardware limitations such as small screens and keyboards. While many employees recognized the potential benefits in efficiency that could be realized by the organization, many expressed skepticism about using it in this instance. Even more interesting, one employee mentioned simply turning off the mobile device because they were overwhelmed with the number of requests that were directed to them. This perhaps indicates that more needs to be done on: Improving the fit between the devices chosen and the tasks they need to perform. Improving the fit between the devices provided and the users that use them, (i.e. giving wanderers smart phones and stationary users desktops).
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
367
Identifying the appropriate scope of a system. Does the system really need to be department-wide or should the system be limited to a smaller subset of users to maximize impact. Increasing the training of the users, so that they know how to use the system and so that they understand the importance of the system to the organization. Educating the users on the strengths and weaknesses of mobile systems in advance of deployment, in order to avoid unrealistic expectations. If users’ expectations can be set in advance, less frustration should result from the technology’s inherent characteristics and limitations. Managing user perceptions over time through ‘‘change management’’ techniques. As noted several times perceptions are not static. What can an organization do to so that user experiences evolve in a positive way? What can management do before, during and after implementation to improve system success? Measuring the value of a system as well as the perception of value is also important. For example, if a manager uses a mobile system only while they are on the run and under stress they may not fully appreciate the system’s value simply because they do not realize how much time the system saved them. Modifying the business rules and processes as appropriate so that users are not overwhelmed by the new technology. Technology cannot help the organization if employees turn it off!
7
New research opportunities
The case study presented points to a few areas of research, both academic and practical that need to be explored. On the academic side a variety of research questions present themselves. How can organizations design mobile devices and services so that they are easy for individuals to use and understand? How can mobile technology be used to support teams of employees, rather than just single users? How will the advent of ubiquitous computing affect the structure of organizations? What are the security implications of ubiquitous computing, especially given government privacy regulations? How will organizations have to change their infrastructures to seamlessly support heterogeneous data across heterogeneous devices? What types of protocols and standards need to be developed before ubiquitous computing can be supported? How do business processes have to change, as ubiquitous computing becomes a reality?
368
John Burke et al.
On the practical side, first it is clear that mobile devices are quickly reaching a level of technical maturity where they can start to replace many traditional non-mobile systems. However, as was clear from the case many practical issues remain. What kind of users does a firm have, wanderers, travelers, or visitors? What kind of devices should a firm purchase and how can they be integrated? How can firms train their employees to use mobile systems most effectively? How should firms decide which applications should be made mobile applications? Do these applications justify the cost?
8
Conclusion
Most of the discussion above including the case study focused on using mobile technology to improve the efficiency of existing processes and corporate structures. However, an equally important question is how will these new devices and systems change business processes in the future? Five hundred years ago most organizations had very few employees, existed for a very limited time, sometimes only for one day, and usually only in one place. In such cases, very little effort was needed to control or communicate to workers, and voice communication usually sufficed. However, over time organizations have grown to the point where today they may span several countries, have tens of thousands of employees using several languages, and require worldwide communication systems that use different technologies at the same time. This trend towards increasing complexity can be expected to continue, especially if one considers the broader organization including suppliers, outsourcing vendors, and downstream customers. Therefore, managers are now realizing that mobile technologies offer the potential for organizations to connect all of their employees, customers, managers, suppliers, and partners just as consumer devices like cell phones are connecting families. However, the wide scope that mobile information systems encompass can itself become a problem. Especially when systems cross international, corporate, divisional, or vendor lines devices tend to become heterogeneous. Although the issue was not actively explored above, one major issue that was faced by the company was which mobile device(s) should be the platform for its procurement system. PDA’s, Smart phones and laptops were all potential devices, and several vendors and models were identified. Devices that had large keyboards and screens were easier for some users,
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
369
particularly for those that needed to use the system a lot, but were relatively stationary. Managers that needed to wander a lot obviously preferred smaller devices that were easier to carry. Unfortunately these devices were not completely compatible and this contributed to difficulties in providing user support. How organizations connect and manage the multitude of devices from different vendors is a real issue that executives actively need to address both technically and culturally. Technically, devices in the future will likely need to be able to adapt to new situations through the use of dynamically deployed processes and systems, such as through the use of the SATIN component system (Zachariadis et al., 2006). This system proposes that as users move from place to place their devices would be actively updated with the appropriate code (ideally using cross platform code components), appropriate to their new location largely automatically. Much like users today download new ring tones, future systems will need to intelligently download appropriate applications or add-ons to existing applications. Cultural barriers may even present greater technical problems. How do you adapt a phone designed for the USA to the Cyrillic character set if the user suddenly travels to Russia, and how do you design a Smart phone with a large keyboard and screen that is still fashionable enough for use in Hong Kong? Furthermore, new government regulations regarding information privacy and the tracking of hazardous materials require greater integration within supply chains, and some firms are looking to mobile technologies to help their companies stay within compliance. How these different forces will affect mobile technologies has yet to be determined. Another interesting issue that is illustrated by the case study is that perhaps a new perspective is needed when designing systems. For the last 50 years technology has often been used to increase the efficiency of existing systems. Accounting systems, resource management systems, database and document processing systems have all been automated through technology. However, as was seen in the case study, it is not necessarily sufficient to simply add new technology to existing systems without rethinking the business processes. In the procurement system described above it is clear that managers that were often out of the office liked the system, whereas most users found it difficult to use. In a sense the system was designed with managers in mind rather than with the users who would actually request items. Given the increase in power and portability of new devices it may be time to rethink business processes entirely and focus on the end user. While this is an often-stated idea, the reality is that most automated accounting systems and enterprise resource planning systems focus on reducing costs rather than on improving usability. Put another way, systems are often designed for the convenience of the enterprise rather than for the actual users of the system whether employees or customers.
370
John Burke et al.
Example of how technology can have unexpected negative consequences There are several local hospitals that serve the University of Illinois at Urbana-Champaign. One in particular, has a very modern mobile information system that allows doctors to wirelessly enter the results of examinations and medical tests directly into the hospitals records system using laptops. The doctors can also use the wireless system to send most prescriptions to a variety of local pharmacies while still moving from room to room. It is a very convenient system for the doctors. When patients want to find out the results of their medical tests they can simply call the appropriate medical office and in theory a nurse can look up the information on the system right away. Unfortunately medical offices are busy places and nurses are often with patients. It may take several calls to the office being placed on hold for 10–15 min at a time before a nurse is available to look up the necessary information, especially if the doctor has not fully updated the patient’s records. In this case, the nurse has to track down the doctor all the while the patient is waiting on hold. It is a system that many readers are probably familiar with, and in the days of wired phones it worked satisfactorily. However, this system does not work well for those, like students, who use mobile technology themselves and have cell phone plans that have limited anytime minutes. They simply cannot afford to wait on hold for 30 min. Even if they have unlimited night and weekend minutes hospital offices are generally not open on nights and weekends. The result, it is often easier to drive to the hospital to get the information than to call! This is a clear example where the switch from traditional to mobile technology, by the customer in this case, has caused a marked drop in customer service. This illustrates the need to always consider your customer when designing your system. Does the hospital’s system really end with the doctors and nurses? Obviously not, so while the bulk of this chapter concerned a mobile information system that was internal to an organization, it is important to always remember to ask how do an organization’s information systems ultimately affect the customer? The hospital system freed doctors and nurses to wander away from their offices, but ultimately also limited the ability of patients to tract them down as well! References Angeles, R. (2005). RFID technologies: supply-chain applications and implementation issues. Information Systems Management 22(1), 51–65. Bostrom, R., J.S. Heinen (1977). MIS problems and failures: a socio-technical perspective. MIS Quarterly 1(3), 17–32.
Ch. 13. Efficacy of Mobile Computing for Enterprise Applications
371
Davis, F. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly 13(3), 319–340. Dishaw, M.T., D.M. Strong (1998). Supporting software maintenance with software engineering tools: a computed task-technology fit analysis. Journal of Systems and Software 44(2), 107–120. Dishaw, M.T., D.M. Strong (1999). Extending the technology acceptance model with task-technology fit constructs. Information & Management 36(1), 9–21. Gebauer, J., M. Shaw (2002). A Theory of Task/Technology Fit for Mobile Applications to Support Organizational Processes. College of Business Working Paper CITEBM-02–03, University of Illinois at Urbana-Champaign. Gebauer, J., M. Shaw (2004). Success factors and benefits of mobile business applications: results from a mobile E-procurement study. International Journal of Electronic Commerce 8(3), 19–41. Gebauer, J., M.J. Shaw, M.L. Gribbins (2005). Task-Technology Fit for Mobile Information Systems. College of Business Working Paper 05-0119. Goodhue, D.L., R.L. Thompson (1995). Task-technology fit and individual performance. MIS Quarterly 19(2), 213–236. Kerensky, L. (2006). For Better or for Worse, Mobile is on the Move. Adotas.com. Available at http:// www.adotas.com/2006/08/for-better-or-for-worse-mobile-is-on-the-move/. Current June 29, 2007. Lyytinen, K., Y. Yoo (2002a). Research commentary: the next wave of nomadic computing. Information Systems Research 45(12), 63–65. Lyytinen, K., Y. Yoo (2002b). Issues and challenges in ubiquitous computing. Communications of the ACM 13(4), 377–388. Milgram, S. (1967). The small world problem. Psychology Today 2, 60–67. Ribeiro, J. (2007). To do with the price of fish. Economist.com. Available at http://www.economist.com/ finance/displaystory.cfm?story_id ¼ 9149142. Current June 29, 2007. Sarker, S., J.D. Wells (2003). Understanding mobile handheld device use and adoption. Communications of the ACM 46(12), 35–40. Venkatesh, V., M.G. Morris, G.B. Davis, F.D. Davis (2003). User acceptance of information technology: toward a unified view. MIS Quarterly 27(3), 425–478. Zachariadis, S., C. Mascolo, W. Emmerich (2006). The SATIN component system—A metamodel for engineering adaptable mobile systems. IEEE Transactions on Software Engineering 32(11), 910–927. Zigurs, I., B.K. Buckland (1998). A theory of task-technology fit and group support system effectiveness. MIS Quarterly 22(3), 313–334. Zigurs, I., B.K. Buckland, J.R. Connolly, E.V. Wilson (1999). A test of task/technology fit theory for group support systems. Database for Advances in Information Systems 30(3/4), 34–50.
This page intentionally left blank
Adomavicius & Gupta, Eds., Handbooks in Information Systems, Vol. 3 Copyright r 2009 by Emerald Group Publishing Limited
Chapter 14
Web-Based Business Intelligence Systems: A Review and Case Studies
Wingyan Chung Department of Operations and Management Information Systems, Leavey School of Business, Santa Clara University, Santa Clara, CA 95053, USA
Hsinchun Chen Department of Management Information Systems, Eller College of Management, The University of Arizona, Tucson, AZ 85721, USA
Abstract As businesses increasingly use the Web to share and disseminate information, effectively and efficiently discovering business intelligence (BI) from voluminous information has challenged researchers and practitioners. In this chapter, we review literature in BI on the Web and technology for Web analysis, and propose a framework for developing Web-based BI systems, which can help managers and analysts to understand their competitive environment and to support decision-making. The framework consists of steps including collection, conversion, extraction, analysis, and visualization that transform Web data into BI. Combinations of data and text mining (TM) techniques were used to assist human analysis in different scenarios. We present three case studies applying the framework to BI discovery on the Web. In these studies, we designed, developed, and evaluated Web-based BI systems that search for and explore BI from a large number of Web pages and help analysts to classify and visualize the results. Experimental results show that the systems compared favorably with benchmark methods, showing the usability of the framework in alleviating information overload and in collecting and analyzing BI on the Web. Considering the scarce research work found in this field, this chapter provides valuable insights and new research findings on developing Web-based BI systems. Future directions of BI research are discussed.
373
374 1
W. Chung and H. Chen
Introduction
As businesses increasingly share and disseminate information on the Web, information overload often hinders discovery of business intelligence (BI). A study found that the world produces between 3.41 million and 5.61 million terabytes of unique information per year, most of which has been stored in computer hard drives or servers (Lyman and Varian, 2003). Many of these computing devices serve as the repository of the Internet, supporting convenient access of information but also posing challenges of effective knowledge discovery from voluminous information. Such convenient storage of information on the Web has made information exploration difficult (Bowman et al., 1994). While it is easy to access a large number of Web repositories nowadays, it is difficult to identify the relationships among interconnected Web resources. A study found that over 90% of pages on the public Web are connected to other Web pages through hyperlinks (Broder et al., 2000). The proliferation of electronic commerce further aggravates the problems of identifying business relationships on the Web. Business managers and analysts need better approaches to understand large amounts of information and data. Effectively and efficiently discovering BI from vast amount of information on the Web thus has challenged researchers and practitioners. Unfortunately, little research on BI systems and applications has been found in the literature (Negash, 2004). This chapter reviews related work on BI systems and mining the Web for BI, describes a framework for discovering BI on the Web, and presents three case studies applying the framework to designing and developing Web-based BI systems, which can help managers and analysts to understand their competitive environment and to support decision-making. Our purpose is to inform the community of researchers and practitioners about recent advances in BI research and to enrich this still young field.
2
Literature review
BI is defined as the product of systematic acquisition, collation, analysis, interpretation, and exploitation of business information (Chung et al., 2005). As the Web becomes a major source of business information (Futures-Group, 1998), Web-based BI emerges to be a valuable resource for managers to understand the business environment, to devise competitive strategies, and to support decision-making. For example, business managers and analysts can study competitors’ movements by analyzing their Web site content and hyperlinks. Web log messages, news articles, and online forum messages can be analyzed to extract customer preferences or emerging trends. We review below literature on BI systems and Web mining
Ch.14. Web-Based Business Intelligence Systems
375
technologies. Our scope of review is on analyzing and extracting Web-based BI from organizations’ external environment.
2.1 Business intelligence systems BI systems enable organizations to understand their internal and external environments. Two classes of BI tools have been defined (Carvalho and Ferreira, 2001). The first class of these is used to manipulate massive operational data and to extract essential business information from a company’s (internal) operational data. Examples include decision support systems, executive information systems, online-analytical processing (OLAP), data warehouses and data mining systems that are built upon database management systems to reveal hidden trends and patterns (Choo, 1998). The second class of BI tools, sometimes called competitive intelligence (CI) tools, aims at systematically collecting and analyzing information from the external business environment to assist in organizational decision-making. They mainly gather information from public sources such as the Web. Rooted in military strategy (Cronin, 2000; Nolan, 1999), these BI tools also provide insights into various value-adding processes in knowledge discovery. In contrast to the first class of BI tools, technologies for collecting and analyzing data from external environment are less standardized due to the relatively unstructured nature of the data. These technologies have to implement the steps in the BI discovery process. 2.1.1 BI discovery process Researchers have proposed various steps in the process of discovering BI. Taylor proposes a value-added spectrum consisting of four major phases: organizing processes (grouping, classifying, relating, formatting, signaling, displaying); analyzing processes (separating, evaluating, validating, comparing, interpreting, synthesizing); judgmental processes (presenting options, presenting advantages, presenting disadvantages), and decision processes (matching goals, compromising, bargaining, choosing) (Taylor, 1986). Some authors add ‘‘evaluation’’ as a feedback loop (Fuld et al., 2002). Through the different phases, transformations take place in the order of data, information, informing knowledge, productive knowledge, and action. The difficulty to discover BI increases as one progresses along the phases because data are less structured and requires more processing. An empirical study of BI implementation helps to identify four phases (Westney and Ghoshal, 1994) similar to Taylor’s spectrum. The data management phase consists of acquisition, classification, storage, retrieval, editing, verification and quality control, presentation, aggregation, distribution, and assessment. The analysis phase consists of synthesis, hypothesis, and assumption building and testing. The implication and
376
W. Chung and H. Chen
action phases, respectively, concern how analysts should respond and what tasks should be performed. 2.1.2 Commercial BI tools Fuld et al. (2003) found that the global interest in BI technology has increased significantly over the past five years. They compared 16 commercial BI tools based on a 5-stage intelligence cycle: (1) planning and direction, (2) published information collection, (3) source collection from humans, (4) analysis, and (5) reporting and information sharing. It was found that the tools have become more open to the Web, through which businesses nowadays share information and perform transactions. There is no ‘‘one-size-fits-all solution’’ because different tools are used for different purposes. In terms of the weaknesses of BI tools, automated search capability in many tools can lead to information overload. Despite improvements in analysis capability over the past year (Fuld et al., 2002), there is still a long way to go to assist qualitative analysis effectively. Most tools that claim to do analysis simply provide different views of collection of information (e.g., comparison between different products or companies). More advanced tools use text mining (TM) technology or rule-based systems to determine relationships among people, places, and organizations using a user-defined dictionary or dynamically generated semantic taxonomy. Because existing BI tools are not capable of illustrating the landscape of a large number of documents collected from the Web, their actual value to analysis is questionable (Fuld et al., 2003). In addition, only few improvements have been made to reporting and information sharing functions, although many tools integrate their reports with Microsoft Office products and present them in a textual format. 2.2 Mining the Web for BI As most resources on the Web are text-based, automated tools and techniques have been developed to exploit textual information. For instance, Fuld et al. (2003) have noticed that more BI tools are now compatible with the Web than in the past. Although text expresses a vast, rich range of information, it encodes this information in a form that is difficult to decipher automatically (Hearst, 1999), to which researchers have recently identified TM as a potential solution. Compared with data mining, TM focuses on knowledge discovery in textual documents and involves multiple processes. 2.2.1 Text mining TM is the process of finding interesting or useful patterns in textual data and information. An example is analyzing textual descriptions in financial
Ch.14. Web-Based Business Intelligence Systems
377
news to predict stock market movements. TM combines many of the techniques of information extraction, information retrieval, natural language processing, and document summarization (Hearst, 1999; Trybula, 1999). It provides a means of developing knowledge links and knowledge sharing among people within organizations. Though the field is in its infancy, it has been anticipated to have explosive growth in order to address growing information challenges in organizations (Trybula, 1999). TM evolved from the field of automatic text processing that emerged in the 1970s, and was influenced by related fields of machine learning in the 1980s, and data mining, knowledge discovery, and Web mining in the 1990s. In recent years, businesses increasingly rely on TM to discover intelligence on the Web. Trybula (1999) proposes a framework for knowledge discernment in text documents. The framework includes several processes to transform textual data into knowledge: (1) Information acquisition: The text is gathered from textbases at various sources, through finding, gathering, cleaning, transforming, and organizing. Manuscripts are compiled into a preprocessed textbase. (2) Extraction: The purpose of extraction is to provide a means of categorizing the information so that relationships can be identified. Activities include language identification, feature extraction, lexical analysis, syntactic evaluation, and semantic analysis. (3) Mining: It involves clustering in order to provide a manageable size of textbase relationships that can be evaluated during information searches. (4) Presentation: Visualizations or textual summarizations are used to facilitate browsing and knowledge discovery. Although Trybula’s framework covers important areas of TM, it has several limitations for TM on the Web. First, there needs to be more preprocessing of documents on the Web, because they exist in many formats such as HTML, XML, and dynamically generated Web pages. Second, efficient and effective methods are needed to collect Web pages because they are often voluminous. Human collection does not scale to the growth of the Web. Third, information on the Web comes from heterogeneous sources and requires better integration and more discrimination. Fourth, more mining and visualization options other than clustering are needed to reveal hidden patterns in noisy Web data. Web mining technologies are needed to augment TM for discovering BI. 2.2.2 Web mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services (Etzioni, 1996). Machine learning techniques have been applied to Web mining (Chen and Chau, 2004). Given the exponential growth of the Web, it is difficult for any single search engine to provide comprehensive coverage of search results. Meta-searching has been shown to be a highly effective method of resource discovery and collection on the Web. By sending
378
W. Chung and H. Chen
queries to multiple search engines and collating the set of top-ranked results from each search engine, meta-search engines can greatly reduce bias in search results and improve coverage. Chen et al. (2001) showed that the approach of integrating meta-searching with textual clustering tools achieved high precision in searching the Web. To extract information and uncover patterns from Web pages or sites, three categories of Web mining have been identified: Web content mining, Web structure mining, and Web usage mining (Kosala and Blockeel, 2000). Web content mining refers to the discovery of useful information from Web contents, such as text, image, video, audio, and so on. Previous work on Web content mining include Web-page categorization (Chen et al., 1996), clustering (Zamir and Etzioni, 1999), rule and pattern extraction (Hurst, 2001), and concept retrieval (Chen et al., 1998; Schatz, 2002). Web structure mining refers to the analysis of link structures that model the Web, encompassing work on resource discovery (Chakrabarti et al., 1999), Webpage ranking (Brin and Page, 1998; Lempel and Moran, 2001), authority identification (Kleinberg, 1999; Mendelzon and Rafiei, 2000), and evolution of the Web documents (Henzinger and Lawrence, 2004). Web usage mining studies techniques that can predict user behavior while the user interacts with the Web. Knowledge of Web usage can contribute to building e-commerce recommender systems (Pazzani, 1999), Web-based personalization and collaboration (Adomavicius and Tuzhilin, 2001), and decision support (Chen and Cooper, 2001). In addition to the aforementioned Web mining technologies, there have been applications of machine learning and data mining techniques to Web mining. Web-page summarization, a process of automatically generating a compact representation of a Web page based on the page features and their relative importance (Hearst, 1994; McDonald and Chen, 2002), can be used to facilitate understanding of search engine results. Web-page classification, a process of automatically assigning Web pages into predefined categories, can be used to assign pages into meaningful classes (Mladenic, 1998). Webpage clustering, a process of identifying naturally occurring subgroups among a set of Web pages, can be used to discover trends and patterns within a large number of pages (Chen et al., 1996). Web-page visualization, a process of transforming a high-dimensional representation of a set of Web pages into a two- or three-dimensional representation that can be perceived by human eyes, can be used to represent important knowledge as pictures (Yang et al., 2003). 3
A framework for discovering BI on the Web
Although a number of data and TM technologies exist, there has not yet been a comprehensive framework to address problems of discovering BI on the Web. Data and TM technologies hold the promise for alleviating these
Ch.14. Web-Based Business Intelligence Systems
379
problems by augmenting human analysis. However, applying these technologies effectively requires consideration of several factors related to the Web itself, such as the use of collection methods, Web-page parsing and information extraction, the presence of hyperlinks, and language differences in heterogeneous information sources. Unfortunately, existing frameworks using data and TM techniques (e.g., Nasukawa and Nagano, 2001; Soper, 2005; Trybula, 1999) do not address these issues. Research on intelligent Web technologies (e.g., Zhong et al., 2003) seldom addresses the need for BI discovery on the Web (Negash, 2004). To address the needs, we have developed a framework for discovering BI on the Web. The rationale underlying our framework is to capture strengths of different data and TM techniques and to complement their weaknesses, thereby effectively assisting human analysts as they tackle problems of discovering BI on the Web. As shown in Fig. 1, the framework provides system designers useful tools, techniques, and guidelines to design and develop Web-based BI systems. The framework consists of five steps: collection, conversion, extraction, analysis, and visualization. Input to and output from the framework are, respectively, Web data and BI discovered after applying the steps. Each step allows human knowledge to guide the application of techniques (e.g., heuristics for parsing, weighting in calculating similarities, keywords for meta-searching/meta-spidering). The steps shown in Fig. 1 are collections of processed results: Web pages and documents; a tagged collection; indexes and relationships; similarities, classes, and clusters; and hierarchies, maps, and graphs. As we move from left to right of these collections, the degree of context and difficulty to detect noise in the results increase. The three lefthand-side collections are labeled ‘‘data and text bases’’ and the two righthand-side collections are labeled ‘‘knowledge bases.’’ The former mainly contain raw data and processed textual information while the latter contain BI discovered from data and text bases. We explain each step in the following sections. 3.1 Collection The purpose of this step is to acquire raw data for creating research test beds. Data in the forms of textual Web pages (e.g., HTML, XML, JSP, ASP, etc.) are collected. Several types of data are found in these pages: textual content (the text that can be seen on an Internet browser), hyperlinks (embedded behind anchor text), and structural content (textual mark-up tags that indicate the types of content on the pages). To collect these data, meta-searching/meta-spidering and domain spidering are used. Meta-spidering is an enhanced version of meta-searching using keywords as inputs. These keywords can be identified by human experts or by reviewing related literature. In addition to obtaining results
380
Conversion
Collection HTML/XML pages and Web sites
Meta-searching Meta-searching / Meta/ Metaspidering spidering (keywords) (keywords)
Extraction
Language Language identification identification
Indexing Indexing (word/phrase) (word/phrase)
HTML/XML HTML/XML Parsing Parsing
Link LinkExtraction Extraction
Domain Domain Spidering Spidering (links) (links)
Domain/DB Domain/DB Specific Specific Parsing Parsing
Entity Entity Extraction Extraction (Lexical / (Lexical / syntactic) syntactic)
Co-occurrence Co-occurrence Analysis Analysis Classification/ Classification/ categorization categorization Clustering/ Clustering/ summarization summarization Link / Network Link / Network Analysis Analysis
Visualization
Structure Structure Visualization Visualization
System System
Spatial Spatial Navigation Navigation Placing Entities Placing Entities on Map on Map User UserInterface Interface
Hidden Web (behind a DB)
Web pages and Documents
Tagged Collection
Data and Text Bases
Indexes and Relationships
Similarities, Classes, Clusters
Hierarchies, Maps, Graphs
Knowledge Bases
Fig. 1. A framework for discovering BI on the Web.
User
W. Chung and H. Chen
The Web
Analysis
Ch.14. Web-Based Business Intelligence Systems
381
from multiple search engines and collating the set of top-ranked results, the process follows the links of the results and downloads appropriate Web pages for further processing. Data in the hidden Web (i.e., Web sites behind a firewall or protected by passwords) can be collected through metaspidering. Domain spidering uses a set of seed URLs (provided by experts or identified in reputable sources) as starting pages. A crawler follows links in these pages to fetch pages automatically. Oftentimes, a breadth-first search strategy is used because it generally provides good coverage of resources on the topic being studied. The result of this step is a collection of Web pages and documents that contain much noisy data. 3.2 Conversion Because collected raw data often contain irrelevant details (i.e., the data are noisy), several steps may be needed to convert them into more organized collections and to filter out unrelated items. Language identification (mentioned in the framework by Trybula, 1999) is used mainly for Web pages in which more than one language may exist or English may not be the primary language. Heuristics (such as reading the meta-tags about language encoding) may be needed. HTML/XML parsing tries to extract meaningful entities based on HTML or XML mark-up tags (e.g., oH1W, oTITLEW, oA HREF ¼ ‘‘http://www.nytimes.com/’’W). Domain/database specific parsing tries to add in domain knowledge or database schematic knowledge to improve the accuracy of entity extraction. For example, knowledge about major BI companies can be used to capture hyperlinks appearing in Web pages. Further analysis can be done to study the relationships among the interlinked companies. The result of this step is a collection of Web pages that is tagged with the above-mentioned semantic details (e.g., language, meaning of entities, domain knowledge) with more contextual information than the results from the previous step. 3.3 Extraction This step aims to extract entities automatically as inputs for analysis and visualization. Indexing is the process of extracting words or phrases from textual documents. A list of stop words is typically used to remove nonsemantically bearing terms (e.g., ‘‘of,’’ ‘‘the,’’ ‘‘a’’), which can be identified in the literature (e.g., van Rijsbergen, 1979). Link extraction identifies hyperlinks within Web pages. Anchor texts of these links are often extracted to provide further details about the linkage relationships. Lexical or syntactic entities can be extracted to provide richer context of the Web pages (i.e., entity extraction). An example of a lexical entity is a company name (e.g., ‘‘Siebel,’’ ‘‘ClearForest’’) appearing on a Web page. The results of this step are indexes to Web pages and relationships between entities and
382
W. Chung and H. Chen
Web pages (e.g., indicating which terms appear on which pages, showing the stakeholder relationship between a business and its partner). They provide more contextual information to users by showing the relationships among entities. Noise in data is much reduced from the previous steps. 3.4 Analysis Once the indexes, relationships, and entities have been extracted in the previous step, several analyses can be performed to discover knowledge or previously hidden patterns. Co-occurrence analysis tries to identify frequently occurring pairs of terms and similar Web pages. Pairwise comparison between pages is often performed. Classification/categorization helps analysts to categorize Web pages into predefined classes so as to facilitate understanding of individual or an entire set of pages. Web-page classification has been studied in previous research (Glover et al., 2002; Kwon and Lee, 2003; Lee et al., 2002). Clustering organizes similar Web pages into naturally occurring groups to help detect patterns. Related works include Chen et al. (1998), Jain and Dubes (1988), and Roussinov and Chen (2001). Summarization provides the gist of a Web page and has been studied in (Hearst, 1994; McDonald and Chen, 2002). Link or network analysis reveals the relationships or communities hidden in a group of interrelated Web pages (e.g., Menczer, 2004). Depending on the contexts and needs, these functions are selectively applied to individual empirical studies by using appropriate techniques. The results of this step are similarities (e.g., a similarity matrix among pairs of Web pages), classes (e.g., classes of stakeholders), and clusters (e.g., groups of closely related Web pages). They are more abstract than the results from previous steps while supporting the use of structured analysis techniques (e.g., visualization techniques). 3.5 Visualization In some applications (e.g., understanding the market environment of an industry), it would be worthwhile to graphically present a high-level overview of the results. Visualization appears to be a promising way to accomplish this. In the information visualization community, various frameworks and a taxonomy for information visualization have been proposed (Shneiderman, 1996; Spence, 2001; Yang et al., 2003). Three kinds of visualization can be performed on the results from the previous step. (1) Structure visualization reveals the underlying structure of the set of Web pages, often in the form of hierarchies. An example is identifying and portraying the similarity of Web pages as trees so that these pages are grouped into related nodes. (2) Spatial navigation presents information (abstracted from voluminous data) in a two- or three-dimensional space,
Ch.14. Web-Based Business Intelligence Systems
383
allowing users to move around in different directions to explore the details. A specific instance of spatial navigation is map browsing, in which a user navigates on a map to look for relevant information. (3) Placing entities on map (e.g., mapping Web sites) allows analysts to study relationships among Web pages. Often, the distances among the points are used to reflect similarity among the pages. For example, the Kohonen self-organizing map has been used to visualize large numbers of Web pages (Chen et al., 1998; Shneiderman, 1996; Spence, 2001; Yang et al., 2003). The results of this step include hierarchies (e.g., hierarchically related Web pages or sites), maps (e.g., Web sites placed as points on a map), and graphs (e.g., interconnected Web sites represented as graphs). They can be perceived graphically, supporting the understanding of large amount of information. 3.6 Comparison with existing frameworks Compared with existing TM frameworks to our knowledge, our proposed framework recognizes special needs for collecting and analyzing Web data. While Trybula’s framework (Trybula, 1999) touches on issues of finding and gathering data, it does not address the voluminous and heterogeneous nature of Web data. The framework proposed by Nasukawa and Nagano (2001) assumes the use of operational data stored in business databases and hence does not deal with data collection and conversion on the Web. The framework proposed by Soper (2005) lacks capability to process textual and hyperlink information that is important to understand businesses on the Web. In contrast, different spidering techniques in our framework provide broader and deeper exploration of a domain’s content. Conversion and extraction methods in our framework provide more comprehensive details specific to the Web, such as hyperlinks, anchor texts, and meta-contents, than Trybula’s framework, which considers clustering only in its mining stage. Nasukawa and Nagano’s framework mainly relies on natural language processing techniques to extract concepts from textual documents and is not tailored to the processing of noisy Web data. In contrast, our framework encompasses a wider range of analysis and visualization techniques taking into account the noisiness and heterogeneity of Web data.
4
Case studies
In this section, we present three case studies of applying the framework to developing Web-based BI systems. We describe the system developed in each study and summarize the potential benefits of the system. Table 1 shows detailed applications of the framework in the three empirical studies. Most of the components of collection, conversion, and extraction were
384
W. Chung and H. Chen
Table 1 Detailed applications of the framework Component Collection Meta-searching/meta-spidering Domain spidering Conversion Language identification HTML/XML parsing Domain/database specific parsing Extraction Indexing (word/phrase) Link extraction Entity extraction (Lexical/syntactic) Analysis Co-occurrence analysis Classification/categorization Clustering/summarization Link/network analysis Visualization Structure visualization Spatial navigation Placing entities on map
Case 1
Case 2
Case 3
x x
x
x
x x x
x x
x x
x
x
x
x
x x x
x x x
x x x x x x
applied, while analysis and visualization components were selectively applied to specific studies that focused on certain BI discovery problems. 4.1 Case 1: Searching for BI across different regions As electronic commerce grows in popularity worldwide, business analysts need to access more diverse information, some of which may be scattered in different regions. It is estimated that the majority of the total global online population (64.8%) lives in non-English-speaking regions (Global Reach, 2004b). Moreover, that population is estimated to grow to 820 million in the near future, while the population of English-speaking users is anticipated to remain at 300 million (Global Reach, 2004a). These statistics imply a growing need for developing Web-based BI systems for nonEnglish-speaking users. The Chinese e-commerce environment provides a good example. Chinese is the primary language for people in Mainland China, Hong Kong, and Taiwan, where emerging economies are bringing tremendous growth to the Internet population. In Mainland China, the number of Internet users has been growing at 65% every 6 months since 1997 (CNNIC, 2002). Taiwan and Hong Kong lead the regions by having
Ch.14. Web-Based Business Intelligence Systems
385
the highest Internet penetration rates in the world (ACNelisen, 2002). The need for searching and browsing Chinese business information on the Internet is growing just as quickly. To facilitate BI discovery in the Chinese e-commerce environment, we have applied our framework to developing a BI search system, called the Chinese Business Intelligence Portal (CBizPort), that supports searching and browsing of BI across the Greater China regions consisting of Mainland China, Hong Kong, and Taiwan. The portal integrates information from heterogeneous sources and provides post-retrieval analysis capabilities. Meta-searching, pattern extraction, and summarization were major components of the portal, which has an interface customized to the usages in the three regions. An online demo of the system is available at: http:// aidemo.eller.arizona.edu/big5biz. CBizPort has two versions of user interface (Simplified Chinese and Traditional Chinese) that have the same look and feel. Each version uses its own character encoding when processing queries. The encoding converter is used to convert all Chinese characters into the encoding of the interface version. On the search page (Fig. 2), the major component is the metasearching area, on top of which is a keyword input box. Users can input multiple keywords on different lines and can choose among eight carefully selected information sources (Table 2) from the three regions by checking the boxes. A one-sentence description is provided for each information source. On the result page, we display the top 20 results from each information source. The results are organized according to the information sources on one Web page. Users can browse the set of results from a particular source by clicking on the bookmark at the top-right-hand side of the page (e.g., ‘‘HKTDCmeta,’’ ‘‘Baidu,’’ and ‘‘Yahoo Hong Kong’’ in Fig. 2). Users can also click on the ‘‘Analyze results’’ button to use the categorizer or choose a number of sentences provided to summarize the Web page. The CBizPort summarizer was modified from an English summarizer called TXTRACTOR that uses sentence-selection heuristics to rank text segments (McDonald and Chen, 2002). This heuristic strives to reduce redundancy of information in a query-based summary (Carbonell and Goldstein, 1998). The summarization takes place in three main steps: (1) sentence evaluation, (2) segmentation or topic identification, and (3) segment ranking and extraction. First, a Web page to be summarized is fetched from the remote server and parsed to extract its full text. All sentences are extracted by identifying punctuations acting as periods such as ‘‘ ,’’ ‘‘1,’’ ‘‘!,’’ and ‘‘?.’’ Important information such as presence of cue phrases (e.g., ‘‘therefore,’’ ‘‘in summary’’), sentence lengths, and positions are also extracted for ranking the sentences. Second, the TextTiling algorithm (Hearst, 1994) is used to analyze the Web page and determine where the topic boundaries are located. A Jaccard similarity function is used to compare the similarity of different blocks of sentences.
386 W. Chung and H. Chen Fig. 2. Screen shots of various functions of CBizPort. (Clockwise from top-left: (1) Search page: allows a user to input multiple keywords in separate lines and to choose among eight regional Chinese search engines; (2) Result page: displays the title and summary of each result and provides the options of summarizing each result in one to five sentences; (3) Categorizer: organizes the search results into twenty or fewer folders each labeled by a key phrase; (4) Summarizer: displays the summary (on the left) and the original page (on the right) of a selected result on a pop-up window.)
Ch.14. Web-Based Business Intelligence Systems
387
Table 2 Information sources of CBizPort Region
Information source
Description
Mainland China
Baidu China Security Regulatory Commission Yahoo Hong Kong Hong Kong Trade Development Council
A general search engine for Mainland China A portal containing news and financial reports of the listed companies in Mainland China A general search engine for Hong Kong A business portal providing information about local companies, products, trading opportunities A portal with government publications, services and policies, business statistics, etc. A general search engine for Taiwan An IT news portal with hundreds of online publications in business and IT areas A government portal with business and legal information
Hong Kong
Taiwan
Hong Kong Government Information Center Yam PCHome Taiwan Government Information Office
Third, document segments identified in the previous step are ranked according to the ranking scores obtained in the first step and key sentences are extracted as summary. The CBizPort summarizer can flexibly summarize Web pages using one to five sentence(s). Users can invoke it by choosing the number of sentences for summarization in a pull-down menu under each result. Then, a new window is activated that displays the summary and the original Web page. The CBizPort categorizer organizes the Web pages into various folders labeled by the key phrases appearing in the page summaries or titles (see Fig. 2). It relies on a Chinese phrase lexicon to extract phrases from Web-page summaries obtained from the eight search engines or portals. The lexicon for Simplified Chinese CBizPort is different from that for Traditional Chinese because the terms and expressions are likely to be different in the two contexts. To create the lexicons, we collected a large number of Chinese business Web pages and extracted meaningful phrases from them using the mutual information approach, which is a statistical method that identifies as meaningful phrases significant patterns from a large amount of text in any language (Church and Hanks, 1989; Ong and Chen, 1999). The mutual information (MI) algorithm is used to compute how frequently a pattern appears in the corpus, relative to its sub-patterns. Based on the algorithm, the MI of a pattern c (MIc) can be found by MI c ¼
fc f left þ f right f c
where f stands for the frequency of a set of words. Intuitively, MIc represents the probability of co-occurrence of pattern c relative to its left
388
W. Chung and H. Chen
sub-pattern and right sub-pattern. Phrases with high MI are likely to be extracted and used in automatic indexing. For example, if the Chinese phrase ‘‘ ’’ (knowledge management) appears in the corpus 100 times, the left sub-pattern ( ) appears 110 times and the right subpattern ( ) appears 105 times, then the MI for the pattern ‘‘ ’’ is 100/(110 þ 105 100) ¼ 0.87. For creating the Simplified Chinese lexicon, over 100,000 Web pages in GB2312 encoding were collected from major business portals such as Sohu.com, Sina Tech, and Sina Finance in Mainland China. For creating the Traditional Chinese lexicon, over 200,000 Web pages in Big5 encoding were collected from major business or news portals in Hong Kong and Taiwan (e.g., HKTDC, HK Government, Taiwan United Daily News Finance Section, Central Daily News). The Simplified Chinese lexicon has about 38,000 phrases and the Traditional Chinese lexicon has about 22,000 phrases. Using the Chinese phrase lexicon, the categorizer performed full-text indexing on the title and summary of each result (or Web page) and extracted the top 20 (or fewer) phrases from the results. Phrases occurring in the text of more Web pages were ranked higher. A folder then was used to represent a phrase and the categorizer assigned the Web pages to respective folders based on the occurrences of the phrase in the text. A Web page can be assigned to more than one folder if it contains more than one of the extracted phrases. The number of Web pages in each folder also is shown. After clicking on a folder, users can see the titles of the Web pages assigned to that folder. Further clicking on a title will open the Web page in a new window. To evaluate the performance of CBizPort in assisting human analysis, we have conducted an experiment using 30 Chinese speakers (10 from each of the three regions) who were students at a major university in the United States. We compared the search and browse performances of CBizPort and regional Chinese search engines, assigned according to a subject’s place of origin (Mainland China – Sina, Hong Kong – Yahoo Hong Kong, and Taiwan – Openfind). Details of the experimental design and results can be found in Chung et al. (2004). In general, we found that CBizPort’s searching and browsing performance (measured by the accuracy, precision, and recall of information retrieval task performances) was comparable to that of regional Chinese search engines, and a combination of the two systems performed significantly better than using either one alone for search and browse tasks. However, CBizPort’s analysis functions did not enable the portal to achieve significantly better searching and browsing performance, despite subjects’ many positive comments. Subjects’ verbal comments indicated that CBizPort performed better than regional Chinese search engines in terms of analysis functions, crossregional searching capabilities and user-friendliness, while regional Chinese search engines had more efficient operation and were more popular. Overall, we believe that improvements are needed in applying the
389
Ch.14. Web-Based Business Intelligence Systems
framework to addressing the heterogeneity and unmonitored quality of information on the Web. 4.2 Case 2: Exploring BI using Web visualization techniques As reported by Fuld et al. (2003), existing BI tools suffer from a lack of analysis and visualization capabilities because many of them do not reveal underlying structure of the data. This case study examines the use of clustering and visualization techniques to assist analysts in exploring BI on the Web. We have applied our framework to developing a system, called Business Intelligence Explorer (BIE), that assists in the discovery and exploration of BI from a large number of Web pages. Data in the form of Web pages were collected by meta-searching seven major search engines (AltaVista, AlltheWeb, Yahoo, MSN, LookSmart, Teoma, and Wisenut) using nine BI queries obtained by searching the INSPEC literature database. In total we collected 3,149 Web pages from 2,860 business Web sites. Each Web page represented one Web site. Then we performed automatic parsing and indexing to extract key terms from the pages, which were used as inputs of co-occurrence analysis, transforming raw data (indexes and weights) into a matrix showing the similarity between every pair of Web sites. The similarity between every pair of Web sites contained its content, structural (connectivity), and cocitation information, as shown in Fig. 3 (Chen and Lynch, 1992). Similarity between site i and site j = Wij
Aij
Sij
A
S
2
+
dki d kj
dki d kj k 1 n
sim D j , Di
S ji d ki
2
k 1 m
d ki
2
k 1
k 1
where
2
1
p
p
sim Di , D j
C
2
where A, S, C are matrices for Aij, Sij, Cij respectively. 0 < , < 1, and 0 Aij = 1 if site i has a hyperlink to site j, Aij = 0 otherwise Sij = Asymmetric similarity score between site i and site j S ij
Cij
1
n = total number of terms in Di m = total number of terms in Dj p = total number of terms that appear in both Di and Dj d ij
tf ij log
N df j
wj
Term type factor
j
tfij = number of occurrence of term j in Web page i dfj = number of Web pages containing term j wj = number of words in term j 1 Term type factor
j
1
10 2 type j 10
where type j
min
2
if term j appears in title, if term j appears in heading ,
3
if term j appears in content text ,
4
if term j appears in image alternate text
Cij = number of Web sites pointing to both site i and site j (co-citation matrix)
Fig. 3.
Formulae used in co-occurrence analysis.
390
W. Chung and H. Chen
Groups of Web sites organized in hierarchical communities Clicking on any nodes immediately below the root will open that sub-tree Back button allows users to traverse upward in the tree. Clicking this button, users can open a Web site when they have specified it.
Panel showing details on demand (labels, title, summary, URL)
Fig. 4.
Web community browsing method.
Web clustering and visualization techniques were used to transform the similarity matrix into a hierarchical and a map displays. The hierarchical display of Web pages was generated by using a genetic algorithm (GA) to recursively partition sets of Web pages, each represented as a node on a graph where the links are the computed similarities. During each iteration, GA tries to find a way to bipartition the graph such that a certain criterion (the fitness function) is optimized. The partitioned pages formed Web communities, each consisting of similar pages and are arranged in a multilevel hierarchy as shown in Fig. 4, in which the user is browsing the topic ‘‘information management’’ and can click on the node titled ‘‘Privacy information management system’’ to further navigate such topics as ‘‘digital libraries’’ and ‘‘Lotus organizer,’’ each of which has a number of related Web sites that the user can click to open. The map display of Web pages was generated by using a Multidimensional Scaling (MDS) algorithm (Young, 1987) that transformed the similarity matrix into a two-dimensional representation of points, each representing a Web site. The steps of MDS were: 1. Convert the similarity matrix into a dissimilarity matrix by subtracting each element by the maximum value in the original matrix. Call the new dissimilarity matrix D. 2. Calculate matrix B, which is the scalar products, by using the cosine law. Each element in B is given by ! n n n X n X X X 1 1 1 1 bij ¼ d2 d2 þ d2 d 2ij 2 n k¼1 ik n k¼1 kj n2 g¼1 h¼1 gh where dij is an element in D, n the number of nodes in the Web graph.
Ch.14. Web-Based Business Intelligence Systems
391
3. Perform a singular value decomposition on B and use the following formulae to find out the coordinates of points. B ¼ U V U0 . . .
(1)
X ¼ U V1=2 . . .
(2)
where U has eigenvectors in its columns and V has eigenvectors on its diagonal. Combining (1) and (2), we have B ¼ X Xu. We then used the first two column vectors of X to obtain the twodimensional coordinates of points, which were used to place the Web sites on the screen, forming a knowledge map of a BI topic. Figure 5 shows the screenshot of the knowledge map on which companies are shown as points and their proximities represent similarity. A user can click on a point to browse details of the company and can navigate the map by zooming in and out of it. For example, the user can see that ‘‘eKnowledgeCenter’’ is similar to ‘‘IT Toolbox Knowledge Management’’ due to their close proximity. To evaluate the usability of the new browsing methods developed, we have conducted an experiment with 30 subjects comparing knowledge map with a graphical search engine, Kartoo.com, and comparing Web community with a traditional linear list display of search results. The detailed findings are presented in Chung et al. (2005). In summary, we found that knowledge map performed significantly better than Kartoo in terms of effectiveness (measured by accuracy, precision, and recall), efficiency (measured by time spent on a task), and users’ rating on the placement of Web sites because of KMs accurate placement of Web sites The closeness of any two points reflects their similarity Navigation buttons allow browsing in four directions
Details of this Web site is being shown on the bottom panel
Users can control the number of Web sites to be displayed
Panel showing details (title, URL summary)
Zooming buttons allow zoom-in or zoom-out functions
Fig. 5.
Knowledge map browsing method.
392
W. Chung and H. Chen
and its clean interface. Web community performed significantly better than result lists in terms of effectiveness, efficiency, and usability. 4.3 Case 3: Business stakeholder analysis using Web classification techniques Business stakeholder analysis has been complicated by the use of the Web, which has allowed much greater variety and number of stakeholders (than in the past) to establish relationships to a company via hyperlink connections, distributed databases, online forums, intranets and extranets, and other Web-enabled technologies. These multitudes of stakeholders may include individuals, international businesses, local and regional suppliers, government agencies, and global organizations. As a result, business managers and analysts find it increasingly difficult to understand and to address the needs of stakeholders. Although traditional stakeholder analysis approaches (e.g., Donaldson and Preston, 1995; Freeman, 1984; Mitchell et al., 1997) offer theoretical foundations for understanding business relationships, they are largely manually driven and not scalable to the rapid growth and change of the Web. In this case study, we have applied our framework to business stakeholder analysis on the Web, with a goal of assisting business analysts in better understanding stakeholder relationships on the Web. We have collected Web pages of business stakeholders of the top 100 knowledge management companies identified by the Knowledge Management World (KMWorld.com) Web site (McKellar, 2003), a major Web portal providing news, publications, online resources, and solutions to more than 51,000 subscribers in the knowledge management systems market. To identify such stakeholders, we used the back-link search function of the Google search engine (http://www.google.com/) to search for Web pages having hyperlinks pointing to the companies’ Web sites. For each of the 100 KM (host) companies, we considered only the first 100 results returned from Google in order to limit the scope of analysis. After filtering self-links and duplicating results, we obtained 3,713 Web pages, or 37 stakeholders per host company. A BI expert having a doctorate degree and years of industry and academic experiences helped us to develop a domain lexicon consisting of 329 key terms for distinguishing among 11 business stakeholder types, which were modified from Reid (2003) and are shown in Table 3. These key terms were selected from the 414 collected Web pages of 9 companies randomly chosen from the 100 KM companies for training the algorithms to be used in automatic classification of business stakeholder pages. Next, the expert manually classified each of the 414 stakeholder pages of the 9 selected companies into one of the 11 stakeholder types (listed in Table 3). These tagged pages serve as a training set of stakeholders.
Ch.14. Web-Based Business Intelligence Systems
393
Table 3 Stakeholder types used in manual tagging of Web pages Group
Description
Stakeholder type
Transactional (internal environment)
Actor that the enterprise interacts with and influences
Contextual (external environment)
Distant actor that the enterprise has no power or influence over
Other
Cannot identify a stakeholder type
Partner/supplier/sponsor Customer Employee Shareholder Government Competitor Community (Special Interest Groups) Education/research institution Media/reviewer Portal Creator/Owner Unknown
To prepare for automatic classification, we considered two sets of features of business stakeholders’ Web pages: structural content features and textual content features. Structural content features contain occurrences of lexicon terms in different parts of the Web page. To identify such occurrences, an HTML parser automatically extracted all one-, two-, and three-word terms from the pages’ full-text content. A list of 462 stop words was used to remove non-semantic-bearing words (e.g., ‘‘the,’’ ‘‘a,’’ ‘‘of,’’ ‘‘and’’). Using HTML tags, the parser identified positions in which the terms appeared on the page. We have considered terms appearing in page title, extended anchor text (the anchor text plus 50 words surrounding it), and page full text. Textual content features are the frequencies of occurrences of important one-, two-, and three-word terms appearing in the business stakeholder pages. By considering terms appearing in multiple categories of stakeholders, we modified the thresholding method used in Glover et al. (2002) to select important terms from a large number of extracted terms. Figure 6 shows the formulae and procedure used in the method. Terms with high feature ratios were selected as features for classification. Two machine learning algorithms, feedforward/backpropagation neural network and Support Vector Machines (Cristianini and Shawe-Taylor, 2000), were used to classify business stakeholder pages automatically into their respective stakeholder types. Neural network, a computing system modeled after the human brain’s mesh-like network of interconnected neurons, has been shown to be robust in classification and has wide applicability in different domains (Lippman, 1987) and Web-page filtering (Lee et al., 2002). The neural network is characterized by an input layer with 2,284 nodes (987 structural content features and 1,297 textual content features), a hidden layer with 1,148 nodes, and an output layer with 11 nodes (the 11 stakeholder classes). A single hidden layer was selected
394
W. Chung and H. Chen
Fig. 6.
Formulae and procedure in the thresholding method.
because it was found to be able to model any complex systems with desired accuracy (Patuwo et al., 1993). To achieve high accuracy while avoiding over-fitting, we chose the average of the number of input nodes and the number output nodes to be the number of hidden nodes (Rich and Knight, 1991). Also, the running of the neural network stops when it had run for a hundred epochs, chosen based on empirical testing. Support vector machines (SVM), a machine learning algorithm that tries to minimize structured risk in classification, has been successfully applied to text categorization (Joachims, 1998) and Web-page classification (Glover et al., 2002). We used the decomposition method for bound-constrained SVM formulation proposed in Hsu and Lin (2002b) to perform multi-class classification, which also has been studied in Hsu and Lin (2002a). The structural and textual content features selected in the previous step were used as input to the algorithms. Each stakeholder page was represented as a feature vector containing 987 structural content features (binary variables indicating whether certain lexicon terms appeared in the page title, extended anchor text, and full text) and 1,334 textual content features (frequencies of occurrences of the selected features—663 words and 671 two- or three-word phrases). We used 283 pages of the 9 companies’ stakeholders to train the algorithms. The model and weights obtained from the training were used to predict the types of business stakeholder pages of 10 testing companies randomly selected from the 100 KM companies. In this process, we assumed that meaningful classification could be obtained from the business stakeholders who provided on their Web pages explicit information about relationships with the host companies. Figure 7 shows the front page of our system called Business Stakeholder Analyzer and a sample application in which a user explores the stakeholders of Siebel,
Ch.14. Web-Based Business Intelligence Systems a. Front page of Business Stakeholder Analyzer – Subjects could browse through definitions of stakeholder types, examples of stakeholder pages, and three business stakeholder groups.
b. Business stakeholders of Siebel.
Fig. 7.
Business Stakeholder Analyzer.
395
396
W. Chung and H. Chen
a major company specializing on customer relationship management. The stakeholders were identified and classified automatically by the aforementioned techniques. We have conducted an experiment with 36 student subjects to evaluate the performance of applying the framework to stakeholder classification compared with human users and with random classification (the baseline). Detailed experimental results are presented in Chung et al. (forthcoming). In summary, we found that the use of any combination of features and techniques in automatic stakeholder classification outperformed the baseline method significantly (measured by classification accuracy). Overall, humans were more effective than NN or SVM, but the algorithms outperformed human judgment in classifying certain stakeholder types such as partners/sponsors/suppliers and community. Subjects perceived the automatic classification very favorably as helping business analysts identify and classify business stakeholders.
5
Summary and future directions
In this chapter, we have reviewed related work on BI systems and mining the Web for BI, described a framework for discovering BI on the Web, and presented three case studies applying the framework to discovering BI on the Web. The framework was found to help to meet analysis needs that would otherwise require substantial human effort. Such needs include summarizing, classifying, visualizing, exploring the information landscape, and extracting relationships. Each case study demonstrates how certain analysis needs in discovering BI can be fulfilled, thereby freeing humans to perform other value-added work. Considering the scarce research work found (Negash, 2004), this chapter presents new advances in the BI field and provides practical insights to BI practitioners and system developers. Several directions are worth exploring for researchers and practitioners. Developing Web-based systems to support discovery of BI in other languages than English (or Chinese described in Section 4.1) will be promising considering the multinational nature of the Web in supporting business operations. For example, Spanish and Arabic are two major languages used in South America and the Middle East, respectively. As commercial activities of these regions are growing significantly in recent years, systems that support effective and efficient discovery of BI on the Web will assist managers and analysts in understanding their business environment. However, new visualization metaphors for BI exploration will further assist effective and efficient discovery of BI on the Web. Metaphors that exploit the nature of the Web as well as features of a specific domain may bring a more satisfactory and pleasurable browsing experience to users. Type-specific stakeholder analysis can be performed to further
Ch.14. Web-Based Business Intelligence Systems
397
support Web-based stakeholder analysis. Multinational business partnerships and cooperation also can be analyzed through explicit information posted on the Web.
References ACNelisen (2002). Nelisen//Netratings reports a record half billion people worldwide now have home Internet access [Online]. Available at http://asiapacific.acnielsen.com.au/news.asp?newsID ¼ 74 Adomavicius, G., A. Tuzhilin (2001). Using data mining methods to build customer profiles. IEEE Computer 34(2), 74–82. Bowman, C.M., P.B. Danzig, U. Manber, F. Schwartz (1994). Scalable Internet resource discovery: research problems and approaches. Communications of the ACM 37(8), 98–107. Brin, S., L. Page (1998). The anatomy of a large-scale hypertextual Web search engine, in: Proceedings of the 7th International WWW Conference, Brisbane, Australia. Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopolan, R. Stata, A. Tomkins, J.L. Wiener (2000). Graph structure in the Web, in: Proceedings of the 9th International World Wide Web Conference, Elsevier Science, Amsterdam, The Netherlands, pp. 309–320. Carbonell, J., J. Goldstein (1998). The use of MMR: diversity-based reranking for reordering documents and producing summaries, in: Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Melbourne, ACM Press, Australia, pp. 335–336. Carvalho, R., M. Ferreira (2001). Using information technology to support knowledge conversion processes. Information Research 7(1). Chakrabarti, S., B. Dom, R.S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg (1999). Mining the web’s link structure. IEEE Computer 32(8), 60–67. Chen, H., M. Chau (2004). Web mining: machine learning for web applications, in: M.E. Williams (ed.), Annual Review of Information Science and Technology (ARIST). Information Today, Inc, Medford, NJ, pp. 289–329. Chen, H., H. Fan, M. Chau, D. Zeng (2001). MetaSpider: meta-searching and categorization on the web. Journal of the American Society for Information Science and Technology 52(13), 1134–1147. Chen, H., A. Houston, R. Sewell, B. Schatz (1998). Internet browsing and searching: user evaluation of category map and concept space techniques. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications 49(7), 582–603. Chen, H., K.J. Lynch (1992). Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man, and Cybernetics 22(5), 885–902. Chen, H., C. Schuffels, R. Orwig (1996). Internet categorization and search: a self-organizing approach. Journal of Visual Communication and Image Representation 7(1), 88–102. Chen, H.M., M.D. Cooper (2001). Using clustering techniques to detect usage patterns in a Web-based information system. Journal of the American Society for Information and Science and Technology 52(11), 888–904. Choo, C.W. (1998). The Knowing Organization. Oxford University Press, Oxford. Chung, W., H. Chen, E. Reid (forthcoming). Business stakeholder analyzer: An experiment of classifying stakeholders on the web. Journal of the American Society for Information Science and Technology (fully accepted, published online, print version forthcoming). Chung, W., H. Chen, J.F. Nunamaker (2005). A visual framework for knowledge discovery on the Web: an empirical study on business intelligence exploration. Journal of Management Information Systems 21(4), 57–84. Chung, W., Y. Zhang, Z. Huang, G. Wang, T.-H. Ong, H. Chen (2004). Internet searching and browsing in a multilingual world: an experiment on the Chinese business intelligence portal (CBizPort). Journal of the American Society for Information Science and Technology 55(9), 818–831.
398
W. Chung and H. Chen
Church, K., P. Hanks (1989). Word association norms, mutual information, and lexicography, in: Proceedings of the 27th Annual Meeting of Association for Computational Linguistics, Vancouver, BC, Canada, pp. 76–83. CNNIC (2002). Analysis report on the growth of the Internet in China, China Internet Network Information Center [Online]. Available at http://www.cnnic.net.cn/develst/2002-7e/6.shtml Cristianini, N., J. Shawe-Taylor (2000). An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, Cambridge, UK. Cronin, B. (2000). Strategic intelligence and networked business. Journal of Information Science 26, 133–138. Donaldson, T., L.E. Preston (1995). The stakeholder theory of the corporation: concepts, evidence and implications. Academy of Management Review 20(1), 65–91. Etzioni, O. (1996). The world-wide web: quagmire or gold mine? Communications of the ACM 39(11), 65–68. Freeman, E. (1984). Strategic Management: A Stakeholder Approach. Pitman, Marshfield, MA. Fuld, L.M., K. Sawka, J. Carmichael, J. Kim, K. Hynes (2002). Intelligence Software ReportTM 2002. Fuld & Company Inc, Cambridge, MA. Fuld, L.M., A. Singh, K. Rothwell, J. Kim (2003). Intelligence Software ReportTM 2003: Leveraging the Web. Fuld & Company Inc, Cambridge, MA. Futures-Group (1998). Ostriches & Eagles 1997, in: The Futures Group Articles, Glastonbury, CT. Global Reach (2004a). Evolution of non-English online populations [Online]. Available at http:// global-reach.biz/globstats/evol.html Global Reach (2004b). Global Internet Statistics (by Language) [Online]. Available at http:// www.glreach.com/globstats/ Glover, E.J., K. Tsioutsiouliklis, S. Lawrence, D.M. Pennock, G.W. Flake (2002). Using Web structure for classifying and describing Web pages, in: Proceedings of the 11th International World Wide Web Conference, Honolulu, Hawaii. Hearst, M.A. (1994). Multi-paragraph segmentation of expository text, in: Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 9–16. Hearst, M.A. (1999). Untangling text data mining, in: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, The Association for Computational Linguistics, College Park, MD. Henzinger, M.R., S. Lawrence (2004). Extracting knowledge from the world wide web. in: Proceedings of the National Academy of Sciences of the United States of America. Hsu, C.W., C.J. Lin (2002a). A comparison on methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13(2), 415–425. Hsu, C.W., C.J. Lin (2002b). A simple decomposition method for support vector machines. Machine Learning 46(1–3), 291–314. Hurst, M. (2001). Layout and language: challenges for table understanding on the Web, in: Proceedings of the 1st International Workshop on Web Document Analysis, Seattle, WA, pp. 27–30. Jain, A.K., R.C. Dubes (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, in: Proceedings of the Tenth European Conference on Machine Learning, Springer Verlag, Chemnitz, Germany, pp. 137–142. Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the Association of Computing Machinery 46(5), 604–632. Kosala, R., H. Blockeel (2000). Web mining research: a survey. ACM SIGKDD Explorations 2(1), 1–15. Kwon, O.-W., J.-H. Lee (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing & Management 39(1), 25–44. Lee, P.Y., S.C. Hui, A. Cheuk, M. Fong (2002). Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57. Lempel, R., S. Moran (2001). SALSA: the stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19(2), 131–160.
Ch.14. Web-Based Business Intelligence Systems
399
Lippman, R.P. (1987). Introduction to computing with neural networks. IEEE ASSP Magazine 4(2), 4–22. Lyman, P., H. Varian (2003). How much information? University of California, Berkeley [Online]. Available at http://www.sims.berkeley.edu:8000/research/projects/how-much-info-2003/. McDonald, D., H. Chen (2002). Using sentence selection heuristics to rank text segments in TXTRACTOR, in: Proceedings of the second ACM/IEEE-CS Joint Conference on Digital Libraries, ACM/IEEE-CS, Portland, OR, pp. 28–35. McKellar, H. (2003). KMWorld’s 100 Companies that Matter in Knowledge Management 2003, KM World [Online]. Available at http://www.kmworld.com/100.cfm Menczer, F. (2004). Evolution of document networks. in: Proceedings of the National Academy of Sciences of the United States of America. Mendelzon, A.O., D. Rafiei (2000). What do the neighbours think? Computing web page reputations. IEEE Data Engineering Bulletin 23(3), 9–16. Mitchell, R.K., B.R. Agle, D.J. Wood (1997). Toward a theory of stakeholder identification and salience: defining the principle of who and what really counts. Academy of Management Review 22(4), 853–886. Mladenic, D. (1998). Turning Yahoo into an Automatic Web Page Classifier, in: Proceedings of the 13 European Conference on Artificial Intelligence, Brighton, UK, pp. 473–474. Nasukawa, T., T. Nagano (2001). Text analysis and knowledge mining system. IBM Systems Journal 40(4), 967–984. Negash, S. (2004). Business intelligence. Communications of the Association for Information Systems 13, 177–195. Nolan, J. (1999). Confidential: Uncover Your Competitor’s Secrets Legally and Quickly and Protect Your Own. Harper Business, New York. Ong, T.-H., H. Chen (1999). Updateable PAT-array approach for Chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management, in: Proceedings of the Second Asian Digital Library Conference, Taipei, Taiwan, pp. 63–84. Patuwo, E., M.S. Hu, M.S. Hung (1993). Two-group classification using neural networks. Decision Sciences 24(4), 825–845. Pazzani, M. (1999). A framework for collaborative, content-based and demographic filtering. Artificial Intelligence Review 13(5), 393–408. Reid, E.O.F. (2003). Identifying a company’s non-customer online communities: a Proto-typology, in: Proceedings of the 36th Hawaii International Conference on System Sciences (HICSS-36), IEEE Computer Society, Island of Hawaii, HI. Rich, E., K. Knight (1991). Learning in Neural Networks, in: Artificial Intelligence (2nd ed.), McGrawHill, Inc., New York, pp. 500–507. Roussinov, D., H. Chen (2001). Information navigation on the Web by clustering and summarizing query results. Information Processing and Management 37(6), 789–816. Schatz, B. (2002). The Interspace: concept navigation across distributed communities. IEEE Computer 35(1), 54–62. Shneiderman, B. (1996). The eyes have it: a task by data type taxonomy for information visualizations, in: Proceedings of Visual Languages, IEEE Computer Society, Boulder, CO, pp. 336–343. Soper, D.S. (2005). A framework for automated Web business intelligence systems, in: Proceedings of the 38th Hawaii International Conference on System Sciences, IEEE Computer Society, Island of Hawaii, HI. Spence, R. (2001). Information Visualization. ACM Press, New York, NY. Taylor, R.S. (1986). Value-added Processes in Information Systems. Ablex, Norwood, NJ. Trybula, W.J. (1999). Text mining, in: M.E. Williams (ed.), Annual Review of Information Science and Technology. Information Today, Inc, Medford, NJ, pp. 385–419. van Rijsbergen, C.J. (1979). Information Retrieval. 2nd ed. Butterworths, London. Westney, E., S. Ghoshal (1994). Building a competitor intelligence organization: adding value in an information function, in: T.J. Allen, M.S. Scott (eds.), Information Technology and the Corporation in the 1990s: Research Studies. Oxford University Press, New York, pp. 430–453.
400
W. Chung and H. Chen
Yang, C.C., H. Chen, K. Hong (2003). Visualization of large category map for Internet browsing. Decision Support Systems 35(1), 89–102. Young, F.W. (1987). Multidimensional Scaling: History, Theory, and Applications. Lawrence Erlbaum Associates, Hillsdale, NJ. Zamir, O., O. Etzioni (1999). Grouper: a dynamic clustering interface to Web search results, in: Proceedings of the 8th World Wide Web Conference, Toronto, Canada. Zhong, N., J. Liu, Y. Yao (eds.) (2003). Web Intelligence. Springer, New York, NY.