Community-Built Databases
.
Eric Pardede Editor
Community-Built Databases Research and Development
Editor Dr. Eric Pardede La Trobe University Dept. Computer Science & Computer Engineering Bundoora, VIC Australia
[email protected]
ACM Computing Classification (1998): H.3, H.4, H.5.3, K.4 ISBN 978-3-642-19046-9 e-ISBN 978-3-642-19047-6 DOI 10.1007/978-3-642-19047-6 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011928234 # Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Communities have built collections of information in a collaborative manner for centuries. Around 250 years ago, more than 140 people wrote l’Encyclopedie in 28 volumes with 70,000 articles. More recently, Wikipedia has demonstrated how collaborative efforts can be a powerful method of building a massive data storage. It is known that Wikipedia has become a key part of many corporations’ knowledge management systems for decision making. Wikipedia is only one example brought about by Web 2.0 with the goal of creating communities of users. We have witnessed global member-built mediadata storage such as Flickr and YouTube, and other social networking applications such as Friendster, Facebook, and LinkedIn, also share information between members in a large unstructured information pool. While Web 2.0 has many benefits, there are many more opportunities to be unleashed. Imagine if we could use information gathered by many people for critical decision making. There is great potential for creating and sharing more structured data through the web. To make it more regulated and more realistic, the data will be limited to the community scale rather than the global scale, for example, an academic research community and a community of doctors in a particular region. Each community can create a large database in which each member can contribute information freely and can use the information with higher levels of confidence. This book addresses the need for comprehensive research sources in communitybuilt databases. This book does not only focus on one database area or one domain; rather, the chapters discuss various aspects of research in and the development of community-built databases, providing information on advanced community-built database research and also indicating which parts of research can benefit from further investigation. It is expected that this book will provide comprehensive reading material for a wide ranging audience and has the potential to influence readers to think further and investigate areas that are novel to them. The first chapter of the book reemphasizes the need for community-built databases. Ayanso et al. present the use of Web 2.0 technologies to enhance information creation, dissemination, and collaboration among knowledge communities. The concept of the social web is explained and several real cases are provided, including v
vi
Preface
cases in health care, research communities, and software development. The challenges of the social web, as well as the collaborative knowledge repository, are detailed. This introductory chapter opens up further research topics, some of which are provided in the next chapters. Cerquitelli et al. highlight the fact that current user-generated content includes not only textual information, but also multimodal material. The authors review different collections of user-contributed media by using popular online communities such as YouTube, Flickr, and Wikipedia. Several issues related to the maintenance of different types of data are provided. Finally, the chapter gives insight into possible research issues for community-contributed media collections. Part II commences with a chapter on Social Network Analysis (SNA) in Community-Built Databases by Snasel et al. The authors describe various SNA techniques that can be used to determine the quality and success of a communitybuilt database. Fundamental aspects of social networks are identified, including the representation, the visualization, and the measures for SNA. Following this, the authors conduct several experiments using real data that demonstrate how knowledge on social aspects can be used to leverage functionalities of community-built databases, such as for recommender systems. Online Recommender Systems have attracted wide interest from web community researchers in the last few years. In the paper by He and Chu, a novel design model that can be used to extend the functionalities and the performance of current recommender systems is proposed. The authors propose the Social-Network-based Recommender System (SNRS) as the architecture of their model that utilizes information from social networks for decision recommendation. The information includes the users’ preferences, item likability, and homophily. Furthermore, the performance of SNRS is extended through semantic filtering applied on the source social network data. In addition, trust issues in SNRS have also been addressed in this paper. The recommender system is developed and experiments are conducted in a social network formed by a large group of graduate students. In the next chapter, Papadopoulos et al. investigate the use of collaborative tagging systems for detecting similarities in online communities. The authors provide relevant studies in a collaborative tagging system along with community detection and its applications. On the basis of their studies, the authors propose methods to identify tag communities, which are groups of tags that are either semantically close to each other or share some usage context. The experiments are conducted using real-world tagging systems, namely, BibSonomy, Flickr, and Delicious. The performance of the proposed method is compared against an established method, the former showing superior results. In the next chapter, Tamine et al. investigate the use of social context to model information retrieval and collaboration in a scientific research community. The shift from personal context to social context in information retrieval is explained as the motivation of the work. The formalization of social information retrieval that has quantitative and qualitative model components is provided in the paper. For experiments, the authors use an online scientific documents dataset with citations and coauthor analysis as the social context.
Preface
vii
The next four chapters cover the issue of community-built database design and storage. We start with the chapter by Badia, which aims to accommodate social interaction on existing relational database technologies. The author briefly describes basic terminologies of current database technology and explains how they cannot be used efficiently to support user-created content and user interaction activities. To achieve this, the paper proposes the concept of private views, which is a dynamic layer between the database and the users. Issues in the organization, querying, sharing, and maintenance of private views are explained in detail. Comparison to existing work demonstrates that the current proposal has more ability to capture social interaction using existing database technology. In the next chapter, Rozewski and Kusztina identify an ontological model to represent the knowledge network inside a collaborative system, such as communitybuilt databases. The proposed ontological model incorporates not only technical aspects of the collaborative system, but also the psychological, cognitive, and social aspects. The ontological model is implemented into an e-learning case study using the Wiki mechanism for the knowledge repository. A popular design model for collaborative communities is through a graph database, such as that discussed in the next chapter by Soussi et al. The dynamic nature and rich representation capabilities enable a graph database to be used for unstructured data and semistructured data, which commonly appear in collaborative communities. As a preliminary, the authors describe the current graph database model and query languages. As their research contribution, the paper proposes methodologies to extract social network information from relational databases to graph databases. In the last chapter of this part, Uden discusses semantics in a collaborative data repository such as Wiki. The use of semantic technologies enables the formalization of the interrelationships of concepts and the automation of the interpretation process by machines. The author clearly identifies the current state of semantics utilization into Wiki, such as Wikipedia. Some examples of current applications are explained in detail and the author concludes the paper by envisioning the future of Semantic Wiki. The next part on the future of community-built database research and development commences with a chapter by Cortizo et al., which examines the potential of mobile devices, such as the mobile phone, to even further facilitate the use and development of community-built databases. The current and future characteristics of mobile devices in terms of physical hardware, operating and application software, and supported services are discussed. The authors use these characteristics to envision their idea of using mobile phones at the heart of community-built databases. Not only can mobile phones be used as an interface to separate databases, they can also be used as mobile database tools. The chapter concludes with several future scenarios that clearly demonstrate the potential and the importance of mobile phones to future research in and the development of community-built databases. In the next chapter, Vodanovich et al. explore the design of community-built databases for a targeted audience, the youth. This work is motivated by the need to assist the well-being of youths through various means, including the use of
viii
Preface
technology. The authors first introduce the concept of youth well-being. Next, they present how various youth community-built databases (YCD) have emerged recently. Their analyses of the concepts of youth well-being and the existing databases enable the authors to identify problems and issues in YCD design and development. A conceptual design and framework for YCD is proposed, which embodies four dimensions, namely web interaction, social collaboration, semantic integration, and community governance. With another target audience in mind, the next chapter by Mesiti et al. investigates an environment that can assist people with special needs to collaborate in and produce data for community-built databases. The authors summarize the latest software targeted at people with special needs. They also review software accessibility and usability models for this target audience. They contribute to a new usability model for a collaborative environment catering for people with special needs, for example, they provide a wiki-based e-learning environment, called VisualPedia. We conclude this book with a chapter on trust in collaborative information systems by Javanmardi and Lopes. From the previous chapters, we hopefully have demonstrated the need for and the benefits of collaborative work in community-built databases. In a collaborative system, all users must have confidence to contribute and use the information. This chapter discusses a reputation model as the means to ensure information quality in collaborative information systems. To achieve this, the authors propose models that associate the users’ reputation with the quality of the content that she or he provides. As proof of concept, an empirical study is conducted using Wikipedia. These chapters by no means encompass all the issues in community-built database research and development. However, they demonstrate the vastness of this research area. It is inevitable that sooner or later, we will need to collaborate with people from different domains to ensure this research area grows. Finally, the editor hopes this book will contribute to research in and the development of community-built databases and collaborative information systems, both by academia and by industry practitioners. Melbourne December 2010
Eric Pardede
Contents
Part I
Collaborative Knowledge Collection through Community-Built Databases
1
Social Web: Web 2.0 Technologies to Enhance Knowledge Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Anteneh Ayanso, Tejaswini Herath, and Kaveepan Lertwachara
2
Community-Contributed Media Collections: Knowledge at Our Fingertips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Tania Cerquitelli, Alessandro Fiori, and Alberto Grand
Part II
Social Analysis in Community-Built Databases
3
Social Network Analysis in Community-Built Databases . . . . . . . . . . . . . 51 Va´clav Sna´sˇel, Zdeneˇk Hora´k, and Milosˇ Kudeˇlka
4
Design Considerations for a Social Network-Based Recommendation System (SNRS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Jianming He and Wesley W. Chu
5
Community Detection in Collaborative Tagging Systems . . . . . . . . . . . 107 Symeon Papadopoulos, Athena Vakali, and Yiannis Kompatsiaris
6
On Using Social Context to Model Information Retrieval and Collaboration in Scientific Research Community . . . . . . . . . . . . . . . 133 Lynda Tamine, Lamjed Ben Jabeur, and Wahiba Bahsoun
ix
x
Contents
Part III
Community-Built Databases Storage and Modelling
7
Social Interaction in Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Antonio Badia
8
Motivation Model in Community-Built System . . . . . . . . . . . . . . . . . . . . . . 183 Przemysław Ro´z˙ewski and Emma Kusztina
9
Graph Database for Collaborative Communities . . . . . . . . . . . . . . . . . . . . 205 Rania Soussi, Marie-Aude Aufaure, and Hajer Baazaoui
10
Semantics in Wiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Lorna Uden
Part IV
Future of Community-Built Databases Research and Development
11
On the Future of Mobile Phones as the Heart of Community-Built Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Jose C. Cortizo, Luis I. Diaz, Francisco M. Carrero, Adrian Yanes, and Borja Monsalve
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 S. Vodanovich, M. Rohde, and D. Sundaram
13
Collaborative Environments: Accessibility and Usability for Users with Special Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 M. Mesiti, M. Ribaudo, S. Valtolina, B.R. Barricelli, P. Boccacci, and S. Dini
14
Trust in Online Collaborative IS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Sara Javanmardi and Cristina Lopes
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Contributors
Marie-Aude Aufaure Applied Mathematics and Systems Laboratory (MAS), SAP Business Objects Academic Chair in Business Intelligence, Ecole Centrale Paris, Chatenay-Malabry, France Anteneh Ayanso Department of Finance, Operations, and Information Systems, Brock University, 500 Glenridge Avenue, St., Catharines, ON, Canada L2S3A1,
[email protected] Hajer Baazaoui Riadi-GDL Laboratory, ENSI – Manouba University, Tunis, Tunisia Antonio Badia Computer Engineering and Computer Science Department, University of Louisville, Louisville, KY, USA,
[email protected] Wahiba Bahsoun Institut de Recherche en Informatique de Toulouse, Toulouse, France B.R. Barricelli DiCo, University of Milano, Milano, Italy, Barricelli@dico. unimi.it P. Boccacci DISI, University of Genova, Genova, Italy,
[email protected] Francisco M. Carrero BrainSins,
[email protected] Tania Cerquitelli Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy,
[email protected] Wesley W. Chu Computer Science Department, University of California, Los Angeles, CA 90095, USA,
[email protected] Jose C. Cortizo Universidad Europea de Madrid, Villaviciosa de Odo´n, Spain,
[email protected]
xi
xii
Contributors
Luis I. Diaz BrainSins,
[email protected] S. Dini Istituto David Chiossone – ONLUS, Genova, Italy,
[email protected] Alessandro Fiori Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy,
[email protected] Alberto Grand Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy,
[email protected] Jianming He Computer Science Department, University of California, Los Angeles, CA 90095, USA,
[email protected] Tejaswini Herath Department of Finance, Operations, and Information Systems, Brock University, 500 Glenridge Avenue, St., Catharines, ON, Canada L2S3A1 Zdeneˇk Hora´k VSˇB-Technical University of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic,
[email protected] Lamjed Ben Jabeur Institut de Recherche en Informatique de Toulouse, Toulouse, France Sara Javanmardi University of California, Suite 5069, Bren Hall, Irvine, CA 92697, USA,
[email protected] Yiannis Kompatsiaris Informatics and Telematics Institute, CERTH, 57001 Thermi, Greece,
[email protected] Milosˇ Kudeˇlka VSˇB-Technical University of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic,
[email protected] Emma Kusztina Faculty of Computer Science and Information Systems, West Pomeranian University of Technology in Szczecin, ul. Z˙ołnierska 49, 71-210 Szczecin, Poland Kaveepan Lertwachara Orfalea College of Business, California Polytechnic State University, 1 Grand Avenue, San Luis Obispo, CA 93407, USA Cristina Lopes University of California, Suite 5069, Bren Hall, Irvine, CA 92697, USA,
[email protected] M. Mesiti DiCo, University of Milano, Milano, Italy,
[email protected] Borja Monsalve Wipley,
[email protected]
Contributors
xiii
Symeon Papadopoulos Informatics and Telematics Institute, CERTH, 57001 Thermi, Greece; Aristotle University, 54124 Thessaloniki, Greece,
[email protected] M. Ribaudo DISI, University of Genova, Genova, Italy,
[email protected] M. Rohde The University of Auckland, New Zealand Przemysław Ro´z˙ewski Faculty of Computer Science and Information Systems, West Pomeranian University of Technology in Szczecin, ul. Z˙ołnierska 49, 71-210 Szczecin, Poland,
[email protected] Va´clav Sna´sˇel VSˇB-Technical University of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic,
[email protected] Rania Soussi Applied Mathematics and Systems Laboratory (MAS), SAP Business Objects Academic Chair in Business Intelligence, Ecole Centrale Paris, Chatenay-Malabry, France; Riadi-GDL Laboratory, ENSI – Manouba University, Tunis, Tunisia D. Sundaram The University of Auckland, New Zealand Lynda Tamine Institut de Recherche en Informatique de Toulouse, Toulouse, France,
[email protected] Lorna Uden Faculty of Computing, Engineering and Technology, Staffordshire University, Beaconside, Stafford ST18 0AD, UK,
[email protected] Athena Vakali Aristotle University, 54124 Thessaloniki, Greece, avakali@csd. auth.gr S. Valtolina DiCo, University of Milano, Milano, Italy,
[email protected] S. Vodanovich The University of Auckland, New Zealand, s.vodanovich@auckland. ac.nz Adrian Yanes Tieto Corporation,
[email protected]
.
Part I
Collaborative Knowledge Collection through Community-Built Databases
.
Chapter 1
Social Web: Web 2.0 Technologies to Enhance Knowledge Communities Anteneh Ayanso, Tejaswini Herath, and Kaveepan Lertwachara
Abstract In recent years, online social networks have grown immensely and become widely popular among Internet users. In general, a social network is a social structure consisting of nodes (which are generally individuals or organizations) that are connected by one or more specific types of relations. These online groups are made up of those who share passions, beliefs, hobbies, or lifestyles. These networks allow the development of communities that exploit the capacity of the Internet to expand users’ social worlds to include people in distant locations, binding them more strongly. The Internet helps many people find others who share their interests regardless of the distance between them. These social networks use a variety of communication and collaboration technologies such as blogging, video conferencing, and Wiki tools to name a few, which can be used to harness collective intelligence. Thus, they provide great communication potential and tremendous opportunities for both casual users and professionals to share knowledge with others and thus benefit from the collective pool of shared knowledge. For instance, in the context of learning where communication of knowledge about issues and experience may be limited by traditional means, educators can share experiences and teaching materials that can advance eLearning. In other communities such as the health care domain, doctors and nurses can share practices, experiences, and other resources to provide better health care. Another example can be experiences and knowledge shared by emergency workers which can improve emergency responses in various dimensions. In this chapter, we discuss how Web 2.0 technologies can enhance knowledgebased professional communities. Specifically, we identify a few select communities and discuss the technologies that are used, the ways in which they can be used, and the potential opportunities and challenges encountered by these communities.
A. Ayanso (*) Department of Finance, Operations, and Information Systems, Brock University, 500 Glenridge Avenue, St. Catharines, ON L2S3A1, Canada e-mail:
[email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_1, # Springer-Verlag Berlin Heidelberg 2011
3
4
1.1
A. Ayanso et al.
Introduction
Web 2.0 is generally described as a cost-effective collection of technologies, based largely upon user-generated content that can become a richer pool of knowledge as more people exchange their information and expertise [1]. These technologies have redefined the web as a platform and contributed unique features and design characteristics that focus on collaborations [1, 2]. Web 2.0 also represents “social computing” as it shifts computing to the edges of the network and empowers individual users with lightweight computing tools to manifest their creativity, engage in social interaction, and share knowledge [3, 4]. Recent technologies have also emerged that allow real-time information sharing and thus further facilitate the users’ sense of community. Common characteristics of these online communities include empowering users (e.g., as codevelopers of software) and trusting and encouraging them to share information with members of other social networks, both online and offline, which can help expand the knowledge base, as well as disseminate the information.
1.1.1
Web 2.0 Technologies in Social Web
Barnatt [5] identified three key reasons why Web 2.0 sites have evolved into a network of interconnections between people, services, and information. These include interpersonal computing, web services, and software as a service (SaaS). Interpersonal computing involves person-to-person interactions facilitated by the tools and features provided on websites that enable collaborative content creation, sharing, and manipulation. In particular, interpersonal computing in the current social media environment is most commonly associated with the development of Wikis, blogs, social networking, and viral video sites. Web services involve application-toapplication data and service exchanges between organizations automated by web servers and other related Internet technologies. These service-oriented architectures (SOAs) and protocols such as XML and API allow computer applications to automatically execute online transactions with minimal human involvement. Hence, Web 2.0 technologies such as these offer tremendous opportunities and potential benefits to a wide range of business organizations. SaaS, sometimes considered as a type of web service, are online applications delivered over the Internet. This means that the applications can be accessed from any electronic device with an Internet connection, which frees the user from the need for locally installed software. Regardless of the specific type of technology, application, or connection, Web 2.0 tools present both individuals and organizations with several new value propositions. Organizations are increasingly using Web 2.0 tools both internally and externally to identify experts, facilitate collaboration, and promote organizational learning.
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
5
According to McKinsey’s Global Survey on Internet technologies in January 2007, more than 75% of corporate executives indicate a plan to maintain or increase their investments in technology trends that facilitate collaborations such as peer-to-peer networking, online social networks, blogs, podcasts, and web services [6]. Many of these executives also recognize the strategic importance of Web 2.0 technologies and the potential value of their investments in these technologies. A survey carried out by Forrester also shows the increasing corporate spending on Web 2.0 technologies and tools, which is estimated to reach $4.6 billion globally by 2013 [7].
1.1.2
The Role of Social Networking
After analyzing seven Web 2.0 technologies including blogs, mashups, podcasting, RSS, social networking, widgets, and Wikis, Forrester reports that social networking has the potential to attract the greatest levels of investment [7, 8]. Social networks are “circles in which people interact and connect with other people” [9]. They provide an appropriate environment for collaboration and knowledge sharing. According to Fraser and Dutta [10], social networking sites can generally be classified into five broad categories: egocentric, community based, passion centric, media sharing, and opportunistic. Egocentric networks represent popular user profile sites like MySpace and Facebook that serve as virtual platforms for identity construction, personal creativity, and artistic expression. The community networks connect members with strong identity linkages based on nation, race, religion, and other social communities that also exist in the physical world. Passion-centric networks, also known as “communities of interest”, connect people who share similar interests and hobbies. Media-sharing sites such as YouTube and Flickr are popular social networking sites which are defined primarily by their content, rather than their members. The main focus of this chapter is on opportunistic networks which represent socially organized knowledge communities where members look for professional networking and opportunities in their profession or academic discipline. These include networking sites for experts, scientists, practitioners, and researchers in various professions (see Appendix 1 for the list of communities examined).
1.2
Social Web and Organizational Knowledge Communities
Enterprise 2.0 is the term used to refer to the use of Web 2.0 platforms by organizations on their intranets and extranets in order to make visible the practices and outputs of their knowledge workers [11, 12]. Web 2.0 platforms have fueled the creation of online communities within business organizations that can encourage
6
A. Ayanso et al.
learning and professional development among knowledge workers. These online community portals provide multiformat platforms and create various forms of learning environments for organizations and their members. Such knowledge communities provide a range of content, services, and tools for successful implementation of organizational initiatives or industry best practices. They also provide several strategic advantages such as enhancing staff competency by replacing traditional training and development and exposing organizations and their members to new thinking through peer-to-peer learning. In addition to facilitating intraorganizational knowledge networks, Web 2.0 technologies and applications have created new intermediaries that connect organizations with practitioners, experts, and researchers around the world. One notable example of a knowledge management network is the Knowledge and Innovation Network (KIN) (http://www.ki-network.org), which is a membershipbased community linking business practitioners from leading industry organizations, world-class researchers, and some of the world’s leading experts in knowledge management and innovation. Its objectives include promoting, fostering, and supporting collaboration between practitioners, researchers, and experts to create new knowledge and practice. One unique aspect of the KIN is that it involves only one member from each industry to promote open knowledge sharing and collaboration free from competitors and vendors. Collaboration and knowledge sharing in KIN takes place through special interest groups (SIGs) which include communities of practice, enabling technologies, and innovation. Another interesting example is HR.com (http://www.hr.com), which is the largest social network and online community of corporate human resource executives. This network provides member executives with easy access to shared knowledge on best practices, trends, and industry news in order to help them develop and effectively manage their workforce and organization. The network supports several communities whose specific interests include employee benefits, compensations, staffing and recruitment, HR outsourcing, training and development, performance management, leadership, and many other areas within the human resources domain. Within each community, members connect, learn, and share with their peers. Some of the shared knowledge on best practices includes latest relevant articles related to the community, live and archived webcasts with leaders and innovators, HR blogs and free spaces for executive members to create their own blogs, and an online bulletin board or forum for members to share their ideas, experiences, questions, and answers. A unique and important feature of this network is its HR Market Intelligence tool which provides real-time and comprehensive market data from industry peers. The HR Market Intelligence is a metricbased program that offers extensive data on multiple supplier categories (e.g., job boards, consultants, and services) within the HR industry. This tool also allows executives to build customizable reports based on survey data collected from the largest pool of HR professionals. Members can create reports by mapping their needs against the best practices, trends, and benchmarks reported by other HR executives.
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
1.3
7
Social Web and Health Care
In the health care industry, the Internet has become an important resource for consumers seeking online health-related information, as well as social and emotional support. The Internet is also changing medical practice, transforming biomedical research, and empowering health care consumers [13]. PEW internet research surveys conducted in 2002, 2004, 2006, 2007, and 2008 have consistently found that well over 75% of Internet users looked online for health information [14]. On websites such as WebMD, familydoctor.org, mayoclinic.com, and emedicinehealth.com that serve the general public in health-related issues, information is posted by a select group of experts (e.g., medical doctors and nurse practitioners) but can be searched and used by anyone. These websites serve different patient populations whose interests range from chronic diseases, such as cancer and Alzheimer, to rudimentary ailments, such as joint pain or fever, and have become an indispensible resource for many health care consumers. Traditionally, many of these venues were used for disseminating health-related knowledge which was perceived as a relatively low-cost solution to help alleviate the load of the overburdened health care system. Some of the health care forums such as WebMD, enabled by Web 2.0 technologies, also allow the general public to share their experiences, and an increasing number of web users are turning to one another for answers about many different subjects. This knowledge gained from the expert-provided information, as well as experiences shared by other patients, can be a valuable resource that can positively affect both physical and emotional wellbeing of the patients seeking help [15]. The trends suggest that “whereas someone may have in the past called [upon] a health professional, their Mom, or a good friend, they now are also reading blogs, listening to podcasts, updating their social network profile, and posting comments” ([14], p. 7). Because of the immense popularity of these Web 2.0 technologies in health care, new terms such as “Health 2.0” and “Medicine 2.0” have been coined. Hughes et al. [16] define Health 2.0 as – “the use of Web 2.0 web tools such as blogs, podcasts, tagging, search, Wikis, etc. by actors in health care including doctors, patients, and scientists, using principles of open source and generation of content by users, and the power of networks in order to personalize health care, collaborate, and promote health education”. Along a very similar line, Eysenbach [13] defines “Medicine 2.0 applications, services, and tools. . . as web-based services for health care consumers, caregivers, patients, health professionals, and biomedical researchers, that use Web 2.0 technologies and/or semantic web and virtual reality approaches to enable and facilitate specifically (1) social networking, (2) participation, (3) apomediation, (4) openness, and (5) collaboration, within and between these user groups”. While according to Jessen [17], Medicine 2.0 is “the science of maintaining and/or restoring human health through the study, diagnosis and treatment of patients utilizing Web 2.0 internet-based services, including web-based community sites, blogs, Wikis, social bookmarking, folksonomies (tagging) and Really Simple Syndication (RSS),
8
A. Ayanso et al.
to collaborate, exchange information and share knowledge”. Jessen’s definition focuses on scientific collaborative aspects of knowledge generation and use in health care provision. As Abidi et al. [18] argue, to generate holistic health care knowledge, it is important to augment the explicit knowledge with the context-specific experiential knowledge. And to encourage collaboration between the health care professionals, such as doctors, biomedical researchers, and other health care professionals, Web 2.0 technologies can be very advantageous. There are several sites that promote the exchange of information among the health care professionals. One of the notable names in this category is Sermo (http:// www.sermo.com/). Sermo is a social network through which physicians exchange information and collaborate (see Fig. 1.1). It is a community with more than 112,000 members where they communicate in detail with each other about clinical and nonclinical issues. According to Sermo’s website – “it’s where practicing US physicians – spanning 68 specialties and all 50 states – collaborate on difficult cases and exchange observations about drugs, devices and clinical issues”. Sermo is assisted by its survey and ratings tools that are among the most advanced anywhere on the web [19]. Syndicom SpineConnect (http://www.syndicom.com/spineconnect/) is the leading collaborative knowledge network for spine surgeons to collaborate on difficult and unusual cases. According to this site, every day thousands of spine surgeons from around the world log on to SpineConnect “to share knowledge, develop novel approaches to treatment, address the top challenges in spine health care, and create technological solutions that address voids in the current marketplace with the underlying goal of improving patient outcomes”. With simple and intuitive “case wizard” features, SpineConnect enables surgeons to obtain feedback on complicated patient cases quickly and securely from some of the world’s foremost authorities in spine surgery. Another feature of SpineConnect is ResearchEdge PN. It is a unique webbased tool for conducting case-based research which provides a secure environment
Fig. 1.1 Sermo Source: MacManus [19]
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
9
where researchers can gather, compile, and analyze case data from multiple research centers. There is a slew of sites that empower health consumers by providing information and other services. Lists of Health 2.0 sites mostly focused on consumers can be found at Health20.org Wiki site (http://www.health20.org). In general, prior research shows that members of online support groups express very positive attitudes toward the usefulness of the web for seeking health information [20]. The December 2003 PEW Internet report [21] indicates that more and more citizens are accessing health information online, increasing from 46 million Americans in March 2000 to 93 million Americans in 2003. WebMD (http://www.webmd.com) is also probably one of the most notable names among health-related websites. It is one of the most comprehensive health resources for everyday consumers, as well as physicians, nurses, and educators. It is not a pure Web 2.0 site, and for the most part, in order to provide credible information; information on thousands of symptoms, medical tests and tools, drugs and vitamin supplements, etc., is posted by a select group of experts. Another similar example is FamilyDoctor (http://www.familydoctor.org/) which provides information on a variety of health topics. On its website, there are sections that focus on men, women, seniors, parents, and children. WebMD, however, recognizes that “the power and comfort of learning from and sharing with others who face the same challenges can be invaluable”. To fulfill this need, WebMD has a large number of message boards, discussion boards, as well as live events where health care consumers can communicate and exchange information among each other. The blogging community and chat forums on WebMD are very active with a large number of users logging on and posting messages on a daily basis. WebMD also implements “widgets” – small software modules operating within or connecting to WebMD’s environment and appearing on a device or website of member’s choice – that provides registered members with access to a variety of WebMD content and services. There are other similar sites that focus on specific ailments. PsychCentral (http:// www.psychcentral.com) has been operated since 1995 by mental health professionals offering reliable, trusted information and over 150 support groups to consumers. Although mainly using Web 1.0 features, this site also provides a blog for community members to participate. PatientsLikeMe (http://www.patientslikeme.com) is also an online patient community which allows people to “meet [other] patients just like you, and learn more about your disease”. As MacManus [19] comments, it is “the best example of a combination of really useful community and tools making a significant difference in the lives of people with serious debilitating diseases”. This is a social networking site currently activated for four diseases, namely amyotrophic lateral sclerosis, multiple sclerosis, Parkinson’s, and HIV, where patients with similar medical conditions can find each other, mutually share progress, and collectively discover the best answer to each other’s questions. Perhaps, the most important feature on its website is a tool that helps registered users track their medical conditions and outcomes. Users on the site can literally drill down and see other patients in similar situations or who are on
10
A. Ayanso et al.
similar medications and see what did or did not work for them [19]. WiserTogether (http://www.wisertogether.com) which focuses mainly on maternity and pregnancy is another example of a website where the members write entries sharing their health problems and experiences with treatment methods. The sites generally have evolved from the encyclopedic-styled WebMD and become more interactive and case study oriented. DoublecheckMD (http://www.doublecheckmd.com) is a site that uses natural language recognition to allow consumers to search medical documents and match symptoms with the drugs they are taking. DoubleCheckMD.com [22] is a search engine which uses and recognizes symptom description expressed in everyday language to allow consumers to search medical documents for information on adverse drug reactions, drug–drug interaction, and sole drugs side effects. This is especially important for the side effects of drugs and drug interactions that may result in medical conditions. The website shows the potential for extracting information from the millions and millions of pages of medical texts and presenting it to the users in a useful format. With the “Next Step” tag feature, the website also allows its users to learn how they should proceed, or what laboratory tests they should take next. The idea of presenting this information to the patient is that the patient can be better informed when visiting the doctor’s office to decide on the next set of appropriate steps. Vitals (http://www.vitals.com) is a doctor directory which serves as a one-stop shop for information about physicians and is publicized as a “source for comprehensive medical information on 720,000 doctors nationwide”. Health consumers can find a doctor for a particular location and required expertise. The site uses reported empirical data, patient reviews, and physician reviews of their peers.
1.4
Social Web and Research Communities
Social networks underlying Web 2.0 technologies have also created tremendous collaborative opportunities for the research community. The speed and ease of connectivity afforded by social media and Web 2.0 tools has reduced the barriers to collaboration among researchers in various disciplines and professions [23]. Such networks are revolutionalizing how scientists and researchers in different geographic locations can interact and work with each other on various research projects ranging from small-scale to international issues affecting the global society. The ability to connect with other researchers across disciplines and large geographical boundaries is even more significant and holds even greater promises for researchers in developing countries and experts with unusual specialties to find collaborators. Notable examples of knowledge networks for the scientific research community include ResearchGate, Academia.edu, Labmeeting, and Mendeley. ResearchGATE (http://www.researchgate.net) is a scientific network with over 200,000 members that provides powerful collaboration tools, contents tailored to
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
11
scholarly needs, and subcommunities dedicated to research specialties with a vision of creating “Science 2.0”. The aim of ResearchGATE is to promote knowledge sharing between scientists all over the world and is based on the idea that communication between scientists will accelerate the creation and distribution of new knowledge and assure research quality [24]. In addition to facilitating research connections and collaborations within research specialties, ResearchGATE offers an international job board through RSS Feed from various international top research institutions. At Academia.edu (http://www.academia.edu), researchers can easily find out “who’s researching what?”, find people with similar research interests, as well as keep track of the latest developments in their research area, including the latest papers, talks, blog posts, and status updates. Researchers can create an easyto-maintain webpage, listing their research interests, biography, and papers they have written. Researchers can choose to receive notifications when someone searches for their names and research papers on Google. They can also upload and share their papers and talks, and receive statistics on how often their papers are viewed and downloaded. Similarly, at Labmeeting (http://www.labmeeting.com), researchers have access to a web service that helps organize, collect, and share scientific papers. Researchers can also interact with top scientists at top universities through a questionand-answer service from the network. At Mendeley (http://www.mendeley.com), scientists and researchers can upload research papers, and the site sifts through bibliographic data to match and recommend other related research papers already in its database. The site provides a research management tool for desktop and web. Mendeley Desktop is freely available on Windows, Mac, and Linux. It allows users to organize a collection of research papers and citations, automatically extracts references from documents, and generates bibliographies. Mendeley web lets users access their research paper library from anywhere, share documents in closed groups, and collaborate on research projects online. The potential values brought about by the online communities mentioned above go beyond helping the research community become more productive. Researchers can also use these websites to enhance their professional development efforts by establishing a professional online presence and building an online social network that allows members to conduct research, obtain expert advice, collaborate with others, learn about new ventures, and reconnect with others. These networks are built on a variety of Web 2.0 applications to cater for user needs and are designed to help users expand their social circles into well-connected and diverse professional networks. Users can easily take advantage of the various tools and features available in these networks to focus on key professional goals and objectives. For example, to enhance the user experience, many of these networks support various blogging platforms and enable users to link their personal blog to their social network sites. Content can be also shared with other popular sites such as LinkedIn, Digg, Twitter, Facebook, Technorati, and so on. The ability to access other members’ free contents can save a significant amount of time in information search and decision making. Due to the information and user profiles the social networking sites make available about their members, other members can easily locate experts, as well as partners for their research projects.
12
1.5
A. Ayanso et al.
Social Web and Software Development Communities
In software development communities, online collaboration has been an integral part of the open source movement for at least the past 10 years. For example, Ubuntu, a Linux operating system, has been developed and maintained partly by a community of volunteers who collaborate virtually with each other (see http:// www.ubuntu.com). These voluntary efforts range from highly technical tasks, such as coding, testing, and debugging the software, to nontechnical, administrative tasks, such as designing a user interface, writing documentations, and proofreading. Traditionally, Ubuntu “members” or “contributors” collaborate using Wiki tools where end users help write manuals and supporting documents. Recently, Ubuntu volunteers have adopted other Web 2.0 tools such as blogs and RSS (really simple syndicated) feeds to communicate with each other. Over the years, subcommunities have also emerged as “grass root” groups establishing themselves to serve the needs of local Ubuntu users. For example, on Planet.Ubuntu.com, there are currently over 20 virtual subcommunities serving specific locales such as Argentina, Chile, Indonesia, and South Africa, as well as French-, Chinese-, and Italian-speaking user groups. There are also over 200 self-organized local support teams (known as LoCo Teams) to provide offline support and face-to-face meetings among their members. The open source software development movement like Ubuntu can be traced as far back as the early 1980s. A “free” software movement called the “GNU Project” has been credited with creating the back-end components of what is now known as the Linux operating system (see http://www.gnu.org). Since then, millions of copies of GNU/Linux software have been distributed under more familiar names such as RedHat, Debian, and of course Ubuntu. Like in the Ubuntu communities, GNU volunteers work together virtually using various collaboration tools including Web 2.0 technologies such as blogs and online forums. These “grass root” movements in software development extend beyond the nonprofit sector. Sun Microsystems (now owned by Oracle) created virtual communities such as OpenJDK, GlassFish, and NetBeans to harness the collaborative power of the voluntary mass (see http://www.java.net). Like Ubuntu volunteers, members of these virtual communities communicate with each other using various Web 2.0 tools ranging from listservs, discussion forums, and blogs to more recent technologies such as Twitter, Youtube, and Facebook groups. However, these company-sponsored communities are partly managed by their corporate sponsors and less self-directed than those created for Ubuntu and GNU. Regardless of the nature of their sponsorship, these online communities provide their members with networking opportunities with other members in similar professions or with similar interests. These social opportunities can also add to the members’ professional statures who become actively involved in the various projects in the online communities. As a result, these communities can be classified as both passion centric and opportunistic [10]. Consequently, members on these communities have incentives (either intrinsic or extrinsic) to engage with and contribute to
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
13
the contents that are captured on these websites. These contents are eventually stored in a database as part of the website’s archive for subsequent searches and retrieval. Due to the collective efforts of a large number of volunteers, updates and changes on these community-built databases are made relatively frequently. To encourage members to participate, many online communities also award their members with “status points” which can be accumulated in order to increase the individual member’s social status within the online communities. Because of their relatively low cost, companies often implement these online communities as a way to supplement traditional customer service mechanisms (e.g., call centers, online FAQs, and online knowledge base). Moreover, since the contents of these online archives are created and edited by a large number of members who are likely to be the actual users of the software, the online archive includes less formal, but potentially useful information such as the actual user’s experience, potential “off-label” usage of the software, troubleshooting tips and solutions, and other information that could be beneficial to other members of the community. As a result, these online archives can become a unique repository of a large amount of knowledge and expertise that cannot be found anywhere else. However, unlike traditional knowledge bases and technical support databases on retail or corporate websites, the information compiled from virtual “grass root” communities is less structured and thus potentially more problematic for information retrieval. In addition, the information posted on the Web 2.0 portals tends to be free-form texts (e.g., blogs and tweets) and often includes multimedia components (e.g., Youtube and photos on Facebook). Moreover, due to the informal nature of most online communities, the semantics used by each member of the community to describe similar events or objects may significantly differ. For example, members of the GNU communities explicitly distinguish the terms “Free Software” and “Open Source Software” (http://www.gnu.org), whereas members of other open source software communities tend to use these terms somewhat interchangeably. As a result, searching for useful and relevant information on these community-built databases can become a challenge. In spite of the potential difficulty, many members of virtual communities do perform searches to locate the information they need in a manner similar to what they use on a more structured database. For example, most online forums have a search engine that matches the search keyword with the content of text messages. So do most blogs. On other Web 2.0 sites such as Facebook and Youtube, a search engine is provided that matches the search keywords with predetermined data fields such as Facebook group names, video clip titles, and textual descriptions (tags) of photos. The next section of this chapter discusses some of the challenges common to most online communities and to the online knowledge communities described here (i.e., working professionals, health care, scholarly research, and software development communities). These challenges range from technical issues such as information retrieval and accuracy to social and legal concerns such as intellectual property rights and privacy.
14
1.6
A. Ayanso et al.
Social Web and Critical Challenges
Social media has exponentially increased the scale and pattern of information exchange on the Internet. It provides a network platform that allows individuals and groups to interact more immediately and on a much larger scale than ever before. However, the influence of social media can also be deemed as a doubleedged sword. While social media has opened new channels and opportunities for individuals, social groups, and professional knowledge communities, the same technologies can also be exploited in a destructive way by individuals and groups to propagate misinformation, propaganda, violence, defamation, and threats. Social media creates an open forum for anyone to post any personal, social, and political views in various forms such as photographs, videos, podcasts, articles, and blogs. It also involves many different individuals and societies with different social views, moral standards, ethics, cultures, and codes of conducts. Content that is acceptable to one individual or group may not necessarily be accepted equally by another individual or group. In addition, the degree of censorship may vary from place to place because some countries have stricter laws regarding the nature of content that can be posted on the Internet. Organizations and governments are confronted with the issue of striking a balance between the social costs on the one hand and employees’ and citizens’ rights to free speech on the other. Below we review some of the critical issues surrounding Web 2.0 sites and their operations which are broadly applicable to many different types of knowledge communities whether organizational, health care, research, or software development communities.
1.6.1
Information Retrieval
Many online communities allow their members to post multimedia objects that may include text messages, pictures, and video or audio files that are eventually stored as part of an online archive. These multimedia contents are “tagged” with textual descriptions for subsequent searches and retrieval. For instance, a picture of a symptom of a particular ailment may be uploaded on a health site and may be described with a tag. However, these tags often fail to fully describe the multimedia objects. As a result, information retrieval based on text-based queries may not yield desired outcomes [25]. Moreover, many members of these online communities engage in exploratory searches, as opposed to using specific keyword searches. These exploratory searches (often called information foraging behavior) require a type of information retrieval tool different from what is usually available to these online communities [26]. As a result, many new research efforts have been directed toward creating information retrieval tools that can accommodate various types of online searches (see, e.g., [27, 28]). In addition, even with the ability to search using text-based queries, people from different backgrounds may understand similar concepts using different textual
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
15
descriptions. As a result, unless an exact keyword is used, search results may not include all the information contained in an online repository. This could also be a significant inherent challenge faced by online communities that span a large geographical boundary (e.g., across countries and continents) and whose members come from a wide variety of backgrounds and walks of life. Recent developments of folksonomy systems such as del.icio.us have been used as a way to alleviate this issue facing online information retrieval [29, 30].
1.6.2
Intellectual Property
Social media has resulted in an explosive growth of user-generated content which presents significant legal challenges [31]. Content sharing sites are increasingly impacting on the business models of traditional content creators. Postings of copyrighted works ranging from photographs and small clips to the entire copy of literary works have become a common occurrence. A chapter or an excerpt from a book posted on a research, health, or any other site is a simple example of this issue. Many social media companies have failed to effectively control copyrighted content on their sites by simply binding their users to be liable for their own actions as part of their terms of use when signing on. Copyright laws differ from country to country, whereas many of the content sharing websites cross several cultures, nations, and geographic regions. Understanding the law applicable to a particular action is becoming a major challenge for several users. On the other hand, companies rely heavily on specific legislations based on their country of origin, and they tend to adopt a reactive, rather than a proactive approach to addressing copyright infringements in other locations. In organizational knowledge communities, user-generated content – a key feature of Web 2.0 technologies – is also likely to cause information leakage [32]. The use of social media sites may potentially lead to accidental or deliberate displays of private company information as users utilize the various tools and features available on the sites. In addition to safety measures related to personal information, professional networking sites should promote the importance of social media policies for organizations.
1.6.3
Information Accuracy
Web 2.0 systems can create a platform where end users contribute content and collaborate further on the social networks. In general, when the content is generated by the community at large, the accuracy of information is often challenged. Even with the most well-known initiatives such as Wikipedia, this concern is often expressed. Although, at any one point in time, the information on Wikipedia may be questioned for its accuracy, there is a general tendency for information to be
16
A. Ayanso et al.
corrected over time by social policing, as well as other information auditing mechanisms. Other members of the community visiting the same page can edit the information to make it more accurate if required. Concerns about the accuracy of information are also expressed in the context of health communities [14]. The expertise and number of contributing members of the community can impact significantly on this attribute of information. Bhatnagar et al. [33] discuss several network-related characteristics, such as boundedness, density, exclusivity, centrality, and strength of ties, that can be important aspects of effective social knowledge communities. Also, strategies such as expert review of information posted and citation and referencing mechanisms can be used to effectively manage this problem.
1.6.4
Privacy and Security
Many new Web 2.0 technologies create new vulnerabilities for the online social networks. Since Web 2.0 platforms enable anyone to upload content, these sites are easily susceptible to hackers wishing to upload malicious content. Once the malicious content has been uploaded, innocent visitors to these sites can also be infected [34]. Web 2.0 uses different mechanisms like RSS, Trackback, Pingback, Tagging – these create new vulnerabilities. Some examples of these threats may include, a syndicated page through RSS can create a backdoor, using VoIP framework built for online communication for launching SYN attack, cross-site scripting (XSS), various new types of injection attacks ([33]). As Adhikari [32] notes, what makes matters worse is that the majority of these sites are considered “trusted” by URL filtering products and are not blocked. Thus, the malicious codes can be easily passed on through the use of these sites. A good example of this type of threat is a recent event where a malicious link posted on Facebook clicked by employees of a large US financial firm allowed the perpetrators to “slip deep inside the financial firm’s network, where they roamed for weeks” [35]. Social networking sites rely on connections and communication, and often they encourage the user to provide a certain amount of personal information. Users’ online profiles may be also made available to the general public and attract unwanted attention. Thus, precautions to protect the privacy and security of users’ information are an important aspect of social media. In knowledge communities such as healthrelated websites, privacy and security of patient information is among the chief cornerstones of these social networks. Privacy issues are rampant in the new participative systems like blogs, online social networks where content is posted by the end user. Privacy may not be limited to self, but one’s entire network’s privacy is critical. For example, FOAF (Friend of a Friend) can expose private information for a chain of people connected to each other. This poses a privacy risk to the entire network of people rather than a single individual. Although companies understand that they are exposed to many social, ethical, and legal issues, all too common are passive approaches taken by social networking
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
17
sites when addressing these issues. To date, most of the steps taken by companies have focused on framing the terms of use that protect themselves from potential legal liabilities. This strategy, however, puts the onus on the user and does not provide a comprehensive solution. In fact, most users do not read the terms of use when signing up to social media sites. Some terms may also require legal interpretations; thus, users may have minimal awareness of the legal implications and consequences of some actions. In addition to company-specific terms of use, companies should have a broader framework that incorporates well-defined social, ethical, and legal dimensions. For example, LinkedIn has demonstrated this step by becoming a licensee of the TRUSTe Privacy Program, which is an independent organization that enables individuals and organizations to establish trusting relationships by promoting the use of fair information practices. The company’s privacy policy outlines the type of personal information that is collected and how it will be used. The user agreement also outlines the precautions users should take so that the security of personal information is partly their responsibility. Other professional networking sites should take similar steps to increase user awareness and confidence in the areas of privacy and security.
1.6.5
The Digital Divide
The rapid growth of Web 2.0 technologies may expand the already existing gap in access to or use of information and communication technologies (ICTs) between the developing and the developed parts of the world. For those who have access to social media, there is an immense promise of knowledge creation and knowledge sharing. However, if the digital divide is not addressed in the first place, the social disparity and the knowledge gap for those who do not have access will continue to grow at an alarming rate. Therefore, ensuring access to technology for all and the development of skills in its use should be the top priority of any economic initiative in both local and global settings.
1.7
Conclusion
Online knowledge communities provide immense opportunities for knowledge sharing among experts, as well as casual users. The benefits to be derived from a collective pool of shared knowledge are tremendous. This chapter discusses the ways in which Web 2.0 technologies are used in four knowledge communities – namely organizational, health care, research and software development – identifying the potential opportunities, as well as challenges encountered by these communities. Knowledge communities have transformed into another form of social media due to recent advances in web technologies. To reiterate, Web 2.0 tools and techniques support more effective collaboration and knowledge sharing. While the specific
18
A. Ayanso et al.
objectives differ from one community to another, they all take advantage of the features and functionalities Web 2.0 offers to promote the identification, creation, representation, and distribution of information and/or knowledge on a much larger scale than ever before. Finally, it is important that organizations and professionals recognize the risks in online professional networks. Although there are several advantages of Web 2.0 technologies, they also come with several technological, social, ethical, and legal issues. In professional networks such as those discussed in this chapter, these issues become more significant due to the scope of the potential damage that can be done to organizations and professionals. Addressing these critical issues, therefore, would require a concerted effort from multiple players which include individual users, legislators, software developers, and service providers [31].
Appendix 1: Community-Built Networks Community Academia.edu DoublecheckMD FamilyDoctor GNU Health20.org HR.com
Community category Research Health Health Software development Health Interorganization – human resource Software Interorganization – innovation
Java.net Knowledge and Innovation Network (KIN) Labmeeting Mendeley Patientslikeme Psychcentral ResearchGATE Sermo Syndicom
Research Research Health Health Research Health Health
Ubuntu Vitals WebMD WiserTogether
Software development Health Health Health
URL http://www.academia.edu http://doublecheckmd.com/ http://familydoctor.org/ http://www.gnu.org http://health20.org/Wiki/ http://www.hr.com http://www.java.net http://www.ki-network.org http://www.labmeeting.com http://www.mendeley.com http://www.patientslikeme.com/ http://psychcentral.com/ http://www.researchgate.net http://www.sermo.com/ http://www.syndicom.com/ spineconnect/ http://www.ubuntu.com/ http://www.vitals.com/ http://www.webmd.com http://wisertogether.com
References 1. O’Reilly, T.: What is Web 2.0: design patterns and business models for the next generation of software. O’Reilly net. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/whatis-web-20.html (2005)
1 Social Web: Web 2.0 Technologies to Enhance Knowledge Communities
19
2. Yang, T.A., Kim, D.J., Dhalwani, V., Vu, T.K.: The 8C framework as a reference model for collaborative value webs in the context of Web 2.0. In: Proceedings of the 41st Hawaii International Conference on System Sciences (HICSS-41), Hawaii, USA (2008) 3. Parameswaran, M., Whinston, A.B.: Social computing: an overview. Commun. Assoc. Inf. Syst. 19, 762–780 (2007) 4. Parameswaran, M., Whinston, A.B.: Research issues in social computing. J. Assoc. Inf. Syst. 8(6), 336–350 (2007) 5. Barnatt, C.: Web 2.0: an introduction. http://explainingcomputers.com/web2.html (2010). Accessed 7 Feb 2010 6. McKinsey & Company: How businesses are using Web 2.0: a Mckinsey global survey. McKinsey Q. http://www.mckinseyquarterly.com/How_businesses_are_using_Web_20_A_ McKinsey_Global_Survey_1913. Accessed March 2007 7. Young, G.O., Brown, E.G., Keitt, T.J., Owyang, J.K., Koplowitz, R., Lo, H.: Global enterprise Web 2.0 market forecast: 2007 to 2013 Forrester (2008). Accessed 21 Apr 2008 8. Shiels, M.: Web 2.0 is set for spending boom, BBC News. http://news.bbc.co.uk/2/hi/7359927. stm (2008). Accessed Apr 2008 9. Avram, G.: At the crossroads of knowledge management and social software. Electron. J. Knowl. Manage. 4(1), 1–10. http://www.ejkm.com (2006) 10. Fraser, M., Dutta, S.: Throwing Sheep in the Boardroom: How Online Social Networking Will Transform Your Life, Work and World. Wiley, London (2008) 11. McAfee, P.A.: Enterprise 2.0: New Collaborative Tools for Your Organization’s Toughest Challenges. Harvard University Press, Cambridge (2009) 12. McAfee, P.A.: Enterprise 2.0: the dawn of emergent collaboration. MIT Sloan Manage. Rev. 47(3) 21–28 (2006) 13. Eysenbach, G.: Medicine 2.0: social networking, collaboration, participation, apomediation, and openness. J. Med. Internet Res. 10(3), e22 (2008) 14. Fox, S., Jones, S.: The social life of health information. Pew Internet and American Life Project. http://www.pewinternet.org/Reports/2009/8-The-Social-Life-of-Health-Information. aspx (June 2009) 15. Barnett, G.A., Hwang, J.M.: The use of the internet for health information and social support: a content analysis of online breast cancer discussion groups. In: Murero, M., Rice, R.E. (eds.) The Internet and Health Care: Theory, Research and Practice, pp. 233–253. Lawrence Erlbaum, Mahwah (2006) 16. Hughes, B., Joshi, I., Wareham, J.: Health 2.0 and Medicine 2.0: tensions and controversies in the field. J. Med. Internet Res. 10(3), e23 (2008) 17. Jessen, W.: Medicine 2.0 #27 – communication is key. http://blog.highlighthealth.info/ medicine-20/medicine-20-27-communication-is-key/ (2008). Accessed June 2008 18. Abidi, S.S.R., Hussini, S., Sriraj, W., Thienthong, S., Finley, A.: Knowledge sharing for pediatric pain management via a Web 2.0 framework. Med. Inf. United Healthy Eur. 44(5), 287–291 (2009) 19. MacManus, R.: Top health 2.0 web apps. Read Write Web. http://www.readwriteweb.com/ archives/top_health_20_web_apps.php (2008). Accessed Feb 2008 20. Lieberman, M.A., Golant, M., Giese-Davis, J., Winzlenberg, A., Benjamin, H., Humphreys, K., et al.: Electronic support groups for breast carcinoma: a clinical trial of effectiveness. Cancer 97(4), 920–925 (2003) 21. Fox, S., Rainie, L.: The online health care revolution: how the web helps Americans take better care of themselves. Pew Internet and American Life Project. http://www.pewinternet. org/Reports/2000/The-Online-Health-Care-Revolution.aspx (November 2000) 22. DoublecheckMD.com: http://doublecheckmd.com/ 23. Ingram, M.: Social media help generate science 2.0. Internet Evolution. http://www. internetevolution.com/author.asp?section_id¼539&doc_id¼185130& (2009). Accessed 30 Nov 2009 24. Hamm, S.: ResearchGATE and its savvy use of the web. BusinessWeek. http://www. businessweek.com/innovate/content/dec2009/id2009127_441475.htm (2009). Accessed 7 Dec 2009
20
A. Ayanso et al.
25. Li, W.-S., Candan, K.S., Hirata, K., Hara, Y.: Supporting efficient multimedia database exploration. VLDB J. 9(4), 312–326 (2001) 26. Fu, W., Pirolli, P.: SNIF-ACT: a cognitive model of user navigation on the world wide web. Hum. Comput. Interact. 22(4), 355–412 (2007) 27. Mitchell, W.J.T.: What Do Pictures Want? The Lives and Loves of Images. University of Chicago Press, Chicago (2005) 28. O’Gorman, M.: E-crit: Digital Media, Critical Theory and the Humanities. University of Toronto Press, Toronto (2006) 29. Abel, F.: The benefit of additional semantics in folksonomy systems. In: Proceedings of the Conference on Information and Knowledge Management, Napa Valley, CA, USA, pp. 49–56 (2008) 30. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, pp. 501–510 (2007) 31. George, C.E., Scerri, J.: Web 2.0 and user-generated content: legal challenges in the new frontier. J. Inf. Law Technol. 2 (2007) 32. Adhikari, R.: Experts on Web 2.0 security problems. http://itmanagement.earthweb.com/secu/ article.php/3805496/Experts-on-Web-20-Security-Problems.htm (2009). Accessed 21 Feb 2009 33. Bhatnagar, S., Herath, T., Sharman, R., Rao, H.R., Upadhyaya, S.: Web 2.0: investigation of issues for the design of social networks. In: Ordoneez de Pablos, P. (ed.) Web 2.0: The Business Model, pp. 133–146. Springer, New York (2009) 34. Ben-Itzhak, Y.: Tackling the security issues of Web 2.0. http://www.scmagazineus.com/ tackling-the-security-issues-of-web-20/article/35609/ (2007). Accessed 10 Sept 2007 35. Acohido, B.: How cybercriminals invade social networks, companies. USA Today. http:// www.usatoday.com/tech/news/computersecurity/2010-03-04-1Anetsecurity04_CV_N.htm (2010). Accessed 4 Mar 2010
Chapter 2
Community-Contributed Media Collections: Knowledge at Our Fingertips Tania Cerquitelli, Alessandro Fiori, and Alberto Grand
Abstract The widespread popularity of the Web has supported collaborative efforts to build large collections of community-contributed media. For example, social video-sharing communities like YouTube are incorporating ever-increasing amounts of user-contributed media, or photo-sharing communities like Flickr are managing a huge photographic database at a large scale. The variegated abundance of multimodal, user-generated material opens new and exciting research perspectives and contextually introduces novel challenges. This chapter reviews different collections of user-contributed media, such as YouTube, Flickr, and Wikipedia, by presenting the main features of their online social networking sites. Different research efforts related to community-contributed media collections are presented and discussed. The works described in this chapter aim to (a) improve the automatic understanding of this multimedia data and (b) enhance the document classification task and the user searching activity on media collections.
2.1
Introduction
The past few years have witnessed the steady growth of Web-based communities such as social networking sites, blogs, and media-sharing communities. This recent trend in the WWW technology, commonly known as Web 2.0, finds its natural outcome in the emergence of well-known online media-sharing communities. For example, licensed broadcasters and production companies were historically the only publishers of video content in online Video-on-Demand (VoD) systems. However, the advent of video-sharing communities, partially supported by the popularity of affordable hand-held video recording devices (e.g., digital cameras, camera phones), has reshaped roles. Nowadays, hundreds of millions of Internet users are self-publishing consumers. This has resulted in user-generated content (UGC) becoming a popular and everyday part of the Internet culture, thus creating
T. Cerquitelli (*) Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy e-mail:
[email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_2, # Springer-Verlag Berlin Heidelberg 2011
21
22
T. Cerquitelli et al.
new viewing patterns and novel forms of social interaction. Accordingly, an increasing research effort has been devoted to analyzing and modeling user behavior on social networking sites. The abundance of contents generated by media-sharing communities could potentially enable a comprehensive and deeper multimedia coverage of events. Unfortunately, this potential is hindered by issues of relevance, findability, and redundancy. Automated systems are largely incapable of understanding the semantic content of multimedia resources (e.g., photos, videos, documents). Queries on multimedia data are thus extensively dependent on metadata and information provided by the users who upload the media content, for example, in the form of tags. However, this information is often missing, ambiguous, inaccurate, or erroneous, which makes the task of querying and mining multimedia collections a nontrivial one. Hence, user-contributed media collections present new opportunities and novel challenges to mine large amounts of multimedia data and efficiently extract knowledge useful to improve the access, the querying, and the exploration of multimedia resources. The aim of the chapter is to present how information retrieval and data mining approaches can be used to extract and manage the content generated by media communities. The chapter is organized as follows. Section 2.2 reviews different collections of user-contributed media, such as YouTube, Flickr, and Wikipedia, by presenting their main features. Section 2.3 discusses several aspects of the reviewed media collections and their associated user communities. In particular, the structure of the underlying social networks is analyzed. Different research efforts aimed at studying media content distribution and user behavior are then presented. Finally, the semantics of content found in these collections are addressed. Section 2.4 describes three taxonomies, one for each media collection, proposed to sum up and classify the main research issues being addressed. An overview of results achieved in the media annotation domain is presented in Sect. 2.5, whereas Sect. 2.6 describes diverse research efforts targeted at developing novel and efficient data mining techniques to (a) extract relevant semantics from image tags, (b) train concept-based classifiers and automatically organize a set of video clips relative to a given event, and (c) efficiently categorize a huge amount of documents exploiting Wikipedia knowledge. Finally, Sect. 2.7 draws conclusions and suggests research directions offered by the community-built media collections.
2.2
Social Media-Sharing Communities
The past few years have witnessed the rapid proliferation of social networking sites, wikis, blogs, and media-sharing communities. The advent of media-sharing communities, partially spurred by the popularity of affordable hand-held image and video recording devices (e.g., digital cameras, camera phones), has favored the growth of social networks. Nowadays, hundreds of millions of Internet users are self-publishing consumers. This has resulted in user-generated content (UGC)
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
23
becoming a popular and everyday part of the Internet culture, thus establishing new viewing patterns and novel forms of social behavior. This section reviews different collections of user-contributed media like YouTube [1] (see Sect. 2.2.1), Flickr [2] (see Sect. 2.2.2), and Wikipedia [3] (see Sect. 2.2.3) by discussing their main features. We discuss these media collections as representatives of the rapid growth witnessed in the Internet-based multimedia domain. In addition, the interest of Web users in these Web services has significantly grown, attracting an increasing attention from the scientific community.
2.2.1
YouTube
YouTube [1] is one of the largest and most successful online services allowing users to upload, share, and watch video material freely and easily. Since its establishment in early 2005, YouTube has become one of the fastest-growing Web sites and ranks fourth in Alexa’s [4] top popular site list, with 24 h of new video content uploaded every minute, as of March 2010. YouTube’s primary features include the ability to upload and play back video clips. Any user with a Web browser can view YouTube videos, but users are required to create an account to publish their own content and interact with each other. Registered users are assigned a profile page, named a “channel”, which serves as an index to the user’s uploaded material. Users may easily customize the looks of their channel page by selecting an available graphical theme or by creating a new one. They may optionally disclose their personal details, subscribe to other users’ videos, or “make friends” with them. Comments can be posted by registered users on another user’s profile page or on a specific video’s page. Viewers can additionally rate videos or join groups that focus on particular interests. YouTube thus shows its strong community nature: it is a social networking site, with the added feature of hosting video content [5]. Videos can be uploaded in most existing container formats and are automatically converted into the Adobe Flash Video format (FLV). The Adobe Flash Player browser plug-in, required to play the FLV format, is one of the most common pieces of software installed on personal computers. Currently, video clips uploaded by standard users are limited to 10 min in length and a file size of 2 GB. In YouTube’s early days, it was possible for users to upload longer videos, but a time restriction was introduced in March 2006, when it was found that the majority of videos exceeding this length were unauthorized uploads of copyrighted materials. Since November 2008, YouTube has been making an ongoing effort to improve picture quality. Videos were thus made available in HD format and currently use the H.264/MPEG-4 AVC codec, with stereo AAC audio. YouTube assigns each video a distinct 11-digit identifier, composed of digits and (uppercase and lowercase) letters. Associated with the videos are some metadata, including the name of the uploader, the date of upload, and the number of views,
24
T. Cerquitelli et al.
ratings, and comments. A title, a category, a set of keywords (tags), and a textual description are also provided by the user who added the video. A list of related videos is generated by YouTube, which consists of links to other videos that have a similar title, description, or tags, and are thus supposedly similar in content. Uploaders may also specify a set of related videos of their choice. YouTube provides a dedicated feature to share videos on other online communities (e.g., Facebook [6], Twitter [7], MySpace [8]) and makes it easy to embed them in an external Web page by automatically generating the required HTML code.
2.2.2
Flickr
Flickr [2] is a popular photo-sharing Web site that allows users to store, search, and share their photos with family, friends, and the online community. Flickr, hosted on 62 databases across 124 servers, manages about 800,000 user accounts per pair of servers and contained more than 5 billion images as of August 2009. Launched in February 2004 by a Vancouver-based company, Flickr was originally a multiuser chat room, called “FlickrLive”, with the capability of exchanging photos in real time. In March 2005, Flickr was acquired by Yahoo! and became a photo-sharing Web site allowing users to upload, share, and view photos freely. In March 2009, the ability to upload and view HD videos was also added to Flickr’s features. Users can join the Flickr network by means of two account types. Free accounts allow users to upload 100 MB of images in a month and at most 2 videos. However, free account users can manage at most 200 photos in their photostream. A photo can be added to at most ten groups, and statistics about photos are not accessible. A free account is automatically deleted when it has been inactive for 90 consecutive days. “Pro accounts”, characterized by unlimited storage, allow users to upload any number of images and videos every month and to access account statistics. Furthermore, each image can be added to up to 60 groups. However, pro account users are charged a yearly fee of about 25 dollars. Flickr, as a photo-sharing Web site, provides both private and public image storage. Private photos are visible by default only to the photo owner. By contrast, public photos are viewable by all Flickr users. In addition, each user maintains a list of “favorite photos” on the Flickr Web site, which is publicly visible to the user’s contacts when they log into Flickr. However, they can also be marked as viewable by friends and/or family. Furthermore, for each uploaded image, belonging to either the private or the public category, the user (photo owner) can define a “contact list” to control image access for a specific set of users. Flickr users, characterized by common interests, can gather to form self-organized communities, referred to as groups. The main purpose of groups is to facilitate the sharing of user photos in the group pool, i.e., a collection of photos shared by any member with the group. There are three types of groups: (a) public, where anyone can see the group photo pool, and anyone can join it, (b) public, where anyone can
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
25
see the group but only “invited” users can join it, and (c) private, where the group is hidden from the community and an invitation is required both to see and to join it. Browsing techniques and diverse sophisticated search functionalities are also available on Flickr. For example, it is possible to filter the search results according to geographical locations, time intervals and media type. Since community-built photo collections are typically very conspicuous in size, the efficient browsing of these collections is still an open research issue. Different approaches aimed at improving the browsing of large image collections have been proposed in the literature [9, 10].
2.2.3
Wikipedia
Wikipedia [3] is a free, collaborative, multilingual encyclopedia. Any Web user can collaborate to create or edit a Wikipedia page. Since 2001, this effort of the Web community has produced over 15 million articles. The corpus is composed of several collections of articles in different languages. For example, the English version contains over 3.2 million articles at March 2010, whereas the German version contains more than 1 million. In recent years, Wikipedia has been considered the most powerful, accurate, and complete available free encyclopedia. Due to its nature, Wikipedia is one of the most dynamic and fastest-growing resources over the Web. For instance, articles about events are often added within few days from their occurrence. Moreover, since everyone can edit the information, errors are usually removed after few revisions by other Web users. Unlike other encyclopedias, Wikipedia is freely available and covers a huge number of topics. For example, Encyclopedia Britannica, one of the oldest encyclopedias which is considered a reference book for the English language, with articles typically contributed by experts, contains only around 120,000 articles in the last release. Indeed in [11] Wikipedia was found to be very similar, in terms of accuracy, to Encyclopedia Britannica, but the coverage of topics is greater. The main advantage of Wikipedia is its community, which improves and controls the content of the encyclopedia. Except for a few pages, every article may be edited anonymously or with a user account, while only registered users may create a new article. Once a new article has been added, the page is owned by the community, which can modify the content. Even though every Web user can contribute to the growth of the encyclopedia, the Wikipedia community has established “a bureaucracy of sorts”, including a clear power structure that gives volunteer administrators the authority to exercise editorial control. The administrators are a group of privileged users who have the ability to delete pages, lock articles from being changed in case of vandalism or editorial disputes, and block users from editing. Since many Web users may not be proficient at creating and managing Web content, the editing model of Wikipedia is based on “wiki”. This technology allows
26
T. Cerquitelli et al.
even inexpert users to create and/or edit complex Web pages with structured information, such as internal and external links, tables, images, and videos. Furthermore, the wiki model makes changes to an article immediately available, even if they contain errors. The German edition of Wikipedia is an exception to this rule. It has been testing a system of maintaining stable versions of articles to permit readers access only to versions of articles that have passed certain reviews. Many features have been implemented to assist contributors. For example, the “History” page attached to each article records every single past revision of the article. This feature makes it easy to compare old and new versions, undo changes that an editor considers undesirable, or restore lost content. The “Discussion” pages associated with each article are used to coordinate work among multiple editors. Pieces of software such as Internet bots (e.g., Vandal Fighter) are in wide use to remove vandalism as soon as it is committed, to correct common misspellings and stylistic issues, or to ensure that new articles comply with a standard format [12]. Since Wikipedia grows very dynamically and is human contributed and mainly composed of free text, the structure of the media collection is very complex. Dumps of articles are generated automatically every week and can be downloaded to apply offline analyses of the content. The basic entry in Wikipedia is an article (or page), which defines and describes an entity or an event and consists of a hypertext document with hyperlinks to other pages, within or outside Wikipedia. The role of the hyperlinks is to guide the reader to pages that provide additional information about the entities or events mentioned in an article. Each Wikipedia article is uniquely referenced by an identifier, which consists of one or more words separated by spaces or underscores, and occasionally a parenthetical explanation. Some articles can contain an Infobox, a table which sums up key information about the article. The community provides a collection of templates for different categories (e.g., City, Company) to avoid ambiguous annotations and present information with a uniform layout. The hyperlinks within Wikipedia are created using the articles’ unique identifiers. Since every article can be edited by any user, one critical issue is the consistency with respect to these identifiers. “Redirect” pages, which contain only a redirect link, exist for each alternative name of a concept and point readers to the one preferred by Wikipedia. Another issue is the different meanings that words can assume according to the context. Disambiguation pages are specifically created for these ambiguous entities and are identified by the parenthetical explanation “(disambiguation)”. These pages consist of links to articles defining the different meanings of the entity.
2.3
Community Nature and Media Collection Features
Each community-contributed media collection has different features due to both the structure of the Web service and the managed media type. Many research efforts have been devoted to studying the structure of social networks [13], proposing
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
27
Fig. 2.1 Taxonomy of user-contributed media collection features
folksonomy models [14], and analyzing the properties of information spreading through communities [15]. In this section, we discuss diverse works analyzing the community nature and the media collection features of Flickr, YouTube, and Wikipedia. We propose a taxonomy, shown in Fig. 2.1, to categorize the discussed works in three main topics: (a) social networks (see Sect. 2.3.1), (b) media content distribution (see Sect. 2.3.2), and (c) semantics of media content (see Sect. 2.3.3). The social network topic includes two main issues, network structure [5, 13, 16] and growth trend [17], while the media content distribution topic includes lifespan of content [18] and social popularity [19, 20].
2.3.1
Social Networks
Social networks have been studied from two different points of views: network structure [5, 13, 16] and growth trend [17], as shown in the taxonomy reported in Fig. 2.1. From the structure point of view, social networks are modeled by graphs where nodes represent the users and links the friendships among users. Different graph mining algorithms have been exploited to extract relevant and useful information on social community structure [13]. Online social networking sites also represent a unique opportunity to study the dynamics of social networks and the ways they grow. The growth trend (see taxonomy in Fig. 2.1) in the Flickr social network has been investigated in [17].
28
T. Cerquitelli et al.
In [5], the social network structure of YouTube has been analyzed as a first step toward understanding the kind of media and social space that it represents. The analysis draws upon a crawl of YouTube’s user profiles, characterized by the aggregate of all tags used by the authors to annotate the uploaded videos. Clusters of authors and associated keywords were identified through vector-space projection and hierarchical cluster analysis. Nine author clusters were thus found, corresponding to distinct genres of popular Internet video. Similar clustering was identified by analyzing the authors’ networks of friends, represented as graphs. These results show that socially coherent activity (friendship among users) is strongly characterized by semantic coherence (similar descriptive tags). A different research issue has been addressed in [16]. The work is founded on the observation that Web surfers are not required to register or upload videos in order to view the existing materials; a large proportion of YouTube’s audience is in fact expected to fall into this category of users. As a consequence, the network among (registered) users does not necessarily reflect that among the videos. The social networking among videos has thus been studied, and relationships between pairs of related videos have been modeled by means of a directed graph. Measurements on the graph topology revealed definite small-world characteristics. This phenomenon, also known as six degrees of separation, refers to the principle that items (e.g., people, files, URL links) within a given environment are linked to all others by short chains of “acquaintances”. Similar results were achieved on other real-world user-generated graphs. However, compared with, for example, the graph formed by URL links in the World Wide Web, the YouTube network of videos exhibits a much shorter characteristic path length, thus implying a much more closely related group. Authors in [17] analyze the growth trend in the Flickr social network in order to understand the link formation process. Flickr can be modeled as a directed network, where users represent nodes and directed edges model links between a pair of users. The presence of a link from a user to another does not imply the presence of the reverse link. As shown in [17] the creation of the first link affects the second, since users tend to rapidly respond to the incoming link by creating a link in the reverse direction. Since users explore the network by visiting their neighbors, users tend to connect to nearby users in the network. Furthermore, the number of links created and received by users is directly proportional to their current number of links.
2.3.2
Media Content Distribution
The enormous, steady growth of the well-known online media-sharing communities, such as YouTube, Flickr, Facebook, MySpace, and the emergence of numerous similar Web sites, confirm the mass market interest. While similar on the surface to standard commercial media distribution systems, UGC collections follow in fact
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
29
much less predictable evolution trends. Furthermore, the popularity of UGC presents rather diverse and complex dynamics, thus making traditional content popularity predictions unsuitable. Works devoted to studying media content distribution address either (a) the lifespan of the media content or (b) social popularity as shown in Fig. 2.1. The lifespan of the media content is highly dependent on the user behavior and interests. Several studies have been investigating the main reasons for the popularity and diffusion of videos and photos. For example, in [18] the analysis of the popularity evolution of user-produced videos in YouTube and other similar UGC Web sites is presented. The key observation is that understanding the popularity characteristics can prove pivotal in discovering weaknesses and bottlenecks in the system and suggest policies to improve it. The study was conducted on several datasets containing video meta information crawled from YouTube and Daum [21], a popular search engine and UGC service in Korea. Their analysis reveals that the popularity distribution of videos exhibits power-law behavior with a steep truncated tail (exponential cutoff), suggesting that requests for videos are highly skewed toward popular files. Rather than this skewness being a natural phenomenon due to the low level of interest in many UGC videos, filtering effects in search engines, which typically favor a small number of popular items, seem most likely responsible for the significant imbalance in the video popularity distribution. Popular videos thus tend to gain more and more views, while niche videos reach a much smaller audience than expected. Proper leverage of the latter could increase the total number of views by as much as 45% and reveal the latent demand created by the search engine bottleneck. Other studies consider also the influence of social contacts in the popularity of media content (see taxonomy in Fig. 2.1). For example, the characterization of the Flickr media collection has been studied in [20]. An analysis has been performed along three dimensions: (a) the temporal dimension, which allows tracking the user interest in a photo over time, (b) the social dimension, aimed at discovering the social incentives of users in viewing a photo, and (c) the spatial dimension, which analyzes the geographic distribution of user interest in a photo. Experimental results reported in [20] show that users discover new photos within 3 h of their upload. Furthermore, for the most popular photos (i.e., photo with high view frequency) almost 45% of the new photo views are generated within the first two days, while for infrequent images (i.e., photo with low view frequency), this ratio increases to 82%. Moreover, the following two factors affect photo popularity: (a) the social network behavior of users and (b) photo polling. In fact, people with a large social network within Flickr have their photos viewed many times, while people with a poor social network have their images accessed only few times. Finally, the geographic distribution of user interest is also dependent on the photo popularity. In fact, the geographic interest in a photo is worldwide when the photo has many views, while for infrequently viewed photos the geographic distribution is around a given geographic location.
30
T. Cerquitelli et al.
In [19] the analysis performed is more focused on the influence of social contacts on the bookmarking of favorite photos to estimate the potential spreading capability of 1,000 favorite photos, over 15,000 unique fans and 35,000 favorite markings. As shown in [19] the information dissemination through social links can be modeled as a slight variation of the model exploited to study the spread of infectious diseases throughout human populations [22]. A social cascade begins when the first user includes the photo in his/her list of favorites. Then, the cascade continues along social links. The following aspects of Flickr in the social behavior of users have been observed: (a) users maintain their favorite photos indefinitely (e.g., until photos are removed from the list) and (b) the higher the number of neighbor users, the higher the dissemination rate. Furthermore, the time required by the dissemination is in inverse relation to the number of neighbors. Hence, social links are an effective mechanism for disseminating information in online social networks.
2.3.3
Semantics of Media Content
In the last few years, an increasing effort has been devoted by the researcher community to drawing a general picture of the semantic distribution of UGC. Since this aspect is fundamental to understanding the interests and the predominant uses, we included it in our proposed taxonomy, shown in Fig. 2.1. In [16], an in-depth measurement study of the statistics of YouTube videos is provided. Based on a dataset including metadata about more than 3 million videos, content distribution among categories has first been analyzed, showing that music videos are prevalent (22.9%), followed by entertainment videos (17.8%) and comedy (12.1%). Accordingly, the video length distribution indicates that the majority of videos does not exceed 3–4 min in length. Similar works have analyzed the distribution of the encyclopedic content available on Wikipedia. Wikipedia articles are classified into categories and subcategories exploiting a predefined taxonomy, enriched by article authors. When an article is uploaded, authors can freely choose one or more categories for their new encyclopedia entry, or create a new one. Due to this uncontrolled mechanism, categories associated with a given article may not always optimally suit its content. The distribution and the growth of Wikipedia topics were studied in [23]. The approach employed for this analysis is based on the evaluation of the semantic relatedness of each article with taxonomy categories. The topic coverage of Wikipedia article is unfair. “Culture and arts”, “People”, and “Geography” cover more than 50% of all the articles, while society and social sciences covers 12%. However, some of the other topics are rapidly growing. The results of the analysis, reported by the Wikipedia Foundation, show that the geographic distribution of articles is highly uneven. Most articles are written about North America, Europe, and East Asia, while only few about Africa and large parts of the developing world.
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
2.4
31
Taxonomies on Research Issues
A lot of investigation has been carried out on community-contributed media collections to improve the user searching activity and document classification of media collections. To provide a clear overview of the research efforts, we propose three taxonomies (see Figs. 2.2–2.4), one for each media collection, to sum up the main issues addressed in the research. Figure 2.2 summarizes the works addressing video-sharing communities. A first class of approaches is aimed at studying whether and how meaningful, coherent annotations can be derived from the collaborative tagging effort of community users. The second group of research activities is instead focused on the exploitation of these video collections from a data mining perspective. In the works reviewed, video clips and their associated tags are exploited to (a) train concept-based classifiers and improve query representations in a media retrieval framework or (b) generate relevant metadata about captured events and enhance the user viewing activity. Figure 2.3 depicts the taxonomy proposed to classify the works focused on photo-sharing communities. Research efforts can be grouped according to the addressed topic, like tag recommendation to effectively support photo annotation, and automatic extraction of semantics. Among the tag annotation works, two main approaches have been proposed. The first one resorts to basic techniques that consider tag frequencies in the past to suggest useful and relevant tags, while the second one exploits collective knowledge, residing in the Flickr community, to support photo annotations. Furthermore, numerous works have been devoted to exploiting data mining techniques to (a) discover location and event semantics from
Fig. 2.2 YouTube: taxonomy of discussed approaches
32
T. Cerquitelli et al.
Fig. 2.3 Flickr: taxonomy of discussed approaches
Fig. 2.4 Wikipedia: taxonomy of discussed approaches
photo tags, (b) identify clusters of similar photographs, or (c) summarize a large set of images. Figure 2.4 shows a possible taxonomy to categorize the discussed works relating to the analysis and the employment of Wikipedia data. Three main topics can be identified according to the use of Wikipedia information and the purpose of the work: (a) word analysis, (b) document classification, and (c) document search engine. The word analysis addresses the evaluation of the semantic relatedness
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
33
among Wikipedia topics and the disambiguation of terms in documents, according to the Wikipedia categories. Wikipedia information can also be used to build more efficient text representations in terms of classification performance. Different approaches have been proposed, based on (a) the bag-of-word representation, (b) the analysis of Wikipedia taxonomies, and (c) the analysis of the Wikipedia graph structure. Moreover, some works have been devoted to developing search engines which retrieve documents according to (a) the semantic analysis of terms in the documents, based on Wikipedia taxonomies, or (b) the employment of ontologies extracted by Wikipedia infoboxes. The proposed taxonomies are useful for categorizing the works discussed in Sects. 2.5 and 2.6.
2.5
Media Annotation
An interesting challenge when dealing with knowledge collected on a large scale is that of making it searchable and thus usable. Despite the growing level of interest in multimedia Web search, most major Web search engines still offer limited search functionality and exploit keywords as the only means of media retrieval [24]. In the context of media (e.g., video, images, documents), this requires content to be annotated, which can be done manually or automatically. In the first case, the process is an extremely time-consuming, and hence costly, one. As pointed out in [25], a potential drawback of manual annotation is its subjective nature as an indicator of content. The same media may produce rather disparate reactions from different users or groups of users, who may also have varying motivations for annotating it. This would result in the media being annotated very differently. However, automatic annotation of media content may require content analysis algorithms to extract descriptions from media data. Community-built media collections are typically designed in such a way as to enable user queries on the content, and thus provide varying levels of media annotation. Tagging is the most popular form of annotation and has proved successful over the past years, as shown in [26–31]. In addition, it is available at virtually no cost, because the annotation task is spread across the entire community.
2.5.1
Photo Annotation
Photos uploaded on Flickr can be enriched by different kinds of metadata, in the form of tags, notes, number of views, comments, number of people who mark the photo as their favorite, and even geographical location data. The analysis of “how users tag photos” and “what kind of tags they provide” is presented in [31]. By analyzing 52 million photos collected between February 2004 and June 2007, authors show that the tag frequency distribution can be modeled by
34
T. Cerquitelli et al.
a power law [32], and the probability of a tag having tag frequency x is proportional to x 1.15. The head of the power law fit contains tags that would be too generic to be useful as a suggestion, while the tail contains the infrequent tags that often correspond either to wrong words or to highly specific tags. Furthermore, the distribution of the number of tags per photo also follows a power law distribution. The probability of having x tags per photo is proportional to x 0.33. The head of the power law contains photos annotated with more than 50 tags and the tail contains more than 15 million photos with only a single tag, while almost 17 million photos have only two to three tags. The majority of photos is thus annotated with only a few tags, which describe where the photo is taken, who or what appears in the photo, and when the photo was taken. Different works addressed the research issues on photo annotation. To support automatic photo annotations, diverse tag recommendation techniques have been studied. As shown in the taxonomy, depicted in Fig. 2.3, the proposed approaches can be classified into basic techniques and collective knowledge supporting. Flickr offers a service to suggest tags when a user wants to tag a picture. Suggested tags, sorted lexicographically, include recently used tags and those most frequently employed by the user in the past. However, this service is rather limited. One step further toward personalized tag suggestion for Flickr was presented in [27]. Three algorithms have been proposed to suggest a ranked list of tags to the user. Proposed algorithms receive as input parameters the identity of the user, an initial set of tags (if available), and the corresponding tagging history of all users. The recommendation is based on the tags that the user or other people have exploited in the past. Furthermore, the suggested tags are dynamically updated with every additional tag entered by the user. The first two methods consider only those tags exploited by the user in the past and suggest a ranked list of tags, sorted by considering both tag frequency and past inserted tags. The last method considers both the set of tags exploited by other people in the past and the tags similar to those entered by the user for the same picture in the past. In particular, a set of promising groups is first identified by analyzing both the user and the group profiles. Then, for each of these groups, a ranked list of suggested tags is generated according to tag frequency and past inserted tags. To validate the methods proposed in [27], different pictures have been downloaded and divided into two groups: (a) 200 pictures with four to eight given tags and (b) 200 pictures with more than ten given tags. The method which considers tags exploited also by other people is more effective in suggesting relevant tags. Furthermore, the results obtained for the second set of pictures are better than those obtained for the first set because users who add more tags to an individual picture usually have a better tagging history. In fact, the methods proposed in [27] yielded better accuracy on pictures with a large number of given tags. Different and more effective approaches have been proposed to effectively support photo annotations [27, 31]. Authors in [31] proposed to exploit the collective knowledge that resides in the Flickr community to support tag recommendation. Given a photo, the proposed recommendation system selects a list of relevant tags. The proposed system operates
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
35
in two steps. From a photo with user-defined tags, an ordered list of candidate tags is derived for each of the user-defined tag, based on co-occurrence. Then, the lists of candidate tags are aggregated and classified to generate a ranked list of recommended tags. The co-occurrence between two tags is computed as the number of photos where both tags are used in the annotation. The obtained value is normalized with respect to the overall frequency of the two tags individually. Two measures have been proposed to normalize the tag co-occurrence: symmetric and asymmetric measures. The first, according to the Jaccard coefficient [33], is defined as the size of the intersection (co-occurrence of the two tags) divided by the size of the union of the two tags (sum of the frequencies of two tags). This measure can be exploited to identify equivalent tags (i.e., tags with similar meaning). By contrast, the asymmetric measure evaluates the probability of finding tag tj in annotations, under the condition that these annotations also contain tag ti. For each user-defined tag, an ordered list of candidate tags is derived from the collective knowledge (i.e., user-generated content created by Flickr users). The larger the collective knowledge, the more relevant and useful the list of candidate tags. Given diverse lists of candidate tags, they are merged in a single ranked list by means of two strategies: voting and summing. The first one computes a score for each candidate tag, and the ranked list of recommended tags is obtained by sorting the tags according to the number of votes. On the other hand, the summing strategy computes for each candidate tag the sum of all co-occurrence values between the considered tag and the user-defined tags. To evaluate the recommendation system proposed in [31], 331 photos with at least one user-defined tag have been analyzed. 131 photos were used as a training set, while the remaining 200 photos were the actual test set. Experimental results show the effectiveness of the proposed recommendation system in selecting relevant tags. For almost 70% of the photos, the system suggests a good recommendation at the first position of the ranked list, and for 94% a good recommendation is provided among the top 5 ranked tags.
2.5.2
Video Annotation
To extend the accessibility of video materials and enhance video querying, manual or automatic annotation is needed. Since manual annotations often reflect personal perspective, videos may be tagged very differently by different users. However, the study reported in [25] suggested that user interaction with multimedia resources within social networks could help generate more consistent and less ambiguous tagging semantics for video content. In particular, it has been observed that when multiple users are allowed to label content over a long period of time, stable tags tend to emerge [34]. This can be thought of as a form of user consensus built by letting users interactively correct tags, similarly to wiki pages, thus providing more reliable metadata. This form of “collaborative tagging” (see taxonomy in Fig. 2.2) has been investigated by inferring semantics for the content from user behavior, both explicitly (i.e., through direct user input) and implicitly (i.e., by monitoring
36
T. Cerquitelli et al.
user activity). To this aim, Facebook’s public APIs were exploited to build a social network application, Tag!t. The application makes it possible to share and interact with video content in a broader way than allowed by Facebook’s features. Besides sharing videos and aggregating them in collections, users can tag specific timestamps of their friends’ videoclips and link additional materials (e.g., videoclips, images, Web pages) to them. The application also keeps track of user interaction with the media itself (e.g., play/pause/seek events). Experiments with the Tag!t application showed that users within a selected group tend to tag in a similar manner. In addition, the semantics suggested by a user were found to be scantly biased by other users’ tagging of the same content, thus indicating that collaborative tagging leads to coherent semantics.
2.5.3
Document Annotation
Wikipedia articles can also be used to analyze words in order to improve keyword extraction from documents and disambiguation algorithms, as shown in the taxonomy, depicted in Fig. 2.4. For example, Semantic MediaWiki [35] is an extension of the MediaWiki software to annotate the wiki contents in the articles. The aim of this tool is to improve consistency in Wikipedia articles by reusing the information stored in the encyclopedia. Some approaches have attempted to detect the semantic relatedness between terms in documents to identify possible document topics. In [36] the authors introduced “Explicit Semantic Analysis” (ESA) which computes the semantic relatedness between fragments of natural language text using a concept space. The method employs machine learning techniques to build a semantic interpreter which maps fragments of natural language to a weighted sequence, named “interpretation vector”, and built of Wikipedia concepts ordered by their relevance. The relatedness between different interpretation vectors is evaluated by means of cosine similarity. In [37], the Wikipedia Link-Based Measure is described. The approach identifies a set of candidate articles which represent the analyzed concepts and measures the relatedness between these articles using a similarity measure which can be a tf-idfbased measure, the Normalized Google Distance, or a combination of both. Experimental results show that the ESA approach is effective in identifying the relatedness between terms. The Wikify! system [38] supports both algorithms for keyword extraction from documents and word sense disambiguation to assign to each extracted keyword a link to the correct Wikipedia article. The keyword extraction algorithm is based on two steps: (a) candidate extraction, which extracts all possible n-grams that are also present in a controlled dictionary, and (b) keyword ranking, which is based on tf-idf statistics, w2 independence test or Keyphraseness (i.e., the probability that a term be selected as a keyword for a document). Three different disambiguation algorithms are integrated in the system. The first one is based on the overlap between the terms in the document and a set of ambiguous terms stored in a dictionary. The second one
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
37
is based on a Naı¨ve Bayes classifier whose model is built on feature vectors of correlated Wikipedia articles. Finally, a voting system which takes care of disambiguation results obtained by previous techniques was also employed. Independent evaluations carried out for each of the two tasks showed that both system components produce accurate annotations. The best performance for the keyword extraction task is achieved by the Keyphraseness statistics with accuracy, recall, and F-measure results of 53.37%, 55.90%, and 54.63%, respectively. The disambiguation procedure reaches an accuracy of 94% at best.
2.6
Mining Community-Contributed Media
An overview of different works focused on mining huge collections of communitycontributed media is presented in the following. Section 2.6.1 describes different approaches for the extraction of semantics from photo tags available on Flickr, while Sect. 2.6.2 presents how Wikipedia articles can be used as a knowledge base to achieve an automatic classification over electronic documents. Finally, Sect. 2.6.3 describes two research efforts aimed at categorizing and automatically organizing large sets of video clips.
2.6.1
Semantics Extraction from Photo Tags
Photo tags, in the form of unstructured knowledge without a priori semantics, can be efficiently mined to automatically extract interesting and relevant semantics. Many works have been devoted to these issues, which can be classified according to the taxonomy shown in Fig. 2.3. A lot of research effort has been devoted to jointly analyzing Flickr tags with photo location and time metadata. Approaches proposed in [36, 39] analyze intertag frequencies to discover relevant and recurrent tags within a given period of time [39] or space [40]. However, semantics of specific tags were not discovered. One step further toward the automatic extraction of semantics from Flickr tags was based on analyzing temporal and spatial distributions of each tag’s usage [41]. The proposed approach extracts place and event semantics by analyzing the usage in the space and time dimensions of the user-contributed tags assigned to photos on Flickr. Based on temporal and spatial tag usage distributions, a scale-structure identification (SSI) approach is employed, which clusters usage distributions at multiple scales and measures the degree of similarity to a single cluster at each scale. Tags can ultimately be identified as places and/or events. The proposed technique is based on the intuition that an event refers to a specific segment of time, while a place refers to a specific location. Hence, relevant patterns for event and place tags “burst” in specific segments of time and regions in space, respectively. In particular, the number of usage occurrences for an event tag should be much higher in a small segment of time than the number of usage occurrences of
38
T. Cerquitelli et al.
that tag outside the segment. However, the segment size and the number of usage occurrences inside and outside the segments significantly affect the analysis. The evaluation has been focused on photos from the San Francisco Bay area. Experimental analysis has been performed on a dataset including both photos and tags. The dataset consists of 49,897 photos with an average of 3.74 tags per photo. Each photo is also characterized by a location and a time. The location represents the latitude–longitude coordinates either of the place where the photo was taken or of the photographed object. The time represents either the photo capture time or the time the photo was uploaded to Flickr. These photos cover a temporal range of 1,015 days, starting from 1 January 2004. The average number of photos per day was 49.16, with a minimum of zero and a maximum of 643. From these photos, 803 unique tags were extracted. The maximum number of photos associated with a single tag was 34,325 for the San Francisco Bay area, and the mean was 232.26. The method described in [41] achieves good precision in classifying tags as either a place or event. A parallel effort has been devoted to enhancing the approach proposed in [41] to efficiently mine the huge photographic dataset managed by Flickr. The approach proposed in [42] is organized in three steps. First, the issue of generating representative tags for arbitrary areas in the world is addressed using a location-driven approach. Georeferences associated with the uploaded photographs are initially exploited to cluster photographs. Candidate tags within each cluster are then ranked to select the best representative ones. The extracted tags often correspond to landmarks within the selected area. Second, the method to identify tag semantics proposed in [41] has been exploited. The method allows the automatic identification of tags as places and/or events based on temporal and spatial tag usage distributions. Lastly, tag-location-driven analysis is combined with computer vision techniques to achieve the automatic selection of representative photographs of some landmark or geographic feature. Tags that represent landmarks and places are initially selected by the aforementioned location-driven approach. For each tag, the corresponding images are clustered by the k-Means [33] clustering algorithm according to their visual content to discover varying views of the landmark in question. For this purpose, range of complementary visual features is extracted from the images. Clusters are subsequently ranked by applying four distinct methods so as to identify the ones which best represent the various views associated with a given tag or location. Finally, images within each cluster are also ranked according to how well they represent the cluster. The proposed techniques have been evaluated on a set of over 110,000 georeferenced photos from the San Francisco area by manually selecting ten landmarks of interest. Results showed that the tag-location-visual-based approach is able to select representative images for a given landmark with an increase in precision of more than 45% over the best nonvisual technique (tag-location based). Across most of the locations, all of the selected images were representative. For some geographical features, the visual-based methods still do not provide perfect precision in image summaries, mainly due to the complex variety of scenarios and ambiguities connected with the notion of representativeness.
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
2.6.2
39
Wikipedia
The huge amount of data available with Wikipedia articles represents an interesting media collection that can be used to improve automatic document understanding and retrieval as shown in Fig. 2.4. An overview of results achieved in the document categorization domain is presented in Sect. 2.6.2.1, while Sect. 2.6.2.2 discusses some research efforts targeted at developing novel and efficient search engines.
2.6.2.1
Document Classification
The categorization task is usually performed by building models based on the statistical properties of small document collections. The variety of Wikipedia articles, the hyperlink graph structure, and the taxonomy of categories have been employed in different studies to build automatic categorization approaches or improve the performance of existing models. According to the representation, we can divide the discussed works into three categories as depicted in the taxonomy in Fig. 2.4: (a) bag-of-word, (b) Wikipedia taxonomy analysis, and (c) Wikipedia graph analysis. In [43], Gabrilovich and Markovitch presented one of the first works employing Wikipedia as an external resource for the document categorization task. The idea is to improve document representation by using the knowledge stored in the encyclopedia. A feature generator identifies the most relevant encyclopedia articles for each document. Then, the titles of the articles are used as new features to augment the bag-of-words (BOW) representation of the document. In the BOW representation, a document or a sentence is represented as an unordered collection of words, disregarding the structure of the text. This representation is usually associated with statistical measures such as tf-idf (term frequency-inverse document frequency). The tf-idf is used to evaluate how important a word is with respect to a document in a collection: the higher this value, the more representative the word. Empirical evaluation shows that, using background knowledge stored in Wikipedia, classification performance on short and long documents drawn from different datasets can be improved with respect to traditional classification approaches based only on the BOW representation. A similar idea was presented in [44], where the authors automatically constructed a thesaurus of concepts from Wikipedia. The thesaurus was extracted using redirect and disambiguation pages and the hyperlink graph of Wikipedia articles. Similarly to the previous approach, the authors search candidate concepts mentioned in each document, but then they add synonyms, hyponyms, and correlated concepts of these candidate concepts, used as new features to enrich the BOW representation. This extended knowledge can be leveraged to relate documents which did not originally share common terms. Therefore, such documents are shifted closer to each other in the new representation. The effectiveness of this approach was empirically demonstrated by means of a linear Support Vector Machine (SVM)
40
T. Cerquitelli et al.
[33] to classify documents from different datasets. The micro-averaged and the macroaveraged of the precision–recall break-even point (BEP) were used to compare classification performance with respect to the baseline approach. Compared with the baseline method, the proposed approach yielded an improvement of 2–5%. A new classification model based on Wikipedia information was proposed by Sch€onhofen in [45]. The relatedness between a document and a category of the Wikipedia taxonomy was computed by evaluating the similarity between that document and the titles of articles classified under that category. Since each article can belong to different categories, relevance statistics are used to rank the categories. The method was tested on the Wikipedia article body, which is not used to build the model, and on news datasets. The best results are achieved by combining Wikipedia categorization with the top terms identified by tf-idf. For example, the accuracy achieved on the news dataset is around 89%. An improvement over Sch€ onhofen’s approach was suggested in [46]. The authors propose to exploit both the words appearing in the article titles and in the hyperlinks. In fact, the hyperlinks better characterize the content of the article. Empirical results on a subset of Wikipedia articles show an improvement in precision and recall with respect to Sch€ onhofen’s method. Using only the top-3 Wikipedia categories returned by the method, the improvement in precision and recall is around 15% and 35%, respectively. A different text categorization approach based on an RDF ontology extracted from Wikipedia Infoboxes was presented in [47]. The method focuses on news documents of varying themes. For each document, the authors manually selected the Wikipedia category which best relates to its topic. A text document is then converted into a “thematic graph” of entities occurring in the document. Since the thematic graph can include uncorrelated entities, a selection of the most dominant component is applied. Finally, the text is classified according to the best coverage class of the entities belonging to the graph. The accuracy achieved by this approach on two different document collections is worse than that of a Naı¨ve Bayes classifier [33] based on BOW representation. One of the reasons for misclassifications may be the manual mapping of Wikipedia categories to the document topics. Moreover, news documents – unlike encyclopedia content – may be biased to reflect the interest of the readers. Yet, an interesting highlight of the ontology-based categorization approach is that it does not require a training phase, since all information about categories is stored in the ontology.
2.6.2.2
Search Engine
Several search engines based on Wikipedia have been developed to retrieve documents which are highly correlated with the keywords typed by the user. As shown in the taxonomy depicted in Fig. 2.4, the proposed approaches can be classified according to the Wikipedia information employed in the query analysis.
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
41
Some approaches have been addressing the user interactivity in an effort to improve the relevance of the results. For example, the Koru search engine [48] allows the user to automatically expand queries with semantically related terms through an interactive interface to extract relevant documents. The interface is composed of three panels: (a) the query topic panel, which provides users with a summary based on a ranking of significant topics extracted from the query, (b) the query results, which presents the outcome of the query in the form of a series of document surrogates, and (c) the document tray, which allows users to collect multiple documents they wish to peruse. Real users were asked to experiment with the system to identify the improvements offered over traditional keyword search. The main advantages reported by testing users were the capability of lending assistance to almost every query and the improved relevance of the documents returned. Other approaches have been devoted to improve efficiency, in terms of query processing scalability, and proficiency in entity recognition. Bast et al. [49] present the ESTER modular system for highly efficient combined full-text and ontology search. It is based on graph-pattern queries, expressed in the SPARQL language, and on an entity recognizer. The entity recognizer combines a supervised technique with a disambiguation step to identify concepts in the query and in the documents. In addition, the system includes a user interface which suggests a semantic completion based on the typed keywords, and the display of properties of a desired entity. The interface is designed in such a way as to offer all the features of a SPARQLbased query engine, with the added benefit of being intuitive for inexpert users. For example, when a user has typed “Beatles musician”, the system will give instant feedback that there is semantic information on musicians, and it will execute, in addition to an ordinary full-text query, a query searching for instances of that class (in the context of the other parts of the query), showing the best hits for either query. Good performance in terms of scalability and entity recognition has been achieved by the proposed system.
2.6.3
Video Content Interpretation
User-contributed video collections like YouTube present new opportunities and novel challenges to mine large amounts of videos and extract knowledge useful to categorize and organize their content. Following the taxonomy depicted in Fig. 2.2, Sect. 2.6.3.1 presents a classification technique based on the visual features of video clips, while Sect. 2.6.3.2 describes a novel approach to synchronize and organize a set of video clips related to a concert event. 2.6.3.1
Concept-Based Classification
Automatic indexing of video content has already received significant interest as an alternative to manual annotation and aims at deriving meaningful descriptors from the video data itself. Since such data is of sensory origin (image, sound, video), techniques from digital signal processing and computer vision are employed to
42
T. Cerquitelli et al.
extract relevant descriptions. The role of visual content in machine-driven labeling has been long investigated and has resulted in a variety of content-based image and video retrieval systems. Such systems commonly depend on low-level visual and spatiotemporal features and are based on the query-by-example paradigm. As a consequence, they are not effective if proper examples are unavailable. Furthermore, similarities in terms of low-level features do not easily translate to the highlevel, semantic similarity expected by users. Concept-based video retrieval [50] tries to bridge this semantic gap and has evolved over the last decade as a promising research field. It enables textual queries to be carried out on multimedia databases by substituting manual indexing with automatic detectors that mine media collections for semantic (visual) concepts. This approach has proven effective and, when a large set of concept detectors are available, its performance can be comparable with that of standard Web search [51]. Concept detection relies on machine learning techniques and requires, to be effective, that vast training sets be available to build large-scale concept dictionaries and semantic relations. So far, the standard approach has been to employ manually, expert-labeled training examples for concept learning. This solution is costly and gives rise to additional inconveniences: the number of learned concepts is limited, the insufficient scale of training data causes overfitting, and adapting to changes (like new concepts of interest) remains difficult. In [52], the huge video repository offered by YouTube is utilized as a novel kind of knowledge base for the machine interpretation of multimedia data. Web videos are exploited for two distinct purposes. On the one hand, result video clips of a YouTube search for a given concept are employed as positive examples to train the corresponding detector. Negative examples are drawn from other videos not tagged with that concept. Frames are sampled from the videos and their visual descriptors are fed to several statistical classifiers (Support Vector Machines, Passive-Aggressive Online Learning, Maximum Entropy), whose performance has been compared. On the other hand, tag co-occurrences in video annotations are used to link concepts. For each concept, a bag-of-words representation is extracted from the tags of the associated video clips. The process is then repeated with user queries, which are thus mapped to the best matching learned concepts. The approach has been evaluated on a large dataset (1,200 hours of video) by manually selecting 233 concepts. Precision in detecting concepts rapidly increases with the number of videoclips included in the training set and stabilizes when 100–150 videos are used. Results show that the average achieved precision, although largely dependent on the concepts, is promising (32.2%), suggesting that Web-based video collections indeed have the potential to support unsupervised visual and semantic learning.
2.6.3.2
Automated Synchronization of Video Clips
The abundance of video material found in user-generated video collections could enable a broad coverage of captured events. However, the lack of detailed semantic and time-based metadata associated with video content makes the task of identifying and synchronizing a set of video clips relative to a given event a nontrivial one.
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
43
Kennedy and Naaman [53] propose the novel application of existing audio fingerprinting techniques to the problem of synchronizing video clips taken at the same event, particularly concert events. Synchronization allows the generation of important metadata about the clips and the event itself and thus enhances the user browsing and watching experience. A set of video clips crawled from the Web and related to the same event is assumed to be initially available. Fingerprints are generated for each of them by spectral analysis of the audio tracks. The results of this process are then compared for any two clips to identify matches. Both the fingerprinting techniques and the matching algorithm are quite robust against noisy sources – as is often the case with user-contributed media. Audio fingerprinting matches are exploited to build an undirected graph, where each node represents a single clip and edges indicate temporal overlapping between pairs of clips. Such a graph typically includes a few connected components, or clusters, each one corresponding to a different portion of the captured event. Based on the clip overlap graph, information about the level of interest of each cluster is extracted and highly interesting segments of the event are identified. In addition, cluster analysis is employed to aid the selection of the highest quality audio tracks. Textual information provided by the users and associated with the video clips is also mined by a tf-idf strategy to gather descriptive tags for each cluster so as to improve the accuracy of search tasks, as well as suggest metadata for unannotated video clips. This system has been applied to a set of real user-contributed videos from three music concerts. A set of initial experiments enabled the fine-tuning of the audio fingerprinting and matching algorithms to achieve the best matching precision. Manual inspection showed that a large fraction of the clips left out by the system were very short or of abnormally low quality, and thus intrinsically uninteresting. Proficiency in identifying important concert segments (typically hit songs) has then been assessed by comparison with rankings found on the music-sharing Web site Last.fm, with a positive outcome. A study with human subjects has also been conducted which was able to validate the system’s selection of high-quality audio. Finally, the approach proposed to extract textual descriptive information proved successful in many cases, with failure cases being mostly related to poorly annotated clips and small clusters.
2.7
Perspective
This chapter presented and discussed different community-contributed media collections by highlighting the main challenging research issues opened by the online social networking sites. In the last few years, an increasing research effort has been devoted to studying and understanding the growth trend in social networks and the media content distribution. Furthermore, the abundance of multimodal and user-generated media has opened novel research perspectives and thus introduced novel challenges to mine large collections of multimedia data and effectively extract relevant knowledge. Much research has been focused on improving the automatic understanding
44
T. Cerquitelli et al.
of user-contributed media, enhancing the user searching activity on media collection, and automatically classifying Wikipedia articles. However, less attention has been paid to the integration of heterogeneous data available on different online social networking sites to tailor personalized multimedia services. For example, users can explore and gain a broader understanding of a context by integrating the huge amount of data stored in Wikipedia, YouTube, and Flickr. Digital libraries which cover a specific field (e.g., geography, economy) can be enriched by extracting related information from Wikipedia. Similarly, news videos can be integrated and contextualized with information provided by Wikipedia articles. Only few approaches have been proposed to address these issues [54, 55]. The problem of integrating Wikipedia and a geographic digital library is presented in [54]. The integration approach consists of identifying relevant articles correlated to geographical entries in the digital library. The identification is carried out by analyzing List_of_
Wikipedia pages, where G is a geographical entity (e.g., region, country, city). Finally, additional information is extracted by parsing infoboxes content of selected Wikipedia pages. Experimental evaluation on the extraction of relevant articles and metadata information show good performance in precision and recall. The integration of news videos and Wikipedia articles about news events is addressed in [55]. The method aims to automatically label news videos with Wikipedia entries in order to provide more detailed explanations about the context of the video content. Using “Wikinews,” a sister project of Wikipedia, all newsrelated articles are extracted from Wikipedia, while information about videos is extracted from the content caption (CC), which is usually provided. The CCs of news stories are labeled with Wikipedia entries by evaluating date information. Experimental results show that Japanese news videos broadcast over a year were accurately labeled with Wikipedia entries with a precision of 86% and a recall of 79%. A host of technical challenges remain for better exploiting the communitycontributed media into personalized applications able to tailor multimedia services according to the context in which the user is currently involved. Multimedia services for mobile devices could be personalized by exploiting both the current context of the user and the huge amount of media content available on the online social networking sites. For example, WikEye [56], a system for mobile technology, is able to retrieve interesting information on touristic places from Wikipedia spatial and temporal data. Future research directions might also exploit relevant knowledge available in different media collections.
References 1. 2. 3. 4. 5.
Youtube website. http://www.youtube.com/ Flickr website. http://www.flickr.com/ Wikipedia website. http://www.wikipedia.com/ Alexa the web information company. http://www.alexa.com/ Paolillo, J.: Structure and network in the youtube core. In: hicss, p. 156 (2008)
2 Community-Contributed Media Collections: Knowledge at Our Fingertips 6. 7. 8. 9.
10. 11. 12.
13. 14. 15. 16. 17.
18.
19. 20. 21. 22. 23.
24. 25. 26.
27. 28. 29.
30.
45
Facebook website. http://www.facebook.com/ Twitter website. http://twitter.com/ Myspace website. http://www.myspace.com/ Graham, A., Garcia-Molina, H., Paepcke, A., Winograd, T.: Time as essence for photo browsing through personal digital libraries. In: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pp. 326–335 (2002) Heesch, D.: A survey of browsing models for content based image retrieval. Multimedia Tools and Applications 40(2), 261–284 (2008) Giles, J.: Special report: Internet encyclopedias go head to head. Nature 438(15), 900–901 (2005) Smets, K., Goethals, B., Verdonk, B.: Automatic vandalism detection in Wikipedia: Towards a machine learning approach. In: AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 43–48 (2008) Tang, L., Liu, H.: Graph Mining Applications to Social Network Analysis. Managing and Mining Graph Data pp. 487–513 (2010) Hotho, A., J€aschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications pp. 411–426 (2006) Lerman, K., Galstyan, A.: Analysis of social voting patterns on digg. In: Proceedings of the first workshop on Online social networks, pp. 7–12 (2008) Cheng, X., Dale, C., Liu, J.: Statistics and social network of youtube videos. In: Proceedings of 16th International Workshop on Quality of Service, pp. 229–238 (2008) Mislove, A., Koppula, H.S., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Growth of the flickr social network. In: Proceedings of the first workshop on Online social networks, pp. 25–30 (2008) Cha, M., Kwak, H., Rodriguez, P., Ahn, Y., Moon, S.: I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, p. 14 (2007) Cha, M., Mislove, A., Adams, B., Gummadi, K.P.: Characterizing social cascades in flickr. In: Proceedings of the first workshop on Online social networks, pp. 13–18 (2008) Van Zwol, R.: Flickr: Who is looking? In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 184–190 (2007) Daum website. http://www.daum.net/ Pastor-Satorras, R., Vespignani, A.: Epidemics and immunization in scale-free networks. In: Handbook of Graphs and Networks: From the Genome to the Internet (2002) Kittur, A., Chi, E., Suh, B.: What’s in Wikipedia?: mapping topics and conflict using socially annotated category structure. In: Proceedings of the 27th international conference on Human factors in computing systems, pp. 1509–1512 (2009) Tjondronegoro, D., Spink, A.: Web search engine multimedia functionality. Information Processing & Management 44(1), 340–357 (2008) Davis, S., Burnett, I., Ritz, C.: Using Social Networking and Collections to Enable Video Semantics Acquisition. IEEE MultiMedia (2009) Ames, M., Naaman, M.: Why we tag: motivations for annotation in mobile and online media. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 971–980 (2007) Garg, N., Weber, I.: Personalized tag suggestion for flickr. In: Proceeding of the 17th international conference on World Wide Web, pp. 1063–1064 (2008) Golder, S., Huberman, B.: Usage patterns of collaborative tagging systems. Journal of Information Science 32(2), 198 (2006) Marlow, C., Naaman, M., Boyd, D., Davis, M.: HT06, tagging paper, taxonomy, Flickr, academic article, to read. In: Proceedings of the seventeenth conference on Hypertext and hypermedia, p. 40 (2006) Noll, M., Meinel, C.: Authors vs. readers: a comparative study of document metadata and content in the www. In: Proceedings of the 2007 ACM symposium on Document engineering, p. 186 (2007)
46
T. Cerquitelli et al.
31. Sigurbj€ornsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge. In: Proceeding of the 17th international conference on World Wide Web, pp. 327–336 (2008) 32. Reed, W.: The pareto, zipf and other power laws. Economics Letters 1, 15–19 (2001) 33. Tan, P., Steinbach, M., Kumar, V.: Introduction to data mining (2006) 34. Mikroyannidis, A.: Toward a Social Semantic Web. COMPUTER pp. 113–115 (2007) 35. V€olkel, M., Kr€ otzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Proceedings of the 15th international conference on World Wide Web, p. 594. ACM (2006) 36. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007) 37. Witten, D., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago, USA, pp. 25–30 (2008) 38. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 233–242 (2007) 39. Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., Tomkins, A.: Visualizing tags over time. In: Proceedings of the 15th international conference on World Wide Web, pp. 193–202 (2006) 40. Jaffe, A., Naaman, M., Tassa, T., Davis, M.: Generating summaries and visualization for large collections of geo-referenced photographs. In: Proceedings of the 8th ACM international workshop on Multimedia information retrieval, pp. 89–98 (2006) 41. Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place semantics from flickr tags. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 103–110 (2007) 42. Kennedy, L., Naaman, M., Ahern, S., Nair, R., Rattenbury, T.: How flickr helps us make sense of the world: context and content in community-contributed media collections. In: Proceedings of the 15th international conference on Multimedia, pp. 631–640 (2007) 43. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the National Conference on Artificial Intelligence, vol 21, p 1301 (2006) 44. Wang, P., Hu, J., Zeng, H., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems 19(3), 265–281 (2009) 45. Sch€onhofen, P.: Identifying document topics using the Wikipedia category network. Web Intelligence and Agent Systems 7(2), 195–207 (2009) 46. Huynh, D., Cao, T., Pham, P., Hoang, T.: Using hyperlink texts to improve quality of identifying document topics based on wikipedia. International Conference on Knowledge and Systems Engineering pp. 249 –254 (2009) 47. Janik, M., Kochut, K.: Wikipedia in action: Ontological knowledge in text categorization. In: International Conference on Semantic Computing, pp. 268–275 (2008) 48. Milne, D., Witten, I., Nichols, D.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 445–454 (2007) 49. Bast, H., Chitea, A., Suchanek, F., Weber, I.: ESTER: efficient search on text, entities, and relations. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, p. 678 (2007) 50. Snoek, C., Worring, M.: Concept-based video retrieval (2009) 51. Hauptmann, A., Yan, R., Lin, W., Christel, M., Wactlar, H.: Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia 9(5), 958–966 (2007) 52. Ulges, A., Koch, M., Borth, D., Breuel, T.: TubeTagger-YouTube-based Concept Detection. In: IEEE International Conference on Data Mining Workshops, pp. 190–195 (2009)
2 Community-Contributed Media Collections: Knowledge at Our Fingertips
47
53. Kennedy, L., Naaman, M.: Less talk, more rock: automated organization of communitycontributed collections of concert videos. In: Proceedings of the 18th international conference on World wide web, pp. 311–320 (2009) 54. Lim, E., Wang, Z., Sadeli, D., Li, Y., Chang, C., Chatterjea, K., Goh, D., Theng, Y., Zhang, J., Sun, A., et al.: Integration of Wikipedia and a geography digital library. Lecture Notes in Computer Science 4312, 449 (2006) 55. Okuoka, T., Takahashi, T., Deguchi, D., Ide, I., Murase, H.: Labeling News Topic Threads with Wikipedia Entries. In: 11th IEEE International Symposium on Multimedia, pp. 501–504 (2009) 56. Hecht, B., Rohs, M., Sch€ oning, J., Kr€ uger, A.: Wikeye - using magic lenses to explore georeferenced Wikipedia content. In: Proceedings of the 3rd International Workshop on Pervasive Mobile Interaction Devices (2007)
.
Part II
Social Analysis in Community-Built Databases
.
Chapter 3
Social Network Analysis in Community-Built Databases Va´clav Sna´sˇel, Zdeneˇk Hora´k, and Milosˇ Kudeˇlka
Abstract In this chapter, we see community-built databases (CBD) as a direct product of social interaction among different users and apply social network analysis techniques to understand and uncover hidden relations that explain various aspects of CBD quality and success. We consider several types of CBD data, discuss their visualization, and provide a short survey of related work. Finally, we present two experiments applying a social perspective to the CBD data.
3.1
Introduction
When working in the field of community-built databases (CBD), we are dealing with several kinds of data which can be divided according to their nature into the following categories (see Fig. 3.1): l l l
Database-specific data (e.g., movie information in movie databases) User-specific data (information about user accounts) Mixed data combining users with database-specific data (user/movie rating, user discussion, etc.)
From this point of view, one may consider CBD as direct products of social network formed around this database. Technically speaking, the database should be considered in a broader sense, such as a set of Web pages containing knowledge or information, not only as the technological base for saving data. It is natural to employ techniques of social network analysis (also referred to as SNA) in the field of CBD. SNA has proven to be useful in understanding complex relations among subjects with hidden implications. Uncovering these relations and implications can give us further insight into information contained in the database, the quantity and quality
V. Sna´sˇel (*) VSˇB-Technical University of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_3, # Springer-Verlag Berlin Heidelberg 2011
51
V. Sna´sˇel et al.
52 Fig. 3.1 Basic categories of data in community-built databases
of this information and possibly also show the path which leads to successful and high-quality CBD. User-specific data can be evaluated effectively using purely statistical approaches, so we do not discuss them further in this chapter. A similar situation holds for the database-specific data, because they heavily depend on the nature of a particular database type. This chapter is organized as follows. At first we mention some of the biggest CBD. Then we discuss some of their common features related to the social aspects. The next section is dedicated to understanding these databases in terms of the SNA, various types of data and their visualization. The latter part of this chapter presents several practical applications using the social perspective in order to mine information from CBD. Illustrative experiments are accompanied by basic notions of theory. We would like to thank the editors of this book and reviewers of this chapter for their valuable suggestions and comments.
3.1.1
Community-Built Databases Brief Survey
Wikipedia is surely the biggest success of CBD. Currently it is available in more than 270 languages containing more than 15 million articles created by 90,000 contributors.1 English Wikipedia, the largest language edition, has more than three million articles. The success and support of volunteers have also made possible the existence of other wiki projects such as Wiktionary, WikiBooks, Wikimedia Commons, etc. Since Wikipedia is not aimed at one specific type of data, it clearly conflicts with the specialized CBD. Regardless of this fact, many of them were able to grow to respectable sizes.
1
According to http://en.wikipedia.org/wiki/Help:About
3 Social Network Analysis in Community-Built Databases
53
In the field of movies, the most important player is The Internet Movie Database (www.imdb.com) and contains information about more than 1,500,000 movie titles and more than 3,500,000 people involved within the film industry.2 Data about music is gathered on the Discogs.com site, which provides information about artists, albums, and songs, as well as being a marketplace for their exchange. A similar scheme holds for collectors of almost everything (magazines at Kaastor.com, paper money at Plaata.com, etc.). Almost every popular car model has its own fan club with a Web site. Most of these sites also contain discussion forums, where car owners discuss various issues linked to a specific car model, thereby creating a CBD. Quality of this database correlates with the popularity of a specific car model within the community of Internet-oriented people (roughly speaking, more owners means more users creating more content) and also with the popularity of the Web site (more owners are searching for specific information). Depending on the definition, we may also consider many other services as CBD, often dedicated to sharing interesting things, such as Digg (www.digg.com), Delicious (www.delicious.com), StumbleUpon (www.stumbleupon.com), and many others.
3.1.2
Common Features
When analyzing the largest successful CBD, we can see that most of them share certain features based on social aspects. In the following sections, we discuss them in greater detail.
3.1.2.1
Motivation
This is the cornerstone of success, because users add value [32], and users have to be motivated to add value. This is not a big problem in a company environment where users are paid for their work, but it is a key issue in systems based on volunteers. Kuznetsov [24] summarizes the motivations of Wikipedia editors. According to their study, the main motives are: altruism (users are working for the common good), reciprocity (users are helping and expect to get help when needed), community (users form communities and are acting as members of these communities), reputation (users can create a virtual identity which can be respected), and autonomy (users are not restricted to a specific subject). Nov [31] in a similar study also mentions fun, career, and understanding (users think that contributing to Wikipedia will give them a new perspective on subjects and issues).
2
According to http://www.imdb.com/stats
V. Sna´sˇel et al.
54
3.1.2.2
Openness to Users
Registration is quick, only a few fields are required, and many actions can be performed without registration. If the user wants to contribute to the database, nobody wants to waste his/her time or worse, lose his/her attraction. Every user counts so the sites are trying to be as international as possible. Kuznetsov [24] also points out that users are more willing to correct an error in Wikipedia (81%) than in a printed source (16%). The explanation may be contained in the process of correcting the error.
3.1.2.3
Protection Against Vandalism
Not all users are willing to help and provide useful information. The problem is that the protection against abuse cannot conflict with the openness to users. CB databases have to balance these criteria. Some of the databases prefer the accuracy provided by the review process applied before data publication and risk of the data not being up to date and the users’ satisfaction. Others distinguish key information, which cannot be modified randomly, and new or not so important information is available to anyone wanting to make a change. This process guarantees the quality of content but does not restrict most users’ actions. The use of automatic heuristics for vandalism detection has proven to be very useful. While a small error may involve the user in its correction and possibly attract him to the database construction process, huge mistakes will confuse users for a long time. An analysis of different types of vandalism types and Wikipedia’s responses to them can be found in [33]. Although there have been several propositions for reputation systems for Wikipedia (see [1] for a concrete approach and valuable references), they have mainly played a supporting role for administrators and are not accessible to normal visitors. Final decisions about reverting changes are still based on a particular contribution. Long-term behavior of the user may result in the permanent blocking of the user or his/her IP address. Similar approaches can be used in a positive manner to identify promising users suitable for promotion in the Wikipedia hierarchy [6].
3.1.2.4
Strong Emphasis on Discussion
Users must have a place to express their opinions and must feel that their opinions are taken into account. The discussion can be hidden but has to be recorded. The final presentation of knowledge is often formed based on consensus. Kittur and Kraut [21] analyzes the need for collaboration, particularly discussion, in Wikipedia. A direct correlation exists between the quality of constructed knowledge and the amount of collaboration needed for its construction. But the correlation is not that simple as it depends on the type of collaboration and number of people involved.
3 Social Network Analysis in Community-Built Databases
3.1.2.5
55
Continuous Development
Only a few rules are present from the beginning. Additional rules are subsequently created, often after a wide consensus among users. The functionality of technological background is improved in small steps based on specific demand. A slow response to users’ needs often creates confusion. Butler et al. [7] emphasize the need for sequential evolution of policies in collaborative systems, rather stating them in advance.
3.1.3
Particular Types of Data
In this section, we focus on seeing the CBD in terms of social networks analysis. We consider the CBD data as being one of the following types (see Fig. 3.2): l l l
User versus document (or other unit of knowledge) User versus set of documents (e.g., topic/genre/etc.) User versus user
All types can be grouped as multiple objects, thus forming subtypes like group of users versus group of documents. The first two types can be seen as so-called two-mode networks. These networks consist of two disjoint sets of nodes (e.g., users and documents), where an edge can connect only nodes from the first set with nodes from the second set (a user created a document). A good overview can be found in [4]. Data considered under the third type is very often not present in the database directly (with some exceptions like systems strongly based on friendship) but can be quickly inferred, for example, from discussions or mutual collaboration on one document or topic. From the view of SNA, this data can be considered as so-called one-mode networks. The illustrative images (Fig. 3.2) provide good insight into the structure of social networks and CBD. But with an increasing number of objects, the clarity of images decreases quickly. In many cases, the computational complexity of the methods being used increases rapidly. There are several possible ways to address this issue (illustrated in Fig. 3.3):
Fig. 3.2 Particular types of data in CBD with respect to the social point of view
V. Sna´sˇel et al.
56
Fig. 3.3 Social network visualization possibilities l l l
l
Find faster methods Find better visualization techniques Apply some simplification, which will allow us to maintain good overview of network Specify our interests more precisely and use computer algorithms to find important parts
Finding faster methods is always a possibility, but we will surely reach a limit. In this situation, more processed data means also more data for visualization (see first two parts of Fig. 3.3). Here, we will reach the limits of a particular visualization technique, therefore we should think of a better visualization technique. A common approach is to use another dimension (switching from 2D to 3D or using animation rather a static image) or another visualization attribute (coloring nodes, sizing nodes, and edges). Some of these enhancements can be used only in some specific environments and are not suited, for example, to grayscale printers (third part of Fig. 3.3). If it is so complicated to display so much information at once, perhaps the data could be simplified with only the important parts being displayed (fourth part of Fig. 3.3). Clearly, this leads to the omission of less important information, which is not always a suitable option. We have to know what is important. In this case, we specify the objective using a pattern and let the algorithm search for it. Identified patterns can be highlighted in visualizations or presented separately (last part of Fig. 3.3).
3.1.4
Social Networks Representation
Social networks are usually modeled using graphs (see Fig. 3.4). Below we recall some basic notions from the graph theory. On a graph, we can consider a tuple G ¼ ðV; EÞ, where V is the list of nodes (V ¼ {A, B, C, D}) and E is the list of edges (E ¼ {e1, e2, e3, e4, e5}). An edge connects two nodes and more formally can be expressed as a tuple (e1 ¼ hA; Bi). It is natural to represent the social network data and their respective graphs using matrices as they can be immediately used in further computations. Graphs can be modeled using the incidence matrix (rows are vertices, columns are edges, and the value denotes the relation of the given vertex to the given edge; see Fig. 3.4). The incidence matrix of a directed graph usually contains values 1 (the edge starts from the node), 0 (the node is not related to the edge), and 1 (the edge ends in the node). Another form of a modeling graph is the adjacency matrix which represents the relation between two nodes. Later in the text we will use this type of matrix.
3 Social Network Analysis in Community-Built Databases
57
Fig. 3.4 Modeling social network using graph and representing it using incidence and adjacency matrix
Fig. 3.5 Directed and undirected unweighted graph of the one-mode social network and its representation using binary matrices
Fig. 3.6 Directed and undirected weighted graph of the one-mode social network and its representation using binary matrices
Fig. 3.7 Weighted and unweighted graph of the two-mode social network and its representation using binary matrices
Using the previous paragraphs, we may distinguish between the matrices of onemode networks (all objects are of the same type, see Figs. 3.5 and 3.6) and twomode networks (objects can be divided into two groups, see Fig. 3.7). The relation between objects can be considered (according to various needs) as reflexive or not (if object A has relation to object B, has also the object B relation to object A?). This leads to directed or undirected graphs (see left and right parts of Figs. 3.5 and 3.6).
V. Sna´sˇel et al.
58
In different situations, we may consider a binary relation only (if object A is related to object B or not) or we need a wider sense of relation (object A is related to object B in degree C). This leads to unweighted (see Fig. 3.5) or weighted graphs (see Fig. 3.6).
3.1.5
Social Networks Visualization
In the following sections, we present one particular approach for processing social network data, which may also be suitable for processing CBD also. As a social network we consider a set of subjects which are linked together by some kind of relationship. Social networking – in the sense of providing services to people to stay in touch, communicate, and express their relations – has received great attention in recent years. Freeman, in [16], underlines the need for social network visualization and provides an overview of the development of their visualization. Detailed information about the development can be found in [17]. The development from handdrawn images to complex computer-rendered scenes is evident, as is the shift from classic sociograms to new approaches and methods of visualization. What remains is the need for clarity of such visualizations. As previously mentioned, the CBD data may be seen as two-mode networks. Such networks can be visualized using a bipartite graph. But it is apparent from Fig. 3.8 (which links people denoted by letters with events denoted by numbers) that there are too many nodes and edges for this visualization method. Even if we use a better layout algorithm (which will, e.g., place nodes in a different way to minimize crossings between edges), the graph does not clearly illustrate the inner structure of the data. Paper [18] introduces the usage of the formal concept analysis (FCA), a wellknown general purpose data analysis method, in the area of social networks and reviews the motivation for finding relations hidden in data that are not covered by simple graph visualization. The paper shows that the Galois lattice is capable of capturing all three scopes of a two-mode network data – the relation between subjects, relation between events, and also the relation between subjects and events.
Fig. 3.8 Social network – visualized as bipartite graph
3 Social Network Analysis in Community-Built Databases
59
Fig. 3.9 Concept lattice before reduction
We will later see (in Fig. 3.9) that by using FCA, we can create a different visualization of presented data, which will uncover the inner structure of the network.
3.1.6
Basic Social Networks Analysis Measures
In this section, we discuss some basic SNA measures and notions, which can be used to identify interesting nodes in the network. Additional measures and their discussion can be found in recent paper [30].
3.1.6.1
Degree
The number of edges incident to the node is called the degree of the node. See Fig. 3.10 for two example networks – the first network has five nodes – four nodes have degree one, one node has degree four; the second network has four nodes, each with degree two. Nodes with a lower degree are more isolated in the network; conversely, nodes with higher degree have more contacts and are more important for the network structure. The average degree of nodes in the network can be considered as a coefficient describing the average linkage of network members. As can be seen from the previous figure, the average degree cannot describe the properties of the whole network (the resulting number is very similar although the linkage of the nodes is very different). To describe the distribution of the degrees among the nodes, we can use the histogram (see Fig. 3.11).
V. Sna´sˇel et al.
60 Fig. 3.10 Two similar networks with different degrees (average degrees 1.6 and 2)
Fig. 3.11 Two similar networks with different degrees (average degrees 2.5 and 3.3) and their histograms
3.1.6.2
Degree Centrality
To evaluate the degree of the node with respect to the whole network, we can also use so-called degree centrality, which can be computed as CD ðvÞ ¼
degreeðvÞ ; n1
where n is the number of the nodes in the network. This measure can be extended to the whole network using the node with highest degree centrality as a baseline.
3.1.6.3
Betweenness
Relative importance of the node in the network – with respect to the transmission of information through the network – can be represented using betweenness. For a graph (or network) G ¼ (V, E), this measure can be computed using CB ðvÞ ¼
X sst ðvÞ ; sst s6¼v6¼t2V
where sst is the number of shortest paths between the nodes s and t, sst(v) is the number of shortest paths between the nodes s and t that pass through the node v. Figure 3.12 contains two networks with a similar number of nodes, but different inner structure, which results in a different betweenness value of nodes. The first network has a hierarchical structure; therefore, the central node has the
3 Social Network Analysis in Community-Built Databases
61
Fig. 3.12 Two networks with similar number of nodes, but different betweenness
Fig. 3.13 Two similar networks with highlighted bridges
highest betweenness and the outlying nodes have the minimum betweenness. The structure of second network is cyclic; therefore, the value of betweenness is the same for all nodes.
3.1.6.4
Bridge
The edge connecting the separate parts of network is called the bridge. Technically speaking, removing this node from the network will increase the number of connected components in the network. Finding bridges in the network can be very helpful in the identification of important relations between nodes and also for finding independent groups. An example of two contradictory networks with highlighted bridges can be seen in Fig. 3.13. The first one contains three bridges; the second one is bridgeless.
3.1.6.5
Closeness Centrality
Closeness can be understood as a measure of how long it will take to distribute some information from the given node to other reachable nodes in the network. Closeness can be computed as 1 ; t2Vv dðv; tÞ
CC ðvÞ ¼ P
where d(v, t) is the shortest path between the nodes v and t in the given network. Figure 3.14 illustrates two networks – the first one has dominant node in the center of the network with the highest closeness value, and the second one has all nodes with similar closeness value.
V. Sna´sˇel et al.
62 Fig. 3.14 Two similar networks, the first one having one dominant node with higher closeness
Fig. 3.15 Two networks with similar number of nudes, but different clustering coefficients
3.1.6.6
Clustering Coefficient
The local clustering coefficient describes the neighborhood of some node in terms of its interconnections. It can be computed using CC ðvÞ ¼
degreeðvÞ ; j N j ðj N j 1Þ
where N is the set of nodes connected to the node v. Figure 3.15 shows two network with the same number of nodes. The first one is less interconnected; therefore, the nodes have a smaller clustering coefficient value. The second one contains a lot of cliques; therefore, the nodes have higher clustering coefficient values. This measure can be simply extended to the whole networks (so-called global clustering coefficient) using the concept of connected triplets and triangles.
3.1.7
Complexity Aspects
As can be seen from both the mentioned paper and experiments presented below, with the increasing range of input data, the Galois lattice soon becomes very complicated and the information value decreases. The computational complexity increases rapidly. A comparison of the computational complexity of algorithms for generating a concept lattice can be found in [25]. As stated in the paper, the total complexity of lattice generation depends on the size of input data, as well as on the size of the output lattice. This complexity can be exponential. An important aspect of these algorithms is their time delay complexity (the time complexity between generating two concepts). Recent paper [15] describes a linear time delay algorithm. In many
3 Social Network Analysis in Community-Built Databases
63
applications, it is possible to provide additional information about key properties interesting to the user, which can be used to filter unsuitable concepts during the lattice construction [3]. In some applications, it is also possible to select one particular concept and navigate through its neighborhood. These approaches allow us to manage a larger scale of data but cannot provide the whole picture of a lattice. A lot of social networks can be seen as object–attribute data or simply a matrix (binary and fuzzy). They can be processed using matrix factorization methods, which have been proven to be useful in many data mining applications dealing with large-scale problems. Our aim is to allow the processing of a larger amount of data and our approximation approach is compatible with the two mentioned in the previous paragraph. Clearly, some bits of information have to be neglected, but we want to know how close or far from the original result we are. One way of doing this would be to directly compare the results from the original and reduced datasets. Egghe and Rousseau [14] introduce a modification of the classic Lorenz curve to describe the dissimilarity between presence–absence data.
3.1.8
Related Work
Previously, the social network approach was applied in connection with CBD, mostly with the analysis of Wikipedia. In this section, we refer to some of these publications. The SNA metrics computed on Wikipedia ([40]) is used mainly in the area of content quality evaluation [23]. Suh et al. [36] identify user conflicts based on mutual action reverts. Crandall et al. [10] analyze user attributes and their behavior with respect to their friends. A Wikipedia article description using edit network is contained in [5]. An older, well-known approach of the history of visualization using the so-called history flow can be found in [39]. Finding experts both from the Wikipedia content and among users is in [13]. An interesting prototype of CBD enhancements using social interaction and online communities can be found in [2]. An automatic recommendation of work that has to be done according to the user profile is presented in [9]. Said et al. [34] tries to explain several CBD properties using the social properties of its users. We have selected the following experiments to illustrate both the usefulness and the bottlenecks of the social point of view in CBD. We use FCA for structure analysis and matrix reduction methods to handle the complexity.
3.2
Tools and Techniques
Before we present our experiments illustrating the usefulness of the social perspective on CBD, we have to recall some basic notions of tools and techniques we will use.
V. Sna´sˇel et al.
64
3.2.1
Formal Concept Analysis
The formal concept analysis (shortly FCA, introduced by Rudolf Wille in 1980) is a well-known method for object–attribute data analysis. The input of FCA is called formal context C, which can be described as C ¼ (G, M, I) – a triplet consisting of a set of objects G and set of attributes M, with I as relation of G and M. The elements of G are defined as objects and the elements of M as attributes of the context. For a set A G of objects, we define A0 as the set of attributes common to the objects in A. Correspondingly, for a set B M of attributes, we define B0 as the set of objects which have all attributes in B. A formal concept of the context (G, M, I) is a pair (A, B) with A G, B M, A0 ¼ B, and B0 ¼ A. BðG; M; IÞ denotes the set of all concepts of context ðG; M; IÞ and forms a complete lattice (so-called Galois lattice). For more details, see [19, 20]. The Galois lattice may be represented by the Hasse diagram. In this diagram, every node represents one formal concept from the lattice. Nodes are usually labeled by attributes (above the node) and objects (below the node) possessed by a concept. For the sake of clarity, sometimes reduced labeling is used (see Fig. 3.9 for illustration), which means that attributes are shown only at the first node (concept) they appear in. This holds reciprocally for objects. These two labelings are equivalent. For this environment in particular, where the FCA meets social networks (and consequently also the graph theory), we have proposed a modified concept lattice drawing method. The result of this method can be seen in Fig. 3.16. The first difference is that the size of presented nodes is linked to their semantic value; therefore, nodes representing concepts covering more objects (with respect to
Fig. 3.16 Modified visualization of concept lattice at rank 5
3 Social Network Analysis in Community-Built Databases
65
concepts at the same level in the hierarchy) are bigger. The second difference lies in the opacity of presented nodes. Due to the reduction process presented below, we know that some concepts are not fully reliable (concepts containing objects with attributes modified during the reduction).
3.2.2
Nonnegative Matrix Factorization
The matrix factorization methods decompose one – usually a huge – matrix into several smaller ones. Nonnegative matrix factorization differs in that the use of constraints produces nonnegative basis vectors, which make possible the concept of a parts-based representation. Common approaches to NMF obtain an approximation of V by computing a (W, H) pair to minimize the Frobenius norm of the difference V WH. Let V 2 Rmn be a nonnegative matrix and W 2 Rmk and H 2 Rkn for 0 < k minðm; nÞ. Then the objective function or minimization problem can be stated as minkV WH k2 ; with Wij > 0 and Hij > 0; for each i and j: There are several methods for computing NMF. We have used the multiplicative method algorithm proposed by Lee and Seung [26, 27].
3.2.3
Lorenz Curves
To evaluate similarity we can use the Lorenz curves, an approach well-known in the economy sector, in the way proposed in [14]. Let us suppose we have two presence–absence (binary) arrays r ¼ (xi)i¼1, . . . , N and s ¼ (yi)i¼1, . . . , N of dimension N. In the same manner as we normalize vectors, we can create arrays ai and bi by dividing each element of the array by their total sum. Formally ai ¼ Txir ; P P bi ¼ Tyis ; 8i ¼ 1; . . . ; N; , where Tr ¼ Nj¼1 xj and Ts ¼ Nj¼1 yj . Next, we can as di ¼ ai bi, ordered from the compute difference array d ¼ (di)i ¼ 1, . . . , N P largest value to the smallest one. By putting ci ¼ ij¼1 dj , we obtain the coordinates of the Lorenz curve by joining the origin (0, 0) with the points of similarity coordinates Ni ; ci i¼1;...; N .
3.3 3.3.1
Experiments Real-World Experiment
In our first example, we will use the well-known dataset from [11]. We have been dealt with this experiment previously in [35] and basically this dataset has the same
66
V. Sna´sˇel et al.
characteristics as a lot of data from CBD. It contains information about the participation of 18 women in 14 social events during the season. This participation can be considered as a two-mode network or as a formal context (binary matrix with rows as women and columns as social events). The visualization of this network as bipartite graph can be seen in the upper part of Fig. 3.17. Events are represented by nodes in the first row. These nodes are labeled by the event numbers. The second row contains nodes representing women. These nodes are labeled by the first two letters of their names. Participation of the women in the event is represented by an edge between corresponding nodes. An illustration of the formal context (resp. binary matrix) can be seen in the left part of Fig. 3.18. Now we will describe the computed Galois lattice (Fig. 3.9). Each node in the graph represents one formal concept. Every concept is a set of objects (women, in this case) and set of corresponding attributes (events). Edges express the ordering of concepts. The aforementioned reduced labeling is used here. The lattice contains all combinations of objects and attributes present in the data. One can easily read that Sylvia participated in all events that Katherine did. Everyone who participated in events 13 and 14 also participated in event 10. The reasons for these nodes to be separate are the women Dorothy and Myrna, who took part in event 10 but not in the events 13 and 14.
Fig. 3.17 Social network – visualized as bipartite graph before and after reduction to rank 5
Fig. 3.18 Context visualization (original, rank 5)
3 Social Network Analysis in Community-Built Databases
3.3.1.1
67
After Reduction
Due to the high number of nodes and edges, many interesting groups and dependencies are hard to find. Now we will try to reduce the formal context to lower dimension and observe the changes. We have performed a reduction of the original 18 14 context to lower ranks and computed the corresponding concept lattices. To illustrate, we have selected results obtained for rank 5 using the NMF method. Modified context can be seen in the remaining part of Fig. 3.18. A visualization of the network into a bipartite graph (Fig. 3.17) reveals some changes, but is still too complicated. The concept lattice can give us better insight. A detailed look at the reduced lattice (Fig. 3.19 for rank 5) shows that the general layout has been preserved, as well as the most important properties (e.g., mentioned implication about Sylvia and Katherine). The reduction to rank 5 caused the merging of nodes previously marked by attributes 10, 13, 14 (which we have discussed earlier). To illustrate the amount of reduction, we can compute the similarity between the original and the reduced context and draw Lorenz curves (see first row of Fig. 3.20). A larger area under the curve means higher dissimilarity (lower similarity). Because we compare the context using the object-by-object approach, we obtain several curves (drawn using the gray color on the figure). To simplify the comparison, these curves have been averaged (the result drawn using the black color). In the same manner, we have computed these curves for formal concepts (second row of Fig. 3.20). An alternative approach to measuring the dimension reduction is to compute the so-called normalized correlation dimension. For more details see [37] and [35].
Fig. 3.19 Concept lattice at rank 5
V. Sna´sˇel et al.
68
Fig. 3.20 Lorenz curve comparing contexts (first row) and lattices (second row) – before (first column) and after reduction to ranks 8, 5, 3 (remaining columns)
Fig. 3.21 Principle of recommendation experiment
3.3.2
Recommendation and Prediction System
In this experiment, we used the social context of CBD users to suggest or possibly predict their further actions. Recent paper [38] shows that it is still an important problem and the social aspect may be helpful. Our hypothesis is as follows: the process of dimension reduction in the object– attribute matrix leads to the unification of objects based on their attributes. Attributes of an object may change according to the attributes of similar objects. This principle (illustrated in Fig. 3.21) works in the field of latent semantic indexing [12] and works also in the field of recommendation systems [29]. Because users do not behave randomly, but exist in their own social context and have their own motivation and knowledge, this principle has a meaningful interpretation here. This hypothesis may be used to predict user behavior or suggest interesting subjects to the user. We wanted to analyze the success ratio of such an approach, especially in connection with the size of the original data used for suggestions. In order to perform this experiment, we analyzed a random sample of user accounts of various Wikipedia projects, such as Wiktionary, WikiQuote, WikiBooks, WikiSource, WikiSpecies, WikiNews, etc. With each user account, we also
3 Social Network Analysis in Community-Built Databases
69
retrieved the date and time of user registration. To pair user accounts across different Wikipedia projects, we used only the account name (as no other reliable information is available). Next, we considered data within some predefined time range only and constructed a binary matrix containing information about the presence of a given user account in analyzed projects. In the next step, we reduced the dimension of this matrix using NMF (using rank 3). This process has changed the matrix a little bit. A change from 0 to 1 in attribute A of user U in our interpretation means that many users having similar attributes as user U possessed also the attribute A. This change can be considered as a recommendation that the user is interested in this project (or as a prediction that this user will in the future be interested in this project). To verify this, we compared these proposals with data observed after several years. Only those users participating in more than two projects have been selected.
3.3.2.1
Obtained Results
Experiment results are summarized in Table 3.1. We have constructed several datasets containing information about user accounts from different years (line 1). From each dataset we have selected a random sample of data (line 2). For the year 2008, we have selected three samples of different sizes. Next, we have computed our baseline – the probability of a random suggestion being true with respect to the data from 8/3/2010 (line 3). Using the aforementioned process, we have generated suggestions (line 4) and computed the precision (line 5) and recall (line 6) of these suggestions. The results show that this process can significantly improve random suggestions (from 2–9% up to 20–27%), but only in a limited scope (recall 4–17%). It seems that some correlation exists between the sample size and recall, but this correlation will not be direct and will probably be dependent on some other property. The computation of NMF every time we need suggestions for one particular user is clearly not effective. There are two options: we can preprocess this data from time to time or we can use the OnlineNMF (see [8]) to update the computed data whenever needed. Such an approach can be used not only to discover users’ interest in projects, but in the same manner also in documents, set of documents, or topics in general. It would also be interesting to investigate the differences in behavior of similar matrix decomposition methods, such as the singular value decomposition (SVD, [22]) and especially the semidiscrete decomposition method (SDD, [28]), which is well suited to binary data. Table 3.1 Results of the recommendation experiment (compared with the data from 8/3/2010) Data from 1/1/2007 1/1/2008 1/1/2009 Sample size 1,156 751 1,309 2,838 2,660 Random change 9% 9% 9% 9% 2% Suggestions 150 346 165 1,478 393 Precision 26% 27% 27% 27% 20% Recall 4% 15% 4% 17% 14%
V. Sna´sˇel et al.
70
3.4
Conclusions
In this chapter, we have discussed the social aspect of CBD and how the techniques of SNA can be useful in the understanding of these databases and their evolution. We suggest that this perspective be the subject of future research.
References 1. Adler, B.T., De Alfaro, L.: A content-driven reputation system for the Wikipedia. In: Proceedings of the 16th International World Wide Web Conference (2007) 2. Atzenbeck, C., Hicks, D.L.: Socs: Increasing social and group awareness for wikis by example of Wikipedia, WikiSym (2008) 3. Belohlavek, R., Sklenar, V.: Formal concept analysis constrained by attribute-dependency formulas ICFCA, vol. 3403 pp. 176–191 (2005) 4. Borgatti, S.P., Everett, M.G.: Network analysis of 2-mode data. Soc. Networks 19, 243–269 (1997) 5. Brandes, U., Kenis, P., Lerner, J., van Raaij, D.: Network analysis of collaboration structure in Wikipedia. In: Proceedings of the 18th International Conference on World Wide Web, pp. 731–740 (2009) 6. Burke, M., Kraut, R.: Mopping up: Modeling Wikipedia promotion decisions. In: Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work, pp. 27–36 (2008) 7. Butler, B., Joyce, E., Pike, J.: Don’t look now, but we’ve created a bureaucracy: the nature and roles of policies and rules in Wikipedia. In: Proceeding of the 26th annual SIGCHI Conference on Human Factors in Computing Systems, pp. 1101–1110 (2008) 8. Cao, B., Shen, D., Sun, J.T., Wang, X., Yang, Q., Chen, Z.: Detect and track latent factors with online nonnegative matrix factorization. In: The 12th International Joint Conference on Artificial Intelligence, pp. 2689–2694 (2007) 9. Cosley, D., Frankowski, D., Terveen, L., Riedl, J.: SuggestBot: Using intelligent task routing to help people find work in Wikipedia. In: Proceedings of the 12th International Conference on Intelligent User Interfaces, pp. 32–41 (2007) 10. Crandall, D., Cosley, D., Huttenlocher, D., Kleinberg, J., Suri S.: Feedback effects between similarity and social influence in online communities. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 160–168 (2008) 11. Davis, A., Gardner, B.B., Gardner, M.R.: Deep South: A Social Anthropological Study of Caste and Class. University of Chicago Press, Chicago (1965) 12. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990) 13. Demartini, G.: Finding experts using Wikipedia. In: Proceedings of the Workshop on Finding Experts on the Web with Semantics (FEWS2007) at ISWC/ASWC2007 (2007) 14. Egghe, L., Rousseau, R.: Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Inf. Process. Manage. 42, 106–120 (2006) 15. Farach-Colton, M., Huang, Y.: A linear delay algorithm for building concept lattices, Combinatorial Pattern Matching: 19th Annual Symposium (2008) 16. Freeman, L.C.: Visualizing social networks. J. Soc. Struct. 1, 4 (2000) 17. Freeman, L.C.: The Development of Social Network Analysis. Empirical Press Vancouver, British Columbia (2004) 18. Freeman, L.C., White, D.R.: Using Galois lattices to represent network data. Sociol. Methodol. 23, 127–146 (1993)
3 Social Network Analysis in Community-Built Databases
71
19. Ganter, B., Wille, R.: Applied lattice theory: Formal concept analysis. In: Gr€atzer, G. (ed.) General Lattice Theory, pp. 592–606. Birkh€auser, Basel (1997) 20. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer, New York (1997) 21. Kittur, A., Kraut, R.E.: Harnessing the wisdom of crowds in Wikipedia: quality through coordination. In: Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work, pp. 37–46 (2008) 22. Kolda, T.G., O’Leary, D.P.: A semidiscrete matrix decomposition for latent semantic indexing information retrieval. ACM Trans. Inf. Syst. 16, 322–346 (1998) 23. Korfiatis, N.T., Poulos, M., Bokos, G.: Evaluating authoritative sources using social networks: An insight from Wikipedia. Online Inf. Rev. 30, 252–262 (2006) 24. Kuznetsov, S.: Motivations of contributors to Wikipedia. ACM SIGCAS Computers and Society, vol. 36, issue 2 (2006) 25. Kuznetsov, S.O., Obedkov, S.A.: Comparing performance of algorithms for generating concept lattices. J. Exp. Theor. Artif. Intell. 14, 189–216 (2002) 26. Lee, D., Seung, H.: Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 13, 556–562 (2001) 27. Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) 28. Letsche, T., Berry, M.W., Dumais S.T.: Computational methods for intelligent information access. In: Proceedings of the 1995 ACM/IEEE Supercomputing Conference (1995) 29. Li, T., Gao, C., Du, J.: A NMF-based privacy-preserving recommendation algorithm. First IEEE International Conference on Information Science and Engineering, pp. 754–757 (2009) 30. Musiał, K., Kazienko, P., Bro´dka, P.: User position measures in social networks. In: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, pp. 1–9 (2009) 31. Nov, O.: What motivates Wikipedia’s? Commun. ACM 50, 60–64 (2007) 32. O’reilly, T.: What is Web 2.0, Design patterns and business models for the next generation of software 30 (2005) 33. Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Creating, destroying, and restoring value in Wikipedia. In: Proceedings of the 2007 International ACM Conference on Supporting Group Work, pp. 259–268 (2007) 34. Said, A., De Luca, E.W., Albayrak, S.: How social relationships affect user similarities, Workshop on Social Recommender Systems IUI2010 (2010) 35. Sna´sˇel, V., Hora´k, Z., Kocˇ´ıbova´, J., Abraham, A.: On social networks reduction. In: Proceedings of the 18th International Symposium on Foundations of Intelligent Systems, pp. 533–541 (2009) 36. Suh, B., Chi, E.H., Pendleton, B.A., Kittur, A.: Us vs. them: Understanding social dynamics in Wikipedia with revert graph visualizations. In: Proceedings of the IEEE VAST, pp. 163–170 (2007) 37. Tatti, N., Mielikainen, T., Gionis, A., Mannila, H.: What is the dimension of your binary data? In: Proceedings of the 6th International Conference on Data Mining, pp. 603–612 (2006) 38. Tylenda, T., Angelova, R., Bedathur, S.: Towards time-aware link prediction in evolving social networks. In: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, pp. 1–10 (2009) 39. Vie´gas, F.B., Wattenberg, M., Dave, K.: Studying cooperation and conflict between authors with history flow visualizations. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 575–582 (2004) 40. Zlatic´, V., Bozˇicˇevic´, M., Sˇtefancˇic´, H., Domazet, M.: Wikipedias: Collaborative web-based encyclopedias as complex networks. Phys. Rev. E 74, 16–115 (2006)
.
Chapter 4
Design Considerations for a Social Network-Based Recommendation System (SNRS) Jianming He and Wesley W. Chu
Abstract The effects of homophily among friends have demonstrated their importance to product marketing. However, it has rarely been considered in recommender systems. In this chapter, we propose a new paradigm of recommender systems which can significantly improve performance by utilizing information in social networks including user preference, item likability, and homophily. A probabilistic model, named SNRS, is developed to make personalized recommendations from such information. We extract data from a real online social network, and our analysis of this large dataset reveals that friends have a tendency to select the same items and give similar ratings. Experimental results from this dataset show that SNRS not only improves the prediction accuracy of recommender systems, but also remedies the data sparsity and cold-start issues inherent in collaborative filtering. Furthermore, we propose to improve the performance of SNRS by applying semantic filtering of social networks and validate its improvement via a class project experiment. In this experiment, we demonstrate how relevant friends can be selected for inference based on the semantics of friend relationships and finergrained user ratings. Such technologies can be deployed by most content providers. Finally, we discuss two trust issues in recommender systems and show how SNRS can be extended to solve these problems.
4.1
Introduction
In order to overcome information overload, recommender systems have become a key tool for providing users with personalized recommendations on items such as movies, music, books, and news. Intrigued by many practical applications, researchers have developed algorithms and systems over the last decade. Some of them have been commercialized by online venders such as Amazon.com and Netflix.com. These systems predict user preferences (often represented as numeric ratings) for
J. He (*) Computer Science Department, University of California, Los Angeles, CA 90095, USA e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_4, # Springer-Verlag Berlin Heidelberg 2011
73
74
J. He and W.W. Chu
new items based on the user’s past ratings of other items. The algorithms used in recommender systems are usually two types – content-based filtering and collaborative filtering. Let us define a target item as the item being considered for recommendation, and a target user as the user who is receiving recommendations. In content-based filtering, a target item is recommended to a target user if the item is similar to the ones that the user liked in the past in terms of explicit content attributes [1, 2], while in collaborative filtering, a target item is recommended to a target user if it is an item that has been liked in the past by people who are similar to this user. Collaborative filtering finds users who are similar to a target user based on their previous ratings of other items [3–5]. Despite all of the efforts above, recommender systems still face many challenges. First, there are continuous demands for further improvements on the prediction accuracy of recommender systems. Second, the algorithms for recommender systems suffer from many issues. For example, in order to measure item similarity, content-based methods rely on explicit item descriptions. However, such descriptions may be difficult to obtain for abstract items like ideas or opinions. On the other hand, collaborative filtering has a data sparsity problem [6]. In contrast to the huge number of items in recommender systems, each user normally rates only a few items. Therefore, the user/item rating matrix is typically very sparse. It is difficult for recommender systems to accurately measure user similarities from that limited number of reviews. A related problem is the cold-start problem [6]. Even for a system that is not particularly sparse, when a user initially joins, the system has no reviews from this user. Therefore, the system cannot accurately interpret this user’s preference. To tackle those problems, two approaches have been proposed [1, 4, 7, 8]. The first approach condenses the user/item rating matrix through dimensionality reduction techniques such as Singular Value Decomposition (SVD) [4, 8, 9]. By clustering users or items according to their latent structure, unrepresentative users or items can be discarded, and thus the user/item matrix becomes denser. However, these techniques do not significantly improve the performance of recommender systems and sometimes even make the performance worse. The second approach “enriches” the user/item rating matrix by: (1) using a default rating; (2) incorporating implicit user ratings, for example, the time spent on reading articles [10]; (3) filling in with half-baked rating predictions from content-based methods [7]; or (4) exploiting transitive associations among users through their past transactions and feedback [11]. These methods alleviate the data sparsity problem to some extent but still cannot solve the cold-start issue. In this chapter, we plan to solve these problems from a different perspective. Specifically, we propose a social network-based recommender system (SNRS) [12] which predicts user interests by utilizing rich semantic information in social networks, especially social relationships. In a social network, two persons connected via a social relationship tend to have similar attributes to each other. This is a fundamental property of social networks, and it is also known as the homophily principle [13]. In product marketing, the importance of social relationships has long been recognized [14, 15]. Intuitively, when we want to buy an unfamiliar product, we often consult with our friends who
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
75
have already experienced the product, since they are those whom we can reach for immediate advice. When friends recommend a product to us, we also tend to accept the recommendation because we consider their inputs as trustworthy. Many marketing strategies, such as Hotmail, that leveraged social relationships have achieved great success [16]. Thus, social relationships play a key role when people make decisions about products, and it is the basis for constructing SNRS. The recent emergence of online social networks (OSNs) gives us an opportunity to investigate the role of social relationships in recommender systems. With the increasing popularity of Web 2.0, many OSNs such as Myspace.com and Facebook. com have emerged. Members in those networks have their own personalized space where they not only publish their biographies, hobbies, interests, blogs, etc., but also list their friends. Here, friends are defined in a general sense: any two users who are connected by an explicit social relationship are considered as friends. In reality, they can be family members, buddies, classmates, and so on. In addition, we define immediate friends as friends who are just one hop away from each other in a social network graph, and distant friends as friends who are multiple hops away. OSNs provide platforms where people can place themselves on exhibit and maintain connections with friends. As OSNs continue to gain more popularity, the unprecedented amount of personal information and social relationships can promote social science research which was once limited by the lack of data. In this chapter, we design a new paradigm of recommender systems by utilizing such information in social networks. While the benefits of utilizing social network information in recommender systems can be significant, how to materialize such an idea is especially challenging considering the complexity of social networks. Many challenging questions can be raised in this context. In particular, we investigate the following questions: (1) Does homophily really exist when friends rate items? (2) How to effectively use different types of social network information to make better predictions? (3) If predictions rely on the opinions of immediate friends, what if a target user has no immediate friend who has reviewed the same target item? (4) How does SNRS handle heterogeneities in social networks such as different types of friend relationships? (5) How does SNRS handle situations where the reviews from immediate friends are not trustworthy? The remainder of the chapter is organized as follows. First, in Sect. 4.2, we give a background of collaborative filtering algorithms. Then, in Sect. 4.3, we introduce the dataset that we crawled from a real online social network, Yelp.com. We will study this dataset to determine whether homophily exists when friends rate items. In Sect. 4.4, we present our SNRS system. Following that, we evaluate the performance of SNRS on the Yelp dataset in Sect. 4.5, focusing on its prediction accuracy and coverage. In Sect. 4.6, we propose to further improve the prediction accuracy of SNRS by applying semantic filtering for social networks. We design a student experiment in a graduate class to validate its effectiveness. In Sect. 4.7, we propose extensions of SNRS to handle the trust issues caused by users with unreliable domain knowledge. Finally, we review related work in Sect. 4.8.
76
4.2
J. He and W.W. Chu
Background
After the pioneering work done in the Grouplens project in 1994 [5], collaborative filtering (CF) soon became one of the most popular algorithms in recommender systems. Many variations of this algorithm have also been proposed such as hybrid approaches of combining CF with content-based filtering [7, 17–19] or adopting different weighting schemes [6, 20]. In this chapter, we will use the traditional CF proposed in the Grouplens project as one of the comparison methods. Therefore, the remainder of this section will focus on this algorithm. The assumption of CF is that people who agreed in the past tend to agree again in the future. Therefore, CF first finds users with tastes similar to those of the target users. CF will then make recommendations to the target user by predicting the target user’s rating of the target item based on the ratings of his/her top-K similar users. User ratings are often represented by discrete values within a certain range, for example, 1–5. Here 1 indicates an extreme dislike of the target item, while 5 shows high praise. Let RUI be the rating of the target user U on the target item I. Thus, RUI is estimated as the weighted sum of the votes of similar users as follows. X wðU; VÞ ðRVI RV Þ; (4.1) RUI ¼ RU þ Z V2C
where RU and RV represent the average ratings of the target user U and every user V in U’s neighborhood, C, which consists of the top-K similar users of U. w(U, V) is the weight between users U and V, and z ¼ P w1ðU;V Þ is a normalizing constant to V
normalize total weight to one. Specifically, w(U, V) can be defined using the Pearson correlation coefficient [5]. P I ðRUI RU ÞðRVI RV Þ wðU; VÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; (4.2) P 2P 2 ðR R Þ ðR R Þ UI U VI V I I where the summations over I are over the common items for which both user U and V have voted. As we can see, the traditional CF models user-to-user relations based purely on user rating similarities and does not utilize at all the semantic friend relations among users. However, such semantics are essential to the buying decisions of users. In the following sections, we are going to present a new paradigm of recommender systems which improves the performance of recommender systems by using the semantic information in social networks.
4.3
Yelp.com
For this research, we collect a dataset from a real online social network Yelp.com. As one of the most popular Web 2.0 websites, Yelp provides users with local searches for restaurants, shopping, spas, nightlife, etc. Besides maintaining the
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
77
traditional features of recommender systems, Yelp provides social network features so that it can attract more users. Specifically, Yelp allows users to invite their friends to join Yelp or make new friends with those who already exist at Yelp. The friendship at Yelp is a mutual relationship, which means that when a user adds another user as a friend, the first user will be automatically added as a friend of the second user. Yelp provides a homepage for each local commercial entity and each user. From the homepage of a local entity, we can find all the reviews of this entity. From the homepage of a user, we can find all the reviews written by this user, as well as friends explicitly specified by this user. Specifically, we picked restaurants, the most popular category at Yelp, as the problem domain. We crawled the homepages of all the Yelp restaurants in the Los Angeles area that were registered before November 2007, which ended up being 4,152 restaurants. Then, by following the reviewers’ links in the Yelp restaurant homepages, we further crawled the homepages of all these reviewers, which resulted in 9,414 users. Based on the friend links in users’ homepages, we were able to identify friends from the crawled users and thus reconstruct a social network of Yelp users. Note that the friends we collected for each user may only be a subset of the actual friends listed on the user’s homepage. This is because we require every user in our dataset to have at least one review in the crawled restaurants. In other words, the social network that we crawled has a focus on dining. A preliminary study of this dataset yields the following results. The dataset contains 4,152 restaurants, 9,414 users, and 55,801 user reviews. Thus, each Yelp user, on average, writes 5.93 reviews and each restaurant, on average, has 13.44 reviews. If we take a closer look at the relations between the number of users and the number of their immediate friends (as shown in Fig. 4.1a), we can see that it actually follows a power-law distribution; this means that most users have only a few immediate friends while a few users have a lot of immediate friends. A similar distribution also applies to the relations between the number of users and the number of reviews, as shown in Fig. 4.1b. Because most users on Yelp review only a few restaurants, it thus causes a data sparsity issue as in most recommender systems. In particular, the sparsity of this dataset, i.e., the percentage of user/item pairs whose ratings are unknown, is 99.86%.
Fig. 4.1 (a) The number of users versus the number of immediate friends in the Yelp network, and (b) the number of users versus the number of reviews both follow the power-law distribution
78
J. He and W.W. Chu
Since homophily is the main assumption for building SNRS, we would like to see whether homophily appears in the Yelp dataset. In the following studies, we focus on two questions: (1) whether friends tend to review the same restaurant than non-friends and (2) whether friends tend to give similar ratings to those of nonfriends.
4.3.1
Review Correlations of Immediate Friends
In this study, we want to know if a user reviews a restaurant, what is the chance that at least one of the user’s immediate friends has also reviewed the same restaurant? To answer this question, we count, for each user, the percentage of restaurants that have also been reviewed by at least one immediate friend. The average percentage over all users in the dataset is 18.6%. As a comparison, we calculate the same probability by assuming immediate friends review restaurants uniformly at random and independently. In a social network with n users, for a user with q immediate friends and a restaurant with m reviewers (including the current user), the chance that atleast one of q immediate friends appears in m reviewers is nq1 n1 . We calculate this value for every user and every 1 m1 m1 restaurant the user reviewed. The average probability over all users is only 3.7%. Compared with the 18.6% observed in the dataset, it is clear that immediate friends do not review restaurants randomly. We also compare the average number of co-reviewed restaurants between any two immediate friends and any two random users on Yelp. The results are 0.85 and 0.03, respectively, which again illustrates the tendency for immediate friends to coreview the same restaurants.
4.3.2
Rating Correlations of Immediate Friends
To validate whether immediate friends tend to give ratings that are more similar than those of non-friends, we compare the average rating differences (in absolute values) for the same restaurant between reviewers who are immediate friends, and non-friends. We find that for every restaurant in our dataset, if two reviewers are immediate friends, their ratings of this restaurant differ by 0.88 on average with a deviation of 0.89. If they are not, their rating difference is 1.05 and the standard deviation is 0.98. This result clearly demonstrates that immediate friends, on average, give more similar ratings than do non-friends. From the studies above, we can see that immediate friends at Yelp have stronger correlations than non-friends when reviewing the same restaurants and give similar ratings. In other words, homophily indeed exists when friends rate items. This observation further leads us to the design of SNRS in Sect. 4.4.
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
4.4
79
A Social Network-Based Recommender System (SNRS)
Before we present SNRS, let us first use Angela’s story to recall the critical factors in our buying decisions. Angela wants to watch a movie on a weekend. Her favorite movies are dramas. From the Internet, she finds two movies that are particularly interesting – “Revolutionary Road” and “The Curious Case of Benjamin Button”. These two movies are all highly rated on the message board at Yahoo Movies. Because she cannot decide which movie to watch, she calls her best friend Linda with whom she often socializes. Linda has not viewed these two movies either, but she knows that one of her office mates had just watched “Revolutionary Road” and highly recommended it. So Linda suggests “Why don’t we go to watch Revolutionary Road together?” Angela is certainly willing to take Linda’s recommendation and has a fun night at the movies with her friend. If we review this scenario, we can see at least three factors that really contribute to Angela’s final decision. The first factor is Angela’s own preference for drama movies. If Angela did not like drama movies, she would be less likely to pick something like “Revolutionary Road” to begin with. The second factor is the global popularity of these two movies. If these movies had received unfavorable reviews, Angela would most likely lose interest and stop any further investigation. Finally, it is the recommendation from Angela’s friend, Linda, which leads to Angela’s finally choosing “Revolutionary Road”. Interestingly, Linda’s opinion is also influenced by her office mate. If we consider the decisions that we make in our daily lives, many of them are actually influenced by these three factors. Figure 4.2 further illustrates how these three factors impact upon customers’ final buying decisions. Intuitively, a customer’s buying decision or rating is decided by both his/her own preference for similar items and his/her knowledge about the
Fig. 4.2 The three factors that influence a customer’s buying decision: user preference for similar items, information regarding the target item from the public media, and feedback from friends
80
J. He and W.W. Chu
characteristics of the target item. A user’s preference, such as Angela’s interest in drama movies, is usually reflected in the user’s past ratings of other similar items, for example, the number of drama movies that Angela previously viewed and the average rating that Angela gave to those movies. Knowledge about the target item can be obtained from public media such as magazines, television, and the Internet. Meanwhile, the feedback from friends is another source of knowledge regarding the item, and they are often more trustworthy than advertisements. When a user starts considering the feedback from his/her friends, this user is then influenced by his/her friends. Note that such an influence is not limited to that from our immediate friends. Distant friends can also indirectly exert their influence on us; in the previous scenario, for example, Angela was influenced by Linda’s office mate. Each one of these three factors has an impact on a user’s final buying decision. If the impact from all of them is positive, it is very likely that the target user will select the item. On the contrary, if any has a negative influence, for example, very low ratings in other user reviews, the chance that the target user will select the item will decrease. Bearing this in mind, we propose SNRS in the following sections.
4.4.1
SNRS Architecture
Let us now introduce the variables used in this chapter and formalize the problem that we are dealing with. Specifically, we use capital letters to represent variables, and use capital and bold letters to represent their corresponding variable sets. The value for each variable or variable set is represented by the corresponding lowercase letter. Formally, we consider a social network as a graph G ¼ (U, E) in which U represents nodes (users) and E represents links (social relationships). Each user U in U has a set of attributes AU as well as immediate neighbors (friends) N(U) such that if V 2 N(U), then (U, V) 2 E. The values of attributes AU are represented as aU. Moreover, a recommender system contains the records of users’ previous ratings, which can be represented by a triple relation of T ¼ (U, I, R) in which U is the users in the social network G; I is the set of items (products or services), and each item I in I has a set of attributes A0 I. R is a set of item ratings for item I; that is, R ¼ {RUI} where RUI ¼ user U0 s rating on item I and takes a numeric value k (e.g., k ¼ {1, 2, . . ., 5}). Moreover, we define I(U) as the set of items that user U has reviewed, and refer to the set of reviewers of item I as U(I). The goal of this recommender system is to predict Pr RUI ¼ kjA0 ¼ a0 I ; A ¼ aU; fRVI ¼ rVI : 8V 2 UðIÞg ; i.e., the probability distribution of the target user U’s rating of the target item I given the attribute values of item I, the attribute values of user U, and V’s rating on item I for all reviewers V on item I. Once we obtain this distribution, RUI is equal to the expected value of the distribution. Items with high estimated ratings will be recommended to the target user, and users with high estimated ratings on the target item are the potential buyers. To achieve the goal, we propose SNRS as shown in Fig. 4.3. SNRS consists of two major components: an immediate friend inference engine and a distant friend inference engine. As we pointed out in Angela’s story, a user’s buying decision is
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
Rating Matrix T U I R 1 2 5 2 2 4
User U
Social Network G
Item I
User Preference Inference Engine
81
Item Likability Inference Engine
Homophily InferenceEngine
Pr(RI | aU ) Pr(RUI | rVI :"VŒU(I)«N(U))
Pr(RU | a’I) Immediate Friend Inference Engine
Aggregator Pr(RUI | a’I, aU,rVI : "VŒU(I)«N(U)) Distant Friend Inference Engine
Pr(RUI | a’I,aU,rVI :"VŒU(I)) Updated Rating Matrix T U I R 1 2 5 1 3 4 2 1 3
Fig. 4.3 The architecture of a social network-based recommender system
influenced not only by his/her immediate friends; his/her distant friends can also exert their influence indirectly through his/her immediate friends. Therefore, SNRS incorporates these two types of influences, but it deals with them differently. In particular, the immediate friend inference engine focuses on exploiting the homophily effects among immediate friends, and the distant friend inference engine leverages the immediate friend inference engine to bring homophily effects among distant friends into consideration. More specifically, the immediate friend inference engine contains four smaller components: (1) User preference inference engine computes the probability distribution of a target user U’s rating based on U’s preferences to the items similar to a target item I; (2) Item likability inference engine computes the probability distribution of the rating that item I receives based on the characteristics of the reviewers similar to user U; (3) Homophily inference engine utilizes homophily effects among
82
J. He and W.W. Chu
immediate friends and computes the probability distribution of user U’s rating of item I based on U’s immediate friends’ ratings on item I; and finally (4) an aggregator takes the results from the aforementioned three inference engines, combines them, and predicts user U’s rating distribution for item I. We shall discuss these components of SNRS in the following sections.
4.4.2
Immediate Friend Inference
Since the immediate friend inference engine considers homophily from immediate friends only, the probability distribution it estimates is actually PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ. The set of user V is limited from all reviewers of item I to U’s immediate friends who also rate item I. Note that information from other reviewers of item I will be used in the distant friend inference engine. Since direct computing PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ is difficult, we assume that the influence of three factors, i.e., item attributes, user attributes, and ratings of immediate friends, are independent of each other. Therefore, we factorize this probability as follows. PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ 1 ¼ PrðRUI ¼ kjA0 ¼ a0 Þ PrðRUI ¼ kjA ¼ aU Þ Z PrðRUI ¼ kjfRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ:
(4.3)
First, PrðRU ¼ kjA0 ¼ a0 I Þ is the conditional probability that the target user U will give a rating k to an item with the same attribute values as item I. This probability represents U’s preference for items similar to I. Because this value depends on the attribute values of items rather than an individual item, we drop the subscript I in RUI for simplification. Second, Pr(RI ¼ k j A ¼ au) is the probability that the target item I will receive a rating value k from a reviewer whose attribute values are the same as U. This probability reflects the general likability of the target item I by users like U. For the same reason, because this value depends on the attribute values of users rather than a specific user, we drop the subscript U in RUI. Finally, Pr(RUI ¼ k j {RVI ¼ rVI: 8V 2 U(I) \ N(U)}) is the probability that the target user U gives a rating value k to the target item I given the ratings of U’s immediate friends for item I. This is where we actually take homophily effects into consideration in SNRS. We shall present the components for estimating each of the above probabilities in the following sections. 4.4.2.1
User Preference
PrðRU ¼ kjA0 ¼ a0 I Þ measures the target user U’s preference for the items similar to item I. For example, if we want to predict Angela’s rating to “Revolutionary Road”, PrðRU ¼ kjA0 ¼ a0 I Þ gives us a hint of how likely it is that Angela will give a rating k to a drama movie which also has Kate Winslet in the cast. To estimate this
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
83
probability, we adopt the naı¨ve Bayes assumption. We assume that the item attributes in A’, for example, category and cast, are independent of each other. Therefore, we adopt this approach and have PrðRU ¼ kÞ PrðA1 ; A2 ; :::; An jRU ¼ kÞ PrðA1 ; A2 ; :::; An Þ Qj¼n PrðRU ¼ kÞ j¼1 PrðAj jRU ¼ kÞ ¼ ; A ¼ fA1 ; A2 ; :::; An g; PrðA1 ; A2 ; :::; An Þ (4.4)
PrðRU ¼ kjA ¼ aI Þ ¼
where PrðA0 1 ; A0 2 ; . . . ; A0 n Þ can be treated as a normalizing constant, Pr(RU ¼ k) is the prior probability that U gives a rating k, and PrðA0J j RU ¼ kÞ is the conditional probability that each item attribute A0J in A0 has a value a0J given U rated k; for example, Pr(movie type ¼ drama j RU ¼ 4). The last two probabilities can be estimated from counting the review ratings of the target user U. Specifically, PrðRU ¼ kÞ ¼
jIðRU ¼ kÞj þ 1 ; jIðUÞj þ n
(4.5)
and PrðA0j
¼
a0j jRU
0 IðA j ¼ a0 j ; RU ¼ kÞ þ 1 ; ¼ kÞ ¼ jIðRU ¼ kÞj þ m
(4.6)
where jI(U)j is the number of reviews of user U in the training set, jI(RU ¼ k)j is the number of reviews that user U gives a rating value k, and jIðA0j ¼ a0j ; RU ¼ kÞj is the number of reviews to which U gives a rating value k while attribute A0j of the corresponding target item has a value a0j . Notice that we insert an extra value 1 to the numerators in both equations, and add n, the range of review ratings, to the denominator in (4.5), and m, the range of A0j s values, to the denominator in (4.6). This method is also known as the Laplace estimate, a well-known technique in estimating probabilities [21], especially on a small size of training samples. Because of the Laplace estimate, “strong” probabilities, like 0 or 1, from direct probability computation can be avoided. Moreover, in some cases when item attributes are not available, we can approximate PrðRU ¼ kjA0 ¼ a0 I Þ by the prior probability Pr(RU ¼ k). Even though Pr (RU ¼ k) does not contain information specific to certain item attributes, it does take into account U’s general rating preference; for example, if U is a generous person, U gives high ratings regardless of the items.
4.4.2.2
Item Likability Inference Engine
Pr(RI ¼ k j A ¼ au) captures the general likability of item I from users like user U. For example, from a reviewer who is similar to Angela (e.g., the same gender and age), how likely is it that “Revolutionary Road” will receive a rating of 5? Similar
84
J. He and W.W. Chu
to the estimation in user preference, we use the naı¨ve Bayes assumption and assume that user attributes are independent. Thus, we have PrðRI ¼ kÞ PrðA1 ; A2 ; :::; Am jRI ¼ kÞ PrðA1 ; A2 ; :::; Am Þ Qj¼m PrðRI ¼ kÞ j¼1 PrðAj jRI ¼ kÞ ; A ¼ fA1 ; A2 ; :::; Am g; ¼ PrðA1 ; A2 ; :::; Am Þ (4.7)
PrðRI ¼ kjA ¼ aU Þ ¼
where Pr(RI ¼ k) is the prior probability that the target item I receives a rating value k, and Pr(Aj j RI ¼ k) is the conditional probability that user attribute Aj of a reviewer has a value of aj given item I receives a rating k from this reviewer. These two probabilities can be learned by counting the review ratings on the target item I in a manner similar to what we did in learning user preferences. When user attributes are not available, we use Pr(RI ¼ k), i.e., item I’s general likability, regardless of users, to approximate Pr (RI ¼ k j A ¼ au). In addition, Pr(A1, A2, . . . , Am) is a normalizing constant.
4.4.2.3
Homophily Inference Engine
Finally, Pr(RUI ¼ k j {RVI ¼ rVI: 8V 2 U(I)\N(U)}) is where SNRS utilizes homophily effects from immediate friends. To estimate this probability, SNRS needs to learn the correlations between the target user U and each of U’s immediate friends V from the items that they both have rated previously, and then assume each pair of friends will behave consistently on reviewing the target item I also. Thus, U’s rating can be predicted from rVI according to the correlations. A common practice for learning such correlations is to estimate user similarities or coefficients, based on either user profiles or user ratings. However, user correlations are often so sensitive that they cannot be fully captured by a single similarity or coefficient value. Different measures return different results and have different conclusions on whether or not a pair of users is really correlated [16]. At another extreme, user correlations can be also represented in a joint distribution table of U’s and V’s ratings on the same items that they have rated; i.e., Pr(RUJ, RVI) 8J 2 I(U) \ I(V). This table fully preserves the correlations between U’s and V’s ratings. However, in order to build such a distribution with accurate statistics, it requires a large number of training samples. This is especially a problem for recommender systems, because in most of these systems, users review only a few items compared with the large amount of items available in the system, and the co-rated items between users are even fewer. Therefore, in this study, we use another approach to remedy the problems in both cases. In Sect. 4.3, we showed that it is true that immediate friends tend to give similar ratings more than do non-friends. Therefore, for each pair of immediate friends U and V, we consider their ratings on the same item to be close with some error e. That is, RUI ¼ RVI þ e;
I 2 IðUÞ \ IðVÞ; V 2 NðUÞ \ UðIÞ:
(4.8)
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
85
From this equation, we can see that error e can be simulated from the histogram of U’s and V’s rating differences Hist(RUJ – RVJ) for 8J 2 I(U)\I(V). Thus, Hist (RUJ – RVJ) serves as the correlation measure between U and V. For rating ranges from one to five, Hist(RUJ – RVJ) is a distribution of nine values, i.e. from 4 to 4. Compared with similarity measures, it preserves more details in friends’ review ratings. Compared with a joint distribution approach, it has fewer degrees of freedom. Assuming U’s and V’s rating difference on the target item I is consistent with Hist (RUJ – RVJ). Therefore, when RVI has a rating rVI on the target item, the probability that RUI has a value k is proportional to Hist(k rVI). PrðRUI ¼ kjRVI ¼ rVI Þ 1 Histðk rVI Þ:
(4.9)
When the target user U has more than one immediate friend who co-rates the target item, the influences from all of those friends can be incorporated in a product of normalized histograms of individual friend pairs. PrðRUI ¼ kjfRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ ¼
1Y 1 Histðk rVI Þ; (4.10) VZ Z V
where ZV is the normalizing constant for the histogram of each immediate friend pair, and Z is the normalizing constant for the overall product. Once we obtain PrðRU ¼ kjA0 ¼ a0 I Þ; PrðRI ¼ kjA ¼ au Þ; and PrðRUI ¼ kj fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ, these probabilities are fed into an aggregator where the ultimate rating distribution of RUI is shown in (4.3). R0UI , the predicted value of RUI, is the expected value of the distribution. X k Pr RUI ¼ kjA0 ¼ a0II ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞg R0UI ¼ K
(4.11)
4.4.3
Distant Friend Inference
We have just introduced the approach for predicting a target user’s rating of a target item from those of the user’s immediate friends for the same item. However, in reality, there are many cases where no immediate friends of a target user have reviewed the same target item; thus, the rating of the target user cannot be predicted from immediate friend inference. To solve this problem, we propose distant friend inference. The idea of distant friend inference is intuitive. Even though V, an immediate friend of a target user U, has no rating for the target item, if V has his/her own immediate friends who rated the target item, we should be able to predict V’s rating of the target item via the immediate friend inference, and then to predict U’s rating
86
J. He and W.W. Chu
based on the predicted rating from V. This process conforms to real scenarios as in our previous example, where Linda’s office mate influences Linda who further influences Angela. Followed by this intuition, we decide to apply an iterative classification method [22–24] for distant friend inference. Iterative classification is an approximation technique for classifying relational entities. This method is based on the fact that relational entities are correlated with each other. Estimating the classification of an entity often depends on the classification estimations of its neighbors. The improved classification of one entity will help to infer the related neighbors and vice versa. Unlike traditional data mining which assumes that data instances are independent and identically distributed (i.i.d.) samples and classifies them one by one, iterative classification iteratively classifies all the entities in the testing set simultaneously because the classifications of those entities are correlated. Note that iterative classification is an approximation technique, because exact inference is computationally intractable unless the network structures have certain graph topologies such as sequences, trees, or networks with low tree width. In previous research, iterative classification has been used successfully to classify company profiles [23], hypertext documents [22], and emails [25]. The pseudo-code for distant friend inference is shown in Table 4.1. This pseudocode predicts the users’ ratings for each target item at a time. The original iterative classification method classifies the whole network of users. However, since the number of users in social networks is usually large, we reduce the computation cost by limiting the inference to a user set N which includes the target users of the target item I, and their corresponding immediate friends. In each iteration, we generate a random ordering O of the users in N. For each user U in O, if U has no immediate friend who belongs to U(I), which is the set of users whose rating (either Table 4.1 Pseudo-code for distant friend inference 1. For each item I in the testing set do 2. Select a set of users N for inference.N includes the target users of item I and their corresponding immediate friends. 3. For iteration from 1 to M do 4. Generate a random ordering, O, of users in N 5. For each user U in O do 6. If U has no immediate friend who exists in U(I) 7. Continue 8. Else 9. Apply immediate friend inference 10. RUI = Sk k *Pr(RUI = k | A=aU, A'=a'I, {RVI =rVI : "VŒU(I)« N(U))}) 11. Insert into or Update U(I)with RUI if different 12. End If 13. End For 14. If no updates in the current iteration 15. Break 16. End If 17. End For 18. Output the final predictions for the target users 19.End For
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
87
ground truth or predicted value) is observable, the estimation of RUI will be skipped in this iteration. Otherwise, PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ, will be estimated by immediate friend inference, and RUI is then obtained from (4.11). Because user rating is an integer value, in order to continue the iterative process, we round RUI to a close integer value, and insert into or update U(I) with RUI if different. This entire process iterates M times or until no update occurs in the current iteration. In our experiment, the process usually converges within ten iterations. It is worth pointing out that after we compute, PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ, there are two other options for updating RUI besides rounding the expectation in distant friend inference. The first option is to select RUI with the value k such that it maximizes PrðRUI ¼ kjA0 ¼ a0 I I ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ. However, by doing so, we are actually discarding clues of small probabilities at the same time. After several iterations, the errors caused by the greedy selection will be exacerbated. The target users are likely to be classified with the majority class. The other option is to directly use PrðRUI ¼ kjA0 ¼ a0 I ; A ¼ aU ; fRVI ¼ rVI : 8V 2 UðIÞ \ NðUÞgÞ, as soft evidence to classify other users. However, in our experiments, this approach does not return results as good as those obtained by rounding the expectation.
4.5
Experiments
We evaluate the performance of SNRS on the Yelp dataset, mainly focusing on the issues of the prediction accuracy, data sparsity, and cold-start. We used a restaurant’s price range as the item attribute. Since there is no useful user attribute, we substituted Pr(RI ¼ k j A ¼ au) with Pr(RI ¼ k) when estimating item likability. As a comparison, we implemented CF and trust-based collaborative filtering (TCF) [26]. The basic idea of TCF is to combine trust-based weighting with filtering. It first estimates two types of implicit trust: profile-level and item-level trust among users based on their ratings. Then it filters out users with low trust values. To make predictions, it uses the CF. Instead of using user similarity as in (4.1), TCF uses a harmonic mean of user trust and user similarity. Compared with their use of implicit user trust, SNRS in fact utilizes interpersonal trust underlying friend relationships. For this reason, we are interested in comparing the performance of SNRS with that of TCF.
4.5.1
Cross-Validation
We carried out this experiment in a tenfold cross-validation. The prediction accuracy was measured by the mean absolute error (MAE), which is defined as the average absolute deviation of predictions about the ground truth data over all the
88
J. He and W.W. Chu
Table 4.2 Comparison of the MAEs of selected methods in a tenfold cross-validation on the Yelp dataset MAE Coverage SNRS 0.727 0.807 TCF 0.775 0.454 CF 0.848 0.616 The methods used are: collaborative filtering (CF), trust-based collaborative filtering (TCF), and social network-based recommender system (SNRS)
instances, i.e., target user/item pairs, in the testing set. The smaller the MAE, the better the inference. The second metric is the coverage, which is defined as the percentage of the testing instances for which the method can make predictions. The experimental results are listed in Table 4.2. From this table, we note that SNRS achieves the best performance in terms of MAE (0.727). For example, it is lower than that of CF by 14.3% and that of TCF by 6.2%. Thus, the use of social network information in SNRS improves the prediction accuracy. In terms of the coverage, SNRS reaches the highest coverage (0.807). The reason for this high coverage is because SNRS is able to make use of estimated user ratings for predictions. Considering the low MAE and high coverage of SNRS, it demonstrates that SNRS is promising. In addition, TCF improves the MAE of CF at a cost of reduction in the coverage. This is because, in some cases, even though the similarity for a pair of users can be estimated, if the trust between them cannot be obtained, TCF still cannot make predictions.
4.5.2
Data Sparsity
CF suffers from problems with sparse data. In this study, we want to evaluate the performance of SNRS at various levels of data sparsity. To do so, we randomly divide the entire user/item pairs of our dataset into ten groups, and then randomly select n groups as the testing set, and the rest as the training set. The value of n controls the sparsity of the dataset. At each value of n, we repeat the experiment 100 times. The performance is measured by the average MAEs and the coverage. Figure 4.4a compares the MAEs of the above methods when the percentage of testing set varies from 10 to 70%. There are two observations. First, the MAEs of SNRS are consistently lower than those of CF and TCF. Second, although the MAEs of all the methods increase as the training set becomes sparser, the MAEs of SNRS grow at a much slower pace. For example, the MAEs of SNRS increase 6.2% from 0.714 to 0.758 when the testing set is increased from 10 to 70% of the entire dataset, while the MAEs of CF and TCF grow to 10.7% and 9.5%, respectively, under the same conditions. Figure 4.4b compares the coverage of these methods. We noted the coverage of SNRS is the highest for all test conditions. For example, when the size of testing set is 40% of the whole dataset, the coverage of SNRS is 0.786, while that of CF and
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
89
Fig. 4.4 Comparison of the (a) MAEs and the (b) coverage of CF, TCF, and SNRS for different testing set sizes
TCF is 0.713 and 0.401, respectively. The decrease in the coverage of SNRS is also the slowest as the training set becomes sparser. In particular, the ratio of the decrease in the coverage of SNRS is 9% when the size of the testing set changes from 10 to 70% of the entire dataset, while the same ratio of CF is 85.4%.
4.5.3
Cold-Start
Cold-start is an extreme case of data sparsity where a new user has no reviews, in which CF cannot make recommendations to the new user. Neither can SNRS do so if this new user has no friends. However, in some cases of cold-start, when a new user is invited by some existing users in the system, the preference of this new user can be estimated by those of the user’s friends. In this study, we simulated the latter case of cold-start by creating the following experimental settings: (1) Since there is no prior ratings of the target user, we simply set the output from PrðRU ¼ kjA0 ¼ a0 I Þ as a uniform distribution. (2) Because we cannot learn the rating correlation between this new user and the user’s friends, we directly used the friends’ rating distributions on the target item, Pr(RUI j {RVI: 8V 2 N(U) \ U(I)}), as the result from friend inference. (3) Except for the target user, the ratings of all other users were known. We simulated cold-start for every user in the dataset. The resulting MAE is 0.753 and the coverage is 1. This result demonstrates that even in cold-start, SNRS can still perform decently. The coverage of SNRS is high compared with that in the tenfold cross-validation (0.807) because the ratings of every target user’s friends are all observable in the setting of this experiment.
4.5.4
Role of Distant Friends
In Sect. 4.5.1, we noticed SNRS achieved the highest coverage because it is able to make use of estimated ratings of immediate friends which are inferred from distant
90
J. He and W.W. Chu
Table 4.3 Comparison of the performance of SNRS with and without distant friend inference. MAE Coverage With distant friend inference 0.727 0.807 Without distant friend inference 0.683 0.364
friends. This observation leads us to further study the role of distant friends in SNRS. Specifically, we compared the performance of SNRS with and without distant friend inference in a tenfold cross-validation. The experimental results are shown in Table 4.3. From these results, we can see that by considering the influences from distant friends, the coverage of SNRS is increased from 0.364 to 0.807, which is equivalent to a 122% improvement. However, the improvement is achieved at the cost of a slight reduction in the prediction accuracy. In our experiments, the MAE increases from 0.683 to 0.727, which is only a 6.4% difference. This is consistent with our intuition that the impact from distant friends is not as direct as that from immediate friends, and certain errors will be inevitably introduced when considering distant friends, but compensated for by the enormous gain in the coverage. Our experimental results revealed that social network information can be used to improve the performance of recommender systems. In the next section, we shall discuss how to remedy some issues in SNRS that are caused by heterogeneities in social network information.
4.6
Semantic Filtering of Social Networks
Social networks contain rich semantics that are valuable to SNRS. However, this information can also interfere with the predictions of SNRS if not carefully applied. In this section, we discuss the issues of SNRS caused by the heterogeneities in social relationships and items. Friends exhibit similar behaviors when selecting items; however, the favorite items that friends have in common depend on their social relationships. For example, two friends who have common interests in music CDs may not necessarily agree about their favorite restaurants. Therefore, to find the favorite restaurants, we should not consider friends who share only a common preference in music. Instead, an appropriate set of friends needs to be selected according to the target items. In fact, we considered this issue when performing experiments on the Yelp dataset. Rather than considering all friends listed in users’ profiles, we keep only those friends who also have an interest in food. For example, even though two real friends may have reviewed many common hotels on Yelp, they are not necessarily considered as friends in SNRS unless they both have reviewed restaurants. However, this solution is still a gross approximation, because even within the domain of restaurants, friends can be further grouped based on their opinions on different food categories, price range, restaurant environment, etc.
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
91
Item clustering can theoretically be applied to SNRS to select relevant friends for inference. That is, by clustering similar items into different groups, homophily effects among friends can be estimated based on the ratings of items within the same group. Thus, it is possible for SNRS to identify friends who have a high correlation in music CDs but a low correlation in restaurants. However, because the number of items used to measure user similarity becomes less due to item clustering, the estimated similarity values may not be as accurate as those without clustering. A better way to select relevant friends is to utilize the semantics in social relationships. Unfortunately, such semantics are not readily available in most current OSNs. When a user indicates someone as a friend, it is not clear how and why they became friends, and more importantly, we do not know in which aspects they have homophily effects. Some OSNs ask how friends know each other, for example, whether they were/are classmates or colleagues. Information like this definitely helps us understand friend relationships. However, it is still too general to have practical application in recommender systems. Instead, the semantics that we really want to know from friend relationships should be more specific to the domain of interest, in particular, the factors that influence users’ buying decisions. For example, in terms of dining, it would be ideal for SNRS to know whether two individuals are friends because they have similar taste in food and/or similar preference regarding the price of the meals. Although items have many characteristics, the factors that matter in most users’ buying decisions in choosing restaurants may be limited to only a few common ones, such as food taste, nutrition value, price, service, and environment. By carefully designed questionnaires or other means of marketing analysis, such factors can be obtained. Thus, more semantics regarding users’ rating intentions and social relationships can be collected. By providing users with the mechanism to rate items for each factor in their buying decisions, for example, asking them to rate a restaurant based on food taste and price, etc., recommender systems can improve the understanding of users’ rating intentions. Currently, most recommender systems ask users to input only overall ratings which, however, consist of too many factors and are difficult to understand. For example, when a user gives an overall rating of 4 to a restaurant, it is not clear whether it is because of the food taste or the price of the meal. On the other hand, if a user can provide ratings for those factors, the rationale behind the overall rating can be well explained. Besides understanding users’ rating intentions, SNRS can also obtain the semantics in social relationships by asking users to rate their friends on those factors. A user’s high rating of a friend on a specific factor means this user tends to agree with the friend’s opinion, and together they have a stronger homophily effect. To predict a user’s rating of a factor, SNRS needs to select those friends on whom this user has a strong homophily effect regarding the same factor. The selection of friends is thus dynamic according to the semantics in the factors of user ratings. We call this process semantic filtering and denote SNRS with semantic filtering as SNRS-SF. The framework of SNRS-SF is almost the same as that of SNRS, except that immediate friend inference and distant friend inference are now based on semantically filtered social networks.
92
J. He and W.W. Chu
Since overall rating is not determined by a single factor, relevant friends for predicting a user’s overall rating cannot be selected in the same way as we did for predicting fine-grained rating. To do so, we consider friends as those who are selected as relevant friends for two or more of the most important factors. For example, if a system considers that price and taste are the most important decision factors in terms of dining, then a user’s relevant friends for predicting overall ratings are those users who are considered relevant friends for predicting the user’s ratings on price and taste. In the following section, we shall use this approach.
4.6.1
Semantic Filtering Experiments
Since the Yelp dataset does not have fine-grained user ratings, we cannot use the Yelp dataset for semantic filtering experiments. Therefore, we designed an experiment for a graduate student class and collected a social network and fine-grained user ratings from students. The goal of this experiment is to predict students’ ratings for reading online articles. It was conducted in a graduate student class, “Intelligent Information Systems”, with 22 students. We first selected 21 articles which focus mainly on four topics: local news, US news, technologies, and culture. These articles all contain strong opinions expressed by the authors. The article information and the corresponding categories of these 21 articles are listed in Table 4.4. Before asking these students to review the online articles, we first collected their demographic information, including gender, age, student type, employment, and religion. We then asked the students to answer a set of survey questions related to the articles as shown in Table 4.5. These survey responses will provide prior information about the students. We then asked the students to review, as shown in Fig. 4.5, every article and give ratings (from 1 to 5, with 5 being the best) on the following four factors: (1) Interestingness: Is the article interesting? (2) Agreement: How much do you agree with the author? (3) Writing: Is the article well written? and (4) Overall: Overall evaluation. The reason we include the first three ratings is because they usually play the most important roles when we give an overall score to an article. Since most students did not know each other before the experiment, it would be difficult to form a social network from their original relationships. We therefore divided the students into groups and let them get to know each other through discussions of the articles. Specifically, we divided the students into three groups twice. The first grouping was based on students’ ethnicities, and the second grouping was based on students’ responses to the survey questions. The goal of these groupings was to organize the students in such a way that the students in a group were more likely to be friends after the group discussions. Each group then had a meeting to discuss the articles. During the discussions, every student needed to explain the reasons why s/he liked or disliked each article. Thus, the other
4 Design Considerations for a Social Network-Based Recommendation System (SNRS) Table 4.4 The article information Article ID Article Information 1 Adenhart’s death is a tragic loss for baseball By Kendall Salter, Daily Bruin, April 10, 2009. 2 Aggressive biking, skateboarding poorly fit our walking campus By Karen Louth, Daily Bruin, March 6, 2009. 3 Backers of stem cell research are on guard By Robert T. Garrett, The Dallas Morning News, April 10, 2009. 4 Budget cuts should not degrade education By Daily Bruin, March 12, 2009. 5 File-Sharing Site Admin Sentenced to 6 Months Jail By Enigmax, TorrentFreak, April 11, 2009. 6 Google Earth accused of aiding terrorists By Rhys Blakely, Times Online, December 9, 2008. 7 Hot Topic: A Gay Marriage Tipping Point? By Brian Montopoli, CBS News, April 6, 2009. 8 How Environmentalists Plan to Control Your Life By Fox News, April 6, 2009. 9 Identity theft hits close to home By Patt Morrison, Los Angeles Times, March 12, 2009. 10 Is an Italian rail company taking L.A. for a ride? By Tim Rutten, Los Angeles Times, March 25, 2009. 11 Israel boycott shows ignorance and limits ideas By DailyBruin, March 5, 2009. 12 L.A.’s animal terrorists By Tim Rutten, Los Angeles Times, March 11, 2009. 13 Learning to Love the Bailout By The New York Times, April 11, 2009. 14 Obama Flinches on Immigration By Editorials, The New York Times, March 23, 2009. 15 The age of Friendaholism By Meghan Daum, Los Angeles Times, March 7, 2009. 16 The First Showdown on Health Care By Editorial, The New York Times, April 11, 2009. 17 The recession heats up romance novels By Meghan Daum, Los Angeles Times, April 4, 2009. 18 Unemployment, and CEO pay, on the rise By Tim Rutten, Los Angeles Times, April 4, 2009. 19 We need a bailout too By Rosa Brooks, Los Angeles Times, February 19, 2009. 20 Why not gay marriage? By Raymond Lesniak, NJ.com, August 16, 2007. 21 Wild wild Web By Patt Morrison, Los Angeles Times, February 26, 2009.
93
Category Local Local Technology Local Technology Technology Culture Culture Culture Local Local Local U.S. U.S. Technology U.S. Culture U.S. U.S. Culture Technology
members in the group were able to know more about the student. After the discussions, the students evaluated other group members, as shown in Fig. 4.6, (using ratings from 1 to 3) according to the following three aspects: (1) Do you have common interests on the articles? (2) Do you agree with his/her opinions on the articles? (3) Do you have common judgments about the author’s writing skill?
94
J. He and W.W. Chu
Table 4.5 Survey questions Q1 Has the rise in unemployment affected you or someone in your family? Q2 Given the current state of the economy, are you concerned about getting a job after you graduate? Q3 Are you concerned about increased government spending? What if increased government spending leads to higher tuition cost? Q4 Are you affiliated with a political party? Q5 Do you consider yourself conservative, liberal or moderate? Q6 Does the government do enough to regulate immigration? Q7 Do you support gay marriage? Q8 Do you think there is a need for health care reform? Q9 Should every American have health insurance? Q10 Do you agree with the use of stem cells for medical research? Q11 Do you know anyone with an incurable illness who may benefit from stem cell research? Q12 Should websites and tools that could be used improperly be outlawed? (Google Earth, Bit Torrent, P2P, etc.) Q13 South Korea has a three-strikes law where repeated copyright offenders can be banned from the Internet? Do you think this is fair?
Fig. 4.5 Form for reviewing an article
Since the students may rate each other differently, i.e., one considers the other as a friend, but not vice versa, the social relationship in this dataset is directional. The students were allowed to revise their previous ratings of the articles if they had a new understanding of the articles after the discussion. Compared with the Yelp dataset, there are three differences in this dataset. First, instead of having an overall rating, each article now has three fine-grained ratings (interestingness, agreement, and writing) which more clearly reflect students’ opinions on the articles. Second, friend relationships are based on buying decision factors rather just friendship. We are now able to know whether the friendship is based on their similar interests or similar opinions, etc. Third, since every student
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
95
Fig. 4.6 Form for reviewing a group member
reviewed every article in this experiment, the student/article rating matrix is completely filled in. Thus, the data sparsity of the dataset is 0. Compared with the extremely sparse data in the Yelp dataset, the fully observed students’ ratings in this experiment allow us to measure the performance of SNRS-SF under the sparseness test in a full range.
4.6.2
Experiment Setup
We implement the following methods for performance comparison. Collaborative filtering (CF). When predicting fine-grained student ratings, we select similar users based on their fine-grained ratings on all articles. Collaborative filtering with item clustering based on item category (CF-C). In this method, 21 articles are clustered into four groups according to their categories. To predict a student’s rating of an article, we measure his/her Pearson coefficient with other students based on their ratings of the articles in the corresponding group. Collaborative filtering with item clustering by running K-means on students’ ratings (CF-K). In this method we use K-means to cluster 21 articles based on their rating similarities. Since there are four types of ratings (three fine-grained ratings and an overall rating), we have four sets of clusters. In each set, the articles are clustered into three groups. Similar to CF-C, to predict each student’s rating of an article, we measure Pearson coefficient of student pairs based on their ratings of the articles in the same cluster. SNRS. In this method, we consider student V as student U’s friend if U rates V with a value 3 on at least one of the three factors. The social network of these students is shown in Fig. 4.7a. Each node in the figure represents a student. If student
96
J. He and W.W. Chu
Fig. 4.7 The student social network (a) before and (b) after semantic filtering. Each node represents each student. For a pair of students U and V, node U has a directed edge to node V if U rates V with a value of 3 on at least one of the three factors and U rates V with a value of 3 for at least two of the three factors
U considers student V as a friend, then there is a corresponding directed edge from U to V. SNRS with semantic filtering (SNRS-SF). In this method, when predicting finegrained user ratings, we consider student V as student U’s friend if U rates V with a value of 3 on the given factor. When predicting overall ratings, we select V as U’s friend if U rates V with a value of 3 on at least two of the three factors. Figure 4.7b shows the social network of the students after we apply semantic filtering to overall ratings. When compared to Fig. 4.7a, we can see that many social relationships have
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
97
been pruned. For example, before we apply semantic filtering, there are 179 friend links in Fig. 4.7a, and on average each student has 8.14 friends. In Fig. 4.7b, there are 94 friend links, and each student on average has 4.27 friends after semantic filtering. Similar to the previous section, we control the sparseness of the dataset by randomly selecting a different percentage of the dataset as the testing set. For each size of the testing set, we repeated the experiment 100 times. For each pair of student/article in the testing set, we predicted the target student’s fine-grained ratings and overall ratings of the target article by applying the above methods. MAEs and coverage are used as the performance metrics.
4.6.3
Experimental Results
Figure 4.8a–d shows the MAEs for predicting student ratings of the interestingness, agreement, writing, and overall aspects of the articles. We notice that the trends of SNRS and SNRS-SF are very different from those of CF, CF-C, and CF-K. The MAEs of SNRS and SNRS-SF remain almost constant at all levels of data sparseness, while CF, CF-C, and CF-K all significantly increase as the sparseness increases. The improvements of MAE for SNRS and SNRS-SF over the CF group increase as data sparseness increases. For example, in Fig. 4.8a, when the sparseness is 90%, the MAEs of CF, SNRS, and SNRS-SF are 0.974, 0.764, and 0.705, respectively. This implies 21.6 and 27.6% prediction accuracy improvements over CF. These results reveal that the performance of recommender systems can be significantly improved by effectively using the semantic information in social networks, as consistent with our findings on the Yelp dataset. Furthermore, the MAEs of SNRS-SF are lower than those of SNRS. Specifically, SNRS-SF yields average MAE reduction over SNRS of 9.8%, 11.6%, 7.4%, and 6.2% for predicting student ratings on interestingness, agreement, writing, and overall aspects, respectively. These results illustrate that applying semantic filtering can further improve the SNRS prediction accuracy. We note that the MAEs of CF, CF-C, and CF-K have similar results. CF-C perform worst among these three methods which implies that item clustering does not improve the prediction accuracy of CF as also represented in [10]. The trends of the coverage of all the above methods at different sparseness are shown in Fig. 4.9. We notice that initially the coverage of CF, CF-C, and CF-K is higher than that of SNRS and SNRS-SF. For example, in Fig. 4.9a, when the sparseness is 10%, the coverage of CF, CF-C, and CF-K is 1, 1, and 0.897, respectively, while that of SNRS and SNRS-SF is 0.865 and 0.744. However, as the sparseness increases, the coverage of all the CF group decreases drastically. In particular, the coverage of CF starts to decrease significantly after the sparseness exceeds 0.7. It reaches almost 0 at the sparseness 0.9. The coverage of CF-C and CF-K start to decrease significantly even earlier when the sparseness exceeds 0.4.
98
J. He and W.W. Chu
Fig. 4.8 The comparisons of the MAEs of CF, CF-C, CF-K, SNRS, and SNRS-SF for predicting fine-grained ratings on (a) interestingness (b) agreement, (c) writing, and (d) overall aspects of the articles
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
99
Fig. 4.9 The comparisons of the Coverage of CF, CF-C, CF-K, SNRS, and SNRS-SF for predicting fine-grained ratings on (a) interestingness, (b) agreement, (c) writing, and (d) overall aspects of the articles
100
J. He and W.W. Chu
This is because item clustering makes the number of items in each cluster smaller; thus, the coverage has a greater impact due to the sparseness. On the other hand, we notice that the coverage of SNRS and SNRS-SF are rather insensitive to the sparseness. It starts to drop after the sparseness exceeds 0.8 and with a slower rate than CF, CF-C and CF-K. For example, even when the sparseness is 0.9, the coverage of SNRS and SNRS-SF is still 0.789 and 0.645, as shown in Fig. 4.9a. The coverage of SNRS-SF is slightly lower than that of SNRS because of fewer friends for each student.
4.7
Trust in SNRS
SNRS implicitly assumes that all users in the social network are trustworthy. However, in most recommender systems, this assumption is not necessarily valid. In this section, we shall discuss two trust issues and propose how SNRS can be extended to handle them.
4.7.1
Shilling Attacks from Malicious Users
Intrigued by incentives, malicious users in recommender systems can purposely provide false reviews to promote their own products or attack similar products of competitors. For example, in a user-based collaborative filtering system, a malicious user can simply fake a set of reviews with the exact same ratings as those of a target user. Then this malicious user will be considered as the most similar user of the target user. If malicious users want to promote their own products, they can simply give the products high ratings, and these products will have a high chance of being recommended to the target user. This problem is known as shilling attacks. The main reason that shilling attacks can become threats is that recommender systems rely too much on user rating similarity but overlook another important aspect, i.e., trust among users. Some studies have introduced explicit trust defined by users [27] and implicit trust inferred from user ratings [26] and have shown some improvements. However, unlike these approaches, SNRS is in essence built on trust, and thus it is able to handle shilling attack problems. Instead of using rating similarity, SNRS makes predictions by exploiting homophily among friends. Since users know their friends themselves, it is less likely for them to add malicious users as friends. If a user suspects that some friends may be potential malicious users, the user can remove those friends from the friend lists. Thus, in SNRS, the fact that two users are friends indicates the trust between them. In addition, with the capability of rating friends on each factor of a user’s buying decisions (as discussed in Sect. 4.6), SNRS not only knows who are friends, but also on which aspect of buying decisions
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
101
two friends trust each other. Therefore, the risk of shilling attacks can be further reduced.
4.7.2
Misleading by Friends with Unreliable Knowledge
It is worth pointing out that malicious users are not the only cause of the trust problems in recommender systems. Due to limited knowledge of target items, users who are trustworthy may still provide inaccurate reviews that do not truly reflect the truth of the items. Since SNRS relies on friends’ opinions to make predictions, those inaccurate reviews will produce misleading recommendations of SNRS. For example, Alice has a taste similar to her friend Bob for Italian food, but Bob seldom goes to Thai restaurants. In this case, even though Bob is trustworthy to Alice, his opinion on Thai restaurants may not be so useful. To the best of our knowledge, little research if any has been devoted to solving problems caused by users with unreliable knowledge. The key problem in this example is that the quantification of user correlations is based on all of the common items that every pair of users has reviewed, and it does not consider differences in item categories, like the difference between Thai food and Italian food. Conceptually, SNRS can solve this problem by introducing item clustering, for example, the clustering of items based on their contents or rating similarities (as shown in Sect. 4.6). Therefore, SNRS can quantify two friends’ homophily effects based only on the items within the same cluster as a target item. However, in practice, this solution may not work well because item clustering will make the data sparser. To solve this problem, we propose to relax item categories when quantifying homophily effects. Instead of treating different categories as totally isolated, we consider some of them as still related based upon domain knowledge such as item taxonomies. For example, assuming we know from item taxonomies that Chinese food and Thai food are all Asian food, thus Chinese food is more similar to Thai food comparing with Italian food. Therefore, even though we cannot use Bob’s preference for Italian food, we can still leverage his preference for Chinese food, if any, to guide the recommendation to Alice about Thai food. In particular, we model item taxonomies into a type abstraction hierarchy (TAH) [28]. A TAH is often used to facilitate approximate query answering. It has a tree structure representing objects at different levels of abstraction. The leaf nodes in a TAH are usually the most specific objects. As the level goes up, the nodes in the TAH become more general. In Fig. 4.10, we show a sample TAH generated from food taxonomy. Let us refer to the leaf nodes in a TAH generated from item taxonomy as item categories, such as Thai food and Chinese food. Thus, every item in the system can be mapped into a corresponding leaf node according to its category. Let us assume that a target item belongs to category T; C refers to each category in item taxonomy; IC is the set of items of category C. We define WCT as the
102
J. He and W.W. Chu Dining Asian Food
Thai Food
European Food
Chinese Food Japanese Food
French Food Italian Food
Fig. 4.10 A TAH for relaxation of food styles
similarity between category C and category T. Thus, homophily effects among friends U and V can be estimated as, HistðWCT ðRUI RVI ÞÞ
I 2 IðUÞ \ IðVÞ
(4.12)
Equation (4.12) counts U’s and V’s rating differences on all of their commonly reviewed items. But for each item I that they both reviewed, the contribution of the rating difference on I to the final histogram is multiplied by a factor of WCT which is the similarity between the categories to which the target item and item I belong. Given two categories C and T, the value of WCT can be decided based on the following two observations. First, let us define D(C, T) as the distance from categories C and T to their lowest common ancestor LCA(C, T) in the TAH. (Note that C and T are the leaf nodes in the same depth.) The smaller the distance, the closer C and T are in the domain space; thus, they are more closely related. Second, categories in a specific domain are more strongly related to one another than in general domains. We use jLCA(C, T)j as the number of all the leaf nodes under LCA(C, T) to measure its generalities. The larger jLCA(C, T)j, the more general the domain space that both C and T belong to. Following these observations, we propose to measure WCT as in (4.13). ( WCT ¼
1 1 DðC; TÞlog2 ðjLCAðC; TÞj þ 1Þ
if C ¼ T; otherwise:
(4.13)
1 Therefore, the similarity between Thai food and Chinese food is ¼ 0:5; log 24 1 ¼ 0:19: Since while the similarity between Thai food and Italian food is 2log2 6 0.19 is less than 0.5, it is consistent with our intuition. Note that similar intuitions have been used to estimate the similarity between two concepts in a TAH [29]. The difference in our work is that we estimate the similarity between leaf nodes in a TAH, while [29] has no such a restriction. In addition, (4.13) assumes a linear decay model of WCT in D(C,T), which is arguable. Future work can be made on selecting a better model to fit a specific domain. Once we obtain WCT, the homophily of a pair of users can be quantified [as shown in (4.12)]. By doing so, even though these two users may not have enough
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
103
commonly reviewed items in the same category as a target item, their rating correlations in other categories can remedy the data sparsity if used properly.
4.8
Related Work
Studies show that recommendations from friends are far more useful than those from recommender systems [2]. However, the systems that really utilize interpersonal relationships in social networks are few, if in fact there are any. Most recommender systems use information in social networks, especially user profiles, as an extra information resource to remedy the data sparsity issue. For example, in [18] the author uses contents in user profiles to find similar users. In [7], the authors use approximated predictions from contents in user profiles to “enrich” the original user/item rating matrix. However, none of them uses homophily among friends for inference. Most directly related work is found in [30]. The authors proposed to combine social networks with recommender systems. They estimated the weights in collaborative filtering with an exponential function of the minimal distance of two users in a social network. This is an over-simplified correlation between users. Distance has no semantic meaning of similarity. Two distant friends may still share common opinions. As noted by the authors, this approach does not work well. Zheng et al. [30] proposed another approach to reduce the computational cost in recommender systems by limiting the candidate similar users within a user’s social network neighbors. This approach will actually exacerbate the data sparsity problem of a recommender system, because there are far fewer candidates of similar users than before. By contrast, we use a histogram of friends’ rating differences to quantify homophily effects among friends rather than using their minimal distances. In addition, we consider the impact not only from immediate friends, but also from distant friends in an iterative classification.
4.9
Conclusions
Social networks provide an important source of semantic information regarding user behaviors and friend interactions. This information, especially homophily effects among friends, is valuable to recommender systems. Through statistical analyses of the dataset crawled from Yelp.com, we show that friends undoubtedly tend to review the same restaurants and give more similar ratings than non-friends. Based on these observations, we designed a SNRS. To the best of our knowledge, this is the first attempt to incorporate the semantics of social networks into recommender systems. SNRS predicts user ratings by exploiting information in social networks, including the user’s own preferences, item’s likability, and homophily effects among
104
J. He and W.W. Chu
friends. It incorporates impacts from distant friends via an iterative classification. We evaluated the performance of SNRS with several other methods on the Yelp dataset through a tenfold cross-validation, and SNRS achieves the best result. In terms of prediction accuracy, it yields a 14.3% improvement compared with that of CF, while in terms of coverage, it yields a 31% improvement compared with CF. In the sparsity test, SNRS returns consistently accurate predictions and high coverage over a wide range of data sparsity. Even in a cold-start test, SNRS still performs reasonably well. We also studied the role of distant friends in SNRS and found that when the influences from distant friends are considered, the coverage of SNRS can be significantly improved with only a slight reduction in the prediction accuracy. To deal with heterogeneities in social networks, we further proposed an approach for filtering social networks based on the semantics in fine-grained user ratings and ratings of friends. Using this approach, relevant friends can be selected for inference according to the type of target items. A specific class experiment was designed to evaluate the effectiveness of semantic filtering in the social network that was formed by a large group of graduate students. The experimental results reveal that SNRS with semantic filtering can further improve the prediction accuracy by 11.6%. Finally, we investigated two trust issues in SNRS. We showed that SNRS has the capability of handling shilling attacks, as well as the problems caused by friends with unreliable knowledge. Further research in these areas is desirable. In our future work, we propose to study the performance of SNRS in other datasets, such as categories other than restaurants on Yelp. We also want to investigate how to apply SNRS to other Web 2.0 domains such as Facebook. For example, Facebook recently started personalizing user contents such as news feeds. Intuitively, our framework may also be applicable to the recommendations of news feeds, since the recommendation has to consider users’ own preferences, the global popularity of news itself (i.e., item likability), and users’ social networks.
References 1. Mooney, R.J., Roy, L.: Content-based book recommending using learning for text categorization. In: Proceeding of ACM SIGIR’99 Workshop Recommender Systems: Algorithms and Evaluation (1999) 2. Sarwar, B.M., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of 10th International Conference on World Wide Web (WWW’01) (2001) 3. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998) 4. Billsus, D., Pazzani, M.: Learning collaborative information filters. In: Proceedings of International Conference on Machine Learning (1998) 5. Resnick, P., Iakovou, N., Sushak, M., Bergstrom, P., Riedl, J.: GroupLens: An open architecture for collaborative filtering of Netnews. In: Proceedings of Computer Supported Cooperative Work Conference (1994)
4 Design Considerations for a Social Network-Based Recommendation System (SNRS)
105
6. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005) 7. Melville, P., Mooney, R.J., Nagarajan R.: Content-boosted collaborative filtering for improved recommendations. In: Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-2002), pp. 187–192. Edmonton, Canada, July 2002 8. Sarwar, B.M., Karypis, G., Konstain, J., Riedl, J.: Application of dimensionality reduction in recommender systems – a case study. In: Proceedings of ACM WebKDD Workshop (2000) 9. O’Connor, M., Herlocker, J.: Clustering items for collaborative filtering. In: Proceedings of the ACM SIGIR Workshop on Recommender Systems, Berkeley, CA (1999) 10. Morita, M., Shinoda, Y.: Information filtering based on user behavior analysis and best match text retrieval. In: Proceedings of the 7th Annual Information ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–281 (1994) 11. Huang, Z., Chen, H., Zeng, D.: Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Trans. Inf. Syst. 22(1), 116–142 (2004) 12. He, J., Chu, W.W.: A social network-based recommender system (SNRS). Ann. Inf. Syst. 12 (2010). Special Issue on Data Mining for Social Network Data (AIS-DMSND), pp. 47–74 13. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: Homophily in social networks. Annu. Rev. Sociol. 27, 415–44 (2001) 14. Subramani, M.R., Rajagopalan, B.: Knowledge-sharing and influence in online social networks via viral marketing. Commun. ACM 46(12), 300–307 (2003) 15. Yang, S., Allenby, G.M.: Modeling interdependent consumer preferences. J. Mark. Res. 40, 282–294 (2003) 16. Jurvetson, S.: What exactly is viral marketing? Red Herring 78, 110–112 (2000) 17. Basu, C., Hirsh, H., Cohen, W.: Recommendation as classification: Using social and content-based information in recommendation. In: Recommender System Workshop’98, pp. 11–15 (1998) 18. Pazzani, M.: A framework for collaborative, content-based, and demographic filtering. Artif. Intell. Rev. 13, 393–408 (1999) 19. Wang, J., Vires, A.P., Reinders, M.J.T.: Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR’06), 6–11 Aug 2006 20. Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: Proceedings of ACM SIGIR, pp. 230–237 (1999) 21. Chandra, B., Gupta, M., Gupta, M.P.: Robust approach for estimating probabilities in Naı¨veBayes classifier. Pattern Recognit. Mach. Intell. 4815, 11–16 (2007) 22. Lu. Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 496–503 (2003) 23. Neville, J., Jensen, D.: Iterative classification in relational data. In: Proceedings of the Workshop on Learning Statistical Models from Relational Data at the 17th National Conference on Artificial Intelligence (AAAI), pp. 13–20 (2000) 24. Sen, P., Getoor, L.: Empirical comparison of approximate inference algorithms for networked data. ICML Workshop on Open Problems in Statistical Relational Learning, Pittsburgh, PA (2006) 25. Carvalho, V., Cohen, W. W.: On the collective classification of email speech acts. Special Interest Group on Information Retrieval (2005) 26. O’Donovan, J., Smyth, B.: Trust in recommender systems. In Proceedings of IUI’05, 9–12 Jan (2005) 27. Massa, P., Avesani, P.: Trust-aware collaborative filtering for recommender systems. In: Proceedings of Federated International Conference on the Move to Meaningful Internet: CoopIS, DOA, ODBASE, pp. 492–508 (2004) 28. Chu, W.W., Chiang, K., Hsu, C.C., Yau, H.: An error-based conceptual clustering method for providing approximate query answers. Commun. ACM 39(13) (1996)
106
J. He and W.W. Chu
29. Mao, W., Chu, W.W.: Free-text medical document retrieval via phase-based vector space model. In: Proceedings of American Medical Informatics Association (AMIA), Annual Symposium (2002) 30. Zheng, R., Provost, F., Ghose, A.: Social network collaborative filtering: preliminary results. In: Proceedings of the 6th Workshop on eBusiness (WEB2007), Dec 2007 31. Lathia, N., Hailes, S., Capra, L.: The effect of correlation coefficients on communities of recommenders. In: Proceedings of SAC’08, 16–20 March 2008 32. Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Mach Learn 27, 313–331 (1997) 33. Sinha, R., Swearingen, K.: Comparing recommendations made by online systems and friends. In: Proceedings of the DELOS-NSF Workshop on Personalization and Recommender Systems in Digital Libraries (2001)
Chapter 5
Community Detection in Collaborative Tagging Systems Symeon Papadopoulos, Athena Vakali, and Yiannis Kompatsiaris
Abstract Collaborative Tagging Systems have seen significant success in recent years as a convenient mechanism for organizing and sharing the favorite content of web users. The collective tagging activities of users can be represented in the form of a folksonomy, i.e., a tripartite network associating the users with the online content resources of their selection and the tags used to annotate them. The network structure of folksonomies has been extensively studied and exploited in a series of information retrieval systems. This chapter discusses the application of community detection, i.e., the identification of groups of nodes in a network that are more densely connected to each other than to the rest of the network, on folksonomy networks. In addition, we describe a parameter-free extension of an existing community detection scheme (Xu et al.: SCAN: A structural clustering algorithm for networks. In: Proceedings of KDD’07: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM, New York, NY (2007)) that is particularly suited for discovering communities on tag networks, i.e., networks comprising tags as nodes and associations among tags as edges. We found that the resulting tag communities correspond to meaningful topic areas, which may be used in the context of content retrieval and recommendation systems. The chapter discussion is complemented bya set of evaluation tests on real tagging systems that demonstrate that the proposed method produces more relevant tag communities than the ones discovered by a state-of-the-art modularity maximization method (Clauset et al.: Phys. Rev. E 70:066111, 2004).
S. Papadopoulos (*) Informatics and Telematics Institute, CERTH, 57001 Thermi, Greece and Aristotle University, 54124 Thessaloniki, Greece e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_5, # Springer-Verlag Berlin Heidelberg 2011
107
108
5.1
S. Papadopoulos et al.
Introduction
Collaborative data is nowadays prevalent across the Web. The wide adoption of technologies that enable users to connect to each other and to contribute to the online community has revolutionized the way that content is organized and shared on the Web. In many online applications, it is possible for users to upload their own content or links to existing content (bookmarking) and to organize it by use of tags, i.e., freeform keywords. Such applications, examples of which are delicious,1 flickr,2 and BibSonomy,3 are commonly referred to as Collaborative Tagging Systems. An established means of modeling the structure of Collaborative Tagging Systems is the folksonomy model [1, 2]. A folksonomy comprises three types of entities that are of interest in a Collaborative Tagging System: Users (U), Resources (R), and Tags (T). In addition, it encodes the associations among these entities in the form of a network, i.e., a tag assignment by a user to a resource is modeled as a set of edges between the respective nodes (tags, resource, user) on the folksonomy network under study. A characteristic property of folksonomy networks, similar to other complex networks, is the existence of community structure [3]. The entities of a folksonomy tend to form groups that are more closely related to each other than to the rest of the folksonomy entities. For instance, one can identify sets of resources within a Collaborative Tagging System that are focused on a specific topic. Such community structure can be explicitly declared when appropriate mechanisms, such as the Flickr Groups, are provided by the tagging system. Even more interesting is the implicit community structure that can be discovered by analyzing the network structure arising from the tagging activities of users. This chapter presents an overview of research pertaining to the identification and exploitation of community structure within a Collaborative Tagging System. It will be shown that existing online applications can benefit from incorporating the results of community detection and that new intelligent services can be developed on top of the community analysis results. Moreover, an extension of a popular community detection method [4] will be presented. The method is applied on tag networks in order to discover sets of tags that are consistently used together by users of a Collaborative Tagging System and correspond to emerging topics of social interest. An evaluation study on three real-world Collaborative Tagging Systems will be presented in order to demonstrate the value of the derived community analysis results. The rest of the chapter has the following structure. Section 5.2 presents background information on the chapter topic. Section 5.3 presents an efficient community detection scheme that addresses the specific characteristics of tag networks. Furthermore, the section discusses how to evaluate the detected community
1
http://delicious.com http://www.flickr.com 3 http://bibsonomy.org 2
5 Community Detection in Collaborative Tagging Systems
109
structure. The proposed framework is applied on three real-world tag datasets and the results are presented in Sect. 5.4. Finally, Sect. 5.5 concludes the chapter.
5.2
Background
This section presents background material that is necessary for the subsequent discussion. Some mathematical notation is provided in Sect. 5.2.1 and several recent works that are pertinent to the chapter subject are discussed in Sect. 5.2.2.
5.2.1
Notation
For folksonomies, we employ the definition presented in [2]. However, we do not include the subtag/supertag relation nor the personomy construct that appear in the original definition. Definition 1. A folksonomy is a tuple F fU; R; T; Yg, where U, R, T are finite sets comprising respectively the users, resources and tags of the Collaborative Tagging Systems under study, and Y is a ternary relation between them, i.e., Y U R T, called tag assignments (TAS). Since folksonomies are commonly represented in the form of networks, we will adopt the common graph notation, according to which G ¼ (V,E ) is a graph consisting of the set V of nodes and the set E of edges. A natural way to model a folksonomy is by use of a hypergraph, where V ¼ U [ R [ T and E ¼ {{u,r,t}| (u,r,t) ∈ Y}. However, the hypergraph model is very rarely used in practice due to its complexity, as well as due to the lack of efficient techniques for analyzing its structure. Instead, the tripartite graph model, in which each hyper-edge {u,r,t} ∈ Y is reduced to three simple edges {(u,r) ∈ U R, (u,t) ∈ U T, (r,t) ∈ R T}, is used as an approximate representation for folksonomies. Further simplifications of the model, for example, to bipartite graphs and to one-mode networks [1], are even more frequently used for tackling specific analysis problems. For instance, a very common folksonomy-derived graph is the tag co-occurrence graph, GT ¼ {VT, ET}, where nodes represent tags, VT T, and edges depict cooccurrences between pairs of tags, ET ¼ {(ti, tj)|ti, tj ∈ T}. Tag co-occurrence is usually defined in the context of resources, i.e., when two tags are used together to annotate the same resource, they are considered co-occurring. The number of times that two tags co-occur in the context of some resource can be used as a weight of their relation on the graph, c(ti, tj) ¼ cij ¼ j{∃r ∈ R|(r,ti),(r,tj) ∈ R T}j.4 There 4
In the following, we refer to this kind of co-occurrence as resource-based tag co-occurrence.
110
S. Papadopoulos et al.
Fig. 5.1 Tag co-occurrence network formation example. In case of (i) tag co-occurrence is considered in the context of resources, while in (ii) it is considered in the context of both resource and user
are variations in how tag co-occurrence is computed. For instance, in [5], tag co-occurrence is also defined in the context of both a user and a resource, i.e., c0ij ¼ jf9u 2 U; r 2 Rjðu; r; ti Þ; ðu; r; tj Þ 2 Ygj. These two different tag cooccurrence definitions are exemplified in Fig. 5.1. Alternative tag similarity measures are also possible to use for building a tag graph, for example, tag context similarity [5] and FolkRank [2]. Independent of the measure used to build the tag graph, the tag neighborhood operator is defined around an input tag t0, N(t0) ¼ {ti ∈ T j(t0, ti) ∈ ET}. This returns all tags that co-occur with t0 on G.
5.2.2
Related Work
The concept of community in online systems is extremely wide. In the past, it has been commonly used to denote communities of web pages [6, 7] and web users [8]. In these works, communities are defined with reference to an underlying network (e.g., the Web graph [6, 7]) and correspond to the result of community detection, i.e., a process that discovers groups of nodes in the network that are more densely connected to each other than to the rest of the network. Naturally, since Collaborative Tagging Systems comprise three types of entities (U, R, T), one should expect that even in the specific context of folksonomies, the notion of community is still wide in scope. To this end, we present a classification of pertinent works along three dimensions: (a) type of node, (b) application scenario, and (c) method. These dimensions are briefly
5 Community Detection in Collaborative Tagging Systems
111
Table 5.1 Types of folksonomy networks and respective application scenarios Type of node Method Application scenario U [9] k-means [9], Social link prediction [10] R [11–13] Hierarchical Agglomerative User profiling [12], Personal content Clustering [14], organization [12], Resource clustering [13] Personalized retrieval [14], Ontology T [5, 14–17] Spectral clustering [13], evolution [18], Search and Community detection navigation [15], Tag sense [5, 16, 17] disambiguation [5] U, R [19] Content recommendation [13] U, T Fuzzy Biclustering [19], Collaborator Recommendation [22], Co-clustering [20], Content recommendation [22] Biclique detection [21] R, T [20] Tag recommendation [18], Resource (image-page) organization [20] U, R, T [23, 24] Tripartite clustering [23], User-Resource-Tag recommendation, Hypergraph clustering [24] Study of Collaborative Tagging Systems
explained below and further analyzed in the following sections. Table 5.1 presents a summarized view of the classification. Type of node: Depending on the kind of objects that they contain, there is a distinction between several kinds of communities in Collaborative Tagging Systems. First, there are three one-mode communities (user based, resource based, and tag based). In addition, there are three two-mode communities (user–resource, user–tag, resource–tag). Finally, it is possible to have communities comprising all three kinds of folksonomy entities (user–resource–tag). Method: There is a multitude of methods used to derive folksonomy communities. Methods range from conventional clustering schemes, such as k-means and Hierarchical Agglomerative Clustering, to state-of-the-art community detection methods, such as modularity optimization and biclique detection. For two-mode and three-mode folksonomy networks, more elaborate methods need to be applied, such as co-clustering and hypergraph clustering. Application scenario: The results of community detection can be exploited in the context of several data analysis tasks. In most cases, community analysis results are incorporated in recommendation algorithms, for example, friend/collaborator recommendation, resources (web page/image) recommendation, and tag suggestion. Additional application scenarios include tag sense disambiguation, personalized content search and navigation, as well as ontology evolution and population. Note that the term “community detection” is frequently used to refer to graphbased clustering algorithms [25]. In essence, these two fields have a common objective: the identification of groups of nodes in a network that form natural groups. It is due to their different scientific origin (social network analysis and statistical physics for community detection and computer science for graph-based clustering) that these two fields sometimes appear as different despite the fact that
112
S. Papadopoulos et al.
they address the same problem. In the rest of this chapter, we will use the term community detection for consistency, but in most cases it is legitimate to read it as graph-based clustering. Similarly, a community can be thought of as a cluster.
5.2.3
Types of Communities in Collaborative Tagging Systems
There can be several kinds of communities in a Collaborative Tagging System depending on the kind of objects that constitute them. One-mode communities, i.e., communities comprising objects of one kind, are by far the most commonly studied structures within folksonomies. These communities are implicitly defined upon the assumption that their members are similar to each other. The employed measure of similarity is usually empirically selected based on the achieved results, as well as on computational efficiency considerations. Among the possible three kinds of one-mode communities within a Collaborative Tagging System, tag communities are the ones attracting the most research interest [5, 14–18]. The problem of tag community identification is attractive mainly for two reasons: (a) tag communities usually correspond to semantically related concepts or topics, which make them suitable for a series of applications (see Sect. 5.2.5), (b) the number of unique tags is limited,5 which makes the problem of tag community detection easier from a computational point of view. It is also for these reasons that tag communities constitute the main focus of this chapter. Before the advent of Collaborative Tagging Systems, the problem of resource (web page) community identification was tackled by means of web graph [6, 7] and content-based [26] analysis techniques. In view of the wealth of user input that is now possible in the context of Collaborative Tagging Systems, the discovery of resource communities attracts once again significant research interest [11–13]. Collaborative tagging has been particularly valuable for resources with no textual component, for example, images and videos. Detecting user communities in tagging systems is probably the least studied kind of folksonomy communities. This is mainly due to two reasons: (a) users are typically characterized by multiple interests; therefore, it is hard to group them into communities, unless overlapping community detection methods are devised; (b) even when communities of users are identified within a tagging system, there is no straightforward means of evaluating the quality or utility of the identified communities. Recently, some interesting work has appeared [9] that attempted to cluster users of a Social Tagging System based both on the tags that they use and on the temporal patterns of their tag usage. Such analysis could be potentially interesting for the administrators of a tagging application, for example, for improving the effectiveness of online advertising. 5 Even though tags are allowed to have any form, there are practically a finite number of distinct tags and when tag filtering and normalization techniques come into play, this number is even lower.
5 Community Detection in Collaborative Tagging Systems
113
Two-mode and three-mode communities are relatively less studied mainly due to the increased complexity and sophistication of the methods necessary for their detection. Identifying such communities relies on bipartite and tripartite graph clustering, which is computationally more expensive. Several simplified approaches for deriving such communities [20, 23] have delivered promising results.
5.2.4
Community Discovery Methods
Community identification can be thought of as a kind of clustering, i.e., a process that splits the objects of a dataset into meaningful groups. It is for this reason that conventional clustering techniques, such as k-means and Hierarchical Agglomerative Clustering, have been used in the context of Collaborative Tagging Systems to derive communities. For instance, in [9], a variant of the k-means algorithm is employed to cluster a set of flickr users into groups based on their tagging behavior. Hierarchical Agglomerative Clustering is even more popular, e.g., for clustering tags in [14, 15]. Recently, more sophisticated clustering schemes have been applied on folksonomies, for example, co-clustering [20] and tensor-based spectral clustering [13]. Despite their wide adoption, the application of conventional clustering techniques is troubled by two main problems, which are especially profound in the context of Collaborative Tagging Systems. First, clustering algorithms typically require the number of clusters (communities) to be set a priori. Exploring the effect of such a parameter in datasets of limited size is usually not a problem. However, in the context of a real-world folksonomy dataset, there is no reliable method for estimating the number of communities. Furthermore, conventional clustering algorithms operate on the whole dataset in order to produce the cluster structure and involve significant computational complexity. For this reason, the magnitude of real-world tagging datasets renders many of these methods impractical for detecting communities in Collaborative Tagging Systems. The two above reasons encourage the use of community detection methods [27] in the context of Collaborative Tagging Systems. Community detection methods are based on a graph representation of the objects under study and exploit the graph structure in order to group objects into communities. Community detection methods attempt to find the “natural” organization of the objects of a system into groups. Thus, they do not need the number of communities to be provided as parameter. Furthermore, new community detection techniques have recently appeared that are scalable to datasets of very large size, thus being suitable for the analysis of community structure in real-world tagging systems. One should note that the majority of works applying community detection on folksonomy networks [5, 12, 28, 29] has made use of modularity-optimization schemes such as the one in [30]. Thus, such methods address the first of the above problems: they can identify the number of communities present in the network under study. Furthermore, there are computationally efficient schemes for optimizing
114
S. Papadopoulos et al.
modularity, for example, the greedy modularity optimization technique in [31] identifies community structure in a sparse network of n nodes in time O(n log2n), which addresses the second of the aforementioned problems. However, classical modularity optimization approaches also suffer from some weaknesses, such as inability to find overlapping communities or flat treatment of tags irrespective of their role in the network (these will be discussed in more detail in Sect. 5.3), which are pertinent in the context of tag community identification. For this reason, we introduced two density-based community detection methods in [16] that attempt to mitigate these weaknesses.
5.2.5
Applications of Community Detection
The results of community detection are valuable for understanding the structure and dynamics of Collaborative Tagging Systems. The complexity and magnitude of the microscopic structure of folksonomy networks obfuscate the understanding of interactions and processes taking place in such systems. Thus, community structure is necessary since it provides a mesoscopic view on the elements (users, resources, tags) that constitute the folksonomy, as well as their relations. It is common to derive community-based views of networks, i.e., networks of which the nodes correspond to the identified communities of the original networks and the edges to the relations between the communities (inter-community edges of the original network are merged and intra-community edges are removed from the view). Such views are more succinct and informative than the original networks. Furthermore, communities are a meaningful unit of organization, which means that their members share a similar role in the system and can thus be treated in the same way. For that reason, many data analysis tasks and derivative services can benefit from having access to the knowledge of the network community structure. For instance, assuming that two users of an online social network are found to belong to the same community, while there is no explicit link to each other, the system can recommend them to establish such a link (the so-called “friend recommendation” feature found in many online social networks). In a similar way, if two pieces of content (e.g., web pages) have been found to belong to the same community, then the system can recommend to users who liked the first piece of content to also read the second (in case they have not done already). It is for this reason that community detection has found applications in the field of recommendation systems [10, 13, 18, 22]. Community structure can be also used as a new means of representing user profiles [12, 15]. In the absence of communities, a user profile could be represented either as a tag frequency vector in the tag vector space (denoting the number of times that the given user has used each tag) or as a resource vector (denoting which resources the user has liked). Due to the sparsity of these vector spaces, this raw user profile representation is ineffective. For instance, if a user makes frequent use of the tag “ubuntu,” but not of the tag “Linux,” then it appears as if he/she is not
5 Community Detection in Collaborative Tagging Systems
115
interested at all in resources tagged with the latter. Employing tag communities as a vector space for the user profile alleviates the above problem and thus leads to increased recall performance for personalized retrieval systems. Moreover, the discovery of tag communities can serve in two additional tasks: (a) sense disambiguation [5] and (b) ontology evolution/population [18]. The work in [5] resulted in the conclusion that clustering tag networks can lead to the identification of the different contexts/senses that a tag is used in a potentially more effective way than by resorting to some external source of knowledge (e.g., WordNet). In addition, mapping tags to formal ontology concepts and tag co-occurrences to ontology relations has also been shown to be a task benefiting from the use of tag clustering [18]. Such results demonstrate that tag community structure can be exploited as a step toward semantifying the content in large-scale repositories containing tagged resources.
5.3
Detection and Evaluation of Communities in Tag Networks
From the above, it has become evident that identifying tag communities in Collaborative Tagging Systems is an important research problem with potential benefits for several applications. The goal of tag community detection is to derive groups of tags that either are semantically close to each other or share some usage context. In practice, one may expect that the tag communities of a Collaborative Tagging System correspond to the topics that are of interest to its users. Thus, tag communities act as a proxy of user interests at a higher level of abstraction than individual tags and for that reason they present a more informative and context-rich means of describing both the resources and the users of a Collaborative Tagging System. Most existing efforts on the analysis of tag communities have employed either conventional clustering schemes (e.g., hierarchical agglomerative clustering was used in [14, 15]) or the recently popularized modularity optimization schemes (used in [5, 12, 28, 29]). However, there are a number of issues specific to tag networks that these methods do not adequately address. Overlapping community structure. Tag communities are expected to overlap with each other since there are numerous polysemous tags. For instance, the tag “opera” is expected to belong to at least two communities: one related to music and one related to browsers. Moreover, tag community overlap is expected to be more pronounced by tags that are used in a different context by different groups of users. For example, the tag “Barcelona” may be used by a group of people to refer to architecture-related resources and by another group of people to refer to the city as a travel destination. With this consideration in mind, one may expect that clustering or community detection methods that are inherently partitional, i.e., they do not allow for overlaps among communities, are bound to miss important information stemming from the different contexts of tag usage.
116
S. Papadopoulos et al.
Tag roles. Most existing community detection methods treat network nodes in a uniform way. In the case of tag networks, however, this may lead to poor results. For instance, tags that are used by many users tend to denote generic topics (categories) that are connected to a large number of different tags. Unless such tags are treated in a special way by community agglomeration or expansion schemes, it is possible that topically irrelevant terms will end up in the same community due to their transitive relation through some generic tag. Thus, a community detection approach, which differentiates between generic and regular tags, is more likely to yield meaningful tag communities. Importance of local context. The majority of existing clustering and community detection schemes requires access to the complete network structure in order to perform the division of nodes into communities. Such a global approach has two undesired implications. First, it incurs substantial amounts of recalculations to derive the updated community structure after changes take place on a network, even when these changes are only minor. This renders global approaches computationally inefficient and thus limits their applicability to static snapshots of networks. Such a limitation diminishes their utility in the context of Collaborative Tagging Systems since these systems are massive and highly dynamic. A second implication of the global approach in community detection is the effect of total network scale on the size of communities that are detectable: it was found in [32] that even profound communities are not discovered by modularity maximization algorithms if their size falls below some threshold that is dependent on the network scale. In order to address the aforementioned particularities of tag networks, we introduced in [16] a hybrid community detection scheme based on two steps: (a) community seed set detection based on the notion of (m,e)-cores (Sect.5.3.1) and (b) local expansion of the identified cores with the goal of attaching additional relevant nodes (Sect. 5.3.2). While the proposed combined approach successfully tackles the specific requirements of tag networks, it is also troubled by the need to set parameters, specifically, two parameters (m and e) for the community seed set selection and an additional parameter (BL) for the local community expansion step. For this reason, we introduced in [17] a parameter-free refinement of the community detection scheme of [16], which we describe in the following.
5.3.1
Community Seed Set Detection
The community seed set detection step of our method is based on the concept of (m, e)-cores introduced in [4]. The definition of (m, e)-cores is based on the concepts of structural similarity and e-neighborhood that we repeat here for convenience. We also repeat the definition of direct structure reachability. Definition 2. The structural similarity between two nodes v and w of a graph G ¼ {V, E} is defined as: jGðvÞ \ GðwÞj sðv; wÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jGðvÞj jGðwÞj
(5.1)
5 Community Detection in Collaborative Tagging Systems
117
where G(v) is the structure of node v and is defined as GðvÞ ¼ fw 2 Vjðv; wÞ 2 Eg [ fvg
(5.2)
Definition 3. The e-neighborhood of a node is the subset of its structure containing only those nodes that are at least e-similar with the node; in math notation: Ne ðvÞ ¼ fw 2 GðvÞjsðv; wÞ eg
(5.3)
Definition 4. A vertex v is called a (m, e)-core if its e-neighborhood contains at least m vertices; formally: COREm;e ðvÞ , jNe ðvÞj m
(5.4)
Definition 5. A node is directly structure reachable from a (m,e)-core if it is at least e-similar to it: DirReachm,e (v, w) , COREm,e(v) ∧ w ∈ Ne(v). Once the (m, e)-cores of a network have been identified, it is possible to start attaching adjacent nodes to them provided that they are reachable through a chain of nodes which are directly structure reachable from each other. We call the resulting set of nodes as a community seed set. An example of computing structural similarity values for the edges of a network and then identifying the underlying (m, e)-cores, hubs, and outliers of the network is illustrated in Fig. 5.2. This technique for collecting community seed sets is computationally efficient since its complexity Computing the structural is Oðk nÞ for a network of n edges and average degree k. similarity value for every edge of the network introduces an additional Oðk mÞ complexity in the community detection scheme.
Fig. 5.2 Example of community structure in an artificial network. Nodes are labeled with successive numbers and edges are labeled with the structural similarity value between the nodes that they connect. Nodes 1 and 10 are (m,e)-cores with m ¼ 5 and e ¼ 0.65. Nodes 2–6 are structure reachable from node 1 and nodes 9, 11–15 are structure reachable from node 10. Thus, two community seed sets have been identified: the first consisting of nodes 1–6 and the second consisting of nodes 9–15
118
S. Papadopoulos et al.
One issue that is not addressed in [4] pertains to the selection of parameters m and e. Setting a high value for e (the maximum possible value for e is 1.0) will render the core detection step very eclectic, i.e., few (m, e)-cores will be detected. Moreover, higher values for m will also result in the detection of fewer cores (for instance, all nodes with degree lower than m will be excluded from the core selection process). For this reason, we employed an iterative scheme [17], in which the community seed set selection operation is carried out multiple times with different values of m and e so that a meaningful subspace of these two parameters is thoroughly explored and the respective (m, e)-cores are detected. The exploration of the (m, e) parameter space is carried out as depicted in Fig. 5.3. We start by a very high value for both parameters. Since the maximum possible values for m and e are kmax (maximum degree on the graph) and 1.0, respectively, we start the parameter exploration by two values close to them (for instance, we could select m0 ¼ 0.9 kmax and e0 ¼ 0.9; the results of the algorithm are not very sensitive to this choice). We identify the respective (m,e) cores and associated core sets and then relax the parameters in the following way. First, we reduce m; if it falls below a certain threshold (e.g., mmin ¼ 4), we then reduce e by a small step (e.g., 0.05) and we reset m ¼ m0. When both m and e reach a small value (m ¼ mmin and e ¼ emin), we terminate the community seed set detection step. This exploration path ensures that first high-quality communities will be discovered and subsequently less profound ones will also be detected. Although the parameter sampling process is depicted as linear in Fig. 5.3, in practice one could employ a near-logarithmic sampling scheme for parameter m in order to save computational cost.
Fig. 5.3 Depiction of the (m,e) parameter space exploration path
5 Community Detection in Collaborative Tagging Systems
5.3.2
119
Community Expansion
Starting from a community seed set S, the second step in the proposed community detection method involves an expansion process, which aims at attaching additional nodes, which are relevant, to the initial community seed set. The expansion step is essential for deriving higher quality communities since the community seed sets produced by the previous step may fail to include in the communities nodes that are of importance for them. In the case of tag communities, this would lead to tag communities that would miss some important keywords and would thus be less representative of their topic. In addition, it is due to this expansion step that overlap among communities is possible since the previous step produces non-overlapping community seed sets. The community expansion step is based on the maximization of a local measure of community quality, namely subgraph modularity introduced in [33]. The modularity of a subgraph S ∈ V is defined as the ratio of the number of intra-community edges (edges connecting nodes within S) over the number of edges sticking out of S (5.5). Obviously, the larger such a value is, the more well separated the subgraph is from the rest of the graph. In the extreme case of a disconnected subgraph, its modularity value tends to infinity: MðSÞ ¼
jfðv; wÞ 2 Ejv; w 2 Sgj jfðv; wÞ 2 Ejv 2 S ^ w 2 V Sgj
(5.5)
The proposed expansion step is based on a greedy maximization scheme, i.e., it successively attaches nodes to community S as long as their addition increases the subgraph modularity M(S) of the community. The set of nodes that are considered as candidates for attachment to S are pooled from the “community frontier,” i.e., the set of all nodes that are adjacent to at least one node of the community. Each candidate node is tentatively attached to the community and the new value of its modularity is computed. Nodes with very high degree6 are not considered in this process for two reasons: (a) to reduce the computational complexity of the expansion step and (b) to prevent the expansion process from creating a “gigantic” community. The node resulting in the maximum increase of modularity for the community is considered a member of the community and the process is repeated for the rest of the candidate nodes (it is possible that there is no increase of modularity by adding a node to the community, in which case no expansion takes place). The expansion process is specified in Algorithm 1 and exemplified in Fig. 5.4a, b. It is thanks to this expansion step that overlap among the resulting communities is possible.
6
We create a degree-ordered list of nodes for the whole graph and consider as high-degree nodes the top 10% of them.
120
S. Papadopoulos et al.
a
b
M(S) =1.429
M(Sexp)=2.4
Fig. 5.4 An example of maximizing the local modularity of the expanding community. In Fig. 5.4a the local modularity of the subgraph consisting of nodes 1–6 has a value of M ¼ 10 7 ¼ 1:429. By attaching node 11 to the community (Fig. 5.4b), the local modularity of the expanded subgraph (comprising nodes 1–6 and 11) is now equal to MðSexp Þ ¼ 12 5 ¼ 2:4
After carrying out the community expansion step, there are still nodes that have no community associated with them. These nodes are classified as either hubs or outliers using the same criterion as in [4]. A node is considered to be a hub if it is connected to more than one community. For instance, node 8 in Fig. 5.2 is classified as a hub since it is adjacent to two communities. The remaining nodes, i.e., nodes that are adjacent to only one or no community at all are classified as outliers. In Fig. 5.2, node 7 is such a node. In the case of tag networks, hub tags may correspond to either polysemous tags that are used in multiple topical contexts or generic tags that can be used independent of the topic of a resource (for instance, in
5 Community Detection in Collaborative Tagging Systems
121
the case of tags referring to images, such a tag could refer to the picture quality of the image, for example, “bright,” “black and white,” etc.). Outlier tags may correspond to personal tags (i.e., tags that have some specific meaning only for the person using them), infrequent tags, or spam/erroneous tags.
5.3.3
Evaluation of Tag Communities
Evaluating the results of community detection, which is a kind of clustering process, constitutes a challenging task. Due to the size and uncontrolled nature of folksonomy data, it is simply impractical to have the produced communities subjectively evaluated by human subjects. Furthermore, since Collaborative Tagging Systems are complex systems characterized by evolving and emerging semantics, there is no standard ground truth and it is hard to rely on external sources of knowledge, such as Wikipedia7 to establish semantic relationships among the tags of a folksonomy. Therefore, there is no commonly agreed evaluation protocol for assessing the quality of the produced communities. Obviously, the most direct means of evaluating the quality of a set of communities is to subject them to human judgment, i.e., to ask people to assess the relatedness that members of the same community have to each other. By asking multiple users to evaluate the same community, it is also possible to derive confidence scores based on the inter-annotator agreement, thus removing the subjective element of the evaluation. However, such user evaluation studies are costly and can only be applied to limited samples of the community structures under test. An implicit means of evaluating the quality of communities on a graph is by use of some graph-based community structure quality measure. Such a measure, which is commonly used in the community detection literature, is the modularity of the community structure. Modularity quantifies the extent to which the division of a network into communities results into more edges between nodes of the same community than those that would result from the same division but with a random edge distribution between the nodes. Modularity is computed by means of the (5.6): ki k j 1 X Q¼ dðci ; cj Þ; Aij 2m 2m ij
(5.6)
where A denotes the adjacency matrix of the graph, ci is the community to which node i belongs, and d(ci, cj) is the Kronecker delta symbol. There are two reasons why we are not going to use modularity in our evaluation study. First, modularity can be computed only for partition-like community structures. Instead, the community structure proposed by our method is not partitional,
7
Wikipedia, http://wikipedia.org/
122
S. Papadopoulos et al.
since it allows overlaps between communities and it leaves several network nodes unassigned to communities. An additional, and perhaps even more important, reason for not using modularity is its known limitation in correctly capturing small-scale communities [32]. Instead of modularity, we employ another popular graph-based quality measure, the graph conductance. Conductance is defined in relation to a subgraph S (i.e., between the subgraph and the rest of the a community), which implies a cut ðS; SÞ graph. The measure is computed by means of the following equation: P fðSÞ ¼
i2S;j2S
Aij
; minðAðSÞ; AðSÞÞ
(5.7)
where A(S) is the total number of edges that are incident with S: AðSÞ ¼
XX
Aij
(5.8)
i2S j2V
An advantage of conductance is that it is defined per community, enabling the derivation of an empirical distribution of conductance values for a given community structure. In that way, it is possible to quantify the performance of the community detection method over the whole set of discovered communities and thus assess the robustness of the method under test. Furthermore, conductance is considered to capture the “gestalt” notion of communities and has been extensively used for evaluating community quality in a wide range of online networks [8]. Last but not least, it is possible to evaluate a community detection method by incorporating its results in some Information Retrieval (IR) task and measure the IR performance for that task in terms of measures such as precision, recall, and F-measure. In the case of tag communities, such a task is tag recommendation, i.e., given some input tag(s) the system produces a set of tag suggestions to the user. The advantage of employing such an evaluation method is that it is possible to use the tagging history of real users as ground truth, which enables large-scale evaluation studies at low cost. In practice, one divides all available tag assignments of a Collaborative Tagging System into two sets, one used for training and the other used for testing. Based on the training set, one builds the corresponding tag graph and detects the communities in it. Then, by using the tag assignments of the test set, the evaluation aims to quantify the extent to which the community structure found by use of the training set can help predict the tagging activities of users on the test set. For each test resource that is tagged with L tags, K < L tags are used as input to the tag recommendation algorithm and the rest L K are predicted. In that way, both the number of correctly predicted tags and the one of the missed tags are known, which enables quantification of the IR performance in terms of precision and recall.
5 Community Detection in Collaborative Tagging Systems
5.4
123
Experimental Evaluation
In order to gain insights into the behavior of community detection in real-world tagging systems, we conduct an empirical evaluation of the performance of two community detection methods on three datasets coming from different tagging applications, namely BibSonomy, Flickr, and Delicious. The first of the two community detection methods under study is the well-known greedy modularity maximization scheme presented by Clauset et al. [31]8 and the second is the scheme that we presented in Sect. 5.3. We will use the abbreviations CNM and HCD (standing for Hybrid Community Detection) to denote the two methods. The three datasets that we used for our study are described in the following and basic information on their size is presented in Table 5.2. BIBSONOMY-200K: BibSonomy is a social bookmarking and publication sharing application focused on research literature. The BibSonomy dataset was made available through the ECML PKDD Discovery Challenge 2009.9 We used the “PostCore” version of the dataset, which consists of a little more than 200,000 tag assignments (triplets) and hence the label “200K” was used to form the dataset name. FLICKR-1M: Flickr is a popular photo-sharing and organizing application on the Web today, featuring billions of tagged images. For our experiments, we used a focused subset of Flickr comprising approximately 120,000 images that were located within the city of Barcelona (the images contained geolocation information). In total, the number of tag assignments for this dataset approaches one million. DELICIOUS-7M: Delicious is a popular social bookmarking service that enables users to manage and share their bookmark collections online. We made use of a small snapshot of the Delicious bookmark collection corresponding to January 2006, comprising seven million tag assignments. This dataset is a subset of the collection studied in [34]. Starting from each dataset, we built a resource-based tag co-occurrence graph as described in Sect. 5.2.1. The raw graph contained a large component and several very small components and isolated nodes. For the experiments we used only the large component of each graph, which accounts for more than 99% of the size of the raw graph for all three datasets. Some basic statistics of the analyzed large Table 5.2 Folksonomy datasets used for evaluation Dataset #triplets U BIBSONOMY-200K 234,403 1,185 FLICKR-1M 927,473 5,463 DELICIOUS-7M 7,501,032 112,950
8
R 64,119 123,585 1,332,796
T 12,216 27,969 251,352
We used the publicly available implementation of this algorithm, which we downloaded from http://www.cs.unm.edu/~aaron/research/fastmodularity.htm 9 http://www.kde.cs.uni-kassel.de/ws/dc09
124
S. Papadopoulos et al.
components are presented in Table 5.3. The nodes of the three tag graphs appear to have a high clustering coefficient on average, which indicates the existence of community structure in them. We applied both community detection methods, CNM and HCD, on the tag graphs and proceeded with the analysis of the derived communities. First, we present a comparison of the sizes of the detected communities. Figure 5.5 presents the rank plots of the communities detected by CNM and HCD based on their size. It is evident that CNM produces communities with much more skewed size distribution than HCD. For instance, the three largest communities of the BIBSONOMY-200K tag network together comprise a total of 10,625 tags, accounting for approximately 89% of all unique tags of this graph. By contrast, the communities produced by HCD have a much more balanced size distribution, with the largest community of BIBSONOMY-200K consisting of just 38 nodes. A similar situation holds also for the other two datasets. When considering the applications of tag community detection (see Sect. 5.2.5), it is hard to imagine that the highly imbalanced community structure produced by CNM can be of much benefit. For instance, knowing that two tags belong to the same huge community is not very informative of their semantic relation; in fact, there are many pairs of tags within such huge communities that are not actually related to each other. Table 5.4 presents several such examples of unrelated tags which were placed in the same community. Having these tags in the same community is not only uninformative but is actually misleading and thus potentially harmful for use within some information retrieval task. By contrast, Table 5.5 presents several examples of interesting tag communities discovered by HCD. Table 5.3 Basic graph statistics for the large component of the examined tag graphs Dataset |V| |E| k
cc
BIBSONOMY-200K FLICKR-1M DELICIOUS-7M
0.6689 0.8512 0.8018
b
4
CNM
10
3
10
2
10
1
10
0
HCD
236,791 693,412 3,443,367
39.63 50.39 31.76
c
4
10
CNM 3
2
10
1
10
1
10
2
10
3
10
4
10
HCD
3
10
2
10
1
10 10
10 0
CNM
10
0
0
10
5
10
4
HCD
10
Conductance
10
Community Size
Community size
a
11,949 27,521 216,844
0
10
1
10
2
10
3
10
4
10
0
10
1
10
2
10
3
10
Rank
Rank
Rank
BIBSONOMY-200K
FLICKR-1M
DELICIOUS-7M
4
10
Fig. 5.5 Size distribution of the communities detected by CNM and HCD. Across all three datasets, CNM produces communities with a much more skewed size distribution than HCD
5 Community Detection in Collaborative Tagging Systems
125
Table 5.4 Examples of unrelated tags that were assigned by CNM to the same community Dataset Examples of unrelated tags placed in the same community BIBSONOMY- Hannover, nutritional, ebusiness, bishop, vivaldi, sunsets, skyscapes, recycle, 200K antiracist, patentbibliometrics Information retrieval, magnetic, robotics, kolmogorov, wordnet, darmstadt, socialinformatics, changemanagement, thermodynamics, metaphysics webdesign, windows, torrent, puzzle, vmware, geotagging, mov, techcrunch, cpplib, baseballplayers FLICKR-1M
Spanien, common chimpanzee, star wars, renault, restaurant, prostitution, olympicstadium, large windows, infrared, president of the usa Barcelona, watermelon, photon awards, birthday, mediterranean, palm tree, fine arts, volkswagen, building, logistics Roma, double bass, crowd surfing, environment, lomography, flickr babes, sombrero, basketball, bruce springsteen, design for children
DELICIOUS7M
Geekiness, telepathy, scifihorror, britneyspears, theflintstones, sportculture, onlinepokergames, environmentalhealth, uspatent, argentina Education, capetown, flashwebsites, businessanalyst, alcoholicsanonymous, newjournalism, adventuretravel, countrycallingcodes, musicnetwork, scienceastrophysics Food, island, bike, jersey, federal, climate, ghosts, athletics, enviroment, imperialism Examples from the three largest communities of each dataset are presented
Close examination of the tags contained in them reveals their close semantic and contextual association. In the case of CNM, these communities are contained in the aforementioned gigantic communities together with numerous unrelated tags, thus their utility is limited. Although the tag communities detected by HCD contain tags that are closely related to each other, there are cases in which they appear to be fragmented: there are multiple tag communities that refer to the same topic, but are split in different communities. Such an example is presented in Table 5.6. In this case, the CNM algorithm managed to assemble all tags related to recipes and ingredients to a single community,10 while HCD had them dispersed in four different groups. Subsequently, we also computed the conductance values for the communities derived by CNM and HCD for every dataset. Figure 5.6 presents the conductance distributions characterizing the community structure produced by the two methods under comparison. It appears that CNM produces communities of lower conductance than HCD, which in terms of graphs means that the CNM communities are better separated than their HCD counterparts from the rest of the network. However, this seemingly superior performance of CNM in terms of conductance comes at the cost of creating highly unbalanced tag communities that do not correspond well to the topics connoted by the tags.
10
However, it is obvious from the CNM community tags that also irrelevant tags were placed in the same community.
126
S. Papadopoulos et al.
Table 5.5 Examples of interesting tag communities discovered by HCD. In the case of CNM, these communities are “hidden” within the gigantic communities discovered by CNM Dataset Examples of interesting HCD tag communities BIBSONOMY-200K mpg, tif, jpeg, mpc, fileconverter, ico, wma, swf, fileconversion, txt, midi, psd, wmi, ogg, avi, psp, tiff, odg, mdb, kar, divx, wmv, qcp, odp, realaudio, ods, rtf, odt, jpg, mov, amv, png, flv, flac, mmf, gif, sxw, amr israelis, Middleast, terrorism, middleastpeace, peaceprocess, onevoice, palestinians, conflictresolution, extremism, hatred urlogic, lymphatic, neoplasms, virus, pathophysiology, microbial, hemic, physician, doctor, musculoskeletal, respiratory, student, hepatological, viral, infections, hematological, gastrointestinal, genital FLICKR-1M salad, spansih gastronomy, catalan food, modena, bacalla`, colmenillas, bread with tomato, marinated, gastronomy, esqueixada, merluzzo, ec, marinado, cod, vinegar, bacalao, foie, meatfest, publish, duck foie george clooney, sean connery, charcoal, jude law, antonio banderas, jennifer lopez, tom cruise, penelope cruz, viggo mortensen, travel photography series, australian, federer, conde godo´, open, moya, tenerife, atp, las palmas gran, garros, las, torneo, murray, tamarasit, roland, roddick, podcast, bernardes, sharapova, djokovic, islas, wta, wawrinka, campeonato, canarias, usopen, enric molina, en, masters, chela gran DELICIOUS-7M
apollomission, saturnrocket, spacecrew, crewflight, navylieutenant, gusgrissom, flightcommander, colonelwhite, americanastronauts, lieutenantcolonel, edwardwhite, spacewalk, capekennedy herbiehancock, dextergordon, billstewart, bobbyhutcherson, chrispotter, samyahel, brianblade, johnscofield, grantgreen, freddiehubbard, bradmehldau, adamrogers, wayneshorter, donaldbyrd, theloniousmonk, leemorgan, larrygoldings, hardbop, peterbernstein, weatherreport, marcjohnson, mainstreamjazz, artblakey, billevans, joehenderson, joshuaredman, charlieparker gildaradner, danacarvey, commercialparodies, momjeans, thehanukkasong, richardpryor, stevemartin, wilferrell, chrisfarley, billmurray, schwettyballs, motivationalspeaker, wakeupandsmile, cleargravy, adamsandler, colonblow, kingtut, alecbaldwin, mikemyers, churchlady, pinkpanther, chevychase
Table 5.6 Example of fragmentation in the HCD communities in comparison to a CNM community CNM community Fragmented HCD communities sausage, polenta, casserole, tomato, mozzarella, recipes, thanksgiving, turkey, roastturkey, saucy, oregano, basil cranberry, orange, relish, grinder, raw, berries, molasses, syrup, pomegranates, culina´ria, recept, recipe, recipes, receitas, pastries, pies, pecans, lowsugar, nuts, estados, cooking, ecards, tea, unidos, drinks, walnuts, tarts, tea, fooddrink, drinks, daniel, fooddrink programing, goto, johnston, singer, songwriter, horseradish, crab, spicy, tarts, lowsugar, pastries, pies, nuts, pecans, crabcakes, creamy, potatoes, casseroles, walnuts onions, scalloped, scallopedpotatoes, cheese, mozzarella, sausage, saucy, tomato, onions, scallopedpotatoes, scalloped, casseroles, potatoes, cheese polenta, oregano, casserole, basil, triana Several CNM community tags are written in italics since they are unrelated to the community topic
5 Community Detection in Collaborative Tagging Systems
b
0.12
c
0.1
HCD 0.08 0.06 0.04 0.02 0
0.08
CNM
% communities
0.1
HCD
0.06 0.04 0.02 0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
CNM
CNM
% communities
% communities
a
127
1
HCD
0.15
0.1
0.05
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Conductance
Conductance
Conductance
BIBSONOMY-200K
FLICKR-1M
DELICIOUS-7M
1
Fig. 5.6 Conductance distribution of the communities detected by CNM and HCD. Across all three datasets, CNM produces communities with lower (better) conductance values than HCD Table 5.7 IR performance of CNM and HCD community structures in tag recommendation BIBSONOMY-200K FLICKR-1M DELICIOUS-7M CNM HCD CNM HCD CNM HCD RT 15,344 15,344 57,206 57,206 56,754 56,754 15,271 12,440 57,021 51,374 56,115 32,252 Rout RTP 377 1,547 2,545 14,086 1,250 6,425 P (%) 2.47 12.44 4.46 27.42 2.23 19.92 R (%) 2.46 10.08 4.45 24.62 2.20 11.32 F (%) 2.46 11.14 4.46 25.95 2.21 14.44 P@1 (%) 2.54 8.48 1.89 14.50 1.63 9.16 P@5 (%) 2.39 17.48 3.04 26.90 2.36 31.83 The following notation was used: RT denotes the number of correct tags according to the ground truth, Rout the number of tag suggestions made by the recommender, RTP the number of correct suggestions, P, R, and F denote precision, recall, and F-measure, respectively, and P@1 and P@5 denote precision at first and fifth, respectively
Finally, we used the derived tag communities in the context of tag recommendation in order to quantify their effect on the IR performance of a community-based tag recommendation system. More specifically, we created a simple tag community recommendation scheme, which, based on an input tag, uses the most frequent tags of its containing community to form the recommendation set. In case more than one tag are provided as input, the system produces one tag recommendation list (ranked by tag frequency) for each tag and then aggregates the ranked list by summing the tag frequencies when of tags belonging to more than one list. Although this recommendation implementation is very simple, it is suitable for benchmarking the utility of community structure since it is directly based on it. The evaluation process was conducted as described in Sect. 5.3.3 with an additional filtering step applied on the tag assignments of the test set. Out of those, we removed the tags that (a) did not appear in the training set since it would be impossible to recommend them and (b) were among the top 5% of the most frequent tags, since in that case recommending trivial tags (i.e., the most frequent within the dataset) would be enough to achieve high performance. Table 5.7 presents a comparison between the IR performance of tag recommendation when using the CNM and HCD tag communities. According to the table,
128
S. Papadopoulos et al.
using the HCD tag communities results in far better tag recommendations than by using the CNM ones across all three datasets. For instance, in FLCKR-1M dataset, the HCD-based recommendation achieves more than six times higher precision than the CNM-based one (27.42% compared with 4.46%). A large part of the failure for the CNM-based tag recommendation can be attributed to the few gigantic communities that dominate its community structure. Also, in the case of HCDbased tag recommendation, there were numerous cases in which it was not possible to provide any recommendations due to the fact that the input tags did not belong to any community (HCD leaves several tags unassigned to communities). Obviously, one could employ a fall-back tag recommendation scheme (e.g., based on simple tag co-occurrence) to address cases where no recommendation can be made based on HCD communities. Alternatively, it is also possible to directly exploit the hub and outlier information of the HCD structure in order to improve the performance of the respective tag recommendation method. In this study, we wanted to focus on the comparison between the community structures of two methods. In the future, we plan to develop more sophisticated approaches for tag recommendation based on tag communities and compare them with existing tag recommendation schemes (e.g., FolkRank [2]). Apart from tag recommendation, HCD can be also used as a means of discovering the different usage contexts/senses of tags. Since HCD identifies hub tags that are adjacent to multiple communities, we can assume that at least some of them correspond to different usage contexts for these tags. For instance, Table 5.8 Table 5.8 Examples of different usage contexts for three sample hub tags (one for each dataset). Context Tag community Networks (BIBSONOMY-200K) Network analysis Epidemic spread Bibliometrics VoIP networks Security
socialanalysis, socialdynamics, sociophysics, autocorrelation, swarm, percolation, selforg, fitnesslandscapes, socialstructure, . . . large, spreading, enidtl, kcore, dtl, pysics, epidemic, gda, vespignani, networks, delivery, pastorsatorras citation, reputation, publication, bibliometrics, academics, impact, . . . voip, phone, skype, telephony, jajah, vonage, internettelephony, . . . proxy, encryption, wlan, password, openssl, cracker, https, wep, . . .
Bridge (FLCIKR-1M) Bac de Roda Bridge of Sighs (Bisbe) Port bridge Pont del Centenari
bacderoda, puente de calatrava, bru, puente luminoso, structural, . . . carrer del bisbe, bridge of sighs, footbridge, cobblestone, bascule, . . . port, puerto, sea, port vell, ship, boat transporte, pont, montserrat, new
Bean (DELICIOUS-7M) Plant seeds Coffee beans Bean bag Java beans Travis Bean Frances Bean Cobain
lentils, blackbean, greenbean, kidneybeans, hummus, chickpeas, . . . grown, arabica, robusta, pricey, coffea, wholesaler toffet, velveteen, crushed, swirl, tuffet, herbagreeen, swirled spring, hibernate, jsf, ejb guitars, fender, musicstuff, giger, telecaster, strat, stratocaster, . . . courtneylove, cobain, francesbean, lindacarroll, thevoid, girlgerms, . . .
5 Community Detection in Collaborative Tagging Systems
129
presents several usage contexts discovered for the tags “networks,” “bridge,” and “bean” taken from the datasets BIBSONOMY-200K, FLICKR-1M, and DELICIOUS7M, respectively.11 These contexts correspond to meaningful concepts/objects connoted by the tags, for example, in the case of “bean,” there are tag communities corresponding to bean as a plant seed, coffee bean, java bean, etc. In the future, we plan to investigate methods of accurately mining such tag contexts in an automatic way.
5.5
Conclusions
This chapter provided an overview of community detection in Collaborative Tagging Systems. It discussed a range of works in this area that deal with the problem of identifying groups of objects (communities) in a tagging system that are highly related to each other. The chapter structured the related work discussion along three dimensions: type of community (user, resource, tag or composite), community detection method, and application context. In addition, a parameter-free extension of an existing community detection method was presented that is catered for the particular characteristics of tag networks. Furthermore, a series of means for evaluating the derived tag community structure were discussed. Finally, a comparative evaluation study was conducted involving the proposed technique and an established modularity maximization scheme. Three datasets coming from real-world tagging systems were employed in the study to ensure the reliability of the findings. The proposed approach was demonstrated to produce tag communities that are more precise and more useful for tag recommendation than the ones produced by the modularity maximization scheme. In addition, the presented method was successfully applied to discover multiple contexts of usage for several hub tags. Acknowledgments This work was supported by the WeKnowIt and GLOCAL projects, partially funded by the European Commission, under contract numbers FP7-215453 and FP7-248984 respectively.
11
In fact, these hub tags were adjacent to many more tag communities than the ones presented. Out of those some were very generic, some were similar to the ones presented, and for some we could not establish a profound sense. For the sake of brevity, we manually picked some of the prominent contexts to present.
130
S. Papadopoulos et al.
References 1. Mika, P.: Ontologies are us: A unified model of social networks and semantics. In: ISWC 2005, Lecture Notes in Artificial Intelligence, vol. 3729, pp. 522–536. Springer, Heidelberg (2005) 2. Hotho, A., J€aschke, R., Schmitz, C., Stumme, G.: Information retrieval in folksonomies: Search and ranking. In: ESWC 2006, Lecture Notes in Artificial Intelligence, vol. 4011, pp. 411–426 (2006) 3. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99, 7821–7826 (2002) 4. Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.: SCAN: A structural clustering algorithm for networks. In: Proceedings of KDD´07: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM, New York, NY (2007) 5. Au Yeung, C.M., Gibbins, N., Shadbolt., N.: Contextualising tags in collaborative tagging systems. In: Proceedings of 20th ACM Conference on Hypertext and Hypermedia (Turin, Italy, 29 June–1 July), pp. 251–260. ACM, New York, NY (2009) 6. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for emerging cyber-communities. Comp. Netw. 31(11–16), 1481–1493 (1999) 7. Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Proceedings of KDD’00: 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160. ACM, New York, NY (2000) 8. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of WWW´08: 17th International Conference on World Wide Web, pp. 695–704. ACM, New York, NY (2008) 9. Koutsonikola, V., Vakali, A., Giannakidou, E., Kompatsiaris, I.: Clustering users of a social tagging system: A topic and time based approach. In: Proceedings of the 10th International Conference on Web Information Systems Engineering (2009) 10. Schifanella, R., Barrat, A., Cattuto, C., Markines, B., Menczer, F.: Folks in folksonomies: Social link prediction from shared metadata. In: Proceedings of WSDM’10: 3rd ACM international Conference on Web Search and Data Mining, pp. 271–280. ACM, New York, NY (2010) 11. Cattuto, C., Baldassarri, A., Servedio, V.D.P., Loreto, V.: Emergent community structure in social tagging systems. Adv. Complex Syst. 11(4), 597–608 (2008) 12. Au Yeung, C.M., Gibbins, N., Shadbolt, N.: A study of user profile generation from folksonomies. In: Proceedings of SWKM 2008: Workshop on Social Web and Knowledge Management at WWW2008 (2008) 13. Nanopoulos, A., Gabriel, H.-H., Spiliopoulou, M.: Spectral clustering in social-tagging systems. In: Proceedings of WISE 2009: Web Information System Engineering, pp. 87–100. Springer, Heidelberg (2009) 14. Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.: Personalized recommendation in social tagging systems using hierarchical clustering. In: Proceedings of RecSys 2008: ACM Conference on Recommender Systems, pp. 259–266. ACM, New York, NY (2008) 15. Gemmell, J., Shepitsen, A., Mobasher, B., Burke, R.: Personalizing navigation in folksonomies using hierarchical tag clustering. In: Proceedings of DaWaK 2008: International Conference on Data Warehousing and Knowledge Discovery, pp. 196–205. Springer, Heidelberg (2008) 16. Papadopoulos, S., Kompatsiaris, Y., Vakali, A.: Leveraging collective intelligence through community detection in tag networks.In: Proceedings of CKCaR’09 Workshop on Collective Knowledge Capturing and Representation, Redondo Beach, California, USA (2009) 17. Papadopoulos, S., Kompatsiaris, Y., Vakali, A.: A Graph-based clustering scheme for identifying related tags in folksonomies. In: Proceedings of 12th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2010), Bilbao, Spain (2010)
5 Community Detection in Collaborative Tagging Systems
131
18. Specia, L., Motta, E.: Integrating folksonomies with the semantic web. In: Proceedings of the 4th European Conference on the Semantic Web: Research and Applications. Lecture Notes in Artificial Intelligence, vol. 4519, pp. 624–639. Springer, Heidelberg (2007) 19. Han, L., Yan, H.: A fuzzy biclustering algorithm for social annotations. J. Inf. Sci. 35(4), 426–438 (2009) 20. Giannakidou, E., Koutsonikola, V., Vakali, A., and Kompatsiaris, Y.: Co-clustering tags and social data sources. In: Proceedings of WAIM 2008: International Conference on Web-Age information Management, IEEE Computer Society, Washington, DC, pp. 317–324 (2008) 21. Lehmann, S., Schwartz, M., Hansen, L.K.: Biclique communities. Phys. Rev. E 78(1), 016108 (2008) 22. Diederich, J., Iofciu, T.: Finding communities of practice from user profiles based on folksonomies. In: Proceedings of the 1st International Workshop on Building Technology Enhanced Learning solutions for Communities of Practice (2006) 23. Lu, C., Chen, X., Park, E.K.: Exploit the tripartite network of social tagging for web clustering. In: Proceedings of CIKM ´09: 18th ACM Conference on information and Knowledge Management, pp. 1545–1548. ACM, New York, NY (2009) 24. Neubauer, N., Obermayer, K.: Towards community detection in k-partite, k-uniform hypergraphs. In: Workshop on Analyzing Networks and Learning with Graphs at NIPS (2009) 25. Schaeffer, S.E.: Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007) 26. Strehl, E., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceedings of AAAI 2000 Workshop on Artificial Intelligence for Web Search, pp. 58–64 (2000) 27. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010) 28. Begelman, G., Keller, P., Smadja, F.: Automated tag clustering: Improving search and exploration in the tag space. In: Proceedings of the WWW 2006 Collaborative Web Tagging Workshop, Edinburgh (2006) 29. Simpson, E.: Clustering tags in enterprise and web folksonomies. HP Labs Techincal Reports (2008) 30. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133 (2004) 31. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70, 066111 (2004) 32. Fortunato, S., Barthe´lemy, M.: Resolution limit in community detection. Proc. Natl Acad. Sci. USA 104(1), 36–41 (2007) 33. Luo, F., Wang, J.Z., Promislow, E.: Exploring local community structures in large networks. In: Proceedings of WI 2006: IEEE/WIC/ACM international Conference on Web Intelligence, pp. 233–239. IEEE Computer Society, Washington, DC (2006) 34. Wetzker, R., Zimmermann, C., Bauckhage, C.: Analyzing social bookmarking systems: A del. icio.us cookbook. In: Proceedings of ECAI 2008 Workshop on Mining Social Data (MSoDa), 2630, July 2008, Patras, Greece (2008)
.
Chapter 6
On Using Social Context to Model Information Retrieval and Collaboration in Scientific Research Community Lynda Tamine, Lamjed Ben Jabeur, and Wahiba Bahsoun
Abstract In this chapter, we particularly address the shift in the usage from the personal context toward social context in information retrieval area. We are specifically interested in the scientific communities and related practices for information retrieval, seeking, and collaboration. Therefore, we present an overview of works that tackle the problem of information retrieval in scientific community from a social perspective. Regarding the objective of exploiting the sociometric foundations in the context of literature retrieval, we present our social retrieval model. We particularly model the author’s importance within the community, formalize the degree of collaboration between authors, taggers’ interest in a scientific topic, and combine them to tune information relevance in response to a user query. Experimental evaluation using a scientific corpus of documents and social data extracted from the academic social network CiteULike (http://www. citeulike.org) is presented and shows the impact of the social view on the retrieval effectiveness assuming different assumptions of relevance emerging from socially endorsed data.
6.1
Introduction
It is well known that the fundamental intellectual problems of information retrieval are the production and consumption of information. More specifically, information retrieval is a field that deals with storage and access to relevant information according to the user needs. The main goal of an information retrieval system is to return to the user the most valuable documents in response to his queries. Several information retrieval theoretical models have been proposed to specify both document
L. Tamine (*) Institut de Recherche en Informatique de Toulouse, Toulouse, France e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_6, # Springer-Verlag Berlin Heidelberg 2011
133
134
L. Tamine et al.
and query representations and document-query matching [1, 2]. These retrieval approaches considered as system-centered ones are basically founded on computing the topical relevance of a document according to a topical query. In fact, topical relevance estimation is expressed using the document-query matching via evidence showing the extent to which these two entities share contents expressed using keywords or concepts. Thus, these approaches are characterized as “one size fits all” since they provide the same results depending on the query keywords, even though these latter are expressed by different users with different intentions, interests, backgrounds, and, more generally, surrounding contexts. While being fundamental for the advances and present stage of information retrieval, these approaches make information retrieval difficult and challenging from the cognitive side, particularly in large-scale and interactive environments. The main criticism is that, in these approaches, retrieval ignores the influence of the user’s context that moves the information relevance from topical to situational or cognitive one. According to the cognitive view [3] that emerges from user-centered retrieval approaches, cognitive relevance is leveraged by both topical relevance viewed from the system side and usefulness viewed from the human side. Previous works in the field of contextual information retrieval tackled the problem of the user-centered retrieval by combining search technologies and knowledge about the query and the user context into a single framework to provide the most appropriate answer for the user’s information needs [4]. Context refers particularly to user’s background, preferences, interests, and community. In this chapter, we focus on user’s community as the key contextual factor for enhancing the conception of several central concepts in the information retrieval process, such as information, information need, user, interaction, and relevance. More specifically, we address a particular community, namely the scientific community which is well known to be highly connected regarding both information production and consumption. Thus, we investigate the ways to model scientific collaboration for achieving information access. Accordingly, the main objectives of this chapter are the following: l
l
l
To review the concept of context in information retrieval and highlight the shift from personal to social context. To discuss major information retrieval models within scientific information communities. To present and experimentally validate our approach of social information retrieval specifically for scientific communities.
In Sect. 6.2, we provide a basic background of notions relating to personal and social context in information retrieval. We then give in Sect. 6.3 an overview of collaboration and social retrieval models specifically designed for scientific information access. Section 6.4 highlights the chapter’s contributions. In Sect. 6.5, we detail and evaluate the effectiveness of our approach of a social model for literature access. Section 6.6 concludes the chapter and gives insight into the remaining challenging issues.
6 On Using Social Context to Model Information Retrieval and Collaboration
6.2
135
From Personal Context to Social Context in Information Retrieval
The notion of context has a long history in multiple computer science applications [5, 6]. It is a core concept that has been addressed mainly in user modeling, artificial intelligence, adaptive hypermedia, information retrieval, and ubiquitous computing. It is a wide and difficult notion that does not have one definition that can cover all the aspects it refers to. We shall basically define context as: “any knowledge or elementary information characterizing the surrounding application (user, objects, interactions) and having an important relationship with the application itself ”. In this chapter, we focus on context in information retrieval area. In information retrieval applications, context refers to the whole data, metadata, applications, and cognitive structures embedded in situations of retrieval or information seeking and having an impact on the user’s behavior in general and relevance assessment in particular. Intuitively, user–system interaction constitutes a rich repository of potential information about preferences, experience and knowledge, as well as interests [3]. This information repository represents a context of interaction, viewed as source of evidence that could allow retrieval systems to better capture user’s information needs and to more accurately measure the relevance of the delivered information. In other words, the system’s estimation of the relevance would rely not only on the results of query-document matching but also on the user’s contextdocument adequacy. This has challenged the design of contextual information retrieval systems with regard to the definition of relevant dimensions of context and the specification of methods and strategies dealing with these aspects in order to improve the search performance [7]. One of the fundamental research questions here is: which context dimensions should be considered in the retrieval process? Several studies proposed a specification of context within and across application domains [8, 9]. Prior works in contextual information retrieval [3] focused on using personal context to personalize the information retrieval process. Personal context refers mainly to the user’s specific preferences, background, location, etc. However, with the prominence of online social media tools, many studies proved that interaction and collaboration among the users is an appropriate source of evidence allowing the users’ information needs to be better met. In the following paragraphs, we present an overview of works that use either personal context or social context as part of the process of information seeking and retrieval.
6.2.1
On Using Personal Context in Information Retrieval: Toward Personalized Information Retrieval
Personal context could be characterized by five context factors summarized as: [10]:
136
L. Tamine et al.
1. User behavior: user behavior data consists of user–search engine interaction features such as click through data, eye-tracking, and browsing features. These various interaction features constitute the dynamic context [11, 12] about the user’s experience allowing a robust prediction to be made about his preferences and short-term or long-term preferences, when seeking information. 2. User interests: generally, this factor expresses the cognitive background of the user that has an impact on his relevance judgment. The great benefit of using the user interests [13, 14] is to disambiguate queries and improve retrieval precision from a large amount of information. 3. Application: user’s application refers to the user’s background in accordance with the principles of evidence-based domain of interest such as diagnosis in medicine. Domain-dependent applications provide clues allowing the user’s information need to be better ascertained. The main objective of using application-dependent features in the retrieval process is to interpret more accurately the user’s information need within a more restricted domain so as to provide specific answers [15]. In order to achieve this goal, specific domain tasks are identified with related specific queries and guidelines for selecting relevant results. 4. Task: task could be defined as the goal of information seeking behavior [16]. Numerous tasks may be achieved by the users like reading news, searching for jobs, preparing course material, and shopping. The aim of considering this factor is to understand the purpose of user queries in order to deliver more accurate results. In web document retrieval, user queries can achieve three main (general) tasks: the topic relevance task, the homepage finding task, and service finding task. Appropriate query and document features are then exploited to predict the desired task and rerank the results. In mobile IR, task could be defined as the application’s achievement such as tourist guide or GPS1-based transport. 5. Location: this factor concerns the geographical zone of interest corresponding to the current query. It is particularly used to categorize queries as local or global ones. Compared with global queries, local ones are likely to be of interest only to a searcher in the relatively narrow personal region. As an example, a search for housing is a location-sensitive query. Techniques are applied [17] to identify the specific geographical locality addressed by a query in order to improve the quality of the query results. By considering the personal context during document retrieval, personalized information retrieval is achieved. The key idea behind personalizing information retrieval is to customize search based on specific personal characteristics of the user. Therefore, as a personalized search engine is intended for a wide variety of users with personal contexts, it has to learn the user model first (commonly called user profile) and then exploit it in order to tailor the retrieval task to the particular user.
1
Global positioning system.
6 On Using Social Context to Model Information Retrieval and Collaboration
6.2.2
137
On Using Social Context in Information Retrieval: Toward Social Information Retrieval
Nowadays, it is well known that traditional retrieval models make information retrieval difficult and challenging from the cognitive side, particularly in large-scale and interactive environments supporting communities such as bloggers, Wikipedia authors and users, online communities through Facebook, Myspace, and Skyblog. The main criticism is that, in these approaches, retrieval ignores the influence of user’s interactions within his social context on the whole information process. Thus, the use of social networks’ theoretical foundations becomes tractable to achieve several retrieval tasks. Accordingly, social information retrieval [18, 19] became a novel research area that bridges information retrieval and social networks analysis to enhance traditional information models by means of social usage of information. Social context is inferred from the analysis of unstructured communication between users [20]. It refers to both user profiles, interaction with each other, and tasks achieved through sharable information spaces, mainly Web sphere. Indeed, analyzing what people exchange, say, share, annotate, etc., allows to be identified so as to better meeting their information needs. In practice, the social context includes personal context and implicit and explicit indicators of interest and information usage such as tagging, rating, and browsing activities, communication with friendships, and activities of friends. [21]. According to the main objective achieved, we can categorize state-of-the-art works on using social context for information and/or knowledge management and retrieval in general, as follows: l l l l l l
Online communities identification [22, 23]. Social recommendation and filtering [24, 25]. Collaborative information production and sharing [26, 27]. Folksonomy and tag prediction [28, 29]. Expert search [30], people search [31]. Social information retrieval [32, 33].
In particular, social information retrieval, which is the subarea addressed in this chapter, consists mainly of using the social context during the document retrieval process. The main challenge of social information retrieval is, thus, to use social metadata to cluster the user’s information needs according to his socially close neighbors and consequently adapt the relevance assumption of documents. With this in mind, our objective here is to reach a wide definition of context in a social dimension involving users seeking scientific information. Thus, key components such as information, information producers (authors), information consumers (scientific users), and several explicit and implicit relationships between them such as production, retrieval, interest sharing, authoring, trust, citation, and bookmarking are modeled and used as clues for estimating document relevance according to user’s information needs. In what follows, we first review prior works on social information retrieval within a scientific community and then outline our main research contributions.
138
6.3
L. Tamine et al.
Background: Collaboration and Social Models for Scientific Information Access
In this section, we discuss the integration of social context in order to achieve retrieval and collaboration tasks in a scientific research community. In particular, we first review major information retrieval models within scientific information networks that neglect social context and then highlight the benefits of considering several social relevance factors such as trustworthiness of authors, citation sources importance, strengths between authors, and taggers’ interests in order to enhance the retrieval rankings.
6.3.1
Analysis of Coauthorship and Co-citation Networks: How Strongly Is Connected the Scientific Community?
With the increasing collaboration between scientists, many researches have addressed the structure of the scientific community and its evolution patterns. These studies evaluate the cooperation among research groups and predict reliable collaboration between local and international research teams. In fact, collaboration does not cover only individuals, but also concerns research groups, institutions, and countries. As defined in [34], the scientific collaboration involves two or more scientists working on a research project and sharing intellectual, economic, and physical resources. Having these various factors and the complexity of the interactions during the collaboration process, it is difficult to represent and evaluate each collaborator’s contribution. Meanwhile it can be approximated through the coauthorship and citation associations explicitly defined in the resulting research documents. Scientific collaboration networks are extracted from bibliographic resources and modeled using a graph where nodes represent authors or collaborators and edges denote collaboration associations. In [35, 36], the scientific collaboration is represented by a coauthorship network where connections express direct and collegial interaction between authors. By contrast, approaches described in [37, 38] focus on the citation network for modeling scientific collaboration as it represents influence and knowledge transfer between authors. Notably, the analysis of the two kinds of scientific networks shows an important collaboration within the researcher community. It is concluded in [35] that authors of the experimental research fields (biomedicine and astrophysics) tend to diversify their collaboration to reach 18 different collaborators. This study, covering several bibliographic databases, shows also that the coauthorship highly connects scientists to include around 80–90% of nodes in the giant component. Likewise, citation network analysis of physics researchers [39] shows an important collaboration interaction with 74% of the papers having ten or fewer citations and with an average of 14.6 citation links per paper.
6 On Using Social Context to Model Information Retrieval and Collaboration
139
On the basis of the coauthorship network, a set of indicators are introduced by bibliometrics to quantify scientific collaborations such as the coauthorship index that represents the average number of coauthors per paper and the collaboration rate that calculates the rate of papers resulting from collaboration between two groups of scientists. In most of these measures generally rank institutions and research groups, it is proposed that authors to be ranked according to their scientific quality by applying the PageRank algorithm on the coauthorship network [40]. We note that coauthorship is considered either as a symmetric association that equally involves the collaborators [41] or as an asymmetric association that depends on the frequency and the exclusivity of the coauthorship between each couple of authors [40]. Scientific collaboration is also quantified based on the citation links that represent indicators of scientific excellence. According to this purpose, the H-index [42] measures the scientific quality of an author based on his most cited papers and the number of citations that he has received. The approach presented in [38] proposes to weight the citation link depending on the coauthor numbers, and then rank authors by disseminating scientific credits on the network through the citation links to simulate knowledge redistribution between scientists. However, the citation feature is not sufficient to estimate paper relevance considering that the probability of jumping to an article is proportional to its publication time. The citation count is extended so as to rank papers according to their age and expected citation over time later [43, 44]. Accordingly, CiteRank [45] takes into account publication time to rank the articles and defines a random walk to predict the number of expected citations.
6.3.2
On Applying Social Networks for Literature Retrieval and Scientific Collaboration
An information retrieval within bibliographic resources differs from other usage context by a specific information need targeting high scientific quality of retrieved documents in addition to the query similarity. Therefore, scientific indexes and academic digital libraries, from the introduction of the SHEPARD’S CITATIONS (1873) to the recent of GOOGLE SCHOLAR2 (2004), have addressed one common issue: evaluating the scientific quality of bibliographic resources. To tackle this problem, literature access ranks documents using bibliometric measures such as the citation index. Other approaches apply hyperlink analysis algorithms to the citation network of documents and rank resources depending on their authority computed by the HITS and PageRank algorithms [46, 47]. In fact, the quality of a document is related to its author once document and authors are inseparable entities and represent each other. On the basis of this idea, 2
http://scholar.google.com/
140
L. Tamine et al.
recent approaches in literature access have boosted the social context of bibliographic resources and bind their quality to the importance of corresponding authors on scientist social networks. The social network of authors is a generalized representation of the coauthorship and citation network that possibly integrates additional entities and interactions. Unlike early work that models the social network of authors by using coauthorship associations only [41, 48], recent works include documents as information nodes in the social network and align entities into document and author layers with possible associations connecting nodes from different layers [18]. These models extract social relationships from interactions involving document nodes such as the collaboration, the publication, and the citation. The social importance of authors is evaluated by a set of measures introduced by both domains of social network analysis [49] and hyperlink analysis [46, 47]. In the context of scientific publications, the Betweeness measure is considered as an indicator of interdisciplinarity and highlights authors connecting dispersed partitions of the scientific community. The Closeness measure, based on the shortest path in the graph, reflects the reachability and independence of an author in his social neighborhood. The PageRank measure and the Authority score computed by the HITS algorithm distinguish the authoritative resources in the social network. By contrast, the Hub score computed by the HITS algorithm identifies authors having an important social activity and relying on authoritative resources, and these authors are called Centrals. With the introduction of academic social networks on the Web (e.g., CITEULIKE3 and ACADEMIA4), the importance of scientific papers is inferred not only from its production context but also through its consuming context. The social network of bibliographic resources is extended so as to include more social entities interacting in the social producing and consuming context of the document. It includes all the actors and the data that help to estimate the social relevance of documents. In fact, actors represent information producers (authors) and information consumers (users), whereas data cover documents and social annotations (tags, rating, reviews). Accordingly, actors become information nodes collaborating to produce documents and interacting to provide social annotations. In this context, the importance of the scientific paper is estimated by the social importance of related actors, as well as their popularity and received tags [21].
6.4
Research Objectives and Contributions
Our focus in this chapter is on the formalization of a social information retrieval, gathering several entity types that share and exchange information. More specifically, we instantiate the generic model within a scientific community and show how to model scientific information retrieval embedded within authors, users, and taggers.
3
http://www.citeulike.org http://www.academia.edu/
4
6 On Using Social Context to Model Information Retrieval and Collaboration
141
Unlike related work, this model has several new features. First, the social information network includes new entities corresponding to users and social annotations in addition to documents and author nodes presented in [18, 50]. This helps to estimate document relevance based on their social production and consuming contexts. Second, we include citation links as social interactions between authors of scientific papers enriching thus their mutual associations previously based on coauthor relationships only [41, 51]. Finally, we define a weighting model for edges connecting social entities in the contrast of approaches presented in [41, 50] modeling bibliographic resources using a binary network model. Specifically, weights are assigned to coauthorship, citation, and authorship edges to evaluate influence, knowledge transfer, and shared interest between authors.
6.5
A Social Retrieval Model for Literature Access
This section presents our novel retrieval approach for literature access [52] based on social network analysis. In fact, we investigate a social model where authors represent the main entities and relationships are extracted from coauthor and citation links. Moreover, we define a weighting model for social relationships which takes into account the authors’ positions in the social network and their mutual collaborations. Assigned weights express influence, knowledge transfer, and shared interest between authors. Furthermore, we estimate document relevance by combining the document-query similarity and the document social importance derived from corresponding authors. To evaluate the effectiveness of our model, we conduct a series of experiments on a scientific document dataset that includes textual content and social data extracted from the academic social network CITEULIKE.
6.5.1
The Social Information Retrieval Model
An information retrieval model is a theoretical support that aims at representing documents and queries and measuring their similarity viewed as relevance. Formally and based on the representation introduced in [53], the social information retrieval model can be represented by a quintuple [D, Q, G, F, R(qi, dj, G)], where D is the set of documents, Q represents the set of queries, G is the social information network, F represents the modeling process of documents and queries, and R(qi, dj, G) is the ranking function including various social relevance features and taking into account the social information network topology G. This function can be defined by combining the subset of the flowing factors: the topical relevance, the social importance of actors, the social distance, the popularity, the freshness, and the incoming links and tags [21]. The social information network G represents the
142
L. Tamine et al.
Fig. 6.1 The social information network
Documents
Information producers
Social annotations
Information consumers
social entities that interact in the social producing and consuming context of documents. As illustrated in Fig. 6.1, the social information network G includes all actors and the data that help to estimate the social relevance of documents. In fact, actors represent information producers (authors) and information consumers (users), whereas data cover documents and social annotations (tags, rating, reviews). Accordingly, actors become information nodes collaborating to produce documents and interacting to provide social annotations. Within this generic model, we present in what follows the qualitative and quantitative components of the social retrieval model.
6.5.1.1
The Qualitative Model Component
The social information model can be represented by a graph G ¼ (V, E), where nodes V ¼ A [ U [ D [ T denote social entities with A, U, D, and T, respectively, correspond to authors, users, documents, and social annotations. The set of edges E V V represents social relationships connecting various node types (authorship, coauthorship, friendship, citation, annotation, etc.). Unlike the friends-offriends social applications such as FACEBOOK5 and MYSPACE,6 academic social [e.g., CITEULIKE (see Footnote 3) and ACADEMIA (see Footnote 4)] users produce and consume information through additional descriptors for bibliographic resources. Therefore, specific relationships connecting social entities are involved: documents, authors, users, and tags. In general, we identify the following social relationships: l l l
5
Authorship: connects an author ai 2 A with his authored document dj 2 D. Reference: connects a document di 2 D with its referenced documents. Coauthorship: connects two authors ai, aj 2 A having produced one common document at least.
http://www.facebook.com http://www.myspace.com
6
6 On Using Social Context to Model Information Retrieval and Collaboration l
l l
l
l
143
Citation: connects two authors ai, aj 2 A with author aj is cited by ai at least once through his documents. Bookmarking: connects a user ui 2 U and his bookmarked document dj 2 D. Annotation: connects a document di 2 D with a tag tj 2 T assigned at least once to describe its content. Tagging: connects a user ui 2 U and a tag tj 2 T as he uses it at least once to bookmark a document. Friendship: connects users ui, uj 2 U if either they have a direct personal relationship or they join the same group.
The social entities included in the social information network for the literature access could be represented using a graph notation illustrated in Fig. 6.2.
6.5.1.2
The Quantitative Model Component
As stated above, edges connecting social nodes may express different kinds of social relationships and significantly optimize the exploring process of the social network. In fact, if we explore the social neighborhood (closely reachable nodes) of a social entity, weights would help to select jumping nodes. Through our model, we investigate a weighting model for author-to-author relationships e(ai, aj) 2 (A A) and author-to-document relationships e(ai, dj) 2 (A D). l
Coauthorship. This social relationship is represented by an undirected edge connecting two authors having collaborated to produce a document. Coauthors have often personal direct relationships; however, multiple collaborations reflect their similarity and shared interest. In fact, scientific authors tend to exchange knowledge and diversify their collaborations. For this reason, we propose to normalize weights by the total of collaborations involving the couple of authors. The coauthorship edges could be weighted as follows: Coði; jÞ ¼
2Aði; jÞ ; AðiÞ þ AðjÞ
(6.1)
Document Author User Tag
Fig. 6.2 The social entities included in the social information network for the literature access could be represented using a graph notation
Co-authorship Authorship Citation Reference Bookmarking Tagging Annotation Friendship
144
l
L. Tamine et al.
where A(i) is the number of documents authored by ai and A(i, j) represents the number of documents coauthored by ai and aj. Citation. This social relationship is represented by a directed edge connecting an author with his cited authors. An author who usually cites a second author would be influenced by his opinions and eventually discuss similar subjects. Therefore, the citation links expresses knowledge transfer between authors. To evaluate citation relationship strength, we propose to take into consideration the citation frequency, as well as the total announced citations. Citation relationships is weighted as follows: Ciði; jÞ ¼
l
Cði; jÞ ; Cð jÞ
(6.2)
where C(i) is the number of citations announced by author ai and C(i, j) represents the number of times author ai cites aj. This social relationship is represented by a directed edge connecting an author with his authored documents. The strength of the authorship association is viewed as the author affiliation with the topic of the document. We note that an author would be more affiliated with a topic if he frequently addressed it in his published papers. Therefore, a coauthor will be more associated with document d discussing a topic S rather than all his coauthors if he has published more documents in this topic. In order to estimate the knowledge and the experience of a coauthor ak 2 A on the topic of document d, we propose to compare the quantity of information he has imported via his other publications. From the information producer’s point of view, this can be measured by the information entropy Hdk ðti Þ for tags assigned to the subset of the coauthor publications noted A k . We consider as a random variable each tag ti 2 Td assigned to document d where it exists an edge e(ti, dj) 2 (T D) and we calculateSits probability distribution Prk(ti) among the subcollection of documents A ¼ m k¼1 A k published by the m coauthors of the document d. We propose to normalize the information entropy values by the number of tags associated with the document noted T d . Meanwhile, a coauthor with a single publication in the collection has Hdk ðti Þ ¼ 0 and gets a higher weight value wðai ; dj Þ ¼ 1 rather than his coauthors with much more publications on the topic of the document. We propose so to assign a default weight value for authors having unique document in the dataset and take into consideration the number of publications per author. Authorship. This social relationship is represented by a directed edge connecting an author with his authored documents. The strength of the authorship association is viewed as the author affiliation to the topic of the document. We note that an author would be more affiliated with a topic if he frequently addressed it in his published papers. Therefore, a coauthor will be more associated with document d discussing a topic S rather than all his coauthors if he has published more documents in this topic.
6 On Using Social Context to Model Information Retrieval and Collaboration
145
In order to estimate the knowledge and the experience of a coauthor ak 2 A on the topic of document d, we propose to compare the quantity of information he has imported via his other publications. From the information producer’s point of view, this can be measured by the information entropy Hdk ðti Þ for tags assigned to the subset of the coauthor publications noted A k . We consider as a random variable each tag ti 2 Td assigned to document d where it exists an edge k e(ti, dj) 2 (T D) and we calculate Smits probability distribution Pr (ti) among the subcollection of documents A ¼ k¼1 A k published by the m coauthors of the document d. The final Authorship weight w(ak, d) is computed as follows:
1 1 k y; wðak ; dÞ ¼ 1 d Hd ðti Þ kT k kA k k
(6.3)
where Hdk ðti Þ ¼
X ti
Prk ðti Þ log Prk ðti Þ;
(6.4)
tfðti ; A k Þ þ 0:5: tfðti ; A Þ
(6.5)
2T d
and Prk ðti Þ ¼ 0:5
With tfðti ; A k Þ is the frequency of tag ti in the subset of author ak documents ðA k Þ and tfðti ; A Þ represents the tag frequency in the subcollection of the coauthors documents ðA Þ. In order to get ascendant values of entropy, we scale tag probability into the interval [0.5, 1]. We note that 1 y is the default weight value attributed to authors having a single document in the collection. Some social network analysis algorithms do not support multiple edges between two nodes with similar directions. Thus, we propose to combine the coauthorship and citation weights as follows: wði; jÞ ¼
6.5.2
1 ð1 þ Coði; jÞÞ ð1 þ Ciði; jÞÞ: 4
(6.6)
Computing Social Relevance
The objective of document relevance estimation within the social network is to derive a more accurate response for the user by combining the topical relevance of document d and the importance of associated authors in the social network. Accordingly, we aim to select the social importance measures that identify central
146
L. Tamine et al.
authors in the social network of bibliographic resources. Therefore, we compute for each author a social importance score CG(ai) using one of the following importance measures: the Betweeness, the Closeness, the PageRank, the HITS’s Authority score, and the HITS’s Hub score. We apply these importance measures only on the subgraph of authors Ga ¼ (A, Ea), where Ea (A A). Edges in Ga denote either coauthorships or either citation links and weighted as described previously. Afterward, a social importance score is transposed from authors to documents using a weighted sum aggregation as follows: ImpG ðdÞ ¼
k X
wðai ; dÞCG ðai Þ:
(6.7)
i¼1
The social score of document ImpG(d) estimates its social relevance. Nevertheless, it is not enough to retrieve relevant documents from the collection. Therefore, we combine ImpG(d) score with a traditional information retrieval metric such a TF–IDF score as follows: RelðdÞ ¼ aRSVðq; dÞ þ ð1 aÞImpG ðdÞ;
(6.8)
where a 2 [0. . .1] is a weighting parameter, RSV(q, d) is a normalized similarity measure between query q and document d, and ImpG(d) is the importance of document d in the social network G.
6.5.3
Experimental Evaluation
In order to evaluate the effectiveness of our model, we conduct a series of experiments on a scientific documents dataset published for the ACM SIGIR conference from 1978 to 2008. The main evaluation objectives are to: 1. Compare different importance measures with both binary and weighted social network models to estimate importance of scientific papers. 2. Evaluate the effectiveness of our model compared with traditional information retrieval models and other closely related retrieval models.
6.5.3.1
Experimental Datasets and Design
We used for our experiments the SIGIR dataset that contains information about authors and citation links in addition to the textual content of publications. We included in the social network all authors having published at least one paper for the ACM SIGIR conference. Two authors are associated with a social relationship if they either coauthored a SIGIR publication or one of them cites the other author through his SIGIR paper.
6 On Using Social Context to Model Information Retrieval and Collaboration
147
To enrich this dataset, we gathered data about information consumers and social interactions from the academic social network CITEULIKE. We collected all social bookmarks targeting the SIGIR publications and we extracted related tags and corresponding users. The following paragraphs describe the dataset characteristics and evaluation measures: l
l
Social network statistics. The SIGIR dataset includes 2,871 authors with an average of 2 coauthorships and 16 citation links per author. As shown in Table 6.1, the citation relationships dominate the social network with nine times as many as the coauthorship associations. In fact, the inclusion of citation links restructures small and dispersed components into larger author communities. Consequently, the giant component connecting the majority of authors nodes is enlarged with citation relationships to include 84% of authors as shown is Fig. 6.3. Queries and relevance assumption. Tags are user-generated keywords used to annotate document content. They help a user to index a document from their perspective and consequently correspond to a later information needs which may possibly be satisfied with this document. Unlike automatic extracted terms form textual context, tags seem to be more convenient to represent queries once both of them are user-generated terms expressing information needs. Thus, we propose to choose tags assigned to the SIGIR publications as representative queries in our experiments. We assume that the popular tags are more important in the social context. Thus, we select as queries the most frequent tags assigned to the SIGIR publications, then we build the ground truth through the following steps: 1. We select as initial queries the top 100 tags sorted by total bookmarks targeting the SIGIR publications (popular tags). 2. We remove personal and empty tags such as “to read” and “sigir.” 3. We regroup similar tags with different forms like “language model” and “language modeling.”
Table 6.1 Social network properties of the SIGIR dataset
Authors Coauthorships Citation links Coauthorship and/or citation links
2,871 5,047 45,880 52,516
100%
Giant component Others
80% 60% 40%
Fig. 6.3 The giant component of the SIGIR social network
20% 0%
A
C
AC
A: Co-authorship network C:Citation network AC: Co-authorship and/or citation network
148
L. Tamine et al.
4. For each query, we collect documents bookmarked at least once by the corresponding tag or its similar forms. 5. From the previous list of documents corresponding to a query tag, we select only the documents having the query tag among their three top assigned tags. The final document set corresponds to the query relevant documents. 6. We remove the query tag if no relevant document is found.
l
We retain for experimentation the top 25 queries and their corresponding relevant documents. The final collection includes 512 relevant documents with an average of 20 relevant documents per query. To index the dataset, we used the open-source library for information retrieval APACHE LUCENE7 which is based on a modified scoring function of the vector-space model described in [54]. Evaluation measures. In order to compare the social importance measures and evaluate our model performance, we use recall and precision. Users are commonly interested in the top results; therefore, we study precision at 0.1 and 0.2 points of recall. With an average of 800 retrieved documents per query, these recall points correspond to the first 160 documents.
6.5.3.2
Comparison of Social Importance Measures
The social importance measures highlight key entities in the social network and include measures introduced by both domains of social network analysis [49] and hyperlink analysis [46, 47]. These measures have multiple semantics that vary from one social application to another. In the context of scientific publications, the Betweeness measure is considered as an indicator of interdisciplinarity and highlights authors connecting dispersed sectors of the scientific community. The Closeness measure, based on the shortest path in the graph, reflects the reachability and independence of an author in his social neighborhood. The PageRank measure and the Authority score computed by the HITS algorithm distinguish the authoritative resources in the social network. By contrast, the Hub score computed by the HITS algorithm identifies authors having an important social activity and relying on authoritative resources, and these authors are called Centrals. We applied these social importance measures on both a binary and a weighted model of the social network. We note W-Betweeness, the application of Betweeness measure on the weighted model of the social network. We use the same notation for the rest of the social importance measures. Table 6.2 presents comparative effectiveness results of the different importance measures for both binary and weighted models of the social network. These results are obtained using only the social importance score of documents by setting a ¼ 0 in (6.8). We note that the Hub measure better ranks scientific papers for both binary and weighted models of the social network. We conclude that the importance of scientific publications can be estimated as the Centrality of their authors. 7
http://lucene.apache.org/
6 On Using Social Context to Model Information Retrieval and Collaboration Table 6.2 Comparison of social importance measures [email protected] [email protected] Betweeness 0.0363 0.0363 W-Betweeness Closeness 0.0232 0.0191 W-Closeness PageRank 0.0324 0.0299 W-PageRank Authority 0.0389 0.0411 W-Authority Hub 0.0516 0.0430 W-Hub
[email protected] 0.0374 0.0214 0.0225 0.0398 0.0516
149
[email protected] 0.0398 0.0189 0.0199 0.0423 0.0433
The weighted model slightly improves the retrieval precision for most social importance measures. This is approved with values obtained by W-Hub, W-Authority, and W-Betweeness measures beyond their analogous measures applied to a binary social network. Therefore, we conclude that the properties expressed through weights on social relationships including the shared interests, the influence, and the knowledge transfer can better identify Central authors and then estimate the relevance of bibliographic resources. For all the social importance measures, precisions [email protected] and [email protected] do not exceed the threshold of 60% compared with those of the traditional information retrieval model based on the TF IDF metric having [email protected] ¼ 0.08 and [email protected] ¼ 0.0786. Therefore, the social importance measures are not able to sort the results without taking into account the similarity between document and query. In the remaining experiments, we retained the W-Hub measure as it is the best measure expressing the social importance of bibliographic resources.
6.5.3.3
Evaluation of Our Model Effectiveness
In order to evaluate the effectiveness of our model, we first select the best tuning parameter a, then we compare the retrieval performances with similar retrieval systems.
Tuning the Parameter a We studied the impact of the parameter a on the retrieval process [see (6.8)]. We note that if a ¼ 0 only the social relevance is taken into account. Moreover, a ¼ 1 corresponds to the baseline TF IDF since the topical relevance is only considered to rank documents. We note in Fig. 6.4 a significant improvement in performance following the integration of topical relevance with a value of a over 0.4. Analyzing precisions [email protected] and [email protected] in function of the parameter a, we note that curves present peaks with values exceed obtained value for a ¼ 0 and this is when only the topic relevance is taken into consideration. Hence, the combination of the two scores can effectively improve the final ranking of documents. The best values of the parameter a are obtained between 0.5 and 0.6.
150
L. Tamine et al.
Fig. 6.4 Tuning the a parameter
0,12 0,1 0,08
[email protected] [email protected]
0,06 0,04 0,02 0
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
α
Evaluating Search Performance We compare our model with three baselines detailed as follows: l
l
TF IDF model: denotes a traditional information retrieval system implemented by APACHE LUCENE based on the TF IDF metric and using the stemming algorithm SnowBall Stemmer. We used this retrieval system with the same configuration in our model to select documents and compute their topical relevance. PR-Docs model: denotes a retrieval system that estimates the importance of documents based on their authority. It combines the topical relevance and the PageRank score of documents computed on the document graph where edges represent citation links. Final document relevance is computed as follows: RelðdÞ ¼ aRSVðq; dÞ þ ð1 aÞPageRankdocs ðdÞ:
l
(6.9)
We note that the topical relevance RSV(q, d) is computed using the first baseline TF IDF. We studied the impact of the a parameter on the search effectiveness and we note that best retrieval precisions are obtained with a ¼ 0.7 for [email protected] and a ¼ 0.3 for [email protected] as shown in Fig. 6.5. Kirsch’s model: denotes the social information retrieval model introduced in [51] that represents authors using a binary coauthorship network and computes their social importance score using the PageRank measure. This model combines the topical relevance and the social relevance as follows: RelðdÞ ¼ RSVðq; dÞ rd ;
(6.10)
with rd is the social relevance of document d computed as the sum of its authors PageRank scores. Figure 6.6 compares results obtained by the different baselines and our social model tuned with a ¼ 0.5 and a ¼ 0.6 noted, respectively, SM0.5 and SM0.6.
6 On Using Social Context to Model Information Retrieval and Collaboration 0,1 0,09 0,08 0,07 0,06 0,05 0,04 0,03 0,02 0,01 0
Fig. 6.5 Tuning the a parameter for PR-Docs model
151
[email protected] [email protected]
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
α
Fig. 6.6 Evaluation of the retrieval effectiveness
0,14 0,12 0,10
TF*IDF PR-Docs Kirsch SM0,5 SM0,6
0,08 0,06 0,04 0,02 0
p@0,1
p@0,2
We note that the best values of the parameter a can lead to an improvement in favor of our model between 15 and 55% compared with the baseline TF IDF. Therefore, we confirm that integrating the social relevance of a document can significantly improve the retrieval effectiveness. Comparing our model with best obtained values of the PR-Docs model, we note an improvement of 45% for [email protected]. Therefore, we conclude that the W-Hub measure computed on the social network of authors better expresses the social importance of scientific papers than prior measures based on the citation graph. Comparing our model with the Kirsch’s model, we note an improvement of 14% that confirms the impact on the retrieval performances of including the citation links and weighting the social network edges. In summary, results show a low performance for the retrieval systems used for comparison. In fact, we used tags for experimental evaluation which are usergenerated terms and may not be present in document content. Therefore, only a few relevant documents can be retrieved which explains the low precisions of the proposed model. Furthermore, results are proportional to the content-based retrieval model used to compute the topical relevance of documents and its performance directly affects the extended models. The main objective of previous experiments is to improve content-based raking by including the social importance of document and this is achieved with a significant improvement of 55% compared with the TF ID baseline.
152
6.6
L. Tamine et al.
Conclusion
Although there have been several studies of scientific literature access, little has been published about the use of social metadata to enhance the effectiveness of the underlying process. From the social perspective, several factors of author’s collaboration have to be considered: citation, coauthoring, co-citation, bibliographic coupling, bookmarking, etc. Measuring aspects of social behavior in science is still challenging. In this chapter, we focused on citation and coauthor analysis as sources of evidence to fine-tune document relevance and highlighted some implications of social relation weighting. More precisely, what constitutes strong collaboration is derived from the shared topics of authors through coauthored publications. Furthermore, social document relevance is impacted by the social importance of the authors according to their social position in the scientific network. An experimental case study shows that our model is effective. However, at this early stage, there is no means for systematically establishing the right balance between effectiveness and efficiency of a retrieval model of scientific information from the social perspective. Thus, the future of social information retrieval research within a scientific community is promising.
References 1. Robertson, S.E., Hancock-Beaulieu, M.M.: On the evaluation of IR systems. Inform. Process. Manage. 28(4), 457–466 (1992) 2. Salton, G.: The SMART Information Retrieval System. Prentice-Hall, Englewood Cliffs (1971) 3. Ingwersen, P., Jarvelin, K.: The Turn: Integration of Information Seeking and Information Retrieval in Context. Springer, Dordrecht (2005) 4. Allan, J. Hard track overview in TREC 2003 high accuracy retrieval from documents. In: Proceedings of the 12th Text Retrieval Conference (TREC-12), National Institute of Standards and Technology, NIST special publication, pp. 24–37 (2003) 5. Ryan, N., Pascoe, J., Morse, D.: Enhanced reality fieldwork: the context-aware archaeological assistant. In: Gaffney, V., van Leusen, M., Exxon, S. (eds.) Computer Applications in Archeology, Tempus Reparatum, Oxford (1997) 6. Schilit, B., Adams, N., Want, R.: Context-aware computing applications. In: Proceedings of the Workshop on Mobile Computing Systems and Applications, IEEE Computer Society, Santa Cruz, CA, pp. 85–90 (1994) 7. Crestani, F., Ruthven, I.: Introduction to special issue on contextual information retrieval systems. Inf. Retr. 10, 829–847 (2007) 8. Tamine, L., Boughanem, M., Daoud, M.: Evaluation of contextual information retrieval: overview of research and issues. Knowl. Inf. Syst. 24(1), 1–34 (2010) 9. Vieira, V., Tedesco, P., Salgado, A.C., Bre´zillon, P.: Investigating the specifics of contextual elements management: the cemantika approach. In: CONTEXT, pp. 493–506 (2007) 10. Tamine-Lechani, L., Boughanem, M., Zemirli, N.: Personalized document ranking: exploiting evidence from multiple user interests for profiling and retrieval. J. Digit. Inf. Manage. 6(5), 354–365 (2008)
6 On Using Social Context to Model Information Retrieval and Collaboration
153
11. Joachims, T., Granka, L., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inform. Syst. 25(2), 7 (2007) 12. Teevan, J., Dumais, S.: Personalizing search via automated analysis of interests and activities. In: Proceedings of the 28th International SIGIR Conference on Research and Development in Information Retrieval, pp. 449–456 (2005) 13. Sieg, A., Mobasher, B., Burke, R.: Users information context: Integrating user profiles and concept hierarchies. In: Proceedings of the 2004 Meeting of the International Federation of Classification Societies, Vol. 1, pp. 28–40 (2004) 14. Speretta, M., Gauch, S.: Personalized search based on user search histories. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 622–628 (2005) 15. Lee, J., Hu, X., Downie, J.: Qa websites: rich research resources for contextualizing information retrieval behaviors. In: Proceedings of the 28th International SIGIR Conference on Research and Development in Information Retrieval, Workshop on Information Retrieval in Context, pp. 33–366 (2005) 16. Kelly, D., Belkin, N.: Display time as implicit feedback: Understanding task effects. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 377–384 (2004) 17. Hattori, S., Tezuka, T., Tanaka, K.: Context-Aware Query Refinement for Mobile Web Search. In: SAINT-W’07: Proceedings of the 2007 International Symposium on Applications and the Internet Workshops, IEEE Computer Society, Washington, DC, USA, p. 15 (2007) 18. Korfiatis, N.T., Poulos, M., Bokos, G.: Evaluating authoritative sources using social networks: an insight from Wikipedia. Online Inform. Rev. 30(3), 252–262 (2006) 19. Stanoevska-Slabeva, L., Nicolai, K., Fleck, T.: Using social network analysis to enhance information retrieval systems. In: Proceedings of Social Network Applications Conference (2008) 20. Kleinberg, J.: The convergence of social and technological networks. Commun. ACM 51(11), 66–72 (2008) 21. Amer-Yahia, S., Benedikt, M., Bohannon, P.: Challenges in searching online communities. IEEE Data Eng. Bull. 30(2), 23–31 (2007) 22. Li, J., Cheung, W.K., Liu, J., Li, C.H.: On discovering community trends in social networks. In: WI-IAT’09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, Washington, DC, USA, pp. 230–237 (2009) 23. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale study in the Orkut social network. In: KDD’05: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data mining, ACM, New York, NY, USA, pp. 678–684 (2005) 24. Jamali, M., Ester, M.: Using a trust network to improve top-n recommendation. In: RecSys’09: Proceedings of the Third ACM Conference on Recommender Systems, ACM, New York, NY, USA, pp. 181–188 (2009) 25. Ma, H., Lyu, M.R., King, I.: Learning to recommend with trust and distrust relationships. In: RecSys’09: Proceedings of the Third ACM Conference on Recommender Systems, ACM, New York, NY, USA, pp. 189–196 (2009) 26. Hu, M., Lim, E.-P., Sun, A., Lauw, H.W., Vuong, B.-Q.: Measuring article quality in Wikipedia: models and evaluation. In: CIKM’07: Proceedings of the 16th ACM Conference on Information and Knowledge Management, ACM, New York, NY, USA, pp. 243–252 (2007) 27. Wilkinson, D.M., Huberman, B.A.: Cooperation and quality in Wikipedia. In: WikiSym’07: Proceedings of the 2007 International Symposium on Wikis, ACM, New York, NY, USA, pp. 157–164 (2007) 28. Heymann, P., Ramage, D., Garcia-Molina, H.: Social tag prediction. In: SIGIR’08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 531–538 (2008)
154
L. Tamine et al.
29. Schenkel, R., Crecelius, T., Kacimi, M., Michel, S., Neumann, T., Parreira, J.X., Weikum, G.: Efficient top-k querying over social-tagging networks. In: SIGIR’08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 523–530 (2008) 30. Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise corpora. In: SIGIR’06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, pp. 43–50 (2006) 31. Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: WWW’08: Proceeding of the 17th International Conference on World Wide Web, ACM, New York, NY, USA, pp. 1071–1072 (2008) 32. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: WWW’07: Proceedings of the 16th International Conference on World Wide Web, ACM, New York, NY, USA, pp. 501–510 (2007) 33. Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har’el, N., Ronen, I., Uziel, E., Yogev, S., Chernov, S.: Personalized social search based on the user’s social network. In: CIKM’09: Proceeding of the 18th ACM Conference on Information and Knowledge Management, ACM, New York, NY, USA, pp. 1227–1236 (2009) 34. Bordons, M., Gsmez, I.: Collaboration networks in science. In: Garfield, E., Cronin, B., Atkins, H.B. (eds.) The Web of Knowledge: A Festschrift in Honor of Eugene Garfield, pp. 197–213. Information Today, Medford (2000) 35. Newman, M.E.J.: Scientific collaboration networks. I. network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001) 36. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl Acad. Sci. USA. 98(2), 404–409 (2001) 37. Ding, Y., Yan, E., Frazho, A., Caverlee, J.: Pagerank for ranking authors in co-citation networks. J. Am. Soc. Inf. Sci. Technol. 60(11), 2229–2243 (2009) 38. Radicchi, F., Fortunato, S., Markines, B., Vespignani, A.: Diffusion of scientific credits and the ranking of scientists. Phys. Rev. E80, 056103 (2009) 39. Lehmann, S., Lautrup, B., Jackson, A.D.: Citation networks in high energy physics. Phys. Rev. E 68(2), 026113 (2003) 40. Xiaoming Liu, Johan Bollen, Michael L., Nelson, and Herbert Van de Sompel. 2005. Coauthorship networks in the digital library research community. Inf. Process. Manage. 41, 6 (December 2005), 1462-1480. DOI=10.1016/j.ipm.2005.03.012 http://dx.doi.org/10.1016/ j.ipm.2005.03.012 41. Mutschke, P.: Enhancing information retrieval in federated bibliographic data sources using author network based stratagems. In: Research and Advanced Technology for Digital Libraries: Proceedings of Fifth European Conference, ECDL 2001, Darmstadt, Germany, 4–9 September 2001 42. Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proc. Natl Acad. Sci. U. S. A. 102(46), 16569–16572 (2005) 43. Hauff, C., Azzopardi, L.: Age dependent document priors in link structure analysis. In: ECIR, pp. 552–554 (2005) 44. Meij, E., de Rijke, M.: Using prior information derived from citations in literature search. In: RIAO (2007) 45. Walker, D., Xie, H., Yan, K.-K., Maslov, S.: Ranking scientific publications using a simple model of network traffic. J. Stat. Mech., P06010 (2007) 46. Langville, A.N., Meyer, C.D.: A survey of eigenvector methods for web information retrieval. SIAM Rev. 47(1), 135–161 (2005) 47. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998) 48. Yan, E., Ding, Y.: Applying centrality measures to impact analysis: a coauthorship network analysis. J. Am. Soc. Inf. Sci. Technol. 60(10), 2107–2118 (2009)
6 On Using Social Context to Model Information Retrieval and Collaboration
155
49. Wasserman, S., Faust, K., Iacobucci, D.: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, Cambridge (1994) 50. Kirchhoff, L., Stanoevska-Slabeva, K., Nicolai, T., Fleck, M.: Using social network analysis to enhance information retrieval systems. In: Applications of Social Network Analysis, ASNA, Zurich (12 Sept 2008) 51. Kirsch, S.M., Gnasa, M., Cremers, A.B.: Beyond the web: retrieval in social information spaces. In: Proceedings of the 28th European Conference on Information Retrieval, ECIR 2006. Springer (2006) 52. Tamine, L., Jabeur, A.B., Bahsoun, W.: An exploratory study on using social information networks for flexible literature access. In: FQAS, pp. 88–98 (2009) 53. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM, New York (1999) 54. Hatcher, E., Gospodnetic, O.: Lucene in Action (In Action Series). Manning, Greenwich (2004)
.
Part III
Community-Built Databases Storage and Modelling
.
Chapter 7
Social Interaction in Databases Antonio Badia
7.1
Introduction
The advent of the social web has made user-generated content a focus of research. Such content is already available in the form of tags, blogs, comments, etc., in many Web sites. Experience with social Web sites has shown that users tend to have a say about the data: they may interpret the same data in somewhat different ways, or may be able to add detail to it, enriching the existing contents. When users are able to enter information in a format-free manner, they tend to do so and may provide additional semantics to the data. The success of tagging clearly indicates that users are willing to provide content when this can be done in a manner that is natural to the users. Past research shows that the tags that users add to items in many Web sites are, as a whole, excellent descriptors of the contents of the items themselves and can be fruitfully used for several tasks, like improving search or clustering of the item set [1–3]. Overall, the amount of user-created content is growing at a fast rate. Databases would seem a perfect place for social interaction, since basically any database is created and designed with a group of users in mind; very few databases are for a single, particular user. However, in a database the interaction with the user is strictly regulated and constrained. The database is usually planned in a centralized manner by a person or small team of people who try to anticipate the needs of the user community and come up with a database design. In the context of relational databases, this includes the database schema, that is, the set of tables (and attributes per table) that will store the information. Access to the database is regulated in a centralized manner too, with a database administrator (DBA) in charge of giving users permissions. These permissions may be read permissions (user can see the data, but not change it) and write permissions (users can add, delete, or change data). Write permissions are usually restricted. Also, it is very infrequent that users are given permission to alter the database schema (as opposed to the data). Thus, the
A. Badia Computer Engineering and Computer Science Department, University of Louisville, Louisville, KY, USA e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_7, # Springer-Verlag Berlin Heidelberg 2011
159
160
A. Badia
viewpoint that organizes the data is given to the users, who are usually unable to change it. In many cases, access to data is done through views, that is, the result of relational queries that display a selected part of the data for a given user or set of users. While views allow a certain degree of customization, full access to data is complicated by the fact that changes to views actually amount to changes to the database tables (since views do not have an existence of their own), and technical issues prevent many updates from being possible in relational views. As a result, databases accept very little in the way of user input. Hence, user-created content is currently being ignored in databases. In this chapter, we argue that database technology should and can be adapted to provide the needed capabilities to support user interaction, user communities, and the social dynamics that arise from them: l
l
Database technology should be used to support user interaction because databases tend to have communities of users (i.e., not a single or small group), so they are a perfect environment to enable social interaction. Furthermore, there are several ways in which database systems can benefit from user-created content: it can help interpret the data in the database, enrich it, and fill in any gaps in (very needed, but hardly present) metadata. Also, by allowing users to store their own data in the database, we make them more likely to explore the data and, in general, use the database for their tasks. Database technology can be used to support user interaction because the relational data model can be seen as a general platform on top of which flexible schemas can be developed so that almost arbitrary content can be captured and stored.
There are clear benefits of adapting existing database technology for this end (as opposed to creating brand new approaches), including: l
l
l
Reusing existing data repositories. Such repositories may provide a starting point for user communities to gather around Reusing database technology to securely and efficiently store large amounts of user-created data Allowing a smooth transition from traditional repositories of information to user-created content. This continuity should be an important point for many organizations which may not want to throw away years of information
In this chapter, we overview the challenges to such goals posed by current relational technology and offer an initial set of solutions. In order to make the chapter self-contained, in the next section we introduce some basic definitions and establish some terminology. Next, in Sect. 7.3, we describe with some examples the kind of social interaction that we aim to support and show how in each case the interaction is not supported or ill-supported by current relational technology. Then in Sect. 7.4, we describe our general framework to enable social interaction in databases and show how capturing user-created data and metadata plays a fundamental role. Hence, in Sect. 7.5, we offer some initial solutions to the problems exposed in
7 Social Interaction in Databases
161
Sect. 7.3. In this chapter, we focus on the challenge of capturing user-created data and metadata, storing it in the database, making it available for querying, and allowing controlled sharing among users. Since the idea of capturing user-created content in databases may be received with some skepticism, Sect. 7.6 discusses several issues that arise in giving users the possibility to generate content and add it to the database. This is a new situation, and hence the consequences of using the technology should be examined. Thus, we enumerate several possible objections to the goals of the framework presented and present our answer to them. Finally, Sect. 7.8 closes with some remarks on the ongoing development of the framework.
7.2
Basic Database Technology
A relational database can be seen as having two parts. The first one is a database schema, that is, a collection of distinct table names and, associated with each name, a table schema, a collection of attribute names which are assumed to be unique for each table. Given relation T, the schema of T is denoted sch(T); and given (relational) database D, the schema of D is denoted sch(D). The second component of the database is a database extension, that is, a set of tuples for each table in the database. Such a set of tuples is expected to dynamically change over time. Hence, the extension can be considered as a function from a linearly ordered set of time values into the database schema, which associates a set of tuples with each table in the schema. For table T, ext(T) denotes the extension of T at a fixed point in time. Most of our analysis is static, but there are clearly some concepts that could (and should) be extended to a dynamic viewpoint; this can be achieved by denoting the extension of T at time t with ext(T, t). For the rest of the chapter, though, we stick to the static view; we willSsay more about the dynamic view in Sect. 7.5.4. Thus, given database D, extðDÞ ¼ T2schðDÞ extðTÞ. We assume that each table has a primary key defined, and that foreign keys are also defined as needed. We also assume that all tables in a database are connected through primary key–foreign key connections. In other words, the graph created by generating a node for each table, and adding an edge (t1, t2) between tables (nodes) t1, t2 if t1 has a foreign key referencing t2 is a connected graph. By a relational query language, we mean SQL or the Relational Algebra (henceforth RA) extended by a group by operator and basic aggregates (at least, count). Given a database D, a relational query q can be seen as a function from D into a fixed relational schema, expressed in SQL or in RA. q(D) will denote the result of applying q to ext(D), that is, the answer to q in D.1 For a given query q, rels(q) sch(D) is the set of relation names used in q (that is, if q is an SQL query, the relations in the FROM clause, including the FROM clauses of any subquery). 1
Once again, a dynamic analysis will add a time component, as ext(D) changes over time.
162
A. Badia
Besides queries, we will also consider write statements, which may be an insertion, deletion, or update statement. We denote by aff(w) the set of rows affected (changed) by the statement w. For an insertion, this is the set of inserted rows; for a deletion, this is the set of deleted rows; and for an update, this is the set of modified rows. Note that for a user u to issue a write statement, the user needs to have certain permissions granted by the DBA. Given a command c from user u, we use c(D) to refer to q(D) if c is a query q, or aff(w) if q is a write statement w. Given a database D, we talk of the set of users U of D. This is the set of all agents that are allowed to access D. When agent u ∈ U accesses D, it is to send a query or a write statement. We assume that the system can attribute any query or write statement to a particular user, unambiguously. Note that we do not define what an agent is. This is due to the fact that many databases are accessed by applications, and an application may access D on behalf of several users. Likewise, a database may be used to register sensor inputs in a totally automatic fashion. We assume, in what follows, that the set of agents has been defined as needed by the analysis so that it may be restricted to human users or include other entities. What we assume is that the interactions of agents with the database can be identified as such, that is, when user u ∈ U accesses the database, such access is considered a transaction, and all queries and write accesses throughout the transaction can be attributed to u. In the examples that follow, we will assume that u is a person; however, we point out that even “programmed” agents like a sensor may want to access the database in ways that have not been foreseen by the database creators/designers, and hence, such agents would also benefit from the framework proposed here. Given database D, user u ∈ U, we denote by D sch(D) the part of the database which is accessible (visible) to u. It is understood that any user access of the database (query or write access) is over D(u) only. Thus, if user u issues command c, c(D) D(u), that is, if c is a query q or write statement w, rel(q) D(u) and aff(w) D(u). Finally, we note that in many cases the access of u to D will be mediated by an interface of some kind: typically a GUI, a Web form, or even a process that picks up user input (clicks, etc.). Such interfaces are usually designed based on the database schema and on the functionality of the application of which they are a part. We will assume that any added functionality proposed also creates changes in the interface so that u can effectively communicate extra information to D. How interfaces should be enlarged to handle this extra information is an issue beyond the scope of this chapter.
7.3
User-Created Content
From now on, we call all data that is entered by a user user-created content. This covers a wide variety of information. Roughly speaking, user-created content can be divided into two types:
7 Social Interaction in Databases l
l
163
Independent: content is generated without dependence on any other data (although pointers to other data may exist). Blogs are a typical example of this. Parasitic: the user-generated data is dependent for its existence on other, preexisting (and usually not user-generated) data. Tags are a typical example of this.
Both types are important and deserve to be studied. Here, we focus on parasitic data, which is generated when users interact with an existing data repository. A data repository can be a Web site, an (electronic) document, or a database. Our focus here is on relational databases. While some data repositories accept user input, traditional (relational) databases are not able to store user-created content, except when such information flows through predefined, restricted access paths. There are several actions that users are not allowed to perform with current relational technology: l
l
l
Very often, users are presented with a view of the database, which they cannot update (add, delete, or change existing data). This is a well-known issue, mentioned here for completeness [4]. Users cannot restructure the database data, that is, change the given schema to another schema that is more convenient. In particular, sometimes a user considers some of the data as labels that could be used to manage other data (i.e., as metadata). Existing systems do not allow data–metadata restructuring [5]. Other restructuring by users is usually severely restricted. Users cannot add metadata, tags, comments for data, or query results. In particular, they cannot add to the database anything except data that is structured exactly as the data in the database already is.
The first two issues are well known and have been investigated in the research literature. The last one, however, has been considered only recently and in a limited way (see Sect. 7.7 for references to related work). There is a strong assumption that users should not add anything to the database that does not conform to the existing schema (we discuss this view further in Sect. 7.6). Here, our goal is to disregard this assumption and allow users to enter data in their terms. To understand what the challenges are, we give a few examples of unsupported interaction next. In practice, we expect that most user-created content will be associated with queries, in particular, with the metadata of queries: a user may want to enter tags or comments about the results of a query, i.e., finding unexpected and/or interesting results, or simply trying to interpret the data. Of course, a user may also make other annotations. Assume, for instance, a user accessing a geographic database, with information about the position of certain locations: in a table Locations, there are attributes longitude and latitude. The user examines data in the table and realizes that the format is in the form of degrees, minutes, and seconds, so she/ he makes a comment and attaches it to both attributes (i.e., to metadata). Then the user issues a query and examines the results and realizes that some of the tuples have as latitude values outside the 90 to þ90 range, a mistake probably due to faulty data. The user proceeds to mark such tuples as having incorrect values
164
A. Badia
(i.e., data). Finally, the user proceeds to link the table with an outside resource which contains similar information, the Getty thesaurus of geographic names, by attaching to the table its URL.2
7.4
General Framework
Our goal is to develop the concept of social database, defined as a database that handles user-created content and enables user–user interaction. There are two main issues on social databases: l
l
Handling user input in a way that is convenient and meaningful to the user, including (but not limited to) annotations that a user may wish to create over same data or metadata in the database, as well as related information that she/he may already have, and that may be stored outside the database. Such usercreated content should be stored in the database; the user should be able to query it (or query it together with the data, in what we call enriched queries, see Sect. 7.5.2) and (if the user allows) share it with other users so that user–user communication (intra-user interaction) can be achieved. Building a social network of users. This can intuitively be based on the commonalities of their information needs. A measure of the connection of two users can be established based simply on typical user interaction with the database (queries and other commands); however, having user-created content enables computing of more fine-grained links. The network obtained can be used in several ways; we discuss some potential uses in Sect. 7.6.
When both issues are handled, one obtains a system that can store user input and use it to enable user–user communication. The overall system, as envisioned, is depicted in Fig. 7.1. To develop such a system, the first issue should be tackled before the second one, given that in order to build the social network, it is necessary to gather information at the user level. Going beyond traditional information (database) commands allows us to obtain a more faithful representation of user connections. Hence, in this chapter we concentrate on the first issue. As stated in the Introduction, our position is that relational databases should be used to store user-generated information, since the database can greatly benefit from it. While relational databases are very restricted and inflexible regarding the data that they can accept, they can be made capable of storing user-generated information with some extensions to existing technology. Here, we propose one such extension, private views, that allows existing relational databases to accept user input, store it, and organize it. Next, we explain and define the concept and study the issues of private view creation, storage, and maintenance. 2
www.getty.edu/research/conducting_research/vocabularies/ tgn/index.html
7 Social Interaction in Databases
165
Fig. 7.1 A social database system
Traditional Database
Users
7.5
Database Data
Social Database
Private Views
Master Table
User Social Network/ User similarity
Private Views
Our approach is to create a flexible, dynamic layer between the database, on one end, and the user on the other. We call this layer a private view since it is devoted to collecting, for a given user u, all the data generated by u and make such data accessible to u only although we will see later that u may choose to make the data accessible to others. It is important to point out that our extension, while adding functionality to existing database technology, is conservative; that is, it works within the relational database model as much as possible. This maximizes the chances of the approach being implementable in current platforms, without requiring major changes or creating conceptual conflicts with existing technology. We divide user-created content with respect to two fundamental dimensions: l
l
What it is attached to. Since all user-created content is assumed to be parasitic, it is the case that each piece of content is linked to either data (a cell, a row, a set of rows) or metadata (an attribute, or a set of attributes). Since we want to be flexible, we allow the data reference to be any user command (SQL query or write statement). Note that in a relational database, metadata is also stored in tables (usually called Catalog tables or system tables) and hence an SQL query over such tables identifies metadata. In fact, we assume that most user-created content will be attached to metadata. What type of data it is. We distinguish between internal data, that is, data that the user creates and enters in the database, and external data, that is, a data that already exists outside the database (a document, a web page, etc.). For external data, we define a link type. Internal data is characterized as either structured (having a tuple structure) or unstructured (a tag or a comment).
166
A. Badia
Having defined the user-created content we intend to represent, we outline a list of operations to capture such content. In the following, c is a command issued by the user (either a query or a write statement). l
l
l
annotate(t, c) associates a tag or comment t with an attribute (column), tuple (row), cell, or whole relation which is the result of running c on D. Note that this includes schema items (relation or attribute names), which are obtained by running c on Catalog tables. Enrich(t, c) associate structured information (a tuple t) with some existing data: attribute (column), tuple (row), cell, or whole relation which is the result of running c in D. Again, this includes metadata. Link(l, c) associates with some data or metadata (the result of running c) an external reference l (URL, filename in a local file system, or combination of network address and filename).
Formally, this is summarized as follows. Let u ∈ U be a database user. We assume that u has access to the Catalog (system) tables, at least to the data in them that refers to the schema of D. The private view of user u (in symbols, Pv(u)) is defined as a relation with schema (Reference, Type, Value), where l
l l
Reference is a set of rowids in the database, given by the command c from the user, that identify the referent Type is one of annotate, enrich, link Value depends on the type, and it represents the user-created content
We discuss Reference and Value in more detail. When a command c is over Catalog tables, the reference is called a metadata reference; otherwise it is a data reference. When the command refers to a single attribute in a single row, it is called a cell reference.3 For metadata references, which we expect to be the most used, we simply get the rowids from the Catalog tables. For data references, though, we need to distinguish between aff ðwÞ for a write statement w and the result q(D) of running a query q. The rows in aff(w) exist in the database, and hence have rowids, which we can collect. A relational query, on the other hand, may create tuples that do not exist in the database through the use of joins and groupings. When the user wants to attach content to some tuples R ¼ {r1, . . ., rn} of the answer to some query q, those tuples exists only in a transient form and have no rowids (unless, of course, q is issued in the context of a CREATE TABLE or CREATE MATERIALIZED VIEW command). Hence, we instead attach the content to those tuples in the database that were used to create q(D). For a query q, answer q(D), the lineage of
3
Note that it may be impossible to decide when a query returns as answer a single row. In practice, we expect that a (human) user would pick a particular row (and/or a particular attribute within a row) by clicking on it in a GUI, and then attaching some information to the selected value. Thus, the implementation of this functionality may depend on the user interface, an issue that is not within the scope of this chapter. We assume, though, that such functionality exists, and hence we provide operations for it.
7 Social Interaction in Databases
167
row r ∈ q(D) is defined as follows: let q be expressed in RA (if q is an SQL query, a standard translation to RA exists). Then we define Lin(r) recursively: l
if q is of the form pL(E), for E an RA expression, then LinðrÞ ¼ ft0 2 E j r ¼ t0 ½Lg;
l
where t0 [L] denotes the subtuple of t0 with schema L. if q is of the form sc(E), for E an RA expression, then LinðrÞ ¼ ft0 2 E j t0 ¼ r ^ cðt0 Þ is trueg;
l
where c(t0 ) denotes the evaluation of condition c on tuple t0 . if q is of the form E1 ./ c E2 , for E1 and E2 RA expressions, then LinðrÞ ¼ ft1 2 E1 ; t2 2 E2 j r ¼ t1 t2 ^ cðt1 t2 Þg;
l
where t1t2 denotes tuple concatenation. if q is of the form GBA,f(B)(E), for E an RA expression, then n o ~ ¼ t0 ½A ~ LinðrÞ ¼ t0 2 Er½A
l
where A is any subset of attributes of the schema of E, and f(B) is a sequence of aggregate functions applied to corresponding attributes in the schema of E. if q is of the form E1 E2, then LinðrÞ ¼ ft0 2 E1 j t0 ¼ r ^ :9t00 ðt0 ¼ t00 Þg
Note that the lineage of row r ∈ q(D) always comes from rels (q). We compute Lin(q(D)) ¼ [r∈q(D)Lin(r) and attach the user content to the set of rowids of rows in Lin(R). As for Content, it can be one of the following: l
l
l
A tag is a single keyword, syntactically denoted by a string with no separators (commas, whitespaces, etc.). Semantically, it is usually a word in some natural language. We note that in some social sites, complex tags are allowed. Some sites do not allow them, forcing users to combine several words into a single string (“baby_picture”). We do not enter into this issue here and, for simplicity, consider tags to be single words. A comment is a piece of text, usually a fragment in some natural language. Syntactically, it is denoted by a string where separators are allowed. A tuple is a sequence of pairs; it takes the form ((att1, val1), . . ., (attn, valn)) where each atti is an attribute name and each vali is a value (1 i n).
168
A. Badia
This structure allows the user to enter structured information in the database using a schema that may not be part of the database (view) schema, that is, with a userdefined schema. The motivation to allow this kind of input is that whenever structure is available, it should be used, as it makes storage and querying support more effective. At the same time, we want to allow the user to enter the information in a format of his/her choice to make sure that no data is discarded because it does not adhere to the predetermined database schema. A private view is distinguished from a traditional view by two main characteristics: l
l
Because it is created to collect user-created data in whatever format the user chooses, its schema is different from traditional schemas. We use a generic table format inspired by the work in [5]. Since private views are completely under the control of the user, the user can insert, delete, or change all the data in the view. Thus, unlike traditional views, all kinds of updates are allowed. This avoids the problems that are associated with traditional views [4].
Even though the previous point would suggest that a private view should be considered as an independent table, the data on it is dependent on the database in the sense that we collect only parasitic user-created content (see Sect. 7.3). Hence, the data on the view does not really have an independent existence and should change whenever the mentioned data or metadata changes. This is discussed in Sect. 7.5.4. Applied to our examples in the previous section, we can see how a private view would help deal with them. In the first example, the user attaches a comment to two of the attributes in the schema of the table Locations, using a annotate command. In the second, a query q is run, and some tuples are found to have incorrect values. The system computes the lineage of such tuples and attaches to them (through their rowids) a comment that values are incorrect (another annotation, but this time to data). Finally, the user uses a link to associate with table Location the URL of a Web site with relevant information.
7.5.1
Private View Organization and Storage
We propose a completely normalized storage for our private views. This achieves not only efficiencies of space, but also helps define operations over the view, as will be seen in the next subsection. We assume that the user issues a query q or a write statement w and then selects a part of q(D) or aff(w), called the referent, to which user-created content refers. Note again that q or w is usually generated through an interface of some kind, and the selection process may involve clicking or other icon manipulation. We simply assume that the database is made aware of the referent by receiving a set of row identifiers. We also assume that each row in any table of the database is assigned a
7 Social Interaction in Databases
169
unique rowid.4 This includes the Catalog tables, where the metadata is. Since the reference may be a single cell (an attribute in a row), a single row, or a set of rows, we create a separate table Refs to store rowids for a given referent. The Value attribute can be a simple string (for tags and comments) or a complex object (tuple). Tuples are stored in a separated table Values with attributes Attribute and Value. This allows us to store any type of tuple that the user may decide to store. Note that strings can also be stored here with a special attribute tag or comment. The type attribute can be stored simply as a string or even an enumerate type if the system allows this. Overall, then, the conceptual private view schema (Reference, Type, Value) is stored in tables Private(refid, type, refval), Refs (refid,rowid), and Values(refval, Attr,Val). Tuple (r, t, v) in the conceptual view represents the fact that the user attaches to the data in r some content v of type t. This tuple is stored by generating tuples (ir, t, iv), (ir, r), and (iv, a, v) in tables Private, Refs, and Values, respectively, where l
l
l
(ir, iv) is the primary key in Private, while ir and iv are foreign keys in Refs and Values, respectively. Both Refs and Values have no proper primary keys; that is, the whole schema of the table is the primary key. If t (the type) is – annotate, Values contains a tuple (iv,tag,st) where st is a string with the tag or comment; – meta or enrich, Values contains tuples (iv, At1, Val1),..., (iv, Atn, Val1) to represent tuple ((At1, Val1), . . . ,(Atn, Valn)). – link, Values contains a tuple (iv, Linki, st), where i ¼ 1 and st is a string representing a URL, or i ¼ 2 and st is a file name in a local directory, or i ¼ 3 and st is a file name in another machine (the file name is preceded by a network name for the machine). The index in Linki. is used to help the system determine how to handle the value in st (see next subsection). Regardless of type, Refs contains tuples (iv,rowid) where rowid is one of the identifiers in the referent.
Note that a data tuple may have more than one item of user-created content attached to it, since the user is not limited to attaching a single tag or comment to a given data item. Also, the same tuple may be part of more than one referent (i.e., a tuple may be in the answer set of more than one query and/or may be affected by more than one write statement). Hence, the relation between data or metadata items and user-created content is assumed to be many-to-many, and normalizing the storage solves the problems of schemas like [6], which create a potentially large number of nulls.
4
This is already the case in several commercial systems.
170
7.5.2
A. Badia
Private View Querying
Note that it is not enough to allow the user to add data to the database in a customizable way. The user should also be able to retrieve such information; therefore, any query means (SQL, forms or otherwise) must be made aware that this extra information exists and retrieve it if requested. In our view, the system should automatically create a private view (and all needed tables) each time a user is added to the system; methods like naming the tables with the username can be used to keep track of who owns the data and set permissions accordingly. For instance, for a user with username “JJones”, the system creates tables JJones-private, JJones-Refs,and JJones-Valuesand gives read and write permission to this user only. One of the advantages of storing the user-created content in the database using regular tables is that they are immediately available for querying. Using SQL (perhaps through an appropriate interface), the user can search for comments, tags, or any other user-created content that was previously stored. But another interesting possibility is that when data from the database is queried, the user can now retrieve data in two ways: standard mode, in which only the data from the database is used to create an answer, and enriched mode, in which data from the database is processed, but any related user-created content is automatically added to data in the answer. That is, the private data is attached by the system, without the user having to specify in the query additional tables or joins. We can think of this as delivering raw data versus annotated data. Note that a user can manually generate an enriched result by adding clauses to an existing SQL query. However, the system can be made to perform this task automatically. Intuitively, the process seems simple: for command c, user u, Pv(u) is the private data of u. Then, the enriched model processes c as follows: 1. Compute c(D) (c over D) as usual. If c was a query q, this is q(D). If c was a write statement w, then aff(w) is computed. 2. Do a left outerjoin between c(D) and Pv(u) using the rowid attribute. Formally, we have qðDÞ ./rowid ðRefs ./refid Private ./ refval ValuesÞ; where ./ is used to denote a left outerjoin. Note that a left outerjoin is used since some rows in the answer may have user-created content attached to them and others may not, but we do not want to miss any row originally in c(D). However, as we will see next, there is an important difference between handling data and metadata, and we deal with each separately. Given a query q or write statement w, we can define the metadata involved in q (in symbols, Meta(q)) or w (in symbols, Meta(w)) precisely:
7 Social Interaction in Databases l
l
171
for RA query q, we can define Meta(q) recursively: – if q ¼ R, for any relation R, MetaðqÞ ¼ ;. This may seem incorrect, but we want to gather only attributes that are truly used in q, hence the real work is done in other clauses. – if q ¼ sc(E), for any RA expression E, Meta(q) ¼ Att(c) [ Meta(E), where Att(c) is the set of all attribute names used in condition c (note that we assume sets, that is, duplicates are removed). – if q ¼ pL(E), for any RA expression E and attribute list L, Meta(q) ¼ L (note that, for q to be a correct expression, L sch(E)). – if q is one of E1 ./c E2 , E1 [ E2, E1 E2, or E1 \ E2, for any RA expressions E1 and E2, Meta(q) ¼ Meta(E1) [ Meta(E2) (again, we assume that duplicates are removed). – if q ¼ GBL,F(L0 )(E), where GB represents the group by operator, L and L0 are subsets of sch(E) and F is a list of aggregate functions, Meta(q) ¼ L [ L0 [ Meta(E). If q is an SQL query, the set MetaðqÞ can be obtained either by translating q to RA (note that most query processors do this in order to execute the query), or directly from SQL by gathering all attribute names used in the SELECT clause (including any attributes used with aggregates), and in the conditions in the WHERE clause, as well as the attributes in the GROUP BY and HAVING clause if they are present,5 plus recursively adding any attributes names in any subquery in q, be it in the SELECT, FROM, or WHERE clauses. It is important to note that MetaðqÞ is a superset of the schema of q(D), the table produced as an answer to q in D, and it includes all attributes involved in formulating q. This is done to help users understand the semantics of the query. for write statement w, if w is an insertion into table R, a deletion from table R, or a modification of some tuples in table R, then Meta(w) ¼ sch(R).6
The key point is that there is a unique set of metadata attached to each query or write statement. Hence, once Meta(q) or Meta(w) is computed, we can extract the corresponding set of rowids from the Catalog tables, and use the procedure above (outerjoin) to add any user annotation to the metadata of the command. For a query q, this means adding to the top row of q(D) (where the schema is usually displayed) any user-created content found. Note that we include content related to attributes in the WHERE clause, which may not be part of the schema of the answer table. Such content could be displayed also at the top of the answer, previous to the data. Again, how to best display this information is considered a matter of user interface, and not dealt with here.
5
Note that the SQL standard allows for some attributes to appear in the GROUP BY clause and not in the SELECT clause, so this check is necessary. 6 While an update statement may involve more than one table, the typical usage of SQL is to make changes in one table at a time.
172
A. Badia
At the data level, however, there are some special issues to deal with. Let c(D) ¼ {r1, . . ., rm}, where each ri is a rowid (1 i m). Assume that the user attached some content to a row r ∈ c(D). In fact, the user may have attached several pieces of content u1, . . ., us. However, the user attached content u1 to r when it was part of the answer of another command c0 , and so it was part of the set c0 (D) 6¼ c(D), content u2 when it was part of the answer to another command c00 . . . Should all this content still be displayed? Some of it? One could argue that such content should be displayed in context, so the content associated with r through c0 (D) should be displayed only if c(D) is somewhat related to c0 (D). For a given rowid r ∈ prowidRefs, we define the annotation contexts of r as follows: ACðrÞ ¼ ft:refid j t:rowid ¼ rg (that is, the set of reference ids containing r). For each annotation context ac ∈ AC(r), its extension is simply the set of all rowids attached to it: ExtðacÞ ¼ ft:rowid j t:refid ¼ acg Then, for each row rj ∈ c(D), we display the user-created content attached to rj in context ac ∈ AC(rj) if Ext(ac) is sufficiently related to c(D). What “sufficiently related” is can be defined using several semantic measures. An analytic measure can be defined along the lines of typical metrics like Jaccard, since both Ext(ac) and c(D) are sets: distðac; cÞ ¼
jExtðacÞ \ cðDÞj a jExtðacÞ [ cðDÞj
where a is a threshold. The advantages of this well-known metric is that it closely matches intuition in extreme cases (i.e., it reaches its maximum value of 1 when c(D) Ext(ac) or Ext(ac) c(D)), and its minimum of 0 when ExtðacÞ \ cðDÞ ¼ ;). S Thus, for c(D) ¼ {r1, . . ., rm}, let AcðcÞ ¼ r2cðDÞ AcðrÞ. If AcðcÞ 6¼ ;, this represents contexts that are common to all tuples in c(D). Then we can choose ac 2 AcðcÞ such that minr∈c(D)dist(ac, r), that is, the context that is the closest to some tuple in c(D). Other measures are also possible: the context that minimizes overall distance (minSr∈c(D)dist(ac, r)) or average distance (minAvgr∈c(D)dist(ac, r)). If AcðcÞ ¼ ; (there is no context that is common to all tuples) we can choose, for each tuple r ∈ c (D), the display of the context ac that minimizes the distance to c(D): Ext(ac): minac∈AC(r)dist(ac, c)). Once it is decided which user-created content to show, the same technique outlined above (outerjoin based on rowids) can be used. Note that carrying out the procedure just outlined can be quite costly: we need to compute, for all r ∈ c(D), AC(r); then, for each ac ∈ AC(r), we have to obtain Ext(ac), and finally we need to determine a metric between Ext(ac) and c(D) (the one proposed above or an alternate one). It is easy to see that in a worst-case
7 Social Interaction in Databases
173
scenario this may turn out to be quite expensive. However, two things need to be noticed: first, the worst-case scenario happens only when each data row in the database has some user-created content attached to it by a single user. Even though a user can attach some content to many data rows at once by using a large size result of some query, this could still be considered unlikely in a real-size database. But, most importantly, the system proposed here is envisioned mostly to be used with metadata, and only occasionally with data. Indeed, users are normally interested in interpreting and understanding the results of their queries, and for this task the metadata is of considerable help. As a second consideration, if the private data increases significantly, standard database techniques (indices, etc.) can be used to achieve better performance. Finally, to attach values, we add user-created content to each data row. When presenting the enriched result to the user, however, the full outerjoin above may generate considerable redundancy, since, as stated earlier, a given data tuple may have several items of user-created content attached. In particular, when the user attaches a tuple, the normalized storage proposed creates several duplicates of the original data tuple. Thus, alternative presentations of the information should be considered. For instance, even though not truly a relational table, a slightly different arrangement could attach all tags or comments to a data tuple together, without duplicating it. When a set of rows have the same user-created content attached, this could be shown by displaying the rows together and visually attaching the content to all of them as a block, without repetitions. Of course, this will not avoid all repetitions (as previously stated, the relationship between data and user-created content is many-to-many). Nevertheless, duplication in enriched answers can be avoided to a certain degree. This applies to structured (tuples) and unstructured (tags, comments) internal content. With respect to external content (links), there are two obvious choices: to simply display the link itself, or to actually follow the link and display the contents of whatever it points to. It is here that having typed links would help the system deal with different kinds of references (i.e., URLs versus filenames).
7.5.3
Private View Sharing
Recall that the ultimate goal is to enable user–user communication. The private data can be used to foster inter-user communication in several ways (we discuss this in Sect. 7.6). However, so far all user-created content is kept private: only the user who authored the content can see it. Thus, a user u could be given the option to share Pv(u) with other users. We envision the database system supporting a share command with a syntax along the lines of PUBLISH DATA TO ALL| SIMILAR | UserList where l l
ALL means to all users. UserList is a list of users with whom we want to share the data.
174 l
A. Badia
SIMILAR asks the system to find users similar to us, and share the data with them only. Similarity between two users u and u0 can be defined in several ways: based on commonality of queries (that is, common data of interest) or even commonality of private data (which the system has access to). This includes similarity not only of data (using references) but also of tags, annotations, external links. While we do not discuss this in detail, we point out that capturing private data allows a more exact determination of the similarity among users. Formally, given user u, the history of u is a sequence Hu ¼ ((a1, t1), . . ., (an, tn)) of pairs, where the ai are database accesses and ti are timestamps. Each database access ai is either a read access or query ri or a write access (insert, delete, update) wi. The history captures the interaction of u with the database by storing the commands that u issued against the database and the time at which each command was issued. A user profile for u is defined as a pair Pfu ¼ (Hu, Pv(u)) where Hu is the history of u and Pu is the private view of u. Similarity between users u and u0 could be established based solely on Hu and Hu0 ; however, it seems clear that a similarity that takes into account Pfu and Pfu0 is likely to be more fine grained and return better results.
When the user decides to share private data, the private view is copied in a systemwide table which has a similar schema, but with an attribute user id added, so that the author of the content can be identified. A command SUBSCRIBEshould also be supported by the system, so that users can gain access to this system-wide table. Note that this command can be extended similarly to the share command we described so that users are able to request access to everyone’s private data, or just some specific users. Note also that since the sharing is done through an intermediate system-wide table, the users have control about data they access and which data they share. This is an extremely desirable property; giving users control over their data they generate encourages them to create such data in the first place. Note also that the system may impose some constraints of its own. The most important (and obvious) one is that content may be shared only with users who are allowed to see the referents, that is, that have (at least, read) access to the data being referred to. Allowing the users to see metadata on data that they are not allowed to access may be considered a security breach especially since some user-created content may give away the data. Thus, permissions given by the user may be narrowed down by the system to that part of the audience who has permission to access the underlying data. Note that, in computing similarity between pairs of users, most measures along the lines sketched here would return a very low similarity between users who cannot access at least some common data, and hence the risk of sharing data with unauthorized users is quite small.
7.5.4
Private View Maintenance
Finally, we discuss the issue of maintaining the private view over time. We focus on changes to referred data or metadata.
7 Social Interaction in Databases
175
The key intuition here is that a reference in the Refs table is akin to a foreign key. That is, rowid values in this table are used to refer to existing data in the database on which they depend. This intuition immediately tells us two important things: l
l
When data is added to the database, there are no maintenance issues for private views. This is similar to the introduction of new values in a primary key column; such actions cannot break an integrity constraint. When data is deleted or updated, there are several possible options as to what to do with references to them. Which one is better may depend on the particular user. The user may be offered a set of options of what to do in such cases. A cascade-like action would delete all rows in the Refs table (and also in the Private and Values table) that refer to rows that are deleted or updated (the rational to delete references to updated values is that it was the value that most likely prompted the user to add some content). Note that this policy creates some overhead, since rows that are referenced in the private data need to somehow be marked so that, in deletion or update, the system knows that it needs to check the private data of some users. Since the change may come from the system, in principle the private data of all users that have access to such rows need to be checked. This could be implemented with a mechanism similar to the one used to manage triggers. However, cost remains an issue.
There is, however, a more complex but interesting possibility. As pointed out earlier, the definitions given above are static, that is, they use the state of the database at a point in time. As database D evolves over time, ext(D) is bound to change. We assume that changes in both data and metadata are beyond the control of any user, that is, for any user u, D may change over time as a result of events that u neither initiates nor controls. Deleting any references to the data would avoid dangling references. But if users created content (tags, etc.) in the first place, it is because there was something of interest to them in such data. Even if the data changes or is deleted, the content may still be of interest, and remembering the content may be a more interesting option. An idea proposed long time ago is that of a no overwrite database, that is, a system where all data has a timestamp attached and new values are simply added to the database with a current timestamp, while old values are never deleted [7]. Such a mechanism enables the support of sophisticated temporal database capabilities, which is a complex subject of research in itself [8]. The basic idea, however, is quite intuitive: a tuple update creates a new tuple version. Each tuple version has an associated timestamp, and all versions, old and new, are maintained by the system and made accessible. Applying this idea to our concept of private view, we can transform such view into a time-sensitive repository (even if the database itself is not!). When referred data is deleted or updated, the content related to it in the private view is not deleted. However, an extra marker is added to the table to mark when data is active, or it has been deleted or updated. Thus, when changed, data is simply marked, but annotations are kept. Note that the mechanism proposed earlier for enriched queries can still be used, with only a slight modification. A selection is
176
A. Badia
added to the outerjoin to pick up only data that is still current. This selection will filter out data with a marker set to indicate that the data is deleted or updated. This mechanism protects the private data in case a system reuses rowids. Thus, when using enriched queries, only currently valid private data is retrieved. But the user is still able to look up past private data, which may help the user to understand the evolution of the database. If the users are allowed to share personal data, maintenance of the central repository that is created for such purposes presents challenges of its own. Since the data in the central repository is only a copy of the private user data, when such data is updated in a user basis, it should also be updated in the central repository. Such central repository could be considered as a materialized view, and hence, techniques for maintaining such views (like incremental computation) could be used here [9]. In particular, note that it may be useful to limit the sharing only to current data in the database if only to ensure that all users can see and understand the referents. Thus, the system should add to the central table only currently valid data. One intriguing application of this idea is that it can be applied to the metadata (schema) of the database. A no-overwrite mechanism could keep information about the old schema around, while presenting the new schema to applications and users. This could help users understand database evolution. In particular, if users comment on changes on formatting, units, etc., they could provide very valuable metadata which is usually absent from the database (see next section for more on this).
7.6
Discussion
There are several potential objections to allowing user content to be part of the database: l
l
l
Data quality: if users can generate content without any supervision (there is no editing or curation process), the quality of the result may be quite uneven and questionable. Database contents (data) are valuable because they are curated; they are obtained though a predefined process and have some guarantees of quality. Data ownership: data in the database is usually owned by the organization that created the database (funded it), delegated to the DBA and/or small group of people in charge. If users create content and add it to the database, who owns it? The user or the organization? Data control: without some central control, databases could grow in an uncontrolled manner (not only the data itself, but also the schema). While dealing with large volumes of data is something that databases do well, increasing the complexity of the schema may have negative effects in many areas (including, ironically, user access, which can become more complicated).
7 Social Interaction in Databases
177
There are several answers to these concerns. Our approach is based on distinguishing between public data (that in the database, curated, accessible to all users with the right permissions) and private data (user generated, accessible only to the user and to whoever the user wishes to share the data with). They are kept separate to a certain degree; data in the database does not get changed, except as permitted by the usual channels. Private data is kept apart. We have proposed a mechanism to make private data sharable; however, note that private data, even when shared, is still kept separate from regular data. Also, private data can be examined before being admitted to the central repository by a curator or editor. A particular control implemented in social sites is to allow other users to judge if contents are appropriate; i.e., people can complain that other people’s comments are offensive, etc., viz the mechanisms of Wikipedia [10]. Clearly, there is additional work for the DBA to take care of. However, our mechanism is uniform and simple (and we cannot expect new functionality to be added with no cost). There are, on the other hand, many benefits that a database system can derive from allowing users to store private data. We have hinted at such benefits throughout the paper, but here we list them to make the case that the additional overhead is well worth the effort: l
l
l
Users may add valuable metadata to the database. If a user has questions about how to interpret the result, the user could consult what other users have said about the data involved. An important type of metadata is that where users comment on the meaning and interpretation of the database schema (mostly table attributes). This is the source of many problems in database integration [11–13]; hence, saving user-generated metadata could be very useful, as such metadata is rarely present nowadays. By mining the tags and comments deposited by users, a system could add information to its Catalog tables, obtaining a more accurate description of the items in the database. Such description could prove extremely useful to the DBA: it could alert about misconceptions that users may have about the data in the database, it could help change the way data is represented and/or captured, etc. Thus, we believe that capturing user data is a first step toward a better understanding not just of the users, but of the database itself. The system can also exploit a user’s profile to its advantage. As mentioned, many systems already keep histories in order to understand system usage, for the purposes of tuning and optimizing system resources (although such systems usually keep a centralized history for all users, since it is not interested, as we are, in using such information in an individual basis). But a more fine-grained approach can allow further tuning by taking into account which users are logged into the system, and using only their information to set parameters dynamically. There is a large body of research on social networks [14] that can be applied once the network is built, and which can lead to a better understanding of the users of the database (for instance, centrality measures may reveal influential users). This in turn may lead to a better understanding of how users access the database, their expectations, and understanding of it. This may influence database evolution and maintenance.
178 l
A. Badia
The system can obtain a better analysis of database usage. By discovering what data users use/see of the database, we can determine whether there are “hot spots” that most people use, parts that very few people use, and parts that people do not use. While this information is currently used only for performance purposes, it can also be used for semantic purposes. For instance, we could attempt to enrich data in hot spots and restructure the database to make sure that users are aware of data that is infrequently used. Thus, usage can guide database evolution [15].
In the end, the effort proposed here can be seen as having a dual advantage: to allow users to interact more freely with the database, storing information of convenience to them, and also as a way for the database to “get to know” the user better, and use this information to encourage user–user communication.
7.7
Related Work
There are several very active areas of research which influence this proposal; therefore, the body of related work is quite large. Here we overview only some of the work that has directly influenced the ideas presented. Tagging (in particular, social tagging, that is, tagging systems developed by large number of users) has been proven to have more resilience and structure than expected from its uncontrolled, decentralized development [3]. As a result, social annotations have been used already for similar analysis, for instance, automatic resource discovery [16]. We believe that such annotations will be especially useful in databases. One particular advantage of capturing user-created content in this context is that it can be used to semantically annotate existing data and metadata, a task that is difficult with automatic approaches [1, 17]. However, organizing and browsing large amounts of annotations is still a challenge [18]. Our approach here implicitly organizes the annotations by author and (through references) by item tagged. Once connections among database users are established, several tools can be made to bear on the resulting network for several analysis like detecting closely related subgroups of users (communities) [19]. In fact, the whole body of work in social networks can be applied here [14]. Many systems already store user commands, especially queries. This information is then used for query optimization [20] or to generate query forms for database interfaces [21]. However, most systems store the information in a central repository, many times ignoring the particular user that issued the command. The reason is that the information is used at the system level, for instance, to identify frequently accessed tables or attributes in order to build indices on them. In such a context, well-supported patterns in the data are important. Here, we want to use the information at the user level; hence, we keep the information at the user level. However, we
7 Social Interaction in Databases
179
note that the analysis of the resulting social network also yields interesting global results. A somewhat similar approach is described in [22]. There, a central repository is created using a generic schema, with domain experts expected to simply take care of uploading data into the system using a simple, general RDF interface. The data and the schema of the database are customized as much as possible, using the fact that from all declarations from the experts, some patterns can be observed that allow grouping and organization of data. The paper [23] proposes a system for building content sharing communities. This system allows users to share their data. This is similar to our idea of allowing users to share their private data. However, [23] does not presuppose an existing database, and therefore it is free to choose a novel structure for the shared data and its querying (based on Active XML over data streams). In our framework, a preexisting (relational) database is taken as the starting point. Nevertheless, there are interesting similar aims between the proposal of [23] and ours. Another approach [6] also proposes to capture user data in a flexible format and allow sharing in a central repository. As in our proposal, a generic relational table is used to hold user-created content. Here, a flexible schema in which attributes are added on the fly is proposed. The data is then clustered by common attributes in an attempt to define the types of objects that the users are referring to. Note that this schema leads to numerous null values, since not all objects will have values for all attributes. By contrast, our approach stores only needed information. In [6], each type of object leads to the creation of a view. A collection of such views is offered to users for querying. This work, like [23], does not assume a preexisting database. Hence, the data from users must be analyzed to identify the entities under consideration. We anchor all user-created content in the database. In other words, we focus on parasitic data, while these efforts focus on independent data (see Sect. 7.3). The papers [24] and [25] also develop systems to store and query annotations. Geerts et al. [24] presents the MONDRIAN system, based on a special algebra to handle together annotations and the tuples they refer to. The resulting language gives a “parallel” relational algebra to manage annotations. While this is an interesting approach, it presents serious challenges from an implementation point of view. Here we have chosen to manipulate all information in traditional relational algebra in order to make the resulting system as simple as possible. Bhagwat et al. [25] introduces the DBNotes system. This system focuses on annotation propagation: as data is transformed, the associated annotations are transparently carried with the data. This is carried out using ideas from research on data lineage (provenance). We have not covered this dynamic aspect here. Recently, work on emergent semantics [26] has added emphasis to user-created content, proposing it as a means to add semantics to different objects, like multimedia objects, which are hard to define with traditional modeling [27]. Our system can be seen as an enabling mechanism to capture emergent semantics in databases and is partly motivated by this research, which considers user input as first-class information, i.e., just as, or important than, the data stored in the system.
180
7.8
A. Badia
Conclusion and Further Research
We have presented a general framework for enabling social interaction in databases. The overall goal is to increase the flow of information among users. As part of this project, it is necessary to find ways to allow users to add their own content to the database. In this chapter, we have argued that this technology currently does not support the required functionality. We then proposed extensions to the technology that offer the basic support needed. Our extensions are conservative, that is, they work within the relational database model, since this maximizes the chances of the approach being implemented in current software. At the same time, our extensions allow users to enter several forms of content (tags, comments, structured data) in a format that is convenient to the user. The ideas proposed here can be extended in several ways. From a technical point of view, there are some extensions that should be considered to make the framework complete. For instance, we stated that most of our analysis is static, but there are clearly some concepts that could (and should) be extended to a dynamic viewpoint. As database usage evolves over time, the dynamic view of user content should become valuable. Such analysis is already carried out in social Web sites [3]. Also, we do not deal with nulls. However, it could be very interesting to see what users say about null values, and whether their input helps deal with the difficult issues surrounding incomplete information [28, 29]. Besides these technical issues, there are other, larger aspects where the research could be fruitfully exploited. We point out here three particular avenues where our current research is focused. One is the extension of the ideas to other data models, in particular to semistructured data models (XML). There is already a very large body of work on users tagging content in social web sites. Such research assumes very simple organization of the data; a more formal approach that exploits XML Schema could complement such research. Another idea is to exploit the framework in the context of inherently distributed tasks, like collaborative document editing. While several software tools already exist for this task, most of them limit user input to document edits, not allowing or strictly restricting the use of tags or comments. Such systems usually have problems with conflict resolution, depending on simple voting schemas. Letting users add metadata to the documents may give an additional perspective on users’ views of modifications and allow more sophisticated schemas for integration of versions. Finally, the information obtained here can be exploited in many different ways. We have briefly sketched a few uses in Sect. 7.6, but the list there was not meant to be exhaustive. A long-term goal of our research is to implement the ideas described, test them in a real-life system, and analyze the data obtained to determine potential uses for it. As usually happens when users are allowed to provide content, we expect several unanticipated results from this research. Acknowledgments This research was sponsored by NSF under grant CAREER IIS-0347555. The author is very grateful to his Program Manager, Maria Zemankova, for her support and patience.
7 Social Interaction in Databases
181
References 1. Bao, S., Wu, X., Fei, B., Xue, G., Su, Z., Yu, Y.: Optimizing web search using social annotations. In: Proceedings of the WWW 2007 Conference, pp. 501–510. ACM, New York, NY (2007) 2. Escobar-Molano, M., Badia, A.: Exploiting Tags for Concept Extraction and Information Integration. In: Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing. IEEE Press (2009) 3. Golder, S., Huberman, B.A.: The structure of collaborative tagging systems. J. Inf. Sci. 32(2), 198–208 (2006) 4. Melnik, S., Adya, A., Bernstein, P.A.: Compiling mappings to bridge applications and databases. J. ACM Trans. Database Syst. 33(4), 1–50 (2008) 5. Wyss, C., Robertson, E.: Relational languages for metadata integration. ACM Trans. Database Syst. TODS 30(2), 624–660 (2005) 6. Yu, B., Li, G., Ooi, B.C., Zhou, L.-Z.: One table stores all: enabling painless free-and-easy data publishing and sharing. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR), pp. 142–153, (2007) 7. Nørba˚g, K.: Design issues in transaction-time temporal object database systems. In: Stuller, J., Pokorny, J., Thalheim, B., Masunaga, Y. (eds.) Current Issues in Databases and Information Systems. Lecture Notes in Computer Science, vol. 1884, pp. 371–378. Springer, Heidelberg (2000) 8. Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. TKDE 11, 36–44 (1999) 9. Gupta, A., Mumick, I.S. (eds.): Materialized Views: Techniques, Implementations, and Applications. MIT, Cambridge, MA (1999) 10. Wilkinson, D.M., Huberman, B.A.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis, Montreal, Quebec, Canada, pp. 157–164 (2007). 11. Doan, A., Halevy, A.: Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration, Spring (2005) 12. Kang, J., Naughton, J.: Schema matching using interattribute dependencies. IEEE Trans. Knowl. Data Eng. TKDE 20(10), 1393–1407 (2008) 13. Sciore, E., Siegel, M., Rosenthal, A.: Using semantics values to facilitate interoperability among heterogeneous information systems. ACM Trans. Database Syst. 19, 254–290 (1994) 14. Wasserman, S., Faust, K.: Social Network Analysis. Cambridge University Press, Cambridge (1994) 15. Vassiliadis, P., Papastefanatos, G., Vassiliou, Y., Sellis, T., Hellas, I.: Management of the evolution of database-centric information systems. Download from CiteSeerX at http://citeseerx.ist.psu.edu/viewdoc/summary?doi¼?doi¼ 10.1.1.104.5372 (2008) 16. Plangprasopchok, A., Lerman, K.: Exploiting social annotation for automatic resource discovery. In: Proceedings of AAAI workshop on Information Integration (2007) 17. Wu, X., Zhang, L.,Yu, Y.: Exploring social annotations for the semantic web. In: Proceedings of the WWW 2006 Conference (2006) 18. Li, R., Bao, S., Fei, B., Su, Z., Yu, Y.: Towards effective browsing of large scale social annotations. In Proceedings of the WWW 2007 Conference, pp. 943–952. ACM Press, New York, NY (2007) 19. Liu, H., Tang, T., Agarwal, N.: Tutorial on community detection and behavior. Study for social computing. Presented in the 1st IEEE International Conference on Social Computing (2009) 20. Agarwal, P.K., Xie, J., Yang, J., Yu, H.: Input-sensitive scalable continuous join query processing. ACM Trans. Database Syst. (TODS) 34(3) article 13 (2009) 21. Jayapandian, M., Jagadish, H.V.: Automating the design and construction of query forms. IEEE Trans. Knowl. Data Eng. TKDE 21(10), 1389–1402 (2009)
182
A. Badia
22. Howe, B., Tanna, K., Turner, P., Maier, D.: Emergent semantics: towards self-organizing scientific metadata. In: Proceedings of the International Conference on Semantics of a Networked World (ICSNW), pp. 177–198 (2004) 23. Abiteboul, S., Polyzotis, N.: The data ring: community content sharing. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR), pp. 154–163 (2007) 24. Geerts, F., Kementsietsidis, A., Milano, D.: MONDRIAN: Annotating and querying databases through colors and blocks. In: Proceedings of the ICDE 2006 Conference (2006) 25. Bhagwat, D., Chiticariu, L., Tan, W., Vijayvargiya, G.: An annotation management system for relational database systems. In: Proceedings of the VLDB 2004 Conference, pp. 942–944 (2004) 26. Philippe, K.A. Ouksel, A. M., Catarci, T., Hacid, S., Illarramendi, A., Mecella, M., Mena, E., Neuhold, E.J., De, O., Risse, T., Scannapieco, M., Saltor, F., Santis, L.D., Spaccapietra, S., Staab, S., Studer, R.: Emergent semantics: principles and issues. In: Proceedings of the 9th International Conference on Database Systems for Advanced Applications (DASFAA 2004), pp. 25–38, Springer, Heidelberg 27. Santini, S., Gupta, A., Jain, R.: Emergent semantics through interaction in image databases. IEEE Trans. Knowl. Data Eng. 13(3), 337–351 (2001) 28. Benjelloun, O., Sarma, A. D., Halevy, A., Widom, J.: Uldbs: Databases with uncertainty and lineage. In: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp. 953–964 (2006) 29. Libkin, L.: Data exchange and incomplete information. In: Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Chicago, IL, USA, pp. 60–69 (2006)
Chapter 8
Motivation Model in Community-Built System Przemysław Ro´z˙ewski and Emma Kusztina
Abstract Every person can be characterized by motivation. Therefore, a collective motivation model of users of an information system can be applied for increasing its overall efficiency and productivity. In this chapter, the motivation model for an environment of community-built system is proposed. The model’s objective is to maintain control of social and economic systems. All of the psychological, cognitive, and social aspects of the community-built system have been incorporated into the model. Having introduced an analysis of the networked nature of knowledge, an ontological model for knowledge representation in community-built system is formulated. Following that, the main features of a computer-supported collaboration process are discussed to establish the environment of a community-built system. The motivation model is located within the environment and leverages the qualities of the ontological model.
8.1
Introduction
Motivation is a vital component of human activities. This is especially true regarding the Internet, where the quality of content often depends on the creator’s motivation. Moreover, it is more likely that a highly motivated user can cooperate with other individuals and perform complex research activities. Therefore, the problem of motivation is an important research area in community-built databases. A formalized motivation model can be used as a tool for evaluation of a community-built database development process. During this discussion, we will begin by investigating the concept of a communitybuilt database system and move toward the concept of a community-built system. The latter concept is more general and can be defined as a system for content creation by a community operating on a dedicated engine (e.g., Wiki). From this perspective, the
P. Ro´z˙ewski (*) Faculty of Computer Science and Information Systems, West Pomeranian University of Technology in Szczecin, ul. Z˙ołnierska 49, 71-210 Szczecin, Poland e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_8, # Springer-Verlag Berlin Heidelberg 2011
183
184
P. Ro´z˙ewski and E. Kusztina
discussion presented herein is related to both the community-built database systems and other community-built systems such as a knowledge repository [1, 2] or digital library [3]. The motivation model belongs to the class of incentive models and is oriented toward maintaining control of social economic systems. Generally speaking, the incentive model consists of elements induced from a community of actors performing certain actions [4]. The main assumption about the incentive model is that an agent’s strategy is defined by a choice of action. The system’s strategy is a choice of the incentive function (being the control parameter) – a mapping of agents’ actions onto the set of obtainable rewards (remuneration) [5]. In the case of community-built systems, it is impossible to have a predefined operational center. The community members switch their roles (the editors and creators) continuously. The model can be used to simulate various community-built information systems. In a provided case study, an e-learning information system is analyzed. The actors share knowledge in order to build up community knowledge by efficient and effective collaborative authoring and communication tools that the community-built system provides. For example, Wiki provides opportunities for collaborative knowledge building [6] to create community knowledge [6]. The community knowledge can be represented by a computer-friendly ontology representation [7]. As in a knowledge-base system, the goal of community members can be defined by expressing a member’s ontology on the scale of/between commonly shared ontologies. Hence, the main research issues are: the development of a distributed knowledge model [8], the construction of a collaborative ontology [9], and integration of the ontologies and its attendant issues [10]. In this chapter, the motivation model supports a process of achieving consensus on commonly shared ontologies [11]. In the literature, there are other collaborative systems based on the concept of ontology, like a workflow-centric collaboration system [12], or collaborative semantic tools for domain conceptualization [13]. However, none of them supports the concept of a community-built system. This chapter will investigate the concept of development of the motivation model aimed at supporting the activity of users in the process of developing and maintaining the community-built information system. The proposed motivation model focuses on the task of filling the community-built system with high-quality content. Furthermore, the motivation model includes interactions and collaboration between users. Hence, the community-built system is treated as a knowledge repository. A structure of the motivation model and formal assumptions regarding the evaluation of the model will be described. Moreover, an example of a motivation model application will be presented as a case study. The chapter is organized in the following manner: in the second section, a process of networked knowledge processing is discussed in the perspective of a knowledge network. The author believes that the emergence of a knowledge network is one of the main reasons for the existence and development of community-built information systems. The third section covers certain elements of the concept of communitybuilt system. Particular attention is paid to the processes of collaboration and cooperation and the Wiki mechanism. In the fourth section, the motivation model
8 Motivation Model in Community-Built System
185
is introduced along with the supporting case study: implementation of the motivation model in an e-learning system.
8.2
Networked Knowledge Processing
Knowledge creation and processing are two processes that are based on the individual cognitive capabilities of an individual and his access to resources (i.e., the object of processing). The networked organization of the knowledge (networked knowledge) process ensures access to resources and, in addition, creates a cognitively responsive environment. Networked knowledge can produce units of knowledge whose informational value is far greater than the mere sum of the individual units, i.e., it creates a new knowledge [14]. The information resources available on the Internet, paired with appropriate semantics, constitute the networked knowledge. The phenomena of synergy and emergency which appear in a community-built system are the most significant purposes of processing the networked knowledge. The synergy occurs when individuals with similar knowledge resources are mutually related. As a result of the relation, the quality of the initial knowledge potential and processed knowledge increases. The emergency occurs when individuals with similar objectives are connected. In this case, the problem can be discussed from the point of view of different domains. As shown by many examples, most often, only the application of a multidisciplinary approach leads to adequate solutions. The knowledge network is the operational environment of a community-built system and it allows the processing of the networked knowledge. The main objective of the knowledge network is to enable users to easily share knowledge across the network. By sharing knowledge, an individual’s expertise can be integrated to form a much larger and valuable network of knowledge [15], such as Wikipedia.
8.2.1
The Concept of Knowledge Network
Generally speaking, the knowledge network is a network through which all available knowledge is properly represented, correlated, and accessed [16]. The knowledge network identifies “who knows what” [17]. The structure of a knowledge network is based on a production network and a social network [18]. A knowledge flow is the relaying of knowledge between nodes according to certain rules and principles of a production network [19]. On the social network level, a knowledge node is a team member or role, or a knowledge portal or process. From the point of view of the community-built system, the community-built engine (e.g., wiki) is a hub in the production network. Additionally, the community-built engine is the most important element of a social network, representing the knowledge community. The knowledge network views knowledge in terms of the main flow elements. A knowledge flow starts and ends at a determined node. The node can generate,
186
P. Ro´z˙ewski and E. Kusztina
learn, process, understand, synthesize, and deliver knowledge [19]. Knowledge network operations include knowledge sharing and knowledge transfer [18]. Knowledge sharing covers profundity and depth of knowledge sharing; knowledge transfer includes transport, absorption, and feedback of knowledge. The nodes in the knowledge network include individuals, as well as aggregates of individuals, such as groups, departments, organizations, and agencies. Increasingly, the nodes also include nonhuman agents such as community-built engines knowledge repositories, websites, content, and referral databases [20].
8.2.2
The Problem of Knowledge Representation in the Knowledge Network
Reality is defined by an unlimited and diverse set of information and stimuli which attract the human perception system [21]. Cognitive science assumes the mind’s natural ability to conceptualize [22]. The conceptual scheme exists in every domain [23] and informs the domain’s boundaries and paradigm. The conceptual model of the knowledge domain can be created as an ontology, where the concepts are the atomic, elementary semantic structures [24, 25]. The concept is a nomination of classes of objects, phenomena, and abstract category [26]. For each of them, common features are specified in such a way that every class can be easily distinguished. From the practical point of view, an ontology is a set of concepts from a specific domain. The modeling approach, considered as a cognition tool, assumes some level of the examined object’s simplification. The ontological approach is based on the following assumptions [27]: (1) the ontology describes the fundamental knowledge, (2) the ontology is build by the subject experts of the specified domain, (3) the concepts are the ontology nodes, connected by relations, (4) the ontology can be considered as a unordered graph, and (5) the concepts have to be unique along the ontology. The community-built database system (repository) is designed to present the philosophical, scientific, scientific–technical, scientific–technological state of the selected domain [28]. Using the repository, the elements of domain knowledge are shared, mainly in the form of knowledge objects, which are interpreted as modules of knowledge that emerge as a result of analysis and division of knowledge into “pieces” [27].
8.2.3
Role of Users in the Knowledge Network
The knowledge processing “actors” in the knowledge network include knowledge workers (knowledge individuals) [29], as well as knowledge systems [15]. The actor’s connections schema affects the quality of the knowledge network. As found in [15], the typical mistakes in maintaining connections can be identified as: actors are connected only to other actors with low expertise levels; actors who receive
8 Motivation Model in Community-Built System
187
knowledge only by means of knowledge transfer with low viscosity; loosely connected subgroups that do not profit from the knowledge in other groups; dependency on a few actors for preservation of the network, without whom only loosely connected subgroups will remain; actors who are not well integrated into the network, because they have no, or only a few, relationships with other actors. Every actor in the network plays some role. Often, an actor can perform more than one role in the knowledge network. The actor’s role is the actor’s response to the network demands and such actor desires an adequate acknowledgement of his intellectual potential and available resources in a proper network context. Based on [15], the following roles can be identified in the knowledge network: a knowledge creator is an actor who creates new knowledge that is used by others in the organization; a knowledge sharer (knowledge broker) is an actor who is responsible for sharing knowledge that is created by the knowledge creators; a knowledge user is an actor who depends on knowledge for executing its job. In the community-built system, the knowledge broker role is maintained by the system, which creates the operational environment for the knowledge network. In addition, in the communitybuilt system, the role of editor is introduced. The editor is responsible for maintaining the quality of content and its estimation in the knowledge network. Every role that exists within the community-built system environment requires certain skills and behaviors [30]: – Cognitive skills: writing and constructive editing skills (skills in research, writing, and editing), web skills (accessing the Internet, using web browsers, tracking logins and passwords, writing with embedded Wiki HTML editors, and working with digital images or other web media), group process skills (be able to set goals; communicate clearly; share leadership, participation, power, and influence; make effective decisions; engage in constructive controversy; and negotiate conflict). – Personal characteristics: openness (opens up each contributor’s ideas to scrutiny and criticism, others can modify, reorganize, and improve any contributions), integrity (integrity can be perceived through the accountability of each individual, through the honesty of each individual, and through the competence of each contributor), self-organization (requires metacognition, ability to self-assess, and ability to adjust to feedback from the environment).
8.3
The Concept of a Community-Built System
A community-built system can be defined as a system of virtual collaborations organized to provide an open resource development within a given community [31]. According to [32], virtual collaborations are a type of collaboration in which individuals are interdependent in their tasks, share responsibility for outcomes, and rely on information communication technology to produce an outcome, such as shared understanding, evaluation, strategy, recommendation, decision, action plan,
P. Ro´z˙ewski and E. Kusztina
188
or other product. Moreover, in order to create the community-built system, we need a common goal or deliverable [33]. In the case of the community-built system, the common goal is to develop an open resource. The open resource contains content consistent with the concept of community work. One of the main characteristics of a community-built system is a common purpose for collaboration. As described in [34], many activities in the field of learning, education, and training are collaborative in nature, involving turn-taking, statement-and-response, or multithread discussions which, in turn, occur among participants over periods of time ranging from seconds to entire human generations. Usually, people collaborate because they are assigned tasks that they cannot perform alone, so they are driven to collaborate with others [33]. However, in the case of community-built systems, the social aspect of people gathering in a community becomes central. The community of collaborators is accelerated by social context, motivational aspects, distributed cognition, and learning community [35]. In the community-built system, the main process of knowledge collaboration involves co-construction of knowledge, collaborative knowledge construction, and reciprocal sense making [36]. A community-built system’s operational environment is defined as a collaborative service provided within a collaborative workplace supporting the collaborative activities of a collaborative group [34]. Collaborative systems, groupware, or multiuser applications allow groups of users to communicate and cooperate on common tasks [37]. Such a system is modeled as a dynamic and interindependent, diverse, partially self-organizing, fragile, and complex adaptive system [38]. Based on [39], we can formulate key assumptions necessary for communitybuilt system success: (1) knowledge is created as it is shared; (2) individuals have prior knowledge that they can contribute during a discussion/collaboration; (3) participation is critical to open resource development; and (4) individuals will participate, if given optimal conditions. The community-based system operates asynchronously, and this pattern supports cooperation rather than competition between individuals [40].
8.3.1
The Collaboration Process
The collaboration process is the driving force behind the operation of a communitybuilt system. During the collaboration, individuals work jointly. It is important to distinguish between processes involving collaboration and coordinated work [41]. In collaboration, no control element is required. The collaboration between individuals strongly depends on the relationship’s typology [42]. It is rarely possible to maintain an all-to-all connectivity pattern. More likely, the connections pattern is limited to providing regular graphs, lattices, or other similar formats. The collaboration process might be analyzed from different perspectives that depend on the context (see Table 8.1). Knowledge sharing is the most important facet of the community-built system. We assume that every individual who takes
8 Motivation Model in Community-Built System
189
Table 8.1 Different aspects of the collaboration process Definition Key points and context Context: Organization Collaboration is an interaction between multiple Key points [33]: – all relationships, is often parties (two or more). All parties are doing a trade work with a shared purpose or goal, and all – both parties invest, and both parties parties will benefit from their efforts. In addition, receive something in return collaboration can happen across boundaries. (Based on [33]) Collaboration is a conversational, relatively Context: Learning unstructured, iterative, but nevertheless, active Key points [44]: – collaboration requires process during which the participants work consensus, mutual understanding, together to achieve a goal and reach a decision reciprocity, and trust or a solution. (Based on [43]) Context: Knowledge sharing Collaboration refers to informal, cooperative Key points [8]: – collaboration requires relationships that build shared vision and complete understanding and effective understanding needed for conceptualizing crosssharing of information and knowledge functional linkages in the context of knowledge throughout the development cycle intensive activities. Collaboration facilitates the acquisition and integration of resources through external integration and cooperation with other cooperative or supporting agents, conducted on a basis of common consensus, trust, cooperation, and sharing by a multifunctional team of experienced knowledge workers (Based on [8])
part in the collaboration process wants the repository content to be evaluated. When an individual objects to the repository content, the disputed knowledge object is created or edited.
8.3.2
Wiki: An Example of the Community-Built System
Let us analyze a Wiki system as a working example of the community-built system. Wiki is a special kind of website, the contents of which can be edited by visitors [45]. It is a web-based hypertext system that supports community-oriented authoring, in order to rapidly and collaboratively build the content by editing the page online in a web browser [37, 46, 47]. The Wiki also represents a class of asynchronous communication tools [61, 62]. The main differences between Wiki and other tools (such as blogs or threaded discussions) are the following [30]: the content authorship is collaborative, not single or multiple, dynamic nature, and nonlinear and multipage construction. Technically, Wiki systems consist of four basic elements [48]: (1) content, (2) a template that defines the layout of the Wiki pages, (3) a Wiki engine, a software that handles all the business logic of the Wiki, (4) a Wiki page, the page that is created by the Wiki engine displaying the content in a web browser. Wiki server technology enables the creation of associative hypertexts with nonlinear navigation structures [46]. The Wiki mechanism is found to be popular in
P. Ro´z˙ewski and E. Kusztina
190
Table 8.2 Main features of the Wiki and the community-built systems Community-built system Wiki features Wiki (based on [48]) (in the terms used by [51]) Open If any page is found to be Each member can create, incomplete or poorly modify, and delete the organized, any reader can content edit it as he/she sees fit Organic The structure and content of the The structure and content are site evolve over time dynamic and nonlinear and can be rapidly constructed, accessed, and modified Universal Any writer is automatically an Every member can shift roles on editor and organizer the fly System maintains a version Observable Activity within the site can be database, which records its watched and revised by any historical revision and visitor to the site content Tolerant Interpretable (even if The content must fulfill interface undesirable) behavior is requirements, the quality is preferred to error messages decided by the community
Communitybuilt system features Open source
Rapidness
Simplicity Maintainable
Self-repair
several real-life applications such as: a digital library [49], a communication and collaboration in chemistry domain [50]. In Table 8.2, the main features of Wiki are presented and mapped onto the features of the community-built system.
8.4
Motivation Model
We will review a motivation model designed to support an activity of editors and creators performed during the processes of implementing and using communitybuilt information systems. A comparable motivation model focused on students and teacher cooperation can be found in [28, 52]. The main difference between these two motivation models lies in the characteristics of the roles. The creators can change roles and become editors, whereas teachers and students cannot. Our research will be limited to the repository information system. The proposed motivation model is oriented toward the task of supplying the repository with high-quality material.
8.4.1
The Concept of Motivation
A motif (a reason for action) is a consciously understood need for a certain object, position, situation, etc. It is therefore acceptable to claim that the motif comes from a requirement, becomes its current state, and leads to certain actions [53]. During the realization of the mentioned sequence “need–motif–action,” at each step, we deal with the following decision-making situations: many motifs can lead to a certain
8 Motivation Model in Community-Built System
191
action, many needs can make up one motif, and many motifs come out of a single need [52]. The motivation, like other cognitive processes, cannot be observed directly [54]. This means that it is possible to define only the quantitative relationships between the choice-specific parameters through exterior registration of the choice results. The motivation is a set of several components. The creator’s attention, relevance, confidence, and satisfaction are paired with creator’s motivation in the fourfactor theory of ARCS Motivation Model [55]. The ARCS model also contains strategies that can help an editor to stimulate or maintain each motivational element. In [56], it is shown that personally valued future goals are the core of motivation. Moreover, the cultural discontinuities and limited opportunities within the creators’ working environment may weaken future motivational force [57]. The environment where motivational action takes place and the object to be motivated strongly influence the form and content of the motivation model. The diverse environment creates individual needs for the motivation model, because of multicultural differences [58]. Moreover, the motivation model can be designed for artificial or human objects. In [59], a motivation model for virtual humans such as nonplayer characters has been proposed. Hence, this motivation model is based on overlapping hierarchical classifier systems, working to generate coherent behavioral patterns.
8.4.2
Interpretation of the Motivation Model in the Context of the Community-Built System
The process of content creation always includes the social, research, and education aspects and occurs across cognitive, information, and computer-based levels of community-built systems. At each of these levels, the editors and the creators have their own roles and levels of involvement. At the cognitive level, several assumptions can be made about the content and related tasks. At the information level, information is exchanged between the participants of the creation process. Lastly, the computer-based level is characterized by the community-built system’s make up and participants’ ability to use it. In the community-built system, motivation plays a different part in content creation and modification processes. The creator is a person responsible for creating and improving the content. The editor is a person responsible for content assessment. The editor’s assessment process is based on his/her personal knowledge. From the point of view of knowledge engineering, the editor’s ontology is projected on the community-built system’s content. As a result of intelligent comparison between those operations, the editor either accepts the content, or using some community-built systems’ tools, assumes the creator role and improves the content on his/her own (Fig. 8.1).
P. Ro´z˙ewski and E. Kusztina
192 Fig. 8.1 Relationship between the creator and the editor in community-built systems
Content creation
Motivation
Creator
Editor
Content editing
Editor Task co nte ntEditors’
content
Editor
Task
motivation function
Editor Task nt
nte
co
Creator’s motivation function
nt
nte
co
Task
Fig. 8.2 Schema presenting the process of filling in the community-built system
Creator
Creator’s motivation function
content
Community-Built systems (repository)
Task
Creator
Creator’s motivation function co nte nt Task
Creator
The main editor’s motivation is the content verification, so that it reflects his own understanding of knowledge in a content repository. Hence, the content is the editor’s main interest focus. Therefore, the editor is responsible for the development of a repository. More importantly, the editor is the initiator of motivation. The repository development is carried out by the editor’s action such as selection of a concept for future improvement conducted by a given creator. The creator’s main motivation is to create high-quality content. The quality of the creator’s work depends on conceptual depth. Moreover, the creator’s and editor’s motivations can be influenced by both personal and social factors. The community-built system can support this with special tools and exclusion of anonymous input. The discussed motivation model is applied in the content formation situation (Fig. 8.2). The goal of the content formation process is to create high-quality content as the result of the interaction between actions of content creation, editing, and populating the content repository. The content in community-built systems can be described as an ontology that is stored in the knowledge repository. Both the editors and the creators interact in an environment encompassing community-built systems. The editor’s and creator’s motivation function is task dependent. On the one hand, a task may be formulated to deal with a wide range of concepts. On the other hand, a task may require further examination of a single concept. Moreover, every concept is characterized by its cognitive workload. Both the creator and the editor operate in the community-built system’s environment based on the selected
8 Motivation Model in Community-Built System
193
task. Every task is selected by the creator or the editor with respect to their motivation functions. It is noteworthy that every creator has his/her own motivation function, while all of the editors have one common motivation function. Similar to [52] the motivation model can be developed as a particular game scenario, where the activities of the editor and the creator will be supported by their own interests. The development of the motivation model in a specific content formation situation is possible once the following requirements have been met: – – – –
A choice is made of a specific content creation situation A choice action result can be registered A system for assessment of the results is in place The creator and the editor have access to observations and evaluations of the choices that have been made – The result of a choice has to be deemed by the creator to be a desired one (i.e., usability of the result) – The creator has to be certain that the desired result can be achieved in a given formation situation with probability greater than zero (i.e., subjective probability of achieving the result)
8.4.3
Scenario of a Community-Built Systems Interaction
Speaking from the perspective of the community-built system, the motivation can be identified as a certain game scenario (interaction and interplay) between the content’s editor and the creator, both performing actions in a specific content formation situation [52]. The scenario is aimed at influencing the creator’s involvement in a community-built system and supplying the community-built system repository with new content. The creator’s task can be recognized as a community-built system’s repository filled with the concept’s related content. Figure 8.3 shows the schema of supplying the community-built system repository. The proposed scenario assumes that the role of the creator is to choose a task after the preprocessing phase. The purpose of the preprocessing phase is to estimate the repository content regarding the creator’s interest (creator’s preference). Every task is related to some part of the content development process. Hence, the new concept is represented by that new task. The creator deals with the task, and the task quality depends on the correctness of the development process and the complexity level of the task. Both the task solved by the creator and the assessment accepted by the editor are placed in the repository and will serve as a base solution for other creators. We will assume that it will be sufficient to stimulate involvement and that it has a positive influence on the content. This leads us to conclude that it will be satisfactory as a part of the creator’s motivation function. At the same time, supplying the repository with a wide spectrum of high-quality content gives satisfaction to the editor and make up the editor’s motivation function. The editor’s and the creator’s interaction with the community-built system’s environment can offer a scientific quality. We will assume that the subject range of
P. Ro´z˙ewski and E. Kusztina
194
Start
Repository
Creator
Creator’s interest
Context (domain)
Preprocessing
Creator’s motivation function
Set of tasks
Creator's goal function: task choice
Editor
Task development
No
Editor’s interest Evaluation of the choice and solution
Editor’s motivation function
Editor's goal function: repository filling quality No
Creating capabilities?
No
Stop
Fig. 8.3 The scenarios of supplying and using the community-built system repository
the content in the community-built system repository is in accordance with the editor’s and creator’s scientific and research interests. The content formation process can be described in terms of the ontology development process. The elements of the content are created on the basis of the ontology and differ in their levels of complexity [27]. The editor’s motivation function is oriented to maximizing the span of the domain with tasks (concepts) that have the following qualities: topicality of the
8 Motivation Model in Community-Built System
195
tasks subject to the editor’s point of view and the individual resources he is prepared to assign to the creator for solving a certain task (for example consultation time, access to scientific material, etc.). The maximizing of a degree of repository supply with concepts is an editor’s interest. It the end, the repository should reflect the complete ontology of a given subject. The criterion regarding placing the concept in the repository is decided by the editor on the basis of: concept complexity level, graphical quality, language correctness, and others. Consequently, the possibility of realizing the editor’s interests is limited by his resources considering time in terms of quantitative and calendar aspects and other informal preferences. The creator’s motivation function is formulated to obtain the maximum level of meeting one’s interests during choosing and solving the task, with given constraints regarding time (of one’s own and the editor’s as well) and a way of grading the resulting effects. The creator’s interests rely on individual preferences and can be described using opposing groups of creators (similar to [52]). The first group is concerned mainly with achieving a minimal acceptable success level, meaning meeting only the basic requirements for obtaining a positive opinion about the task (low complexity of the task, minimal acceptable quality) and saving the maximal amount of their time. The second group of creators is interested in providing the community-built system repository with the maximal possible success level, implying creating and editing contents of high complexity in order to produce best overall quality. The assertions presented above indicate that both of the motivation function types are dependent on the complexity level of the task and have common constraints related to time. Supplementing the repository with new tasks can be interpreted as an accrual of this knowledge resource. Increasing motivation for both the creators and the editors positively affects and accelerates the accrual of this resource. The parameters describing the activities of both stakeholders: the creator and the editor are main elements of the motivation model in the community-built system (form now on regarded as simply as the “motivation model”). A measure of the success of their cooperation is the accrual of knowledge in the community-built system repository, which can be evaluated through the intensity of supplying it with properly created content. Upon developing a motivation model, one has to consider a very important factor: the stochastic character of the creator’s and editor’s arrival, which is mainly a result of individual work routine and stochastic character of creators’ and editor’s motivation parameters. The motivation model regulates the process of the creator choosing a task to be solved within the scope of a certain subject on the basis of his/her own motivation function, while taking into consideration the editor’s requirements and preferences. The entire process, from the moment of formulating tasks to the moment of evaluating them and placing in the repository followed by creating a new set of tasks waiting for the next group of creators prepared to address them, can be described as a game scenario. The new tasks are a result of the editor’s work. Modeling a game scenario requires formulating motivation and goal functions of the game participants with regard to the repository supply process.
196
8.4.4
P. Ro´z˙ewski and E. Kusztina
Formalization of the Motivation Problem
Let us consider the basic components of the motivation model. The proposed motivation model is an interpretation of the motivation model presented in [28, 52]. 1. Development process participants E ¼ ðe1 ; e2 ; . . . ej ; . . .Þ – editors are responsible for task creation and assessment (leads of the subject, disposer of the subject repository), pðEÞ ¼ fwe ; le g – parameters of stochastic process of editor’s arrival, C ¼ ðc1 ; c2 ; . . . cz ; . . .Þ – creators coming to choose and solve tasks (develop content with the task accordance), pðCÞ ¼ fwc ; lc g – parameters of stochastic process of creator’s arrival, where w – distribution law, l – intensity of arrival. We accept the process pðEÞ, pðCÞ to be a Markovian one, meaning that it has a stationary, memory-less, and sequential character. 2. Domain ontology GD ¼ fW D ; K D g – ontology graph, where W D ¼ fwg – nodes of graph (basic concepts), K D ¼ fkg – arcs of graph (relations between concepts). 3. Tasks set R ¼ fri g; i ¼ 1; 2; . . . i – tasks set in the frames of a domain D, P ¼ fQðri Þ; Aðri Þg – parameters of task ri , where Qðri Þ – task ri complexity level, Aðri Þ – task’s topicality for the repository. The editors determine the task’s topicality based on the task quality. 4. Editor’s motivation function sE – motivation function of editors e. The editor’s motivation function is a function which depends on tasks’ parameters: sE ¼ ðQðri Þ; Aðri ÞÞ and defines resources appointed to every task ri . The resources can be described by vector Xðri Þ and mainly cover following items: consultation time, data storage space, time of access to telecommunication channels, etc. The editor’s motivation function sE is a monotonously rising function of a discrete argument Qðri Þ, i ¼ 1; 2; . . . i . 5. Creator’s motivation function sCz – motivation function of each creator cz . In general, case function sCz depends on parameters of tasks sCz ¼ ðQðri Þ; Aðri ÞÞ. From the point of view of content development, the whole group of creators can be divided generally into the two above mentioned extreme groups. For the first group of creators (interested in achieving the minimal acceptable success level), the motivation function sS is a jest monotonously decreasing function of a discrete argument Qðri Þ; i ¼ 1; 2; . . . i , meaning the task complexity. For the second group of creators (interested in filling the repository with the maximal possible success level – best quality), the motivation function sC is a monotonously increasing function of the same argument Qðri Þ. 6. Goal function of creator’s task choice In relation to the effectiveness of the decision made, we understand the correlation between the maximal satisfaction of both the creator’s and editor’s interests
8 Motivation Model in Community-Built System
197
and the total maximal motivation function. The editor’s interest is satisfied by placing in the repository a properly solved task of a significantly high level of complexity. The creator’s interests involve attaining high quality of task while minimizing time costs, which also depends on the level of complexity of the task. The decision-making process can be described by a binary argument, yij yij
¼
1 if creator cz chooses task ri 0 otherwise;
Then, the goal function has the following structure: Fðyij Þ ¼ asE þ sCZ ¼ max , Y where Y ¼ fyij g; i ¼ 1; 2; . . . i ; j ¼ 1; 2; . . . j and a is waging coefficient. Both elements of the goal function depend on the same argument Qðri Þ. The element sE is a monotonous increasing function of argument Qðri Þ, while sCz in dependence on kind of creator is monotonously decreasing or increasing function of the same argument Qðri Þ. 7. Editor goal function of repository filling The editor goal function reflects the influence of each task on the accumulation of knowledge in the repository ðDWÞ. The current state of knowledge in the repository is characterized by two parameters: domain ontology graph GD and the level of its coverage with properly solved tasks, topical for the editor – graph GP . Each solved task ri ensures a proper accrual of knowledge in the repository Gðri Þ GP GD , meaning DWðri Þ ¼ GD \ Gðri Þ. We assume that tasks with a higher complexity level provide a greater increase of knowledge than tasks with a low complexity level. Let us consider: GD ¼ fW D ; K D g – domain ontology graph, C ¼ ðc1 ; c2 ; . . . cz ; . . .Þ – creators coming to choose and solve tasks, ~tðcÞ ¼ ðt1 ðc1 Þ; t2 ðc2 Þ; . . . ; tz ðcz Þ; . . .Þ – stochastic process of creators arrival, R~ ¼ fri ðcz Þg; ri 2 R; cz 2 C – set of tasks chosen by creators according to their goal function. Fðyij Þ ¼ asE þ sCZ ¼ max; Y
i ¼ 1; 2; . . . ; i ; j ¼ 1; 2; . . . j : Gðri ðcz ÞÞ ¼ fWiz ; Kiz g – solved task ri ðcz Þ ontology subgraph, GP – summary graph of ontologies of tasks placed in the repository in the interval t ½0; T0 : GP ¼ Gðr1 ðc1 ÞÞ [ Gðr2 ðc2 ÞÞ [ [ Gðri ðcz ÞÞ [ ; where t1 ; t2 ; . . . ; tj ; . . . 2 ½0; T0
P. Ro´z˙ewski and E. Kusztina
198
UðT0 Þ – accrual of knowledge in the repository in the interval t ½0; T0 : UðT0 Þ ¼ GD \ GP ; Within a certain calendar interval ½0; T0 , knowledge accrual in the repository has to be maximal: ; UðT0 Þ ¼ GD \ GP ¼ Max N
8.4.5
Formulating the Repository Motivation Model
For the given: 1. 2. 3. 4.
Domain D and its ontology graph GD Set of tasks R ¼ fri g and task parameters: P ¼ fQðri Þ; Aðri Þg Stochastic creators’ arrival pattern pðCÞ and arrival process parameters fwc ; lc g Stochastic editors’ arrival pattern pðEÞ and arrival process parameters fwe ; le g
One has to: (a) Form the editor’s motivation function sE regarding repository filling (b) Optimize each arriving creator’s cz goal function of the task choice ri ðcz Þ according to his/her motivation function sCz Fðyji Þ ¼ asE þ sCZ ¼ max ; Y (c) Choose properly solved tasks among the ones made by creators according to the grade GD \ Gðri ðcz ÞÞ 6¼ ;; and supplement the repository with chosen tasks GP ¼ Gðr1 ðc1 ÞÞ [ Gðr2 ðc2 ÞÞ [ [ Gðri ðcz ÞÞ [ ; Repository filling criterion: UðT0 Þ – knowledge accrual in the repository within a calendar interval ½0; T0 UðT0 Þ ¼ GD \ GP ¼ max
8 Motivation Model in Community-Built System
199
Constraints: (a) Summary resources (time-related, technical, didactic, staff) offered to creators for solving tasks: X cz 2C
xðriz Þyzi X;
where xðriz Þ – resources appointed to task ri ðsz Þ, X – summary resources for the subject lead by the editor (b) Calendar interval t 2 ½0; T0 , appointed to creators for choosing and solving tasks min tðriz Þ 0; z
max tðriz Þ T0 ; z
where tðriz Þ; tðriz Þ – appropriate moments to start and end solving task riz by creator cz . The presented motivation model can be solved using an analytical method. However, the model also well suits the simulation apparatus. On the basis of the simulation approach, various characteristics of the motivation model can be evaluated. For example, the simulation approach allows us to recognize a knowledge network stagnation (e.g., based on the birth–death process approach) and a method for knowledge network dynamic acceleration.
8.4.6
Case Study: Motivation Model Implementation in E-Learning
The discussed motivation model can be interpreted in terms of the simulation model. The motivation model has been adapted to an e-learning system in order to simulate the community-built system, where both groups of actors – students and teachers – work jointly, aiming to provide high-quality content to the knowledge repository. The case study is based on [28]. In that article, the authors propose a model of the educational and social agent collaboration between students, teacher, and an e-learning information system (the repository). Such a system is an example of the social learning process [60]. The system will support the competency-based learning process by means of creating a social network in order to exchange its ontology in the repository environment. The student is included in the repository development process. The idea is to expand the student’s knowledge during the learning process and record his achievements relying on an external (pertaining to the market) e-portfolio mechanism. The students work with their teacher according to a certain motivation model [28]. The teacher (the editor) is interested in having the repository developed in a specific direction. The student (the creator) is interested in developing his competencies
P. Ro´z˙ewski and E. Kusztina
200
and improving the e-portfolio components. The information system provides appropriate information tools, knowledge-portioning methods and defines the collaboration between teachers and students. As a result, the collaboration environment for knowledge management in a competence-based learning system is presented. The behavior of the participants in the learning process can be defined using appropriate motivation functions. The key goal is to find a balance between the motivation functions of the teacher and that of the student. The teacher’s motivation function can be represented thus: sN ðri Þ ¼ xðri Þ; where xðri Þ – resources assigned to solving a task, e.g., didactic materials, teacher’s time, software, hardware, etc. Student’s motivation function: sSj ðri Þ ¼ FðWðsj Þ; Hðri Þ; CSj ðri Þ; FS Þ; where Wðsj Þ – student’s base knowledge, Hðri Þ – grade/mark, that can be given by a teacher, CSj ðri Þ – costs borne by the student in solving the task, FS – student’s other preferences (e.g., student’s goals and constraints in the learning process). In the case study, the motivation model is simulated using the ARENA software (Fig. 8.4) as a collaboration process. The collaboration process between students and teacher can be interpreted as a queuing system, which has the following features: – By a specified repository content, it can be assumed that the teacher’s work relies on checking tasks – The result of the checking process is determined: (1) positive mark without placing task in the repository, (2) positive mark and placing task in the repository, (3) negative evaluation – repeating the task solving for a positive mark or well done task improvement for the purpose of developing the repository – By a specified subject, time, and group of students, the teachers work can be treated as a server with specified entry, exit, the average time of evaluation
Fig. 8.4 ARENA model of motivation model in e-learning
8 Motivation Model in Community-Built System
201
– The average time of evaluation results from the experience of the teacher (the specifics of each course, subject, tasks difficulty level, the type of students, time assigned for a subject in a learning program) – Student’s stream flow is stochastic, Markovian – Students are served on one server. There is also the possibility of a queue, which is characterized by a specific time and method of service In the case study, the motivation model is simulated using the ARENA software (Fig. 8.4). The simulation experiment allows us to analyze the queue’s parameters in a teacher’s workstation (“tasks checking” component on Fig. 8.4), as well as define the students’ service time by specified input parameters. For example, the following simulation scenario can be evaluated: Given: 55 students, time interval of 6 days, daily time for tasks’ examination – 3 h, expected time for each student – 20 min, correction time – 1 day and predicted probability of tasks evaluation: 70% – exit with promotion without repository development, 15% – placing solution in the repository, 15% – sending back for correction. One has to: Define the total working time of the teacher assigned to a task checking with specified output probability distribution, specified students’ service distribution, and estimated time for task checking. Results: With a limited time interval of maximum 6 days, a queue on the teacher’s side will emerge. Using the same input assumptions, we can establish how many days the teacher needs in order to evaluate the tasks. It turns out that it is almost 8. Finally, the simulation allows evaluating parameters such as: limited access to software and hardware resources, maximum size of social networks queue, and cost related to the teacher’s work. Using statistical data, a group management strategy can be modified and appropriate adjustments can be made in the repository development plan.
8.5
Conclusion
In this chapter, the community-built system has been identified as a form of knowledge network. The knowledge is modeled using the ontological approach. The approach allows the representation of knowledge in a formal way available for further computer processing. Moreover, the ontological approach is compatible with the concept of semantic web. In the future, the community-built system will be integrated with the semantic web. Subsequently, multiple software agents can become participants in the process of populating the knowledge repository. Using the motivation model, the activity of the community-built system can be analyzed equally on both technical and knowledge levels. The motivation model covers two functions important for motivation: that of the creator and that of the editor and describes their unique interests in supplying a knowledge repository using the Wiki mechanism. As shown in the case study, the motivation model can serve as a theoretical framework for different simulation models.
202
P. Ro´z˙ewski and E. Kusztina
References 1. Jones, R.: Giving birth to next generation repositories. Int. J. Inf. Manage. 27(3), 154–158 (2007) 2. McGreal, R.: A typology of learning object repositories. In: Adelsberger, H.H., Kinshuk Pawlowski, J.M., Sampson, D. (eds.) Handbook on Information Technologies for Education and Training, 2nd edn, pp. 5–28. Springer, Heidelberg (2008) 3. Bj€ork, B.-C.: Open access to scientific publications-an analysis of the barriers to change. Inf. Res. 9(2), Paper 170 (2004) 4. Novikov, D.A., Shokhina, T.E.: Incentive mechanisms in dynamic active systems. Automat. Remote Contr. 64(12), 1912–1921 (2003) 5. Novikov, D.A.: Incentives in organizations: theory and practice. In: Proceedings of 14th International Conference on Systems Science. Wroclaw, vol. 2, pp. 19–29 (2001) 6. Sheehy, G.: The Wiki as knowledge repository: using a Wiki in a community of practice to strengthen K-12 education. TechTrends 52(6), 55–60 (2008) 7. Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering. Springer, Heidelberg (2004) 8. Ho, C.-T., Chen, Y.-M., Chen, Y.-J., Wang, C.-B.: Developing a distributed knowledge model for knowledge management in collaborative development and implementation of an enterprise system. Robot. Comput. Integr. Manufact. 20(5), 439–456 (2004) 9. Farquhar, A., Fikes, R., Rice, J.: The ontolingua server: a tool for collaborative ontology construction. Int. J. Hum. Comput. Stud. 46, 707–727 (1997) 10. Ferna´ndez-Breis, J., Martiinez-Bejar, R.: A cooperative framework for integrating ontologies. Int. J. Hum. Comput. Stud. 56(6), 665–720 (2002) 11. Cress, U., Kimmerle, J.: A systemic and cognitive view on collaborative knowledge building with Wikis. Int. J. Comput. Support. Collab. Learn. 3(2), 105–122 (2008) 12. Yao, Z., Liu, S., Han, L., Reddy, Y.V.R., Yu, J., Liu, Y., Zhang, Ch., Zhaoqing, Z.: An ontology based workflow centric collaboration system. In: Shen, W., et al. (eds.) The Tenth International Conference on Computer Supported Cooperative Work in Design, CSCWD 2006. Lecture Notes in Computer Science, vol. 4402. Springer, Heidelberg, pp. 689–698 (2007) 13. Pereira, C., Sousa, C., Soares, A.L.: A socio-semantic approach to collaborative domain conceptualization. In: Meersman, R., Herrero, P., Dillon, T. (eds.) OnTheMove OTM Federated Conferences and Workshops 2009. Lecture Notes in Computer Science, vol. 5872, pp. 524–533. Springer, Heidelberg (2009) 14. Decker, S., Hauswirth, M.: Enabling networked knowledge. In: Klusch, M., Pechoucek, M., Polleres, A. (eds.) Cooperative Information Agents XII. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 5180, pp. 1–15. Springer, Heidelberg (2008) 15. Helms, R., Buijsrogge, K.: Knowledge network analysis: a technique to analyze knowledge management bottlenecks in organizations. In: 16th International Workshop on Database and Expert Systems Applications (DEXA’05), pp. 410–414 (2005) 16. Mulvenna, M., Zambonelli, F., Curran, K., Nugent, Ch: Knowledge networks. In: Stavrakakis, I., Smirnov, M. (eds.) The Second International workshop on Autonomic Communication (WAC 2005). Lecture Notes in Computer Science, vol. 3854, pp. 99–114. Springer, Heidelberg (2006) 17. Palazzolo, E.T., Ghate, A., Dandi, R., Mahalingam, A., Contractor, N., Levitt, R.: Modelling 21st century project teams: docking workflow and knowledge network computational models. In: CASOS 2002 Computational and Mathematical Organization Theory Conference, Pittsburgh, USA, 21–23 June 2002 18. Chen, J., Chen, D., Li, Z.: The analysis of knowledge network efficiency in industrial clusters. In: Proceedings of the 2008 International Seminar on Future Information Technology and Management Engineering (FITME 2008), IEEE Computer Society, Leicestershire, United Kingdom, pp. 257–260 (2008)
8 Motivation Model in Community-Built System
203
19. Zhuge, H.: Knowledge flow network planning and simulation. Decis. Support Syst. 42(2), 571–592 (2006) 20. Carley, K.: Smart agents and organizations of the future. In: Lievrouw, L., Livingstone, S. (eds.) Handbook of New Media, pp. 206–220. Sage, London (2002) 21. Kushtina, E., Zaikin, O., Ro´z˙ewski, P.: On the knowledge repository design and management in e-learning. In: Lu, Jie, Ruan, Da, Zhang, Guangquan (eds.) E-Service Intelligence: Methodologies, Technologies and Applications, Studies in Computational Intelligence, vol. 37, pp. 497–517. Springer, Heidelberg (2007) 22. Gomez, A., Moreno, A., Pazos, J., Sierra-Alonso, A.: Knowledge maps: an essential technique for conceptualization. Data Knowl. Eng. 33(2), 169–190 (2000) 23. Eden, K.: Analyzing cognitive maps to help structure issues or problems. Eur. J. Oper. Res. 159(3), 637–686 (2004) 24. Guarino, N.: Understanding, building and using ontologies. Int. J. Hum. Comput. Stud. 46(2–3), 293–310 (1997) 25. Sugumaran, V., Storey, V.C.: Ontologies for conceptual modeling: their creation, use, and management. Data Knowl. Eng. 42(3), 251–271 (2002) 26. Kushtina, E., Ro´z˙ewski, P., Zaikine, O.: Extended ontological model for distance learning purpose. In: Reimer, U., Karagiannis, D. (eds.) Practical Aspects of Knowledge Management, PAKM2006. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 4333, pp. 155–165. Springer, Heidelberg (2006) 27. Zaikine, O., Kushtina, E., Ro´z˙ewski, P.: Model and algorithm of the conceptual scheme formation for knowledge domain in distance learning. Eur. J. Oper. Res. 175(3), 1379–1399 (2006) 28. Ro´z˙ewski, P., Ciszczyk, M.: Model of a collaboration environment for knowledge management in competence based learning. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) 1st International Conference on Computational Collective Intelligence, Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 5796. Springer, Heidelberg, pp. 333–344 (2009) 29. Marwick, A.D.: Knowledge management technology. IBM Syst. J. 40(4), 814–830 (2001) 30. West, J.A., West, M.L.: Using Wikis for Online Collaboration: The Power of the Read-Write Web. Jossey-Bass, San Francisco (2008) 31. Shen, D., Nuankhieo, P., Huang, X., Amelung, Ch, Laffey, J.: Using social network analysis to understand sense of community in an online learning environment. J. Educ. Comput. Res. 39(1), 17–36 (2008) 32. Wainfan, L., Davis, P.K.: Challenges in Virtual Collaboration: Videoconferencing Audioconferencing and Computer-Mediated Communications. RAND, Santa Monica (2005) 33. Ommeren, E., Duivestein, S., de Vadoss, J., Reijnen, C., Gunvaldson, E.: Collaboration in the Cloud – How Cross-Boundary Collaboration Is Transforming Business. Microsoft and Sogeti, Groningen (2009) 34. ISO 19778: Information technology – learning, education and training – collaborative technology – collaborative workplace. International Organization for Standardization (2008) 35. Lowyck, J., Poysa, J.: Design of Collaborative Learning Environments. Comput. Hum. Behav. 17(5–6), 507–516 (2001) 36. Fischer, F., Bruhn, J., Grasel, C., Mandl, H.: Fostering collaborative knowledge construction with visualization tools. Learn. Instruct. 12(2), 213–232 (2002) 37. Tolone, W., Ahn, G.-J., Pai, T., Hong, S.-H.: Access control in collaborative systems. ACM Comput. Surv. 37(1), 29–41 (2005) 38. Brown, J.S.: Leveraging technology for learning in the cyber age – opportunities and pitfalls. In: International Conference on Computers in Education (ICCE 98) (invited speaker) (1998) 39. Cole, M.: Using Wiki technology to support student engagement: lessons from the trenches. Comput. Educ. 52(1), 141–146 (2009) 40. De Pedro, X., Rieradevall, M., Lopez, P., Sant, D., Pinol, J., Nunez, L.: Writing documents collaboratively in Higher Education using traditional vs. Wiki methodology (I): qualitative
204
41. 42. 43.
44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58.
59.
60. 61.
62.
P. Ro´z˙ewski and E. Kusztina results from a 2-year project study. In: International Congress of University Teaching and Innovation, Barcelona, 5–7 July 2006 Alexander, P.M.: Virtual teamwork in very large undergraduate classes. Comput. Educ. 47(2), 127–147 (2006) Delgado, J.: Emergence of social conventions in complex networks. Artif. Intell. 141(1–2), 171–185 (2002) Moran, T.: Shared environments to support face-to-face collaboration. Position Paper for the Workshop on Shared Environments to Support Face-to-Face Collaboration. In: ACM 2000 Conference on Computer Supported Cooperative Work, Philadelphia, 2–6 December 2000 Skyrme, D.J.: The realities of virtuality. In: Sieber, P., Griese, J. (eds.) Organizational Virtualness Proceedings of the V O Net – Workshop, Simowa Verlag, Bern, April 1998 Mathieu, J.: Blogs, podcasts, and Wikis: the new names in information dissemination. J. Am. Diet. Assoc. 107(4), 553–555 (2007) Ebersbach, A., Glaser, M., Heigl, R.: Wiki: Web Collaboration, 2nd edn. Springer, Berlin (2008) Ravid, G., Kalman, Y.M., Rafaeli, S.: Wikibooks in higher education: empowerment through online distributed collaboration. Comput. Hum. Behav. 24(5), 1913–1928 (2008) Klobas, J.: Wikis as tools for collaboration. In: Kock, N. (ed.) Encyclopedia of E-Collaboration. IGI, Hershey (2008) Frumkin, J.: The Wiki and the digital library. OCLC Syst. Serv. 21(1), 18–22 (2005) Williams, A.J.: Internet-based tools for communication and collaboration in chemistry. Drug Discov. Today 13(11–12), 502–506 (2008) Wagner, C.: Wiki: a technology for conversational knowledge management and group collaboration. Commun. AIS 13, 265–289 (2004) Kusztina, E., Zaikin, O., Tadeusiewicz, R.: The research behavior/attitude support model in open learning systems. Bull. Pol. Acad. Sci. Tech. Sci. 58(4), pp. 705–711 (2010) Mayer, R.E.: Cognitive, metacognitive, and motivational aspects of problem solving. Instruct. Sci. 26(1–2), 49–63 (1998) Anderson, J.R.: Cognitive Psychology and Its Implications, 5th edn. Worth, New York (2000) Keller, J.M.: Using the ARCS motivational process in computer-based instruction and distance education. New Direct. Teach. Learn. 78, 37–47 (1999) Miller, R.B., Brickman, S.J.: A model of future-oriented motivation and self-regulation. Educ. Psychol. Rev. 16(1), 9–33 (2004) Phalet, K., Andriessen, I., Lens, W.: How future goals enhance motivation and learning in multicultural classrooms. Educ. Psychol. Rev. 16(1), 59–89 (2004) DeVoe, S.E., Iyengar, S.S.: Managers’ theories of subordinates: a cross-cultural examination of manager perceptions of motivation and appraisal of performance. Organ. Behav. Hum. Decis. Process. 93(1), 47–61 (2004) Sevin, D., Thalmann, D.: A motivational model of action selection for virtual humans. In: Computer Graphics International (CGI’2005), IEEE Computer Society Press, New York, pp. 213–220 (2005) Chang, B., Cheng, N.-H., Deng, Y-Ch, Tak-Wai Chan, T.-W.: Environmental design for a structured network learning society. Comput. Educ. 28(2), 234–249 (2007) Hee-Seop, H., Hyeoncheol, K.: Eyes of a Wiki: automated navigation map. In: Fox, E.A., et al. (eds.), Eighth International Conference on Asian Digital Libraries, ICADL 2005. Lecture Notes in Computer Science, vol. 3815, pp. 186–193. Springer, Heidelberg (2005) Shih, W-Ch, Tseng, S.-S., Chao-Tung Yang, Ch-T: Wiki-based rapid prototyping for teaching-material design in e-learning grids. Comput. Educ. 51(3), 1037–1057 (2008)
Chapter 9
Graph Database for Collaborative Communities Rania Soussi, Marie-Aude Aufaure, and Hajer Baazaoui
Abstract Data manipulated in an enterprise context are structured data as well as unstructured data such as emails, documents, and social networks. Graphs are a natural way of representing and modeling such data (structured, semistructured, and unstructured ones) in a unified manner. The main advantage of such a structure relies on the dynamic aspect and the capability to represent relations, even multiple ones, between objects. Recent database research work shows a growing interest in the definition of graph models and languages to allow a natural way of presenting data. In this chapter, we provide a survey of the main graph database models and the associated graph query languages. We then present an application using a graph database to extract social networks.
9.1
Introduction
We have now entered the knowledge era, where people work in a collaborative way and manipulate structured as well as unstructured data. More and more information about communications among people is available. This mass of information should be used by companies to optimize the business process, for example, by using information about people to constitute the best team for a particular project. These vast amounts of data need storage and analysis. This data may reside in multiple locations and may change over time. Moreover, the data sources do not have a unified schema, or their schemas cannot be controlled. Current representation and storage systems are not flexible enough to deal with dynamic changes and
R. Soussi (*) Applied Mathematics and Systems Laboratory (MAS), SAP Business Objects Academic Chair in Business Intelligence, Ecole Centrale Paris, Chatenay-Malabry, France and Riadi-GDL Laboratory, ENSI – Manouba University, Tunis, Tunisia e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_9, # Springer-Verlag Berlin Heidelberg 2011
205
206
R. Soussi et al.
are not very efficient in manipulating complex data. Besides, data manipulation systems cannot easily work with structural or relational data. Graphs are powerful representation formalism for both structured and unstructured data and can be seen as a unified data representation. Data in multiple domains can be modeled naturally as graphs like Semantic Web [1], images, social networks [2], and bioinformatics. Thus, recent database research demonstrates an increasing interest in the definition of graph models and languages to allow a natural way of handling data appearing in these applications. Indeed, a graph database leads to a more natural modeling (graph structures) and offers flexible support for dynamic data (social network, web, etc.). It also facilitates data query using graph operations. Explicit graphs and graph operations allow a user to express a query at a very high level of abstraction. Queries about paths and the shortest path between two nodes are performed efficiently with graph database techniques. In this chapter, we present the main graph database models and the associated graph query languages; we also discuss two related models that are not quite graph database models, but use graphs, for example, for navigation, for defining views, or as language representation. We discuss in each section the capacity of these models and query languages to present or to query communities’ data, especially information found on social networks. Then, we present an application using a graph database for modeling social networks.
9.2 9.2.1
Graph Database: Models and Query Languages Brief Overview of Graph Database Models
A graph database is defined [3] as a “database where the data structures for the schema and/or instances are modeled as a (labeled) (directed) graph, or generalizations of the graph data structure, where data manipulation is expressed by graph-oriented operations and type constructors, and has integrity constraints appropriate for the graph structure”. More formally, a graph database schema is in the form of a graph Gdb ¼ ðN; E; c; lÞ, where N is a set of nodes and E is a set of edges; c is an incidence function from E into N N; and V is a set of labels and l is a labeling function from N [ E into V . There are a variety of models for a graph database (for more details, see [3]). All these models have their formal foundation as variations of the basic mathematical definition of a graph. The structure used for modeling entities and relations influences the way that data are queried and visualized. In this section, we compare existing models to find the one most suitable for storing and representing a social network. We focus on the representation of entities and relations in these models. We present in what follows some models classified according to the data structure used to model entities and relations.
9 Graph Database for Collaborative Communities
9.2.1.1
207
Models Based on Simple Node
Data are represented in these models by a (directed or undirected) graph with simple nodes and edges. Most of these models (GOOD [4], GMOD [5], etc.) represent both schema and instance database as a labeled directed graph. Moreover, LDM [6] represents the graph schema as a directed graph where leaves represent data and whose internal nodes represent connections among the data. LDM instances consist of two-column tables, one for each node of the schema. Entities in these model are represented by nodes labeled with type name and also with type value or object identifier (in the case of instance graph). Some models have nodes for explicit representation of tuples and sets (PaMaL [7] and GDM [8]), and n-ary relations (GDM). Relations (attributes and relations between entities) are generally represented in these models by means of labeled edges. LDM and PaMaL use tuple nodes to describe a set of attributes that are used to define an entity. GOOD defines edges to distinguish between monovalued (functional edge) and multivalued attributes (nonfunctional edge). Nevertheless, these models do not allow the presentation of nested relations and are not very suited to modeling complex objects.
9.2.1.2
Models Based on Complex Node
In these models, the basic structure of a graph (node and edge) and the presentation of entities and relations are based on hypernodes (and hypergraphs). Indeed, a hypernode is a directed graph in which nodes themselves can be graphs (or hypernodes) [40]. Hypernodes can be used to represent simple (flat) and complex objects (hierarchical, composite, and cyclic) as well as mappings and records. A hypergraph is a generalized notion of graph where the notion of edge is extended to hyper edge, which relates to an arbitrary set of nodes. The Hypernode Model and GGL [39] emphasize the use of hypernodes for representing nested complex objects. GROOVY is centered on the use of hypergraphs. The hypernode model is characterized by using nested graphs at the schema and instance levels. GGL introduces, in addition to its support for hypernodes (called Master-nodes), the notion of Master-edge for encapsulation of paths. It uses hypernodes as an abstraction mechanism consisting in packaging other graphs as an encapsulated vertex, whereas the Hypernode model additionally uses hypernodes to represent other abstractions such as complex objects and relations. Most models have explicit labels on edges. In the hypernode model and GROOVY, labeling can be obtained by encapsulating edges, which represent the same relation, within one hypernode (or hyperedge) labeled with the relation name.
9.2.1.3
Discussion
The purpose of this review of graph database models is to find the one most suited to model many complex data objects and their relationships, such as social networks.
208
R. Soussi et al.
Table 9.1 Graph database model comparison Entity Complex Dynamic Hypernode + + Groovy + + GGL + + GOOD + GMOD + PaMaL + + GDM + + LDM + +
Relation Nested + + +
Visualization Neighborhood + + + +
+ + + +/ +
Social Network is an explicit representation of relationships between people, groups, organizations, computers, or other entities [9]. As other networks, it can be represented as a complex graph [2], G ¼ (V, E), where V is the set of nodes representing people and E is the set of edges (V V) meaning the different kind of relationships among people. Indeed, the social network structure can contain one or more types of relations, one or more types or levels of entities, and many attributes over the entities. This structure is dynamic due to growth of the volume, change of attributes, and relations. Then, we compare the previous graph database models using some characteristics related to social network: the ability to present dynamic and complex objects, nested and neighborhood relations, and the ability to give a good visualization of social network. We resume the comparison on Table 9.1, where “+” indicates the graph model support, “” indicates that the graph model does not support, and “+/” partial support. From this comparison, we have concluded that models based on hypernodes can be very appropriate to represent complex and dynamic objects. In particular, the hypernode model with its nested graphs can provide an efficient support to represent every real-world object as a separate database entity. Moreover, models based on a simple graph are unsuitable for complex networks where entities have many attributes and multiple relations.
9.2.2
Graph Database Languages
A query language is a collection of operators or inference rules, which can be applied to any valid instance of the model data structure types, with the objective of manipulating and querying data in those structures in any desired combination [10]. In this section, we review some proposals for graph database query languages found in the literature. We concentrate this study on visual, semantic, SQL-like, and formal query languages.
9 Graph Database for Collaborative Communities
209
Fig. 9.1 PhD students and their supervisors (tables and corresponding graph)
For each category, we will run several queries using the following example of a PhD student database as shown in Fig. 9.1. We will show how these graph database languages support graph features (path, neighborhood, etc.). 9.2.2.1
Visual Query Languages
Visual query languages aim to provide the functionality of textual query languages to users who are not technical database experts and also to improve the productivity of expert database users [41]. In general, these languages allow users to draw a query as a graph pattern with the help of a graphical interface. The result is the collection of all subgraphs of the database matching the desired pattern [11–13]. G, G+, and GraphLog G [12] is a visual query language based on regular expressions that allow simple formulation of recursive queries. G enables users to pose queries, including transitive closure, which is not expressible in relational query languages. A graphical query Q (example Fig. 9.2) is a set of labeled directed multigraphs, in which the node labels of Q may be either variables or constants, and the edge labels are regular expressions defined over n-tuples of variables and constants. A path is expressed on a G query initially by the means of two types of edges: dashed edges correspond to paths of arbitrary length in the graph and solid edges correspond to paths of fixed length. In G, simple paths are traversed using certain non-Horn clause constructs available in Prolog. However, it does not support cycles, find the shortest path, or calculate node distance. Moreover, G does not support aggregation functions. G evolved into a more powerful language called G+ [13], in which a query graph remains as the basic building block. A simple query in G+ has two elements, a query graph that specifies the class of patterns to search, and a summary graph, which represents how to restructure the answer obtained by the query graph. G+ provides primitive operators such as depth-first search, shortest path, transitive closure, and connected components. It can easily find a regular simple path. The language
210
R. Soussi et al.
Fig. 9.2 G query to find students and supervisors and query GraphLog query to find all students working on Ontology
Fig. 9.3 Hypernode database schema and instance
also contains aggregate operators that allow the path length and node degree to be found. The graph-based query language G+ provided a starting point for GraphLog [14]. GraphLog differs from G+ with a more general data model, the use of negation, and the computational traceability. GraphLog queries are graph patterns that ask for patterns that must be present or absent in the database graph. Edges in queries represent edges or paths in the database. Each pattern defines a set of new edges (i.e., a new relation) that are added to the graph whenever the pattern is found. An edge used in a query graph either represents a base relation or is itself defined in another query graph. GraphLog supports computing aggregate functions and summarizing along paths. Figure 9.2 shows an example of a GraphLog query. Hyperlog Hyperlog [15] is a declarative query and update language for the Hypernode Model (Fig. 9.3). It visualizes schema information, data, and query output as sets of nested graphs, which can be stored, browsed, and queried in a uniform way. A hyperlog query consists of a number of graphs (templates), which are matched against the hypernodes and generate graphical output. The user chooses which variables in the query should have their instantiations output in the query result. Hyperlog programs contain sets of rules. The body of a rule is composed of a number of queries, which may contain variables. The head of a rule is also a query and indicates the updates (if any) to be undertaken for each match of the graphs in the body. In order to illustrate the template and the query in the Hyperlog query language, we give an example in Fig. 9.4: the template can find the
9 Graph Database for Collaborative Communities
211
Fig. 9.4 Template and query with Hyperlog
students and their supervisors; the query can find the students working on Ontology. Hyperlog does not offer a special notation or expression to express paths. The existing rules can find only simple ones. The absence of aggregation functions explains the absence of answers to a query about node degree or path lengths. QGRAPH QGRAPH [11] query is a labeled connected graph in which the vertices correspond to objects and the edges to links with a unique label. The query specifies the desired structure of vertices and edges. It may also place Boolean conditions on the attribute values of matching objects and links, as well as global constraints. A query consists of match vertices and edges and optional update vertices and edges. The former determine which subgraphs in the graph database constitute a match for the query. The latter determine modifications made to the matching subgraphs. A query with both match and update vertices and edges can be used for attribute calculation and for structural modification of the database. The query processor first finds the matching subgraphs using the query’s match elements and then makes changes to those subgraphs as indicated by the query’s update elements. QGRAPH offers a good support to express paths by means of subqueries, conditions, and annotations on edges and nodes. However, it does not offer an operator for aggregation. Figure 9.5 contains two queries: the right query finds all subgraphs with a supervised link between a Student and a Supervisor; the left one finds just the students that have ontology as the thesis topic. GOOD and Languages Based on GOOD The GOOD [4] data transformation language is a database language with graphical syntax and semantics. This query language is used for the GOOD graph-based data model (Fig. 9.6). GOOD query language is based on graph-pattern matching and allows the user to specify node insertions and deletions in a graphical way.
212
R. Soussi et al.
Fig. 9.5 Queries with QGRAPH
Fig. 9.6 GOOD data model schema and instance
GOOD contains five operators. Four of them correspond to elementary manipulation of graphs: addition of nodes and edges, and deletion of nodes and edges. The fifth operation called ‘abstraction’ is used to group nodes on the basis of common functional or nonfunctional properties. The specification of all these operations relies on the notion of pattern to describe subgraphs in an object-based instance. GOOD presents other features such as macros (for more succinct expression of frequent operations), computational-completeness of the query language, and simulation of object-oriented characteristics such as encapsulation and inheritance. A simple path can be discovered by using a pattern. Moreover, GOOD is incapable of finding a path with no fixed length. Figure 9.7 illustrates two examples of GOOD query: the first query is to find students and their supervisors; the second one is to find students working on the ontology topic. This language was followed by the proposals GMOD [5], PaMaL [7], and GOAL [16]. These languages use GOOD’s principal features and add several new functionalities.
9.2.2.2
SQL-Like Languages
SQL-like languages are declarative rule query languages that extend traditional SQL and propose new SQL-like operators for querying graphs and objects [17–19].
9 Graph Database for Collaborative Communities
213
Fig. 9.7 GOOD queries
Fig. 9.8 Object Exchange Model (OEM). Schema and instance are mixed
Lorel Lorel [18] is implemented as the query language of the Lore prototype database management system at Stanford (http://www.db.stanford.edu/lore). It is used for the Object Exchange Model (OEM) data model (Fig. 9.8). A database conforming to OEM can be thought of as a graph where Object-IDs represent node labels and OEM labels represent edge labels. Atomic objects are leaf nodes where the OEM value is the node value. Lorel allows flexible path expressions, which allow querying without precise knowledge of the structure. Path expressions are built from labels and wildcards (placeholders) using regular expressions, allowing the user to specify rich patterns that are matched to actual paths in the graph database. Lorel also includes a declarative update language.
GraphDB G€ uting [20] proposes an explicit model named GraphDB, which allows simple modeling of graphs in an object-oriented environment. A database in GraphDB is a collection of object classes where objects are composed of identity and tuple structure; attributes may be data- or object-valued. There are three different kinds of object classes called simple classes, link classes, and path classes. Simple objects are just objects, but also play the role of nodes in the database graph. Link objects are objects with additional distinguished references to source and target simple
214
R. Soussi et al.
objects. Path objects are objects with an additional list of references to simple and link objects that form a path over the database graph. GraphDB uses graph algorithms in order to implement graph operations. Both the shortest path and cycle were implemented using the A* algorithm. Moreover, nodes, paths, and subgraphs are indexed using path classes and index structures such as B-Tree and LSD-Tree. GraphDB allows aggregation by using aggregate functions.
GOQL GOQL [21] is an extension of OQL enriched with constructs to create, manipulate and query objects of type graph, path, and edge. GOQL is applied to graph databases that use an object-oriented data model. In this data model, they define, similar to GraphDB, a special type: node type, edge type, path type, and graph type. GOQL is capable to query sequences and paths. In addition to the OQL sequence operators, GOQL uses the temporal operators Next, Until, and Connected for queries involving the relative ordering of sequence elements. For processing, GOQL queries are translated into an operator-based language, O-Algebra, extended with new operators. O-Algebra is an object algebra designed for processing objectoriented database (OODB) queries. To deal with GOQL’s extension for path and sequence expressions, O-Algebra is extended with three temporal operators, corresponding to the temporal operators: Next, Until and Connected.
SOQL SOcial networks Query Language (SoQL) [22] is an SQL-like language for querying and creating data in social networks. SoQL enables the user to retrieve paths to other participants in the network and to use a retrieved path in order to attempt to create a connection with the participant at the end of the path. The main element of a SoQL query is either a path or a group, with subpaths, subgroups, and paths within a group defined in the query. Creation of new data is also based on the path and group structures. Indeed, SoQL presents four new operators: – SELECT FROM PATH query which retrieves paths between network participants, starting at a specific node and satisfying conditions in the path predicates. – SELECT FROM GROUP query which retrieves groups of participants that satisfy conditions as a set of nodes. – The CONNECT USING PATH and CONNECT GROUP commands are presented. These commands automate the process of creating connections between participants. The language uses operators that specify conditions on a path or a group. It also proposes aggregation functionalities, as well as existential and universal quantifiers on nodes and edges in a path or a group, and on paths within a defined group.
9 Graph Database for Collaborative Communities
215
GraphQL GraphQL [17] is a graph query language for graphs with arbitrary attributes and sizes. In GraphQL, graphs are the basic unit of information. Then, each operator takes one or more collections of graphs as input and generates a collection of graphs as output. It is based on graph algebra and the FLWR (For, Let, Where, and Return) expressions used in Xquery (see next section). In the graph algebra, the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs using the idea of neighborhood subgraphs and profiles, refinement of the overall search space, and optimization of the search order.
9.2.2.3
Formal Languages
LDM The Logical Database Model [6] presents a logic very much in the spirit of relational tuple calculus, which uses fixed types of variables and atomic formulas to represent queries over a schema using the power of full first-order languages. Figure 9.9 presents the LDM schema and instances. The result of a query is another LDM schema called ‘query schema,’ which consists of those objects over a valid instance that satisfy the query formula. In addition, the model presents an alternative algebraic query language proved to be equivalent to the logical one.
Gram Gram [23] is an algebraic language based on regular expression and supporting a restricted form of recursion. Figure 9.10 shows the data model used by Gram. Regular expressions over data types are used to select walks (paths) in a graph. It uses a data model where walks are the basic objects. A walk expression is a regular expression without union, x Student
Supervisor
x Name
Last-Name
x Thesis_ Topic
Name
I 1 2 3
I(Name) Val(I ) Yan Sara Alain
I(Last_Name) Val(I ) I 4 Smith 5 James 6 Jones
I(Thesis_topic) I Val (I) 7 Ontology 8 graph
I(Student) I Val(I) 9 (1,4,7) 10 (2,5,7) 11 (3,6,8)
Fig. 9.9 Logical Data Model: The schema (on the left) and part of instances (on the right)
216
R. Soussi et al.
Fig. 9.10 Gram Data Model: The schema (on the left) and the instances (on the right) Fig. 9.11 Gram query to find students and their supervisors
Fig. 9.12 G-Log Data Model: The schema (on the left) and the instances (on the right)
whose language contains only alternating sequences of node and edge types, starting and ending with a node type. The query language is based on hyperwalk algebra with operations closed under the set of hyperwalks. This hyperwalk facilitates the query of paths and the finding of adjacent node and edge. A Gram query example is presented in Fig. 9.11.
G-Log G-Log [24] is a declarative, nondeterministic complete language for complex objects with identity. The data model of G-Log is (right down to minor details) the same as that of GOOD (Fig. 9.12). The main difference between G-Log and GOOD is that the former is a declarative language, and that the latter is imperative. In G-Log, the basic entity of a program is a rule. Rules in G-Log are graph-based and are built up from colored patterns. A G-Log program is defined as a sequence of sets of G-Log rules.
9 Graph Database for Collaborative Communities
217
HNQL HyperNode Query Language (HNQL) is a query and update language for the hypernode model [25]. HNQL consists of a basic set of operators for declarative querying and updating of hypernodes. In addition to the standard deterministic operators, HNQL provides several nondeterministic operators, which arbitrarily choose a member from a set. HNQL is further extended in a procedural style by adding to the said set of operators an assignment construct, a sequential composition construct, a conditional construct for making inferences, and, finally, loop and while loop constructs for providing iteration (or equivalently, recursion) facilities.
9.2.2.4
Semantic Languages
A semantic query language is a query language which is defined for querying a semantic data model. The semantic query language presented in [26] provides a foundation for extracting information from the semantic graph in which the possible structure of the graph is described by ontology (Fig. 9.13) that defines the vertex types, the edge types, and how edges may interconnect vertices to form a directed graph. It uses a query with a specific format containing a function that specifies patterns and conditions for matching graphs in the database. Figure 9.13 shows an example of a pattern used by Kaplan query language.
9.2.2.5
Discussion
Querying social networks turns out to be a nontrivial task due to the intrinsic complexity of the networked data. Also, these kinds of querying focus on a special type of information. Moreover, the information needs of a community or a social network are diverse and can be categorized as two types: (1) values or measures such as the centrality and diameter; (2) information about attributes relations and data management on social networks. In this section, we present a comparison of the previous graph database languages and we discuss whether they are well adapted to query a social network. The existing graph query languages cannot extract all the
Fig. 9.13 Ontology describing the graph (left) and the pattern to extract students working on same topic (right)
218
R. Soussi et al.
Table 9.2 Graph query languages-1G
Basic unit Data model Language style Pattern Update query Implementation Path Neighborhood Diameter Distance between nodes
G+ Hyperlog QGraph GOOD Kalpan GraphLog Nodes/ Nodes/ Hypernode Nodes/ Nodes/ Nodes/ edges edges edges edges edges Graph Graph Hypernode Graph GOOD Semantic graph Graphical Graphical Graphical Graphical Graphical Semantic + + Rules + + + + + + + + + + + + +/ + +/ +/ +/ + + +/ +/ + + +
HNQL Hypernode Hypernode Formal + + +/ +/
Table 9.3 Graph query languages-2GOQL Nodes/ edges Data Model OEM Graph Query style SQL-like SQL-like Pattern Update query + Implementation + + Path + + Neighborhood +/ +/ Diameter Distance between +/ nodes Basic Unit
Lorel Object
SOQL Group/ path Graph SQL-like + + + + +/ +/ +
GraphDB Nodes/ edges Graph SQL-like + + + + +/
GraphQL LDM Graph Tuple
Gram Nodes/ edges Graph LDM Gram SQL-like Formal Formal + + + + + +/
G-Log Nodes/ edges G-Log Formal + +/ +/
characteristics of a social network, even those designed for social networks (e.g., SoQL). We summarize the main characteristics of the previous languages in Tables 9.2 and 9.3. We put (+) where the language proposes an explicit definition for the characteristics, () if not, and (+/) where it tries to define it indirectly. These two tables show that: l
l
l l
Many query languages use pattern to facilitate the information search process especially graphical languages such as Good, Qgraph, and G. Most languages provide operator or techniques to find path. Nevertheless, graphical languages do not determine path by direct operation. The neighborhood characteristic is not well processed by existing languages. Graph characteristics based on calculations such as diameter or distances between nodes are treated only by G+ and GraphLog.
In practice, users prefer graphical languages because they are easy to use. Moreover, graphical query languages for a graph model are unable to obtain information about communities. Languages designed for a social network such as SOQL are based on SQL and can be applied only on simple graphs.
9 Graph Database for Collaborative Communities
9.3 9.3.1
219
Related Data Model RDF Query Languages
RDF [27] is a knowledge representation language dedicated to the annotation of documents and more generally of resources within the framework of the Semantic Web. By definition, an RDF graph [28] is a set of RDF triples. An RDF triple is a triple ðs; p; oÞ 2 ðI [ BÞ I ðI [ B [ LÞ, where I, B, and L are sets that represent (IRIs), Blank nodes, and Literals, respectively). In this triple, s is the subject, p the predicate, and o the object. RDF models information with a graph-like structure, where basic notions of graph theory such as node, edge, path, neighborhood, connectivity, distance, and degree play a central role. RDF has been used for presenting communities and social networks (e.g., FOAF1 and RELATIONSHIP2). RDF can be a good support to model social networks, although its query languages do not offer an efficient support to query this kind of data. Indeed, several languages for querying RDF documents have been proposed, some in the tradition of database query languages (i.e., SQL, OQL): RQL [29], SeRQL [30], RDQL [31], and SPARQL [42]. Others more closely inspired by rule languages: Triple [32], Versa,3 N3, and RxPath.4 The currently available query languages for RDF support a wide variety of operations. However, several important features are not well supported or not even supported at all. RDF query languages support only querying for patterns of paths, which are limited in length and form. Nevertheless, RDF allows the representation of irregular and incomplete information (e.g., the use of blank node). Of the original approach, only Versa and SeRQL provide a built-in means of dealing with incomplete information. For example, the SeRQL language provides so-called optional path expressions (denoted by square brackets) to match paths whose presence is irregular. Usually, such optional path expressions can be simulated, if a language provides set union and negation. Others work on RDF query languages to try to extend the original languages to improve path expressiveness. For example, Alkhateeb et al. [33] allow an RDF knowledge base to be queried using graph patterns whose predicates are regular expressions. In RDF Path,5 N3 and Graph Path,6 they try to use specifications similar to those in XPATH to query paths in RDF. Moreover, RDF query languages are not well adapted to query paths of unknown length or which include multiple propriety on RDF graph. Neighborhood retrieval cannot be performed well for languages that do not have a union operator. Many of the
1
http://www.foaf-project.org/ http://vocab.org/relationship/.html 3 http://wiki.xml3k.org/Versa 4 http://rx4rdf.liminalzone.org/RxPathSpec 5 http://infomesh.net/2003/rdfpath 6 http://www.langdale.com.au/GraphPath/ 2
220
R. Soussi et al.
existing proposals support very little functionality for grouping and aggregation. Moreover, aggregated functions such as COUNT, MIN, and MAX applied to paths could be used to answer queries in order to analyze data (like the degree of a node, the distance between nodes, and the diameter of a graph). We can find exceptions in Versa, RQL, and N3, which support count functionality aggregation in path and nodes, are not explicitly treated by any languages that need to be considered as a requirement.
9.3.2
XML Query Languages
The Extensible Markup Language (XML) is a subset of SGML. XML data are labeled ordered trees (with labels on nodes), where internal nodes define the structure, and leaves the data (scheme and data are mixed.). XML additionally provides a referencing mechanism among elements that allows the simulation of arbitrary graphs. In this sense, XML can simulate semistructured data. Also, many new extensions of XML are designed to represent graphs such as GML, GraphML, and XGML. Current query languages [34] for XML do not support the majority features for graph-structured XML document. The principal feature supported is path. For example, XPath7 uses path expressions to select nodes or node sets in an XML document. Also, the set of axes defined in XPath is clearly designed to allow the set of graph traversal operations that are seen to be atomic in XML document trees. An XPath axis is fundamentally a mapping from nodes to node sets and defines a way of traversing the underlying graph. Each axis encapsulates two things: a type of edge to follow (e.g., child vs. attribute) and whether it is followed transitively (e.g., child vs. descendant). Also, XQuery8 uses XPath to express complex paths and supports flexible query semantics. In XML-QL [43], path expressions are admitted within the tag specification and they permit the alternation, concatenation, and Kleene-star operators, similar to those used in regular expressions. In XML-GL [35], the only path expressions supported are arbitrary containment, by means of a wildcard* as the edge label; this allows the traversal of the XML-GL graph, reaching an element at any level of depth. However, current query languages for XML are designed for tree-structured XML data and do not support the matching of schema in the form of general graph. Even though XPath can express a node with multiple parents by multiple constraints with axis “parent,” it cannot express a graph with cycles. While XML will not allow multiple parents, there is nothing in XQuery (or XPath in particular) which precludes a traversal from parent to child to a different parent. This insufficiency does not allow the presentation and the query of all kinds of graphs, particularly those on social networks. 7
http://www.w3.org/TR/xpath http://www.w3.org/TR/xquery/
8
9 Graph Database for Collaborative Communities
9.4
221
Social Network Extraction from Relational Database Using a Graph Database
A Social Network is an explicit representation of relationships between people, groups, organizations, computers, or other entities and it is modeled by a graph (see Sect. 9.2.1.3). There are many ways to obtain a social network. The approaches presented in the literature on social network extraction use a specific type of data source to extract people and relations among them [36]. Most of these data sources come from the Web. However, some problems related to the extraction of a social network from various information sources available on the World Wide Web still remain. First, a general problem is the identification of people because of different naming standards or the same names assigned to different persons. The social context and the type of social interactions among people within these information sources need to be carefully analyzed in order to obtain a meaningful understanding of the underlying Social Network structure. Moreover, data from the Web are often not very reliable because anyone can add information; also, in some cases, we cannot easily collect information from the Web due to privacy issues. Nevertheless, in the context of business, important expertise information about people is not stored on the Web. Such information is stored in files, databases, and especially relational databases. A relational database is a rich source of data, but it is not appropriate for storing and manipulating social network data. Indeed, the relational model was intended for simple record-type data with a structure known in advance. The schema is fixed and extensibility is difficult. Thus, it might require very sophisticated and expensive operations, such as renormalization and reindexing, which might not be performed automatically. Schema renormalization in such cases is neither desirable nor easy to perform. The standard query and transformation language for the relational database is SQL, which does not support paths, neighborhoods, and queries that address connectivity (an exception is transitivity). These graph features will facilitate the application of a social network analysis algorithm. Also, it will allow response to queries such as who owns the information, who has the leadership, and who is an expert in a particular domain. Such information is very important for business applications. Then, enterprises need to extract their social network from the existing relational database to store, update, and retrieve information in a simple way such as graphs. On the other hand, extracting a social network from a relational database is not just a translation of a relational database into a simple graph structure. The resulting social network should contain detailed information about people and their relations. As we have shown in the previous section, a graph database can be a good representation for social networks and facilitate its querying. There are many approaches that transform relational databases to other structures having graph-like features such as RDF, XML, or even ontology, but not into a graph database. Hence, in this section, we will present our approach to transforming a relational database to a social network using a graph database.
222
9.4.1
R. Soussi et al.
Converting Relational Database into Hypernode Database
Having a graph database instead of a relational database will provide a clearer view of existing entities in the initial database. Indeed, all these entities will be presented in the form of nodes, and the relations between them will be outlined, which facilitates the further step of selecting the desired entities. Also, nodes in a graph database can encapsulate all the attributes of entities in the same node and give us a simple graph of entities. From this graph of entities, a social network can be extracted. Using the comparison between existing graph database models (Table 9.1), we have chosen to work with the hypernode model [25] because the hypernode database with its nested graphs can provide an efficient support to represent every real-world object as a separate database entity. The transformation of a relational database into a graph database includes schema translation and data conversion [37]. The schema translation can turn the source schema into the target schema by applying a set of mapping rules. In our work, we propose a translation process, which directly transforms the relational schema into a hypernode schema. A data conversion process converts data from the source to the target database based on the translated schema. Data stored as tuples (rows) in the relational database are converted into nodes and edges in the graph database. This involves unloading and restructuring relational data and then reloading them into a target database in order to populate the schema generated during the translation process. In what follows, we will detail these two steps.
9.4.1.1
Schema Translation
The first step consists of extracting the relational database schema using the schema metadata of the relational database management system (information about tables and columns), which is extracted using SQL queries. The idea is to identify the primary key, composite key(s), and foreign key(s) of each relation. This information is then used to design the new schema (hypernodes and relations within and between them). This process is performed by the following steps. Step1: Relational schema extraction. In this step, information from the relational database is extracted using SQL queries. In our approach, a relational schema is represented as a set of relations (tables) ¼ TRnTR: ¼ rn ; A; Kp;F , where: l l
l
rn denotes the name of TR. A denotes a set of attributes of TR and gives information about each attribute integrity constraints, A :¼ f ana :¼ han ; t; ce; cp; n; dig, where an is an attribute name, t is its type, ce states whether or not a is a foreign key, cp states whether or not a is a primary key, n states whether or not a can be null, and d is a default value if one is available. Kp, F denotes a set of key of TR and gives information about each key integrity constraints, Kp;F :¼ fbjb :¼ hkr; ce; cp; re; fa ig, where b represents a key (an attribute which can be a key or a part of a composed key), kr is the name of a key
9 Graph Database for Collaborative Communities
223
Fig. 9.14 Relational database schema (primary key is underlined and foreign key is marked by “#”)
attribute, ce indicates if b is a foreign key or not, cp indicates if b is a primary key or not, re is the relation that contains the exported primary key, and fa is the attribute name of the foreign key. This schema provides an image of metadata obtained from an existing relational database and provides more information than does a traditional schema. Indeed, it gives information about primary and foreign keys to facilitate the extraction of relations in subsequent steps. For example, for the Table “Thesis” (database in Fig. 9.14), the relation Thesis (th_id, Th_name, Topic) was extracted. Once the schema has been extracted, we can generate the corresponding hypernode schema (step 2). Step 2: Mapping the relational schema to the hypernode schema. We use for this step a hypernode schema, which is an extension of the original one. A hypernode is defined [25]by H ¼ (N, E), where N is a finite set of nodes containing primitive nodes and further hypernodes, and E is a set of edges between members of N, such that N A [ L (where A is the set of atomic values and L the set of labels) and E ðN NÞ. A Hypernode Database (HD) is a finite set of hypernodes, which satisfies the following conditions: 1. The hypernode label is unique in HD. 2. 8 H a label in the label set of HD, 9 h HD whose defining label is H. The Hypernode model does not use labeled edges; the task of representing relations (and their names) can be accomplished by encapsulating edges, which represent the same relation (same label edges), within one hypernode labeled with the relation name. However, the traditional presentation of a social network is a labeled node attached to an explicit labeled edge. Then, we extend the HD to LHD (Labeled Hypernode Database) by adding explicit labels to edges. LHD ¼ HS \ ES, where: 1. HS is a finite set of hypernode. 2. ES is a set of edge where ES ðHS HSÞ and 8e 2 ES, e has a label.
224
R. Soussi et al.
In this step, we use a hypernode database schema composed by the union of two sets: fHnH: ¼ hhn ; Nhig [ fRh nRh : ¼ hr,hs ;hd ig The first one is the set of hypernodes where: l l
hn denotes the name of H. Nh denotes a set of nodes Nh :¼ fnnn :¼ hnn ; t; ce; cpig, where nn is the node name, t is the type, ce mentions if the nodes contains a foreign key (in the relational schema n is a foreign key), and cp mentions if the nodes contains a primary key. The second one is the set of relations where:
l l l
r denotes the name of R hs denotes the hypernode source name hd denotes the hypernode destination name
To extract this schema, we start by identifying the hypernodes and then their relations. Hypernode identification. Using the relational schema, we create from each table t 2 TR a new hypernode h. h has the same characteristics as t: same name and attributes. Indeed, each attribute from the table TR is transformed into a node n in h where n contains all the characteristics of a (name, type, etc.). If the attribute is a foreign key, its type is changed to be the name of the exported relation. Relation identification. In order to identify the relations between the identified hypernodes, the nodes set Nh of each hypernode h is analyzed. For each node, we verify whether it contains a foreign key in order to search existing dependency on other hypernodes. We have identified four relation types: l
l
l
l
“IS-A” relation: if h has only one node npf and no more and that contains a key which is primary (a simple one) and foreign key, then h shares the relation “ISA” with the hypernode mentioned in the npf type; for example, if the hypernode “Foreign_Student” contains the node “ST_id,” which is a primary key and a foreign key, then “Foreign_Student” shares the relation “IS-A” with “Student.” “Part_of” relation : if h has more than one node npf that contains a key which is primary and foreign key, then h shares a Part-of relation with each hypernode mentioned in the npf type; for example, “Student” and “Thesis” are “Part-of ” the hypernode “Thesis_hasStudent” because “Thesis_hasStudent” contains the nodes “St-id” with type “Student” and “Th-id” with type “Thesis”. R relation: this kind of relation is a particular case of the “Part_of” relation. When the hypernode is composed only of nodes which contain a key which is primary and foreign key, then we delete this hypernode and we use its name to build relations between the hypernodes mentioned in the npf type; for example, the hypernode “Thesis_hasLab” is deleted and is transformed into a relation between “Thesis” and “Laboratory”. “” relation: if h contains a node which contains a foreign key, h has a relation with the hypernode mentioned in the type of the node. In this case, we are not able to give a name to this relation.
9 Graph Database for Collaborative Communities
225
Director_thesis
Dir-id Dir_lastname Dir-name Lab_id
Foreign_Student
Integer String String Laboratory
IS-A St_id Country
Student Laboratory
Student
String St_id
St_name St_lastname Dir_id
String String Director_thesis
Part_of Thesis_hasStudent St-id Th-id Supported
Lab-id
Integer
Lab-name
String String
Integer
Lab-adress Part_of
Thesis_hasLab Thesis
Student Thesis Boolean
Th_id Th_name Topic Dir_id
Integer String String Director_thesis
Fig. 9.15 Hypernode database schema
Considering the initial database, Fig. 9.15 shows the resulting hypernode schema.
9.4.1.2
Data Conversion
In order to instantiate the hypernode database schema already identified, data conversion is performed in three steps. First, the relational database table’s tuples are extracted. Second, these data are converted to match the target format. Then, for each hypernode in the hypernode database, a set of instances hypernode HI is extracted from the relational tuples. The set of instances hypernode HI is defined by HI ¼ fHi nHi : ¼ h h, hi ; Nhi ig; where: l l l l
Hi denotes the instance hypernode h denotes the hypernode source name hi denotes the name of Hi Nhi denotes a set of nodes Nhi :¼ fnnni :¼ g, where nn is the node name, t is the type, and val mentions the node value
For each relation in the LHD, a set of instance relations RI is extracted using the value of keys on the relational tables. RI is defined by RI :¼ fri nri :¼ h r; his ; hid ig, where: l l l
r denotes the relation which is intanciated by ri his denotes the hypernode instance source hid denotes the hypernode instance destination
226
R. Soussi et al.
Fig. 9.16 Part of the hypernode database instance
Finally, transformed data are loaded into the LHD schema. An excerpt of the HD is shown in Fig. 9.16.
9.4.2
Social Network Extraction
Using the resulting hypernode database from the previous step, the social network is extracted. This phase passes through two steps: (1) Entities (people) identification and (2) detection of relations among people. The social network is defined by SN ¼ (ESN, RSN), where: l
l
ESN is a finite set of entity such ESN :¼ fene 2 HI; e :¼ hh; en ; Ne ig, where h is the hypernode, which represents e, en is e’s name, and Ne is the set of e’s node. RSN is a finite set of the relations between entities such RSN ¼ frsn nrsn :¼ hn; e1 ; e2 ig, where n is the relation name, and e1 and e2 2 ESN. In what follows, we describe the two steps of the social network extraction.
9.4.2.1
Entities Identification
Entities identification is the process of identifying hypernodes that contain entities which compose the social network. In this step, we describe the process for identifying people. The hypernode database schema is used to extract candidate hypernodes (hypernodes that may contain persons). Then, the hypernodes’ instances
9 Graph Database for Collaborative Communities
227
are used to deeply analyze the candidate hypernodes and detect those containing people. Candidate hypernodes detection. A person has a number of characteristics such as name, surname, birthday, address, and email. Some of these characteristics are used when designing databases containing persons. We collect these characteristics from various ontologies such as FOAF ontology and person ontology (schemaWeb9) and we manually build a person ontology (PO) containing all these characteristics and their synonyms (collected from WordNet). Using the person ontology, the set of nodes related to each hypernode in the LHD is analyzed. l
l
If the node’s name is one of the PO concepts, the number of characteristics for this hypernode is incremented. If the number of characteristics for the hypernode >¼1 and one of them contains a name, the hypernode h is a candidate to contain persons.
Candidate hypernodes Analysis. Each candidate hypernode has a set of instance hypernodes hi. In order to analyze the name found in each instance of hypernode (we take just the ten first entities), the name is sent to the Web search engine (Bing API). The top ten returned documents are downloaded and parsed using DOM.10 Each document is analyzed using the NER (Named entity Recognition) proposed by Stanford11 and which put three kinds of tags (person, location, or organization). We give to each document a rank rd. If the name is tagged in the document by Person, the document is ranked by rd ¼ 1 else rd ¼ 0. The average assigned to the name found in the hypernode instance hi (avghi) counts how many times is considered as a person name in the documents (where the tag of this name is Person) P
avghi ¼
rd number documents
(9.1)
The average assigned to the hypernode (avgH) calculates the average where the names found in its hypernode instances are considered as a person’s name: P
avgH ¼
avghi number hi
(9.2)
In order to identify persons, we use the NER proposed by Stanford: in which precision is an average of about 90% to find Person entities; so, a hypernode is considered as representative of a person if more than 60% of its instances contains a person name (we take only 60% as a threshold due to problems such as wrongly written name and use of abbreviations, which decrease the precision of NER).
9
http://ebiquity.umbc.edu/ontology/person.owl http://www.w3.org/DOM/ 11 http://nlp.stanford.edu/ner/index.shtml 10
228
9.4.2.2
R. Soussi et al.
Building Relations
After the identification of the entities set ESN, we use the existent relations in the hypernode database (the relations that share ESN elements with other hypernodes or among them) to find the set of relations RSN. In order to facilitate this step, we have designed a set of patterns to apply this kind of transformation to all the relations in the hypernode database. The pattern will enumerate all the existing relations between persons by using only the hypernode database schema. After the relation pattern identification, we will search the correspondent relations on the instances database. A pattern relation Pr is defined by Pr ¼ hnPr; hp1 ; hp2 ; hin i such as nPr is the name of the relation, hp1 and hp2 the hypernodes, which share the relation hypernodes that represent people), and hin a mediator for this relation (the hypernode used to identify the relation). For each relation Rh 2 set of LHD relations, we check these conditions: 1. If Rh:¼<“IS-A”,hs,hd> where hs or hd 2 ESN then hs or hd is added to ESN. The relation “IS-A” allows us to find hidden entities that are not identified in the previous step. In the relation construction process, we start by analyzing this kind of relation to find in the next steps the relations related to the new discovered entity. 2. If Rh:¼ where hs and hd 2 ESN then two patterns are identified: 2.1 Pr1:¼, if two entities (hs and hd) are already connected in the LHD, we will search if their instances (his and hid )are connected, too. 2.2 Pr2:¼<Same_hd.name, hpi,hpj, hd>, where hpi ¼ his, hpj ¼ his, and i! ¼ j. Pr2 represents the relations between the instances of hs, which can be connected with the same instance of hd. 3. If Rh:¼, where hs 2 ESN and hd 2 = ESN, then: 3.1 If r! ¼ “Part-of,” then pattern Pr3 is extracted: Pr3:¼<Same_hd.name, hpi, hpj, hd>, where hpi ¼ his , hpj ¼ his, and i! ¼ j. We search the hs instances that are connected with the same instance of hd. 3.2 If r ¼ “Part-of ” then the hypernodes that are “Part-of” hd are researched: Firstly, for each hj 2{h\h has the relation Rh:¼<“Part-of”, hj, hd>}, a new node is added to hs containing the name of hj and then the pattern Pr4 is extracted. Pr4:¼<Same_hj.name, hp1,hp2, hj >, where hp1¼ hs, hp2¼ hs. Pr4 represents the relations between the instances of hs that share the same value of hj. 4. If Rh:¼, where hs 2 = ESN and hd 2 ESN, then: 4.1. a new node on hs containing the name of hd is added. 4.2. if hs has relations with other entities, then for each detected relations, a pattern Pr5 is extracted as Pr5:¼<Same_hs.name,hp1,hpj, hd >, where hp1¼hd and hpj 2 {e\e has relation with hs}. By applying these patterns to our example, we can detect the relations between the detected entities Student and Director_thesis:
9 Graph Database for Collaborative Communities
229
– From the relation, Rh¼<“IS-A”, Foreign-Student, Student>, we detect a new entity “Foreign-Student,” which is added to the set of entities ESN. – From the relation Rh1:¼<””,Student, Director_thesis >, we identify two patterns : – Pr1:<””,Student, Director_thesis, null>: Student and Director_thesis share the relation R:¼<””,Student, Director_thesis > then each Student and Director_thesis instance can have these relations if they have the same id_Dir value (Fig. 9.17). – Pr2:¼<Same_ Director_thesis, Studenti, Studentj, Director_thesis >: two students may have the same Director_thesis (same value of Dir-id) (Fig. 9.18). – From the relation Rh2:¼<””, Director_thesis, Laboratory>, we identify the pattern: – Pr3:¼<Same_Laboratory,Director_thesisi, Director_thesisj, Laboratory >: using the value of the foreign key Lab_id in each hypernode instance of the entity Director_thesis, we will link those having the same value of Lab_id by the relation Same_Laboratory. (Fig. 9.19) – From the relation Rh3:¼<”Part-of”,Student, thesis_hasStudent> – Thesis_hasStudent shares two relations “Part-of” with Student and Thesis. We add a new node on the hypernode Student n:, corresponding to his thesis. Then, we can apply the pattern Pr4 (Fig. 9.20). Student_1
St_id
Director_thesis_2
03
St_name St_lastname Dir-id
38
Dir-id Dir_lastname Dir-name Lab_id
Mohsen Ali Director_thesis_2
Jean Weber Laboratory_1
Fig. 9.17 Instance of the relation between Student and Director_thesis Student_3 Student_1
esis
Same_Director_th
St_id
03
St_name
Mohsen Ali
St_lastname Dir-id
St_id
12
St_name St_lastname Dir-id
Yen Yang Director_thesis_2
Director_thesis_2
Fig. 9.18 Relation among Student instances Director_thesis_1
Director_thesis_2 Dir-id Dir_lastname Dir-name Lab_id
38
Same Laboratory
Jean Weber Laboratory_1
Fig. 9.19 Relation between Director_thesis instances
Dir-id Dir_lastname Dir-name Lab_id
27 Norman Lochan Laboratory_1
230
R. Soussi et al.
Fig. 9.20 Adding the node thesis on the Student hypernode
Student_1 St_id St_name St_lastname Dir-id Thesis
03 Mohsen Ali Director_thesis_2 Thesis_1
Director_thesis_WeberJean
38 Laboratory_1
Thesis
Thesis_1 Thesis_3
Sa
Dir-id Lab_id
m
Director_thesis_LochanNorman
e La bo ra to
Student_MohsenAli
ry
Dir-id Lab_id Thesis
ire o ct
Student_YenYang
r_
Student_JackPierre
th is es
Thesis_1
D
Thesis
Egypt Director_thesis_2
Thesis_2
e_
Country Dir-id
03 Foreign_Student
m
IS-A
Sa
St_id
27 Laboratory_1
St_id IS-A Country Dir-id Thesis
12 Foreign_Student China
St_id Dir-id Thesis
05 Director_thesis_1 Thesis_2
Director_thesis_2 Thesis_3
Fig. 9.21 Corresponding social network
– Pr4:¼<Same_Thesis, Studenti, Studentj, Thesis_hasStudent >, by this pattern we search all the students who share the same thesis (not found in our data). – From the relation Rh4:¼<””, Thesis, Director_thesis >, there is no identified pattern because Thesis is not related to other entities. After identifying relations and entities, the final social network is obtained by applying all the previous detection (people and their relations) and giving as a tag for each hypernode its type and the name of the corresponding person (see Fig. 9.21)
9.5
Implementation and Evaluation
In order to demonstrate the effectiveness and the validity of the proposed approach, a prototype has been developed. The prototype was implemented using Java and PostgreSQL. We have visualized the output database and the social network using SNA12 (Fig. 9.22). The hypernode database will be stored in an adapted database management system that also allows the storage of voluminous complex graphs. 12
http://www.sapweb20.com/blog/2009/03/sap-enterprise-social-networking-prototype/
9 Graph Database for Collaborative Communities
231
Fig. 9.22 SNA visualization
We experimented (for more details, see [38]) the process using the data of a database containing information about PhD students (administrative and technical information) and also information about relevant personnel. This database contains 1,788 students (+ 80 students per year). To evaluate the scalability and performance of our method for converting a relational database into a hypernode database, a set of SQL queries has been designed to observe any differences between the source relational database and the hypernode database. After comparing the results between the two databases, the hypernode schema is generated without loss or redundancy of data. This proves the correctness of this conversion. We have used the same method to evaluate the transformation into a social network. For each entity e:¼ , we verify that (1) the same attributes appear in the relational database and in the social network, and (2) we find the same relations. For example, in order to verify an entity’s attributes: Select * from h.name where n0.name¼n0.val; (n0 is the first node on the e’s node set and is usually corresponding to the primary key or a part of the primary key on the relational database). For example, for the entity “Student_Mohsen_Ali”, the correspondent query used: Select * from Student where St_id¼03; The results obtained from these queries show the accuracy of the transformation approach. Our approach can transform a relational database into a graph from a social network perspective without the loss of information.
9.6
Conclusion
In this chapter, we have presented the main graph database models and the associated graph query languages. Graph database models can model communities (social networks) and their activities, even though they are complex and dynamic (using a model based on complex nodes). Moreover, graph query languages are the most appropriate query languages for querying communities (e.g., better than RDF
232
R. Soussi et al.
query languages), but they are not well adapted to extract information about communities because they do not use social network analysis methods; nor do most of them offer techniques to extract information about path and neighborhood. We have also presented a social network extraction method from relational database, which is based on (1) a transformation of the relational database into a hypernode database and (2) a social network extraction from the hypernode database. In our future work, we will focus on how to improve the extraction method by using ontologies describing the relations between entities in the relational database. Then, we will try to define a storage system based on the hypernode model and a graph query language better adapted to the social network structure. We will also work on generic transformation rules from different users’ perspectives, and the merging of graphs.
References 1. Shadbolt, N., Berners-Lee, T., Hall, W.: The semantic web revisited. IEEE Intell. Syst. 21(3), 96–101 (2006) 2. Xu, X., Zhan, J., Zhu, H.: Using social networks to organize researcher community. In: Proceedings of the IEEE ISI 2008 Paisi, Paccf, and SOCO International Workshops on Intelligence and Security Informatics, pp. 421–427. Springer, Heidelberg (2008) 3. Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1–39 (2008). doi:http://doi.acm.org/10.1145/1322432.1322433 4. Gyssens, M., Paredaens, J., Gucht, D.V.: A graph-oriented object model for database end-user interfaces. SIGMOD Rec. 19(2), 24–33 (1990). doi¼http://doi.acm.org/10.1145/93605.93616 5. Andries, M., Gemis, M., Paredaens, J., Thyssens, I., Bussche, J.D.: Concepts for graphoriented object manipulation. In: Proceedings of the 3rd international Conference on Extending Database Technology: Advances in Database Technology, vol. 580, pp. 21–38. Springer, London (1992) 6. Kuper, G.M., Vardi, M.Y.: The logical data model. ACM Trans. Database Syst. 18, 379–413 (1993) 7. Gemis, M., Paredaens, J.: An object-oriented pattern matching language. In: JSSST, vol. 742, pp. 339–355. Springer, Heidelberg (1993) 8. Hidders, J., Paredaens, J.: GOAL, A graph-based object and association language. Advances in database systems: Implementations and applications, CISM 247–265 (1993) 9. Barnes, J.A.: Class and committees in a Norwegian island parish. Hum. Relat. 7, 39–58 (1954) 10. Codd, E.F.: Data models in database management. In: Proceedings of the 1980 Workshop on Data Abstraction, Databases and Conceptual Modeling (Pingree Park, Colorado, United States, June 23–26, 1980), pp. 112–114. ACM, New York, NY (1980) 11. Blau, H., Immerman, N., Jensen, D.: A visual language for querying and updating graphs. University of Massachusetts Amherst, Computer Science Department, Technical Report 2002-037 (2002) 12. Cruz, I.F., Mendelzon, A.O., Wood, P.T.: A graphical query language supporting recursion. SIGMOD Rec. 16(3), 323–330 (1987). doi¼ http://doi.acm.org/10.1145/38714.38749 13. Cruz, I.F., Mendelzon, A.O., Wood, P.T.: G+: Recursive queries without recursion. In: Proceedings of the 2th International Conference on Expert Database Systems (EDS), pp. 645–666. Addison-Wesley, New York (1989) 14. Consens, M.P., Mendelzon, A.O.: Expressing structural hypertext queries in Graphlog. In: Proceedings of the 2th International Conference on Hypertext, pp. 269–292. ACM, New York, NY (1989).
9 Graph Database for Collaborative Communities
233
15. Levene, M., Poulovassilis, A.: An object-oriented data model formalised through hypergraphs. Data Knowl. Eng. 6(3), 205–224 (1991) 16. Hidders, J.: Typing graph-manipulation operations. In: Proceedings of the 9th International Conference on Database Theory (ICDT), pp. 394–409. Springer, Heidelberg (2002) 17. He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: Proceedings of the 2008 ACM SIGMOD international Conference on Management of Data SIGMOD ’08, pp. 405–418. ACM, New York, NY (2008) 18. Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.L.: The Lorel query language for semistructured data. Int. J. Digit. Libr. 1(1), 68–88 (1997) 19. Flesca, S., Greco, S.: Querying graph databases. In: Zaniolo, C., Lockemann, P.C., Scholl, M. H., Grust, T. (eds.) Proceedings of the 7th international Conference on Extending Database Technology: Advances in Database Technology (March 27–31, 2000). Extending Database Technology, vol. 1777, pp. 510–524. Springer, London (2000) 20. G€uting, R.H.: GraphDB Modeling and querying graphs in databases. In: Proceedings of the 20th international Conference on Very Large Data Bases, pp. 297–308. Morgan Kaufmann Publishers, San Francisco, CA, 12–15 Sept (1994) 21. Sheng, L., Ozsoyoglu, Z.M., Ozsoyoglu, G.: A graph query language and its query processing. In: Proceedings of the 15th International Conference on Data Engineering (ICDE), pp. 572–581. IEEE Computer Society, Washington, DC (1999) 22. Ronen, R., Shmueli, O.: SoQL: a language for querying and creating data in social networks. In: Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE), pp. 1595–1602. IEEE Computer Society, Washington, DC, March 29–April 02 2009 23. Amann, B., Scholl, M.: Gram: A graph data model and query language. In: Proceedings of the European Conference on Hypertext Technology (ECHT), pp. 201–211, ACM, New York, NY (1992) 24. Paredaens, J., Peelman, P., Tanca, L.: G-Log: A graph-based query language. IEEE Trans. Knowl. Data Eng. TKDE 7(3), 436–453 (1995) 25. Levene, M., Loizou, G.: A graph-based data model and its ramifications. IEEE Trans. Knowl. Data Eng. 7(5), 809–823 (1995). doi¼ http://dx.doi.org/10.1109/69.469818 26. Kaplan, I.: A semantic graph query language. Lawrence Livermore National Laboratory, Technical Report UCRL-TR-255447 (2006) 27. Miller, E., Swick, R., Brickley, D.: Resource description framework (RDF). W3C Recommendation (2004) 28. Klyne, G., Carrol, J.J., Andmcbride, B.: Resource description framework (RDF): Concepts and abstract syntax. W3C Recommendation. http://www.w3.org/TR/rdf-concepts/ (2004) 29. Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., Scholl, M.: RQL: A declarative query language for RDF. In: Proceedings of WWW’02, pp. 592–603. ACM, New York, NY (2002) 30. Broekstra, J.: SeRQL: Sesame RDF query language. In: SWAP Deliverable 3.2 Method Design (2003) 31. Seaborne, A.: RDQL – A query language for RDF. Member Submission, W3C (2004) 32. Sintek, M., Decker, S.: TRIPLE – A query, inference, and transformation language for the semantic web. In: Proceedings of the 1st International Semantic Web Conference on the Semantic Web. Lecture Notes in Computer Science, vol. 2342, pp. 364–378. Springer, Heidelberg (2002) 33. Alkhateeb, F., Baget, J., Euzenat, J.: Extending SPARQL with regular expression patterns (for querying RDF). Web Semant. 7(2), 57–73 (2009) 34. Bonifati, A., Ceri, S.: Comparative analysis of five XML query languages. SIGMOD Rec. 29(1), 68–79 (2000) 35. Ceri, S., Comai, S., Damiani, E., Fraternali, P., Paraboschi, S., Tanca, L.: XML-GL: A graphical language for querying and restructuring XML documents. Comput. Netw. 31, 1171–1187 (1999)
234
R. Soussi et al.
36. Kirchhoff, L., Stanoevska-Slabeva, K., Nicolai, T., Fleck, M.: Using social network analysis to enhance information retrieval systems. In: Applications of Social Network Analysis (ASNA) (2008) 37. Maatuk, A., Akhtar, M., Rossiter, B.N.: Relational database migration: A perspective. In: DEXA’08, pp. 676–683 (2008) 38. Soussi, R., Aufaure, M.A., Baazaoui, H.: Towards social network extraction using a graph database. In: Proceedings of 2nd International Conference on Advances in Databases, Knowledge, and Data Applications, pp. 28–34 (2010) 39. Graves, M., Bergeman, E.R., Lawrence, C.B.: Graph database systems for genomics. IEEE Eng. Med. Biol. 14(6), 737–745 (1995) 40. Levene, M., Poulovassilis, A.: The hypernode model and its associated query language. In: Proceedings of the 5th Jerusalem Conference on information Technology (Jerusalem, Israel), pp. 520–530. IEEE Computer Society Press, Los Alamitos, CA (1990) 41. Harel, D.: On visual formalisms, pp. 514–530. Commun. ACM 31 (5) (1988) 42. Pe´rez, J., Arenas, M., and Gutierrez, C. :Semantics of SPARQL. Tech. rep., Universidad de Chile. Department of Computer Science, Universidad de Chile, TR/DCC-2006-17. (2006) 43. Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and Suciu, D.: A query language for XML. Comput. Netw. 31, pp. 1155–1169 (1999)
Chapter 10
Semantics in Wiki Lorna Uden
Abstract Wikis provide flexible ways for supporting the quick and simple creation, sharing and management of content. Despite the obvious benefit of a wiki, its contents are barely machine interpretable. Structural knowledge, for example about how concepts are interrelated, can neither be formally stated nor be automatically processed. In addition, numerical data is available only as plain text and thus cannot be processed by its actual meaning. Semantic Wikis provide the ability to capture (by humans), store and later identify (by machines) further meta-information or metadata about those articles and hyperlinks, as well as their relations. Semantic Web provides intelligent access to heterogeneous, distributed information, enabling software products (agents) to mediate between user needs and the information sources available. This chapter describes the use of semantic technologies in wiki.
10.1
Introduction
Wikis are usually viewed as tools to manage online content in a quick and easy way by editing some simple syntax known as wiki-text. A typical wiki is a repository of information. That information is placed on pages, which are interconnected through a web of links. Also, pages are categorised and have a heading. The aim of wiki is to organise the collected knowledge and to share this information. Despite the obvious benefit of a wiki, its contents are barely machine interpretable. Structural knowledge, for example about how concepts are interrelated, can neither be formally stated nor be automatically processed. In addition, numerical data is available only as plain text and thus cannot be processed by its actual meaning. Since Wikis are by nature unstructured, use of semantics can greatly facilitate the process of sifting through the information. In a corporate wiki, which has been in use for a lengthy period of time, the amount of information on projects,
L. Uden Faculty of Computing, Engineering and Technology, Staffordshire University, Beaconside, Stafford ST18 0AD, UK e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_10, # Springer-Verlag Berlin Heidelberg 2011
235
236
L. Uden
customers, processes, and so forth, can be overwhelming and, despite good categorisation and other navigational tools, being able to search through the pages by using semantics is an important advancement in the technology. Also, updating different language versions becomes easier. This chapter begins with an introduction to Wikis. This is followed by a definition of a Wiki, Web 2.0 applications and Wiki, and a discussion of the importance and limitations of Wikis in Sect. 10.2. In Sect. 10.3, Semantic Webs are discussed and their importance is critiqued. In Sect. 10.4, we provide an overview of the development of and rationale behind Semantic Web Wikis, followed in Sect. 10.5 by a description of current Semantic Web Wiki applications and in Sect. 10.6 the direction of future developments. Section 10.7 presents our conclusion.
10.2
Background
A Wiki (from WikiWiki, meaning “fast” in Hawaiian) is a set of linked Web pages, created through incremental development by a group of collaborating users, as well as the software used to manage the set of Web pages. A Wiki: l l l
l
Enables Web documents to be authored collectively Uses a simple markup scheme Does not publish content instantly, once an author submits a page to the Wiki engine Creates new Web pages when users create hyperlinks that point nowhere
A Wiki is essentially a collection of Websites connected by hyperlinks [1]. “Web 2.0” is commonly associated with Web applications that facilitate interactive information sharing, interoperability, user-centred design and collaboration on the World Wide Web. Examples of Web 2.0 include Web-based communities, hosted services, Web applications, social networking sites, video-sharing sites, wikis, blogs, mashups and folksonomies. A Web 2.0 site allows its users to interact with other users or to change Website content in contrast to non-interactive Websites where users are limited to the passive viewing of information that is provided to them [2]. Web 2.0 is not a new technology; it is the creative use and bundling of existing technologies for use in new ways.
10.2.1 Social Software Social software belongs to Web 2.0. In Web 1.0, a few authors of content provide material for a wide audience of relatively passive readers. Conversely, Web 2.0 allows every user of the Web to use it as a platform to generate, re-purpose and consume shared content. According to Boyd [3], there are three characteristics attributed to social software, namely:
10 l
l
l
Semantics in Wiki
237
Support for conversational interaction between individuals or groups, ranging from real-time instant messaging to asynchronous collaborative teamwork spaces. This category also includes collaborative commenting on, and within, blog spaces. Support for social feedback that allows a group to rate the contributions of others, perhaps implicitly, leading to the creation of digital reputation. Support for social networks to explicitly create and manage a digital expression of people’s personal relationships and to help them build new relationships.
10.2.2 Wiki Wikis are a well-known Web 2.0 content management platform. The first wiki was designed by Cunningham and Bo Leuf [4] in 1995 because they wanted to make their hypertext database collaborative. Their idea was that it would be a collaborative community Website where anyone can contribute. Anyone should be able to edit any page from a simple Web form. Making the site easily editable gives it numerous advantages. This is because it encourages many people to participate in creating the Website together, making Wikis a very useful knowledge base. Wikis are being used in many organisations to provide affordable and effective Intranets and for knowledge management purposes. Since 1995, many wiki-inspired programmes and Wiki Websites have been created. Some Wiki knowledge bases include: Wikipedia, The Tcl’ers Wiki, The Emacs Wiki, British Telecom, TWiki, Colorado State University, USA, etc. All of them share the following common properties [1]. l l l l l l
Edit via browser Simplified wiki syntax Rollback mechanism Strong linking Unrestricted access Collaborative editing
A Wiki is a system that allows one or more people to build up a corpus of knowledge in a set of interlinked Web pages, using a process of creating and editing contents. Wikipedia is the most famous Wiki. There are many possible uses for a Wiki, including research collaboration, multi-authored papers, project work and maintenance of documents that require regular updating. Some people describe Wiki as a collaborative content management system. It means that a Wiki allows multiple people to work on the same document. Wikis permit asynchronous communication and group collaboration across the Internet. They can be described as a composition system, a discussion medium, a repository, a mail system and a tool for collaboration. Wikis provide users with both author and editor privileges; the overall organisation of contributions, as well as the
238
L. Uden
content itself, can be edited. It can incorporate sounds, movies and pictures. A Wiki can be used as a simple tool to create multimedia presentations and simple digital stories. Wiki content – contributed “on the fly” by subject-matter specialists – can be immediately (and widely) viewed and commented on. Matias [5] argues that Wikis can be thought of as simple interfaces to a hypertext database for keeping track of notes and interlinked information. According to Matias [5], there are several advantages of using Wiki, one of which includes giving an opportunity for users with different types of knowledge, confidence and communication skills to contribute equally to a joint publication, reducing e-mail traffic. Collaborators who lack confidence to argue a case in a “live” face-to-face debate can feel more comfortable making the same points in a Wiki environment where the pace of discussion is slower and the quality of the thinking is more significant than force of personality. Wikis work well for keeping track of notes and interlinked information. In addition, Wikis have the following advantages [5]: l l l l l l l l l
They simplify the editing of a Website They use simple markup They record document histories Creating links is simple with Wikis Creating new pages is simple with Wikis Wikis simplify site organisation Wikis keep track of all material Many Wikis are collaborative communities Wikis encourage good hypertext
TWiki [6] is a flexible, powerful and easy-to-use enterprise Wiki, enterprise collaboration platform and Web application platform. It is a Structured Wiki, typically used to run a project development space, a document management system, a knowledge base or any other groupware tool on an intranet, extranet or the Internet. The applications can be implemented by users who have no programming skills. Developers can extend the functionality of TWiki with Plugins. TWiki fosters information flow within an organisation, allows distributed teams to work together seamlessly and productively and eliminates the one-Webmaster syndrome of outdated intranet content. TWiki is developed by an active open-source community on twiki.org. TWiki is used internally to manage documentation and project planning of Yahoo products. TWiki is a mature, full-featured Web-based collaboration system [6]: l
l
l
l
Any Web browser can be used to edit existing pages or create new pages. There is no need for ftp or http to upload pages. Edit link: To edit a page, simply click on the Edit link at the bottom of every page. Auto links: Web pages are linked automatically. One does not need to learn HTML commands to link pages. Text formatting: Simple, powerful and easy to learn text formatting rules. Basically, the user writes text like she or he would write an e-mail.
10
Semantics in Wiki
239
– Webs: Pages are grouped into TWiki Webs (or collections). This allows the setting up of separate collaboration groups. – Search: Full-text search with/without regular expressions (see a sample search result). – E-mail notification: a user is automatically notified when something has changed in a TWiki Web. Subscribe in WebNotify. – Structured content: TWiki Forms are used to classify and categorise unstructured Web pages and to create simple workflow systems. – File attachments: Any file can be uploaded and downloaded as an attachment to a page by using the browser. This is similar to file attachments in an e-mail, albeit on Web pages. – Revision control: All changes to pages and attachments are tracked. Previous page revisions and differences can be retrieved and tracked to determine who has made the change and when. – Access control: Allows groups to be defined and imposes fine-grained read and write access restrictions based on groups and users. – Variables: Variables are used to dynamically compose pages. This allows the user, for example, to dynamically build a table of contents, include other pages or show a search result embedded in a page. – TWiki Plugins: TWiki functionality can be enhanced with server side Plugin modules. Developers can create Perl Plugins using the TWiki Plugin API.
10.2.3 Wikipedia The most popular Wiki is Wikipedia. This online collaborative encyclopaedia project is one of the most visited Websites in the world. But Wikipedia is just one part to of a bigger picture. The Wikimedia Foundation [2], which runs Wikipedia, also hosts several other reference sites like Wiktionary and Wikiquotes. Wikipedia was “born” on 15 January 2001 and was created by Jimbo Wales and Larry Sanger. It was conceived as an encyclopaedia based on the Wiki principle, that is, an encyclopaedia which could be edited by anyone [7]. The first version of Wikipedia was created in English, and it was soon followed by a version in French, created on 11 May 2001. Wikipedia is a general encyclopaedia with diversified contents. Articles in Wikipedia are categorised by topics into subcategories, thus forming a hierarchical structure of Wikipedia contents. Wikipedia is a free encyclopaedia, well established as the world’s largest online collection of encyclopaedic knowledge. It is also an example of global collaboration within an open community of volunteers. Wikipedia is based on the MediaWiki software. The idea of Wikipedia is to allow everyone to edit and extend the encyclopaedic content (or simply correct typos). Wikipedia also contains numerous articles that are meant to enhance the browsing of Wikipedia besides the encyclopaedic articles on many subjects. These include lists of the countries of the world, sorted by area, population and the index of free
240
L. Uden
speech. Currently, all these lists have to be written manually. This introduces several sources of inconsistency.
10.2.4 Limitations of Wikipedia One of the main benefits of Wikipedia is the strong interconnection of its articles via hyperlinks. The ubiquity of such links in Wikipedia is one of the key features for finding desired information. In spite of its revolutionary editing mechanism and organisation, Wikipedia’s dedicated facilities for searching information are surprisingly primitive. Users often rely on full-text search, article name or links for finding information. So it became common to create pages with the sole purpose of collecting links (lists of articles). A more structured approach with a similar functionality is Wikipedia’s category system [8]. Although Wikipedia is the biggest collaboratively created source of encyclopaedic knowledge, it is expanding beyond the borders of any traditional encyclopaedia. Wikipedia is facing the problems of knowledge management. It is no longer sufficient for today’s needs. Currently, Wikipedia’s contents are accessible only for human reading. According to V€ olkel et al. [9], current Wikipedia has no way to automatically gather information scattered across multiple articles. They give an example of a query such as “Give me a table of all movies from the 1960s with Italian directors”. Although the data is quite structured (each movie on its own article, links to actors and directors), its meaning is unclear to the computer, because it is not represented in a machine-processable, (i.e. formalised) way. Another drawback of Wikis is their iterative nature (sections appearing and disappearing as a document evolves), which can be very difficult to track, especially for those with slow reading speeds. Semantic Web technologies offer ways to overcome such problems.
10.3
Semantic Web
Tim Berners-Lee coined the term Semantic Web when envisioning the next dramatic evolution of Web technology. He envisions forms of intelligence and meaning being added to the display and navigational context of the current World Wide Web (Web). The Semantic Web is a long-range development that is being built in stages by groups of researchers, developers, scientists and engineers around the world through a process of specification and prototype instantiating these interoperable specifications [10]. “The Semantic Web is an extension of the current Web in which information is given welldefined meaning, better enabling computers and people to work in co-operation.” – Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001.
10
Semantics in Wiki
241
The W3C has developed a set of standards and tools to support this vision, and after several years of research and development, these are now usable and could make a significant impact. Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation [11]. While Web 2.0 focuses on people, the Semantic Web is focused on machines. The Web requires a human operator, using computer systems to perform the tasks required to find, search and aggregate its information. It is impossible for a computer to perform these tasks without human guidance because Web pages are specifically designed for human readers. The Semantic Web is a project that aims to change this by presenting Web page data in such a way that it is understood by computers, enabling machines to do the searching, aggregating and combining of the Web’s information – without a human operator. The Semantic Web is about two things. It is about common formats for integration and combination of data drawn from diverse sources, whereas the original Web concentrated mainly on the interchange of documents. It is also about language for recording how the data relates to real-world objects. This allows a person, or a machine, to start off in one database and then move through an unending set of databases which are connected not by wires, but by common subject matter [12]. The Semantic Web is an extension to the Web that adds new data and metadata to existing Web documents, extending those documents into data. This extension of Web documents to data is what will enable the Web to be processed automatically by machines and also manually by humans. The Semantic Web is typically built on syntaxes that use URIs to represent data, usually in triples-based structures: i.e. many triples of URI data that can be held in databases or interchanged on the World Wide Web using a set of particular syntaxes developed especially for the task. These syntaxes are called “Resource Description Framework” (RDF) syntaxes. The RDF is used to turn basic Web data into structured data that software can make use of. RDF works on Web pages and also inside applications and databases. Data are accessed using the general Web architecture using, for example, URIs. Data should be related to one another just as documents (or portions of documents) are already. This also requires the creation of a common framework that allows data to be shared and reused across application, enterprise and community boundaries, and to be processed automatically by tools, as well as manually, including revealing possible new relationships among pieces of data. A variety of applications can be used in Semantic Web technologies. These include: data integration, whereby data in various locations and various formats can be integrated in one, seamless application; resource discovery and classification to provide better, domain-specific search engine capabilities; cataloguing for describing the content and content relationships available at a particular Website, page or digital library; use by intelligent software agents to facilitate knowledge sharing and exchange; content rating; description of collections of pages that represent a single logical “document”; description of intellectual property rights of Web pages.
242
L. Uden
It has taken years to put the components together that comprise the Semantic Web, including the standardisation of RDF, the W3C release of the Web Ontology Language (OWL) and standardisation on SPARQL, which adds querying capabilities to RDF. So with standards and languages in place, we can see Semantic Web technologies being used by early adopters. Semantic Web technologies are popular in areas such as research and life sciences where it can help researchers by aggregating data on different medicines and illnesses that have multiple names in different parts of the world. On the Web, Twine is offering a knowledge networking application that has been built with Semantic Web technologies. The Joost online television service also uses Semantic technology on the backend. Here, Semantic technology is used to help Joost users understand the relationships between pieces of content, enabling them to find the types of content they want most. Oracle offers a Semantic Web view of its Oracle Technology Network, called the OTN Semantic Web, to name a few of those companies who are implementing Semantic Web technologies. The architecture of the Semantic Web is shown in Fig. 10.1. The seven layers are as follows: l
l
l
Unicode and URI: Unicode, the standard for computer character representation, and URIs, the standard for identifying and locating resources (such as pages on the Web), provide a baseline for representing characters used in most of the languages in the world and for identifying resources. XML: XML and its related standards, such as Namespaces and Schemas, form a common means for structuring data on the Web, but without communicating the meaning of the data. These are already well established within the Web. RDF: RDF is the first layer of the Semantic Web proper. RDF is a simple metadata representation framework, using URIs to identify Web-based resources and a graph model for describing relationships between resources. Several syntactic representations are available, including a standard XML format.
Fig. 10.1 Semantic Web layered architecture
10 l
l
l
l
Semantics in Wiki
243
RDF Schema: a simple type modelling language for describing classes of resources and properties between them in the basic RDF model. It provides a simple reasoning framework for inferring types of resources. Ontologies: a richer language for providing more complex constraints on the types of resources and their properties. Logic and Proof: an (automatic) reasoning system provided on top of the ontology structure to make new inferences. Thus, using such a system, a software agent can make deductions as to whether a particular resource satisfies its requirements (and vice versa). Trust: The final layer of the stack addresses issues of trust that the Semantic Web can support. This component has not progressed further than a vision of allowing people to ask questions about the trustworthiness of the information on the Web in order to receive an assurance of its quality.
10.3.1 Resource Description Framework A language for representing information about resources in the World Wide Web is known as a RDF. It is used to represent metadata about Web resources, such as the title, author and modification date of a Web page, copyright and licensing information about a Web document or the availability schedule for some shared resource. RDF can also be used to represent information about things that can be identified on the Web, even when they cannot be directly retrieved on the Web. Examples include information about items available from online shopping facilities (e.g. information about specifications, prices and availability), or the description of a Web user’s preferences for information delivery. RDF is a graph-centric approach which builds on the intuition that the fundamental aspects of a domain of interest can be modelled as binary relationships between objects. This linking structure forms a directed, labelled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations [8]. RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs) and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values. For example, “there is a Person identified by http://www.w3. org/People/EM/contact#me, whose name is Eric Miller, whose email address is [email protected], and whose title is Dr”, could be represented as the RDF graph in Fig. 10.2 [13]. RDF extends the linking structure of the Web to use URIs to name the relationship between things and the two ends of the link (this is usually referred to as a “triple”). This simple model allows structured and semi-structured data to be mixed, exposed and shared across different applications [12]. RDF provides a common
244
L. Uden
Fig. 10.2 RDF example [13]
framework for expressing information so it can be exchanged between applications without loss of meaning. The ability to exchange information between different applications means that the information may be made available to applications other than those for which it was originally created. According to Davies [14], Semantic technologies are functional capabilities that enable both people and computers to create, discover, represent, organise, process, manage, reason with, present, share and utilise meanings and knowledge to accomplish business, personal and societal purposes. Semantic technologies are tools that represent meanings, associations, theories and know-how about the uses of things separately from data and programme code. This knowledge representation is called ontology – a run-time semantic model of information, defined using constructs for: l l l l
Concepts – classes, things Relationships – properties (object and data) Rules – axioms and constraints Instances of concepts – individuals (data, facts)
10.3.2 Functions of Semantic Technology Davies [14] describes the functions of semantic Web technology as: to create, discover, represent, organise, process, manage, reason with, present, share and utilise meanings and knowledge in order to accomplish business, personal and societal purposes. Davies further pointed out that the business value of semantic technologies has three dimensions or axes as shown in Fig. 10.3: l
l
Capabilities enabled by semantic technologies and new solution patterns. New capabilities are the main value driver. Lifecycle economics of semantic solutions measured as the ratio of benefits to cost and risk. The lifecycle perspective focuses on development risk.
10
Semantics in Wiki
245
Fig. 10.3 Business value dimensions
l
Performance of semantic solutions measured by improvements in efficiency, effectiveness or strategic edge. Performance focuses on returns. In the following pages, we discuss each dimension of business value in more detail.
Davies [14] also suggested that Semantic technologies drive business value by providing superior capabilities (increased capacity to perform) in five critical areas: l
l
l
l
l
Development – Semantic automation of the “business-need-to-capabilityto-simulate-to-test-to-deploy-to-execute” development paradigm solves problems of complexity, labour-intensivity, time-to-solution, cost and development risk. Infrastructure – Semantic enablement and orchestration of core resources for transport, storage and computing help solve problems of infrastructure scale, complexity and security. Information – Semantic interoperability of information and applications in context, powered by semantic models makes “killer apps” of semantic search, semantic collaboration, semantic portals and composite applications. Knowledge – Knowledge work automation and knowledge worker augmentation based on executable knowledge assets enable new concepts of operation, superproductive knowledge work, enterprise knowledge-superiority and new forms of intellectual property. Behaviour – Systems that learn and reason as humans do, using large knowledge bases, and reasoning with uncertainty and values, as well as logic enable new categories of hi-value product, service and process.
10.4
Semantics for Wikipedia
Semantic wiki systems, such as Semantic MediaWiki (SMW) [15] and IkeWiki [16], have been developed to extend conventional wikis by providing support for simple semantic annotations on wiki pages, such as categories and typed links. A wiki page in semantic wikis may contain both conventional wiki text and structured data, and the structured data can be further accessed by customisable queries using simple query languages.
246
L. Uden
V€olkel et al. [9] have provided an extension to MediaWiki which allows important parts of Wikipedia’s knowledge to be machine processable with as little effort as possible. These authors arrived at the following key elements for annotations: l l l
Categories, which classify articles according to their content Typed links, which classify links between articles according to their meaning Attributes, which specify simple properties related to the content of an article
Categories already exist in Wikipedia, though they are mainly used to assist browsing. According to these authors, Typed links are obtained from normal links by slightly extending the way of creating a hyperlink between articles. Attributes provide source for machine-readable data, which incorporates data values in the encyclopaedia. Typically, such values are provided in the form of numbers, dates or coordinates.
10.4.1 Development of Semantic Wiki The development of Web application models was driven by the advances of Web technology. Early Web applications only allowed the users to browse Web pages. Nowadays, users are able to control content publishing with the help of Web 2.0 technologies. Wiki blogs today further provide extensible computing infrastructures that support scripting and/or customisable plug-in to facilitate collective Web application development. Recent advances in the Social Semantic Web, such as SMW, allow users to collaboratively control structured data management. Figure 10.4 shows the different Web application models (from Bao et al. [17]). The early model is the conventional model. In the Conventional Model, the Web application is composed of three clearly separated major components, namely the Web browser, the Web server and the backend storage system (e.g. a database or a
Fig. 10.4 A comparison of several Web application models (from [17])
10
Semantics in Wiki
247
file system). Users of such an application are provided with limited control over the contents of the application; interaction is usually limited to browsing and search. The representation, computation and presentation components are primarily hosted on the server and are controlled by Webmasters only. Other models extend the Conventional Model with extra client-side control over data and computation. The AJAX Model [18] uses an AJAX engine to act as a mediator between the browser and the server and is becoming increasingly popular due to its powerful client-side computing ability. It improves the user experience with respect to both data transfer (e.g. asynchronous data retrieval from the server without interfering with page display) and data presentation. For example, a powerful word processing system (e.g. Google Docs) can be used within a browser where the data is actually stored on the Web. It is notable that users may also insert client-side scripts into Web applications for customised processing. According to Bao et al. [17], the Wiki-based Model enables end users to directly control some data content and presentation on the server side. For example, Wikipedia articles are collaboratively maintained by users, and complex wiki templates are frequently used to enable advanced page layout (e.g. to render a calendar). A wiki page may embed external script languages (e.g. JavaScript) for more advanced tasks. The SemWiki (Semantic Wiki)-based model enables users’ additional control on the management and consumption of structured data. For example, it is not yet possible to assert a structured, queryable annotation for an individual wiki page; nor is it possible to execute a query of “all European countries that have female government leaders” in Wikipedia. Semantic wiki overcomes these limitations by extending wikis with the ability to create and query structured annotations using a relatively simple modelling and querying language. As a result, users can now exert greater levels of control over the data in an application.
10.4.2 Semantic MediaWiki MediaWiki is software for running Web-based wikis. It is used to run many of the most popular wikis in the world. It is written in PHP and is best known for powering Wikipedia – that helps to search, organise, tag, browse, evaluate and share the wiki’s content. Semantic MediaWiki was initially created by Markus Kr€otzsch, Denny Vrandecˇic´ and Max V€olkel and was first released in 2005. Semantic MediaWiki provides an extension that enables wiki users to semantically annotate wiki pages, based on which the wiki contents can be browsed, searched and reused in novel ways. Semantic MediaWiki (SMW) is a semantically enhanced wiki engine that enables users to annotate the wiki’s contents with explicit, machine-readable information [15]. Typically, Semantic wikis are built upon RDF triples for storing structured data. Data in a semantic wiki does not need to be stored with a predefined schema as is required by an RDBMS (although it is possible to do this). This conforms to the
248
L. Uden
open nature of the Web. It enables significant flexibility and extensibility in data modelling. The Semantic Wiki model also allows hybrid modelling with predefined “schema”, schema-free user-added metadata and unstructured data, thus making the extension of an application much easier. For example, users can always add new attributes as needed to a specific article in addition to existing attributes [17]. According to Bao et al. [17], some semantic wikis (like SMW) not only preserve the semantic structure of data, but also provide lightweight query capabilities (with a role similar to that of SELECT queries in SQL). They gave an example, saying that in SMW it is possible to pose a query {{ #show [[Category:Article]][[tag::< q>Category:food]] }} to find all articles tagged with “food” or its subtags (like “doughnut”). These authors also argue that because the modelling specification and queries themselves are also presented as semantic wiki pages, they can be accessed, updated or deleted in the same fashion as other wiki pages in the browser. Therefore, semantic wikis function as a virtual abstraction layer over the Web server and database/file systems so that programmers are not required to directly access the layers hidden below semantic wikis. Several MediaWiki extensions provide scripting functionalities. These extensions can be used to support a wide range of lightweight data processing abilities, including variables, data types and control flow [17]. In SMW, many elements of a user interface (UI) in an application can be constructed using scripts. In MediaWiki (thus, also in SMW), users can also inject JavaScript code into a wiki page by including either server-side scripts or code in some client-editable special wiki pages. Some SMW-based applications (e.g. wikicafe.metacafe.com and metavid. org) have developed advanced UIs such as video browsing and annotation. Semantic wikis provide a transparent platform for lightweight Web application development through data modelling, processing and presentation (via a user interface) abilities. According to Bao et al. [17], such development model enjoys several advantages: flexibility, socialisation, inference ability, efficient modelling ability and safety. SMW is free software, available as an extension of the popular wiki engine MediaWiki. Figure 10.5 provides an overview of SMW’s core components [15]. The integration between MediaWiki and SMW is based on MediaWiki’s extension mechanism: SMW registers for certain events or requests, and MediaWiki calls SMW functions when needed. Hence, SMW does not overwrite any part of MediaWiki and can be added to existing wikis without much migration cost. According to Kr€ otzsch et al. [15], Semantic MediaWiki (SMW) is a semantically enhanced wiki engine that enables users to annotate the wiki’s contents with explicit, machine-readable information. By using semantic data, SMW addresses core problems of today’s wikis. MediaWiki is the base software Semantic MediaWiki is built on. These authors argue that SMW addresses core problems of today’s wikis including consistency of content, accessing knowledge and reusing knowledge. While traditional wikis contain only text which computers can neither understand nor evaluate, SMW adds semantic annotations that allow a wiki to function as a collaborative database.
10
Semantics in Wiki
249
Fig. 10.5 Architecture of SMW’s main components in relation to MediaWiki (from Kr€otzsch et al. [15])
The primary objective for SMW, therefore, is the seamless integration of semantic technologies into the established usage patterns of the existing MediaWiki system. SMW is available in multiple languages. Information is dynamic and changes in a decentralised way, and there is no central control of the wiki’s content. Decentralisation leads to heterogeneity, but SMW ensures that existing processes of consensus finding can also be applied to the novel semantic parts. Wikis contain not only text but also uploaded files, especially pictures and similar multimedia content. A semantic wiki combines the strengths of wiki technology (easy to use and contribute, strongly interconnected, collaborative) and the Semantic Web technologies (machine processable, data integration, complex queries). Semantic wikis have the ability to capture or identify information about the data within pages, and the relationships between pages, in ways that can be queried or exported like database data [19]. The primary structural mechanism of most wikis is the organisation of content within wiki pages. In MediaWiki, these pages are further classified into namespaces, which distinguish different kinds of pages according to their function [15]. Semantic data in SMW is also structured by pages, such that all semantic content explicitly belongs to a page. Semantically speaking, every page corresponds to an ontological element (including classes and properties) that might be further described by annotations on that very page. SMW collects semantic data by letting users add annotations to the wiki text of pages via a special markup. The processing of this markup is performed by the components for Parsing and Rendering in Fig. 10.5. Properties in SMW are used to express binary relationships between one semantic entity (as represented by a wiki page) and some other such entity or data value. According to Kr€ otzsch et al. [15], a number of other semantic wiki implementations besides SMW have been created. These include: IkeWiki [16] and MaknaWiki [20]. Both of these are similar to SMW with respect to the supported kinds of
250
L. Uden
easy-to-use inline wiki annotations and various search and export functions. In contrast to SMW, both systems introduce the concept of ontologies and (to some extent) URIs into the wiki, which emphasises use-cases of collaborative ontology editing that are not the main focus of SMW. SMW has many advantages when compared with other semantic wiki engines. The reason is that it is built on open-source MediaWiki, which has a large user community, a mature system framework and various extensions. More importantly, another advantage is that MediaWiki is flexible and can be easily extended for specific use. SMW enables users to annotate the wiki’s contents with explicit, machine-readable information [15]. SMW was chosen because of its robustness, flexibility and support [21]. According to Zou and Fan [21], all content in SMW is structured by wiki pages. Each page could be considered a resource, a property or an individual. Pages could be further classified into namespaces to distinguish different types of pages. Each wiki page also has a unique name which can be treated as its URI in combination with namespaces.
10.4.3 Why Semantic MediaWiki Semantic MediaWiki introduces some additional markup into the wiki-text which allows users to add “semantic annotations” to the wiki. These simplify the structure of the wiki, help users to find more information in less time and improve the overall quality and consistency of the wiki. Some of the benefits of using SMW are [2]: l l l l l l l
Automatically generated lists Visual display of information Improved data structure Searching information Inter-language consistency External reuse Integrate and mash-up data
Semantic Wikis enhance collaboration and information sharing by providing capabilities such as: l
l
l
l l l
Concept-based rather than language-based searching: queries span vocabularies, languages and search engines. Question answering rather than simple retrieval. Also, overlay ontologies and knowledge bases can integrate with major Web searching engines. More richly structured content navigation, including multiple perspectives, multiple levels of abstraction, dependency/contingency relationships, etc. Easy visualisation of content structure (categories, taxonomies, semantic nets, etc.). Direct editing of content structure. Mining of semantic relationships in content.
10 l l
l
Semantics in Wiki
251
Wiki content linked to dynamic models, simulations and visualisations. Wiki content linked with external repositories, file systems (e.g. personal desktop, enterprise servers, Web sources, semantic-enabled feeds, e.g. RSS). Richer user access/rights models, including reputation systems.
According to Bao et al. [17], Semantic Wikis offer the promise of a new application model with the following interesting characteristics: l
l
l
Rich data modelling: User-contributed content may be a mixture of text and structured data, and semantic wikis can best preserve structured data without forcing structured representation of the free text part. With the structured data, a semantic wiki can function like a lightweight database or a knowledge base, and users can model data using several common modelling methods, e.g. relational modelling or rule modelling. Transparent data processing: as wikis allow simple computing logics (such as declaring an object and applying a data processing rule) to be published as a part of wiki pages in the forms of wiki scripts, they are transparent to all wiki users and can be collaboratively authored and improved in Web browsers. Social programming: The transparency of data modelling and data processing and the convergence model of wikis themselves opens up the development of Web applications to all interested users.
10.5
Current Semantic Wiki Applications
Semantic MediaWiki has come a long way from its roots as an academic research project. It is currently in active use in hundreds of sites, in many languages, around the world, including Fortune 500 companies, biomedical projects, government agencies and consumer directories. There are a number of consulting companies that implement SMW as part of their solutions, including Benchmarking Partners, FZI, LeveragePoint and Ontoprise and WikiWorks. The four Websites currently hosting Semantic MediaWiki and some of its extensions are: Wikia, Referata, YourWiki and Pseudomenon.
10.5.1 Semantic MediaWiki Dengler et al. [22] have extended the Semantic MediaWiki (SMW) software with process modelling and visualisation functionalities to support a collaborative, distributed and iterative process documentation with a modelling tool. Using widespread and well-accepted wiki technology, users are able to model and update organisational processes in a familiar environment by reusing externalised knowledge already stored in wikis. According to these authors, other expected advantages for modelling business processes with SMW are the reduced maintenance costs of
252
L. Uden
the system and improved interoperability through RDF export functionality which is already provided by SMW. An application of Semantic MediaWiki was proposed by Zou and Fan [21] to construct a terminology registry. A terminology registry registers and points to terminology resources. It lists, describes, identifies and points to sets of vocabularies available for use in information systems and services [23]. Terminology registries provide the fundamental infrastructure for terminology services, such as Web navigation, query expansion, cross-language retrieval and metadata creation [21]. According to Proffitt [24], terminology is a list or vocabulary of terms (words or phrases) or notation used to describe, navigate and search content. Terminology resources may contain terms, concepts and their relationships in vocabularies, and metadata schemas. Its resources could be considered to be a part of linked data that can be shared, mashed up and reused by either humans or applications on the Semantic Web. A terminology resource may be listed, described and identified by one or more terminology registries. Terminology registries include the Dublin Core Metadata Initiative (DCMI) registry, National Science Digital Library (NSDL) metadata registry and CORES metadata registry.
10.5.2 SMW+ SMW+ is a semantic enterprise wiki that allows one to tag, process and query data. Thus, it serves as a real Knowledge Management platform, fostering integration and re-use of knowledge. It combines a wiki’s social authoring approach with proven semantic technology. It also allows the aggregation of information in suitable views and output formats. Users can, for example, dynamically compile their own task lists or present events automatically in the form of an up-to-date time line. In general, the semantic features enable a better organisation, browsing and retrieval of wiki contents compared with traditional, non-semantic wikis. SMWþ is open source. There is complete control over the code and it can be customised to individual needs. The flexible nature of SMWþ allows its use in a variety of applications and usage scenarios. Typical scenarios for which SMWþ is suited include project management, knowledge management, software development and vocabulary building. The developers of SMWþ compare it to a Swiss army knife: a tool that is hardly limited to a particular scenario. In order to enable companies to operate in a global environment, a flexible communication and information infrastructure that can be easily adapted to changing needs is required. Clearley et al. [25] argue that modern Service-Orientated Architecture (SOA) infrastructure is the answer. An essential component of a SOA infrastructure is the central service registry. According to Paoli et al. [26], current standards for organising service registries and their implementations are driven by the technical aspects of the infrastructure. When using such technically organised service registries, business users often fail to find the needed information. Paoli et al. [26] proposed a new approach to the organisation and implementation of the
10
Semantics in Wiki
253
business registries that are driven by the needs of business users using UDDI and Semantic MediaWiki. Paoli et al. [26] believe that service registries should address all stakeholders. Current service descriptions only concentrate on the service developers. Service discovery has technical and business (“semantic”) facets. The technical part of a service description deals with the syntax of the service interface and is affected by the underlying SOA infrastructure. The semantic part should reflect the business objectives of the service. They conclude that traditional information retrieval techniques based on descriptive terms are clearly insufficient and must be augmented by consideration of each term together with its network of somehow-related terms. These authors argue that for technical and application reasons the technical and business aspects of the service description should be kept separate. They also suggested that UDDI is practically the only standard for advertising services through service registries. The use of UDDI serves to establish a worldwide service registry to create a worldwide market of services and enable small and unknown companies anywhere in the world to offer their innovative services to customers on the other side of the globe. The rapid growth of personal and community data on the Web demands effective Social Webtops, i.e. Web-based infrastructure for users to organise and share online data [27]. A social Webtop uses the Web for data storage and provides special applications for seamlessly moving our daily data process applications onto the Web. For example, Wikipedia reduces the need for encyclopaedia software on PCs, and Google Document, an open Web-based alternative, does the same for conventional word processing software. According to these authors, a successful social Webtop requires many critical capabilities, such as structured data representation, smart data integration and propagation, provenance-aware social data access control, Web data persistency and preservation and friendly data access UI. Bao et al. [27] argue that a social Webtop should provide versatile concept modelling supports to facilitate Web users creating, propagating, accessing and consuming. It should also be aware of social provenance and support both socialisation that promotes collaborative data generation and consumption, and personalisation that ensures necessary data privacy protection and customisation. According to these authors, the Semantic Wiki is not only a general purpose tool for content management, but also a powerful workbench for building lightweight social Webtop applications. Social Webtop applications can be developed on internal and external online data using wiki-based concept modelling and programming power. Each layer gives a higher level of abstraction of the information than the layer below.
10.5.3 Personal Knowledge Management A key to success in the modern economy is the management of knowledge. Knowledge is fundamentally created by individuals [28]. It is important to support individuals in their personal knowledge management. According to V€olkel and
254
L. Uden
Fig. 10.6 Semantic Personal Knowledge Management supporting externalisation (authoring) and internalisation (learning) of personal knowledge (from [29] )
Oren [29], current tools for personal knowledge management have limitations. These include: analogue approaches are not automated and cannot be searched, digital approaches are restrictive, do not support ad hoc structures and do not support associative browsing. These authors have developed Semantic Web and wiki technologies for personal knowledge management, giving individuals personal advantages based on wiki technology as a personal authoring environment. The implementation is an SPKM tool, a semantically enhanced personal knowledge management tool, shown in Fig. 10.6. In the system, each individual uses his own SPKM tool as a personal knowledge repository. V€olkel and Oren [29] argue that the individual benefits personally from this system by having better retrieval and reminders of his knowledge. His personal wiki is connected to other applications and other wikis; this network allows individuals to combine their knowledge through sharing and exchanging. According to V€olkel and Oren [29], personal information management (PIM) tools such as Microsoft Outlook support finding and reminding very well but do not offer any support for authoring, knowledge reuse and collaboration (again context management and interoperability are lacking). These authors use an SPKM system consisting of a semantic wiki enhanced in multiple ways to support all their requirements.
10.5.4 ITSM Semantic Wiki IT service management (ITSM) is a set of processes that allows planning, organising, directing and controlling of provisioning in an IT service. There are several frameworks that provide guidelines for implementing ITSM processes, with the Information Technology Infrastructure Library (ITIL) being the most prominent one [30]. According to Kleiner and Abecker [31], configuration management is the process within ITSM which is responsible for describing all entities that are used to provide IT services, as well as the relationships between these entities [32]. Entities relevant for configuration management are referred to as configuration items (CIs). Descriptions of configuration items and relationships between configuration items are stored in the configuration management database (CMDB), which is the logical abstraction of all physical databases that contain information relevant to the configuration management discipline within an organisation.
10
Semantics in Wiki
255
Although there are a variety of tools for automatically populating and updating CMDBs, not all information can be determined and put in context automatically. In order to overcome this problem, Kleiner and Abecker [31] use semantic wiki software as the basis for creating a platform, which allows users to participate in the documentation of configuration items and their relations to each other, as well as best practices for the use of IT components. The developed ITSM wiki supports a distributed, collaborative approach to configuration management in agile environments. It is built on top of MediaWiki, Semantic MediaWiki and SMW+, and the extension adds ITSM-specific functions to the wiki. Kleiner and Abecker [31] have shown that it is feasible to use semantic wikis as a platform to implement a CMDB.
10.6
Future of Semantic Wiki
The future of semantic wiki is promising. It is predicted by researchers that technologies from semantic wikis will merge into collaborative Web 2.0 (3.0?) apps like Google Docs. There is suggestion that there will no longer be wiki text. There will be WYSIWYG HTML editing of semantic forms. In future, there will be more expressive reasoning versus consistency in semantic wiki. Semantic wiki will be the potential for bringing the semantic Web to the masses. More and more open virtual communities and corporate intranets will be utilising wikis using semantic wiki. Many people have also foretold that semantic wiki will be used as a programmed environment for applications (business, public and other domains). It will be programmed in the same way as the wiki pages are edited. Semantic wiki will be supported by easy-to-use tools. Most semantic wiki will able to offer some data exchange, thereby enabling knowledge exchange. Some researchers believe that semantic wikis will become the dominant medium for human-generated information. On the other hand, for machine- and externally generated information, semantic wikis will serve as the “glue” bringing together different information sets (see, for example, http://opencongress.org/wiki – the shape of things to come). In the future, Semantic Wiki systems might play an important role in “knowledge acquisition”, enabling non-technical users to contribute to the Semantic Web.
10.7
Conclusion
Wikipedia is the world’s largest collaboratively edited source of encyclopaedic knowledge. But despite its utility, its contents are barely machine interpretable. Structural knowledge, such as the inter-relationship of concepts, can neither be formally stated nor be automatically processed. Also, the wealth of numerical data is available only as plain text and thus cannot be processed by using its actual meaning. An extension has been integrated in Wikipedia that allows the typing of links between articles and the specification of typed data inside the articles in an
256
L. Uden
easy-to-use manner, enabling even casual users to participate in the creation of an open semantic knowledge base. Wikipedia has the chance to become a resource of semantic statements, hitherto unknown regarding size, scope, openness and internationalisation. These semantic enhancements bring to Wikipedia the benefits of today’s semantic technologies: more specific ways of searching and browsing. Also, the RDF export, which gives direct access to the formalised knowledge, opens Wikipedia up to a wide range of external applications that will be able to use it as a background knowledge base. Semantic Wikis combine the strengths of the social “Web 2.0” and the Semantic Web [33]. This also overcomes some of their respective weaknesses or deficiencies. There are a growing number of Semantic MediaWiki extensions on offer today for adding and modifying data. There is much potential for Semantic Wiki. Wikipedia can benefit by enhancing it with semantic technologies in ways of searching and browsing. The RDF export in semantic Web gives direct access to the formalised knowledge and opens Wikipedia up to a wide range of external applications so that it can be used as a background knowledge base. Future Semantic Wikis will be easy to use; they will enable us to programme in the same way as the Wiki pages are edited. Most Semantic Wikis will offer both data and knowledge exchange.
References 1. Schaffert, S., Bischof, D., Buerger, T., Gruber, A., Hilzensauer, W., Schaffert, S.: Learning with semantic wikis. In: Proceedings of the First Workshop on Semantic Wikis – From Wiki To Semantics, (SemWiki2006), Budva, Montenegro, pp. 109–123 11–14 June 2006. Retrieved November 2006 from http://www.wastl.net/download/paper/Schaffert06_SemWikiLearning. pdf 2. Wikipedia Foundation http://en.wikipedia.org/wiki/Wikipedia 3. Boyd, D.: Reflections on friendster, trust and intimacy. In: Ubiquitous Computing (Ubicomp 2003), Workshop application for the Intimate Ubiquitous Computing Workshop. Seattle, WA, 12–15 Oct 2003 4. Leuf, B., Cunningham, W.: The Wiki Way: Quick Collaboration on the Web, p. 15. AddisonWesley, Boston (2001) 5. Matias, N.: What is a wiki. http://articles.sitepoint.com/article/what-is-a-wiki 6. Thoeny, P.: (2010). TWiki® – the Open Source Enterprise Wiki and Web 2.0 Application Platform, http://www.twiki.org/ 7. Gonzalez-Reinhart, J.: Wiki and the Wiki Way: Beyond a knowledge management solution. Information Systems Research Center, pp. 1–22 (2005) 8. Kr€otzsch, M., Vrandecic, D., V€ olkel, M.: Wikipedia and the Semantic Web – The Missing Links. In: Proceedings of the First International Wikimedia Conference (Wikimania-05). Wikimedia Foundation (2005). 9. V€olkel, M., Kr€otzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic Wikipedia, International World Wide Web Conference Committee (IW3C2). WWW 2006, Edinburgh, Scotland, 23–26 May 2006. 10. Anderson, T., Whitelock, D.: The educational semantic web: visioning and practicing the future of education. (Special Issue). J. Interactive Media in Education, 2004 (1). <www-jime. open.ac.uk/2004/1> [Accessed 23.6.2008 ] 11. Berners-Lee, T., Hendler, J., Lassila, O: The semantic web. Scientific American 5 (2001). Available at http://www.sciam.com/2001/0501issue/0501berners-lee.html
10
Semantics in Wiki
257
12. W3C http://www.w3.org/ 13. RDF Primer http://www.w3.org/TR/rdf-primer/ (2010) 14. Davies, M.: Semantic interoperability community of practice (SICoP), Semantic Wave 2006: Executive Guide to the Business Value of Semantic Technologies, White Paper Series Module 2, Updated on 05/15/06 , Version 1.1, DRAFT for Public Review (2006) 15. Kr€otzsch, M., Vrandecic, D., V€ olkel, M., Haller, H., Studer, R.: Semantic Wikipedia. J. Web Semant. 5, 251–261 (2007) 16. Schaffert, S.: IkeWiki: A semantic wiki for collaborative knowledge management. In: Tolksdorf, R., Bontas, E.P., Schild, K. (eds.) 1st International Workshop on Semantic Technologies in Collaborative Applications (STICA’06), Manchester, UK, June 2006 17. Bao, J., Ding, L., Huang, R., Smart, P., Braines, D., Jones, G.: A semantic wiki based lightweight web application model. In: 4th Asian Semantic Web Conference (ASWC), Shanghai, China, 6–9 Dec 2009. A Semantic Wiki Based Light-Weight Web Application Model, http:// eprints.ecs.soton.ac.uk/17738/ 18. Garrett, J.J.: Ajax: A new approach to Web applications. Adaptivepath.com, February 2005. [Online; Stand 18.03.2005] 19. Semantic Wiki: Wikipedia. Retrieved April 27, 2009, from http://en.wikipedia.org/wiki/ Semantic_wiki (2009) 20. Nixon, L.J.B., Simperl, E.P.B.: Makna and multimakna: towards semantic and multimedia capability in wikis for the emerging web. In: Proceedings of the Semantics 2006, Vienna, Austria, Nov 2006 21. Zou, Q., Fan, W.: A semantic mediawiki-empowered terminology registry. In: Proceedings of the International Conference on Dublin Core and Metadata Applications 2009, http://dcpapers. dublincore.org/ojs/pubs/article/viewFile/949/956 22. Dengler, F, Lamparter, S., Hefke, M., Abecker, A.: Collaborative process development using semantic mediawiki, http://publications.f-dengler.de/wm2009-Collaborative_Process_ Development_using_SMW.pdf 23. Terminology Registry Scoping Study. (2008). UKOLN at the University of Bath and University of Glamorgan. Retrieved April 7, 2009, from http://www.ukoln.ac.uk/projects/trss 24. Proffitt, M., Waibel, G., Vizine-Goetz, D., Hunghton, A.: Terminologies Services Strawman. Distributed to Terminologies Services Meeting at the Metropolitan Museum of Art, September 12, 2007, New York City. Retrieved May 2, 2009, from http://www.oclc.org/programs/ events/2007-09-12a.pdf (2007) 25. Cearley, D., Fenn, J., Plummer, D.: Gartner’s positions on the five hottest IT topics and trends in 2005; Gartner Web Site, 2005. http://www.gartner.com/DisplayDocument?doc_cd¼125868 [as seen on 2006-09-20] (2005) 26. Paoli, H., Schmidt, A., Lockemann, P.C.: Abstract, User-Driven SemanticWiki-based Business Service Description. In: Schaffert, S., Tochtermann, K., Pellegrini, T. (eds.) Networked Knowledge-Networked Media, Integrating Knowledge Management, New Media Technologies and Semantic Systems. Springer, Berlin (2009) 27. Bao, J., Ding, L., McGuinness, D., Hendler, J.: Towards social webtops using semantic wiki, http://www.cs.rpi.edu/~baojie/pub/2008-07-19_ISWC08poser.pdf (2008) 28. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company, p. 59. Oxford University Press, New York (1995) 29. V€olkel, M., Oren, E.: Personal knowledge management with semantic wikismax, http://www. eyaloren.org/pubs/pkm.pdf (2006) 30. Clacy, B., Jennings, B.: Service management: Driving the future of IT. Computer 40(5), 98–100 (2007) 31. Kleiner, F., Abecker, A.: Towards a collaborative semantic wiki-based approach to IT service management, wiki.ontoprise.de/smwforum/images/d/df/Paper-2010-03-01.pdf, Accessed 20 April 2010 32. Lacy, S., Macfarlane, I.: Service transition, ITIL, version 3, Stationery Office, Norwich (2007). 33. V€olkel, M., Haller, H.: Conceptual data structures for personal knowledge management. Online Inf. Rev. 33(2), 298–315 (2009)
.
Part IV
Future of Community-Built Databases Research and Development
.
Chapter 11
On the Future of Mobile Phones as the Heart of Community-Built Databases Jose C. Cortizo, Luis I. Diaz, Francisco M. Carrero, Adrian Yanes, and Borja Monsalve
Abstract In retrospect, 10 years ago, we would not have imagined ourselves uploading or consuming high-quality videos via the Web, contributing to an online encyclopedia written by millions of users around the world or instantly sharing information with our friends and colleagues using an online platform that allows us to manage our contacts. And the Web is still evolving and what seemed to be science fiction then would become reality within 5–10 years. Nowadays, the Mobile Web concept is still an immature prototype of what will be in a few years’ time, but it represents a giant industry (it is expected that some five billion people will be using mobile/cellular phones in 2010) with even greater possibilities in the future. In this paper, we examine the possible future of mobile devices as the heart of community-built databases. The mobile devices characteristics, as both current and future features, will allow them to have a very relevant role not only as interfaces to community-driven databases, but also as platforms where applications using data from community-driven databases will be running, or even as distributed databases where users can have better control of relevant data they are contributing to those databases.
11.1
Introduction
The Social Web [19] has been able to shift the way information is generated and consumed. Initially, information was generated by one person and consumed by many people, but now the information is generated by many people and consumed by many people, changing the needs in information access and management [18].
J.C. Cortizo (*) Universidad Europea de Madrid, Villaviciosa de Odo´n, Spain e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_11, # Springer-Verlag Berlin Heidelberg 2011
261
262
J.C. Cortizo et al.
Ten years ago, we could not have imagined ourselves uploading or consuming high-quality videos via the Web, contributing to an online encyclopedia written by millions of users around the world or instantly sharing information with our friends and colleagues using an online platform that allows us to manage our contacts. And the Web is still evolving and what seemed to be science fiction then would become reality within 5–10 years. Nowadays, the Mobile Web concept is still an immature prototype of what will be in a few years’ time. On the one hand, current mobile devices are more used to acquiring information from the Web due to the actual limitations of the devices and software, which position the users in a pre-Web 2.0 era. On the other hand, mobile devices play a significant role in our daily life and will gain prominence in future years. In this paper, we examine the possible future of mobile devices as the heart of community-built databases. Both the current and future characteristics of mobile devices will allow them to play a very relevant role not only as interfaces to community-driven databases, but also as platforms where applications using data from community-driven databases will be running, or even as distributed databases where users can have better control of the relevant data they are contributing to those databases. The current state-of-the-art mobile devices related to community-driven databases focus on the interface level, with a lot actual of work on augmented reality that enhances the real world with information extracted from these databases. But mobile devices will have a key position in the database itself. In many communitydriven databases, users contribute with contents created by themselves or even with personal information. This is the case with communities such as Flickr, where photographers share their photographs, or even LinkedIn, where users share their CVs and their professional experiences. Given the current state of technology development, users have no other options other than to upload their contributions to a common database managed by third parties. But it may be possible in the near future to create distributed databases where certain parts of the information, or even the whole database, remain on mobile devices. In that way, users would have a stronger control of their most important information and contents. The rest of the paper is organized as follows. First, we analyze the mobile phone industry, which has become one of the most important industries. In Sect. 11.3, we analyze how mobile phones have changed our culture and explore the links between mobile phones and everyday life. In Sect. 11.4, the anatomy of a mobile phone is reviewed, analyzing the most common services offered by a mobile phone, their technical capacities, and components. Section 11.5 analyzes the actual role of mobile phones in community-built databases and some future directions not only from the perspective of mobile phones used as an interface to access or contribute community-built databases, but also as its core components being the hosts of certain relevant data. Section 11.6 presents two scenarios where the actual use of mobile phones can be improved in community-built databases, and Sect. 11.7 concludes the paper.
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
11.2
263
Mobile Devices: A Real Market
A mobile device is a small computing device, typically having a display and an input that can be either a touch input or a miniature keyboard. There are several kinds of mobile devices including: l l l l l l l
Mobile phones E-Book readers Handheld game consoles Mobile computers such as netbooks GPS navigation devices Portable media players PDAs (Personal digital assistant)
But mobile phones are the most important mobile devices in terms of their usage, capacities, and even their size. The United Nations communications agency (ITU) expects around five billion people to be using mobile/cellular phones in 2010 [1], an increase from four and a half billion in 2009. As the estimated world’s population is 6.8 billion people, the data reflect a mobile phone penetration higher than 70% worldwide. ITU also expects one billion mobile broadband subscriptions in 2010, which tops the 600 millions at the end of 2009, and that are expected to exceed Web access from desktop computers within the next years. Being a little more realistic, not all mobile phones belong to a unique user. About 35% of unique users have two or more subscriptions. But there are also several million users sharing a mobile phone with other users, particularly those who live in poor households, in Africa, or in other parts of the world. So, in contrast, on the planet there are 1.2 billion PCs of any kind, and Gartner expects 2 billion PCs by 2014; 1.6 billion TV sets; 1.7 billion Internet users; and 3.9 billion FM radio receivers. The mobile industry is a global giant which generated 1.07 trillion dollars in annual revenues in 2009 [2], which positions the mobile industry among those few earning a trillion dollars a year (automobiles, food, construction, and military spending). Mobile phones may be very small devices, but they represent a giant industry with even greater possibilities in the future. For instance, many print publishers are focusing on the mobile market as an opportunity to expand their brands, reach new audiences, and generate additional revenue, which leads Gartner to predict that global spending on mobile advertising will increase from 913 million dollars in 2009 to 13 billion in 2013 [3]. In a recent report, QuickPlay Media [58] studied mobile TV and video consumption in the USA, pooling more than 1,000 US-based mobile subscribers between 18 and 35. The report shows that almost 80% of respondents expect that more people will be watching TV and video programs on their mobile phone by 2010. In [45], Lenhart surveyed teenagers about their mobile phones in 2004 and then surveyed them again in 2008 to determine how penetration of mobile phones had
264
J.C. Cortizo et al.
changed over the years. The report shows that the use of mobile phones among teens had escalated from 45% in 2004 to 63% in 2006 and 71% in 2008. And how do mobile phones stand up against the adoption of other technologies? l l l l l
77% of teens own a game console (Xbox, PlayStation, or Wii) 74% of teens own a MP3 player (iPod or similar) 71% of teens own a mobile phone 60% of teens own a computer 55% of teens own a portable game console (PSP, Nintendo DS)
Figures show a very high penetration of mobile phones among teenagers, increasing rapidly to reach the penetration of game consoles, a more childrenoriented technology.
11.3
Mobile Phone Culture
In addition to their important contribution to industry, mobile phones have acquired a fundamental and distinctive role in today’s societies, contributing to the transformation of relationships, work, leisure, family, sexualities, globalization, etc. Mobile phones have played a significant role in mobilizing crowds and changing history [36]. As an example, in March 2004, one day before the Spanish general elections, a public demonstration against the war in Iraq was convoked via text messages sent by mobile phones (SMS). This action had such a great impact on Spanish society that the final result of the elections was completely different from the predictions made several days previously. Hence, the great and widespread interest in mobile phones among media producers, industry, artists, educators, and policy makers. Mobile phones have an impact on our cultures in several ways including: l l l l l
Social interaction Commercial identity Protecting privacy Politics and Government Economy
11.3.1 Mobile Phones and Social Interaction Mobile phones play a very important role in the interplay between the public and private spheres of life. It has made it more difficult to separate work from private life as an employee’s mobile phone is required to be turned on 24 h a day as a way of locating employees to solve companies’ issues. Mobile phones also play an important role in making friends and arranging to meet other people. It is very common to arrange a date or a meeting via SMS as text messages via mobile phones
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
265
do not require a synchronous communication, but they are likely to be attended to as soon as possible due to most people’s proximities to their phones. Regarding social interactions, mobile phones lack certain elements of human interaction such as tone of voice, body language, facial expression, and touch, which results in a loss of communication quality. People have also integrated mobile phones in the “culture of lying,” using text messages pretending to have a meeting while dating, or staying at work while socializing. In certain situations, mobile phones are seen as a disruptive technology, such as when someone is using a bewildering ring tone, shouting and pacing in public places while on the phone, or answering the phone while talking with others, or at a meeting, or attending a lecture.
11.3.2 Commercial Identity In recent years, several technologies concerning commercial identity using the mobile phone have been developed. Mobile payment is one of the most rapidly adopted alternative payment methods, especially in Asia and Europe, where credit cards are not so used for small and daily payments. There are four primary mobile payment models: l l l l
Premium SMS-based transactional payments Direct mobile billing Mobile Eeb payments Contactless NFC
According to Gartner [65], there were 74.4 million mobile payment users in 2009, which is expected to increase and exceed 190 million users by 2012, representing more than 3% of all mobile device users. According to that report, “the most profound impact of mobile payment services is that they provide the nonbanking population with access to modern financial services, providing tools to improve their living standards.” In addition to mobile payments, mobile phones allow users to access other financial services such as bank transactions, be notified when charged with large payments, transfer money, etc.
11.3.3 Protecting Privacy Mobile phones in our everyday lives have also given rise to privacy issues, as they are a perfect tool for tracking others. As mobile phones are used for the storing of private information (contacts, e-mails, and personal documents), they are susceptible to theft. The inclusion of cameras on almost every last generation phone has become a threat to personal privacy and intellectual property and has changed the way we behave in certain situations.
266
J.C. Cortizo et al.
11.3.4 Politics and Government Mobile phones have also become a tool for several political and e-government issues. In several UK cities, it is possible to select political leaders through SMS election, and mobile phones have also become a new tool for election propaganda, such as in the presidential election in the Philippines in 2004, when candidate Arroyo’s consultant sent SMSs to the constituencies to influence their voting. Text messages via mobile phones are also a common, quick, and effective way to organize political protests, like the Spanish general elections in 2004, or the overthrow of President Joseph Estrada of Philippines.
11.4
Anatomy of a Mobile Phone
Mobile phones are electronic devices designed to work on cellular networks and containing a set of services that allow phones of different types to communicate with each other. Mobile phones need a subscription to a carrier (mobile phone operator). Typically, mobile phones work on GSM networks, and the operator issues a SIM card for each client, which contains the subscription and authentication parameters. Mobile phones are meant to support voice calls, but they can also send and receive data, send text messages (SMS), and access WAP services or Internet using technologies such as GPRS or 3G. Each day, mobile phones are more often used for data communications. A typical data communication from/to a mobile phone is a SMS, which have been so successful as a mobile phone service that they have had an impact on our cultures, allowing the creation of pseudolanguages used by users to communicate among them using as few characters as possible. Most cell phones that support data communications can be used as wireless modems (tethering) to connect computers to the Internet everywhere, which enables people from rural areas to access the Internet. Most mobile handsets produced after 2008 incorporate a set of hardware characteristics [39] that make them a complex electronics piece of electronics (see Fig. 11.1) and also a useful tool for several tasks.
11.4.1 Hardware Anatomy In the last few years, telephone hardware has been improved in several ways. Nowadays, the market place offers different models with high capabilities similar to those of 4-year-old computers. The main issues for the manufacturers are still battery life and connectivity, all of them compressed in limited dimensions. Almost 98% of smart phone manufacturers are using the ARM hardware platform due to the low-consumption battery. This choice is causing several difficulties in the software
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
267
Fig. 11.1 Components of an iPhone 4 showing the hardware complexity of actual mobile phones. Image provided by iFixit.com #
industry due the necessity to develop different technologies concerning mobile phone deployment. Currently, manufacturers such as Intel are investing in the creation of a low-consumption platform to offer an alternative to ARM processors. The mobile phone industry still needs a generic platform, as occurred with the PC industry, with low-battery consumption and other improved features. As there is still no standardized platform, deploying software for mobile devices requires high investments. These are some of the common hardware features available on the modern smart phones: l l l l l l l l
High processing power, between 800 MHz 1 GHz High connectivity through: bluetooth, WiFi (802.11a/b/g/n), infrared, etc. Broadband connectivity: EDGE, CDMA2000, UMTS, etc. Camera sensors similar to normal cameras, offering between 3.2 and 12 Mpixels GPS and A-GPS in combination with geo-location using Internet databases Tilt sensors (accelerometer, brightness, and proximity) Digital compass Radio transmitter/receiver
These hardware capabilities have opened a new world of possibilities for the software industry. Hence, software manufactures can offer software integrated with continuous connectivity, data sensors, and geo-location. These features have been integrated very rapidly into social networks due to the ease of transmission of multimedia content (photos, videos, and audio), manage geo-positioning (where the user is), and bilateral communication as in normal computers (instant messaging).
268
J.C. Cortizo et al.
Furthermore, the entire combination of mobile sensors, broadband connectivity, and GPS capabilities is providing an environment conducive to augmented reality applications. Graphics quality has been one of the research lines in recent years for mobile phone manufacturers. The high battery cost of graphics hardware had been hindering the innovation for some time. This issue was also preventing the interaction with the social networks, since graphical capabilities are mandatory for a successful interaction with Internet services. But this problem has been solved due to the advantages of creating a new compact platform for mobile devices. Multimedia quality is also a strong focus in research lines. New models come with high-quality sensors, offering excellent image quality. This feature also requires specific software to take advantage of the sensor’s potential. The hardware industry in recent years has been researching the deployment of a compact and functional platform for mobile devices. Thanks to the advantages of those deployments, manufacturers have found a way to integrate all the hardware components in compact mother boards (graphical hardware, networking modules: bluetooth, 802.11a/b/g/n radio, and multimedia components) all-in-one. For this reason, it is difficult to find any significant difference between hardware components and different models. Over the last 2 years, the hardware war has focused on the appearance (device design) and software capabilities (mobile operating systems, software deployment platforms, and software distribution). We should highlight that, due to the increasing use of social networks through mobile platforms, smart phones need continuous connection with Web applications. This use has pushed mobile phone operators and mobile phone manufactures to offer new ways of data transmission with low cost and low-power consumption. Moreover, some mobile phone operators and mobile phones manufactures have established bilateral agreements to offer mobile phone solutions integrated with broadband data plans.
11.4.2 Software Anatomy 11.4.2.1
Mobile Operating Systems
Due to improvements in hardware, mobile operating systems have been a partner of those new hardware capabilities. For several years, mobile operating systems were focused on limited functions, such as managing the communications modules, and offered limited capabilities in data transmission. This situation caused a frozen period in the software field concerning the operating systems in mobile phones, whose only alternatives were Symbian OS, Windows Mobile, and Palm OS. After the advances in the hardware industry, mobile operating systems as Symbian and Windows Mobile began to have other competitors including RIM (Blackberry), IOS (Apple), and Linux (Maemo, Bada, Limo, and Android). These alternatives in mobile operating systems created a war between software manufactures and
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
269
hardware manufactures. As aforementioned, hardware capabilities are practically the same between different manufacturers; therefore, over the last 4 years the differences have been in the software capabilities. Modern mobile operating systems are offering a system architecture similar to generic operating systems in computers. Even some mobile operating systems, such as iPhone OS or Android, are common adaptations of the same operating systems as Mac OS X and GNU/Linux. This situation necessitates the development of a platform to deploy mobile phone applications. Due to the limitations of such platforms (low speed development due the hardware/software emulation), social networks have focused their efforts on offering their services through generic services. According to hardware requirements, mobile operating systems attempt to integrate all hardware capabilities with low-battery consumption. In addition, the current trend is to offer a mobile operating system with a documented API so that all the hardware and software capabilities can be used more easily. The problem with this is that software manufactures need to duplicate their efforts to offer the same software solution in different mobile operating systems due to the incompatibility between software. Hence, this practice is avoided in social networks. It is quite common that the only way to deploy mobile phone applications using social networks services is by using public APIs through standards protocols as HTTP and using generic data encoding formats as XML or JSON. Thanks to this, social network providers do not have to worry about the mobile platform, but only the content’s format.
11.4.2.2
Software Development Mechanisms
As we mentioned in the last point, software deployment has encountered several barriers in the mobile phone industry. Although hardware platforms are not too different, the software used in there is implemented in very different ways. This is causing several headaches in the software industry. Every mobile phone manufacturer is releasing different ways to develop applications in his platform. For this reason, mobile phone deployments are still an expensive area for software manufacturers. The usual way is to publish a software development kit (SDK) to create the application under the specific mobile platform. But, depending on the platform chosen, limitations are different. For instance, Apple is providing an SDK with high visual capabilities, but with a limited hardware interaction. On the other hand, Android OS through their API is providing practically the same capabilities as native operating system applications; other alternatives such as Bada and Maemo are offering a more complete interaction than normal deployments, catching the attention of the software industry due to the ease of deploying cross-platform applications without much effort. We consider that one of the causes of this limited interaction with social networks and mobile phones to date is the difficulties of manipulating and integrating
270
J.C. Cortizo et al.
the hardware capabilities with those that software kit development provides. In future, major changes to the software development kits are expected, offering new capabilities and facilities to deploy new applications with less effort and better integration with the social networks.
11.4.3 Services Anatomy Mobile phones are the best way to interact with different services maintaining mobility. Over the last few years, such services have been notably incremented. Voice service (calls) and text messages (SMS) are still the basic services provided by mobile phones, but these services are now complemented with other services including voice mailbox, call conference, and video conference (Fig. 11.2). Current mobile phones offer users numerous services (see Fig. 11.3) that can be classified under several groups: l l l l
Voice (voice services, voice mailbox, and call diverting) Text Messages (short text messages and multimedia messages) Leisure (games, internet, and multimedia) Media (newspapers, television, and videos)
Fig. 11.2 Decomposition of services included on most mobile phones in the market
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
271
Fig. 11.3 Example of integration of mobile services
l l
Financial (mobile wallet, mobile payments, and mobile stocks) Other (GPS and maps, electronic apparel controlling, live video, etc.)
One of the main goals of the software industry is the integration of these services with social networks. For instance, companies are researching to provide local databases in mobile phones to offer services integration. An example of this is database of contacts in the new generation of mobile phones where we can find the integration of the contacts independently of the services (calls, SMS, MMS, emails, etc.). This practice was exported to other services and, therefore, when we take a photo using our mobile phone, it is available to the whole system, which in turn makes it available to other services on the mobile phone (for instance, it is possible to send a picture taken with the mobile phone through email or via Bluetooth). The last generation of mobile phones (iPhone and Android) has also included a great mechanism to add functionality to mobile phones, making it easier for consumers to adapt their phones to their needs. These new services are the mobile phones application markets (Apples App Store and Android Market), which offer users a fast, easy, and effective way to access paid or free applications for their devices. Currently, Apples App Store offers over 150K applications for iPhone, and Android has more than 20K and is growing. The applications markets have radically changed the way that mobile services are offered and consumed, as previously most services were offered by mobile operators and now are developed by thousands of independent developers and distributed directly to the mobile phones. Mobile phones are everywhere, nearly everyone has one, and they are used for almost any task imaginable. They represent one of the most widespread electronic devices, and also they are electronic devices with an intimate relationship with the owner. And there is no other device more adapted to the new reality of the Social Web.
272
11.5
J.C. Cortizo et al.
Envisioning Mobile Phones as the Heart of Community-Built Databases
As stated in previous sections, the current and future feature of mobile devices will allow them to play a very relevant role, not only as interfaces to community-driven databases, but also as platforms where applications using data from communitydriven databases will be running, or even as distributed databases where users can have better control of relevant data they are contributing to those databases. The ongoing state-of-the-art development of mobile devices related to communitydriven databases focuses on the interface level, with a lot of work being undertaken on augmented reality that enhances the real world with information extracted from these databases. But mobile devices will have a key position on the database itself. In many community-driven databases, users contribute with contents created by them, or even with personal information. This is the case of communities such as Flickr where photographers share their photographs, or even LinkedIn, where users share their CVs and their professional experiences. With the current state of technology development, users have no option but to upload contributions to a common database managed by third parties. But it would be possible in the near future to create distributed databases where certain parts of the information, or even the whole database, remain on mobile devices. In that way, users would have greater control of their most important information and contents.
11.5.1 Mobile Phones as Interface Mobile phones integrate a great number of technologies that make them suitable for many tasks. But when dealing with interfaces, they are particularly adaptable. They include touchable screens, cameras, and many sensors such as GPS, digital compass, and accelerometer. This is why they are being used as interfaces for our everyday activities in the real world with the digital reality. Some examples of using mobile phones as our interfaces with the world can be seen in the following applications: l
1
FourSquare1 is a mobile social network that allows registered users to connect with friends and update and share their locations. It integrates a geo-locationbased game in order to encourage users to share their actual location. FourSquare mobile applications (available for most important devices such as iPhone, Android, Blackberry) allow users an interface with their environment as they can easily share their location and discover nearby businesses and places of interest.
http://www.foursquare.com
11 l
l
On the Future of Mobile Phones as the Heart of Community-Built Databases
273
Strands2 is a mobile social network for runners that allows registered users to connect other runners and share their training routes and evolution. Strands application tracks all the movements of their users when they are running, calculating their routes, average speed, running intensity, etc. Google Shopper3 is a mobile application that allows user to take photos of books, music cds, and other products and then it retrieves information about those products from the Web in order to offer detailed product information such as prices, reviews, specifications, and more.
These mobile applications serve as a connection between the real and virtual worlds, an interface that let users access community-built databases from a mobile device in a natural manner. To make all this possible, mobile phone applications use several technologies, being the most relevant touchable interfaces, augmented reality and positioning related sensors.
11.5.1.1
Mobile Phones Inputs
Mobile phones offer a challenge to input and output techniques due to their small size [30], which allows rather limited UI functionality design. Whereas screen resolution has increased allowing more sophisticated GUIs, the input mechanics have remained quite unchanged throughout the short history of mobile phones. Current phones provide either use buttons or a touch screen. To overcome the limitations of small keypads, several approaches have been suggested including speech input and auditory UIs [11], or gesture input [64]. Tangible user interfaces and touchable interaction are terms increasingly gaining currency within the human computer interaction community [33], particularly in relation to mobile devices. Touchable user interfaces utilize physical representation and manipulation of digital data and offer interactive couplings of physical artifacts with computationally mediated digital information [74]. Many different research projects have studied enabling technologies, usability aspects and various applications of tangible user interfaces [32, 63, 67]. When dealing with mobile phones, tangible user interfaces are usually simplified to touchable interfaces, where the interaction with the mobile phone is conducted by touching the screen with one or several fingers, a more natural way to interact with the device than navigating a set of menus by means of several keys or a joypad. The introduction of touchable screens on mobile phones has been a major step in the simplification of the interfaces of mobile phone and makes mobile devices a more useful interface. As an extension of touchable interfaces, the concept of “gestures” has been introduced over the last years [71]. Gesture is a broad term defined in common usage as a
2
http://www.strands.com http://www.google.com/mobile/shopper/
3
274
J.C. Cortizo et al.
means of expression. As described in [43], gestures are used for everything from pointing at a person to draw their attention to conveying information about space and temporal characteristics. Current research on gesture ranges from the recognition of human body motion (including facial expressions and hand movements) [7, 70] to pen- and mouse-based research [57] and sign language [10]. Recently, there have been several researches on applying tangible user interfaces (TUIs) [35] in mobile phones. TUIs are one step ahead of gestures as they need an interaction from users with real-world elements, which can be applied to several mobile applications such as gaming [38, 48], social interaction [44], and education [5]. These kinds of interfaces allow users to set inputs to mobile applications in a simple and natural way as they interact with their environment like they do when they are not using any mobile device. There exist several other direct inputs for mobile phones, including speechbased ones [24, 34, 51, 54], text-based [12, 13, 20], and several combinations of methods. Last generation mobile phones include a set of sensors that allow them to receive indirect inputs from users. GPS, digital compass, accelerometer, and other sensors can track the user’s position and activities and can be used to make mobile phones context-aware. Context-aware mobile phones have been used to demonstrate various context-adaptive features. For instance, mobile device screen layout orientation, ring tone, and volume adaptation have been proposed [25, 27]. Location-based applications are a significant area of interest and have great potential for possible commercial applications; this is particularly relevant with today’s GPS-enabled mobile devices that include mobile maps. There exist several location-awareness applications such as tour guides of the city [4], campus or museum environments, shopping assistants, messaging [60], and location-sensitive reminders [21]. All these input technologies developed for mobile phones allow them to be converted into effective interfaces to community-built databases. We can upload our location or the route taken to social networks such as FourSquare, Yelp, Rummble, Google Buzz, or Runner+ and users can also interact with contents in social networks in a more natural way and adapt to the mobile experience with touchable and related interfaces. But all these technologies also empower augmented reality applications, which are perfectly suited to mobile devices and social contents.
11.5.1.2
Augmented Reality
Augmented reality-based technologies [6] are one of the technologies that have had increased impact on innovative applications [23], including applications for mobile devices [29, 31] in recent years [69], since they make use of elements present on most mobile phones (screen, camera, and sometimes other sensors such as inclinometers and compass) to simplify the tasks related to information access. Hence, there have emerged quite a number of applications and augmented reality browsers that include layers of metadata about physical objects that have been geo-tagged,
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
275
either by the developers of these applications or by third parties (including the communities of users). These applications identify the objects using Global Positioning System (GPS) integrated into the mobile [61] in conjunction with other sensors such as compass, allowing, at all times, to determine the location of the user and what he is watching. There are two basic techniques for recognizing the object the user is looking at: the detection of tags placed on objects to recognize, or image recognition [40], which requires more computational load but results in greater flexibility to recognize objects. With all this, the user can “see” a digital image on the screen of the mobile phone composed by the actual image captured by the camera of the handset and the overlapping images or content from other sources. Usually, Augmented Reality applications have been oriented to multimedia applications, overlapping 3D objects to captured images, but actually more applications are adding to the final image information of interest to the user, such as the description of a painting in a museum, of a building, monument, in videogames, etc. There exist numerous augmented reality-based applications for mobile phones, such as media authoring [26], cultural heritage [16], archeological fieldwork assistance [41], e-learning [22], industrial applications [66], and location-based applications [62]. But many of the most used augmented reality enhanced applications in mobile phones are related to social networks or community-built databases. As an example, Google has recently released Google Shopper, a mobile application that enhances the reality captured by a phone camera by adding additional information such as description of the products captured, reviews, ratings. As another example, Layar4 is an augmented reality browser based on a set of information layers that can be added to augment the reality captured by a mobile phone. Some of those layers allow users to access user-generated content in social networks such as YouTube, Twitter, FourSquare, or UnLike.
11.5.2 Mobile Phones as Databases There has been an extensive research on mobile databases [9], addressing several possible situations (see Figs. 11.4 and 11.5): l
4
Mobile client–fixed host. This is the most typical mobile application involving databases and where most research has been conducted. Examples involve traveling employees accessing corporate databases [14], mobile users accessing personal data such as their bank accounts [49] or agenda, and servers broadcasting information (weather, traffic, etc.) [15, 46] to a large set of users or specific mobile applications for accessing social networks such as Twitter or Facebook [17].
http://www.layar.com
276
J.C. Cortizo et al.
Fig. 11.4 Representation of the mobile databases models based on mobile clients–fixed host, and mobile client–mobile host
Fig. 11.5 Representation of the mobile databases models based on fixed client–mobile host, fixed client–fixed host, and P2P
11 l
l
l
l
On the Future of Mobile Phones as the Heart of Community-Built Databases
277
Mobile client–mobile host. Most of these kinds of applications require the management of a database embedded in mobile devices, data durability and copy synchronization. Some examples involve portable folders such as medical folders on a smart or SIM card or phonebooks in mobile phones [52]. Fixed client–mobile Host. This is a situation not well covered by current research, although it has been proposed in the literature [37]. Fixed client–fixed Host. This kind of application may involve mobile databases, such as keeping track of mobile locations in telecommunications, transportation, or traffic [72]. P2P. There has not been much research on mobile P2P databases concerning mobile phones, but some of the research conducted on mobile P2P databases for sensor networks [75] can be applied to mobile phones.
The current state of the art on mobile databases is centered around the mobile client–fixed host model, which allows plenty of possibilities, but is based on the assumption that the information is managed within the fixed host (typically a Web server) and mobile clients are only an additional way to access/modify the information on the host. With the actual capacities of mobile phones, and their intense impact on our everyday activities, it is time to let mobile phones and devices gain importance when dealing with databases, especially for community-built databases. New models such as fixed client–mobile host or P2P mobile databases are going to have a great impact on community-built databases.
11.5.2.1
Fixed Client–Mobile Host Mobile Databases
Until now, there has been no option to run an application where the host resided on the mobile phone, due to the lack of connectivity and processing power. But nowadays mobile phones have the same processing capacity as 4- to 5-year-old servers, enough for most smaller applications. They are able to run 3D games, and even to host a server (Web server, ftp server, etc.). They also have great connectivity, as they are able to connect to Internet wireless, GPRS, 3G, and almost any other wireless technology available on the market. And there exist several situations where mobile phones should host, at least, an important part of the data, instead of being hosted by an external Eeb server: l
Host profile data for social networks. Every time we want to sign in to a new social network, we have to introduce our personal data into that social network, and each time we change something (e.g., surname, e-mail, and affiliations) we have to access each social network and upload that information. There exist several Web services that try to centralize our profile data in order to broadcast the changes to every social network we are registered in, but as each social network treats your profile as theirs, there exist many difficulties in that process. Our profile information on social networks contains very personal information that should be managed from a more personal device/environment in order to be
278
l
l
J.C. Cortizo et al.
controlled and synchronized with all the desired services. Obviously, then it would be a good idea to host that information in our personal mobile phones and allow external websites or social networks access to that information in a more controlled way. Host contact information such as personal contacts or professional contacts. Our mobile phone is used to store data from our contacts, our relatives, and our work connections. We add new contacts to the mobile phone when we meet them and update their information in our mobile phone as it changes. But when we sign into a new social network, we have to add our contacts manually or from our e-mail accounts (typically from GMail/Yahoo/Hotmail accounts, and not from other e-mail providers like our company’s), which usually makes no sense as we already have most of our contacts on our mobile phones. Pictures taken with our mobile phone. When we take a picture with our mobile phone, we can upload it to social networks such as Flickr5 using our mobile phone as a client to upload the pictures to our online pictures database. This is very useful for sharing photos from our vacations or special events, but works in a strange way, since using the mobile phone as a client reflects that the important database is the one that is being stored online. Why cannot the really important database, the photo albums, be stored on our mobile phones? Why cannot we establish the access restrictions to those photos directly from the phone instead of doing it online?
In the situations previously discussed, our important data could be more controlled if the mobile phone was used as the database host, allowing certain applications/users to access that data. One of the possible problems when doing that is managing access restrictions, but it has already been solved in the Eeb applications/ social networks area with technologies such as OAuth6 [28], which can be adapted for use with mobile hosts. As each mobile phone has a unique phone number associated with it, it provides a direct route to accessing the data from everywhere. To avoid connectivity issues, it could be useful to use actual mobile databases technologies developed for mobile client–fixed host applications. In such applications, it is common to store a copy of certain data in the mobile phone in order to make it available to the user if there is no connection available at that moment [53]. But in this case, the fixed host (social networks or websites hosted in a fixed server) will be the one hosting a local copy of certain data.
11.5.2.2
P2P Mobile Databases
There exist several situations where mobile phones should not act like a host or a client, but as peers. Probably, the most common scenario is represented by real mobile social networks [8, 47], where mobile applications are not merely clients for an online social network, but make extensive use of the mobile features. Mobile 5
http://www.flickr.com http://oauth.net/
6
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
279
social networks are a hot research topic as they combine elements from two huge markets: mobile devices/phones and social networks. Mobile settings offer opportunities for new services on social networks as changes in user contexts (location and activities) can be used to provide the user with context-aware services relevant to that moment and situation [59]. There also exist specific situations where social networks must be really connected with mobile phones, such as healthcare-oriented social networks/services where the mobile phone can be used to monitor the health status of the patient [76]. In all these, and other similar situations, mobile databases may be used as P2P mobile databases, where each mobile phone shares certain information with other mobile phones. Mobile P2P databases is a recent research topic that shows some interesting work [42, 50, 56, 68, 73], but must be developed in order to suit the industry needs.
11.6
Future Scenarios
We would like to introduce a couple of near future scenarios where mobile phones have a relevant role in community-built databases, which may exemplify the use of several technologies and ideas expressed previously.
11.6.1 MobiTag: A Mobile Interface to the Real World MobiTag (not an actual product) would be a mix of Layar and Google Shopper. With this application, users could capture any product, person, or place with their camera-phones and then tag, review, or comment on it, sharing these contents with others and being able to access other people’s contents. MobiTag would be an interface to a general community-built database empowered by augmented reality, object/face recognition, and several other technologies such as Opinion Mining/ Sentiment Analysis [55], in order to process all the user-generated content appropriately so that it is condensed and displayed on a small screen. Figure 11.6 shows an example of the use of MobiTag or a similar product. One of the most important challenges that should be addressed for this future scenario would be the object/face recognition software, but products such as Google Shopper show that actual state of the art should be enough for recognizing products. Applications such as Recognizr7 are working on the face detection problem from a mobile phone and retrieving the information from Social Media such as LinkedIn or Facebook (see Fig. 11.7). 7
Recognizer is a prototype developed by The Tribe http://www.tat.se/site/showroom/latest_ design.html
280
J.C. Cortizo et al.
Fig. 11.6 Example of a possible application using the mobile phone as an interface to a community-built database of objects and persons in the real world
Fig. 11.7 Recognizr is an easy-to-use application to retrieve information from social networks such as LinkedIn from a mobile phone using a picture or camera capture as input to the system
In general terms, the system displays information about what is being captured by the phone camera. The user focuses the camera on the objects of interest. When the camera captures an object it is recognized and this will display information about it using augmented reality techniques. The MobiTag system works by following four steps (see Fig. 11.8). The first step captures the image and detects the presence of objects of interest (a good example would be a book). To resolve this point, it could use a preprocessing algorithm to know the object’s shape, cut out the image by removing irrelevant information, and isolate the object in a new image.
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
281
Fig. 11.8 The whole MobiTag system
The second step relates the image to a text string. That is, it sends the image of the object bounded to a Web service that determines what object it is. For example, in the case of a book the service will return the title of the book. The third step is a Web service that retrieves related information from Internet and social media related information such as author, genre, criticism, social media ratings, and prices. It basically finds related information and returns what the algorithm considers to be most relevant to the user. The fourth and final step shows the relevant information to the user, by means of augmented reality. The related information is superimposed on the detected object. For example, a book can show an aggregate of ratings in social networks, an indication of the best price and the option to buy it only by clicking a button or going to the published information related to the object, etc. All this information is above the object so that even if the user moves the terminal and if the whole object appears inside the image, the display information will be provided above, causing the user to believe that the information is about the object. This is augmented reality. As the most complex part of the system is to extract the object from the initial capture and relate this image to with the product’s name, we have focused our first experiments on these initial steps. It was necessary to test the response of CBIR systems against real images of the object using different illuminations. The experiment consisted of detecting an object and performing an action associated with it. CDs were used as objects to be detected and the associated action was to play a music file related to the captured object. In this case, web services were not used, only local services such as image recognition and music launches. This approach simulated the first three steps of the whole system. The result of the experiment was positive. The preprocessing system (step 1) improves the CBIR image association by 55%. The percentage of improvement of
282
J.C. Cortizo et al.
Fig. 11.9 Example of a mobile phone hosting personal information and controlling the access of several social networks to that information
the system is subject to the proper functioning of the preprocessing algorithm, which worked successfully in 57% of cases of the total sample.
11.6.2 Mobile Phone Hosting Personal Information This second scenario is related to what was proposed in the mobile databases section. Figure 11.9 shows a mobile phone hosting personal information such as a curriculum vitae or a personal photo album and granting access to that data to social networks such as LinkedIn (where we would like to share our more actual CV) or Flickr (where we would like to share our pictures with our friends and relatives). Users should introduce their phone number in the social networks, which would send a petition to the phone in order to be granted access to specific contents. If access to a certain social network or to a certain content is not granted, the website is not able to retrieve the data and share with other users of that social network.
11.7
Conclusions
In this chapter, we have reviewed the current and future roles of mobile phones in community-built databases. Although most of the observations could be applied to other mobile devices such as portable game consoles or e-readers, we have focused the paper on mobile phones due to their huge industry, their market penetration, and their strong links with users that have made possible important changes in our cultures, and have a great impact on everyday activities. The success of community-built databases can be attributed to the rise of the Internet and broadband connectivity. However, there are more users worldwide utilizing mobile phones than users connected to the Internet, and in the next years the number of users accessing the Internet from mobile devices will surpass the number of users accessing Internet from desktop or notebook computers. That
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
283
means that the role of mobile devices in community-built databases must be empowered. Currently, mobile phones are used for mostly accessing community-built databases. Examples include social networks mobile applications for accessing Facebook, Flickr, Twitter, FourSquare, or any other important social network. But most of those mobile interfaces don not allow users to access all the functionalities offered by a computer. Only a few social networks are currently able to expand the capabilities of mobile phones in order to add extra functionalities when using a mobile phone. That is the case with FourSquare, a social service that allows users to share their location and find nearby businesses or places of interest. We have reviewed several opportunities for mobile phones in order to improve their use and contribution to community-built databases. When used as an interface to access Web social networks, technologies such as augmented reality, or the sensors now available in most mobile phones (GSP, accelerometer, and digital compass) allow social networks and other community-built databases to offer their users a real mobile experience. But mobile phones can also be used as an important part of the database itself hosting personal information such as the user profile. To allow that kind of applications, more research on mobile databases must be undertaken, especially on mobile host–fixed clients and P2P mobile databases. We also believe that there will be an extensive research on OAuth-related protocols applied to mobile databases in order to make it possible for Web applications to access contents hosted on mobile phones. The use of mobile phones as the core of community-built databases has emerged as an excellent research field for the next years and we believe it will attract the interest of researchers and companies as it combines two of the most rapidly increasing segments of the Web: Mobile Web and Social Web.
References 1. URL: http://www.unmultimedia.org/radio/english/detail/90889.html (2010) 2. URL: http://communities-dominate.blogs.com/brands/2010/02/announcing-tomiahonenalmanac-2010-edition-180-pages-84-charts-on-mobile-stats-and-facts.html (2010) 3. ABCinteractive: Going mobile: How publishers are preparing for the burgeoning digital market (2009) 4. Abowd, G.D., Atkeson, C.G., Hong, J., Long, S., Kooper, R., Pinkerton, M.: Cyber-guide: a mobile context-aware tour guide. Wireless Netw. 3(5), 421–433 (1997). DOI http://dx.doi.org/ 10.1023/A:1019194325861 ¨ sterberg, 5. Almgren, J., Carlsson, R., Erkkonen, H., Fredriksson, J., Møller, S., Rydga˚rd, H., O M., Fjeld, M.: Tangible user interface for chemistry education. In: Visualization, Portability, and Database. Proc. SIGRAD 2005 (2005) 6. Azuma, R.: A survey of augmented reality. Presence Teleoperators Virtual Environ. 6(4), 355–385 (1997) 7. Baudel, T., Beaudouin-Lafon, M.: Charade: remote control of objects using free-hand gestures. Commun. ACM 36(7), 28–35 (1993). DOI http://doi.acm.org/10.1145/159544.159562
284
J.C. Cortizo et al.
8. Beach, A., Raz, B., Buechley, L.: Touch me wear: Getting physical with social networks. In: Computational Science and Engineering, IEEE International Conference on 4, 960–965 (2009). DOI http://doi.ieeecomputersociety.org/10.1109/CSE.2009.393 9. Bernard, G., Ben-othman, J., Bouganim, L., Canals, G., Chabridon, S., Defude, B., Ferrie´, J., Ganc¸arski, S., Guerraoui, R., Molli, P., Pucheral, P., Roncancio, C., Serrano-Alvarado, P., Valduriez, P.: Mobile databases: a selection of open issues and research directions. SIGMOD Rec. 33(2), 78–83 (2004). DOI http://doi.acm.org/10.1145/1024694.1024708 10. Brashear, H., Starner, T., Lukowicz, P., Junker, H.: Using multiple sensors for mobile sign language recognition. In: ISWC ’03: Proceedings of the 7th IEEE International Symposium on Wearable Computers, p. 45. IEEE Computer Society, Washington, DC, USA (2003) 11. Brewster, S.: Overcoming the lack of screen space on mobile computers. Personal Ubiquitous Comput. 6(3), 188–205 (2002). DOI http://dx.doi.org/10.1007/s007790200019 12. Brewster, S.A., Hughes, M.: Pressure-based text entry for mobile devices. In: Mobile-HCI ’09: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 1–4. ACM, New York, NY, USA (2009). DOI http://doi. acm.org/10.1145/1613858.1613870 13. Butts, L., Cockburn, A.: An evaluation of mobile phone text input methods. Aust. Comput. Sci. Commun 24(4), 55–59 (2002). doi:DOI http://doi.acm.org/10.1145/563997.563993 14. Chen, Y.F., Huang, H., Jana, R., Jim, T., Hiltunen, M., John, S., Jora, S., Muthumanickam, R., Wei, B.: mobile ee: An enterprise mobile service platform. Wireless Netw. 9(4), 283–297 (2003). doi:DOI http://dx.doi.org/10.1023/A:1023687025164 15. Cherniack, M., Franklin, M.J., Zdonik, S.B.: Data management for pervasive computing. In: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, p. 727. Morgan Kaufmann Publishers, San Francisco, CA (2001) 16. Choudary, O., Charvillat, V., Grigoras, R., Gurdjos, P.: March: mobile augmented reality for cultural heritage. In: MM ’09: Proceedings of the 17th ACM international conference on Multimedia, pp. 1023–1024. ACM, New York, NY (2009). DOI http://doi.acm.org/10.1145/ 1631272.1631500 17. Church, K., Neumann, J., Cherubini, M., Oliver, N.: Socialsearchbrowser: a novel mobile search and information discovery tool. In: IUI ’10: Proceeding of the 14th international conference on Intelligent User Interfaces, pp. 101–110. ACM, New York, NY (2010). DOI http://doi.acm.org/10.1145/1719970.1719985 18. Cortizo, J.C., Carrero, F., Gomez, J.M., Monsalve, B., Puertsa, E. (eds.): Proceedings of the 1st International Workshop on Mining Social Media. Bubok (2009) 19. Crowley, D., Reas, C., Schneider, M., Tiersky, H.: The social web: platforms, communities, and creativity. In: SIGGRAPH ’05: ACM SIGGRAPH 2005 Web program, p. 19. ACM, New York, NY (2005). DOI http://doi.acm.org/10.1145/1187335.1187357 20. Curran, K., Woods, D., Riordan, B.O.: Investigating text input methods for mobile phones. Telemat. Inf. 23(1), 1–21 (2006). DOI http://dx.doi.org/10.1016/j.tele.2004.12.001 21. Dey, A.K., Abowd, G.D.: Cybreminder: A context-aware system for supporting reminders, pp. 172–186. Springer, Heidelberg (2000) 22. Doswell, J.T.: Context-aware mobile augmented reality architecture for lifelong learning. In: ICALT ’06: Proceedings of the 6th IEEE International Conference on Advanced Learning Technologies, pp. 372–374. IEEE Computer Society, Washington, DC (2006) 23. Feiner, S., MacIntyre, B., H€ ollerer, T.: Wearing it out: First steps toward mobile augmented reality systems. In: First International Symposium on Mixed Reality (1999) 24. del Galdo, E., Rose, T.: Speech user interface design for mobile devices. In: CHI ’99: CHI ’99 extended abstracts on Human factors in computing systems, pp. 159–160. ACM, New York, NY (1999). DOI http://doi.acm.org/10.1145/632716.632811 25. Gellersen, H.W., Schmidt, A., Beigl, M.: Multi-sensor context-awareness in mobile devices and smart artifacts. Mob. Netw. Appl. 7(5), 341–351 (2002). DOI http://dx.doi.org/10.1023/ A:1016587515822 26. Guven, S., Feiner, S., Oda, O.: Mobile augmented reality interaction techniques for authoring situated media on-site. In: ISMAR ’06: Proceedings of the 5th IEEE and ACM International
11
27.
28.
29. 30.
31. 32.
33.
34.
35. 36. 37. 38.
39. 40. 41.
42.
43.
44.
On the Future of Mobile Phones as the Heart of Community-Built Databases
285
Symposium on Mixed and Augmented Reality, pp. 235–236. IEEE Computer Society, Washington, DC (2006). DOI http://dx.doi.org/10.1109/ISMAR.2006.297821 H€akkil€a, J., Schmidt, A., M€antyj€arvi, J., Sahami, A., Aakerman, P., Dey, A.K.: Context-aware mobile media and social networks. In: MobileHCI ’09: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 1–3. ACM, New York, NY (2009). DOI http://doi.acm.org/10.1145/1613858.1613982 Hashimoto, R., Ueno, N., Shimomura, M.: A design of usable and secure access-control apis for mashup applications. In: DIM ’09: Proceedings of the 5th ACM workshop on Digital identity management, pp. 31–34. ACM, New York, NY (2009). DOI http://doi.acm.org/ 10.1145/1655028.1655037 Henrysson, A., Olilla, M.: Umar – Ubiquitous mobile augmented reality. In: Proceedings of the 3rd International Conference on Mobile and Ubiquitour Multimedia, pp. 41–45 (2004) Holleis, P., Huhtala, J., H€akkil€a, J.: Studying applications for touch-enabled mobile phone keypads. In: TEI ’08: Proceedings of the 2nd international conference on Tangible and embedded interaction, pp. 15–18. ACM, New York, NY (2008). DOI http://doi.acm.org/ 10.1145/1347390.1347396 Hollerer, T.H.: User interfaces for mobile augmented reality systems. Ph.D. thesis, New York, NY, USA (2004). Adviser-Feiner, Steven K. Holmquist, L.E., Schmidt, A., Ullmer, B.: Tangible interfaces in perspective: Guest editors’ introduction. Personal Ubiquitous Comput. 8(5), 291–293 (2004). DOI http://dx.doi.org/ 10.1007/s00779-004-0292-9 Hornecker, E., Buur, J.: Getting a grip on tangible interaction: a framework on physical space and social interaction. In: CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 437–446. ACM, New York, NY (2006). DOI http://doi.acm.org/ 10.1145/1124772.1124838 Howell, M., Love, S., Turner, M.: Spatial metaphors for a speech-based mobile city guide service. Personal Ubiquitous Comput. 9(1), 32–45 (2005). DOI http://dx.doi.org/10.1007/ s00779-004-0271-1 Ishii, H., Ullmer, B.: Tangible bits: Towards seamless interfaces between people, bits and atoms (1997) Jiayin, Q.: Research on mobile phone culture (2007) Jing, J., Helal, A.S., Elmagarmid, A.: Client-server computing in mobile environments. ACM Comput. Surv. 31(2), 117–157 (1999). DOI http://doi.acm.org/10.1145/319806.319814 Jung, B., Schrader, A., Carlson, D.V.: Tangible interfaces for pervasive gaming. In: Proceedings of the 2nd International Conference of the Digital Games Research Association. Changing Views: Worlds in Play, pp. 16–20 (2005) Juniper: Mobile augmented reality – A whole new world (2009) Karpischek, S., Marforio, C., Godenzi, M.: Swisspeaks – Mobile augmented reality to identify mountains. In: Proceedings of the 3rd European Conference on Ambient Intelligente (2009) Kayalar, C., Kavlak, E., Balcisoy, S.: A user interface prototype for a mobile augmented reality tool to assist archaeological fieldwork. In: SIGGRAPH ’08: ACM SIGGRAPH 2008 posters, pp. 1–1. ACM, New York, NY (2008). DOI http://doi.acm.org/10.1145/ 1400885.1401024 Kim, D.H., Lee, M.R., Han, L., In, H.P.: Efficient data dissemination in mobile p2p ad-hoc networks for ubiquitous computing. In: Multimedia and Ubiquitous Engineering, International Conference on 0, 384–389 (2008). DOI http://doi.ieeecomputersociety.org/10.1109/MUE. 2008.100 Kr€uger, A., Butz, A., M€ uller, C., Stahl, C., Wasinger, R., Steinberg, K.E., Dirschl, A.: The connected user interface: Realizing a personal situated navigation service. In: IUI ’04: Proceedings of the 9th International Conference on Intelligent User Interfaces, pp. 161–168. ACM, New York, NY (2004). DOI http://doi.acm.org/10.1145/964442.964473 Leichtenstern, K., Andre´, E.: Social mobile interaction using tangible user interfaces and mobile phones. In: Proceedings of the Conference of the Gesellschaft f€ur Informatik, Workshop Mobile and Embedded Interactive Systems (MEIS (2006)
286
J.C. Cortizo et al.
45. Lenhart, A.: Teens and mobile phones over the past five years: Pew internet looks back (2009) 46. Li, G., Wang, H., Liu, Y., Chen, J.: Mobile real-time read-only transaction processing in data broadcast environments. In: SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 1176–1177. ACM, New York, NY (2005). DOI http://doi.acm.org/10.1145/ 1066677.1066942 47. Li, N., Chen, G.: Analysis of a location-based social network. In: Computational Science and Engineering, IEEE International Conference on 4, 263–270 (2009). DOI http://doi. ieeecomputersociety.org/10.1109/CSE.2009.98 48. Liarokapis, F., Macan, L., Malone, G., Rebolledo-Mendez, G., de Freitas, S.: Multimodal augmented reality tangible gaming. Vis. Comput. 25(12), 1109–1120 (2009). DOI http://dx. doi.org/10.1007/s00371-009-0388-3 49. Mallat, N., Rossi, M., Tuunainen, V.K.: Mobile banking services. Commun. ACM 47(5), 42–46 (2004). DOI http://doi.acm.org/10.1145/986213.986236 50. Matuszewski, M., Balandin, S.: Peer-to-peer knowledge sharing in the mobile environment. In: Creating, Connecting and Collaborating through Computing, International Conference on 0, 76–83 (2007). DOI http://doi.ieeecomputersociety.org/10.1109/C5.2007.24 51. Melto, A., Turunen, M., Kainulainen, A., Hakulinen, J., Heimonen, T., Antila, V.: Evaluation of predictive text and speech inputs in a multimodal mobile route guidance application. In: MobileHCI ’08: Proceedings of the 10th international conference on Human computer interaction with mobile devices and services, pp. 355–358. ACM, New York, NY, USA (2008). DOI http://doi.acm.org/10.1145/1409240.1409287 52. Montante, R.: A survey of portable software. J. Comput. Small Coll. 24(3), 19–24 (2009) 53. Monteiro, J.M., Brayner, A., Lifschitz, S.: A mechanism for replicated data consistency in mobile computing environments. In: SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing, pp. 914–919. ACM, New York, NY (2007). DOI http://doi.acm.org/ 10.1145/1244002.1244203 54. Paek, T., Chickering, D.M.: Improving command and control speech recognition on mobile devices: Using predictive user models for language modeling. User Modeling and UserAdapted Interaction 17(1–2), 93–117 (2007) 55. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf Retr 2(1–2), 1–135 (2008). DOI http://dx.doi.org/10.1561/1500000011 56. Pang, X., Catania, B., Tan, K.L.: Securing your data in agent-based p2p systems. In: Database Systems for Advanced Applications, International Conference on 0, 55 (2003). DOI http://doi. ieeecomputersociety.org/10.1109/DASFAA.2003.1192368 57. Pastel, R., Skalsky, N.: Demonstrating information in simple gestures. In: IUI ’04: Proceedings of the 9th international conference on Intelligent user interfaces, pp. 360–361. ACM, New York, NY (2004). DOI http://doi.acm.org/10.1145/964442.964534 58. QPM: Us mobile tv and video survey (2009) 59. Rana, J., Kristiansson, J., Hallberg, J., Synnes, K.: An architecture for mobile social networking applications. Computational Intelligence, Communication Systems and Networks, International Conference on 0, 241–246 (2009). DOI http://doi.ieeecomputersociety.org/10.1109/ CICSYN.2009.73 60. Rantanen, M., Oulasvirta, A., Blom, J., Tiitta, S., M€antyl€a, M.: Inforadar: group and public messaging in the mobile context. In: NordiCHI ’04: Proceedings of the third Nordic conference on Human-computer interaction, pp. 131–140. ACM, New York, NY (2004). DOI http:// doi.acm.org/10.1145/1028014.1028035 61. Reitmayr, G., Schmalstieg, D.: Location based applications for mobile augmented reality. In: Proceedings of the 4th Australasian user interface Conference, vol. 18, pp. 65–73 (2003) 62. Reitmayr, G., Schmalstieg, D.: Location based applications for mobile augmented reality. In: AUIC ’03: Proceedings of the Fourth Australasian user interface conference on User interfaces 2003, pp. 65–73. Australian Computer Society, Inc., Darlinghurst, Australia (2003) 63. Rogers, Y., Muller, H.: A framework for designing sensor-based interactions to promote exploration and reflection in play. Int. J. Hum.-Comput. Stud. 64(1), 1–14 (2006). DOI http:// dx.doi.org/10.1016/j.ijhcs.2005.05.004
11
On the Future of Mobile Phones as the Heart of Community-Built Databases
287
64. Ronkainen, S., H€akkil€a, J., Kaleva, S., Colley, A., Linjama, J.: Tap input as an embedded interaction method for mobile devices. In: TEI ’07: Proceedings of the 1st international Conference on Tangible and Embedded Interaction, pp. 263–270. ACM, New York, NY (2007). DOI http://doi.acm.org/10.1145/1226969.1227023 65. Shen, S.: Dataquest insight: Mobile payment, 2007–2012 (2009) 66. Tumler, J., Doil, F., Mecke, R., Paul, G., Schenk, M., Pfister, E.A., Huckauf, A., Bockelmann, I., Roggentin, A.: Mobile augmented reality in industrial applications: Approaches for solution of user-related issues. In: ISMAR ’08: Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 87–90. IEEE Computer Society, Washington, DC (2008). DOI http://dx.doi.org/10.1109/ISMAR.2008.4637330 67. Ullmer, B., Ishii, H.: Emerging frameworks for tangible user interfaces. IBM Syst. J. 39(3–4), 915–931 (2000) 68. Veijalainen, J.: Autonomy, heterogeneity and trust in mobile p2p environments. Multimedia and Ubiquitous Engineering, International Conference on 0, 41–47 (2007). DOI http://doi. ieeecomputersociety.org/10.1109/MUE.2007.99 69. Wagner, D., Schmalstieg, D.: First steps towards handheld augmented reality. In: Proceedings of the 7th International Conference on Wearable Computers (2003) 70. Wahlster, W.: Towards symmetric multimodality: Fusion and fission of speech, gesture, and facial expression. In: Springer (ed.) Proceedings of the Annual German Conference on AI, KI 2003 (2003) 71. Wasinger, R., Kr€ uger, A., Jacobs, O.: Integrating intra and extra gestures into a mobile and multimodal shopping assistant. In: In Pervasive, pp. 297–314. Springer LNCS (2005) 72. Wolfson, O., Xu, B., Chamberlain, S., Jiang, L.: Moving objects databases: Issues and solutions (1998) 73. Wolfson, O., Xu, B., Yin, H., Cao, H.: Search-and-discover in mobile p2p network databases. In: Distributed Computing Systems, International Conference on 0, 65 (2006). DOI http://doi. ieeecomputersociety.org/10.1109/ICDCS.2006.74 74. Xie, L., Antle, A.N., Motamedi, N.: Are tangibles more fun?: comparing children’s enjoyment and engagement using physical, graphical and tangible user interfaces. In: TEI ’08: Proceedings of the 2nd international conference on Tangible and embedded interaction, pp. 191–198. ACM, New York, NY (2008). DOI http://doi.acm.org/10.1145/1347390.1347433 75. Xu, B., Vafaee, F., Wolfson, O.: In-network query processing in mobile p2p databases. In: GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 207–216. ACM, New York, NY (2009). DOI http:// doi.acm.org/10.1145/1653771.1653802 76. Yu, W.D., Siddiqui, A.: Towards a wireless mobile social network system design in healthcare. In: Multimedia and Ubiquitous Engineering, International Conference on 0, 429–436 (2009). DOI http://doi.ieeecomputersociety.org/10.1109/MUE.2009.77
.
Chapter 12
Designed for Good: Community Well-Being Oriented Online Databases for Youth S. Vodanovich, M. Rohde, and D. Sundaram
Abstract Community-built databases have been established for diverse groups and communities. While traditional community-built databases are driven by a professional and technically-oriented audience, the rise of community-built databases shaped by the Net Generation has its own particular problems and issues. We begin this chapter with a definition of the concept of well-being for youth as well as the developmental processes that lead toward well-being. Next, we discuss the problems, issues, challenges and requirements for positive well-being that may arise in the context of community-built databases. We then propose conceptual and design principles to mitigate these problems, address the issues and provide mechanisms to enhance the developmental processes that lead to positive social, emotional, moral and cognitive outcomes. We further propose a framework and architecture that embody these principles in terms of four key dimensions, namely: web interaction, social collaboration, semantic integration and community governance. Finally, we conduct an exploratory implementation to illustrate key aspects of the architecture and framework to enhance youth wellbeing and support their needs.
12.1
Introduction
Youth is a period of rapid emotional, physical and intellectual change, where young people progress from being dependent children to independent adults. Young people who are unable to make this transition smoothly can face significant difficulties in both the short and long term. Although the vast majority of young people are able to find all the resources they need for their health, wellbeing and development within their families and living environments, some young
S. Vodanovich (*) The University of Auckland, New Zealand e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_12, # Springer-Verlag Berlin Heidelberg 2011
289
290
S. Vodanovich et al.
people have difficulty in locating resources that can help them and moreover difficulty in integrating into society. One way to support this transition is to create an environment that enables youth to be well supported through the provision of information and the creation of a community-built database where youth feel empowered to collaborate with their peers, as well as decision makers and legislators. There is immense interest in youth well-being from policy makers, advocates, parents, teachers and the youth themselves. However, this interest has not successfully been translated into adequate support mechanisms for youth in a format which they understand. The youth of today are part of a Net Generation [81] where technologies, such as mobiles and, more popularly, the Internet, are part and parcel of their everyday lives. The Net Generation uses informational, collaborative and community-oriented systems to a level that is unprecedented. The youth of today represent the first generation to grow up surrounded by digital technologies. They have spent their entire lives surrounded by and using computers, video games, digital music players, video cams, cell phones and all the other toys and tools of the digital age. This ready availability of multiple forms of media, in diverse contexts of everyday life, means that information systems content is increasingly central to everyday communication and is increasingly vital to the developmental needs of youth. Stakeholders involved in the well-being of youth have thus far failed to adequately leverage Web 2.0 technologies such as the Internet in their attempt to help youth overcome the challenges that they face and develop into well-balanced adults. There are few frameworks or guidelines available for the design of web spaces to cater for the well-being of youth. While there are many web spaces that provide entertainment for youth, there are very few that cater for the well-being of youth. As such, there are not enough web spaces which provide youth with up-to-date and relevant information and which allow youth to collaborate and participate in an online community in an interactive form so that youth well-being is enabled. Therefore, the main objective of this research is to explore, conceptualise, design and implement a youth-oriented community database that can enhance youth wellbeing.
12.2
Youth Well-being
Youth workers and policy makers, teachers, parents and researchers have highlighted concerns about young people’s well-being and the need for improvement in this area [5, 27]. However, the concept of well-being has not been clearly defined, theorised or measured [26, 73], especially when applied to young people [3]. A recent study found that young people and youth workers agree that well-being is multidimensional and that key dimensions include relationships, psychological factors, health, social environments and emotions [92]. However, young people
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
291
indicated that aspects of the self and their relationships were more important to their well-being, while youth workers focused on social contexts and emotions. This implies that young people may be uncritical or unaware of the role of their social contexts and are therefore unlikely to seek to address structural dimensions that impact upon their well-being. The findings also suggest that young people felt they could control their well-being, while youth workers felt that young people’s wellbeing was connected to, if not produced by, their social environments [6]. Family and friends dominate the social environments of youth. Consequently, loneliness and perceived social support from family and close friends [77], a socially supportive network [2, 39], the level of emotional support [13], relationships and friendships [2] as well as a feeling of closeness and connectedness to others on a daily basis have all been found to contribute to youth well-being [36]. Moreover, feeling understood and appreciated and sharing pleasant interactions with friends and family are especially strong predictors of well-being [68]. The importance of having friends and more significantly having good quality friendships is an important developmental element for adolescents [36]. These findings underscore the importance of schools as a primary source of connectedness with adults and with the broader community as perceived and experienced by the adolescent [70]. Resnick et al. [70] assert that family connectedness still plays an important role in youth’s well-being. Eckersley et al. [27] assert that adolescence is a period of difficult metamorphosis where youth are deciding who they and what they believe and begin to accept responsibility for their own lives. One of the phenomena that is occurring at this time is the construction of a coherent self-identity [59]. According to popular psychologists, a coherent self-identity is directly linked to their psychological well-being [18, 48, 78], as well as their subjective sense of well-being [78]. A study done by Suh [79] found that adolescents who had successfully achieved a coherent and consistent self-identity required less affirmation from external sources [23] and therefore had a stronger sense of subjective well-being. Adolescence is a period of many questions: questions about oneself and questions about changing relationships with the outside world. In this phase of life, there is a struggle for independence from parents and an increased reliance on peers for support [10, 50]. However, there are some areas of a young person’s life that she or he does not feel comfortable sharing with even the closest friends. For these reasons, the Internet as a form of digital media has become increasingly popular among youth [65, 82]. The Internet accommodates the increased need to communicate with existing friends and the creation of new social relationships as well as the need for anonymity when looking for information on sensitive topics [4]. Youth are not only exposed to a plethora of technological tools that allow them to connect to the Internet; they are equally surrounded by friends and family who go online. According to a survey done by the PEW Online American Study [41], 83% of all the youth surveyed stated that most of the people they know use the Internet while only 6% said that very few or none of the people they know use the Internet [51].
292
S. Vodanovich et al.
The sense of empowerment and in turn the well-being of youth comes from the ability of computers and other information and communications technologies to provide better access to information, anonymity and the ability to include their views in decision making [21]. Moreover, networked computers empower people around the world as never before to disregard the limitations of geography and time, find one another and gather together in groups based on a wide range of cultural and sub cultural interests and social affiliations [49]. It is empowering for youth to know that they are in control of the information that they are receiving and a key part of this is their awareness of the tools and paths that are open to them in achieving changes to policies that affect them. A survey conducted by Valaitis [86] about youths creating and implementing their own web sites, found that they felt that technology empowered them in three ways: sharing their views and information with the community, obtaining others’ opinions and gaining access to influential people. Furthermore, she also found that youth felt more confident, better prepared and more knowledgeable when expressing themselves to the wider community. These three elements of youth web spaces – information, community and collaboration – will be discussed in the next section.
12.2.1 Information Enhancing youth knowledge about how issues in the media and changes to government policy at the local, national and international levels affect them is crucial to their understanding of themselves. The exponential growth of the web and the increasing availability of collaborative tools and services on the Internet have facilitated innovated knowledge creation/dissemination infrastructures, such as electronic libraries, digital journals, resource discovery environments, distributed co-authoring systems and virtual scientific communities [16]. The transformation of such a rich information base is vital for youth. Transformation of this information may include filtering, aggregation or visualisation. This transformation can facilitate the explanation to the youth community of issues impacting upon them. This in turn is an important step towards ensuring that youth understand how local, regional and international issues impact upon them and, as a result, ensure the well-being and empowerment of youth through knowledge acquisition [16].
12.2.2 Community It is important to acknowledge the role of youth in participating in their own wellbeing. Not only are they capable of providing support for each other, but also, as previously mentioned, they are more aware of what their concerns and issues are and thus their participation needs to be encouraged in all spheres of society and in decision-making processes at the national, regional and international levels.
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
293
There are increasing calls for young people to participate in the debates and decisions made concerning their well-being, their education and their communities. These calls are fuelled partly by a growing recognition of children’s rights to express themselves, participate and be heard in general and partly by the decline in civic and political participation both generally [54] and especially among young people [47, 67]. The Internet can be seen as a means of increasing young people’s participation in a community environment [34]. It is important to establish environments in which youth have opportunities to be heard within meaningful, caring and supportive relationships which enhance their sense of themselves in positive ways [35]. The Internet provides individual spaces (websites and e-zines2) and communities (chats, special interest groups) that allow youth to participate within parameters (anonymity, connectivity and simultaneity) [24] different from those in their on-ground communities. These community-based web spaces promote a feeling of higher contribution and closeness both to society in general and to their own group/community; they experience a growing sense of actualisation; they perceive higher levels of coherence and understanding with respect to what is happening in the world. However, on the whole, they have less trust in people at large and tend to resort to their own group/community for comfort and certainty [22].
12.2.3 Collaboration Collaboration amongst youth and between them and legislators and decision makers is a vital part of ensuring that their voice is heard. Calvert [11] asserts that collaborative and group-based activities can promote pro-social behaviour, or positive social interaction skills such as cooperation, sharing, kindness, helping, showing affection and verbalising feelings [11, p. 209]. Some scholars see digital technologies as a way of enabling children to have more control and navigation in their learning, mostly through direct exploration of the world around them, ways to design and express their own ideas and ways to communicate and collaborate on a global level [42]. This type of collaboration will improve decision-making processes at national, regional and international levels and, more importantly, will help frame future discussions around issues that youth and children consider most important for themselves and their wellbeing.
12.3
Youth Developmental Needs
Adolescence may be defined as the period within the life span when most of a person’s biological, cognitive, psychological and social characteristics are changing from what is typically considered childlike to what is considered adultlike [52].
294
S. Vodanovich et al.
Lerner et al. [53] assert that there are ubiquitous individual differences in adolescent development and they involve connections among biological, cognitive, psychological and societal factors with none of these influences acting either alone or as the prime mover of change. Adolescence can be seen as a period of rapid physical transitions in such characteristics as height, weight and body proportions. Hormonal changes are part of this development. Nevertheless, hormones are not primarily responsible for the psychological or social developments of this period [64]. In fact, the quality and timing of hormonal or other biological changes influence and are influenced by psychological, social, cultural and historical contexts [14]. Biological processes can influence an individual’s psychological and psychosocial state, but psychological and psychosocial events may also influence the biological systems. Therefore, the timing and outcome of pubertal processes can be modified by psychosocial factors. The most important psychological and psychosocial changes in puberty and early adolescence are the emergence of abstract thinking, the increasing ability to absorb the perspectives or viewpoints of others, increased introspection, the development of a personal and sexual identity, the establishment of a system of values, increasing autonomy from family and more personal independence, greater importance of peer relationships and the emergence of skills and coping strategies to overcome problems and crises [69]. The difference between the Net Generation and preceding generations is that these processes are now taking place within the context of the Internet. One aspect that can be considered in the context of the impact the Internet has is the development of self-identity which can be seen as one of the key developmental milestones during the period of adolescence [30]. Erikson saw adolescence as a period of moratorium – a time out period during which the adolescent experiments with a variety of identities, without having to assume the responsibility for the consequences of any particular one. The Internet allows adolescents to try on different personas and determine which persona will gain the most approval and acceptance [58]. Another important influence in successful identity development is membership of social groups [57]. Group membership also enhances the need to belong, an innate human motivation [9]. Self-reflection and identity formation can be enhanced by emotional and intellectual openness which the Net Generation finds in the virtual environment as they find it easier to expose their inner thoughts and personal information online. Adolescents are increasingly embracing the virtual world as a means of exploring their identity in creative and new ways, asserting their sense of self in a highly personal form and customising their sites with unique photos, text, tags and avatars [17]. Similarly, adolescent cognitive development within the context of the Internet is also taking place. Johnson and MacEwan (2006) assert that the Internet is a cultural tool that influences cognitive processes and an environmental stimulus that contributes to the formation of specific cognitive architecture. Cognition is a general term encompassing mental processes such as attention, perception, comprehension, memory and problem solving [75]. Considering the most common Internet activities for adolescents such as playing games, navigating Websites and
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
295
communicating with others [72], Johnson and MacEwan (2006) suggest that meta-cognitive abilities are developed as a result of all these activities. Video games, as well as synchronous communication, decrease cognitive processing speed (i.e., reaction time). Video games require simultaneous processing; online communication requires successive processing. Video games make extreme demands on visual and meta-cognition skills. Accessing Websites builds the knowledge base and contributes to concept development.
12.4
Youth-Oriented Community-Built Databases (YCD)
Youth-oriented community-built databases are a part of almost all web spaces aimed at young people. Well-known examples of this include the open-source community, social networking sites, message boards, discussion forums and the Wikipedia community. Most features available on youth-centred web spaces are usually intended to fulfil one or more of the following purposes; promoting knowledge about a particular issue or set of issues, promoting youth voice or empowerment of youth as members of society, social trust, or community building and /or promoting team building or leadership skills [61]. This section will present a comparison between two popular youth websites; one is at an international level (Voices of Youth) and the other is at a national level (Aotearoa Youth Voices). The purpose of this comparison is twofold: to explore what purpose each website serves in terms of target audiences as well as the features they provide, and to explore the strengths and weaknesses of two very different approaches to the design and creation of a youth-oriented communitybuilt database.
12.4.1 Voices of Youth – UNICEF Voices of Youth is a website that was created by UNICEF in 1995 and proclaims that since its launch, Voices of Youth has reached young people in more than 180 countries, of whom over 60% are from developing countries. The primary purpose of this web space is listed as creating a vibrant online meeting place where young people from around the world explore, discuss and take action on global problems. This website has three main areas for exploration. The first is called Explore which gives youth the opportunity to learn about rights-related issues such as sexual exploitation, HIV/AIDS and education. The second area is called Speak out where youth have the opportunity to voice their opinions and take part in discussion forums on issues that concern them as young people. And the third section is called Take Action, whereby youth have the opportunity to learn the skills to start their own projects and report back to their peers on progress and ideas. As mentioned previously, a major part of this website is concerned with gathering youth to draw
296
S. Vodanovich et al.
their attention to global problems and make them aware of how they can help be part of the solution to some of these problems. This website emphasises interactive features which enable young people to, for example, be an editor of a mini-paper or a report; an open discussion forum allowing youth to contribute to an online community with other youth around the world and the ability to collaborate with other community members to create newsletters and reports on relevant topics. In terms of evaluating the degree to which this web space contributes to the well-being of youth, it is perhaps useful to consider it in terms of the conceptual model presented earlier (Fig. 12.1). This web space encompasses all three types of web spaces for youth. It is interactive in so far as the features it provides enable youth to send messages and emails to global organisations about issues they feel strongly about, as well as giving them the opportunity to be part of the design team of Voices of Youth in terms of suggesting changes to the website and the addition of new topics they would like to see discussed. It is information provisional as it provides information on a wide and diverse range of topics. It is Collaborative as it allows youth to come together and produce artefacts such as a monthly newsletter on topics that they are passionate about. It is community based in so far as the discussion forum aspect of the website is concerned since youth can not only communicate with other youth on issues that concern them, but also consult professionals and decision makers on how to get their action projects off the ground. In addition, Voices of Youth periodically holds online real-time chats where adults and decision makers are invited along with the youth of the website to discuss issues that concern them.
Fig. 12.1 Information, community and collaboration
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
297
In spite of the numerous interactive and informative features of UNICEF’s website, there are still a few areas worth mentioning that require enhancement: l l l
l l
Content focuses on global youth issues and not enough on local youth issues Insufficient localised country-specific information Insufficient features that allow youth to customise/personalise the page to according to their preferences Insufficient governance controls in place Difficulty in combining data stored in underlying database with community-built databases in other popular youth web spaces, e.g., it would be useful if the comments they make on this web space were to be available for their peers to see on their Facebook page.
12.4.2 Aotearoa Youth Voices Aotearoa Youth Voices is a New Zealand-based initiative whose primary purpose is stated as providing ways to get young people’s voices heard. In keeping with this aim, the web space provides many opportunities for youth to have their voices heard in the local and national spheres of the government. As this web space is part of the Ministry of Youth Development website, there is much rhetoric surrounding how the ministry can be used as a vehicle for youth participation and thus empowerment. There is an immense amount of information on how to participate on a local and national level but the level of features providing this participation is not really present. This web space encompasses only two types of web spaces for youth, information provisional and community. There are elements of interactivity where youth are able to send emails and message to their local councils and to the Minister of Youth Affairs. It is information provisional in so far as it provides extensive information on local and global laws and policies that affect youth. And lastly, it is community oriented in so far as it provides a discussion forum for members only where network members can come together and discuss what they have been doing in terms of youth action projects, as well as gain feedback on other topics and issues of interest. As mentioned previously, the content of this website is focused on getting young people’s voices heard through a variety of means. In comparison with the UNICEF website, this website is seriously lacking in terms of interactivity, quality and content. l
l
l l
There is no information on youth-related issues as identified in the literature review Despite there being plenty of information on how to take action in one’s community, the help provided is a bit confusing The format of how to use the community and participation tools is not very clear There are no collaborative tools enabling youth to come together to create an artefact
298 l
l
S. Vodanovich et al.
The discussion forum is restricted to members only, which provides some level of governance It is equally difficult to gain access to community data being created by contributing youth.
12.5
Problems and Issues of Youth Well-being in YCD’s
The following is a discussion of problems and issues in reference to the two existing youth well-being web spaces mentioned above. The problems are discussed alongside the issues that relate to the various problems. A more detailed explanation of these issues is presented in this section, followed by a discussion of requirements arising from these issues. l
l
l
l
l
l
Relevance: This issue relates to the relevance of issues presented in web spaces. As noted above, youth are searching for information relevant to what is important in their lives at the present time. Often, these concerns are different when considered from the perspective of many adult designers of youth web spaces. Presentation: The presentation of information, features and content of a web space needs to be in a youth-friendly format. As Livingstone et al. [55] note, youth are fickle web users, who need to be actively engaged in web spaces as opposed to just sitting and reading, thereby making presentation a key issue to be considered in the design of an effective youth web space. Personalisation and Customisation: Moreover, as Ha and Chan-Olmsted (2001) note, personalisation and customization of web spaces are vital to captivating audiences, especially youth audiences. Personalisation involves filtering the types of information being viewed by the user, whereas customization involves changing the appearance (e.g., background and font colours) of the web space. Interactivity: Interacting with the web space with the ability to change or filter in some way the information being received by the user is an important tool that any web space should take advantage of; in the case of designing a youth web space, this becomes even more important. As Ha and James [33] note, interactivity of web spaces for Youth adds a fun and playful element that youth are searching for on the Internet. Reach and Range: These two issues refer to the potential reach that youth are able to have with not only their peers but also legislators and decision makers. The range of available tools enabling interaction with these two groups allows youth to acquire and contribute information in a meaningful manner. Ubiquity: Young people tend to continue many different threads of conversation within many different contexts within the Internet. They may start an email conversation, which may lead to including other friends on Facebook, Twitter and other social networking technologies, all whilst referring to static informational content on another web space. Ubiquity refers to the ability of community-built databases within web spaces to liaise with each other to provide better integration of data.
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
12.6
299
Design Principles for YCD’s
Following the discussion of the problems and issues that exist in the domain of youth well-being community-built databases and web spaces, we can now consider the requisite conceptual and design requirements.
12.6.1 Conceptual Requirements This field of research is dominated by three distinct areas of research – Youth, Youth Development and Information Systems. Therefore, it is difficult to locate models and frameworks that will enable the design and implementation of a communitybuilt database and web space for youth well-being. There are numerous frameworks for designing and implementing web spaces for organisations [1, 40, 62], frameworks for explaining the transition of youth from childhood to adulthood, as well as frameworks and models that explain to some degree what young people are involved with online (e.g., [8, 31, 84]. However, in order to design and implement a community-built database and web space for youth well-being, we need to bring these diverse perspectives together to create a robust framework and model for this specific area of research. The three modalities of information, collaboration and community can additionally be analysed in terms of their content, quality of information and interactivity. Content, quality of information and interactivity all play a vital role in the design of a web space for youth (Fig. 12.2). In terms of what content should be covered by web spaces for youth, it is perhaps useful to first consider the issues which youth consider as the most important to them. Recent surveys conducted by Harris Interactive [56] and United Nations Youth Association of Australia [85] found that young people wanted to discover information and contribute information on topics that were central to their lives and about which they felt strongly [56] (Fig. 12.3). The quality of the web space for youth is strongly influenced by the quality of the content of information within in it as well as the quality of the information presented within in it. That is, there is a phenomenal amount of content available for the consumption of youth regarding a variety of issues and concerns that may be of interest to them. However, in order to ensure a better quality of information available to the youth, two steps could be taken. The first is some form of intelligence density, defined as measuring how quickly can you get the essence of the underlying data from the input’ [25]. Intelligence density allows the user to filter data to satisfy their particular interest and also to present the data in levels of abstraction given the depth they want to focus on. Intelligence density in this form can be enhanced immensely by the voice of youth. In another way, the more emphasis there is on listening to the voice of children and youth, the more the quality of the information provided regarding youth advocacy and policies will improve. As an
300
S. Vodanovich et al.
Fig. 12.2 Problems, issues and requirement in current youth-oriented community-built databases
Fig. 12.3 Youth interaction in youth-oriented community-built databases
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
301
increasing number of youth turn to the Internet as a research tool [51], the quality of the information provided in terms of accuracy and relevance should be quite an important consideration in the design of a youth-oriented web space. For example, the range and quality of information provided by Epal, an interactive site to assist the provision of the Connexions service in Britain, is noted as a major factor behind its success [55]. Similarly, Rizer’s – which is a Nottingham site aimed at educating potential youth offenders about the Criminal Justice System and the consequences of crime – success is due in part to the fact that it fills an important information gap on the web with up-to-date information and that youths find it interesting and stimulating’ [55]. Therefore, the presentation of content is important in determining the success of a youth web space. In this regard, the interactivity and ease of use is an important factor as is the kind of language used, all of which will ensure that it is appealing to youth. In addition, interactivity is another dimension that should be considered in the design of a youth web space. Terdimen [83] reports on a study that observed American and Australian youth using dozens of websites across a variety of genres. They found that the participants want to be “doing something as opposed to just sitting and reading, which tends to be more boring and something they say they do enough of already in school”. Therefore, interactivity is very important especially when it comes down to capturing youth attention. There is much debate about the definition of interactivity. Steuer [76] defines interactivity as the extent to which users can participate in modifying the form and content of a mediated environment in real time. However, not all observers agree about the importance of real time. For example, Rheingold [71] suggested that the asynchronous characteristics of tools such as email, news groups and listservs is one of the key benefits of these interactive media. We agree with Heeter [37] who defines two components of interactive websites. The first is ease of adding information, meaning the degree to which users can add information for access by a mass, undifferentiated audience. And the second is interpersonal communication facilitation, which comes in at least two forms: asynchronous (allowing users to respond to messages at their convenience) and synchronous (allowing for concurrent participation in real time). Furthermore, Ha and James [33] identified five dimensions of Web interactivity that fulfil different communication needs: (1) Playfulness – measured by the presence of such curiosity-arousing devices as Q and A formats and games; (2) Choice – measured by the number of alternatives for colour, speed, language and other non-informational aspects; (3) Connectedness – measured by the presence of information about the product, company, third parties and other content of interest to visitors; (4) Information collection – measured by the presence of such monitoring mechanisms as registration forms and counters and (5) Reciprocal communication – measured by the presence of response mechanisms, including the Webmasters email address, surveys and purchase orders. Hugh-Hassell and Miller [43] echo similar sentiments. Their research identifies that visual appeal of the site, ease of navigation, currency and accuracy of information are all key elements when it comes to creating an interactive web space for youth.
302
S. Vodanovich et al.
12.6.2 Design Requirements In the section Conceptual Requirements, we have outlined three areas that should be considered for the design of a youth web space to enhance well-being: content, quality and interactivity. In this section, we elaborate on specific requirements for the design of interfaces for youth web spaces to enhance well-being that address the issues raised in the prior sections. The requirements are as follows: web spaces to enhance youth well-being should: (a) present information in such a way that dense information can easily be navigated; (b) use up-to-date designs; (c) be interactive; (d) show who contributed what; (e) be customisable and easy to personalise, ultimately allow youth to contribute to the overall design of the web space and (f) allow users to express their virtual identity. The issues that are relevant in the context of designing interfaces are presentation, interactivity and personalisation/customisation. The problems related to presentation are poor navigation facilities and inappropriate structure to present information. These observations lead us to the requirement that the information presented on a web space to enhance youth well-being should be presented in a way that it is easy to both understand and navigate. Youth in general are attracted to what is new and innovative and not dusty from their parent’s cupboard. Hence, another requirement to ensure a youth-friendly appeal is to improve the appearance and thus experience, of websites so that they do not undermine young people’s desire to be and to be seen to be, cool [55]. The issue of interactivity originates from limited facilities of current platforms to support interactivity. Interactivity generally leads to improved user satisfaction and acceptance along with increasing the visibility of websites [15]. Livingstone et al. [55] assert that the Internet can facilitate participation in so far as encouraging its users to sit forward, click on the options, find the opportunities exciting, begin to contribute content, come to feel part of a community and so, perhaps by gradual steps, shift from acting as a consumer to increasingly (or in addition) acting as a citizen. Thus, the emphasis among academics is clear that creating an interactive environment is what is required to enable Youth to engage with the Internet in a meaningful manner [37, 55, 61]. The strong evidence in the literature suggests, however, the importance of this aspect in designing youth web spaces and therefore we propose that interactivity should be considered. From another perspective, youth seek to modify the web spaces so as to “leave their mark” and receive acknowledgement and other positive feedback for their contributions. Also, youth is seeking for pillars for navigation in a complex and confusing world and tend to understand knowledge in a social context [70]. Therefore, we state as a further requirement that youth should directly see who has contributed which content in order to create social interactivity. Personalisation and customisation are generally not well supported in current youth web spaces. As argued for interactivity, users should engage in the web space. Additionally, the web space should be fun [8] to use. This is enabled by the web spaces capability to adapt to their personal needs. Therefore, we state as one
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
303
requirement that the web space should be personalised for the users and allow them to customise the web space according to their needs. This relates to the requirement of presenting the information in a manner that is easy to understand and navigate. Each user will have different prior knowledge and preferences that moderate the perception of how well the information is presented and to accommodate for diverse needs the web space should be presented differently for each user. Moreover, the creation of a coherent identity is an important part of adolescence [78]. Online role-play games are a good illustration of this concept where users are able to create virtual identities and are encouraged by the social dynamics of the virtual world to make their virtual identity stronger and “more appreciated” [11]. For the interface of a web space to enhance youth well-being, identities and what they do (as described under interactivity) should be presented as an integral part of the interface. This means that youth users do not only personalise the web space in terms of how it is presented to them, but can also contribute how the web space appears to other users. This concept is well illustrated by popular sites such as MySpace where it is apparent that the pages of different users differ significantly in both content and design. Moreover, governance is another important requirement that must be considered in the construction of a youth-oriented community-built database. While all Internet users are susceptible to governance issues in an online context, adolescents are more at risk because of their still developing social, interpersonal, computer and life skills. Adolescents regularly use the Internet to communicate through emails, social networking sites and chat rooms. Wirth et al. [92] assert that while engaging in most of these online activities takes little-to-no training for the adolescent user, fully understanding safety threats, knowing how to effectively protect one’s self from these threats and having the motivation to do so require a greater level of understanding. Designing a web space or a community-built database that protects these vulnerable users is discussed in more detail in the following sections.
12.7
YCD Framework
In order to support the design of community-driven databases, which can enhance youth well-being, we propose a conceptual framework. This framework is informed by the issues and design principles discussed above. One important aspect we wish to consider is systems that are tailored to the requirements of youth users and their well-being where concepts of community, identity, database and data may slightly differ from traditional perspectives. Youth use the Internet to multitask between various web spaces that catch their attention and therefore often the data produced is conversation based rather than ontology based [72]. For instance, while organisations can build up knowledge bases in wikis or other forms of community-driven databases, youth users usually have a very diverse set of interests and motives that collide with the strict
304
S. Vodanovich et al.
Fig. 12.4 Conceptual framework to guide the design of community databases for youth wellbeing
organisation of wikis in categories and articles. Conversation-driven data, rather than depending on an ontology, depends more on the identities of those conversing as well as the order in which a conversation progresses. Youth use a diverse variety of tools to converse [19, 66, 81]. This also includes different web spaces and therefore their different underlying community-driven databases. Figure 12.4 emphasises a number of central elements for a youth-oriented, community-driven database. First, it is assumed that youth communities span multiple databases or web spaces. These communities are connected by means of a conversation rather than topics or ontology. Furthermore, based on our discussion above, community databases should provide the following dimensions: mechanisms for enriched, appealing and personalised presentation, mechanisms that enable social collaboration using distributed social identities, mechanisms that support the integration of various elements such as data, objects, knowledge and processes and allow the governance of the content. In the following subsections, we explore these three dimensions and governance issues in greater detail.
12.8
Web Interaction Dimension
One of the key design requirements for youth community-driven databases is the creation of an environment that is rich and visually attractive. The community database is one such technology that enables the visualisation of the content using
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
305
multiple media and technologies. Youth spend a lot of their time searching for information on the Internet and are attracted to web spaces that make this process “fun” [7]. A community database for the youth could provide this information in a strongly visual manner with a focus on pictures, interactive content and other media. The information can further be made more accessible with the use of knowledge maps or symbols to support the meaning of text content following the principles of concept maps, mind maps, conceptual diagrams and visual metaphors [29]. Community databases are web-based technologies where the creation of links to any web resource is identified by a URL. However, young users of community databases do not want to click on “boring” links but rather see the linked content directly integrated into the site. An example of a web space that does this is MySpace, which enables young people to represent themselves using a “MashUp” of different media. Youth are not just consumers of content; they are also motivated to contribute to web spaces [74], thereby popularising technologies which make it easy for them to add and edit content. An editor for community databases that makes the data entry “fun” could be an opportunity for youth to be enabled and motivated to contribute their own content to the database.
12.8.1 Social Collaboration Dimension Youth are seeking pillars for simple navigation in a complex and confusing world and tend to understand knowledge in a social context [62]. Consequently, youth seek to modify the web spaces so as to “leave their mark” and receive acknowledgement and other positive feedback for their contributions. For community databases, this has two implications: youth should directly see who has contributed what content and they should be able to create their own identity in the community database. Most community databases allow traceability of users and the content they have contributed. However, this is mostly hidden in special “version” pages that are often not easy to understand. Some community databases allow direct contributions to pages by leaving comments rather than editing the page directly, and the comments are directly attributed to the user who made the comment. Most community databases allow users to create their own personal pages. In addition, the creation of a coherent identity is an important part of adolescence [78]. Online role-play games are a good illustration of this concept where users are able to create virtual identities and are encouraged by the social dynamics of the virtual world to make their virtual identity stronger and “more appreciated” [11]. Allowing the content to be enriched with the meta-data of the author of the content adds a social dimension to collaboration. As a result, rather than collaborating on topics and content with anonymous users, it becomes more a communication process between virtual identities that present themselves through their contributions and personally designed pages.
306
S. Vodanovich et al.
Youth regularly make use of many different communication channels such as: chat clients, email, discussion forums, shout boxes, YouTube and VoIP [19, 66, 81]. A community database for youth can strengthen the youth user community by offering as many communication channels as possible.
12.8.2 Semantic Integration Dimension To enable the creation of mash-ups using different resources from the Internet, the community-driven database should be a suitable platform for integrating content from different sources. Therefore, rather than the community database becoming a central repository for various bits of information, it would become an information source for a “digital native” [65] where they can become a “spider in the net” rather than trying to contain all the content in one database. The Internet and its uses evolve faster than any single platform will ever be able to. Today, one of the more popular ways for youth to express their virtual identity is a MySpace page, tomorrow it could be a blog and maybe the day after it could be a video cast. A community database for youth must enable them to say what they want to say using the most recent technologies and to integrate their social identity in the community database with social identities they have developed on other platforms. The facilities provided by the database would focus on providing a standardised means of linking to content in other sources rather than a means of storing content in the community database.
12.8.3 Community Governance A requirement that has emerged with the growing number of community-driven platforms is the governance of the user contributions [12]. This is of special importance in the context of youth-related databases as their well-being depends on how well they are protected [60, 93, 94]. The governance should be based on: peer control of the youth and control from adults. This requires at least two different processes and roles for government. Many community-driven databases differentiate between system administrator users (e.g., Sysops in MediaWiki) and normal users. A community database specifically designed for youth would require a finer differentiation of user rights and roles: for instance, youth supervisors, youth users and adult supervisors.
12.9
Architecture
Informed by our discussions of requirements for youth-oriented community databases, we propose a conceptual architecture to guide the implementation of such databases. Primarily, the four dimensions of web interaction, social collaboration,
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
307
Fig. 12.5 Conceptual architecture
semantic integration and community governance need to be considered. In addition, a system that helps youth build up databases to promote their well-being must consider that the data is conversation-driven and not topic or ontology-driven. Furthermore, it must be considered that these conversations cannot be expected to be limited to one web space or database, but can possibly span multiple databases. Our proposed architecture is given in Fig. 12.5. The individual components of the architecture will be discussed in the following sections. The web interaction layer of our proposed architecture accommodates for mechanisms offered by the database to allow user to enter data or retrieve data from the web space in various ways. In a service-oriented architecture, such layers are sometimes referred to as portal layer [46]. Similarly to the portal layers in SOA architectures, the web interaction layer is composed of a heterogeneous set of tools which allow the interaction with the layer. We conceptualise tools, which allow changes to be made to the content of the database, as application objects. These application objects offer users of the database a means of viewing the data or changing the data. This is reflected in Norman’s well-known concept of presentation language and action language. Examples of application objects are, for instance, an embedded map from Google maps (http://maps.google.com/) or an embedded YouTube video. These objects would primarily support presentation language. Various editors for the purpose of adding and editing text or other media can support action language. These application objects must be presented to the users in a way that accommodates the specific requirements of youth users. As discussed above, this includes an attractive and modern design that includes interactive elements. Furthermore, a database to promote well-being must be presented in an accessible way, both in
308
S. Vodanovich et al.
Fig. 12.6 Choreography, orchestration and composition
technical and in user experience terms. Standard compliance of the interface of the web space is one factor which not only contributes to accessibility, but also ensures that the user experience of this database can be integrated into other web spaces. The social collaboration layer is the key component of the architecture to leverage and adapt to the system-spanning, erratic and conversation-based manner in which youth use the Internet (Fig. 12.6). Users enter data by recombining content from other sources or from within the database. If users enter new data exclusively by recombining data, which is part of one database, we speak of composition. If the users enter data as a recombination of content from multiple databases, we speak of orchestration. When users contribute data to one database, which becomes meaningful only in the context of a conversation, which spans multiple databases or web spaces, we speak of choreography of data. Unlike orchestration, however, in choreography the users do not use explicit links, which could be encoded in the database. The terms composition, orchestration and choreography stem from the context of service-oriented architecture [28]. Community databases should support composition, orchestration and choreography by allowing easy linkage between data and linkage of data from external sources as a central functionality. By altering or deleting these links, the structure and meaning of the data can be reorganised. As youth users contribute a great share of content to social networking sites, it is especially important to integrate these sites for the process of social collaboration. The easy integration of the community database with the various social networking sites or other web resources allows users to link and extend their distributed online identity within the community database. One important aspect of the social collaboration layer is the concept of partial persistence. The discussion of composition, orchestration and choreography shows that the relevant data cannot be found in one database but is distributed in
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
309
different databases. Consequently, only parts of the data can be stored in one database. Although persistence layers, for instance in service-oriented architectures, do allow distributed data repositories [38], these usually are within the boundary of one organisation or business network. The diversity of the social collaboration layer means that any attempt to work with this data in a structured manner can be difficult. Therefore, our conceptual architecture proposes that the social collaboration layer is designed in a way which allows that parts of the data on this layer are projected upon a semantic integration layer. The semantic integration layer allows the addition of a semantic meta-data to the data entered in the process of social collaboration. A technology that is capable of integrating data from different sources in different formats is the semantic web. The use of semantic web technologies is a common approach to integrate data [20, 32, 80, 95]. The semantic web principles are based on the open world assumption, meaning it is always assumed that more perspectives are added to a certain set of data and the principle Anyone can say Anything about Any Topic meaning that there is no ownership of information and that everybody can contribute to any topic go in line with youth’ desire to freely express themselves in ways that, due to the rapidly evolving nature of the Internet, cannot be anticipated. A community database for youth must develop these concepts further not just to semantically enable the content stored in the community database but to be prepared to provide a common platform to meaningful integrate content from other web sources. The community governance layer is enabled by the semantically enriched data of the semantic integration layer. Youth users contribute content by composition, orchestration and choreography. Governance must address all of these in order to create a youth space that enhances well-being. While composition is relatively easy to govern as all of the content resides in one database, orchestration is more problematic as the data entered by the users is dynamically generated from data that is stored elsewhere. The choreography of data poses even greater challenges as the meaning of data may depend on what users have contributed to other platforms. To fulfil the challenging task of governing a community database, which is conversation driven rather than ontology driven, advanced intelligence mechanisms must be employed. These mechanisms can be driven by visualisation, automatic reasoning or semantic graph-based queries.
12.10
Implementation
We have conducted an explorative implementation to illustrate the key aspects of our architecture and framework. However, our focus was not on the delivery of a complete horizontal solution to all of the issues relevant for youth-oriented community-driven databases. Our focus was on a vertical prototype, which can help in discovering difficulties in the implementation of projects or reveal issues that cannot be derived from a theoretical discussion.
310
S. Vodanovich et al.
A wide range of web spaces that promote youth-well-being have already been implemented. However, our survey of youth web spaces revealed two important limitations. First of all, the web spaces provide their information in a manner that is not very engaging for teenagers. The content is provided in a static fashion or tries to be so flash that the web spaces become unusable. Second, although there are web spaces that address various topics concerning the life and well-being of youth, there are no integrated and well-governed web spaces that could provide youth with a central contact point when querying the web for their well-being. Our studies have shown that the primary contact point for youth when enquiring sensitive topics concerning their well-being on the web is the Google search engine. Although Google governs some of its search terms such as sex, many topics that youth search for can lead to more dubious places. We therefore explore the implementation of a database which can provide a community-driven approach to assist youth in seeking information using search engines in a way that enables their developmental requirements. Conceptually, the requirements for our community database can be expressed in two modes of user interaction. The first is users actively seeking for information. They can achieve this by entering a search term and either clicking a button search or whisper. By using the button search, the search term that the user has entered will be visible to other users of the platform under his or her primary user name. The button whisper will also publish the search query to other users, but anonymously. This brings into play multiple identities as discussed in the framework section. The second mode of interaction is assisting other users in their search queries. As every search query is made available to other users, these have the opportunity to click on them and visit a unique page which represents this search query. Here, users can post additional content such as their comments or links to sources they deem relevant to the query at hand. Our aim was to explore the combination of technologies that can be used to fulfil these requirements and whether an implementation can be guided by our proposed conceptual architecture. Our vertical prototype is not aimed at providing an integrated user experience of the different aspects discussed below. Some parts have been implemented in Java Swing or as a JavaScript prototype. One possible combination of components and technologies is given in Fig. 12.7. In the following paragraphs, we will discuss important aspects of our explorative implementation in greater detail. One important requirement we found in our implementation was that composition and orchestration could be supported only when the database enables the user to work with data in a network representation. Composition inherently requires a hierarchical organisation of data when smaller pieces are aggregated to mash-ups or composites. These composites in turn can be reused to build higher-level mash-ups or composites. In terms of the search queries, the whole discussion about one search term can be a useful contribution to the discussion of another search term. Users will thereby be enabled to link one search term with the other creating a hierarchical structure of composites. As some elements are to be reused in more than one composite, a network organisation of the data emerges. Orchestration follows the same process as composition. However, unlike composition, some elements that are
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
311
Fig. 12.7 Overview of explorative youth community database implementation
used and linked together originate from other databases. For the search queries, these can, for instance, be YouTube videos, which contain valuable hints for a search query. Important components of an implementation are a browser and an editor to work with such networks. The primary purpose of the youth portal is to allow the first mode of interaction to enter new search queries and to provide an appealing first impression of the database. However, assisting other users in their search queries requires a more complex mode of interaction requiring the browser and editor discussed above. These are represented in Fig. 12.7 by the browser application. This application turned out to be too complex to reside entirely on the client side and parts of the application logic must be implemented on a browser server. We chose to explore the use of the Google web toolkit (GWT) technology to implement the browser client as this technology allows easy integration with a server written in Java. Application objects can be supported by HTML iFrame objects [88], for instance to embed YouTube videos or Flickr photos, GWT composite objects and HTML editors. GWT further enables the easy implementation of drag and drop within the browser application. One difficulty arising for the composition and orchestration in this context is that content residing in an iFrame object cannot be dragged and dropped to be recombined with other objects. Only the iFrame as a whole can be recombined with other objects. The data that the browser application offers to the user are managed by a content server. Browser application and content server do not communicate directly
312
S. Vodanovich et al.
primarily to enable orchestration. Due to security measures, a browser application should only communicate directly with one server (the domain currently displayed in the browser). To enable the browser application to show contents originating from different servers, the browser server needs to work as a mediator to establish connections to the different servers. The purpose of feed robots is to automatically provide content to the database, which might be helpful for the users. Most importantly, when a user enters a search query, a new page is created and in a simultaneous process, a feed robot is initiated; this in turn calls the Google search API to feed the search results to the newly created page. Each object, which is added by the users, is attached to semantic metadata. Currently, these are very simple semantic statements. The underlying technology follows the semantic web stack. Uniform resource identifiers (URI) [44] provide a technology to uniquely identify resources. They can be used to identify external content and content that is used within the platform. The resource description framework (RDF) [90] allows the modelling of basic statements about the identified resources. This can be used to link different content, to give content meaning and to store content in a structured way [87]. The integration of different content can be facilitated using the web ontology language (OWL) [89] as defining integrating ontologies has proven to be a good way to integrate heterogeneous data from heterogeneous data sources [80, 95]. We have used the Jena semantic web framework to attach metadata to the data network. The semantic metadata is provided as hierarchically organised RDF documents on a web server. The metadata are linked to each other using the linked data principles. The data can be easily retrieved and processed using SPARQL queries. These can possibly enable different forms of social networking analysis but, so far, this has not been investigated further in our prototype implementation. Overall, as illustrated in Fig. 12.7, our proposed conceptual architecture could provide guidance for this implementation project directed towards youth-wellbeing. The most distinguishing characteristic of our framework and architecture is the social collaboration layer. In order to understand data that is distributed among extremely loosely coupled databases by erratic conversations, it is essential to provide databases which accommodate the special requirements of youth. We have identified further requirements for youth-oriented community-driven databases based on our explorative implementation: for instance, the need to support composition and orchestration by allowing users to work with data networks.
12.11
Conclusion
Community-built databases have been established for diverse groups and communities. Well-known examples of this include the open-source community and the Wikipedia community. With the emergence of Web 2.0 technologies and the adaptation of community-driven web spaces by wider audiences, new issues
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
313
concerning the governance of these web spaces arise. While traditional communitybuilt databases are driven by a professional and technical-oriented audience, the rise of community-built databases shaped by the Net Generation may require a greater deal of governance and be sensitive and cognisant of the developmental needs of youth than that afforded by traditional community-built databases. We began this chapter with a definition of the concept of well-being for youth, as well as the developmental processes that lead towards well-being. In particular, we explored the informational, community and collaborative needs of youth. We then explored the features, problems, issues and requirements of youth-oriented community-built databases in the context of two popular youth websites, one at an international level (Voices of Youth) and the other at a national level (Aotearoa Youth Voices). Apart from identifying the features provided by these websites, another key purpose of this comparison was to explore two very different approaches to the design and creation of youth-oriented community-built databases. This led us to identify problems, issues, challenges and requirements for youth well-being that may arise in the context of community-built databases. Key issues that were synthesised included: relevance, reach, range, presentation, personalisation, customisation, interactivity and ubiquity. These issues helped us to identify a number of requirements, concepts and design principles for youth-oriented community-built databases. These principles in turn led us to propose a framework and architecture that would help in solving the afore-identified problems. The four key pillars underpinning the framework and architecture are: web interaction (accessibility, search, visualisation), social collaboration (identity, communication, traceability), semantic integration (standardisation, ontology, integration) and community governance (authorisation, self-governance, youth workers). Lastly, we illustrated using an explorative proof of concept implementation the concepts, principles, framework, architectures and pillars of youth-oriented community-built database that enhance and govern youth well-being through information discovery, capture, transformation and sharing.
References 1. Adams, L.A, Courtney, J.F.: Achieving relevance in IS research via the DAGS framework. Paper presented at the System Sciences, 2004. Proceedings of the 37th Annual Hawaii International Conference (2004). 2. Argyle, M.: The Social Psychology of Everyday Life. Routledge, New York, (1992). 3. Ben-Arieh, A.: Where are the children? Children’s role in measuring and monitoring their well-being. Soc. Indic. Res. 74(3), 573–596 (2005). 4. Borzekowski, D.L.G, Rickert, V.I.: Adolescent cybersurfing for health information: a new resource that crosses barriers. Arch. Pediatr. Adolesc. Med. 155(7), 813–817 (2001). 5. Bourke, L.: Toward understanding youth suicide in an Australian rural community. Soc. Sci. Med. 57(12), 2355–2365 (2003). 6. Bourke, L., Geldens, P.M.: What does wellbeing mean? Perspectives of wellbeing among young people and youth workers in rural Victoria, Youth Studies Australia 26(1), 41–49 (2007).
314
S. Vodanovich et al.
7. Boyd, D.: Identity production in a networked culture: Why youth heart MySpace. Paper presented at the Paper presented at the American Association for the Advancement of Science, St Louis, Missouri (2007). 8. Boyd, d.m., Ellison, N.B.: Social network sites: definition, history and scholarship. J. Comput. Mediat. Commun. 13(1), 210–230 (2008). 9. Brewer, M.B, Gardner, W.: Who is this “We”? Levels of collective identity and self representations. J. Personal. Soc. Psychol. 71(1), 83–93 (1996). 10. Buhrmester, D., Furman, W.: The development of companionship and intimacy. Child Dev., 58(4), 1101–1113 (1987). 11. Calvert, S.L.: The form of thought. In: Sigel, I (ed.) Theoretical Perspectives in the Concept of Representation. Erlbaum, Hillsdale (1999). 12. Camille, R.: Viable wikis: struggle for life in the wikisphere. Paper presented at the Proceedings of the 2007 International Symposium on Wikis, Montre´al, Canada, (2007). 13. Caplan, S.E., Turner, J.S.: Online emotional support: bringing theory to research on computermediated supportive and comforting communication. Comput. Hum. Behav. 23(2), 985–998 (2007). 14. Caspi, A., Lynam, D., Moffitt, T.E, Silva, P.A.: Unraveling girls’ delinquency: biological, dispositional and contextual contributions to adolescent misbehavior. Develop. Psychol., 29 (1), 19–30 (1993). 15. Chen, K. & Yen, D.C.: Improving the quality of online presence through interactivity. In. Manag., 42(1), 217–226 (2004). 16. Chen, L.L.J., Gaines, B.R.: Knowledge acquisition processes in internet communities. Paper presented at the Proceedings of Tenth Knowledge Acquisition Workshop, Nottingham, UK (1996). 17. Cheng, L., Farnham, S., Stone, L.: Lessons learned: social interaction in virtual environments. In: Tanabe, M., Besselaar, P., Ishida, T. (eds.), Digital Cities II: Computational and Sociological Approaches, Springer, Heidelberg, Germany, pp. 597–604 (2002). 18. Christopher, J.C.: Situating psychological well-being: exploring the cultural roots of its theory and research. J. Counsel. Dev. 77(2), 141–152 (1999). 19. Chu, J.: Navigating the media environment: how youth claim a place through zines. So. Justice, 24, pp. 147 (1997). 20. Chung, P.W.H., Cheung, L., Stader, J., Jarvis, P., Moore, J., Macintosh, A.: Knowledge-based process management – an approach to handling adaptive workflow. Knowl. Based Syst., 16 (3), 149–160 (2003). 21. Cockburn, T.: New information communication technologies and the development of a children’s ‘community of interest’. Commun. Dev. J. 40(3), 329–342 (2005). 22. Contarello, A., Sarrica, M.: ICTs, social thinking and subjective well-being - The internet and its representations in everyday life. Comput. Human Behav., 23(2), 1016–1032 (2007). 23. Crocker, J., Luhtanen, R.K., Cooper, M.L.: Contingencies of self-worth in college students : theory and measurement. J. Personal. Soc. Psychol., 85(5), 894–908 (2003). 24. De Kerckhove, D.: Connected Intelligence. The Arrival of the Web Society. Somerville House Publishing, Toronto, (1995). 25. Dhar, V., Stein, R.: Seven Methods for Transforming Corporate Data into Business Intelligence. Prentice-Hall, New Jersey (1997). 26. Diener, E., Seligman, M.E.P.: Beyond money. Psychol. Sci. Public Interest, 5(1), 1–31 (2004). 27. Eckersley, R., Wierenga, A., Wyn, J.: Success and wellbeing: a preview of the Australia 21 report on young people’s wellbeing. Youth Stud. Australia, 25(1), 10–18, (2006). 28. Emig, C., Langer, K., Krutz, K., Link, S., Momm, C., Abeck, S.: The SOA’s Layers. Universitaet Karlsruhe, Karlsruhe (2006). 29. Eppler, M.J.: A comparison between concept maps, mind maps, conceptual diagrams and visual metaphors as complementary tools for knowledge construction and sharing. Inf. Visual., 5(3), 202–210 (2006).
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
315
30. Erikson, E.H. Themes of adulthood in the Freud-Jung correspondence. In: Smelser, N. J., Erickson E. H. (eds.), Themes of Work and Love in Adulthood. Harvard University Press, Cambridge (1980). 31. Gross, E.F., Juvonen, J., Gable, S.L.: Internet use and well-being in adolescence. J. Soc. Iss., 58(1), 75 (2002). 32. Gruber, T.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. Comput. Stud. 43(5–6), 907–928 (1995). 33. Ha, L., James, E.L.: Interactivity reexamined: a baseline analysis of early business web sites. J. Broadcast. Electr. Media42(4), 457–474 (1998). 34. Hall, R., Newbury, D. What makes you switch on?: young people, the internet and cultural participation. In: Sefton-Green J., (ed.), Young People, Creativity and New Technologies. Routledge, London (1999). 35. Hart, R., Daiute, C., Iltus, S., Kritt, D., Rome, M., Sabo, K. Developmental theory and children’s participation in community organizations. Soc. Justice 24(3), 33–63 (1997). 36. Hartup, W.W., Neil, J.S., Paul, B.B. Friendship: development in childhood and adolescence. In: International Encyclopedia of the Social & Behavioral Sciences, pp. 5809–5812. Pergamon, Oxford (2001). 37. Heeter, C. Implications of new interactive technologies for conceptualizing communication. In: Salvaggio, J., Bryant, J. (eds.) Media Use in the Information Age: Emerging Patterns of Adoption and Consumer Use, pp. 217–235. Lawrence Erlbaum, Hillsdale (1989). 38. Heidasch, R. Get ready for the next generation of SAP business applications based on the Enterprise Service-Oriented Architecture (Enterprise SOA). SAP Prof. J. pp. 103–128 (2007). 39. Henderson, M., Argyle, M. Social support by four categories of work colleagues: relationships between activities, stress and satisfaction. J. Occup. Behav. 6(3), 229–239 (1985). 40. Hevner, A.R., March, S.T., Park, J. Design science in information systems research [Research Essay]. MIS Q. 28(1), 75–105 (2004). 41. Horrigan, J.B.: Home Broadband Adoption, Pew Internet and American Life Project, Washington, DC (2006). 42. Huffaker, D.: Spinning yarns around the digital fire: storytelling and dialogue among youth on the Internet, First Monday 9(1), 1 (2004). 43. Hughes-Hassell, S., Miller, E.T. Public library Web sites for young adults: Meeting the needs of today’s teens online. Library Inf. Sci. Res. 25(2), 143–156 (2003). 44. IETF. RFC 3968 – Uniform Resource Identifier (URI): Generic Syntax. IETF (2005). 45. Johnson, G.: A Theoretical framework for organizing the effect of the internet on cognitive development. Paper presented at the World Conference on Educational Multimedia, Hypermedia and Telecommunications 2006, Orlando, Florida (2006). 46. Kalakota, R., Robinson, M.: Services Blueprint: Roadmap for Execution (Addison-Wesley Information Technology Series). Addison-Wesley Professional, Reading (2003). 47. Kimberlee, R.H.: Why Don’t British young people vote at general elections? J. Youth Stud. 5 (1), 85–98 (2002). 48. Kitayama, S., Markus, H.R.: The pursuit of happiness and the realization of sympathy: cultural patterns of self, social relations and well-being. In: Diener, Suh, E. M. (eds.) Culture and Subjective Well-Being. MIT Press, Cambridge, (2000). 49. Kozinets, R.V.: E-tribalized marketing?: the strategic implications of virtual communities of consumption. European Manag. J. 17, 252–264 (1999). 50. Larson, R., Richards, M.H.: Daily companionship in late childhood and early adolescence: changing developmental contexts. Child Dev. 62(2), 284–300 (1991). 51. Lenhart, A., Madden, M.: Teen content creators and consumers. Pew Internet and American Life Project, Washington, DC (2005) 52. Lerner, R.M: Adolescent development: scientific study in the 1980s. Youth Society, 12(3), 251–275 (1981). 53. Lerner, R.M., Lerner, J.V., De Stefanis, I., Apfel, A.: Understanding developmental systems in adolescence: implications for methodological strategies, data analytic approaches and training. J. Adoles. Res., 16(1), 9–27 (2001).
316
S. Vodanovich et al.
54. Livingstone, S., Bober, M.: Active Participation or Just More Information? Young People’s Take Up of Opportunities to Act and Interact on the Internet. London School of Economics and Political Science, London (2004) 55. Livingstone, S., Bober, M., Helsper, E.J.: Active participation or just more information? In. Communi. Soci., 8(3), 287–314 (2005). 56. Markow, D.: Editorial: Our Take on It. In Trends and Tudes, Harris Interactive, New York (2003). 57. Martin, J., Nakayama, T.: Intercultural Communication in Contexts. Mayfield, Mountain View (1997). 58. McKenna, K.Y.A., Bargh, J.A.: Causes and consequences of social interaction on the Internet: a conceptual framework. Media Psychol., 1(3), 249–269 (1999). 59. Meeus, W.: Studies on identity development in adolescence: an overview of research and some new data. J. Youth Adolesc., 25(5), 569–598 (1996). 60. Mitchell, K.J., Finkelhor, D., Wolak, J.: Protecting youth online: family use of filtering and blocking software. Child Abuse Negl., 29(7), 753–765 (2005). 61. Montgomery, K., Gottlieb-Robles, B., Larson, G.O.: (2004). Youth as E-citizens: Engaging the Digital Generation. American University, Washington, DC (2004) 62. Nunamaker, J.F., Chen, M., Purdin, T.D.M.: Systems development in information systems research. J. Manag. Inf. Sys., 7(3), 89&106 (1990). 63. Perret-Clermont, A.N., Foundation, J., Resnick, L.B.: Joining society: social interaction and learning in adolescence and youth. Cambridge University Press, Cambridge. (2004) 64. Petersen, A., Taylor, B. The biological approach to adolescence: biological change and psychological adaptation. In: Adelson, J. (Ed.) Handbook of Adolescent Psychology, pp. 117–155. Wiley, Newyork (1980). 65. Prensky, M.: Digital natives, digital immigrants. Horizon, 9(5), 1–2 (2001a). 66. Prensky, M.: Digital natives, digital immigrants, part II. Do they really think differently? Horizon, 9(6), 1 (2001b). 67. Prout, A.: Children’s participation: control and self-realisation in British late modernity. Child. Soci., 14(4), 304–315 (2000). 68. Reis, H.T., Sheldon, K.M., Gable, S.L., Roscoe, J., Ryan, R.M.: Daily well-Being: the role of autonomy, competence and relatedness. Pers. Soc. Psychol. Bull., 26(4), 419–435 (2000). 69. Remschmidt, H.: Psychosocial milestones in normal puberty and adolescence. Horm. Res. Paediatr., 41(Suppl. 2), 19–29 (1994). 70. Resnick, M.D., Harris, L.J., Blum, R.W.: The impact of caring and connectedness on adolescent health and well-being. J. Paediatr. Child Health 29(s1), S3–S9 (1993). 71. Rheingold, H.: The Virtual Community: Homesteading on the Electronic Frontier. AddisonWesley. New York (1993). 72. Roberts, D.F., Foehr, U.G., Rideout, V.: Generation M: media in the lives of 8 – 18 year olds http://www.kff.org/entmedia/loader.cfm?url¼/commonspot/security/getfile.cfm&PageID¼51809 (2004) 73. Ryff, C.D.: Happiness is everything, or is it? Explorations on the meaning of psychological well-being. J. Personal. Soci. Psychol., 57(6), 1069–1081 (1989). 74. Sharp, J.: Digital diversions: youth culture in the age of multimedia Gend. Educ. 12(1), 138–139 (2000). 75. Solso, R.L., MacLin, M.K., MacLin, O.H.: Cognitive Psychology, 7th edn. Allyn and Bacon, Boston, MA (2005). 76. Steuer, J.: Defining virtual reality: dimensions determining telepresence. J. Commun., 42(4), 73–93 (1992). 77. Subrahmanyam, K., Lin, G.: Adolescents on the net: Internet use and well-being. Adolescence, 42(168), 659–677 (2007). 78. Suh, E.M.: Self, the hyphen between culture and subjective well-being. In: Diener, E., Suh, E. M. (eds.), Culture and Subjective Well-being. MIT Press, Cambridge (2000).
12
Designed for Good: Community Well-Being Oriented Online Databases for Youth
317
79. Suh, E.M.: Culture, identity consistency and subjective well-being. J. Person. Soc. Psychol., 83(6), 1378–1391 (2002). 80. Suwanmanee, S., Benslimane, D., Thiran, P.: OWL-based approach for semantic interoperability. Paper presented at the Advanced Information Networking and Applications, 2005. AINA 2005. 19th International Conference, Taipei, Taiwan (2005). 81. Tapscott, D.: The Digital Economy. McGraw-Hill, New York, (1996). 82. Tapscott, D.: Growing Up Digital: The Rise of the Net Generation. McGraw-Hill, New York, (1997). 83. Terdiman, D.: What websites do to turn on teens. Retrieved 15th October 2008, from http:// www.wired.com/culture/lifestyle/news/2005/02/66514 (2005) 84. Turow, J., Nir, L.: The Internet and the Family 2000: The View From Parents/the View From Kids. University of Pennsylvania, Philadelphia (2005). 85. UNYAA: Youth speak a conversation for the future. Retrieved 26th October, 2008, from http://www.unya.asn.au/youthspeak/download/YouthSpeak_Major_Concerns.pdf (2008) 86. Valaitis, R.: They don’t trust us; we’re just kids. Views about community from predominantly female inner city youth. Health Care For Women International, 23, 248–266 (2002). 87. Vdovjak, R., Houben, G.-J.: RDF-based architecture for semantic integration of heterogeneous information sources. Paper presented at the in Workshop on Information Integration on the Web, Rio de Janeiro, Brazil (2001). 88. W3C: HTML 4.01 specification, W3C recommendation 24 December 1999, Frames W3C (1999). 89. W3C: OWL Web Ontology Language Overview, W3C Recommendation 10 February 2004 W3C (2004a). 90. W3C: RDF Primer, W3C Recommendation 10 February 2004 W3C (2004b). 91. White, R., Wyn, J.: Youth and Society. Oxford University Press, New York, (2004). 92. Wirth, C.B., Rifon, N.J., Larose, R., Lewis, M.L.: Promoting teenage online safety with an i-safety intervention: enhancing self-efficacy and protective behaviors. Paper presented at the Paper presented at the Annual Meeting of the International Communication Association, Montreal, Quebec (2008). 93. Wolak, J., Mitchell, K.J., Finkelhor, D.: Escaping or connecting? Characteristics of youth who form close online relationships. J. Adolesc., 26(1), 105–119 (2003). 94. Ybarra, M.L., Mitchell, K.J.: Online aggressor/targets, aggressors and targets: a comparison of associated youth characteristics. J. Child Psychol. Psychiatr., 45(7), 1308–1316 (2004). 95. Zhou, J., Zhang, S., Zhao, H., Wang, M.: SGII: towards semantic grid-based enterprise information integration. In Grid and Cooperative Computing - GCC 2005, Beijing, China, pp. 560–565 (2005).
.
Chapter 13
Collaborative Environments: Accessibility and Usability for Users with Special Needs M. Mesiti, M. Ribaudo, S. Valtolina, B.R. Barricelli, P. Boccacci, and S. Dini
Abstract Accessibility and usability are relevant characteristics when considering the design and development of interactive environments especially when the target users have different forms of disability. Guaranteeing the inclusion of such characteristics poses new challenges in Web 2.0 collaborative environments because their users can have the role of both producers and consumers of information (the produsers of information). The purpose of this chapter is to specify the expected characteristics of accessible and usable collaborative systems tailored for produsers who have special needs. Hence, usability models are reviewed and the W3C guidelines and approaches for guaranteeing accessibility are reported. Then an accessible e-learning wiki-based environment is presented and, finally, a specific usability model is proposed that could overcome some of the problems that have emerged when analyzing the usability of such a collaborative environment.
13.1
Introduction
E-learning, especially in the form of Web-based training, is changing the way that schools and academia organize their education and training programs. In traditional educational institutions, Learning Management Systems are not aimed at replacing the traditional face-to-face teaching practice, but at integrating it into more efficient and effective blended-learning strategies. Nevertheless, innovation in Web 2.0 and e-learning technologies leads toward a revolution in education, encouraging the cooperation among teachers and students for the collaborative production of educational materials. This collaborative-learning approach is transforming the role of both the teachers, involving them in an innovative process of learning objects creation, and the students, who can contribute, under proper supervision, to the creation of supplementary knowledge on the basis of the available educational
M. Mesiti (*) DiCo, University of Milano, Milano, Italy e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_13, # Springer-Verlag Berlin Heidelberg 2011
319
320
M. Mesiti et al.
materials. To describe this double role of the users who are no longer only simple consumers but also producers of knowledge, the term “produser” was coined [4]. The main idea of e-learning technologies is to offer teachers and students the control over content, learning sequence, pace of learning, time, and often media, allowing them to tailor their experiences to meet their personal learning objectives. Unfortunately, most e-learning systems omit clear and measurable objectives and strategies for supporting collaborative processes and the evolution of the teachers’ and the students’ communities in a community of e-learning content providers. Starting from the user requirements, e-learning systems should guarantee the fruition of content at multiple levels, from access to generation, to large communities of users, each of them characterized by different abilities, skills, and requirements, acting in a variety of use contexts and adopting different technologies [31]. We think this is particularly true in the educational context where the access to the software and learning material should be guaranteed to all learners, thus facilitating the inclusion of those with different abilities. We remark that users affected by disabilities like blindness and low vision, deafness and hearing loss, learning disabilities, cognitive limitations, motor difficulties, and speech difficulties are around 50 million in the European Union (10% of its population) [7]. Moreover, this need of accessibility is of great interest to international consortia, like the WWW Consortium (W3C), and national governments that have issued recommendations and laws for the formulation of accessibility guidelines in order to develop software tools useful for citizens with disabilities. Besides accessibility issues, e-learning applications have to face more general problems related to usability. Usability guidelines, initially proposed for traditional interactive systems [6, 10], have been currently enhanced to take into account the social creation of knowledge [12, 30, 33] and the users’ need to be more aware about the actions of the rest of the users in collaborative e-learning classes. For these reasons, the purpose of this chapter is to define a model for supporting the design and development of collaborative e-learning systems accessible and usable by teachers and by all students during their activities of fruition or generation of educational materials. The model, which is inspired by already existing usability models but places a strong emphasis on accessibility and collaboration, has been conceived during the development of VisualPedia [2], an accessible wikibased e-learning system that will also be introduced. The chapter starts with a presentation of assistive tools that can be used for interacting with computer systems and that should be considered when developing collaborative environments. Indeed, users with special needs can exploit one (or more) of them to interact with the collaborative environment. Then, in Sect. 13.3 an overview is given of the current usability models for the design and the development of interactive environments, and the W3C accessibility guidelines are briefly described. Once the usability models and the accessibility guidelines have been described, we present the experience we gained in designing and developing VisualPedia, an accessible and collaborative e-learning platform. Special emphasis will be given to the usability and collaborative characteristics of this system and to the problems that emerged during our usability analysis. Starting from the QUIM model
13
Collaborative Environments for Users with Special Needs
321
[27], a well-established model for the design and development of interactive systems, in Sect. 13.5 a specific and new model for collaborative environments is proposed that includes some of the factors considered in the QUIM model plus other factors related to the communication process at the base of the produser activity. This model considers all the aspects relevant to a collaborative environment, including accessibility that we handle as one of the facets of usability. Finally, Sect. 13.6 reports our conclusion and suggests directions for future works.
13.2
Disabilities and Software for Disability
Several attempts have been made in recent years to give a definition of disability. Until very recently, this definition has been determined by a medical approach [11] largely based on pity and charity toward disabled people that should be “cured” in order to fit in society. The World Health Organization [39] has worked to overcome this definition and to spread a notion of disability based on a social approach that led to the international classification of functioning (ICF), disability, and health model. According to this model, disability is viewed as the lack of functionalities encountered by an individual in executing a task or action. Assistive technology (AT) [17] is a generic term that includes assistive, adaptive, and rehabilitative devices developed for people with motor and sensorial impairments that are adopted to overcome the limitations in technologies and the access to their functionalities. AT promotes greater independence by enabling these users to perform tasks that they were formerly unable or had great difficulty to accomplish by providing enhancements to or changed methods for interacting with the required technology. Thus, technology is not simply employed to realize physical devices but also to suggest approaches for performing activities that are based on technological principles and components. ATs can be classified into four categories reported in Table 13.1 together with an icon used in the paper for their identification. In the remainder of the section, we first detail some characteristics of these disabilities and the main ATs that can be used to overcome the limitations in accessing computer systems. Then, ATs are classified according to their use for inserting and accessing information. ATs for Visual Impairments. Partially sighted, low vision, legally and totally blind are terms used to describe users with increased levels of visual impairments. Visual impairments also include issues in color recognition like daltonism or color blindness. According to the US Census [7], severe visual impairments occur at the rate of 0.06 per 1,000 individuals. Specific ATs that run on off-the-shelf computers can display the text on the screen or magnify it in different environments (e.g., word processors, Web browsers, e-mail clients). Screen magnification softwares enlarge text and graphics and they are mainly used by partially impaired or low vision users. Screen readers with vocal synthesis read the content of a page and reproduce it (via sound card) through a
322 Table 13.1 Category of disability from a functional point of view
M. Mesiti et al. User
Disability Visual impairments
Physical impairments
Hearing impairments
Cognitive and learning difficulties
synthesizer and they are mainly used by users with severe low vision or totally blind. Braille displays can be jointly used with screen readers to reproduce the content of a page by raising and lowering different combinations of pins. In addition to these specific ATs, modern operating systems offer built-in functionalities for setting, for example, visual output, high-contrast schemes, size, and color of the mouse cursor. ATs for Physical Impairments. Motor impairment is a loss or limitation of body control. According to the US Census [7], around 18 million Americans report difficulty using their arms or hands that can affect both typing and mouse control. Users give instructions to computers through the following ATs: alternative keyboards (e.g., reduced, larger, touch-sensitive membrane) or their emulation on the screen, alternative mouses (e.g., trackballs, trackpads, and joysticks), pointing devices (e.g., head-controlled mouses and eye tracking) or touch screens, voice recognition systems, predictive word processors and programs with built-in word lists. Modern operating systems offer functionalities for mouse and input typing control. ATs for Hearing Impairments. The terms mild hearing, hard of hearing, and deaf are used to describe users with varying levels of hearing impairments. According to the US Census [7], almost 8 million Americans over the age of 14 have difficulty in hearing a conversation, even when wearing a hearing aid. ATs allow auditory cues and multimedia information to be identified (sounds, music, audio feedback) and can be considered accommodation techniques, i.e., techniques supporting information exchange through visual channels (e.g., flashing lights). They can be used to convey complex auditory information in a text form or through other communication modes (e.g., sign language, fingerspelling, and cued speech). ATs for Cognitive and Learning Difficulties. Cognitive impairment is any limitation in the ability of thinking or reasoning that affects the capacity to perform a task. A person with cognitive impairment can have problems with memory, language, or other mental functions. According to the US Census, 14.3 million Americans aged 15 or over have a cognitive or emotion-related mental disability.
13
Collaborative Environments for Users with Special Needs
323
For this kind of impairment, there are no specific ATs, but some tools developed for other impairments can be adapted to the specific cognitive or learning difficulty. For example, screen readers can help dyslexic users to monitor their own work by allowing them to listen to what they have written. Word prediction facilities and speech recognition systems can help users with dysgraphia to avoid typing mistakes and storage of incorrect spellings. Assistive Tools for Inserting and Accessing Information. Depending on their specific use, ATs can be classified as INPUT ATs, when used to produce information, and OUTPUT ATs, when used to access information. These classifications stress the usefulness of ATs across different forms of disability and thus the possibility of utilizing the same AT for different kinds of disability. It is also important to observe that it is not rare the case of users affected by multiple disabilities and therefore ATs of different types can be combined together to help them to access/produce information. Table 13.2 presents ATs divided according to this classification and the associated icons underline that any effort to overcome a difficulty can support people with different kinds of disability. Table 13.2 Difficulties and ATs Action Difficulty INPUT (Managing) Keyboard typing
Pointing devices using
OUTPUT (Understanding)
AT and main accommodation techniques Alternative keyboards (e.g., reduced, larger, touch-sensitive membrane), keyboard emulation Voice recognition systems Predictive word processors and programs with built-in word lists Keyboard accessibility options in operating system Mouse alternative like trackball, track pad and joystick, touch pad, head-controlled mouse and eye tracking, touch screen, mouse emulator Mouse accessibility options in operating system (e.g., to adjust mouse pointer)
Reading (texts)
Screen readers with vocal synthesis Screen readers with Braille display Screen magnification softwares Display accessibility options in operating system
Listening (sounds and speech)
Accommodation techniques: visual support; captioning for additional text or information; LIS (sign language), fingerspelling and cued speech Audio accessibility options in operating system
Watching (images and videos)
Alternative description, image processing (edge detection, enhancement)
Icons stress that any effort to overcome a difficulty can support people with different kinds of disability
324
13.3
M. Mesiti et al.
State of the Art in Usability Models and Accessibility
Usability is related to how a software enables a group of users to achieve specific goals in a specific use context. By contrast, accessibility is used to describe the degree to which a product, device, or service is accessible to as many people as possible. The purpose of this section is first to present the current models that should be considered when designing and developing collaborative environments and then to present the W3C guidelines for making interactive environments accessible to all produsers, including those with special needs.
13.3.1 Usability in Traditional Interactive Environments The success of an interactive system depends on a number of different factors, including functionality, performance, cost, reliability, maintenance, and usability. All these factors are of quite equal importance, and a serious failure in any one of them can cause a failure of the system as a whole. Currently, some researches [18, 21] have highlighted a growing recognition that the success of an interactive system depends to a significant degree on its usability. The ISO 9241-11 standard [15] provides the following usability definition: “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use”. According to this definition, the design process of usable interactive systems cannot be based only on a personal designer’s experience and intuition. The literature offers many models and standards containing principles for good interface design that can provide helpful guidelines for designers. Examples of standard include the ISO/IEC 9126-1 [16], which identifies usability as one of six different software quality attributes. Moreover, several models proposing different sets of attributes of usability that can be used in order to evaluate or to support the design of interactive environments have been defined in the literature (e.g., [22, 26, 28, 29]). For example, Nielsen [22] proposes a set of ten heuristics (see Table 13.3) related to the efficiency in use, the learnability, the safety strategies to prevent errors and the satisfaction. Instead, Shneiderman [29] and Preece [26] describe the same issue using different parameters (i.e., speed of performance or throughput instead of efficiency in use). The problem is that these standards and models suggest similar usability attributes using a different terminology with the result of increasing confusion in the activities of designers and developers alike. In order to address this problem, Seffah in [27] defines a consolidated model, redefined and extended by Holzinger in [14], about usability measurement with consistent terms for usability attributes and metrics. The model, named quality in use integrated map (QUIM) [27], is composed of eleven factors representing different facets of usability that are based on the main usability guidelines expressed in standards and models presented in the literature. The model refines these factors and expresses them as criteria and then
13
Collaborative Environments for Users with Special Needs
Table 13.3 Nielsens’s heuristic principles
N 1 2 3 4 5 6 7 8 9 10
325
Heuristic Visibility of system status Match between the system and the real world User control and freedom Consistency and standards Recognition rather than recall Flexibility and efficiency of use Display all but needed information Error prevention Help users recognize, diagnose, recover from errors Help and documentation
maps the criteria into usability metrics. The factors describe different facets of the usability by means of user-oriented attributes identified in existing standards and models. The eleven factors are: efficiency, effectiveness, productivity, satisfaction, learnability, safety, trustfulness, accessibility, universality, usefulness, and acceptability. The criteria have subfactors that are directly measurable via a set of metrics. Metrics are numeric values that summarize the status of specific user interface attributes. They are faster, cheaper, and less ambiguous than other usability evaluation attributes, [23] but their context of use can be influenced by a specific category of users, tasks, or environments. The QUIM model does not cover all models proposed in the literature. For example, the Polillo quality model for Web sites [25] is not taken into account even if the usability factors of the two models are quite similar. The Polillo quality model considers some aspects of Web design activities (like architecture, content, or management characteristics) that can be also described using QUIM factors or criteria. For instance, the architecture characteristics deal with issues related to the navigability or the consistency of Web sites and the QUIM model also offers specific criteria to represent them. The consolidated QUIM model provides design principles able to bring the designer’s attention to issues that should be considered in order to develop successful interactive applications. Current interactive systems have to support powerful functionalities, but also a simple and clear interface; they have to be easy to use and to learn; they have to be flexible but also provide good error handling. Under this prospective, a consolidated model for detecting and assessing usability factors can be employed as a guideline for designing interactive systems because it consolidates the experiences, points of view, and studies proposed within several Human– Computer Interaction and Software Engineering communities [27].
13.3.2 Accessibility Accessibility is particularly important in Web-based collaborative environments since accessible user interfaces allow users with disabilities to take an active part in the process of knowledge creation.
326
M. Mesiti et al.
Since 1999 the W3C has proposed different guidelines for developing accessible Web environments and at least two approaches have been identified to guarantee their accessibility. The first one is rather technical and defines guidelines and checklists to be satisfied in order to accomplish the accessibility of user interfaces; the second one is more user centered and also considers the user experience while browsing the Web. In the following, we will briefly introduce the initiatives promoted by the W3C to cope with Web accessibility. Accessibility 1.0. European national laws such as the Italian Law n. 4/2004 on Web Accessibility, the German Federal Ordinance on Barrier-Free Information Technology, and the French General Reference for Accessibility of Administrations are based on the technical approach providing sets of technical requirements to be checked in the spirit of the guidelines proposed by the W3C: the Web Content Accessibility Guidelines (WCAG 1.0) [34], the Authoring Tools Accessibility Guidelines (ATAG 1.0) [35], the User Agent Accessibility Guidelines (UAAG 1.0) [36]. These guidelines explain how to make Web content accessible and are intended for all Web content developers (page authors and site designers) and for developers of authoring tools. Consider for example WCAG 1.0. The first guideline explains how to introduce accessible images by providing equivalent alternatives to nontextual content. The second one emphasizes that content, error messages, or possible actions should not rely only on colors, so that visually impaired or color-blind users can also easily understand them. Moreover, clear navigation mechanisms (guideline 13) and clear and simple documents (guideline 14) are fundamental for people with visual or cognitive impairments. WCAG 1.0 has been acknowledged by different governments in Europe and the USA1 because of their relevance in providing the same right to all citizens to access information. Moreover, many software tools [9] have been developed to be compliant with the technical indications they contain, which are mostly based on HTML markup and proper use of CSS properties. The guidelines contained in ATAG 1.0 are specific for Web authoring tool developers. Their purpose is “twofold: to assist developers in designing authoring tools that produce accessible Web content and to assist developers in creating an accessible authoring interface” [35]. Specifically, authoring tools should automatically generate standard markup (guideline 2) and support the creation of accessible content (guideline 3) facilitating for instance the insertion of text alternatives. A peculiarity of these guidelines is the number 7, which also requires that authoring interfaces should be accessible. This is particularly relevant in the collaborative context since users with a disability are encouraged to become producers and contributors of new knowledge. Accessibility 2.0. The W3C guidelines presented so far have been a reference for many years. However, the technical approach on which they are based has been deeply criticized by experts [3, 19]. Indeed, a strict adherence to technical
1
An example is Section 508, http://www.section508.gov/
13
Collaborative Environments for Users with Special Needs
327
constraints could be counterproductive since it does not necessarily entail an improvement in the overall browsing experience for people with disabilities. Passing the tests of automatic tools for checking accessibility does not guarantee that the Web site as well will be usable. Moreover, new challenges have emerged from the introduction of Web 2.0 facilities that are not covered by WCAG 1.0/ATAG 1.0. Collaborative environments in which the content is permanently incremented with successive refinements are completely different from Web 1.0 environments where Web sites owners are the only ones responsible for the published content. These challenges are more user oriented than technical and require that developers consider usability aspects within the technical guidelines. To address these issues, in December 2008 a new version of the W3C guidelines was made available with the publication of the W3C Recommendation of WCAG 2.0 [37]. For what concerns the guidelines for authoring tools, a Working draft (ATAG 2.0 [38]) has been published in October 2009. The approach of WCAG 2.0 is different from the one adopted in WCAG 1.0. The proposal is neither HTML specific nor CSS specific and the intended audience is wider including, on the one hand, the target user, (i.e., people with different forms of disability) and, on the other hand, Web content producers such as Web designers and developers, but also teachers and students. For each proposed guideline, success criteria are provided and (like in WCAG 1.0) three levels of conformance are defined: A (lowest), AA (middle), and AAA (highest). For each guideline and success criteria, sufficient and advisory techniques for meeting success criteria have been defined. The guidelines are classified under four POUR principles [37]: 1. Perceivable, i.e., users must be able to sense the Web content which cannot be invisible to all of their senses. 2. Operable, i.e., users must be able to use the interface that cannot require interaction that they are not able to perform. 3. Understandable, i.e., users must be able to understand the information, as well as the operation of the user interface. 4. Robust, i.e., users must be able to access the content as technologies advance. The WCAG 2.0 guidelines are shown in Table 13.4, together with indications of the form of disability they might address; their numbering suggests the principle with which they are associated. Further details on WCAG 2.0 can be found online [37]. ATAG 2.0 splits the guidelines for authoring tools into two parts. Part A, promoting their accessibility, and Part B, supporting the automatic production of accessible Web content for end users. Guideline 7 of ATAG 1.0 has been enhanced with a dedicated section for the development of accessible authoring tools including HTML editors, software for converting to Web content technologies, collaborative software for updating portions of Web pages (e.g., blogs, wikis, online forums), content management systems, and many others. It is particularly important to guarantee access to user-generated content environments to people with disabilities. Further details of ATAG 2.0 can be found in the official online documentation [38].
328
M. Mesiti et al.
Table 13.4 WCAG 2.0 guidelines N. Guideline 1.1 Text Alternatives: Provide text alternatives for any nontext content so that it can be changed into other forms people need, such as large print, braille, speech, symbols or simpler language. 1.2 Time-based Media: Provide alternatives for time-based media. 1.3 Adaptable: Create content that can be presented in different ways (for example simpler layout) without losing information or structure. 1.4 Distinguishable: Make it easier for users to see and hear content including separating foreground from background. 2.1 Keyboard Accessible: Make all functionality available from a keyboard. 2.2 Enough Time: Provide users enough time to read and use content. 2.3 Seizures: Do not design content in a way that is known to cause seizures. 2.4 Navigable: Provide ways to help users navigate, find content, and determine where they are. 3.1 Readable: Make text content readable and understandable. 3.2 Predictable: Make Web pages appear and operate in predictable ways. 3.3 Input Assistance: Help users avoid and correct mistakes. 4.1 Compatible: Maximize compatibility with current and future user agents, including assistive technologies.
13.4
X
X
X
X
X X
X X
X
X
X
X X
X
X
X X
X
X
X
X
X X
X X
X X
X X
X X
X X
X X
X X
The VisualPedia Experience
VisualPedia [2] is a wiki-based collaborative system developed in the e-learning context. It extends Mediawiki and offers a repository of e-learning units, named educational objects, that can be shown to users (students) according to their profiles and are accessible with respect to WCAG guidelines. To the best of our knowledge, VisualPedia is one of the first wiki-based environments that considers the issue of accessibility in its development. Indeed, as clearly outlined in [32], despite the increasing success of wikis, not much work has been done on the development of accessibility tools. The authors of this paper point out that in order to guarantee accessibility, the following issues should be addressed. First, a standardization of wiki markup languages is needed since each developed system tries to impose its own dialect. Moreover, even if many WYSIWYG editors have been provided to help authors to produce contents, these are rarely accessible. Finally, wiki pages are generally offered without any form of personalization neither in their content nor in their visual presentation. After a brief presentation of the educational objects stored in VisualPedia, we will discuss the main features of the system and the accessibility issues that have been taken into account during its development.
13
Collaborative Environments for Users with Special Needs
329
13.4.1 Educational Objects Educational objects are multimedia objects describing notions represented through different media – text, image, sound – that can be described at three levels of complexity: complete, summary, and simplified. Figure 13.1 shows an example of these three levels for an educational object describing a leaf. For the leaf, the three descriptions (taken from Wikipedia) are given, each one providing a different level of detail. Specifically, the simplified description preserves essential concepts that are expressed with a simpler structure of the sentences and a controlled vocabulary and are thought for students with cognitive difficulties. Notice that simplified descriptions can also be useful for students who are learning a foreign language and possess a temporarily reduced vocabulary. With this design choice, we have addressed guideline 3.1 of WCAG 2.0, and especially number 3.1.5 Reading Level: “When text requires reading ability more advanced than the lower secondary education level after removal of proper names and titles, supplemental content, or a version that does not require reading ability more advanced than the lower secondary education level, is available”. Besides textual descriptions, audio descriptions can be included. This alternative can be useful when the screen reader cannot reach the same quality of human reading (e.g., for reading a poem in ancient Greek). An educational object is stored within VisualPedia as an XML document that replaces the textual content contained in the text attribute of the Mediawiki page table. This XML document contains the descriptions of the object at the three levels of complexity plus the bibliography reference and optional links to vocal descriptions. Images and sounds are stored in the VisualPedia Image table. Tags are introduced in the XML document to distinguish the role (an object or part of a description) of the image/sound within the educational object. Moreover, in the Image table, the values of longDesc and alt attributes required for image 1. Complete. In botany, a leaf is an above-ground plant organ specialized for photosynthesis. For this purpose, a leaf is typically flat (laminar) and thin. There is continued debate about whether the flatness of leaves evolved to expose the chloroplasts to more light or to increase the absorption of carbon dioxide. In either case, the adaption was made at the expense of water loss. Leaves are also the sites in most plants where transpiration and guttation take place. Leaves can store food and water, and are modified in some plants for other purposes. The comparable structures of ferns are correctly referred to as fronds. Furthermore, leaves are prominent in the human diet as leaf vegetables. 2. Summary. A leaf is a plant organ specialized for photosynthesis and is typically flat and thin. Leaves are also the sites in most plants where transpiration and guttation take place. Leaves are prominent in the human diet as leaf vegetables. 3. Simplified. If you look at the tree in the park, it is full of flat leaves that can be of different colours depending on the season. The tree uses them for many purposes, for example for converting carbon dioxide into sugars using the energy from sunlight. There are types of leaves that humans can eat. Fig. 13.1 Three different textual descriptions of the leaf educational object
330
M. Mesiti et al.
Fig. 13.2 XML representation of the leaf educational object
accessibility are also added. Figure 13.2 reports a part of the XML document corresponding to the leaf educational object. Teachers can add new objects in VisualPedia thanks to an extended user interface. Descriptions that may be incomplete can be possibly updated and extended by collaborating colleagues. Links among educational objects are provided through the Mediawiki markup language (even if it is not a standard, it is one of the most used and popular). Students can access the views of the educational objects tailored according to the preferences specified in their profiles as discussed in the following paragraphs.
13.4.2 VisualPedia Features Personalization. In addition to the information already collected by Mediawiki at user registration, new Web forms have been added to obtain the extra data necessary for personalization. The parameters relate to fonts (color and size of the characters) and to images (lines thickness and color schema). Vertical and horizontal bars can be shown beside the text to direct the eyes of users with visual and cognitive disabilities. Preferences are adopted to customize the look and feel and the content of pages. The personalized visualization is obtained by creating, for each user, an inline CSS fragment to be added to the CSS files of the selected Mediawiki skin. By introducing this feature within VisualPedia, we have addressed guideline 1.3 since the content is adapted according to the user profile. Image simplification. Images can be delivered in a simplified form trying to mask many irrelevant details, which are discarded without even being noticed by users without disabilities, while, on the other hand, capturing such details wastes much effort for visually impaired ones. Moreover, the use of many bright colors, intended to capture the attention, can be perceived as confusing by disabled users,
13
Collaborative Environments for Users with Special Needs
331
Fig. 13.3 Simplification of a leaf image for different user profiles
who would most appreciate a line drawing with strong contrasting background. Therefore, simplification algorithms, based on the Canny method [5], have been developed so that any color or gray scale image can be converted, automatically and relying on the user profile, into another simplified one. Fig. 13.3 provides few examples of simplification of a leaf image obtained through the algorithm. Since providing these alternative images can be time consuming, their storage has been coupled with the storage of a corresponding matrix representing the edges contained within. This matrix is exploited in the Canny algorithm when image simplification is required, thereby reducing the computational efforts. By introducing this feature within VisualPedia, we have addressed guidelines 1.3 and 1.4 of WCAG 2.0. Retrieval of educational objects. Retrieval facilities that take the user’s profile into account to rank and return educational objects have been introduced. For each educational object, the available description matching the user profile is returned, sorted according to the resultant ranking. The snippet associated with each link also contains information on the classification of the educational object, i.e., the subject it belongs to and the school level for which the educational object is more suitable. In addition to the retrieval facility, educational objects in VisualPedia can be accessed by following proper links available from the initial pages of the system. Through this feature, we have addressed guideline 2.4 of WCAG 2.0 for the navigation on the Web site (Table 13.4).
13.4.3 VisualPedia’s Usability and Accessibility Characteristics The strategy applied in the design and development of VisualPedia follows the star life-cycle model [13], in that incremental prototype techniques are applied and evaluations of the prototype at each incremental step are performed. Specifically, in the case of collaborative design and development of software, the cycle becomes an evolving and never-ending process, in that the use of the prototypes by the users suggests new way of use and therefore leads to redesign [1]. In this framework, at each step of development, the prototypes are evaluated by discount evaluation techniques [20], by experiments, and by informal meetings with novice users. During these meetings, users are asked to have a look at the current prototype and to use it. Usability problems are identified by the observation of their activities and direct feedbacks are collected thanks to a final interview.
332
M. Mesiti et al.
Fig. 13.4 Distribution of the Nielsen’s violated principles
After a first development step of VisualPedia, usability and accessibility analyses using methods from the cognitive engineering domain were performed. One of these was a heuristic analysis with the ten Nielsen principles (see Table 13.3) that was performed by five evaluators with expertise in computer science. The analysis results highlighted the existence of several usability issues depicting VisualPedia as an unusable system. Examples of the detected issues are related to the presentation of content in the pages, the navigation among the links, the lack of help pages, and the poor rendering quality of the images. Figure 13.4 shows how many times the ten Nielsen principles have been violated. Technical accessibility tests were performed by a group of experts. They used CSE Validator [9] to check the HTML and CSS code of VisualPedia and highlighted some technical mistakes that have been corrected. Apart from minor mistakes, the technical accessibility of the pages was generally good but unfortunately, as we know from the literature, the conformance to the technical accessibility guidelines does not necessarily mean that a Web site is usable or that “Web sites that achieve higher conformance to WCAG are also more usable by people with disabilities. . .” [24]. Moreover, the real problem is shifted to the production and upload of the educational objects: data entry is indeed the critical phase that could possibly vanish the development of an accessible framework. A further evaluation phase was conducted by performing an experiment with the final users. These were people with different types of disability (especially visual and cognitive difficulties), different ages, different education (ranging from primary to high schools), and different skills in the use of the computer and/or assistive tools. The users were asked to perform three simple tasks: (1) login to VisualPedia, (2) browse educational objects relying on their profiles, and (3) search educational objects using the internal retrieval facility. The majority of the users performed their tasks successfully in a completely independent way. Few of them needed more time and/or some help in performing the tasks due to their inexperience when using
13
Collaborative Environments for Users with Special Needs
333
Fig. 13.5 Evaluation of final users
the computer. Those users affected by vision impairment obtained different results in the use of VisualPedia. Low vision users did not encounter any problem, while blind users experienced troubles using the screen reader, because of the presence of extra information within VisualPedia pages. The results obtained by the experiment performance suggest that VisualPedia is generally easy to use and potentially useful for educational purposes. Figure 13.5 illustrates for each of the three tasks the users were asked to perform, the number of people who found it easy, needed training, or found it difficult to use. The comparison between the results obtained by the two usability evaluation phases highlights a contradiction: even if VisualPedia resulted generally easy to be used by users with disabilities, the experts found severe usability issues. Thus, the experiment with final users pointed out that a useful system such as VisualPedia can provide a high acceptability by its users despite the presence of usability problems. The VisualPedia case is an example that highlights how often, in analyzing collaborative systems, the results obtained by applying cognitive engineering methods of usability evaluations are very different from those obtained by experiments with the final users. Therefore, in the next section, we propose a model for the evaluation of collaborative systems that considers not only the aspects taken into account by cognitive engineering, but also some other aspects related to accessibility, usefulness, and acceptability of the system by its final users.
13.5
A Model for Collaborative Environments
In this section, we focus on the usability factors that should be considered when designing and developing collaborative e-learning environments such as VisualPedia. The experience gained during the usability analysis of VisualPedia has shown how an unusable system, according to a heuristic evaluation performed by
334
M. Mesiti et al.
experts, ended being a system that was accepted and useful for end users. This result is a well-known problem related to usability evaluation methods based on the cognitive engineering strategies [8, 30, 33]. Several authors claim that conventional usability evaluation methods like heuristic evaluation [22] and even exploratory methods like the cognitive walkthrough [6] are not appropriate for analyzing Web 2.0 or social networking sites. These studies argue that the results of the adopted inspection methods do not reflect the users’ opinions because, as also verified in our case study, despite these methods pointing out unusable systems, other experiments with end users have highlighted enthusiasms and usefulness in using them. For example, in [30] the authors found that YouTube failed when tested using heuristic evaluation although it is one of the most popular Web environments. Therefore, in some situations “usability evaluations, wrongfully applied, can quash potentially valuable ideas early in the design process, incorrectly promote poor ideas, misdirect developers into solving minor instead of major problems or ignore (or incorrectly suggest) how a design would be adopted and used in everyday practice” [12]. The innovative ideas which are at the basis of some interactive environments, especially in novel collaborative contexts, must not be discouraged by some negative results of usability evaluations, but they must rely on the enthusiasm of users demonstrated during testing. With these considerations in mind, this section aims to define a new usability method able to extend classic usability inspection analysis in order to take into consideration issues related to social network systems and systems devoted to a wider audience including users with special needs. As already pointed out in Sect. 13.3.1, even if several standards and models were proposed to study the usability facets of an interactive system, the absence of a unique and consolidated set of guidelines has led us to start with the QUIM model and to extend it in order to take into account guidelines supporting designers and developers in designing and implementing collaborative environments accessible and usable also by users with special needs. The first step for reaching our goal involved redefining some usability QUIM factors specifically important for our aim. In particular, we focused on specific QUIM factors such as productivity, learnability, accessibility, universality, acceptability, and usefulness. The first two factors, productivity and learnability, can give useful indications in designing environments in which the user is an active producer of knowledge. By contrast, the succeeding three factors provide useful indications for the possibility to design interactive environments for all. Finally, the last factor, the usefulness, is used to measure how much an easy-to-use system is also relevant for actual users’ needs. A second step in the development of our model was devoted to extend the QUIM model to consider factors for improving the communicability in interactive systems. This extension relies on studies carried out in [8, 33] in which the authors proposed to extend the classic heuristic evaluation model using a set of new heuristics to describe aspects related to the design of social environments developed adopting Web 2.0 technologies. Starting from these studies we propose to integrate the QUIM model with a new factor related to the communicability of
13
Collaborative Environments for Users with Special Needs
335
interactive systems. A detailed description of these factors follows; for each factor a description of the associated criteria for their evaluation is also proposed. The productivity and learnability factors. The aim of the productivity factor is to assess the level of effectiveness achieved in relation to the resources consumed by the user and the system within a specific context of use. This factor is very similar to the efficiency and effectiveness factors but the difference is that it focuses on the number of useful outputs without considering unproductive actions.2 The learnability factor, instead, studies the extent to which the features of an interactive environment are easily understood and managed within a specific context of use. Table 13.5 reports a list of usability factors and related criteria. For example, the productivity criteria are: time behavior, resource utilization, and loading time. Several measures have been proposed to evaluate the criteria of each factor. For example, for the time behavior criterion, the model proposes to calculate the time spent on performing each productive task; for the loading time criterion it
Table 13.5 The relations among criteria and usability factors Productivity Accuracy Appropriateness Attractiveness Consistency Controllability Dependability Discretion Familiarity fault-Tolerance Flexibility Loading time Minimal action Minimal memory load Navigability Nonobtrusiveness Operability Privacy Readability Resource utilization Safety Security Self-descriptiveness Simplicity Time behavior Trustworthiness Understandably User guidance
2
Learnability
Accessibility
Universality
X X
X X X
Acceptability
Usefulness X
X X
X X X
X X X
X X X
X X
X
X
X
X
X X X X
X X
X X
X X X
X X X X X
X X
X X
X X X X
X
X
Unproductive actions mean actions that do not directly or indirectly contribute to the task output (i.e. help actions, search actions, undo actions).
336
M. Mesiti et al.
proposes metrics to measure either the time taken to load a Web page or the time span between the moment that a user clicks on a selection and the moment that a new element appears on the screen. For metrics related to each criterion see [23]. The accessibility factor. It is used to understand the capability of an interactive environment to be used by people with special needs. The QUIM model, and the one we present here, considers the accessibility factor as a possible facet of the usability problem and proposes a set of criteria to address accessibility issues for different types of disability. We remark that in the literature there are other interpretations of the relationship existing between usability and accessibility. These two aspects have been presented as nonintersecting or intersecting, or one as a subproblem of the other (see, for example, [24] for a discussion on their mutual relationship). However, our position is that accessibility is one of the facets of usability. To evaluate user interface problems specifically related to accessibility, the QUIM model proposes criteria that address issues faced by users affected by the forms of disability discussed in Sect. 13.2. We consider accessibility to be of great importance in the context of collaborative e-learning systems. Our proposal adopts the QUIM accessibility factor in order to provide measures of performance and satisfaction for all possible users according to the international regulations defined by WAI. Table 13.6 extends the information on the accessibility factor of Table 13.5, also associating WCAG 2.0 with the QUIM criteria and providing an idea of the strict interlacing existing between the two models. A letter A in the third column indicates that a QUIM factor has a counterpart in the WCAG 2.0 accessibility guidelines, possibly expressed with a slightly different terminology (criteria that influence neither usability nor accessibility are omitted). This confirms our assumption that accessibility should be considered as a facet of the usability problem. The universality and acceptability factors. The universality factor includes the interactive environment support for the activities of users with different profiles, cultures, contexts of use, and technology used to access it; whereas the acceptability factor can be used to quantify the extent to which an environment is comfortable. The criteria related to these factors are reported in Table 13.5 and detailed in [23].
Table 13.6 The relations among QUIM accessibility and WCAG 2.0
Consistency Controllability Flexibility Minimal action Minimal memory load Navigability Operability Readability Self-descriptiveness Simplicity Time behavior Understandably User guidance
QUIM X X X X X X X X X X
X
WCAG 2.0 A A A A A A A
A A A
13
Collaborative Environments for Users with Special Needs
337
The usefulness factor. The main goal of usefulness is to realize how users accept an environment, considering that, in some cases, focusing on the usability can be harmful [12]. In order to address this problem, we take into consideration the QUIM usefulness factor as one of the most important factors at the base of our model. It deals with how to improve the acceptability of the system in terms of quality in use, i.e., how much the interaction style fits the user’s activities and mental model (e.g., her/his knowledge, skills, profile, activity procedures). The last column of Table 13.5 shows a set of criteria for this factor. The communication factor. A lack of the usefulness factor is that it does not consider specific problems due to the communication process among produsers who are collaborating using the same system, or the active role that users play in creating, sharing, and transferring knowledge. These considerations require us to extend our model, taking the communication factor into account. This factor is a new factor with respect to the QUIM model, and it is introduced because the communication is a basic factor in the context of collaborative environments. This new factor entails heuristics focusing on communication issues in order to support appropriate verbal, gesture exchange of messages and to enable the efficient sharing of artifacts, data protection, collaboration, and coordination of the users’ actions. Table 13.7 shows a set of criteria for this factor. These criteria, except the first, are already used in the context of other factors but have also some relevance in the communication field. The first criterion, collaborability, is introduced to define a measure for the evaluation of the collaborative degree of the environment. Collaborability represents a criterion to assess how a set of group activities can achieve a team’s targeted objectives with respect to individual effort or contribution of each member team. The metrics at the base of this criterion are addressed to measure the capability of an interactive system for supporting: l l
l
l
l l
The feeling of virtual presence The creation of a common awareness without fearing that personal ideas will be diminished The possibility of asking questions for clarification or more details or to comment on other contributions The possibility of sharing and transferring multimedia objects (keeping a reference to whoever has posted the material) The protection of the data and personal information The coordination of user’s actions and contributions
Comparing Tables 13.6 and 13.7, it is easy to identify some overlaps between the communication and the accessibility factors. These overlaps are justified by the need to guarantee an accessible content after the collaborative process has been Table 13.7 Criteria for the communication factor Communication Collaborability Controllability Loading time Privacy Resource utilization Safety
Navigability Time behavior
Operability User guidance
338
M. Mesiti et al.
carried out by the e-learning providers who act as produsers. Teachers, creating new learning objects in collaboration with the rest of users, have to produce educational material in accordance with the usability factors of our model, with particular care of the accessibility issues for ensuring access also by users with special needs.
13.6
Conclusions and Future Work
In this chapter, we addressed the issue of developing collaborative e-learning systems that are accessible and usable by users with different profiles (including those with special needs). We started with the presentation of the assistive tools that can help to access and insert educational materials in collaborative e-learning systems. Then, an overview of current usability models for interactive systems has been presented along with the accessibility guidelines developed by the W3C. The experience gained during the development of VisualPedia, a collaborative environment in which users with different abilities can cooperate to create, share, transfer, and use learning objects, has been reported. This system has been conceived by adopting a life-cycle model based on an iterative usability evaluation process. According to this model, VisualPedia has been evaluated in its initial prototyping phase by means of heuristic analysis and experiments with end users. These evaluations highlighted some usability problems but also a high acceptance by users with disabilities. Further steps are required to improve the accessibility and the usability of VisualPedia, and therefore a usability model has been defined starting from studies on accessibility and usability carried out by several research institutions, international organizations, or national governments. This model stems from the QUIM consolidated model and extends it for considering properties related to collaboration and specific accessibility facets contained in the W3C guidelines. This model can be used in the design process of collaborative environments to evaluate usability and accessibility facets from different points of view with particular emphasis on communication and accessibility factors. As future work, we plan to design and develop a new version of the VisualPedia system by addressing the usability issues arising from the heuristic evaluation presented in this chapter and exploiting the proposed usability model. Moreover, we wish to develop an articulated set of metrics for the evaluation of the suggested usability factors, metrics that can be personalized with respect to the specific contexts of use (user profiles and fruition media). These metrics will be integrated in a Web application for the quantitative evaluation of usability and accessibility of collaborative systems. Finally, we wish to better investigate the connection between usability and W3C accessibility guidelines in order to enhance the proposed usability model. Acknowledgment This work has been partially supported by a grant from “Ministero della Pubblica Istruzione” within the project “Nuove Tecnologie e Disabilita`. Azione 6: Progetti di ricerca per l’innovazione”.
13
Collaborative Environments for Users with Special Needs
339
References 1. Bianchi, A., Bottoni, P., Mussio, P.: Issues in design and implementation of multimedia software systems. In: Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 91–96 (1999) 2. Boccacci, P., Ribaudo, M., Mesiti, M.: A collaborative environment for the design of accessible educational objects. In: IEEE/WIC/ACM International Conference (2009) 3. Brian, K., Sloan, D., Brown, S., Seale, J., Petrie, H., Lauke, P., Ball, S.: Accessibility 2.0: People, policies and processes. In: W4A’07: Proceedings of International Conference on Web Accessibility, pp. 138–147. ACM, New York, NY (2007) 4. Bruns, A.: Blogs, Wikipedia, Second Life, and Beyond: From Production to Produsage. Peter Lang Publishing, New York (2008) 5. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1(8) (1986) 6. Cathleen, W., John, R., Clayton, L., Peter, P.: The cognitive walkthrough method: A practitioner’s guide. In: Nielsen, J., Mack, R.L. (eds.) Usability Inspection Methods, pp. 105–140. John Wiley, New York (1994) 7. CENSUS: www.census.gov/prod/2001pubs/p70-73.pdf (2001) 8. Chounta, I., Avouris, N.: Heuristic evaluation of an argumentation tool used by communities of practice. In: EC-TEL 2009 Workshop: TEL-CoPs’09: 3rd International Workshop on Building Technology Enhanced Learning Solutions for Communities of Practice (2009) 9. CSE HTML Validator: www.htmlvalidator.com/ 10. Dix, A., Finlay, J., Abowd, G., Beale, R.: Human Computer Interaction, vol. 3. Prentice Hall, Englewood Cliffs, NJ (2003) 11. EDF: www.edf-feph.org 12. Greenberg, S., Buxton, W.: Usability evaluation considered harmful (some of the time). In: CHI, pp. 111–120 (2008) 13. Hix, D., Hartson, H.: Developing User Interfaces: Ensuring Usability Through Product and Process. Wiley, New York (1993) 14. Holzinger, A., Searle, G., Kleinberger, T., Seffah, A., Javahery, H.: Investigating usability metrics for the design and development of applications for the elderly. In: ICCHP, pp. 98–105 (2008) 15. International Organization for Standardization: ISO 9241-11, ergonomic requirements for office work with visual display terminals (vdts), part 11: Guidance on usability (1998) 16. International Organization for Standardization: ISO/IEC 9126-1 standard, software engineering, product quality, part 1: Quality model (2001) 17. ISO: www.iso.org/iso/catalogue_detail.htm?csnumber¼38894 (2007) 18. Mayhew, D.J.: The Usability Engineering Lifecycle: A Practitioner’s Handbook for User Interface Design. Morgan Kaufmann, San Francisco, CA (1999) 19. Moss, T.: 10 common errors when implementing accessibility. www.webcredible.co.uk/userfriendly-resources/web-accessibility/errors.shtml (2008). Retrieved Jan 2010 20. Nielsen, J.: Usability engineering at a discount. In: Salvendy, G., Smith, M.J. (eds.) Designing and Using Human-Computer Interfaces and Knowledge Based Systems, pp. 394–401. Elsevier, Amsterdam (1989) 21. Nielsen, J.: Usability Engineering. Morgan Kaufmann, San Francisco, CA (1993) 22. Nielsen, J.: Usability inspection methods. In: CHI’95: Conference Companion on Human Factors in Computing Systems, pp. 377–378. ACM, New York, NY (1995) 23. Padda, H.: UIM map: A repository for usability/quality in use. Master’s thesis, Degree of Master of Computer Science at Concordia University (2003) 24. Petrie, H., Kheir, O.: The relationship between accessibility and usability of websites. In: CHI’07: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 397–406. ACM, New York, NY (2007)
340
M. Mesiti et al.
25. Polillo, R.: Il check-up dei siti Web. Valutare la qualita` per migliorarla. Apogeo, Milano (2004) 26. Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holl, S., Carey, T.: Human-Computer Interaction. Addison-Wesley, Reading, MA (1992) 27. Seffah, A., Donyaee, M., Kline, R., Padda, H.: Usability measurement and metrics: A consolidated model. Softw. Qual. J. 14(2), 159–178 (2006) 28. Shackel, B.: Usability-Context, Framework, Definition, Design and Evaluation. Cambridge University Press, Cambridge (1991) 29. Shneiderman, B.: Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley, Boston, MA (1992) 30. Silva, P., Dix, A.: Usability – Not as we know it! In: HCI 2007 – The 21st British HCI Group Annual Conference (2007) 31. Stephanidis, C.: User interfaces for all: concepts, methods, and tools. In: Stephanidis, C. (ed.) User Interfaces for All: New Perspectives into Human-Computer Interaction, pp. 3–17. Lawrence Erlbaum, Mahwah, NJ (2001) 32. Taras, C., Siemoneit, O., Weißer, N., Rotard, M., Ertl, T.: Improving the accessibility of Wikis. In: ICCHP’08: Proceedings of 11th International Conference on Computers Helping People with Special Needs, pp. 430–437. Springer, Heidelberg (2008) 33. Thompson, A., Kemp, E.: Web 2.0: Extending the framework for heuristic evaluation. In: CHINZ ’09: Proceedings of the 10th International Conference NZ Chapter of the ACM’s Special Interest Group on Human-Computer Interaction, pp. 29–36. ACM, New York, NY (2009) 34. W3C: Web content accessibility guidelines 1.0, 1999. http://www.w3.org/TR/WCAG10/. Last access March 2010 35. W3C: Authoring tool accessibility guidelines 1.0, 2000. http://www.w3.org/TR/WAI-AUTOOLS/. Last access March 2010 36. W3C: User agent accessibility guidelines 1.0, 2002. http://www.w3.org/TR/WAIUSERAGENT/. Last access March 2010 37. W3C: Web content accessibility guidelines (WCAG) 2.0, 2008. http://www.w3.org/TR/ WCAG20/. Last access March 2010 38. W3C: Authoring tool accessibility guidelines (ATAG) 2.0, 2009. http://www.w3.org/TR/ ATAG20/. Last access March 2010 39. WHO: http://www.who.int/topics/disabilities/en/
Chapter 14
Trust in Online Collaborative IS Sara Javanmardi and Cristina Lopes
14.1
Introduction
Collaborative systems available on the Web allow millions of users to share information through a growing collection of tools and platforms such as wikis, blogs, and shared forums. Simple editing interfaces encourage users to create and maintain repositories of shared content. Online information repositories such as wikis, forums, and blogs have increased the participation of the general public in the production of Web content through the notion of social software [1–3]. The open nature of these systems, however, makes it difficult for users to trust the quality of the available information and the reputation of its providers. Online information repositories, especially in the form of wikis, are widely used on the Web. Wikis were originally designed to hide the association between a wiki page and the authors who have produced it [4]. The main advantages of this feature are as follows: (a) it eliminates the social biases associated with group deliberation, thus contributing to the diversity of opinions and to the collective intelligence of the group, and (b) it directs authors toward group goals, rather than individual benefits [5]. In addition, one of the key characteristics of wiki software is its very low-cost collective content creation, requiring only a regular Web browser and a simple markup language. This feature makes wiki software a popular choice for content creation projects where minimizing overhead is of high priority, especially in creating new or editing already existing content. “Wikinomics” is a recent term that denotes the art and science of peer production when masses of people collaborate to create innovative knowledge resources [6]. The most well-known example of a public collaborative information repository is Wikipedia, which has a traffic rank of six worldwide.1 Usually, people trust usergenerated content in Wikipedia for learning purposes or decision-making without validating its information [7]. Therefore, the highly desirable properties of wikis or 1
According to traffic report by Alexa.com in April 2010.
S. Javanmardi (*) University of California, Suite 5069, Bren Hall, Irvine, CA 92697, USA e-mail: [email protected]
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6_14, # Springer-Verlag Berlin Heidelberg 2011
341
342
S. Javanmardi and C. Lopes
other similar social software technologies – openness, ease of use, and decentralization – can also have disruptive consequences on society. Open wikis can easily be associated with poor quality information and often fall prey to malicious or misleading content editing [8]. Online communities use trust/reputation management components to facilitate cooperative user behavior [9]. In general, trust management systems seek two main goals: (a) to assist users in rating products or other users for better decision-making, and (b) to provide an incentive for better user behavior resulting in improved future performance [10, 11]. In the context of wikis, reputation management systems are suggested as a social rewarding technique that motivates users to participate actively in sharing knowledge [12]. In addition, these systems can assist administrators with automatic detection of high/low reputation users to promote/demote the access rights. Reputation can be defined as the public’s opinion (more technically, a social evaluation) of a person, a group of people, or an organization [13]. Trust is one user’s belief in another user’s capabilities, honesty, and reliability based on his/her own direct experiences. In online communities, there are two notions of trust: individual-to-individual trust and individual-to-technology trust [14]. eBay and online banking are examples of these two categories, respectively. In wikis, we have a combination of these trust/reputation relationships; individuals need to have trust in content that is collaboratively created by other individuals. Authors also need to have trust in other authors collaborating with them to create/edit content. For example, one of the obstacles faced by experts who collaborate with Wikipedia is the lack of guarantee that an inexpert/vandal user will not tamper with their contributed content [15]. Therefore, the trustworthiness of content is tightly linked with the reputation of the author. The remainder of this chapter is as follows: Section 14.2 provides a brief overview of the relevant literature. Section 14.3 introduces the user reputation model. Section 14.4 describes how user reputation can be used to assess content quality. Section 14.5 provides technical background and describes how the data were collected and mined. Finally, Sect. 14.6 draws some conclusions and points to a few directions for future investigation.
14.2
Background
Many online communities have trouble motivating enough users to build an active community. High user participation is the key factor for a successful online community, and that is why good motivating factors are essential [12]. As of April 2010, six of the ten most popular Web sites worldwide simply could not exist without user-contributed content [16]. These sites – Myspace, YouTube, Facebook, eBay, Wikipedia, and Craigslist – look for some incentives to encourage broader participation or the contribution of higher quality content. In order to increase and enhance user-generated content contributions, it is important to
14
Trust in Online Collaborative IS
343
understand the factors that lead people to freely share their time and knowledge with others [17, 18]. The positive correlation between content quality and user participation has been discussed in some work [19, 20]. Some studies also showed that building a good reputation/trust can be a motivating factor that encourages user participation in collaborative systems, as well as an incentive for good behavior [12, 21–24]. There is an extensive amount of research focused on building trust for online communities through trusted third parties or intermediaries [25, 26]. However, it is not applicable to all online communities where users are equal in their roles and there are no entities that can serve as trusted third parties or intermediaries. Reputation management systems provide a way to build trust through social control without trusted third parties [27]. A reputation management system is an approach to systematically evaluate opinions of online community members on various issues (e.g., products and events) and their opinions concerning the reputation of other community members [28]. Reputation management systems try to quantify reputation based on metrics to rate their users or products. In this way, users are able to judge other users or products to decide on future transactions. A well-known example of reputation management is eBay’s auction and feedback mechanism. In this system, buyers and sellers can rate each other after each transaction by using crude +1 or 1 values so that the overall reputation of a trustee becomes the sum of these ratings over the last 6 months. Besides assigning these ratings, users can add textual annotations to present their experiences during their transactions [29]. In other distributed environments such as peer-to-peer (P2P) file sharing networks or grid computing, users can rate each other after each transaction (e.g., downloading a file). So far, a considerable amount of research has been focused on the development of trust/ reputation models in virtual organizations, social networks, and P2P networks [30–33]. Reputation management systems are difficult to scale when they have limited sources of information. Users do not always give feedback about other users/ products. They also prefer not to return negative feedback [11]. To overcome this problem, these systems consider reputation as a transitive property and try to propagate it, in order to have an estimation of unknown users and products. In this way, there is a high risk of propagating biased or inaccurate ratings. A study of P2P e-commerce communities confirms this issue and shows that reputation models based solely on feedback from other peers in the community are inaccurate and ineffective [27]. To alleviate this problem, a reputation management system can make its judgments based on objective observations rather than using explicit experiences from other users, for example, by tracking behavior of users in the system or analyzing users’ feedback to products over time. Quite unlike some research lines that are based on subjective observations in wiki systems [12, 34], in this work our aim is to quantify reputation based on objective observations of the users’ actions. In the open editing model of Wikipedia, users can contribute anonymously or with untested credentials. As a consequence, the quality of Wikipedia articles has
344
S. Javanmardi and C. Lopes
been a subject of widespread debate. For example, in late 2005, American journalist John Seigenthaler publicly criticized Wikipedia because of a collection of inaccuracies in his biography page, including an assertion that he was involved in the assassination of former U.S. President John F. Kennedy.2 Apparently the inaccuracies remained in Wikipedia for 132 days. Because there is no single entity taking responsibility for the accuracy of Wikipedia content, and because users have no other way of differentiating accurate content from inaccurate content, it is commonly thought that Wikipedia content cannot be relied upon, even if inaccuracies are rare [35]. To overcome this weakness, Wikipedia has developed several user-driven approaches for evaluating the quality of its articles. For example, some articles are marked as “featured articles.” Featured articles are considered to be the best articles in Wikipedia, as determined by Wikipedia’s editors. Before being listed here, articles are reviewed as “featured article candidates,” according to special criteria that take into account: accuracy, neutrality, completeness, and style.3 In addition, Wikipedia users keep track of articles that have undergone repeated vandalism in order to eliminate it and report it.4 However, these user-driven approaches cannot be scaled and only a small number of Wikipedia articles are evaluated in this way. For example, as of March 2010, only 2,825 articles (less than 0.1%) in English Wikipedia are marked as featured. Another difficulty of the userdriven evaluations is that Wikipedia content is, by its nature, highly dynamic and the evaluations often become obsolete rather quickly. Due to these conditions, recent research work involves automatic quality analysis of Wikipedia [33, 35–43]. Cross [35] proposes a system of text coloring according to the age of the assertions in a particular article; this enables Wikipedia users to see what assertions in an article have survived after several edits of the article and what assertions are relatively recent and thus, perhaps, less reliable. Adler et al. [37] quantify the reputation of users according to the survival of their edit actions; then they specify ownerships of different parts of the text. Finally, based on the reputation of the user, they estimate the trustworthiness of each word. Javanmardi et al. in [36] present a robust reputation model for wiki users and show that it is not only simpler but also more precise compared to the previous work. Other research methods try to assess the quality of a Wikipedia article in its entirety. Lih [40] shows that there is a positive correlation between the quality of an article and the number of editors as well as the number of revisions. Liu et al. [33] present three models for ranking Wikipedia articles according to their level of accuracy. The models are based on the length of the article, the total number of revisions, and the reputation of the authors, who are further evaluated by their total number of previous edits. Zeng et al. [42] compute the quality of a particular article revision with a Bayesian network from the reputation of its author, the number of
2
http://bit.ly/4Bmrhz http://en.wikipedia.org/wiki/Wikipedia:Featured_articles 4 http://bit.ly/dy3t1Y 3
14
Trust in Online Collaborative IS
345
words the author has changed, and the quality score of the previous version. They categorize users into several groups and assign a static reputation value to each group, ignoring individual user behavior. Stvilia et al. [41] have constructed seven complex metrics using a combination of them for quality measurement. Dondio et al. [39] have derived ten metrics from research related to collaboration in order to predict quality. Blumenstock [38] investigates over 100 partial simple metrics, for example, the number of words, characters, sentences, and internal and external links. He evaluates the metrics by using them for classifications between featured and nonfeatured articles. Zeng et al., Stvilia et al., and Dondio et al. used a similar method that enables the evaluation results to be compared. Blumenstock demonstrates, with an accuracy of classification of 97%, that the number of words is the best current metric for distinguishing between featured and nonfeatured articles. These works assume that featured articles are of much higher quality than nonfeatured articles and recast the problem as a classification issue. Wohner and Peters [43] suggest that, with improved evaluation methods, these metrics-based studies enable us to determine the accuracy of various submissions. Studying German Wikipedia, they believe that a significant number of nonfeatured articles are also highly accurate and reliable. However, this category includes a large number of short articles. Their study of German Wikipedia from January 2008 shows that about 50% of the articles contain lower than 500 characters, and thereby they assume that some short nonfeatured articles are of high quality, since their subject matter can be briefly but precisely explained. In addition, we and others [43, 44] assume that when an article is marked as featured and is displayed on the respective pages, it attracts many more Web users for contributions and demands more administrative maintenance. Wohners and Peters’ investigation of German Wikipedia [43] reveals this assumption to be true. For example, over 95% of all articles are edited with greater intensity, once they are marked as featured. Wilkinson and Huberman [44], in a similar study of English Wikipedia, show that featured articles gain an increase in the number of edits and editors after being marked as featured. According to these observations, the accuracy of the classification in the related work [39, 41, 42] will be valid only if featured articles are considered before they are marked as featured.
14.3
Modeling User Reputation
The long-term goal of this effort is to develop an automated system that can estimate the reputation Ri(t) of a Wikipedia user i at time t based on his/her past behavior. The reputation index Ri(t) should be positive and scaled between 0 and 1 and, for the moment, should be loosely interpretable as the probability that i produces high-quality content. Here, we take a first step toward this long-term goal by developing several computational models of Ri(t) and testing them, in the
346
S. Javanmardi and C. Lopes
form of classifiers, on the available “ground truth” associated with Wikipediaknown administrators and vandals. This general approach, which is fairly standard in machine learning applications, requires some explanations. It is reasonable to assume that there exists a true reputation function that is scaled between 0 and 1 and increases monotonically from the user with the lowest reputation to the user with the highest reputation. Our work is an attempt to approximate this unknown function. The only ground truth available to us concerning this function comes in the form of two extreme datasets of users: the vandals and the admins. No ground-truth data are available for individuals in the middle range of the spectrum. Thus, to approximate the true unknown reputation function, our first focus is on testing whether the proposed models behave well on the two extreme populations. The models we propose have very few free parameters and they are used to predict reputation values for large numbers of admins and vandals. Once a model capable of producing an output between 0 and 1 for each user has been shown to perform well on the two extreme populations, it is also reasonable to ask whether it performs well on other users. Since no ground truth is available for these users, only indirect evidence can be provided regarding the corresponding performance of the model. Indirect, yet very significant, evidence can be provided in a number of different ways including assessment with respect to other models and datasets proposed in the relevant literature, and results obtained on curated datasets that go beyond the available admin/vandal data. These are precisely the kinds of analyses that are described in the following sections. In order to estimate users’ reputations, we deconstruct edit actions into inserts and deletes. We consider the stability of the inserts done by a user, the fraction of inserts that remain, to be an estimate of his/her reputation. Although stability of deletes can also be considered as another source of information, it has several shortcomings. In fact, Wikipedia is more derived by inserts, and the size of inserts is 1.6 times larger than the size of deletes. Deletes are more difficult to track and therefore calculating the stability of deletes is noisier and more computationally extensive. Hence, we make an assumption that using only the stability of inserts would result in a reliable estimation of users’ reputation values. Consider a user i who at time t inserts ci(t) tokens into a Wikipedia page. It is reasonable to assume that the update Rþ i (t) of Ri(t) should depend on the quality of the tokens inserted at time t. To assess the quality of each token, let t0 represent the first time point, after t, where an administrator (hereafter referred to as “admin”) checks the current status of a wiki page by submitting a new revision. According to English Wikipedia history dumps, admins on average submit about 11% of the revisions of the page, which are distributed over the life cycle of the page. By definition (or approximation), a token inserted at time t is defined to be of good quality if it is present after the intervention of the admin at time t0 ; otherwise it is considered to be of poor quality. Therefore, we have ci(t) ¼ gi(t) + pi(t), where gi(t) (resp. pi(t)) represents the number of good quality tokens (resp. poor quality). For user i, we also let Ni(t) be the total number of tokens inserted up to and right before the time t and, similarly, let ni(t) be the number of good quality tokens
14
Trust in Online Collaborative IS
347
inserted up to and right before the time t. Using a “+” superscript to denote values immediately after the time t, we have Niþ ðtÞ ¼ Ni ðtÞ þ ci ðtÞ and nþ i ðtÞ ¼ ni ðtÞ þ gi ðtÞ. We can now define three different models of reputation. Model 1: In the first model, reputation is simply measured by the fraction of good tokens inserted. In this model, we simply have Rþ i ðtÞ ¼
nþ ni ðtÞ þ gi ðtÞ i ðtÞ ¼ Niþ ðtÞ Ni ðtÞ þ ci ðtÞ
(14.1)
Model 2: While the first model appears reasonable, tokens that are deleted are treated uniformly. In reality, there is some information to be found at the time at which deletions occur. Vandalistic insertions, for instance, tend to be removed very rapidly [23, 45, 46]. According to our study on Wikipedia, 76% of vandalism is reverted in the very next revision. Insertions that are deleted only after a very long period of time tend to be deleted because they are outdated rather than poor in quality. Thus, in general, we arrive at the hypothesis that the quicker a token is deleted, the more likely it is to be of poor quality. To realize this hypothesis, we propose a variation on Model 1, where delete tokens introduce a penalty in the numerator with an exponential time decay controlled by a single parameter a. Rþ i ðtÞ ¼
Ppi ðtÞ aðtd tÞ ni ðtÞ þ gi ðtÞ d¼1 e Ni ðtÞ þ ci ðtÞ
(14.2)
Here, td represents the time at which the corresponding token was deleted. Since update rate can vary among different wiki pages, we consider the time interval in terms of the number of revisions. We trained Rþ i ðtÞ over different values of a in order to maximize the area under the ROC curve (AUC). The result shows that a ¼ 0.1 returns the best result. Model 3: This model is a variation of Model 2, where we also take into account the reputation of the deleter and use his/her reputation to weigh the corresponding deletion in the form Rþ i ðtÞ
Ppi ðtÞ ni ðtÞ þ gi ðtÞ d¼1 Rjðtd Þ ðtd Þeaðtd tÞ ¼ Ni ðtÞ þ ci ðtÞ
(14.3)
The idea behind this variation of the model is to value the deletions performed by high reputation users (e.g., admins) and devalue the deletions performed by low reputation users (e.g., vandals). In Model 3, a ¼ 0.08 for the maximum (AUC). For users who start with a delete action, we need to know the initial value, Ri(0). If we denote T the final time, experiments show that the fastest convergence from
348
S. Javanmardi and C. Lopes
Ri(t) to Ri(T) is obtained using the initial values Ri(0) ¼ 0.2 for all anonymous users, and Ri(0) ¼ 0.45 for all registered users (data not shown). These initial values are used in the rest of the chapter. Finally, it is worth noting that if Model 3 were to perform well on the classification task (vandals vs. admins), this would provide further indirect evidence that Model 3 is self-consistent and may perform well on other users too, since the update equation a5t time t + 1 for Model 3 uses the predicted reputation for users other than vandals or admins at time t.
14.3.1 User Reputation Results In this section, we evaluate the reputation models on our dataset extracted from English Wikipedia history, as described in Sect. 14.5.
14.3.1.1
Evaluation on Ground-Truth Data
We first analyze the performance of the reputation models on two major populations: vandals and admins. Vandals are users who have been blocked by the Wikipedia Committee because they performed edits in violation of Wikipedia rules by engaging in vandalism. The “admin” title is conferred to users selected by the Wikipedia Committee due to their helpful, long-term contributions. Although Model 1, Model 2, and Model 3 have at most one free parameter (a) and can be applied directly to estimate the reputation Ri(t) of any user at any time, here we first use the output Ri(T) to derive a classifier to separate vandals and admins. Table 14.1 shows the AUC (Area Under the Curve) values corresponding to the ROC curves of the three corresponding classifiers. The table shows that all the three models perform well and their classification performances are comparable. To further analyze classification performance on a broader set of users, we extend the test populations beyond the extreme of vandals and admins to all blocked users on one side and to good users of Wikipedia on the other side. All blocked users are a superset of the vandals. According to Wikipedia, in addition to vandalism, user blocking can happen because of other reasons such as
Table 14.1 AUC values for the three reputation models AdminsGood usersvandals vandals Model 1 0.9751 0.9839 Model 2 0.9753 0.9769 Model 3 0.9742 0.9762
Adminsblocked users 0.9196 0.9094 0.9073
Good usersblocked users 0.9220 0.9153 0.9125
14
Trust in Online Collaborative IS
349
Fig. 14.1 TPRs and FPRs for the three reputation models as the classification threshold is decreased from 1 to 0
sock-puppetry,5 edit war,6 advertising, or edit disruption. At the other end of the spectrum, automatic extraction of good users beyond admins is not a trivial task. To identify a set of good users, we focus on Wikipedia articles that are marked as good or featured by a committee of experts. From the pool of users contributing to these articles, we extract those who still have contributions that are live in the most recent revisions of these articles. Our definition of good users is also consistent with the result of a recent study of Wikipedia [5], which shows that identification of top page contributors is most highly correlated with the count of their contributed sentences that have survived up to the most recent revision of the wiki pages. Table 14.1 shows the AUC values for this extended classification experiment. Similar to the previous results, all the three models perform well and their classification performances are comparable; however, looking at TPRs (True Positive Rates) and FPRs (False Positive Rates) separately (Fig. 14.1) reveals some subtle differences. In particular, we can see that Model 1 is the best model for detecting vandals/blocked users (lower FPR), while Model 3 is the best model for detecting admin/good users (higher TPR). Table 14.2 compares the mean and standard deviation of the reputation values for good users and admins against blocked users. In general, all three models assign high reputation values to admins/good users and low reputation values to blocked users, but the distribution of assigned reputations (Fig. 14.2) confirms that Model 1
5
http://en.wikipedia.org/wiki/Wikipedia:Sock_puppetry http://en.wikipedia.org/wiki/Edit_warring
6
350
S. Javanmardi and C. Lopes Table 14.2 Mean and standard deviation of reputation values for admins, good users, and blocked. Users for the three Reputation Models Admins and good users Blocked users Model 1 0.5416 ( 0.2407) 0.0926 ( 0.2091) Model 2 0.7835 ( 0.1698) 0.1884 ( 0.2962) Model 3 0.8180 ( 1514) 0.2216 ( 0.3128)
Fig. 14.2 Distribution of reputation for good users/admins vs. blocked users based on the three models. The X-axis shows the reputation bins and the Y-axis shows the percentage of users in the bins
outperforms the other two models at detecting blocked users, while Model 3 outperforms the other two models at detecting good users.
14.3.1.2
Reputation and User Behavior
In this section, we consider the application of the three models to estimate the reputation of all users by extending the previous analyses. We first estimate reputation values for all the users of English Wikipedia. Figure 14.3 shows the distribution of reputation values for the three models. Unlike Models 2 and 3, where higher reputation users are more dominant, Model 1 yields a higher number of low reputation users. This is a direct consequence of the prompt punishment of a user in Model 1 after his/her contributed data are deleted. The decrease in reputation punishment occurs in Model 1 regardless of the reason for the deletion or the reputation of the deleter. Hence, it is very likely that Model 1 overly shifts good users to the left. This is also confirmed by the results of the previous experiments and the poor TPRs of Model 1, compared with Models 2 and 3. In order to evaluate the predictive value of the proposed reputation models, we run another experiment. In this experiment, we calculate the reputation of all the users of English Wikipedia up to time t and analyze the users’ behavior up to time t. Then, in a second phase, we analyze their behavior after time t, and correlate this behavior with the reputation values calculated before time t. Specifically, we measure the statistical correlation between the reputation of the users at time t and their behavioral indicators before and after time t. We process history revisions up
14
Trust in Online Collaborative IS
351
Fig. 14.3 Distribution of reputation for all users in English Wikipedia based on the three Models. The X-axis shows the reputation bins and the Y-axis shows the percentage of users in the bins
to January 1, 2007, for reputation estimation and then examine users’ behavioral indicators on January 1, 2007, and September 30, 2009. For each model, we classify all the users into ten different bins (ignoring bots) according to their reputation values. For each bin associated with model, we calculate the mean of four individual, time-dependent, behavioral indicators, namely RDR, DSR, SDR, and CDR, defined as follows: l
l
l
l
Reverted Data Ratio (RDR) is the ratio of the number of submitted revisions by a user that are reverted by other users to the total number of revisions submitted by the same user. This metric can be interpreted as the tendency of a user toward contributing vandalistic/problematic content. Data Stability Ratio (DSR) is the percentage of contributed data by a user who remains live in the wiki pages. It shows the percentage of content contributed by a user that has not yet been deleted by other users yet. Submission Data Ratio (SDR) is the number of revisions submitted by a user to the total number of submitted revisions. This metric shows how actively each user contributes to the wiki pages by submitting new revisions. Correction Data Ratio (CDR) is the ratio of the number of reverts done by a user to the total number of reverts. This metric can be interpreted as the tendency of a user to make corrections in the wiki pages.
Figure 14.4 shows the mean of CDR, SDR, DSR, and RDR, respectively, in each bin associated with each reputation model, when the behavioral indicators and reputation values are calculated using data up to January 1, 2007. As the diagrams show, in general, there is a positive correlation between user reputation and CDR – signifying that users with an estimated high reputation tend to make more corrections than users with estimated low reputation. The positive correlation between user reputation and SDR also shows that higher reputation users submit more revisions compared to lower reputation users. The correlation between user reputation and RDR is negative, indicating that lower reputation users tend to contribute vandalistic or low-quality content more frequently. These positive and negative
352
S. Javanmardi and C. Lopes
Fig. 14.4 CDR, SDR, DSR, and RDR as functions of reputation (based on data before 2007). The X-axis shows the reputation bins and the Y-axis shows the percentage of users in the bins
correlations are consistent with the general intuitions about Wikipedia that were used to build the models. It is important to note that among these parameters DSR is a direct input to Model 1 and an indirect input to Models 2 and 3. Hence, the positive correlation between DSR and the user reputation is expected for the three models. For this first set of graphs shown in Fig. 14.4, this positive correlation does not give any evidence about the predictive value of the models, since both user behavior indicators and user reputation are calculated on the same data. To show the predictive value of the models, we plot users’ behavioral indicators computed using data up to September 30, 2009, against the reputation values estimated using data up to January 1, 2007. Figure 14.5 shows the mean of CDR, SDR, DSR, and RDR, respectively, in each bin associated with each reputation model, where the users’ behavioral indicators are estimated in 2009, while reputation values used to determine the bins are estimated at the beginning of 2007. The first observation is that this second set of curves has shapes similar to those in Fig. 14.4, indicating that the estimated users’ reputations are consistent with their behaviors – users continue to behave in 2007–2009 as they had behaved before 2007. Furthermore, behavior or reputation is captured in broad strokes, by the reputation models (Fig. 14.6). The values of the behavioral indicators in Fig. 14.5 are slightly different from their predicted values corresponding to Fig. 14.4. For example, according to Model 3 applied up to 2007, users with a reputation of 0.1 or below ought to have 69% reverted data (RDR), whereas in reality during 2007–2009 those users had only 52% reverted data. Likewise, the same Model 3 predicts that users with a reputation between 0.8 and 0.9 ought to be responsible for 37% of the total number
14
Trust in Online Collaborative IS
353
Fig. 14.5 CDR, SDR, DSR and RDR extracted after 2007 as functions of reputation computed before 2007. The X-axis shows the reputation bins and the Y-axis shows the percentage of users in the bins
Fig. 14.6 Transitions between high-quality and low-quality states
Table 14.3 Correlation values for the three reputation models Models RDR CDR SDR Model 1 (– 0.906, – 0.871) (0.434, 0.760) (0.757, 0.861) Model 2 (– 0.927, – 0.939) (0.783, 0.852) (0.822, 0.833) Model 3 (– 0.958, – 0.973) (0.779, 0.811) (0.791, 0.786)
DSR (0.999, 0.996) (0.976, 0.975) (0.944, 0.944)
of submissions (SDR), whereas in reality during 2007–2009 those users were responsible for only 27% of submissions. To compare these two sets of diagrams (Figs. 14.4 and 14.5), we perform a Pearson correlation analysis. The results are described in Table 14.3, where each tuple shows the correlation between the two parameters before and after 2007, respectively. For example, the entry (0.906, 0.871) signifies that the correlation between RDR and Model 1 reputation is 0.906 in Fig. 14.4, while it is 0.871 in Fig. 14.5. These correlations are highly significant, and the same is observed if one measures the correlation between the reputation value themselves within or across models, and up to 2007 or up to 2009. In combination, these results suggest that the reputation models are good at predicting behavioral indices and reputation values at future times, not only for extreme populations of very good or very bad users, but also across the entire spectrum of reputation values.
354
S. Javanmardi and C. Lopes
14.3.2 Comparison to Related Work In this section, we discuss our model in more detail and compare it to related work in the literature according to several different criteria, appearing in boldface in the criteria list below:
14.3.2.1
Tracking Token Ownership
Effective assignment of inserts and deletes to owners is highly dependent on (1) the accuracy of the diff algorithm used for calculating the distance between two revisions of a wiki page; and (2) the side effects of reverts resulting in incorrect ownership assignments. An effective diff algorithm for wikis should identify differences in a way that is meaningful to human readers. In particular, reordering of text blocks should be detected in order to accurately assign ownership to the tokens in the reordered blocks. This issue has not been taken into consideration in some of the previous work [5, 34, 47]. For example, Sabel et al. [34] use the Levenshtein algorithm7 to compute the edit distance between two revisions. This algorithm penalizes block reordering and as a result each token that has been shifted is usually considered deleted from its old position and inserted in its new position [48, 49]. In our experience, the Wikipedia’s diff algorithm can suffer from the same problem, occasionally preventing the detection of block reorderings. We and others [37] overcome this problem by using efficient diff algorithms that detect reordering of blocks and run in time and space linear to the size of the input [50, 51]. Another issue in accurate assignment of token ownership has to do with taking into account the side effects of reverts. In general, successive revisions of a wiki page have similar content, and each revision, except the very first, is a descendant of the preceding one. However, this model is insufficient for describing the realistic evolution of a wiki page [34]. Assume that a vandal blanks out the ith revision of a wiki page. Therefore, the (i + 1)th revision becomes blank. When user u reverts the (i + 1)th revision to the previous revision, this revert results in a new revision and the content of (i + 2)th revision and ith revision become the same. This scenario raises several problems: (1) users whose contributions were deleted by the vandal are penalized unfairly; (2) u is erroneously considered to be the owner of all the content of the (i + 2)th revision; and (3) the true original owner(s) are denied ownership of the content they actually contributed. We and others [37] address this issue by ignoring these spurious insertions and deletions caused by reverts. However, in [37], the authors decided to process only up to the third successive revision in order to extract reverts and assign ownership. Our study of Wikipedia shows that about 6% of reverts return the ith revision of a page to the jth, where i j >3. For this reason, in order not to lose any information, we process all
7
http://en.wikipedia.org/wiki/Levenshtein_distance
14
Trust in Online Collaborative IS
355
revisions. Because reverts happen very frequently in Wikipedia, ignoring the side effect of reverts can result in significant numbers of incorrect assignments of token ownership.
14.3.2.2
Stability of Edits
For the purpose of this study, user reputation is estimated by looking at the stability of the content he/she contributes. To estimate the stability of the content, we track the tokens inserted by a user up to the last revision of the page to see how many of these tokens are deleted. In some of the related work in the literature, the tracking process has been more limited, for instance, by tracking inserted tokens only up to a limited number of successive revisions and therefore missing some deleted tokens. For example, the authors in [37] use up to the tenth successive revisions. Our study of Wikipedia shows that 37% of the deletes happen after the tenth revision. Hence, ignoring this fraction of deletes may lead to reputation estimates that are less accurate. For the purpose of this study, user reputation is estimated by considering the stability of inserts only. One may argue that although the number of deletes is considerably smaller than the number of inserts, there is some information in the stability of the deletes too, and one ought to be able to use this additional information to derive even more accurate models of reputation. To see if the stability of deletes can improve the accuracy of the models, we reformulate our simplest model (Model 1) by considering the stability of deletes. We define Model 10 as follows, Rþ i ðtÞ ¼
þ nþ i ðtÞ þ nd ðtÞ ; þ Ni ðtÞ þ Ndþ ðtÞ
(14.4)
where nd(t) is the number of good quality deleted tokens and Ndþ ðtÞ is the total number of deleted tokens after time t. We tested Model 10 as a classifier on admins and vandals and the results showed that Model 10 has lower AUC (0.84) than Model 1. Interestingly, this observation is consistent with the result of another study [5], which shows that delete and proofread edits have little impact on the perception of top contributors in Wikipedia. In other words, there does not seem to exist any significant correlation between an author’s reputation and an author’s number of deletes in the wiki pages, but, in contrast, there is a very strong correlation between an author’s reputation and an author’s number of insertions.
14.3.2.3
Dynamic/Nondynamic and Individualized/Nonindividualized Reputation Measures
One of the advantages of the models presented here is that they assign individualized and dynamic reputation values to both anonymous and registered users. This is not the case in some of the related work published in the literature. For example, the authors in [42], use nondynamic and nonindividualized reputation values for the
356
S. Javanmardi and C. Lopes
users. They categorize users into four groups – administrators, anonymous users, registered users, and blocked users – and assign a static reputation value to each group. In [37], authors consider dynamic and individualized reputation values for registered users, but assign a static and nonindividualized reputation value to anonymous users.
14.3.2.4
Resistance to Attacks
According to the proposed models, users increase their reputation when their contributions to the wiki pages survive. The robustness of the models are highly dependent on when the reputation gain events are triggered. Assume that the reputation of a user increases immediately after he/she inserts some content; if the page is revised only after a long period of time, the user will have an increased reputation throughout the period, even if his/her contribution is of poor quality. One solution to this problem is to postpone the reputation increase until the contribution is reviewed by another user. Although this solution solves the previous problem, the reputation model becomes vulnerable to a Sybil attack,8 whereby an attacker has multiple identities and can follow up his/her own edits. To overcome both problems at once, we postpone the reputation increase until a high reputation user (e.g., admin) approves the corresponding page. Therefore, in the proposed models, a reputation gain can be triggered only when an admin submits a new revision. One may argue that this reliance on the limited number of admins as outside authorities might reduce the accuracy or scope of applicability of the proposed models. However, as shown in Table 14.6, in Wikipedia we have large number of good users who contribute actively to Wikipedia pages. Thus, enlarging the pool of authorities beyond admins to include these good users to validate the quality of insertions may provide an efficient solution, especially for pages with high edit rates. Among related work, Chatterjee et al. [52] have addressed the attack resistance problem by extending their previously presented model [37]. Although the extended model is resistant to the aforementioned attacks, it is considerably more complex than the original model. Since we do not consider the stability of deletes and reverts and we ignore the side effects of reverts, our proposed models are not prone to other kinds of attacks, such as delete–restore or fake followers [52]. Another issue in the proposed models is that reputation gains happen without giving any consideration to the quality of the page that a user contributes to. In [53], the authors make two assumptions: (1) the quality of a wiki page depends on the reputation of its contributors; and (2) the reputation of a user depends on the quality of the pages he/she contributes to. Although the first assumption is often true, the second assumption is more debatable; furthermore, it also increases the vulnerability of the model against some attacks. Our study of Wikipedia shows that vandals are more active in high-quality pages. For example, the average RDR associated 8
http://en.wikipedia.org/wiki/Sybil_attack
14
Trust in Online Collaborative IS
357
with featured articles9 is 17.8% (11.4% before being marked as featured and 25.4% after being marked as featured), while it is about 9.9% for nonfeatured articles. In general, a policy based on the assumptions in [53] would result in vandals having more incentives to contribute to high-quality pages hoping to increase their reputations, and high reputation users having less incentives to contribute to low-quality pages to improve their quality.
14.3.2.5
Population Coverage and Precision and Recall Issues
In related work, anonymous users are either completely ignored or assigned a static reputation value, regardless of their behavior [37]. There are three main reasons why we think that it is important to consider anonymous users in the reputation estimation process: (1) About 33% of the submissions and 39% of the inserts in Wikipedia are contributed by anonymous users and 16% of these contributions have survived up to the last revisions of the articles; therefore they cannot be ignored; (2) Wikipedia itself blocks IP addresses associated with anonymous vandals, and 40% of anonymous vandals are subject to infinite blocking. Therefore, an effective reputation management system for Wikipedia should be able to identify anonymous vandals; otherwise, a significant number of vandals will go undetected; and (3) about 15% of data deleted from registered users is deleted by anonymous users; hence ignoring their deletes would degrade the accuracy of the estimated reputation for registered users. To further verify the relevance of anonymous users, we reformulate Model 3 and assign a static reputation value to all anonymous users, as suggested in [37, 42]. Several static reputation values were tested and the results for the new model (Model 30 ) show that the AUC always drops, for instance, by 1% when the reputation of all anonymous users is set to 0.1. These results indicate that ignoring the anonymous population is likely to decrease the accuracy of a reputation model. Evaluation results reported by Adler et al. [37] using a precision and recall analysis also confirm this observation. To be more specific, in their work they use a model to estimate reputation values up to time t and then estimate the precision and recall after time t provided by low reputation users for short-lived text, which are defined as follows: l
l
Short-lived text is text that is almost immediately removed (only 20% of the text in a version survives to the next version). A low reputation author is an author whose reputation falls in the bottom 20% of the reputation scale.
Table 14.4 shows the precision and recall values obtained on these data by Adler et al. by first ignoring anonymous users (first row) and then by assigning a static common reputation value to all anonymous users (second row). The third row 9
http://en.wikipedia.org/wiki/Featured_Article
358
S. Javanmardi and C. Lopes
Table 14.4 Precision and recall provided by low reputation users for shortlived text
Models Ignoring anonymous users [37] Considering anonymous users [37] Model 3
Precision 0.058 0.190 0.404
Recall 0.378 0.904 0.975
shows the results obtained using Model 3, the most similar of our models to their model, to estimate reputations in English Wikipedia up to 2007 and measure precision and recall on the same data. As the table shows, the model by Adler et al. [37] performs better when a reputation is assigned to anonymous users, albeit statically. Model 3 significantly outperforms the other two approaches because of dynamic assignment of reputation to anonymous users, better token ownership assignments, and also effective removal of side effects of reverts.
14.4
Measuring Quality Evolution of Wikipedia Articles
Since Wikipedia is a dynamic system, the articles can change very frequently. Therefore, the quality of articles is a time-dependent function and a single content may contain high- and low-quality content in different periods of its lifetime. The goal of our study is to analyze the evolution of content in articles over time and estimate the fraction of time that articles are in high-quality state. In our analysis of the evolution of the content quality in Wikipedia articles, we separate revisions into low- and high-quality revisions. On the basis of this assumption, an article can be in low-quality (q ¼ 0) or high-quality (q ¼ 1) states. In order to assess the quality q of a revision, we take into account two factors: the reputation of the author and whether this revision has been reverted in one of the subsequent revisions or not. The reputation of a contributor is a value between 0 and 1 and can be viewed as the probability that he/she produces a contribution of high quality. This probability is computed based on the stability of the past contributions of the user using the methods developed in [36]. The heuristic behind this reputation assessment is that high-quality contributions tend to survive longer in the articles as compared to low-quality contributions. This heuristic is also supported by other work [37, 53]. As Fig. 14.5 suggests, submission of a new revision can keep the state of the article or move it to the other state. If the revision is reverted later in the article history, we consider the new state of the article to be q ¼ 0. Otherwise, if the reputation of the author of that revision is r, then with probability of r the new revision will be q ¼ 1 and with probability of 1 r the new revision will be q ¼ 0. With all these elements in place, we define Q(T) as the ratio of high-quality revisions submitted for the article up to time T: QðTÞ ¼
n X i¼1
qðti Þ=n
(14.5)
14
Trust in Online Collaborative IS
359
Fig. 14.7 Distribution of Q(T) for featured and nonfeatured articles
where q(ti) is the quality of the revision submitted at time ti and n is the total number of revisions up to time T. Figure 14.7 shows the distribution of Q(T) for both feature and nonfeatured articles. While the average of Q(T) is relatively high for both featured and nonfeatured articles, it is higher for featured articles –74% vs. 65%. To estimate the proportion of time during which an article is in a high-quality state, we also define the duration QD(T) by Pn QDðTÞ ¼
i¼1
ðti þ 1 ti Þqðti Þ T t1
(14.6)
The distribution of QD(T) for both featured and nonfeatured articles are shown in Fig. 14.8. Figure 14.9 also shows the average and standard deviation of Q(T) and QD(T) for both featured and nonfeatured articles. Featured articles on average contain high-quality content 86% of the time. Interestingly, this value increases to 99% if we only consider the last 50 revisions of the articles. The same statistics for nonfeatured articles show that they have high-quality content 74% of the time. The difference between the averages of Q(T) and QD(T) suggests that typically low-quality content has short life span. This result is consistent with other studies reporting the rapid elimination of vandalism in Wikipedia [23, 45, 46]. For example, Kittur et al. [46] reported that about one third to one half of the systematically inserted fictitious claims in Wikipedia are corrected within 48 h.
360
S. Javanmardi and C. Lopes
Fig. 14.8 Distribution of QD(T) for featured and nonfeatured articles
Fig. 14.9 Average article quality for featured articles and nonfeatured articles. Quality is assessed by the average and the standard deviation of Q, QD, and QD50 for featured and nonfeatured articles. For each article, Q is the ratio of high-quality revisions. QD is the amount of time that an article spends in its high-quality state computed over its entire lifetime. QD50 is the value of QD when only considering the last 50 revisions of the article
Figure 14.10 shows the evolution of QD(T) as a function of T for both featured and nonfeatured articles of the same age. Overall, QD(T) tends to increase with T and its standard deviation decreases gradually.
14
Trust in Online Collaborative IS
361
Fig. 14.10 Evolution of article quality over time for same-age articles in Wikipedia
14.4.1 Parameters Affecting Quality In this section, we present an empirical study of Wikipedia statistics that can explain the results in Sect. 14.4. First, we analyze user attribution in Wikipedia and compare the behavior of anonymous and registered users. Second, we compare the evolution of content in featured and nonfeatured articles to see which parameters result in higher quality in featured articles.
362
14.4.1.1
S. Javanmardi and C. Lopes
Anonymous Vs. Registered Users
Wikipedia users can contribute to wiki pages both anonymously or as registered users. Registered users are identified by their usernames, while anonymous users are tracked by their IP addresses.10 Although there is no one-to-one correspondence between people and accounts or IP addresses, Wikipedia uses usernames or IP addresses to track user behavior for further promotions (e.g., admin assignment) or demotions (e.g., user block). To investigate the effect of the open editing model of Wikipedia, we compare the behavior of anonymous and registered users to see if there is any correlation between registration and quality of the contributed content. We, as well as others [23, 42, 54], follow the same nomenclature as Wikipedia: a “user” in this study refers to a registered account or an IP address, and it does not refer to a real-world individual. Wikipedia keeps the past revisions of articles, and these revisions are accessible through history pages of articles. These history pages can be mined in order to analyze the behavior of both registered and anonymous users in Wikipedia. Our first attempt to compare the behavior of anonymous and registered users was based on the revert actions performed to Wikipedia articles. A revert is the action of undoing all changes made to an article, restoring it to what it was at a specific time in the past. According to the Wikipedia revert policy,11 reversion is used primarily to fight vandalism or similar activities such as spamming. Our study of all English Wikipedia reverts show that 96% of the reverts are done by registered users, while most of the reverted revisions are associated with anonymous users. Furthermore, in 73% of the time a revert restores the current revision of an article to a recent revision submitted by a registered user. In order to have a more fine-grained analysis of the user behavior, we compared the text of consecutive revisions to extract the insertions and deletions made by each user in each revision. The granularity of inserts and deletes is measured in terms of single tokens (words). The results show that 60.6% of the total inserted content is contributed by registered users. We also followed the evolution of articles and extracted the contributions made to each article over time. Using this method, we were able to determine the contributor of each single token in the last revision of each article. The results show that 84% of the current content of Wikipedia articles (i.e., survived content in the latest revisions of the articles) has been contributed by registered users. Another interesting observation shows that 49.4% of the contributed content by registered users has been deleted over time, while this value is 85.2% for anonymous users. These observations show the high dynamics in the evolution of content in Wikipedia and the higher stability of the registered contributions. Comparison of the distribution of the reputation for anonymous and registered users clearly shows that registered users tend to have higher reputation. The average of reputation for registered users (as measured in [36]) is 59%, while that for
10
http://en.wikipedia.org/wiki/Why_create_an_account http://en.wikipedia.org/wiki/Wikipedia:Revert_only_when_necessary
11
14
Trust in Online Collaborative IS
363
anonymous users is 49% [55]. Furthermore, 70% of the reverted revisions (vandalistic content) are associated with anonymous users. Together these results suggest that user registration has a positive effect on the quality of Wikipedia.
14.4.1.2
Featured Vs. Nonfeatured Articles
By comparing content evolution in both featured and nonfeatured articles, we aim to find what parameters affect content quality and how the open editing model of Wikipedia lets featured articles attain high quality. In [20, 44], authors have compared featured and nonfeatured articles and concluded that, on average, featured articles benefit from higher number of edits and distinct editors. In order to have a more detailed comparison between featured and nonfeatured articles, we examined the evolution of content in these articles and extracted the statistics shown in Table 14.5. Although 39.1% of the total inserted tokens in nonfeatured articles are contributed by anonymous authors, this figure drops to 15.2% when only the last revisions are considered. In the case of featured articles, the total percentage of inserted tokens by anonymous authors is 56.3%, with this figure dropping to 7.8% in last revisions. According to these statistics, most of the remaining content in both featured and nonfeatured articles belong to registered users, but this percentage is higher in featured articles. This observation, together with the result in Sect. 14.4.1.1, provides strong evidence for why featured articles contain higher quality content throughout their lifetime. Furthermore, the token survival ratio presented in Table 14.5 shows a much higher turnover of the content in featured articles over time; a higher ratio of tokens is deleted in featured articles compared to other articles. This might be counterintuitive as one might expect that the content inserted in the featured articles should be of higher quality and therefore more stable. However, it can be interpreted as higher
Table 14.5 Statistical comparison of featured and nonfeatured articles Featured articles
Nonfeatured articles
Total inserted tokens Registered 43.7% (before: 50.9%, after: 36.7%) 60.9% Anonymous 56.3% (before: 49.1%, after: 63.3%) 39.1% Tokens in the last revisions Registered 92.2% (before: 93.1%, after: 88.2%) 83.9% Anonymous 1.5% (before: 6.9%, after: 11.8%) 16.1% Token survival Registered 23.1% (before: 33.8%, after: 8.7%) 50.9% Anonymous 1.5% (before: 2.6%, after: 0.7%) 15.2% Ratio of reverted revisions 17.8% (before: 11.4%, after: 25.4%) 9.9% Ratio of revisions submitted by admins 17.4% (before: 20.7%, after: 14.2%) 10.9% The statistics for featured articles consider revisions submitted before and after the articles were marked as featured
364
S. Javanmardi and C. Lopes
dynamics in the evolution of the content in these articles that allows only very highquality content to survive. Note that in order to control the increased visibility and attention that articles might gain after being marked as featured, we have also reported the results of our analysis both before and after becoming featured. Featured articles can also be distinguished from other articles in terms of proportion of reverted revisions. While, on average, 9.9% of the revisions in nonfeatured articles are reverted, they become 25.4% after an article becomes featured. This significant increase in the ratio of reversions after articles are marked as featured is a matter of further study; this can be due to more vandalism as a consequence of higher visibility or it might be attributed to the fact that most of the featured articles have become mature and thus more resistant to change. In summary, we conclude that (a) featured articles are more closely followed: although less than 0.08% of the articles are marked as featured, they comprise about 1.4% of the total number of revisions; (b) Wikipedia administrators contribute more actively to featured articles even before these articles are marked as featured; (c) the revert ratio in featured articles is about 1.8 times higher than the ratio for nonfeatured articles; and (d) featured articles have a much higher turnover of content. This higher dynamic in the article’s evolution allows very high-quality content to survive. It is interesting to note that even at this lower survival rate, featured articles are on average longer than other articles [38]. Overall, these statistics support the view that featured articles benefit from a higher degree of supervision as compared to other articles.
14.5
Tools and Methods
In order to obtain the data for our study, we used five client machines for a period of 2.5 months during summer 2009 to send requests to MediaWiki API and extract the data. By sending consecutive requests to MediaWiki API, one can obtain the text of all revisions of each Wikipedia article. We needed the list of the articles in English Wikipedia to feed to the API in order to obtain article revisions. However, a significant number of Wikipedia articles had been redirected to other articles so we ignored them. In order to obtain a clean list of Wikipedia articles, we used crawler4j [56] to crawl English Wikipedia and extract the list of nonredirected articles. We started from the Wikipedia main page and some other seed pages and by traversing the links we crawled about 1.9 million articles. We also used the MediaWiki API to extract different types of contributors such as bots,12 admins, and blocked users. Table 14.6 shows the properties of the dataset. A note about “users.” It is virtually impossible to associate actual persons with the internet behavior in a one-to-one fashion. To bypass this problem, Wikipedia 12
Bots are generally programs or scripts that make automated edits without the necessity of human decision-making.
14
Trust in Online Collaborative IS
Table 14.6 Properties of the dataset
365 Time span Number of users Registered users Anonymous users Number of articles Featured Good users Good Good users For deletion Regular Number of revisions By anonymous users By registered users
96 months 12,797,391 1,749,146 11,048,245 1,899,622 2,650 197,436 7,502 334,369 125 1,889,345 123,938,034 82,577,828 41,360,206
defines two classes of users. An anonymous user is a user who is known only through his/her IP address. A registered user is a user associated with his/her usernames (i.e., nicknames) that were entered during the registration process. We, as well as others [23, 42, 54], follow the same nomenclature as Wikipedia: a user in this study refers to a registered account or an IP address, and it does not refer to a real-world individual.
14.5.1 Extracting Reverts A revert is an action to undo all changes made to an article and is primarily used for fighting vandalism. To extract reverts, we compare the text of each revision to the text of the previous revisions. Since the text comparison process is computationally expensive, the comparison is done on the MD5 signature of the texts rather than on the texts themselves.
14.5.2 Extracting Events We consider an atomic event to be an insertion or deletion of a word. Insertions are extracted by comparing the text of each revision with the text of the previous revision; deletions are extracted by comparing the text in a revision with the text of all the subsequent revisions. We use the diff algorithm described in [50], for accurate extraction of atomic events. The advantage of this algorithm compared to most of the current diff algorithms is its ability to detect movements of blocks. The developed tool, named Wikipedia Event Extractor, is publicly available at [57]. We calculated Ri(T) of users by processing the extracted events.
366
14.6
S. Javanmardi and C. Lopes
Discussion and Conclusion
In this chapter, we summarized several studies of Wikipedia as the largest online collaborative information system in order to a analyze the issue of trust. We showed how user reputation can be modeled according to his/her edit behavior pattern. We also showed how this user reputation value can be used to assess collaborative content quality. One application of this work can be in scientific mashups, especially when the content comes from collaborative information repositories where the quality of such content is unknown. In [58], we studied CalSWIM as a watershed scientific mashup whose content is taken from both highly reliable sources and Wikipedia, which may be less so. We showed that how integrating CalSWIM the reputation management system can help to assess the reputation of users and the trustworthiness of the content. Using user reputations, the system selects the most recent and trustworthy revision of the wiki article rather than merely the most recent revision, which might be vandalistic or of poor quality. Although this study of the problem of trust in the CalSWIM mashup indicates that the idea of showing the most trustworthy and recent revision of an article to a user can be beneficial for fetching content from wikis. However, it is important to note that assessing trustworthiness of content based only on the reputation of the contributor has some limitations: l
l
l
Data sparsity: for a considerable number of users of Wikipedia, we do not have enough information for accurate reputation estimation. The models that we use for estimation of user reputation are based on the observed behavior of users and how other users react to the contributions of these users. Therefore, in cases where a user is new to the system, we do not have a stable reputation estimate for him. Anonymity: a significant number of users contribute to Wikipedia articles anonymously and they are identified only by their IP addresses. However, there is a loose correspondence between the IP addresses and the real-word users. Expertise: quality of the contribution of a user to a topic depends on the expertise of the user on that topic. Having one reputation value may not be a perfect representative for quality of the contributions of the user on different topics. In CalSWIM, we tried to alleviate this problem by estimating the reputation of users based only on their contributions to water-related articles.
In addition to the above limitations, there is no guarantee that users will not change their behavior in the future. So, a user who has contributed high-quality content in the past might contribute low-quality content in the future. In addition, when a new user comes to the article and contributes high-quality content, the system sacrifices freshness for trustworthiness, only because it does not have an accurate estimate of the user’s reputation. This problem becomes worse for articles that are updated less frequently. In the case of our CalSWIM mashup, some articles are updated very infrequently. The average time span between submission of the
14
Trust in Online Collaborative IS
367
last two revisions of articles is 29 days. However, our study of Wikipedia-featured articles shows that the update rate for an article increases significantly as it gains more visibility [36]. According to this observation, our conjecture is that mashups like CalSWIM can help these articles gain more visibility and thereby enjoy more frequent updates. To overcome the limitations caused by inaccurate user reputation, in future work we aim to process the changes made to newly submitted revisions of an article to ascertain whether or not it is vandalistic. Inspired by [59], we categorize Wikipedia vandalism types and build a statistical language models, constructing distributions of words from the revision history of Wikipedia articles. As vandalism often involves the use of unexpected words to draw attention, the fitness (or lack thereof) of a new edit, when compared with language models built from previous revisions, may well indicate that an edit is the product of vandalism. One of the main advantages of this technique is that it is extendable, even to other Web 2.0 domains such as blogs.
References 1. Oreilly, T.: What is web 2.0: Design patterns and business models for the next generation of software. Commun. Strat. 65, 17–27 (2007) 2. Alexander, B.: Web 2.0: A new wave of innovation for teaching and learning? Educause Rev. 41(2), 32–44 (2006) 3. Boulos, M., Maramba, I., Wheeler, S.: Wikis, blogs and podcasts: A new generation of webbased tools for virtual collaborative clinical practice and education. BMC Med. Educ. 6, 41 (2006) 4. Leuf, B., Cunningham, W.: The Wiki Way: Quick Collaboration on the Web. AddisonWesley, Boston (2001) 5. Arazy, O., Stroulia, E.: A utility for estimating the relative contributions of wiki authors. In: International AAAI Conference on Weblogs and Social Media. [Online]. Available: http:// www.aaai.org/ocs/index.php/ICWSM/09/paper/view/157 (2009) 6. Tapscott, D., Williams, A.: Wikinomics: How Mass Collaboration Changes Everything, pp. 70–77. Penguin Group, New York (2006) 7. Giles, J.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005) 8. Seigenthaler J.: A false Wikipedia ‘biography’. [Online]. Available: http://www.usatoday. com/news/opinion/editorials/2005-11-29-wikipedia-edit\_x.htm (2005) 9. Shneiderman, B.: Designing trust into online experiences. Commun. ACM 43(12), 57–59 (2000) 10. Resnick, P., Zeckhauser, R.: Trust among strangers in Internet transactions: Empirical analysis of eBay’s reputation system. In: Baye, M.R. (ed.) The Economics of the Internet and E-Commerce. Advances in Applied Microeconomics, vol. 11. Elsevier, Amsterdam (2002) 11. Josan, A., Ismail, R., Boyd, C.: A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43(2), 618–644 (2007) 12. Hoisl, B., Aigner, W., Miksch, S.: Social rewarding in wiki systems motivating the community. In: Schuler, D. (ed.) Proceedings of HCI International – 12th International Conference on Human-Computer Interaction (HCII 2007). LNCS, vol. 4564/2007, pp. 362–371. Springer, Berlin (2007) 13. Reputation. [Online]. Available: http://en.wikipedia.org/wiki/Reputation
368
S. Javanmardi and C. Lopes
14. Corritore, C.L., Kracher, B., Wiedenbeck, S.: On-line trust: Concepts, evolving themes, a model. Int. J. Hum. Comput. Stud. 58(6), 737–758 (2003) 15. Knowledge smackdown: Wikipedia vs. citizendium. [Online]. Available: http://www.storysouth.com/comment/2006/09/knowledge\_smackdown\_wikipedia.html 16. Alexa’s top 10 websites. [Online]. Available: http://www.alexa.com/ 17. Nov, O.: What motivates wikipedians? Commun. ACM 50(11), 60–64 (2007) 18. Maslow, A.H.: Motivation and Personality. HarperCollins, New York (1987) 19. Wilkinson, D., Huberman, B.: Cooperation and quality in Wikipedia. In: WikiSym ‘07: Proceedings of the 2007 International Symposium on Wikis, pp. 157–164. ACM, New York, NY (2007) 20. Ganjisaffar, Y., Javanmardi, S., Lopes, C.: Review-based ranking of Wikipedia articles. In: Proceedings of the International Conference on Computational Aspects of Social Networks, June 2009 21. Anthony D., Smith, S.W., Williamson T.: Explaining quality in internet collective goods: zealots and good Samaritans in the case of Wikipedia. : Dartmouth College, Hanover. Tech. Rep. [Online]. Available: web.mit.edu/iandeseminar/Papers/Fall2005/anthony.pdf (2005) 22. Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics, Stockholm, Sweden, (2005) 23. Vie´gas, F.B., Wattenberg, M., Dave, K.: Studying cooperation and conflict between authors with history flow visualizations. In: CHI ‘04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 575–582. ACM, New York, NY, (2004) 24. Josang, A., Keser, C., Dimitrakos, T.: Can we manage trust? In: Proceedings of the 3rd International Conference on Trust Management (iTrust’05), May 2005. 25. Ketchpel, S., Garcia-molina, H.: Making trust explicit in distributed commerce transactions. In: Proceedings of the International Conference on Distributed Computing Systems, pp. 270–281, 1996 26. Atif, Y.: Building trust in e-commerce. IEEE Internet Comput. 6(1), 18–24 (2002) 27. Xiong, L., Liu, L.: A reputation-based trust model for peer-to-peer ecommerce communities [extended abstract]. In: EC ‘03: Proceedings of the 4th ACM Conference on Electronic Commerce, pp. 228–229. ACM, New York, NY, 2003 28. Gutscher, A.: A trust model for an open, decentralized reputation system. Trust Management, pp. 285–300, (2007) 29. Evaluating a member’s reputation. [Online]. Available: http://pages.ebay.com/help/feedback/ evaluating-feedback.html (2005) 30. Squicciarini, A.C., Paci, F., Bertino, E.: Trust establishment in the formation of virtual organizations. In: ICDE Workshops, pp. 454–461, (2008) 31. Aringhieri, R., Damiani, E., Vimercati, S.D.C.D., Paraboschi, S., Samarati, P.: Fuzzy techniques for trust and reputation management in anonymous peer-to-peer systems: Special topic section on soft approaches to information retrieval and information access on the web. J. Am. Soc. Inf. Sci. Technol. 57(4), 528–537 (2006) 32. Ziegler, C.-N., Golbeck, J.: Investigating interactions of trust and interest similarity. Decis Support Syst. 43(2), 460–475 (2007) 33. Liu, H., Lim, E., Lauw, H., Le, M., Sun, A., Srivastava, J., Kim, Y.A.: Predicting trusts among users of online communities: an epinions case study. In: EC’08: Proceedings of the 9th ACM Conference on Electronic Commerce, pp. 310–319, ACM, New York, NY, (2008) 34. Sabel, M., Garg,A., Battiti,R., Wikirep: Digital reputation in virtual communities. University of Trento, Tech. Rep. [Online]. Available: http://eprints.biblio.unitn.it/archive/00000810/ (2005) 35. Cross, T., Puppy smoothies: Improving the reliability of open, collaborative wikis. First Monday 11(9) (2006) 36. Javanmardi, S., Lopes, C., Baldi, P.: Mining Wikipedia to extract user reputation. J. Stat. Anal. Data Min. 2(3), 126–139 (2010) 37. Adler, B.T., de Alfaro, L.: A content-driven reputation system for the Wikipedia. In: WWW ‘07: Proceedings of the 16th international conference on World Wide Web, pp. 261–270. ACM, New York, NY, (2007)
14
Trust in Online Collaborative IS
369
38. Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW ‘08: Proceedings of the 17th International Conference on World Wide Web, pp. 1095–1096. ACM, New York, NY, (2008) 39. Dondio, P., Barrett, S.: Computational trust in web content quality: A comparative evaluation on the Wikipedia project. Informatica Int. J. Comput. Inform. 31(2), 151–160 (2007) 40. Lih,A.: Wikipedia as participatory journalism: Reliable sources? Metrics for evaluating collaborative media as a news resource. In: Proceedings of the 5th International Symposium on Online Journalism, April 2004 41. Stvilia, L.S.B., Twidale, M.B., Gasser, L.: Assessing information quality of a communitybased encyclopedia. In: Proceedings of the International Conference on Information Quality, pp. 442–454, Nov 2005 42. Zeng, H., Alhossaini, M., Ding, L., Fikes, R., McGuinness, D.L.: Computing trust from revision history. In: Proceedings of the 2006 International Conference on Privacy, Security and Trust, Oct 2006 43. Whner, T., Peters, R.: Assessing the quality of wikipedia articles with lifecycle. In: WikiSym ‘09 Proceedings of the 2009 International Symposium on Wikis. ACM, New York, NY, Oct 2009 44. Wilkinson, D., Huberman, B.A.: Assessing the value of cooperation in Wikipedia. First Monday 12(4) (2007) 45. Kittur, A., Suh, B., Pendleton, B.A., Chi, E.H.: He says, she says: conflict and coordination in Wikipedia. In: CHI ‘07: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 453–462. ACM, New York, NY, (2007) 46. Magnus, P.D.: Early response to false claims in Wikipedia. First Monday 13(9) Sep 2008 47. Hess. M., Kerrand, B., Rickards, L.: Wiki user statistics for regulating behaviour, Tech. Rep. [Online]. Available: http://icd.si.umich.edu/684/files/684\%20wikistat\%20paper\%202.pdf (2006) 48. Leusch, G., Ney, H.: Bleusp, invwer, cder: Three improved mt evaluation measures. In: NIST Metrics for Machine Translation Challenge. Waikiki, Honolulu, Hawaii, Oct 2008 49. Leusch, G, Ueffing, N., Ney, H.: Cder: Efficient mt evaluation using block movements. In: Proceedings of EACL, pp. 241–248 (2006) 50. Heckel, P.: A technique for isolating differences between files. System Sciences 264–268 (1978) 51. Tichy, W.F.: The string-to-string correction problem with block moves. ACM Trans. Comput. Syst. 2(4), 309–321 (1984) 52. Chatterjee, K., de Alfaro, L., Pye, I.: Robust content-driven reputation, School of Engineering, University of California, Santa Cruz, CA, USA, Tech. Rep. UCSC-SOE-08-09. [Online]. Available: http://www.soe.ucsc.edu/~luca/papers/08/ucsc-soe-08-09.pdf (2008) 53. Hu, M., Lim, E.-P., Sun, A., Lauw, H.W., Vuong, B.-Q.: Measuring article quality in Wikipedia: Models and evaluation. In: CIKM ‘07: Proceedings of the 16th ACM conference on Conference on information and knowledge management, pp. 243–252. ACM, New York, NY, (2007) 54. Ekstrand, M., Riedl, J.: rv you’re dumb: Identifying discarded work in wiki article history. In: WikiSym ‘09 Proceedings of the 2009 International Symposium on Wikis. ACM, New York, NY, Oct 2009 55. Javanmardi, S., Ganjisaffar, Y., Lopes, C., Baldi, P.: User contribution and trust in Wikipedia. In: Proceedings of the 5th International Conference on Collaborative computing: Networking, Applications and Worksharing, Nov 2009 56. Crawler4j. [Online]. Available: http://crawler4j.googlecode.com/ 57. Wikipedia event extractor. [Online]. Available: http://mondego.calit2.uci.edu/WikipediaEventExtractor/ 58. Javanmardi, S., Ganjisaffar, Y., Lopes, C., Grant, S.: Scientific mashups: The issue of trust in the aggregation of web 2.0 content. In: WebScience 2010, Apr 2010 59. Lopes, R., Carric¸o, L.: On the credibility of Wikipedia: an accessibility perspective. In: WICOW ‘08: Proceeding of the 2nd ACM workshop on Information credibility on the web, pp. 27–34. ACM, New York, NY, (2008)
.
Index
A Accelerometer, 267, 272, 274, 283 Accessibility 1.0, 326 Accessibility 2.0, 326–327 Actor, 184, 186, 187, 199 Administrators, 346, 355, 364 AJAX Model, 247 Android, 268, 269, 271, 272 Android market, 271 Android OS, 269 Annotation, 140, 142, 143 Anonymous users, 356–358, 362, 363, 365 API, 269 Apple, 268, 269, 271 App Store, 271 ARM, 266, 267 Assistive technology, 321 ATAG 1.0, 326, 327 AUC, 347–349, 355, 357 Augmented reality, 262, 268, 272, 274–275, 279–281, 283 Automatic quality analysis, 344 B Bada, 268, 269 Bag-of-words (BOW), 33, 39, 42 Bank transaction, 265 Battery, 266–269 Bluetooth, 267, 268, 271 Bookmarking, 108, 123 BOW. See Bag-of-words (BOW) Broadband, 263, 267, 268, 282 C CalSWIM, 366, 367 Camera, 265, 267, 272, 274, 275, 279, 280 CBIR, 281 CDMA2000, 267
CDR. See Correction Data Ratio (CDR) Cellphone, 266 Cellular network, 266 Citation network, 138, 140 Classification document classification, 31, 32, 39–40 tag classification, 38 video classification Classifier, 346, 348, 355 Clustering, 111–113, 115, 116, 121, 124 photo clustering, 32, 38 video clustering, 28 Clustering coefficient, 124 Co-authorship network, 138, 139, 150 Cold-start, 74, 87, 89, 104 Collaborate, 342 Collaboration, 293, 296, 299, 304–306, 308, 309, 312, 313 Collaboration process, 188–189, 200 Collaborative service, 188 Collaborative systems, 341, 343 Collaborative tagging systems, 107–129 Commercial identity, 264, 265 Communication, 263, 265–268 Communities of interest, 5 Community-built database, 261–283 Community-built systems interaction, 193–195 Community detection, 107–129 Community-driven database, 262, 272 Community governance, 306, 307, 309 Community structure, 108, 113–117, 121, 122, 124, 125, 127–129 Compass, 267, 274, 275, 283 Complexity, 55, 62–63 Composition, 308–312 Concept of a community-built system, 183, 184, 187–190 Conductance, 122, 125, 127
E. Pardede (ed.), Community-Built Databases, DOI 10.1007/978-3-642-19047-6, # Springer-Verlag Berlin Heidelberg 2011
371
372 Connectivity, 266–268, 277, 278, 282 Content quality, 343, 363, 366 Context, 274, 279 Contextual information retrieval, 135 Conversation, 298, 303, 304, 307–309, 312 Correction Data Ratio (CDR), 351–353 Correlation, 343, 344, 350–353, 355, 362 Culture, 262, 264–266, 282 D Database, 261–283 Data communication, 266 Data-enrichment, 147 Data sensor, 267 Data sparsity, 74, 77, 88–89, 95, 103, 104, 366 Data Stability Ratio (DSR), 351 Data transmission, 268 Development kit, 269, 270 Digital compass, 267, 274 Digital divide, 17 Disability, 321–323, 326, 327, 332, 336 Distant friend inference, 80–82, 85–87, 90, 91 Distant friends, 75, 80, 81, 89–90, 103, 104 Distributed database, 262, 272 DSR. See Data Stability Ratio (DSR) E eBay, 342, 343 E-Book reader, 263 Economy, 264 EDGE, 267 Edit war, 349 Educational objects, 328–332 e-government, 266 Election, 264, 266 Electronic device, 266, 271 Empowerment, 292 Enterprise 2.0, 5 Events, 343, 365 Extensible Markup Language (XML), 220, 221 F Facebook, 275, 279, 296, 298, 342 False Positive Rates (FPR), 349 Featured articles, 344, 345, 357, 359–361, 363, 364 Feedback, 343 Flickr, 262, 272, 278, 282, 283 Folksonomy, 108–114, 121, 123 Formal concept analysis (FCA), 58, 64–65 FourSquare, 272, 274, 275, 283 FPR. See False Positive Rates (FPR) Frequency, 29, 33–35, 39 Friend recommendation, 114
Index G 3G, 266, 277 Game console, 263, 264, 282 Geo-location, 267, 272 Google Buzz, 274 Google Shopper, 273, 275, 279 Governance, 297, 298, 303, 304, 306, 307, 309, 313 Government, 264, 266 GPRS, 266, 277 GPS, 263, 267, 268, 271, 272, 274, 275 Graph, 109–113, 116, 118, 119, 121–125 Graph-based clustering, 111, 112 Graph database, 205–232 Graph database models, 206–218, 222, 231 Graphical hardware, 268 Graph models, 206, 208, 218 Graph query languages, 206, 215, 217, 218, 231, 232 Graphs, 205–232 Greedy maximization, 119 Grid computing, 343 GSM, 266 GSP, 283 GUI, 273 H Handheld, 263 Health 2.0, 7, 9 High quality, 345, 353, 356–360, 363, 364, 366 Homophily Inference, 81, 84–85 Homophily principle, 74 HTTP, 269 Hub, 117, 120, 128, 129 Hypernode, 207, 208, 210, 217, 218, 222–232 I Identity, 302, 303, 305, 306, 308, 313 Image recognition, 275, 281 Image simplification, 330, 331 Immediate friend inference, 80–85, 87, 91 Inclinometer, 274 Industry, 262–264, 267–269, 271, 279, 282 Information consumers, 137, 140, 142, 147 Information producers, 137, 140, 142, 144, 145 Information retrieval, 14–15 Intellectual property, 265 Interactivity, 297–299, 301–303 Internet, 263, 266–268, 270, 277, 281, 282 IP, 357, 362, 365, 366 iPhone, 267, 269, 271, 272 Item likability inference, 81, 83–84 ITSM Semantic Wiki, 254–255
Index J JSON, 269 K k-means, 111, 113 Knowledge, 184–189, 191, 192, 195, 197–201 Knowledge communities, 3–18 Knowledge network, 6, 8, 10, 184–187, 199, 201 L Language, 265, 274 Layar, 15, 279 Lineage, 166–168, 179 LinkedIn, 262, 272, 279, 280, 282 Low quality, 351, 353, 357–359, 366 Low-reputation author, 342, 349, 351, 357, 358 M Machine learning, 346, 364 Maemo, 268, 269 Matrix factorization, 63, 65 MD5, 365 Media annotation, 22, 33–37 document annotation, 36–37 photo annotation, 33–35 video annotation, 35–36 MediaWiki API, 364 Medicine 2.0, 7 Metadata, 135, 137, 152 Mobile advertising, 263 Mobile billing, 265 Mobile database, 275–279, 282, 283 Mobile device, 262, 263, 265, 267, 268, 272–274, 277, 279, 282, 283 Mobile handset, 266 Mobile industry, 263 Mobile operating system, 268–269 Mobile payment, 265, 271 Mobile phone, 261–283 Mobile phone operator, 268 Mobile platform, 268, 269 Mobile TV, 263 Mobile web, 262, 283 A Model for Collaborative Environments, 333–338 Modularity-optimization, 111, 113–115 Motivation, 183–201 Motivation model, 183–201 MP3 player, 264 Multimedia, 267, 268, 270, 274 Myspace, 342
373 N Netbook, 263 Net generation, 290, 294, 313 Networked knowledge, 184–187 Nielsens’s heuristic principles, 325 Nonfeature articles, 345, 357, 359–361, 363–364 Notebook, 282 O OAuth, 278, 283 One-mode network, 55, 57 Online communities, 5, 6, 11–15, 24, 342, 343 Flickr, 24–25, 27–35, 37, 38, 44 Wikipedia, 25–27, 30, 32, 33, 36, 37, 39–41, 44 YouTube, 23–24, 27–31, 41, 42, 44 Ontology, 184, 186, 191, 192, 194–199 Ontology evolution/population, 111, 115 Open source, 7, 12, 13 Opinion mining, 279 Outliers, 117, 120, 121, 128 P Parameter space, 118 Payment method, 265 PC, 263, 267 PDA, 263 Peer production, 341 Peer-to-peer, 276–279, 283, 343 Personal data, 275, 277 Personal information, 262, 272, 277, 282, 283 Personalisation/personalization, 298, 302, 303, 313, 328, 330 Personal knowledge management, 253–254 Phone operator, 268 Polillo quality model, 325 Politics, 264, 266 Polysemous tags, 115, 120 Population coverage, 357–358 P2P. See Peer-to-peer Precision, 357–358 Predictive value, 350, 352 Premium SMS, 265 Privacy, 13, 16–17, 264, 265 Private information, 265 Private-view, 165–176 Profile, 277, 283 Q Quality evolution, 358–364 Quality in use integrated map, 324
374 R Radio, 263, 267, 268 Ranking, 344 Rating, 342, 343 RDF, 219–221, 231 RDR. See Reverted Data Ratio (RDR) Recall, 357–358 Recognizr, 279, 280 Recommendation system(s), 68–69, 127 Recommender systems, 73–77, 80, 84, 90, 91, 97, 100, 101, 103 Registered users, 348, 355–357, 361–363, 365 Relational database, 221–232 Reputation, 341–353, 355–358, 362, 366, 367 Reputation management systems, 342, 343, 357, 366 Reputation model, 343, 344, 347–351, 353, 357 Resistance to Attacks, 356–357 Resource Description Framework (RDF)., 241, 243–244 Retrieval of educational objects, 331 Reverted Data Ratio (RDR), 351–353, 356 Reverts, 351, 354–356, 358, 362, 365 Revision, 344, 346, 347, 349–351, 354–360, 362–367 ROC curve, 347, 348 Rummble, 274 Runner+, 274 S Science 2.0, 11 Scientific collaboration, 134, 138–140 SDK, 269 SDR. See Submission Data Ratio (SDR) Search engine document search engine, 32 Security, 16–17 Selection of parameters, 118 Self-identity, 291, 294 Semantic, 306, 307, 309, 312, 313 Semantic filtering of social networks, 90–100 Semantic MediaWiki (SMW), 245, 247–253, 255, 256 Semantic query language, 217 Semantic relation, 121, 124 Semantic Web, 236, 240–246, 249, 252, 254–256 Sense disambiguation, 111, 115 Sensor, 267, 268, 272–275, 277, 283 Sentiment analysis, 279 Sharing, 137, 138, 161, 173–174, 176, 179 Short-lived text, 357 SIM card, 266, 277
Index Smart phone, 266, 268 SMS, 264–266, 270, 271 SMS election, 264, 266 Social annotations, 140, 142 Social context, 134–138, 140, 147 Social importance, 140, 141, 145, 146, 148–149, 151 Social information network, 88, 90, 141–143 Social information retrieval, 137, 140–145 Social media, 4, 10, 14–17, 279, 281 Social network(s), 21–23, 26–30, 35, 36, 43, 44, 51–70, 74–78, 80, 86, 90–100, 103, 104, 206–208, 214, 217–232, 267–275, 277–283 Social networking, 4–11, 15, 16 Social relationships, 74, 75, 80, 90, 91, 94, 96, 140–143, 149 Social software, 236–237, 341, 342 Social Web, 3–18, 261, 271, 283 Sock-puppetry, 349 SQL-like languages, 212–215 Stability of edits, 355 Statistical language models, 367 Strands, 273 Subgraph modularity, 119 Submission Data Ratio (SDR), 351–353 Support vector machine (SVM), 39, 42 SVM. See Support vector machine Sybil attack, 356 T Tag communities, 112, 114–116, 119, 121–122, 124–129 Tag co-occurrence, 109, 110, 115, 123, 128 Tag recommendation, 111, 122, 127–129 Tags, 137, 140–145, 147, 148, 151, 159, 163, 165–167, 169, 170, 173–175, 177, 178, 180 Term frequency-inverse document frequency (tf-idf), 36, 39, 40, 43 Tethering, 266 Text message, 264–266, 270 tf-idf. See Term frequency-inverse document frequency (tf-idf) Touchable interaction, 273 TPR. See True Positive Rates (TPR) True Positive Rates (TPR), 349, 350 Trust, 341–367 Trust in SNRS, 100–103 TWiki, 237–239 Twitter, 275, 283 Two-mode network, 55, 57, 58, 66
Index
375
U Ubiquity, 298, 313 UGC. See User-generated content (UGC) UMTS, 267 Uniform resource identifiers, 312 UnLike, 275 Usability, 319–338 Usage contexts, 115, 128, 129 User-created, 160–166, 168–174, 178, 179 User-generated, 147, 159, 163, 164, 177 User-generated content (UGC), 21, 22, 28–30, 35 User interface, 273, 274 User participation, 342, 343 User preference, 73, 79, 81–84 User profile, 111, 114, 115, 136, 137
W WAP, 266 WCAG 1.0, 326, 327 WCAG 2.0, 327–329, 331, 336 Web 2.0, 3–18, 262, 367 Web application, 268, 283 Web service, 277, 281 Well-being, 289–313 WiFi, 267 Wiki, 183–185, 187, 189–190, 201 Wikinomics, 341 Wikipedia, 52–54, 63, 68, 69, 237, 239–240, 245–251, 253, 255, 256, 341–352, 354–359, 361–367 Wikipedia Event Extractor, 365 Wikis, 235–256, 341, 342, 366 Wireless modem, 266
V Vandalism, 344, 347, 348, 359, 362, 364, 365, 367 Vandals, 342, 346–349, 354–357, 362 Video games, 290, 295 Visualization, 52, 56, 58–59, 63, 64, 66, 67 VisualPedia, 320, 328–333, 338 Visual query languages, 209–212
X XML, 269. See Extensible Markup Language (XML) Y Yelp, 274 Youth workers, 290, 291, 313 YouTube, 275, 342