The Role of Digital Libraries in a Time of Global Change: 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010, Gold Coast, ... Applications, incl. Internet Web, and HCI)
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
6102
Gobinda Chowdhury Chris Koo Jane Hunter (Eds.)
The Role of Digital Libraries in a Time of Global Change 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010 Gold Coast, Australia, June 21-25, 2010 Proceedings
13
Volume Editors Gobinda Chowdhury University of Technology, Sydney PO Box 123, Broadway, NSW 2007, Australia E-mail: [email protected] Chris Koo Nanyang Technological University 31 Nanyang Link, Singapore 637718 E-mail: [email protected] Jane Hunter The University of Queensland Brisbane, QLD 4072, Australia E-mail: [email protected]
Library of Congress Control Number: 2010927958 CR Subject Classification (1998): H.3, I.2, H.4, H.5, C.2, J.1 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-13653-2 Springer Berlin Heidelberg New York 978-3-642-13653-5 Springer Berlin Heidelberg New York
The year 2010 was a landmark in the history of digital libraries because for the first time this year the ACM/IEEE Joint Conference on Digital Libraries (JCDL) and the annual International Conference on Asia-Pacific Digital Libraries (ICADL) were held together at the Gold Coast in Australia. The combined conferences provided an opportunity for digital library researchers, academics and professionals from across the globe to meet in a single forum to disseminate, discuss, and share their valuable research. For the past 12 years ICADL has remained a major forum for digital library researchers and professionals from around the world in general, and for the Asia-Pacific region in particular. Research and development activities in digital libraries that began almost two decades ago have gone through some distinct phases: digital libraries have evolved from mere networked collections of digital objects to robust information services designed for both specific applications as well as global audiences. Consequently, researchers have focused on various challenges ranging from technical issues such as networked infrastructure and the creation and management of complex digital objects to user-centric issues such as usability, impact and evaluation. Simultaneously, digital preservation has emerged and remained as a major area of influence for digital library research. Research in digital libraries has also been influenced by several socio-economic and legal issues such as the digital divide, intellectual property, sustainability and business models, and so on. More recently, Web 2.0 technologies have had a significant influence on digital library research. Of particular interest to digital library researchers are issues including social networks, mobile technology, social tagging, and social information retrieval and access. Issues surrounding free versus fee-based information services, related intellectual property issues, the Open Archives Initiative, Creative Commons, institutional repositories, and of late Google Books will also have a significant impact on digital libraries of the future. Keeping this in mind, the theme of this year’s ICADL conference was “The Role of Digital Libraries in a Time of Global Change.” Given the large number of high-quality submissions, choosing the accepted papers for this conference proved a great challenge. Fortunately, a large number of international experts came forward to participate in the Program Committee and review papers. Following reviews of each submission by at least two, and in most cases by three or four, experts, a total of 21 full papers, 9 short papers and 7 posters were selected for this conference volume. Through this collection of papers and posters, this conference volume provides a snapshot of the current research and development activities in digital libraries around the globe in general, and in the Asia-Pacific region in particular. The papers have been arranged under the specific sub-themes of the conference, providing readers with an overview of the current research activities in digital libraries today. The conference was also significantly enriched by keynote papers from three world leaders in the field, Katy Borner of Indiana University, David Rosenthal of Stanford University and Curtis Wong of Microsoft Research. Prior to the conference, four tutorials were offered: Multimedia Information Retrieval, Introduction to Digital
VI
Preface
Libraries, Lightweight User Studies, and Managing Digital Libraries on the Web. In addition, four workshops were held on the final day following the main conference program: Music IR for the Masses, Digital Libraries for International Development, Global Collaboration of I-Schools, and Digital Libraries and Education. On behalf of the ICADL2010 Organizing Committee we would like to thank the authors for their submissions, the Program Committee for their outstanding reviews and the editorial team members without whose cooperation and support it would not be possible to prepare this conference volume. We would also like to convey our sincere thanks to our sponsors [Microsoft Research, SAGE Publications Asia Pacific Pte Ltd, ExLibris (Australia) Pty Ltd and the University of Queensland Library], the chairs and members of the Organizing Committee and the University Queensland for their generous help and support. Finally, we are thankful to Springer for publishing this conference volume.
June 2010
Gobinda Chowdhury Chris Khoo Jane Hunter
Organization
Organizing Committee General Chair Jane Hunter
The University of Queensland, Australia
Program Chairs Gobinda Chowdhury Chris Khoo Chun Xiao Xing
University of Technology, Sydney , Australia Nanyang Technological University, Singapore Tsinghua University, China
Tutorial Chairs Glenn Newton Tamara Sumner
National Research Council Canada University of Colorado, USA
Workshop Chairs Stephen Downie Shigeo Sugimoto
University of Illinois, USA University of Tsukuba, Japan
Poster and Demo Chairs Maureen Henniger Cecile Paris
University of Technology, Sydney, Australia CSIRO, Australia
University of British Columbia, Canada Nanyang Technological University, Singapore
VIII
Organization
Program Committee Maristella Agosti Jamshid Beheshti Jose Borbinha Christine L. Borgman George Buchanan Hsinchun Chen Hsueh-hua Chen Sudatta Chowdhury Fabio Crestani Milena Dobreva Schubert Foo Edward Fox Dion Goh Preben Hansen Hao-Ren Ke Ross Harvey Jessie Hey Ian Ruthven Peter Jacso Min-Yen Kan Debal Kar Ray Larson Yuan-Fang Li Chern Li Liew Ee-Peng Lim Gavin McCarthy Michael Moss Jin-Cheon Na Paul Nieuwenhuysen Michael Olsson Edie Rasmussen Ingeborg Solvberg Shigeo Sugimoto Hussein Suleman Yin-Leng Theng Shalini Urs Ross Wilkinson Vilas Wuwongse Ding Ying
University of Padova, Italy McGill University, Canada IST/INESC-ID - Information Systems Group, Portugal UCLA, USA Swansea University, UK University of Arizona, USA National Taiwan University, Taiwan University of Technology, Sydney, Australia University of Lugano, Italy University of Strathclyde, UK Nanyang Technological University, Singapore Virginia Tech, USA Nanyang Technological University, Singapore Swedish Institute of Computer Science, Sweden National Taiwan Normal University, Taiwan Simmons College, USA University of Southampton, UK University of Strathclyde, UK University of Hawaii, USA National University of Singapore TERI, India University of California, USA The University of Queensland, Australia Victoria University, New Zealand Nanyang Technological University, Singapore University of Melbourne, Australia Glasgow University, UK Nanyang Technological University, Singapore Vrije Universiteit Brussel, Belgium University of Technology, Sydney, Australia University of British Columbia, Canada Norwegian University of Science and Technology University of Tsukuba, Japan University of Cape Town, South Africa Nanyang Technological University, Singapore University of Mysore, India Australian National Data Service AIT, Thailand Indiana University, USA
Table of Contents
Digital Libraries of Heritage Materials A Visual Dictionary for an Extinct Language . . . . . . . . . . . . . . . . . . . . . . . . Kyle Williams, Sanvir Manilal, Lebogang Molwantoa, and Hussein Suleman A Scalable Method for Preserving Oral Literature from Small Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven Bird Digital Folklore Contents on Education of Childhood Folklore and Corporate Identification System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ya-Chin Liao, Kuo-An Wang, Po-Chou Chan, Yu-Ting Lin, Jung-I Chin, and Yung-Fu Chen Ancient-to-Modern Information Retrieval for Digital Collections of Traditional Mongolian Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biligsaikhan Batjargal, Garmaabazar Khaltarkhuu, Fuminori Kimura, and Akira Maeda
1
5
15
25
Annotation and Collaboration A Collaborative Scholarly Annotation System for Dynamic Web Documents - A Literary Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Gerber, Andrew Hyland, and Jane Hunter
29
The Relation between Comments Inserted onto Digital Textbooks by Students and Grades Earned in the Course . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiro Motoki, Tomoko Harada, and Takashi Nagatsuka
A Configurable RDF Editor for Australian Curriculum . . . . . . . . . . . . . . . Diny Golder, Les Kneebone, Jon Phipps, Steve Sunter, and Stuart A. Sutton
189
Thesaurus Extension Using Web Search Engines . . . . . . . . . . . . . . . . . . . . . Robert Meusel, Mathias Niepert, Kai Eckert, and Heiner Stuckenschmidt
198
Images and Retrieval Preservation of Cultural Heritage: From Print Book to Digital Library - A Greenstone Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henny M. Sutedjo, Gladys Sau-Mei Theng, and Yin-Leng Theng Improving Social Tag-Based Image Retrieval with CBIR Technique . . . . . Choochart Haruechaiyasak and Chaianun Damrongrat
A Visual Dictionary for an Extinct Language Kyle Williams, Sanvir Manilal, Lebogang Molwantoa, and Hussein Suleman Department of Computer Science University of Cape Town Private Bag X3, Rondebosch, 7701 {kwilliams,smanilal,lmolwantaa,hussein}@cs.uct.ac.za
Abstract. Cultural heritage artefacts are often digitised in order to allow for them to be easily accessed by researchers and scholars. In the case of the Bleek and Lloyd dictionary of the |xam Bushman language, 14000 pages were digitised. These pages could not be transcribed, however, because the language and script are both extinct. A custom digital library system was therefore created to manage and provide access to this collection as a purely “visual dictionary”. Results from user testing showed that users found the system to be interesting, simple, efficient and informative. Keywords: Digital preservation, repository, dictionary, Web interfaces.
1
Introduction
Cultural heritage in South Africa, like in most developing countries, is subject to degradation and inaccessibility. There is a clear need to preserve cultural heritage and make it accessible for the future. A snapshot of South African heritage would be incomplete without mentioning the Bushman people - one of the oldest known ethnic groups in the world. With the rapid influence of Western culture, there are now only a handful of these Bushmen people left in South Africa [1]. It is estimated that in a few years the entire generation of Bushman will have passed on, thereby creating a need to preserve whatever artefacts and knowledge exist from the Bushmen people in order to allow for them to be accessed in the future. The Bleek and Lloyd Collection is a collection of artefacts that document the life, language and culture of the Bushman people of Southern Africa. The collection is primarily made up of notebooks that contain Bushman stories, narratives and artwork. Included in this collection is a dictionary that contains English words and their corresponding |xam Bushman language translations. This dictionary can be used to assist researchers in understanding and interpreting the |xam Bushman language. However, the script used to represent the |xam language can not be represented using modern data encoding techniques. An accessible and preservable archive for a “visual dictionary,” where words are represented by images and definitions are extracted from viewing the images, known as the Bushman OnLine Dictionary (BOLD), was therefore built. Using this visual dictionary, users are able to browse, search and interact with the |xam words based on their English translations. G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 1–4, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
2
K. Williams et al.
Related Work
The Native Languages of the Americas is a non-profit organisation that works towards preserving Native American languages by making use of Web technology [2]. The Nuer Field Notes Project is an attempt to preserve and make available a set of linguistic field notes recorded by Eleanor Vandevort, a missionary in South Sudan between 1949 and 1963, using modern data encoding techniques [3]. In addition to these specific attempts at preserving languages, there also are a number of general preservation attempts, such as the Contemporary African Music and Art Archive [4] and the Amarius archive [5]. In 2007, Suleman [1] devised an XML-centric approach to manage the notebooks and artwork in Bleek and Lloyd Collection, showing the XML-centric approach to be more efficient than the traditional database model. All of the above-mentioned systems allow users to readily access the collection of digital cultural artefacts. However, none of these systems preserve and make accessible a dictionary, which cannot be represented using modern data encoding techniques, as a live reference for researchers and other people who access the archives.
3
Design and Implementation
The BOLD system was built using Fedora Commons as an underlying digital repository system. Figure 1 shows the Web interface with which users interact with the visual dictionary. The Web interface shows the three classes of digital objects that make up the collection: envelopes, slips and inserts. For every word in the dictionary there is one or more envelopes, and each envelope contains one slip and one or more inserts. The |xam Bushman words and their corresponding English translations are written on the inserts. The following means of interaction with the visual dictionary are provided by the interface: Core services. Browsing and searching the visual dictionary based on the English translations of |xam words is the core functionality provided by the Web interface. The results of browsing and searching the visual dictionary should be such that they can assist researchers in understanding and interpreting the |xam language, even though the language cannot be represented using modern data encoding techniques. Users are able to browse the words in the dictionary by their English initial letters, or via a scrollable list. Users also can make use of an AJAX live search to find specific words in the dictionary. Enhancement for understanding. There are a number of services that attempt to assist the user in their understanding of the collection. The first of these services comes in the form of links to the notebooks of the Bleek and Lloyd Collection for each word displayed, thereby assisting in contextualising words. In addition to this, the definition of a word appears below it when it is clicked on and spelling correction suggestion takes place when a search is performed. Enhancement for experience. To improve the experience of the user when interacting with the collection, a history of all the words that have been searched
A Visual Dictionary for an Extinct Language
3
Fig. 1. The website users use to interact with the collection
for is stored in a session cookie and there is an image zoom function that zooms an image to the size of the browser viewport when the image is clicked on, thereby allowing for closer inspection of the image. Having briefly discussed the design and implementation of the BOLD visual dictionary, the next section will present an evaluation of the system.
4
Evaluation
The visual dictionary was evaluated using usability testing in which 22 users participated. User evaluation was conducted on a one-on-one basis in which users first answered a pre-test questionnaire. Users then were given an introduction to the Bleek and Lloyd Collection to read, after which they were asked to perform three tasks. They were then given free reign to experiment with the system and, lastly, were asked to answer a post-test questionnaire. The key findings from the evaluation are summarised here. 81% of users were 100% satisfied with the way in which they could search for a single word and 60% of users were 100% satisfied with the way they could search for multiple words. 72% of users were 100% satisfied with the way in which they could browse the system. Users noted that when searching for multiple words the system became slow and was not intuitive. However, overall the results show that the core services of searching and browsing were well received and resulted in high user satisfaction. Users felt that the system was easy to navigate and that there was adequate information to help with navigation. Users also found the dictionary to be generally informative and useful and felt that they could find information quickly. Users felt that their errors were easy to correct while 17 users agreed that the system furthered their knowledge in African Cultural Heritage. All users agreed that the system is useful for researching the |xam language. Users found the
4
K. Williams et al.
enhancement for understanding useful and were surprised by the apparent complexity of the |xam language script compared to English. Two users expressed that they felt like they did not know what they could do with the system. All but 2 users felt that the system was slow. The most appealing aspects of the visual dictionary were: the thumbnail view when browsing; the multiple ways of searching and the AJAX search; the simple layout, lack of clutter, pretty design, ease of use and easy navigation; the quick access to resources; high resolution images; rich information; and links to the Bleek and Lloyd Collection. The least appealing aspects of the visual dictionary were: that it was slow; the links to the Bleek and Lloyd notebooks were not intuitive; the interface was too simple; and the correction of misspelled words to suggested words that were not in the dictionary. The most prominent words used to describe the visual dictionary were: interesting, simple, effective and informative. Evaluation showed that users were generally happy with the system and felt as if they could use it in meaningful ways. In this sense, the BOLD Project shows that it has potential in meeting its goal of assisting researchers in interpreting and understanding the |xam Bushman language.
5
Conclusions
The Bushman OnLine Dictionary (BOLD) is a cultural heritage archive system for providing access to a visual dictionary for the |xam Bushman language, which forms part of the Bleek and Lloyd Collection. The system was built to assist researchers and scholars in understanding and interpreting |xam Bushman texts. The system contains core services for browsing and searching the archive, as well as services for enhancing user understanding and enhancing user experience. Evaluation showed that users had a positive experience using the system and were pleasantly surprised by many of its features. The BOLD Project is a first step in building a system that allows for meaningful interaction with a visual dictionary and has set the stage for future work to be done in this area.
References 1. Suleman, H.: Digital Libraries Without Databases: The Bleek and Lloyd Collection. In: Kov´ acs, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 392– 403. Springer, Heidelberg (2007) 2. Native Languages of the Americas: Preserving and promoting American Indian languages, http://www.native-languages.org/ 3. Nuer Field Notes, http://www.dlib.indiana.edu/collections/nuer/ 4. Marsden, G., Malan, K., Blake, E.: Using digital technology to access and store African art. In: CHI 2002 extended abstracts on Human factors in computing systems (CHI 2002), pp. 258–259. ACM, New York (2002) 5. Doumat, R., Egyed-Zsigmond, E., Pinon, J., Csiszar, E.: Online ancient documents: Armarius. In: Proceeding of the Eighth ACM Symposium on Document Engineering (DocEng 2008), pp. 127–130. ACM, New York (2008)
A Scalable Method for Preserving Oral Literature from Small Languages Steven Bird Dept of Computer Science and Software Engineering, University of Melbourne Linguistic Data Consortium, University of Pennsylvania
Abstract. Can the speakers of small languages, which may be remote, unwritten, and endangered, be trained to create an archival record of their oral literature, with only limited external support? This paper describes the model of “Basic Oral Language Documentation”, as adapted for use in remote village locations, far from digital archives but close to endangered languages and cultures. Speakers of a small Papuan language were trained and observed during a six week period. Linguistic performances were collected using digital voice recorders. Careful speech versions of selected items, together with spontaneous oral translations into a language of wider communication, were also recorded and curated. A smaller selection was transcribed. This paper describes the method, and shows how it is able to address linguistic, technological and sociological obstacles, and how it can be used to collect a sizeable corpus. We conclude that Basic Oral Language Documentation is a promising technique for expediting the task of preserving endangered linguistic heritage.
1
Introduction
Preserving the world’s endangered linguistic heritage is a daunting task, far exceeding the capacity of existing programs that sponsor the typical 2-5 year “language documentation” projects. In recent years, digital voice recorders have reached a sufficient level of audio quality, storage capacity, and ease of use, to be used by local speakers who want to record their own languages. This paper investigates the possibility of putting the language preservation task into the hands of the speech community. With suitable training, they can be equipped to record a variety of oral discourse genres from a broad cross-section of the speech community, and then provide additional content to permit the recordings to be interpreted by others who do not speak the language. The result is an audio collection with time-aligned translations and transcriptions, a substantial archival resource. This paper describes a method for preserving oral discourse, originating in field recordings made by native speakers, and generating a variety of products including digitally archived collections. It addresses the problem of unwritten languages being omitted from various ongoing efforts to collect language resources for ever larger subsets of the world’s languages [1]. The starting point is Reiman’s work [2], modified and refined so that it uses appropriate technology G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 5–14, 2010. c Springer-Verlag Berlin Heidelberg 2010
6
S. Bird
for Papua New Guinea, and so that it can scale up easily. The method has been tested with Usarufa, a language of Papua New Guinea. Usarufa is spoken by about 1200 people, in a cluster of six villages in the Eastern Highlands Province, about 20km south of Kainantu (06◦ 25’S, 145◦ 39’E). There are probably no fluent speakers of Usarufa under the age of 25; only the oldest speakers retain the rich vocabulary for animal and plant species, and for a variety of cultural artefacts and traditional practices. Some texts including the New Testament and a grammar have been published in Usarufa [3]. However, only a handful of speakers are literate in the language.
2 2.1
Basic Oral Language Documentation Audio Capture
The initial task in Basic Oral Language Documentation (BOLD) is audio capture. Collecting the primary text from individual speakers is straightforward. They press the record button, hold the voice recorder a few inches from their mouth, and begin by giving their name, the date, and location. The person operating the recorder may or may not be the speaker.
Fig. 1. Informal Recording of Dialogue and Personal Narrative
Collecting a dialogue involves two speakers plus someone to operate the voice recorder (who may be a dialogue participant). The operator can introduce the recording and hold the recorder in an appropriate position between the participants. The exchange shown in Fig. 1 involved a language worker (left), the author, and a village elder. The dialogue began with an extended monologue from the man on the left, explaining the purpose of the recording and asking the other man to recount a narrative, followed by some conversation for clarification, followed by an extended monologue from the man on the right. The voice recorder was moved closer to the speaker during these extended passages, but returned to the centre during conversational sections. In most cases, the person operating the recorder was a native speaker of Usarufa, and was also participating in the dialogue. The operator was instructed not to treat the recorder like a hand-held microphone, moved deliberately between an interviewer and interviewee to signal turns in the
A Scalable Method for Preserving Oral Literature from Small Languages
7
conversation. Instead, the recorder was to be held still, and usual linguistic cues were to be used for marking conversational turns. A configuration which was not tried would be to have separate lapel microphones, one per speaker, connected to the digital voice recorder via a splitter jack. However, this would have involved four pieces of equipment (recorder, splitter, and two microphones), and increased risks of loss, incorrect use, and degraded signal quality. 2.2
Oral Annotation and Text Selection
The oral literature collected in the first step above has several shortcomings as an archival resource. Most obviously, its content is only accessible to speakers of the language. If the language falls out of use, or if knowledge of the particular word meanings of the texts is lost, then the content becomes inaccessible. Thus, it is important to provide a translation. Fortunately, most speakers of minority languages also speak a language of wider communication, and so they can record oral translations of the original sources. This can be done by playing back the original recording, pausing it regularly, and recording a translation on a second recorder, a process which is found to take no more than five minutes for each minute of source material. A second shortcoming is that the original speech may be difficult to make out clearly by a language learner or non-speaker. The speech may be too fast, the recording level may be too low, and background noise may obscure the content. Often the most authentic linguistic events take place in the least controlled recording situations. In the context of recording traditional narratives, elderly speakers are often required; they may have a weak voice or few teeth, compromising the clarity of the recording. These problems are addressed by having another person “respeak” the original recording, to make a second version [2,4]. This is done at a slower pace, in a quiet location, with the recorder positioned close to the speaker. This process has also been found to take no more than five minutes for each minute of source material. A third shortcoming is that the original collection will usually be unbalanced, having a bias towards the kinds of oral literature that were the easiest to collect. While it is possible to aim for balance during the collection process, one often cannot predict which events will produce the best recordings. Thus, it is best to capture much more material than necessary, and only later create a balanced collection. Given that the respeaking and oral translation take ten times real time, we suggest that only 10% of the original recordings are selected. This may be enough for a would-be interpreter in the distant future to get a sufficient handle on the materials to be able to detect structure and meaning in the remaining 90% of the collection. The texts are identified according to the following criteria: 1. cultural and linguistic value: idiomatic use of language, culturally significant content, rich vocabulary, minimal code-switching 2. diversity: folklore, personal narrative, public address, dialogue (greeting, discussion, instruction, parent-child), song 3. recording quality: clear source recording, minimal background noise
8
2.3
S. Bird
Recommended Protocol for Oral Annotation
The task of capturing oral transcriptions and translations onto a second recorder offers an array of possibilities. After trying several protocols, we settled on the one described here. The process requires two native speakers with specialised roles, the operator and the talker, and is illustrated in Fig. 2.
Fig. 2. Protocol for Respeaking and Oral Translation: the operator (left) controls playback and audio segmentation; the talker (right) provides oral annotations using a second recorder
Once a text has been selected, it is played back in its entirety, and the two language workers discuss any aspect of the recording which is problematic. For instance, older people captured in the recordings may have used near-obsolete vocabulary unknown to younger language workers. This preview step is also important as the opportunity to experience the context of the text; sometimes a text is so enthralling or amusing that the oral annotators are distracted from their work. When they are ready to begin recording, the operator holds the voice recorder close to the talker, with the playback speaker (rear of recorder) facing the talker. The talker holds the other recorder about 10cm from his/her mouth, turns it on, checks that the recording light came on, and then introduces the recording, giving the names of the two language workers, the date and location, and the identifier of the original recording. For the respeaking task, the operator pauses playback every 5-10 words (2-3 seconds), with a preference for phrase boundaries. For the translation task, the operator pauses playback every sentence or major clause (5-10 seconds), trying to include complete sense units which can be translated into full sentences. The talker leaves the second recorder running the whole time, and does not touch the controls. This recorder captures playback of the original recording, along with the respoken version or the translation. The operator monitors the talker’s speech, ensuring that it is slow, loud, and accurate. The operator uses agreed hand signals to control the talker’s speed and volume, and to ask for the phrase to
A Scalable Method for Preserving Oral Literature from Small Languages
9
be repeated. When necessary, the talker is prompted verbally with corrections or clarifications, and any interactions about the correct pronunciation or translation are captured on the second recorder. Once the work is complete, recording is halted, and the logbooks for both recorders are updated. 2.4
Logbooks
For each primary text, the language workers note the date, location, participant names, topic, and genre, using the logbook provided with each recorder. Genre is coded using the OLAC Discourse Type vocabulary [5]. If there is any major problem during the original recording, a fresh recording is started right away. Pausing to delete files is a distraction, draws attention to the device, and is prone to error. The recorder has substantial capacity and extraneous recordings can easily be filtered out later during the selection process.
Fig. 3. Metadata Capture in Village: (a) creating metadata by listening to the opening of each recording; (b) scanned page showing file identifier, participants, topic and genre (date and location were already known in this case)
2.5
Summary
Fig. 4 summarises the process, assuming 10 hours (100k words) of primary recordings are collected. It includes a third stage – not discussed above – involving pen and paper transcription using any orthography or notation known
Fig. 4. Overview of Basic Oral Language Documentation
10
S. Bird
to the participants, such as the orthography of the language of wider communication. Such transcripts, while imperfect, serve as a finding aid and as a clue to linguistically salient features such as sound contrasts and word breaks. A separate archiving process involves occasional backup of recorders onto a portable mass storage device, keyboarding texts and metadata, and converting audio files to a non-proprietary format, steps that typically require outside support.
3
Pilot Study in Papua New Guinea
The above protocol was developed during a pilot study during April-June 2009. Bird and Willems trained a group of language workers in the village for one week, then left them to do oral literature collection and oral annotation for a month, then brought them into an office environment to work on further oral annotation and textual transcription. In this section, the activities are briefly described and the key findings are reported. 3.1
Activities
Village-based training. Teachers, literacy workers, and other literate community members were gathered for a half-day training session. It took place in the literacy classroom in Moife village, with everyone sitting on the floor in a circle. We explained the value of preserving linguistic heritage and demonstrated the operation of the voice recorders. Participants practiced using the recorders and were soon comfortable with the controls and with hearing their own voices. Next, participants took turns to record a narrative while the rest of the group observed. Later, we demonstrated the oral annotation methods and the participants practiced respeaking and oral translation. The four recorders were loaned out, and participants were asked to collect oral literature during the evening and the next day, and to return the following day to review what they collected and to continue practicing oral annotation. A further five days were spent doing collection and annotation under the supervision of Bird and Willems. Village-based collection and oral annotation. In the second stage, we sent the digital voice recorders and logbooks back to the village for two 2-week periods. This would assess whether the training we provided was retained. Could the participants find time each day for recording activities? Could they meet with an assigned partner to do the oral annotation work using a pair of recorders? Could they maintain the logbooks? Apart from reproducing the activities from the first stage, they were asked to broaden the scope of the work in three ways. First, they were to collect audio in a greater range of contexts (e.g. home, market, garden, church, village court) and a greater range of genres (e.g. instructional dialogue, oratory, child-directed speech). They were to include a wider cross-section of the community, including elderly speakers and children, and to go to the other villages where Usarufa is spoken, up to two hours walk away. Finally, they were asked to train another person in collecting oral discourse and maintaining the logbook, then entrust the recorder to that person.
A Scalable Method for Preserving Oral Literature from Small Languages
11
Town-based oral annotation and transcription. In the third stage, we asked the language workers come to Ukarumpa, a centralized Western setting 20km away, near Kainantu, with office space and mains electricity. This provided a clean and quiet environment for text selection and oral annotation, plus the final step of the BOLD protocol: writing out the transcriptions and translations for a selection of the materials. The town context also permitted us to explore the issue of informed consent. Four speakers saw how it was possible to access materials for other languages over the Internet (see Fig. 5), and even listen to recordings of dead languages. As community leaders, they gave their written consent for the recorded materials to be placed in a digital archive with open access.
Fig. 5. Experiencing the Web and Online Access to Archived Language Data
3.2
Findings
The findings summarized here include many issues that were encountered early on in the pilot study but resolved in time for the town-based stage, leading to the protocol described in Section 2 above. Recording. The Usarufa speakers had no difficulty in operating the recorders and collecting a wide variety of material. The built-in microphone and speaker avoided the need for any auxiliary equipment. The clear display and large controls were ideal, and the small size of the device meant it could be hidden in clothing and carried safely in crowded places. We gave out four recorders for periods of up to two weeks, and some were lent on to others, but none were lost or damaged. Many members of the speech community were willing to be recorded, though some speakers spoke in a stilted manner once the recorder was turned on, and others declined to be recorded unless they were paid a share of what they assumed the language workers were being paid per recording. Respeaking. Talkers usually adopted the fast tempo of the original recording, in spite of requests to produce careful speech. When the audio segment was long, they sometimes omitted words or gave a paraphrase. Texts from older people presented difficulties for younger speakers who did not always know all
12
S. Bird
the vocabulary items. These problems were resolved, by having a second person control playback and monitor speed and accuracy of the respoken version, and by having both people listen through the recording first, to discuss any problematic terms or concepts. Oral Translation. A key issue was the difficulty in translating specialised vocabulary into the language of wider communication (Tok Pisin). For example, the name of a tree species might be translated simply as diwai (tree), or sampela kain diwai (some kind of tree). They were asked to mention any salient physical or cultural attributes of the term the first time it was encountered in a text. Another problem arose as a consequence of using the transcriber to control playback. The translator sometimes paused mid translation, in order to compose the rest of the translation before speaking. This pause was sometimes mistaken for the end of the translation, and the transcriber would resume playback. Occasionally, the resumed translation and resumed playback overlapped (just like when two people might start speaking simultaneously after a brief pause in conversation). This problem is solved by having the translator nod to the operator when s/he is finished translating a segment. Segmentation. Fundamental to respeaking and translation is the decision about where to pause playback of the original recording. While listening to playback, one needed to anticipate phrase boundaries in order to press the pause button. Older participants, or those with less manual dexterity, tended to wait until they heard silence before deciding to pause playback, by which time the next sentence had started. These problems were largely resolved once we adopted the practice of having participants review and discuss recordings before starting oral annotation, and simply through practice (e.g. about an hour of doing oral annotation). Metadata. Each participant was able to document their recordings in the supplied logbook. There was some variability in how the participants interpreted the instructions, resolved in later work by attaching a fold-out flyer with examples in the back of the logbooks. It was easy for anyone to check the state of completeness of the metadata by pressing the folder button to cycle through the five folders, and checking the current file number against the corresponding page of the exercise book. At the end of the pilot study, the logbooks were scanned and converted to PDF format for archiving. These scans are the basis for creating OLAC metadata records [6,7]. Archiving. The contents of the recorders were transferred to a computer via a USB cable. We had engraved unique identifiers on the recorders, but the filenames inside each recorder were identical, and care had to be taken to keep them separate on disk. A more pernicious problem was that the file names displayed on the recorder (e.g. folder C, file 01) did not corresponded to the name inside the device, where file numbers were in time order and not relative to folder. For example, C01 could have filename VN52017 (which means that it is the 17th file
A Scalable Method for Preserving Oral Literature from Small Languages
13
on the recorder, even though it is the first file in Folder C). Thus, the identifier for the audio file (machine id, folder letter, file number) should be spoken at the start of each recording. (Care must be taken to only read out the file number once recording has started.) A selection of the audio files were burnt on audio CD for use back in the village, and the complete set of recordings are being prepared for archiving with PARADISEC [8], and with the Institute of PNG Studies in Port Moresby.
4
Conclusions and Further Work
This paper has described a method for preserving oral literature that has been shown to work effectively for a minority language in the highlands of Papua New Guinea. Using appropriate technology and a simple workflow, people with no previous technical training were able to collect a significant body of oral literature (30 hours), and provide oral annotations and textual transcriptions for a small selection. Much of the collection and annotation work could happen in the evenings, when people were sitting around their kitchen houses lit only by the embers of a fire and possibly a kerosene lantern. At US $50 per recorder, it was easy to acquire multiple recorders, and little was risked when they were given out to people to take away for days at a time. Note that the important matters of access and use have not been addressed here (cf. [9,10]). This approach to language documentation has several benefits. It harnesses the voluntary labour of interested community members who already have access to a wide range of natural contexts where the language is used, and who decide what subjects and genres to record, cf. [11], and who are in an excellent position to train others. They are also able to move around the country to visit other language groups far more easily than a foreign linguist could. As owners of the project they may be expected to show a higher level of commitment to the task, enhancing the quality and quantity of the collected materials. The activities easily fit alongside language development activities, adding status and substance to those activities, and potentially drawing a wider cross-section of the community into language development. Limited supervision by a trained linguist/archivist is required between the initial training and the final archiving. Metadata can be collected alongside the recording activities in a simple logbook which accompanies the voice recorder, and then captured for the later creation of electronic metadata records. The whole process is able to sit alongside ongoing language documentation and development activities (and there is no suggestion that it supplant these activities). Building on the success of the pilot study, a much larger effort is underway in 2010, involving 100 digital voice recorders donated by Olympus Imaging Corporation, in collaboration with the University of Goroka, the University of PNG, Divine Word University, the Institute of PNG Studies, and the Summer Institute of Linguistics, with sponsorship from the Firebird Foundation for Anthropological Research (http://boldpng.info/).
14
S. Bird
Acknowledgments I am indebted to staff of the Summer Institute of Linguistics at Ukarumpa, especially Aaron Willems, for substantial logistical and technical support.
References 1. Maxwell, M., Hughes, B.: Frontiers in linguistic annotation for lower-density languages. In: Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006. Association for Computational Linguistics, pp. 29–37, http://www.aclweb.org/anthology/W06-0605 2. Reiman, W.: Basic oral language documentation, Presentation at the First International Conference on Language Documentation and Conservation (2009) 3. Bee, D.: Usarufa: a descriptive grammar. In: McKaughan, H. (ed.) The Languages of the Eastern Family of the East New Guinea Highland Stock, pp. 324–400. University of Washington Press (1973) 4. Woodbury, A.C.: Defining documentary linguistics. In: Austin, P. (ed.) Language Documentation and Description, vol. 1, pp. 35–51. SOAS, London (2003) 5. Johnson, H., Aristar Dry, H.: OLAC discourse type vocabulary (2002), http://www.language-archives.org/REC/discourse.html 6. Bird, S., Simons, G.: Extending Dublin Core metadata to support the description and discovery of language resources. Computers and the Humanities 37, 375–388 (2003), http://arxiv.org/abs/cs.CL/0308022 7. Bird, S., Simons, G.: Building an Open Language Archives Community on the DC foundation. In: Hillmann, D., Westbrooks, E. (eds.) Metadata in Practice: a work in progress. ALA Editions, Chicago (2004) 8. Barwick, L.: Networking digital data on endangered languages of the Asia Pacific region. International Journal of Indigenous Research 1, 11–16 (2005) 9. Duncker, E.: Cross-cultural usability of the library metaphor. In: JCDL 2002: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries. Association for Computing Machinery, pp. 223–230 (2002) 10. Jones, M., Harwood, W., Buchanan, G., Lalmas, M.: StoryBank: an Indian village community digital library. In: JCDL 2007: Proceedings of the 7th ACM/IEEECS Joint Conference on Digital Libraries. Association for Computing Machinery, pp. 257–258 (2007) 11. Downie, J.S.: Realization of four important principles in cross-cultural digital library development. In: JCDL Workshop on Cross-Cultural Usability for Digital Libraries (2003)
Digital Folklore Contents on Education of Childhood Folklore and Corporate Identification System Design Ya-Chin Liao1,*, Kuo-An Wang2,*, Po-Chou Chan2, Yu-Ting Lin3, Jung-I Chin2, and Yung-Fu Chen4,** 1
Department of Commercial Design, National Taichung Institute of Technology, Taichung 40402, Taiwan, ROC [email protected] 2 Department of Management Information Systems 3 Center of General Education, Central Taiwan University of Science and Technology, Taichung 40601, Taiwan {gawang,bjjem,jichin}@ctust.edu.tw, [email protected] 4 Department of Health Services Administration, China Medical University, 91 Hsueh-Shih Road, Taichung 40402, Taiwan Tel.: 886-4-22053366 ext. 6315; Fax: 886-4-22031108 [email protected] Abstract. Digital artifacts preserved in digital repositories of museums are mostly static images. However, the artifacts may be lost, degraded, or damaged no matter how well the preservation and exhibition environments have been controlled, which makes the artifacts difficult to recover. Furthermore, if not properly inherited, information regarding making, function, and usage of an artifact might be lost after several generations. Hence, in addition to digitizing folklore artifacts, we have also digitized the crafts in how to make them and skills and rituals in how to use them to be recorded in videos. With abundant digitized collections, the repository website is becoming more and more popular for teachers and students, especially in kindergartens and elementary schools, to extract and create useful teaching materials for folklore education. Recently, folklore contents have been encouraged to be applied in the education of English as second language (ESL), social work, and mathematics. In this study, we applied the digital folklore contents for developing story books to be used in childhood folklore education and for instructing students to design corporate identification system (CIS) as a class exercise. Technology acceptance model (TAM) was used to evaluate perceived usefulness (PU), perceived ease of use (PEU), and behavior intention (BI) in using these digital contents to accomplish their tasks. The results show that the scores of PU, PEU, and BI are all greater than 3 (5-point Likert scale) indicating usefulness and ease of use of the contents and website, as well as a positive attitude toward continuous use of the contents in various educational areas.
1 Introduction Folklore may refer to unsubstantiated beliefs, legends and customs, currently existing among the common people [1] or substantiated artifacts, crafts, skills, and rituals, *
widely governing the living style of the common people [2]. It reflects on the ancestral missions that have shaped a people and the inherited values that they reflect on their daily lives and pass to the future generations [3]. In general, folklore refers to the society and culture tradition of the common people and the customs practiced and beliefs held by the vast majority of people in the cultural mainstream that they have inherited from their ancestors [3]. As a result, the value of folklore artifacts, crafts, skills, and rituals lies in their demonstration of popular conceptions, life wisdom and the ancestral legacy hidden within the culture. Their basic value lies in their tight intermeshing of spirituality, psychology, and social mores, as well as their social functions and symbolic cultural meanings lie largely in their artistic and historical worth. 1.1 Significance of Digitizing Folklore Artifacts and Activities Recently, digital contents of biology, archeology, geology, anthropology, architecture, art, language (dialect), and so forth, have been widely developed in Taiwan (http://catalog.digitalarchives.tw) and around the world. However, most of these contents emphasized on static artifacts instead of the crafts in making them or the skills and rituals in using these digitized artifacts. The artifacts may be lost, degraded, or damaged no matter how well the preservation and exhibition environments are monitored and controlled. The degraded or damaged artifacts are not easy to recover. Furthermore, manufacturing, functions, and usages of the artifacts might be forgotten after several generations if not properly inherited. Hence, it is very important to preserve the crafts in making, the skills in operating, and the ceremonies or rituals in using the artifacts. In our previous investigations, in addition to folklore artifacts [2], folklore activities [4] have also been digitized for the preservation of Taiwanese tradition and culture. For example, the craft in making puppets concerning wood sculpture, painting, clothing, and decoration, as well as the skill of using or playing the puppet in religious rituals involving delicate finger operation, hand control, and arm and body movements. Besides, folklore and religious rituals have their spiritual meaning that step-by-step procedure embeds significant meaning for a people or a religion. The artifacts collected by national or local museums are mostly incomplete. Although Taiwanese Folklore Museum is a popular multi-function site with an important mission in the exhibition and preservation of representative folklore artifacts, there are only 1412 artifacts currently preserved in this museum. In order to extend the digital folklore contents, two strategies have been adopted to supplement insufficiency of the collected artifacts. First, folklore hobbyists are regularly invited to demonstrate their private collections, which will be digitized by the task force of the digital preservation team of the museum for extending the number of digital contents. Another aggressive strategy is to sign cooperation agreements with the members of folklore associations by giving services to digitize their personal collections [5]. As a result, 2140 additional digitized artifacts have been added to the digital repository. With the extended digital contents in hand, the website of Taiwanese Folklore Museum is becoming even more popular for folklore hobbyists, students, and teachers, especially in the kindergartens and primary schools, to extract and prepare their teaching materials for folklore, social work, and other educations.
Digital Folklore Contents on Education of Childhood Folklore
17
1.2 Web-Based Learning Motivated by the investigations which showed that media richness facilitates learning of courses with high uncertainty and equivocality [6] and that e-learning with interactive videos gains more learner satisfaction than non-interactive and traditional classroom learning [7], digital contents of crafts, skills, and rituals have been developed for the purpose of both digital preservation and online education [4]. The same concept was applied to digitize childcare standard operation procedures (SOPs) [8]. In contrast to general non-interactive e-learning and traditional classroom learning style, in a previous study we proposed a metadata-based method for recording each step of folklore activity as a metadata record in which the title, description, associated digital media, and other related information are all included [4]. In addition, folklore artifacts related to individual folklore activities are also linked in the metadata for seamless integration. Recently, a web-based (MOSAICA) system, which integrates numerous functions including preservation of culture heritage and provision of interactive and creative educational experience, has been shown to be valuable in enhancing people with different cultural backgrounds to have positive attitude toward open-mindedness through learning from stories, customs, and diverse cultures via hypertext [9]. Hence, it is promising in facilitating folklore and other educations based on digital folklore contents preserved in digital repositories.
2 Development of Digital Taiwanese Folklore Contents The digital folklore contents consist of folklore artifact and activities. The dimensions, originalities, functions, and other detailed descriptions of folklore artifacts were examined, investigated, and recorded by folklore specialists. Metadata based on the Dublin core were used to record individual folklore artifacts and activities to be compatible with international standard. All the digitized contents are stored in a database system with a website (http://www.folkpark.org.tw) designed for providing general publics, students, instructors, and researchers to browse and surrogate the digital contents.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 1. Folklore artifacts preserved were classified into ten categories: (a) Clothing and Jewelry, (b) Kitchenware and Dinnerware, (c) Furnishings, (d) Transportation, (e) Religion and Religious Ceremonies, (f) Aborigines, (g) Documents and Deeds, (h) Machinery and Tools, (i) Study, and (j) Arts and Recreation
18
Y.-C. Liao et al.
Digitized Folklore Artifacts. A total of 1412 artifacts preserved in the Taiwanese Folklore Museum were classified into ten categories according to their life styles and functions (Fig. 1). In addition, 2140 artifacts and artworks collected by folklore artists or hobbyists had also been digitized and pooled to the digital content repository (Fig. 2).
(a)
(b)
(c)
(d)
(g)
(h)
(i)
(j)
(f)
(e)
(k)
(l)
Fig. 2. Folklore artifacts and arts digitized during exhibitions: (a) Bride Wore Red, (b) Historical Documents from Old Taichung, (c) Artifacts from When Grandparents Were Young, (d) Collection of Cicada-shaped Jade Pieces, (e) Carvings by Hsu Pei-ming, (f) Paintings and Personal Letters by Contemporary Taiwan Artists, (g) Carved Wooden Puppets, (h) Jade Auspicious Animals, (i) Handicrafts Decorated Using Plant Dye and Indigo Blue, (j) Wood-fired Ceramic Creations, (k) Ancient Taiwanese Religious Books and Paintings, and (l) Dream of the Red Chamber-themed Art and Handicrafts.
Digitized Folklore Activities. Step-by-step folklore activities were demonstrated practically by the folklore specialists invited to participate in this study and the actions were taken by a professional photographer using a digital camcorder with a resolution of 640x480 pixels. The text and oral description of a folklore activity were done by a folklorist who majors in this specific activity. Video clips of individual steps were obtained by using the video editing software to edit a video sequence and saved as the Microsoft wmv and Apple Quicktime formats. These video clips were then combined with other related information and recorded using metadata format compatible with the Dublin core standard. Metadata designed based on the Taiwanese folklore artifacts [3] were extended to include folklore crafts, skills, and rituals, in which the “Relation” element contains two quantifiers, i.e. “Has Part” and “Is Part Of”, is used to interlink between the main (parent) metadata record and its children metadata of individual steps [4,5]. Table 1. Step-by-step demonstration of bamboo weaving craft
(1) Scraping
(2) Splitting into strips
(3) Splitting into thinner strips
(4) Trimming strip width
(5) Trimming thickness
(6) Round mouth weaving
(7) Weaving the bottom
(8) Drawing in the mouth
(9) Making the base
(10) Making the handle
Digital Folklore Contents on Education of Childhood Folklore
19
An example of step-by-step demonstration of bamboo weaving is described in Table 1. Each step in this table has a corresponding video segment. Additionally, a main (parent) metadata record is used to interlink with its related activity steps (children) by the “Relation” element proposed by the Dublin core, in which the element contains two Quantifiers; i.e., “Has Part” and “Is Part Of” for describing the sequential relation between the parent and children metadata records. Furthermore, the Quantifier “Reference source” is applied for expressing its relationship with other artifacts or folklore activities. The “Has Part” Qualifier is used for the parent metadata record to relate to its children steps, while the “Is Part Of” for the child steps to trace back to their parent. With this mechanism, all the child steps can be tightly connected to their parent so that the webpage can support flexible interaction between the users and the browsers for easy navigation. The Quantifier “Sub-Collection Type” was added to the Element “Type” in the metadata proposed in [2]. Table 2 shows a total of 24 digitized folklore activities which include craft (A1-A12), skills (B1-B9), and ceremonies or rituals (C1-C3). Table 2. Digitized folklore activities
(A1) Dough figurines
(A2) Bamboo weaving
(A3) Bamboo utensil carving
(A7) Pottery
(A8) Wooden carving
(A9) Painting of (A10) New (A11) Plant dying(A12) Calligraphy brush making objects Year’s paintings
(B1) Top spinning(B2) Puppet show (B3) Gongs & perform. perform. drums perform.
(B7) Dulcimer performance
(B8) Nanhu performance
(A4) Piece (A5) Puppet A6) Art cultivated together the cloth head carving in pot
(B4) Nanguan (B5) Trad. music performances performance
(B6) Ruan performance
(B9) Guzheng (C1) Spring Bull (C2) Venerating (C3) Turning performance Hitting Festival Heaven Ruler Adult Ceremony
Website for Presenting Digital Contents. The web-based system is implemented under XML protocol. Basically, XML is a way of structuring information and sending it from one software component to another. In the website, XML is used for exchanging digital contents between different museums. Figure 3 shows the homepage of the website supporting digital preservation of folklore artifacts and activities. For each craft, skill, or ritual, a step-by-step video demonstration accompanied with either Chinese or English description can be selected. Figure 4 demonstrates the homepage of a partner museum and its digital contents of folklore artifacts collected by hobbyists.
20
Y.-C. Liao et al.
(a)
(b)
Fig. 3. Home page of the Taiwanese Folklore Museum for demonstrating (a) folklore artifacts and (b) folklore activities
(a)
(b)
Fig. 4. Homepage of a collaborated folklore society: (a) Blog page of a member for information exchange with other members and (b) his digitized collected artifacts
3 Adoption of Digital Folklore Contents for Education Recently, folklore contents have been encouraged to be applied in the education of ESL (English as second language) [10], social work [11], and mathematics [12]. Bowman [10] stated that knowledge of folklore and ethnography is good discussion material for language education. It was suggested that language teachers should take at least an introductory folklore course [10]. Lee et al. [12] proposed a web-based learning system designed based on a well-known Chinese folklore story for teaching secondary school students basic probability. It was reported that the outcome is encouraging because the system provides a near-real learning environment enticing the students’ interests and motivations to use the system to learn. Furthermore, folklore knowledge and skill, especially in diverse culture groups, are deemed as powerful tools for facilitating efficiency of social works [11]. Students who major in social work having strong folklore skill and understanding will be more competent in social work practice. Open-mindedness is crucial for people to accommodate diverse views and opinions, which in turn enable ones to have critical thinking in making important decision [9]. Nakanishi and Rittner [13] addressed that people must learn their own culture before learning the cultures of other peoples. Hence, learning our own folklore
Digital Folklore Contents on Education of Childhood Folklore
21
will enable the students to be more open-minded to accommodate and tolerate different views regarding political policies, social opinions, ethnic identity, or religious beliefs. Motivated by the aforementioned investigations and findings, we applied the digital folklore contents developed in past years for developing story books to be used in childhood folklore education. In addition, these contents were also used for instructing students majoring in commercial design to design Corporate Identification System (CIS) for Taiwan Folklore Museum as a class exercise. Childhood Folklore Education. Taiwan Folklore Museum is sponsored by city government of Taichung and is currently run by a university. In addition to collection, preservation, and exhibition of folklore artifacts, another important mission is that it dedicates to folklore education for citizens, especially children, around the country. The number of annual visitors is more than 20,000, among them more than 80% are kindergarten students and pre-school children. Storytelling is deemed as a viable way and instructional strategy for promoting cultural competence for students [11]. In this study, stories used for childhood folklore education were composed based on the collected artifacts and activities by students who major in early childhood care and multimedia design. Figure 5 demonstrates an example of the composed stories. The upper row shows the digitized artifact accompanied with its description written by a folklorist. Sampled clips of the story book are displayed at the bottom row. Halter tops: Halter tops were the innermost layer of clothing and were in direct contact with the skin. This rounded halter top is relatively narrow overall and the lower half even more so. It is probably a children's halter top from mainland China. The top and side borders are adorned with lace and the bottom half with curved pieces of fabric embroidered with plant designs. From the embroidery techniques used and the flower designs, we can deduce that this halter top is a relatively recent creation.
Story: Tong is a boy with a kind heart. He likes to play in the field. One day, he saw Bear, a panda that got lost in the bamboo forest, so he carried the bear home and tried to keep him warm. Bear can’t stop himself from shaking, so Tong took off his halter top and put it on Bear. They both live happily ever after.
Fig. 5. An example of composed stories used for childhood folklore education
Practice on Designing Corporate Identification System (CIS). The students were requested to design CIS based on the digital folklore contents. Its aim is to create designs for the collection of cultural items in the museum. This exercise uses traditional images as ideas for creating logos, standard fonts, and combination of both, as well as dolls and product designs. CIS consists of three elements, i.e. Mind Identity (MI), Behavior Identity (BI) and Visual Identity (VI), which are interconnected and interacted in an organic way. CIS spreads the concept of its management and culture inside the corporate itself and to the public with its own system. Therefore, they are able to have a sense of approval and unification. MI is the core of the whole CI system. VI and BI are its external performance. MI includes three aspects, which are
22
Y.-C. Liao et al.
objective, policy and value of management. VI has the brightest image in the CI system, which is made up with two elements: core and application. BI is used to show the philosophy of the corporate. In this study, students who majors in Commercial Design taking the course “Advertising Creative Strategy” were asked to design the CIS for Taiwan Folklore Museum as an exercise by using available digital folklore contents on the website. Figure 6 shows an example of the designed CIS. The upper row of the example shows the digitized artifact and its description available on the website. The design concept proposed by the student is depicted on the middle row. Bottom row demonstrates mark, logotype, combination of both mark and logotype, character, and product which are basic CIS components. Coir Rain Coat: Coir capes, comprised of either one or two pieces, were worn in agricultural societies by individuals working in the rain. Coir capes were very effective at keeping the wearer dry and were not easily damaged by wind or rain. They were manufactured out of better quality brushed and dried palm leaves that were arranged and sewn tightly together. Design Concept: The logo is created based on the cape itself. Coir cape, which resembles a raincoat, was one of the most important cultural items in the early agricultural societies. It prevents those who go out to work from getting wet on rainy days. Its aim is to protect our culture and tradition. The two colors, green and orange symbolize the societies, in which people relied on agriculture and the rectangle on the back of the cape represents the land. Mark Logotype Combined Mark & Logo Character Product
Fig. 6. An example of CIS designs
4 Evaluation In this pilot study, 33 students who major in Early Childhood Care taking the course “Multimedia Design” and 65 students who majors in Commercial Design taking the course “Advertising Creative Strategy” were asked to use the digital folklore contents to design story books and CIS, respectively, as their term projects. Each student was requested to fill a questionnaire (Table 3) based on the modified technology acceptance model (TAM) [14] to evaluate perceived usefulness (PU), perceived ease of use (PEU), and behavior intention (BI) in looking the interested contents from the website and using these materials in accomplishing his/her project. All questions are graded on a 5-point Likert scale ranging from 1 to 5 points. The Cronbach Alpha of the questionnaire is 0.83 which indicates its high reliability. As shown in Table 3, the scores of PU, PEU, and BI are greater than 3 indicating usefulness of the digital folklore contents, ease of use of the website environment, and positive attitude toward the use of the contents in numerous education areas.
5 Discussion and Conclusion Folklore is believed to be an endangered, marginalized, or misunderstood field. Folk artists are honored to study and inherit the skills building upon earlier generation [10].
Digital Folklore Contents on Education of Childhood Folklore
23
Previously, we proposed an information exchange platform allowing folk artists and folklore hobbyists to exchange information regarding their own created or collected folklore artifacts and knowledge [5]. The digitized contents contained in the platform complement the insufficiency of folklore artifacts collected in Taiwan Folklore Museum. Additionally, hobbyists and professional folk artists of the folklore association have the potential to serve as folklore educators for preparing introductory materials and answering questions regarding their private creation and collection. Similar function was also recently integrated in MOSAICA project in which dual objectives of preservation and presentation of diverse cultural heritage have been achieved [9]. It is believed that a platform containing abundant and diverse digital folklore contents and folk artists’ knowledge can stimulate students’ interests and motivations in learning. After having learned deeply about our own folklore or culture, we can understand more about other cultures [10], which in turn can prevent conflicts among different peoples with different political, social, ethnic, and religious identities [1,9]. Further system development will focus on constructing folklore legends, stories, songs, dances, and riddles to increase diversity of the digital folklore contents, which may also be useful in reminiscence therapy of dementia and Alzheimer disease for aged people. Information quality and system integration are two important factors which highly influences perceived usefulness and post adoption of an information system [15]. In our study, the originality, category, and function of each artifact were studied, examined, and recorded by well-known Taiwanese folklore specialists, which greatly ensures quality of the digital contents [1]. Furthermore, integration of the folklore activities and folklore artifacts were achieved through “Relation” element of the metadata [4]. Each artifact collected in folklore museums and folklore hobbyists has its own story regarding religion, myth, folklore legend, ethnography, or anthropology. These materials and information are also tightly integrated into the digital folklore content system [4,5]. Table 3. Descriptive statistics of modified TAM
A. Perceived Ease of Usefulness 1. I found it is easy to operate Digital Folklore Content (DFC) Repository 2. I found DFC Repository is easy to do what I want it to do 3. I found the user interface of DFC Repository is clear and understandable 4.I found the interaction with Digital Folklore Content Repository is flexible B. Perceived Usefulness 5. I agree DFC can facilitate self-learning and accomplish tasks more quickly 6. I agree DFC Repository can decrease learning time and increase productivity 7. I agree DFC Repository can elevate learning wiliness and enhance effectiveness 8. I agree DFC Repository can provide information for different age groups 9. I agree DFC Repository can promote folklore activities 10. I agree DFC is useful for making folklore course materials 11. I agree DFC is useful for learning the current course 12. I agree DFC is useful for understanding Taiwanese folklore C. Behavior Intention 13. I intend to use DFC as frequently as I need 14. I will continue to use DFC whenever possible in suitable circumstance 15. I expect to use DFC in other related activities and courses in the future
In conclusion, the paper presents the adoption of digital folklore contents in childhood folklore education and in CIS design practice. Evaluation study based on the TAM suggests that the developed digital folklore contents are useful in designing course materials for numerous educational areas. Acknowledgments. This work was funded in part by National Science Council of Taiwan (Grants Nos. NSC96-2422-H-039-002, NSC97-2631-H-166-001 and NSC982410-H-039-003-MY2) and China Medical University (Grant No. CMU96-210).
References 1. Chan, P.C., Chen, Y.F., Huang, K.H., Lin, H.H.: Digital Content Development of Taiwanese Folklore Artifacts. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds.) ICADL 2005. LNCS, vol. 3815, pp. 90–99. Springer, Heidelberg (2005) 2. Bronner, S.J.: The Meanings of Tradition: An Introduction. West Folklore 59, 87–104 (2000) 3. Randall, M.: Unsubstantiated belief: What we assume as truth, and how we use those assumptions. Journal of American Folklore 117, 288–295 (2004) 4. Chen, Y.F., Chan, P.C., Huang, K.H., Lin, H.H.: A Digital Library for Preservation of Folklore Crafts, Skills, and Rituals and Its Role in Folklore Education. In: Sugimoto, S., Hunter, J., Rauber, A., Morishima, A. (eds.) ICADL 2006. LNCS, vol. 4312, pp. 32–41. Springer, Heidelberg (2006) 5. Chan, P.C., Liao, Y.C., Wang, K.A., Lin, H.H., Chen, Y.F.: Digital Content Development of Folklore Artifacts and Activities for Folklore Education. In: Li, F., Zhao, J., Shih, T.K., Lau, R., Li, Q., McLeod, D. (eds.) ICWL 2008. LNCS, vol. 5145, pp. 332–343. Springer, Heidelberg (2008) 6. Sun, P.C., Cheng, H.K.: The design of instructional multimedia in e-Learning: A media richness theory-based approach. Computers and Education 49, 662–676 (2007) 7. Zhang, D., Zhou, L., Briggs, R.O., Nunamarjer, J.F.: Instructional video in e-learning: Assessing the impact of interactive video on learning effectiveness. Information & Management 43, 15–27 (2006) 8. Wang, J.H.T., Chan, P.C., Chen, Y.F., Huang, K.H.: Implementation and Evaluation of Interactive Online Video Learning for Childcare SOPs. WSEAS Transactions on Computers 5, 2799–2806 (2006) 9. Barak, M., Herscoviz, O., Kaberman, Z., Dori, Y.J.: MOSAICA: A web-2.0 based system for the preservation and presentation of cultural heritage. Computers and Education 53, 841–852 (2009) 10. Bowman, P.B.: Standing at the crossroads of folklore and education. Journal of American Folklore 119, 66–79 (2006) 11. Carter-Black, J.: Teaching cultural competence: An innovative strategy grounded in the university of storytelling as depicted in African and African American storytelling traditions. Journal of Social Work Education 43, 31–50 (2007) 12. Lee, J.H.M., Lee, F.L., Lau, T.S.: Folklore-based learning on the Web- Pedagogy, case study, and evaluation. Journal of Educational Computing Research 34, 1–27 (2006) 13. Nakanishi, M., Rittner, B.: The inclusionary culture model. Journal of Social Work Education 28, 27–35 (1992) 14. David, F.D.: Perceived usefulness, perceived ease of usefulness, and user acceptance of information technology. MIS Quarterly 13, 319–340 (1989) 15. Saeed, K., Abdinnour-Helm, S.: Examining the effects of information system characteristics and perceived usefulness on post adoption usage of information systems. Information and Management 45, 376–386 (2008)
Ancient-to-Modern Information Retrieval for Digital Collections of Traditional Mongolian Script Biligsaikhan Batjargal1, Garmaabazar Khaltarkhuu2, Fuminori Kimura3, and Akira Maeda3 1
Graduate School of Science and Engineering, Ritsumeikan University, Japan 2 Mongolia-Japan Center for Human Resources Development, Mongolia 3 College of Information Science and Engineering, Ritsumeikan University, Japan {biligsaikhan,garmaabazar}@gmail.com, {fkimura,amaeda}@is.ritsumei.ac.jp Abstract. This paper discusses our recent improvements to the traditional Mongolian script digital library (TMSDL), which can be used to access ancient historical documents written in traditional Mongolian using a query in modern Mongolian. The results of the experiment show that the percentage of successfully retrieved queries was improved. Keywords: Traditional Mongolian, Historical documents, Information retrieval.
1 Introduction In recent years, the role and importance of digital cultural heritage preservation has been continuously increasing in the Asia-Pacific region as well as worldwide. This paper provides a summary of the recent achievements of the TMSDL [1] that aims to preserve over 800 years of old historical records written in traditional Mongolian for future use and to make them available for public viewing. There are over 50,000 registered manuscripts and historical records written in traditional Mongolian script stored in the National Library of Mongolia. Despite the importance of keeping old historical materials in good conditions, the Mongolian environment for material storage is not satisfactory to keep historical records for a long period of time. The TMSDL, which is based on Greenstone Digital Library Software (GSDL), with a modern Mongolian query input, will help the user interested in the history of Mongols access materials written in traditional Mongolian. However, some limitations still exist. Thus, we have made two improvements to the TDMSL. The first improvement is the search function that retrieves traditional Mongolian documents using modern Mongolian query. The second improvement is a display function that properly renders the ancient complex script, in this case the traditional Mongolian script.
preserves a more ancient language and reflects the Mongolian language spoken in the ancient period while modern Mongolian script reflects pronunciation differences in modern dialects. Traditional Mongolian has different grammar and a distinct dialect compared to modern Mongolian. Retrieval of the desired information from traditional Mongolian documents using modern Mongolian is not a simple task due to substantial changes in the Mongolian language over time. We wanted to improve the retrieval effectiveness by utilizing a dictionary to eliminate the limitations of the previous version such as not considering irregular words for the query retrieval. The structure of the TMSDL with the improvements is shown in Fig. 1. Adding a dictionary-based query translation approach to the translation module was a major improvement that takes into account age differences in the writing systems of the ancient and modern Mongolian languages. We define such a retrieval method as “ancient-to-modern information retrieval”. Traditional Mongolian Script Digital Library
Ancient-to-modern Information Retrieval
Query in traditional Mongolian script
DOM: document.form
Translation Module
Dictionary
Rules/Grammarbased Conversions
Query in traditional Mongolian script
AJAX
DOM: document.form
Query in modern Mongolian (Cyrillic)
Retrieval in traditional Mongolian
User PC document.inner Presentation HTML Characters
Unicode Basic Characters U+1800 – U+18AF
Database XML
Rendering
Compile
Our Uniscribe converter navigator.appVersion
document. getElementsby
Traditional Mongolian Script Collections in GSDL
Retrieval results
Fig. 1. Ancient-to-modern information retrieval at the TMSDL
The developing online version of Tsevel’s concise Mongolian dictionary 1[2] was utilized. Tsevel’s dictionary was printed in 1966 and is one of two Mongolian dictionaries with definitions written in Mongolian available on the market. It includes over 30,000 words in Cyrillic and traditional Mongolian script. As shown in Fig. 1, the query in modern Mongolian (Cyrillic) was translated into a query in traditional Mongolian script. Our translation approach involved exact matching of query terms word by word from Tsevel’s dictionary. The translation technique of the previous version, which was based on grammatical rules, was used if no exact match was found. Consequently, the query in traditional Mongolian (Unicode characters in the range U+1800 – U+18AF) was submitted as a retrieval query for traditional Mongolian script collections in Greenstone. This improvement boosted the quality of the translation and allowed the users to access documents written in an ancient language (traditional Mongolian) with a query input in a modern language (modern Mongolian – Cyrillic). The “document.form” object of the HTML Document Object Model (DOM) was used to submit queries to and extract results from Greenstone. Our translation algorithm, which utilizes asynchronous JavaScript and XML (AJAX), and Tsevel's dictionary were integrated into the TMSDL using our collection-specific macro files. The modern Mongolian (Cyrillic) input at the TMSDL is illustrated in Fig. 2. 1
http://toli.query.mn/
Ancient-to-Modern Information Retrieval for Digital Collections
27
Furthermore, we improved the display algorithm of the TMSDL, allowing traditional Mongolian script to be rendered properly on any platform as shown in Fig. 2. The display algorithm for traditional Mongolian script was enhanced that Uniscribe– Unicode Scripts Processor is used by Windows Vista and later versions, and our conversion algorithm is used by other platforms e.g., Windows XP, Unix-like systems and Mac OS X. The JavaScript property “navigator.appVersion” was utilized to identify the version and name of the web browser and operating system of the client system. The “innerHTML” property was used to display the rendered text in Greenstone.
Fig. 2. Cyrillic input and retrieval results at the TMSDL
3 Evaluation After improvements, we conducted an experiment to examine the difference between the old and new versions of the TMSDL as well as to check the translation correctness. The Altan Tobci (year 1604, 164 pp) and the Story of Asragch (year 1677, 130 pp) – chronological books of Mongolian kings, Genghis Khan, and the Mongol Empire – are available in the TMSDL. We compared the word count in the search results for the 150 most frequently appearing queries of the two versions with the “Qad-un Undusun-u Quriyanggui Altan Tobci–Textological Study” [3], which contains detailed analysis of traditional Mongolian words’ frequency in the Altan Tobci. In the experiment of retrieving traditional Mongolian documents via modern Mongolian (Cyrillic) utilizing a dictionary, we found that the new version translated and retrieved about 86% of input queries, which was an improvement on the old version’s 61%. However, about 64% of input queries in modern Mongolian do not match with a word count that was less than or greater than the actual number (frequency) because
28
B. Batjargal et al.
of possible errors of translation, grammatical inflection, and text digitization, or limitations of the indexer and retrieval function. Improvements on the retrieval results are illustrated in Fig. 3. Detailed retrieval results of sample query terms are shown in Fig. 4 with modern and ancient forms, their meanings, and word count. Retrieved results in the TMSDL with highlights are shown in Fig. 2.
4 Summary and Future Development In this paper, we introduced the recent achievements to the TMSDL, which can be used to access ancient historical documents written in an ancient language using a query in a modern language. The TMSDL is a useful system to build a digital library of traditional Mongolian script digital collections. The percentage of successful retrievals from queries input in the TMSDL was increased. There were some variations in the word count of the retrieved query terms, so that requires some improvements. For future development, we want to improve the TMSDL so that all queries in modern Mongolian are correctly translated into traditional Mongolian and the word count in the retrieval results is matched with the textological study of the TMSDL contents.
Fig. 3. Improvements on the retrieval results
Fig. 4. Sample retrieval results in detail
References 1. Khaltarkhuu, G., Maeda, A.: Developing a Traditional Mongolian Script Digital Library. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds.) ICADL 2008. LNCS, vol. 5362, pp. 41–50. Springer, Heidelberg (2008) 2. Tsevel, Y.: Mongol Helnii Tovch Tailbar Toli. Ulaanbaatar (1966) (in Mongolian) 3. Choimaa, S., Shagdarsuren, T.: Qad-un Undusun-u Quriyanggui Altan Tobci–Textological Study, Ulaanbaatar(2002) (in Mongolian)
A Collaborative Scholarly Annotation System for Dynamic Web Documents – A Literary Case Study Anna Gerber, Andrew Hyland, and Jane Hunter University of Queensland, St. Lucia, Queensland, Australia, (617) 3365 1092 {agerber,ahyland,jane}@itee.uq.edu.au
Abstract. This paper describes ongoing work within the Aus-e-Lit project at the University of Queensland to provide collaborative annotation tools for Australian Literary Scholars. It describes our implementation of an annotation framework to facilitate collaboration and sharing of annotations within research sub-communities. Using the annotation system, scholars can collaboratively select web resources and attach different types of annotations (comments, notes, queries, tags and metadata), which can be harvested to enrich the AustLit collection. We describe how rich semantic descriptions can be added to the constantly changing AustLit collection through a set of interoperable annotation tools based on the Open Annotations Collaboration (OAC) model. RDFa enables scholars to semantically annotate dynamic web pages and contribute typed metadata about the IFLA FRBR entities represented within the AustLit collection. We also describe how the OAC model can be used in combination with OAI-ORE to produce scholarly digital editions, and compare this approach with existing scholarly annotation approaches. Keywords: Annotation, Interoperability, Scholarly Editions, Ontology.
records biographical information about people and organisations (agents) involved in the creation and dissemination of Australian Literature, as well as information about the events in which they participate including work creation, realization, and manifestation, and their relationships with other agents and works. Sub-communities use their own specific vocabularies and terminologies associated with their topic of interest. If they wish to contribute their research data and knowledge to AustLit, they need to request that additional attributes are added to the AustLit data model by technical staff. This process presents a delay and barrier to establishing new topics and has also led to increasing complexity of the AustLit data model. This has also affected the user interface for editing AustLit records, making it inaccessible to scholars who do not have adequate training in information systems. As collaborative research teams have become increasingly geographically distributed, scholars have also identified a need for tools to support discussion, sharing of notes and data on projects such as scholarly editions. An analysis of annotation needs within the AustLit community has indicated there are at least five different use cases for annotation tools: 1. 2. 3. 4. 5.
Fine-grained semantic tagging of textual documents (both manual and automated); Subjective attachments of free-text notes and interpretations of resources; Input of metadata descriptions for AustLit resources and derived resources; Representation of different versions of scholarly editions as annotations; Representation of compound objects as annotated links between resources.
Hence, the primary motivation for the work described in this paper was to provide collaborative annotation and scholarly editing tools integrated within the AustLit web portal, to enable scholars to collaboratively select and annotate digital resources, and to share their annotations with the research community and enrich the AustLit collection. The remainder of the paper is structured as follows: Section 2 describes Related Work; Section 3 outlines the objectives of the work described here; Section 4 describes the architecture, implementation and user interface; Section 5 provides a discussion of the results; Section 6 outlines future work plans and Section 7 provides a brief conclusion of the outcomes.
2 Related Work Web-based collaborative annotation systems are designed to enable online communities of users to attach comments or notes to web resources and to tag them with keywords. An overview of existing web annotation systems is provided in [4]. Existing web annotation systems support the first three use cases described in Section 1, but do not support the use cases for scholarly editions or compound objects. Boot [5] provides a discussion of existing approaches to annotation for scholarly digital editions. Many scholarly edition projects employ bespoke annotation systems, using proprietary or non-standard annotation formats. These approaches typically focus on annotating source documents that are encoded using TEI XML [6], and assume that the source documents are static: when changes are made a new document will be created to represent the new version. TEI allows both inline and stand-off annotations. Inline annotations do not allow annotations to overlap, and also require that the annotator
A Collaborative Scholarly Annotation System for Dynamic Web Documents
31
have write access to the source document, making it difficult to maintain the integrity of the contents, and almost impossible to implement in a collaborative web-based environment. Stand-off annotation allows multiple layers of potentially overlapping annotation hierarchies to be stored; however this approach has limitations for some types of scholarly annotations [7], and does not provide a complete solution for annotating web resources generally, including non-TEI documents or images. The Open Annotations Collaboration (OAC) [8] has been developing a data model and framework to enable the sharing and interoperability of scholarly annotations across annotation clients, collections, media types, applications and architectures. The OAC model, outlined in Figure 1, draws on initiatives such as the W3C’s Annotea [9] and the W3C Media Fragments Working Group. Representing annotations using RDF and OWL enables the annotated resources to become accessible to the larger Semantic Web, including inferencing and reasoning engines.
Fig. 1. Open Annotations Collaboration model
3 Objectives The first objective of this work was to meet the annotation requirements of the literary scholars using AustLit. Hence, the Aus-e-Lit annotation tools need to support the following activities: • Scholarly (free text) annotation of textual documents, web pages and images; • Fine-grained semantic tagging of AustLit documents and web pages (automatic or manual) with user-specified controlled terms (AustLit data entities) to enable search and inferencing; • Annotation of changes between documents, for example for recording the transmission history of a particular text (for tracking and visualization of scholarly editions). AustLit scholars also work with documents and images sourced from a variety of institutional repositories, both within national and international collections and on the Web.
32
A. Gerber, A. Hyland, and J. Hunter
Hence a further objective was to allow the scholars to use the same annotation system to annotate resources regardless of location. Consequently, we needed to ensure that annotation content was available an open, interoperable format that could be easily extracted and exported to other formats for re-use. Hence, additional objectives were: • •
To evaluate the Open Annotations Collaboration (OAC) model; To compare RDF-based approaches with existing (TEI-XML) approaches.
For literary scholarship (and to enable further semantic reasoning), it is important to be able to identify exactly which work, expression, manifestation, agent, and the section of a literary text displayed on a Web page, is under scrutiny. Most Web annotation systems allow a context (textual segment or image region) to be specified to identify the section of interest. In practice, the annotated documents are usually HTML Web pages, with XPointer contexts. However, AustLit is a dynamic system Web pages are dynamically generated from collections that are frequently being updated with additional or corrected information. Using XPointers to represent contexts for a dynamic page is a fragile, unstable approach, as minor changes to the underlying data or stylesheets that render the page can cause the context to become invalid. In order to provide a robust Web-annotation system for dynamic pages, we need to be able to distinguish between content and presentation [10] and refer to the content objects rather than the presentation markup in annotation contexts. While it is possible to adopt conventions for HTML IDs to work around this issue, the relationships between identified sections of a dynamic web page and the data entities from which it was generated are not explicit, or able to be generalized across systems. RDFa [11] is a W3C standard for embedding RDF data in (X)HTML. Our hypothesis is that RDFa is ideal for encoding semantic entities within dynamic web pages. Hence our final objective is to evaluate RDFa for semantic tagging of dynamic documents.
4 Architecture, Implementation and User Interface (UI) Figure 2 shows a high-level view of the Aus-e-Lit annotation system architecture. Researchers annotate pages on the AustLit web portal, or Web resources from other
Fig. 2. Architecture diagram
A Collaborative Scholarly Annotation System for Dynamic Web Documents
33
sites using the Aus-e-Lit annotation client, which is implemented as a Firefox browser extension. The annotations are stored as RDF on a separate Annotation server and can be browsed and searched directly via the annotation client. Metadata attached to the annotations can be selectively harvested into the AustLit database via OAI-PMH. 4.1 Adding RDFa to the AustLit Web Portal The AustLit web portal pages have been enhanced with RDFa. The annotation client (described in Section 4.3) can identify, extract and update metadata/tags that are embedded in the AustLit resources. AustLit record pages are rendered from the AustLit database by custom Java servlets using XSLT stylesheets which we extended to insert RDFa into the HTML. The RDFa uses classes and properties from the AustLit ontology that we developed for use with LORE [12] to represent the data entities from which the pages were generated. Figure 3 shows an AustLit work record page alongside a subset of the RDFa that encodes information about the FRBR entities represented on the page.
Fig. 3. RDFa representing FRBR entities and properties for The Drover’s Wife
4.2
Annotation model
In order to develop a set of annotation tools that will satisfy the five use cases outlined in Section 1, we need a common model that is interoperable across all of them; hence we have adopted the OAC model. Annotations are stored as RDF using Danno [13]; an annotation server developed at the UQ eResearch Lab. Danno uses the Annotea schema to ensure compatibility with existing Annotea clients and servers, while also allowing us to store additional attributes for use by OAC-aware annotation clients. Aus-e-Lit annotations include the following attributes from the OAC model:
34
A. Gerber, A. Hyland, and J. Hunter
• oac:hasContent (the URI to the resource containing the content of the annotation – in our case this is an HTML document); • oac:hasTarget (the URI of the web document being annotated); • oac:hasPredicate (indicates the kind of annotation, such as question, comment, explanation, change etc); Additional information can be recorded using attributes from other schemas such as Dublin Core (language, format, title, subject, date created or modified etc). In addition to implementing basic annotation types representing questions, comments and so on, we extended the model to define a ScholarlyAnnotation class, with subclasses VariationAnnotation and SemanticAnnotation to support the types of scholarly annotations requested by AustLit scholars. Our extensions are as follows: ScholarlyAnnotation. Extends annotations to include tags, importance, alternate body and references, based on scholarly annotation requirements outlined in [14]. SemanticAnnotation. A subclass of ScholarlyAnnotation, which allows scholars to attach metadata conforming to the AustLit ontology to AustLit entities. It extends the model with a semantic context which indicates the entity or property from the web page’s RDFa to which the metadata applies. VariationAnnotation. Records metadata about two variants of a text, for example to annotate changes between them. It subclasses ScholarlyAnnotation and allows two targets, variantTarget and originalTarget which are subProperties of oac:hasTarget, to identify the original and variant texts. It also adds attributes for recording the date, agent (person) and place where the variation on a text occurred. Table 1 shows the attributes for a VariationAnnotation. The namespace v represents the Aus-e-Lit VariationAnnotation schema. Table 1. Example VariationAnnotation record dcterms:created
oac:hasContent (stored as separate HTML document) v:variation-date
H. M. Martin was commissioned to edit a collection of poetry from Harpur's manuscript books. The extent of his editorial intervention can be seen in the variation between the 1867 and 1883 versions of 'The Creek of the Four [1882?]
A Collaborative Scholarly Annotation System for Dynamic Web Documents
35
4.3 Annotation Client The Aus-e-Lit annotation client is implemented as an open source add-on to the LORE Firefox extension [12]. The User Interface elements are implemented using HTML and AJAX using the cross-platform ExtJS library, so that it will be possible to port the client to alternative browser extension frameworks in the future. The client provides several views that allow scholars to browse, search and edit annotations: • A tree-based Browse view, which lists all annotations on a given page, along with any replies. The annotations in the tree can be sorted in ascending or descending order by date of creation, creator or annotation title; • Timeline view, displaying a summary for all annotations and replies by date; • Search view, allowing search by creator, title or date-range; • Edit view, which allows scholars to create or modify existing annotations; • VariationAnnotation view, which shows resources that are being compared or related side-by-side along with additional scholarly metadata. Figure 4 shows the Aus-e-Lit Annotation client. The main browser window displays the VariationAnnotation view, which relates a manuscript image from the notebook of Patrick White with a transcript of another version of the text. The Browse view is visible in the top-right, and the Edit view is shown at the bottom-right of screen, including fields for editing the additional scholarly annotation attributes that were described in 4.2. For example, the Tags field allows scholars to tag resources using keywords, which can be selected from the AustLit thesaurus, or entered as free text. The user interface automatically suggests matching tags as the user types. The list of fields displayed in the editor changes depending on the type of the annotation. A key objective was to make semantic annotations accessible to scholars by providing a user interface that hides the complexity of the AustLit data model and does not require a detailed understanding of IFLA FRBR. We designed the UI to allow scholars to attach
Fig. 4. Editing a VariationAnnotation relating a manuscript image and transcript
36
A. Gerber, A. Hyland, and J. Hunter
Fig. 5. Editing a SemanticAnnotation that attaches metadata to a Manifestation
metadata directly to entities such as works or manifestations via the familiar AustLit Web interface. In Figure 5, an alternateTitle property is being contributed for an AustLit manifestation via a SemanticAnnotation. When the Choose semantic selection button is pressed, the annotation client parses the embedded RDFa and inserts icons into the Web page; a brick icon next to each entity, and a brackets icon next to each property from the RDFa. The user selects the entity that they wish to annotate by clicking on these icons. A drop down menu in the editor allows users to attach OWL DataType properties from the AustLit ontology to the annotation. The UI only displays properties that are applicable for the selected entity, and basic validation is performed on the data values entered, to assist scholars to enter data that conforms to the AustLit data model.
Fig. 6. Publishing a scholarly digital edition using LORE
A Collaborative Scholarly Annotation System for Dynamic Web Documents
37
LORE [12] allows metadata and relationships to be specified at the resource level, which are then published as OAI-ORE-compliant RDF. Annotations extend the capabilities of LORE so that parts of resources (such as image regions, or sections of a text) can be discussed, and annotations can be attached to this context. As each annotation is itself a resource with a unique URI, a Scholarly Edition can be built up by a group of collaborators annotating multiple documents representing different versions of a work, and then published as a LORE compound object that encapsulates all of the versions, along with ScholarlyAnnotations that provide critical commentary and VariationAnnotations that describe each variation in detail. Figure 6 shows how the annotation client integrates with LORE to allow annotations to be selectively added to a compound object from the annotation search or browse panel. In addition to publishing collections of annotations using LORE, scholars can select annotations for export to RDF/XML or to a Microsoft Word document, which lists the metadata, target URI(s) and contents of the selected annotations.
5
Discussion
A survey of existing scholarly annotation systems [15] revealed that the majority: • focus on scholarly formats such as TEI to the exclusion of general Web resources; • are closed systems that do not allow annotation of outside documents; • act as silos: annotations are not easily reused or referenced outside of the system. This presents a problem for collaboration and re-use of annotations. The OAC model allows any online resource to be the target or content of an annotation, regardless of media type or location, and is designed to facilitate interoperability between annotation systems. Our case study has demonstrated that the OAC model can be extended to enable specialized Scholarly Annotations that record scholarly metadata and relate multiple documents. Web pages are often discounted as scholarly resources as they are considered to be “transitory representations of the scholarly objects that need Annotation” [5]. By using RDFa to encode the relationship between presentation markup and the scholarly or data entities represented, our system enables scholarly annotation of dynamic Web pages from digital library and information systems. Feedback from AustLit scholars has included a request for semantic contexts for VariationAnnotations. While our semantic context works well for AustLit record pages, it remains to be seen how well this approach generalizes to Web pages representing full texts rendered from TEI XML. We believe our approach can be generalized to work with other ontologies and types of data, however, the following considerations should be noted: Current JavaScript libraries for extracting and working with RDFa are not very efficient, however we expect that browser support will improve for RDFa over time. Each data entity of interest needs a URI. Blank nodes should be avoided in RDFa, while triples need to be reified in order to be identified within the semantic context. Our system generates a unique hash that identifies the context triple, however a better approach would be to construct Named Graphs. XPointer contexts are also used in combination with the semantic context for presentation markup that exists inside of RDFa spans. This is necessary because the granularity of our RDFa does not allow us to specify contexts down to the character level.
38
A. Gerber, A. Hyland, and J. Hunter
This will be a problem when working with TEI documents, as the sections within AustLit’s TEI documents are very large. We may need to enrich the original TEI documents to provide finer section granularity.
6 Future Work Further work to be undertaken on this project includes a detailed performance and user evaluation of the system and the following: • To improve the integration of the annotation client and the LORE Compound Object editor for publishing scholarly digital editions, including options to export scholarly compound objects that contain annotations to TEI documents; • To store versioning information for Web documents in order to alert scholars about the changes that may have occurred since the page was annotated; • To harvest and map metadata contributed through annotation into AustLit, with a facility for filtering and moderating the annotations for inclusion; • To investigate using OAI-ORE with the OAC model for Variation annotations so that users can compare and record changes between more than two documents.
7 Conclusions In this paper we have described the implementation of a scholarly annotation and publishing framework that enables literary scholars to annotate web resources with tags, comments and notes using the Open Annotations Collaboration (OAC) model. We have extended the OAC model to enable scholars to create VariationAnnotations which document changes between texts, and SemanticAnnotations for contributing typed metadata. Our annotation client leverages RDFa embedded within dynamic Web pages to identify and to allow annotations to be associated directly with the underlying data entities, rather than with the constantly changing presentation markup. Finally, through integration with LORE, annotations and other digital resources can be aggregated into OAI-ORE compound objects representing scholarly digital editions, and shared with the research community to enrich the AustLit collection. Acknowledgements. Aus-e-Lit is funded by DEST through the National eResearch Architecture Taskforce. We gratefully acknowledge the valuable contributions made to this paper by Roger Osborne, Kerry Kilner and the AustLit research communities.
References 1. Aus-e-Lit, http://www.itee.uq.edu.au/~eresearch/projects/aus-e-lit/ 2. AustLit: The Australian Literature Resource, http://austlit.edu.au 3. Kilner, K.: The AustLit Gateway and Scholarly Bibliography: A Specialist Implementation of the FRBR. Cataloging & Classification Quarterly 39(3/4) (2004) 4. Hunter, J., Khan, I., Gerber, A.: HarVANA - Harvesting Community Tags to Enrich Collection Metadata, JCDL, PA, USA, June 16 – 20, pp. 147–156 (2008)
A Collaborative Scholarly Annotation System for Dynamic Web Documents
39
5. Boot, P.: Mesotext. Digitised Emblems, Modelled Annotations and Humanities Scholarship. Pallas Proefschriften, Amsterdam. PhD Thesis (2009) 6. Text Encoding Initiative (TEI), http://www.tei-c.org/index.xml 7. Banski, P., Przepiorkowski, A.: Stand-off TEI Annotation: the Case of the National Corpus of Polish. In: ACL-IJCNLP LAW III, Singapore, August 6 –7 (2009) 8. Open Annotations Collaboration, http://www.openannotation.org/ 9. W3C Annotea Project, http://www.w3.org/2001/Annotea/ 10. Hemminger, B.: NeoNote. Suggestions for a Global Shared Scholarly Annotation System. D-Lib Magazine (May/June 2009) 11. W3C RDFa in XHTML, http://www.w3.org/TR/rdfa-syntax/ 12. Gerber, A., Hunter, J.: LORE: A Compound Object Authoring and Publishing Tool for the Australian Literature Studies Community. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds.) ICADL 2008. LNCS, vol. 5362, pp. 246–255. Springer, Heidelberg (2008) 13. Chernich, R., Crawley, S., Hunter, J.: Universal Collaborative Annotations with Thin Clients - Supporting User Feedback to the Atlas of Living Australia, eResearch Australasia, Sydney, Australia, November 13 – 15 (2009) 14. Furuta, R., Urbina, E.: On the Characteristics of Scholarly Annotations. In: Conference on Hypertext and Hypermedia, MD, USA, pp. 78–79 (2002) 15. Hunter, J.: Collaborative Semantic Tagging and Annotation Systems. In: Annual Review of Information Science and Technology. American Society for Information Science & Technology, vol. 43 (2009)
The Relation between Comments Inserted onto Digital Textbooks by Students and Grades Earned in the Course Akihiro Motoki, Tomoko Harada, and Takashi Nagatsuka Tsurumi University, Dept. of Library, Archival and Information Studies, Tsurumi 2-1-3,Tsurumi-ku,Yokohama, 230-8501, Japan {Motoki-A,Harada-T,Nagatsuka-T}@Tsurumi-u.ac.jp
Abstract. When students read textbooks in the classroom, they usually apply active reading. The practice of marking in university textbooks is a familiar one. They scribble comments on the margin, highlight elements, underline words and phrases, and correlate distinct parts to foster critical thinking. While the use of annotations during active reading supports the students themselves, these can also be useful for other readers. Investigations were carried out to evaluate the comments inserted by students onto their digital textbooks and how this relates to their eventual grade earned at the end of course. The results of our study highlight two main factors influencing students; eventual grade, quantity and quality of annotation. Students who wrote a lot of comments and focused upon the more important keywords in the text trend to receive a higher grade. Accordingly, our analysis was based on number and quality of text word selection. Keywords: annotation, electronic note-taking, grade, digital textbook, comments.
The Relation between Comments Inserted onto Digital Textbooks by Students
41
to use much system. It is clear we need to address these problems regarding annotation in the digital environment [4]. There is other important student behavior as taking note with a pen and paper which is known traditional note-taking[5,6]. Traditional note-taking remains an activity that many students in higher education continue to rely upon heavily[7]. The longestablished tradition of note-taking may also benefit from recent advances in digital technology. We also have a comprehensive understanding of the many issues that surround the process of traditional pen-on-paper note taking; for example the relationship between note taking and the storage function on notes , optimal note-taking behaviors, the relationship between note taking and one’s score on a test, etc[5-6,8-11]. Many studies or projects involving in-class and online educational technology include note-taking or annotation applications[2-4,7,9]. In-class and online note-taking or annotation systems are often based upon the methodologies of traditional notetaking or annotation on paper documents. Therefore, it is valuable to know the ways in which students today manage their note-taking and annotation practices when they are using digital textbook, and also how they view traditional note-taking with a pen and paper versus electronic note taking and annotation. The authors chose the digital textbooks(DTs) using Microsoft Word 2003 because of its annotation and note-taking features[12]. MS Word is the only digital system that supports in-line annotations among annotation and note-taking systems[4]. Digital textbooks(DTs) are uploaded to the server and then students download the DTs to their portable personal computers for use in each lecture. Students then write comments on DTs in their classrooms[13]. At first, HTML tags are inserted into each set of comments with the selected text in the DT with a macro for Word 2003, and then each set of comments with the selected text to add HTML tags is extracted with PerlScript[14,15]. The students were divided into two distinct groups: cluster-A, defined as those DTs in which students have inserted many comments, and cluster-B, in which students have inserted few comments[14]. The number of comments inserted by students is gauged to be one of the useful indexes for quantitative analysis of student motivation[14]. A preliminary study was carried out on how to relate a student’s grade earned at the end of course to the number of comments on his/her digital textbook. What the authors discovered appears to suggest that there is a positive correlation between the number of comments inserted by students and their grades earned at the end of course [15]. In this paper investigations are carried out to evaluate the comments inserted by students onto their digital textbooks and how this relates to their eventual grade earned at the end of course. Our study highlights two main factors influencing students; eventual grade, quantity and quality of annotation. It is investigated whether students who wrote a lot of comments and focused upon the more important keywords in the text have a tendency to receive a higher grade. Accordingly, our analysis is based on number and quality of text word selection.
2 Method 2.1 Overview The authors chose the digital textbooks(DTs) using Microsoft Word 2003 because of its annotation and note-taking features. In the MS Word the composition of a textual
42
A. Motoki, T. Harada, and T. Nagatsuka
annotation takes place in a sub-window within the main editing window. MS Word is the only digital system that supports inline annotations in many annotation and notetaking systems. In the DT, comments are inserted without overwriting the original text and also are indicated by assigning the annotator’s name between double brackets. When reading the DT, students may filter comments by the author and may also opt to hide all comments. Digital textbooks(DTs) are uploaded onto our department’s server in Tsurumi University(Stage 1 in Fig 1) and then students download the DTs to their portable personal computers for using in lectures(Stage 2 in Fig 1). Students write comments on DTs in their classrooms(Stage 3 in Fig 1). The DTs with comments inserted by students were collected at each end of the semester in 2005, 2006 and 2007. The authors analyze the comments inserted into DTs. At first, HTML tags are added into each set of comments with the selected text in the DT with a macro for Word 2003(Stage 6 in Fig 1), and then each set of comments with the selected text to add HTML tags is extracted from HTML documents with PerlScript(Stage 7 in Fig 1).
Fig. 1. Extraction process of comments inserted by students on Digital Textbooks
2.2 Subjects Students select a part of the text in DT and then a frame of comment, coupled with the selected portion of text, is inserted into DT automatically. Students write annotations and/or notes within the frame such as their opinions, the meaning of a word or phrase, or questions to teacher etc (Fig 2). Though inserted comments are automatically numbered in order, students find it easy to manage their comments created in the classroom.
The Relation between Comments Inserted onto Digital Textbooks by Students
43
Fig. 2. Comment with a selected portion of text inserted by student in Digital Textbook
2.3 Procedure Digital textbooks(DTs) in Japanese are prepared by each teacher taking charge of the courses, “Introduction to Networks" and "Introduction to Multimedia", with Microsoft Word 2003. The pages of two DTs, “Introduction to Networks" and "Introduction to Multimedia", are 41 and 54, respectively. The DTs in each of the two courses are composed of alphanumerical texts and graphics such as figures, tables and images. In the opening lecture of the course, teacher explains to students about how to use the DT and encourages students to add comments during the lecture on their DT. The teacher suggests to students that adding comments will lead to the improvement of their learning abilities, and also explains to students that their DT will be collected for investigation at the end of the course. In the final lecture of the course, students upload their DT with their comments and annotations to the server. Their eventual grades in the course are based on the weighted combination of the three following requirements: class participation, assignment and /or quiz, and examination. The eventual grade is composed of S(100-90 points), A(89-80 points), B(79-70 points), C(69-60 points) and D(59-0 points). Students who receive a grade of S, A, B or C succeed in getting credit for the course. On the other hand, students who receive a grade of D, fail to get credit for the course.
3 Data Analysis and Results The number and percentage of students who added comments to Digital Textbooks(DTs) are shown in Table 1. As identified above, students annotate text in their DTs using the comment feature of MS Word. Both courses of “Introduction to Networks” and “Introduction to Multimedia” are required courses during the first year in the Department of Library, Archival and Information Studies, at Tsurumi University. However, because the number of students who registered for the course can fluctuate over the course of three years, some students, however, who failed these courses in the first year must repeat the course in their second year. The total number of students who registered for the two courses over three years of 2005, 2006 and 2007 was 510. The percentages of students who submitted their DTs to the server were distributed from 62.1 to 84.5. The average percentage of all students who submitted their DTs per the total number of students was 74.1. The percentages of
44
A. Motoki, T. Harada, and T. Nagatsuka
students who added comments to DTs were distributed from 83.1 to 100.0. The average percentage of students who added comments per the total number of students who submitted their DTs was 92.7. As exemplified above, most of the students who submitted their DTs added comments. In three classrooms of “Introduction to Networks” in 2006 and “Introduction to Multimedia” in 2005 and 2006, the ratios of students who did not add comments as annotations onto their DT were 10 percents or more in the number of all students submitted DT in the course. Some of students who did not add comments used the features of the highlighter pen and/or color change in a form of annotations onto their DT. Most of students who did not submit their DT failed to get a credit for the course. The teachers encouraged their students to add comments during the period of lectures in 2007. Compared with previous years, the students who added comments onto DTs in two classrooms of “Introduction to Networks” and “Introduction to Multimedia” in 2007 have leveled up over 95 percents. Being able to show that adding comments improves their scores will clearly encourage students to implement this practice more vigorously. Table 1. The number and percentage of students who added comments to Digital Textbooks(DTs)
Course
Introduction to Networks
Introduction to Multimedia Total Number
Year
Registered students (RS)
Students submitted DTs (SS) (% for RS)
Students added comments to DTs (% for SS)
2005
84
69 (82.1)
67 (97.1)
2006
84
71(84.5)
59(83.1)
2007
83
67(80.7)
65(97.0)
2005
82
60 (73.2)
54 (90.0)
2006
90
60 (66.7)
54 (90.0)
2007
87
54(62.1)
54 (100.0)
510
381(74.1)
353(92.7)
Table 2 shows the average numbers of comments in each grade at two courses of “Introduction to Networks” and “Introduction to Multimedia”. The average numbers of comments at the course of “Introduction to Networks” are 36.91+/-15.73 (mean +/standard deviation) in S, 31.78+/-18.21 in A, 20.68+/- 15.54 in B and 11.12+/-13.47 in C, respectively. The average number of comments has clearly seen a decline from grade S to grade C. The average numbers of comments at the course of “Introduction to Multimedia” were 49.80 +/-11.43 (mean +/- standard error of the mean) in S, 28.51in A, 22.23 in B and 13.71in C, respectively. The average number of comments has also clearly seen a decline from grade S to grade C. In both cases, we can point to a trend that high-performing students who got better grades added more comments in the form of annotations than low-performing students who received a poor grade. It can be supposed from the results of our experiments shown in Table 2 that the average number of comments per student being aimed at a group of the same grade have a positive relation with their eventual grades earned in the course.
The Relation between Comments Inserted onto Digital Textbooks by Students
45
Table 2. The average number of comments inserted by each student onto DT and their eventual grades earned in two courses of “Introduction to Networks” and “Introduction to Multimedia”
Course
Introduction to Networks
Introduction to Multimedia
Grade
S
A
B
C
S
A
B
C
Mean
36.91
31.78
20.68
11.12
49.80
28.51
22.23
13.71
SD
15.73
18.21
15.54
13.47
20.56
20.65
15.25
12.43
The left side of Fig 3 is a scatter plot of the number of comments inserted by each student on the DT made up with each grade earned in the course of “Introduction to Networks”, and the right side of Fig 3 shows a scatter plot of the number of comments inserted by each student on the DT made up with each grade earned in the course of “Introduction to Multimedia”. The number of comments in each grade widely dispersed from zero to higher counts. For example, the number of comments in a group of grade A in “Introduction to Networks” widely dispersed from zero to 101. We already mentioned before the eventual grade was worked out based on the weighted combination of the three following requirements: class participation, assignment and /or quiz, and examination. However, it needs to be investigated further why some students who did not add many comments could receive a higher eventual grade. The average number of comments per student being aimed at every 381 students was 24.02. In other instances, the groups of grade C in both of “Introduction to Networks” and “Introduction to Multimedia” included students to add comments over 40. The number of 40 comments in the groups of grade C is relatively higher than the average number of comments being aimed at every 381 students. We need further study of which analyze the contents written by students within a frame of comment in the form of annotation to understand more exactly the relation between comments and eventual grade earned in the course.
Fig. 3. Scatter plots of the number of comments inserted by each student on DT in their eventual grade earned in two courses of “Introduction to Networks” and “Introduction to Multimedia”
46
A. Motoki, T. Harada, and T. Nagatsuka
The higher frequency terms being ranked within the top 10 were picked out from all words and phrases selected by students in their annotation on the DTs of “Introduction to Networks”(Table 3) and “Introduction to Multimedia”(Table 4). Some terms were translated from Japanese to English and then listed in tables. The number of comments is calculated automatically in each word or phrase at first and then similar words or phrases were brought together in a single term manually. As the term of “OSI” is high up on the list of higher frequency terms in “Introduction to Networks”, “OSI with (section number)” and “OSI with (Japanese words)” are put all together into the term of “OSI”. Table 3. The higher frequency terms picked out from all words and phrases selected by students to add comments onto the DT of “Introduction to Networks” S O r der
w or ds or phr as es
A
flow control
11
flow control
37
routing
34
2
protocol
8
routing
34
protocol
22
3
bluetooth
7
broadcast
27
20
flow control
8
4 5 6 7 8 9
OSI
7 7 7 6 6 6
physical layer
24 23 23 22 20 20
INS
OSI
19 19 17 17 16 16
data link layer
5 5 5 5 4 4
router
15
network layer
4
physical layer data link layer IP address
10
ACK
6
data link layer buffer over-flow packet bluetooth OSI Introduction to Networks
N um ber of c om m ent s
19
w or ds or phr as es
C
1
routing
w or ds or phr as es
B w or ds or phr as es Introduction to Networks IP
buffer over-flow
N um ber of c om m ent s
Introduction to Networks flow control transport layer IP packet physical layer
N um ber of c om m ent s
packet routing error control physical layer
N um ber of c om m ent s
9 9
In the case of “Introduction to Networks”, the terms of which are protocol, names of some primary protocols of the OSI reference model, flow control, routing or router, and packet were used often by students among all grades. In the case of “Introduction to Multimedia”, the terms of which are ASCII, Open (Web), Stor(aged Information), font, binary file, CD-ROM, ANK and JIS were also used often by students among all grades. Table 4. The higher frequency terms picked out from all words and phases selected by students to add comments onto the DT of “Introduction to Multimedia”
Order 1 2 3 4 5 6 7 8 9 10
S words or Number of phrases comments RAM 5 ROM 5 ASCII 5 font 5 semiconductor 4 memory binary file 4 property 4 Open 3 DIF 3 JIS 3
A words or Number of phrases comments ASCII 48 Open 36 multimedia 30 Deep 28 Stor 27 JIS 23 ANK 19 binary file 16 bitmap format 15 escape 15 sequence
B words or Number of phrases comments ASCII 37 multimedia 30 digital 29 contents Open 27 CD-ROM 21 text file 21 hard disk 19 Deep 19 floppy disk 17 ANK 16
C words or Number of phrases comments Open 16 ASCII 16 Stor 13 binary file 13 font 11 CD-ROM 9 hard disk 9 floppy disk 9 ANK 9 Deep 9
The Relation between Comments Inserted onto Digital Textbooks by Students
47
Fig. 4 and Fig. 5 show the similarities of terms which appeared often among four grades. Grade S is used as of control point in Fig 4, and grade C is used as of control point in Fig 5. Being based on the lists of higher frequency terms picked up from each grade, the similarities of terms in “Introduction to Networks” and “Introduction to Multimedia” were calculated in both of Fig 4 and Fig 5. The number of terms and frequency of appearance in “Introduction to Networks” are 22 and over 5 in grade S, 22 and over 13 in grade A, 22 and over 9 in grade B, 19 and over 4 in grade C, respectively. The number of terms and frequency of appearance in “Introduction to Multimedia” are 18 and over 3 in grade S, 22 and over 11 in grade A, 19 and over 11 in grade B, 20 and over 7 in grade C, respectively.
Fig. 4. The similarity of high frequency terms picked out all words and phrases selected by students verified among four grades used grade S as of control point
The percentages of similar terms in grade A, B, C in “Introduction to Multimedia” are 50%, 53% and 45% of grade S, respectively. The percentages of similar terms in grade A, B, C in “Introduction to Networks” are 64%, 64% and 47% of grade S, respectively(Fig 4). The percentages of similar terms in grade B, A, S in “Introduction to Multimedia” are 79%, 59% and 50% of grade C, respectively. The percentages of similar terms in grade B, A, S in “Introduction to Networks” are 59%, 55% and 41% of grade C, respectively(Fig 5). Both figures, Fig 4 and Fig 5, illustrate that the similarities of terms which appeared often among four grades positively relate to eventual grades earned at the end of the course.
Fig. 5. The similarity of high frequency term picked out all words and phrases selected by students verified among four grades used grade C as of control point
48
A. Motoki, T. Harada, and T. Nagatsuka
4 Conclusion In the study, we found the average number of comments per student being aimed at a group of the same grade related to the eventual grades earned at the end of course, in other words, high-performing students who got better grades added more comments in the form of annotations than under-performing students who received a poor grade. There are already many studies concerning for the behavior of students’ practice of adding annotations to their textbooks printed on the paper and taking notes in lectures[1, 5, 6, 8]. In our study, students added their comments in the form of annotations onto their digital textbooks using the comment feature on MS Word. We found also on their digital textbooks the other kinds of annotations: Example for highlighting important words, where a part of sentence was changed to a different color, or writing some words within text not to use the comment feature on MS Word. It needs to be investigated further what kind of differences there are between the practice of annotations and note-taking by students on digital textbooks and those on printed textbooks. We also found in the study the similarity of terms which appeared often among four grades positively related the eventual grades earned at the end of course. We need further study of which analyze the contents written by students within a frame of comment in the form of annotation to understand more exactly the relation between comments and eventual grade earned in the course. With the growth of digital environment, the document tradition is changing from paper to electronic. As more of our educational material moves to the computer, supporting annotation and note-taking digitally becomes an important task. At the same time, technology gives us unprecedented control over the annotation and note-taking process. We should study student behavior and practices and what changes will are likely to occur through use of digital textbooks as opposed to the more traditional approach in the learning environment of the paper-and-pen on the classroom. The digital textbooks should be improved and applied to the lessons for increasing student motivation and for encouraging better grades. The next major step will be to make a system to extract a set of annotated text and annotation data added by students onto the digital textbooks and then to analyze the data automatically.
References 1. Marshall, C.: Annotation: from paper books to the digital library. In: Proceedings of the ACM Digital Libraries 1997 Conference, pp. 131–140 (1997) 2. Hoff, C., Wehling, U., Rothkugel, S.: From paper-and-pen annotations to artifact-based mobile learning. Journal of Computer Assisted Learning 25(3), 219–237 (2009) 3. Hoff, C., Rothkugel, S.: Shortcomings in Computer-based Annotation Systems. In: Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2008, pp. 3715–3720 (2008) 4. Cousins, S.B., Baldonado, M., Paepcke, A.: A Systems View of Annotations. Xerox PARC Tech Report P9910022 (2000) 5. Hartley, J., Ivor, K.D.: Note-Taking:A Critical Review. Programmed Learning and Educational Technology 15(3), 207–224 (1978)
The Relation between Comments Inserted onto Digital Textbooks by Students
49
6. Palmatier, R.A., Bennet, J.M.: Note-taking habits of college students. Journal of Reading 18, 215–218 (1974) 7. Reimer, Y.J., Brimhall, E., Chen, C., O’Reilly, K.: Empirical user studies inform the design of an e-notetaking and information assimilation system for students in higher education. Computers & Education 52(4), 893–913 (2009) 8. Knight, L.J., McKelvie, S.J.: Effects of attendance, note-taking, and review on memory for a lecture: Encoding vs. external storage functions of notes. Canadian Journal of Behavioural Science 18(1), 52–61 (1986) 9. Bauer, A., Mellon, C.: Selection-based note-taking applications. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 981–990 (2007) 10. Kiewra, K.A., Benton, S.L.: The relationship between information-processing ability and notetaking. Contemporary Educational Psychology 13(1), 33–44 (1988) 11. Kiewra, K.A.: A review of note-taking: The encoding-storage paradigm and beyond. Educational Psychology Review 1(2), 147–172 (1989) 12. Motoki, A., Harada, T., Nagatsuka, T.: Poster Presentation: Digital Workbooks applied on the Librarian Training Course. In: World Library and Information Congress: 72nd IFLA General Conference and Council, Seoul (2006) 13. Motoki, A., Harada, T., Nagatsuka, T.: Digital Workbooks designed to improve skills of the Students on Librarian Training Course: A Content Analysis of Student Written Comments. The bulletin of Tsurumi University. Part 4, Studies in humanities, social and natural sciences 44, 69–76 (2007) 14. Motoki, A., Harada, T., Nagatsuka, T.: Digital Workbooks designed to improve skills of the Students on Librarian Training Course (2): Evaluation of the workbooks and content analysis of the comments written by students. The bulletin of Tsurumi University. Part 4, Studies in humanities, social and natural sciences 45, 97–111 (2008) 15. Motoki, A., Harada, T., Nagatsuka, T.: Poster Presentation: The Effects of Digital Workbook on the Information Literacy Education: The number of comments written by students and a grade for the subjects. Annual meeting of Japanese Society for Information and Media Studies (2008)
Visualizing and Exploring Evolving Information Networks in Wikipedia Ee-Peng Lim2 , Agus Trisnajaya Kwee1 , Nelman Lubis Ibrahim1 , Aixin Sun1 , Anwitaman Datta1 , Kuiyu Chang1 , and Maureen1 1
School of Computer Engineering, Nanyang Technological University, Singapore {atkwee,INLUBIS,axsun,anwitaman,askychang,maureen}@ntu.edu.sg 2 School of Information Systems, Singapore Management University, Singapore [email protected]
Abstract. Information networks in Wikipedia evolve as users collaboratively edit articles that embed the networks. These information networks represent both the structure and content of community’s knowledge and the networks evolve as the knowledge gets updated. By observing the networks evolve and finding their evolving patterns, one can gain higher order knowledge about the networks and conduct longitudinal network analysis to detect events and summarize trends. In this paper, we present SSNetViz+, a visual analytic tool to support visualization and exploration of Wikipedia’s information networks. SSNetViz+ supports timebased network browsing, content browsing and search. Using a terrorism information network as an example, we show that different timestamped versions of the network can be interactively explored. As information networks in Wikipedia are created and maintained by collaborative editing efforts, the edit activity data are also shown to help detecting interesting events that may have happened to the network. SSNetViz+ also supports temporal queries that allow other relevant nodes to be added so as to expand the network being analyzed.
1
Introduction
Information network refers to representing information entities as nodes and their inter-relationships as edges. In this paper, we address the problem of visualizing and exploring evolving information networks in Wikipedia. Each Wikipedia article written in Wikitext describes an information entity (e.g., a person) and contains links to other related information entities (e.g., country of birth, company worked for, etc.). Using a web browser, users can browse Wikipedia articles and navigate to other articles via links. This mode of user interaction, however, does not support a network view of Wikipedia information. It is therefore difficult to gain the high level knowledge about the browsed articles in the context of the entire information network. Wikipedia also maintains historical versions of an article under the history tab of the article. Nevertheless, these historical article versions are viewed as singletons and not part of some evolving information network. Visualizing and exploring evolving information networks is important for several reasons. Firstly, it enables us to understand the relationships among network G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 50–60, 2010. c Springer-Verlag Berlin Heidelberg 2010
Visualizing and Exploring Evolving Information Networks in Wikipedia
51
nodes in the time dimension. When a link has existed between a pair of nodes for a long time, it is considered more permanent than another link that exists for a very short time period. Secondly, time-based network visualization allows one to study interesting network changes that signify some interesting trends or events. These may well be trends and events that occur in the physical world. SSNetViz+ is a tool designed and implemented to overcome the limitation of existing web browsers in analysing Wikipedia information networks. As an extension of the earlier SSNetViz project [6] which focuses on visualizing and exploring heterogeneous semantic networks, SSNetViz+ introduces a new time dimension and a multi-version information network representation. It supports storage of multiple versions of an information network with multiple node types. By introducing new operators to manipulate information networks, it helps users to better understand the relationships among networks, network trends and events. From Wikipedia articles, we first extract relevant information networks of multiple versions based on the network analysis task to be conducted. At present, this step is carried out semi-automatically. The extracted networks are stored in some repository. Users can then perform interactive network analysis on the multiple versioned networks using a combination of network manipulation operators and network search. There are a few design challenges for SSNetViz+. In the following, we outline these challenges and our general approaches to tackle them. – When multiple versions of information networks exist, how can they be visualized and explored without overloading users with too much information? Information network can be very large in size. Having multiple versions of information networks will only aggravates information overloading. Displaying the entire network or multiple versions of the network on a single screen is usually infeasible and not useful for analysis. SSNetViz+ is thus designed to show only one version of network at a time. Moreover, for a given version of information network, we allow users to select a subset of nodes known as anchor nodes and their neighboring nodes to be visualized. This incremental exploration approach reduces the amount of information to be shown at the same time allowing the users to focus more on the interesting sub-networks and versions. – How can time-based network analysis be performed easily? An important purpose of using SSNetViz+ is to find interesting sub-networks and versions through visual means. We realise this capability by introducing: (a) time scrollbar for examining nodes created at different points in time; (b) historical node statistics to guide users in exploiting Wikipedia user activity data so as to identify the interesting network versions; (c) delta graph operator to compare different versions of network; and (d) search enabled network exploration that allows new interesting nodes to be added to the visualized sub-networks to expand the scope of network analysis. – How can information networks with multiple versions be searched? SSNetViz+ incorporates a keyword search engine that allows users to search for interesting nodes containing keywords within user specified time period. As the
52
E.-P. Lim et al.
same keywords may appear in different versions of the information network, we have to develop a new node result ranking function that considers the versions containing the keywords. In this paper, we will illustrate SSNetViz+ using a terrorism information network example. We create this information network by gathering terrorism related articles and links among them. Using this real example, we highlight the strengths of using SSNetViz+ for analysing evolving information networks.
2
Related Work
There are several previous research works related to this paper. The first body of related works is related to visual analytics on graphs or network data. A good survey on graph or network visualization is given in [4]. The survey describes a variety of graph layout techniques applicable to both tree and graph structures. It also covers clustering techniques that summarize the amount of network information to be visualized. In the context of scientific literature, Chen also proposed several visualization techniques for citation, co-citation and other networks embedded in the document collections [3]. His work also introduced the concept of pivotal point which refers to an article with high betweenness centrality value. Such an article is believed to be important for the users to examine. Yang, et al., described a visual analytic toolkit for detecting events in an evolving information network. The main idea is to define seven different events to be tracked for the network, to visualize and explain the network changes[10]. This work shares the same objectives as ours but we rely more on user activity data to find events. We combine visualization, search and analysis in the SSNetViz+ while the toolkit is more for visualization only. In Wikipedia, there are also works on extracting topic or ontological networks from article content. Wu and Weld proposed the Kylin Ontology Generator (KOG) to extract ontology from infoboxes of Wikipedia articles and combine it with WordNet[9]. DBpedia represents an ongoing efforts to extract Wikipedia tagged content turning them into a knowledge base of RDF data[1]. Kittur, Chi and Suh extracted category tags of Wikipedia articles and derived the high level topic distribution of each article using the extracted category tags. By performing topic analysis on Wikipedia topics in January 2008 and July 2006, they concluded that “Natural and physical sciences” and “Mathematics and logic” are the two fastest growing topics[5]. To visualize revision history of a Wikipedia article, Nunes, et al., proposed a timeline representation of article revision activity and implemented that as an application known as WikiChanges[7]. HistoryViz is yet another web application that allows users to examine events and related entities of a person entity using data in Wikipedia[8]. Our work differs from the above as we use Wikipedia article titles as nodes in the information network instead of extracting nodes or events from article content. We study Wikipedia articles in an evolving network as opposed to one article evolving in isolation.
Visualizing and Exploring Evolving Information Networks in Wikipedia
3 3.1
53
Modeling of Evolving Information Networks Multi-version Information Network
Information network is the basic data structure SSNetViz+ is designed to represent and manipulate. We define an information network G = V, E to be set of nodes V and a set of directed edges E. Every node belongs to some node type and nodes of the same type share a common set of attributes. Similarly, edges are of some edge types but they do not carry any attributes. For Wikipedia, the nodes are articles and the directed edges are links from articles to other articles. As Wikipedia allows users to modify or revise articles, it results in multiple versions of information network for those articles. An article vi ∈ V in the information network can have multiple versions {vi1 , vi2 , · · · , vi,|Ti | } created at different timestamps Ti = {ti1 , ti2 , · · · , ti,|Ti | }. Each article version vik may have links to other articles and we denote these links by Eik ⊆ V . Given a set of articles V , we assume that their versions can be linearized by timestamp and we define a version of the information network for each timestamp. Formally, we define a multi-version information network V, E as a series of information networks represented by V, E, T where V = {(vi , tij )|vi ∈ V ∧tij ∈ Ti }, E = {(vi , vj )|∃vi , vj ∈ V, ∃tik ∈ Ti , ∃tjk ∈ Tj , vi = (vi , tij )∧vj = (vj , tjk )∧tik ≥ tjk ∧vj ∈ Eik }, and T is the union of all articles’ timestamps, i.e., T = ∪vi ∈V Ti . Given a multi-version information network V, E, T, we can induce another multi-version information network with given nodes Vt ⊆ V at a timestamp t (t may or may not be in T) defined by Vt , Et where Vt = {(vi , tij ) ∈ V|vi ∈ V ∧ tij is the latest timestamp at or before t}, Et = {(vi , vj ) ∈ E|vi , vj ∈ Vt }. The induced information network is thus a snapshot of the multi-version information network at the timestamp t. SSNetViz+ is designed to induce information networks with user specified node set at regular timestamps, e.g., weekly or monthly, for visualization and exploration. This requires a much smaller data size to be manipulated by SSNetViz+ compared with the original multi-version information network, hence reducing the overheads in network exploration. Henceforth, we will therefore use information networks and induced information networks interchangeably. 3.2
Manipulation of Information Networks
Based on a set of induced information networks with equally spaced timestamps, we can manipulate the networks using a set of operators so as to explore the networks. In the following, we formally define some operators for this purpose. Addition of an article node. This operator essentially adds a new node v into an existing induced information network Vt , Et by returning a new induced information network Vt , Et where (a) Vt = Vt ∪ {(v , t ) ∈ V|t is thelatest timestamp at or before t}, (b) Et = Et ∪ {(v , vj ) ∈ E|v = (v , t ) ∈ Vt ∧ vj ∈ Vt } ∪ {(vj , v ) ∈ E|v = (v , t ) ∈ Vt ∧ vj ∈ Vt }.
54
E.-P. Lim et al.
When a new node is added, its links to the existing induced information network are also included in the new network if the links are to nodes in the existing network. Similarly, we can remove an article node an induced information network by removing the relevant node and its links. Removal of an article node. The node removal operator deletes a node v from an existing induced information network Vt , Et by returning a new induced information network Vt , Et where (a) Vt = Vt − {(v , t )|(v , t ) ∈ Vt }, (b) Et = Et − {(v , vj )|(v , vj ) ∈ Et } − {(vj , v )|(vj , v ) ∈ Et }. Delta difference of two induced information networks. The delta difference of two induced information networks at timestamps t1 and t2 (t1 < t2), Vt2 , Et2 − Vt1 , Et1 returns three information networks, namely: – + network, V + , E +, defined by (a) V + = {v|(v, t) ∈ Vt2∧ ∃(v, t ), (v, t ) ∈ Vt1 , and (b) E + = {(v, w)|((v, tk ), (w, tk )) ∈ Et2 ∧ ∃((v, tl ), (w, tl )) ∈ Et1 . – − network, V − , E − defined by (a) V − = {v|(v, t) ∈ Vt1∧ ∃(v, t ), (v, t ) ∈ Vt2 , and (b) E − = {(v, w)|((v, tk ), (w, tk )) ∈ Et1 ∧ ∃((v, tl ), (w, tl )) ∈ Et2 . – 0 network, V 0 , E 0 defined by (a) V 0 = {v|(v, t) ∈ Vt1 ∧ (v, t ) ∈ Vt2 }, and (b) E 0 = {(v, w)|((v, tk ), (w, tk )) ∈ Et1 ∧ ((v, tl ), (w, tl )) ∈ Et2 }. The + , − and 0 information networks represent the added, deleted and remaining parts comparing the newer and older information networks respectively. Note that this operation does not return an induced network. The three resultant networks are not time specific. 3.3
Example Terrorism Information Network
In the SSNetViz+ project, we construct a terrorism information network with multiple versions from Wikipedia. This network consists of 813 terrorist articles, 439 terrorist group articles and 1797 incident articles. These articles were identified by first using the terrorism related entities provided by MIPT’s Terrorism Knowledge Base1 (TKB) to locate the relevant Wikipedia articles, followed by another round of human checking so as to remove wrongly matched articles. Once the terrorism articles had been found, we manually identified the links between entities represented by these articles. A total of 1922 links were found. We extracted the version histories of all selected articles and constructed the induced information networks from the version histories with monthly timestamps. The final induced information networks cover the period from August 5, 2001 to January 6, 2008.
4
Network Visualization and Exploration
4.1
Overview of User Interface
In this section, we present the visualization and exploration features of SSNetViz+. SSNetViz+ has an interactive user interface as shown in Figure 1. 1
MIPT: National Memorial Institute for the Prevent of Terrorism.
Visualizing and Exploring Evolving Information Networks in Wikipedia
55
Fig. 1. SSNetViz+ User Interface
The interface consists of: (1) a menu bar that allows information networks to be loaded for visualization; (2) a search bar for querying the network nodes; (3) a visualization panel for displaying networks and exploring them; (4) an anchor node panel for maintaining a list of anchor nodes that are bookmarked by the user to be interesting; and (5) a profile panel for displaying the Wikipedia article of any selected node. 4.2
Network Visualization and Exploration
Like its predecessor SSNetViz, SSNetViz+ displays nodes of an information network in different shapes depending on their node types. As shown in Figure 1, terrorist group nodes (e.g., Al-Qaeda) and terrorist nodes (e.g., Osama bin Laden) are shown in rectangles and ovals respectively. The lines between nodes represent relationships with source nodes at the thick ends and destination nodes at thin ends of the lines. One can perform normal and hyperbolic zooming, rotation, and node positioning on the information network using the display options and slide bar at the bottom of the main window. A set of important nodes to be always included in the visualization panel can be bookmarked by the user as anchor nodes. Anchor nodes are maintained in the anchor node panel and the respective nodes in the visualization panel are adorned with a green box at the right bottom corner. Whenever a node is selected, the article the node represents will have its textual content appear in the profile panel. A small red box on the right top corner of a node indicates that this node is an anchor node. The number in the red box shows the number of neighboring nodes not yet shown.
56
E.-P. Lim et al.
Time travel using scroll bar. SSNetViz+ incorporates several other unique visual features. To help the user select an induced information network with a specific timestamp for display, a time scroll bar at the bottom of the window is used. Once a timestamp is selected, an appropriate induced information network will be shown with appropriate colors assigned to the nodes indicating their existence statuses. The node existence status can be pre-stub (article not created and cited yet), stub (article not created but cited) or created (article created and cited). We use red, yellow and green to encode the three status values of nodes respectively. To help a user visualize when the node articles are cited and created, SSNetViz+ can display the life span bars of selected nodes which display the three stages of the nodes in red, yellow and green. In this way, the user can easily tell when a node article is cited and created, and move the time scroll bar to a timestamp to examine the network and node articles at the timestamp. For example, Figure 2(a) shows the Al-Qaeda’s network before its article was created. As we move the time scroll bar to the later timestamps, the Al-Qaeda network evolves to have more neighbors as shown in Figures 2(b) and 2(c).
(a) Initial (b) At Al-Qaeda 2002 network
June
6,
(c) At April 18, 2003
Fig. 2. Al-Qaeda network at different timestamps
Browsing articles based on past activities. To visualize the user activity behind an information network, SSNetViz+ allows user activity statistics of selected nodes to be displayed in timeline charts. SSNetViz+ can display three types of timeline charts, namely (a) number of links, (b) cumulative number of revisions, and (c) number of words. Time scroll bar, again, can be move to the appropriate timestamp to visualize the node articles and network at time points which the articles show interesting changes in activity statistics. For example, Figure 3 shows the link activity data for the Al-Qaeda node. Addition and removal of nodes. A new node can be added into or removed from the existing information network by calling the node addition and removal operators defined in Section 3.2. In addition, SSNetViz+ supports multi-node addition when an existing node (adorned with a little red box) is selected for neighborhood expansion. The expansion will include all nodes linked to the selected node
Visualizing and Exploring Evolving Information Networks in Wikipedia
57
Fig. 3. Link Activity Data
added to the information network. Multi-node removal is performed removing the neighbors of the selected node collapsing the selected nodes’ neighborhoods. Delta graph analysis. Delta graph analysis in SSNetViz+ is performed by specifying a timestamp marker (using the button at the menu bar) and moving the time scroll bar to another timestamp for comparing the information networks at these two timestamps. The delta difference operation in Section 3.2 will be called to return nodes and edges in + , − and 0 networks. The nodes and edges in + and − are shown in blue and red respectively, while those in 0 have color(s) unchanged. Figure 4 shows the delta difference of the Al-Qaeda’s network between April 11, 2005 and January 22, 2006.
Fig. 4. Delta Difference Example
5
Temporal Search
SSNetViz+ supports temporal keyword search with start and end dates as shown in the main interface window. The search returns node articles with revisions between the start and end dates and containing the specified keyword(s). After using Lucene2 to score the relevance of every revision, SSNetViz+ displays in the result window the revision scores across timestamps of each relevant node article as a colored line in a line chart as shown in Figure 5. The figure shows the nodes containing the search keyword “al-qaeda” and the top 10 relevant nodes 2
http://lucene.apache.org/
58
E.-P. Lim et al.
Fig. 5. Temporal Search Window
Fig. 6. Event Detection in Search Results
are highlighted (including the three terrorist group nodes al-qaeda, taliban and jama’at al-tawhid wal-jihad). Only the revision scores of these 10 nodes are shown to keep the chart easy to read. The recent revisions of al-qaeda node (in red) are shown to be most relevant. This is followed by the revisions of Osama bin Laden node (in blue). The line chart can be further zoomed in for more detailed viewing. Such a line chart provides a useful temporal interpretation of the search result. The left-panel of the result window enumerates all nodes of the search results grouped by node type ranked by relevance. Each node result has a number of revisions that satisfy the search criteria shown as a number in parentheses. Each time, only 10 relevant nodes are highlighted and the user can always select the next or previous 10 relevant nodes to examine.
Visualizing and Exploring Evolving Information Networks in Wikipedia
59
Temporal search coupled with line chart can be used to locate relevant events in relevant nodes. For example, when using ”september 11” keyword between August 5, 2001 to June 1, 2008, a sudden surge of revision relevance score of a terrorist Zacarias Moussaoui was found around May 5, 2005 as shown in Figure 6. Upon verification, we found out that Moussaoui had surprised the court by pleading guilty to all September 11 related charges against him around that time. This had caused in more frequent use of “september 11” keyword in his article. Selected nodes in the search results can be added to SSNetViz+’s visualization panel as anchor nodes by dragging and dropping the nodes to the visualization panel. This effectively combines search results with visualization and exploration.
6
Conclusion
In this paper, we describe the design and implementation of SSNetViz+, a tool for visualizing, exploring and querying evolving information networks in Wikipedia. SSNetViz+ has been implemented using Java with the extracted information network data stored using MySQL database. SSNetViz+ manages multiple versions of information networks and supports operations to manipulate these networks. Using terrorism network as an example, we show the various capabilities of SSNetViz+, including exploring information networks by timestamp, comparing information networks across timestamps, and showing search results by line charts. SSNetViz+ is currently under user evaluation by terrorism experts from the International Center for Political Violence and Terrorism Research (ICPVTR). The future work items on SSNetViz+ include automatic identification and visualization of interesting nodes and links based on user activity data and summarization of search results that include inter-linked nodes articles.
Acknowledgement This work was supported by A*STAR Public Sector R&D, Singapore, Project Number 062 101 0031.
References 1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: Dbpedia - a crystallization point for the web of data. Web Semantics: Science, Service and Agents on the World Wide Web 7(3) (September 2009) 2. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computing Network and ISDN Systems 30(1-7) (1998) 3. Chen, C.: Visualizing semantic spaces and author co-citation networks in digital libraries. Information Processing and Management, 401–420 (1999) 4. Herman, I., Melan¸con, G., Marshall, M.S.: Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics 6(1), 24–43 (2000)
60
E.-P. Lim et al.
5. Kittur, A., Chi, E., Suh, B.: What’s in wikipedia? mapping topics and conflict using socially annotated category structure. In: ACM CHI (2009) 6. Lim, E.-P., Maureen, Ibrahim, N., Sun, A., Datta, A., Chang, K.: Ssnetviz: a visualization engine for heterogeneous semantic social networks. In: International Conference on Electronic Commerce (2009) 7. Nunes, S., Ribeiro, C., Gabriel, D.: Wikichanges - exposing wikipedia revision activity. In: WikiSym (2008) 8. Sipos, R., Bhole, A., Fortuna, B., Grobelnik, M., Mladenic, D.: Historyviz - visualizing events and relations extracted from wikipedia. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyv¨ onen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554. Springer, Heidelberg (2009) 9. Wu, F., Weld, D.: Automatically refining the wikipedia infobox ontology. In: WWW (2008) 10. Yang, X., Asur, S., Parthasarathy, S., Mehta, S.: A visual-analytic toolkit for dynamic interaction graphs. In: KDD, pp. 1016–1024 (2008)
Do Games Motivate Mobile Content Sharing? Dion Hoe-Lian Goh, Chei Sian Lee, and Alton Yeow-Kuan Chua Division of Information Studies, Wee Kim Wee School of Communication and Information Nanyang Technological University, Singapore 637718 {ashlgoh,leecs,altonchua}@ntu.edu.sg
Abstract. Indagator (Latin for explorer) is a game which incorporates multiplayer, pervasive gaming elements into mobile content sharing. Indagator allows users to annotate real world locations with multimedia content, and concurrently, provide opportunities for play through creating and engaging interactive game elements, earning currency, and socializing. A user study of Indagator was conducted to examine the impact of the usability of Indagator’s content sharing and gaming features, as well as demographic profiles on participants’ motivation to use the application. Participants felt that the features in Indagator were able to support the objectives of content sharing and gaming, and that the idea of gaming could be a motivator for content sharing. In terms of motivation to use, usability of Indagator’s gaming features, gender and participants’ familiarity with mobile gaming emerged as significant predictors. Implications and future research directions are discussed. Keywords: Mobile content sharing game, user evaluation, usability, motivations, social computing.
of areas, including their appropriate game designs, as well as users’ experience, perceptions and attitudes. Hence, research is needed to understand how such applications can effectively blend gameplay into mobile content sharing to motivate such activities, and understand how users perceive and respond to them. The objectives of the present paper are thus two-fold. The first is to extend current research in mobile content sharing games through the design and implementation of Indagator (Latin for explorer). Unlike existing mobile content sharing games which are primarily casual in nature, Indagator introduces multiplayer, pervasive gaming elements set in a persistent virtual world. The second objective is to evaluate Indagator and uncover the motivations for using the application by examining the influence of usability and users’ demographic profiles.
2 Related Work Games that combine content sharing and gaming are also known as Games With A Purpose (GWAP), and may be characterized as applications that use games to solve a given problem [18]. Such games harness humans to perform computations in an entertaining setting. One of the more prominent examples is the ESP Game [19]. Two remote players are randomly paired and tasked to create keywords to images presented to them within a given time limit. Points are earned only if the two players assign the same keyword. The specificity of the keywords determine the number of points obtained, and coupled with a countdown timer that ticks away the seconds on the screen, excitement, challenge and motivation is added for players. While players are entertained, the matching keywords can be used to improve the performance of image search engines. A related example is the Google Image Labeler1, a variant of the ESP Game. Another example is OntoGame, a platform that uses games for creating ontological knowledge structures [17]. Games include OntoPronto for creating an ontology from Wikipedia entries and OntoTube for annotating YouTube videos with ontological elements. Similar ideas that blend content sharing and gaming can also be found in mobile applications. One example reviewed earlier is the Gopher Game [2]. The game is location-based and a player helps a gopher complete its mission by supplying it with camera phone images and textual content based on a task description. Players earn points depending on the quality of the content submitted and use these points to create new gophers and participate in other in-game activities. By helping gophers complete their missions, content sharing among players is facilitated because other users may collect these gophers and view the images and text associated with them. Next, in MobiMissions [5], content sharing is accomplished through the completion of missions, which are defined by sequences of digital photographs and text annotations associated with specific locations. Players create missions for others to undertake, and search locations for available missions. To complete a mission, a player has to capture up to five photographs and add up to five text annotations. This content can then be shared with other players. Finally, CityExplorer [14] extends the idea of games for labeling images to the physical world. The game treats a geographic 1
http://images.google.com/imagelabeler/
Do Games Motivate Mobile Content Sharing?
63
area, such as a city, as a game board subdivided into segments. Within each segment, players need to label as many points of interest with category names as possible. Categories are not predefined and players can develop their own. A player who creates the most number of labels in a segment wins credits for that segment at the end of the game.
3 Indagator: Blending Content Sharing and Gaming Indagator is inspired by tales of explorers who navigate unchartered territory in their quest for fame, fortune and adventure. As its name suggests, gameplay is modeled after an exploration theme operating in two levels. First, Indagator provides an environment for users to share and seek location-based content. Second, layered upon this information environment is a game of exploration, in which players navigate their physical world to amass treasure, overcome obstacles, and interact with other players. In doing so, Indagator blends play with the collaborative creation, seeking and sharing of information. 3.1 Content Sharing Features At its core, Indagator is a location-based mobile content sharing system. The application provides features for creating, sharing and seeking content on their mobile devices, and is adapted from a mobile annotation system known as MobiTOP (Mobile Tagging of Objects and People) [12]. In Indagator, as in MobiTOP, content refers to location-based annotations, each comprising attributes such as title, tags, textual information, multimedia content (e.g. images) and users’ ratings for that annotation (see top section of Figure 1). At creation time, other implicit attributes are also captured such as contributor name, location (latitude and longitude), and date. The Indagator client supports a map-based interface for exploring annotations (Figure 2). Individual annotations are displayed as markers on a map-based interface, and selecting a marker will show the details of its corresponding annotation. The map
Fig. 1. Annotation details
Fig. 2. Map interface
Fig. 3. Engaging an encounter
64
D.H.-L. Goh, C.S. Lee, and A.Y.-K. Chua
also offers standard navigation features such as pan and zoom. Annotations may also be accessed via tags as well as via filtering by date/time, location and user. The Indagator mobile client was developed using the Java Platform, Micro Edition (J2ME) API running on Nokia N95 smart phones. 3.2 Gaming Features Layered upon Indagator’s content sharing environment are features that allow users to concurrently engage with their content through play. The gaming environment is overlaid upon a player’s actual physical surroundings so that interaction with gaming features is done within the real world. Gameplay in Indagator is deliberately designed to be simple to reduce the cognitive overhead of players who are on the go. In essence, players explore their environment to seek content they need, or create new content to be shared. As part of the game, their goals are to amass wealth by interacting with Indagator’s gaming features. During exploration, players may engage various encounters (see below), interact with other players, and earn in-game currency. Further, the more currency (and hence wealth) a player gains, the higher his/her rank will be, and this is reflected in the game’s leaderboard. Indagator may thus be compared with pervasive, multiplayer games, but unlike these, there are no designated objectives. Instead, Indagator gameplay is open-ended, with the primary purpose of facilitating the exploration and creation of content as players move around in their physical environment. In this sense, Indagator is more similar in genre to virtual worlds such as Second Life2, with the addition of gaming elements unique to the system. Indagator gaming elements include the following. Earning currency. Players earn in-game currency (called aurum) by enriching the environment through contributing content, rating existing content, or through successful engagement of encounters. Aurum can then be spent on creating encounters, to acquire game objects and access other game-based features. Setting encounters. When creating content, players have an option to associate an encounter with it, which will be triggered when the content is accessed. The bottom section of Figure 1 shows an annotation with an encounter type and level of difficulty being selected. Encounters are meant to introduce the elements of entertainment and surprise into content sharing and seeking, and include the following types: • Mini-games are primarily casual in genre and may include puzzles, shooting and board varieties. Games may be contextual, such as a guessing game that presents clues about a nearby attraction for the player to solve, or non-contextual, as in the case of a shooting game. Different games cost varying amounts of aurum. • Traps are designed to inflict damage, and in Indagator, this refers to aurum lost. A player stumbling onto a trap causes him/her to lose aurum to the encounter setter. • Treasure earns a player some amount of aurum and hence increases wealth. This is system-generated and randomized across content. Indagator may also randomly set encounters on annotations to increase diversity of play and introduce an element of unpredictability. 2
http://secondlife.com
Do Games Motivate Mobile Content Sharing?
65
Engaging encounters. Players who access content associated with encounters will have a choice of engaging the encounter and earn aurum, or bypass the encounter to continue accessing the annotation with no penalty or reward. The latter was a design decision undertaken to ensure that content access takes priority over gameplay, and that content should not be denied access to those who need it. However, to encourage engagement, players who do so receive a small amount of aurum, while successful engagements receive more. Figure 3 shows a contextual mini-game in which a player is presented with an image and is asked to specify its location within a nine-grid map. Here, images are harvested from nearby content and the system randomly selects one for the encounter. Players who successfully specify the location will obtain aurum.
4 Evaluating Indagator: Methodology An evaluation was conducted to examine the influence of usability and participants’ demographic profiles on the motivation to use Indagator. Forty-one participants (24 males and 17 females) were recruited from a local university and consisted of undergraduate and graduate students. Their ages ranged from 19 to 37, with an average age of 24. Twenty-three of the participants (about 56%) had a background in computer science, information technology or related disciplines, while the others came from disciplines such as arts, humanities, and business. Slightly more than half of the participants (22) reported they frequently used their mobile phones to share pictures, video, music and other media with others. At the same time, 17 participants (about 40%) played games on their mobile phones frequently. However, 34 participants (about 80%) were frequent players of games on desktop computers. The study was conducted across nine sessions with between four to eight participants per session. The purpose was to keep the number of participants small to allow for greater interaction with Indagator as well as with the researchers conducting the study. Each session began with an introduction to Indagator and the concept of blending gameplay with mobile content sharing. Participants were then given a demonstration of the system and its various features. They were also asked to interact with the system and raise any questions if needed. The entire session lasted approximately one hour. At the end of the session, participants were given a questionnaire to complete. The questionnaire consisted of items that rated the usability of Indagator’s various features on a scale of 1 (strongly disagree) to 5 (strongly agree). Items were adapted from past studies on mobile usability and gaming (e.g. [10, 16]), and covered seven aspects. The first three related to the usability of Indagator’s mobile content sharing features, while the last four were associated with Indagator’s gaming features: • • • • • • •
System navigation – accessing the various features in Indagator Map navigation – zooming, panning and visualizing the annotations on the map Annotation management – creating, viewing, editing and deleting annotations Encounter usability – ease of understanding and engaging encounters Entertainment value of encounters – whether encounters were entertaining Encounter appeal – whether encounters would help encourage content sharing Adequacy of encounter genres – whether the types of encounters were sufficient
Finally, participants were asked whether they would be motivated to share content if they were to use Indagator.
66
D.H.-L. Goh, C.S. Lee, and A.Y.-K. Chua
5 Results Multiple regression analysis was conducted to examine the impact of the usability of Indagator’s content sharing and gaming features as well as demographic profiles on participants’ motivation to use the application. Table 1 shows their standardized betaweights, their associated t-values, as well as the means and standard deviations of participants’ responses to the content sharing and gaming features. In terms of the latter, the table suggests that responses were favorable, with mean values above two, suggesting agreement with the seven usability aspects covered in the questionnaire. Finally, the mean value of the response on motivation to share content (dependent variable) was 3.37 (SD = 1.26), suggesting that participants seemed willing to use Indagator for content sharing. The independent variables impacted motivation to use [F(13, 27)=20.71, p<0.001] and explained about 91% of the variance (R2 = .91). We summarize our results as follows: • Age of participants was not a predictor of motivation to use Indagator. • Males were more likely to use Indagator than females (β = .60, p < .01). • Participants’ with a background in IT-related fields were more likely to use Indagator than those with non-IT backgrounds. • Familiarity with mobile content sharing activities (e.g. photos, music, etc.) did not seem to influence participants’ motivation to use Indagator. • Familiarity with playing games on desktop computers did not appear to predict motivation to use. However, familiarity with playing mobile games was significant (β = -.45, p < .01). Interestingly, this association was negative – a lower familiarity with playing mobile games resulted in a larger inclination to use Indagator. • The usability of Indagator’s mobile content sharing features, comprising system navigation, map navigation and annotation management, did not appear to influence participants’ motivation to use the application. Table 1. Results of multiple regression analysis Independent Variables Demographics
Age Gender Educational background Familiarity with mobile content sharing Familiarity with playing mobile games Familiarity with playing desktop games Usability of mobile System navigation content sharing features Map navigation Annotation management Encounter usability Usability of gaming Entertainment value of encounters features Encounter appeal Adequacy of encounter genres ** p < .01. * p < .05. All other p-values were greater than .05. + Valid only for usability of content and gaming features
• All four aspects associated with Indagator’s gaming features (encounter usability, entertainment value of encounters, encounter appeal and adequacy of encounter genres) were found to be significant predictors. Specifically, participants who reported that encounters were usable (β = .20, p < .05), entertaining (β = .49, p < .01), and likely to promote content sharing (β = .42, p < .01), and that the genres provided were sufficient (β = .66, p < .01), were more likely to use the application.
6 Discussion Our work shares similarities with the Gopher Game, MobiMissions, GWAPs and similar applications in that we aim to investigate the use of games for content sharing. Nevertheless, there are distinct differences that warrant the present research. The Gopher Game may be viewed as a task-based approach to content sharing in which players are given specific objectives to accomplish via gophers. On the other hand, Indagator players can independently create and share any content of interest. The Indagator environment is therefore more open-ended, requiring more complex player-player and player-system dynamics. Similarly, MobiMissions focuses primarily on generating and completing missions, which are well-defined sequences of locations and associated content. Here, Indagator offers a richer gaming environment that supports encounters in addition to content creation and access. In addition, many GWAPs are Web-based games and are primarily casual in genre. Indagator differs by catering to mobile users, and incorporates location-based, multiplayer gaming elements that existing GWAPs do not. The multiple regression analysis showed that all four aspects of Indagator’s gaming feature were positively associated with participants’ intention to use the application. This suggests the importance of creating usable, diverse, exciting and enjoyable genres of encounters to cater to a wide spectrum of players with diverse interests and preferences [6, 16]. Put differently, while Indagator is fundamentally a mobile content sharing application, the layering of gaming elements introduces an added challenge of ensuring that both components (content sharing and gaming) are effectively addressed in terms of usability and playability. Interestingly, none of the usability aspects pertaining to content sharing (system navigation, map navigation and annotation management) were significant predictors of intention to use the application. This finding however does not imply their unimportance. Rather, one possible explanation is that due to the novelty of blending content sharing and gameplay, participants were more focused on gaming features rather on those for sharing and viewing annotations. In addition, because Indagator’s content sharing features were patterned after those found on the Web (e.g. map navigation) as well as on established mobile user interface design guidelines (e.g. [12]), participants may have found usability to be a non-issue when compared to the gaming features. In terms of demographics, only educational background and gender appeared to be significant predictors in motivation to use Indagator. For the former, those trained in IT-related fields showed a greater inclination to use Indagator than those without. One likely reason could be that gaming and content sharing, which were realized as technologically-oriented activities in Indagator, struck a greater resonance with those who had more IT knowledge. There is much support for this finding in the literature where
68
D.H.-L. Goh, C.S. Lee, and A.Y.-K. Chua
prior experience with various forms of information technology has been found to influence the intention of motivation to use a particular technology (e.g. [11, 15]). In terms of gender, males were more likely to use Indagator than females. This finding concurs with research showing that in general, males seem to spend more time playing computers games and enjoy it more [3], and that this behavior occurs across difference cultures [8]. This finding again underscores the importance of designing different genres of encounters to appeal to both males and females, as well as the need for better positioning of Indagator as a platform primarily for mobile content sharing in which players can vary levels of usage of the application’s gaming features. Surprisingly, familiarity with playing mobile games was shown to be negatively associated with motivation to use Indagator. We hypothesize that this could be due to differing expectations of mobile games for entertainment and mobile games for content sharing among participants, although there is insufficient data to support this. Specifically, participants with prior mobile gaming experience held pre-conceived ideas of what games should be. Since Indagator is a relatively new gaming genre in which content sharing is a primary activity and games serve to enhance the content sharing experience, our design decisions were focused on how encounters could be used to access timely and relevant content. While the entertainment aspects of encounters were considered, they were secondary in the version of Indagator that was evaluated. In contrast, participants with no prior mobile gaming experience appeared more likely to use Indagator as their expectations of mobile games remained malleable. Consequently, they may have been able to better appreciate the way encounters were designed to support content sharing.
7 Conclusion The following are some design considerations for Indagator and similar systems: • Games are not universal motivators, and applications that incorporate gaming elements into content sharing should attempt to decouple these components to attract a wider user group [1]. Indagator was deliberately designed such that gameplay can be dislodged from mobile content sharing through the selection of an “Information Mode” by users. In this mode, encounters and other gaming elements are turned off and users only focus on content creation and access. Our intention is to appeal to both non-gamers and gamers alike so that a rich information environment can be created. • Applications that combine content sharing and gaming have to address the twin challenges of usability as well as game design. As demonstrated in our experience with Indagator, participants appeared to focus on the playability aspects of encounters. Consequently, it would be prudent to adhere to established game design guidelines (e.g. [9, 13, 16] even though content sharing may the primary focus. • On a related note, since games are the facilitators to content sharing, the importance of creating diverse and enjoyable genres of games to cater to players with different profiles therefore cannot be overlooked [6]. Thus for example, genres of games that appeal to either or both genders should be incorporated (e.g.[7]).
Do Games Motivate Mobile Content Sharing?
69
Although the user study has yielded useful results, we plan to conduct more comprehensive evaluations in future work. For example, the sample of 41 participants may prevent generalizability of our results. Further, the present study uncovered interesting observations that should be investigated. This includes a better understanding of the usability and game design aspects, as well as personality, gender and other demographic factors that influence attitudes towards content sharing and gaming applications. In the process of doing so, formulating effective design guidelines and evaluation heuristics for this genre of applications would be another area of investigation. Acknowledgements. This work was supported by the Singapore National Research Foundation Interactive Digital Media R&D Program, under research grant NRF2008IDM-IDM004-012.
References [1] Bell, M., Chalmers, M., Barkhuus, L., Hall, M., Sherwood, S., Tennent, P., Brown, B., Rowland, D., Benford, S., Capra, M., Hampshire, A.: Interweaving mobile games with everyday life. In: Proceedings of the 2006 Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 417–426 (2006) [2] Casey, S., Kirman, B., Rowland, D.: The Gopher Game: A social, mobile, locative game with user generated content and peer review. In: Proceedings of the 2007 International Conference on Advances in Computer Entertainment Technology, pp. 9–16 (2007) [3] Chou, C., Tsai, M.J.: Gender differences in Taiwan high school students’ computer game playing. Computers in Human Behavior 23(1), 812–824 (2007) [4] Goh, D.H.-L., Ang, R.P., Chua, A.Y.K., Lee, C.S.: Why we share: A study of motivations for mobile media sharing. In: Liu, J., Wu, J., Yao, Y., Nishida, T. (eds.) AMT 2009. LNCS, vol. 5820, pp. 195–206. Springer, Heidelberg (2009) [5] Grant, L., Daanen, H., Benford, S., Hampshire, A., Drozd, A., Greenhalgh, C.: MobiMissions: The game of missions for mobile phones. In: Proceedings of the ACM SIGGRAPH 2007 Educators Program (2007), http://doi.acm.org/10.1145/1282040.1282053 [6] Ha, I., Yoon, Y., Choi, M.: Determinants of adoption of mobile games under mobile broadband wireless access environment. Information & Management 44(3), 276–286 (2007) [7] Hartmann, T., Klimmt, C.: Gender and computer games: Exploring females’ dislikes. Journal of Computer-Mediated Communication 11(4), Article 2 (2006), http://jcmc.indiana.edu/vol11/issue4/hartmann.html [8] Jackson, L.A., Zhao, Y., Qiu, W., Kolenic, A., Fitzgerald, H.E., Harold, R., von Eye, A.: Culture, gender and information technology use: A comparison of Chinese and US children. Computers in Human Behavior 24(6), 2817–2829 (2008) [9] Jegers, K.: Pervasive game flow: Understanding player enjoyment in pervasive gaming. ACM Computers in Entertainment 5(1), Article 9 (2007) [10] Ji, Y.G., Park, J.H., Lee, C., Yun, M.H.: A usability checklist for the usability evaluation of mobile phone user interface. International Journal of Human-Computer Interaction 20(3), 207–231 (2006) [11] Kim, S.H.: Moderating effects of job relevance and experience on mobile wireless technology acceptance: Adoption of a smartphone by individuals. Information & Management 45(6), 387–393 (2008)
70
D.H.-L. Goh, C.S. Lee, and A.Y.-K. Chua
[12] Kim, T.N.Q., Razikin, K., Goh, D.H., Theng, Y.L., Nguyen, Q.M., Lim, E.P., Sun, A., Chang, C.H., Chatterjea, K.: Exploring hierarchically organized georeferenced multimedia annotations in the MobiTOP system. In: Proceedings of the 6th International Conference on Information Technology: New Generations, pp. 1355–1360 (2009) [13] Korhonen, H., Koivisto, E.M.I.: Playability heuristics for mobile multi-player games. In: Proceedings of the 2nd International Conference on Digital Interactive Media in Entertainment and Arts, pp. 28–35 (2007) [14] Matyas, S., Matyas, C., Schlieder, C., Kiefer, P., Mitarai, H., Kamata, M.: Designing location-based mobile games with a purpose – collecting geospatial data with CityExplorer. In: Proceedings of the 2008 International Conference on Advances in Computer Entertainment Technology, pp. 244–247 (2008) [15] Morris, S.A., Gullekson, N.L., Morse, B.J., Popovich, P.M.: Updating the attitudes toward computer usage scale using American undergraduate students. Computers in Human Behavior 25(2), 535–543 (2009) [16] Pinelle, D., Wong, N., Stach, T.: Heuristic evaluation for games: Usability principles for video game design. In: Proceedings of the 2008 Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 1453–1462 (2008) [17] Siorpaes, K., Hepp, M.: Games with a purpose for the Semantic Web. IEEE Intelligent Systems 23(3), 50–60 (2008) [18] von Ahn, L., Dabbish, L.: Designing games with a purpose. Communication of the ACM 51(8), 58–67 (2008) [19] von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedings of the 2004 Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 319–326 (2004)
A Multifaceted Approach to Exploring Mobile Annotations* Guanghao Low, Dion Hoe-Lian Goh, and Chei Sian Lee Wee Kim Wee School of Communication and Information Nanyang Technological University {lowg0008,ashlgoh,leecs}@ntu.edu.sg
Abstract. Mobile phones with capabilities such as media capture and location detection have become popular among consumers, and this has made possible the development of location-based mobile annotation sharing applications. The present research investigates the creation of mobile annotations from three perspectives: the recipients of the annotations, the type of content created, and the goals behind creating these annotations. Participants maintained a two weeklong diary, documenting their annotation activities. Results suggest that range of motivational factors, including those for relationship maintenance and entertainment. Participants were also more inclined to create leisure-related annotations, while the types of recipients were varied. Implications of our work are also discussed. Keywords: Mobile phone, mobile annotation sharing, motivations, diary study.
1 Introduction Mobile phones coupled with location-positioning functionality (e.g. global positioning system, GPS) have become increasingly popular, and this has fueled the growth of location-based applications. Combined with other functions such as wireless networking and media capture, mobile phones have become an easy to use platform for media creation and sharing. Besides availability of such functionality, a wealth of research has yielded a number of other motivations for creating or sharing mobile media (e.g. photos and video). For example, mobile media is used to share people’s daily routines, engage in conversations, share information and report news (e.g. [1], [2]). Past research has suggested that such sharing activities can be personally satisfying, meeting emotional needs [3], and serve hedonic purposes (curiosity and diversion, social connection, social avoidance) (e.g. [4], [5]). Conversely, people may choose to create mobile media for personal use only, perhaps for personal recollection or self-reflection [3]. While there are numerous studies on the motivations for mobile media creation and sharing, there is comparatively less work done on mobile annotations, which in this *
This work is partly funded by A*STAR grant 062 130 0057 and Singapore National Research Foundation Interactive Digital Media R&D Program grant NRF2008IDM-IDM004-012.
paper, we define as mobile content encompassing multiple media such as text, images, and video. Here, to the best of our knowledge, research mostly centers on tagging of mobile media. For example, in [2], users tag photographs taken on mobile phones for individual and social purposes. We argue that a distinction between mobile media (encompassing single media) and mobile annotations (encompassing multiple/mixed media) are sufficiently different to warrant separate investigation. This is because mobile media may elicit different motivations [1] [6] and we should not make a general assumption that these motivations will surface when multiple media are brought together [7]. Multiple media presents a richer environment and may convey more information than single media. For example, annotations that combine text and images about a vacation would likely communicate more information than just images alone. In this paper, we therefore adopt a multifaceted investigation of mobile annotations. To accomplish this, a diary study and interview of participants were conducted. In the study, participants used MobiTOP (Mobile Tagging of Objects and People), a mobile annotation system to create location-based annotations. The annotations, diary returns and interviews were analyzed using a three-component framework which we coin TAG (Target, Annotation, Goal). The contributions of this work includes: a) the TAG Framework to study the creation of mobile annotations from multiple perspectives; and b) the application of this framework on annotations created by MobiTOP via diary study and email interview. By understanding mobile annotation creation, we argue that systems that meet users’ needs more effectively can be designed. The remaining sections of this paper are structured as follows. Section 2 provides an overview of related work and a discussion of the TAG framework. Section 3 introduces MobiTOP. Next, Section 4 describes our methodology of the study while Section 5 presents our findings and analyses. Finally, Section 6 discusses the implications of this work as well as opportunities for future research.
2 Related Work 2.1 Literature Review In earlier work, mobile media systems were generally for individual use where the main motivation was to document or for reminders. For example, [8] focuses on how to motivate annotations in a personal digital photo library. In contrast, mobile media systems for sharing attract users with different motivations [6]. Studies have shown that users of mobile media applications for sharing created tags that were primarily directed towards “others”. Another study [7] divided “others” into “friends and family” and “public”. Further, [9] explored the uses of Flickr images and discovered that in most cases, it was used to update “others” of one’s life. Many studies also looked at the type of annotations that people create. For example, [10] showed that the images taken on mobile phones tend to be personal, short-lived and ephemeral. Likewise, [11] concluded that mobile phones tended to participate in the “aesthetics of banality” in which the images captured were mainly focused on the mundane, trivial aspects of everyday life. Another example was a diary study carried out in [3] where a list of categories of annotations from the images captured for sharing were uncovered (e.g. People, Objects, and Places of Interest).
A Multifaceted Approach to Exploring Mobile Annotations
73
In the case of motivations, much work has been conducted on mobile media. For example, [12] uncovered five uses of digital images captured using the MMM2 system: (1) creating and maintaining social relationships; (2) as a record and reminder of personal and collective experiences; (3) as a means to voice one’s views; (4) to influence others’ view of oneself through self-presentation; and, (5) as a means to support both personal and group tasks. In related work, [13] organized reasons for co-present media sharing into storytelling, identity presentation, social information sharing and serendipitous discovery. In terms of annotating captured media, [6] examined the motivations for tagging Flickr images. Motivations were cast along two dimensions. The sociality dimension relates to whether an annotation was meant for personal use or for others, while the function dimension refers to an annotation’s intended use – organization or communication. As discussed, there is much work done in studying human behavior in mobile media, while there is comparatively less work done on mobile annotations. Here, work is mostly centered on implementation issues, such as GeoNotes [14] and Micro-Blog [15]. The present work is thus timely as we seek to understand the creation of mobile annotations from multiple perspectives. 2.2 The TAG Framework The literature review above suggests that annotations can be analyzed from different perspectives. In the present research, we consolidate existing work in this area and propose a framework that focuses on three dimensions: the target recipient (T), the content of the annotation (A), and the goal for creating the annotation (G). The target recipient (T) looks at who the annotations are created for. Here, past work has suggested that an annotation may be created for the general public, family and friends or for individual consumption [2] [6] [7]. Next, the content of the annotation (A) identifies the different types of annotations that users create. Studies in mobile media (e.g. [1] [2] [3]) suggest a diverse range of possible categories, and may apply to mobile annotations as well. Finally, the goal for creating the annotation (G) refers to the intentions or motivations of the creator for the annotation itself [4]. This dimension investigates the thoughts behind users’ actions, and as discussed previously, has been extensively studied, especially in mobile media [3] [12] [13]. Our approach is multifaceted, and synthesizes the perspectives from past studies such as those reviewed above. By bringing these perspectives together, we hope to provide framework for a more holistic analysis of mobile annotations behaviours.
3 A Brief Introduction to MobiTOP MobiTOP is a mobile annotation system that allows users to create, share and seek location-based annotations [16]. In MobiTOP, a map-based visualization is supported for exploring and discovering annotations. This map is populated with markers which represent the available annotations in the vicinity (see Figure 1a). Selecting a marker will cause its details to be displayed in a separate screen (see Figure 1b). Standard navigational features such as panning and zooming are also available.
74
G. Low, D.H.-L. Goh, and C.S. Lee
Fig. 1. (a - left) MobiTOP’s map view. (b - right) Screenshot of an annotation’s details
Annotations in MobiTOP consist of information such as title, tags, multimedia content (e.g. images), textual information, creator’s name and date (Figure 1b). Information such as contributor’s name, date and location (latitude and longitude) are implicitly captured at the point of creation. In MobiTOP, tags are used to facilitate the retrieval of related annotations. Additional access mechanisms include filtering by attributes such as date/time, location and user. The current implementation of MobiTOP is developed using the Java Platform, Micro Edition, and has been tested on Nokia N95 8GB smartphones to ease development and evaluation. More details about the MobiTOP system may be found in [16].
4 Methodology We adopted the diary study methodology to gather data from our participants. This was supplemented with an interview at the end of the diary study. Participants in the study were 24 part-time graduate students who were full-time working professionals. They ranged between 20 to 39 years of age and consisted of seven females and 17 males. Most of participants (24) had computer science or IT-related educational background, and they worked in diverse sectors including telecommunications, banking/finance, manufacturing and education. Generally, participants were familiar with the use of mobile phones and mobile applications. Of the 24 participants, 20 were familiar with using the mobile phone’s camera feature and more than half had experience using the mobile phone to browse the Web. However, only seven participants had experience with their phone’s GPS, suggesting that many of the participants were not frequent users of location-based applications. For the diary study, each participant was provided with a Nokia N95 phone and tasked to maintain a two-week diary whereby the participants recorded their MobiTOP usage activities. Specifically, participants were required to record the location that they were at when creating the annotation, the intended recipient(s), tags that describe the annotation, and the reason for creating the annotation. Due to the mobile nature of the study, employing direct observation methods to capture user intentions may not be feasible [17]. Obtaining responses via post-study interviews only presents issues where interviewees may have filtered out or forgotten certain information [18]. In contrast, a diary study is able to obtain a more personal and natural account which may be difficult to observe by another party. It is not intrusive, in situ and would not suffer from retrospective study like interviews.
A Multifaceted Approach to Exploring Mobile Annotations
75
Prior to the diary study, participants were trained in using MobiTOP. Thereafter, they were given instructions and scenarios for the creation of annotations. Participants were also requested to attend a post-study interview in which they further elaborated on their motivations and clarify unclear diary entries.
5 Results and Analyses The diary study generated a total of 650 entries, with an average of 27.1 entries per participant (minimum=14, maximum=33). Here, we elaborate on the findings from the analysis of these diary entries using the TAG framework described earlier. 5.1 Target (T): Who Are the Recipients? We referred to past work to identify the target audiences of the annotations created by the participants in study. Specifically, several studies [2] [3] [6] [7] have identified “self”, “friends and family” and “public” as types of intended recipients. Annotations that were meant for “self” refers to those that were created for the creator only. Those for “friends and family” were meant for the creator’s friends and family, while “public” annotations refer to anyone who may want to view the annotation. In our study, we manually inspected each diary entry provided by the participants to determine the intended recipient(s) of the annotations. We found that annotations intended for self were primarily created to serve as reminders for future reference or entertainment purposes (see Section 5.3). For instance, one participant noted that “This is the first time I am having dinner at this restaurant and it is quite nice. I want to tag it for future reference.” While most annotations were triggered by a place or an activity, some participants reported in their diary that they were “Playing with MobiTOP” from the comfort of their home. Our data showed that annotations that were intended for friends and family typically contained names of locations or items that the participants developed strong positive or negative associations with, and they felt that the information within the annotations might be valuable to their family and friends. For instance one participant had a good dining experience and created this annotation, “I want to share the real chicken rice stall around Jurong that is delicious.” Another had a bad experience in a train station and shared this with his family and friends, “The crowd was frustrating me and I had to join a long queue to take the stairs down.” Lastly, we found that annotations intended for the public contained details of places of attractions, suggestions or recommendations. They tended to include the phrase “share with others” in their diary entries, indicating their desire to share the information with others. In addition, these annotations usually possessed a neutral tone. Some examples include “To share with others the new left side of Jurong Point center” and “I wanted to share with others a moderate price food stall that I found.” 5.2 Annotation (A): What Is Annotated? The content of all annotations were manually inspected and organized into categories as shown in Figure 2. Here, content included title, description, media resources attached and tags. Some annotations’ content belonged to more than one category and
76
G. Low, D.H.-L. Goh, and C.S. Lee
were thus coded as such. As shown in the figure, “Food” and “Shop” were the top two categories indicating that many of the annotations were related to dining or shopping experiences. Annotations from the “Food” category were typically reviews and recommendations such as, “Nice place to have steamboat. Affordable and the food variety is good” and “The restaurant here provides good service compared to other branches”. As for “Shop”, one participant created the following annotation, “This is the place where you can get your fashionable fancy swimsuits”.
Fig. 2. Annotation categories
“View” was the third largest category. These were annotations were created when participants encountered scenic views, and often contained images of landscapes, buildings and breathtaking skylines. These annotations tended to be of the “image of the moment” type, reflecting the participants’ emotions at the moment when the image was captured and their needs to archive the image permanently. Thus, annotations in this category provided an outlet for participants to express their emotions visually. Additionally, we found that annotations related to one of the lower occurring categories, “religion”, typically contained descriptions, images or even videos of places of worship. We attributed the low counts due to the sensitive nature of topic (i.e. religion). Thus, this highlights that participants in our study were generally considerate and sensitive to the needs of others. 5.3 Goal: Why Annotate? Due to the lack of literature on mobile annotations, we culled motivational factors from our earlier work on mobile media [3], others such as [6] [12], as well as identified new categories for those annotations that could not fit into existing ones. We manually examined the motivations from the participants’ diary entries and attempted to include them into the following: (1) creation/maintenance of social relationships; (2) reminding of individual and collective experiences; (3) self-presentation; (4) task performance; and (5) entertainment. Here, entertainment was a new category that was uncovered through our analysis. The following presents a more detailed discussion of the motivations that emerged from our analysis. Creation/maintenance of social relationships refers to establishing new, and keeping up with existing relationships between the sender and recipients through mobile
A Multifaceted Approach to Exploring Mobile Annotations
77
annotations. Such annotations may contain content to provide personal contextual information for another individual, or for friends and family. For example a participant wanted to share with family members a statue that he saw while travelling around Singapore, “To share with my family and friends the Merlion Statue that is a symbol of Singapore” (Figure 3a). Another participant used MobiTOP as an application to share time together with his spouse by showing her how MobiTOP works (“Wife wanted to have a try on the application”). We found that participants created annotations to develop new online relationships with other users. For instance, one participant responded to the annotations created by another by providing more information (e.g. pictures), and the original annotation creator subsequently reciprocated with words of appreciation. Reminding of individual and collective experiences involves sharing annotations as a record and reminder of individual and collective experiences, and may include key moments, everyday activities or even mundane content related to oneself, others within a social circle or even the public. One of the participants created an annotation at a bus-stop so that he could find his way back to the location if he got lost, “I wanted to get to it later in case I lost my way”. Additionally, we found that participants were also motivated to record places for future references, and these places were usually locations where they had great and fun experiences. An example is, “This was the first time I had dinner at this restaurant and it was quite nice. I wanted to tag it for future reference.”
(a) An annotation for creation/maintenance of social relationships
(b) An annotation created while having fun
Fig. 3. Example of photos of annotations
Self-presentation refers to the sharing of annotations to create an impression of oneself on others, or put differently, to influence how other people would view oneself. For instance, a participant wanted to inform others that he was taking a Master’s degree at a local university and created an annotation for it (“To tell others that my Master’s degree was from this university.”) Other participants felt the need to share with others on various aspects of their personal everyday lives. Example include annotations that “Show where I buy my groceries weekly”, and “I want to show the place where I stay with my friends.” Task performance refers to annotations that have a functional value and are meant to assist the sender and/or recipient to complete a task. For instance, participants were found to help by answer or respond to the queries posted by other participants. An example was an annotation used to provide travel directions for a participant. We
78
G. Low, D.H.-L. Goh, and C.S. Lee
also observed that some participants offered help using a more proactive approach. Here, participants anticipated the needs of others and shared information that they felt might be useful. For instance, one participant found a store selling unique items that he had never seen before. He created an annotation “to tell everyone around the world” about the store. Entertainment refers to motivational factors that involve distracting oneself from a particular problem or activity, exploration by experimenting, or simply just having fun. In our study, MobiTOP was used to kill time. For instance, a participant created an annotation describing her surroundings while waiting for his/her transportation “Instead of waiting without doing anything, I decided to create an annotation.” Apart from killing time, some participants were curious about how MobiTOP would work in different situations. “Wanted to try the application in a semi-open parking lot” and “Trying out creating annotations on the move” are examples of participants who wanted to test the location detection capabilities of the application. While experimenting, some participants found it fun to create annotation trails of their travels on the map. One participant noted, “As I am in a boat, my annotations would the trace my boat’s travel when I view them in the map” (Figure 3b).
6 Discussion and Conclusion This research aims to understand for whom, what and why users create mobile annotations. This work serves as one of the first studies behind studying the creation of mobile annotations and proposes a multifaceted analytical framework which we term as TAG. This framework was applied to MobiTOP, a mobile annotation system, to discover and understand the creation of location-based annotations. Earlier studies typically looked at one to two of the following perspectives: who the annotations were for, what did the annotations contain, and why participants created annotations. The TAG framework, however, provides an integrated examination by synthesizing the various perspectives culled from existing literature. We summarize our key findings and discuss their implications on the design and development of mobile annotation systems below: • For the target of the annotations (T), we found that the three major categories were self, family and friends, and public. More importantly, we found that different features were required by different target groups. For the self category, features that will be helpful are those that allow individuals to organize annotations to facilitate future retrieval. However, for annotations intended for family and friends, communication features (e.g. social networking tools) will be highly valuable. Thus, this indicates that designers of mobile annotations systems should not only consider who their users will be, but also account for who their target recipients are. This is so that appropriate features will be incorporated to ensure that the intended recipients (e.g. self, friends and family, public) are able to access the annotations easily. • The analysis of the content of annotations (A) provides an overview of the categories of annotations that users are interested in. Hence, such analysis enables data related to the personal interests of users to be captured. This will be useful to identify user profiles which will aid developers in terms of providing better customiza-
A Multifaceted Approach to Exploring Mobile Annotations
79
tion and personalization. Such data are also highly valued by potential advertisers for personalized mobile advertisements. Thus, designers of such systems may want to explore partnerships and alliances with relevant businesses. • The Goal (G) component of the framework allows us to uncover the motivations behind creation of mobile annotations. We found that motivations are varied and include self- presentation (satisfying personal needs), creation and maintenance of social relationship (social interaction), and task performance (to complete a task). These are consistent with the findings of past work (e.g. [3]). Our findings have important implications on the design of mobile annotation systems. For example, creation and maintenance of social relationship can be afforded by social networking tools (e.g. “friend feeds”). More importantly, one critical finding is the fact that entertainment was a major motivational force behind mobile annotation creation. This implies that designers of similar systems may consider incorporating games to further motivate users to create mobile content. This could include minigames such as planting puzzles, or creating missions which would help to sustain one’s interest in content creation and access. Caution, however, should be exercised when interpreting our results because the nature of this study may reduce the generalizability of its findings. Specifically, the majority of the respondents were working professionals with IT and engineering educational backgrounds. Replication of this study in other contexts (e.g., other agegroups) or in a specific domain (e.g. education) would be useful to better understand mobile annotation creation. Expanding the study to compare MobiTOP with other mobile content sharing applications (e.g., those with games) would allow researchers to understand the impact of entertainment on the creation of mobile annotations. Finally, the effect of other personal characteristics such as gender and personality are relevant constructs that may influence the different dimensions of our TAG framework. Future work may want to investigate the influence of such constructs.
References 1. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 56–65. ACM, San Jose (2007) 2. Kindberg, T., Spasojevic, M., Fleck, R., Sellen, A.: I saw this and thought of you: some social uses of camera phones. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, pp. 1545–1548. ACM, Portland (2005) 3. Goh, D., Ang, R., Chua, A., Lee, C.: Why we share: A study of motivations for mobile media sharing. In: Liu, J., Wu, J., Yao, Y., Nishida, T. (eds.) AMT 2009. LNCS, vol. 5820, pp. 195–206. Springer, Heidelberg (2009) 4. Taylor, C.A., Anicello, O., Somohano, S., Samuels, N., Whitaker, L., Ramey, J.A.: A framework for understanding mobile internet motivations and behaviors. In: CHI 2008 Extended Abstracts on Human Factors in Computing Systems, pp. 2679–2684. ACM, New York (2008) 5. Kim, H., Kim, J., Lee, Y., Chae, M., Choi, Y.: An empirical study of the use contexts and usability problems in mobile internet. In: Proceedings of the 35th Annual Hawaii International Conference on System Sciences, vol. 5, p. 132. IEEE Computer Society, Los Alamitos (2002)
80
G. Low, D.H.-L. Goh, and C.S. Lee
6. Ames, M., Naaman, M.: Why we tag: motivations for annotation in mobile and online media. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 971–980. ACM, San Jose (2007) 7. Nov, O., Naaman, M., Ye, C.: What drives content tagging: the case of photos on Flickr. In: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 1097–1100. ACM, Florence (2008) 8. Kustanowitz, J., Shneiderman, B.: Motivating annotations for personal digital photo libraries: Lowering barriers while raising incentives. Technical report, HCIL, Univ. of Maryland (2004) 9. House, N.A.V.: Flickr and public image-sharing: distant closeness and photo exhibition. In: CHI 2007 Extended Abstracts on Human Factors in Computing Systems, pp. 2717– 2722. ACM, San Jose (2007) 10. Gye, L.: Picture This: the Impact of Mobile Camera Phones on Personal Photographic Practices. Continuum: Journal of Media & Cultural Studies 21, 279–288 (2007) 11. Koskinen, I.: Seeing with mobile images: Towards perpetual visual contact. In: Proceedings of The Global and the Local in Mobile Communication: Places, Images, People, Connections, Budapest, Hungary (2004) 12. House, N.V., Davis, M., Ames, M., Finn, M., Viswanathan, V.: The uses of personal networked digital imaging: an empirical study of cameraphone photos and sharing. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, pp. 1853–1856. ACM, Portland (2005) 13. Naaman, M., Nair, R., Kaplun, V.: Photos on the go: a mobile application case study. In: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 1739–1748. ACM, Florence (2008) 14. Espinoza, F., Persson, P., Sandin, A., Nyström, H., Cacciatore, E., Bylund, M.: GeoNotes: Social and navigational aspects of location-based information systems. In: Abowd, G.D., Brumitt, B., Shafer, S. (eds.) UbiComp 2001. LNCS, vol. 2201, pp. 2–17. Springer, Heidelberg (2001) 15. Gaonkar, S., Li, J., Choudhury, R.R., Cox, L., Schmidt, A.: Micro-Blog: sharing and querying content through mobile phones and social participation. In: Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services, pp. 174–186. ACM, Breckenridge (2008) 16. Razikin, K., Goh, D., Theng, Y.-L., Nguyen, Q., Kim, T., Lim, E.-P., Chang, C., Chatterjea, K., Sun, A.: Sharing mobile multimedia annotations to support inquiry-based learning using MobiTOP. In: Liu, J., Wu, J., Yao, Y., Nishida, T. (eds.) AMT 2009. LNCS, vol. 5820, pp. 171–182. Springer, Heidelberg (2009) 17. Jacucci, G., Oulasvirta, A., Salovaara, A., Sarvas, R.: Supporting the shared experience of spectators through mobile group media. In: Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work, pp. 207–216. ACM, Sanibel Island (2005) 18. Olsson, T., Soronen, H., Väänänen-Vainio-Mattila, K.: User needs and design guidelines for mobile services for sharing digital life memories. In: Proceedings of the 10th International Conference on Human Computer Interaction with Mobile Devices and Services, pp. 273–282. ACM, Amsterdam (2008)
Model Migration Approach for Database Preservation Arif Ur Rahman1,2 , Gabriel David1,2 , and Cristina Ribeiro1,2 1
Departamento de Engenharia Inform´ atica–Faculdade de Engenharia, Universidade do Porto 2 INESC Porto Rua Dr. Roberto Frias, 4200-465 Porto, Portugal {badwanpk,gtd,mcr}@fe.up.pt
Abstract. Strategies developed for database preservation in the past include technology preservation, migration, emulation and the use of a universal virtual computer. In this paper we present a new concept of “Model Migration for Database Preservation”. Our proposed approach involves two major activities. First, migrating the database model from conventional relational model to dimensional model and second, calculating the information embedded in code and preserving it instead of preserving the code required to calculate it. This will affect the originality of the database but improve two other characteristics: the information considered relevant is kept in a simple and easier to understand format and the systematic process to preserve the dimensional model is independent of the DBMS details and application logic. Keywords: database preservation, dimensional modeling.
1
Introduction
Organizations are increasingly relying on databases as the main component of their recordkeeping systems. However, at the same pace as the amount and detail of information contained in such systems grows, also grows the concern that in a few years most of it may be lost, when the current hardware, operating systems, database management systems (DBMS) and actual applications become obsolete and turn the data repositories unreadable. The paperless office increases the risk of losing significant chunks of organizational memory. In this paper we present an approach for preserving the information stored in relational databases for the future. According to research in the area, the five characteristics of databases which must be preserved are context, structure, content, appearance and behaviour [15,17]. The context includes non-technical information giving answers to questions like who, when and why about the database as well as information on its technical features. The contents is the data stored in the database representing real-world facts. The structure of the database relates to the composition and logical hierarchy of the elements of a database, thus contributing to the meaning
This work is supported by FCT grant reference number SFRH/BD/45731/2008.
G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 81–90, 2010. c Springer-Verlag Berlin Heidelberg 2010
82
A. Ur Rahman, G. David, and C. Ribeiro
of the data. Appearance is about screen forms used for entering and modifying data and about generated reports. It requires the presence of the user application designed to manipulate data, submit queries, and extract information. The behavior is the dynamic part of the system and, therefore, the most difficult to preserve. It includes the interaction control component and the code implementing the business rules. If the former can be seen as less relevant from a preservation viewpoint, the latter may contain important bits of information, in the form of functions to produce important derived results not explicitly stored in the database. This paper will discuss how the migration from relational to dimensional model impacts the preservation of these database characteristics. Some aspects of preservation which need to be taken care of during the process of database preservation include integrity, intelligibility, authenticity, originality and accessibility [5]. Integrity refers to the completeness, correctness and consistency of data stored in the database. Intelligibility of a database concerns both the interpretation of the data formats and the understandability of the relationships between tables and their relation to the reality they represent. An intricate database model becomes hard to understand. The authenticity is the property which relates the preserved information with its source and is guaranteed by keeping record of the actors, tools and operations involved in a preservation process. Originality in terms of preserving the structure and functionality, should be kept into account but may conflict with other aspects like intelligibility or accessibility. Technical accessibility means that the data is kept in open formats and does not rely on vendor-specific software. The approach to database preservation proposed in Section 3 is based on model migration. This operation changes the structure of the database in order to improve intelligibility and accessibility, the crucial problems identified above. In the process, we might decide to perform a data quality assessment and repair the data to reduce problems like missing values, inserting records on foreign key errors. However, the decision was to keep the data as it is, in order to preserve as much as possible the facts recorded, though in different format, even when they are affected by data quality problems. So, the goal is to preserve the actual level of integrity. The authenticity of the actual database is guaranteed by the inclusion of audit information qualifying the records. The authenticity of the preserved database requires the addition of audit information relating the preserved records to the original ones and the specification of the migration procedures, when they were executed and by whom, using which tools. Authenticity also benefits from metadata about the context of creation and use of the original database which should be recorded in the context component of the preserved database. Note however that the model migration approach is done at the expense of originality.
2
What to Preserve
The model migration approach provides a pre-processing of the database, which can be coupled with existing database preservation initiatives such as the
Model Migration Approach for Database Preservation
83
Software Independent Archiving of Relational Databases (SIARD) [4,14] or the Digital Preservation Testbed (DPT)[15]. SIARD is a non-proprietary published open standard. It is based on other open standards like Unicode, XML, SQL 1999 and the industry standard ZIP [14]. As it is based on open standards, it supports interoperability of the database contents in the long term. Using the SIARD format, even if the database software through which the database was created is not available or not executable, the database will be accessible and usable. At the present it is possible to migrate Oracle, Microsoft SQL Server and Microsoft Access databases to SIARD format. A database in the SIARD format consists of two components namely metadata and the primary data. An uncompressed ZIP archive stores these components having the metadata in the folder header and the primary data in the folder content. Moreover, the archive also stores metadata about which primary data can be found where in the archive [14]. A SIARD database archive can be reloaded in the future to any RDBMS which supports standard SQL [5]. The model migration proposal concentrates on the archival format for the database contents. The archive should contain the original relational model and it may contain the original database file or an export file. The archive must also contain the new preservation model and the preserved contents according to the new model following the SIARD archive structure. A similar approach could be applied to the DPT, modifying its central notion of preservation object. The DPT preservation object has five main components, namely the original database, an XML overview file of the database, applications, the preservation log file and metadata. Testbed suggests the preservation of the original database file (*.mdb or export file). The XML overview file represents an overview of the tables in the database, the relationships between the tables and the content and structure of the actual tables and views. The application component is for storage of queries, stored procedures, application code (if applicable), system documentation and user manuals. It is not meant to preserve the applications as a working entity. The preservation log file contains all the information about the preservation actions through which the database passes. The metadata component contains metadata for the authentic preservation. This is mainly contextual metadata. For using the DPT with our proposal, the overview file has to contain the original relational model and the new model and also the original database file version possibly in XML or SQL DDL format. It is clear from the research done that there can be no single way to preserve all kinds of databases. For our work we define a database as a combination of four components. 1. Data: The data is the contents stored in the tables of a database. 2. Schema: The schema of a database is the structure (data model) which is needed to understand the relationships among tables. Business rules which are partly structural also need to be preserved. 3. Context: The contextual information which is normally not included in the operational system.
84
A. Ur Rahman, G. David, and C. Ribeiro
4. Database Application: The application developed in a high-level programming language for retrieval, modification, and deletion of data in conjunction with various data-processing operations. This contains the appearance and behavioural aspects of the database. A part of the business rules maybe implemented in the behavioural aspect. For preserving a database it is important to take into account the nature of its contents. The contents of some databases evolve with the passage of time e.g. the CIA World Factbook database [1,2] while others remain static as is the case of a population census database. The former needs a different approach than the latter for preservation. In this paper the focus will be on the latter. It is very important to preserve the schema of a database for the understandability and usability of the information stored in it. In our approach as we suggest to migrate the database to a dimensional model, the structure of the preserved database is different from the structure of the operational system but it is easier to understand. A database application can be preserved as a working entity by writing an emulator for it. An emulator is a program that runs on one computer and virtually re-creates a different computer. Therefore, through emulation we can use obsolete application on a recent computer [11]. But there are many problems associated with using emulation as a preservation strategy. For example it cannot be ensured that the computers in the future will be capable of executing an emulator of any older computer. Every time there is some change in the platform for which the emulator was developed, the emulator needs to be re-developed. Another approach to deal with the application component is to simply preserve the user manuals, queries and functions in textual format and not as a working entity [15]. In this paper we propose an alternative approach which calculates and explicitly stores the information embedded in the code (application logic). The goal is to keep just the data and make the information application-independent.
3
Database Migration for Database Preservation
Database migration is not a new concept, it has been studied and discussed in the past [8,9]. Database migration may take different forms including DBMS version evolution (Oracle 10 to Oracle 11), change in DBMS (Oracle to DB2) or change of the data model (hierarchal to relational data model). In this paper we propose the use of model migration from the relational model to the dimensional model as a step for preserving a relational database. 3.1
Dimensional Modeling
Dimensional modeling is a logical design technique that seeks to present the data in a standard framework which is intuitive, allows for high-performance access and is resilient to change [3,6]. Information is stored in tables of two natures: dimensions store detailed data about the entities or objects involved in a certain
Model Migration Approach for Database Preservation
85
relevant process (like clients, items being sold, employees); fact tables store the values representing real world facts (like quantities sold, amounts earned) and the relationship to the corresponding dimensions. A fact table surrounded by the related dimensions is called a star. Dimensions may be shared by different stars. Time and Location are commonly used dimensions. The strengths of the dimensional model make it better for long-term preservation and access to the information. As discussed by Kimball [7] and Ponniah [10], report writers, query tools and user interfaces can all make strong assumptions about the dimensional model which makes the processing more efficient. Other features, as discussed by Torlone [16], include explicit separation of structure and contents, and hierarchies in the dimensions. The separation of the structure and contents helps in making it DBMS independent which is crucial for database preservation. The hierarchies in dimensions help in aggregating the data and result in faster access. In the past dimensional modeling has not been considered for database preservation. 3.2
Model Migration
The design of a database preservation process requires a proper balance of the aspects of originality, integrity, accessibility, intelligibility and authenticity in each major problem to be solved. Two issues to be considered are the complexity of the relational model of real-size information systems and the embedding in code of important knowledge from the application domain. The complexity of the relational model may prove to be a serious stumbling block for preserving databases. Part of it comes from the requirement of redundancy elimination that transaction-oriented databases must follow in order to be efficient and consistent in capturing facts. The preserved database is no longer used for transaction processing but instead for querying and decision making. Although it contains the facts of the original database, the change of usage brings a change of requirements. It is better if the data is preserved in a form that gives simpler and quicker access. This can be achieved by migrating the database from relational model to a dimensional model, as depicted in Figure 1 [12]. The operation will affect the originality of the database but will give a relief from the complexities of the relational model and improve intelligibility and accessibility, because the resulting model is much easier to understand and the queries on it are simpler to state. The second problem is the fact that some results coming from the database are produced by functions embodying application-domain knowledge. Preserving code is a much more difficult problem than preserving data, because it requires the ability to preserve the engine able to run it, from the application to the DBMS or the underlying operating system. But discarding the code affects accessibility, as there is no technical way to reach the derived data it would be producing, and it affects also the integrity as chunks of the data are lost. The solution offered by migration is to include the facts and dimension attributes in the dimensional model to explicitly store the data in danger. In the data migration phase, also called ETL for extraction, transformation and loading according
86
A. Ur Rahman, G. David, and C. Ribeiro
Fig. 1. Model Migration Approach for Database Preservation
to data warehouses terminology [7], the application code is run to produce the implicit values in it, which are then kept in the preserved database. It is assumed that when the preservation operation is performed, the original platform or a compatible one is still available. In the migration process, parts of the relational model which are needed only to support the interaction at the data capture phase or which are not relevant may be dropped. To further simplify the information stored in the database, it can be converted to XML format. This will make the information platform-independent which is very important for achieving its long-term preservation [13,14].
4
Case Study
The proof of concept for the ideas presented in this paper is a case study involving the database for the “Course Evaluation System” of the University of Porto. Students are invited to answer 31 questions about the course they are attending and teacher performance, with answers ranking from 1-5, with 1 the lowest and 5 the highest grade. Information about the identity of the student is not stored and the answers are anonymous. The operational system has a rather complex model, a part of which is shown in Figure 2. It is designed to capture the answers via dynamically built on-line forms. All the reports are calculated at query time using functions based on complex queries. A report can be about the whole faculty, a program, a curricular year or a single course. The user may also choose a granularity level for a report which may be one of the following:
Model Migration Approach for Database Preservation
87
Fig. 2. Relational Model of the Operational System
– Question level: the report presents statistics about each individual question. – Vector level: the questions are grouped into vectors, the report presents statistics about the vectors. – Global level: the vectors are combined into a single result, the report presents statistics about the global results. Before starting the migration process a thorough analysis of the operational system was done. The process was carried out in small steps which resulted in the dimensional model shown in Figure 3. The tables COURSES and PROGRAMS in the operational system are represented by the IPDW COURSES dimension in the dimensional model which has two levels (courses and programs). The tables QUESTIONS and VECTORS are represented by the IPDW QUESTIONS dimension also with two levels (questions and their aggregations in vectors). The questionnaire was modified six times since the inception of the system. The tables QUIZZES and QUIZ GROUPS in the operational system store information about these different quizzes. The IPDW SEMESTER dimension stores information about a semester and the QUIZ used for it. Though the answers of the students are kept anonymous, some information related to them is stored in the IPDW QUIZ dimension in the dimensional model. IPDW ANSWERS is the fact table, it is the de-normalized form of the ANSWERS and VALUES table. Reference keys to the corresponding dimensions along with the values from ANSWERS and VALUES tables are stored in it. Tables GROUPS, VALUES GROUPS, CONFIGURATION, and QUESTIONNAIRE GROUPS were discarded as they were used for dynamically building the interface for the online data capture. In this process it was important not to damage the integrity of the data.
88
A. Ur Rahman, G. David, and C. Ribeiro
Fig. 3. Dimensional Model for the Operational System
In the next step the functions were executed and the results like averages, standard deviation, number of answers and percentiles for each granularity level were explicitly stored. For each granularity level we got a different star with the fact table storing the results coming from executing the functions and references to the corresponding dimensions. One of the stars is shown in Figure 3. The dimensions in this star are shared by others. At this stage the database became application-independent. The dimensions in the dimensional model are systematic and easy to serialize and store in XML along with their structure and metadata. If we compare the models (Figure 2 and Figure 3) it is obvious that the dimensional model is simpler and easier to understand and therefore more intelligible. The information which was embedded in code is now explicitly stored in the database and is readily accessible. Also there is no need to preserve the code for the future. After the process of migration is completed, the results coming from the migrated database were compared with the operational system to verify the authenticity of the information.
5
Conclusion
This paper proposes a model migration approach for database preservation. For migrating the relational model of the operational system a thorough understanding of the original system is required. Before migration it should be decided what is to be kept for the future and what can be discarded. This work is similar to the
Model Migration Approach for Database Preservation
89
evaluation, elimination and description work an archivist must perform before archiving a set of documents.
6
Future Work
This is a work in progress and we are currently engaged in making the process of migration easier. As the proposed approach involves migration of an operational system from a relational model to a dimensional model, we are working on defining some generic transformation rules for the process. The rules will guide the team involved in a migration process. Another aspect that requires further research is metadata. A specification of the metadata needs, both for keeping the database system context and for describing the preservation process is still required.
References 1. Buneman, P., Cheney, J., Tan, W.-C., Vansummeren, S.: Curated databases. In: PODS 2008: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York, NY, USA, pp. 1–12 (2008) 2. Buneman, P., M¨ uller, H., Rusbridge, C.: Curating the CIA world factbook. International Journal of Digital Curation 4(3) (2009) 3. Connolly, T.M., Begg, C.: Database Systems: A Practical Approach to Design, Implementation, and Management. Addison-Wesley Longman Publishing Co., Inc., Boston (2001) 4. Heuscher, S.: Technical aspects of SIARD. ERPANET (2003) 5. Heuscher, S., J¨ armann, S., Keller-Marxer, P., M¨ ohle, F.: Providing authentic longterm archival access to complex relational data. In: Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data. European Space Agency (2004) 6. Imhoff, C., Galemmo, N., Geiger, J.G.: Mastering Data Warehouse Design: Relational and Dimensional Techniques. Joe Wikert (2003) 7. Kimball, R., Reeves, L., Thornthwaite, W., Ross, M., Thornwaite, W.: The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses. John Wiley & Sons, Inc., New York (1998) 8. Meier, A.: Providing database migration tools - a practicioner’s approach. In: VLDB 1995: Proceedings of the 21st International Conference on Very Large Databases, pp. 635–641. Morgan Kaufmann Publishers Inc, San Francisco (1995) 9. Meier, A., Dippold, R., Mercerat, J., Muriset, A., Untersinger, J.-C., Eckerlin, R., Ferrara, F.: Hierarchical to relational database migration. IEEE Softw. 11(3), 21–27 (1994) 10. Ponniah, P.: Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. John Wiley & Sons, Inc., Chichester (2001) 11. Dutch Archives Testbed Project. Digital preservation testbed white paper emulation: Context and current status. Technical report, Dutch National Archives, Digital Preservation Testbed Project, ICTU, Nieuwe Duinweg 24-26, 2587 AD Den Haag, (June 2003)
90
A. Ur Rahman, G. David, and C. Ribeiro
12. Rahman, A.U., David, G., Ribeiro, C.: Model migration approach for database preservation. In: 5th International Digital Curation Conference, London, December 2-4 (2009) 13. Ramalho, J.C., Ferreira, M., Faria, L., Castro, R.: Relational database preservation through XML modeling. In: Extreme Markup Languages, Montreal Quebec, Department of Informatics, University of Minho, Portugal (August 2007) 14. SFA. SIARD format description. Technical report, Swiss Federal Archives, Berne (September 2008) 15. Digital Preservation Testbed. From digital volatility to digital permanence: Preserving databases. Technical report, National Library of Australia, Dutch National Archives (2003) 16. Torlone, R.: Conceptual multidimensional models. In: Multidimensional Databases, pp. 69–90. Idea Group, USA (2003) 17. Wilson, A.: Significant properties report. Technical report, InSPECT (April 2007)
Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features Robert B. Allen and Catherine Hall The iSchool at Drexel University 3141 Chestnut Street, Philadelphia, PA 19104, USA {rba,ceh48}@drexel.edu
Abstract. Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clues which can be useful for improving the accuracy of the categorization. Here, we describe observations of several historical newspapers to determine the characteristics of sections. We then explore how to automatically identify those sections and how to detect serialized feature articles which are repeated across days and weeks. The goal is not the introduction of new algorithms but the development of practical and robust techniques. For both analyses we find substantial success for some categories and articles, but others prove very difficult. Keywords: Access, Classification, Digital Humanities, Historian’s Workbench, Newspapers, Text Processing.
1 Using Structure to Enhance Indexing of Historical Newspapers Local newspapers are important historical resources. In the past several years vast amounts of historical newspapers, largely created from digitized microfilm, have been generated and we would like to support access to these potentially valuable resources. In the U.S., the major initiative is the National Digital Newspaper Program (NDNP) which is being sponsored by the National Endowment for the Humanities (NEH) and the Library of Congress [1]. Participating states deliver digitized page images and OCR text using the METS/ALTO schema1. To date, this OCR has mostly been used for search interfaces2 but much richer access could be supported by identifying the structure (e.g., articles and sections) of the newspapers. Because such large amounts of historical materials are being digitized, automated methods of processing are needed. 1
2
METS/ALTO is a marriage of METS (Metadata Encoding and Transmission Standard) and ALTO (Analyzed Layout and Text Object). The former standard uses XML to encode descriptive, administrative, and structural metadata about digital objects; the latter describes the content and layout of each piece of the digital object. e.g., http://www.loc.gov/chroniclingamerica/
The structure of modern newspapers is described by standards such as the International Press and Telecommunications Council (IPTC) family of News Exchange Format Standards (e.g., NEWSML 1, NEWSML-G2, NITF). The IPTC has also developed taxonomies, known as descriptive NewsCodes, for the categorization of newspaper content.3 There are five sets of these descriptors and we will focus on two of those here. The first of these is Genres, which describes the nature, journalistic or intellectual characteristic of a news object, but not specifically its content. Some examples of Genre include: Background, Daybook, Scener, and Feature. The second taxonomy is Subject Codes which is a hierarchical system to describe content in three different levels of specificity (Topics, SubjectMatter, SubjectDetail). Examples here include: Arts, Culture & Entertainment, Archaeology, and Fire. Allen et al. [2] reports an initial study of automated processing methods for the OCR from historical newspapers. They first explored automatic methods to segment the pages based on the OCR. Several approaches were tested, such as including semantic coherence among the terms, but for this dataset the best results were obtained from detecting font changes from article headings. This approach was fairly successful overall and especially for relatively large, well-structured articles; however, it was less successful for tightly packed advertisements and notices. After the articles were segmented, they were assigned to genres with the intention that at least some of the genres (especially news stories) would be processed further (in later research), for instance by assigning subject codes. This sequence, from segmentation to genre assignment to subject code assignment, was described as a pipeline processing model. For the genre and topic categorization, they [2] primarily employed templates based on the presence of specific words. For instance, the weather reports were identified because they included terms such as “temperature”, “degree”, and “snow”. While the OCR text had many errors, there were often enough correct terms for the articles to be identified, especially for those types of articles (such as a weather reports and reports about chess matches) where there were predictable and distinctive terms. In addition to the word templates, matches (exact or partial) for some distinctive phrases (e.g., “Weather Report”) were also used as evidence of the appropriate genre category. With considerable tuning, fairly accurate matching was obtained for focused categories. However, the cost of improved accuracy could be a loss of flexibility both when formats and styles change, and across different newspapers. Moreover, it was found to be difficult to maintain the spirit of the pipeline; the second step of determining genre would actually be of use in the first step, determining segmentation [2]. In addition, the IPTC distinction between Genres and Subject Codes did not always match the logic of the pipeline. For instance, advertising is considered by IPTC as type of content. Nor are all genres easy to identify; reviews (literature, theater, music etc) often look no different in format from other news articles. Rather than pursuing the pipeline processing model per se, it seems more reasonable to frame the task as identifying and using as many of the regularities and constraints as are available to optimize the identification of the various components of 3
http://www.iptc.org. Other modern newspapers often have XML tags even if they are not IPTC compliant. For instance, in the Los Angeles Times has a types of subject category based on “Desks”; for instance, there is a “Book Review Desk”. The New York Times has “Times Topics”.
Automated Processing of Digitized Historical Newspapers beyond the Article Level
93
these newspapers. One constraint which should be useful is that articles on related topics are often positioned near each other. Knowing the section could be helpful for disambiguating OCR and correcting errors because we may then utilize domain specific ontologies which are more successful than general purpose ontologies. As a second type of constraint, we can explore whether we can find regular feature articles. Neither of these tasks is difficult for human analysis but because we need to process millions of pages of historical newspapers there is an enormous advantage to developing automated techniques. In Section 2, we survey the frequency and stability of sections across newspapers. We then turn to procedures for automatically extracting sections. In Section 3, we extend the subject categorization of based on word counts. We also explore the identification and extraction of regular feature articles.
2 Description of Regular Feature Articles and Sections Surprisingly, there has been little systematic description of the structure and organization of newspapers so we started with that. We found that some feature articles are repeated across days; some fall directly into the genre classifications as defined by IPTC, but others do not readily match those categories. In any event, knowing that those items are repeated and clustered may make them easy to detect. First, we examined what types of clustered and categorized material was typically present in the historical newspapers. Next, we examined how the type of material and its clustering changed across time and how it varies across different newspapers. We obtained page images from the NDNP team at the Library of Congress for several Washington DC newspapers for the years 1900 to 1910. The Washington Times was selected as our primary focus because we had the most complete run for it. We then compared the Washington Times for two months during 1904 and then again during 1908. We also compared the Washington Times to another Washington newspaper, the Washington Herald and a rural Pennsylvania newspaper, the New Holland Clarion. 2.1 Washington Times, March 1904 We examined each page for the first week of March. Although we kept in mind the general notion of genres, and more specifically IPTC’s genre categories, we found it helpful to think in terms of ‘sections’ as we attempted to identify and explain the pieces of information that come together to make up a newspaper. The backbone of any newspaper is its traditional news articles, those pieces that report on current or recent events, but we also identified a number of other regular items and features: Classifieds: Times Want Ads is a dedicated page each day for classified advertisements. Typical categories include; Help Wanted, Situations Wanted, Wanted, For Sale, For Rent, For Hire, Lost, Found, Personal, Miscellaneous. Daybook: This includes What Is Going on in Washington; a list of theaters and show times, excursions etc. Also, a section of possibly paid for advertisements under the heading of Amusements, which includes theater and music performances, forthcoming lectures etc.
94
R.B. Allen and C. Hall
Editorial: Appearing daily with the masthead, a portion of the page is reserved for editorial comment. Financial Information: A daily section which incorporates financial news items, market tables (Washington Stock Exchange, New York Stock Exchange, New York Cotton Market/ Chicago Grain Market). Also a section entitled Current News and Gossip of Interest to Investors. Local News: In March 1904, The Washington Times had regularly appearing features dedicated to the news of regions local to the DC area (Alexandria, Boyds, Georgetown, Hyattsville, and Rockville). Each of these sections may be as short as two or three paragraphs and while News of Georgetown appears every day during this particular week, News from Rockville only appears once. Masthead: Editorial/publication information including editor’s name, office address, subscription prices. Notices: This is a broad category that covers many of the small recurring sections identified from the March 1904 files. Some examples of Notices include Advertised Letters, Church Notices, Death Record, Died, Foreign Mails, Legal Notices, Local Mention, Marriage Licenses, Railroads, Real Estate Transfers, Special Notices, Trustee Sales. Poems: Each day on the same page as the masthead and editorial, the newspaper publishes a short poem or verse. Reviews: Weekly reviews of new theater productions (Tuesday) and literature (Saturday). Society News: A daily section appearing on the same page as the masthead/editorial titled In the Circle of Society. It contains news about and of interest to the upper echelons of Washington society - dinners, dances, receptions, people who are in and out of town. Sports: A dedicated sports page appears in the newspaper each day. It incorporates news items, results tables, schedules, etc. Weather Report: Appears on the front page daily. Includes temperature tables and sun rise/set times and tide table. Women’s Interests: A significant daily section is The Home Its Problems and Interests. Aimed at women, it contains a mix of short features, tips and recommendations on subjects such as fashion, food, children, health and beauty, and haberdashery. On Sundays, most of the aforementioned features are present, but the paper is substantially longer (about 50 pages) and split into five parts. The first part closely resembles the Monday-Saturday version of the newspaper with a few changes – e.g. more full-page advertisements and the movement of certain features such as society news, financial, classifieds etc to the Metropolitan Section. There is also a selfexplanatory Comic Section, and Magazine Features, which includes columnists, special reports and features, and women’s fashion. The final part is titled Colored Section and contains a mixture of fictional stories and factual features (sometimes pieces of historical interest or sensational true stories), all of which are richly illustrated through photographs and drawings.
Automated Processing of Digitized Historical Newspapers beyond the Article Level
95
2.2 Washington Times, November 1904 By November 1904 several sections which we had observed in March had changed names. For example, The Circle of Society became In Society’s Circle, and Literature became In the Book World. The content of these sections, however, remained largely the same. A more noticeable development is the thinning out of local news sections; Hyattsville Notes and News from Boyds which appear in the March 1904 papers do not appear in the November 1904 papers we analyzed. Additions to the paper are minor and include a daily cartoon and a section titled Points in Paragraph. Both of these appear on masthead/editorial page and the latter appears to be an extension of the editorial, offering pithy comments about current news. 2.3 Washington Times, March 1908 By 1908, Marriage Licenses and Died are combined (along with Births) in Vital Records. New sections added by this period include the politically focused What Congress Did (a short report on bills passed, resolutions adopted, people who spoke in both the Senate and the House) and Today’s Caller’s at the White House (politicians and noteworthy visitors expected that day). Another minor addition is Court Record, a notification of the cases being heard in court that day. Changes were also made to the Masthead/Editorial page; Points in Paragraph and the cartoon which appeared in November 1904 have not lasted. News from Rockville has disappeared, leaving Alexandria and Georgetown as the only local areas to have distinct news sections. Society news enjoys more prominence at this point, moving from the editorial page to earlier in the newspaper. Variously titled (e.g., Tea and Luncheon Parties; Weddings Dinners Teas) it occupies the majority of a page, sometimes continuing on to a second. Similarly, the women’s interests section (now called Facts and Fads in the Realms of Home and Fashion), which now also includes Notes from Stage Folks, has expanded to fill most of a page. 2.4 Washington Herald, 1908 After comparing the Washington Times across different years, we then moved to compare it with other newspapers. We first looked at the Washington Herald, another daily DC newspaper, founded in October 1906. Looking at the same weeks in March and November 1908, it was easy to identify similarities in the sections and structure of the two papers. Features like weather reports, advertisements, editorial comment, sports and financial sections are expected to exist in both papers, and it is also no surprise that both also contain similar notices – marriage licenses, death record, times of church services, court records etc. Both papers also have sections for women’s interests, news from local areas (although the geographic areas covered do differ), and society gossip. Sometimes the similarities extend beyond the content of the paper to the structure itself; the women’s section of both papers appears regularly, but not always, on page 7; and the ‘Want’ advertisements appear usually, but not always, on page 10 of each. While the overall structure is similar, we noted several differences between the papers. Examples of unique features in the Herald include readers’ letters, fiction serialization and the daily columnist Frederic J. Haskin, who writes on a variety of serious and frivolous topics. The similarities between the papers far outweigh the differences, however, and it
96
R.B. Allen and C. Hall
is highly likely that an automated process developed to identify sections in one paper could also identify a large number in the other. In February 1939, the two papers merged to become the Washington Times-Herald. 2.5 New Holland Clarion As a comparison with the urban Washington DC newspapers, we also evaluated the March and November 1904 and 1908 editions of The New Holland Clarion, a weekly newspaper from New Holland, a small rural town in Lancaster County, Pennsylvania. Strikingly, many similarities could be drawn with the Washington Times. There were sections for local news (Hatville, Blue Ball, Intercourse, Churchtown, etc.), sports, and finance (although focused on the produce and livestock markets). Generally, these sections were shorter than those in urban newspapers. National news was sparsely reported compared and rarely made the front page. This is a small community newspaper; a front page story from March 1904 was titled Many Bones Are Broken and concerns a number of people who had broken limbs during inclement weather. Like the Washington Times, the Clarion includes a substantial number of advertisements and classifieds, and also a number of notice-type sections including unclaimed letters, times of church services, and railroad timetables. Marriages and deaths are also reported, though sometimes more sensationally than in the Times; example headlines include The Work of the Reaper and Gone to the Great Beyond. The New Holland Clarion also has a section devoted to people who in and out of town (Points Purely Personal) and, in November 1908, it introduced a women’s interest feature called Home Circle Department, aimed at “tired mothers as they join the home circle and eveningtide”. 2.6 Summary of Observations The amount of consistency across the three papers is notable. Structure exists that should be easily recognizable by both human and automatic extraction methods. We know that certain features such as sports, classifieds, financial, society news and women’s interests appear in the Washington Times daily. We also know that when an edition of the paper is 12 pages long, the sports section is likely to appear on page 8, the financial information on page 9, and the Times Wants Ads on page 10. In addition, certain features of the Washington Times are likely to appear together; the masthead and editorial comment are always on the same page, and the society news regularly appears with them. However, there is also considerable change in sections over time and across newspapers. Indeed, not only do the details of the sections change but the conceptual organization itself changes. For instance, Vital Records may or may not appear as a separate section and when it does it appear typically includes a mix of Births, Deaths, Marriages.
3 Automated Procedures So many pages of historical newspaper have been digitized that is not realistic for them to be manually marked up. Thus, we explore techniques for automatically identifying the sections and features described in Section 2.
Automated Processing of Digitized Historical Newspapers beyond the Article Level
97
3.1 Test Data Set Because sections change frequently, we cannot simply create a template to find them consistently; instead, we need to develop automated procedures for finding the sections. We focused on five categories of sections which we judged to be fairly robust and which, if correctly identified, would account for a substantial portion of the newspapers. For each of those sections, we manually listed the pages on which they appeared for each day during March 1904 and established rules for the coding. On some days some sections filled more than one page, for instance, sports and classified advertising sometimes covered one-and-a-half pages. In those cases, we coded only one full page. However, when these sections covered less than one page they were coded for that page and, indeed, these could be two sections on one page. 3.2 Section Identification from Tagged Articles The first technique we explored was based on the article-categorization approach of Allen et al.[2]. If we found a sufficient number of sports articles on a page, we would conclude that that page was a Sports section. The accuracy for this procedure was low, even for distinct sections such like Sports. Apparently, there was enough error in the article-level identifications that accurate categorizations could not be made with this approach and we did not pursue it. 3.3 Section Identification from Page-Level Word Lists Rather than focusing on articles, we shifted to considering entire pages using a wordcounting technique similar to the method in [2]. Specifically, we developed word lists for each of the five types of sections on which we were focusing. We then found the average frequency for each of the terms across all the pages for that month. Next, we compared the frequencies separately for each page to the frequency for the entire month. If the page frequency exceeded the overall monthly frequency by a large multiplier (e.g., 30 times), that was considered to be a hit. Then, if a minimum number of such matches (e.g., 4) were obtained for a given category we identified it as that type of section. We applied this method to the 324 weekday pages of the Washington Times for March 1904 for which complete data were available and we compared the results with the section coding from Section 3.1. Table 1 summarizes the results with the pagelevel word-list technique. This technique turned out to be quite successful for the sections on which we focused but less successful when we explored other types of sections such as Editorials. Although Sports had many distinctive terms, we found only a few distinctive words which were associated with Editorials and even those were not reliably captured by the OCR. The lowest accuracy observed in Table 1 was for Society. While that section often included news about galas with royalty and diplomats, there were some days when the events reported were more mundane and hence more difficult to distinguish from other news.
98
R.B. Allen and C. Hall
Table 1. Hit and False Alarm Ratios for several sections for the weekdays (324 pages) during March 1904
Type of Section Classified Advertisements
Hit Ratio 1.00
False Alarm Ratio4 0.00
Home and Family Society Sports Stocks and Finance
0.96 0.62 1.00 0.92
0.65 0.00 0.53 0.00
Our strategy was to select methods that would minimize the need for human intervention in the categorization process. However, it is unlikely that human intervention can be entirely eliminated and the goal should be to find a balance in which human intelligence can be applied with the greatest leverage. In an early run, we noted that the Home and Family category was often falsely identified on a page of classified advertisements because there were many domestic terms among the classifieds. To minimize those false alarms we instituted a rule such that if the page is recognized as having Classified Advertising, then it should not also trigger the Home and Family category. In any event, these do not seem to be very serious errors as the terms for the two categories overlap a lot. Of course, there were some true errors for instance, when a cluster of Finance stories appeared on the front page and that cluster was identified as a section. Because the five sections we studied were fairly standard and broad we expect the results could likely be generalized across time and to other urban newspapers. However, as noted in Section 2.4, the Finance section for the rural New Holland Clarion was quite different from the Finance section for the urban newspapers. Finally, although the analysis in Table 1 excluded Sundays because of their very different structure of sections, when we did look at the Sunday newspapers we found that their Sports and Finance sections were several pages long and these were consistently correctly identified. 3.4 Identifying Regular Features Some of the items described in Section 2 above were included because they were clearly demarcated in the newspaper as regular features with distinctive and repeating headings (e.g., News of Georgetown). While many of the larger sections had such headings (e.g., News and Gossip of the Day from the World of Sports) the shorter items might be better called regular features rather than sections. They remained distinctive in comparison to traditional news stories which almost always had unique titles. In any event, identifying the items with repeated titles should help us to parse the sections of the newspapers. Moreover, because the strings of characters were so distinctive and were repeated so often, it seemed promising to identify them with our basic text processing tools. Furthermore, because the strings were generally several words long identifying them would also be robust to OCR errors. 4
A maximum of one false alarm was counted per day.
Automated Processing of Digitized Historical Newspapers beyond the Article Level
99
For each article during the month of March 1904, we extracted the first line of articles which contained more than five characters. We then compared those to the first lines of text from all other articles extracted for that month. Of course, the matching was complicated by OCR errors so partial matches were based on counting the number of overlapping letters in two strings and the order of the characters was ignored. Article matches were counted if at least 85% characters matched. Out of 9556 items, 667 sets of matching articles were identified. The majority of the matches were for advertisements being run for several days. It appeared that short articles with repeated headings were advertisements. This suggests a simple technique for detecting certain types of advertisements and that is also useful for identifying the contents of the historical newspapers. All of the feature titles described above in Section 2.1, with the exception of “What is Going on in Washington” were found using this method. However, in some cases because the titles were split across lines, only the first words were matched. There were also some limitations for the automatic methods. For instance, many datelines (e.g., “NEW YORK”) were observed though, presumably, those could be filtered out. Poems were not detected because the lead had only the title of the poem. There was, however, distinct separation lines demarcating poems and conceivably those separators could be used to find the poem text. The partial match procedure can also be used if the goal is not discovery of items which are frequently repeated, but finding instances of titles which are known to be in the text. For instance, the title of the sports section was “News and Gossip of the Day from the World of Sports” but it was not associated with a specific article. In the same vein, using the banner heading for the Society category, “In the Circle of Society” could be used to reduce the errors for that category in the word-based categorization technique as describe above.
4 Discussion The techniques described in this paper should help the automated indexing of the historical newspapers and, thus, ultimately support better access for users. Ultimately, we believe that the newspapers could be a substantial component for a Historian’s Workbench [3, 4]. More immediately, the complete index would need to be compiled and even with a great deal of automated processing, it seems likely that some level of human intervention will be needed. Thus, we envision an interface and content management system for managing updates. Perhaps this could be a version of the “collaborative correction” technique that was originally developed by the National Library of Australia [5]. We believe that automatic methods should be used to leverage human input and there are several promising areas to explore. The categorization techniques described here are straightforward and robust but are probably not as sensitive as techniques which systematically assess the most predictive terms for each category. Importantly, we might be able to address the problem of sections shifting across months. While there some variability of sections from day-to-day, those shifts are generally transient, and a high-level program could examine output across weeks to learn the prototypical pattern. Such a program might detect and flag shifts in editorial policy for sections
100
R.B. Allen and C. Hall
(e.g., when a new section is added or an old one is dropped). A further interesting challenge presented by the shifting of sections is how to describe the contents of the newspaper with structural metadata systems such as METS.5 Data management itself can be a substantial challenge. While the data sets used here are already very large, inferences such as those based on imperfect knowledge bases and on collaborative correction may multiply the amount of errors. Inevitably some incorrect information is likely to creep into the knowledgebase. Even without errors there will be ambiguous (e.g., which Mr. Smith was mentioned in a given article) and disputed information, so the provenance of all inferences should be maintained indicating what updates were done and why they were done. The contents of newspapers are clearly structured; yet those structures are sometimes fuzzy and flexible. At the item-level, it is difficult to define what constitutes an article [2]. In this paper we have focused only on sections such as Sports which were relatively unambiguous while acknowledging there are several other content clusters. We should consider ways to accept and even embrace that variability. We could consider articles and sections as prototypes rather than as absolute categories. We might consider the faceting of metadata attributes or even consider degrees of applicability for them. Another way to address these issues is to consider the units of the newspaper to be genres6 [6]. Indeed, the interrelationship of the elements could be explained by considering them as “ecology of genres” [7]. Thus, we might find that when one section changes that there is a realignment of other sections to match it. Indeed, such shifts might be found across newspapers when several papers serve a community. While there are limitations in the ability to identify and even the define articles and sections, it should also be emphasized that we can do this well most of the time. Furthermore, there is an increasing synergy among the pieces as we pin down more of them. For example, identifying sports sections should help us to identify sports articles which appear in those sections. It could even help us to disambiguate the OCR text. There are also many factors to consider beyond those that we have considered thus far. For instance, when doing the analyses described above, we observed that the classified advertisements consistently had the highest word count of any other pages. A further synergy may be created by identifying the type of events being described. Some of our other work [4] has begun to explore developing community event models by combining evidence from several different types of historical resources. Furthermore, identifying events may allow us to develop visualization interfaces such as timelines [8] and, ultimately, to improve public awareness and understanding of history. Acknowledgments. Catherine Hall was supported with an IMLS LC21 Fellowship to the College of Information Science and Technology of Drexel University. Other support was provided by a NEH Digital Humanities Start-up Grant to Robert B. Allen. We thank Ray Murray of LC for the digitized newspapers from Washington DC. We also thank the Lancaster County Historical Society and Access Pennsylvania for the New Holland Clarion. 5
6
While METS/ALTO is focused on describing OCR, we would also like to develop structural descriptions for the sections. However, as we have seen, it is sometimes difficult even to define sections. This is a broader sense of “genre” than the IPTC Genre codes discussed in Sections 1 and 2.
Automated Processing of Digitized Historical Newspapers beyond the Article Level
101
References 1. Murray, R.L.: Toward a Metadata Standard for Digitized Historical Newspapers. In: Proceedings of IEEE/ACM JCDL, pp. 330–331 (2005) 2. Allen, R.B., Waldstein, I., Zhu, W.Z.: Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds.) ICADL 2008. LNCS, vol. 5362, pp. 380–387. Springer, Heidelberg (2008) 3. Toms, E., Flora, N.: From Physical to Digital Humanities Library: Designing the Humanities Scholar’s Workbench. In: Siemens, R., Moorman, D. (eds.) Mind Technologies, Humanities Computing, and the Canadian Academic Community, pp. 91–115. U. Calgary Press, Calgary (2006) 4. Allen, R.B.: Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design. In: IFLA International Newspaper Conference: Digital Preservation and Access to News and Views, pp. 54–59 (2010) 5. Holley, R.: How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine 15(3/4) (March/April 2009) 6. Ihlström, C., Åkesson, M.: Genre Characteristics – A Front Page Analysis of 85 Swedish Online Newspapers. In: Proceedings of the Proceedings of the Hawaii International Conference on System Sciences (2004) 7. Foulger, D.: Medium as an Ecology of Genre: Integrating Media Theory and Genre Theory. Media Ecology Association (2006) 8. Allen, R.B., Nalluru, S.: Exploring History with Narrative Timelines. In: Smith, M.J., Salvendy, G. (eds.) HCII 2009. LNCS, vol. 5617, pp. 333–338. Springer, Heidelberg (2009)
Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing Mikalai Krapivin, Aliaksandr Autayeu, Maurizio Marchese, Enrico Blanzieri, and Nicola Segata DISI, University of Trento, Italy
Abstract. In this paper we use Natural Language Processing techniques to improve different machine learning approaches (Support Vector Machines (SVM), Local SVM, Random Forests) to the problem of automatic keyphrases extraction from scientific papers. For the evaluation we propose a large and high-quality dataset: 2000 ACM papers from the Computer Science domain. We evaluate by comparison with expert-assigned keyphrases. Evaluation shows promising results that outperform state-ofthe-art Bayesian learning system KEA improving the average F-Measure from 22% (KEA) to 30% (Random Forest) on the same dataset without the use of controlled vocabularies. Finally, we report a detailed analysis of the effect of the individual NLP features and data set size on the overall quality of extracted keyphrases.
1
Introduction
The rapid growth of information on the web has made autonomous digital libraries possible and popular. Autonomy here means automatic information harvesting, processing, classification and representation and it brings several information extraction challenges. A specific challenge lies in the domain of scholarly papers [1] collected in autonomous digital libraries1 . Library spider crawls the web for scientific papers. Having downloaded the paper, crawler converts it into a text format. Then relevant metadata (like title, authors, citations) are extracted and finally documents are classified, identified and ranked. Metadata helps to categorize papers, simplifying and improving users’ searches, but often metadata is not available explicitly. Information extraction from scholarly papers contains two broad classes of tasks: i) recognition of structural information which is present inside the paper body (like authors, venues, title); ii) extraction of information which is only implicitly present, such as generic keyphrases or tags, which are not explicitly assigned by the authors. In this paper we focus on the second, more challenging task of extraction of implicit information. In particular, we analyse the effect of the use of Natural Language Processing (NLP) techniques and the use of specific NLP-based 1
G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 102–111, 2010. c Springer-Verlag Berlin Heidelberg 2010
Keyphrases Extraction from Scientific Documents
103
heuristics to improve current Machine Learning (ML) approaches to keyphrases extraction. Machine learning methods are successfully used to support automatic information extraction tasks for short news, mails, web pages [2], and for the problem of keyphrases extraction [3,4,5]. Keyphrase is a short phrase representing a concept from a document. They are useful for search, navigation and classification of digital content. This paper extends and improves on a preliminary work [6] describing our initial concepts and presenting initial results obtained using standard SVM approaches. In this paper: 1. We extend the use of state-of-the-art NLP tools [7] for the extraction, definition and use of linguistic-based features such as part of speech and syntactic relations extracted by dependency parsers [8]. 2. We apply the proposed NLP-based approach to different ML methods: traditional SVM, innovative Local SVM and Random Forests. 3. We define and publish a large and high-quality dataset of 2000 documents with expert-assigned keyphrases in the Computer Science field. 4. We analyze in details the effect of different NLP features and dataset size on the overall quality of extracted keyphrases. 5. We perform a comparative analysis of the computed quality measures and obtained keyphrases with the various ML techniques (enhanced with NLP) and the popular Bayesian learning system KEA. An extended version of this paper is also available as a technical report2 .
2
Related Work
The state-of-the-art system for keyphrases extraction is KEA [9]. It uses Na¨ıve Bayes classificator and few heuristics. The best results reported by KEA team show about 18% of Precision [9] in extracting keyphrases from generic web pages. Use of domain specific vocabularies may improve the result up to 28.3% for Recall and 26.1% for Precision [10]. Turney suggested another approach which uses GenEx algorithm [11]. The GenEx algorithm is based on a combination of parameterized rules and genetic algorithms. The approach provides nearly the same precision and recall as KEA. In a more recent work [2], the author applies web-querying techniques to get additional information from the Web as background knowledge to improve the results. This method has a disadvantage: mining the Web for information and parsing the responses is a time and resource consuming operation. This is problematic for Digital Libraries with millions of documents. In this approach the author measures the results by the ratio of average number of correctly found phrases vs. total number of extracted phrases. Recent works by A. Hulth et al. considered domain [5] and linguistic [4] knowledge to search for relevant keyphrases. In particular, [5] used a thesaurus to capture domain knowledge. But the recall value reported in this work is very low, 2
namely 4-6%. The approach proposed in [4] introduced a heuristic related to part-of-speech usage, and proposed training based on the three standard KEA features plus one linguistic feature. Authors reported fairly good results (FMeasure up to 33.9%). However, it is hard to compare their results with others because of the strong specificity of the used data set: short abstract with on average 120 tokens where around 10% of all words in the proposed set were keyphrases. A recent interesting work about the application of linguistic knowledge to the specific problem is reported in [12]. The authors used WordNet and “lexical chains” based on synonyms and antonyms. Then they applied decision trees as an ML part on about 50 journal articles as the training set and 25 documents as the testing set. They reported high precision, up to 45%, but did not mention recall. This makes difficult any comparison with other techniques. Other ML technique, least square SVM [3] shows 21.0% precision and 23.7% recall in the analysis of web mined scientific papers. Also in this case the described experiments are limited to a very small testing dataset of 40 manually collected papers. Our work contributes along three dimensions: 1) using a large high-quality freely available dataset which can set up a ground for further comparison; 2) empowering machine learning methods with NLP techniques and heuristics; 3) applying ensemble learning method (namely RF) that have never been applied to the specific keyphrases extraction problem before; and 4) providing a detailed investigation of the results like features importance, varying dataset size.
3
Dataset Description, Characterization and Linguistic Processing
3.1
Dataset Description and Characterization
The dataset presented contains a set of papers published by the ACM in the Computer Science domain in 2003–2005. The documents are included in the ACM portal3 and their full texts were crawled by CiteSeerX digital library. In our pre-processing tasks, we separated different parts of papers, such as title and abstract, thus enabling extraction based on a part of an article text. Formulas, tables, figures and possible LATEX mark-up were removed automatically. We share this dataset and welcome interested communities to use it as a benchmarking set for information extraction approaches. Our experimental dataset consists of 2000 documents with keyphrases assigned by ACM authors and verified by ACM editors. Keyphrases fall into two categories: i) author assigned: located inside each document in the header sections after the prefix “Keywords:”; ii) editor assigned: manually assigned by human experts in a particular domain. It is important to note that in our preparation of the above dataset, we have selected only papers that contain at least one expert assigned keyphrase in the full text of a document. So we are not in 3
http://portal.acm.org; available also at http://dit.unitn.it/~ krapivin/
Keyphrases Extraction from Scientific Documents
105
(b) Chunk types
(a) POS tags 0.8
0.6 Text Key-phrases
0.5
Text Key-phrases
0.7 0.6
0.4
0.5 0.4
0.3
0.3
0.2
0.2 0.1
0.1 0
0 NN
IN
JJ
NNP NNS VBP
I-NP B-NP
O
B-PP B-VP I-VP
Fig. 1. Distributions for normal text and keyphrases
the more challenging case of completely implicit extraction4 . In our dataset, each document has on average about 3 unique human assigned keyphrases inside. 3.2
Linguistic Analysis of Keyphrases
We performed an NLP analysis of the keyphrases to study their linguistic properties. This forms the base for heuristics choice and definition of the features used in the machine learning step. This improves the quality of generated keyphrase candidates while simultaneously reducing their quantity. We analysed a sample of 100 random documents using OpenNLP tools. We applied tokenizer, Part of Speech (POS) tagger and chunker to explore differences between POS tags and chunk types for normal text documents and the corresponding keyphrases set. Fig. 1 shows POS tag and chunk type distributions, respectively for the most common POS tags, such as nouns: NN, NNP, NNS; prepositions: IN, adjectives: JJ and verbs: VBP; and chunk types, such as noun phrases: B-NP, I-NP; prepositional phrases: B-PP; verbal phrases: B-VP, I-VP. To improve readability we have omitted values close to zero. One can note from the figures that the distributions differ significantly between normal text and keyphrases sets. The major differences in POS tags distribution confirm the intuition that most keyphrases consist of nouns, singular as well as plural, and adjectives. The difference in chunk types distribution also confirms and reinforces this hypothesis, adding to it that the overwhelming majority of keyphrases are noun phrases. This is also confirmed by additional analysis we have done with MaltParser to explore differences between dependencies of keyphrases and the ones of normal text. 3.3
Text Processing
Before any extraction task, text needs to be pre-processed [13]. Pre-processing includes sentence boundary detection, tokenization, POS tagging, chunking, 4
To tackle such a challenging implicit extraction one could move from syntactic to semantic relations between words in order to access (implicitly) related keyphrases.
106
M. Krapivin et al.
parsing, stemming and recognizing separate blocks inside the article, such as Title, Abstract, Section Headers, Reference Section, Body. We used OpenNLP suite [7] to do standard steps of text processing. Namely, we apply sentence boundary detector, tokenizer, part of speech tagger and chunker consequently. Then we apply a heuristic, inspired by the previous linguistic analysis of keyphrases. The heuristic consists of two steps. First we filter by chunk type, leaving only NP chunks for further processing. Then we filter the remaining chunks by POS. We leave only chunks with tokens belonging to the parts of speech from the top of the distribution in Fig. 1a, such as NN, NNP, JJ, NNS, VBG and VBN. Table 1 shows an example sentence and extracted keyphrase candidates. This heuristic extracts for further analysis only linguistically meaningful keyphrase candidates. Table 1. Keyphrase candidates extracted by the heuristic Sentence
Therefore, the seat reservation problem is an on-line problem, and a competitive analysis is appropriate.
In addition, we apply MaltParser to extract dependencies which we use as additional features for machine learning. Finally, we use S-removal stemmer (from KEA) to avoid well-known issues related to the presence of the same words written in different forms.
4 4.1
Enhancing Machine Learning with Natural Language Processing Features Selection
Proper feature space selection is a crucial step in information extraction. Table 2 details the feature set we propose. Features 1, 2 and 3 are common and widely used in most information extraction systems [14]. Less traditional features include feature 4 – quantity of tokens in a phrase [1] and 5 – part of a text [3]. The features numbered in Table 2 from 6 to 20 are based on linguistic knowledge. We consider keyphrase containing a maximum of 3 tokens, with indices 1, 2 and 3 in the feature names referring, respectively, to the first, second and third token of the candidate. Features 6 to 8 contain part of speech tags of the tokens of a keyphrase candidate. The next set of features uses dependencies given by the MaltParser. Each dependency contains a head, a dependant and a labelled arc joining them. Dependencies help us to capture the relations between tokens, the position and the role of the keyphrase in the sentence. Specifically features 9-11 contain part of
Keyphrases Extraction from Scientific Documents
107
Table 2. The adopted Feature Set, i ∈ [1..3] # 1 2 3 4 5
Feature term frequency inverse document frequency position in text quantity of tokens part of text
# 6-8 9-11 12,15,18 13,16,19 14,17,20
Feature i-th token POS tag i-th token head POS tag i-th token dependency label distance for i-th incoming arc distance for i-th outgoing arc
speech tag of a head of the token for each token of the candidate. Features 12-20 refer to the relations within the keyphrase and relations attaching a keyphrase to the sentence. They consist of three groups, one group for each token of the keyphrase candidate, and have similar meaning. Let us consider in detail the first group, features 12-14. Feature 12 refers to the label of the arc from the first token of the candidate to its head. It grasps the relation between the keyphrase and the sentence or between the tokens of keyphrase. Features 13 and 14 grasp the cohesion of keyphrase and its relative position in the sentence. Feature 13 refers to the distance between the first keyphrase token and its dependant if it exists. As a distance we take the difference between token indices. Feature 14 refers to the distance between the first token and its head. Differently from many text classification approaches based on the “bag-ofwords” model, which encodes the text in a binary vector showing which words are present and causes very high dimensionality of the data, we work here with only 20 features. 4.2
Machine Learning Methods Used for Comparison
Random Forest (RF). [15] is based on the core idea of building many decision tree classifiers and voting for the best result. RF extends the “bagging” approach [16] with random selection of features sets. The RF algorithm does not need costly cross-validation procedure and it is considered scalable and fairly fast. RF is one of the state-of-the-art approaches for classification tasks. Support Vector Machines (SVMs). [17] are classifiers with sound foundations in statistical learning theory [18] considered, with RF, the state-of-the-art classification method. The reasons of SVM popularity include the possibility of customizable input space mapping (with kernels), handling of noisy data and robustness to the curse of dimensionality. When the dimensionality is high, a linear classifier is often the best choice, while with a reduced number of features a non-linear approach works best. We adopt SVM with the highly non-linear Gaussian (RBF) kernel, because in our case the dimensionality of data is low and linear separation in the input space gives poor results. FaLK-SVM. [19] is a kernel method based on Local SVM [20] which is scalable for large datasets. There are theoretical and empirical arguments supporting the fact that local learning with kernel machines can be more accurate than
108
M. Krapivin et al.
SVM [21]. In FaLK-SVM5 a set of local SVMs is trained on redundant neighbourhoods in the training set selecting at testing time the most suitable model for each query point. The global separation function is subdivided in solutions of local optimization problems that can be handled very efficiently. This way, all points in the local neighbourhoods can be considered without any computational limit on the total number of support vectors which is the major problem in applying SVM on large and very large datasets. KEA. [9] represents the state-of-the-art for key-phrase extraction tasks and is based on the bag-of-words concept. The Bayes theorem is used to estimate a probability of a phrase to be a keyphrase using the frequencies of the training set. So in the end each phrase in the text has a probability to be a keyphrase. After that KEA takes top q of phrases and considers them to be keyphrases. Na¨ıve Bayes learning is widely used for other text-oriented tasks like spam filtering.
5
Experimental Evaluation
In this section we detail the experiments carried out for the analysis of the discussed keyphrase approaches. To assess the results we used standard IR performance measures: Precision, Recall and F-Measure [18,1]. We divided the whole dataset of 2000 documents into 3 major sets: training set (TR), validation set (VS) and testing set (TS) respectively with 1400, 200 and 400 documents each. To find out the best dataset size we further divided the training set into 7 subsets of 200 documents each. 5.1
Experiment 1. Comparison of ML Methods Enhanced by NLP
Random Forest: four parameters to tune are the number of trees in the ensemble I, the splitting parameter K, the balancing parameter w and the depth of a tree d. By trial and error we defined the following strategy to lessen training tries: (1) take 3 different K parameters: default, half and double of default; (2) stop the algorithm as soon as an increase in the number of trees does not improve significantly the solution; (3) the depth of tree usually should not overcome the quantity of selected features, and should not be much smaller than them. We used the fast open source implementation6 compatible with WEKA [13]. SVM: the hyper-parameters we tune are the regularization parameters of the positive and negative classes (C + and C − respectively) and the width σ of the RBF kernel. These parameters are selected using 10 fold cross-validation with a three-dimensional grid search in the space of the parameters. The model selection is performed maximizing in this parameter space the occurrence-bases F-Measure. For SVM training and prediction we use LibSVM [23]. FaLK-SVM: as well as the SVM parameters (C − , C + and σ) we have to set the neighbourhood size k used for the local learning approach. Model selection is 5 6
FaLKM-lib [22] source is available at http://disi.unitn.it/~ segata/FaLKM-lib http://code.google.com/p/fast-random-forest/
Keyphrases Extraction from Scientific Documents
109
Table 3. SVM, FaLK-SVM, RF and KEA results. Best values in bold
FaLK-SVM SVM Random Forest KEA (best q)
Precision
Recall
F-Measure
24.59% 22.78% 26.40% 18.61%
35.88% 38.28% 34.15% 26.96%
29.18% 28.64% 29.78% 22.02%
thus performed as described for SVM, but using a four-dimensional grid-search. It is important to mention that FaLK-SVM and SVM have the parameters fully automatically tuned. KEA: KEA has only one tuning parameter q which is the threshold; experimentally we have found that q = 5 produces the best F-Measure. Table 3 summarizes the results. We see that the best result in F-Measure is achieved with the Random Forest using all 20 proposed NLP features. FaLKM and SVM follow very closely while KEA is much lower. Since the difference between best three methods is not very big (RF outperforms SVM by ca. 4%) it is important to understand which are the most important factors: particular features or peculiarities of the dataset. These has led us to explore both dimensions in the next set of experiments. 5.2
Experiment 2. Training Set Size Analysis
An increase in training set size may bring an improvement of prediction quality. However, training on a large amount of data is computationally expensive. Thus it is relevant to estimate which dataset size is enough to get the best prediction performance. To study this, we carried out experiments at increasing training set sizes as summarized in Figure 2a. One can see that i) F-Measure improves as the training set size increases; ii) the improvement levels off after ca. 400 documents. We can conclude that for the task of keyphrases extraction it is important to have rather large training sets, but training sets with more than 400 documents are computationally expensive without a relevant increase in prediction quality. 5.3
Experiment 3. NLP Features Analysis
In this experiments we analyse the individual effect of the features on the prediction ability. We performed experiments omitting features one by one and monitoring the effect on the overall quality. Assuming that our features are logically grouped we decided to exclude the following groups of features sequentially: i) Arcs (11 features left) ii) Head POS Tags (8 features left) iii) POS tags (5 features left) iv) TFxIDF and relative position (3 features left). Figure 2b summarizes the results (KEA has just 3 features so there is a just one point on plot). We see that in case of Random Forest using only the first three features decreases F-Measure essentially to KEA results. This is very interesting, because bayesian learning of KEA consideres just 3 features (we made a try to increase of features quantity in Na¨ıve Bayes approach, but with no significant
Fig. 2. F-Measure behaviour with dataset size and feature count growth. Triangle in b) shows 3 features of KEA.
success). In our comparison of four methods we have two “statistical learning” methods (SVM and FaLK-SVM), and two “probabilistic” methods which give close results having the same three basic features that regard simple count of tokens. Figure 2b shows that the various methods capture different features in different ways: arcs are important for Random Forest and POS tags a most important for SVM. Moreover, while there is a tendency for the overall quality to level off, the experiments do not show clearly to have reached a “plateau”, thus other relevant features may be found.
6
Conclusions
In this paper we applied NLP-based knowledge to several different ML methods for automatic keyphrases extraction from scientific documents. The proposed NLP-based approach shows promising results. We have performed a detailed evaluation of the performance of all ML methods by comparing the extracted keyphrases with human assigned keyphrases on a subset of 2000 ACM papers in the Computer Science domain. The evaluation shows that: i) The best results of NLP-powered ML methods always outperform KEA in all quality parameters: Precision, Recall and overall F-measure; in particular, it improves the average F-Measure from 22% (KEA) to 30% without the use of controlled vocabularies; ii) Feature removal leads to stable decrease of F-Measure for all considered machine learning methods. iii) Larger training set improves F-Measure and it reaches a “plateau” with training set size around 400 documents. iv) Random Forests is a good tradeoff between quality of keyphrases extraction and computational speed. The proposed hybrid ML+NLP approach may also be valid with different and more specific data like news, emails, abstracts and web pages.
Keyphrases Extraction from Scientific Documents
111
References 1. Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: JCDL (2003) 2. Tourney, P.: Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Tech. Rep., NRC-44947/ERB-1096 (August 2002) 3. Wang, J., Peng, H.: Keyphrases extraction from web document by the least squares support vector machine. In: ICWI (2005) 4. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMLNP, pp. 216–223 (2003) 5. Hulth, A., Karlgren, J., Jonsson, A., Bostrom, H., Asker, L.: Automatic Keyword Extraction Using Domain Knowledge. Springer, Heidelberg (2004) 6. Krapivin, M., Marchese, M., Yadrantsau, A., Liang, Y.: Automated key-phrases extraction using domain and linguistic knowledge. In: ICDIM (2008) 7. Morton, T.: Using Semantic Relations to Improve Information Retrieval. PhD thesis, University of Pennsylvania (2005) 8. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., K¨ ubler, S., Marinov, S., Marsi, E.: Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2), 95–135 (2007) 9. Witten, I., Paynte, G., Frank, E., Gutwin, C., Nevill-Manning, C.: Kea: Practical automatic keyphrase extraction. In: DL (1999) 10. Medelyan, O., Witten, I.: Thesaurus based automatic keyphrase indexing. In: JCDL (2006) 11. Braams, J.: Learning to extract keyphrases from text. Technial report NRC(ERB1057) (February 1999) 12. Ercan, C., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management 43(6), 1705–1714 (2007) 13. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005) 14. Turney, P.: Coherent keyphrase extraction via web mining. In: IJCAI, pp. 434–439 (2003) 15. Statistics, L.B., Breiman, L.: Random forests. Machine Learning, 5–32 (2001) 16. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 17. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 18. Osuna, E., Freund, R., Girosi, F.: Support Vector Machines: Training and Applications. In: CVPR, p. 130 (1997) 19. Segata, N., Blanzieri, E.: Fast Local Support Vector Machines for Large Datasets. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition. LNCS, vol. 5632, pp. 295–310. Springer, Heidelberg (2009) 20. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Transactions on Geoscience and Remote Sensing 46(6), 1804–1811 (2008) 21. Segata, N., Blanzieri, E.: Fast and Scalable Local Kernel Machines. Technical Report DISI-09-072, University of Trento (2009) 22. Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Technical report, University of Trento (2009), http://disi.unitn.it/~ segata/FaLKM-lib/ 23. Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001)
Measuring Peculiarity of Text Using Relation between Words on the Web Takeru Nakabayashi1, Takayuki Yumoto2 , Manabu Nii2 , Yutaka Takahashi2 , and Kazutoshi Sumiya3 1
3
School of Engineering, University of Hyogo 2167 Shosha, Himeji, Hyogo 671-2280, Japan [email protected] 2 Graduate School of Engineering, University of Hyogo 2167 Shosha, Himeji, Hyogo 671-2280, Japan {yumoto,nii,takahasi}@eng.u-hyogo.ac.jp School of Human Science and Environment, University of Hyogo 1–1–12 Shinzaike-honcho, Himeji, Hyogo 670-0092, Japan [email protected]
Abstract. We define the peculiarity of text as a metric of information credibility. Higher peculiarity means lower credibility. We extract the theme word and the characteristic words from text and check whether there is a subject-description relation between them. The peculiarity is defined using the ratio of the subject-description relation between a theme word and characteristic words. We evaluate the extent to which peculiarity can be used to judge by classifying text from Wikipedia and Uncyclopedia in terms of the peculiarity.
1
Introduction
Recently, the web makes it easy to get various kinds of information that which we wish to know. However, the information on the web sometimes contains incorrect information. Therefore, knowing the credibility of information on a web page is very important. There is a great deal of research on detecting web spam that is related to the credibility of information. For example, Gy¨ ongyi et al. proposed the use of TrustRank to find web spam[1]. This approach is based on link analysis, and it cannot be applied to information that is not on the web or is in unlinked pages. On the other hand, Yamamoto and Tanaka proposed a method to measure the credibility of a statement[2]. They call the statement a fact. This approach depends on the web search using a phrase such as “(*) is famous for beer”. Therefore, it is sensitive for small differences of description about the same fact. We define the peculiarity of text as a metric of information credibility on the basis of word relation. Higher peculiarity means lower credibility. We extract the
This research was supported in part by a Grant-in-Aid for Scientific Research (B)(2) 20300039 from the MEXT of Japan.
G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 112–115, 2010. c Springer-Verlag Berlin Heidelberg 2010
Measuring Peculiarity of Text
113
theme word and the characteristic words from a text and measure the subjectdescription relation between them. The peculiarity is defined using ratio of the subject-description relation.
2
Measuring Peculiarity of Text
2.1
Characteristic Words of Text
To define the peculiarity of text, we use characteristic words of a text. First, we define characteristic degree of words using two aspects, characteristic degree in the document and characteristic degree on the web. To find the characteristic degree, we use importance degree that is part of the Termex system1 . This system is used to extract domain specific words for Japanese texts by using the following expression[3]. F LR(w, p) = tf (w, p) × LR(w, p),
(1)
where tf (w, p) is the term frequency of word w in page p, and LR(w, p) is the score which increases when w is a compound noun that is longer. Nouns with higher FLR value are regarded as domain specific words. We regard characteristic words that frequently appear in the document and compound nouns as more characteristic. F LR(w, p) function matches this idea. To express the degree to which a word is characteristic on the web, we use the following inversed document frequency. idf (w) = log(N/df (w)),
(2)
where N is the number of documents and df (w) is the number of documents that contain word w. We use the number of search results when the query is w as df (w). Thus, N should be the number of all documents on the web. However, as this is too difficult to find out, we use a very large number instead. We set N = 500000000. We can use text t instead of page p in expressions (1) and (2). The characteristic score of word w in text t is defined as follows: ch(w, t) = F LR(w, t) × idf (w)
(3)
We extract all nouns including compound nouns from the target text t and select 4 words that have the highest characteristic scores as characteristic words. 2.2
Subject-Description Relation
Oyama and Tanaka proposed a method to find subject-description relation of words[4]. If A is a subject word and B is its description word, B is a detailed 1
http://gensen.dl.itc.u-tokyo.ac.jp/win.html
114
T. Nakabayashi et al.
word of A. If the following expression is satisfied and it is significant, word B is a detailed word of word A. DF (intitle(A) ∩ B)/DF (intitle(A)) > DF (A ∩ B)/DF (A)
(4)
DF (intitle(A)) is the number of documents that have a title with keyword A, and DF (intitle(A) ∩ B) is the number of documents that contain keyword A in their title and keyword B in the any part. DF (A∩B) is the number of documents that contain the keywords A and B, and DF (A) is the number of documents that contain keyword A. χ2 tests are used for significance test of expression (4). We used this method to judge whether the theme word and the characteristic words have any relation. 2.3
Peculiarity of Text
We define peculiarity by using relation between a theme word and characteristic words. A theme word is a word that expresses the theme of text. In our method, it is selected by users. The peculiarity of text t with a theme word, w, is defined as follows: (5) pe(t, w) = 1 − |σw (cW ords(t))|/|cW ords(t)| cW ords(t) is a set of characteristic words of t. σw (S) is a filtering function that returns detailed words of w in the word set S. σw (S) = {s|s is a detailed word of w, s ∈ S}
(6)
A text with a higher peculiarity is not regarded as credible.
3
Experiments and Discussion
We evaluated the contribution of peculiarity to the process of judging the credibility. If we can use the values of peculiarity to classify credible and implausible text, the peculiarity can be used to judge whether a web page is credible or not. We regard text as implausible when the peculiarity is higher than 0.4. Otherwise, the text is regarded as credible. We prepared ten keywords and each article about them from Wikipedia2 and Uncyclopedia3 . Uncyclopedia is a parody site of Wikipedia and its articles contain many jokes. We regarded the text in Wikipedia articles as credible and text in Uncyclopedia articles as implausible. We used the title of the articles as the theme words and five characteristic words extracted from each text by our algorithm. We show the results in Table 1. The original theme words are in Japanese. In Table 1, the label “Wikipedia” means the peculiarity of a text from Wikipedia and “Uncyclopedia” means that of a text from Uncyclopedia. From Table 1, the precision of all text is 0.65, that of a credible text is 0.70, and that of implausible text is 0.60. 2 3
Table 1. Result of experiments Theme word Wikipedia Uncyclopedia Windows 0.4 0.4 USA 0.2 0.8 Ichiro 0.2 0.0 Object oriented 0.2 0.6 Cola 0.4 0.6 Dragon Ball 0.6 0.4 Pencil 0.2 0.4 Nuclear power 0.0 0.0 Global Warming 0.0 0.0 Alchemy 0.2 0.2 Precision 0.7 0.6
The results contain some problems. The first problem is that extracted characteristic words are not widely known. For example, “warm-biz law” is a characteristic word from an Uncyclopedia article about “global warming”. It is a fictious law and it is not used in other pages. However, it is judged as detailed words of “global warming”. Characteristic words should also be used in other pages. To solve this problem, we need to change the method of extracting characteristic words or the method of judging the relation between the theme and the characteristic words. The second problem is a failure to extract compound words. For example, “device product for” is extracted as characteristic words from a Wikipedia article about “Windows” and it should be split into “device product” and “for”. To solve this problem, we should reconsider the algorithm used to extract compound words.
References 1. Gy¨ ongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the Thirtieth international conference on Very large data bases (VLDB 2004), VLDB Endowment, pp. 576–587 (2004) 2. Yamamoto, Y., Tanaka, K.: Finding comparative facts and aspects for judging the credibility of uncertain facts. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 291–305. Springer, Heidelberg (2009) 3. Nakagawa, H., Yumoto, H., Mori, T.: Term extraction based on occurrence and concatenation frequency (in Japanese). Journal of natural language processing 10(1), 27–45 (2003) 4. Oyama, S., Tanaka, K.: Query modification by discovering topics from web page structures. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 553–564. Springer, Heidelberg (2004)
Imitating Human Literature Review Writing: An Approach to Multi-document Summarization Kokil Jaidka, Christopher Khoo, and Jin-Cheon Na Wee Kim Wee School of Communication and Information Nanyang Technological University Singapore 637718 (KOKI0001,assgkhoo,TJCNa)@ntu.edu.sg
Abstract. This paper gives an overview of a project to generate literature reviews from a set of research papers, based on techniques drawn from human summarization behavior. For this study, we identify the key features of natural literature reviews through a macro-level and clause-level discourse analysis; we also identify human information selection strategies by mapping referenced information to source documents. Our preliminary results of discourse analysis have helped us characterize literature review writing styles based on their document structure and rhetorical structure. These findings will be exploited to design templates for automatic content generation. Keywords: multi-document summarization, discourse analysis, rhetorical structure, literature reviews.
The first part of our study involves analyzing and identifying the typical discourse structures and the rhetorical devices used in human-generated literature reviews, and the linguistic expressions used to link information in the text to form a cohesive and coherent review. An analysis of how information is selected from the source papers and organized and synthesized in a literature review is also carried out. This study focuses on literature reviews which are written as a section in journal articles in the domain of information science. On the basis of the analyses, a literature review generation system will be designed and implemented. This paper gives an overview of the project and reports preliminary results from the analyses of human literature reviews.
2 Approach 2.1 Study of the Discourse Structure In the first part of the study, we are analyzing the discourse structure of literature reviews at three levels of detail: • Macro-level discourse structure—analysis of the document structure and types of information conveyed by each sentence of a literature review. This will allow an understanding of the typical document structure of a literature review and how its elements are laid out. • Inter-clause level discourse structure—analysis of the rhetorical functions which frame the information to serve a particular purpose of the literature review and their linguistic realization. • Intra-clause level discourse structure—concepts and relations between concepts expressed within the clause, which are connected to each other through the inter clause-level structure. This will help us to understand the types of information and relations that are important in literature reviews of research papers. We have developed an XML document structure language to annotate the different sections of a literature review [5]. We applied it to annotate the text of a sample of twenty literature reviews taken from articles published in the Journal of the American Society for Information Science and Technology. The annotations were made at the sentence level to identify the type of information conveyed by the sentence. Broadly, our XML annotations identified two types of functional elements: • meta-elements which indicate the reviewer’s comments and critique of cited studies: meta-critique and meta-summary elements. • descriptive elements which describe topics, studies and concepts at different levels of detail: topic, what, study, description, method, result, interpretation, brief-topics and current-study elements. We observed that literature reviews may be either descriptive or integrative and their discourse structures are correspondingly different [5]. Descriptive literature reviews summarize individual papers/studies and provide more information about each study, such as its research methods and results. Integrative literature reviews focus on the
118
K. Jaidka, C. Khoo, and J.-C. Na
ideas and results extracted from a number of research papers and provide fewer details of individual papers/studies. They provide critical summaries of topics and methodologies. We found that descriptive literature reviews have a significantly greater number of method, result and interpretation elements embedded within each study element, through which they provide more information on each study. Integrative literature reviews have fewer study elements and instead have significantly more meta-summary and meta-critique elements, wherein they provide high-level critical summaries of topics. Both types of information science literature reviews follow a hierarchical structure and have a typical composition of discourse elements. On the basis of these findings, we designed generic templates for our literature review generation system. We analyzed the inter-clause level structure of the sample literature reviews and identified 34 rhetorical functions as well as a variety of linguistic expressions for realizing these functions. For the purpose of literature review generation, the focus will be on emulating commonly occurring rhetorical functions which are: • • • •
state common topic introduce a topic describe the purpose of a study identify the method/model
• • • •
state the research results report the recommendations delineate a research gap compare studies
It was observed that for descriptive literature reviews, writers prefer rhetorical arguments which build a description of cited studies, providing information on their research methods, results and interpretation. For integrative literature reviews, writers employ rhetorical arguments which help to build a critical summary of topics, and provide examples illustrating the author’s argument. To study the intra-clause level discourse structure, we will follow the approach outlined in an earlier study [6] which investigated four kinds of information embedded at the clausal level: research concepts, relationships between them, contextual relations which place the research in the context of a framework, and research methods. 2.2 Mapping of Referenced Information to Source Papers In this step, we attempt to identify the strategies followed by human reviewers to select information from research papers and organize and synthesize it in a literature review. The objectives for this analysis are: • To find out which parts of a research paper do researchers select information from, for example, abstract, conclusion, etc. • To find out what types of information is selected, for example, methodology, topic description, etc. • To find out what transformations are performed on the information. • To identify possible rationale for the reviewer’s choices. From a preliminary analysis, we observed the following typical information selection strategies by reviewers:
Imitating Human Literature Review Writing
119
• a marked preference for selecting information from certain sections of the source paper such as its abstract, conclusion and methodology. • for descriptive literature reviews: applying cut-paste or para-phrasing operations on text from individual sources to provide a detailed description of the studies. • for integrative literature reviews: applying inferencing and generalization techniques to summarize information from several source papers into a higher-level overview.
3 Proposed System The summarization process in our literature review generation system will have four phases: pre-processing, information selection and integration, rhetorical function implementation and post-processing. The novel approaches in our literature review generation system would mainly be in the information selection and integration stage to select information from different semantic levels, and the rhetorical function implementation stage where the literature review will be drafted. These would draw from the results of our analysis of human summarization strategies. Information selection and integration involves: • Sentence selection—identifying important sentences which appear to be fulfilling a literature review’s purposes such as identifying a gap, introducing a topic, etc. • Concept extraction—extracting important concepts and relations from the selected text. • Concept integration—linking concepts into lexical chains based on rhetorical relations. At the rhetorical implementation stage, a draft literature review is generated by selecting a document structure and mapping extracted information into the document structure. Text is generated by filling out sentence templates which embody rhetorical functions.
References 1. Goldstein, J., Mittal, V., Carbonell, J., Kantrowitz, M.: Multi-document summarization by sentence extraction. In: NAACL-ANLP 2000 Workshop on Automatic summarization, vol. 4, pp. 40–48. Comp. Ling., Seattle (2000) 2. Schiffman, B., Nenkova, A., McKeown, K.: Experiments in multidocument summarization. In: Proceedings of the second international conference on Human Language Technology Research, pp. 52–58. Morgan Kaufmann Publishers Inc., San Diego (2002) 3. Schlesinger, J.D., O’leary, D.P., Conroy, J.M.: Arabic/English Multi-document Summarization with CLASSY – The Past and the Future. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 568–581. Springer, Heidelberg (2008) 4. Saggion, H., Lapalme, G.: Generating indicative-informative summaries with sumUM. Comput. Linguist. 28, 497–526 (2002) 5. Khoo, C., Na, J.-C., Jaidka, K.: Analysis of the Macro-Level Discourse Structure of Literature Reviews. Online Information Review (under review) 6. Ou, S., Khoo, C.S.-G., Goh, D.H.: Design and development of a concept-based multidocument summarization system for research abstracts. J. Inf. Sci. 34, 308–326 (2007)
A Study of Users’ Requirements in the Development of Palm Leaf Manuscripts Metadata Schema Nisachol Chamnongsri1,2, Lampang Manmart1, and Vilas Wuwongse3 2
1 Information Studies Program, Khon Kaen University, Thailand School of Information Technology, Suranaree University of Technology, Thailand 3 School of Engineering and Technology, Asian Institute of Technology, Thailand [email protected], [email protected], [email protected]
Abstract. This paper presents the users’ behavior, their needs and expectations with respect to palm leaf manuscripts (PLMs) which are ancient Thai documents.. We focus on access tools, access points and how users select PLMs. The data were collected by in-depth interviews of 20 users including researchers, local scholars and graduate students who are working on research in the field and using PLMs for information and knowledge resources. The research results present two important characteristics of user behaviors: previous knowledge of items, and exploratory searches. Users adopt a 4-step pattern in searching for the PLMs. Finally, we discuss the important information in searching for the PLMs and we compare this with the frequently consulted bibliographic elements and Dublin Core elements. Keywords: User studies, User behaviors, User requirements, Metadata schema development, Palm leaf manuscripts, Ancient documents, Cultural heritage.
A Study of Users’ Requirements in the Development of PLMs Metadata Schema
121
Fig. 1. Pictures of long palm leaf manuscripts
digitized and transcribed into modern languages, services should be expanded to allow access to the public, so as to maximize the application of the knowledge recorded in the PLMs which is another way to preserve the cultural heritage. Using suitable metadata schema will gain more efficiency in the management of digitized PLMs collections. The related literature in the area of user studies mentions that understanding the users can help to develop information systems and services. For this reason, it is necessary to study the behavior and needs of users of PLMs in searching for these manuscripts. The purpose of this study is to discover the basic information to be used for metadata needs such as determining the qualifications of the desired metadata that could be of help and to expand their searches and access to the desired PLMs and its contents.
2 Related Works Today, exploring information seeking behavior in ancient documents and cultural heritage context is limited. It is mostly found in the user studies in the domain of Humanities and Social Sciences, digital libraries and digital museums [6], [7], [8], [9], [10], [11], [12], although these kinds of documents contain a vast amount of knowledge and have value for various academic areas. Furthermore, some of them record the contents that can be used to develop competency and strengthen economic and social foundations for the long-term, sustainable growth of the country. Recently, scholars in these areas have been the main users and they are familiar with these kinds of documents; also ancient books are important resources for humanities scholars and researchers [10]. The subjects of the studies include academic scholars, graduate students, experts in cultural heritage (researchers in museums, curators, registrars, IT personnel in museums and librarians). These researches are based on transaction logs, monitoring search behavior, interviews, think-aloud techniques, or the use of surveys. And some research uses multi approaches, such as monitoring search behavior and interviews, or conducting surveys and follow-up interviews. Similar to other researches in this domain [6], [9] Buchanan and et al. [7] reveal that humanities information seeking demonstrates a strong use of human support, as well as a more intensive use of printed or merchandised seeking tools. However, they have a greater degree of satisfaction with seeking information from digital libraries and the internet. Wu and Chen [10] draw similar conclusions, namely that humanities
122
N. Chamnongsri, L. Manmart, and V. Wuwongse
graduate students’ are satisfied with searching full-text databases for ancient books, but they comment that the coverage, quality, and search interface could be improved upon. Furthermore, they suggest, links to the related resources should be added. And they continue to use paper sources while professors rely less on the database and emphasize the importance of paper versions. Trying to understand search behavior, Skov and Ingwersen [12] categorize the characteristics of information seeking behavior of virtual museum visitors in to four groups: highly visual experience, meaning making, known item/element searching, and exploratory behavior which illustrates the differences between professional and non-professional behavior. Besides, seeking behavior was highly task dependent. According to Wu and Chen [10] graduate students use full-text databases, like bibliographic databases, to locate information concerning their research interests. And even they though they are satisfied with the search functions, they consider that the search interface should be improved. Studying the mismatch between the current retrieval tools and trends of the information seeking needs in the cultural heritage, Amin, Ossenburggen, and Nispen [11] categorize information seeking tasks into three groups: fact finding, information gathering, and keeping up-to-date. The problems revealed in this research, such as the difficulty in building queries when the users are not familiar with the vocabulary, difficulty in comparing the objects, to do relationship search, to do exploration, and to do combination. Furthermore, balancing between providing the experts with the most recent and relevant information is another problem.
3 User Study At the moment there are only limited groups of PLM users. Therefore, to gain more understanding of PLMs user behavior the study relies on the following groups: academic researchers, graduate students and local scholars who have a real experience with the PLMs. Also the study of needs is based on the study of information behavior consisting of personal variables, social variables and environmental variables [13], as follows: personal characteristics; objectives of PLMs searches and how users search, access and use these documents; information that will benefit the search, relevance, and access to PLMs; and obstacles and problems in PLM search and access. 3.1 Participants A total of 20 subjects participated in the study. Ten participants were selected from the users’ registration books of four PLM preservation projects and their publications relating to the PLMs. The other ten participants were selected after interviews using the snowball technique. The participants included seven graduate students (from three fields: five Master’s students in Epigraphy, one in Thai language, and one Ph.D. student in Communication Arts), eight researchers (from six academic institutes in five research areas: Northeastern Thailand or Isan and Laos history, Oriental language and Buddhism, Isan Literature, Anthropology, Thai literature and Folklore), and five local scholars who are interested in obtaining information from PLMs. Most of the participants (19) were familiar with palm leaf manuscripts in both their physical structure and content, for example, they knew how to identify the date of inscription, what type of information was written in each type of palm leaf manuscript,
A Study of Users’ Requirements in the Development of PLMs Metadata Schema
123
and also that palm leaf manuscripts can be inscribed with more than one story in each fascicle. Furthermore, most participants (16, 18) reported that they can read at least one script and understand the language inscribed in the PLMs. Nine of them evaluated themselves as experts in reading the Tham-Isan script, while six participants evaluated themselves as experts in reading the Thai-Noi script. A few participants, who are mostly students, are not able to read these kinds of scripts or understand these languages. They evaluated themselves as novices in the study of PLMs. Six participants in the group of researchers have 20-30 years’ experience in using PLMs, while four of the seven participants in the group of graduate students had only 1-3 years’ experience. Table 1. The participants’ research areas Research areas 1. Isan and Laos’ history 2. Ancient languages and scripts 3. Isan literature and language 4. Isan culture 5. North region culture 6. Folklore 7. Anthropology 8. Sociology 9. Buddhism 11. Herbal medicine 11. Pali literature 12. Health communication 13. Legend of cities
Academic researchers (N=8) 1 1 6 3 3 2 1 -
Local scholars (N=5) 1 3 4 1 1 2 1 1 -
Graduate Students (N=7) 4 2 1 1 1
Total (N=20) 2 5 11 7 1 3 6 2 2 1 1 1 1
3.2 Study Method The semi-structured interviews were conducted over two months at the participants’ workplaces. Telephone interviews were used in three cases where there were problems with distance and time schedules. Before the interviews were started, the participants were informed about the objectives of the research and metadata. The pilot interviews with PLM preservation projects’ staff about their user behaviors were conducted first to gain more understanding and in order to make the questionnaire complete. Significant answers or opinions from the previous interviewees were passed on to the next interviewees to confirm the requirements. The interviews took from one to two hours for each participant. There were only the interviewer and a participant present during each interview.
4 Results and Discussions 4.1 User Behaviors in Seeking for PLMs The results show that users use the PLMs as their main resource for research. In addition, they use PLMs for translation, reading practice, and preparing materials for teaching ancient scripts. They consider that original PLMs are a primary source of knowledge and information without the need for any other documents.
124
N. Chamnongsri, L. Manmart, and V. Wuwongse
Nevertheless, searching for PLMs has been difficult for users because of the lack of efficient retrieval tools. Furthermore the retrieval tools themselves were difficult to access and use. Thus, the main channel chosen by users to access the PLMs they want is to use their own personal resources, such as friends or colleagues who do research in the same area or who have experience of PLMs (e.g. curators, monks, graduate students etc.) In addition, they may know the holders of PLMs or have personal connections with temples located in their areas or with preservation projects of PLMs or they may have their own collection of PLMs. Most of the users in this study did not have experience of using PLMs at the national libraries because of the strict restrictions and some of them believed that they could not find the PLMs they wanted at the national libraries, and some said that they did not know that the national libraries held collections of PLMs. These results are similar to those obtained by social scientists worldwide who have their own personal collection, and whose fieldwork and archives are their main sources of information because they provide unique, primary information; and whenever they need help they also ask colleagues, friends, librarians, and people who they believe to be knowledgeable about their topic and who can give them advice or offer suggestions [17]. With regard to the search methods used for accessing the PLMs, the research results showed two important characteristics: Previous Knowledge of Items. In this case, users know what PLMs they are looking for (title and concept of content). Users in this group were people who are looking for PLMs which contain local literature and folktales. Normally, they used the title as their access point. Moreover, they already know the nature of the stories or subjects they are looking for. Thus, they can easily decide if the PLMs they have found are what they want, although in some cases the titles are in a different dialect or use a different word order. However, for new users who are not familiar with the PLMs they need, it is necessary for them to carry out some preliminary steps in order to find out the nature of the stories they are going to research by using related publications, such as encyclopedias of Northeastern Thailand, related research reports, and translation versions etc. Exploratory Searches. In this case, users may have an idea about the nature of the stories they want, but they may or may not know the words to use and where to start their search. Sometimes they might not know or recognize that they have found the stories they need. Users categorized in this group were the subjects who were looking for stories other than local literature and folktales (e.g. traditional medicine, traditional law, folklore, traditional beliefs etc.). They need a good starting point for access, such as subject heading, keywords or related concepts in order to navigate their way to what they require which would be confirmed by a summary of the content. This group of subjects asked for help from people such as friends who do research in the same area, monks, and curators etc. These people, then, helped them establish the nature of the contents they required, because having only “Title” as their access point restricts the retrieval tools available for people to find what they need.
A Study of Users’ Requirements in the Development of PLMs Metadata Schema
125
Search Patterns for the PLMs. The results of the study show the 4-step patterns of behavior in searching for the PLMs: 1. The users have to identify and clarify what information they want, the concept of the desired information and their purposes. 2. They have to relate the identified concept to the title of the PLMs, because the “Title” is a single access point for guiding users to the content in the PLMs. However, users who are not familiar with the titles of the PLMs will not know which title they should select. Moreover, the “Title” in ancient languages can cause some problems for users. Furthermore, if the title does not indicate the content, users might have no clue to guide them to the content they need. Although in some cases, the titles will tell us something about the content, users will still want to obtain a summary of the content in order to confirm that the title they have retrieved contains all the information they need. Thus, it can be concluded that when searching for PLMs most users use the title in order to decide what document they require. 3. When searching for the PLMs, users want to know if the PLMs they want are still available in the collection where they are stored. As the PLMs are ancient documents, it is possible they may have disappeared, even if the titles are still shown in the registration book. Ideally, the participants will want to use the retrieval tools like a union catalog where they can carry out their search in one place in order to find out where the PLMs are stored. However, while this research was being conducted, there was no union catalog for PLMs. Other tools, similar to the registration book, which can be used, are the PLMs lists which are the by-product of PLM surveys and collections in each region in the 1980s. This tool is available in academic libraries and PLM preservation projects. On the other hand, some users did not use any retrieval tools, but they asked people (e.g. their friends who had experience with PLMs, monks who look after the PLMs at the temples where they are resident). Furthermore, users need to know if the PLMs are complete (i.e. no missing fascicles, no missing pages, not too damaged to read, readable script etc.) so that they can make a decision as to which PLMs they should select and where they should obtain them. 4. In order to select the PLMs the users need to know the physical condition of the PLMs and their storage location. Usually users will select those PLMs which are complete, or which have a sufficient number of fascicles and pages for the length of the story, or according to the age of the PLM, the script or its language. The process of searching for PLMs is similar to that among social scientists studied by Ellis [18]. His pattern includes six generic features: Starting (Identify), Chaining (Relate to the concept), Browsing (Search), Differentiating (Search), Monitoring, and Extracting (Select). Similarly, the study of Meho and Tabbo [17], which is based on Ellis’s pattern, adds four new features: Accessing, Verifying, Networking, and Managing Information which have important roles in improving information retrieval and facilitating research. However, monitoring of the task did not seem to occur in the case of searching for PLMs; there were few participants who said that they sometimes monitored new entries of PLMs related to their research in the preservation projects of the PLMs, for example, by asking staff to let them know if the project obtained a new PLM or issued a new publication of translations of the PLMs. In the seeking process, sometimes, when searching for the translation versions, users also need to verify the accuracy of the content.
126
N. Chamnongsri, L. Manmart, and V. Wuwongse
4.2 Important Information in Searching for PLMs The results of many studies have shown that a few bibliographic elements given on the catalog records were frequently used by users while other descriptive elements were rarely used [16], [17], [18]. In general, the 10 most frequently used bibliographic items, in descending order, were: title, author, date of publication, subject headings, call number, content note, edition information, publisher, place, and summary [18], Table 2. When compared to this study of which the focus was on PLMs and ancient documents, the results were quite similar. The first three items: title, where the PLM was found, date of inscription and subject matter or concept of content were frequently used to access the PLMs. One item which is seldom used is the storage location which the user uses to locate PLMs (i.e. call number for library materials.) Next, a summary is used, which confirms the content of the PLMs. The final item is the physical condition, which is unique to PLMs; however, such a feature is not included in the top ten general bibliographic items. These results indicate that what users really want to see in bibliographic records is information which will help them make a decision about the content of the document rather than its physical characteristics. This conclusion is confirmed by the users’ suggestions when they were asked the question “What items would you like to have but which you cannot find on the bibliographic records?” and “What are the document characteristics that might determine the relevance or usefulness of a document?” Their answers confirm the results of previous studies: users required more information related to the contents of the document [1], [16], [17], [18]. Of less importance were the content summary or annotation of each fascicle, the literary styles, subject, uniform title, original title of the PLM, and where the PLM was found which could affect the form and content of the PLM. When compared to other research studies the items on the catalog records identified by users of the PLM preservation projects were found from the 10 most frequently used in the cataloging records, but they only used the title as a guide to the contents and another two items which could give more information about the content, namely, Note (regarding where the PLM was found, which referred to the geographic location), and Date of inscription which referred to the period of time in which the PLM was produced. The other seven elements were the physical characteristics of the documents and the administrative information which were not often consulted by the users. Moreover, even though information about four physical characteristics were available in the retrieval tools, the users frequently used only two of the items. Furthermore, some important physical features were not available. These were whether the PLMs were complete or incomplete, depending on the number of missing fascicles, the number of missing pages, and how badly damaged the document was, which affected its readability. When the items available to the PLM preservation projects were compared to the Dublin Core metadata set of items, it was found that the items (information) available looked similar, but the order of the items was slightly different, (Table 2). The important information includes content and subject matter, physical characteristics, and location identifier (Table 3). Nevertheless, they still gave attention to the physical description.
A Study of Users’ Requirements in the Development of PLMs Metadata Schema
127
Table 2. Comparison between the most frequently consulted bibliographic items and those found in this stud The Most Frequently Used Bibliographic Elements [18]
Dublin Core elements Used by Data Providers [19]
1. Title (24) 2. Author (21) 3. Date of publication (16) 4. Subject headings (11) 5. Call number (10) 6. Content note (7) 7. Edition information (7) 8. Publisher (5) 9. Place (5) 10.Summary (3)
1. Title (98.8%) 2. Creator (95.1%) 3. Date (92.7%) 4. Identifier (91.5%) 5. Type (87.7%) 6. Subject (82.9%) 7. Description (72.0%) 8. Language (52.4%) 9. Publisher (50.0%) 10. Format (47.6%)
This Study (the PLMs) Elements User Identified and Used 1. Title (75%) 2. Where the PLM was found (75%) 3. Date of inscription (70%) 4. Storage place (65%) 5. Script (55%) 6. Language (55%) 7. Physical condition (45%) 8. Content Summary* (40%) 9. Page Number (35%) 10. Where the PLM was inscribed (30%)
* Not available in the present retrieval tools
However, after an analysis of all these items, it was found that “Title” was the highest rank of frequency use. The next was content factors, especially the time period of the document. Then, identification of the location was the third significant item of information in document evaluation. Finally, the physical characteristics of the document were then used as a basis for decisions. Table 3. Significant Differences between the most frequently consulted bibliographic elements and those found in this study Main Items Subject matter
Intellectual source Content matter
Location identifier Physical characteristics
The Most Frequently Consulted Bibliographic Elements [18] 1. Title 4. Subject headings 6. Content note 10.Summary 2. Author 8. Publisher 3. Date of publication 7. Edition information 5. Call number 9. Place 7. Edition information
This Study (the PLMs) Elements User Identified and Used 1. Title 8. Content, Summary*, Subject*, Uniform title*
3. Date of inscription 2. Where the PLM was found 10. Where the PLM was inscribed, Literary styles* 4. Storage place 5. Script 6. Language 7. Physical condition 9. Page Numbers
* Not available in the present retrieval tools
In this study, the significant information identified and used by the subjects (individual users and PLM preservation projects) was categorized into five groups. These are: Subject matter, Content matter, Intellectual source, Location, and Physical characteristics. 1. Subject Matter is the group of features that relate to the content topic or subject of the PLMs and they include: Title, Summary of Content, Subject, and Uniform title.
128
N. Chamnongsri, L. Manmart, and V. Wuwongse
2. Content Matter is the group of factors pertaining to the scope, and coverage of the PLM and includes: Date of inscription, Where the PLM was found, Where the PLM was inscribed, Note, and Literary styles. 3. Intellectual Source is the group of features pertaining to the creation of the document, the original owner, and the owners who own the copyright. These include: the Source of the PLM and the Date of reproduction. 4. Location Identifier is the group of features pertaining to locating the PLM, where the PLM is available and its Storage place. 5. Physical Characteristics are the group of features pertaining to the physical format of the PLMs, including Script, Language, Physical condition, and Number of Pages.
5 Conclusion Good metadata should support both the user needs and the needs of those who manage the collections, moreover, document representation has to satisfy the principle of least effort to learn and use, otherwise users may not bother to use those features [18]. According to the results of this study, it is concluded that bibliographic records or metadata of the PLMs should be provided in two sets: descriptive metadata as a public catalog with detailed contents, but little information about the physical characteristics of the data, which would meet the needs of users, and another metadata with complete bibliographic descriptions and administrative information which would meet the needs of PLM specialists, such as project staff in the preservation of and control of access to the PLMs. Therefore, the results of this study provide important information for the design of appropriate metadata schema for palm leaf manuscripts.
References 1. Bruce, T.R., Hillmann, D.I.: The Continuum of metadata quality: defining, expressing, exploiting. In: Hillimann, D.I., Westbrooks, E.L. (eds.) Metadata in practice, pp. 238–256. American Library Association, Chicago (2004) 2. Haynes, D.: Metadata for information management and retrieval, London, Facet (2004) 3. Wendler, R.: The Eye of the Beholder: challenges of image description and access at Harvard. In: Hillimann, D.I., Westbrooks, E.L. (eds.) Metadata in practice, pp. 51–69. American Library Association, Chicago (2004) 4. Ucak, N.O., Kurbanoglu, S.S.: Information need and information seeking behavior of scholars at a Turkish university. In: 64th IFLA General Conference, August 16 - 21 (1998), http://www.ifla.org.sg/IV/ifla64/041-112e.htm (Retrieved 19 May 2008) 5. Geser, G.: Resource discovery – position paper: putting the users first. Resource Discovery Technologies for the Heritage Sector DigiCULT Thematic Issue 6, 7–12 (2004) 6. Barett, A.: The information-seeking habit of graduate student researchers in the Humanities. Journal of Academic Librarianship 31(4), 324–331 (2005) 7. Bruchanan, G., et al.: Information seeking by Humanities scholars. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 218–229. Springer, Heidelberg (2005)
A Study of Users’ Requirements in the Development of PLMs Metadata Schema
129
8. Chen, H.L.: A socio-technical perspective of museum practitioners’ image-using behaviors. Electronic Library 25(1), 18–35 (2006) 9. Rimmer, et al.: Humanities scholars’ information-seeking behaviour and use of digital resources. In: Workshop on digital libraries in the context of users’ broader activities, June 15. Chapel Hill, USA, Part of JCDL (2006) 10. Wu, M., Chen, S.: Humanities graduate students’ use behavior on full-text database for ancient Chinese books. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 141–149. Springer, Heidelberg (2007) 11. Amin, A., Ossenburggen, J., Nispen, A.: Understanding cultural heritage experts’ information seeking needs. In: Procs. Joint Conference on Digital Libraries (JCDL 2008), pp. 39–47. ACM, New York (2008) 12. Skov, M., Ingwersen, P.: Exploring information seeking behaviour in digital museum context. In: Procs. The second international symposium on Information interaction in context. ACM International Conference Proceeding Series, vol. 348, pp. 110–115 (2008) 13. Wilson, T.D.: Recent trends in user studies: action research and qualitative methods. Information Research, 5(3) (2000), http://www.informationr.net/ir/5-3/paper76.html#four (Retrieved June 1, 2009) 14. Meho, L.I., Tibbo, H.R.: Modeling the Information-Seeking Behavior and Use of Social Scientists: Ellis’s Study Revisited. Journal of the American Society for Information Science and Technology 54(6), 570–587 (2003) 15. Ellis, D.: A behavioural approach to information retrieval design. Journal of Documentation 46(3), 318–338 (1989) 16. Luk, A.T.: Evaluating bibliographic displays from the user’s point of view: A focus group study. Master’s thesis, Faculty of Information Studies, University of Toronto, Toronto (1996) 17. Lundgren, J., Simpson, B.: Looking through users’ eyes: What do graduate students need to know about internet resources via the library catalog? Journal of Internet Cataloging 1(14), 31–44 (1999) 18. Lan, W.-C.: From Document Clues to Descriptive Metadata: Document Characteristics Used by Graduate Students in Judging the Usefulness of Web Documents. Ph.D. Dissertation, School of Information and Library Science, University of North Carolina at Chapel HIill (2002) 19. Ward, J.A.: Quantitative analysis of unqualified Dublin Core Metadata Element Set usage within data providers registered with the Open Archives Initiative. In: Proceedings of the Third ACM/IEEE Joint Conference on Digital Libraries, pp. 315–317 (2003)
Landscaping Taiwan’s Cultural Heritages – The Implementation of the TELDAP Collection-Level Description Hsueh-Hua Chen1, Chiung-Min Tsai2, and Ya-Chen Ho1 1
Department of Library and Information Science, National Taiwan University, Taipei, Taiwan 2 Research Centre for Digital Humanities, National Taiwan University, Taipei, Taiwan {sherry,tsaibu,r94126014}@ntu.edu.tw
Abstract. This paper depicts the implementation process of the collection-level description of TELDAP. Our study looks into collection-level description in order to eliminate problems users might encounter when accessing and retrieving resources caused by having only large amounts of item-level metadata. The implementation process is divided into five stages. In order to facilitate the application of collection-level description, we have put forth revised schema for the usage of currently available description standards. In the future, we intend to fortify relationships between item-level and collection-level metadata, and provide versions in different languages, expanding the accessibility of valuable resources to more users. Keywords: collection-level description, metadata, digital archives.
This study focuses on the digitized resources produced by TELDAP, establishing proper collection level description according to their characteristics and condition. The results are then presented in a prototype, using faceted classification. It is expected that the addition of collection-level description will not only provide new facets of knowledge organisation for the TELDAP Union Catalogue, but simultaneously benefit users and information managers with means to explore and retrieve information more effectively while further broadening the international user base of TELDAP.
2 Collection and Collection Level Description A ‘collection’ is an aggregate of various objects or information resources that exist in actual or digital form and range in size. ‘Any aggregation of individual items’ can be designated as a ‘collection’ (Johnston & Robinson, 2002). It might be restricted to a few items of related data, or contain the whole repository of an institute. It could be a collection of sub-collections or a super collection of other collections. The organisation of collections is a group of resources with a common factor (Chapman, 2006) which may concern the same topic, exist in the same format, have the same historical origin or originate from the same institution. Though related collections share some kind of association, they may in fact be stored in multiple locations or repositories. The collection-level description, in short, is the complete metadata for a collection. The process of creating a collection-level description is seeking a structured, accessible, standardized and machine readable form for exchanging information. The purpose of collection-level description is for collection management and resource discovery (CD Focus, 2001). It helps users narrow the scope of search query and locate necessary or useful information related to their search enquiries. Content holders and information managers use it to assure the quality of information and promote the usability and efficiency of information retrieval. The collection level description brings the reality and usability of the resources to light, giving an overview to users wishing to explore and locate information among collections. It has been the practice for libraries, museums and other cultural institutes to use an “item” as the basic information description unit, such as a certain book or painting. Subsidiary information of the item level description, although abundant and detailed, will baffle a user unfamiliar with the collected resources. The collection level description enables users to see more fully the range of available information, and therefore have a clearer understanding and knowledge of the collections (Heaney, 2000). The information no longer appears unrelated and scattered, but is presented in a more useful manner for users; when users are presented with a range of relevant information to compare and discover other interesting links, they are encouraged to explore deeper and further.
3 Implementation The provision of collection level description helps create an information landscape to enable users navigating and retrieving information within the huge network resources, as well as enhance interoperability between the databases. The collection level description empowers information managers with information organization and quality control; by understanding the entire range of the collections, it follows that they can
132
H.-H. Chen, C.-M. Tsai, and Y.-C. Ho
better handle the tasks of management and development (CD Focus, 2001; Chapman, 2006; Powell, Heaney & Dempsey, 2000). There are five steps to create collection level description, the essential information landscape, for TELDAP. 3.1 Identifying Collections in TELDAP In the first step, the objective of this study is to identify the collections in TELDAP. By first exploring the related websites from 2002 to now, understanding the collection nature, related information, participating institutes and accessibility, the TELDAP collections can be identified. However, it is possible to take different approaches to classify the collection. Though researchers usually only take one approach to classify collections, these small collections could be aggregated according to themes, institute, or common property applicable to create a larger collection. Collections may also be classified using other approaches depending on different purposes of reorganisation. To identify a collection is very complicated work, as it can proceed from a variety of approaches. Using cultural artefacts as an example in point, they can be variously categorised, by, for example, period of history (such as Tang, Song, Yuan, Ming, Qing), genre (copperware, porcelain, jade, lacquer, enamelware), or by purpose (writing tools, ceremonial items, cooking utensils) etc. This study adopts the principles of ‘functional granularity’ (Macgregor, 2003) so that if a collection is ‘functional’ and can accomplish various purposes, then it is worthy of describing. For better accessibility, collections are derived from already classified databases constructed by archiving institutions, enabling each collection to have corresponding databases to ensure the collection level description functions usefully. This is conducted by preliminarily dividing digital resources from eight institutions into 145 collections, then confirming the suitability of the rudimentary classification with the original managers of these resources, in order to make further adjustments. 3.2 Developing a TELDAP CLD Schema Recently a great deal of researches on schema and the frameworks of collection-level description have been being conducted. Considering the multi-disciplinary nature of TELDAP and number of involving institutes, it is therefore required to adopt a schema which is appropriate for multiple-domain operation. In order to develop a collection-level description schema, we explored the possibility of three most-applied schemas: the UKOLN Simple Collection Description, the RSLP Collection Description Schema and the Dublin Core Collections Application Profile. 3.2.1 UKOLN Simple Collection Description (UKOLN SCD) The UKOLN SCD is a simple structure of collection description metadata schema for users to discover, confirm, and access interesting information collections, and also to retrieve information across databases. UKOLN SCD consists of 23 elements, of which 12 originated from the Dublin Core. In order to meet the needs of collection and description processes a further 11 elements were devised. The 23 elements can be divided into two major categories, with one for describing collections themselves and the other describing services for collections. All elements could be applied multiple times and optionally. (Powell, 1999).
Landscaping Taiwan’s Cultural Heritages
133
3.2.2 Research Support Libraries Programme Collection Description Project With the support of UKOLN and OCLC (the Online Computer Library Centre), in 2000 Micheal Heaney conceived the RSLP Analytical Model of Collections and Their Catalogues and conceptualized a collection description schema. Using this model as its basis, the RSLP Collection Description Metadata Schema was developed to be used to describe the RSLP collections. (Heaney, 2000). Elements of the RSLP schema were partly selected from Dublin Core, DCQ (DC Qualifiers) and the vCard selection section. It also depended on the demand of several other useful elements, and was named ‘collection level description (cld)’. Encoding with the Resource Description Framework (RDF), RSLP collection level description schema is divided into three sets of elements: Collection, Location and Agent encoding. 3.2.3 Dublin Core Collections Application Profile, DCCAP In October of 2000, the DCMI Collection Working Group was funded using the RSLP collection description model as their basis, then the first version of the Dublin Core Collection Description Application Profile was released in 2003. After many revisions and name changes, the newest version was released in March 2007 and named the 13th Dublin Core Collections Application Profile (abbreviated to DCCAP). This newest schema could support discovery, identification, selection, identification of the location, and identification of the services, and was suited for use in various fields. Based on Heaney’s model and then expanded, the entities of DCCAP include Collection, Item, Location, Agent, Service, Catalogue and Index. There are 30 elements of the DCCAP schema to describe ‘collections’, 26 of which describe Catalogue or Index. The properties of entity and relationship (e.g type etc.) all use the DC description modifier (Dublin Core Collection Description Task Group, 2007). Table 1. Dublin Core Collection Description Schema 1. 4. 7. 10. 13. 16. 19.
Type Alternative Title Language* Rights Accrual Periodicity Audience Temporal Coverage*
24. Is Located At 27. Super-Collection 30. Associated publication*
* The definition of the element has been modified.
The aforementioned three schemas have been evaluated with the consideration to previous practices (UKOLN, 2009; vads, 2009; Zeng, 1999). Seven aspects, namely Fitness for purpose, Reputation, Existing experience, Compatibility, Easiness, Available Tools, and Sustainability, are taken into account for comparison. Though RSLP
134
H.-H. Chen, C.-M. Tsai, and Y.-C. Ho
provided the online tools, DCCAP was chosen as the basic model for implementation (see table 2). The major reason is that DCCAP uses many elements that originated from the 15 elements of the Dublin Core, which is also used as the scheme for TELDAP Item level metadata. Therefore the future development of an integrated Item level and collection level metadata became more possible. In addition, the descriptions of 16 of the 30 DCCAP elements were revised in order to make it even more suitable for the describing of TELDAP collections (see table 1). Table 2. Comparison of Collection-Level Description Schema
Fitness for purpose Reputation Existing experience Compatibility Easiness Tools Sustainability
All three schemas are designed for collection management and resource discovery. RSLP and DCCAP are both notable CLD schema. The UKOLN SCD is in the early stage of development. The RSLP standard is adopted by many large scale projects, while DCCAP has also gradually become more widely used in recent years. DCCAP uses many elements that originated from the 15 elements of the Dublin Core. RSLP includes three sets of elements; UKOLN SCD and DCCAP are composed of equally simple elements. Only RSLP provides online tools. Presently only the DCCAP has been maintained by experts and work continues to be updated by a small working group.
Source: Authors
3.3 Creating Collection-Level Description for TELDAP To complete the creation of the collection-level description, the authors have thoroughly worked together with collection managers of each TELDAP participating institutes. The whole set of collection-level description elements are completed according the team’s basic understanding and sent to collection managers of each participating institute before visiting them. With better domain knowledge and handling experience of their own collections, collection managers are expected to feedback to our first draft of TELDAP collection-level descriptions. At the beginning of the process, the purpose and objectives of the collection level description is emphasized. Collection managers are asked to inspect whether the collection is correctly organised, and revise each element. The revised results returned from collection managers are analyzed for further modification of schema of the TELDAP collection-level description. In accordance with the structure and contents of the TELDAP collections, an online platform was constructed as the editing tool to create corresponding metadata. The platform provides XML encoding for data exchange, or to send data back to collection managers for final confirmation in order to ensure the validity of its contents.
Landscaping Taiwan’s Cultural Heritages
135
Define the scope of collections and the depth of their descriptions Create suitable collection-level descriptions Create collection-level metadata Preliminary metadata fill-outs Organise interviews Conduct interviews and affirm the suitability of the defining of collections. Revise the contents of metadata. Interview analysis. Revise description standards.
Interviewees return revised metadata
Emending of metadata on the editing platform. Conduct authority control.
Development of collection-level metadata editing tools
Reviews by institutions
Errors
Revise
Correct
Present results on the demonstration system Evaluation of outcome Fig. 1. Collection-level metadata creation procedures
3.4 TELDAP Collection-Level Description Portal A portal utilizing facet classification to provide TELDAP collection-level description service is the major outcome of this study. The concept of faceted classification is widely adopted in library information and commerce applications (Adkisson, 2005). The TELDAP collection-level description portal results from this research using faceted classification technology, which provides a powerful tool for facet browsing and better cataloguing to guide users in finding useful information from different perspectives while avoiding being overwhelmed by information. The architecture of The TELDAP collection-level description portal is designed according to knowledge organization and disciplinary ontology. The completed collection-level description was imported into the system. Using faceted classification in themes, time spans, geographical coverage, institutes, collection titles, and genre, different facets of the collection will be presented. The TELDAP collection-level description portal thus renders collection data easily browsed and discovered, much more in alignment with the needs of users while simultaneously increasing the exposure and accessibility of TELDAP collections.
136
H.-H. Chen, C.-M. Tsai, and Y.-C. Ho
Figure 2 shows the user interface of the TELDAP collection-level description portal which presents the importance of the system browsing for the collection. Titles of collections are listed under different facets include themes, time spans and institutes for browsing. Facets of collection genre and geographical coverage will be added in the near future. When a facet is selected, the next level of classification will appear with a list of collection under the classification. The details of descriptive elements of each collection are presented in tabulated format. When users find a collection that is interesting to them, they will select to view the completed metadata or re-direct to the collection databases or websites. It is believed that by using the method of faceted classification, it will help users discover more information relevant to their needs.
Fig. 2. User Interface of the TELDAP Collection-Level Description Portal (http://culture.teldap.tw/ culture/ )
3.5 Evaluation of Outcome In order to understand the amount of benefit users gain from the collection level descriptions and its system interface, as well as any possibilities for further improvement, an evaluation scheme will be created to facilitate appropriate revision.
4 Modifications Metadata schemas reach maturity and better performance after a long period of development. Due to the heterogeneity and growing speed of the TELDAP collections, applying DCCAP would encounter many problems. Modifications including schema revisions and CLD operations are suggested in this study. The principles of both modifications are as follows.
Landscaping Taiwan’s Cultural Heritages
137
4.1 Revising Principles a. User oriented: considering that the main goal of the collection-level description is to promote the use of the collection and related resources, the description must be understandable, comparable, accessible and usable information. Therefore the first revising principle is user friendliness, which is achieved by minimising the use of technical terminology whilst employing a clear and simple form of expression to present the collection-level description elements. b. In accordance with the nature of the collection or the goal of the project: Though it may be more cost-effective to use existing principles of description, it must be acknowledged that no single metadata can be applied to every item. Existing description principles should be amended according to the nature of the collection or the goal of the project. Therefore it is preferred to first evaluate the practicality and importance of each element, retaining core elements for retrieval, before eliminating the unnecessary or adding the required. Finally, mapping of the amended principles with internationally used principles is to be conducted to facilitate any future data exchange. 4.2 Operating Principles a. To describe the common properties of collections: the DCCAP schema can describe the whole entity of information including both physical and digital forms. As to whether the physical or the digital should be given primary place, the essential information of the collection should be considered, as well as how the description can have more significance. Many TELDAP participating institutes are museums, which use ‘objects’ as the core of their collections. All activity centres on exhibiting and therefore descriptions of these collections, when transferred into the digital collection, focuses on the real objects themselves. However, there are exceptions where the digital form of the object is more useful. An example is made of collections of the ‘Insects Photo Gallery (National Taiwan University)’ and ‘Type Insect Specimens (National Taiwan University)’ which both exist via digital images in the collections. The ‘Insects Photo Gallery’ contains field photography of insects in their natural environment. The records are born digital, but ‘Type Insect specimens’ are digital images of actual specimens for preservation and research purposes. Therefore in the collection ‘Insects Photo Gallery’ is described as digital while ‘Type Insect specimens’ is noted as a physical collection. b. To include unique and important characteristics of valuable items under collections: each item is precious and invaluable, with its own story. If presented in a collection with too large a scope, each individual item’s history, unique story and qualities may be lost. In order to avoid this occurrence, a higher level of the collection-level description can also be detailed, with an item’s salient points emphasised. For example, Su Shih’s calligraphy work of ‘the Cold Food Observance’ is regarded as a representative work of the Song Dynasty. However, ‘the Cold Food Observance’ is placed under the collection of ‘Historical calligraphy’ in the National Palace Museum, which is described as ‘a rich collection including seal, scribe, regular, running
138
H.-H. Chen, C.-M. Tsai, and Y.-C. Ho
and free forms of calligraphy, spanning from the 3rd to 20th centuries.’ Therefore, this item would be lost in the generalised collection descriptions. These principles are all designed to ensure that collection level descriptions answer to the goal of discovering, identifying, selecting, locating, and serving resources.
5 Conclusions and Future Work The paper outlines the implementation of the TELDAP collection-level description, as well as the modifications and suggestions for the revisions and operations of collection-level description schema. It is hoped that the addition of collection-level description solves the issues of handling the massive quantity of item-level metadata contained in TELDAP, which is difficult for users to use effectively. The five-stage process of implementation of the TELDAP collection-level description is depicted; the first is to identify the scope of TELDAP collections, dividing the digitized results into 150 collections. Next, the DCCAP CLD schema needed to be applied, and 16 of the 30 elements had to be revised; interviews were held to clearly understand the collection to enhance collection-level description creation. The final outcome was presented in a prototype which uses faceted classification to re-organise the TELDAP collections and improve browsing function. A follow-up evaluation system will be created in order to understand the amount of benefit CLD provides to its users, and to consider any further possibilities for improvement. In order to smooth the implementation of TELDAP collection-level description, there are different strategies for use. When amending collection-level descriptions, expression should proceed from the perspective of users, minimising the use of technical terms and keeping the text straightforward. Existing description principles should be revised according to the nature of the collection or the goal of the project. When describing the collection, the common properties of collections are the main focus for the description, and special attention is paid to detailing the unique and important qualities of items in the collection. Due to the ever changing quality of collections, constant effort is made to create and maintain collection-level description. In the future, the new facets of geographical coverage and collection genre will be added to this portal website to broaden the collection scope of TELDAP, and lead users to discover new information resources from many new perspectives. Relationships between the item-level and collection-level metadata will be constructed for better information retrieval and access, in order to integrate and improve efficiency. Finally, besides the present English and Chinese version, the TELDAP collection-level descriptions in other languages such as Japanese and Spanish will be offered in the future to broaden the base of international users.
Acknowledgements This paper is supported by the National Science Council’s Taiwan e-Learning and Digital Archives Program (Project No. NSC-98-2631-H-002-015). We are also grateful to Yi-cheng Weng and Jessamine Cheng for their help in editing and translations.
Landscaping Taiwan’s Cultural Heritages
139
References 1. Adkisson, H.P.: Use of faceted classification (2005), http://www.webdesignpractices.com/navigation/facets.html (Retrieved December 25, 2009) 2. CD Focus. Collection Description Focus (2001), http://www.ukoln.ac.uk/cdfocus/ (Retrieved December 25, 2009) 3. Chapman, A.: Tapping into collections. MLA West Midlands ICT Network Day Collection descriptions and cultural portals, University of Wolverhampton, Wolverhampton. (2006), http://www.ukoln.ac.uk/cd-focus/presentations/mla-WM2006-july/mla-WM-2006-07-25.ppt (Retrieved December 25, 2009) 4. Dublin Core Collection Description Task Group, Dublin Core Collections Application Profile (2007), from DCMI Web site, http://dublincore.org/groups/collections/ collection-application-profile/2007-03-09 (Retrieved December 10, 2009) 5. Heaney, M.: An Analytical Model of Collections and their Catalogues 3rd issue (2000), from UKOLN Web site, http://www.ukoln.ac.uk/metadata/rslp/model/amcc-v31.pdf (Retrieved December 10, 2009) 6. Johnston, P., Robinson, B.: Collections and Collection Description (2002), http://www.ukoln.ac.uk/cd-focus/briefings/bp1/bp1.pdf (Retrieved December 10, 2009) 7. Macgregor, G.: Collection-level description: metadata of the future? Library Review 52(6), 247–250 (2003) 8. Powell, A.: Simple Collection Description (1999), from UKOLN Web site, http://www.ukoln.ac.uk/metadata/cld/simple/ (Retrieved December 10, 2009) 9. Powell, A., Heaney, M., Dempsey, L.: RSLP Collection Description. D-Lib Magazine 6(9) (2000), http://www.dlib.org/dlib/september00/powell/09powell.html (Retrieved December 15, 2009) 10. UKOLN, Good Practice Guide for Developers of Cultural Heritage Web Services (2009), http://www.ukoln.ac.uk/interop-focus/gpg/print-all/ (Retrieved December 10, 2009) 11. Vads, Creating Digital Resources for the Visual Arts: Standards and Good Practice (2009), http://vads.ahds.ac.uk/guides/creating_guide/sect42.html (Retrieved December 10, 2009) 12. Zeng, M.L.: Metadata elements for object description and representation: A case report from a digitized historical fashion collection project. Journal of the American Society for Information Science 50(13), 1193–1208 (1999)
GLAM Metadata Interoperability Shirley Lim and Chern Li Liew Victoria University of Wellington, School of Information Management, Wellington, New Zealand [email protected]
Abstract. Both digitised and born-digital images are a valuable part of cultural heritage collections in galleries, libraries, archives and museums (GLAM). Efforts have been put into aggregating these distributed resources. High quality and consistent metadata practice across these institutions are necessary to ensure interoperability and the optimum retrieval of digital images. This paper reports on a study that involves interviews with staff members from ten institutions from the GLAM sector in New Zealand, who are responsible for creating metadata for digital images. The objective is to understand how GLAM institutions have gone about creating metadata for their image collections to facilitate access and interoperability (if any) and the rationale for their practice, as well as the factors affecting the current practice. Keywords: Galleries, libraries, archives, museums, metadata for digital images, metadata interoperability.
2 Research Design The NZ Register of Digitisation Initiatives was consulted to identify institutions within the GLAM sector that have existing publicly accessible online collections. Only institutions with a minimum of fifty digital images and corresponding metadata records available for online viewing were selected. Sixteen institutions were selected based on the criteria. They were the target sample. In the end however, only staff from ten institutions (two art galleries, three libraries, two archives and three museums) took part in the interview. The questions for which the following findings (reported in this paper) correspond to are: 1. 2.
3.
4.
Does your organisation have an existing cataloguing guideline? Are there specific guidelines on cataloguing digital images? How many staff members are responsible for cataloguing digital images in your organisation? If there is more than one staff member, how do you ensure consistency in cataloguing? Does the integrated library system/ collection management software used in your institution have an impact on metadata creation? E.g. the number and choice of metadata elements? Does the organisation have any strategies to make metadata interoperable?
3 The Interview Findings 3.1 Galleries Two galleries participated in the email interview process. Although galleries have inhouse cataloguing guidelines, they are applied to collection items and not specifically to digital mages. One of the galleries with a large collection has four individuals responsible for cataloguing. They consist of a registrar and part-time assistants. Staff members learnt from previous job experience and the main training is provided by collection management system (CMS) vendor. There are no specific procedures in place any of the galleries to ensure cataloguing consistency. One gallery has separate sets of metadata created in the Vernon Systems as well as spreadsheet for Dublin Core (DC) metadata. The Vernon Systems is also used by both galleries and the system automatically generates metadata when creating audio-visual files. The gallery with a larger collection shares their metadata through eHive, which is an online collection management tool developed by the Vernon Systems. In addition, the gallery also makes their metadata records interoperable by creating records in XML format. This is done on the Vernon Systems. Having metadata available in DC is considered important for the larger gallery. On the other hand, the other gallery has no intention to share metadata at the time of interview. 3.2 Libraries Two out of the three libraries involved in the interviews use the Anglo-American Cataloging Rules (AACR2) for cataloguing digital visual images. The other library follows an in-house manual which draws on the General International Standard
142
S. Lim and C.L. Liew
Archival Description (ISAD-G) for cataloguing their image collections. One library has only one cataloguer while another larger library has a team of staff with different responsibilities. To ensure consistency, the library, with larger staff number, trains all members of staff and provides them with a manual. In addition, occasional checks for consistency are performed by the team leader. Two libraries create MARC records. The other library has an in-house-created CMS that provides the structure for the creation of descriptive records. Two libraries share their metadata. Using OMI-PMH and the Z39.50 protocol, one library contributes their metadata to Matapihi, the National Library and the Digital New Zealand. One library uses Primo services to harvest metadata from Digitool. According to the interviewee from the library that uses ISAD-G, there is no intention to share metadata at the moment. 3.3 Archives One of the two archives participated in the interviews is a council archive while the other is an archive with a small collection of art works. The council archive follows data standards developed in-house and based around the data fields available within the Vernon Systems while the other archive has a less detailed procedural guideline. In the council archive, technical assistants as well as archivists are responsible for cataloguing of digital visual images while the smaller archive have two staff members performing the cataloguing task. To ensure consistency, the council archive also has Standard Operating Procedures and a monthly audit process to check the cataloguing work. The use of Vernon Systems in the council archive affects metadata creation since some fields come with predefined list. The other archive custom designed their CMS. Digital images are treated as surrogates of physical items. As a result, the council archive catalogues digital images following guidelines for all other items with the addition of some extra metadata pertaining to access restriction, and image reference number. Metadata sharing is not a current plan of the council archive but according to the interviewee, the institution will take that into consideration during the next review of the CMS. The private archive has no intention to do so. 3.4 Museums The interview results from three participating museums shows that museums follow inhouse guidelines for digital image cataloguing that are often based on existing practices and art cataloguing manuals. One museum uses library card catalogue system focusing on subject matter, key features or the names of individuals in the images. Two other museums when cataloguing their digital visual images adhere to minimum standards such as NISO Z39.87 as well as standards established by collection management system. One of the museums is in the process of creating a cataloguing manual which will be based on SPECTRUM – an established museum process and documentation system – as well as Cataloguing Cultural Objects (CCO). In addition, the particular museum also uses VRA cataloguing standard. Only one museum has a team of staff cataloguing images. There are no specific procedures in place to ensure cataloguing consistency. One museum does not use CMS while one uses the Vernon Systems, which comes with a pre-existing set of element fields. Another museum uses a custom designed CMS
GLAM Metadata Interoperability
143
which has an automatic harvesting feature for technical metadata. In terms of metadata sharing, one museum considers the usage of the Vernon Systems, which is commonly used in museum sector, as a means to achieve interoperability of metadata. Another museum chooses to have their metadata mapped to DC and uses OAI-PMH for metadata sharing in projects like Matapihi and DigitalNZ. One other museum shares their metadata through the Museums Aotearoa.
4 Discussion – Interoperability of GLAM Metadata The interview findings confirmed that there still exist different cataloguing practices among the various institutions in the GLAM sector. There is a general consensus among those interviewed in this study that metadata interoperability is important and should be considered by institutions. However, as reflected in current practice, especially for smaller organisations, lack of resources (including staff with necessary skills and knowledge) is one of the main reasons for lack of apparent effort to ensure metadata interoperability. The other key reason highlighted is the wide use of a proprietary CMS. Since the system comes with pre-existing metadata fields, there is a lack of flexibility in metadata creation and management. Also, with training by the CMS vendor being used as standard training for staff, this could lead to the institutions concerned adopting cataloguing practices to accommodate their CMS rather than managing their metadata records to meet the requirements for interoperability. Further research could address the institutional policies necessary for institutions to manage metadata records for optimum retrieval and interoperability.
References 1. Carnaby, P.: Creating a digital New Zealand. In: 29th Annual IATUL Conference, Auckland, New Zealand, April 21-24 (2008) 2. Foulonneau, M., Riley, J.: Metadata for Digital Resources. Chandos, Oxford (2008) 3. Elings, M.W.: Metadata for all: Descriptive Standards and Metadata Sharing across Cultural Heritage Communities. VRA Bulletin 34(1), 7–14 (2007) 4. Baca, M.: Fear of Authority? Cataloguing & Classification Quarterly 38(3/4), 143–151 (2004) 5. Taylor, A.G., Joudrey, D.N.: The Organization of Information, 3rd edn. Libraries Unlimited, Westport (2009) 6. Caplan, P.: Oh what a tangled web we weave: opportunities and challenges for standards development in the digital library arena. First Monday 5(6) (2000), http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/ article/viewArticle/765/674 7. Museums Aotearo, http://www.museums-aotearoa.org.nz/ 8. Digital Strategy, http://www.digitalstrategy.govt.nz/
Metadata Creation: Application for Thai Lanna Historical and Traditional Archives Churee Techawut Computer Science Department, Faculty of Science, Chiang Mai University, Thailand [email protected]
Abstract. This paper describes the process of metadata creation of the Thai Lanna historical and traditional archives (shortened to the Lanna Archives) by applying the Singapore Framework for Dublin Core Application Profiles. Its metadata model for scholarly works based on the Functional Requirements for Bibliographic Records (FRBR) is adapted to create a data model and metadata scheme for the Lanna Archives. The proposed metadata scheme provides the level of detail which describing digital Lanna Archives require and also supports information consistency and information sharing. Keywords: Lanna Archives, Metadata, Dublin Core, Digital Collection.
1
Introduction
The digital preservation of the Lanna Archives and recording their inside knowledge has been undertaken by many institutions in the north of Thailand. Because most of the digitized data are maintained in these different institutions and in various formats it is difficult to use and share this valuable information. Therefore, it is necessary to define the information standard or metadata for describing Lanna Archives and organize some basic structures for information consistency to support efficient usability and simple searches. The metadata standard that is widely accepted as a basis for designing metadata application is the Dublin Core (DC) metadata standard [1]. One of its framework applications that is close to this work is called the Singapore Framework for DC Application Profiles [2], and is used to present a metadata model for scholarly works [3] based on FRBR. This paper describes the adaptation of the Singapore Framework in the metadata creation process to provide a metadata scheme for interoperability and reusability of Lanna archive digital collections.
2
Related Works
Not much work has been found on the design and construction of metadata for Thai Lanna historical and traditional archives. Most of the work aims at creating a collection of data about palm leaf manuscripts [4][5] in the northeastern area, but they do not focus on creating a metadata for standard use. G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 144–147, 2010. c Springer-Verlag Berlin Heidelberg 2010
Metadata Creation: Application for Thai Lanna Archives
145
Another related work outlines the adaptation of FRBR model [6] to analyze the attributes and relationships on the conceptual level and facilitate their retrieval in digital collections of Thai Isan palm leaf manuscripts. This work is similar to the issue of creating metadata, but it is different in the domain and data model, furthermore, it extends the type of archives to books and mulberry manuscripts.
3
The Process of Metadata Creation
The process of metadata creation began with requirements gathering. The major requirement was quality and consistency of metadata for using and sharing between institutions. In this paper, quality means the extent to the maximum covering of core data as users require, and consistency means the equivalent required data format for interoperability. Based on the requirements, the created metadata must support two statements of functional requirements specification as follows: – Provide a more compatible set of metadata than simple DC elements, because the 15 simple DC metadata elements do not offer the level of detail which describing digital Lanna Archives requires. – Provide the interoperability of consistent metadata. The contents of metadata elements must be defined by an encoding standard and can be browsed on different software platforms. After specifying functional requirements for the metadata, a data model was drafted on a conceptual level. The first draft was outlined by the application of Chen’s entity-relationship (ER) model and ER-to-relational mapping algorithm [7]. However, the mapping of DC metadata elements with the first draft showed the incompatibility with DC metadata elements. By responding to the functional requirements, a metadata model for scholarly works based on FRBR model was brought to define a conceptual model to clarify basic metadata elements and their relationships as shown in Fig. 1. All entity and relationship labels used in metadata model for scholarly works were used in this model, except the isEditedBy relationship which was replaced with the isContributedBy relationship. The following schemas show attributes for each entity. Attributes shown in italics do not appear as attributes in the metadata model for scholarly works. – LannaArchive (title, subject, abstract, identifier) – Expression (title, created date, language, character, type, has version, references, note, temporal coverage,identifier) – Manifestation (format, digitized date, source, rights holder, provenance, identifier) – Copy (URI, identifier) – Agent (name, family name, given name, type of agent, organization name, address, identifier)
146
C. Techawut
Fig. 1. Data model of Lanna Archives based on Singapore Framework
Table 1. The metadata elements for describing Lanna Archives Data Elements Element Refinement(s) Encoding Scheme(s) (*Remarks) Title − − Creator Orgname − Address − Subject Keyword − Others − ∗Content categories in [9] Description Abstract − Note − ∗Colophon Publisher − − Place − Contributor − − Date Created − ∗Thai minor era year Digitized − ∗B.E. date of digitization Type − DCMI Type Vocabulary Format − IMT Extent − Medium − ∗Book/palm leaf manuscript/mulberry manuscript Amount − ∗Number of pages/palm leafs in a bundle Identifier − − ∗Unified index − URI Source − − Relation Has version − References − Language − − ∗Specific set of langauges Character − ∗Specific set of characters Coverage Temporal W3C-DTF RightsHolder − − ∗For digital copy Provenance − − ∗For physical copy
Metadata Creation: Application for Thai Lanna Archives
147
The metadata scheme after mapping with the DC metadata elements and their element refinements is shown in Table 1. Most elements are from the DC metadata standard. The element refinements in italics show additional refinements not included in the DC metadata. More details are shown in [8].
4
Conclusion
The development of a metadata scheme presented in this paper will support efficient and effective information access of Lanna Archives including books, palm leaf manuscripts and mulberry manuscripts. The metadata model for scholarly works, based on the FRBR model, in the Singapore Framework for DC Application Profiles helps elaborate the metadata scheme design that simply matches up with the DC metadata standard. The metadata scheme in this paper has been used in the Electronic Lanna Literary project [10] for primarily storing bibliographic records and will be expandable to use for cataloging in the Electronic Heritage project at Chiang Mai University Library. Acknowledgments. This work was a part of the Electronic Lanna Literary project supported by NSTDA (Northern Network), Thailand. I would also like to thank Udom Roongruangsri and Krerk Akkarachinores for setting me on the path of digitized data management.
References 1. Dublin Core Metadata Initiative (DCMI), http://dublincore.org 2. The Singapore Framework for Dublin Core Application Profiles, http://dublincore.org/documents/2008/01/14/singapore-framework/ 3. Allinson, J., Johnston, P., Powell, A.: A Dublin Core Application Profile for Scholarly Works. Ariadne Issue 50 (2007), http://www.ariadne.ac.uk/issue50/allinson-et-al/ 4. Project for Palm Leaf Preservation in Northeastern Thailand, http://www.bl.msu.ac.th/bailan/en_bl/en_bl.html 5. Database Creation Project for Palm Leaf Manuscript in Ubonrachatani Province (in Thai), http://www.lib.ubu.ac.th/bailan/index.html 6. Chamnongsri, N., Manmart, L., Wuwongse, V., Jacob, E.K.: Applying FRBR Model as a Conceptual Model in Development of Metadata for Digitized Thai Palm Leaf Manuscripts. In: Sugimoto, S., Hunter, J., Rauber, A., Morishima, A. (eds.) ICADL 2006. LNCS, vol. 4312, pp. 254–263. Springer, Heidelberg (2006) 7. Chen, P.: The Entity Relationship Mode-Toward a Unified View of Data. TODS 1(1), 9–36 (1976) 8. Roongruangsri, U., Akkarachinores, K., Sukin, S., Jitasungwaro, J., Chayasupo, S., SangBoon, P., Techawut, C., Chaijaruwanich, J.: Electronic Lanna Literary Project. Technical Report of NSTDA (Northern Network) (2009) (in Thai) 9. Wachirayano, D.: Handbook of Palm Leaf Manuscript Exploration, 2nd edn., Publishing supported by Social Investment Fund and Lanna Wisdom School, Chiangmai (2002) (in Thai) 10. Electronic Lanna Literary Project (in Thai), http://www.lannalit.org
A User-Centric Evaluation of the Europeana Digital Library Milena Dobreva1 and Sudatta Chowdhury2 1
Centre for Digital Library Research (CDLR), Information Resources Directorate (IRD), University of Strathclyde, Livingstone Tower, 26 Richmond Street, Glasgow, G1 1XH United Kingdom [email protected] 2 Information and Knowledge Management University of Technology, Sydney P.O. Box 123 Broadway NSW 2007 Australia [email protected]
Abstract. Usability of digital libraries is an essential factor for the user attraction. Europeana, a digital library which is built around the idea to provide a single access point to the European cultural heritage, is paying special attention to the user needs and behaviour. This paper presents user-related outcomes addressing the dynamics of user perception from a study which involved focus groups and media labs in four European countries. While Europeana was positively perceived by all groups in the beginning of the study, some groups were more critical after performing a task which involved eight types of searches. The study gathered opinions on the difficulties encountered which help to understand better users’ expectations within the content and functionality domains of digital libraries which would be of possible interest to all stakeholders in digital library projects. Keywords: Usability, digital libraries, Europeana.
A User-Centric Evaluation of the Europeana Digital Library
149
memory institutions across Europe. The website states that it is “is a place for inspiration and ideas. Search through the cultural collections of Europe, connect to other user pathways and share your discoveries” [12]. Currently it contains objects from all countries in the EU on different languages aiming to include 10 million objects by the end of 2010 – these are links to books, photographs, maps, movies, audio files – which present common European heritage [13]. Personal research is the dominant reason given for visiting Europeana: almost three-quarters visit for personal research activities and less than 20% visit for the next most popular reason – work-related research [15]. In its development Europeana pays special attention to user needs. As research indicates, user needs should not be studied in isolation. Chowdhury [6] stresses that both user-centric and context-based digital library design have been a major area of research in the recent years. From this point of view, Europeana provides an excellent setting to apply in a major real-life project the user-centric design principles. This paper presents some of the outcomes of a study that was undertaken in October 2009-January 2010 focusing on the usability of Europeana from user centric point of view. A special emphasize in the study had been given to a number of areas [10] including: (1) ease of use and intuitiveness of the Europeana prototype especially in the case of users who visit the website for the first time; (2) identification of ‘future’ user needs as the young generation grows up; (3) styles of use of the prototype for knowledge discovery amongst young users; (4) expectations, including how users see trustworthiness; (5) similarities and differences in the groups from different countries, and (6) summarising feedback in order to assist development of future versions of the digital library. This paper looks in depth into the usability and functionality aspects of Europeana addressing the issue what aspects of digital libraries are seen as more attractive and popular amongst users. The users opinions were gathered through focus groups held in four countries (Bulgaria, Italy, the Netherlands and the UK) and media labs held in the UK. This paper also touches upon issues identified by the users which could be helpful in refining further versions of Europeana.
2 Literature Review Sumner [17] emphasises that understanding user needs, arguing that innovative user interfaces and interaction mechanisms can influence better use of digital library resources, collections, and services. Adams and Blandford [1] identify the changing information requirements, which they define as an “information journey”, of users in two domains: health and academia. They propose that the process involves the following stages: (1) initiation, which is driven actively by a specific task or condition, or passively by friends, family, information intermediaries and the press; (2) facilitation, which is driven by the effectiveness of tools for retrieving information; and (3) interpretation in which the user makes sense of retrieved information based on their needs. However, they note that
150
M. Dobreva and S. Chowdhury
awareness of the resources is a major problem, and therefore, propose the need for press alert for recent articles. Blandford and Buchanan [3] list a number of criteria for assessing the usability of digital libraries, e.g. how effectively and efficiently can the user achieve their goals with a system; how easily can the user learn to use the system; how the system helps the user avoid making errors, or recover from errors; how much the user enjoys working with the system; and how well the system fits within the context in which it is used. Arguing that user requirements change from one search session to another, or even within a given search session, Bollen and Luce [4] point out, that usability factors such as user preferences and satisfaction tend to be highly transient and specific; for example, the user search focus can shift from one scientific domain to another between, or even within, retrieval sessions. Therefore, they recommend that research on these issues needs to focus on more stable characteristics of a given user community, such as “the community’s perspective on general document impact and the relationships between documents in a collection”. Chowdhury [5] provides a detailed checklist of usability features for digital libraries including Interface features, search process, database/resource selection, query formulation, search options for text, multimedia, specific collection, etc., search operators, results manipulation, and help. Dillon [9] proposes a qualitative framework for evaluation called TIME which focuses on four elements: Task – what users want to do; Information model – what structures aid use; Manipulation of materials – how users access the components of the document; and Ergonomics of visual displays – how they affect human perception of information. Dillon [9] further claims that “TIME offers a simple framework for evaluators to employ, either alone or in conjunction with existing methods, which will enhance the process and communicability of usability evaluations”. Stelmaszewska and Blandford [16] study how the user interface can support users to develop different search strategies. They focus on the difficulties users experience when search results fail to meet their expectations. Digital libraries are designed for specific users and to provide support for specific activities. Hence digital libraries should be evaluated in the context of their target users and specific applications and contexts. Users should be at the centre of any digital library evaluation and their characteristics, information needs and information behaviour should be given priority when designing any usability study [8]. As Chowdhury [6] emphasizes “…modern digital libraries tend to be person-centric with the mission of allowing users to perform various activities, and communicate and share information across individual, institutional and geographical boundaries”. The digital library researchers often focus on technical issues such as information retrieval methods, software architecture, etc. rather than on user-centered issues [14]. Bertot et al. [2] claim functionality, usability, and accessibility testing are important for providing high quality digital library services to a diverse population of users. They further explain, this multi-method approach to evaluating the digital services and resources of libraries enables researchers, library managers, and funding agencies to understand the extent to which functionality testing determines a digital library is able to perform operations such as, basic search, multiple languages whereas usability testing enables users to use intuitively a digital library’s various features. On the other hand, accessibility testing provides users with disabilities the ability to interact with the digital library.
A User-Centric Evaluation of the Europeana Digital Library
151
This survey of recent work clearly shows that there are a broad range of issues which need to be considered in the design of digital libraries from the user-centric point of view. Although various researchers suggest different models, the basic framework for such studies is clear: users should be given a task which could help to observe their information behaviour and study their emotional response. Such an approach was used in the study of Europeana presented below.
3 Research Design The study was organised in four countries with different level of contribution of digital objects to Europeana: Bulgaria, Italy, The Netherlands, and the UK. A combination of focus groups and media labs targeting young users and members of the general public were selected for this study in order to study user experience with Europeana as well as to consider ideas for possible future improvements of Europeana from a user-centric point of view. The study included 89 participants who took part in six focus groups (two in Amsterdam and two in Sofia for school students; one in Fermo for university students and one in Glasgow for members of the general public) and media labs in Glasgow that assessed user needs on an individual basis (Table 1). Table 1. Number and percentage of participants by country Study method
Focus groups
Media labs
%
Type of users Young users - School students
- University students
General public
Sofia (Bulgaria) 2 groups, 22 participants
24.7
Amsterdam (the Netherlands) 2 groups, 23 participants
25.8
Fermo (Italy) 1 group, 20 participants
22.5
Glasgow (UK) 1 group, 12 participants
Glasgow (UK) 12 individual sessions
27.0
The study was held in four different countries and using two study methods (focus groups and media labs) but all groups and individual sessions followed the same protocol which makes comparable the data which were gathered. One difference between focus groups and media labs was that focus groups had joint discussion while media labs were held as individual sessions and the discussion was not in a group but took the form of a conversation between every participant and a moderator. The protocol included three questionnaires first impressions, as well as deeper and lasting impressions;
152
M. Dobreva and S. Chowdhury
a series of key discussion points and an assignment requesting that participants put together a PowerPoint presentation in line with a predefined set of slides designed to provide a virtual portrait of their local city. The basic unit of analysis were various statement made by participants during discussions supplemented by responses to questionnaires. The searches stored within MyEuropeana and the examination of the content transferred to the PowerPoint presentations prepared by participants provided additional evidence on the user behaviour. The series of media labs run in Glasgow provided an additional means of feedback through collection of eye tracking data which help to evaluate which areas of the interface are mostly heavily used during the work on the assignment. The protocol was designed so that feedback gathered from the users at various stages of the study effectively reflected their impressions. A brief presentation about Europeana and its key features were presented to get users’ first impressions and expectations from Europeana before the actual assignment started. To gather the first impressions of participants on Europeana, a questionnaire was used. It offered dichotomic pairs including seven pairs of concepts related to the ease of use, uniqueness and look and feel of the site, as well as bubbles to be filled in with associations with Europeana. The dichotomic pairs provide a quick indication of the degree to which the participants liked/disliked Europeana, while filling in the bubbles elicited fuller, freely-written comments on how participants perceived the website. Once the users worked on the task with Europeana deeper impressions help the users to ascertain whether or not the nature of the digital library and its delivery met the expectations expressed earlier, and in the lasting impressions, users express their intention whether or not to use Europeana in the future. Data collection included completed questionnaires; recordings of discussion sessions; the populated presentations, queries saved in My Europeana by each participant and eye tracking data. While most of these data allow for qualitative analysis, the set of data on the queries, the presentations and the eye tracking were very useful to back up qualitative statements with evidence on user behaviour.
4 Findings The assignment was designed to incorporate eight different usage scenarios: finding texts on a predefined topic; finding images on a predefined topic; finding audio and video materials on a predefined topic; finding materials presenting the same object in different times; finding materials on a very specific subject (such as a landmark or an event or a person), finding materials on a topic of the participants’ choice within the context of the general theme; finding materials on specific time period or event which happened on a specific date; and finally identifying the providers of digital objects who contributed the highest number of objects on a particular topic, identifying what was found to be the most useful about Europeana and suggesting areas in which material may be lacking. This range of scenarios requires can be seen as Tasks in the TIME model [9] where Europeana provides an environment in which the users can try various searches (this would map the Information model in TIME). The basic challenge for the study was to learn as much as possible about the Manipulation of materials – how users access the
A User-Centric Evaluation of the Europeana Digital Library
153
components of the document; and Ergonomics. Findings and recommendations from the study as a whole can be categorised into two broad areas – Content and Functionality/Usability. A number of commonalities in user behaviour, and particularly in user dissatisfaction, were identified. 4.1 Users’ Impressions
Lasting impressions
11
First impressions
In the following sections, Users’ impressions are discussed. In the first impressions, what users’ expect from Europeana, in the deeper impressions, what users’ experience, and finally in the lasting impressions whether or not users are willing to use Europeana in future and what improvements are required for future use.
30
30
7
Sofia
5
10
32
28 3
18
15
6
1
5
Amsterdam
25
Fermo
8
Glasgow Final positive Final negative Initial positive Initial negative
Fig. 1. Initial and final opinions on Europeana. The four groups were homogeneous in their initial positive feedback but some groups were more critical after the performance of the Tasks.
Figure 1 summarises the overall estimates given by participants in the beginning of the study and in the end of the study. Generally, the feedback of the participants in the beginning was rather positive and since Europeana was new to almost everyone it could be said that the website creates the expectation of being mostly attractive, well organised, easy to use and interesting. Deeper impressions indicated a series of issues which the users thought should be addressed. Lasting impressions of participants show a relatively increased number of critical opinions. This means that the experience of performing the tasks was not entirely positive and a crucial question of the study is what can be done to improve future user experiences from the point of view of the participants – what was difficult, what stumbling blocks they experienced. 4.1.1 Content and Functionality/Usability Findings from the study as a whole can be categorised into two broad areas – Content and Functionality and Usability. Although findings did vary across user groups, a
154
M. Dobreva and S. Chowdhury
number of common points were raised. For example in relation to content, the most common findings, across all groups, were that there is a clear lack of textual material and very few contemporary resources; participants also wanted to have access to more audiovisual resources. This was found to be the case irrespective of the nationality of participants. Participants in each group also expressed a desire for translated materials, as opposed to only being able to set the language of the interface. Access to content proved problematic for all groups, particularly for audio and video material. The ranking of search results and the selection of objects matching a particular search also caused confusion across various groups. The issue of unexpected or inappropriate results was raised repeatedly in relation to searches. Expectations relating to functionality provided a range of diverging feedback. Most of the participants like the interface but some found it difficult. It was sometimes found that materials returned in response to a search bore no relevance to the search term(s), causing confusion and dissatisfaction. One common impediment to more successful searching was that a results set could not be narrowed further, by implementing a secondary search. Some findings were quite specific, e.g. the misclassification of maps as texts which was experiences by some participants in Glasgow. User opinions came as several hundred statements; we used content analysis to group these statements and in the next section we present those which are relevant to content and functionality/usability Although these observations came from the work with Europeana, many of them are generic enough to be issues relevant to other digital libraries.
5 Synthesis of Users’ Feedback Based on the results of the study, the feedback provided by users were analysed and synthesised as issues relevant to Content and Functionality/Usability. Some of these issues address areas which are already under development and some might seem as out of interest for immediate development; but the synthesis of the users’ opinions helps to see what concerns appear as most essential when user work with the current prototype. These could be seen as indicative for other digital libraries and this is why we share them. The key Content issues are: -
Users expected more digitised texts (mentioning specifically books and manuscripts), and wanted to be able to annotate and manipulate them. Audiovisual content is not as well represented as other material, and users wanted more of it. The lack of contemporary books, pictures, films and music disappointed users. School students expected content to be downloadable; they also wanted to be able to add their own content. Users assumed that all content would be free and there was frustration that some content providers charged for access to material. Users recommended improvements to the quality of the information about the objects, i.e. the metadata records.
A User-Centric Evaluation of the Europeana Digital Library -
-
-
-
155
People wanted more translation assistance in order to understand their results better. Users expected better classification of content, e.g. by art galleries, council records, newspapers etc. The top level classifications caused concern – for example, maps listed sometimes as ‘texts’ instead of ‘images’. While the participants in the study generally liked the Timeline, some thought it didn’t give enough description of the items displayed; in addition the date cloud sometimes caused confusion especially when a particular date was not appearing in the date cloud. Broken links, however infrequent, are always an irritant for users.
The primary Functionality/ Usability issues are: -
-
-
-
Reactions were very mixed: many participants found Europeana easy to use; others didn’t, and a small number found it very difficult. Better ranking or prioritising of results was the most frequent demand. Users generally want to be able to see why a certain result appears and understand how the ranking of the results is made. Users wanted to be able to refine their search within a results set. Participants expected greater precision in search results. They didn’t understand how some of the results related to their search and became confused and dissatisfied. Language was perceived as a significant barrier. Users were willing to use materials either in their native language or in English but were not prepared to try to use another language. This was most marked among the younger students. More help menus, FAQs and ‘ask the expert’ (e.g. ‘ask the librarian’) services were wanted. People wanted more ways of browsing the content, including map-based visualizations. Students wanted to be able to customise the interface. There was a call for more linking between items to show relationships as well as for narratives which would contextualise the items. People wanted a clearer and easier route back to their original search.
6 Conclusions After the launch of the prototype, Europeana will be developed into a fully operational service, and more content will be added to it. In due course, the intention is for users to contribute materials too (through an open-source approach, such as Wikipedia) [13]. This is only an example of functionality which is considered for the future; the team developing Europeana has a work group which regularly discussed userrelated issues; in addition various user study methods are applied to consult with various user communities. Our study confirmed that Europeana’s general features and functions are rated highly by majority of the users. Most of them rated navigation around the site, search
156
M. Dobreva and S. Chowdhury
functions, presentation of search results, and ease of access to content as “good” or “excellent” in a recent web survey [15]. According to this earlier web survey study, limited awareness and understanding of My Europeana appears to be the main reason for its lack of use [15]. When asked about new functions and features that could be added to Europeana, the most popular feature was the ability to download content [15]. The study aimed to gather opinions and to analyse possible future actions; some of its findings are in line with changes which are already undertaken by Europeana. Nevertheless the study provides valuable evidence and ideas for future consideration. The user-centric study of Europeana showed that users perceive it positively in the beginning but their experience discourages some of them. As Bertot et al. [2] mention “…by enacting multi-method user-centric approaches to assessing digital libraries, researchers and practitioners can ensure that investments in digital libraries are returned through extensive use of resources by a community with diverse information seeking needs”. Therefore, the user-centric evaluation of the Europeana provided a set of very valuable findings and recommendations that can be implemented to improve the digital library to suit the needs of the future users but also could be helpful for other projects in the digital library domain.
Acknowledgements The authors would like to acknowledge the support of the EDL foundation; Jill Cousins, Jonathan Purday, Adeline van den Berg, Anne Marie van Gerwen and Milena Popova from the Europeana office. We would also like to thank the members of the project team Emma McCulloch, Yurdagül Ünal, Prof. Ian Ruthven, Prof. Jonathan Sykes, Pierluigi Feliciati and Duncan Birrell for their valuable contributions and insights.
References 1. Adams, A., Blandford, A.: Digital libraries support for the user’s ‘Information Journey’. In: 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2005), pp. 160–169. ACM, New York (2005), http://www.uclic.ucl.ac.uk/annb/docs/aaabJCDL05preprint.pdf 2. Bertot, J.C., Snead, J.T., Jaeger, P.T., McClure, C.R.: Functionality, usability, and accessibility Iterative user-centered evaluation strategies for digital libraries. Performance Measurement and Metrics 7(1), 17–28 (2006) 3. Blandford, A., Buchanan, G.: Usability of digital libraries: a source of creative tensions with technical developments. TCDL Bulletin (2003), http://www.ieeetcdl.org/Bulletin/current/blandford/blandford.htm 4. Bollen, J., Luce, R.: Evaluation of digital library impact and user communities by analysis of usage patterns. D-Lib Magazine 8, 6 (2002), http://www.dlib.org/dlib/june02/bollen/06bollen.html 5. Chowdhury, G.G.: Access and usability issues of scholarly electronic publications. In: Gorman, G.E., Rowland, F. (eds.) International Yearbook of Library and Information Management, 2004/2005, pp. 77–98. Facet Publishing, London (2004)
A User-Centric Evaluation of the Europeana Digital Library
157
6. Chowdhury, G.G.: From digital libraries to digital preservation research: the importance of users and context. Journal of documentation 66(2), 207–223 (2010) 7. Chowdhury, G.G., Chowdhury, S.: Introduction to Digital Libraries. Facet Publishing, London (2003) 8. Chowdhury, S., Landoni, M., Gibb, F.: Usability and impact of digital libraries: a review. Online Information Review 30(6), 656–680 (2006) 9. Dillon, A.: Evaluating on TIME: a framework for the expert evaluation of digital interface usability. International Journal on Digital Libraries 2(2/3) (1999) 10. Dobreva, M., McCulloch, E., Birrell, D., Feliciati, P., Ruthven, I., Sykes, J., Ünal, Y.: User and Functional Testing. Final report (2010) 11. Duncker, E., Theng, Y.L., Mohd-Nasir, N.: Cultural usability in digital libraries. Bulletin of the American Society for Information Science 26(4), 21–22 (2000), http://www.asis.org/Bulletin/May-00/duncker__et_al.html 12. Europeana, http://www.europeana.eu/portal/ 13. Europe’s information society: Thematic portal, http://ec.europa.eu/information_society/activities/ digital_libraries/europeana/index_en.htm 14. Khoo, M., Buchanan, G., Cunningham, S.J.: Lightweight User-Friendly Evaluation Knowledge for Digital Librarians. D-Lib Magazine 15(7/8) (2009), http://www.dlib.org/dlib/july09/khoo/07khoo.html 15. Europeana online survey research report, http://www.edlfoundation.eu/c/document_library/get_file? uuid=e165f7f8-981a-436b-8179-d27ec952b8aa&groupId=10602 16. Stelmaszewska, H., Blandford, A.: Patterns of interactions: user behaviour in response to search results (2002), http://www.uclic.ucl.ac.uk/annb/DLUsability/ Stelmaszewska29.pdf 17. Sumner, T.: Report. In: 5th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2005). ACM, New York (2005); D-Lib Magazine. 11, 7/8, (2005), http://www.dlib.org/dlib/july05/sumner/07sumner.html
Digital Map Application for Historical Photos Weiqin Chen and Thomas Nottveit Department of Information Science and Media Studies, University of Bergen, P.O.B. 7802, N-5020 Bergen, Norway [email protected], [email protected]
Abstract. Although many map applications are available for presenting, browsing and sharing photos over the Internet, historical photos are not given enough attention. In addition, limited research efforts have been made on the usability and functionalities of such map applications for photo galleries. This paper aims to address these issues by studying the role of digital maps in presenting, browsing and searching historical photos. We have developed a map application and conducted formative evaluation with users focusing on usability and user involvement. The evaluation has shown positive responses from users. The search and navigation functions in the map application were found especially useful. The map was found to be important in involving users to share local knowledge about historical photos. Keywords: Digital map, historical photo, geocoding, photo gallery, digital library.
Although maps have been integrated in many online photo galleries, little research has been conducted to study the usability and functionality of such systems. In order to address these issues we have developed a map application for historical photos and conducted formative evaluation focusing on usability and user involvement. 1.1 Background The University of Bergen Library (UBL) contains one of the biggest and most popular photo archive (Billedsamlingen) in Norway by historical photographers from 1862 to 1970s. This archive includes about 500,000 photos. Since 1997 Billedsamlingen has digitalized photos by several historical photographers and developed a database and a website allowing general public to browse and search photos. Fig. 1(a) and Fig. 1(b) show the main search page and a search result page from Billedsamlingen’s website respectively. This site was open to public from the year 2000 and currently includes 19,000 historical photos. Over the years the number of historical photos in Billedsamlingen has increased significantly which makes the search and presentation of search results even more difficult. As shown in Fig. 1(b), the search result for Gustav Brosing’s collection gives 2,070 photos which are divided into 42 pages. Users have to browse over these pages to find the photos they want. Billedsamlingen’s website has a pressing need to improve the navigation and search.
Fig. 1. Search and result in Billedsamlingen (a) search page with different search fields including signature, title, year, location/address, person/company, photo type and theme. (b) Search result for Gustav Brosing’s collection including 2,070 photos divided by 42 pages. Choosing the title in the left pane will show photo and associated information on the right pane.
According to Fox and Sornil [3], the benefits of digital libraries will not be appreciated unless they are easy to use effectively. The goal of the research presented in this paper aims to improve the usability by presenting historical photos in a map application so that users can easily search and navigate among photos. In addition, because many end users have good local knowledge about the historical photos, we plan to make the user involvement easier by offering special functions for user contributions. 1.2 Related Research With the increase of location-based equipments that automatically generate location data, recent years have seen a large increase in applications that links information to geographic locations. In addition to the Geographical Information Systems (GIS)
160
W. Chen and T. Nottveit
research, Digital Library (DL) has been an important research area for location-based information systems. One very interesting project within DL is the Alexandria Digital Library Project (ADL). ADL’s collection and services focuses on maps, images, data sets and other information sources with links to geographic locations. By using the web interface, users can browse the content of the library using maps and search by spatial and temporal locations on multiple data types such as text, scanned imagery, map indexes, etc. The research team of ADL has conducted several user evaluations to study the interface functions and design, usage patterns and user satisfaction [4, 5]. Although inspired by the ADL project, we provide in our map application the possibility for user contribution. Users can use their local knowledge to place historical photos to map locations and correct the wrongly geocoded locations. In addition to the ADL project, there are also other researches that focus on map applications including geocoding and search. However, researchers argue that user experience, usability and functionality of such systems have not drawn enough attention [6-9]. Clough and Read [7] conducted a study where they examined four known photo-sharing services with maps. In their study Panoramio was evaluated by ten master students to identify problems regarding the utility and functionality. One of the findings showed that 40% of the test persons felt that they needed local geographical knowledge in order to efficiently search and navigate. Another research examined the usability of seven map services [9]. Thirty people evaluated three to four map services by carrying out six to seven tasks for each map service. Some of the results can be summarized as follows: (1) A map service is considered to be more useful if the tasks could be solved in a short time and (2) The satellite maps are found to be not very useful. Skarlatidou and Haklay [9] point out that websites which emphasize on high quality satellite imagery should run usability tests in order to identify how usable these services will be for the end users. Nivala [10] studied interactive maps and user friendliness. In her research she evaluated four online map services in order to expose problems with user friendliness. The usability evaluation was conducted with twelve test users and twelve expert users where they solved tasks related to a scenario of planning a trip to London as a tourist. The evaluation resulted in 343 unique usability problems. Google Maps was found to have fewest problems (69) and the least serious problems, while Multimap had the most problems (99). Based on the evaluation she designed a comprehensive guideline for developing online user-friendly interactive map which covers the design and functionality for web pages in general, maps and search, as well as help and assistance on web pages.
2 Implementation of Map Application for Billedsamlingen The map application was designed and developed in iterations focusing on UserCentric Design (UCD). Potential end users have been involved in formulating requirements for the map application. The map application should provide users with the possibility to search and navigate among historical photos from Bergen using an online map. In order to present the historical photos from Billedsamlingen on a map, we needed to find the coordinates of the photos based on the text locations registered in the archive. This process is called geocoding. In addition to search fields and the map, the map application should also contain a gallery where the thumbnails can be
Digital Map Application for Historical Photos
161
converted to large photo and are shown with associated information. The map application should also be able to involve end users. For photos that have been associated with coordinates, the application should allow users to change the location of these photos on the map. For photos that do not have coordinates, the application should allow users to place these photos on the map using their local knowledge. 2.1 Choice of Map Service and Geocoding Service Because there are many service providers for maps and geocoding, we conducted extensive evaluations to determine which are most suitable for the map application. Seven map services which are free and provide web-based API (Application Programming Interface) have been evaluated against a predefined set of criteria. These services are Google Maps, Microsoft Virtual Earth, Yahoo! Maps, OpenStreetMap, MapQuest, ViaMichelin and MultiMap. Criteria include, among others, whether the service provides a detailed road map and satellite map of Bergen, whether the service allows markers on the map and association of information windows with the markers, and whether the service provides a good documentation. Google Maps was the only service found to satisfy all criteria and was therefore chosen as the map service. Geocoding is the process of finding associated geographic coordinates (latitude and longitude) from other geographic data, such as street addresses, or zip codes. In our application we needed to automatically find coordinates for historical photos from addresses using geocoding. In addition, we needed to find street addresses for the photos from their geographic coordinates. This process is called reverse geocoding. Reverse geocoding is necessary when the end user has moved a photo’s location or placed a photo on the map. Eleven geocoding services have been evaluated against two levels of criteria. These services are Yahoo! API Geocoder, Map24 AJAX API, Multimap Open API, MapQuest API, ViaMichelin Maps & Drive API, Google Maps Geocoding Web Service, Google Maps Javascript Geocoder, Open Geocoding, GeoPy, GeoNames Search WebService, and Where 2 Get It. The first level of criteria include, among others, whether the service can be used with Google Maps, whether the service allows search of both Norwegian place name and street address with/without street number, whether the service supports reverse geocoding, and the possible availability of the service in the future. After the first evaluation, Yahoo! API Geocoder, Map24 AJAX API, Multimap Open API, MapQuest API, ViaMichelin Maps & Drive API, and GeoNames Search WebService were excluded. The rest of the services went through a new evaluation where prototypes based on the services were developed and compared according to a second level of criteria which include: • • • • •
Batch geocoding: to geocode multiple locations at the same time. Single geocoding: to geocode a single location. Gazetteer: how well does the service work with Bergen locations, especially with regard to the special Norwegian letters, historical names, non-existing names, as well as names with multiple spellings (e.g. smug or smau)? Success rate: measure of the number of locations for which a geocoding service manage to find the coordinates. Accuracy: measure of the distance between the coordinates that a geocoding service finds and the correct coordinates.
162
• • •
W. Chen and T. Nottveit
Efficiency: how long a geocoding service uses to geocode all locations? Integration: how well can a geocoding service be used with the search feature in the map application? Robust: how robust is the service?
After the second evaluation, Google Maps Geocoding Web Service was chosen to be the geocoding service for our application because it scores the highest. 2.2 Map Application for Billedsamlingen The user interface includes four parts: the menu at the very top, a search box, map section, and a photo gallery (Fig. 2). The search box allows the user to search by geographic locations, either among those which are registered in the database or other valid locations in Norway. When a user types into the text field, it will automatically provide up to five suggestions. The result of the search is (1) if the location is in the database, the corresponding green marker and the thumbnail are highlighted in both the map and photo gallery respectively; (2) if the location is not in the database, a blue marker is shown on the map and represents that no photos are found; or (3) the coordinates of the location can not be found and no marker will be placed on the map. Additional feedback is also provided over the search box explaining the search results to users. For example, in case (3) the application shows “What you are searching for is not a valid location or is located out of Hordaland region. Please try again.”
There are different markers in the map section. Descriptions of the markers are above the map. Group Photo marker represents a cluster of photos. Single Photo marker represents a single photo. Blue marker means a location does not have any photo. Green marker represents search result or chosen photo. In order to load the map fast and to allow users navigate easily on the map, clustering is used. Photos
Digital Map Application for Historical Photos
163
which are close to each other are represented on the map with a Group Photo marker. Such Group Photo markers will be replaced with Single Photo markers when the user zooms in. When a Single Photo marker on the map is clicked on, information about the photo appear on the map in a popup window, and the corresponding thumbnail in the gallery is also highlighted, and vice versa, the marker is highlighted on the map when the thumbnail in the gallery is clicked on. The thumbnail gallery is updated by panning or zooming on the map and the thumbnails in the gallery area correspond with the markers on the map at any time. User Involvement and Administration. Photos could be placed on an incorrect location because location names may change over time, same location name may exist at different geographically places, and administrators may misspell or shorten location names when registering information. It could also happen that the photos can not be geocoded at all, thus do not have coordinates in the database. Users sometimes have good local knowledge, particularly about historical photos. They can be a great help for associating these photos to the correct location on the map. The map application makes use of the local knowledge and allows users to move the location of photos or put the photos that were not automatically geocoded on to the map. The popup window has two tabs. From the “info” tab the user can click on the link "view larger photo" and the gallery will show a large photo with related information (Fig. 2). In addition to the "info" tab, the user can change the photo's location by clicking on "edit" tab. When the change is saved, the administrator will be notified about the change on the Administrator’s page. The administrator can determine if the change is correct or not. The change will be visible on the map when an administrator approves the change.
On top of the map section there are two tabs. "Photos on the map" shows all the historical photos that either have been geocoded automatically by the map application or have been geocoded by a user or administrator. "Photos without a map relation" includes the rest of the historical photos that are not geocoded. The user can select such a photo in the gallery and place it on the map. The photo will be represented with a yellow marker on the map. The user can write a note to the administrator and by saving, the administrator will be notified. Only after the administrator has approved the change, the photo will be visible for all users. In the Administrator’s page administrators can automatically geocode a batch of historical photos. In addition, they can get an overview of all the photos that are and are not linked to the map, view all registered users and see how many changes they have made, and view all changes made by users and approve or reject them (Fig. 3).
3 Formative Evaluation The goal of the evaluation is to study how the map application helps to present and search historical photos. The formative evaluation focused on two aspects: perceived usefulness and perceived ease of use. In addition, feedback from the users will help further improvement of the application. 3.1 Evaluation Design For the evaluation we have chosen six different themes which cover the important features of the map application. These themes include: 1. 2. 3. 4. 5.
6.
Find nearby photos by using the map. Locate nearby photos using both the search field and the map. Interactions between search, map and photo gallery. Report a wrong location for a photo. Place a photo on the map that the map application failed to geocode automatically and identify the relationship between historical photos and today’s map. Manage user changes on the Administrator’s page.
Theme 1 deals with basic map operations, such as zoom, pan, change the map type and click on the different types of markers that appear on the map. In addition, the theme examines how users find a historical photo that is close to an object (a road, a place, a heritage, a building or a landmark) that is displayed on the map. Theme 2 examines how users find a historical photo that is close to a given location. Here, both the search field and the map will be used where the search field gives users a starting point and the map further filters search results. Theme 3 examines how users perceive the links between the search fields and map, and map and photo gallery. Themes 4 and 5 deal with user involvement features and how the users use them. In addition, theme 5 examines how the relation between historical photos and today’s map influences the users when using the map application. Theme 6 examines how administrators approve or reject changes made by users. We recruited ten participants for the evaluation, including 5 users who were active users of the current Billedsamlingen’s website and 5 administrators who were actively
Digital Map Application for Historical Photos
165
involved in the work at Billedsamlingen. Right before the evaluation one of the users withdrew. So we conducted evaluation with nine participants (five women and four men) who were between twenty and seventy years old. We began by asking background questions about experience with digital maps and the current Billedsamlingen’s website. All participants are familiar with and have used the Billedsamlingen’s website. Observation and interview were chosen to be the main methods for collecting data. In addition we used a log generated from the map application. Semi-structured evaluation guide was followed during the evaluation. All questions were designed to reflect on the perceived usefulness and perceived ease of use. They also made it possible to compare the map application with the Billedsamlingen’s website. For each of the six themes a concrete task was assigned to the participants. For the four users only the first five themes were used while the five administrators had to carry out the tasks in all six themes. Participants were not allowed to access the help pages and they did not receive any training before the tasks. Our intention was to see if the design itself was intuitive and good so its use became obvious and “self-evident” to the users [11]. While the participant solved the task, he or she was observed and notes were taken during the observation. Thinking aloud was encouraged. When the task was completed, an interview was conducted where the participants were asked follow-up questions following the evaluation guide. By asking questions immediately after a task where the questions and the task belongs to the same theme, the participants could remember better and were able to answer questions more precisely than if the questions were asked after all the tasks were finished. 3.2 Analysis and Findings The participants in this study had generally positive attitude towards the application. They perceived the application to be useful. Using the map as a primary method for search and navigation, rather than relying only on the search field, makes it easier to search, view search results and make changes. Searching with a combination of search field and map was found very useful when one is not familiar with the geographical area and this combination made the search process much easier than searching in the Billedsamlingen’s website. Using the map alone requires a higher level of familiarity with the geographical area. However, even though one is not familiar with the area, the application is still useful -- by navigating through the photo gallery and getting a link from the photo to the map. Moreover, it was useful and easy to search nearby photos from a known object or a known location using the map. To search for and navigate among a variety of historical photos was much easier when there was a connection between the search field, the map and photo gallery. The majority of participants found it easier to recognize the subjects in historical photos by seeing how the photos were geographically located in the map. All participants found it important to either give or receive feedback about incorrect locations of photos. Moreover, the map application was found to lower the threshold for reporting these errors and it was easier to give feedback via the map application than having to explain the error by e-mails, which is the case in the Billedsamlingen’s website. Some participants, however, admitted that out of old habits they thought it was easier to explain the errors with e-mails (“I'm so old fashioned
166
W. Chen and T. Nottveit
that I think I still would have used e-mail, but I see the point here very well”). The interface for handling user changes made the administrators’ job much easier by only relate to one website, rather than having to first read an e-mail and then go to a database or a web page to make the necessary changes. The perceived ease of use varied among participants. Different previous knowledge about digital maps affected the degree to which they completed the tasks. Map markers, on the other hand, were easily understood by the majority of participants with the help of the descriptions of the markers over the map section. They responded that it was simple and time saving to search for photos with the map and search nearby photos from a given location using the available map features. Some participants found some search techniques with the map difficult to use, but thought it would be easier given more training. This was also confirmed during the evaluation where participants become gradually fluent in using the map application. The perceived easiness to change the location of a photo on the map varied too. Some found it rather easy while others had big challenges. The variation could be explained by the different skills and personal characteristics of participants. New technologies can be difficult to understand at first use, but is easier to handle after some training. This was confirmed several times during the evaluation. For example, when the participants were asked to put three historical photos on the map, they had trouble with placing the first photo. But after a brief explanation of the procedure, all participants were able to place the rest two photos successfully ("The method is not difficult, but it does not fall immediately in place. But it is a very convenient way when you know how it works"). Although this can be explained by the difficulties when introducing new technologies, it can also indicate that the design is not quite intuitive. A better design alternative as well as an informative help page could help to solve the problem. Presenting historical photos on today’s map was not considered as a problem, instead the participants thought it was interesting to see changes over time between photos and map.
4 Conclusion and Future Work In this paper we presented a map application for historical photos. A formative evaluation was conducted with users and administrators. The evaluation indicates that the map application is considered as an appropriate service for Billedsamlingen’s historical photos. Using the map, including locations, was found to be very useful when searching for and navigating among historical photos. Moreover, the map was found helpful by administrators in geocoding photos and managing changes made by users. The user involvement was found especially important by both users and administrators. The map application presented, however, different challenges to the users, but the majority of such challenges and difficulties can be resolved by having access to the help pages and allow users to practice with the new features. The map application was considered to be an improvement over the Billedsamlingen’s website. The feedback from the evaluation also includes a number of improvements for further development. An important and interesting question is how to integrate the map application into UBL's digital library. This integration process opens up for several challenges that
Digital Map Application for Historical Photos
167
require further research. Because the UBL system, including databases, falls into the scope of legacy system, it can be difficult to integrate a new modern system such as the map application and its database. Two main challenges can be identified in the integration process. The first concerns how the database in the map application can be integrated with the library's existing databases. The second concerns how the maps application can be integrated with the UBL’s digital library. Further research is needed for making a decision on whether the map application should be an alternative or a replacement to the existing Billedsamlingen’s website. Acknowledgments. The authors would like to thank all the participants in the evaluation. In addition we would like to thank Dag Elgesem and colleagues in Billedsamlingen at UBL.
References 1. Naaman, M., Harada, S., Wang, Q., Garcia-Molina, H., Paepcke, A.: Context Data in Georeferenced Digital Photo Collections. In: Proc. of the 12th annual ACM international conference on Multimedia, pp. 196–203. ACM, New York (2004) 2. Champ, H.: 4,000,000,000 (2009), http://blog.flickr.net/en/2009/10/12/4000000000/ 3. Fox, E.A., Sornil, O.: Digital Libraries. In: Baeza-Yates, R., Ribeiro-Neto, B. (eds.) Modern Information Retrieval, pp. 415–432. ACM Press, New York (1999) 4. Hill, L.L., Carver, L., Dolin, R., Frew, J., Larsgaard, M., Rae, M.-A., Smith, T.R.: Alexandria Digital Library: User Evaluation Studies and System Design. Journal of the American Society for Information Science (Special issue on Digital Libraries) 51, 246–259 (2000) 5. Hill, L.L., Dolin, R., Frew, J., Kemp, R.B., Larsgaard, M., Montello, D.R., Rae, M.-A., Simpson, J.: User Evaluation: Summary of the Methodologies and Results for the Alexandria Digital Library, University of California at Santa Barbara. In: Schwartz, C., Rorvig, M. (eds.) Proceedings of the American Society for Information Science (ASIS) Annual Meeting, pp. 225–243. Information Today, Medford, NJ (1997) 6. Cherubini, M., Hong, F., Dillenbourg, P., Girardin, F.: Ubiquitous Collaborative Annotations of Mobile maps: How and Why people might Want to Share Geographical Notes. In: The 9th International Workshop on Collaborative Editing Systems (IWCES 2007), Sanibel Island, FL (2007) 7. Clough, P., Read, S.: Key Design Issues with Visualising Images using Google Earth. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 570–574. Springer, Heidelberg (2008) 8. Portegys, T.E.: A Location-based Cooperative Web Service Using Google Maps. In: Proc. of the Conference on Information Technology and Economic Development (CITED 2006), Accra, Ghana (2006) 9. Skarlatidou, A., Haklay, M.: Public Web Mapping: Preliminary Usability Evaluation. In: Proc. of the 14th GIS Research, GISUK 2006 (2006) 10. Nivala, A.M.: Usability Perspectives for the Design of Interactive Maps. Helsinki University of Technology, Espoo, Finland (2007) 11. Norman, D.: The Psychology of Everyday Things. Basic Books, New York (1988)
Supporting Early Document Navigation with Semantic Zooming Tom Owen1 , George Buchanan2 , Parisa Eslambochilar1, and Fernando Loizides2 1
Future Interaction Laboratory, Swansea University, Swansea, UK 2 Centre for HCI Design, City University, London, UK {cstomo,p.eslambochilar}@swansea.ac.uk {george.buchanan.1,fernando.loizides.1}@city.ac.uk
Abstract. Traditional digital document navigation found in Acrobat and HTML document readers performs poorly when compared to paper documents for this task. We investigate and compare two methods for improving navigation when a reader first views a digital document. One technique modifies the traditional scrolling method, combining it with Speed-Dependent Automatic Zooming (SDAZ). We also examine the effect of adding “semantic” rendering, where the document display is altered depending on scroll speed. We demonstrate that the combination of these methods reduces user effort without impacting on user behaviour. This confirms both the utility of our navigation, and the minimal use information seekers use of much of the content of digital documents.
1
Introduction
Within–document navigation is a common action performed by users when reading texts in many circumstances: initial skimming, quick readings, the deep analysis of a selected document, and checking remembered details. In this paper, we investigate the support of navigation from the primary perspective of initial navigation of a document. The user may seek specific targets (e.g. keywords), but the exact location of these is unknown, and they even may be absent from the document. In this environment, users seek to obtain a quick overview of the document to determine its structure and judge its content, usually in the context of an immediate information need. This initial and brief reading is usually termed “document triage”. Cathy Marshall [11] has demonstrated that digital document reader software significantly impedes users’ interaction with documents. Our work addresses the challenge, raised by Marshall, of providing “library materials that not only capture the affordances of paper, but also transcend paper’s limitations”. We build upon recent research [4,5] that indicates that users actually use only limited parts of a document’s content during interactive triage. This selective attention appears to be even more pronounced in digital documents when compared to printed texts. We follow the natural corollaries of this knowledge to G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 168–178, 2010. c Springer-Verlag Berlin Heidelberg 2010
Supporting Early Document Navigation with Semantic Zooming
169
minimise the information presented to the user when scrolling, and to maximise the visibility of the remaining content. In principle, this will reduce visual clutter and improve a user’s visual search performance. However, there is contrary evidence that non–realistic renderings of documents undermine user’s acceptance of documents [8]. Therefore, a focussed study is required to assess whether the use of a novel presentation results in a lower subjective user evaluation of the techniques. This paper commences with a general introduction to the “state of the art” in digital document navigation. We proceed to describe the design and implementation of a set of novel techniques for supporting within–document navigation. These are subsequently evaluated in a user study. The results of the study are described in detail, and discussed in relation to previous research. We conclude with an outline of future avenues for research and a summary of the main contributions of the work.
2
Navigation in Digital Documents
To understand the problems experienced during digital document navigation, we initially ran a laboratory–based observational study [3]. We interviewed academics who were regular users of both digital and paper texts. The aim of these interviews was to provide qualitative data that would provide a more detailed picture of the current interaction problems with digital documments. We identified specific issues from our own observational data [3] and the work of researchers such as Cathy Marshall [11] and Kenton O’Hara [12], including: temporary placeholding; permanent bookmarking [6]; navigation; skimming and overview; search–within–documents and reading effort. In this paper we focus on the issues of navigation, skimming and overview. We conducted semi–structured interviews with nine humanities researchers, and repeated this with three computer science researchers. Ratings were obtained on a scale of 1 to 10 for specific interactions, and qualitative feedback elicited to complement each rating. When asked to compare the ease of navigation in digital and paper documents, the users gave an average rating of 5.7 (digital) versus 7.4 (paper). This difference may appear small. However, the detailed picture revealed a stark contrast between positive and negative features. Scrolling proved to be a complex issue: while participants expressed a positive inclination towards it, in practice their experiences were mixed: e.g. “I like scrolling but I find it annoying when it pauses and just stops on you”. Another user was more specific: “when you want to scroll quickly and the computer can’t keep up, it is really confusing, you end up not knowing where you are”. Render speeds are a key underlying problem here, where larger documents often cannot be displayed sufficiently quickly to create a smooth interaction. One academic observed that “You just cannot flick through the PDF”. Simple “go-to” navigation for a specified page was seen as an advantage to electronic media, though three participants reported problems with PDF documents having different printed and logical page numbers. Books downloaded
170
T. Owen et al.
from Google Books were cited by a historian as one particular problem: “the page numbers seem just to start at one, regardless of the original front matter or numbers, so I have to keep on making up the numbers or just guessing it.” In contrast, though printed paper lacked that specific problem, it had a more prevalent one “it gets real hard when the pages seem to stick together and you go back and forth and blow on it to get ... the right page ... I get so distracted by that.” Other findings triangulated well with Marshall and O’Hara: e.g. reading was rated as requiring more effort on digital displays. We also corroborated our earlier observations in revealing a low rate of use of positively rated electronic tools such as within–document search. One literature scholar said: “I guess, in truth, I think I use it when I get desperate, when I can’t see what I think...expect should be there.” At the end of each session, we introduced a number of experimental document reader applications for comparison. User responses to these – particularly to one overview tool – informed the design of novel navigations that we introduce in the next section.
3
Design
Scrolling is a major component of navigation in digital documents. Existing literature demonstrates that scrolling is the main method for moving through an electronic text [2,9]. It is therefore worthwhile improving this particular element of document reader software. Our interviewees reported that slow and erratic response to scrolling was one irritation experienced in using Acrobat and similar software. This in part stems from a relatively simple rendering paradigm in document reader software: regardless of scrolling speed, all content is rendered. However, it is worth considering why people scroll rapidly through digital documents. Data from many sources indicates that rapid scrolling is caused by a desire to obtain a quick overview of a document. Liu [9] and Marshall [11] both note this behaviour, and relate its occurrence to gaining a general impression of a document’s content and structure. O’Hara and Sellen [12] report similar patterns of users’ document handling nearly a decade earlier. This suggests that this pattern is a deeply ingrained and persistent one, and good reader software will be designed with this behaviour in mind. In this section, we articulate four alternative designs for scrolling across PDF documents. We combine two different approaches: adjusting scrolling strategies on one side, and using different rendering techniques on the other. Using two different methods for each of the two approaches results in four different combinations. We will now discuss the scrolling and rendering methods we have used in turn, before demonstrating the resulting designs in use. In the following section, we will report on a four–way comparison of these different methods in a laboratory user study.
Supporting Early Document Navigation with Semantic Zooming
3.1
171
Speed–Dependent Automatic Zooming (SDAZ)
Traditional scrolling presents a simple direct–manipulation method for moving across a document. One novel method that has been popularised in recent years is SDAZ. In SDAZ, as a user scrolls across a surface more rapidly, the view zooms out, to give a wider overview of the content (see Fig. 1). This seems to be a reasonable candidate for document navigation, and indeed has been tested by Cockburn et al [1] for such purposes.
Fig. 1. Document display with traditional (left) and SDAZ (right) scrolling. In SDAZ, the view has shrunk the text and content, providing a less detailed but broader view.
One of the findings in Cockburn’s study suggests that SDAZ provides a more rapid method on targeting “large document features” such as major headings and images than traditional linear scrolling. However, users were cued with a known target object and an indicated direction of movement in the study. This did not match the problem we are studying: in our targeting task, users will be often moving towards a target that is at best partially anticipated, and where the direction is at best partially assumed (i.e. to be below the first page). We therefore needed to retest this proven method for our new task. We developed an extension of our existing document reader software [5] that encompassed both linear scrolling and SDAZ scrolling. The two different modes are represented in Figure 1. 3.2
Semantic Rendering
Our second design consideration is that of document rendering. Our previous study and the interview data both revealed that slow rendering directly impacted upon user’s document navigation. We also had discovered that visual clutter negatively impacts user performance during visual search [5]. There are multiple techniques used in the desktop publishing industry to alleviate rendering problems. These can be observed in design software such as Adobe Illustrator: first, there is full rendering, which yields a high quality view of the document at a high time cost; second, threading can be used to progressively render a complex view with increasing fidelity as the same document fragment remains visible; thirdly, a reduced content display can be used (e.g. outlines only) which produces a general impression of the content very rapidly.
172
T. Owen et al.
Fig. 2. SDAZ with different levels of zoom: full (left) and semantic (right) rendering
We developed a semantically (or, more precisely, typographically) informed rendering method that combines progressive and outline rendering. Instead of reducing filled volumes to mere outlines, we instead select heading text to be displayed first, and then the remainder of the content is progressively rendered in decreasing order of significance: e.g. heading, sub–heading, caption, and emphasised text. This results in a relatively rapid display during rapid scrolling, and minimises interactional lags, where a user’s scrolling position is far in advance of the current displayed text. 3.3
Implementation
We developed four different interfaces built upon the same underlying software: 1) a traditional, linear zoom, full rendering display; 2) linear zoom with semantic rendering; 3) SDAZ with full document content; 4) SDAZ with semantic rendering. The software was built on JPedal, a Java–based PDF rendering suite.
4
User Study
We undertook a user study to evaluate the four different interfaces. In this section, we first report the study design, before presenting the results from the experiment.
Supporting Early Document Navigation with Semantic Zooming
4.1
173
Study Design
To provide a testbed for the study, we selected eight different PDF documents on a single computer–science related topic. For each document, we created a set of simple navigational tasks, informed by our knowledge of typical goals that users seek to satisfy during document triage. Examples would include “find the conclusions section to identify the main contribution of the paper”, or “locate where the paper discusses accessibility for haptic interfaces”. For each task, we had a known target point in the document. There were four types of target: a literal–match on a heading (e.g. “conclusions section”); a non–literal heading match; a match on the body text, and a false (absent) target. By varying literal and visibility properties of the target across different tasks, the results would provide a discrimination between such factors in our results. The basic form of the study followed a conventional pattern of pre–study questionnaire, the main experimental session, and a post–study semi–structured interview. The induction questionnaire obtained basic demographic information and the participant’s experience with using document reader software such as Acrobat. The post–study interview gleaned subjective information about the participant’s experience with the different systems that they encountered. The main part of the experiment consisted of the participant using all four example systems we described above. For each system, the observer demonstrated the software in use on a standard example document. The participant was then given an open–ended time to familiarise themselves with the system. The study then progressed to 32 small–scale tasks over eight documents (one task of each target type per document). A simple questionnaire was taken for each separate interface, and a summative evaluation to compare all four interfaces. To balance for ordering, learning and dependency factors, a latin–squared design was used throughout. Combinations of system, task, document and order were used to minimise biases in the data. The questionnaire on each system captured the participant’s immediate subjective ratings of the that system. The concluding interview probed the participant’s assessment of each system in detail, to gain further insight into the comparative advantages and disadvantages of the interfaces. To ensure that learning effects were not introduced into the study, we sought to recruit users who were experienced with digital document navigation and the tools and concepts behind scrolling techniques. Therefore for the study we recruited participants who were undertaking research and likely to be making regular use of digital documents. For the experiment, we recruited a total of sixteen computer science students, aged from 20 to 37. Participants were either in the final semester of a bachelor’s degree, or engaged in postgraduate study. Potential participants were vetted for dyslexia and uncorrected sight defects, and we present here only data from participants who passed these criteria (one dyslexic candidate was excluded).
174
4.2
T. Owen et al.
Results
The experiment captured a volume of data, from which we will only extract the most authorative findings. We commence with an evaluation of differences in user behaviour during the study, before progressing to analyse the participant’s subjective feedback elicited through the questionnaires and interview. Log Data. Each participant completed 8 tasks in every interface mode. Tasks were of four types, as described in the experimental design. We analysed the data for each type of task separately, using a two–way ANOVA in each case. For tasks where the body text matched the task, or the task was to find content not in the document, there were no discernable effects. Only when the targets were headings did any significant results emerge. The most striking result occurred in the exact heading match tasks: the two– factor ANOVA produced three significant results. Comparing SDAZ versus linear scrolling produced p=0.0360, F=4.543, F–cit=3.954. This indicated a significant difference between the two modes, with linear scrolling proving superior (average times of 17.50s vs. 24.20s). The comparison of full text versus semantic rendering modes resulted in p=0.0089 (F=7.171, F–crit=3.954), revealing an even stronger outcome, this time with the full rendering proving inferior (average of 25.05s vs 16.52s). Finally, the interaction test revealed p=0.0176 (F=5.861, F–crit=3.954). (df=1,1,1,84 for the test as a whole). The underlying reason for these outcomes is readily understood when it is stated that the SDAZ full–text mode produced an average time of 32.40s against averages of between 16.17s and 17.92s for the other three modes. Put plainly, there was strong evidence that the SDAZ full text mode meant that even heading text was unreadable for most participants. When semantic zooming was applied, the SDAZ method performed at a level comparable with linear scrolling. Task Completion Data. The rate of completion on heading match tasks was very high. Only 5 out of 128 exact heading match tasks and 8 topical heading match tasks were not completed. Of the 128 body–text match tasks, however, there was a very high failure rate: 69 out of 128. These were fairly distributed across all modes, and applying a Chi–square test to the data gave a result indicating p=0.985. This is clearly very far from suggesting any significant effects from any mode. Subjective Ratings. The participants rated each interface for nine task– focussed factors on a seven point Likert scale. These were subsequently evaluated through a two–factor ANOVA to determine where perceptible effects were noted. The first item that returned a significant result was the rating “clarity of document display”. Here, we encountered p=0.038 (df=1,1,1,44; F=4.566,0.943,0.604; F–crit=4.062). The significant difference was produced from the comparison of the rendering method: semantic presentation was rated lower than full text (avg.=4.21,var=2.95 vs avg.=5.125,var=1.42). Interestingly, no reliable difference emerged in the rating “clarity of document display when scrolling” (F=0.030,1.921,0.120; F–crit=4.062). Nor was any effect uncovered when comparing “ease of use”.
Supporting Early Document Navigation with Semantic Zooming
175
The rating “Ease of seeing document structure” produced an interaction effect (p=0.008; F=0.028,0.713,7.31; df=1,1,14, etc.). The semantic SDAZ and traditional linear scrolling rated inferior to the SDAZ full–text and linear semantic presentations. This may suggest that users are able to benefit from either SDAZ or semantic presentation, but the combination is too unfamiliar or demanding. Further investigation will be required. The second factor the produced a signficant result was “Time to evaluate document size”. Here, the average ratings given varied from 2.83 (linear full– text mode) to 5.25 (semantic SDAZ). ANOVA yielded p=0.0005 (df=1,1,1,44; F=1.715,14.094,0.191). This reveals a significant impact from the scrolling method. SDAZ based methods were faster than the linear scroll methods. A related third factor also produced a conclusive result “Time to identify document structure”. ANOVA resulted in p=0.021 (df=1,1,1,44; F=1.033,5.587,1.560; F crit=4.062). The significant difference was again between the SDAZ and linear scrolling modes. No significant effect was computed from the application of semantic zoom. These tests paint contrasting picture, with the SDAZ methods yielding a better overview of the document in a short time, while linear methods were preferred in terms of the quality of display. We will now report the subjective feedback from our participants. Subjective Feedback. Our participants gave plentiful subjective feedback on their experience of using the four different systems. The impact of the different presentations were pronounced. Criticisms of the semantic presentation were commonplace: with the clear message being the low acceptability of losing document text “this is completely unnatural”, “I’m not seeing the real document” are just two examples of a consistent hostility to the presentation format. The “non–natural” effect was specifically highlighted by ten participants. The SDAZ method also produced many reservations from our users. Concerns here were about the visibility of headings. These reservations were – unsurprisingly – more consistent in the full–text mode, where they were shared by 11 participants. However, even in the semantic presentation, six participants expressed the same problem.
5
Discussion
The results of our topic raised a number of different issues. Subjectively, users preferred the familiarity of the traditional linear scrolling technique, and the display of full text. However, there was only minimal evidence that there was actually any performance advantage to this mode. In contrast, though the combined SDAZ and semantic rendering interface was subjectively poorly rated, its users attained a similar level of completion rate to every type of task, and in very similar times. The combination of SDAZ with full text display was, however, noticeably inferior for simple heading–match tasks with all other interfaces. It was a closer match in all other tasks.
176
T. Owen et al.
Our experiment demonstrates that the SDAZ method, while comparable in performance for many tasks, is not well accepted in the context of reading text. Beneficially, the zooming feature of SDAZ can give a quick impression of the size of the document more readily than linear scrolling. On the other hand, the low resolution of the text makes reading even headings very difficult when using full–content rendering. Semantically–enhanced SDAZ outperforms traditional SDAZ when a user is searching for content that appears in headings, but there is no difference for smaller content (e.g. body text). Omitting the document content when scrolling is rapid proved to be unsatisfactory for most of our participants. The role of body text in skim reading is problematic: while users are dissatisfied when it is not displayed, they nonetheless are known to rate it as relatively unimportant when deciding document relevance in an information seeking task [7] and our own data shows that visual search is rather ineffective, with over 50% of visual searches failing. This raises rather significant questions for future research in the field, particularly as how to support a user’s visual search. While within–text search features are commonplace in document readers, studies repeatedly demonstrate a very low rate of use (e.g. [10]). If our participants are typical of the general user population, artificial renderings of content, even if more task–effective, will face considerable resistance and low subjective ratings. This last point is a significant issue for the development of digital libraries. If users are demonstrably poor at picking out key text for their information need from the document body, assistance will be required. However, users’ subjective dependence on seeing the “authentic” document will strongly constrain the range of designs intended to support their visual search. Our data clearly demonstrates that visual search performance is poor, and as this activity is a key step in the information seeking of most users, progress is urgently required to improve the efficacy of DLs as a whole.
6
Conclusion and Future Work
Reading digital documents is a commonplace task where there are clear issues outstanding with user interaction. This has direct relevance to digital libraries: as users increasing read digital documents “online” rather than on paper, these deficiencies become ever more important to address. This paper has demonstrated that novel techniques for visual search, even within digital documents, cannot be used uncritically when the user is scanning text. We also identified that using document content and presentation can close usability deficits where they exist. Applying semantic rendering did improve many parts of the reading process, but does not yet perfect the SDAZ method. SDAZ provides a route for improving the rapid overview of a document, but it proves less effective for detailed reading. The limited resolution of digital displays, combined with persistence effects commonplace even on modern LCD computer screens, result in poor legibility of small or briefly viewed text. These limitations are unlikely to disappear in the near future.
Supporting Early Document Navigation with Semantic Zooming
177
Our research into SDAZ as a method for supporting the rapid navigation of text shows that a naive use of the method leads to difficulties in user satisfaction and also some problems with effectiveness. Nonetheless, there are benefits and successes that suggest persistence is required to actually uncover its long– term potential. Our semantically–enhanced SDAZ presentation performed much closer to the traditional interface than its full–text rendering sibling. The main drawback was the elimination of body text. One avenue for research is how best to reconstitute the presentation of body text in the semantic rendering mode. An additional strand of research would be to investigate semantic rendering on a set of users who are likely to have different reading habits to that of those tested in our study. Based on the findings of the research presented in this paper it would be sensible to introduce a broader audience to the concepts we have introduced. A different set of participants may yield yet more information as to how to improve SDAZ. Both SDAZ and linear scrolling methods proved ineffective in supporting a user locate information hidden in the main body, and this is confirmed as a major interaction challenge. SDAZ and other presentation modes have shown that interactions can be improved and user needs both better understood and satisfied through systematic research. Similar diligence must now be applied to visual search across the plain text of digital documents.
Acknowledgments This research is supported by EPSRC Grant EP/F041217.
References 1. Andy Cockburn, J.S., Wallace, A.: Tuning and testing scrolling interfaces that automatically zoom. In: CHI 2005: Proceedings of the SIGCHI conference on Human factors in computing systems, New York, NY, USA, pp. 71–80. ACM, New York (2005) 2. Buchanan, G.: Rapid document navigation for information triage support. In: Proc. ACM/IEEE Joint Conference on Digital Libraries, June 2007, p. 503. ACM Press, New York (2007) 3. Buchanan, G., Loizides, F.: Investigating document triage on paper and electronic media. In: Kov´ acs, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 416–426. Springer, Heidelberg (2007) 4. Buchanan, G., Owen, T.: Improving navigation in digital documents. In: Procs. ACM/IEEE Joint Conference on Digital Libraries, pp. 389–392. ACM, New York (2008) 5. Buchanan, G., Owen, T.: Improving skim reading for document triage. In: Proceedings of the Symposium on Information Interaction in Context (IIiX), pp. 83–88. British Computer Society (2008) 6. Buchanan, G., Pearson, J.: Improving placeholders in digital documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 1–12. Springer, Heidelberg (2008)
178
T. Owen et al.
7. Cool, C., Belkin, N.J., Kantor, P.B.: Characteristics of texts affecting relevance judgments. In: 14th National Online Meeting, pp. 77–84 (1993) 8. Flanders, J.: The body encoded: Questions of gender and electronic text. In: Sutherland, K. (ed.) Electronic Text: Investigations in Method and Theory, pp. 127–144. Clarendon Press (1997) 9. Liu, Z.: Reading behavior in the digital environment. Journal of Documentation 61(6), 700–712 (2005) 10. Loizides, F., Buchanan, G.R.: The myth of find: user behaviour and attitudes towards the basic search feature. In: JCDL 2008: Proceedings of the 8th ACM/IEEECS joint conference on Digital libraries, pp. 48–51. ACM, New York (2008) 11. Marshall, C.C., Bly, S.: Turning the page on navigation. In: JCDL 2005: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, pp. 225–234. ACM, New York (2005) 12. O’Hara, K., Sellen, A.: A comparison of reading paper and on-line documents. In: CHI 1997: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 335–342. ACM, New York (1997)
PODD: An Ontology-Driven Data Repository for Collaborative Phenomics Research Yuan-Fang Li, Gavin Kennedy, Faith Davies, and Jane Hunter School of ITEE, The University of Queensland {uqyli4,g.kennedy1,f.davies,j.hunter1}@uq.edu.au
Abstract. Phenomics, the systematic study of phenotypes, is an emerging field of research in biology. It complements genomics, the study of genotypes, and is becoming an increasingly critical tool to understand phenomena such as plant morphology and human diseases. Phenomics studies make use of both high- and low-throughput imaging and measurement devices to capture data, which are subsequently used for analysis. As a result, high volumes of data are generated on a regular basis, making storage, management, annotation and distribution a challenging task. Sufficient contextual information, the metadata, must also be maintained to facilitate the dissemination of these data. The challenge is further complicated by the need to support emerging technologies and processes in phenomics research. This paper describes our effort in designing and developing an ontology-driven, open, extensible data repository to support collaborative phenomics research in Australia. Keyword: OWL, ontology, repository, phenomics, data management.
1
Introduction
An organism’s phenotype is an observable or quantifiable trait of the organism as a consequence of its genetic makeup combined with its developmental stage, environment and disease conditions. Phenomics is the systematic and comprehensive study of an organism’s phenotype and is determined through a combination of high-throughput and high-resolution imaging- and measurement-based analysis platforms. Phenomics research, together with genomics research, represents a holistic approach to biological study [3,11,7]. Unlike genomics, phenomics research emphasizes physical, observable traits of the subject under study. Like genomics, vast amounts of data are produced by imaging and measurement platforms and analysis tools. The storage, management, analysis and publication of these data is a challenging problem. Specifically, there are three key challenges for data management in phenomics research. – The ability to provide a data management service that can manage large quantities of heterogeneous data in multiple formats (text, image, video) and not be constrained to a finite set of imaging and measurement platforms and data formats. G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 179–188, 2010. c Springer-Verlag Berlin Heidelberg 2010
180
Y.-F. Li et al.
– The ability to support metadata-related services to provide context and structure for data within the data management service to facilitate effective search, query and dissemination. – The ability to accommodate evolving and emerging technologies and processes as phenomics is still a rapidly developing field of research. The Phenomics Ontology Driven Data (PODD) repository1 is being developed to meet the above challenges facing the Australian phenomics research community, aiming at providing efficient and flexible repository functionalities for large-scale phenomics data. An important goal of PODD is to provide a mechanism for maintaining structured and precise metadata around the raw data so that they can be distributed and published in a reusable fashion. Differences in research project reporting, organisms under study, research objectives, research methodologies and imaging and measurement platforms may result in differences in the models of data. In order to accommodate as wide a variety of biological research activities as possible, we have constructed the domain model using OWL [2] ontologies, instead of the traditional UML class diagrams and database schemas. The OWL domain model is at the core of the PODD repository as it drives the creation, storage, validation, query and search of data and metadata. In contrast to traditional data repositories that use database schemas as the underlying model, the employment of OWL ontologies as the domain model makes PODD highly extensible. In this paper, we present our work in addressing the above challenges and highlight the OWL-based modeling approach we take. The rest of the paper is organized as follows. In Section 2 we present some related work and give a brief overview of PODD and relevant technical background. Section 3 presents the high-level design of the repository. In Section 4, we discuss the PODD domain ontology in more detail and show how the ontology-based modeling approach is used in the life cycle of domain objects. Finally, Section 5 concludes the paper and identifies future directions.
2
Overview
2.1
Related Work
In biological research, a large number of databases have been developed to host a variety of information such as genes (Ensembl2 ), proteins (UniProt3 ), scientific publications (PubMed4 ) and micro-array data (GEO5 ). These databases are generally characterized by the fact that they specialize in a particular kind of data (protein sequences, publications, etc.) and the conceptual domain model is relatively well understood and stable. As a result, the need for extensibility and flexibility is not very high. 1 2 3 4 5
Phenomics is a fast growing discipline in biology and new technologies and processes are evolving and emerging rapidly. As a result, the domain model must be flexible enough to cater for such changes. Currently there are a number of related domain models available. Functional Genomics Experiment Model (FuGe) [4] is an extensible modeling framework for high-throughput functional genomics experiments, aiming at increasing the consistency and efficiency of experimental data modeling for the molecular biology research community. Centered around the concept of experiments, it encompasses domain concepts such as protocols, samples and data. FuGe is developed using UML from which XML Schemas and database definitions are derived. The FuGe model covers not only biology-specific information such as molecules, data and investigation, it also defines commonly-used concepts such as audit, reference and measurement. Extensions in FuGe are defined using inheritance of UML classes. We feel that the extensibility we require is not met by FuGe as any addition of new concepts would require the development of new database schemas and code. Moreover, the concrete objects reside in relational databases, making subsequent integration and dissemination more difficult. The Ontology for Biomedical Investigations (OBI)6 is an on-going effort of developing an integrative ontology for biological and clinical investigations. It takes a top-down approach by reusing high-level, abstract concepts from other ontologies. It includes 2,600+ OWL [2] classes and 10,000+ axioms (in the import closure of the OBI ontology). Although OBI is very comprehensive, its size and complexity makes reasoning and querying of OBI-based ontologies and RDF graphs computationally expensive and time consuming. 2.2
The PODD Repository
Under the National Collaborative Research Infrastructure Strategy (NCRIS), the Australian Government has funded two major phenomics initiatives: The Australian Plant Phenomics Facility (APPF), specializing in phenotyping crop and model plant species; and the Australian Phenomics Network (APN), which specializes in the phenotyping of mouse models. Both facilities have common requirements to gather and annotate data from both high- and low-throughput phenotyping devices. The scale of measurement can be from the micro or cellular level, through the level of a single organism, and up to (in the case of the APPF) the macro or field level. An organism’s phenotype, observable and quantifiable traits, is often the product of the organism’s genetic makeup, its development stage, disease conditions and its environment. Any measurement made against an organism needs to be recorded in the context of these other data. The opportunity exists to create a repository to record the data, its contextual data (metadata) and data classifiers in the form of ontological or structured vocabulary terms. The structured nature of this repository would support manual and autonomous data discovery 6
http://obi-ontology.org/
182
Y.-F. Li et al.
as well as provide the infrastructure for data based collaborations with domestic and international research institutions. Currently there are no such integrated systems available to the two facilities. The National eResearch Architecture Taskforce (NeAT) Australia initiated the PODD project to fill this gap. In PODD, we have engaged in the design and development of the Phenomics Ontology Driven Data (PODD) repository. The goal of PODD is to capture, manage, annotate and distribute the data generated by phenotyping platforms. It supports both Australian and international biological research communities by providing repository and data publication services. 2.3
The OWL Ontology Language
The Web Ontology Language (OWL) [2] is one of the cornerstone languages in the Semantic Web technology stack. Based on description logics [6], OWL DL (a sub species of OWL) defines a precise and unambiguous semantics, which is carefully crafted so that it is very expressive yet core reasoning tasks can be fully automated. Information in OWL, as in RDF [5], is modeled in triples: subject , predicate, object , where subject is the entity of interest, predicate represents a particular characteristic/property of the entity and object is the actual value of that property. Classes are first-class citizens in OWL. They represent abstract concepts in a particular domain. Concrete objects are represented by OWL individual s. OWL predicates are used to relate OWL entities (classes, individuals, predicates) to their attributes or other entities. C ::= C
AX ::= C C
– Class name
| C =C
– Class subsumption
|
– Top class
| ⊥
– Bottom class
| C C =⊥
– Class disjointness
| CC
– Class union
| P P
– Property subsumption
| CC
– Class intersection
| P =P
– Property equivalence
| ¬C
– Class negation
| ≥1P C
– Property domain
| ∀ P .C
– Universal quantification
| ∀ P .C
– Property range
| ∃ P .C
– Existential quantification
| ≤ 1 P
– Functional property
| P: o
– Value restriction
| P = (− P )
– Inverse property
| ≥n P
– At least number restriction
| ≤n P
– At most number restriction
| {a1 , · · · , an }
– Enumeration
Fig. 1. OWL expressions
– Class equivalence
Fig. 2. OWL class and predicateaxioms
Class descriptions can be used to construct (anonymous) complex class expressions from existing ones, as shown in Figure 1. C represents (possible complex) class expressions; C is a class name; P stands for a predicate; n is a natural number and ai ’s are individuals. OWL uses axioms to place restrictions
PODD: An Ontology-Driven Data Repository
183
on OWL classes and predicates. These axioms include class subsumption, equivalence, disjointness; predicate domain, range, etc. Figure 2 shows some of the axioms. The OWL language has been widely used in life sciences and biotechnology [10,12,1] as a modeling language for its expressivity and extensibility. There is also growing tool support for tasks such as reasoning, querying and visualization, making it a viable option for the modeling and representation of domain concepts and objects in phenomics.
3
High-Level Design of PODD
PODD is intended to be an open platform that allows any user to access data that is either published or explicitly shared with them by the data owners. Moreover, PODD has been envisioned to provide data management services for a wide variety of research projects and phenotyping platforms. The key design considerations of PODD include: – Data storage and management. It is estimated that several TB of data will be generated by PODD clients per year. Hence, the ability to efficiently manage large volumes of data is crucial. – Repository reusability. Data generated by a wide range of projects and phenotyping platforms will be stored in the repository. Hence, the domain model needs to be flexible enough to cater for the administrative, methodological and technical differences across projects and platforms. – Data persistence and identification. In order to support the dissemination of scientific findings, data in the repository needs to be publicly accessible after being published. Hence, a persistent naming scheme is required. In the development of PODD we employ a number of core technologies to meet the above requirements. – We use Fedora Commons7 , a digital repository for the management, storage and retrieval of domain objects. – We use iRODS [9], a distributed, grid-based storage software system, for the actual storage solution of domain objects across a virtual data fabric. – We incorporate the Sesame8 triple store for the storage and query of RDF triples (ontology definitions of concrete objects). – We use the Lucene9 open-source search engine for the full-text index and search of repository contents, including values in the RDF triple store. The high-level architecture of the PODD repository can be seen in Figure 3. 7 8 9
Fig. 3. The high-level architecture of the PODD repository
4
Ontology-Based Domain Modeling
One of the important design decisions to be made early in the development process is the domain model. Domain modeling aims at providing solutions for two important tasks. Firstly, efficient and flexible data organization; and secondly, data contextualization in the form of metadata, so as to provide meaning for the raw data (documents, publications, image files, etc.) to facilitate search, query, dissemination, and so on. As we emphasized previously, the domain model should be flexible enough to accommodate the rapid changes and advancement of phenomics research. Inspired by FuGe and OBI, we created our own PODD ontology in OWL to define essential domain concepts and relations. 4.1
The PODD Ontology
As PODD is designed to support the data management tasks for phenomics research, a number of essential domain concepts and the relationships between these concepts need to be modeled. Domain concepts will be modeled as OWL classes; relationships between concepts and object attributes will be modeled as OWL object- and datatype-predicates. Concrete objects will be modeled as OWL individuals. The Domain Concepts. In our modeling, the top-level concept is Project, which is an administrative concept and contains essential meta information about the research project, such as the administering organization, principal investigator, project membership, project status, etc. A number of concepts may be associated with the project concept.
PODD: An Ontology-Driven Data Repository
185
Project Plan describing the current project plan at the core metadata element level. Platform describing any single technical measurement platform used in the project. Technical measurement platform means any platform for which parameters and parameter values may be captured. Genotype describing the genotype of the materials used in the investigations. Multiple genotypes are described here and then can become fields in the instances of the Material object. Investigation describing a planned process within a project that executes some form of study design and produces a coherent set of results. It can be considered equivalent to an experiment. The Investigation concept is of central importance. It captures the data and metadata of experiments under a project. A number of concepts are defined to assist in the modeling of investigations. Experimental Design describing experimental design components, e.g. plant layouts, sampling strategies, etc. Growth Condition describing growth conditions, such as growth chambers used, environmental settings, etc. Process representing a planned component of an investigation. It is a description of a series of steps taken to achieve the objective of the investigation. Protocol describing a step within a process that is a consistent whole. e.g. sterilize seeds, plant seedlings, image the plants. Material describing the materials used in the investigation. Materials can be either inputs or outputs. They can be chemicals, substrates, whole organisms or samples taken from a whole organism. The meaning of the Material is usually derived from its position in the model (as well as core metadata elements such as Type). Event capturing ad-hoc events and actions that occur against an individual material. In most instances the events and their timing are described in the process/protocol. An Event object can be utilized to either record fixed events in a form that allows for investigative analysis, or to record one off observations (e.g., the plant under observation died). Measurement describing a single measurement against a single material. e.g. an image of a plant is a single measurement. Measurement objects can capture measurement variables (e.g. shutter speed, lighting, etc). Analysis An Analysis object is a variation on a protocol object, in that it describes a step in a process. It is currently linked to the Prject, Investigation objects as well as to specific Measurement objects since analyses can be performed on single measurements and also on the entire investigation or project based on multiple inputs. Inter-Concept Relationships. The structures and workflows of phenomics research activities are captured using inter-concept relationships, which are defined using OWL predicates.
186
Y.-F. Li et al.
Different research projects will utilize different measurement platforms and have significantly different approaches and project designs.. As a result, different projects may have different structures. In OWL, there are a number of ways of defining the same relationship. In order to achieve high modeling flexibility and accommodate as many scenarios as possible, we have made the following design decisions: – Use OWL restrictions to define inter-concept relationships. OWL restrictions impose constraints on the OWL classes they are defined in. – Only define domain, or range, but not both, for predicates, so that the predicates can be used by different concepts. For each of the concepts described in the previous subsection except for Project, we define an OWL object-predicate with the range being the concept. For example, for concept Analysis, we define a predicate hasAnalysis and define its range to be Analysis. Object Attributes. Attributes are inherent properties of an object, such as the start date of a project, the timestamp of an event, etc. In our modele, we use OWL datatype-predicates to model object attributes, similar to the modeling of inter-object relationships. Figure 4 shows the partial definition of the OWL class Project , in OWL DL syntax [2]. Restriction 1 states that any Project instance must have exactly one ProjectPlan (through the predicate hasProjectPlan, the range of which is ProjectPlan). The other 3 restrictions are similarly defined. Project = 1 hasProjectPlan ≥ 1 hasInvestigation
(1) (2)
= 1 hasStartDate
(3)
≤ 1 hasPublicationDate
(4)
Fig. 4. Partial OWL Definition for the Project concept
4.2
Ontology-Based Model in Object Life Cycle
Concrete objects, instantiations of various concepts such as Project and Investigation, are stored in PODD and can subsequently be retrieved for different purposes. As stated in Section 1, the ontology-based domain model is at the center of the whole life cycle of objects. In this subsection, we briefly present the roles the ontology-based model perform at various stages of the object life cycle. Ingestion. When an object is created, the user specifies which type of object she intends to create and the repository will pull up all the ontological definitions for that type (from the OWL class corresponding to that type and its super classes). Such definitions will be used to (a) guide the rendering of object creation interfaces and (b) validate the attributes and inter-object relationships the user has entered before the object is ingested.
PODD: An Ontology-Driven Data Repository
187
Retrieval & update. When an object is retrieved from the repository, its attributes and inter-object relations are retrieved from its ontology definition, which is used to drive the on-screen rendering. When any value is updated, it is validated and updated in object’s ontology definition. Query & search. An object’s ontology definitions will be stored in an RDF [5] triple store, which can be queried using the SPARQL [8] query language. Similarly, ontology definitions can be indexed and searched by search engines such as Lucene. In summary, ontology-based domain modeling enables us to build very expressive and extensible phenomics domain models. Ample tool support is also available to perform ontology-based tasks such as validation, querying and searching.
5
Conclusion
Phenomics is an emergent discipline that is poised to have a significant impact upon industrial-scale biological research. Phenomics presents a number of data management challenges, such as managing high volumes of data, integrating highly heterogeneous datasets and ensuring the data will exist in perpetuity. To meet the data management needs of the Australian phenomics research community, the PODD repository is being developed to enable efficient storage, retrieval, contextualization, query, discovery and publication of large amounts of data. Central to the design and development of PODD is the use of OWL ontologies as the domain model. Based on description logics and with an emphasis on the Web, the OWL language features a precise semantics, high expressivity and high extensibility. It also has mature and growing tool support. As a result, the OWL language has been widely used in bioinformatics and life sciences to mark up genetic, molecular and disease information. In the PODD model, core domain concepts are defined as OWL classes. Their attributes and relationships with other domain concepts are defined as OWL class restrictions. Concrete domain objects are then initialized, conforming to the ontologies defined for their concepts. Such a modeling approach has a number of benefits. – Firstly, concepts are defined with unambiguous syntax and precise semantics, enabling automated validation and rendering. – Secondly, the extensible nature of OWL language ensures that new concepts can be easily added. – Thirdly, as OWL and RDF are both open standards, interoperability between repositories is expected to be high. In this paper, we introduce the high-level architecture of the PODD repository and the main technologies used in developing it. We focus on the ontology-based domain modeling approach, present the PODD domain model and discuss its role in the life cycle of concrete domain objects.
188
Y.-F. Li et al.
The development of the PODD repository will focus on enhancing ontologybased modeling, representation and processing of phenomics data. Ontologybased annotation services, automated data integration and data visualization will be part of the future directions.
Acknowledgement The authors wish to acknowledge the support of the National eResearch Architecture Taskforce (NeAT) and the Integrated Biological Sciences Steering Committee (IBSSC). The authors wish to thank Dr Xavier Sirault and Dr. Kai Xu for the discussion on the development of the PODD ontology.
References 1. Ashburner, M., Ball, C.A., Blake, J.A., et al.: Gene Ontology: Tool for the Unification of Biology. Nat. Genet. 25(1), 25–29 (2000) 2. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: The Making of a Web Ontology Language. Journal of Web Semantics 1(1), 7–26 (2003) 3. Houle, D.: Numbering the Hairs on Our Heads: The Shared Challenge and Promise of Phenomics. Proceedings of the National Academy of Sciences (October 2009) 4. Jones, A.R., Miller, M., Aebersold, R., et al.: The Functional Genomics Experiment model (FuGE): an Extensible Framework for Standards in Functional Genomics. Nature Biotechnology 25(10), 1127–1133 (2007) 5. Manola, F., Miller, E. (eds.): RDF Primer (February 2004), http://www.w3.org/TR/rdf-primer/ 6. McGuinness, D.L.: Configuration. In: Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.) Description Logic Handbook, pp. 388–405. Cambridge University Press, Cambridge (2003) 7. Nevo, E.: Evolution of genome-phenome diversity under environmental stress. Proceedings of the National Academy of Sciences 98(11), 6233–6240 (2001) 8. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF (April 2006), http://www.w3.org/TR/2006/CR-rdf-sparql-query-20060406/ 9. Rajasekar, A., Moore, R., Vernon, F.: iRODS: A Distributed Data Management Cyberinfrastructure for Observatories. In: American Geophysical Union, Fall Meeting 2007 (December 2007) 10. Ruttenberg, A., Rees, J., Samwald, M., Marshall, M.S.: Life Sciences on the Semantic Web: the Neurocommons and Beyond. Briefings in Bioinformatics 10(2), 193–204 (2009) 11. Sauer, U.: High-throughput Phenomics: Experimental Methods for Mapping Fluxomes. Current Opinion in Biotechnology 15(1), 58–63 (2004) 12. Smith, B., Ashburner, M., Rosse, C., et al.: The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration. Nature Biotechnology 25(11), 1251–1255 (2007)
A Configurable RDF Editor for Australian Curriculum Diny Golder1, Les Kneebone2, Jon Phipps1, Steve Sunter2, and Stuart A. Sutton3 1 JES & Co., Tucson AZ, U.S.A. {dinyg,jonp}@jesandco.org 2 Curriculum Corporation, Melbourne, Australia {les.kneebone,steve.sunter}@curriculum.edu.au 3 University of Washington, Seattle, WA, U.S.A. [email protected]
Abstract. Representing Australian Curriculum for education in a form amenable to the Semantic Web and conforming to the Achievement Standards Network (ASN) schema required a new RDF instance data editor for describing bounded graphs—what the Dublin Core Metadata Initiative calls a ‘description set’. Developed using a ‘describe and relate’ metaphor, the editor reported here eliminates all need for authors of graphs to understand RDF or other Semantic Web formalisms. The Description Set Editor (ASN DSE) is configurable by means of a Description Set Profile (DSP) constraining properties and property values and a set of User Interface Profiles (UIP) that relate the constraints of the DSP to characteristics of the user interface. When fully deployed, the editor architecture will include a Sesame store for RDF persistence and a metadata server for deployment of all RESTful web services. Documents necessary for configuration of the editor including DSP, UIP, XSLT, HTML, CSS, and JavaScript files are stored as web resources. Keywords: DCMI Abstract Model (DCAM), Sesame, description set profile (DSP), description set editor (DSE), resource description editor (RDE), DCMI Application Profile (DCAP), RDF.
1 Background In Australia, the process of developing a national curriculum from kindergarten to Year 12 in specific areas is underway with the support and participation of all governments. The Australian publicly-owned Curriculum Corporation partnered with JES & Co., a U.S. nonprofit dedicated to the education of youth [1], to use the Achievement Standards Network (ASN) [2] to ensure that existing and future digital curriculum resources will link seamlessly to all curricula in the country, including the emerging national curriculum.
organizations [hereafter, “promulgators”] to prescribe what K-12 students should know and be able to do as a result of specific educational experiences. The achievement standards of interest are frequently called curriculum objectives in the cataloging literature as well as academic standards, curriculum standards, learning indicators, benchmarks and an array of other names peculiar to each promulgator. For our purposes here, we shall refer to these variously-named achievement standards as ‘learning objectives’. The correlation or mapping of learning resources such as lesson plans, curriculum units, learning objects as well as student achievement through portfolios and standards-based assessments (e.g., report cards) to these formally promulgated learning objectives is a growing imperative in the education environment [3]. International interest is high in sharing access to learning resources using learning objectives systematically described. One thing missing until the work described here is the means to systematically engage promulgators in the process of describing their learning objectives using the ASN framework in a manner that supports global interoperability. The tasks in achieving RDF representations of learning objectives requires: (1) careful analysis of existing or planned objectives; (2) modeling the learning objectives as Semantic Web amenable resources; (3) development of an ASN-based schema with local extensions; and (4) application development to generate a suitable editor to support creation of the RDF graphs representing the learning objectives. What has been needed to support the ASN framework and these processes is a precise documentation process and a resource description editor that totally masks the conceptual complexities of RDF from the people creating the RDF data. Because the ASN framework supports both the use of local or project-specific extensions to its property set as well as integration of locally defined value spaces (controlled vocabularies), any ASN editor devised has to be configurable to accommodate varying attribute and value spaces and user interface configurations. This paper chronicles the first phase in the development of such a configurable editor—we call it alternatively the Resource Description Editor (RDE) and the Description Set Editor (DSE) depending on the context.
3 General Architecture—Dublin Core The ASN framework is based on the Dublin Core Metadata Initiative’s (DCMI) syntax-independent abstract information model (DCAM) [4]. The DCAM is intended to support development of Dublin Core Application Profiles (DCAP) of which the ASN DCAP is an example [5]. Based conceptually on Heery and Patel [6], a DCAP provides for a principled means of mixing and matching properties from disparate schemas sharing a common abstract model and describes constraints on the use of those properties within a given domain. The DCMI Usage Board has developed Recommended Resources to document best practices in DCAP development including evaluative criteria for development [7] and guidelines [8]. In 2008, DCMI released a Recommended Resource describing the Singapore Framework that “...defines a set of descriptive components that are necessary or useful for documenting an Application Profile and describes how these documentary standards relate to standard domain models and Semantic Web foundation standards. The framework forms a basis for reviewing Application Profiles for documentary
A Configurable RDF Editor for Australian Curriculum
191
completeness and for conformance with web-architectural principles.”[9] The Singapore Framework refined the understanding of application profiles and identified a specific documentation component that described the DCAP “Description Set” using a Description Set Profile (DSP). This was followed by a DCMI Working Draft of a description set constraint language that provides the means for translating a nascent DCAP into machine-processable RDF/XML or XML that can then be used to configure metadata generation applications. [10] The ASN work described in the following paragraphs builds on the DCMI DCAP suite of specifications and practices and is the first example of DCAP deployment in the form of an RDF instance data editor configurable by means of any DCAMconformant Description Set Profile. The ASN property schema is itself a component of the ASN DCAP since it adheres to the DCMI Abstract Model and defines usage of properties drawn from multiple namespaces. In this sense, the ASN AP is similar to other DCAPs such as the Scholarly Works Application Profile [11] and the Dublin Core Collections Application Profile [12].
4 ASN DSE/RDE Editor The reasons for developing an architecture for a configurable editor for describing RDF graphs of learning objectives are twofold: (1) one reason stems from ASN data model’s embracing open extensions to the ASN schema to accommodate properties and controlled vocabularies peculiar to national and organizational needs; and (2) the other reason stems from the need (requirement) to insulate the people generating the graphs of learning objectives from the complexities of RDF. The use of the ASN schema in Australia for describing its new national curriculum illustrates the first of these reasons. The DSE editor had to be configured using Australian controlled vocabularies, and property labeling in the editor had to align with Australian common usage. The editor had to be incrementally and simply extensible (reconfigurable) over time as the Australian DCAP evolves. While sharing the underlying ASN data model to support interoperability, aspects of the Australian Application Profile (AU-AP) are different from those same aspects in the application profile configuring the editor for the United States. As a consortium with global participation, ASN anticipates such variance will be common as more nations and organizations embrace the ASN framework and bring their learning objectives to the Semantic Web. 4.1 The Description Set Editor Architecture When fully deployed, the editor architecture described here will include a Sesame triple store [13] for RDF persistence and a metadata server for deployment of RESTful web services. As noted above, the editor will be configured by means of a DSP, and one or more (depending on the nature of the editor to be instantiated) User Interface Profiles (UIP). All documents necessary for configuration of the editor including the DSP, one or more UIP, XSLT, HTML, CSS, and JavaScript files are stored as web resources either in the editor’s immediate triple store or elsewhere on the web. The basic relationships among the instantiating documents, the resulting editor, and the triple store are illustrated in Figure 1. We will briefly introduce the components of Figure 1 here and set them out more fully below.
192
D. Golder et al.
Fig. 1. Description Set Editor (DSE) architecture
While the DSP defines the constraints on permitted entities and their associated properties, it does not specify the treatment of those properties in the human interface of the editor generated. The DSP defines what can be present (i.e., properties and controlled vocabularies) and the UIP defines how those properties will be treated-e.g., a UIP controls: (1) local property labels; and (2) whether a controlled vocabularies will be treated as a small set of check boxes, a drop-down pick-list, or a separate search/browse interface. The UIP works in concert with XSLT, HTML and CSS. As already noted, all documents necessary for configuration of the editor are stored as web resources—represented in the figure by the upward arrows from the triple-store to the DSP and UIP. The Figure denotes the DSE in terms of 1-n different configurations when fed 1-n configuring documents. As a result, the DSE core implementation may be viewed as an enabling shell with the resulting editor shaped largely by the configuration files it is fed. 4.2 The Description Set Profile The DSP, as the machine-readable expression of the DCAP, is the principal component in defining the RDF data to be generated by the DSE. A DSP describes a description set defined by DCMI as “[a] set of one or more descriptions, each of which describes a single resource”. [4] A record is defined by DCMI as an “instantiation of a description set….” [4] Thus, a description set (or record) is a bounded graph of related descriptions with each description in the set identified by a URI. So defined, an ASN ‘record’ might entail a description of the document Social Studies Strand 4— Geography (2006) and 1-n descriptions of individual learning objective assertions of which that document is comprised. A DSP contains the formal syntactic constraints on: (1) classes of entities described by the ‘set’, (2) the desired properties and their form (literal or non-literal), (3) permitted value spaces (controlled vocabularies) as well as syntax encoding schemes
A Configurable RDF Editor for Australian Curriculum
193
Fig. 2. DCAP constraint language templating: (a) a description set template is comprised of one or more resource description templates; (b) a resource description template is comprised of one or more statement templates; and (c) a statement template is comprised of either a literal or non-literal value template
(data types), and, (4) cardinality. As developed in DCMI’s Singapore Framework, the DSP takes the form of a well defined templating structure that instantiates the DCAM. Figure 2 graphically summarizes this templating. Since a description set is comprised of a set of related descriptions, a relationship between any two Resource Description Templates in the set is established by means of a Statement Template in the subject Resource Description Template referencing the refID of the object Resource Description Template. 4.3 User Interface Profile As noted above in section 4.1, while the DSP defines the constraints on entities and properties for a description set, the User Interface Profile (UIP) defines the human interface—how those entities/properties are presented to the user of the editor. While the DSE is configured through a single DSP, in the current implementation, there is a UIP defined for each “view” of an entity and its properties. Currently, there is a UIP pair for each entity in the DSP—an edit UIP used for generating the editing environment, and a view UIP for generating the display-only environment. There are several other standard UIP present to support display for common functions of the editor such as the “pocket” used for managing relationships between entities (discussed below) and the overall visual frame in which the entity-specific UIP functions. Like the DSP, UIP are web resources encoded in either RDF/XML or XML and are retrievable from the local triple store or elsewhere on the web. The various UIP control how properties are displayed and data entry is managed as well as specifying how value spaces are handled—e.g., whether the interface provides a simple drop-down picklist,
194
D. Golder et al.
check boxes or a full search and browse interface environment needed for large controlled vocabularies such as the Australian Schools Online Thesaurus (ScOT) [14]. 4.4 Description Set Editor (DSE)/Resource Description Editor (RDE) An instance of the editor configured by the means described above can be either a description set editor (DSE) where the intent is to describe sets of related resources or as a resource description editor (RDE) where the intent is to describe individual resources. The only difference between a DSE configuration of the editor and a RDE configuration is that the latter involves a DSP comprised of a single entity (i.e., a single resource description) while the former is be comprised of any number of related entities (i.e., a set of resource descriptions). In either configuration, the editor functions in the same manner.
5 The DSE Design Metaphor The editor is designed around a “describe and relate” metaphor that builds on the underlying abstract information model and its templating framework in the DSP while eliminating exposure of the person using the editor to the information modeling complexities or the RDF encodings generated. In essence, the creator of RDF descriptions is asked to “describe” individual entities using straightforward input templates and then to “relate” those described entities where appropriate. For example, in the Australian national curriculum application profile, there are two entities in the domain model as defined by ASN—a curriculum document entity and a curriculum statement entity. The curriculum document entity describes a strand in a national curriculum such as “Mathematics.” In analyzing the curriculum document for representation, an analyst atomizes its text into individual learning objectives that will then be represented in the RDF graph as structurally and semantically related individual curriculum statement entities. Thus a curriculum document representation may be comprised of hundreds or many hundreds of curriculum statement entities. Once the DSE is configured using this two-entity model, people generating RDF graphs create instances of model entities by selecting the appropriate entity from a menu. Once an entity is selected, the DSE presents the corresponding entity data template. Figure 3 is a screen shot of the data template generated for editing an existing curriculum statement from the Illinois Fine Arts curriculum as viewed through the Australian-configured DSE. In addition to the otherwise conventional data entry and selection mechanisms apparent in the Figure, there are two features of note: (1) the Entity “Pocket”, and (2) the URI Drop Zone for select properties. Both features are used in the processes of relating one entity in a description set to another entity either in- or outside the set. The “Pocket” is used to hold one or more URI entities for subsequent use. As shown by the broken-line arrow in Figure 3, any resource URI in the Pocket that has been defined as a class within a property’s range of permitted classes in the DSP can be dragged from the Pocket to that property’s Drop Zone as a value URI (grayed area in Figure 3). If the resource URI is not within a property’s declared range, the resource URI will not be assigned.
A Configurable RDF Editor for Australian Curriculum
195
Fig. 3. DSE configured for editing a curriculum statement entity using the Australian DSP
Fig. 4. Dragging a resource URI to the “Pocket”
196
D. Golder et al.
Figure 4 illustrates a resource being dragged to the Pocket from a hierarchical display of related ASN statements in the New Jersey Visual and Performing Arts curriculum. The resource URI in the Pocket is now available to serve as a value URI of the alignTo property in Figure 3 when dragged from the Pocket to the Drop Zone for that property—thus asserting that the Illinois statement is semantically aligned to this New Jersey statement.
6 Future Work While the discussion here has been cast in terms of the ASN core schema and the Australian ASN application profile, the DSE can be configured to generate RDF graphs for any DCAM conformant description set. For example, work is underway to create a DSP and accompanying UIP for the Gateway to 21st Century Skills digital library to configure their metadata generation tools. [15] Currently, the DSP and UIP controlling configuration of an instance of the DSE are created manually. We are working to integrate into the DSE editing environment an interface to facilitate the creation of both the DSP and all accompanying UIP.
Acknowledgements This work is partially funded by the National Science Foundation under grant number DUE: 0840740. The authors acknowledge both the support of the Australian Curriculum Corporation and the Le@rning Federation in development of the Australian DSP and the functionality of the DSE and of Zepheira, LLC for partial development of the DSE solution stack.
References [1] JES & Co., http://www.jesandco.org/ [2] Achievement Standards Network, http://www.achievementstandards.org/ [3] Sutton, S., Golder, D.: Achievement Standards Network (ASN): An Application Profile for Mapping K-12 Educational Resources to Achievement Standards. In: DC-2008 Proceedings of the International Conference on Dublin Core and Metadata Applications, pp. 69–79 (2008), http://dcpapers.dublincore.org/ojs/pubs/article/view/920/916 [4] Powell, A., Nilsson, M., Ambjörn, N., Johnston, P., Baker, T.: DCMI Abstract Model, http://dublincore.org/documents/abstract-model/ [5] ASN DCAP, http://www.achievementstandards.org/documentation/ASN-AP.htm [6] Heery, R., Patel, M.: Application Profiles: Mixing and Matching Metadata Schemas. Ariadne 25 (September 2000), http://www.ariadne.ac.uk/issue25/app-profiles/intro.html [7] DCMI Usage Board. Criteria for the Review of Application Profiles, http://dublincore.org/documents/2008/11/03/ profile-guidelines/
A Configurable RDF Editor for Australian Curriculum
197
[8] Coyle, K., Baker, T.: Guidelines for Dublin Core Application Profiles, http://dublincore.org/documents/2008/11/03/ profile-guidelines/ [9] Nilsson, M., Baker, T., Johnston, P.: The Singapore Framework for Dublin Core Application Profiles, http://dublincore.org/documents/singapore-framework/ [10] Nilsson, M.: Description Set Profiles: A constraint language for Dublin Core Application Profiles, http://dublincore.org/documents/2008/03/31/dc-dsp/ [11] Scholarly Works Application Profile (SWAP), http://www.ukoln.ac.uk/repositories/digirep/index/SWAP [12] Dublin Core Collections Application Profile, http://dublincore.org/groups/collections/ collection-application-profile/ [13] Sesame, http://www.openrdf.org/ [14] Schools Online Thesaurus, http://scot.curriculum.edu.au/ [15] The Gateway to 21st Century Skills, http://www.thegateway.org/
Thesaurus Extension Using Web Search Engines Robert Meusel, Mathias Niepert, Kai Eckert, and Heiner Stuckenschmidt KR & KM Research Group University of Mannheim Germany [email protected]
Abstract. Maintaining and extending large thesauri is an important challenge facing digital libraries and IT businesses alike. In this paper we describe a method building on and extending existing methods from the areas of thesaurus maintenance, natural language processing, and machine learning to (a) extract a set of novel candidate concepts from text corpora and (b) to generate a small ranked list of suggestions for the position of these concept in an existing thesaurus. Based on a modification of the standard tf-idf term weighting we extract relevant concept candidates from a document corpus. We then apply a pattern-based machine learning approach on content extracted from web search engine snippets to determine the type of relation between the candidate terms and existing thesaurus concepts. The approach is evaluated with a largescale experiment using the MeSH and WordNet thesauri as testbed.
1
Introduction
The use of thesauri in the area of document indexing and retrieval is a common approach to improve the quality of search results. Due to the fast growing number of novel concepts, manual maintenance of comprehensive thesauri is no longer feasible. A manual process would not be able to keep up with new topics that arise as a reaction to current events in the real world, quickly making their way into publications. The recent past has provided us with a number of examples, two of which we want to mention here as motivation for our contribution. In economics, the financial crisis has led to a discussion of structured financial products and terms such as “CDO” (credit debt obligation) frequently occur in documents covering current events. Nevertheless, the very same term is not included in the leading German thesaurus on business and economics. In the area of medicine, the outbreak of the H1N1 pandemic has recently sparked numerous media and research reports about the so-called “swine flu.” At that point the term “swine flu” was not included in any of the major medical thesauri because it was only recently coined by the media. The current version of the MeSH thesaurus lists the term “Swine-Origin Influenza A H1N1 Virus” as a synonym for “Influenza A Virus, H1N1 Subtype” but not the more commonly used term “swine flu.” In this paper we describe a possible approach to the problem of identifying important terms in text documents and semi-automatically extending G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 198–207, 2010. c Springer-Verlag Berlin Heidelberg 2010
Thesaurus Extension Using Web Search Engines
199
thesauri with novel concepts. The proposed system consists of three basic parts each of which we will briefly motivate by means of the swine flu example. 1. In a first step, we identify candidate terms to be included in the thesaurus. In our example this is the case for swine flu as many existing documents discuss the different aspects of swine flu, including its origin, treatment, and impact on the economy. 2. Once we decide that the term “swine flu” should be included in the thesaurus, we have to identify a location that is most appropriate. This step requires a deeper understanding of the concept “swine flu” since we want to place it in the disease branch and not the animal branch of the thesaurus. In particular, the term should be classified next to the concept “Influenza A Virus, H1N1 Subtype.” 3. After deciding to place “swine flu” close to ’Influenza A Virus, H1N1 Subtype’ one still needs to determine the relation of the two concepts. In particular, we have to decide whether the new term should be regarded as a synonym or whether it should be included as a concept of its own - either as hyponym or hypernym or whether the similarity of the two terms was incidental. The contributions of this paper are the following: (1) We propose methods for carrying out the three steps mentioned above by looking at the literature and adapting existing approaches. (2) We present a large-scale experiment applying these methods to extend parts of the MeSH thesaurus with new terms extracted from documents. (3) We present detailed results on the use of web search engines as a means for generating feature sets for learning the correct relation of new and existing terms in step (3). The paper is structured as followed: In section 2 we explain where our work has to be classified and what other researchers have accomplished in this area. Section 3 includes the detailed description of our approach and the necessary foundations. The experiments and their evaluation and results are summarized in Section 4. In the conclusion (section 5) we summarize the individual results and present a short outlook for future research in the area.
2
Related Work
Nguyen et al. [13] used lexico-syntactic patterns mined from the online encyclopedia wikipedia.org to extract relations between terms. Gillam et al. [5] describe a combination of term extraction, co-occurrence-based measures and predefined linguistic patterns to construct a thesaurus structure from domain-specific collections of texts. Another combination of these techniques using hidden markov random fields is presented by Kaji and Kitsuregawa [9]. Witschel [17] employs a decision tree algorithm to insert novel concepts into a taxonomy. Kermanidis et al. [10] present with Eksairesis a system for ontology building from unstructured text adaptable to different domains and languages. For the process of term extraction they use two corpora, a balanced corpus and a domain-specific corpus.
200
R. Meusel et al.
The semantic relations are learned from syntactic schemata, an approach that is applicable to corpora written in languages without strict sentence word ordering such as modern Greek. Niepert, Buckner and Allen[14] combine statistical NLP methods with expert feedback and logic programming to extend a philosophy thesaurus. This approach is combined with crowdsourcing strategies in Eckert et al.[4]. Many methods focus only on the extraction of synonyms from text corpora: Turney [15] computes the similarity between synonym candidates leveraging the number of hits returned for different combinations of search terms. Matsuo et al. [12] apply co-occurrence measures on search engine results to cluster words. Curran [3] combines several methods for synonym extraction and shows that the combination outperforms each of the single methods, including Grefenstette’s approach [6]. In some cases, special resources such as bilingual corpora or dictionaries are available to support specialized methods for automatic thesaurus construction. Wu and Zhou [18] describe a combination of such methods to extract synonyms. Other techniques using multilingual corpora are described by van der Plas and Tiedemann [16] and Kageura et al. [8].
3
Method Description
Let us assume we are given a thesaurus T that needs to be extended with novel concepts. The process of thesaurus extension can be divided in two major phases. First, concept candidates have to be extracted from document collections and other textual content. In order to achieve satisfiable results it is necessary that the text corpora under consideration are semantically related to the concepts in the thesaurus. For instance, if we want to extend a thesaurus of medical terms we would have to choose a document collection covering medical topics. Given a set of candidate terms, the second step of thesaurus extension involves the classification of these candidates as either synonyms or hyponyms of already existing thesaurus concepts. Figure 1 depicts a typical instance of the thesaurus extension problem. We propose a method supporting the knowledge modeler during both of these phases by (a) extracting terms from text corpora using a novel extraction method based on the well-known tf-idf measure and (b) by generating, for each of the extracted concept candidates, a reasonable sized set of suggestions for its position in the Internal Organ Viscus Stomach
Respiratory Organ
?
synonymy hyponymy concept
Rumen
Reticulum
Lung
Branchia
Gill
Fig. 1. Fragment of the WordNet thesaurus
Thesaurus Extension Using Web Search Engines
Process
Input Document Corpus
Term Selection TF/IDF based Approach
201
Output
Stored ranked Index
Labels of the thesaurus concepts are matched with the index to capture surrounding for each concept
High Ranked not included terms – candidates are selected and used as input
Search Space Reduction (Co-occurence based)
Location Classification (Pattern-Based)
Suggestion List Pairs of candidate and thesaurus concept with classification
Candidate and Thesaurus Concept pairs are passed for final classification
Thesaurus
Fig. 2. The workflow of the thesaurus extension system
thesaurus. For the latter, we distinguish between synonymy and hyponymy relationships. Figure 2 depicts a work-flow of the proposed thesauri extension support system. In the remainder of this section we describe the two components of the system in more detail. 3.1
Term Selection
Term selection is the process of extracting terms that could serve as concepts in the thesaurus. This is usually done by applying statistical co-occurrence measures to a corpus of text documents. In order to quantify the importance of a term t in a corpus D we first compute the tf-idf value wt,d of term t in document d. We found that applying the tf-idf variant with (a) logarithmic term frequency weighting, (b) logarithmic document frequency weighting, and (c) cosine normalization yielded the best results. More formally, we computed the cosine norm normalized tf-idf value wt,d for each term t and each document d according to Equation 1. wt,d norm wt,d = t ∈d
wt ,d
with wt,d = (1 + log(tft,d )) ∗ log
|D| dft
(1)
Since we want to assess the importance of a term t not only for a single document but the entire corpus, we compute the mean w t of the tf-idf values over all documents in which term t occurs at least once. norm d∈D wt,d wt = (2) dft We finally assign the importance weight w ˆt to term t by multiplying the squared value w t with the logarithm of the document frequency dft .
202
R. Meusel et al.
w ˆt = log(dft + 1) × w2t
(3)
The intuition behind this approach is that terms that occur in more documents are more likely to be concept candidates for a thesaurus covering these documents. The presented importance measure w ˆt , therefore, combines the average importance of a term relative to each document in the corpus with the importance of the term relative to the entire corpus. 3.2
Pattern-Based Position Extraction
In a second step, the previously extracted concept candidates are classified in the existing thesaurus. Classification is the process of finding concepts in the thesaurus that are potential hypernyms and synonyms, respectively, for each of the candidate concepts. This process is also often referred to as position extraction. We apply established machine learning approaches to learn lexico-syntactic patterns from search engine results. Typical patterns for concepts C1 and C2 are, for instance, [C1 is a C2 ] for hyponymy and [C1 is also a C2 ] for synonymy relationships. Instead of only using a predefined set of patterns [7], however, we learn these patterns from text snippets of search engines [2] using existing thesauri as training data. The learned patterns are then used as features for the classification of the relationship between each concept candidate and existing thesaurus concepts. Since we are mainly interested in hyponymy and synonymy relationships, we need to train at least two different binary classifiers. Fortunately, the classifiers can be trained with concepts pairs contained in existing domain thesauri. The pattern extraction approach of the proposed system is based on the method presented by Bollegala et al. [2]. Instead of retrieving lexico-syntactic patterns to asses the semantic similarity of term pairs, however, we extract the patterns also for the purpose of classifying relationship types as either synonymy or hyponymy. For each pair of concepts (C1 , C2 ) of which we know the relationship because it is contained in a training thesaurus, we send the query “C1 ” +“C2 ” to a web search engine. The returned text snippet is processed to extract all n-grams (2 ≤ n ≤ 6) that match the pattern “C1 X ∗ C2 ”, where X can be any combination of up to four space-separated word or punctuation tokens. For instance, assume the training thesaurus contains the concepts “car” and “vehicle” with car being a hyponym of vehicle. The method would query a search engine with the string “car” +”vehicle”. Let us assume that one of the returned text snippet is “every car is a vehicle.” In this case, the method would extract the pattern “car is a vehicle”. This pattern would be added to the list of potential hyponymy patterns with “car” and “vehicle” substituted with matching placeholders. Of course, the set of patterns extracted this way is too large to be used directly for machine learning algorithms. Therefore, we rank the patterns according to their ability to distinguish between the types of relationships we are interested in. For both the synonymy and hyponymy relationship we rank the extracted patterns according to the chi-square statistic [2]. For every pattern v
Thesaurus Extension Using Web Search Engines
203
we determine its frequency pv in snippets for hyponymous (synonymous) word pairs and its frequency nv in snippets for non-hyponymous (non-synonymous) word pairs. Let P denote the total frequency of all patterns in snippets for hyponymous (synonymous) word pairs and N the total frequency of all patterns in snippets for non-hyponymous (non-synonymous) word pairs. We calculate the chi-square value (Bollegala et al. [2]) for each pattern as follows: χ2v =
(P + N )(pv (N − nv ) − nv (P − pv ))2 P N (pv + nv )(P + N − pv − nv )
(4)
From the initially extracted set of patterns we kept only the 80 highest ranked patterns extracted with WordNet as training thesaurus and the 60 highest ranked patterns with the medical subject headings (MeSH) thesaurus as training thesaurus. The feature vector for the machine learning algorithms consists of the normalized frequencies for these top-ranked patterns. Finally, we learn a support vector machine with linear kernel, a support vector machine with radial basis function (RBF) kernel, and a decision tree algorithm (J48) using the generated feature vectors. Figure 1 depicts a typical instance of the thesaurus extension problem. The concept candidate “Viscus”, which has been extracted from a text corpus, needs to be classified in the existing thesaurus. The thesaurus extension support system provides, for each candidate concept, a small ranked list of potential positions in the thesaurus. In the following section we report on the empirical evaluation of the presented approach.
4
Experimental Evaluation
Most thesauri are comprised of a large number of concepts and, for every candidate concept, we would have to send a query to a web search engine for every of the thesaurus’ concepts. Hence, we have to reduce the amount of potential thesaurus positions for any given candidate concept. To achieve such a search space reduction we compute, for every candidate concept that needs to be classified, its similarity to each of the thesaurus concepts using the weighted Jaccard value of its surrounding words (Lin [11]). Then, for each concept candidate, only the top-k most similar thesaurus concepts are considered for the pattern based approach. In the following we call the concepts which are included in the top-k set the similar concepts. The thesaurus concepts that share a hyponymy or synonymy relation with a candidate concept are referred to as positional concepts. While the pattern extraction approach would work with any search engine, we decided to use the Yahoo search engine API1 as it is less restrictive on the allowed number of queries per day. A single query with the API took up to three seconds. To evaluate and test our methods we used a thesauri extracted from the MeSH thesaurus of the year 20082. The thesaurus was created by combining all concepts located under the top-level concept anatomy (1611 concepts) with 1 2
Table 1. Accuracy results of the three machine learning approaches and two thesauri for different classification tasks Training data Classification task SVM (lin) SVM (RBF) Decision tree WordNet synonym vs. no synonym 86 % 54 % 98 % hyponym vs. no hyponym 73 % 63 % 82 % WordNet WordNet synonym vs. hyponym 70 % 50 % 71 % WordNet synonym vs. hyponym vs. none 58 % 47 % 70 % synonym vs. no synonym 71 % 59 % 85 % MeSH MeSH hyponym vs. no hyponym 74 % 60 % 87 % MeSH synonym vs. no synonym 53 % 52 % 68 % synonym vs. hyponym vs. none 51 % 40 % 68 % MeSH
all concept located under the top-level concept humanity (186 concepts). For each concept in these thesauri we retrieved the most relevant documents from PubMed3 of the years between 2005 and 2008. The final document corpus included 13392 documents for the anatomy thesaurus and 1468 documents for the humanity thesaurus. We chose WordNet 3.0 as a second thesaurus for the experiments, primarily since this allows us to compare the results to those reported in Bollegala et al. [1]. For each of the three classes “synonymy”, “hyponymy”, and “neither synonymy nor hyponymy” we sampled 300 pairs of concepts belonging to the respective class. For the MeSH training set, these pairs were randomly sampled from the MeSH thesaurus excluding the previously constructed anatomy/humanity sub-thesaurus. Similarly, to create the WordNet training set, we randomly sampled 300 negative and positive training pairs for each class from WordNet. For testing, we isolated 100 concepts each from the anatomy/humanity sub-thesaurus and from WordNet. These concepts serve as candidate concepts and the goal is to evaluate whether our approach can identify their correct positions. For both the 100 MeSH and WordNet candidate concepts we determined the top 100 most similar concepts in the MeSH and WordNet thesaurus, respectively, by applying the above-mentioned co-occurrence similarity measure. On average, 97 percent of the correct positions for each candidate concept were included in this set for WordNet and 90 percent for the MeSH thesaurus. This indicates that the Jaccard similarity measure is able to exclude the majority of all concept positions while retaining most of the correct positional concepts. For each of the 100 concept candidates, we applied the trained classifiers on the set of the previously ranked 100 most similar concepts, resulting in 10000 classifications instances for each combination of thesaurus (MeSH or WordNet), classifier (linear SVM, RBF SVN, decision tree), and classification task. The accuracy values ((true positives + true negatives) / all instances) of these experiments are shown in Table 1. Evidently, the accuracy of the classifiers is strongly influenced by the properties of the thesauri. For instance, for the synonymy classification task, we achieved an accuracy of 86 percent with a linear 3
http://www.ncbi.nlm.nih.gov/pubmed/
Thesaurus Extension Using Web Search Engines
205
Table 2. Percentage of candidate concepts wrongly classified as synonyms (hyponyms) by the linear support vector machine (SVM) and the decision tree algorithm Training data WordNet WordNet MeSH MeSH
Classification task SVM (linear) Decision tree synonym vs. no synonym 7.8 % 6.7 % hyponym vs. no hyponym 10.6 % 15.4 % synonym vs. no synonym 3.6 % 12.7 % hyponym vs. no hyponym 6.1 % 14.1 %
SVM for WordNet but only an accuracy of 71 percent for the MeSH thesaurus. Not surprisingly, the three-class classification problem is more difficult and the approach is not as accurate as for the binary classification tasks. An additional observation is that the classification results for the hyponymy vs. synonymy problem are rather poor pointing to the semantic similarity of the synonymy and hyponymy relations. Furthermore, the results reveal that the decision tree algorithm (J48) leads to more accurate classification results for the majority of the tasks. The accuracy of the J48 classifier is on average 11.6 percent better than the linear SVM classifier and 24.1 percent more accurate than the radial basis function SVM. This is especially interesting because pattern based machine learning approaches mostly employ support vector machines for classification. Proper parameter tuning could close the performance gap between the two approaches, however, this is often not possible in real-world applications. While the decision tree approach is superior in accuracy the linear SVM classifier is more precise. Table 2 shows the percentage of false positives for the synonymy and hyponymy classes for both the MeSH and WordNet thesaurus. Except for the synonymy vs. no synonymy classification problem the linear SVM algorithm results in fewer false positives. A thesaurus maintenance system should support the knowledge modeler by reducing the amount of novel concept/position pairs without excluding correct ones. Therefore, we are especially interested in a high recall and moderate precision making the decision tree algorithm the preferred choice for the thesaurus maintenance setting. For a librarian or knowledge modeler, the main application of the support system is to locate the correct position of the candidate concepts in the thesaurus. Let us assume we are given the concept candidate “tummy” and that we need to determine its position in the thesaurus fragment depicted in Figure 1. Now, two pieces of information will lead us to the correct location. The first one being that “tummy” is a hyponym of “internal organ” and the second being that “tummy” is a synonym of “stomach.” In an additional experiment we evaluated, for each concept candidate, in how many cases we were able to determine the correct position in the target thesaurus. Hence, for each concept candidate, we looked at the set of concepts in the thesaurus which the pattern-based approach classified as either synonyms or hyponyms and checked whether at least on of these concepts led us to the correct position. The size of this set was 14 on average, meaning that, on average, the number of choices was reduced from 100 to 14. Table 3 lists the percentage of cases for which we could determine the correct position for the MeSH thesaurus. We also widened the graph distance
206
R. Meusel et al.
Table 3. Fraction of candidate concepts for which the correct position in the thesaurus could be inferred using the pattern-based classification results; and considering a graph distance of 1 ≤ n ≤ 4 Graph distance 1 2 3 4
MeSH thesaurus 85% 95% 99% 100%
to the correct position from 1 to 4, where the graph distance 1 represents direct synonymy or hyponymy relations. The suggested position was at most 4 edges away from the correct one.
5
Conclusion and Future Work
The results of the experimental evaluation demonstrate that the presented approach has the potential to support and speed-up the laborious task of thesaurus construction and maintenance. The concept candidate ranking based on the adapted tf-idf relevance measure (see Equation 3) could identify most of the significant terms of a text corpus. The combination of co-occurrence guided search space reduction and pattern-based position extraction results in accurate classification results, leaving a drastically reduced number of choices to the knowledge modeler. Furthermore, the experiments indicate that web search engine snippets contain enough information to also learn lexico-syntactic patterns for the problem of hyponymy extraction. The combination of synonymy and hyponymy classification allows us to locate, for each extracted candidate concept, the appropriate position in the thesaurus. We believe only slight modifications are necessary to adapt the system to several important real-world use cases including thesaurus maintenance for digital libraries and information retrieval systems. Both of these use cases are important to businesses as well as university libraries. We intend to conduct more experiments on different heterogeneous thesauri, attempting to relate thesaurus properties to the performance of the pattern based approach. Based on these finding we hope to be able to tune the machine learning approach to achieve improved accuracy and performance, making the approach more suitable for domain-specific and large-scale thesauri. Furthermore, instead of merely extending a thesaurus, we will try to adapt the approach to construct thesauri entirely from scratch using only text corpora and web search engines. A bottleneck of the pattern based approach is the time it takes to query the web search engine. In this work, we reduced the number of pairs by using a co-occurrence similarity measure. In future work, however, we will investigate additional methods to reduce the number of concept positions that have to be visited in the thesaurus. For instance, having strong evidence that a candidate concept is not a hyponym of a thesaurus concept C we can immediately infer that it can also not be a hyponymy of any of C’s descendants. This would allow
Thesaurus Extension Using Web Search Engines
207
us to prune entire sub-trees in the thesaurus, drastically reducing the number of pairs that have to be send to the web search engine. Another idea is to not only apply shallow parsing strategies to extract lexical pattern but also more sophisticated approaches such as POS tagging and deep syntax parsing.
References 1. Bollegala, D., Matsuo, Y., Ishizuka, M.: An integrated approach to measuring semantic similarity between words using information available on the web. In: Proceedings of HLT-NAACL, pp. 340–347 (2007) 2. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of WWW, pp. 757–766 (2007) 3. Curran, J.R.: Ensemble methods for automatic thesaurus extraction. In: Proceedings of ACL, pp. 222–229 (2002) 4. Eckert, K., Niepert, M., Niemann, C., Buckner, C., Allen, C., Stuckenschmidt, H.: Crowdsourcing the assembly of concept hierarchies. In: Proceedings of JCDL (2010) 5. Gillam, L., Tariq, M., Ahmad, K.: Terminology and the construction of ontology. Terminology 11, 55–81 (2005) 6. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Springer, Heidelberg (1994) 7. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France (1992) 8. Kageura, K., Tsuji, K., Aizawa, A.N.: Automatic thesaurus generation through multiple filtering. In: Proceedings COLING, pp. 397–403 (2000) 9. Kaji, N., Kitsuregawa, M.: Using hidden markov random fields to combine distributional and pattern-based word clustering. In: Proceedings of COLING, pp. 401–408 (2008) 10. Kermanidis, M.M.K.L., Thanopoulos, A., Fakotakis, N.: Eksairesis: A domainadaptable system for ontology building from unstructured text. In: Proceedings of LREC (2008) 11. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of ICML, pp. 296–304 (1998) 12. Matsuo, Y., Sakaki, T., Uchiyama, K., Ishizuka, M.: Graph-based word clustering using a web search engine. In: Proceedings of EMNLP, pp. 542–550 (2006) 13. Nguyen, D.P.T., Matsuo, Y., Ishizuka, M.: Exploiting syntactic and semantic information for relation extraction from wikipedia. In: Proceedings of IJCAI (2007) 14. Niepert, M., Buckner, C., Allen, C.: A dynamic ontology for a dynamic reference work. In: Proceedings of JCDL, pp. 288–297 (2007) 15. Turney, P.D.: Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: Proceedings of EMCL, pp. 491–502 (2001) 16. van der Plas, L., Tiedemann, J.: Finding synonyms using automatic word alignment and measures of distributional similarity. In: Proceedings of COLING, pp. 866–873 (2006) 17. Witschel, H.F.: Using decision trees and text mining techniques for extending taxonomies. In: Proceedings of the Workshop on Learning and Extending Lexical Ontologies by Using Machine Learning Methods (2005) 18. Wu, H., Zhou, M.: Optimizing synonym extraction using monolingual and bilingual resources. In: Proceedings of the second international workshop on Paraphrasing, pp. 72–79 (2003)
Preservation of Cultural Heritage: From Print Book to Digital Library - A Greenstone Experience Henny M. Sutedjo1, Gladys Sau-Mei Theng2, Yin-Leng Theng1 1
Wee Kim Wee School of Communication and Information Nanyang Technological University {tyltheng,hennysutedjo}@ntu.edu.sg 2 Fashion Department Northeast Normal University, China [email protected]
Abstract. We argue that current development in digital libraries presents an opportunity to explore the use of DL as a tool for building and facilitating access to digital cultural resources. Using Greenstone, an open source DL, we describe a 10-step approach in converting an out-of-print book, ‘Costumes through Times’, and constructing a DL creation of costumes. Keywords: Greenstone, digital library.
Preservation of Cultural Heritage: From Print Book to Digital Library
209
2 Case Study: “Costumes through Times” Project Although historical and cultural research is pertinent to a better understanding of the peoples in the South-East Asian region, few studies have been carried out to understand the people’s lifestyles and cultures. The “Costumes Through Time” project undertaken by the National Heritage Board (NHB) in 1993, jointly with the Fashion Designers’ Society, attempted to open a window to one facet of Singapore’s past reminiscing the way our forefathers clad themselves, explaining how historical background and social lifestyles are manifested through fashion (Khoo, Kraal, Toh and Quek, 1993). As of today, there are very few copies of this book kept by NHB. This book provides a vivid documentation that expresses Singapore’s multi-ethnic heritage through the costumes that people in Singapore have worn through times. Each chapter is filled with write-ups and pictures of Singapore people’s costumes in different occasions or functions.
3 A Ten-Step Approach We document the development process of designing and building the Costumes DL in a 10-steps process including conceptual design and technical design, adapted from Chaudhry and Tan (2005): a.
b.
c.
d.
e.
Define the objectives & scope of contents of the Costumes DL. A series of discussions were being held with key members of this project: representatives from LaSalle – SIA College of the Arts and librarian from Nanyang Academy of Fine Arts, to gather and articulate the following objectives of Costumes DL: Define digitization process. In the library context, digitization usually refers to the process of converting a paper-based document into electronic form. Digitization could be very expensive and time-consuming; therefore it is important to consider the best method to digitize. Besides that, digitization method depends a lot on the nature and format of the materials as they are varied and in different sizes. Define terminology standards. Terminology standards are a way of controlling the terms used in the process of metadata creation. A good digital collection includes metadata to describe contents and manage the collection. Define collection organization. To facilitate future integration and interoperability of contents, it is important to design an appropriate collection organization. Defining a collection organization is the stage where the relationship between different aspects of metadata, as well as the relationship between the intellectual concepts described within metadata. Create and design the collection. To commence, either open an existing collection or begin a new one. A new collection can be created by copying the structure of an existing one, adding documents and metadata to it. Many decisions to be made at this time, such as to have full-text indexes or not, to create collections or sub-collections, to add or remove support for predefined interface languages, to select the inclusion of document types in the collection to determine plug-ins, and to configure each plug-in with appropriate arguments. Following designs are typical for each collection inside the Costumes DL:
210
H.M. Sutedjo, G.S.-M. Theng, and Y.-L. Theng
f.
g.
h.
i.
j.
General. Enter the Creator’s email and Maintainer’s email field with the appropriate value. Enter the Collection title & Collection description with the collection title and a brief description about the collection. Click on the Browse button beside the URL to ‘about page’ icon and choose image to be displayed as the collection icon. By default, the collection will be accessible publicly. Document Plug-ins. Plug-in GAPlug and plug-in HTMLPlug are being used. The HTMLPlug was configured with smart block: off, to allow images being processed inside the document; and description_tags: on, to display the structure of document. Define cataloguing and metadata standards. Cataloguing and metadata creation refers to the process of creating structured descriptions that provide information about any aspect of a digital document. There are some critical points to be considered: consistent metadata and standardized classification schemes or terminology. Costumes DL uses Dublin Core Metadata Element Set, version 1.1: Reference Description. Each digital resource (HTML file) inside the collection has one correspondent metadata.xml file. Design browse and search features. To make the collection looks like a book, where it has chapters and sub-chapters; create the sub-chapter structure and contents at each main document in the chapter (e.g. for People Celebrating collection, the main document is People Celebrating.html). Greenstone gives a part of summarization of the contents and “detach & no high-lighting” buttons on every document. Remove this by adding following syntax into collect.cfg file under collect\\etc folder. Greenstone also allows an image be associated with a document. The image filename should be the same with the document filename, in any of the image file type. For example, document filename is peoplecelebrating.htm, the associate document image filename should be peoplecelebrating.jpg. Standard images provided by greenstone are changed to match with the look and feel of costumes collections. Create various classifiers. Classifier is a structure used to browse the collection. The more classifiers, the more flexible the browsing is. Costumes DL uses 3 types classifier: (i) Generic Table of Contents); Hierarchy (Subjects); and Hierarchy (A-Z). Customise user interface. Macros allowed customization in the user interface. Macro files are stored in the Greenstone/Macros folder. Macro filename has .dm extension. Macros are associated with one or more packages. Gather feedback and refine Costumes DL. While Costume DL was developed based on the requirements gathered from the users, the design of this project was mostly done based on the developer’s perspective. To correct the unintended errors and wrong perceptions, the prototype of Costumes DL was presented to the users for their review and feedback. Based on the feedback gathered from these review sessions, the prototype was revised and refined. Figure 1 shows the CostumeDL prototype, using Greenstone powerful search and retrieval system, it also provides a platform to support information sharing/publishing by capturing and preserving culture and fashion artifacts. Greenstone is a suite of open-source software for building and distributing DL collections (see http://www.greenstone.org/ cgi-bin/library).
Preservation of Cultural Heritage: From Print Book to Digital Library
211
4 Discussion and Conclusion In the print world, preservation of rare book collections is achieved in part by restricting usage: materials are accessed under the supervision of a librarian and off-premises circulation is prohibited. While these measures undoubtedly prolong the life of these valuable materials, they do little to promote their use. In the case of digital materials, mechanisms to ensure long-term persistence should operate harmoniously with mechanisms supporting dissemination and use. This paper describes a 10-step approach to convert a Costumes DL based on the “Costumes through Times” book using Greenstone software v2.61. We argue that current development in digital libraries (DLs) presents an opportunity to explore the use of DL as a tool for facilitating access to a digital cultural resource, towards the realization of Licklider’s (1965) global DL. Costumes DL is certainly not going to replace the value of “Costumes through Times” book itself. It is hoped that this project can serve as a reference point for other printed books looking into “simpler” means of preservation using open source software.
Fig. 1. CostumeDL prototype
References Chaudhry, A.S., Tan, P.J.: Enhancing access to digital information resources on heritage: A case of development of a taxonomy at the Integrated Museum and Archives System in Singapore. Journal of Documentation 61(6), 751–776 (2005) Kenney, A., Rieger, O.: Moving Theory into Practice: Digital Imaging for Libraries and Archives. Research Libraries Group, Inc., Mountain View (2000) Khoo, Kraal, Toh, Quek: Costumes through Times. National Heritage Board, Singapore (1993) Licklider, J.R.: Libraries of the Future. MIT Press, Cambridge (1965) Throsby, D.: Economic and cultural value in the work of creative artists. Research Report, the Getty Conservation Institute). Getty Information Institute, Los Angeles (2000) Witten, I.H., Bainbridge, D.: How to build a digital library. Morgan Kaufmann, San Francisco (2003)
Improving Social Tag-Based Image Retrieval with CBIR Technique Choochart Haruechaiyasak and Chaianun Damrongrat Human Language Technology Laboratory (HLT) National Electronics and Computer Technology Center (NECTEC) Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand {choochart.haruechaiyasak,chaianun.damrongrat}@nectec.or.th
Abstract. With the popularity of social image-sharing websites, the amount of images uploaded and shared among the users has increased explosively. To allow keyword search, the system constructs an index from image tags assigned by the users. The tag-based image retrieval approach, although very scalable, has some serious drawbacks due to the problems of tag spamming and subjectivity in tagging. In this paper, we propose an approach for improving the tag-based image retrieval by exploiting some techniques in content-based image retrieval (CBIR). Given an image collection, we construct an index based on 130-scale Munsellbased colors. Users are allowed to perform query by keywords with color and/or tone selection. The color index is also used for improving ranking of search results via the user relevance feedback. Keywords: Social media, relevance ranking, tag-based image retrieval, content-based image retrieval (CBIR), color indexing.
1
Introduction
Today many social image-sharing websites are very popular among the Web 2.0 users. To allow the image search by keywords, most systems adopt the social tagging approach which lets the users assign a set of terms to describe the images. The social tagging was foreseen as a method to bridge the semantic gap problem in image analysis. For example, an image of people running in a marathon could be tagged with a few terms such as “marathon” and “running”. Current image analysis technique is still far from being able to detect and identify these concepts in the semantic level. Using the user-assigned tags to perform a tag-based image retrieval approach is very efficient and scalable. However, assigning a set of tags for an image is subjective and sometimes unreliable due to difference in experience and judgment of users. In the above example of marathon image, the image owner could assign some other different tags such as “Nikon” (camera brand used to take the picture), “John” (person in the picture), “Boston” (city in which the marathon takes place). A more serious problem is the tag spamming in which an excessive number of tags or unrelated tags are included for the image in order to gain high rank in search results. G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 212–215, 2010. c Springer-Verlag Berlin Heidelberg 2010
Improving Social Tag-Based Image Retrieval with CBIR Technique
213
Many research works proposed techniques such as tag ranking to increase the quality of the tag sets [1,2] Most current techniques constructed the models based on the tag sets without performing image analysis. In this paper, we propose an approach for improving the tag-based image retrieval based on the content-based image retrieval (CBIR) technique. Given an image collection, we construct an index based on 130 color image scales. The color index is used for improving the ranking of the search results via the user relevance feedback. To allow the faceted search and browsing on image collection, we design a tone facet by grouping colors into 4 tone levels: vivid, pale, dark and gray-scale. Faceted search is a very effective technique which allows users to navigate and browse the information (such as documents and multimedia) based on a predefined set of categories [4,5].
2
The Proposed CBIR Approach
Figure 1 illustrates the overall process for implementing the proposed CBIR approach. The first step is to download a collection of images from social imagesharing websites. Once the image collection (containing images with tags) is obtained, we construct an index from the tag set. With the tag index, users can perform keyword search as in the web search engines.
Fig. 1. The proprosed CBIR approach for improving the tag-based image retrieval
The next step is to implement the CBIR approach by constructing a color index from the image collection. Given an image, we extract each pixel and map it into a predefined set of Munsell-based colors. Kobayashi [3] designed a color image scale of 130 colors. This color image scale defines 10 hue levels and 12 tone levels plus additional 10 gray-scale colors. To provide the faceted search and browsing based on the color tone, we organize 12 tone levels into 3 groups plus 1 group for gray scale, i.e., vivid, pale, dark, gray-scale. To provide the scalability, we adopt the text-based index structure for building the color index. Each pixel from an image is assigned with (1) a color code from the 130-scale colors and (2) a tone code from 4 tone groups. Once all pixels are encoded with color and tone symbols, we count the frequency of pixels assigned for each color-tone code. The frequency values are normalized as percentage
214
C. Haruechaiyasak and C. Damrongrat
such that each image contains 100 color-tone codes. This approximation process significantly helps increase speed of the retrieval process. With the color index, we can provide search and browsing features plus the user relevance feedback concept as follows. 1. Query by keywords plus color and/or tone selection: In this feature, users can type in keywords and specify color and/or tone. Some usage examples are as follows. – Keywords plus color selection: To retrieve images of mountain with the preference on green foliage, the user may input a keyword “mountain” and specify a green color on the color palette. – Keywords plus tone selection: To retrieve images of mountain taken at dawn or sunset, the user may input a keyword “mountain” and specify the dark tone. – Keywords plus color and tone selection: To retrieve images of mountain with bright and vivid green foliage, the user may input a keyword “mountain” and specify a green color plus the vivid tone. 2. Relevance feedback based on a sample image: Given the search results, the user provides a feedback by selecting an image which best matches his/her preference. The system will use the color and tone information of the selected image to rerank the image results. – Given image results from the query “temple”, to obtain some close-up images of golden temples with vivid tone, user may select an image from the results which best matches his/her criteria. The system can then return a new ranked set of image results which are closely similar (in color and tone) to the sample image.
3
GUI Design and System Implementation
We demonstrated the proposed approach on Thailand travel domain by using images downloaded from the Flickr website.1 Using the provided API, we queried some images by specifying a keyword “Thailand” with some related tags such as beach, mountain and temple. The total number of unique images in the collection is 9,938. We constructed the tag and color index by using Lucene IR package.2 Figure 2 illustrates the GUI design of the system prototype. The user can input some keywords as text query with additional color selection as color query. The tone group can also be specified to further filter and rerank the search results. For example, a query “island” could match many images containing “island” as a tag. Some of these images may include photos of people or a hotel taken on an island. However, if the user’s real intent is to retrieve natural island scenes, he/she could include the color query such as green for “trees” and blue for 1 2
Improving Social Tag-Based Image Retrieval with CBIR Technique
215
Fig. 2. GUI design for text and color query
“sea”. The system would filter and rerank the search results to better match the user’s need. In general, images typically contain some themes (e.g., mountain, waterfalls, sunset) which can be conceptually described by colors. Therefore, the CBIR technique could help improve the performance of tag-based image retrieval approach by allowing users to specify color and tone levels during the retrieval process.
4
Conclusion
We proposed a content-based image retrieval (CBIR) approach for improving the search results returned by the tag-based image retrieval. The proposed approach is based on a construction of an index from an image collection using color and tone information. The color (plus tone) index is used for filtering the results from the query by keywords. Our GUI design allows users to adjust the amount of colors and select the tone groups to match his/her preference. In addition, the system also supports the user relevance feedback concept by reranking the search results based on a user’s selected image.
References 1. Li, X., Snoek, C.G., Worring, M.: Learning tag relevance by neighbor voting for social image retrieval. In: Proc. of the 1st ACM int. Conference on Multimedia information Retrieval, pp. 180–187 (2008) 2. Liu, D., Hua, X., Wang, M., Zhang, H.: Boost search relevance for tag-based social image retrieval. In: Proc. of the 2009 IEEE Int. Conf. on Multimedia and Expo., pp. 1636–1639 (2008) 3. Kobayashi, S.: Color Image Scale. Kodansha International (1992) 4. Villa, R., Gildea, N., Jose, J.M.: FacetBrowser: a user interface for complex search tasks. In: Proc. of the 16th ACM Int. Conf. on Multimedia, pp. 489–498 (2008) 5. Yee, K., Swearingen, K., Li, K., Hearst, M.: Faceted metadata for image search and browsing. In: Proc. of the SIGCHI, pp. 401–408 (2003)
Identifying Persons in News Article Images Based on Textual Analysis Choochart Haruechaiyasak and Chaianun Damrongrat Human Language Technology Laboratory (HLT) National Electronics and Computer Technology Center (NECTEC) Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand {choochart.haruechaiyasak,chaianun.damrongrat}@nectec.or.th
Abstract. A large portion of news articles contains images of persons whose names appear in the news stories. To provide image search of persons, most search engines construct an index from textual descriptions (such as headline and caption) of images. The index search approach, although very simple and scalable, has one serious drawback. A query of a person name could match some news articles which do not contain images of the target person. Therefore, some irrelevant images could be returned as search results. Our main goal is to improve the performance of the index search approach based on the syntactic analysis of person name entities in the news articles. Given sentences containing person names, we construct a set of syntactic rules for identifying persons in news images. The set of syntactic rules is used to filter out images of non-target persons from the results returned by the index search. From the experimental results, our approach improved the performance over the basic index search by 10% based on the F1-measure. Keywords: News image retrieval, people search, syntactic analysis.
1
Introduction
To enhance visual appearance and provide readers with some illustration, most news web sites often include related multimedia contents, such as images and video clips, in the articles. For example, a news report on presidential election campaign would likely include images of the candidates involving in different activities such as delivering campaign speeches and greeting voters in public. A large portion of images displayed in the news articles contains the photographs of persons who are mentioned in the news stories. Therefore, one of the focused and challenging tasks in image retrieval is to correctly identify and retrieve images of specified persons from a large image collection. Previous works on identifying people in images are based on two different approaches: face recognition and index search. Face recognition is based on some image processing techniques to construct and learn visual features from facial images [16]. With a lot of improvements made over the years, face recognition has been successfully applied in many applications such as biometrics and consumer electronics (e.g., digital camera). Applying face recognition in the domain G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 216–225, 2010. c Springer-Verlag Berlin Heidelberg 2010
Identifying Persons in News Article Images Based on Textual Analysis
217
of people image search, however, has some drawbacks due to the following reasons: (1) extracting and learning visual features from images typically take a large amount of processing time; (2) learning to recognize each individual person requires an adequate sample size of facial images, therefore this approach is not very scalable for handling a large number of people; and (3) simple face recognition techniques are not robust enough for identifying people from different viewing angles and in low resolution images. Some advanced techniques are being explored and investigated to improve the speed and accuracy. To handle the large scale of the web, many image search engines adopt the index search approach which is successfully applied for web page retrieval [8]. Typical index search is based on the bag-of-words model for building the inverted index structure, i.e., each term has a pointer to a list of documents containing the term. Under the bag-of-words model, a document containing the input query anywhere in the content is considered as a hit and will be returned to the users. For image search, an index can be constructed from many pieces of textual information such as image file name, anchor texts linking into the image and texts adjacent to the image [6]. In general, using the index search for web images is efficient. However, finding images of specific persons in the news domain can sometimes yield undesirable results. With the simple bag-of-words model, a query for a person could match some news articles (i.e., the person is mentioned in the news stories) whose images contain the photo of other persons. Therefore, irrelevant images are sometimes returned to the users along with the correct images of the target person. We illustrate this problem by using an example from Google News Image Search1 which is based on the index search approach. Figure 1 shows the search results by using “Barack Obama” as the query. From the results, more than half of the returned images do not contain images of “Barack Obama”. Based on our initial observation, a news article would likely contain an image of a target person if his/her name appears as the main subject in different sentences of the news article such as in the headline and the image caption. Figure 2 shows an example of a news article from the CNN news web site2 containing a person image with headline and image caption. From the figure, two person name entities “Hillary Clinton” and “Barack Obama” appear in both headline and image caption. Using the simple index search, a query “Barack Obama” would mistakenly return this image of “Hillary Clinton”. The main goal of this paper is to improve the performance of basic index search for finding persons in the news images. Our proposed approach is by applying the syntactic analysis on the textual contents (e.g., headlines and image captions) for positive and negative identification of persons in news images. The syntactic analysis used for our proposed method include named entity tagging and shallow parsing [1,3]. From Figure 2, by performing the syntactic analysis on the news headline, “Clinton” will be parsed and tagged as a noun phrase (NP) followed by “endorses” as a verb phrase (VP) and “Obama” as a noun phrase 1 2
Google News, http://news.google.com CNN.com International, http://edition.cnn.com
218
C. Haruechaiyasak and C. Damrongrat
Fig. 1. Search results of “Barack Obama” from Google News Image Search
Fig. 2. An example of a person image with headline and image caption
(NP). Therefore, from this sentence, we can derive two rules: (1) [NP*] [VP] [NP] for positive identification of person in images; and (2) [NP] [VP] [NP*] for negative identification, where * denotes the position of person name appearance. The first positive identification rule means that if the person name appears as
Identifying Persons in News Article Images Based on Textual Analysis
219
a noun phrase and as the subject of a sentence, the current image would likely contain the specified person. On the other hand, the negative identification rule means that if the person name appears as a noun phrase and as the object of a sentence, the current image would unlikely contain the target person. Using a tagged news article corpus, we constructed a set of syntactic rules based on the above idea. The set of syntactic rules is then used for selecting and filtering image results returned by the index search. From the above example, a query “Obama” will result in the image from Figure 2, however, this image will be filtered out by using the negative identification rule [NP] [VP] [NP*] since “Obama” appears as NP* in the sentence. The experimental results on a collection of news images showed that the proposed syntactic analysis approach could significantly improve the retrieval performance over the basic index search. In next section, we review some related works on people image search. In Section 3, we present the proposed framework for improving people image search based on syntactic analysis of textual content. In Section 4, experiment results based on the evaluation of the proposed method are given with some discussion. Section 5 concludes the paper with some possible future works.
2
Related Works
Previous works on image retrieval are mainly based on two different techniques: text-based and content-based image retrieval (CBIR) [4]. Text-based approach relies on textual information associated with the images for matching an input query, consisting of one of more terms, to the images. Due to the scalability, the text-based image retrieval has been popularly applied in many search engines such as Google3 and Yahoo!4 . On the other hand, CBIR applies image processing techniques to extract and learn visual features from images. In CBIR systems, a visual query describing visual features such as color and texture is used for finding images which best match the input features [7,12]. With a large number of images on the Web containing photographs of persons, the research topic in image retrieval for people image search has gained some increasing attention. In people image search, given a query of a specific person name, the goal is to accurately retrieve images of that person (preferably as individual) from a large image collection (e.g., from news web sites). The techniques for people image search are similar to the ones used for general image retrieval. The content-based approach is based on face recognition technique to construct and learn visual features from facial images. A large number of previous works in people image search are based on the face recognition technique [2,9,10,11]. These research works applied variations of face recognition technique to associate face images to person names found in the news articles. As mentioned earlier, applying face recognition technique is not very scalable since there are a large number of people in the news. Therefore, a large sample size of facial images for 3 4
Google Image Search, http://images.google.com Yahoo! Search – Image Search http://images.google.com
220
C. Haruechaiyasak and C. Damrongrat
each individual person is required. In addition, extracting and learning visual features from facial images requires a large amount of processing time. To increase the scalability, many works applied the text-based approach for finding people in the news articles. Srihari [13] proposed a system called Piction which uses image caption to identify each individual from an image containing multiple persons. The proposed system combines the algorithm for locating faces in an image with the NLP module to analyze the spatial location of persons from the image caption. Yagnik and Islam [14] proposed a method based on the learning of text-image co-occurrence on the web. The approach is called consistency learning which is used for annotating people images. Yang and Hauptmann [15] proposed a statistical learning method to identify every individual person appearing in broadcast news videos based on the contexts found in the video transcripts. Edwards et al. [5] applied a text clustering to group images of people into some topic clusters. Some observation on the syntactic structure of person names in the image captions were discussed and used in the experiments. Our main contribution is to propose an efficient and scalable approach which automatically constructs a set of syntactic rules to identify persons in news images. The syntactic rules are used for selecting and filtering the image results returned from the index search approach.
3
The Proposed Approach Based on Textual Analysis
Our proposed approach for people image search combines the index search with the syntactic analysis of person named entities (NEs) in the textual content. Given a query of a person name, the index search often yields irrelevant images along with the correct images of the target person. The main idea is to construct a set of syntactic rules from textual information for filtering out the irrelevant images. Figure 3 illustrates the overall process for constructing a set of syntactic rules from a news article corpus. Given a news article containing a person image, we apply the named entity recognition to tag all person named entities (NEs) appearing in textual information (i.e., headline, subhead and image caption). All sentences which contains the person NEs are collected and labelled as positive or negative samples. Positive samples are sentences which contain the person name displayed in the image. Negative samples are sentences which contain other person names who do not appear in the image. Figure 4 gives examples of positive and negative samples. For the positive sample, a query of “Tiger Woods” would correctly match the given image. For the negative sample, a query of “Rafael Nadal” would incorrectly match the image of another tennis player “Lleyton Hewitt”. It can be observed that for the positive sample, the target person name appears as the subject of the sentence. On the other hand, for the negative sample, the target person name appears as the object of the sentence. Once a set of positive and negative samples is collected from the corpus, The shallow parsing technique is applied to segment and tag the person NEs and
Identifying Persons in News Article Images Based on Textual Analysis
221
Fig. 3. The proposed approach for constructing a set of filtering rules
Fig. 4. An example of positive and negative image samples
all other terms in the sentences. For example, the following sentence, “Obama wants $1.5 billion for pandemic preps” would be parsed and tagged as, [NP Obama] [VP wants] [NP $ 1.5 billion] [PP for] [NP pandemic preps], where NP denotes a noun phrase, VP denotes a verb phrase and PP denotes a preposition phrase. Since the above sentence contains the image of “Barack Obama”, we can have one of the positive identification rule as [#] [NP*] [VP] [NP] :positive, where # denotes the sentence boundary, * denotes the position of person name
222
C. Haruechaiyasak and C. Damrongrat
Fig. 5. The process of image search result filtering
appearance and :positive is the class label. In this paper, we use OpenNLP 5 , a text analysis tool, to perform all related text processing tasks. Once we have a set of positive and negative identification rules, we construct a set of filtering rules for screening out the negative image results returned from the index search. The same syntactic rules could have the class labels of positive and negative. Therefore, we calculate the score of each filtering rule as the ratio of the number of negative labels over the number of positive labels. score(rulef iltering ) =
count(rulenegative ) count(rulepositive )
(1)
The case of when the number of positive labels is equal to zero means the rule only appears in the negative class. These rules would be assigned as the top rank among all other rules. The score equals to one means the rule appears equally in both positive and negative classes. We consider the rules which have the score values larger than one (i.e., count(rulenegative ) > count(rulepositive )). Figure 5 illustrates the process of filtering out the image search results from the index search approach. Given a query of a person name, the system retrieves all images containing the target person name from the index. Next, the textual descriptions of the images are processed with the shallow parsing. The filtering rules are then applied to screen out the images which do not likely contain the target person.
4
Experiments and Discussion
To evaluate the effectiveness of our proposed method, we perform some experiments using a collection of news images obtained from the Google News Image 5
The OpenNLP Homepage, http://opennlp.sourceforge.net
Identifying Persons in News Article Images Based on Textual Analysis
223
Table 1. List of person names in the test corpus Person name Category Barack Obama Politics Hillary Clinton Politics Tiger Woods Sports David Beckham Sports Rafael Nadal Sports Britney Spears Entertainment Angelina Jolie Entertainment Table 2. Top-ranked filtering rules Filtering rules Score [N P ][V P ][N P ∗][P P ][N P ] 7.0 [#][N P ∗][O][#] 5.0 [N P ][V P ][N P ∗][#] 2.67 [#][N P ∗][V P ][ADV P ] 2.0 [#][N P ∗][N P ][O] 2.0 [O][V P ][N P ∗][O][#] 2.0 [O][V P ][N P ∗][N P ][O] 2.0 [N P ]V P ][N P ∗][V P ][N P ] 2.0 [N P ][O][N P ∗][V P ][ADJP ] 2.0 [N P ][P P ][N P ∗][O][N P ] 1.67
Search. Given a list of persons in different categories (as shown in Table 1), we downloaded textual contents along with enclosed images and captions. The current number of images in the corpus is approximately 500. Using this corpus, we constructed a set of filtering rules based on the process explained in the previous section. We set the maximum number of allowed grams (i.e., number of tokens in a rule) equal to 5 throughout the experiments. Table 2 shows some top-ranked filtering rules based on the score calculation given in Eq. (1), where # denotes the sentence boundary, * denotes the position of person name appearance, N P denotes a noun phrase, V P denotes a verb phrase, P P denotes a preposition phrase, ADV P denotes an adverb phrase, ADJP denotes an adjective phrase, and O denotes a special character. A comprehensive list of tagging symbols can be found from the Penn Treebank Project6 . We evaluate the performance of the proposed syntactic analysis approach by comparing to the index search. For the index search, we build an index of terms extracted and tokenized from textual information of each news article. For the syntactic analysis approach, we apply the set of syntactic rules to filter out the image results returned from the index search. An image in which the associated texts match the filtering rules will be removed from the results. We use the 6
Penn Treebank Project, http://www.cis.upenn.edu/ treebank
224
C. Haruechaiyasak and C. Damrongrat Table 3. Evaluation results of people image search from news articles Approach Precision Recall F1-measure Index search 47.17 100.00 64.10 Syntactic analysis 59.08 99.61 74.17
standard retrieval metrics of precision, recall and F1-measure for the evaluation. The experimental results are summarized in Table 3. From the results, the index search approach yielded the recall equal to 100%. This is due to a query for persons would match all images containing the person names in the textual information. However, over half of the returned results do not contain positive images of the target person. As a result, the precision value is only approximately 47%. Applying the filtering process helped increase the precision to approximately 59%, while the recall stays relatively the same. Therefore, the syntactic analysis approach helped eliminate some false positive cases from the index search. In summary, the index search approach yielded the performance under the F1-measure equal to 64.10%. Using the syntactic analysis approach, the F1-measure increased to 74.17%. Therefore, the syntactic rules are useful for filtering the negative results from the images returned from the index search.
5
Conclusion and Future Works
We proposed a syntactic analysis approach for finding persons in news article images. Our main goal is to improve the quality of image search results obtained by using the text-based index search. The process of named entity tagging and shallow parsing are applied on textual contents, e.g., news headlines and image captions, to assign appropriate part-of-speech (POS) tags to person name entities and all other terms. A set of syntactic rules for identifying persons as appearing in the images are constructed from the tagged corpus. We use these syntactic rules to filter out image results returned by the index search. Based on the evaluation results, our proposed approach improved the performance over the simple index search by 10% based on the F1-measure. For future works, we plan to apply some machine learning techniques such as the Conditional Random Fields (CRFs) to efficiently build the syntactic rules from the tagged corpus. Also a larger news image corpus will be collected and used for a more comprehensive evaluation.
References 1. Abney, S.: Parsing by chunks. In: Berwick, R., Abney, S., Tenny, C. (eds.) PrincipleBased Parsing. Kluwer Academic Publishers, Dordrecht (1991) 2. Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Yee-Whye, T., LearnedMiller, E., Forsyth, D.A.: Names and Faces in the News. In: Proc. of the 2004 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 848–854 (2004)
Identifying Persons in News Article Images Based on Textual Analysis
225
3. Chinchor, N.: MUC-7 Named Entity Task Definition (Version 3.5). MUC-7, Fairfax, Virginia (1998) 4. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR) 40(2), 1–60 (2008) 5. Edwards, J., White, R., Forsyth, D.: Words and pictures in the news. In: Proc. of the HLT-NAACL 2003 workshop on learning word meaning from non-linguistic data, pp. 6–13 (2003) 6. He, X., Cai, D., Wen, J.-R., Ma, W.-Y., Zhang, H.-J.: Clustering and searching WWW images using link and page layout analysis. ACM Trans. on Multimedia Computing, Communications, and Applications 3(2) (2007) 7. H¨ orster, E., Lienhart, R., Slaney, M.: Image retrieval on large-scale image databases. In: Proc. of the 6th ACM int. conf. on image and video retrieval, pp. 17–24 (2007) 8. Kherfi, M.L., Ziou, D., Bernardi, A.: Image Retrieval from the World Wide Web: Issues, Techniques, and Systems. ACM Computing Surveys (CSUR) 36(1), 35–67 (2004) 9. Kitahara, A., Joutou, T., Yanai, K.: Associating Faces and Names in Japanese Photo News Articles on the Web. In: Proc. of the 22nd Int. Conf. on Advanced Information Networking and Applications - Workshops, pp. 1156–1161 (2008) 10. Liu, C., Jiang, S., Huang, Q.: Naming faces in broadcast news video by image google. In: Proc. of the 16th ACM int. conf. on multimedia, pp. 717–720 (2008) 11. Ozkan, D., Duygulu, P.: A Graph Based Approach for Naming Faces in News Photos. In: Proc. of the 2006 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1477–1482 (2006) 12. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1380 (2000) 13. Srihari, R.K.: Automatic Indexing and Content-Based Retrieval of Captioned Images. Computer 28(9), 49–56 (1995) 14. Yagnik, J., Islam, A.: Learning people annotation from the web via consistency learning. In: Proc. of the int. workshop on multimedia information retrieval, pp. 285–290 (2007) 15. Yang, J., Hauptmann, A.G.: Naming every individual in news video monologues. In: Proc. of the 12th ACM int. conf. on multimedia, pp. 580–587 (2004) 16. Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: A literature survey. ACM Computing Surveys (CSUR) 35(4), 399–458 (2003)
Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites Markus H¨ anse1 , Min-Yen Kan2 , and Achim P. Karduck1 1
Abstract. We investigate the automatic harvesting of research paper metadata from recent scholarly events. Our system, Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. Experiments show an acceptable measure of classification accuracy of over 95% for each of the two components.
1
Introduction
With the growing trends of digital publishing and open access publishing, scientific progress is increasingly reliant on near-instantaneous access to published research results. It is now common to find published articles citing work within the same year and even works that have yet to be formally published. Established imprints (such as Wiley, Elsevier and Springer) have adopted Really Simple Syndication (RSS), a web feed standard, to help readers stay abreast of recent news and articles. In the biomedical field, PubMed1 serves an example of a one-stop aggregator that gives up-to-date access to the large bulk of scientific advances. However, in some fields such as computer sciences and engineering, such aggregators do not exist – hampering the ability of researchers to stay current. There are a myriad of reasons for this that are both cultural and practical. PubMed relies on manual effort by authors and publishers to keep the information up-to-date. Also, in computer science, many cutting-edge research results are transmitted through conferences rather than journals, and such conferences often do not have RSS feeds for metadata of individual scholarly papers. To address this problem, the communities has built a number of digital libraries – most notably CiteSeerX2 and Google Scholar3 – that index web-available scientific papers. However, these crawlers are largely reactive – periodically scanning 1 2 3
G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 226–235, 2010. c Springer-Verlag Berlin Heidelberg 2010
Kairos: Proactive Harvesting of Research Paper Metadata
227
the web for new contributions and indexing these as they come about. To really address the need for up-to-date indexing, we must provide digital libraries with crawlers that are proactive: crawlers that know when a conference has just occurred and know when and where to obtain the pertinent paper metadata from each conference web site. This paper details Kairos4, an implemented system that aims to address this problem. Kairos uses supervised machine learning to model which URLs are indeed conference websites and to model how such conference websites present individual paper metadata. Furthermore, by extracting event date information from conference websites, Kairos can proactively schedule crawls to the event website as the date of the conference approaches. After briefly discussing related work, we give an overview of the architecture of our system in Section 3 and describe the two major components of Kairos – the crawler and the IE engine – in Sections 4 and 5 and their evaluation, in turn. We conclude by discussing the project in its larger context.
2
Related Work
Digital libraries have turned to focused crawling to harvest materials for collection. Traditionally, this had been done by downloading web pages and assessing whether they are useful. By useful we mean that crawled web pages are belonging to the topic we would like to crawl. However, this can be wasteful, as downloaded pages that are not useful consume bandwidth. As such, a key step is to estimate the usefulness of a page before downloading. In particular, scientists lately have refined the focused crawling heuristics, exploiting genre [1] and using priority estimation [2]. A number of experiments [3,4,5] have shown that the careful analysis of URLs can be effective estimators. Once useful webpages are identified, information must be extracted from the semi-structured text of the webpage. While many different models have been proposed, the conditional random field (CRF) model pairs together pointwise classification with sequence labeling. CRFs have been applied to a multitude of information extraction and sequence labeling tasks [6,7,8]. While general CRFs can handle arbitrary dependencies among output classes, for textual NLP tasks a linear-chain CRF model often outperforms other models while maintaining tractable complexity. Focused crawling and information extraction are often used serially in many applications that distill data from the web, however, to our knowledge, there have been few works that discuss their integration in a single application.
3
Overview
We introduce Kairos pictorially and trace the subsequent workflow, from the beginning input to the resulting extracted paper metadata per conference. Figure 1 shows an architectural overview of our system. 4
καιρ´ oσ, a classical Greek word for “opportune moment”.
228
M. H¨ anse, M.-Y. Kan, and A.P. Karduck
Fig. 1. Architectural overview of Kairos. New modules of interest highlighted in gray.
Kairos consists of two components: 1) a crawler that uses a maximum entropy classifier to determine whether input URLs are scientific conference websites and 2) an information extraction engine that extracts individual paper metadata from a web page using a linear chain conditional random field classifier. The first component, the crawler, shown on the top left of Figure 1, takes a list of candidate URLs as input, and decides whether they are indeed scientific conference websites. If they are, key metadata about the event date is extracted for later use in the second component. For example, given a conference that issues its first call-for-papers, Kairos’ crawling component will attempt to locate the correct site for the conference and extract the key dates for its deadlines (paper notification and actual conference date). The second component, the IE engine, shown on the right of Figure 1, is run periodically around the dates for each conference. It is encapsulated in its own per-conference crawler that does a periodic crawl of a particular conference’s website around its key dates, looking for webpages where individual paper metadata have been posted. The IE engine is run during these crawls, extracting individual paper metadata from the conference website when and where they are found, and converting them to rows of paper metadata and PDF link locations. This final product can be then ingested into a digital library. As discussed in the introduction, certain web sites also act as scientific event portals, listing many different conferences and other scholarly events and meetings. In the computer science community, both DBLP5 and WikiCFP6 are such websites. WikiCFP, in particular, allows external parties to advertise information about workshops and conferences. It has become a major aggregator of information related to scholarly paper submission deadlines in this community, listing over 5,200 conference venues, as of January 2010. WikiCFP also tracks 5 6
Kairos: Proactive Harvesting of Research Paper Metadata
229
specific conference metadata, including the URL of the conference website, the date range for the conference itself, submission and camera ready deadlines. WikiCFP sub-crawler. For these reasons, we also implemented a component customized for WikiCFP. This is shown in the bottom left of Figure 1. This sub-crawler collects and extracts conference metadata using XPath queries from publicly visible pages in WikiCFP, including the author notification date and the dates of the meeting itself. We extract these two dates from both the subcrawler for WikiCFP, as well as any other URLs that are encountered or entered. These two dates are selected as the list of papers to appear at the conference is occasionally posted after the author notification date and the full paper metadata and links to source papers often appears after the event starts. Kairos schedules its IE engine runs around these key dates per conference. All of the crawler instances in Kairos are built on top of Nutch7 , an opensource crawler. Nutch itself builds on Lucene, a high-performance, text search engine library.
4
Link Based Focused Crawling
The focused crawler requires a module that judges the fitness of a link, to decide whether or not a potential page may lead to useful metadata to extract. In one guise, such a module may take a URL as input and output a binary classification. We place our classification module in between the Fetcher and Injector modules of Nutch, so that it can make judgments on URLs found in newly downloaded web pages, to save bandwidth. Before computing individual features from the candidate URLs, we preprocess them into smaller units that are more amendable to analysis, similar to [4,5]. We glean features from the three components in the URL: namely, the hostname, path and document components; hostnames are further broken up into top level domain (TLD), domain, and subdomain components. These and other components are further broken up into tokens by punctuation boundaries. Given these ordered set of tokens, we compute a set of binary feature classes for maximum entropy (ME) based classification. We use ME (a pointwise classification) instead of CRFs for this, as the URLs are short and the modeling sequential information between instances is unnecessary. We now list feature classes that we extract. – Tokens and Length: This feature class captures the length (in characters) and identity of the tokens with respect to their ordinal position in the component. – URL Component Length in Tokens: This feature captures the number of tokens in the each component. Short domain names and/or path names can indicate good candidates for conference URLs. – Precedence Token Bigrams: From the tokens created from the preprocessing step, we form bigrams as features. These include normal adjacent bigrams as well as ones that contain a gap. The latter gapped bigrams are 7
http://lucene.apache.org/nutch
230
M. H¨ anse, M.-Y. Kan, and A.P. Karduck
used to combat sparse data in this feature class. For example, given the hostname “www.isd.mel.nist.gov”, we create (4+3+2+1 =) 10 individual binary features such as “www≺isd” and “www≺nist”. – Ordinals: This aims to detect ordinal numbers used in many full forms of conference and workshop meetings. We use a regular expression to detect such cases as “1st”, “2nd” or “Twentieth”. – Possible Year: This class of features aims to find both double digit and full forms of years, which is also a frequent unit of information in conference URLs. Again, a regular expression is used to capture years. We normalize possible years in the YY format to their full form (e.g., “07” → “2007”). We also separate any component found as a prefix to the year and add special features if the candidate year detected is the current year. This last part helps to favor spidering conferences that are in this calendar year. Table 1 illustrates an example URL that has gone through preprocessing and subsequent feature extraction. Table 1. Maximum Entropy features extracted from one sample URL Feature Class
To train our classifier, we use both WikiCFP and DBLP as positive seed instances and collect negative instances through a search engine API. WikiCFP provides a semi-structured, fixed format that provides the conference name, date and website URL. We extract these fields from conference information pages in WikiCFP, using the same crawler component discussed earlier (bottom left of Figure 1. DBLP also provides an XML dump that also lists the conference title, year and URL information, which are similarly extracted. To construct negative instances, we need to find URLs that are not pages which contain metadata about papers. However, for the negative instances to be useful in discriminating positive from negative, they should also share some attributes in common with the positive examples. For this reason, we use tokens extracted from the positive URL instances to construct queries to a search engine to retrieve potential negative examples. For example, given a known positive URL of “http://www.icadl2010.org”, we construct a query “inurl:(icadl 2010 org) filetype:html -site:www.icadl2010.org” and send to a search engine to retrieve other webpage URLs that are negative examples. Often, such URLs are call-forpapers for the target conference that have listed elsewhere or blog posts about
Kairos: Proactive Harvesting of Research Paper Metadata
231
the conference, as well as other miscellaneous information. Our inspection of this collection process suggests that it is largely accurate, but that some URLs are false negatives (i.e., are actual conference websites). 4.2
Evaluation
We used the above methodology to retrieve training URLs for our classifier, that were automatically labeled as positive and negative instances. In total, we retrieved 9,530 URLs, of which 4,765 were positive instances. To assess the effectiveness of our classifier, we used stratified 10-fold cross validation: the dataset was randomly divided into 10 equal parts, each with the original proportion of positive and negative instances. We use a publicly available maximum entropy classifier implementation8 , and train the classifier on nine parts and test on the remaining part, and repeat this process ten times. The resulting binary URL classifier achieves an accuracy, precision, recall and F1 measure of 96.0%, 94.8%, 97.2% and .960 respectively, when conference URLs are considered positive. Table 2 gives the resulting confusion matrix for our ten-fold cross validation test. These results show that there is a slight imbalance of the system to err towards false negatives. Table 2. Confusion matrix for the URL conference classifier Gold Standard total +ve -ve System +ve 4519 246 4765 Judgment -ve 127 4638 4765 total 4646 4884 9530
5
CRF-Based Information Extraction
Given a URL that passes the maximum entropy classifier’s test for being a conference web page, Kairos downloads the web page represented by the URL and attempts to extract pertinent paper metadata from the page, if present. This second stage is run only during the key date periods as extracted from the first stage. This task represents a standard web information extraction task, where web pages may present paper metadata in different formats. In conference and workshop websites, paper metadata is commonly found in some (semi-)structured format, such as tables, paragraphs or lists. As the conference website (and metadata gleaned from WikiCFP and DBLP) may already describe the venue and publisher information for paper that are presented at the venue, our system’s goal is to identify the three remaining salient pieces of paper metadata: namely the title, author and a link to the PDF9 of the paper itself. 8 9
Version 2.5.2, available at http://maxent.sourceforge.net/ Kairos currently only handles source papers in PDF, PostScript or MS Word formats, and ignores HTML versions of papers.
232
M. H¨ anse, M.-Y. Kan, and A.P. Karduck
To accomplish this second task, we again make use of a supervised classifier. The task here is to scan a web page for the individual pieces of metadata related to papers. Different from the pointwise classification that characterizes the previous URL classification task, for this information extraction task, we now must deal with labeling and extracting multiple and related fields from a stream of (richly formatted) text. For this reason, we turn to methodologies developed for sequence labeling, and adopt conditional random fields (CRF) as our model representation. We divide each input webpage into a set of regions, where regions are small blocks of text delimited from other parts of the page by certain HTML table, lists and logical tags:
tag. Currently, we do not use any positional features of regions. These features are calculated for each region of interest, and passed to a publicly available linear chain CRF implementation10 for classification. After classication, a set of heuristics are used to bundle the classified spans of text into individual papers. The final output can be then visualized as an index, with fields for title and author gleaned from the CRF system, the PDF link (if one exists) gleaned by regular expressions over the webpage, venue name and venue location, gleaned from the WikiCFP source or link focused crawler. 5.2
Dataset and Annotation Collection
To train and evaluate the CRF labeler, we downloaded positive examples of conference URLs, gathered in the previous task from WikiCFP and DBLP. However, unlike the previous task, there is no automatic means of creating positive and negative instances of each class. As such, we needed to manually label a corpus of conference pages with the appropriate classes. To ensure impartial and replicable annotations, we prepared a corpus of 265 conference pages, presegmented into regions for annotation by human subjects. We recruited 30 student volunteers to annotate these web pages according to an annotation guide, which standardized our instructions. Each volunteer was asked to label around 20 pages with the use of in-house labeling software, which took them between 20 minutes to an hour. Subjects were given a token of appreciation for their participation in the data collection task. Each page was thus annotated by two volunteers. The first author of this paper also carefully annotated all regions in the 276-page collection. The resulting dataset consists of 9015 title, 7863 author, 2044 author+affiliation, 1437 affiliation regions. We use the volunteers’ labels to check the reliability of our own annotation. After discarding annotations from volunteers whom misunderstood the guidelines, we calculated the inter-annotator reliability of the acquired annotations. 10
CRF++ 0.53, available at http://crfpp.sourceforge.net/
234
M. H¨ anse, M.-Y. Kan, and A.P. Karduck
We calculated a Kappa11 [9] of .931, indicating very high agreement among annotators. We take this as confirmation that the task is feasible and that the first author’s annotations can be used as a gold standard. 5.3
Evaluation
We trained and evaluated the information extraction engine using ten-fold cross validation on a region level, where a region is one or more contiguous lines delimited by certain HTML tags. We only trained the CRF with the title, author regions and treated the author+affiliation regions as an author region, as it also includes author metadata. The CRF achieves an accuracy and F1 measure of 97.4% and .974, respectively. To assess which features were most critical, we also performed ablation tests: removing different feature classes, retraining the model and then assessing the subsequent performance. These results are reported in Table 4, which show that all three classes of features were important, but that the region and gazetteer features were the most helpful. Table 3. CRF region labeling confusion matrix Gold Standard total title author System title 7506 234 7740 7566 7740 Judgment author 174 total 7680 7800 15480
6
Conclusion
We have described Kairos, an end-to-end scholarly paper metadata harvester, which autonomously finds and extracts paper metadata from seed URLs and proactive focused crawls. By knowing the key dates of a conference event, our system can locate and harvest such metadata with a shorter delay than other digital libraries and databases. Kairos is built on top of Nutch, a popular open source crawling system, adding a maximum entropy (ME) based URL classifier converting it to a focused crawler amenable to detecting webpages that may contain conference paper metadata. A conditional random field (CRF) subsequently runs on downloaded pages to identify and extract pertinent title, author data per scholarly work. Both the ME and CRF classifiers obtain good performance – over 95% accuracy. In total, the dataset collections and annotations for both stages took over a man-month of time. As the task was very laborious, we believe that these datasets would be valuable for those also targeting similar work in the future. As such we have made these datasets available to the general public12 . An annotation system has been set up to allow volunteers to hand-annotate scientific conference websites to further expand the existing labeled training dataset. 11 12
A measure that falls between -1 (indicating complete disagreement) to 1 (total agreement) http://wing.comp.nus.edu.sg/∼mhaense
Kairos: Proactive Harvesting of Research Paper Metadata
235
Table 4. CRF feature class ablation performance. F1 performance gain over previous row given in parentheses. All performance gains are significant at the .01 level. Feature Classes Accuracy Stylistic (S) 69.4% Lexica (L) 85.9% L+S 87.8% Region (R) 95.6% R+S 96.3% R+L 96.7% All (R+L+S) 97.4%
In this work, we have concentrated on catering for cases in which paper metadata is presented in plain text. In future work, we plan to integrate abilities to deal with more proactive forms of publication and subscription (pub/sub): ingesting publisher RSS feeds, and exporting discovered metadata in RDFa micro formats or making output OAI-PMH compliant. In our current work, we are integrating these modules into our production scholarly digital library. When the integration is complete, we will be close to making the real-time indexing of scientific articles a reality.
References 1. de Assis, G.T., Laender, A.H., Gon¸calves, M.A., da Silva, A.S.: Exploiting genre in focused crawling. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 62–73. Springer, Heidelberg (2007) 2. Guan, Z., Wang, C., Chen, C., Bu, J., Wang, J.: Guide focused crawler efficiently and effectively using on-line topical importance estimation. In: SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 757–758. ACM, New York (2008) 3. Sun, A., Lim, E., Ng, W.: Web classification using support vector machine. In: Proceedings of the 4th international workshop on Web information and data management, pp. 96–99. ACM, New York (2002) 4. Kan, M.Y., Thi, H.O.N.: Fast webpage classification using URL features. In: Proceedings of Conference on Information and Knowledge Management, pp. 325–326 (2005) 5. Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th international World Wide Web Conference, pp. 1109–1110 (2009) 6. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Information Processing and Management 42(4), 963–979 (2006) 7. Sarawagi, S., Cohen, W.: Semi-markov conditional random fields for information extraction. Advances in Neural Info. Processing Systems 17, 1185–1192 (2005) 8. Kristjansson, T., Culotta, A., Viola, P., McCallum, A.: Interactive information extraction with constrained conditional random fields. In: Proceedings of the National Conference on Artificial Intelligence, pp. 412–418. AAAI Press, MIT Press, London (1999) 9. Carletta, J.C.: Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254 (1996)
Oranges Are Not the Only Fruit: An Institutional Case Study Demonstrating Why Data Digital Libraries Are Not the Whole Answer to E-Research Dana McKay Swinburne University of Technology Library P.O. Box 218 John Street Hawthorn VIC 3122, Australia [email protected]
Abstract. Data sharing and e-research have long been touted as the future of research, and a general public good. A number of studies have suggested data digital libraries in some form or another as an answer to a perceived data deluge, and the focus in Australia is very much on digital libraries. Moreover, the Australian National Data Service positions the institution as the core unit for setting data policy and doing initial data management. In this paper we present the results of an institution-wide survey that shows that data digital libraries cannot be the only answer to the question of research data, at least at an institutional level, and that the current focus on digital libraries may actively alienate some researchers. Keywords: research data, e-research, data digital libraries, human factors, survey, institutional repositories.
Despite the obvious appeal of data digital libraries, without understanding researchers’ existing concerns and practices, it is unlikely that any attempt to build data digital libraries will meet with success. One reason for the likely failure of data digital libraries that are not driven by researchers’ interests is technical: it is evident from the existing studies that the metadata and technical requirements placed on data digital libraries will be quite discipline specific [8, 13, 14], and the technical aspects of the problem are quite challenging [15]. The other reason lack of understanding of researchers existing concerns and practices around data will lead to the failure of digital libraries is amply demonstrated by the difficulty in filling institutional repositories: even when a mandate is in place, researchers are unwilling to participate in researchrelated services [16] unless they see real benefit in doing so [17-19] or someone does it for them [20]. Given the institution-centric approach to data management in Australia and our understanding of the need to engage researchers with any institution-wide data preservation strategy, at Swinburne we conducted an institution-wide survey to understand our researchers’ entire approach to data: what data they gather, how they use that data, whether they share it, and how they would feel about an institutional approach to managing it. The results of that survey, and their implications for any attempt to create an institutional data digital library or repository, are presented in this paper. Section 2 of this paper gives some background to our work; Section 3 outlines our methodology; Section 4 presents the results of our survey; Section 5 offers some discussion of these results and their implications for data digital libraries; and Section 6 draws conclusions and offers avenues for future work.
2 Background We are far from the only institution to study researcher behaviour with regards to data. There have been a number of other studies, but they are all discipline-specific, and tend to cherrypick disciplines that already have large amounts of easily defined data to deal with, including ecology [9, 10, 21], crystallography [4, 14], musicology [14], and rheology [22]. While these discipline-based approaches have been largely successful, there is clear evidence that different research disciplines create, manage and share data in different ways [8, 23], and, possibly as a direct result of these differences, there are very few studies of humanities scholars [24]. This dearth of work with humanities researchers was also a hallmark of early work on institutional repositories [25]). Even in the science disciplines, a number of researcher concerns have been raised about data sharing: Lyons recognised as early as 2003 that intellectual property concerns are a large barrier to depositing data in a digital library [26]; Corti raised the ethical implications of data sharing (particularly for sensitive data) in 2000 [27]; Humphrey expressed concern about the impact of individualistic research cultures on data sharing in 2005 [28]. It is instructive to consider the lessons learned from the institutional repository experience: Institutional repositories have the potential to bring esteem to both the institution and the individual researcher [29, 30], however once implemented they often
238
D. McKay
see a lack of uptake [31-33]. Researchers who did not engage in publication sharing prior to their institution setting up a repository were reluctant to engage with repositories, as they saw them as burdensome, technologically challenging [17], risky due to copyright obligations and as taking away time from their research [34]. Conversely, those who already had publication sharing systems in place, such as physicists who deposited in Arxiv.org were not interested in repositories because they saw no benefit to them; their main ties are disciplinary, rather than institutional [25]. Some measures have proven valuable in seeking deposit in institutional repositories; providing immediate and tangible benefit to researchers in the form of profile pages has been successful [17], as has reducing the burden on researchers by using fully mediated deposit [20]. It is likely that the same kinds of things that inhibit publication deposit will inhibit data deposit, and approaches to dealing with these problems must be researcher driven; it is amply demonstrated that mandates alone do not work [16, 35]. One possible researcher-driven incentive for data deposit in digital libraries (and by extension data sharing) is the potential for data set citation in the same way as research publications are cited [26, 36], however for this to be useful data citation would have to be recognized by research quality assessment schemes. In conjunction with the issues raised by the institutional repository experience, traditional Human Computer Interaction (HCI) techniques are ill-placed to assist in the design of data sharing technologies and digital libraries [37, 38]: given the highly distributed nature of these technologies, the novel needs they serve and the highly specialized approaches to data that exist within research disciplines, typical participatory design techniques have nothing to offer. Nardi and Zimmerman suggested that potentially all HCI could offer was usability testing once data-sharing technologies were in place, however given the large investment these technologies require and the likelihood that researchers will not use anything that doesn’t suit them exactly, this is a risky strategy. We believe that knowing how a wide variety of researchers within an institution collect, share, and manage their research data is the first step towards providing useful policy and tools around data sharing and management at the institutional level.
3 Methodology Swinburne is a geographically distributed institution with a diverse researcher population. To reach the greatest number of these researchers and get a more broad-brush picture of data management practices at Swinburne, an online survey methodology was selected. The survey we wrote comprised 55 questions, however the survey was designed such that no respondent will have seen all 55 questions; the questions displayed were selected to be relevant given previous answers. To capture a wider range of researchers’ experiences, in addition to quantitative questions we included a number of questions that required free-text answers. To centre participants’ responses more realistically in their experience [39], for some questions we asked them to comment on a specific incident: the data they used when writing their most recent publication. The survey was available for responses for one month.
Oranges Are Not the Only Fruit
239
We deliberately took a broad view of the definitions of both data and research; because we wanted a picture of the whole institution we did not want to exclude anybody who identified as a researcher. As such we advertised the survey via Swinburne’s Research Bulletin1 (an online newsletter for research staff and graduate students), and allowed respondents to self-select. We sent a reminder email to all staff one week before the survey was due to close. To avoid bias in respondent self selection, we offered a reward to those respondents who completed the survey: a chance to win a 30 Gb video-capable iPod. We analysed the responses to this survey using typical statistical methods for the quantitative data, and grounded theory analysis for the free-text responses [40].
4 Results The results of this survey clustered around five major themes: demographic information (see Section 4.1), data collection and use (see Section 4.2) data storing (see Section 4.3), data sharing practices (see Section 4.4) and attitudes to the institution’s role in data curation and sharing (see Section 4.5). 4.1 Demographic Information A total of 85 respondents completed the survey, two of whom completed it on behalf of their research group as a whole. While we do know how many academic staff work at Swinburne, it is impossible to determine a response rate as there is no firm definition of research-active. Some staff at Swinburne are teaching-focused, and it is evident from our research publications collections each year that some staff in corporate areas are doing research. In addition to these confounding factors, we felt that it was important to include graduate students in our survey, as they often generate data and then move on [41], and it is relevant to Australian data policy what happens to that data. In this section we will discuss respondents’ research roles, experience and research disciplines. 4.1.1 Research Roles and Experience Respondents worked in a variety of roles in the research world, including postgraduate study (masters and PhD), academics who conducted teaching and research (lecturer, senior lecturer, professor), research-only roles (research fellows and assistants) and other professions including consultants and managers. Four of the participants who listed their research role as ‘other’ said they were management or employed by a research project. The length of time in research was fairly evenly distributed, with 14 respondents stating they hade been involved in research for fewer than two years at one end of the scale, and 10 respondents claiming a 20+ year career at the other. The largest groups were in the 2-5 year category (31 respondents) and the 5-10 year category (18 respondents). Surprisingly, there was no statistically significant correlation between length of career and research position. 1
4.1.2 Research Fields Respondents entered their own fields of study into a free text box, and responses were as granular as ‘Sustainable environmental biotechnology [sic]’ and as broad as ‘science’. We used a grounded theory analysis to classify these research fields [40], and discovered that of 85 respondents, 27 studied social science, 19 science, 11 law and business, 8 engineering, 8 IT, 7 education and 4 design. These research fields broadly reflect Swinburne’s research strengths, so we can be reasonably sure that the survey sample represents institutional research at Swinburne. 4.2 Data Collection and Use To understand how researchers manage their data, we first have to know what that data is (see Section 4.2.1), how it is prepared for use (see Section 4.2.2) and whether it is re-used (see Section 4.2.3). 4.2.1 What Kinds of Data? 84 respondents answered a multiple-choice question about what kind of research data they typically use, and the majority of them listed more than one type. The largest number of data types used was 5 types; the mean number was 2.3 types, and the median 2. The distribution of data types is shown in Table 1 (below). Table 1. Data types used by Swinburne researchers Data type Scientific observations Experimental data Biological data Data analyses Computer program outputs Survey/questionnaire responses Interview/focus group materials Primary source materials Participant observations Published materials Written notes Other
Number of uses 23 10 5 6 9 31 40 19 4 15 8 13
% of total uses 12.6 5.5 2.7 3.3 4.9 16.9 21.9 10.3 2.2 8.2 4.4 7.1
% of respondents listing this data type 27.4 11.9 6.0 7.1 10.7 36.9 47.6 22.6 4.8 17.9 9.5 15.5
Data types respondents classified as ‘other’ include video materials, theory, statistics and case studies. We asked respondents to comment on the format of the data they use, specifically splitting digital and non-digital formats. Most researchers used a combination of digital and non-digital data, and a significant portion of the data used (38.6%) is in a non-digital format. This has considerable implications for any institutional-level data curation: either the non-digital data will have to be digitized in order to store complete data sets, or there will have to be some way to link digital and non-digital data.
Oranges Are Not the Only Fruit
241
We asked respondents to comment on any special characteristics of the data they used, and 26 of them (30.6%) claimed their data had some unusual quality. The majority of those who responded to this question (15 respondents) said that their data was sensitive or confidential (conversely one respondent said that his or her data was never subject to ethical or privacy concerns). Two participants mentioned large volumes of data (these would be the typical subjects of e-research initiatives), three said their data required special processing and two mentioned data in a language other than English. 4.2.2 Preparing Data for Use To gain a more accurate picture of the amount and quality of data processing undertaken by Swinburne researchers, we asked them to describe the process of getting the data they used in their last publication to a ready for use. Asking respondents to think of a specific incident results in more accurate self-reporting than asking them to think more generally [39]. The data that researchers reported using for their most recent publication broadly reflected the data they reported using more generally. The level of digital data use is also similar; 50 researchers (58.8%) created only digital data, 14 researchers (16.5%), and the remaining 21 researchers (24.7%) created a combination of digital and nondigital data. Respondents were asked whether they had to process their data (for example change its format or digitize it) before they could use it, and if so, what this processing involved. The kinds of data processing respondents commonly engaged in were audio-file transcription, changing the format of the data, entering data into statistical analysis software, analysis, and data cleaning and checking. The amount of processing required and whether it was applied to digital or non-digital data can be seen in Table 2 (overleaf). There is a clear connection between creating non-digital data and having to process that data. This requirement is non-trivial; the minimum length of time respondents reported spending on processing was 1-2 hours, and some respondents took over a month to process their data. These processing requirements are a significant burden on researchers, but they undertake them so they can do research. Any institutional approach to data management could not rely on researchers being so generous with their time. Table 2. Data processing requirements Original data format Digital-only Non-digital only Combination of digital and non-digital Percentage of total respondents
All data required processing (% ) 20.0 78.6 71.4
No data required processing (%) 54.0 21.4 4.8
Some data required processing (%) 10.0 0.0 23.8
51.8
36.5
10.6
242
D. McKay
4.2.3 Data Re-use 65 out of 85 researchers said that at some point in the past, they had re-used their own research data, demonstrating that future use of one’s own research data is not merely a hypothetical situation. The reasons researchers gave for re-using their data included further publication (‘for a different publication’), re-analysis (‘reanalysis [sic] using a slightly newer technique’), analysis from a different research perspective (‘It was more of a change in focus. As a designer [sic] we look at things in new ways’) and answering new research questions (‘I was looking at waste compostion [sic] firstly and then moved onto using that data to predict gas generation rates’). Of the 65 respondents who had re-used research data, the majority (38) rated ease of re use 1or 2 on a 5-point Likert scale (where 1 was very easy). Only 5 researchers found it difficult (4 or 5 on the scale), the reasons they gave for this were that it was time consuming, that data required reformatting, or that some of the data was lost and had to be regenerated. 4.3 Storing Data 98% of respondents retained the data they used to write their last research paper, and only 4 respondents stored no data digitally. The most popular locations for data storage were ‘“A” computer’ (43 respondents), USB, CD or DVD (12 respondents) and secure digital storage (12 respondents). When asked to rate their comfort level with their data storage arrangements on a 5-point Likert scale the majority (55) scored their comfort 1 or 2 (where 1 is very comfortable). This indicates that researchers have data storage approaches which they are generally happy with; convincing them to use an alternative approach may be quite difficult. 83 researchers stated why they had retained their research data in a free-text field. In keeping with what they claimed about re-use of data, 48 mentioned that they were still using or expected to re-use the data in the future (‘Generally we are pretty good at keeping data in astronomy, but in this particular case, we retained the data for comparison with future generations of the telescope equipment’). 7 respondents specifically mentioned re-use by other researchers (‘To enable verification by external agencies at any stage and to provide access to the data to others at a later date’), 13 respondents (like the previous quotation) mentioned evidentiary reasons, 20 wanted it for future reference (‘in case I ever need to look at it again’), and 16 mentioned legal or ethical reasons (‘because of ethics requirements’). It is interesting to note how many researchers cited legal, ethical and evidentiary reasons for retaining research data; this tends to indicate that at least some of the rationale for storing research data is not research-motivated. 4.4 Sharing Data We asked about two aspects of data sharing: using others’ research data (see Section 4.4.1), and allowing re-use of respondents’ own data (see Section 4.4.2). 4.4.1 Re-using Others’ Data 45.3% of all respondents said they had used datasets created by other people. Use cases given for data re-use included confirmation of the researcher’s own results, new research questions and contexts, to generate new questions, and to use new analysis
Oranges Are Not the Only Fruit
243
methods. Three respondents reported using data from a data digital library. These use cases largely reflect respondents’ reported re-use of their own data. As a group, survey respondents were largely ambivalent about using others’ research data in the future; 29 rated the likelihood of such re-use a 3 on a 5-point Likert scale, and the median and mean ratings were 3 and 2.8 respectively. Unsurprisingly, those who had re-used others’ data in the past were more likely to consider doing so in the future (p=0.002). Reasons given for data re-use included the expense and/or difficulty of obtaining data (‘Data in astronomy is expensive and hard to get’), existing research ties (‘My field has a number of groups working together around the world’) and an interest in drawing one’s own conclusions (‘I like empirical works and I don’t believe 100% in other people’s theories’). Reasons for not being likely to reuse others’ data include ethical constraints (‘it is hard to get access…due to ethical and legal constraints’), concern about others’ data collection methods (‘it would depend on how reliable I felt the researcher was’), lack of available data (‘If that data was available I would use it. I would very much like to have access to some study data’) and lack of necessity (‘I get my own data’). 4.4.2 Sharing One’s Own Data Of 85 researchers, 32 had never shared any data, 30 had definitely had data re-used by others, and 23 weren’t sure (because their data was in an anonymous archive, and thus impossible to tell if it had been re-used). Those who had definitely shared data were more likely to have re-used others’ data than those who hadn’t (p=0.007). Those who had definitely shared research data were offered the opportunity to describe that data sharing; of the 30 only 4 had ever shared data with people they didn’t know. The majority shared data with close colleagues at the same institution (18 respondents) or close colleagues at another institution (3 respondents). Data sharing normally occurred when the respondent offered data to the participant; only one case of sharing via a data repository was reported (this may be as a result of a design flaw in the survey: respondents who reported not being sure if they had shared data did not answer this question). Respondents shared fewer data types than they used; a mean of 1.5 data types per researcher were shared (compared with 2.3 data types used). More of the shared data was digital too—only 13.6% of shared data was non-digital, whereas 38.6% of created data is non-digital (see Section 4.2.1). Again as a group, respondents were largely ambivalent about sharing in the future, reflecting their opinions on using others’ data in the future. Those who had shared in the past were slightly more likely to share in the future (p=0.004). Reasons given for sharing data included cultural reasons (‘it’s what we do in astronomy’) and simple lack of reservation (‘happy to share’). Concerns about data sharing included concerns about the person who might use their data (‘depends on the person, and how trustworthy I think they are’), and ethical concerns (‘I could not give out…interview transcripts’). 4.5 Institutional Involvement in Research Data We asked respondents whether there was anything they believed Swinburne could do to help them manage their data, and if so, what (results in Section 4.5.1). We also proposed a
244
D. McKay
policy that is in line with what the Australian National Data Service (ANDS) expects of institutions; responses to this proposal are discussed in Section 4.5.2. 4.5.1 What Do Researchers Want from Swinburne? When asked what assistance Swinburne might provide in managing research data, the majority of researchers (55, 64.7%) said there wasn’t anything Swinburne could offer them. Among the remaining 30 researchers the most popular form of assistance was archive space, either digital (10, 30.3%) or physical (3, 9.1%). Other popular responses included a back-up service, a data conversion service and data management training. 4.5.2 How Researchers Feel about Australian Data Policy We proposed a policy in line with the ANDS guidelines to respondents: that all research data would have to be deposited in a digital data repository. When we asked respondents how likely they were to comply with such a policy, results were surprisingly positive; only 13 researchers claimed they were ‘unlikely’ or ‘very unlikely’ to abide by such a policy. The reason for this soon becomes apparent, though—the majority of researchers who commented on why they would abide by the policy gave the policy itself as the reason. Table 3. Perceived advantages to a mandatory data deposit policy Advantage
Number
Data preservation Data back-up Security of data Access to others’ data Sharing own data with others Cultural shift toward sharing and community Good publicity Good use of resources Evidence/verification of results ‘None for me’ Other
6 9 9 7 20 7 3 4 9 5 3
% of total responses 7.3 11.0 11.0 8.5 24.4 8.5 3.7 4.9 11.0 6.1 3.7
Table 4. Concerns about a mandatory data deposit policy Advantage Confidentiality/ethics Misuse or theft of data Concern that their work be acknowledged Time uploading takes Cost of repository Usability of repository Reliability of repository Bureaucracy Duplication of existing discipline services ‘None’ Other
Number
% of total responses
24 22 2 9 5 3 5 5 2 2 12
34.3 31.4 2.9 12.9 7.1 4.3 7.1 7.1 2.9 2.9 17.1
% of respondents listing this advantage
32.4 29.7 2.7 12.2 6.8 4.1 6.8 6.8 2.7 2.7 16.2
Oranges Are Not the Only Fruit
245
Table 5. Restrictions on data deposited in an institutional data digital library Advantage Control over who used the data Ethics obligations upheld Acknowledgement Intellectual property protection Notification when data is used To know how the data will be used Data security Access to the data themselves Non-commercial share-alike as in Creative Commons No restrictions Other
Number 27 15 12 10 6 5 5 5 3 3 9
% of total responses 33.8 18.8 15.0 12.5 7.5 6.3 6.3 6.3 3.8 3.8 11.3
% of respondents listing this advantage 36.0 20.0 16.0 13.3 8.0 6.7 6.7 6.7 4.0 4.0 12.0
When asked about the possible advantages of such a policy, researchers did find some advantages (see Table 3 overleaf), but they also had a number of concerns (see Table 4 overleaf). Furthermore they would like to place a number of restrictions on data in such a library (see Table 5 overleaf). This reflects the findings about institutional repositories [42], reinforcing the parallel between publications and data. Many more researchers listed concerns and restrictions than advantages; they find the prospect of a policy that would force them to share their data quite threatening, and the threats outweigh the advantages they can find in such a proposal.
5 Discussion Data digital libraries do have a lot to offer in a data-sharing future; many disciplines and researchers share data already, and many more demonstrate a willingness to do so when offered the right support [10, 21, 22, 43]. Data digital libraries offer the opportunity for dramatic savings in data collection costs, and to bridge distance and even time [7, 9, 44-46]. Data digital libraries are particularly well-suited to data which does not have any ethical sensitivities attached to it, and which is easily described in ways that can be understood by the community that would use that data; the data that tends to meet these criteria (and which is often used in examples of data digital libraries) is scientific data. Despite the promise inherent in digital libraries, they are not the answer to every information problem [47] and in fact are seen as cumbersome and problematic in a general sense by many in academia[48]; they likewise cannot be the whole answer to data. Although our survey respondents claimed they would deposit data in an institutional data digital library, the institutional repository experience has demonstrated that ‘it’s the rules’ mandates simply are not a motivating factor for researchers [16, 35]. Similarly, our survey demonstrates some concern about the implications of deposit in a data digital library; when researchers have concerns about depositing material in an institutional repository they just refuse to do so [17, 35]. Usability of any data digital library was seen as a concern by many of our respondents; we know from the literature that researchers are looking to ‘reduce the chaos’ [49], and that digital libraries are already perceived by many in academia as unusable [48]. It is evident that to be
246
D. McKay
adequately descriptive, metadata in data digital libraries must be quite precise; our survey respondents were concerned about the time it would take to upload data to a digital library, and past experience shows that researchers were confused when asked to create metadata [50], and struggled to do so [21]. The standards of openness inherent in the architecture of many digital library systems [51], would be of concern to our respondents, who were worried about intellectual property and ethics concerns. Much of the data our respondents created (38.6%) was not born digital, and respondents described the digitization process as painful and slow (a finding reflected in other work [12]); we can assume they are unlikely to be willing to digitize data solely for deposit in a data digital library. For an institutional approach to data management to be successful it will have to first provide data management assistance and services that researchers find useful: at present at Swinburne the services researchers would find most useful are digital and physical storage spaces for their data, digital back-up services, and assistance with digitization; these are all services that can be provided without the concerns a data digital library approach raises for researchers (and as mentioned in Section 4.5, researchers see far more problems with a data digital library than advantages).
6 Conclusions and Future Work Data digital libraries offer considerable promise, particularly for those researchers who work with data unencumbered by ethical constraints, and with clear and obvious metadata descriptors (usually scientific data). Researchers who use this kind of data have been targeted in a number of successful case studies, however contrary to the outcomes of these studies we found that data management policies are not met with universal support and approbation by researchers. The lack of enthusiasm stems not from reluctance to share data (the majority of researchers are willing to share), but from a desire for finer-grained control of research data than an institutional policy complete with data digital library was perceived to provide. Not only do researchers want fine-grained control of their research data, but they create a considerable amount of research data in non-digital formats. It is readily apparent from the institutional repository experience that researchers simply will not support services that do not work for them. Given these constraints, data digital libraries are clearly not the whole answer to institutions’ obligations around data. Despite their reservations about data digital libraries, researchers do perceive some value in institutional assistance with data management; storage, backup and data conversion services were all seen as useful by a number of researchers in our survey, and could be a way for institutions to build research data management capabilities. Similarly, some of the researchers we surveyed were already accustomed to archiving data sets, and these researchers would probably both be amenable to and capable of managing their data in a data digital library. As such it is likely worthwhile for institutions to consider creating a data digital libraries even though they are not the whole solution. How best to implement a comprehensive data policy that works for all researchers is still an open question for research, as is the best technical solution for meeting institutional obligations around data. Data digital libraries are undoubtedly part of the solution, though some data will require digitization, some will require novel metadata standards, and some simply requires more access granularity than digital libraries are
Oranges Are Not the Only Fruit
247
able to offer. More important than understanding how data digital libraries might work, though, is gaining a deeper understanding of the data practices and concerns of all researchers, because without that understanding neither policy nor technical solutions can meet researchers’ needs, and research-oriented systems that do not meet the needs of researchers are destined to fail. Acknowledgments. The work presented in this paper was funded by the ARROW Project (Australian Research Repositories Online to the World, www.arrow.edu.au). The ARROW Project was funded by the Australian Commonwealth Department of Education, Science and Training, under the Research Information Infrastructure Framework for Australian Higher Education.
References 1. Corti, L.: Progress Ansd Problems of Preserving and Finding Access to Qualitative Data for Social Research – the International Picture of an Emerging Culture. Forum: Qualitative Sozialforschung / Forum: Qualitative Social Research [online journal] 1 (2000) 2. Hey, T., Trefethen, A.: The Data Deluge: An E-Science Perspective. In: Berman, F., Fox, G.C., Hey, A.J.G. (eds.) Grid Computing: Making the Global Infrastructure a Reality, pp. 809–824. John Wiley and Sons, Ltd., Chichester (2003) 3. Arzberger, P., Schroeder, P., Beaulieu, A., Casey, K., Laaksonen, L., Moorman, D., Uhlir, P., Wouters, P.: Promoting Access to Public Research Data for Scientific, Economic and Social Development. Data Science Journal 3, 135–152 (2004) 4. Heery, R., Duke, M., Day, M., Lyon, L., Hursthouse, M.B., Frey, J.G., Coles, S.J., Gutteridge, C., Carr, L.: Integrating Research Data into the Publication Workflow: The Ebank Uk Perspective. PV-2004 Ensuring the Long-Term Preservation and Adding Value to the Scientific and Technical Data. European Space Agency, Frascati, Italy (2004) 5. Zhuge, H.: China’s E-Science Knowledge Grid Environment. Intelligent Systems, IEEE 19, 13–17 (2004) 6. National Science Foundation Cyberinfrastructure Council: Cyberinfrastructure Vision for 21st Century Discovery. National Science Foundation, Arlington, VA (2007) 7. The ANDS Technical Working Group: Towards and Australian Data Commons. Australian National Data Service, Canberra, Australia (2007) 8. Shadbolt, A., van der Knijff, D., Young, E., Winton, L.: Sustainable Paths for Data Intensive Research Communities at the University of Melbourne: A Report for the Australian Partnership for Sustainable Repositories. University of Melbourne, Melbourne (2006) 9. Borgman, C., Wallis, J., Enyedy, N.: Building Digital Libraries for Scientific Data: An Exploratory Study of Data Practices in Habitat Ecology. Research and Advanced Technology for Digital Libraries, 170–183 (2006) 10. Karasti, H., Baker, K.S.: Digital Data Practices and the Long Term Ecological Research Program Growing Global. International Journal of Digital Curation 3, 42–58 (2008) 11. ANDS: Research Data Policy and the ’Australian Code for the Responsible Conduct of Research’. Australian National Data Service, Canberra, Australia (2009) 12. Henty, M.: Dreaming of Data: The Library’s Role in Supporting E-Research and Data Management. In: Australian Library and Information Association Biennial Conference. Australian Library and Information Association, Alice Springs (2008)
248
D. McKay
13. Borgman, C.L., Wallis, J.C., Mayernik, M.S., Pepe, A.: Drowning in Data: Digital Library Architecture to Support Scientific Use of Embedded Sensor Networks. In: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries. ACM, Vancouver (2007) 14. Treloar, A., Harboe-Ree, C.: Data Management and the Curation Continuum: How the Monash Experience Is Informing Repository Relationships. In: 14th Biennial VALA Conference and Exhibilition, VALA - Libraries Technologies and the Future, Inc., Melbourne (2008) 15. Freeman, P.A., Crawford, D.L., Kim, S., Munoz, J.L.: Cyberinfrastructure for Science and Engineering: Promises and Challenges. Proceedings of the IEEE 93, 682–691 (2005) 16. Callan, P.: The Development and Implementation of a University-Wide Self Archiving Policy at Queensland University of Technology: Insights from the Front Line. In: SPARC IR04: Institutional repositories: The Next Stage, SPARC, Washington D.C (2004) 17. Fried Foster, N., Gibbons, S.: Understanding Faculty to Improve Content Recruitment for Institutional Repositories. D-Lib 11 (2005) 18. Mackie, M.: Filling Institutional Repositories: Practical Strategies from the Daedalus Project. Ariadne 39 (2004) 19. Piorun, M.E., Palmer, L.A., Comes, J.: Challenges and Lessons Learned: Moving from Image Database to Institutional Repository. OCLC Systems and Services 23, 148–157 (2007) 20. Parker, R., Wolff, H.: Building Swinburne Research Bank: An Engaged, User-Centred Approach to Content Recruitment. Information Online. ALIA, Sydney (2009) 21. Borgman, C., Wallis, J., Enyedy, N.: Little Science Confronts the Data Deluge: Habitat Ecology, Embedded Sensor Networks, and Digital Libraries. International Journal on Digital Libraries 7, 17–30 (2007) 22. Winter, H., Mours, M.: The Cyber Infrastructure Initiative for Rheology. Rheologica Acta 45, 331–338 (2006) 23. Fry, J.: Scholarly Research and Information Practices: A Domain Analytic Approach. Information Processing & Management 42, 299–316 (2006) 24. Borgman, C.L.: The Digital Future Is Now: A Call to Action for the Humanities. Digital Humanities Quarterly 3 (to appear, 2010) 25. Allen, J.: Interdisciplinary Differences in Attitudes Towards Deposit in Institutional Repositories. Department of Information and Communications, Masters. Manchester Metropolitan University, Manchester, 69 (2005) 26. Lyon, L.: Ebank Uk: Building the Links between Research Data, Scholarly Communication, and Learning. Ariadne 36 (2003) 27. Corti, L., Day, A., Backhouse, G.: Confidentiality and Informed Consent: Issues for Consideration in the Preservation of and Provision of Access to Qualitative Data Archives. forum: Qualitative Sozialforschung / Forum: Qualitative Social Research 1 (2000) 28. Humphrey, C.: The Preservation of Research in a Postmodern Culture. IASSIST Quarterly, 24–25 (Spring 2005) 29. Crow, R.: The Case for Institutional Repositories: A Sparc Position Paper. The Scholarly Publishing and Academic Resources Coalition (2002) 30. Whitehead, D.: Repositories: What’s the Target? An Arrow Perspective. In: International Association of Technological University Libraries Conference. IATUL, Quebec City, Canada (2005) 31. Björk, B.-C.: Open Access to Scientific Publications – an Analysis of the Barriers to Change. Information Research 9 (2004) 32. Lawal, I.: Scholarly Communication: The Use and Non-Use of E-Print Archives for the Dissemination of Scientific Information. Issues in Science and Technology Librarianship 36 (2002)
Oranges Are Not the Only Fruit
249
33. Ware, M.: Universities’ Own Electronic Repositories yet to Impact on Open Access. Nature Webfocus (2004) 34. Denison, T., Kethers, S., McPhee, N., Pang, N.: Final Report Work Package Cr1: Move Data from Personal Data Repositories to Trusted Digital Alternatives. Faculty of Information Technology, Monash University, Melbourne, 80 (2007) 35. Davis, P.M., Conolly, M.L.J.: Institutional Repositories: Evaluating the Reasons for NonUse of Cornell’s Installation of Dspace. D-Lib 13 (2007) 36. Gold, A.: Cyberinfrastructure, Data, and Libraries, Part 1: A Cyberinfrastructure Primer for Librarians. D-Lib. 13 (2007) 37. Procter, R., Borgman, C., Bowker, G., Jirotka, M., Olson, G., Pancake, C., Rodden, T., schraefel, m.c.: Usability Research Challenges for Cyberinfrastructure and Tools. In: CHI 2006 extended abstracts on Human factors in computing systems. ACM, Montréal (2006) 38. Zimmerman, A., Nardi, B.A.: Whither or Whether Hci: Requirements Analysis for MultiSited, Multi-User Cyberinfrastructures. In: CHI 2006 extended abstracts on Human factors in computing systems. ACM, Montréal (2006) 39. Joliffe, F.R.: Survey Design and Analysis. Halsted Press, New York (1986) 40. Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, Hawthorne (1967) 41. Hayes, B.E., Harroun, J.L., Temple, B.: Data Management and Curation of Research Data in Academic Scientific Environments. In: 2009 ASIS&T Annual Meeting. American Society for Information Science and Technology, Vancouver, Canada (2009) 42. Gadd, E., Oppenheim, C.: Romeo Studies 2: How Academics Want to Protect Their Open Access Research Papers. Journal of Information Science 29, 333–356 (2003) 43. ANDS: Ands Needs Analysis Survey. Australian National Data Service, Canberra, Australia (2009) 44. Arzberger, P., Schroeder, P., Beaulieu, A., Bowker, G., Casey, K., Laaksonen, L., Moorman, D., Uhlir, P., Wouters, P.: An International Framework to Promote Access to Data. Science 303, 1777–1778 (2004) 45. Buchhorn, M., McNamara, P.: Sustainability Issues for Australian Research Data: The Report of the Australian E-Research Sustainability Survey Project. Australian National University, Canberra, Australia (2006) 46. Jacobs, J.A., Humphrey, C.: Preserving Research Data. Communications of the ACM 47 (2004) 47. Cunningham, S.J., Knowles, C., Reeves, N.: An Ethnographic Study of Technical Support Workers: Why We Didn’t Build a Tech Support Digital Library. In: 1st ACM/IEEE-CS joint conference on Digital libraries. ACM, Roanoke (2001) 48. Adams, A., Blandford, A.: Digital Libraries in Academia: Challenges and Changes. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 392–403. Springer, Heidelberg (2002) 49. Woodland, J., Ng, J.: Too Many Systems, Too Little Time: Integrating an Institutional Repository into a University Publications System. In: VALA 2006 13th Biennial Conference and Exhibition. Victorian Association of Academic Libraries, Melbourne (2006) 50. Cunningham, S.J., Nichols, D.M., McKay, D., Bainbridge, D.: An Ethnographic Study of Institutional Repository Librarians: Their Experiences of Usability. In: Open Repositories 2007, San Antonio, TX, USA (2007) 51. Witten, I.H., Bainbridge, D., Nichols, D.M.: How to Build a Digital Library. Morgan Kaufmann, Burlington (2010)
Open Access Publishing: An Initial Discussion of Income Sources, Scholarly Journals and Publishers Panayiota Polydoratou, Margit Palzenberger, Ralf Schimmer, and Salvatore Mele On behalf of the SOAP project1 {polydoratou,palzenberger,schimmer}@mpdl.mpg.de, [email protected]
Abstract. The Study for Open Access Publishing (SOAP) project is one of the initiatives undertaken to explore the risks and opportunities of the transition to open access publishing. Some of the early analyses of open access journals listed in the Directory of Open Access Journals (DOAJ) show that more than half of the open access publishing initiatives were undertaken by smaller publishers, learned societies and few publishing houses that own a large number of journal titles. Regarding income sources as means for sustaining a journal’s functions, “article processing charges", "membership fee" and "advertisement" are the predominant options for the publishing houses; "subscription to the print version of the journal", "sponsorship" and somewhat less the "article processing charges" have the highest incidences for all other publishers.
1 Introduction The advance of technology and the development of the World Wide Web created immense opportunities for people to communicate and exchange information in new ways. This has also been a fact for scholars and the way they communicate their research findings. Over the last decade activities around the open access movement rose significantly. Open access literature is online, free of charge for all readers, and permits its distribution and further use for research, education and other purposes [1]. Discussion around sustainable business models for open access publishing has been going on for several years now and publishers have been experimenting and exploring new opportunities. Some of the areas often discussed are: the basis for charging fees, where publication charges currently come from and where they are expected in the future, the role of the print journal and the role of institutional memberships. Furthermore, the role of waiver policies, new models for assessing impact of research and 1
Important notice: The research results of this Project are co-funded by the European Commission under the FP7 Research Infrastructures Grant Agreement Nr. 230220. This document contains material, which is the copyright of certain SOAP beneficiaries, and may not be reproduced or copied without permission. The information herein does only reflect the views of its authors and not those of the European Commission. The European Commission and the beneficiaries do not warrant that the information contained herein is capable of use, or that use of the information is free from risk, and they are not responsible for any use that might be made of data appearing herein.
Open Access Publishing: Discussion of Income Sources, Journals and Publishers
251
whether that has an impact on submission levels and growth. It is frequently seen that publishers have been experimenting with a combination of different income funds and seeking opportunities to explore new partnerships, one example being in collaboration with learned societies. In Europe, the European Commission recognized the need to examine the potential for change in the scholarly publishing arena [2] and explore initiatives that would make suggestions at policy level for a smooth transition to open access. The SOAP project is one of the initiatives undertaken to explore the risks and opportunities of the transition to full open access publishing. The Study of Open Access Publishing (SOAP) Project The Study of Open Access Publishing (SOAP, http://project-soap.eu) is a two-year project, funded by the European Commission under FP7 (Seventh Framework Program). The project is coordinated by CERN, the European Organization for Nuclear Research, and the SOAP consortium represents key stakeholders such as publishers (BioMed Central Ltd (BMC), Sage Publications Ltd (SAGE UK) and Springer Science+Business Media Deutschland GmbH (SSBM), funding agencies (Science and Technology Facilities Council (STFC) UK), libraries (Max Planck Digital Library) and a broad spectrum of research disciplines. One of the project’s aims is to describe and analyze open access publishing. SOAP aims to compare and contrast business models. Such an approach will allow for a better understanding of the marketplace as well as the opportunities and risks associated with open access publishing. The foundation for the study is the understanding of the market penetration of present open access publishing offers, and this paper presents a first part of this.
2 Methodology The findings presented in this short paper are based on a quantitative analysis of open access journals. Journal level metadata were downloaded from the Directory of Open Access Journals (DOAJ, http://www.doaj.org/) during July 2009. In addition to the DOAJ data, information about publisher types, copyright practices and income sources was manually collected from the journals’ websites. The data collection took place between October 2009 and January 2010.
3 Preliminary Results Publisher characteristics – Size and type The DOAJ data file listed 4568 records. After excluding duplicate records, there were 4032 journal titles and 2586 publisher names. More than half (56%) of the publishers were associated with one journal only. Less than a quarter (21%) of the publishers own between 2 and 9 journals and 9% own between 10 and 49 journals. There are only five publishers with more than 50 journals titles each (14%). Those publishers are: Bentham open, BioMed Central, Hindawi Publishing Corporation, Internet Scientific Publications – LLC and Medknow Publications (Table 1). It should be noted that results differ at article level as compared to journal level which is discussed here.
252
P. Polydoratou et al. Table 1. “OA-size” of publishers by number of journal titles in DOAJ Publishers size class by number of journals 1
Number of publishers
Number of journal titles
2270
2270
[%] of journal titles per publisher in the DOAJ 56
286
845
21
25
362
9
≥ 50
5
555
14
Total
2586
4032
100
2 to 9 10 to 49
Publishers were also grouped by type. The options that were looked at were: publishing houses, learned societies and individual/other initiatives. The publishers with the highest number of journal titles in DOAJ are primarily commercial publishers while learned societies are represented by fewer journal titles. The majority of the publishers with fewer than 50 journal titles represent individual initiatives. Some examples are: academic departments, universities, governmental organizations, international organizations, foundations. Income funds Between October 2009 and January 2010, the project partners manually collected information about visible income funds of the journals from their websites. The information sought referred to the following options: article processing charges, membership fees, advertisement, sponsorship, and subsidy, subscription to the print version of the journal and hard copy sales. The following table lists the seven income sources that were investigated and gives their relative share [%] at the level of journal title. The selection of income sources allowed for multiple responses. "Article processing charges", "membership fee" and "advertisement" are the predominant options for the large publishing houses, whereas Table 2. Income sources for OA journals by size of publisher Publisher size class (number of journal titles per publisher)
Number of journal titles
Number of journal titles successfully processed
1
2270
954
Income sources [%] Multiple selection
a
b
c
d
f
g
x
15
8
13
37
45
15
21
2 to 9
845
438
45
8
29
86
91
32
35
10 to 49
362
185
51
8
15
11
55
5
40
≥ 50
555
540
88
76
83
23
28
61
11
total
4032
2117
199
100
140
157
219
113
107
Open Access Publishing: Discussion of Income Sources, Journals and Publishers
253
"subscription", "sponsorship" and somewhat less the "article processing charges" have the highest incidences for all other publishers. However one should take into consideration that these findings differ at article level as compared to journal level which is discussed here. a c f x
article processing charges advertisement subscription to the print version of the journal other
b d g
membership fee sponsorship hard copy
4 Summary and Future Work Some of the early analyses from the DOAJ data show that more than half of the open access publishing initiatives were undertaken by academic institutions, governmental organizations, foundations, university presses, individuals, etc. Learned societies have yet been identified for about 14% of the DOAJ publishers’ records. Those of the publishing houses listed in DOAJ are dominant in terms of the number of journals that they publish. Regarding income sources as means for justifying viability of a journal, there is a distinctly different pattern with respect to the overall prevalence of the options between the bigger publishers and those smaller in size. At journal title level "article processing charges", "membership fee" and "advertisement" are the predominant options for the 5 publishers that have more than 50 journal titles associated with them. Whereas "subscription", "sponsorship" and somewhat less the "article processing charges" have the highest incidences for all other publishers. This pattern is somewhat different when one is looking at article level information compared to journal level information which is discussed here. The SOAP project is currently finalizing data analyses pertaining to DOAJ data. Further analysis is currently being conducted with respect to copyright/licensing options that are practiced, income options found in subject domains as well as the number of articles produced (data collected for 2008 or where not available for 2007). Other current work involves a review and comparison of large publishers’ experimentation with open access. Specifically, SOAP partners are reviewing the share of hybrid journals in the market, which open access share do hybrid journals have and which open access share does the total article output of publishers have. Future work includes a large scale questionnaire survey looking into scholars’ practices, attitudes and requirements when it comes to open access publishing. Open Access. This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References 1. Open Access at the Max Planck Society guide. Available at, http://www.mpdl.mpg.de/main/Open_Acces_MPDL_Flyer_3.pdf 2. http://ec.europa.eu/research/science-society/ document_library/pdf_06/scientific-info-results-crestfinal-090609_en.pdf
Ontology-Based Information Management: Exploiting User Perspective within the Motion Picture Industry Sharmin (Tinni) Choudhury, Kerry Raymond, and Peter Higgs Queensland University of Technology Brisbane, Australia {t.choudhury,k.raymond,p.higgs}@qut.edu.au
Abstract. In many industries, practitioners engaged in different tasks can pose the same question to which they expect vastly different answers. The answers need to be framed from their perspective, where perspective is based on their role within the industry. We introduce a method of determining the perspective of an agent within the context of the Motion Picture industry from examination of a domain ontology. Keywords: Ontology, Relatedness Metric, Information Management, Human Factors, Motion Picture Industry.
Ontology-Based Information Management: Exploiting User Perspective
255
motion picture industry is an information intensive industry with multiple information silos not only across organizational boundaries but within a given organization. In short, the motion picture industry is in need of innovative techniques of information exploitation and has agents with varying perspectives. We have developed, i) A domain ontology for the motion picture industry, called the Loculus ontology; ii) A rule-based metric for determining semantic relatedness of concepts with a given ontology [1]; iii) The Loculus system [2] that uses the domain ontology and the relatedness metric for perspective based information extraction and classification; The Loculus domain ontology captures both the production process of the motion picture industry as well as models the motion picture product. The Loculus ontology is governed by a set of axioms. It captures key context of the domain such as the timelines used by the industry and, most importantly, the agent context of the industry. The Loculus ontology allows perspective to be determined through analysis of the modeling of agents in relation to other industry concepts. The relatedness metric uses a set of rules to assign weights to edges of the graphlike ontology based on inheritance and relationships that connect two concepts. These weights can then be used to determine and which concepts the agent has a near view to and which concepts are at the periphery of the agent’s awareness. This information can then be used to assist agents in their information management activities. For example, say that an editor poses the question “How do you film a punch?”, the Loculus ontology will reveal that editors are closely associated with concepts such as crosscutting, cutting-on-action – which are editing techniques. They are not closely associated with concepts such as make-up and method acting. These information about the editor can then be used to better serve their information needs. Although our research was based on the Motion Picture Industry, it has more general application in other domains which also has a wide range of agent roles with varying perspectives on domain concepts based on said role, e.g. health.
References 1. Choudhury, S., Raymond, K., Higgs, P.: A Rule-based Metric For Calculating Semantic Relatedness Score for the Motion Picture Industry. In: Workshop on Natural Language Processing and Ontology Engineering at the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Sydney, Australia (2008) 2. Choudhury, S., et al.: Loculus: A Metadata Wrapper for Digital Motion Pictures. In: 11th IASTED International Conference on Internet and Multimedia Systems and Applications. 2007, Honolulu, Hawaii, USA (2007)
Expertise Mapping Based on a Bibliographic Keyword Annotation Model Choochart Haruechaiyasak, Santipong Thaiprayoon, and Alisa Kongthon Human Language Technology Laboratory (HLT) National Electronics and Computer Technology Center (NECTEC) Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand {choochart.har,santipong.tha,alisa.kon}@nectec.or.th
Abstract. Expert finding is a task of identifying a list of people who are considered experts in a given specific domain. Many previous works have adopted bibliographic records (i.e., publications) as a source of evidence for representing the areas of expertise [1,2]. In this paper, we present an expertise mapping approach based on a probabilistic keyword annotation model constructed from bibliographic data. To build the model, we use the Science Citation Index (SCI) database as the main publication source due to its large coverage on science and technology (S&T) research areas. To represent the expertise keywords, we use the subject category field of the SCI database which provides general concepts for describing knowledge in S&T such as “Biotechnology & Applied Microbiology”, “Computer Science, Artificial Intelligence” and “Nanoscience & Nanotechnology”. The keyword annotation model contains a set of expertise keywords such that each is represented with a probability distribution over a set of terms appearing in titles and abstracts. Given publication records (perhaps from different sources) of an expert, a set of keywords can be automatically assigned to represent his/her area of expertise. Keywords: Expertise mapping, expert finding, keyword annotation, bibliographic database, Science Citation Index (SCI).
1
The Proposed Expertise Mapping Model
The proposed system architecture for implementing an expert finder system is illustrated in Fig. 1. The first step is to acquire a sample set of publications from a bibliographic database for constructing expert profiles. The expertise mapping is used to automatically assign related keywords for each expert. To allow users to search by keywords, a search index is constructed from different fields such as title, authors, abstract and subject category. Another feature is to find the relationship among the experts and visualize them as a network. We consider two types of relationships: direct and indirect. The direct (or social) relationship is defined as the co-authoring degree between one researcher to others. The indirect (or topical) relationship is defined as the similarity degree between expertise keyword sets of two experts. G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 256–257, 2010. c Springer-Verlag Berlin Heidelberg 2010
Expertise Mapping Based on a Bibliographic Keyword Annotation Model
257
Fig. 1. The proposed system architecture for implementing an expert finder system
Based on the Bayes’ theorem, the probabilistic keyword annotation model can be formulated as follows. P (KW |T ) =
P (KW ) × P (T |KW ) P (T )
(1)
where KW is the keyword set (from subject category) and T is the set of terms (from title and abstract). Given T = {t0 , t1 , . . . , tp−1 }, we can construct the probabilistic model for a keyword ki as follows. p(ki |t0 , t1 , . . . , tp−1 ) =
The conditional probability value for each term tj given a keyword ki , p(tj |ki ), can be calculated by normalizing the distribution of all terms in T over the keyword ki . Using Eq. 3, all keywords in KW are applied to obtain the keyword annotation model. To perform expertise mapping, we first extract a set of terms from the expert’s publications (i.e., T (expert)). Then we calculate p(ki |T (expert)), the conditional probability for each keyword ki , given a set of terms, T (expert). The resulting probability values are ranked on descending order. The top-ranked keywords can be assigned (i.e., annotated) for this expert.
References 1. Haruechaiyasak, C., Kongthon, A., Thaiprayoon, S.: Building a Thailand Researcher Network Based on a Bibliographic Database. In: Proc. of the 9th ACM/IEEE-CS Joint Conf. on Digital Libraries (JCDL), pp. 391–392 (2009) 2. Murray, C., Ke, W., Borner, K.: Mapping Scientific Disciplines and Author Expertise Based on Personal Bibliography Files. In: Proc. of the 10th Int. Conf. on Information Visualisation, pp. 258–263 (2006)
Extended Poster Abstract: Open Source Solution for Massive Map Sheet Georeferencing Tasks for Digital Archiving Janis Jatnieks University of Latvia, Faculty of Geography and Earth Sciences, Alberta 10, Riga, LV-1010, Latvia [email protected]
1 Introduction Scanned maps need to be georeferenced, to be useful in a GIS environment for data extraction (vectorization), web publishing or spatially-aware archiving. Widely used software solutions with georeferencing functionality are designed to suit a universal scenario for georeferencing many different kinds of data sources. Such general nature also makes them very time-consuming for georeferencing a large number of map sheets to a known grid. This work presents an alternative scenario for georeferencing large numbers of map sheets in a time-efficient manner and implements the approach as the MapSheetAutoGeoRef plug-in for the freely available open source Quantum GIS [1].
Open Source Solution for Massive Map Sheet Georeferencing Tasks
259
writes the georeferencing information into the image header. The resulting files are written in the GeoTIFF open standard raster file format [2]. The strength of this approach lies in universality. It is suitable for spatially referencing many different types of map series, irrespective of their visual style, framing, physical state of the material, image acquisition artifacts or the target coordinate reference system, making this approach highly suitable for digital archiving. While it is possible to automate the corner recognition process with machine vision methods, such a solution is specific to certain visual characteristics of the targeted map series and therefore dependent on further customized software development [3].
3 Conclusion The MapSheetAutoGeoRef plug-in automates most of the steps in the georeferencing process and is currently being extended to provide additional functionality such as warping, resampling the map sheet rasters to versions with fixed cell size, masking map sheet margins and mosaicking the resulting rasters into a single large mosaic. This would permit a single workflow to produce geospatial data product suitable for on-line publishing though WMS [4]. The results indicate significant time savings for the process, allowing for possibly thousands of map sheets to be georeferenced in a person day. This work should greatly extend the georeferencing capacity of small teams working towards digital map archiving, data acquisition from legacy materials or map publishing in an electronic form. Acknowledgments. This work was supported by the European Social Fund project "Establishment of interdisciplinary scientist group and modelling system for groundwater research".
References 1. Quantum GIS, http://www.qgis.org 2. GeoTIFF, http://trac.osgeo.org/geotiff/ 3. Titova, O.A., Chernov, A.V.: Method for the automatic georeferencing and calibration of cartographic images. In: Pattern Recognition and Image Analysis, vol. 19(1), pp. 193–196. Pleiades Publishing, Moscow (2009) 4. OpenGIS Web Map Service (WMS) Implementation Specification, OGC, http://www.opengeospatial.org/standards/wms
Describing OAI-ORE from the 5S Framework Perspective N´ adia P. Kozievitch and Ricardo da S. Torres Institute of Computing, University of Campinas, Campinas, SP, Brazil {nadiapk,rtorres}@ic.unicamp.br
Abstract. Despite the popularity of applications which manage complex digital objects, few attempts have been made to formally characterize them and their services. This poster addresses this problem starting an analysis of OAI-ORE specifications under the 5s framework perspective, verifying how they can be integrated to describe complex digital objects as resources that could be later exchanged.
1
Introduction
There are several applications that need support for complex digital objects, such as new mechanisms for managing data, creating references, links and annotations; clustering or organizing complex digital objects and their components. Despite the popularity of such applications, few attempts have been made to formally characterize digital objects and their services. This poster addresses this problem through the analysis of of OAI-ORE specifications under the 5s framework perspective. OAI-ORE [1,2] specifications develop, identify, and profile extensible standards and protocols to allow repositories, agents, and services to interoperate in the context of use and reuse of compound digital objects. It defines logical boundaries, the relationships among their internal components, and their relationships to the other resources. The 5S Framework (Streams, Structures, Spaces, Scenarios, and Societies) [3] is a formal theory to describe digital libraries. Streams are sequences of elements of an arbitrary type (e.g., bits, characters, images, etc.) A structure specifies the way in which parts of a whole are arranged or organized. A space is a set of objects together with operations on those objects that obey certain constraints. Scenarios can be used to describe external system behavior from the users point of view; provide guidelines to build a cost-effective prototype; or help to validate, infer, and support requirements specifications and provide acceptance criteria for testing. A society is a set of entities and the relationships between them.
2
Mapping OAI-ORE to 5S Concepts
Figure 1 presents concepts for 5S framework and OAI-ORE. A simple digital object defined by the 5S framework is a resource in OAI-ORE. Resources are G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 260–261, 2010. c Springer-Verlag Berlin Heidelberg 2010
Describing OAI-ORE from the 5S Framework Perspective
261
Fig. 1. Mapping concepts from OAI-ORE and framework
identified by URIs in OAI-ORE and by handlers in 5S framework. This resource can be composed in a compound object in OAI-ORE and can by represented by a complex object in the 5S framework. The unit of encapsulation in OAI-ORE is the resource map, to map the resources and their integration. The digital object tuple in 5S framework is a formal definition to express structure, and organization.
3
Discussion and Future Work
The main goal of mapping OAI-ORE- 5S framework is: (i) use OAI-ORE as an interoperability layer, to describe the digital library content and their aggregation via an OAI-ORE resource map; (ii) describe digital objects that will be later exchanged as web resources; and (iii) formally take advantages of 5s formalism and OAI-ORE specifications to understand better the interactions among digital library components.
Acknowledgments We would like to thank CAPES, FAPESP e CNPq - BioCORE Project.
References 1. Lagoze, C., Sompel, H.V.: Compound information objects: the oai-ore perspective. Open Archives Initiative Object Reuse and Exchange, White Paper (2007), http://www.openarchives.org/ore/documents 2. Lynch, C., Parastatidis, S., Jacobs, N., de Sompel, H.V., Lagoze, C.: The oai-ore effort: progress, challenges, synergies. In: JCDL 2007, p. 80. ACM, New York (2007) 3. Gon¸calves, M.A., Fox, E.A., Watson, L.T., Kipp, N.A.: Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM TOIS 22(2), 270–312 (2004)
Matching Evolving Hilbert Spaces and Language for Semantic Access to Digital Libraries Peter Wittek1 , S´ andor Dar´ anyi2 , and Milena Dobreva3 1
1
Department of Computer Science, National University of Singapore, Computing 1, Law Link, Singapore 117590 2 Swedish School of Library and Information Science, University of Bor˚ as and G¨ oteborg University, All´egatan 1, 50190 Bor˚ as, Sweden 3 Centre for Digital Library Research, Information Services Directorate, University of Strathclyde, Livingstone Tower, 26 Richmond Street, Glasgow, G1 1XH United Kingdom
Background
Extended by function (Hilbert) spaces, the 5S model of digital libraries (DL) [1] enables a physical interpretation of vectors and functions to keep track of the evolving semantics and usage context of the digital objects by support vector machines (SVM) for text categorization (TC). For this conceptual transition, three steps are necessary: (1) the application of the formal theory of DL to Lebesgue (function, L2) spaces; (2) considering semantic content as vectors in the physical sense (i.e. position and direction vectors) rather than as in linear algebra, thereby modelling word semantics as an evolving field underlying classifications of digital objects; (3) the replacement of vectors by functions in a new compact support basis function (CSBF) semantic kernel utilizing wavelets for TC by SVMs.
2
Experimental Results
We processed 5946 abstracts with LCSH metadata for machine learning (ML) from the Strathprints digital repository, University of Strathclyde [2]. Keywords were obtained by a WordNet-based stemmer using the controlled vocabulary of the lexical database resulting in 11586 keywords in the abstracts, and ranked according to the Jiang-Conrath distance based with the algorithm described in [3]. With altogether 176 classes, the research question was how efficiently SVM kernels can reproduce fine-grained text categories based on abstracts only. The corpus was split to 80% training data and 20% test data, without validation. Multilabel, multiclass classification problems were split into one-against-all binary problems and their micro-, macro-averaged precision and recall values plus F1 score calculated. Only C-SVMs were benchmarked, with the C penalty parameter left at the default value of 1. The implementation used the libsvm library [4] with linear, polynomial and RBF kernels on vectors to study classification performance. Polynomial kernels were benchmarked at second and third degree, G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 262–263, 2010. c Springer-Verlag Berlin Heidelberg 2010
Matching Evolving Hilbert Spaces and Language for Semantic Access
263
Table 1. Results on the StrathPrints collection Kernel Support length Microaverage P Macroaverage P Microaverage R Macroaverage R Microaverage F1 Macroaverage F1
RBF kernels by a small value (γ = 1/size of feature set) parameter as well as relatively high ones (γ = 1 and 2). A B-spline kernel with multiple parameters was benchmarked with the length of support ranging between 2 and 10. In terms of the micro- and macroaverage F1 measures, in three out of four cases the wavelet kernel outperformed the traditional kernels while reconstructing existing classification tags based on abstracts (1). In all, the wavelet kernel performed best in the task of reconstructing the existing classification on a deeper level from abstracts.
Acknowledgement Research by the second and third authors was funded by the SHAMAN EU project (grant no.: 216736).
References 1. Gon¸calves, M., Fox, E., Watson, L., Kipp, N.: Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM Transactions on Information Systems 22(2), 312 (2004) 2. Dawson, A., Slevin, A.: Repository case history: University of Strathclyde Strathprints (2008) 3. Wittek, P., Dar´ anyi, S., Tan, C.: Improving text classification by a sense spectrum approach to term expansion. In: Proceedings of CoNLL 2009, 13th Conference on Computational Natural Language Learning, Boulder, CO, USA, pp. 183–191 (2009) 4. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm
Towards a Pattern-Based Approach for Scientific Writing and Publishing in Chinese Hao Xu1,2,3 and Changhai Zhang1 1
College of Computer Science and Technology, Jilin University, China Center for Computer Fundamental Education, Jilin University, China Department of Information Science and Engineering, University of Trento, Italy [email protected], [email protected] 2
3
Abstract. Writing and publishing scientific papers is an integral part of researcher’s life for knowledge dissemination. In China, most on-line publishing today remains an exhibition of electronic facsimiles of traditional articles without being in step with the advancement of Semantic Web. Our project aims to propose a pattern-based approach for scientific writing and publishing in Chinese, which provides empirical templates of discourse rhetorical structure, along with attribute and entity annotation based on Semantic technologies.
1
Introduction
An inspiring motivation was from the “Article of the future” Initiative1 announced by Elsevier2 in the year of 2009. Basically, their prototypes provide readers more choices to get into the interesting objects via individualized access routes. An ongoing case study of Cell3 reveals the feasibilities and benefits of this new discourse representation format. In fact,each scientific paper contains a precise semantics, kept in mind by the authors, but hardly ever articulated explicitly to readers. Thus, the interlinked knowledge of the hidden rhetorical structure remains tacit without being presented in the previous on-line publishing utilities. Externalization enriched with semantics and rhetorics is dedicated to scientific publications being much more easier to disseminate, navigate, understand and reuse in research communities. To make an article organized not only by linear structure, but also by rhetorical structure with semantic links and metadata will definitely help the readers more efficiently access the specific information units, without being overwhelmed by undesirable additional details. A handful of significant research have been done by Anita de Waard[1], Tudor Groza et al. [2], and so forth. But still, limited efforts could be applied to Chinese discourses yet due to different writing and publishing cultures. 1 2 3
G. Chowdhury, C. Khoo, and J. Hunter (Eds.): ICADL 2010, LNCS 6102, pp. 264–265, 2010. c Springer-Verlag Berlin Heidelberg 2010
Towards a Pattern-Based Approach for Scientific Writing and Publishing
2
265
Overview
Our methodology is built on the Rhetorical Structure Theory (RST)[3], while more analysis results of Chinese discourse representation will be associated also. The main task of the project is to study how different rhetorical structure can be applied to describe various types of Chinese discourses’ content composition, rhetorical relation, and how patternized bibliographic entity-relationship can be employed to categorize the knowledge about article at both data and metadata level. Essentially Chinese discourses are organized by different rhetorical structures from English discourses. But the elementary rhetorical chunks are pretty similar such as “State of the Art”, “Motivation”, “Experimental Results”, “Evaluation”, “Discussion” and so on. The different writing culture causes that these rhetorical chunks are ordered in dissimilar ways. And these serialized rhetorical chunks within Chinese discourses constitute the basics of our pattern approach. Another focus is using semantic technologies to manage the interlinked attributes and entities, that enriches more semantics and rhetorics within the scientific publication representation and visualization. The output of this project will be an on-line editing and publishing utility which facilitates scientific authoring, reading, reviewing, searching and collaborative work in Chinese research communities.
Acknowledgement This research is partially supported by China Scholarship Council (CSC). Many thanks to Prof. Fausto Giunchiglia for his inspiring suggestions.
References [1] de Waard, A., Tel, G.: The abcde format enabling semantic conference proceedings. In: SemWiki (2006) [2] Groza, T., Handschuh, S., Moller, K., Decker, S.: Salt - semantically annotated latex for scientific publications. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 518–532. Springer, Heidelberg (2007) [3] Thompson, S.A., Mann, W.C.: Rhetorical structure theory: A theory of text organization. Technical report, Information Science Institute (1987)
Evaluation Algorithm about Digital Library Collections Based on Data Mining Technology Yumin Zhao, Zhendong Niu, and Lin Dai School of Computer Science and Technology, Beijing Institute of Technology Beijing 100081, China {zhymbit,bitd2000}@gmail.com, [email protected] Abstract. We present here an evaluation algorithm about digital library collections based on data mining technology to provide a better decision-making support. This evaluation algorithm combines association analysis, classification and numeric prediction techniques. It has been applied into our evaluation research.
1 Introduction In our work we put forward to import natural-integration ideology and data mining technology into digital libraries evaluation research. (These need more in-depth analysis beyond this poster and here is only a brief summary). On the one hand, there are two main ideologies in digital library evaluation research nowadays [1]. One is expert-centric evaluation which bases on expert’s professional way, and the other is user-centric evaluation which bases on the user experience of using digital libraries entirely [2] [3]. We proposed the natural-integration ideology to resolve divergences between them in intrinsic way than the surface reference and complement from each other. On the other hand, our research team considers that the data mining techniques are suitable for digital library evaluation. Firstly, there has mass-data in research work and this is just the base of data mining. Secondly, the result of digital library evaluation is a decision and the task of data mining is just to provide decision-making. So why not to try applying? Maybe some interesting factors that would never be found by user or by expert would be found during data mining? We divide entire evaluation into a few modules. Evaluation about collections of digital library is the most important module and completely embody above ideology. We give up pursuing the full-scale index system which is often used in traditional expert-centric evaluation and give up statistical analysis about questionnaires which is often used in traditional user-centric evaluation. As a substitute, we construct the datasets which express both the expert-view factors such as types, range, number, time of collections and the user-view factors such as the score of each aspect of collections. After then we adopt some improved data mining techniques that include association analysis, classification and numeric prediction to get the evaluation model. The evaluation algorithm is briefly described as follows: 1) 2)
To construct data warehouse selected from database of collections metadata and user-data of all kinds of view. To preprocess data including cleaning, integration, transformation, discretization and concept hierarchy generation to get dataset for data mining.
Evaluation Algorithm about Digital Library Collections
3)
4) 5) 6) 7) 8)
267
To apply association analysis for each attribute and select frequent terms linked to score. After then to execute Data Reduction and select training set attributes which support and confidence is greater than threshold. To apply improved classification and numeric prediction techniques to build computing model about each collection based on domain knowledge. To get score of every collection depending on computing model and update DATABASE. According to Library Classification Scheme to get the distribution model of the collections breadth and to calculate summarizing score about every category. To calculate the distribution model of the user demand cross-over analysis about above two distribution models to get objective degree of satisfaction and from this to get collections evaluation result.
2 Conclusion Through above process we could get a believable score about collections evaluation. It integrates sensibility of user-centric and objective analysis manner of expert-centric together. Some interesting factors have been always found. For example, when the collection type is book, its name’s length has a relation with users’ degree of satisfaction.
Acknowledgements This work is supported by the Natural Science Foundation of China(grant no.60803050), the Program for New Century Excellent Talents in University, China(NCET-06-0161), Fok Ying Tong Education Foundation China(91101), Ministerial Key Discipline Program.
References 1. Hsieh-Yee, I.: Digital Library Evaluation: Progress & Next Steps. 2005 Annual Meeting of the American Society for Information Science and Technology Charlotte, N.C. (2005) 2. Snead, J.T., Bertot, J.C., Jaeger, P.T., McClure, C.R.: Developing multi-method, iterative, and user-centered evaluation strategies for digital libraries: Functionality, usability, and accessibility. In: Proceedings of the American Society for Information Science and Technology, vol. 42 (2005) 3. Kyrillidou, M., Giersch, S.: Developing the DigiQUAL protocol for digital library evaluation. In: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, Denver, CO, USA (2005)
Author Index
Allen, Robert B. 91 Autayeu, Aliaksandr 102
Jaidka, Kokil 116 Jatnieks, Janis 258
Batjargal, Biligsaikhan 25 Bird, Steven 5 Blanzieri, Enrico 102 Buchanan, George 168
Kan, Min-Yen 226 Karduck, Achim P. 226 Kennedy, Gavin 179 Khaltarkhuu, Garmaabazar Khoo, Christopher 116 Kimura, Fuminori 25 Kneebone, Les 189 Kongthon, Alisa 256 Kozievitch, N´ adia P. 260 Krapivin, Mikalai 102 Kwee, Agus Trisnajaya 50